On Wed, Oct 4, 2023 at 9:02 PM Isaac Morland <isaac.morl...@gmail.com> wrote: >> > What about characters not in UTF-8? >> >> Honestly I'm not clear on this topic. Are the "private use" areas in >> unicode enough to cover use cases for characters not recognized by >> unicode? Which encodings in postgres can represent characters that >> can't be automatically transcoded (without failure) to unicode? > > Here I’m just anticipating a hypothetical objection, “what about characters > that can’t be represented in UTF-8?” to my suggestion to always use UTF-8 and > I’m saying we shouldn’t care about them. I believe the answers to your > questions in this paragraph are “yes”, and “none”.
Years ago, I remember SJIS being cited as an example of an encoding that had characters which weren't part of Unicode. I don't know whether this is still a live issue. But I do think that sometimes users are reluctant to perform encoding conversions on the data that they have. Sometimes they're not completely certain what encoding their data is in, and sometimes they're worried that the encoding conversion might fail or produce wrong answers. In theory, if your existing data is validly encoded and you know what encoding it's in and it's easily mapped onto UTF-8, there's no problem. You can just transcode it and be done. But a lot of times the reality is a lot messier than that. Which gives me some sympathy with the idea of wanting multiple character sets within a database. Such a feature exists in some other database systems and is, presumably, useful to some people. On the other hand, to do that in PostgreSQL, we'd need to propagate the character set/encoding information into all of the places that currently get the typmod and collation, and that is not a small number of places. It's a lot of infrastructure for the project to carry around for a feature that's probably only going to continue to become less relevant. I suppose you never know, though. Maybe the Unicode consortium will explode in a tornado of fiery rage and there will be dueling standards making war over the proper way of representing an emoji of a dog eating broccoli for decades to come. In that case, our hypothetical multi-character-set feature might seem prescient. -- Robert Haas EDB: http://www.enterprisedb.com