RebeccaJ wrote:

And I wonder why you like SQL_ASCII better than UTF8, and whether
others have any opinions about those two. (My web server's LC_CTYPE is
C, so I can use any character set.) Wouldn't UTF8 allow more
characters than SQL_ASCII?

I've had a LOT of experience dealing with apps that use 8-bit byte strings (like SQL_ASCII `text') to store data, and I've rarely seen one that *doesn't* have text encoding handling bugs.

If you store your text as byte streams that don't know, check, or enforce their own encoding you must keep track of the encoding separately - either with another value stored alongside the string, or through your app logic.

If you start storing data with multiple different text encodings in the DB, you're most likely to land up tracking down annoying "corrupt text" bugs sooner or later.

If, on the other hand, you use UTF-8, you *know* that everything in the database is well-formed UTF-8. You ensure that it is UTF-8 before storing it in the DB and know it'll be UTF-8 coming out. The DB takes care of encoding conversion for you if you ask it to, by setting client_encoding - the only downside being that it'll refuse to return strings that can't be represented in your current client_encoding, like say Cyrillic (Russian etc) text if you're using ISO-8859-1 (latin-1) for your client encoding.

Even with a UTF-8 database you must still get your I/O to/from libraries and the rest of the system right, converting UTF-8 text to whatever the system expects or vice versa. Alternately, if you set client_encoding, you must be prepared for cases where the DB can't send you what you ask for because your encoding can't represent it.

All in all, I personally think a UTF-8 database is considerably better for most uses. There are certainly cases where I'd use SQL_ASCII, but not most.

--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to