[GENERAL] Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

J. Michael Crawford Mon, 08 Nov 2004 08:19:54 -0800

I am posting this to the General and JDBC groups so that it's in the archive -- I know many people have had problems with Latin1 characters, and I'd like to share what has worked for us. If anyone can add this information to a more permanent FAQ, I'd be much obliged.

---

Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

This is a general primer for using postgres with alternate character sets. For those who have done so successfully, the process is obvious in hindsight. For those who haven't gotten it to work yet, it can be very complex, especially with web applications where four separate encodings can come into play.

Postgres is quite intelligent in dealing with character sets, but even so, it's possible to get into a situation where your application returns garbage for Latin1 or other extended characters (n with a tilde, etc.). This is particularly true in Java-based web applications.

There are four steps to ensuring proper character retrieval, with the fourth applying to Java web applications. They are explained briefly here, and then in more detail further down in the document:

  1.  Use a database encoding that will store the appropriate characters.

2. Set the client encoding first, before importing or adding data to the database.

3. Set the client encoding for each session where a user or program accesses the database.

  4.  Set the proper character encoding for the web page.

Do these four steps, and things should work out nicely. Miss even one of them, and your application will mysteriously return goofy extended characters.

For anyone who has already done this, the solution is probably obvious. However, it took us months to figure this out through trial-and-error, even with the help of the Java and Postgres communities. After all, we had five different potential encodings to deal with, in four separate steps, and that yields a LOT of non-working combinations.

We found dozens of people with the same problem, but most of them never fully resolved their problems. We tried a heck of a lot of different strategies, but the only thing that worked was breaking the situation down into these steps, and finding the right encoding combination for each step, one-at-a-time. Now that our application works, we want to help minimize the number of people who have to reinvent this wheel from scratch.

  The rest of this document explains the steps in more detail.

---

1.  Create the database with the proper encoding.

In our experience, the best thing to do is simply choose Unicode as the database's character encoding if you think there's a chance of storing Latin1 or other characters.

You could choose Latin1, and this should work in most cases. However, there are times when normal-looking characters refuse to be stored in a Latin1 database, such as character 239, which is the same in Latin1 as it is is Utf-8. Rather than attempt to beat our heads against this wall, we went with Unicode because it will hold whatever we need to hold.

---

2.  Set client encoding before importing or adding data to the database.

If you do a mass import from another database, or a data retore from Postgres, make sure the encoding is set first. If you try to restore a unicode characters to a Latin1 database, the data likely won't look right. Likewise, if you try to restore Latin1 characters into a Unicode database without first telling Postgres that it's getting Latin1 characters, the extended characters will come back as garbage, *even*though* a unicode database can hold them.

  The command to set the client encoding for a session is:

"SET CLIENT_ENCODING TO Unicode", "SET CLIENT_ENCODING TO Latin1", and so forth.

If you find out what kind of characters are coming into the postgres database, and tell Postgres ahead of time, it should be able to automatically translate them. As it is designed to do, of course. Just remember that unicode is a HUGE character set, and there will be unicode characters that can't be translated into the much smaller Latin1 character set.

You may have to experiment to find out what the right client encoding is. It will likely depend upon the database from which you're getting data, such as a backup of a Latin1 postgres database, or data retrieved from a SQL Server database, which will have its own encoding. It could also depend upon the character set used by your operating system, such as the character set of data someone entered into a Microsoft Access database (which, if they used the US English Windows, will be cp1252). Fortunately a Unicode or Latin1 client encoding will handle most Windows-related encoding issues.

---

3. Set the client encoding for each session where a user or program accesses the database.

If you have users entering data from a Java application, the encoding will vary from JVM to JVM. A windows-based JVM might have an encoding of cp1252, while a Linus JVM might be utf-8. A windows-database application will likely ave a cp1252 or Latin1 character set, but that could depend upon what application you're using or what version of Windows you're using.

Postgres needs to know what kind of characters to store and return. If you tell Postgres to return unicode characters to an application that's using Latin1, extended characters will look like gibberish.

Even in Java, where you can do all sorts of character-encoding translation, it can be impossible to translate data retrieved from Postgres if it's in the wrong encoding. We've tried changing the JVM encoding, altering the jdbc driver, translating encodings on the database read, and translating encodings after the read while building a new string, to no avail. We tried 25 combinations of each strategy (five different possible read encodings and five different possible string encodings), and nothing worked. We could get an application working in one JVM with one encoding, but another JVM would break, and no amount of translation would help.

But when we finally told Postgres what to return, everythign worked like a charm.

Just as with step two, the key is to use the "SET CLIENT_ENCODING TO (encoding)" sql command. If you're using an application where you can send SQL to the server, this is all you need. In something like MS Access, you'll have to move to a passthrough query. For Java, you'll need to send a command through JDBC:

String DBEncoding = "Unicode" //use a real encoding, either returned from the jvm or explicitly stated PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING TO '" + DBEncoding + "'"); statement.execute();

---

4.  Set the proper character encoding for the web page.

Web browsers also need to know what kind of characters they are displaying. If you send them Unicode characters and the browser thinks it's getting Latin1, extended characters will again come up as nonsense. If you're writing a web-based application, it's a good idea to include the character set as part of the html, such as:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For example, we have an application that serves up pages perfectly from a Linux JVM, which has a utf-8 encoding, and serves up pages with goofy extended characters from a Windows JVM, which has a cp1252 encoding. The solution is either to translate the strings in the Windows JVM into unicode before serving the page, or alternately, tell the web page to use a different character set.

---

Those are all the basics. Many times you may not have to worry about one or more of the steps because your application server or JSP generator may already handle the character set conversion. But if it doesn't, or if you're writing Servlets and not JSP's, these steps are good ones to consider if you work with Latin1 characters.


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

[GENERAL] Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

Reply via email to