Paul J Stevens wrote:
For a very long time postgresql users have complained about charset
incompatibilities in dbmail.

This is only a problem for PostgreSQL users that have an encoding other than SQL_ASCII, the PostgreSQL SQL_ASCII encoding basically ignores encoding completely, which is a good thing for something like email storage (where messages come in in all sorts of encoding).

The problem is that I have my dbmail tables stored in the same database as a number of different company programs to provide an integrated solution. At this time, PostgreSQL doesn't support per table encoding, (It's on their to-do list, but I don't expect it to happen any time soon.) and since other aspects of my database require UNICODE, I'm stuck with my DBMail database as UNICODE which has been the source of many problems.

I posted a question about this to the PostgreSQL hackers mailing list and I got a response from Tom Lane that said if we really need to avoid encoding issues all together, then we should either use SQL_ASCII for the whole database, or if that is not an option, use the bytea datatype.

As Paul mentioned we are using bytea for messageblks and this seems to be working nicely. I agree with your concern over moving the message headers tables to bytea at this point in the cycle, but I think it's worth considering. Encoding issues are a nightmare, and avoiding the whole issue has some merit. I'm not entirely sure of the performance or sorting implications of this either, but I can look into it. What are you most concerned about losing?

So, I came up with a solution I want to play by you.

I've just landed a change that will convert all strings inserted into
the headervalue and subjectfield columns into UTF8 encoded strings using
a gmime's iconv facilities. The subject and address parts of the
envelope are encoded as utf7 (rfc2047) also makes it safe to insert them
into utf8 tables regardsless of the original charset encoding.

Pro:
- we don't need to alter the schema.
- imap-sort behaves as expected.

Are you sure that imap-sort won't behave as expected if we convert the header columns to bytea?

Con:
- this means starting from 2.2.0 dbmail expects tables to use utf8 encoding.

I havent tested yet how this new behavious affects people using non-utf8
 encoded tables, like latin1 or koi8. People with experience in these
matters are invited to speak up.

Does gmime have the ability to convert to any number of encodings? Could we specify in dbmail.conf that our database uses a specific encoding and have gmime do the conversion before it hits the database?

Also: is there a procedure to change the encoding on a table in postgresql?

Not an easy one, changing the encoding is painful.

Please also keep in mind, I am by no means an encoding expert, I've stumbled though some encoding issues and learned a few thing along the way, but that's about it.


Matt

Reply via email to