On Thu, 5 Mar 2015 18:50:57 +0100 Palle Girgensohn > * Ensure that all data containing non-ASCII is quoted-printable > encoded for PostgreSQL, instead of merely all data not claiming to be > "text/plain" > > Why is this? It seems it is doing more harm than good -- it makes it > harder to use indexes and makes it harder to use it from the tables > directly? PostgreSQL is very capable of storing any kind of character > set? We use UTF-8 for everything, this fix breaks a lot of things for > us.
The commit in question is 3a9c38ed, which changes: } elsif ( !$RT::Handle->BinarySafeBLOBs && $MIMEType !~ /text\/plain/gi && !Encode::is_utf8( $Body, 1 ) ) { $ContentEncoding = 'quoted-printable'; } ...to: } elsif ( !$RT::Handle->BinarySafeBLOBs && $Body =~ /\P{ASCII}/ && !Encode::is_utf8( $Body, 1 ) ) { $ContentEncoding = 'quoted-printable'; } That is, any data which claimed to be "text/plain" would blindly be attempted to be shoved into the database; this includes content which contained _invalid_ UTF-8 byte sequences. The commit was in RT 4.0; RT 4.2 is much better about transcoding to real UTF-8, or marking the part as "application/octet-stream" if it cannot be decoded. In that sense, this logic is now less necessary. However, the commit was absolutely necessary at the time to not _lose_ data. Erring on the side of data preservation, at the expense of possibly-unnecessary encoding, seems like not an unreasonable choice. That being said, the Encode::is_utf8() check attempts to verify that we only quoted-printable encode data whose bytes are not currently a correctly-encoded UTF-8 byte stream. The bug now lies there, as after the "utf8-reckoning" branch (2620658..af9fe7c), the $Body argument is guaranteed to contain bytes, not characters. Per the Encode::is_utf8() documentation: [INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If CHECK is true, also checks whether STRING contains well-formed UTF-8. Returns true if successful, false otherwise. The UTF8 flag is almost certainly off on $Body (it _may_ be on, and still contain only codepoints < 128, but this is unlikely), so the well-formed-ness of the string (which is the relevant thing we wish to confirm) is never checked. I _believe_ that this code may be replaced by (untested): } elsif ( !$RT::Handle->BinarySafeBLOBs && !eval { Encode::decode( "UTF-8", $Body, Encode::FB_CROAK); 1 } ) { $ContentEncoding = 'quoted-printable'; } I may push a branch later that makes this change, and see what breaks. All of that aside, note that "it makes it harder to use indexes" makes little sense -- there are no indexes on Content. The full-text index uses its own tsvector, which it constructs after decoding the Content, so the transfer encoding of the Content column is irrelevant to that. - Alex