Re: Handling mislabeled emails encoded with Windows-1252
Hi David, Thanks for the hints! I'll prepare a test and the patch based on master shortly. Cheers, Sebastian David Bremner writes: > Sebastian Poeplau writes: > >>> Nice, I'll add it. >> >> Updated patch attached. >> >> Cheers, >> Sebastian > > Thanks to both of you for working on this. The code looks ok to me, I > have only some procedural comments. > > In order to merge it I'll need at least one test. I think > test/T300-encoding.sh is probably the right place. There are a few > different styles of test; you can either put things in variables as in > that file, or use the more dominant > > test_subtest_begin_test "description" > cat << EOF > EXPECTED > this is my expected output > EOF > notmuch show STUFF > OUTPUT > test_expect_equal_file EXPECTED OUTPUT > > Feel free to bug the list for help on making tests (or #notmuch on > freenode). > > Please also use git-send-email to send your patch(es), with commit > messages with an eye to > > https://notmuchmail.org/contributing/#index5h2 > > To minimize the chance of problems, it's probably best to base your > commits on master, although the patch you sent applied fine here. > > Thanks, > > David ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Sebastian Poeplau writes: >> Nice, I'll add it. > > Updated patch attached. > > Cheers, > Sebastian Thanks to both of you for working on this. The code looks ok to me, I have only some procedural comments. In order to merge it I'll need at least one test. I think test/T300-encoding.sh is probably the right place. There are a few different styles of test; you can either put things in variables as in that file, or use the more dominant test_subtest_begin_test "description" cat << EOF > EXPECTED this is my expected output EOF notmuch show STUFF > OUTPUT test_expect_equal_file EXPECTED OUTPUT Feel free to bug the list for help on making tests (or #notmuch on freenode). Please also use git-send-email to send your patch(es), with commit messages with an eye to https://notmuchmail.org/contributing/#index5h2 To minimize the chance of problems, it's probably best to base your commits on master, although the patch you sent applied fine here. Thanks, David ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Hi, >> As an added optimization, you could try limiting that block of code to >> just when the charset is one of the iso-8859-* charsets. >> >> The following code snippet should help with that: >> >> charset = charset ? g_mime_charset_canon_name (charset) : NULL; >> if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) { >> ... >> >> The reason you need to use g_mime_charset_canon_name (if you decide to >> add the optimization) is that mail software does not always use the >> canonical form of the various charset names that they use. Often you >> will get stuff like "latin1" or "iso_8859-1". > > Nice, I'll add it. Updated patch attached. Cheers, Sebastian diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c --- notmuch-0.27/notmuch-show.c 2018-06-13 03:42:34.0 +0200 +++ notmuch-0.27-patched/notmuch-show.c 2018-07-30 09:41:05.491636418 +0200 @@ -272,6 +272,7 @@ GMimeContentType *content_type = g_mime_object_get_content_type (GMIME_OBJECT (part)); GMimeStream *stream_filter = NULL; GMimeFilter *crlf_filter = NULL; +GMimeFilter *windows_filter = NULL; GMimeDataWrapper *wrapper; const char *charset; @@ -282,13 +283,37 @@ if (stream_out == NULL) return; +charset = g_mime_object_get_content_type_parameter (part, "charset"); +charset = charset ? g_mime_charset_canon_name (charset) : NULL; +wrapper = g_mime_part_get_content_object (GMIME_PART (part)); +if (wrapper && charset && !g_ascii_strncasecmp (charset, "iso-8859-", 9)) { + GMimeStream *null_stream = NULL; + GMimeStream *null_stream_filter = NULL; + + /* Check for mislabeled Windows encoding */ + null_stream = g_mime_stream_null_new (); + null_stream_filter = g_mime_stream_filter_new (null_stream); + windows_filter = g_mime_filter_windows_new (charset); + g_mime_stream_filter_add(GMIME_STREAM_FILTER (null_stream_filter), + windows_filter); + g_mime_data_wrapper_write_to_stream (wrapper, null_stream_filter); + charset = g_mime_filter_windows_real_charset( + (GMimeFilterWindows *) windows_filter); + + if (null_stream_filter) + g_object_unref (null_stream_filter); + if (null_stream) + g_object_unref (null_stream); + /* Keep a reference to windows_filter in order to prevent the + * charset string from deallocation. */ +} + stream_filter = g_mime_stream_filter_new (stream_out); crlf_filter = g_mime_filter_crlf_new (false, false); g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter), crlf_filter); g_object_unref (crlf_filter); -charset = g_mime_object_get_content_type_parameter (part, "charset"); if (charset) { GMimeFilter *charset_filter; charset_filter = g_mime_filter_charset_new (charset, "UTF-8"); @@ -313,11 +338,12 @@ } } -wrapper = g_mime_part_get_content_object (GMIME_PART (part)); if (wrapper && stream_filter) g_mime_data_wrapper_write_to_stream (wrapper, stream_filter); if (stream_filter) g_object_unref(stream_filter); +if (windows_filter) + g_object_unref (windows_filter); } static const char* ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Hi, > Yes, that looks good. I would have probably unreffed the null_stream > and null_stream_filter inside of that if-block rather than at the end > of the function, but that's a stylistic issue that the notmuch authors > can comment on. The patch as it stands should work correctly from what > I can tell __ I was worried about the string returned by g_mime_filter_windows_real_charset: once I unref everything, isn't there a risk of the filter being deleted? As far as I can tell from the code, the returned charset might be a pointer into the filter object... > As an added optimization, you could try limiting that block of code to > just when the charset is one of the iso-8859-* charsets. > > The following code snippet should help with that: > > charset = charset ? g_mime_charset_canon_name (charset) : NULL; > if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) { > ... > > The reason you need to use g_mime_charset_canon_name (if you decide to > add the optimization) is that mail software does not always use the > canonical form of the various charset names that they use. Often you > will get stuff like "latin1" or "iso_8859-1". Nice, I'll add it. Thanks a lot, Sebastian ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Hi Sebastien, Yes, that looks good. I would have probably unreffed the null_stream and null_stream_filter inside of that if-block rather than at the end of the function, but that's a stylistic issue that the notmuch authors can comment on. The patch as it stands should work correctly from what I can tell __ As an added optimization, you could try limiting that block of code to just when the charset is one of the iso-8859-* charsets. The following code snippet should help with that: charset = charset ? g_mime_charset_canon_name (charset) : NULL; if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) { ... The reason you need to use g_mime_charset_canon_name (if you decide to add the optimization) is that mail software does not always use the canonical form of the various charset names that they use. Often you will get stuff like "latin1" or "iso_8859-1". Hope that helps, Jeff On 7/28/18, 7:22 AM, "Sebastian Poeplau" wrote: Hi all, Here's the updated patch. It filters the message through the GMimeFilterWindows that Jeff mentioned and then uses the charset it detects for GMimeFilterCharset in the actual rendering of the message. Jeff, is this how to use the filter correctly? Cheers, Sebastian ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Hi all, Here's the updated patch. It filters the message through the GMimeFilterWindows that Jeff mentioned and then uses the charset it detects for GMimeFilterCharset in the actual rendering of the message. Jeff, is this how to use the filter correctly? Cheers, Sebastian diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c --- notmuch-0.27/notmuch-show.c 2018-06-13 03:42:34.0 +0200 +++ notmuch-0.27-patched/notmuch-show.c 2018-07-28 10:25:25.358502880 +0200 @@ -271,7 +271,10 @@ { GMimeContentType *content_type = g_mime_object_get_content_type (GMIME_OBJECT (part)); GMimeStream *stream_filter = NULL; +GMimeStream *null_stream = NULL; +GMimeStream *null_stream_filter = NULL; GMimeFilter *crlf_filter = NULL; +GMimeFilter *windows_filter = NULL; GMimeDataWrapper *wrapper; const char *charset; @@ -282,13 +285,27 @@ if (stream_out == NULL) return; +charset = g_mime_object_get_content_type_parameter (part, "charset"); +wrapper = g_mime_part_get_content_object (GMIME_PART (part)); +if (wrapper && charset) { + /* Check for mislabeled Windows encoding */ + null_stream = g_mime_stream_null_new (); + null_stream_filter = g_mime_stream_filter_new (null_stream); + windows_filter = g_mime_filter_windows_new (charset); + g_mime_stream_filter_add(GMIME_STREAM_FILTER (null_stream_filter), + windows_filter); + g_mime_data_wrapper_write_to_stream (wrapper, null_stream_filter); + charset = g_mime_filter_windows_real_charset( + (GMimeFilterWindows *) windows_filter); + g_object_unref (windows_filter); +} + stream_filter = g_mime_stream_filter_new (stream_out); crlf_filter = g_mime_filter_crlf_new (false, false); g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter), crlf_filter); g_object_unref (crlf_filter); -charset = g_mime_object_get_content_type_parameter (part, "charset"); if (charset) { GMimeFilter *charset_filter; charset_filter = g_mime_filter_charset_new (charset, "UTF-8"); @@ -313,9 +330,12 @@ } } -wrapper = g_mime_part_get_content_object (GMIME_PART (part)); if (wrapper && stream_filter) g_mime_data_wrapper_write_to_stream (wrapper, stream_filter); +if (null_stream_filter) + g_object_unref (null_stream_filter); +if (null_stream) + g_object_unref (null_stream); if (stream_filter) g_object_unref(stream_filter); } Sebastian Poeplau writes: > Hi Jeff, > >> GMime actually comes with a stream filter (GMimeFilterWindows) which can >> auto-detect this situation. >> >> In this particular case, you'd instantiate the GMimeFilterWindows like this: >> >> filter = g_mime_filter_windows_new ("iso-8859-1"); >> >> "iso-8859-1" being the charset that the content claims to be in. >> >> Then you'd pipe the raw (decoded but not converted to utf-8) content though >> the filter and afterward call g_mime_filter_windows_real_charset (filter) >> which would return, in this user's case, "windows-1252". > > Nice, this is exactly what I was looking for! Somehow I missed it when > checking GMime. I'll adapt my local fix and post the results here. > > Thanks, > Sebastian ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Hi Jeff, > GMime actually comes with a stream filter (GMimeFilterWindows) which can > auto-detect this situation. > > In this particular case, you'd instantiate the GMimeFilterWindows like this: > > filter = g_mime_filter_windows_new ("iso-8859-1"); > > "iso-8859-1" being the charset that the content claims to be in. > > Then you'd pipe the raw (decoded but not converted to utf-8) content though > the filter and afterward call g_mime_filter_windows_real_charset (filter) > which would return, in this user's case, "windows-1252". Nice, this is exactly what I was looking for! Somehow I missed it when checking GMime. I'll adapt my local fix and post the results here. Thanks, Sebastian ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Hi all (sent his to David already using Reply instead of Reply-All, d'oh!), GMime actually comes with a stream filter (GMimeFilterWindows) which can auto-detect this situation. In this particular case, you'd instantiate the GMimeFilterWindows like this: filter = g_mime_filter_windows_new ("iso-8859-1"); "iso-8859-1" being the charset that the content claims to be in. Then you'd pipe the raw (decoded but not converted to utf-8) content though the filter and afterward call g_mime_filter_windows_real_charset (filter) which would return, in this user's case, "windows-1252". Hope that helps, Jeff On 7/23/18, 9:49 PM, "notmuch on behalf of David Bremner" wrote: Sebastian Poeplau writes: > Hi, > > This email is to suggest a minor change in how notmuch handles text > encoding when displaying emails. The motivation is the following: I keep > receiving emails that are encoded with Windows-1252 but claim to be > ISO 8859-1. The two character sets only differ in the range between 0x80 > and 0x9F where Windows-1252 contains special characters (e.g. “quotation > marks”) while ISO 8859-1 only has non-printable ones. The mislabeling > thus causes some special characters in such emails to be displayed with > a replacement symbol for non-printable characters. Hi Sebastian; Everyone's mail situation is unique, but I haven't noticed this problem. Do you have a mechanical (e.g. scripted) way of detecting such mails? I suppose it could just look for characters in the range 0x80 to 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my own mail would help me think about this problem, I think. David ___ notmuch mailing list notmuch@notmuchmail.org https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotmuchmail.org%2Fmailman%2Flistinfo%2Fnotmuch&data=02%7C01%7Cjestedfa%40microsoft.com%7C196f62f02155461e6e2408d5f107b75f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636679937804456911&sdata=bI6deYOaU81RwBFmITjg3G1DPvjgP8xiO5cB%2FKIkz58%3D&reserved=0 ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Hi again, >> Everyone's mail situation is unique, but I haven't noticed this >> problem. Do you have a mechanical (e.g. scripted) way of detecting such >> mails? I suppose it could just look for characters in the range 0x80 to >> 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my >> own mail would help me think about this problem, I think. > > Yes, I guess that should be a good enough heuristic for detecting > affected mail. I'll try to come up with a simple script and post it > here. Attached is a Python script that checks individual message files and prints their name if it finds them to contain mislabeled Windows-1252 text. The heuristic seems to work well on my mail - let me know if you encounter any issues! Cheers, Sebastian find_mislabeled_cp1252.py Description: Binary data ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Hi David, > Everyone's mail situation is unique, but I haven't noticed this > problem. Do you have a mechanical (e.g. scripted) way of detecting such > mails? I suppose it could just look for characters in the range 0x80 to > 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my > own mail would help me think about this problem, I think. Yes, I guess that should be a good enough heuristic for detecting affected mail. I'll try to come up with a simple script and post it here. Cheers, Sebastian ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Handling mislabeled emails encoded with Windows-1252
Sebastian Poeplau writes: > Hi, > > This email is to suggest a minor change in how notmuch handles text > encoding when displaying emails. The motivation is the following: I keep > receiving emails that are encoded with Windows-1252 but claim to be > ISO 8859-1. The two character sets only differ in the range between 0x80 > and 0x9F where Windows-1252 contains special characters (e.g. “quotation > marks”) while ISO 8859-1 only has non-printable ones. The mislabeling > thus causes some special characters in such emails to be displayed with > a replacement symbol for non-printable characters. Hi Sebastian; Everyone's mail situation is unique, but I haven't noticed this problem. Do you have a mechanical (e.g. scripted) way of detecting such mails? I suppose it could just look for characters in the range 0x80 to 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my own mail would help me think about this problem, I think. David ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch