Re: Handling mislabeled emails encoded with Windows-1252

2018-07-31 Thread Sebastian Poeplau
Hi David,

Thanks for the hints! I'll prepare a test and the patch based on master
shortly.

Cheers,
Sebastian


David Bremner  writes:

> Sebastian Poeplau  writes:
>
>>> Nice, I'll add it.
>>
>> Updated patch attached.
>>
>> Cheers,
>> Sebastian
>
> Thanks to both of you for working on this. The code looks ok to me, I
> have only some procedural comments.
>
> In order to merge it I'll need at least one test. I think
> test/T300-encoding.sh is probably the right place. There are a few
> different styles of test; you can either put things in variables as in
> that file, or use the more dominant
>
> test_subtest_begin_test "description"
> cat << EOF > EXPECTED
> this is my expected output
> EOF
> notmuch show STUFF > OUTPUT
> test_expect_equal_file EXPECTED OUTPUT
>
> Feel free to bug the list for help on making tests (or #notmuch on
> freenode).
>
> Please also use git-send-email to send your patch(es), with commit
> messages with an eye to
>
>  https://notmuchmail.org/contributing/#index5h2
>
> To minimize the chance of problems, it's probably best to base your
> commits on master, although the patch you sent applied fine here.
>
> Thanks,
>
> David
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-31 Thread David Bremner
Sebastian Poeplau  writes:

>> Nice, I'll add it.
>
> Updated patch attached.
>
> Cheers,
> Sebastian

Thanks to both of you for working on this. The code looks ok to me, I
have only some procedural comments.

In order to merge it I'll need at least one test. I think
test/T300-encoding.sh is probably the right place. There are a few
different styles of test; you can either put things in variables as in
that file, or use the more dominant

test_subtest_begin_test "description"
cat << EOF > EXPECTED
this is my expected output
EOF
notmuch show STUFF > OUTPUT
test_expect_equal_file EXPECTED OUTPUT

Feel free to bug the list for help on making tests (or #notmuch on
freenode).

Please also use git-send-email to send your patch(es), with commit
messages with an eye to

 https://notmuchmail.org/contributing/#index5h2

To minimize the chance of problems, it's probably best to base your
commits on master, although the patch you sent applied fine here.

Thanks,

David


___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-30 Thread Sebastian Poeplau
Hi,

>> As an added optimization, you could try limiting that block of code to
>> just when the charset is one of the iso-8859-* charsets.
>>
>> The following code snippet should help with that:
>>
>> charset = charset ? g_mime_charset_canon_name (charset) : NULL;
>> if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
>> ...
>>
>> The reason you need to use g_mime_charset_canon_name (if you decide to
>> add the optimization) is that mail software does not always use the
>> canonical form of the various charset names that they use. Often you
>> will get stuff like "latin1" or "iso_8859-1".
>
> Nice, I'll add it.

Updated patch attached.

Cheers,
Sebastian


diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c
--- notmuch-0.27/notmuch-show.c	2018-06-13 03:42:34.0 +0200
+++ notmuch-0.27-patched/notmuch-show.c	2018-07-30 09:41:05.491636418 +0200
@@ -272,6 +272,7 @@
 GMimeContentType *content_type = g_mime_object_get_content_type (GMIME_OBJECT (part));
 GMimeStream *stream_filter = NULL;
 GMimeFilter *crlf_filter = NULL;
+GMimeFilter *windows_filter = NULL;
 GMimeDataWrapper *wrapper;
 const char *charset;
 
@@ -282,13 +283,37 @@
 if (stream_out == NULL)
 	return;
 
+charset = g_mime_object_get_content_type_parameter (part, "charset");
+charset = charset ? g_mime_charset_canon_name (charset) : NULL;
+wrapper = g_mime_part_get_content_object (GMIME_PART (part));
+if (wrapper && charset && !g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
+	GMimeStream *null_stream = NULL;
+	GMimeStream *null_stream_filter = NULL;
+
+	/* Check for mislabeled Windows encoding */
+	null_stream = g_mime_stream_null_new ();
+	null_stream_filter = g_mime_stream_filter_new (null_stream);
+	windows_filter = g_mime_filter_windows_new (charset);
+	g_mime_stream_filter_add(GMIME_STREAM_FILTER (null_stream_filter),
+ windows_filter);
+	g_mime_data_wrapper_write_to_stream (wrapper, null_stream_filter);
+	charset = g_mime_filter_windows_real_charset(
+	(GMimeFilterWindows *) windows_filter);
+
+	if (null_stream_filter)
+	g_object_unref (null_stream_filter);
+	if (null_stream)
+	g_object_unref (null_stream);
+	/* Keep a reference to windows_filter in order to prevent the
+	 * charset string from deallocation. */
+}
+
 stream_filter = g_mime_stream_filter_new (stream_out);
 crlf_filter = g_mime_filter_crlf_new (false, false);
 g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter),
 			 crlf_filter);
 g_object_unref (crlf_filter);
 
-charset = g_mime_object_get_content_type_parameter (part, "charset");
 if (charset) {
 	GMimeFilter *charset_filter;
 	charset_filter = g_mime_filter_charset_new (charset, "UTF-8");
@@ -313,11 +338,12 @@
 	}
 }
 
-wrapper = g_mime_part_get_content_object (GMIME_PART (part));
 if (wrapper && stream_filter)
 	g_mime_data_wrapper_write_to_stream (wrapper, stream_filter);
 if (stream_filter)
 	g_object_unref(stream_filter);
+if (windows_filter)
+	g_object_unref (windows_filter);
 }
 
 static const char*
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-30 Thread Sebastian Poeplau
Hi,

> Yes, that looks good. I would have probably unreffed the null_stream
> and null_stream_filter inside of that if-block rather than at the end
> of the function, but that's a stylistic issue that the notmuch authors
> can comment on. The patch as it stands should work correctly from what
> I can tell __

I was worried about the string returned by
g_mime_filter_windows_real_charset: once I unref everything, isn't there
a risk of the filter being deleted? As far as I can tell from the code,
the returned charset might be a pointer into the filter object...

> As an added optimization, you could try limiting that block of code to
> just when the charset is one of the iso-8859-* charsets.
>
> The following code snippet should help with that:
>
> charset = charset ? g_mime_charset_canon_name (charset) : NULL;
> if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
> ...
>
> The reason you need to use g_mime_charset_canon_name (if you decide to
> add the optimization) is that mail software does not always use the
> canonical form of the various charset names that they use. Often you
> will get stuff like "latin1" or "iso_8859-1".

Nice, I'll add it.

Thanks a lot,
Sebastian
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-28 Thread Jeffrey Stedfast
Hi Sebastien,

Yes, that looks good. I would have probably unreffed the null_stream and 
null_stream_filter inside of that if-block rather than at the end of the 
function, but that's a stylistic issue that the notmuch authors can comment on. 
The patch as it stands should work correctly from what I can tell __ 

As an added optimization, you could try limiting that block of code to just 
when the charset is one of the iso-8859-* charsets.

The following code snippet should help with that:

charset = charset ? g_mime_charset_canon_name (charset) : NULL;
if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
...

The reason you need to use g_mime_charset_canon_name (if you decide to add the 
optimization) is that mail software does not always use the canonical form of 
the various charset names that they use. Often you will get stuff like "latin1" 
or "iso_8859-1".

Hope that helps,

Jeff

On 7/28/18, 7:22 AM, "Sebastian Poeplau"  wrote:

Hi all,

Here's the updated patch. It filters the message through the
GMimeFilterWindows that Jeff mentioned and then uses the charset it
detects for GMimeFilterCharset in the actual rendering of the message.

Jeff, is this how to use the filter correctly?

Cheers,
Sebastian




___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-28 Thread Sebastian Poeplau
Hi all,

Here's the updated patch. It filters the message through the
GMimeFilterWindows that Jeff mentioned and then uses the charset it
detects for GMimeFilterCharset in the actual rendering of the message.

Jeff, is this how to use the filter correctly?

Cheers,
Sebastian


diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c
--- notmuch-0.27/notmuch-show.c	2018-06-13 03:42:34.0 +0200
+++ notmuch-0.27-patched/notmuch-show.c	2018-07-28 10:25:25.358502880 +0200
@@ -271,7 +271,10 @@
 {
 GMimeContentType *content_type = g_mime_object_get_content_type (GMIME_OBJECT (part));
 GMimeStream *stream_filter = NULL;
+GMimeStream *null_stream = NULL;
+GMimeStream *null_stream_filter = NULL;
 GMimeFilter *crlf_filter = NULL;
+GMimeFilter *windows_filter = NULL;
 GMimeDataWrapper *wrapper;
 const char *charset;
 
@@ -282,13 +285,27 @@
 if (stream_out == NULL)
 	return;
 
+charset = g_mime_object_get_content_type_parameter (part, "charset");
+wrapper = g_mime_part_get_content_object (GMIME_PART (part));
+if (wrapper && charset) {
+	/* Check for mislabeled Windows encoding */
+	null_stream = g_mime_stream_null_new ();
+	null_stream_filter = g_mime_stream_filter_new (null_stream);
+	windows_filter = g_mime_filter_windows_new (charset);
+	g_mime_stream_filter_add(GMIME_STREAM_FILTER (null_stream_filter),
+ windows_filter);
+	g_mime_data_wrapper_write_to_stream (wrapper, null_stream_filter);
+	charset = g_mime_filter_windows_real_charset(
+	(GMimeFilterWindows *) windows_filter);
+	g_object_unref (windows_filter);
+}
+
 stream_filter = g_mime_stream_filter_new (stream_out);
 crlf_filter = g_mime_filter_crlf_new (false, false);
 g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter),
 			 crlf_filter);
 g_object_unref (crlf_filter);
 
-charset = g_mime_object_get_content_type_parameter (part, "charset");
 if (charset) {
 	GMimeFilter *charset_filter;
 	charset_filter = g_mime_filter_charset_new (charset, "UTF-8");
@@ -313,9 +330,12 @@
 	}
 }
 
-wrapper = g_mime_part_get_content_object (GMIME_PART (part));
 if (wrapper && stream_filter)
 	g_mime_data_wrapper_write_to_stream (wrapper, stream_filter);
+if (null_stream_filter)
+	g_object_unref (null_stream_filter);
+if (null_stream)
+	g_object_unref (null_stream);
 if (stream_filter)
 	g_object_unref(stream_filter);
 }



Sebastian Poeplau  writes:

> Hi Jeff,
>
>> GMime actually comes with a stream filter (GMimeFilterWindows) which can 
>> auto-detect this situation.
>>
>> In this particular case, you'd instantiate the GMimeFilterWindows like this:
>>
>> filter = g_mime_filter_windows_new ("iso-8859-1");
>>
>> "iso-8859-1" being the charset that the content claims to be in.
>>
>> Then you'd pipe the raw (decoded but not converted to utf-8) content though 
>> the filter and afterward call g_mime_filter_windows_real_charset (filter) 
>> which would return, in this user's case,  "windows-1252".
>
> Nice, this is exactly what I was looking for! Somehow I missed it when
> checking GMime. I'll adapt my local fix and post the results here.
>
> Thanks,
> Sebastian
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-24 Thread Sebastian Poeplau
Hi Jeff,

> GMime actually comes with a stream filter (GMimeFilterWindows) which can 
> auto-detect this situation.
>
> In this particular case, you'd instantiate the GMimeFilterWindows like this:
>
> filter = g_mime_filter_windows_new ("iso-8859-1");
>
> "iso-8859-1" being the charset that the content claims to be in.
>
> Then you'd pipe the raw (decoded but not converted to utf-8) content though 
> the filter and afterward call g_mime_filter_windows_real_charset (filter) 
> which would return, in this user's case,  "windows-1252".

Nice, this is exactly what I was looking for! Somehow I missed it when
checking GMime. I'll adapt my local fix and post the results here.

Thanks,
Sebastian
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-24 Thread Jeffrey Stedfast
Hi all (sent his to David already using Reply instead of Reply-All, d'oh!),

GMime actually comes with a stream filter (GMimeFilterWindows) which can 
auto-detect this situation.

In this particular case, you'd instantiate the GMimeFilterWindows like this:

filter = g_mime_filter_windows_new ("iso-8859-1");

"iso-8859-1" being the charset that the content claims to be in.

Then you'd pipe the raw (decoded but not converted to utf-8) content though the 
filter and afterward call g_mime_filter_windows_real_charset (filter) which 
would return, in this user's case,  "windows-1252".

Hope that helps,

Jeff

On 7/23/18, 9:49 PM, "notmuch on behalf of David Bremner" 
 wrote:

Sebastian Poeplau  writes:

> Hi,
>
> This email is to suggest a minor change in how notmuch handles text
> encoding when displaying emails. The motivation is the following: I keep
> receiving emails that are encoded with Windows-1252 but claim to be
> ISO 8859-1. The two character sets only differ in the range between 0x80
> and 0x9F where Windows-1252 contains special characters (e.g. “quotation
> marks”) while ISO 8859-1 only has non-printable ones. The mislabeling
> thus causes some special characters in such emails to be displayed with
> a replacement symbol for non-printable characters.

Hi Sebastian;

Everyone's mail situation is unique, but I haven't noticed this
problem. Do you have a mechanical (e.g. scripted) way of detecting such
mails? I suppose it could just look for characters in the range 0x80 to
0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
own mail would help me think about this problem, I think.

David


___
notmuch mailing list
notmuch@notmuchmail.org

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotmuchmail.org%2Fmailman%2Flistinfo%2Fnotmuch&data=02%7C01%7Cjestedfa%40microsoft.com%7C196f62f02155461e6e2408d5f107b75f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636679937804456911&sdata=bI6deYOaU81RwBFmITjg3G1DPvjgP8xiO5cB%2FKIkz58%3D&reserved=0


___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-24 Thread Sebastian Poeplau
Hi again,

>> Everyone's mail situation is unique, but I haven't noticed this
>> problem. Do you have a mechanical (e.g. scripted) way of detecting such
>> mails? I suppose it could just look for characters in the range 0x80 to
>> 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
>> own mail would help me think about this problem, I think.
>
> Yes, I guess that should be a good enough heuristic for detecting
> affected mail. I'll try to come up with a simple script and post it
> here.

Attached is a Python script that checks individual message files and
prints their name if it finds them to contain mislabeled Windows-1252
text. The heuristic seems to work well on my mail - let me know if you
encounter any issues!

Cheers,
Sebastian




find_mislabeled_cp1252.py
Description: Binary data
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-24 Thread Sebastian Poeplau
Hi David,

> Everyone's mail situation is unique, but I haven't noticed this
> problem. Do you have a mechanical (e.g. scripted) way of detecting such
> mails? I suppose it could just look for characters in the range 0x80 to
> 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
> own mail would help me think about this problem, I think.

Yes, I guess that should be a good enough heuristic for detecting
affected mail. I'll try to come up with a simple script and post it
here.

Cheers,
Sebastian
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Handling mislabeled emails encoded with Windows-1252

2018-07-23 Thread David Bremner
Sebastian Poeplau  writes:

> Hi,
>
> This email is to suggest a minor change in how notmuch handles text
> encoding when displaying emails. The motivation is the following: I keep
> receiving emails that are encoded with Windows-1252 but claim to be
> ISO 8859-1. The two character sets only differ in the range between 0x80
> and 0x9F where Windows-1252 contains special characters (e.g. “quotation
> marks”) while ISO 8859-1 only has non-printable ones. The mislabeling
> thus causes some special characters in such emails to be displayed with
> a replacement symbol for non-printable characters.

Hi Sebastian;

Everyone's mail situation is unique, but I haven't noticed this
problem. Do you have a mechanical (e.g. scripted) way of detecting such
mails? I suppose it could just look for characters in the range 0x80 to
0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
own mail would help me think about this problem, I think.

David


___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch