On Wed, 2018-07-18 at 14:42 -0700, Andrew Morton wrote:
> On Wed, 18 Jul 2018 16:52:54 +0200 Geert Uytterhoeven 
> <[email protected]> wrote:
> 
> > As PERL uses its own internal character encoding, always calling
> > encode("utf8", ...) on the author name may cause corruption, leading to
> > an author signoff mismatch.
> > 
> > This happens in the following cases:
> >   - If a patch is in ISO-8859, and contains a non-ASCII author name in
> >     the From: line, it is converted to UTF-8, while the Signed-off-by
> >     line will still be in ISO-8859.
> >   - If a patch is in UTF-8, and contains a non-ASCII author name in the
> >     body (not header) From: line, it is assumed to be encoded in PERL's
> >     internal character encoding, and converted to UTF-8 incorrectly,
> >     while the Signed-off-by line will be in real UTF-8.
> > 
> > Fix this by only doing the encode step if the From: line used UTF-8
> > quoted printable encoding.
> 
> Works for me, thanks.

Me too so far, but I've more testing I'd like to do.

> Relatedly, would it be worth adding a checkpatch warning if a patch
> contains anything other than ASCII or UTF-8?
> 
> I added this to my little local patch-checking script.
> 
>       if ! file $p | grep -q -P "ASCII text|Unicode text"
>       then
>               echo $p: weird charset
>       fi

Might be hard to be effective.

For instance, the lkml mail I've kept so far this year
has a mixture of ascii/utf-8/iso-8859/windows-1252 and
some others with a few different encodings used too.

$ grep -Poh "\bcharset=\S+" 
~/.local/share/evolution/mail/local/.MailingLists.Linux-Kernel/cur/*|cut -f3- 
-d:|sort|uniq -c|sort -rn
    821 charset=us-ascii
    469 charset="UTF-8"
    394 charset="ISO-8859-1"
    252 charset=US-ASCII
    221 charset=utf-8
    118 charset=utf-8;
     97 charset="utf-8"
     66 charset=UTF-8
     60 charset="us-ascii"
     33 charset=ISO-8859-15
     24 charset=iso-8859-1
     18 charset=US-ASCII;
     11 charset=us-ascii;
      7 charset=windows-1252;
      7 charset="utf-8";
      6 charset="UTF-8";
      5 charset=windows-1252
      5 charset="iso-8859-1"
      4 charset="windows-1252"
      3 charset=UTF-8;
      3 charset="US-ASCII"
      2 charset="iso-2022-jp"
      2 charset=gbk;
      2 charset="gb2312"
      1 charset="utf-7"
      1 charset="iso-8859-15"
      1 charset=ISO-8859-1
      1 charset="gbk";

And

$ grep "^Content-Transfer-Encoding:" 
~/.local/share/evolution/mail/local/.MailingLists.Linux-Kernel/cur/*|cut -f3- 
-d:|sort|uniq -c|sort -rn
    873 Content-Transfer-Encoding: 7bit
    212 Content-Transfer-Encoding: 8bit
     97 Content-Transfer-Encoding: quoted-printable
     63 Content-Transfer-Encoding: base64
     56 Content-Transfer-Encoding: 8BIT
     24 Content-Transfer-Encoding: 7Bit
      3 Content-Transfer-Encoding: 7BIT
      2 Content-Transfer-Encoding: QUOTED-PRINTABLE

Reply via email to