Re: Problem with accented characters in mailbox.Maildir()

2023-05-09 Thread Peter J. Holzer
On 2023-05-08 23:02:18 +0200, jak wrote:
> Peter J. Holzer ha scritto:
> > On 2023-05-06 16:27:04 +0200, jak wrote:
> > > Chris Green ha scritto:
> > > > Chris Green  wrote:
> > > > > A bit more information, msg.get("subject", "unknown") does return a
> > > > > string, as follows:-
> > > > > 
> > > > >   Subject: 
> > > > > =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
> > [...]
> > > > ... and of course I now see the issue!  The Subject: with utf-8
> > > > characters in it gets spaces changed to underscores.  So searching for
> > > > '(Waterways Continental Europe)' fails.
> > > > 
> > > > I'll either need to test for both versions of the string or I'll need
> > > > to change underscores to spaces in the Subject: returned by msg.get().
[...]
> > > 
> > > subj = email.header.decode_header(raw_subj)[0]
> > > 
> > > subj[0].decode(subj[1])
[...]
> > email.header.decode_header returns a *list* of chunks and you have to
> > process and concatenate all of them.
> > 
> > Here is a snippet from a mail to html converter I wrote a few years ago:
> > 
> > def decode_rfc2047(s):
> >  if s is None:
> >  return None
> >  r = ""
> >  for chunk in email.header.decode_header(s):
[...]
> >  r += chunk[0].decode(chunk[1])
[...]
> >  return r
[...]
> > 
> > I do have to say that Python is extraordinarily clumsy in this regard.
> 
> Thanks for the reply. In fact, I gave that answer because I did
> not understand what the OP wanted to achieve. In addition, the
> OP opened a second thread on the similar topic in which I gave a
> more correct answer (subject: "What do these '=?utf-8?' sequences
> mean in python?", date: "Sat, 6 May 2023 14:50:40 UTC").

Right. I saw that after writing my reply. I should have read all
messages, not just that thread before replying.

> the OP, I discovered that the MAME is not the only format used
> to compose the subject.

Not sure what "MAME" is. If it's a typo for MIME, then the base64
variant of RFC 2047 is just as much a part of it as the quoted-printable
variant.

> This made me think that a library could not delegate to the programmer
> the burden of managing all these exceptions,

email.header.decode_header handles both variants, but it produces bytes
sequences which still have to be decoded to get a Python string.


> then I have further investigated to discover that the library also
> provides the conversion function beyond that of coding and this makes
> our labors vain:
> 
> --
> from email.header import decode_header, make_header
> 
> subject = make_header(decode_header( raw_subject )))
> --

Yup. I somehow missed that. That's a lot more convenient than calling
decode in a loop (or generator expression). Depending on what you want
to do with the subject you may have wrap that in a call to str(), but
it's still a one-liner.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Problem with accented characters in mailbox.Maildir()

2023-05-08 Thread Peter J. Holzer
On 2023-05-06 16:27:04 +0200, jak wrote:
> Chris Green ha scritto:
> > Chris Green  wrote:
> > > A bit more information, msg.get("subject", "unknown") does return a
> > > string, as follows:-
> > > 
> > >  Subject: 
> > > =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
[...]
> > ... and of course I now see the issue!  The Subject: with utf-8
> > characters in it gets spaces changed to underscores.  So searching for
> > '(Waterways Continental Europe)' fails.
> > 
> > I'll either need to test for both versions of the string or I'll need
> > to change underscores to spaces in the Subject: returned by msg.get().

You need to decode the Subject properly. Unfortunately the Python email
module doesn't do that for you automatically. But it does provide the
necessary tools. Don't roll your own unless you've read and understood
the relevant RFCs.

> 
> This is probably what you need:
> 
> import email.header
> 
> raw_subj =
> '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='
> 
> subj = email.header.decode_header(raw_subj)[0]
> 
> subj[0].decode(subj[1])
> 
> 'aka Marne à la Saône (Waterways Continental Europe)'

You are an the right track, but that works only because the example
exists only of a single encoded word. This is not always the case (and
indeed not what the RFC recommends).

email.header.decode_header returns a *list* of chunks and you have to
process and concatenate all of them.

Here is a snippet from a mail to html converter I wrote a few years ago:

def decode_rfc2047(s):
if s is None:
return None
r = ""
for chunk in email.header.decode_header(s):
if chunk[1]:
try:
r += chunk[0].decode(chunk[1])
except LookupError:
r += chunk[0].decode("windows-1252")
except UnicodeDecodeError:
r += chunk[0].decode("windows-1252")
elif type(chunk[0]) == bytes:
r += chunk[0].decode('us-ascii')
else:
r += chunk[0]
return r

(this is maybe a bit more forgiving than the OP needs, but I had to deal
with malformed mails)

I do have to say that Python is extraordinarily clumsy in this regard.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Problem with accented characters in mailbox.Maildir()

2023-05-08 Thread jak

Chris Green ha scritto:

Chris Green  wrote:

A bit more information, msg.get("subject", "unknown") does return a
string, as follows:-

 Subject: 
=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.

Surely there's a way to handle this.


... and of course I now see the issue!  The Subject: with utf-8
characters in it gets spaces changed to underscores.  So searching for
'(Waterways Continental Europe)' fails.

I'll either need to test for both versions of the string or I'll need
to change underscores to spaces in the Subject: returned by msg.get().
It's a long enough string that I'm searching for that I won't get any
false positives.


Sorry for the noise everyone, it's a typical case of explaining the
problem shows one how to fix it! :-)



This is probably what you need:

import email.header

raw_subj = 
'=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='


subj = email.header.decode_header(raw_subj)[0]

subj[0].decode(subj[1])

'aka Marne à la Saône (Waterways Continental Europe)'




--
https://mail.python.org/mailman/listinfo/python-list


Re: Problem with accented characters in mailbox.Maildir()

2023-05-08 Thread Chris Green
Chris Green  wrote:
> A bit more information, msg.get("subject", "unknown") does return a
> string, as follows:-
> 
> Subject: 
> =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
> 
> So it's the 'searchTxt in msg.get("subject", "unknown")' that's
> failing. I.e. for some reason 'in' isn't working when the searched
> string has utf-8 characters.  
> 
> Surely there's a way to handle this.
> 
... and of course I now see the issue!  The Subject: with utf-8
characters in it gets spaces changed to underscores.  So searching for
'(Waterways Continental Europe)' fails.

I'll either need to test for both versions of the string or I'll need
to change underscores to spaces in the Subject: returned by msg.get().
It's a long enough string that I'm searching for that I won't get any
false positives.


Sorry for the noise everyone, it's a typical case of explaining the
problem shows one how to fix it! :-)

-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Problem with accented characters in mailbox.Maildir()

2023-05-08 Thread Chris Green
A bit more information, msg.get("subject", "unknown") does return a
string, as follows:-

Subject: 
=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.  

Surely there's a way to handle this.

-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Problem with accented characters in mailbox.Maildir()

2023-05-08 Thread jak

Chris Green ha scritto:

I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.

So I have the following code to open each mail message:-

 #
 #
 # Read the message from standard input and make a message object from it
 #
 msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

and then later I have (among many other bits and pieces):-

 #
 #
 # test for string in Subject:
 #
 if searchTxt in str(msg.get("subject", "unknown")):
 do
 various
 things


This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed.  I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).

So, for example, the latest case of this happening has:-

 Subject: aka Marne à la Saône (Waterways Continental Europe)

where the searchTxt in the code above is "Waterways Continental Europe".


Is there any way I can work round this issue?  E.g. is there a way to
strip out all extended characters from a string?  Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?

Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)




Hi,
you could try extracting the "Content-Type:charset" and then using it
for subject conversion:

subj = str(raw_subj, encoding='...')

--
https://mail.python.org/mailman/listinfo/python-list


Problem with accented characters in mailbox.Maildir()

2023-05-08 Thread Chris Green
I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.

So I have the following code to open each mail message:-

# 
# 
# Read the message from standard input and make a message object from it 
# 
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

and then later I have (among many other bits and pieces):-

#
#
# test for string in Subject:
#
if searchTxt in str(msg.get("subject", "unknown")):
do
various
things


This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed.  I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).

So, for example, the latest case of this happening has:-

Subject: aka Marne à la Saône (Waterways Continental Europe)

where the searchTxt in the code above is "Waterways Continental Europe".


Is there any way I can work round this issue?  E.g. is there a way to
strip out all extended characters from a string?  Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?

Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)


-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list