[issue1079] decode_header does not follow RFC 2047

2012-06-03 Thread Barry A. Warsaw

Barry A. Warsaw ba...@python.org added the comment:

On Jun 02, 2012, at 09:59 PM, R. David Murray wrote:

I've applied this to 3.3.  Because the preservation of spaces around the
ascii parts is a visible behavior change that could cause working programs to
break, I don't think I can backport it.  I'm going to leave this open until I
can consult with Barry to see if he thinks a backport is justified.  Anyone
else can feel free to chime in with an opinion as well :)

I think a backport is risky.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-06-03 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

OK, I'm closing this, then, and will close the related issues as well.

Thanks again for the patch, Ralf.

--
resolution:  - fixed
stage: patch review - committed/rejected
status: open - closed
versions:  -Python 2.7, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-06-03 Thread Roundup Robot

Roundup Robot devn...@psf.upfronthosting.co.za added the comment:

New changeset 0808cb8c60fd by R David Murray in branch 'default':
#2658: Add test for issue fixed by fix for #1079.
http://hg.python.org/cpython/rev/0808cb8c60fd

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-06-02 Thread Roundup Robot

Roundup Robot devn...@psf.upfronthosting.co.za added the comment:

New changeset 8c03fe231877 by R David Murray in branch 'default':
#1079: Fix parsing of encoded words.
http://hg.python.org/cpython/rev/8c03fe231877

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-06-02 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

I've applied this to 3.3.  Because the preservation of spaces around the ascii 
parts is a visible behavior change that could cause working programs to break, 
I don't think I can backport it.  I'm going to leave this open until I can 
consult with Barry to see if he thinks a backport is justified.  Anyone else 
can feel free to chime in with an opinion as well :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-05-29 Thread Ralf Schlatterbeck

Ralf Schlatterbeck r...@runtux.com added the comment:

On Mon, May 28, 2012 at 08:15:05PM +, R. David Murray wrote:
 
 R. David Murray rdmur...@bitdance.com added the comment:
 
 Ralf, thanks very much for this patch.  I'm considering applying it.
 Given that the current code breaks on parsing various legitimate
 constructs, it seems like the behavior change (preserving whitespace
 in the non-EW parts...which IMO is correct) should be an acceptable
 tradeoff.
 
 Could you please submit a contributor agreement?
 (http://www.python.org/psf/contrib/)

Thanks for considering my patch.
I've just sent the agreement.

Ralf

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-05-28 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

Ralf, thanks very much for this patch.  I'm considering applying it.  Given 
that the current code breaks on parsing various legitimate constructs, it seems 
like the behavior change (preserving whitespace in the non-EW parts...which IMO 
is correct) should be an acceptable tradeoff.

Could you please submit a contributor agreement?  
(http://www.python.org/psf/contrib/)

--
components: +email -Library (Lib)

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-04-20 Thread Patrick Hahn

Changes by Patrick Hahn ph...@janestreet.com:


--
nosy: +phahn

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-01-03 Thread Ralf Schlatterbeck

Ralf Schlatterbeck r...@runtux.com added the comment:

Fine, I see what you mean, this involves very careful reading of the RFC
and could have been a little more verbose ...

Right. Should have been a ')'

 Adding the RFC tests would be great (patches gladly accepted).  Fixes
 for ones we fail would be great, too, but at the very least we can
 mark them as expected failures.  I don't usually like adding tests
 that we expect to fail, but in the case of externally defined tests
 such as the RFC examples I think it is worthwhile, so that we can
 check in a complete test set.

Patch attached (against current tip, 74241:120a79b8bb11). We currently
fail *all* of the tests in the RFC due to the same problem, the closing
')', I've marked them accordingly.

I've made the 5th test (with newline in the string) two cases, one with
\r\n for the newline, one with only \n. They fail differently.

I plan to look into this a little more, my current plan is to make the
outer regex non-greedy (if possible) and remove the trailing whitespace.
That would involve parsing (and ignoring) additional whitespace
*between* encoded words but not at the boundary to a non-encoded word.

Any objections/further infos?

Ralf
-- 
Dr. Ralf Schlatterbeck  Tel:   +43/2243/26465-16
Open Source Consulting  www:   http://www.runtux.com
Reichergasse 131, A-3411 Weidling   email: off...@runtux.com
osAlliance member   email: r...@osalliance.com

--
Added file: http://bugs.python.org/file24130/python.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___diff -r 120a79b8bb11 Lib/test/test_email/test_email.py
--- a/Lib/test/test_email/test_email.py Tue Jan 03 06:26:13 2012 +0200
+++ b/Lib/test/test_email/test_email.py Tue Jan 03 16:16:09 2012 +0100
@@ -2056,6 +2056,67 @@
 self.assertEqual(decode_header(s),
 [(b'andr\xe9=zz', 'iso-8659-1')])
 
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_1(self):
+# 1st testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_2(self):
+# 2nd testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?= b)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b' b)', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_3(self):
+# 3rd testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b'b', 'iso-8859-1'),
+ (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_4(self):
+# 4th testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?=  =?ISO-8859-1?Q?b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b'b', 'iso-8859-1'),
+ (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_5a(self):
+# 5th testcase at end of rfc2047 newline is \r\n
+s = '(=?ISO-8859-1?Q?a?=\r\n=?ISO-8859-1?Q?b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b'b', 'iso-8859-1'),
+ (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_5b(self):
+# 5th testcase at end of rfc2047 newline is \n
+s = '(=?ISO-8859-1?Q?a?=\n=?ISO-8859-1?Q?b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b'b', 'iso-8859-1'),
+ (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_6(self):
+# 6th testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a_b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a b', 'iso-8859-1'), (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_7(self):
+# 7th testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?= =?ISO-8859-2?Q?_b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b' b', 'iso-8859-2'),
+ (b')', None)])
+
 
 # Test the MIMEMessage class
 class TestMIMEMessage(TestEmailBase):
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-01-03 Thread Ralf Schlatterbeck

Ralf Schlatterbeck r...@runtux.com added the comment:

enclosed please find a fixed patch -- decode_header consolidates
multiple encoded strings with the same encoding into a single entry in
the returned parts.
-- 
Dr. Ralf Schlatterbeck  Tel:   +43/2243/26465-16
Open Source Consulting  www:   http://www.runtux.com
Reichergasse 131, A-3411 Weidling   email: off...@runtux.com
osAlliance member   email: r...@osalliance.com

--
Added file: http://bugs.python.org/file24131/python.patch.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___diff -r 120a79b8bb11 Lib/test/test_email/test_email.py
--- a/Lib/test/test_email/test_email.py Tue Jan 03 06:26:13 2012 +0200
+++ b/Lib/test/test_email/test_email.py Tue Jan 03 17:09:34 2012 +0100
@@ -2056,6 +2056,63 @@
 self.assertEqual(decode_header(s),
 [(b'andr\xe9=zz', 'iso-8659-1')])
 
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_1(self):
+# 1st testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_2(self):
+# 2nd testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?= b)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b' b)', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_3(self):
+# 3rd testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'ab', 'iso-8859-1'), (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_4(self):
+# 4th testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?=  =?ISO-8859-1?Q?b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'ab', 'iso-8859-1'), (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_5a(self):
+# 5th testcase at end of rfc2047 newline is \r\n
+s = '(=?ISO-8859-1?Q?a?=\r\n=?ISO-8859-1?Q?b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'ab', 'iso-8859-1'), (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_5b(self):
+# 5th testcase at end of rfc2047 newline is \n
+s = '(=?ISO-8859-1?Q?a?=\n=?ISO-8859-1?Q?b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'ab', 'iso-8859-1'), (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_6(self):
+# 6th testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a_b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a b', 'iso-8859-1'), (b')', None)])
+
+@unittest.expectedFailure
+def test_rfc2047_rfc2047_7(self):
+# 7th testcase at end of rfc2047
+s = '(=?ISO-8859-1?Q?a?= =?ISO-8859-2?Q?_b?=)'
+self.assertEqual(decode_header(s),
+[(b'(', None), (b'a', 'iso-8859-1'), (b' b', 'iso-8859-2'),
+ (b')', None)])
+
 
 # Test the MIMEMessage class
 class TestMIMEMessage(TestEmailBase):
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-01-03 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

Well, a caution that tweaking the regex can have unexpected consequences as 
past issues have proven (but by all means go for it), and a note that the 
parsing strategy is going to change completely in email6 (see 
http://pypi.python.org/email and http://hg.python.org/features/email6).  I 
think your tests should pass on that branch; I'll be interested to try it when 
I get some time.

(Note: I'm removing 3.1 from versions since it doesn't get bug fixes any more.)

Also, I'm not sure the (non-essential) change to consolidate like-charset 
encoded words is appropriate for a bug fix.  It's hard to see how it would 
break anything, but why take the risk if it isn't needed to fix the bug.

--
versions:  -Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-01-03 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

Gah, that's what I get for not reading carefully (or looking at the patch 
first).  Your test change is fine, of course.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-01-03 Thread Ralf Schlatterbeck

Ralf Schlatterbeck r...@runtux.com added the comment:

Attached please find a patch that
- keeps all spaces between non-encoded and encoded parts
- doesn't create spaces between non-encoded and encoded parts in case
  these are already there or not needed (because they are non-ctext
  characters of RFC822 like ')') in the methods encode and __str__
  of class Header.
in all other cases spaces are still inserted, this keeps many tests
happy and probably won't break too much existing code.

I've re-read RFC2047 (and parts of 822) and now share your opinion that
it requires that encoded parts *must* be followed by a
'linear-white-space' if the following (or preceding) token is text or ctext.
(p.7 5. Use of encoded-words in message headers)

With the special-casing of ctext characters mentioned above,
roundtripping is now possible, so if you parse a normalized string
consisting of encoded and non-encoded parts, (even multiple) whitespace
is preserved.

I still think we should do it like everyone else and *not* automatically
insert whitespace at boundaries between encoded and non-encoded words,
even if the RFC requires it. Someone wanting to create headers
consisting of mixed encoded/non-encoded parts without whitespace must
know what to do anyway -- the previous implementation also didn't check
for all border cases.

I've *not yet* tested this against the email6 branch you mentioned.

Note that I didn't have to make the regex non-greedy, it already
was. I've just removed the whitespace at the end of the regex.

I've changed all the tests that test for removal of whitespace between
non-encoded and encoded parts. Obviously I've also changed a test that
relied on failing to parse adjacent encoded strings. But please look at
my changes of the tests.

The rfc2047 tests now all pass.

The patch also fixes issue1467619 Header.decode_header eats up spaces

Ralf
-- 
Dr. Ralf Schlatterbeck  Tel:   +43/2243/26465-16
Open Source Consulting  www:   http://www.runtux.com
Reichergasse 131, A-3411 Weidling   email: off...@runtux.com
osAlliance member   email: r...@osalliance.com

--
Added file: http://bugs.python.org/file24133/python.patch3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___diff -r 3609d32cec46 Lib/email/header.py
--- a/Lib/email/header.py   Tue Jan 03 17:48:19 2012 +0100
+++ b/Lib/email/header.py   Tue Jan 03 22:19:30 2012 +0100
@@ -40,7 +40,6 @@
   \?# literal ?
   (?Pencoded.*?)  # non-greedy up to the next ?= is the encoded string
   \?=   # literal ?=
-  (?=[ \t]|$)   # whitespace or the end of the string
   ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
 
 # Field name regexp, including trailing colon, but not separating whitespace,
@@ -86,8 +85,12 @@
 words = []
 for line in header.splitlines():
 parts = ecre.split(line)
+first = True
 while parts:
-unencoded = parts.pop(0).strip()
+unencoded = parts.pop(0)
+if first:
+unencoded = unencoded.lstrip()
+first = False
 if unencoded:
 words.append((unencoded, None, None))
 if parts:
@@ -95,6 +98,16 @@
 encoding = parts.pop(0).lower()
 encoded = parts.pop(0)
 words.append((encoded, encoding, charset))
+# Now loop over words and remove words that consist of whitespace
+# between two encoded strings.
+import sys
+droplist = []
+for n, w in enumerate(words):
+if n1 and w[1] and words[n-2][1] and words[n-1][0].isspace():
+droplist.append(n-1)
+for d in reversed(droplist):
+del words[d]
+
 # The next step is to decode each encoded word by applying the reverse
 # base64 or quopri transformation.  decoded_words is now a list of the
 # form (decoded_word, charset).
@@ -217,22 +230,27 @@
 self._normalize()
 uchunks = []
 lastcs = None
+lastspace = None
 for string, charset in self._chunks:
 # We must preserve spaces between encoded and non-encoded word
 # boundaries, which means for us we need to add a space when we go
 # from a charset to None/us-ascii, or from None/us-ascii to a
 # charset.  Only do this for the second and subsequent chunks.
+# Don't add a space if the None/us-ascii string already has
+# a space (trailing or leading depending on transition)
 nextcs = charset
 if nextcs == _charset.UNKNOWN8BIT:
 original_bytes = string.encode('ascii', 'surrogateescape')
 string = original_bytes.decode('ascii', 'replace')
 if uchunks:
+hasspace = string and self._nonctext(string[0])
 if lastcs 

[issue1079] decode_header does not follow RFC 2047

2012-01-02 Thread Ralf Schlatterbeck

Ralf Schlatterbeck r...@runtux.com added the comment:

maybe it would be a good start to include the examples at the end of RFC2047 
into the regression tests? These examples at least support the case that a '?' 
may immediately follow an encoded string:

encoded formdisplayed as
(=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=) (ab)

when trying this in python 2.7:

 decode_header ('(=?ISO-8859-1?Q?a?= =?ISO-8859-1?Q?b?=)')
[('(', None), ('a', 'iso-8859-1'), ('=?ISO-8859-1?Q?b?=)', None)]

this fails. So I consider this a bug.

Note that although RFC2047 is vague concerning the interpretation if two 
encoded strings could follow each other without a whitespace, these *are* seen 
in the wild and *are* interpreted correctly by the mailers I've tested: mutt, 
thunderbird, exchange in various versions, even lotus notes seems to get this 
right. So I guess python should be liberal in what you accept and parse 
something like 
'(=?ISO-8859-1?Q?a?==?ISO-8859-1?Q?b?=)'
into
[ ('(', None)
, ('a', 'iso-8859-1')
, ('b', 'iso-8859-1')
, (')', None)
]

--
nosy: +runtux

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2012-01-02 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

The RFC isn't at all vague about encoded words not separated by white space.  
That isn't allowed by the BNF.  As you say, though, they occur in the wild and 
should be parsed correctly.

In your other point I think you mean immediately followed by a ), right?  
Yes, that is allowed and no, we don't currently parse that correctly.

Adding the RFC tests would be great (patches gladly accepted).  Fixes for ones 
we fail would be great, too, but at the very least we can mark them as expected 
failures.  I don't usually like adding tests that we expect to fail, but in the 
case of externally defined tests such as the RFC examples I think it is 
worthwhile, so that we can check in a complete test set.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2011-03-13 Thread R. David Murray

Changes by R. David Murray rdmur...@bitdance.com:


--
versions: +Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2010-11-30 Thread R. David Murray

Changes by R. David Murray rdmur...@bitdance.com:


--
assignee:  - r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2010-09-26 Thread Tokio Kikuchi

Tokio Kikuchi tkiku...@users.sourceforge.net added the comment:

Hi, all

I am against applying these patches because they will insert space separations 
in re-composed header (with str() function).

Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord 
- [('Sm', None), ('\xf6', 'iso-8859-1'), ('rg', None), ('\xe5', 'iso-8859-1'), 
('sbord', None)]
- Sm =?iso-8859-1?q?=F6?= rg =?iso-8859-1?q?=E5?= sbord

Instead, I submit a small recipe for decoding non-compliant RFC2047 header 
where space separation is not properly inserted between encoded_word and 
us-ascii characters.

Please try!

--
Added file: http://bugs.python.org/file19026/u2u_decode.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2010-09-18 Thread Mark Lawrence

Changes by Mark Lawrence breamore...@yahoo.co.uk:


--
stage: needs patch - patch review
versions: +Python 3.2 -Python 2.6, Python 3.0

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2010-04-10 Thread Oliver Martin

Oliver Martin oli...@volatilevoid.net added the comment:

I got bitten by this too. In addition to not decoding encoded words without 
whitespace after them, it throws an exception if there is a valid encoded word 
later in the string and the first encoded word is followed by something that 
isn't a hex number:

 decode_header('aaa=?iso-8859-1?q?bbb?=xxx asdf =?iso-8859-1?q?jkl?=')
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/lib/python2.5/email/header.py, line 93, in decode_header
dec = email.quoprimime.header_decode(encoded)
  File /usr/lib/python2.5/email/quoprimime.py, line 336, in header_decode
return re.sub(r'=\w{2}', _unquote_match, s)
  File /usr/lib/python2.5/re.py, line 150, in sub
return _compile(pattern, 0).sub(repl, string, count)
  File /usr/lib/python2.5/email/quoprimime.py, line 324, in _unquote_match
return unquote(s)
  File /usr/lib/python2.5/email/quoprimime.py, line 106, in unquote
return chr(int(s[1:3], 16))
ValueError: invalid literal for int() with base 16: 'xx'

I think it should join the encoded words with the surrounding text if there's 
no whitespace in between. That seems to be consistent with what the 
non-RFC-compliant MUAs out there mean when they send such things.

Reverting the change from Issue 1582282 doesn't seem to be a good idea, since 
it was introduced in response to problems with mailman (see 
https://bugs.launchpad.net/mailman/+bug/266370). Instead of leaving 
Sm=?ISO-8859-1?B?9g==?=rg=?ISO-8859-1?B?5Q==?=sbord as it is, my patch 
converts it to [('Sm\xf6rg\xe5sbord', 'iso-8859-1')]. This shouldn't 
reintroduce the problem mailman was having while fixing ours.

--
nosy: +leromarinvit
Added file: http://bugs.python.org/file16857/rfc2047_embed.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2009-04-08 Thread Atsuo Ishimoto

Atsuo Ishimoto ishim...@gembook.org added the comment:

+1 for Tony's patch.

This patch reverts fix for Issue1582282 filed by tkikuchi.
I cannot understand the rationale for solution proposed in
Issue1582282. How does the fix make easier to read mails from 
Entourage?

--
nosy: +ishimoto, tkikuchi

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2009-04-04 Thread R. David Murray

R. David Murray rdmur...@bitdance.com added the comment:

Tony, I don't think I agree with your reading of the RFC.  IMO, your
inversion of test_rfc2047_without_whitespace is not correct.  '=' is not
a 'special' in RFC[2]822 terms, so the atom does not end at the apparent
end of the encoded word.  I say apparent because if I'm interpreting the
RFC correctly it isn't a valid encoded word.  I presume you are thinking
that once you've got an atom composed of several encoded words, there's
no reason not to parse them into individual words (and I'm inclined to
agree with you), but the RFC BNF doesn't support that interpretation as
far as I can see. That is, there is no indication I could find that an
atom can be composed of multiple encoded words.

The encoded word followed by a 'special' is more subtle.  In section 5,
the RFC says:

 An 'encoded-word' that appears within a 'phrase' MUST be
 separated from any adjacent 'word', 'text' or
'special' by 'linear-white-space'.

This would apply to encoded words in names in To and From headers. But
in other places where an encoded word can appear the requirement of
white-space separation from specials is not asserted.  It's not clear
how to make this RFC compliant without implementing a full BNF parser :(

It would probably be reasonable to fix this case so that a word followed
by a special with no intervening white space was allowed.  I've attached
a test case for the 'special' example based on that idea.

--
nosy: +r.david.murray
stage:  - needs patch
type:  - behavior
versions: +Python 3.0, Python 3.1 -Python 2.5
Added file: http://bugs.python.org/file13616/issue1079-test.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2009-04-04 Thread Tony Nelson

Tony Nelson tony_nel...@users.sourceforge.net added the comment:

The email package does not follow the RFCs in anything to do with header
parsing or decoding.  This is a known deficiency.  So no, I am not
thinking of atoms at all -- and neither is email.header.decode_header()! :-(

Until email.header actually parses headers into atoms and then decodes
atoms, it doesn't matter what parsed atoms would look like.  Currently,
email.header.decode_header() just stumbles through raw text, and doesn't
know if it is looking at atoms or not, or usually even what header the
text came from.

In order to interpret the RFC correctly, email.header.decode_header()
needs either a parser and the name of the header it is decoding, or
parsed header data.  I think the latter is being considered for a
redesign of the email package for 3.1 or 3.2 (3 months to a year or so,
and not for 2.x at all), but until then, it is better to decode every
likely encoded-word than to skip encoded-words that, for example, have a
parenthesis on one side or the other.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2009-04-03 Thread Tony Nelson

Tony Nelson tony_nel...@users.sourceforge.net added the comment:

I think the problem is best viewed as headers are not being parsed
according to RFC2822 and decoded after that, so the recognition of
encoded words should be looser, and not require whitespace around them,
as it is not required in all contexts.

Patch and test, tested on 2.6.1, 2.7trunk.  The test mostly just
reverses the sense of test_rfc2047_without_whitespace().

--
keywords: +patch
nosy: +barry, tony_nelson
versions: +Python 2.6, Python 2.7
Added file: http://bugs.python.org/file13608/header_encwd_nows.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2009-02-04 Thread Gabriel Genellina

Changes by Gabriel Genellina gagsl-...@yahoo.com.ar:


--
nosy: +gagenellina

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2009-02-03 Thread Tom Lynn

Tom Lynn tl...@users.sourceforge.net added the comment:

The only difference between the two regexps is that the email/header.py
version looks for::

  (?=[ \t]|$)   # whitespace or the end of the string

at the end (with re.MULTILINE, so $ also matches '\n').

To expand on There is nothing about that thing in RFC 2047, it says::

   IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
   by an RFC 822 parser.

RFC 822 says::

   atom=  1*any CHAR except specials, SPACE and CTLs
  ...
   specials=  ( / ) /  /  / @  ; Must be in quoted-
   /  , / ; / : / \ /   ;  string, to use
   /  . / [ / ]  ;  within a word.

So an example of mis-parsing is::

import email.header
h = '=?utf-8?q?=E2=98=BA?=(unicode white smiling face)'
email.header.decode_header(h)
   [('=?utf-8?q?=E2=98=BA?=(unicode white smiling face)', None)]

The correct result would be::

email.header.decode_header(h)
   [('\xe2\x98\xba', 'utf-8'), ('(unicode white smiling face)', None)]

which is what you get if you insert a space before the '(' in h.

--
nosy: +tlynn

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1079
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2007-09-17 Thread Sean Reifschneider

Sean Reifschneider added the comment:

Can you provide an example of an address that triggers this?  Preferably
in a code sample that can be used to reproduce it?  Uber-ideally, a
patch to the email module test suite would be great.

--
nosy: +jafo
priority:  - normal

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1079
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue1079] decode_header does not follow RFC 2047

2007-09-01 Thread Mickaël Guérin

New submission from Mickaël Guérin:

email.header.decode_header expect a space or end of line after the end
of an encoded word (?=). There is nothing about that thing in RFC 2047.

Python 2.5.1 ChangeLog seems to indicate that this bug has been solved.
Unfortunately, the function still don't work.

A visible effet of the bad regex used has the consequence found in Issue
1467619

it seems there are 2 different regex with the same purpose in two
different files (ecre in header.py  ecre in utils.py). the one in
utils.py seems to give better results.

--
components: Library (Lib)
messages: 5
nosy: kael
severity: normal
status: open
title: decode_header does not follow RFC 2047
versions: Python 2.5

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue1079
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com