Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new scan

2010-11-14 Thread Earl Hood
On Sun, Nov 14, 2010 at 11:45 AM, Jon Steinhart j...@fourwinds.com wrote:
 My preference is to say that we'll treat any =?...?= as an encoded word
 wherever it appears and that we'll decode it.  It appears that the authors of
 RFC2047 expect that everything will be parsed into tokens and examined before
 looking for encoded words.

You right.  RFC 822 defined the basic tokenization rules,
and MIME attempts to stay compatibile with that.  I.e. You have
a system that knows how to due RFC 822 tokenization, and then
that token data can be passed to the MIME-aware layer.
Here is a relevant note from RFC 2047:

   IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
   by an RFC 822 parser.  As a consequence, unencoded white space
   characters (such as SPACE and HTAB) are FORBIDDEN within an
   'encoded-word'.  For example, the character sequence

  =?iso-8859-1?q?this is some text?=

   would be parsed as four 'atom's, rather than as a single 'atom' (by
   an RFC 822 parser) or 'encoded-word' (by a parser which understands
   'encoded-words').  The correct way to encode the string this is some
   text is to encode the SPACE characters as well, e.g.

  =?iso-8859-1?q?this=20is=20some=20text?=


I think many mail implementations today probably do
not work that way, mainly due to ignorance of the developers.
Although not related to this topic, an example of this
ignorance is the syntax adopted in DKIM headers.

As for space between encoded word, such space should be
collapsed.  I.e. Two adjacent encoded words should be
concatenated together after decoding, with no space between
them.

Note, it is a mistake to blindly assume that all sequences
of =?...?= should be decoded, which has lead to some erroneous
uses by some software.  For example, using =?...?= inside
parameter values vs using RFC 2184 (now RFC 2123).

 My current plan for the new scan code is to:

  1.  Read a header field name.

  2.  Read a header field body if the header field is used by the format,
     unfolding folded lines in the process.

  3.  Look for encoded words and decode them creating a UTF-8 version of the
     header field body.

I've never really dived into MH/nmh parsing code.  Is there any
attempt to perform RFC 822 based tokenization before duing any
other processing?

Decoding of encoded words should only be done in specific contexts.
Look at Section 5 of RFC 2047 the contexts that encoded words are
allowed.

--ewh

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
http://lists.nongnu.org/mailman/listinfo/nmh-workers


Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new scan

2010-11-14 Thread Jon Steinhart
 On Sun, Nov 14, 2010 at 11:45 AM, Jon Steinhart j...@fourwinds.com wrote:
  My preference is to say that we'll treat any =?...?= as an encoded word
  wherever it appears and that we'll decode it.  It appears that the authors 
  of
  RFC2047 expect that everything will be parsed into tokens and examined 
  before
  looking for encoded words.
 
 You right.  RFC 822 defined the basic tokenization rules,
 and MIME attempts to stay compatibile with that.  I.e. You have
 a system that knows how to due RFC 822 tokenization, and then
 that token data can be passed to the MIME-aware layer.
 Here is a relevant note from RFC 2047:
 
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser.  As a consequence, unencoded white space
characters (such as SPACE and HTAB) are FORBIDDEN within an
'encoded-word'.  For example, the character sequence
 
   =?iso-8859-1?q?this is some text?=
 
would be parsed as four 'atom's, rather than as a single 'atom' (by
an RFC 822 parser) or 'encoded-word' (by a parser which understands
'encoded-words').  The correct way to encode the string this is some
text is to encode the SPACE characters as well, e.g.
 
   =?iso-8859-1?q?this=20is=20some=20text?=

Well sure, that's in the RFC but it doesn't really make a lot of sense to me.
Would be way more sensible in my opinion to decode everything and then parse it
as it would eliminate a zillion special cases in RFC-land.  And, the fact that
you can't have an encoded word for H next to an encoded word for I to make HI
just leads to to the RFC2231 ugliness.  In any case, they chose to do it the
overly complex way.  But my question is really what do do when somebody sends
me this:

=?iso-8859-1?q?this is some text?=

Seems more sensible to treat the whole thing as an encoded word and to decode 
it.
Are you suggesting that I should just treat it as text and not decode it?

 As for space between encoded word, such space should be
 collapsed.  I.e. Two adjacent encoded words should be
 concatenated together after decoding, with no space between
 them.

Where in what RFC do you find this.  RFC2047 section 5, (1) says that encoded
words must be separated from each other by linear white space but doesn't say
that that white space is later removed.  Collapsing out the space doesn't seem
to make sense since there seems to be some intent to treat encoded words as
normal text words which one would normally separate by spaces.

 Note, it is a mistake to blindly assume that all sequences
 of =?...?= should be decoded, which has lead to some erroneous
 uses by some software.  For example, using =?...?= inside
 parameter values vs using RFC 2184 (now RFC 2123).

Hmm.  Where in what RFC is this prohibited?  I'll agree that it doesn't make
a whole lot of sense to have so many mechanisms that do the same thing, but what
harm would come from this if it was decoded properly.

So once again, I'm not asking what is proper when encoding a message.  I'm 
asking
for guidance on sensible behavior when decoding an improperly encoded message.

I'm unaware of any cases where the character sequences for encoded words would
appear in any properly formatted items such as dates or addresses.  So it seems
to me that no harm would be done if I decoded such illegal stuff anyway as the
alternative is an error message.

I'm trying to design a simple piece of code that will reasonably process 
everything.
Of course, it can't be that simple since there are two incompatible Q encodings 
and
other such cruft.  But I really don't want to have to parse every single type of
header because it's pretty much all text from the scan point of view.

Jon

___
Nmh-workers mailing list
Nmh-workers@nongnu.org
http://lists.nongnu.org/mailman/listinfo/nmh-workers