Re: Issue in RFC2047 encoding of Subject

Gero Treuner Thu, 05 Nov 2020 07:48:26 -0800

Hi Arnt,

Thanks for the input.

On Thu, Nov 05, 2020 at 02:32:03PM +0100, Arnt Gulbrandsen wrote:
> On Thursday 5 November 2020 09:49:44 CET, Gero Treuner wrote:
> > * I interpret RFC2047 that separating encoded blocks by space (without
> >   newline) means two words, so space should be displayed to the user as-is
> >   (Mutt doesn't display the space, but this probably is a different story)
> 
> Mutt is right and you misunderstand. See RFC 2047 page 10, first sentence,
> and the surrounding text.
> 
> This rule seems strange at first, but makes sense when you remember that
> some languages use very long sequences of nonwhitespace, while RFC822 (like
> its successors) does not permit much more than seventy sequential
> nonwhitespace characters in a header field. Somehow, the encoded form of a
> string in such a language has to contain ignorable white space.

My understanding is based on this section on page 3:

  An 'encoded-word' may not be more than 75 characters long, including
  'charset', 'encoding', 'encoded-text', and delimiters.  If it is
  desirable to encode more text than will fit in an 'encoded-word' of
  75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may
  be used.

So the question is, if an encoded word has to be broken into multiple 
parts, does it mean that it MUST be separated by CRLF (and also space),
or is only space also allowed?

My test case in the Gitlab issue also shows inconsistent behaviour. This

  Subject: =?iso-8859-1?Q?Gro=DF?= =?iso-8859-1?Q?e?= wordxxxx word wordxx

is displayed as

  Subject: Große wordxxxx word wordxx

So the space between the encoded words is not shown, while the space
between an encoded word and not encoded words is shown. If this is
common practice, then if you have many words to be encoded where spaces
are in between, those are the options:

  1. Embed spaces in the encoded words

  2. When breaking into parts, avoid positions before or after
     whitespace

For me it is not clear why Mutt splits the word "Große" into parts of 4
+ 1 characters, whether the different cases for display of space are
specified somewhere, which approach from above might be followed in
Mutt etc.

In the Gitlab issue I included a prominent section of code. Thinking
more about that, I guess that the "n = t1 - t - 1;" statement (where
the last letter is cut off) is not well aligned with choose_block() in
this special case. I need to look closer.

Gero

Re: Issue in RFC2047 encoding of Subject

Reply via email to