Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Stefano Bagnara Sat, 19 Jul 2008 09:30:46 -0700

Oleg Kalnichevski ha scritto:

Robert Burrell Donkin wrote:

On 7/18/08, Stefano Bagnara <[EMAIL PROTECTED]> wrote:

Robert Burrell Donkin ha scritto:

On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <[EMAIL PROTECTED]>wrote:

Robert Burrell Donkin ha scritto:

On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <[EMAIL PROTECTED]>
wrote:

Stefano Bagnara ha scritto:

<snip>


can we rewind a little

- If the message have only newlines it seems mime4j ends upoutputting
headers with CRLF and body with LF.

am i right in assuming that this is about using Mime4J for
roundtripping via org.apache.james.mime4j.message.Message?

It involve both reading and writing.

In our specific case I record that we accept an LF as separator in
headers,
but we take a CR as a char part of the header (while it is invalid).

E.g: I would say that in the case of an isolated CR in headers wehave 3

options:
1) consider it a newline
 1a) output it as-is when roundtripping
 1b) convert it to CRLF when roundtripping
2) fail parsing (malformed message)
3) use it as part of the header value.

Now we do #3 and I think this is the worst solution.

I don't know if mime4j should support all of the 4 solutions abovefor a

CR

(4 configurations seems too much to me) but I think we shoulddiscuss the

merit of each solution and decide what are the one we want to support.

i understand this argument. however, i still think we need to step
back a little and gain some perspective.

round tripping involves two distinct components.  the parser parses
the message into a DOM (Message) which is then written out.

AIUI it is this complete cycle that results in the line ending
inconsistency noted between the input and the output.  is my
understanding correct?

I think we should discuss about parsing separated from outputting
something we have in memory.

Yes

IMHO, it's clear we'll never be able to
alter malformed mime content while preserving the malformations, so we
have to think that in output we always have to create a canonical mime
message. This is currently not the case, but this is the minor of my
concern (because it is easier to fix, I think).


So I think there's rough consensus that writing the DOM should
canonicalise. Yes, I agree that this can be accomodated by altering
the DOM writer.

So the issue is also during parsing:

1) we now have special treatment for isolated LF, we do not have
something similar for CR (AFAIK both are special end of line delimiters
used in some specific platform and not compliant to the canonical mime
format, so I think we *should* support both special chars (in a lenient
parsing).

If this logic can be acommodated easily then it sounds like we
probably should unless there are good reasons not to

2) ((TextBody) b).getReader(). This give me a reader, so this support
the "line" concept: I do expect this one to treat "non canonical"
newlines like the header/structure parser: if headers are allowed to
terminate with an isolated LF then also lines in text content should do
the same (because probably the whole mime message has LF instead of
CRLF). [RFC seems to suggest that the fact is that the MIME message is
encoded using LF instead of CRLF and that this specific encoding breaks
binary parts, but we want to be smarter wrt this issue].


TextBody is part of the DOM. This can and should be addressed there
(rather than in the parser). I think that doing this should satisfy
both needs without compromising the performance of the parser.

If this is indeed something we can all agree on, I can try to solve thefirst problem (strict/lenient line delimiter handling) using a pluggablestrategy of some kind.


Oleg

My limited knowledge of mime4j details doesn't let me reply "+1". So Isimply tell what I expect from mime4j as an user:


Lenient line delimiter parsing:

- consider isolated LF and CR in the mime stream as newlines as long asa newline concept exists in that specific place (everywhere but binarybody parts having ContentTransferEncoding = "binary").- This means that a CR in a base64 stream is a newline, a CR in atext/plain is a newline, a "CR<boundary> CR" sequence is a validmultipart boundary, "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR","CRCR" or "LFLF" sequences are valid separators between header and bodybecause they are considered as equivalent to "CRLFCRLF".- THis also means that writing in output this stuff will result in amime stream with NO isolated CRs or LFs (unless they are in a "binary"encoded body).

Strict line delimiter parsing (I don't care if we have this now, I justthink we should have this in mind while factoring mime4j because itshould be possible to implement this with no major changes).- LFs and CRs are not newlines, they are not considered newlines andresults in errors raised by the parser (invalid header, invalid content,and so on) that will result in a parsing failure or (if the raisederrors are ignored) in invalid DOM (I'm not sure how we currently handlethis case for non-expected 8bit content in an header, but it should bethe same).

- writing in output this content should result in a well-formed content, so:

- if an LF in the header is somehow "encodable" as a valid sequenceit should be parsed as LF and then encoded while outputting. If insteadan LF in the header is not encodable then we should fail parsing orremove it (or convert it to "?" or anything similar) if we want to belenient.

I'm not saying that I want mime4j to support all of this before arelease, I just want to understand if this is what you also expect andif this can be considered a common goal.


Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [mime4j] newlines and parsing of nested (encoded) rfc822 messages

Reply via email to