Oleg Kalnichevski ha scritto:
Robert Burrell Donkin wrote:
On 7/18/08, Stefano Bagnara <[EMAIL PROTECTED]> wrote:
Robert Burrell Donkin ha scritto:
On Fri, Jul 18, 2008 at 9:34 AM, Stefano Bagnara <[EMAIL PROTECTED]>
wrote:
Robert Burrell Donkin ha scritto:
On Thu, Jul 17, 2008 at 4:02 PM, Stefano Bagnara <[EMAIL PROTECTED]>
wrote:
Stefano Bagnara ha scritto:
<snip>
can we rewind a little
- If the message have only newlines it seems mime4j ends up
outputting
headers with CRLF and body with LF.
am i right in assuming that this is about using Mime4J for
roundtripping via org.apache.james.mime4j.message.Message?
It involve both reading and writing.
In our specific case I record that we accept an LF as separator in
headers,
but we take a CR as a char part of the header (while it is invalid).
E.g: I would say that in the case of an isolated CR in headers we
have 3
options:
1) consider it a newline
1a) output it as-is when roundtripping
1b) convert it to CRLF when roundtripping
2) fail parsing (malformed message)
3) use it as part of the header value.
Now we do #3 and I think this is the worst solution.
I don't know if mime4j should support all of the 4 solutions above
for a
CR
(4 configurations seems too much to me) but I think we should
discuss the
merit of each solution and decide what are the one we want to support.
i understand this argument. however, i still think we need to step
back a little and gain some perspective.
round tripping involves two distinct components. the parser parses
the message into a DOM (Message) which is then written out.
AIUI it is this complete cycle that results in the line ending
inconsistency noted between the input and the output. is my
understanding correct?
I think we should discuss about parsing separated from outputting
something we have in memory.
Yes
IMHO, it's clear we'll never be able to
alter malformed mime content while preserving the malformations, so we
have to think that in output we always have to create a canonical mime
message. This is currently not the case, but this is the minor of my
concern (because it is easier to fix, I think).
So I think there's rough consensus that writing the DOM should
canonicalise. Yes, I agree that this can be accomodated by altering
the DOM writer.
So the issue is also during parsing:
1) we now have special treatment for isolated LF, we do not have
something similar for CR (AFAIK both are special end of line delimiters
used in some specific platform and not compliant to the canonical mime
format, so I think we *should* support both special chars (in a lenient
parsing).
If this logic can be acommodated easily then it sounds like we
probably should unless there are good reasons not to
2) ((TextBody) b).getReader(). This give me a reader, so this support
the "line" concept: I do expect this one to treat "non canonical"
newlines like the header/structure parser: if headers are allowed to
terminate with an isolated LF then also lines in text content should do
the same (because probably the whole mime message has LF instead of
CRLF). [RFC seems to suggest that the fact is that the MIME message is
encoded using LF instead of CRLF and that this specific encoding breaks
binary parts, but we want to be smarter wrt this issue].
TextBody is part of the DOM. This can and should be addressed there
(rather than in the parser). I think that doing this should satisfy
both needs without compromising the performance of the parser.
If this is indeed something we can all agree on, I can try to solve the
first problem (strict/lenient line delimiter handling) using a pluggable
strategy of some kind.
Oleg
My limited knowledge of mime4j details doesn't let me reply "+1". So I
simply tell what I expect from mime4j as an user:
Lenient line delimiter parsing:
- consider isolated LF and CR in the mime stream as newlines as long as
a newline concept exists in that specific place (everywhere but binary
body parts having ContentTransferEncoding = "binary").
- This means that a CR in a base64 stream is a newline, a CR in a
text/plain is a newline, a "CR<boundary> CR" sequence is a valid
multipart boundary, "CRLFCR", "CRLFCR", "CRCRLF", "LFCRLF", "LFCR",
"CRCR" or "LFLF" sequences are valid separators between header and body
because they are considered as equivalent to "CRLFCRLF".
- THis also means that writing in output this stuff will result in a
mime stream with NO isolated CRs or LFs (unless they are in a "binary"
encoded body).
Strict line delimiter parsing (I don't care if we have this now, I just
think we should have this in mind while factoring mime4j because it
should be possible to implement this with no major changes).
- LFs and CRs are not newlines, they are not considered newlines and
results in errors raised by the parser (invalid header, invalid content,
and so on) that will result in a parsing failure or (if the raised
errors are ignored) in invalid DOM (I'm not sure how we currently handle
this case for non-expected 8bit content in an header, but it should be
the same).
- writing in output this content should result in a well-formed content, so:
- if an LF in the header is somehow "encodable" as a valid sequence
it should be parsed as LF and then encoded while outputting. If instead
an LF in the header is not encodable then we should fail parsing or
remove it (or convert it to "?" or anything similar) if we want to be
lenient.
I'm not saying that I want mime4j to support all of this before a
release, I just want to understand if this is what you also expect and
if this can be considered a common goal.
Stefano
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]