Markus Wiederkehr wrote:
On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
<[email protected]> wrote:
On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
<[email protected]> wrote:
When Mime4j is used to build a DOM some header fields are parsed more
than once..
...i suspected that this might be the case...

This is what happens:
 * AbstractEntity.parseField parses the raw string into a name-value pair
 * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
valid fields
 * DefaultBodyDescriptor parses header fields such as Content-Type
 * MaximalBodyDescriptor parses additional header fields such as Mime-Version
 * eventually MimeStreamParser.parse retrieves the raw field from
AbstractEntity and notifies a ContentHandler
 * the ContentHandler (again) has to parse the raw string into name and value
 * the ContentHandler (again) has to parse certain structural fields
already parsed by a body descriptor
IIRC the descriptor stuff was added to satisfy requirements for SAX
and pull parsing downstream (in particular IMAP which uses a lot of
MIME meta-data)

In case of building a DOM the latter two items are done by
MessageBuilder by calling Field.parse().

There are several issues here:
 * the ContentHandler has to do a lot of work that has already been
done before by AbstractEntity.
well, probably done before (dependent of parser settings)

 * Field.parse() uses a JavaCC-generated parser whereas the body
descriptors have their own parsing code.
yes - the body descriptors support a wider variety of fields than the
Field stuff

 * we have unnecessary duplicate code and the two parsers are very
likely inconsistent with each other.
yes - i trust the code in AbstractEntity more highly then that in Field

I'm not so sure.. For example AbstractEntity uses
MimeUtil.getHeaderParams() to parse Content-Type and
Content-Disposition. It looks like getHeaderParams does not correctly
handle comments that are allowed in quoted-strings.. Consider this
example:

        String test = "text/plain; key=value; foo= (test) \"bar\"";
        Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
        System.out.println(headerParams);

The result is {=text/plain, foo=(test), key=value} which is wrong. The
Field equivalent..

        ContentTypeField f = Fields.contentType(test);
        System.out.println(f.getParameters());

..gives the correct result: {foo=bar, key=value}.

But of course we could fix this and use MimeUtil.getHeaderParams() for
parsing Fields, too. In fact I would not mind if most of the javacc
parsers were replaced by handcrafted equivalents..


Folks

For what is it worth to you. HttpComponents Core provides an extensive support for parsing and formatting of MIME headers. The code has been around for quite some time, has been relatively well tested, has been extensively optimized for performance and memory footprint and is able to operate with virtually zero intermediate garbage and no intermediate memory copying.

The parsing/formatting framework is based on the CharArrayBuffer class, which is used by MimeTokenStream for low level parsing operations.

http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/HeaderValueParser.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/BasicHeaderValueParser.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/HeaderValueFormatter.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/BasicHeaderValueFormatter.java

Feel free to take a look at the code and see if it might make sense replacing MimeUtil.getHeaderParams(value) with some bits from HC.

I fully admit my personal bias, so feel free to ignore me.

Oleg


Reply via email to