Re: Duplicate parsing of some header fields

Oleg Kalnichevski Wed, 04 Feb 2009 05:26:41 -0800

Markus Wiederkehr wrote:

On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
<[email protected]> wrote:

On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
<[email protected]> wrote:

When Mime4j is used to build a DOM some header fields are parsed more
than once..

...i suspected that this might be the case...

This is what happens:
 * AbstractEntity.parseField parses the raw string into a name-value pair
 * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
valid fields
 * DefaultBodyDescriptor parses header fields such as Content-Type
 * MaximalBodyDescriptor parses additional header fields such as Mime-Version
 * eventually MimeStreamParser.parse retrieves the raw field from
AbstractEntity and notifies a ContentHandler
 * the ContentHandler (again) has to parse the raw string into name and value
 * the ContentHandler (again) has to parse certain structural fields
already parsed by a body descriptor

IIRC the descriptor stuff was added to satisfy requirements for SAX
and pull parsing downstream (in particular IMAP which uses a lot of
MIME meta-data)

In case of building a DOM the latter two items are done by
MessageBuilder by calling Field.parse().

There are several issues here:
 * the ContentHandler has to do a lot of work that has already been
done before by AbstractEntity.

well, probably done before (dependent of parser settings)

 * Field.parse() uses a JavaCC-generated parser whereas the body
descriptors have their own parsing code.

yes - the body descriptors support a wider variety of fields than the
Field stuff

 * we have unnecessary duplicate code and the two parsers are very
likely inconsistent with each other.

yes - i trust the code in AbstractEntity more highly then that in Field


I'm not so sure.. For example AbstractEntity uses
MimeUtil.getHeaderParams() to parse Content-Type and
Content-Disposition. It looks like getHeaderParams does not correctly
handle comments that are allowed in quoted-strings.. Consider this
example:

        String test = "text/plain; key=value; foo= (test) \"bar\"";
        Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
        System.out.println(headerParams);

The result is {=text/plain, foo=(test), key=value} which is wrong. The
Field equivalent..

        ContentTypeField f = Fields.contentType(test);
        System.out.println(f.getParameters());

..gives the correct result: {foo=bar, key=value}.

But of course we could fix this and use MimeUtil.getHeaderParams() for
parsing Fields, too. In fact I would not mind if most of the javacc
parsers were replaced by handcrafted equivalents..


Folks

For what is it worth to you. HttpComponents Core provides an extensivesupport for parsing and formatting of MIME headers. The code has beenaround for quite some time, has been relatively well tested, has beenextensively optimized for performance and memory footprint and is ableto operate with virtually zero intermediate garbage and no intermediatememory copying.

The parsing/formatting framework is based on the CharArrayBuffer class,which is used by MimeTokenStream for low level parsing operations.


http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/HeaderValueParser.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/BasicHeaderValueParser.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/HeaderValueFormatter.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/BasicHeaderValueFormatter.java

Feel free to take a look at the code and see if it might make sensereplacing MimeUtil.getHeaderParams(value) with some bits from HC.


I fully admit my personal bias, so feel free to ignore me.

Oleg

Re: Duplicate parsing of some header fields

Reply via email to