Hello all,

First of, let me say that I'm really impressed with the state of Tika. I've been following Tika pretty much since day one and feel that a *lot* has been done in such a short period of time, especially looking at the fairly small number of people working on it.

Now I've got a couple of comments and ideas for potential improvements, but the first one I would like to make is related to the HTML sax events. I feel that it's currently fairly difficult to use the information they are supposed to convey because XHTML type events are returned (and thus limiting the result to tag names and such allowed in XHTML). For instance, if you look at the MP3 parser, it currently returns something like this:

<h1>title of the song</h1> --> the H1 is clearly just a container for the title, could have a P or a head/title or something else
<p>name of the artist</p>
<p>year</p>
....

It feels that a more XMLish set of events would make sense, such as something along the line:

<title>title of the song</title>
<artist>name of the artist</artist>
<year>year of the song</year>

The above example convey the same information but in a way that can be more easily leveraged by a Tika user.

The same comment goes for most of the parsers (Image, asm, audio, Office...) expected maybe the Html parser in which case it's fine because it's already Html ;)

So here go the questions:
1) Is there a reason that would prevent Tika from returning Xml type events as opposed to Xml events?

2) Do you feel XML events would provide substantial improvements over the current solution?


Once again, kudos to the team for the hard work.
All the best,

Stephane Bastian



Reply via email to