Hello all,
First of, let me say that I'm really impressed with the state of Tika.
I've been following Tika pretty much since day one and feel that a *lot*
has been done in such a short period of time, especially looking at the
fairly small number of people working on it.
Now I've got a couple of comments and ideas for potential improvements,
but the first one I would like to make is related to the HTML sax
events. I feel that it's currently fairly difficult to use the
information they are supposed to convey because XHTML type events are
returned (and thus limiting the result to tag names and such allowed in
XHTML). For instance, if you look at the MP3 parser, it currently
returns something like this:
<h1>title of the song</h1> --> the H1 is clearly just a container for
the title, could have a P or a head/title or something else
<p>name of the artist</p>
<p>year</p>
....
It feels that a more XMLish set of events would make sense, such as
something along the line:
<title>title of the song</title>
<artist>name of the artist</artist>
<year>year of the song</year>
The above example convey the same information but in a way that can be
more easily leveraged by a Tika user.
The same comment goes for most of the parsers (Image, asm, audio,
Office...) expected maybe the Html parser in which case it's fine
because it's already Html ;)
So here go the questions:
1) Is there a reason that would prevent Tika from returning Xml type
events as opposed to Xml events?
2) Do you feel XML events would provide substantial improvements over
the current solution?
Once again, kudos to the team for the hard work.
All the best,
Stephane Bastian