Hi,

You're definitely right that there would be a mapping between a given document and XML, via a ContentHandler, which is king of what tika does already. This also means that metadata would be extracted from the "raw" ContentHandler. In any case, as you pointed out Tika might not be the best place to do this. However going back to my initial short term issue, which is extending the Html Parser, I would definitely take the solution you proposed earlier if it's still on the table ;)

BR,

Stephane Bastian

Jukka Zitting wrote:
Hi,

On Tue, Dec 9, 2008 at 12:19 PM, Stephane Bastian
<[EMAIL PROTECTED]> wrote:
Parsing goes through several fairly well defined steps and in the case of
Tika it could be represented as follow:
1) Generate Sax events out of the stream
2) Extracts metadata and save them in an instance of the Metadata class
3) Generate Sax events about the structure of a document

For many document types steps 1 and 2 are reversed, and 1 and 3 are
actually just a single step. I'm not sure if there's much room for
generalization here.

How about if we slightly modify Tika to hook custom code to 1) as well. We
could do this by adding an extra ContentHandler to the parse method:

public void parse (InputStream stream, ContentHandler rawHanlder,
ContentHandler structuredHandler, Metadata metadata) ;

Most document types simply don't have a "raw" SAX stream, so I don't
think this is a good idea in the general case. The only SAX events you
have are the ones sent to the content handler we have now, so what
you're trying to do could just as well be achieved using a
TeeContentHandler on top of the existing Parser interface.

What I believe you are looking for is a mechanism that would map the
low-level details of all sorts of document types to XML. That's might
be interesting, but I'm not sure if Tika is the best place to do that.
It might be a better idea to approach the parser libraries directly
about a potential SAX mapping, as they are in a much better position
to evaluate how such a mapping should look like and whether
implementing it is reasonable.

2) Ability to leverage the MatchingContentHandler which is also working in
streaming mode. BTW, to me this part would probably deserve a project on its
own

Thanks, I did think it was a good idea, but it's good to hear that
others like it too. :-)

BR,

Jukka Zitting

Reply via email to