Extending existing Parsers - No easy to do right now, could we make it easier?

Stephane Bastian Mon, 08 Dec 2008 23:28:07 -0800

Hi All,

I finally found some time to send an email and share some thougths onone of the stickiest issue I had so far with Tika : It's almost notpossible to leverage and override functionality of existing Parsers.Ibelieve the main reason comes from the fact that the parse method leavesno room to override existing behavior or provide my own logic. It'spretty much an all or nothing kind of thing.

For instance, take the Html Parser and lets say I just need to extractsome meta-data not currently handled by Tika. If I'm not mistaken, Ibasically have two solutions:

1) Modify the current Html Parser, add code to extract the new metadataand submit a Patch to Tika

2) Create my own class:

- do a copy/paste of existing code - The reason for this is thatcurrent parse() method leaves very little room to override existingbehavior or provide my own logic. It's pretty much an all or nothingkind of thing.

   - add my code
   - register my class so that it's called for a given mimeType

In all the cases I had so far, I simply needed to be able to register myown ContentHandler on the source document (and not on the structuredcontent). Unfortunately, it's currently not possible

So, I wanted to know 1) if other people had trouble extending existingParser? and 2) if this is an issue we should tackle?


BR,

Stephane Bastian

Extending existing Parsers - No easy to do right now, could we make it easier?

Reply via email to