Hi All,

I finally found some time to send an email and share some thougths on one of the stickiest issue I had so far with Tika : It's almost not possible to leverage and override functionality of existing Parsers.I believe the main reason comes from the fact that the parse method leaves no room to override existing behavior or provide my own logic. It's pretty much an all or nothing kind of thing.

For instance, take the Html Parser and lets say I just need to extract some meta-data not currently handled by Tika. If I'm not mistaken, I basically have two solutions:

1) Modify the current Html Parser, add code to extract the new metadata and submit a Patch to Tika
2) Create my own class:
- do a copy/paste of existing code - The reason for this is that current parse() method leaves very little room to override existing behavior or provide my own logic. It's pretty much an all or nothing kind of thing.
   - add my code
   - register my class so that it's called for a given mimeType

In all the cases I had so far, I simply needed to be able to register my own ContentHandler on the source document (and not on the structured content). Unfortunately, it's currently not possible

So, I wanted to know 1) if other people had trouble extending existing Parser? and 2) if this is an issue we should tackle?

BR,

Stephane Bastian

Reply via email to