Hi All,
I finally found some time to send an email and share some thougths on
one of the stickiest issue I had so far with Tika : It's almost not
possible to leverage and override functionality of existing Parsers.I
believe the main reason comes from the fact that the parse method leaves
no room to override existing behavior or provide my own logic. It's
pretty much an all or nothing kind of thing.
For instance, take the Html Parser and lets say I just need to extract
some meta-data not currently handled by Tika. If I'm not mistaken, I
basically have two solutions:
1) Modify the current Html Parser, add code to extract the new metadata
and submit a Patch to Tika
2) Create my own class:
- do a copy/paste of existing code - The reason for this is that
current parse() method leaves very little room to override existing
behavior or provide my own logic. It's pretty much an all or nothing
kind of thing.
- add my code
- register my class so that it's called for a given mimeType
In all the cases I had so far, I simply needed to be able to register my
own ContentHandler on the source document (and not on the structured
content). Unfortunately, it's currently not possible
So, I wanted to know 1) if other people had trouble extending existing
Parser? and 2) if this is an issue we should tackle?
BR,
Stephane Bastian