Re: Pushing parsers upstream

Nick Burch Thu, 22 Dec 2011 23:38:12 -0800

On 16/12/11 15:12, Jukka Zitting wrote:

As mentioned by Antoni, in the end the metadata keys are just strings,
so with a little coordination we don't need to delay the introduction
of new keys over multiple releases.

Hmm, they're not quite just strings - with the new Property stuff theycan also have validation too. I think, however, that having a parsertemporarily include its only copy of a definition shouldn't be the endof the world

More generally though, I think it would make sense over time to have
tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM,
etc.) that aren't directly tied to any specific parser or file format.
Format-specific keys like the ones we now have in the MSOffice
interface

Ah, that MSOffice one is now badly named - lots of the other parsersmake use of keys that it provides. We should maybe rename it tosomething more general, to indicate it relates to most productivitydocument formats

In general though, I agree that re-using an existing defined key name(eg xmp where it covers it) makes sense. At the very least, it avoidswork trying to come up with a name, and you get the documentation forthe entry for free :)


> That way, as long as the generic metadata keys in

tika-core are more or less complete (i.e. cover all of the key
metadata standards), there should be little need for a parser
implementation to need changes in the rest of Tika if it wants to
introduce a new custom metadata key.

I think we're not quite there yet though, so for at least the next year(at a guess) we're going to need to be adding new keys, andrationalising existing ones

* Consistency - both or markup and metadata keys will be harder to
  ensure when it isn't in the same codebase


Yep, that can be a problem. I guess the ultimate solution to this
would be to come up with a well documented definition of what a parser
should ideally output for specific kinds of content, but that's quite
a bit of work.

Possibly we could use some tooling to identify the differences, thenhave a periodic check to ensure things haven't got worse. My hunch isthat this shouldn't be too hard to setup, but I'm not volunteering to doit...!

For detectors, there's extra issue here. At the moment, both the Zip and
OLE2 detectors handle more than just the POI formats, and in the Zip case
rely on code shared between the parsers (poi+keynote) and detector. How
would this work if the container detectors were handed to POI?


I guess this would require some level of code duplication, i.e. having
a Zip detector in POI that knows about OOXML types, and another in
tika-parsers that knows about other types of Zips.

Hmm, I'd rather we didn't have too much duplication. I think this mightend up with quite a bit, and would need quite a lot of testing to ensurethings worked well. Potentially we could end up with something like 5Zip based detectors in that model, such as:

* OOXML one, in POI (needs POI bits)
* iWorks one, in future iWorks library (needs iWorks parser bits)
* ODF one, in ODFToolkit (needs ODF bits)
* Core Tika one (zip, jar, war etc)

At that point maybe we need a zip detector plugin model...

(The OLE2 case is fine - because the detector is powered by POIFS, nonPOI supported OLE2 formats are probably best detected by code within POI)



Nick

Re: Pushing parsers upstream

Reply via email to