On 16/12/11 15:12, Jukka Zitting wrote:
As mentioned by Antoni, in the end the metadata keys are just strings,
so with a little coordination we don't need to delay the introduction
of new keys over multiple releases.

Hmm, they're not quite just strings - with the new Property stuff they can also have validation too. I think, however, that having a parser temporarily include its only copy of a definition shouldn't be the end of the world

More generally though, I think it would make sense over time to have
tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM,
etc.) that aren't directly tied to any specific parser or file format.
Format-specific keys like the ones we now have in the MSOffice
interface

Ah, that MSOffice one is now badly named - lots of the other parsers make use of keys that it provides. We should maybe rename it to something more general, to indicate it relates to most productivity document formats

In general though, I agree that re-using an existing defined key name (eg xmp where it covers it) makes sense. At the very least, it avoids work trying to come up with a name, and you get the documentation for the entry for free :)

> That way, as long as the generic metadata keys in
tika-core are more or less complete (i.e. cover all of the key
metadata standards), there should be little need for a parser
implementation to need changes in the rest of Tika if it wants to
introduce a new custom metadata key.

I think we're not quite there yet though, so for at least the next year (at a guess) we're going to need to be adding new keys, and rationalising existing ones

* Consistency - both or markup and metadata keys will be harder to
  ensure when it isn't in the same codebase

Yep, that can be a problem. I guess the ultimate solution to this
would be to come up with a well documented definition of what a parser
should ideally output for specific kinds of content, but that's quite
a bit of work.

Possibly we could use some tooling to identify the differences, then have a periodic check to ensure things haven't got worse. My hunch is that this shouldn't be too hard to setup, but I'm not volunteering to do it...!

For detectors, there's extra issue here. At the moment, both the Zip and
OLE2 detectors handle more than just the POI formats, and in the Zip case
rely on code shared between the parsers (poi+keynote) and detector. How
would this work if the container detectors were handed to POI?

I guess this would require some level of code duplication, i.e. having
a Zip detector in POI that knows about OOXML types, and another in
tika-parsers that knows about other types of Zips.

Hmm, I'd rather we didn't have too much duplication. I think this might end up with quite a bit, and would need quite a lot of testing to ensure things worked well. Potentially we could end up with something like 5 Zip based detectors in that model, such as:
* OOXML one, in POI (needs POI bits)
* iWorks one, in future iWorks library (needs iWorks parser bits)
* ODF one, in ODFToolkit (needs ODF bits)
* Core Tika one (zip, jar, war etc)

At that point maybe we need a zip detector plugin model...

(The OLE2 case is fine - because the detector is powered by POIFS, non POI supported OLE2 formats are probably best detected by code within POI)


Nick

Reply via email to