On 16/12/11 15:12, Jukka Zitting wrote:
As mentioned by Antoni, in the end the metadata keys are just strings,
so with a little coordination we don't need to delay the introduction
of new keys over multiple releases.
Hmm, they're not quite just strings - with the new Property stuff they
can also have validation too. I think, however, that having a parser
temporarily include its only copy of a definition shouldn't be the end
of the world
More generally though, I think it would make sense over time to have
tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM,
etc.) that aren't directly tied to any specific parser or file format.
Format-specific keys like the ones we now have in the MSOffice
interface
Ah, that MSOffice one is now badly named - lots of the other parsers
make use of keys that it provides. We should maybe rename it to
something more general, to indicate it relates to most productivity
document formats
In general though, I agree that re-using an existing defined key name
(eg xmp where it covers it) makes sense. At the very least, it avoids
work trying to come up with a name, and you get the documentation for
the entry for free :)
> That way, as long as the generic metadata keys in
tika-core are more or less complete (i.e. cover all of the key
metadata standards), there should be little need for a parser
implementation to need changes in the rest of Tika if it wants to
introduce a new custom metadata key.
I think we're not quite there yet though, so for at least the next year
(at a guess) we're going to need to be adding new keys, and
rationalising existing ones
* Consistency - both or markup and metadata keys will be harder to
ensure when it isn't in the same codebase
Yep, that can be a problem. I guess the ultimate solution to this
would be to come up with a well documented definition of what a parser
should ideally output for specific kinds of content, but that's quite
a bit of work.
Possibly we could use some tooling to identify the differences, then
have a periodic check to ensure things haven't got worse. My hunch is
that this shouldn't be too hard to setup, but I'm not volunteering to do
it...!
For detectors, there's extra issue here. At the moment, both the Zip and
OLE2 detectors handle more than just the POI formats, and in the Zip case
rely on code shared between the parsers (poi+keynote) and detector. How
would this work if the container detectors were handed to POI?
I guess this would require some level of code duplication, i.e. having
a Zip detector in POI that knows about OOXML types, and another in
tika-parsers that knows about other types of Zips.
Hmm, I'd rather we didn't have too much duplication. I think this might
end up with quite a bit, and would need quite a lot of testing to ensure
things worked well. Potentially we could end up with something like 5
Zip based detectors in that model, such as:
* OOXML one, in POI (needs POI bits)
* iWorks one, in future iWorks library (needs iWorks parser bits)
* ODF one, in ODFToolkit (needs ODF bits)
* Core Tika one (zip, jar, war etc)
At that point maybe we need a zip detector plugin model...
(The OLE2 case is fine - because the detector is powered by POIFS, non
POI supported OLE2 formats are probably best detected by code within POI)
Nick