Hi,

Currently Tika doesn't have any good guidelines on the semantics and
usage of metadata keys. Mostly we've just ended up with a few basic
keys like CONTENT_TYPE and a bunch of more or less inconsistently used
other keys. The result is that a client that currently wants to assign
any reasonable semantics to the extracted metadata needs to first
check the reported CONTENT_TYPE and use that to deduce the meanings of
the other available metadata keys based on documentation in [1].

This is not optimal. It should be up to the Tika parsers to interpret
the metadata available in the supported document types and map that as
well as possible to a single standard like Dublin Core. This way a
client only needs to know a single set of metadata semantics.

The parser can still make the raw underlying metadata available using
metadata keys that are specific to the actual metadata schema used in
the document type, but that should be considered an extra feature
beyond the normalized Dublin Core output.

One corollary of this is that we should replace the current HTTP-based
CONTENT_TYPE metadata key with the Dublin Core FORMAT.

WDYT?

[1] http://lucene.apache.org/tika/formats.html

BR,

Jukka Zitting

Reply via email to