Hi, Currently Tika doesn't have any good guidelines on the semantics and usage of metadata keys. Mostly we've just ended up with a few basic keys like CONTENT_TYPE and a bunch of more or less inconsistently used other keys. The result is that a client that currently wants to assign any reasonable semantics to the extracted metadata needs to first check the reported CONTENT_TYPE and use that to deduce the meanings of the other available metadata keys based on documentation in [1].
This is not optimal. It should be up to the Tika parsers to interpret the metadata available in the supported document types and map that as well as possible to a single standard like Dublin Core. This way a client only needs to know a single set of metadata semantics. The parser can still make the raw underlying metadata available using metadata keys that are specific to the actual metadata schema used in the document type, but that should be considered an extra feature beyond the normalized Dublin Core output. One corollary of this is that we should replace the current HTTP-based CONTENT_TYPE metadata key with the Dublin Core FORMAT. WDYT? [1] http://lucene.apache.org/tika/formats.html BR, Jukka Zitting
