Hi Jukka,

I like this.

For the implementation (I noted this also in the corresponding JIRA issue):
How about using QNames as Keys in the metadata map (e.g. Map<QName,
String>)? For the standard metadata entries from Dublin core, that are
"mandatory" for all parsers like Title (the current constants like
CONTENT_TYPE), we could simply redefine the constants as QNames with the DC
namespace-URI and maybe a prefix [but the prefix is not used in QNames, it's
only there for reference, equals and hashcode does not use it. QNames are
simply pairs of (URI,Name) [ This makes QNames very elegant. This would make
most parsers automatically source-compatible. The parsers needing update are
the ones, that use plain Strings as keys.

-----
UWE SCHINDLER
Webserver/Middleware Development
PANGAEA - Publishing Network for Geoscientific and Environmental Data
MARUM - University of Bremen
Room 2500, Leobener Str., D-28359 Bremen
Tel.: +49 421 218 65595
Fax:  +49 421 218 65505
http://www.pangaea.de/
E-mail: [EMAIL PROTECTED]

> -----Original Message-----
> From: Jukka Zitting [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, December 03, 2008 1:35 AM
> To: [email protected]
> Subject: Normalize metadata to Dublin Core
> 
> Hi,
> 
> Currently Tika doesn't have any good guidelines on the semantics and
> usage of metadata keys. Mostly we've just ended up with a few basic
> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
> other keys. The result is that a client that currently wants to assign
> any reasonable semantics to the extracted metadata needs to first
> check the reported CONTENT_TYPE and use that to deduce the meanings of
> the other available metadata keys based on documentation in [1].
> 
> This is not optimal. It should be up to the Tika parsers to interpret
> the metadata available in the supported document types and map that as
> well as possible to a single standard like Dublin Core. This way a
> client only needs to know a single set of metadata semantics.
> 
> The parser can still make the raw underlying metadata available using
> metadata keys that are specific to the actual metadata schema used in
> the document type, but that should be considered an extra feature
> beyond the normalized Dublin Core output.
> 
> One corollary of this is that we should replace the current HTTP-based
> CONTENT_TYPE metadata key with the Dublin Core FORMAT.
> 
> WDYT?
> 
> [1] http://lucene.apache.org/tika/formats.html
> 
> BR,
> 
> Jukka Zitting

Reply via email to