Hi Jukka, I like this.
For the implementation (I noted this also in the corresponding JIRA issue): How about using QNames as Keys in the metadata map (e.g. Map<QName, String>)? For the standard metadata entries from Dublin core, that are "mandatory" for all parsers like Title (the current constants like CONTENT_TYPE), we could simply redefine the constants as QNames with the DC namespace-URI and maybe a prefix [but the prefix is not used in QNames, it's only there for reference, equals and hashcode does not use it. QNames are simply pairs of (URI,Name) [ This makes QNames very elegant. This would make most parsers automatically source-compatible. The parsers needing update are the ones, that use plain Strings as keys. ----- UWE SCHINDLER Webserver/Middleware Development PANGAEA - Publishing Network for Geoscientific and Environmental Data MARUM - University of Bremen Room 2500, Leobener Str., D-28359 Bremen Tel.: +49 421 218 65595 Fax: +49 421 218 65505 http://www.pangaea.de/ E-mail: [EMAIL PROTECTED] > -----Original Message----- > From: Jukka Zitting [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 03, 2008 1:35 AM > To: [email protected] > Subject: Normalize metadata to Dublin Core > > Hi, > > Currently Tika doesn't have any good guidelines on the semantics and > usage of metadata keys. Mostly we've just ended up with a few basic > keys like CONTENT_TYPE and a bunch of more or less inconsistently used > other keys. The result is that a client that currently wants to assign > any reasonable semantics to the extracted metadata needs to first > check the reported CONTENT_TYPE and use that to deduce the meanings of > the other available metadata keys based on documentation in [1]. > > This is not optimal. It should be up to the Tika parsers to interpret > the metadata available in the supported document types and map that as > well as possible to a single standard like Dublin Core. This way a > client only needs to know a single set of metadata semantics. > > The parser can still make the raw underlying metadata available using > metadata keys that are specific to the actual metadata schema used in > the document type, but that should be considered an extra feature > beyond the normalized Dublin Core output. > > One corollary of this is that we should replace the current HTTP-based > CONTENT_TYPE metadata key with the Dublin Core FORMAT. > > WDYT? > > [1] http://lucene.apache.org/tika/formats.html > > BR, > > Jukka Zitting
