Hi Uwe,

QNames as met keys sounds like a very interesting proposal, provided we
still allow for the simplistic met API of simply putting String keys as
well. So, we shouldn't require that the only mechanism for adding new met
keys is only via Qnames -- we should support this -- but in addition we
should also support (for backwards compatibility and simplicity/ease of use)
allowing metadata to be added using String keys as well. So, this should
still work:

Metadata met = new Metadata();
met.addMetadata("format", "val1");
met.addMetadata("format", "val2");

And, this should work:

Metadata met = new Metadata();
met.addMetadata(new QName("http://purl.org/
dc/elements/1.0","format","dc"),"val1");
met.addMetadata(new QName("http://purl.org/
dc/elements/1.0","format","dc"),"val2");

To support the simple case, we could have the method:

Metadata#addMetadata(String,String)

Simply be a wrapper around

Metadata#addMetadata(QName,String)

like:

public void addMetadata(String key, String val){
  this.addMetadata(new QName(key), val);
}

WDTY?

At a later time, if we really find no one uses the String keys, we can
deprecate the methods and then remove them from the API...

Cheers,
Chris




On 12/3/08 1:20 AM, "Uwe Schindler" <[EMAIL PROTECTED]> wrote:

> Hi Jukka,
>
> I like this.
>
> For the implementation (I noted this also in the corresponding JIRA issue):
> How about using QNames as Keys in the metadata map (e.g. Map<QName,
> String>)? For the standard metadata entries from Dublin core, that are
> "mandatory" for all parsers like Title (the current constants like
> CONTENT_TYPE), we could simply redefine the constants as QNames with the DC
> namespace-URI and maybe a prefix [but the prefix is not used in QNames, it's
> only there for reference, equals and hashcode does not use it. QNames are
> simply pairs of (URI,Name) [ This makes QNames very elegant. This would make
> most parsers automatically source-compatible. The parsers needing update are
> the ones, that use plain Strings as keys.
>
> -----
> UWE SCHINDLER
> Webserver/Middleware Development
> PANGAEA - Publishing Network for Geoscientific and Environmental Data
> MARUM - University of Bremen
> Room 2500, Leobener Str., D-28359 Bremen
> Tel.: +49 421 218 65595
> Fax:  +49 421 218 65505
> http://www.pangaea.de/
> E-mail: [EMAIL PROTECTED]
>
>> -----Original Message-----
>> From: Jukka Zitting [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, December 03, 2008 1:35 AM
>> To: [email protected]
>> Subject: Normalize metadata to Dublin Core
>>
>> Hi,
>>
>> Currently Tika doesn't have any good guidelines on the semantics and
>> usage of metadata keys. Mostly we've just ended up with a few basic
>> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
>> other keys. The result is that a client that currently wants to assign
>> any reasonable semantics to the extracted metadata needs to first
>> check the reported CONTENT_TYPE and use that to deduce the meanings of
>> the other available metadata keys based on documentation in [1].
>>
>> This is not optimal. It should be up to the Tika parsers to interpret
>> the metadata available in the supported document types and map that as
>> well as possible to a single standard like Dublin Core. This way a
>> client only needs to know a single set of metadata semantics.
>>
>> The parser can still make the raw underlying metadata available using
>> metadata keys that are specific to the actual metadata schema used in
>> the document type, but that should be considered an extra feature
>> beyond the normalized Dublin Core output.
>>
>> One corollary of this is that we should replace the current HTTP-based
>> CONTENT_TYPE metadata key with the Dublin Core FORMAT.
>>
>> WDYT?
>>
>> [1] http://lucene.apache.org/tika/formats.html
>>
>> BR,
>>
>> Jukka Zitting
>
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [EMAIL PROTECTED]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to