Hi Jukka,

Very nice set of requirements. I'd like to chime in on each below:

> 
> 0) Metadata in Tika is always about the document being parsed.

+1

> 
> 1) Metadata should consist of a modifiable set of keys mapped to values.

Let's be concrete here:

I think that Metadata should be of the form:

[key]=>1...n [value]

Where [key]'s are modifiable (as are [value]'s)

This is what you are expressing, correct?

> 
> 2) Metadata keys should be designed to avoid collisions or misspellings.

+1 in all cases that we are in control of, with the caveat that sometimes we
aren't in control of the keys being used, especially in situations where we
have automated Metadata retrieval. For instance, in Nutch, we use the
Metadata object to store "content-type", which is returned from a web
server, processing the met keys in an automated fashion. Well, some servers
return "Content-type", some return "content-type", or "content type", well,
... you get the picture. So we still need that support, at least to support
that type of use case.

> 
> 3) It should be possible to store non-String metadata, like Locale
> settings, Date instances, thumbnail images, etc.

I somewhat have to agree with Jerome here -- what is the value of storing
non-String metadata [values]? To me, this goes against standards like Dublin
Core, or ISO 11179, which explicitly define metadata values to be of String
form.


> 
> 4) We should document and enforce a standard set of metadata keys,
> based on Dublin Core and other standards where possible.

+1, with the caveat, that we need to allow folks to define their own keys as
well.

> 
> 5) It should be easy to extend the set of metadata keys to include
> custom metadata.

+1, ah perfect, you covered my caveat from #4 above here. Great.

> 
> 6) All metadata keys (both standard and custom) should be clearly
> documented with the expected value type and recommended usage.

+1

> 
> 7) No two distinct metadata keys should be used for the same metadata
> semantics.

Could you elaborate on this with an explicit example?

> 
> 8) The Metadata class should have convenience methods for accessing
> the most commonly used metadata.

I'm not so sure that these types of methods should be part of a canonical
Metadata class. To me, it's like building the data model into the software,
which is typically a bad practice. Having them kept as separate, independent
entities, allows both data and software models to evolve independently, and
promotes loose coupling between them, making the software less fragile to
changes in the underlying data model (what if Dublin Core changes in 5
years, and does away with certain fields, while adding others)?

I would support this idea, however, as a set of higher level, convenience
Metadata decorator classes (e.g., like SpellCheckedMetadata).

> 
> The current Metadata class fails somewhat with 2 (there's even a
> SpellCheckedMetadata class) and doesn't support 3 or 8. The constants
> in o.a.tika.metadata interfaces go some way towards 4 and 6, but not
> as far as they could. And we don't do that well on 7.
> 
> So in general I think there's much that we could improve on. To
> resolve most of the issues I'd like to modify metadata handling as
> follows:
> 
> a) Allow both metadata keys and values to be arbitrary Objects.

I'd still like to know what the value-added from having Metadata values be
non-Strings first.

> 
> b) Instead of String constants as metadata keys, use constant object
> instances like DublinCore.TITLE = new DublinCore("title"). These
> objects should have good hashCode(), equals(), and toString()
> implementations.

Having metadata keys be non-Strings may lead to some boundary case
situations, and requires developers to understand how to create good
hashCode() methods, and equals() methods, which have been shown in practice
to be things that developers are not really good at.

In addition, doing things this way arguably makes support for #5 above a bit
more difficult...

> 
> c) Use Date instances for date metadata, URI instances for URIs, etc.
> All value objects should preferably have good toString()
> implementations.
> 
> d) Use the Dublin Core "identifier" property instead of the current
> RESOURCE_NAME_KEY, and the "format" property instead of CONTENT_TYPE.

+1

> 
> e) Add utility methods like set/getIdentifier(), set/getFormat(),
> set/getTitle(), etc. to the Metadata class for accessing the key
> Dublin Core metadata.

As a decorator, I totally support this. As extensions to the canonical
Metadata class, I think they are a bit heavy-weight, and tightly coupled to
the underlying data model.


In general, I like your proposal Jukka -- I'd just like to hear some more
rationale for things like using non-String met keys, and for not having some
of the data model specific items (e.g., DublinCore) be placed in decorators
rather than the canonical Metadata class.

My 2 cents,
 Chris


> 
> WDYT?
> 
> BR,
> 
> Jukka Zitting

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to