Hi Jukka, Very nice set of requirements. I'd like to chime in on each below:
> > 0) Metadata in Tika is always about the document being parsed. +1 > > 1) Metadata should consist of a modifiable set of keys mapped to values. Let's be concrete here: I think that Metadata should be of the form: [key]=>1...n [value] Where [key]'s are modifiable (as are [value]'s) This is what you are expressing, correct? > > 2) Metadata keys should be designed to avoid collisions or misspellings. +1 in all cases that we are in control of, with the caveat that sometimes we aren't in control of the keys being used, especially in situations where we have automated Metadata retrieval. For instance, in Nutch, we use the Metadata object to store "content-type", which is returned from a web server, processing the met keys in an automated fashion. Well, some servers return "Content-type", some return "content-type", or "content type", well, ... you get the picture. So we still need that support, at least to support that type of use case. > > 3) It should be possible to store non-String metadata, like Locale > settings, Date instances, thumbnail images, etc. I somewhat have to agree with Jerome here -- what is the value of storing non-String metadata [values]? To me, this goes against standards like Dublin Core, or ISO 11179, which explicitly define metadata values to be of String form. > > 4) We should document and enforce a standard set of metadata keys, > based on Dublin Core and other standards where possible. +1, with the caveat, that we need to allow folks to define their own keys as well. > > 5) It should be easy to extend the set of metadata keys to include > custom metadata. +1, ah perfect, you covered my caveat from #4 above here. Great. > > 6) All metadata keys (both standard and custom) should be clearly > documented with the expected value type and recommended usage. +1 > > 7) No two distinct metadata keys should be used for the same metadata > semantics. Could you elaborate on this with an explicit example? > > 8) The Metadata class should have convenience methods for accessing > the most commonly used metadata. I'm not so sure that these types of methods should be part of a canonical Metadata class. To me, it's like building the data model into the software, which is typically a bad practice. Having them kept as separate, independent entities, allows both data and software models to evolve independently, and promotes loose coupling between them, making the software less fragile to changes in the underlying data model (what if Dublin Core changes in 5 years, and does away with certain fields, while adding others)? I would support this idea, however, as a set of higher level, convenience Metadata decorator classes (e.g., like SpellCheckedMetadata). > > The current Metadata class fails somewhat with 2 (there's even a > SpellCheckedMetadata class) and doesn't support 3 or 8. The constants > in o.a.tika.metadata interfaces go some way towards 4 and 6, but not > as far as they could. And we don't do that well on 7. > > So in general I think there's much that we could improve on. To > resolve most of the issues I'd like to modify metadata handling as > follows: > > a) Allow both metadata keys and values to be arbitrary Objects. I'd still like to know what the value-added from having Metadata values be non-Strings first. > > b) Instead of String constants as metadata keys, use constant object > instances like DublinCore.TITLE = new DublinCore("title"). These > objects should have good hashCode(), equals(), and toString() > implementations. Having metadata keys be non-Strings may lead to some boundary case situations, and requires developers to understand how to create good hashCode() methods, and equals() methods, which have been shown in practice to be things that developers are not really good at. In addition, doing things this way arguably makes support for #5 above a bit more difficult... > > c) Use Date instances for date metadata, URI instances for URIs, etc. > All value objects should preferably have good toString() > implementations. > > d) Use the Dublin Core "identifier" property instead of the current > RESOURCE_NAME_KEY, and the "format" property instead of CONTENT_TYPE. +1 > > e) Add utility methods like set/getIdentifier(), set/getFormat(), > set/getTitle(), etc. to the Metadata class for accessing the key > Dublin Core metadata. As a decorator, I totally support this. As extensions to the canonical Metadata class, I think they are a bit heavy-weight, and tightly coupled to the underlying data model. In general, I like your proposal Jukka -- I'd just like to hear some more rationale for things like using non-String met keys, and for not having some of the data model specific items (e.g., DublinCore) be placed in decorators rather than the canonical Metadata class. My 2 cents, Chris > > WDYT? > > BR, > > Jukka Zitting ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.