Hi, On Mon, Mar 10, 2008 at 3:55 PM, Chris Mattmann <[EMAIL PROTECTED]> wrote: > > 1) Metadata should consist of a modifiable set of keys mapped to values. > > Let's be concrete here: > > I think that Metadata should be of the form: > > [key]=>1...n [value] > > Where [key]'s are modifiable (as are [value]'s) > > This is what you are expressing, correct?
Yes, though I'm not sure where the 1...n requirement for metadata values comes from. I'm my proposal I'd handle such needs more generally by allowing structured metadata values, not just strings. > > 2) Metadata keys should be designed to avoid collisions or misspellings. > > +1 in all cases that we are in control of, with the caveat that sometimes we > aren't in control of the keys being used, especially in situations where we > have automated Metadata retrieval. Metadata is not very useful if it can't be reliably and automatically processed, so I'd rather avoid situations where there's a chance for confusion. Also, it's IMHO better to resolve any ambiguities before putting things into the Metadata instance instead of guessing later on what the metadata you have really means. In the Nutch case you mentioned, I would ask Nutch to understand the difference between the various forms of HTTP headers and to normalize that metadata before feeding it to Tika. After all, there's nothing HTTP-specific in Tika, whereas Nutch knows much more about the relevant details and actual reality out there. > > 3) It should be possible to store non-String metadata, like Locale > > settings, Date instances, thumbnail images, etc. > > I somewhat have to agree with Jerome here -- what is the value of storing > non-String metadata [values]? To me, this goes against standards like Dublin > Core, or ISO 11179, which explicitly define metadata values to be of String > form. Again, the more reliably the metadata can be automatically processed, the better. It's of course possible to serialize all sorts of metadata to strings, but each such case introduces an inherently unreliable parsing operation. Since Tika doesn't need to worry about serializing the metadata, we should IMHO opt for structured data types instead of strings where appropriate. > > 7) No two distinct metadata keys should be used for the same metadata > > semantics. > > Could you elaborate on this with an explicit example? For example, we currently have both DublinCore.FORMAT and HttpHeaders.CONTENT_TYPE whose semantics are largely overlapping. Each such case adds ambiguity and makes automatic metadata processing harder. > > 8) The Metadata class should have convenience methods for accessing > > the most commonly used metadata. > > I'm not so sure that these types of methods should be part of a canonical > Metadata class. To me, it's like building the data model into the software, > which is typically a bad practice. Good point, having such methods on a decorator layer makes sense. BR, Jukka Zitting