Hi,

On Mon, Mar 10, 2008 at 3:55 PM, Chris Mattmann
<[EMAIL PROTECTED]> wrote:
>  > 1) Metadata should consist of a modifiable set of keys mapped to values.
>
>  Let's be concrete here:
>
>  I think that Metadata should be of the form:
>
>  [key]=>1...n [value]
>
>  Where [key]'s are modifiable (as are [value]'s)
>
>  This is what you are expressing, correct?

Yes, though I'm not sure where the 1...n requirement for metadata
values comes from. I'm my proposal I'd handle such needs more
generally by allowing structured metadata values, not just strings.

>  > 2) Metadata keys should be designed to avoid collisions or misspellings.
>
>  +1 in all cases that we are in control of, with the caveat that sometimes we
>  aren't in control of the keys being used, especially in situations where we
>  have automated Metadata retrieval.

Metadata is not very useful if it can't be reliably and automatically
processed, so I'd rather avoid situations where there's a chance for
confusion.

Also, it's IMHO better to resolve any ambiguities before putting
things into the Metadata instance instead of guessing later on what
the metadata you have really means.

In the Nutch case you mentioned, I would ask Nutch to understand the
difference between the various forms of HTTP headers and to normalize
that metadata before feeding it to Tika. After all, there's nothing
HTTP-specific in Tika, whereas Nutch knows much more about the
relevant details and actual reality out there.

>  > 3) It should be possible to store non-String metadata, like Locale
>  > settings, Date instances, thumbnail images, etc.
>
>  I somewhat have to agree with Jerome here -- what is the value of storing
>  non-String metadata [values]? To me, this goes against standards like Dublin
>  Core, or ISO 11179, which explicitly define metadata values to be of String
>  form.

Again, the more reliably the metadata can be automatically processed,
the better. It's of course possible to serialize all sorts of metadata
to strings, but each such case introduces an inherently unreliable
parsing operation. Since Tika doesn't need to worry about serializing
the metadata, we should IMHO opt for structured data types instead of
strings where appropriate.

>  > 7) No two distinct metadata keys should be used for the same metadata
>  > semantics.
>
>  Could you elaborate on this with an explicit example?

For example, we currently have both DublinCore.FORMAT and
HttpHeaders.CONTENT_TYPE whose semantics are largely overlapping. Each
such case adds ambiguity and makes automatic metadata processing
harder.

>  > 8) The Metadata class should have convenience methods for accessing
>  > the most commonly used metadata.
>
>  I'm not so sure that these types of methods should be part of a canonical
>  Metadata class. To me, it's like building the data model into the software,
>  which is typically a bad practice.

Good point, having such methods on a decorator layer makes sense.

BR,

Jukka Zitting

Reply via email to