> >  I think that Metadata should be of the form:
> >
> >  [key]=>1...n [value]
> >
> >  Where [key]'s are modifiable (as are [value]'s)
> >
> >  This is what you are expressing, correct?
>
> Yes, though I'm not sure where the 1...n requirement for metadata
> values comes from.

It comes from Nutch, and more generaly from HTTP where a header (a metadata)
can be multivalued.


> I'm my proposal I'd handle such needs more
> generally by allowing structured metadata values, not just strings.

If I understand, instead of storing many values for a specified key, I will
store a List of values?

In the Nutch case you mentioned, I would ask Nutch to understand the
> difference between the various forms of HTTP headers and to normalize
> that metadata before feeding it to Tika. After all, there's nothing
> HTTP-specific in Tika, whereas Nutch knows much more about the
> relevant details and actual reality out there.

+1


> parsing operation. Since Tika doesn't need to worry about serializing
> the metadata, we should IMHO opt for structured data types instead of
> strings where appropriate.

+1 but it adds a level of complexity in metadata handling for tika users :
knowing the type associated to a specific metadata, no?
(I agree that it is more or less the case with date or url values
serialized)


>
> >  > 7) No two distinct metadata keys should be used for the same metadata
> >  > semantics.
> >
> >  Could you elaborate on this with an explicit example?
>
> For example, we currently have both DublinCore.FORMAT and
> HttpHeaders.CONTENT_TYPE whose semantics are largely overlapping. Each
> such case adds ambiguity and makes automatic metadata processing
> harder.

-1
HttpHeaders.CONTENT_TYPE and DublinCore.FORMAT have the same semantic but
they  doesn't come from the same  level of information : HTTP is a low level
of information and Dublin is a high level => Tika client should have access
to both information and then guess what is the more reliable information in
their case.

Best Regards

Jérôme


-- 
Jérôme Charron
Directeur Technique @ WebPulse
Tel: +33673716743 - [EMAIL PROTECTED]
http://blog.shopreflex.com/
http://www.shopreflex.com/
http://www.staragora.com/

Reply via email to