Hi,

On Mon, Mar 10, 2008 at 8:21 PM, Jérôme Charron
<[EMAIL PROTECTED]> wrote:
>  > I'm my proposal I'd handle such needs more
>  > generally by allowing structured metadata values, not just strings.
>
>  If I understand, instead of storing many values for a specified key, I will
>  store a List of values?

Yes.

>  > parsing operation. Since Tika doesn't need to worry about serializing
>  > the metadata, we should IMHO opt for structured data types instead of
>  > strings where appropriate.
>
>  +1 but it adds a level of complexity in metadata handling for tika users :
>  knowing the type associated to a specific metadata, no?
>  (I agree that it is more or less the case with date or url values
>  serialized)

Yes, but you can't do anything (I'm assuming automated processing
here) with a piece of metadata unless you know what type it is. And as
long as the structured values have good toString() implementations,
they will still be useful also for manual processing.

IMHO it's much better to know that this piece of metadata is a Date
than that it's a String that (hopefully) matches one of the ISO 8601
or other well known date patterns.

>  > For example, we currently have both DublinCore.FORMAT and
>  > HttpHeaders.CONTENT_TYPE whose semantics are largely overlapping. Each
>  > such case adds ambiguity and makes automatic metadata processing
>  > harder.
>
>  -1
>  HttpHeaders.CONTENT_TYPE and DublinCore.FORMAT have the same semantic but
>  they  doesn't come from the same  level of information : HTTP is a low level
>  of information and Dublin is a high level => Tika client should have access
>  to both information and then guess what is the more reliable information in
>  their case.

That's true if you think of  the Metadata just as a container for
information from various sources.

My motivation for requirement 7 was more about Tika as a whole and
especially the Tika parsers. The parsers should always try to produce
as accurate and consistent document metadata as possible. IMHO it
would be a major problem for one parser to report the document type as
CONTENT_TYPE and another as FORMAT. We should pick one and only one
metadata key as the canonical place for document type information
reported by Tika.

If Tika receives a document of type X with HTTP Content-Type set to A
and Dublin Core dc:format set to B, then the metadata output to a
client should be X. It would be nice if the output contained also A
and B as auxiliary information, but we should make it very clear that
the metadata key used for X in this case is the one and only key that
a client should look at to find the most accurate information that
Tika has about the document.

BR,

Jukka Zitting

Reply via email to