Hi Chris,

Thanks for your comments,

>I am not strongly supportive of of changing the HashMap internal 
>representation in Metadata out.
>A couple of things I like about the HashMap:
>
>* It's simple.
>* It doesn't require dependency on any external libraries and helps keep 
>tika-core minimal.
>
>Wouldn't it be possible for example to simply have XMP be something that sits 
>on top of the Metadata object?

There are definitely a lot of different ways we could implement the metadata 
handling in Tika.
But having XMP as the underlying data model of whatever implementation we are 
going to choose, has the following rationale from my perspective:

Right now Tika is just providing a limited, common set of metadata for all 
supported file formats. We already said that this is fine and should stay that 
way, which also means clients can continue to use the current simple API. But 
there are clients and use cases which would like to have access to more than 
the currently supported limited set of metadata and also have semantic 
information travelling with it (i.e. Namespace information). 
One example for extended metadata interest are video workflows where you see 
more and more temporal metadata being used which is quite structured and 
complex (compared to a simple property like a title). The same is true for face 
recognition metadata that all current image applications (and also camera 
devices) are already writing into the assets. 
An example for the importance of semantic information: At Adobe we already have 
to worry about something as simple as the "creation date". Because it could be 
the date the asset has been written to the hard disk I am currently looking at, 
it could be date the original creator has written it on his hard disk, it could 
also be the date the art work has been digitized (i.e. scanned) or it could be 
the date the work shown on the digital image has been created. Namespaces 
provide that information. 
Oh, and copyright information is also pretty sensitive when it comes to 
semantics :)
A namespace registry as it is provided by the XMP library is in this case 
pretty handy, because storing information in just prefixes is easy, but also 
dangerous as they are just variables.

I would argue that it is difficult to store such data faithfully with a simple 
Hashmap. And having two data models storing data is pretty error prone.

The XMP library would add a dependency and size to Tika-Core but it is really 
just the data model and a parser/serializer for the XML/RDF, so the footprint 
is small.
Additionally if you want to provide XMP output from Tika you need to have 
something like the XMP library to manage and serialize the data, because it 
would be too painful to write a decent XML/RDF serializer again.

Regards
Jörg

Reply via email to