Hi Jörg,

On Apr 25, 2012, at 10:27 AM, Joerg Ehrlich wrote:
> 
>> I am not strongly supportive of of changing the HashMap internal 
>> representation in Metadata out.
>> A couple of things I like about the HashMap:
>> 
>> * It's simple.
>> * It doesn't require dependency on any external libraries and helps keep 
>> tika-core minimal.
>> 
>> Wouldn't it be possible for example to simply have XMP be something that 
>> sits on top of the Metadata object?
> 
> There are definitely a lot of different ways we could implement the metadata 
> handling in Tika.
> But having XMP as the underlying data model of whatever implementation we are 
> going to choose, has the following rationale from my perspective:
> 
> Right now Tika is just providing a limited, common set of metadata for all 
> supported file formats. We already said that this is fine and should stay 
> that way, which also means clients can continue to use the current simple 
> API. But there are clients and use cases which would like to have access to 
> more than the currently supported limited set of metadata and also have 
> semantic information travelling with it (i.e. Namespace information). 

Agreed, that's fine by me too, I'm +1 to support those clients.

> One example for extended metadata interest are video workflows where you see 
> more and more temporal metadata being used which is quite structured and 
> complex (compared to a simple property like a title). The same is true for 
> face recognition metadata that all current image applications (and also 
> camera devices) are already writing into the assets. 
> An example for the importance of semantic information: At Adobe we already 
> have to worry about something as simple as the "creation date". Because it 
> could be the date the asset has been written to the hard disk I am currently 
> looking at, it could be date the original creator has written it on his hard 
> disk, it could also be the date the art work has been digitized (i.e. 
> scanned) or it could be the date the work shown on the digital image has been 
> created. Namespaces provide that information. 

Yep agreed. We have the same issues at NASA too for all sorts of planetary, 
Earth science, astrophysics, and other data :)
Metadata is super important, and Tika's support for it is definitely basic at 
best. It needs to be improved.

> Oh, and copyright information is also pretty sensitive when it comes to 
> semantics :)
> A namespace registry as it is provided by the XMP library is in this case 
> pretty handy, because storing information in just prefixes is easy, but also 
> dangerous as they are just variables.
> 
> I would argue that it is difficult to store such data faithfully with a 
> simple Hashmap. And having two data models storing data is pretty error prone.

I'm not convinced that it's difficult to store data faithfully in a hash map. 
You can encode all sorts of information in field keys (including
namespaces). We discussed this a long time ago in Tika (I think Bertrand 
reported it when it we were in the Incubator):

https://issues.apache.org/jira/browse/TIKA-61

The discussion then was that XMP would be something that we could use to help 
drive it, but I'm just saying I don't think
it's the HashMap that's the limitation here.

Why couldn't we simply add a new module at the tika-* level, called tika-xmp? 
At the very least, it would be the least
intrusive way of exploring some of these ideas. 
> 
> The XMP library would add a dependency and size to Tika-Core but it is really 
> just the data model and a parser/serializer for the XML/RDF, so the footprint 
> is small.
> Additionally if you want to provide XMP output from Tika you need to have 
> something like the XMP library to manage and serialize the data, because it 
> would be too painful to write a decent XML/RDF serializer again.

I'm +1 to try out some of these ideas, but think that doing it in a tika-xmp 
module at the same level as e.g., tika-core, tika-server, etc.,
might be less intrusive.

Thanks for discussing this with me.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to