Hi Chris,

Those are all valid points and I agree that you could do everything with a 
Hashmap. 
Having the parsers fill the Metadata class and its Hashmap with all needed 
information which is then consumed by an XMP component sitting on top of 
Tika-Core is definitely an interesting solution which would keep Tika-Core 
clean of any dependencies and give the ability to introduce new XMP related 
APIs in a least intrusive way.
But from my point of view it is also about how much time and effort you would 
like to spend implementing and testing code in the Metadata class when you have 
something tested and stable that is already available for exactly that purpose. 
Another thought that just comes to my mind is that a lot of file formats 
already use XMP as one or even the only metadata container and you would then 
end up filling the metadata map with the data from the file's XMP and 
converting it back to XMP later on, compared to just being able to parse it as 
is and having most of the metadata available right away. 

Good discussion :)
Regards
Jörg

---
Jörg Ehrlich | Computer Scientist | XMP Technology | Adobe Systems | 
joerg.ehrl...@adobe.com | work: +49(40)306360

-----Original Message-----
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Mittwoch, 25. April 2012 22:40
To: <dev@tika.apache.org>
Subject: Re: [metadata] roadmap proposal available on the wiki

Hi Jörg,

On Apr 25, 2012, at 10:27 AM, Joerg Ehrlich wrote:
> 
>> I am not strongly supportive of of changing the HashMap internal 
>> representation in Metadata out.
>> A couple of things I like about the HashMap:
>> 
>> * It's simple.
>> * It doesn't require dependency on any external libraries and helps keep 
>> tika-core minimal.
>> 
>> Wouldn't it be possible for example to simply have XMP be something that 
>> sits on top of the Metadata object?
> 
> There are definitely a lot of different ways we could implement the metadata 
> handling in Tika.
> But having XMP as the underlying data model of whatever implementation we are 
> going to choose, has the following rationale from my perspective:
> 
> Right now Tika is just providing a limited, common set of metadata for all 
> supported file formats. We already said that this is fine and should stay 
> that way, which also means clients can continue to use the current simple 
> API. But there are clients and use cases which would like to have access to 
> more than the currently supported limited set of metadata and also have 
> semantic information travelling with it (i.e. Namespace information). 

Agreed, that's fine by me too, I'm +1 to support those clients.

> One example for extended metadata interest are video workflows where you see 
> more and more temporal metadata being used which is quite structured and 
> complex (compared to a simple property like a title). The same is true for 
> face recognition metadata that all current image applications (and also 
> camera devices) are already writing into the assets. 
> An example for the importance of semantic information: At Adobe we already 
> have to worry about something as simple as the "creation date". Because it 
> could be the date the asset has been written to the hard disk I am currently 
> looking at, it could be date the original creator has written it on his hard 
> disk, it could also be the date the art work has been digitized (i.e. 
> scanned) or it could be the date the work shown on the digital image has been 
> created. Namespaces provide that information. 

Yep agreed. We have the same issues at NASA too for all sorts of planetary, 
Earth science, astrophysics, and other data :) Metadata is super important, and 
Tika's support for it is definitely basic at best. It needs to be improved.

> Oh, and copyright information is also pretty sensitive when it comes 
> to semantics :) A namespace registry as it is provided by the XMP library is 
> in this case pretty handy, because storing information in just prefixes is 
> easy, but also dangerous as they are just variables.
> 
> I would argue that it is difficult to store such data faithfully with a 
> simple Hashmap. And having two data models storing data is pretty error prone.

I'm not convinced that it's difficult to store data faithfully in a hash map. 
You can encode all sorts of information in field keys (including namespaces). 
We discussed this a long time ago in Tika (I think Bertrand reported it when it 
we were in the Incubator):

https://issues.apache.org/jira/browse/TIKA-61

The discussion then was that XMP would be something that we could use to help 
drive it, but I'm just saying I don't think it's the HashMap that's the 
limitation here.

Why couldn't we simply add a new module at the tika-* level, called tika-xmp? 
At the very least, it would be the least intrusive way of exploring some of 
these ideas. 
> 
> The XMP library would add a dependency and size to Tika-Core but it is really 
> just the data model and a parser/serializer for the XML/RDF, so the footprint 
> is small.
> Additionally if you want to provide XMP output from Tika you need to have 
> something like the XMP library to manage and serialize the data, because it 
> would be too painful to write a decent XML/RDF serializer again.

I'm +1 to try out some of these ideas, but think that doing it in a tika-xmp 
module at the same level as e.g., tika-core, tika-server, etc., might be less 
intrusive.

Thanks for discussing this with me.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to