Hi Chris, Thanks for your comments,
>I am not strongly supportive of of changing the HashMap internal >representation in Metadata out. >A couple of things I like about the HashMap: > >* It's simple. >* It doesn't require dependency on any external libraries and helps keep >tika-core minimal. > >Wouldn't it be possible for example to simply have XMP be something that sits >on top of the Metadata object? There are definitely a lot of different ways we could implement the metadata handling in Tika. But having XMP as the underlying data model of whatever implementation we are going to choose, has the following rationale from my perspective: Right now Tika is just providing a limited, common set of metadata for all supported file formats. We already said that this is fine and should stay that way, which also means clients can continue to use the current simple API. But there are clients and use cases which would like to have access to more than the currently supported limited set of metadata and also have semantic information travelling with it (i.e. Namespace information). One example for extended metadata interest are video workflows where you see more and more temporal metadata being used which is quite structured and complex (compared to a simple property like a title). The same is true for face recognition metadata that all current image applications (and also camera devices) are already writing into the assets. An example for the importance of semantic information: At Adobe we already have to worry about something as simple as the "creation date". Because it could be the date the asset has been written to the hard disk I am currently looking at, it could be date the original creator has written it on his hard disk, it could also be the date the art work has been digitized (i.e. scanned) or it could be the date the work shown on the digital image has been created. Namespaces provide that information. Oh, and copyright information is also pretty sensitive when it comes to semantics :) A namespace registry as it is provided by the XMP library is in this case pretty handy, because storing information in just prefixes is easy, but also dangerous as they are just variables. I would argue that it is difficult to store such data faithfully with a simple Hashmap. And having two data models storing data is pretty error prone. The XMP library would add a dependency and size to Tika-Core but it is really just the data model and a parser/serializer for the XML/RDF, so the footprint is small. Additionally if you want to provide XMP output from Tika you need to have something like the XMP library to manage and serialize the data, because it would be too painful to write a decent XML/RDF serializer again. Regards Jörg