RE: Metadata situation and XMP support in Tika

Joerg Ehrlich Fri, 13 Apr 2012 06:14:37 -0700

Hi Ray,

Using ExifTool as external parser is a good idea.
Currently at Adobe we also use the XMPFiles C++ library in our Java projects to 
read/write metadata, although not as a Tika parser (yet). But that is one idea 
for the future.
And yes, we should definitely coordinate on the metadata enhancements :)


Jörg

-----Original Message-----
From: Ray Gauss II [mailto:ray.ga...@alfresco.com] 
Sent: Mittwoch, 11. April 2012 00:04
To: dev@tika.apache.org
Subject: Re: Metadata situation and XMP support in Tika

Hi Jörg,

As you've seen from TIKA-859 and TIKA-842 I've had to deal with similar issues.

Those issues were needed by TIKA-774 which itself contains another mapping that 
converts the data output by ExifTool to the proper IPTC metadata defined in 
TIKA-842.

The code for the ExifTool parser is now at 
https://github.com/Alfresco/tika-exiftool, and that mapping specifically is at:

https://github.com/Alfresco/tika-exiftool/blob/master/src/main/java/org/apache/tika/parser/exiftool/ExiftoolTikaIptcMapper.java

I'm more than happy to coordinate with you on the XMP stuff going forward if 
you'd like.

Ray Gauss II
DAM Architect, Alfresco

On Apr 5, 2012, at 8:58 AM, Joerg Ehrlich wrote:

> Hi everyone,
> 
> I am an engineer in the XMP/Metadata team at Adobe and we would like to 
> leverage Tika in current projects for metadata extraction (and mimetype 
> detection).
> Our current systems primarily use the XMP data model to manage and interact 
> with metadata.
> As far as I can see, the support for the XMP data model and also for standard 
> metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty 
> suboptimal as of today.
> But instead of wrapping Tika in own layers of code in our systems, we feel 
> that it would be more useful to contribute to the project instead going 
> forward.
> 
> I have had a deeper look in Tika and how to improve the metadata/XMP output 
> of it.
> I saw that you have a bug for XMP already (TIKA-756), which I would probably 
> use to submit any patches related to that.
> But I am currently unsure what the best approach would be to do the mapping 
> to XMP and I would like to hear your opinion on it before starting any work.
> 
> Let me quickly summarize if I have understood the basic metadata concept 
> correctly:
> 
> 1.       Each parser fills a Metadata map which is a simple key-value list 
> where values can also be multi-values
> 
> 2.       Mostly the keys for the Metadata map are taken from fixed lists 
> which are defined as interfaces in the Metadata class
> 
> 3.       Those keys are usually Property objects, where the Property class 
> also serves as a static list which registers every property that is created 
> in the Metadata interfaces. This Property class resembles the XMP data model 
> to some extend but does not store e.g. any hierarchical information. And it 
> leaves every client the choice to store property names with prefixes or not.
> 
> 4.       Any metadata outputter just iterates over the Metadata map and could 
> query the Property list for additional information.
> 
> 5.       In case of the XMP outputter (XMPContentHandler) only those 
> properties are outputted which are stored with a prefix in the Property list.
> 
> 
> I see two potential ways to improve the situation:
> 
> 
> 1.       Have a fixed mapping table for each mime type which would be used in 
> XMPContentHandler to map from the Metadata map to the XMP data model. Such 
> mapping tables would be pretty ugly as each parser produces different 
> metadata maps and there is no consistent way to handle them. This option 
> would be least invasive for other clients of Tika but would also be a real 
> hack and would not really improve the metadata situation in Tika in general.
> 
> 2.       Try to improve the Key interface lists of Metadata class and adjust 
> all parsers accordingly. This could be done by adding new keys with prefixes 
> and keeping/deprecating the existing ones to not disturb existing clients. 
> Similar to what is proposed for the DublinCore namespace in TIKA-859 and 
> TIKA-842.
> This would be more invasive but would offer the opportunity to really improve 
> the metadata situation. I already saw a couple of places in the code that 
> clearly break existing standards. But there are also examples where mapping 
> might have to be done to different properties at the same time: If you look 
> at the mapping of GPS data from Exif, this is currently mapped to W3C 
> vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA 
> (the EXIF standardization committee). So probably both mappings have to be 
> supported.
> 
> I personally would prefer option two. What do you think?
> Looking forward to working with you guys.
> Regards
> Jörg
> 
> [1] http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
>

RE: Metadata situation and XMP support in Tika

Reply via email to