RE: Metadata situation and XMP support in Tika

Joerg Ehrlich Fri, 13 Apr 2012 06:12:59 -0700

Hi Ray,

Yes, that is pretty much what I would propose. Aliasing is one idea, or you 
could simply have a list like the ones at the end of IPTC class which simply 
reference the namespace properties. I haven't got a strong opinion here.
And I'm with you that I don't really see the benefit of including all 
interfaces with the metadata class. I think it would be more clear if 
parsers/clients would use the namespace or standard properties explicitely 
instead of the metadata one. But your idea of having a set of "standard" 
properties available in the Metadata class would be a good help for clients who 
don't care which "title" or "author" they read. They could just say 
"Metadata.title" instead of "DublinCore.title".


Regards
Jörg


From: Ray Gauss II [mailto:ray.ga...@alfresco.com] 

For the IPTC example specifically, all properties are defined using their 
respective namespaces, but some are defined 'inline' while others are an alias 
to the referenced standard, i.e.

   Property KEYWORDS = DublinCore.DC_SUBJECT; 

If I'm understanding you correctly your proposal is to do that same aliasing 
for all the IPTC properties by creating new IptcCore, IptcExt, Photoshop, Plus, 
and XmpRights metadata interfaces that contain the full set of properties under 
those standards and simply referencing them from IPTC?

If so, I'm on board, and that's the direction I've wanted to take things.  I'd 
go so far as to say the Tika Metadata interface itself should cherry pick 
properties from other standards using that same aliasing approach rather than 
attempting to include the entire standard via implements which can obviously 
lead to name conflicts without prefixing the properties.


On Apr 13, 2012, at 8:32 AM, Joerg Ehrlich wrote:

> Hi,
> 
> Looking at the current constants defined for the Metadata map, the interfaces 
> do not follow a common pattern.
> They are organized in interfaces for specific namespaces like DublinCore or 
> XMPDM, there are interfaces for specific standards like IPTC or 
> CreativeCommons and there are interfaces for a specific functional purpose 
> like Geographic or MSOffice. There are also namespace interfaces that mix 
> properties from different namespaces, e.g. TIFF. 
> Overall a clear separation of responsibility and semantic is not always 
> ensured. 
> 
> I would propose to reorganize and rename the interfaces in two groups: First 
> in namespaces and second in standards which simply contain lists with the 
> properties they use from the namespace interfaces.
> The reason is that only those two concepts have unambiguous and clearly 
> defined semantics where each client knows what to do with it.
> Properties which are currently not connected to a namespace (like the 
> properties from MSOffice interface) would also be moved to a namespace.
> 
> Old property definitions should be kept intact, of course, for existing 
> clients, but that is independent of the internal interface organization.
> 
> For example the IPTC standard uses properties from six different namespaces 
> (dc, photoshop, plus, iptc_core, iptc_ext, xmp_rights), but not all of the 
> properties that are defined in those namespaces. 
> I think it would make sense to have in this case six interfaces for the 
> namespaces which contain all properties from those namespaces. And one 
> interface for IPTC itself which contains just lists of the properties they 
> use from the namespaces. The IPTC interface already has those lists, I would 
> just remove all the property definitions from it.
> A mapping of EXIF properties to XMP for example, which is defined to use five 
> namespaces (exif, exifEX, tiff, xmp and dc), can then also reuse the 
> properties defined in the namespace interfaces.
> The functional interface "Geographic" I would rename for example to 
> "W3C_Geographic" or that like as it clearly defines the semantic which is 
> bound to the W3C vocabulary, which is different then what is meant by the 
> mapping to the EXIF namespace defined by CIPA.
> In case of the MSOffice metadata this could either be mapped to properties 
> defined in Open Document standard [1] or the Microsoft OOXML one [2].
> 
> A parser can then map properties to those namespaces it sees fit or several 
> at the same time and the client can then decide which semantic (i.e. 
> properties) it would like to use.
> 
> Regards
> Jörg
> 
> [1] 
> http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.h
> tml [2] 
> http://www.ecma-international.org/publications/standards/Ecma-376.htm  
> (in part 2)
> 
> ---
> Jörg Ehrlich | Computer Scientist | XMP Technology | Adobe Systems | 
> joerg.ehrl...@adobe.com | work: +49(40)306360
> 
> -----Original Message-----
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Donnerstag, 5. April 2012 16:21
> To: dev@tika.apache.org
> Subject: Re: Metadata situation and XMP support in Tika
> 
> Hi Jörg,
> 
> Great summary! I would be in favor of option #2 as well, with the caveat that 
> if we take it slow, I think there might be a way to not really have as much 
> of a client/API impact, using deprecations and other techniques as you 
> suggested.
> 
> Looking forward to your participation!
> 
> Cheers,
> Chris
> 
> On Apr 5, 2012, at 5:58 AM, Joerg Ehrlich wrote:
> 
>> Hi everyone,
>> 
>> I am an engineer in the XMP/Metadata team at Adobe and we would like to 
>> leverage Tika in current projects for metadata extraction (and mimetype 
>> detection).
>> Our current systems primarily use the XMP data model to manage and interact 
>> with metadata.
>> As far as I can see, the support for the XMP data model and also for 
>> standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is 
>> pretty suboptimal as of today.
>> But instead of wrapping Tika in own layers of code in our systems, we feel 
>> that it would be more useful to contribute to the project instead going 
>> forward.
>> 
>> I have had a deeper look in Tika and how to improve the metadata/XMP output 
>> of it.
>> I saw that you have a bug for XMP already (TIKA-756), which I would probably 
>> use to submit any patches related to that.
>> But I am currently unsure what the best approach would be to do the mapping 
>> to XMP and I would like to hear your opinion on it before starting any work.
>> 
>> Let me quickly summarize if I have understood the basic metadata concept 
>> correctly:
>> 
>> 1.       Each parser fills a Metadata map which is a simple key-value list 
>> where values can also be multi-values
>> 
>> 2.       Mostly the keys for the Metadata map are taken from fixed lists 
>> which are defined as interfaces in the Metadata class
>> 
>> 3.       Those keys are usually Property objects, where the Property class 
>> also serves as a static list which registers every property that is created 
>> in the Metadata interfaces. This Property class resembles the XMP data model 
>> to some extend but does not store e.g. any hierarchical information. And it 
>> leaves every client the choice to store property names with prefixes or not.
>> 
>> 4.       Any metadata outputter just iterates over the Metadata map and 
>> could query the Property list for additional information.
>> 
>> 5.       In case of the XMP outputter (XMPContentHandler) only those 
>> properties are outputted which are stored with a prefix in the Property list.
>> 
>> 
>> I see two potential ways to improve the situation:
>> 
>> 
>> 1.       Have a fixed mapping table for each mime type which would be used 
>> in XMPContentHandler to map from the Metadata map to the XMP data model. 
>> Such mapping tables would be pretty ugly as each parser produces different 
>> metadata maps and there is no consistent way to handle them. This option 
>> would be least invasive for other clients of Tika but would also be a real 
>> hack and would not really improve the metadata situation in Tika in general.
>> 
>> 2.       Try to improve the Key interface lists of Metadata class and adjust 
>> all parsers accordingly. This could be done by adding new keys with prefixes 
>> and keeping/deprecating the existing ones to not disturb existing clients. 
>> Similar to what is proposed for the DublinCore namespace in TIKA-859 and 
>> TIKA-842.
>> This would be more invasive but would offer the opportunity to really 
>> improve the metadata situation. I already saw a couple of places in the code 
>> that clearly break existing standards. But there are also examples where 
>> mapping might have to be done to different properties at the same time: If 
>> you look at the mapping of GPS data from Exif, this is currently mapped to 
>> W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by 
>> CIPA (the EXIF standardization committee). So probably both mappings have to 
>> be supported.
>> 
>> I personally would prefer option two. What do you think?
>> Looking forward to working with you guys.
>> Regards
>> Jörg
>> 
>> [1]
>> http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
>> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

RE: Metadata situation and XMP support in Tika

Reply via email to