Hi Antoni, > The roadmap doesn't give much detail about the intended vocabularies. > Dublin core is great, but what else? Joerg? What other kinds of metadata > information would you like to extract with Tika, and what vocabularies would > you like to use to express them? > > At Adobe, you'll likely want Tika to transparently get the XMP metadata from > the docs (using whatever vocabularies you use to express whatever info you > need) into your metadata-processing software, that already "understands" the > semantics of those XMP properties and values. What data would you like to > have Tika transform to common vocabularies and what vocabularies will that be?
Your description about how we handle metadata at Adobe is correct. Regarding the intended vocabularies I think we have to distinguish between "common" file format neutral metadata and data that is specific for a given format or purpose. For the common metadata the proposal was to use the vocabulary as defined in the ISO XMP specification Part One, section 8 (see [1]). That vocabulary is essentially DublinCore with additional elements from IPTC and Adobe Media Management namespace. Apart from the core properties, the general Idea is to extract as much metadata from resources as possible. And the used vocabulary that data is mapped to really depends on the use case (i.e. file format and purpose), I think. Here a pragmatic approach that uses established standards wherever possible is preferable. Unfortunately the established standards often overlap or define contradictory mappings and that's where the pragmatic aspect comes into play :) The Media Annotation Working Group [2] has made a nice try to come up with a decent vocabulary where all sorts of information could be mapped into, but unfortunately they leave out a lot of information a developer needs to actually use it. A - from my point of view - more usable recommendation for at least common image formats has been defined by the Metadata Working Group [3]. Does this answer your questions? Regards Jörg [1] http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=57421 [1b] http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XMPSpecificationPart1.pdf [2] http://www.w3.org/TR/2012/REC-mediaont-10-20120209/ [3] http://metadataworkinggroup.com/