Re: Metadata use by Apache Java projects

Jeremias Maerki Tue, 20 Nov 2007 00:06:37 -0800

Hi Chris

On 19.11.2007 18:27:56 Chris Mattmann wrote:
> Hi Folks,
>  
> >> Sanselan and Tika have both chosen a very simple approach but is it
> >> versatile enough for the future? While the simple Map<String, String[]> in
> >> Tika allows for multiple authors, for example, it doesn't support
> >> language alternatives for things such as dc:title or dc:description.
> > 
> > IMHO it would be good to have a more flexible metadata model in Tika.
> > Better yet if it's a standard used across multiple projects. Best if
> > we don't need to implement it in Tika. :-)
> 
> I'm not quite sure I understand how Tika's metadata model isn't flexible
> enough? Of course, I'm a bit bias, but I'm really trying to understand here
> and haven't been able to. I think it's important to realize that a balance
> must be struck between over-bloating a metadata library (and attaching on
> RDF support, inference, synonym support, etc.) and making sure that the
> smallest subset of it is actually useful.


I'm sorry. I didn't intend to stand on anyone's toes.

At any rate, I'm not talking about full RDF support. I'm talking about
XMP, which uses only a subset of RDF.

> Also, I'd be against moving Metadata support out of Tika because that was
> one of the project's original goals (Metadata support), and I think it's
> advantageous for Tika to be a provider for a Metadata capability (of course,
> one related to document/content extraction).

Metadata capability in the context of content extraction, certainly yes.
Nobody disputes that. But other projects have different needs (like
embedding metadata). So in all this there are certain common needs and
I'm trying to see if we can find a common ground in the form of a
uniform way of manipulating and storing metadata in memory while at the
same time working off a freely available standard.

> I'm wondering too what it means that Tika doesn't support "language
> alternatives"? Do you mean synonyms?

Frankly, I don't know if that's synonyms. Maybe they are in RDF
terminology. The XMP spec talks about "property qualifiers" of which
"language alternatives" (using xml:lang) are a special case. The easiest
way to explain is by example:

<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/"; rdf:about="">
      <dc:creator>
        <rdf:Seq>
          <rdf:li>John Doe</rdf:li>
          <rdf:li>Jane Doe</rdf:li>
        </rdf:Seq>
      </dc:creator>
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Manual</rdf:li>
          <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
          <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:date>2006-06-02T10:36:40+02:00</dc:date>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

You can see that the title is available in three languages. The example
also shows the case with multiple authors.

To access the title using Adobe's XMP tookkit you'd do the following:

XMPMeta meta = XMPMetaFactory.parse(in);
String s;

//Get default title
s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);

//Get title in user language if available
String userLang = System.getProperty("user.language");
s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);

Easy, isn't it? :-) That's the generic access to properties as Adobe's
XMP toolkit provides it. But it can also be useful to have concrete
adapters for easier use and higher type-safety. Here's what I do in XML
Graphics Commons at the moment:

Metadata meta = XMPParser.parseXMP(url);
DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
String s;
s = dc.getTitle();
String userLang = System.getProperty("user.language");
s = dc.getTitle(userLang);

(Obviously, the same could be done for Adobe's XMP toolkit.)

> Also, you mention it's relatively easy
> in other libraries to map between different file format metadata. I think
> that this is fairly easy to do in Tika too, seeing as though its primary
> purpose is support metadata extraction from different file formats.

No argument there. I don't claim I know all the requirements and use
cases of Tika. But I would imagine it's important to preserve as much
metadata as possible. XMP is certainly one of the best containers I've
seen to achieve that goal.

> > 
> >> My questions:
> >> - Any interest in converging on a unified model/approach?
> > 
> > Certainly.
> 
> +1
> 
> > 
> >> - If yes, where shall we develop this? As part of Tika (although it's
> >> still in incubation)? As a seperate project (maybe as Apache Commons
> >> subproject)? If more than XML Graphics uses this, XML Graphics is
> >> probably not the right home.
> >> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> >> the JempBox or XML Graphics Commons approach more interesting?
> > 
> > If there already exists acceptably licensed good code outside the ASF,
> > then I would prefer using that instead of reinventing the wheel within
> > the foundation.
> 
> I'm not sure we're "re-inventing the wheel" here Jukka. Tika's Metadata
> framework began in Nutch, and at the time based on a short survey that
> Jerome Charron and I undertook, there was no easy-to-use, Metadata library
> framework, that met the needs of the types of things done in Nutch/Tika --
> document extraction of metadata from large corpuses, supporting many values
> for keys: mapping between keys, etc. So, in my mind, we're definitely not
> re-inventing any wheel and the framework was borne more out of need/ease of
> use than anything else.
> 
> In any case, the use of a common framework is a good one to discuss and I'm
> open to it. So long as people like me can better understand the gaps in the
> current Tika Metadata framework and the benefits of addressing those gaps to
> all the projects that would need it.


Jeremias Maerki

Re: Metadata use by Apache Java projects

Reply via email to