Hi Antoni Thanks for the interesting information. Frankly, you've scared me there just a bit. It's interesting to see that there are so encompassing efforts underway in some places. To me, full RDF still has a scare factor. At least the subset XMP provides is "manageable" for mere mortals. :-) At least, that's my impression. Maybe I still just know too little about RDF. IMO, XMP finds a good compromise between expressiveness and simplicity. The positive points for Adobe's XMP toolkit: it is in Java, available now and under a license we can easily use in Apache projects.
In your point 4, you mention some restrictions you see for XMP. But XMP is a subset of RDF, so does RDF really restrict you from an RDF point of view? I didn't really understand that point. We'll see how this works out. Jeremias Maerki On 20.11.2007 15:25:44 Antoni Mylka wrote: > Hi Jeremias, tika-dev > > My name is Antoni Mylka, I am involved in aperture.sourceforge.net, > which is addressing similar things as Tika, we got your mail on the > tika-dev mailing list. I also work for the Nepomuk Social Semantic > Desktop project, I'm the maintainer of the Nepomuk Information Element > Ontology. More below. > > Your mail addresses four more-or-less orthogonal issues. > > 1. The standardization of schemas, how the metadata should be > represented i.e. URIs of classes and properties. > > 2. The standardzation of the representational language This means the > conventions about how to use RDF (e.g. Bags, Seqs, Alts etc) and the > formal semantics. > > 3. The standardization of the API that will work with the RDF triples > and handle operations such as adding, deleting and querying triples. > (And maybe the inference). > > 4. The standardization of the RDF storage mechanisms. > > XMP provides its answers to all these questions but they aren't the only > ones. I know of at least two such standardization initiatives, > > 1. Freedesktop.org the XESAM project. A gathering of the major > open-source desktop search engines > http://xesam.org/main > > 2. Nepomuk Social Semantic Desktop Project. An EU-Funded research > project with the Semantic-Web background. > http://nepomuk.semanticdesktop.org > > Many of the issues you are bound to come into have already been > recognized and some answers have been given, naturally the requirements > might have been different and the solutions aren't optimal, but it may > be interesting for you to skim through the output of those projects. To > sum it up: > > 1. > Freedesktop.org schema: > <http://xesam.org/main/XesamOntology90> > > Nepomuk schema: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie/> > Let the pointers take you from there. > There is also an archive of discussions around the drafts of NIE. (there > have been 8 at the moment). > <http://dev.nepomuk.semanticdesktop.org/query?status=new&status=assigned&status=reopened&status=closed&component=ontology-nie&order=priority> > > 2. > Freedesktop don't use any specific representational language, but they > support property inheritance. They implement it by themselves, without > any general-purpose RDF inference. > > Nepomuk uses the Nepomuk Representational Language. It has been > considered better for our purposes, since it employs more intuitive > semantics (so-called closed-world assumption, in normal RDF if you say > that the value if nie:kisses property is a Human, and you write Antoni > nie:kisses Frog - you can infer that the frog is a human, in NRL you can't) > > 3. > No-one tried to standardize the API, there are many libraries that work > with both in-memory and persistent RDF repositories. > > A few pointers: > > There are many APIs out there: > * jena.sourceforge.net - big api for rdf by HP > * www.openrdf.org - rdf api optimized for client/server setups > * http://wiki.ontoworld.org/wiki/RDF2Go - Abstraction api of above > > There are many APIs generating "Schema Specific Adapters", the well > known in Java are: > * http://wiki.ontoworld.org/wiki/RDFReactor > * elmo > ** http://www.openrdf.net/doc/elmo/1.0/user-guide/index.html > ** > http://sourceforge.net/project/showfiles.php?group_id=46509&package_id=157314 > * https://sommer.dev.java.net/ > > from the above, elmo is quite stable and advanced. > > There are murmurs of standardization of RDF Apis, > Max Völkel (FZI, Maintainer of RDF2Go), Henry Story (www.bblfish.net), > and Leo Sauermann (DFKI, http://leobard.twoday.net) repeatedly thought > about starting a JSR discussion on an RDF api, but that never happened. > The W3C may be interested to do something like this (they did it for DOM > I think and for XML, or?), the contact people would be the deployment group: > http://www.w3.org/2006/07/SWD/ > > so, to sum it up: > There are many things out there handling RDF in Java, but nothing > dominates yet as a single monopoly. In my sourroundings (my company, > aperture.sourceforge.net) we prefer to use RDF2Go as "the api", its not > perfect but it seems to work quite well. > > 4. > XMP prescribes that the metadata be contained within the files > themselves. There are many scenarios where this is a limitation. Each > application will have to maintain its indexes by itself and possibly use > a different API to work with XMP storage (in the files) and the common > storage (e.g. an index). There is an ongoing effort to combine the > flexibility of RDF with the search-capabilities of Lucene. Two of the > more prominent ones are > > Sesame Lucene Sail > <https://src.aduna-software.org/svn/org.openrdf/projects/sesame2-contrib/openrdf-sail-contrib/openrdf-lucenesail/> > AFAIK there is no project page yet, but this idea has been worked on for > at least two years now, e.g. in the gnowsis project > www.gnowsis.org > > Boca TextIndexing feature > Part of the IBM SLRP > <http://ibm-slrp.sourceforge.net/wiki/index.php?title=BocaTextIndexing> > > In our opinion, such an initiative deserves at least a separate mailing > list. We have already been working on metadata standardization for some > time now and would be happy to help. Chris Mattman has written that it's > necessary to strike a balance between functionality and over-bloating. > From my own experience i can say that it is VERY difficult :). > > Antoni Mylka > [EMAIL PROTECTED] > > On Nov 19, 2007 10:26 AM, Jeremias Maerki <[EMAIL PROTECTED]> wrote: > > (I realize this is heavy cross-posting but it's probably the best way to > > reach all the players I want to address.) > > > > As you may know, I've started developing an XMP metadata package inside > > XML Graphics Commons in order to support XMP metadata (and ultimately > > PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata. > > > > What is XMP? XMP, for those who don't know about it, is based on a > > subset of RDF to provide a flexible and extensible way of > > storing/representing document metadata. > > > > Yesterday, I was surprised to discover that Adobe has published an XMP > > Toolkit with Java support under the BSD license. In contrast to my > > effort, Adobe's toolkit is quite complete if maybe a bit more > > complicated to use. That got me thinking: > > > > Every project I'm sending this message to is using document metadata in > > some form: > > - Apache XML Graphics: embeds document metadata in the generated files > > (just FOP at the moment, but Batik is a similar candidate) > > - Tika (in incubation): has as one of its main purposes the extraction > > of metadata > > - Sanselan (in incubation): extracts and embeds metadata from/in bitmap > > images > > - PDFBox (incubation in discussion): extracts and embeds XMP metadata > > from/in PDF files (see also JempBox) > > > > Every one of these projects has its own means to represent metadata in > > memory. Wouldn't it make sense to have a common approach? I've worked > > with XMP for some time now and I can say it's ideal to work with. It > > also defines guidelines to embed XMP metadata in various file formats. > > It's also relatively easy to map metadata between different file formats > > (Dublin Core, EXIF, PDF Info etc.). > > > > Sanselan and Tika have both chosen a very simple approach but is it > > versatile enough for the future? While the simple Map<String, String[]> in > > Tika allows for multiple authors, for example, it doesn't support > > language alternatives for things such as dc:title or dc:description. > > > > I'm seriously thinking about abandoning most of my XMP package work in > > XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't > > support, tough: > > - Metadata merging functionality (which I need for synchronizing the PDF > > Info object and the XMP packet for PDF/A) > > - Schema-specific adapters (for Dublin Core and many other XMP Schemas) for > > easier programming (which both Ben and I have written for JempBox and > > XML Graphics Commons). Adobe's toolkit only allows generic access. > > > > Some links: > > Adobe XMP website: http://www.adobe.com/products/xmp/ > > Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/ > > JempBox: http://sourceforge.net/projects/jempbox > > Apache XML Graphics Commons: > > > > http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/ > > > > My questions: > > - Any interest in converging on a unified model/approach? > > - If yes, where shall we develop this? As part of Tika (although it's > > still in incubation)? As a seperate project (maybe as Apache Commons > > subproject)? If more than XML Graphics uses this, XML Graphics is > > probably not the right home. > > - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is > > the JempBox or XML Graphics Commons approach more interesting? > > - Where's the best place to discuss this? We can't keep posting to > > several mailing lists. > > > > At any rate, I would volunteer to spearhead this effort, especially > > since I have immediate need to have complete XMP functionality. I've > > almost finished mapping all XMP structures in XG Commons but I haven't > > committed my latest changes (for structured properties) and I may still > > not cover all details of XMP. > > > > Thanks for reading this far, > > Jeremias Maerki > > > > > > > > -- > Antoni Myłka > [EMAIL PROTECTED]
