Hi Folks, >> Sanselan and Tika have both chosen a very simple approach but is it >> versatile enough for the future? While the simple Map<String, String[]> in >> Tika allows for multiple authors, for example, it doesn't support >> language alternatives for things such as dc:title or dc:description. > > IMHO it would be good to have a more flexible metadata model in Tika. > Better yet if it's a standard used across multiple projects. Best if > we don't need to implement it in Tika. :-)
I'm not quite sure I understand how Tika's metadata model isn't flexible enough? Of course, I'm a bit bias, but I'm really trying to understand here and haven't been able to. I think it's important to realize that a balance must be struck between over-bloating a metadata library (and attaching on RDF support, inference, synonym support, etc.) and making sure that the smallest subset of it is actually useful. Also, I'd be against moving Metadata support out of Tika because that was one of the project's original goals (Metadata support), and I think it's advantageous for Tika to be a provider for a Metadata capability (of course, one related to document/content extraction). I'm wondering too what it means that Tika doesn't support "language alternatives"? Do you mean synonyms? Also, you mention it's relatively easy in other libraries to map between different file format metadata. I think that this is fairly easy to do in Tika too, seeing as though its primary purpose is support metadata extraction from different file formats. > >> My questions: >> - Any interest in converging on a unified model/approach? > > Certainly. +1 > >> - If yes, where shall we develop this? As part of Tika (although it's >> still in incubation)? As a seperate project (maybe as Apache Commons >> subproject)? If more than XML Graphics uses this, XML Graphics is >> probably not the right home. >> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is >> the JempBox or XML Graphics Commons approach more interesting? > > If there already exists acceptably licensed good code outside the ASF, > then I would prefer using that instead of reinventing the wheel within > the foundation. I'm not sure we're "re-inventing the wheel" here Jukka. Tika's Metadata framework began in Nutch, and at the time based on a short survey that Jerome Charron and I undertook, there was no easy-to-use, Metadata library framework, that met the needs of the types of things done in Nutch/Tika -- document extraction of metadata from large corpuses, supporting many values for keys: mapping between keys, etc. So, in my mind, we're definitely not re-inventing any wheel and the framework was borne more out of need/ease of use than anything else. In any case, the use of a common framework is a good one to discuss and I'm open to it. So long as people like me can better understand the gaps in the current Tika Metadata framework and the benefits of addressing those gaps to all the projects that would need it. Thanks! Cheers, Chris > > BR, > > Jukka Zitting ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 Phone: 818-354-8810 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
