Thanks Nick, +1.

I'll try and follow and see if I can help in places.

Cheers,
Chris

On May 16, 2012, at 5:50 AM, Nick Burch wrote:

> Hi All
> 
> I've just been brainstorming with Ray Gauss, and we think we've come up with 
> a way to move towards cleaner and clearer metadata property definitions 
> (prefixes, properties with types etc), whilst maintaining backwards 
> compatibility and avoiding too much work for parsers during the migration. 
> It'll hopefully also help with the larger plan of improving the metadata, and 
> make life easier for people working on that.
> 
> I'll use DublinCore as an example, but it's not the only one this'll apply to.
> 
> Today, we have all the keys from DublinCore imported onto the Metadata 
> object, and all the parsers all call eg Metadata.DESCRIPTION rather than 
> DublinCore.DESCRIPTION. This is a string key, not a property, so there's no 
> information on it about type etc, and it's a raw key of "description" so 
> people outside of the Java space (eg tika-cli users) don't know what it is 
> defined as.
> 
> What I think we'd really like is for that to be a property, with type, with a 
> key that includes our chosen prefix (so that tika-cli users etc know what it 
> is), that doesn't break backwards compatibility until 2.0.
> 
> Additionally, we want to identify which properties are common, which all 
> parsers should be mapping their metadata onto (eg everything should map the 
> metadata that corresponds roughly to what Dublin Core explains Description to 
> be, no matter what the format calls it), in addition from any format specific 
> ones (which only advance users want)
> 
> We think we have a plan!
> 
> In order to avoid breaking backwards compatibility, we've looked and 
> basically nothing uses the metadata key interfaces directly. Everything seems 
> to use the Metadata one instead, eg Metadata.DESCRIPTION rather than 
> DublinCore.DESCRIPTION. So, we think we can change the dublin core one, 
> provided that Metadata is unchanged.
> 
> Step one is therefore to change all the definitions in Dublin Core to be 
> proper properties. We copy over the old strings to Metadata, and @deprecate 
> them (until 2.0). Everything should still work
> 
> Next, we define a class to hold the common Tika metadata properties. These 
> are the ones we consider to be common across all formats, which parsers 
> should be trying to populate wherever they can. (Most parsers already do 
> this, eg for title or description). We'll do a few of these, but we'll need 
> others to contribute to help decide the rest. These will be delegated out to 
> a standard property that someone else has already defined, as we do now.
> 
> With that done, we can also specify some aliases, so that when you set one 
> property it can be defined to also set some others. This allows us to say 
> "when you set the new dublin core description, for now also go and set the 
> old style description". This support will also be helpful for mappings on xmp 
> aware (or similar) formats, to map between their custom properties and our 
> common ones.
> 
> Finally, we go through the parsers and update them to set the new properties, 
> rather than the old strings. They'll maintain compatibility for all users 
> (those using the Java lookups, and those using raw keys eg tika-cli), and 
> when we drop that in 2.0 the parsers don't need to change
> 
> We'll be opening issues for all of these, and doing the work in small chunks 
> so everyone can follow. I believe this all fits with what everyone has been 
> discussing for a while, doesn't break anything, and moves us forward. Despite 
> the long email, it's actually quite small changes!
> 
> Nick


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to