[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746719#comment-14746719 ]
Ray Gauss II commented on TIKA-1607: ------------------------------------ Hi [~talli...@mitre.org], apologies for the delay on responding here. 1. POJOs bq. We might have better documentation of POJOs and compile-time guarantees about methods and typed values. Agreed, but the DOM persistence doesn't preclude us from also using Java 'helper' classes that know how to more easily get and set values for particular schemas that we'd like to focus on. bq. Schemas/xsds can enforce plenty, I know, but would we want to build an xsd and maintain it? I'd vote for sticking as true to a specification's original schema as possible when there is one but whether we'd want to build and maintain for those that don't is a good question. 2. Passthrough bq. why couldn't we literally pass that through via the String version of the xml? I think we could, but we'd first have to 'merge' with the metadata being modeled by the parsers and could then allow access to the full DOM {{Document}} object which clients could easily serialize to a string if need be. 3. Serialization to JSON There seem to be several libraries available that can help with XML to JSON, though I don't think this would belong in core. 4. Multilingual fields Great question. XMP uses RDF and xml:lang: {noformat} <dc:title> <rdf:Alt> <rdf:li xml:lang="x-default">quick brown fox</rdf:li> <rdf:li xml:lang="it">rapido fox marrone</rdf:li> </rdf:Alt> </dc:title> {noformat} that's one possibility. bq. I'm wondering if we want to add structure only where structured data doesn't exist within the document and let the client parse what they'd like out of structured metadata that is in the document? This also relates to passthrough above but one thing to keep in mind is that the metadata we're parsing could be coming from several different parts of the binary. For example, EXIF doesn't necessarily also live in XMP (though most apps also write it there these days) and there can be more than one XMP packet present in a file. It would be nice to bring these different sources into a unified persistence structure, even if for simpler metadata everything lives at the top level. bq. how do we transfer as much normalized/structured metadata as possible in as simple a way to the end user. This also gets back to passthrough and the possibility of access to the full DOM {{Document}} object. Thanks for keeping the discussion going. We obviously need to take great care in changing such a fundamental area of the code. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > ----------------------------------------------------------------------------------------- > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Critical > Fix For: 1.11 > > Attachments: TIKA-1607v1_rough_rough.patch, > TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap<String/Property, > HashMap<String/Property, String/Int/Long>> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)