[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746719#comment-14746719
 ] 

Ray Gauss II commented on TIKA-1607:
------------------------------------

Hi [~talli...@mitre.org], apologies for the delay on responding here.

1. POJOs
bq. We might have better documentation of POJOs and compile-time guarantees 
about methods and typed values.

Agreed, but the DOM persistence doesn't preclude us from also using Java 
'helper' classes that know how to more easily get and set values for particular 
schemas that we'd like to focus on.

bq. Schemas/xsds can enforce plenty, I know, but would we want to build an xsd 
and maintain it?

I'd vote for sticking as true to a specification's original schema as possible 
when there is one but whether we'd want to build and maintain for those that 
don't is a good question.

2. Passthrough
bq. why couldn't we literally pass that through via the String version of the 
xml?

I think we could, but we'd first have to 'merge' with the metadata being 
modeled by the parsers and could then allow access to the full DOM {{Document}} 
object which clients could easily serialize to a string if need be.

3. Serialization to JSON
There seem to be several libraries available that can help with XML to JSON, 
though I don't think this would belong in core.

4. Multilingual fields
Great question.  XMP uses RDF and xml:lang:
{noformat}
<dc:title>
  <rdf:Alt>
    <rdf:li xml:lang="x-default">quick brown fox</rdf:li>
    <rdf:li xml:lang="it">rapido fox marrone</rdf:li>
  </rdf:Alt>
</dc:title>
{noformat}
that's one possibility.

bq. I'm wondering if we want to add structure only where structured data 
doesn't exist within the document and let the client parse what they'd like out 
of structured metadata that is in the document?

This also relates to passthrough above but one thing to keep in mind is that 
the metadata we're parsing could be coming from several different parts of the 
binary.  For example, EXIF doesn't necessarily also live in XMP (though most 
apps also write it there these days) and there can be more than one XMP packet 
present in a file.  It would be nice to bring these different sources into a 
unified persistence structure, even if for simpler metadata everything lives at 
the top level.

bq. how do we transfer as much normalized/structured metadata as possible in as 
simple a way to the end user.

This also gets back to passthrough and the possibility of access to the full 
DOM {{Document}} object.

Thanks for keeping the discussion going.  We obviously need to take great care 
in changing such a fundamental area of the code.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.11
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to