[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660441#comment-14660441
 ] 

Ray Gauss II commented on TIKA-1607:
------------------------------------

To clarify, the work mentioned above that uses an XPath-like syntax is only a 
workaround for mapping structured metadata into the current 'flat' metadata 
model in Tika.

I fully support moving towards a structured metadata store in a 2.0 timeframe. 
(maybe that's now?)

This is simply restating some of what's already been said, but there are many 
aspects to consider during that refactoring:
* Moving towards properly namespacing metadata (even if, for now, our 
serialization of it only contains a prefix)
* Backwards compatibility for simple string key/values
* Enabling easy serialization to XML and JSON
* Enabling easy discovery of at least top level elements
* Lightweight dependencies in tika-core
* Possible representation of binary data
* Not re-inventing the wheel

Given the above, perhaps we'd want to consider using Java DOM 
({{org.w3c.dom.*}}) classes programmatically as a metadata store, appending and 
getting child nodes, etc. rather than hard coding POJOs for each metadata 
standard we want to support.

I'll try to find some time to put together an example patch for that approach 
in the next few days.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.10
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to