[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660093#comment-14660093
 ] 

Tim Allison commented on TIKA-1607:
-----------------------------------

Y, I agree that we should push the parsers to do as much as possible.  I think 
whether we push complexity into the values or into the properties, the users 
will still have to take the time to learn about the options.  

In favor of my proposal: the values have actual Java object values with 
primitives, etc.  The user/Metadata object is not responsible for converting 
those strings to actual Java values (e.g. getDate/getInt)...the knowledge for 
those underlying values is put into the values and the API for those values.  
We could have enums and other typed/checked objects.  

Y, the user has to learn what methods are available, but the user has to learn 
about the sub-properties of the properties, too.


For the record, I really don't like the doubling up of responsibility for 
checking whether a given property can go with a given value in my proposal.  
And, the patch is still quite rough.

As you suggest, it would help to see what the client code would look like for 
the PhoneNumber, MultiLingual and MediaTrack examples.  Would there be a way to 
encode a geoshape?  What would that look like?

> Introduce new arbitrary object key/values data structure for persitsence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.10
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to