[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711801#comment-14711801
 ] 

Tim Allison commented on TIKA-1607:
-----------------------------------

[~rgauss], thank you for this demo code! I haven't had a chance to review 
thoroughly, and I'm sure I've missed plenty  (and, [~chrismattmann], I still 
need to send info for the metadata discussion over on OODT).  

I really like the non-POJO flexibility, the actual namespacing and the full 
blown DOM + XPath.  I side with you on avoiding the literal indexing 
(shoehorning) of keys if there are multiple complex values.  Your patch is just 
plain elegant.

You mention fleshing out the requirements list above, but I'm not sure there's 
much left to add. :)

Some half-baked thoughts:
# The flip side of POJO-bloat for every new metadata schema is unfettered 
flexibility/modifications.  We _might_ have better documentation of POJOs and 
compile-time guarantees about methods and typed values.  Schemas/xsds can 
enforce plenty, I know, but would we want to build an xsd and maintain it?  
That starts feeling like as much work as POJOs, but maybe not.
#  Your comment about "passing through" and the example of the vcard made me 
wonder...for complex structures (xmp, vcard), why couldn't we literally pass 
that through via the String version of the xml?  If a client has enough 
sophistication and knowledge of that structure to make use of it, why not pass 
it through literally and let them do the DOM parsing (I've been thinking about 
proposing this for the XMP that we're pulling out of PDFs and jpegs)?  The 
balancing act, of course, is determining which elements to pull into "regular" 
metadata values (e.g. Dublin Core, etc).
# How will we serialize to JSON, if that is desired?  XML dump for values for 
property of type DOM_Element?
# What would a multilingual field look like?

I'm wondering if we want to add structure only where structured data doesn't 
exist within the document and let the client parse what they'd like out of 
structured metadata that is in the document?  Or, in the case of multimedia, 
perhaps, generate PBCore XML as a value for a "regular" key.

No solutions, just thoughts...

Thank you again.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.11
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to