[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659998#comment-14659998
 ] 

Tim Allison edited comment on TIKA-1607 at 8/6/15 1:45 PM:
-----------------------------------------------------------

This patch adds examples for a MultilingualValue and demo/hack examples of a 
phone number value (to meet the initial example) and a multimedia tracks 
example.  

I've fixed the Json serialization so that it can handle serialization of more 
complex objects, and MetadataValues are no longer required to parse their own 
string as part of serialization.

There is still some stink around requiring a string representation in the base 
class...perhaps move back to abstract class for the base MetadataValue and use 
a StringMetadataValue for those metadata values that can reasonably be 
represented by a single string.

I also don't like having the MetadataValue determine if the property is right.  
It will get annoying to have a bunch of new Properties _and_ a bunch of new 
MetadataValues...any solutions?

The phone number and mediatracks examples are purely for demo purposes.  We 
should integrate/translate [~rgauss]'s 
[tika-ffmpeg|https://github.com/AlfrescoLabs/tika-ffmpeg] properly later.

One challenge ahead is that we'll likely have a profusion of Property types and 
of MetadataValues...I think this should be manageable...what other metadata 
values do we need?

GeoShape (point, shape)...it would be neat to parse a kml or other shape files 
and get meaningful shapes 
([com.spatial4j.core.shape|https://github.com/locationtech/spatial4j] ( ?)) out 
of those.  It would be great to have a Point/LatLon as well, obviously, for 
images. 

[~dsmiley], would you have any interest in this kind of use for spatial4j?  If 
only there were a highly reliable, scalable and fault tolerant search system, 
providing distributed indexing, replication and load-balanced querying, 
automated failover and recovery, centralized configuration that could handle 
geo-search. ;)


was (Author: [email protected]):
This patch adds examples for a MultilingualValue and demo/hack examples of a 
phone number value (to meet the initial example) and a multimedia tracks 
example.  

I've fixed the Json serialization so that it can handle serialization of more 
complex objects, and MetadataValues are no longer required to parse their own 
string as part of serialization.

There is still some stink around requiring a string representation in the base 
class...perhaps move back to abstract class for the base MetadataValue and use 
a StringMetadataValue for those metadata values that can reasonably be 
represented by a single string.

I also don't like having the MetadataValue determine if the property is right.  
It will get annoying to have a bunch of new Properties _and_ a bunch of new 
MetadataValues...any solutions?

The phone number and mediatracks examples are purely for demo purposes.  We 
should integrate/translate [~rgauss]'s 
[tika-ffmpeg|https://github.com/AlfrescoLabs/tika-ffmpeg] properly later.

One challenge ahead is that we'll likely have a profusion of Property types and 
of MetadataValues...I think this should be manageable...what other metadata 
values do we need?



GeoShape (point, shape)...it would be neat to parse a kml or other shape files 
and get meaningful shapes 
([com.spatial4j.core.shape|https://github.com/locationtech/spatial4j] ( ?) 
objects out of those.  It would be great to have a Point/LatLon as well, for 
images.

> Introduce new arbitrary object key/values data structure for persitsence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.10
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to