[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196042#comment-15196042
 ] 

Tim Allison commented on TIKA-1607:
---

bq. I'm also aware that we've strayed a bit from the original issue here of 
structured metadata. Should we create a separate issue?

Agreed.  Thank you.  Let's move this to TIKA-1903.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196030#comment-15196030
 ] 

Ray Gauss II commented on TIKA-1607:


bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor 
as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The {{EmbeddedDocumentExtractor}} interface's {{parseEmbedded}} method 
currently takes a {{Metadata}} object which is only associated with the 
embedded resource (not the same metadata object associated with the 'container' 
file) and is populated with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:
{code}
/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {

/**
 * Gets the map of known embedded resources or null if no resources
 * were stored during parsing
 * 
 * @return the embedded resources
 */
Map getEmbeddedResources();

}
{code}

then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull {{FileEmbeddedDocumentExtractor}} out of {{TikaCLI}} and 
make them use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some {{EmbeddedResources}} object to be optionally populated along with 
the {{Metadata}} in the {{Parser.parse}} method?

Other options?  Maybe they don't need the RAW XMP?

I'm also aware that we've strayed a bit from the original issue here of 
structured metadata.  Should we create a separate issue?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195882#comment-15195882
 ] 

Tim Allison commented on TIKA-1607:
---

Y, makes sense. It might be more easily configurable to use the 
ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, 
no?  That would allow for all of the benefits of easy configurability (thank 
you, Nick!) of parsers for advanced handling of XMP.  

Or is there a reason to avoid that option?

The one benefit to storing encoded bytes in the metadata object is that clients 
in other languages could decode and process.  That said, I much prefer the 
direction you're recommending here.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195326#comment-15195326
 ] 

Ray Gauss II commented on TIKA-1607:


Sorry, I meant {{EmbeddedDocumentExtractor}} (edited comment).

We can currently dump stuff to files in some parsers with the {{--extract}} CLI 
option which sticks a {{FileEmbeddedDocumentExtractor}} in the context.

The current default for PDF is the {{ParsingEmbeddedDocumentExtractor}}.

Perhaps we could add an option to ParsingEmbeddedDocumentExtractor which, when 
enabled, would also save the embedded resources in memory for an advanced user 
to do whatever they need, knowing the risk and resources required for that 
option?

Or provide some other in-memory implementation that advanced users could 
explicitly set in the context?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193845#comment-15193845
 ] 

Ray Gauss II commented on TIKA-1607:


Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedResourceHandler}} 
implementation they could use without resorting to extracting them to files?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178059#comment-15178059
 ] 

Tim Allison commented on TIKA-1607:
---

FWIW, I extracted ~300k XMPs and XFAs from some of our corpus.  The largest 
file was an XFA weighing in at 11MB (compressible via gzip to 900k).

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167208#comment-15167208
 ] 

Tim Allison commented on TIKA-1607:
---

Aside from XMP, I can't think of an example where we'd have multiple DOMs of 
the same type (property name).  For some (rare) PDF files, I could see having a 
DOM for XFA and one or more DOMs for XMP, but they'd be under different 
keys...in my current plan.

I could also see someone modifying an existing parser to generate a DOM to this 
type of field, say, by translating what we're pulling out of the metadata for a 
multimedia file into pbcore.


On the one hand, this is a hack on the way to your unified DOM proposal...basic 
users can get what they want from key/value, and advanced users who actually 
know a given standard can find what they need.

On the other, this would allow advanced users to extract potentially 
conflicting metadata (one XMP packet has dc:creator X, but the update XMP 
packet has dc:creator Y...and we even have this in one of our test files :)).  
By following the XMP standard (iirc), the more recent packet information would 
overwrite the earlier.  Some users will want the "standard" (dc:creator=Y); 
some advanced users might want "all" (dc:creator=X;Y).


The initial motivation for giving access to the raw bytes...if we allow access 
to the raw bytes for a DOM, this could also allow super advanced users to run 
their own content stripping that might not care about slightly dodgy/invalid 
xml, and we already have an example of invalid XMP in one of our multimedia 
files.

However, I'm persuaded that making "bytes" available could lead to disaster.


> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-25 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167135#comment-15167135
 ] 

Ray Gauss II commented on TIKA-1607:


I know there can be multiple XMP packets in a single file, but do we have many 
other examples where we'd need multiple DOMs associated with a single file?

I'm trying to understand if the metadata is really the right place for this.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154216#comment-15154216
 ] 

Tim Allison commented on TIKA-1607:
---

Thank you, Nick and Ray.  Y, I had a failure of imagination in the potentials 
for misuse, which is why I sought feedback.  Thank you!

Is your recommendation that we avoid this proposal altogether, or are you ok 
with the basic storage mechanism and the {{getDOM}} option?


> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154208#comment-15154208
 ] 

Nick Burch commented on TIKA-1607:
--

We have generally required those developing a parser to do more thinking, so 
that users of Tika don't need to. A random bytes bucket does seem to be going 
the other way, making it very easy for a parser developer to chuck random stuff 
into this "other" bucket, and putting all the work onto the now-confused user. 
So, like Ray, I'd advise against it

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154205#comment-15154205
 ] 

Ray Gauss II commented on TIKA-1607:


In my experience people gravitate towards 'other' buckets, i.e.: "I didn't know 
(bother to read) what the designated ones were so I just used 'other'".

{{getBytes}} feels like 'other'.

While people could still do really stupid things with {{getDOM}} if they wanted 
to, {{getBytes}} seems to encourage a developer to go ahead and try to use each 
frame of a 120fps 8K video as a 'metadata' value.  An extreme and unlikely 
example of course, but you get the gist.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149267#comment-15149267
 ] 

Tim Allison commented on TIKA-1607:
---

Y, probably.  We could add limits on length although we're not currently doing 
this with Metadata String values.  To be fair, of course, I realize that 
embedded binary metadata objects (XMP/XFA...) are typically longer than regular 
metadata values.

What else is in the can?

Or, is this just a plain bad idea?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149231#comment-15149231
 ] 

Ray Gauss II commented on TIKA-1607:


Are we opening a can of worms by encouraging the use of a byte array directly 
with no restrictions on length, etc.?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Pascal Essiembre (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148914#comment-15148914
 ] 

Pascal Essiembre commented on TIKA-1607:


In the case of XFA forms, the form IS the content. 

One issue I can see with not extracting XFA text as part of the content by 
default, the generic message put by PDF/XFA editors will be extracted as if it 
was legitimate content:

{noformat}
Please wait... 
  
If this message is not eventually replaced by the proper contents of the 
document, your PDF 
viewer may not be able to display this type of document. 
  
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or 
Linux® by 
visiting  http://www.adobe.com/go/reader_download. 
  
For more assistance with Adobe Reader visit  http://www.adobe.com/go/acrreader. 
  
Windows is either a registered trademark or a trademark of Microsoft 
Corporation in the United States and/or other countries. Mac is a trademark 
of Apple Inc., registered in the United States and other countries. Linux is 
the registered trademark of Linus Torvalds in the U.S. and other 
countries.
{noformat}

That message is an example inserted by PDF/XFA editors as a workaround for PDF 
viewers not supporting XFA and is not genuine content published by the author 
(the XFA forms are).

I'll support whichever way you pick, but I personally can't see use cases where 
extracting that workaround message is the intent when using Tika.  I do see 
value in keeping the entire DOM though.  Maybe you can do as you suggest, but 
"in addition" to returning the XFA text as the content?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148608#comment-15148608
 ] 

Tim Allison commented on TIKA-1607:
---

I'd like to turn something like the above "thought" into a proposal...

Over on TIKA-1857, [~pascal.essiembre] has opened a pull request to strip XFA 
contents into the ContentHandler for PDFDocuments.

It might be more elegant to store the XFA in the metadata object and let 
consumers process that stream.

Would anyone object to adding a two new {{ValueType}}s of Property: BYTES and 
DOM.

Both would be stored as String values (base-64 encoded {{byte[]}}) in the 
regular {{Metadata}} object.

Similar with what we're doing with {{getDate()}} in the {{Metadata}} object, 
we'd add a {{getBytes(Property binaryProperty)}} that would return a decoded 
{{byte[]}}, and we could also add a {{getDOM(Property domProperty)}} that would 
return a {{org.w3c.dom.Document}}.

We could also store raw XMP by this mechanism.

Is this a reasonable first (half) step towards this issue?  Any objections?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-09-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747338#comment-14747338
 ] 

Tim Allison commented on TIKA-1607:
---

Thank you, [~rgauss], for your thoughtful responses and example code!


Y, you're absolutely right about POJOs and helper classes for the common 
elements.  Thank you.

I agree that most of my comments really had to do with pass through.

bq. we'd first have to 'merge' with the metadata being modeled by the parsers 
and could then allow access to the full DOM Document object which clients could 
easily serialize to a string if need be.

Agreed, my thought is a crude/knuckle-dragging/transitional approach for this 
type of merge using our current simple structure.  If there is an XMP packet 
(or multiple as you point out) or any other type of xmlified standard in a 
document, use our current simple structure and store the XMP as a String and 
let clients parse the String to DOM or we could add a new property type ("DOM") 
with a helper method that returns a DOM object (similar to what we're doing now 
with {{getDate()}} and {{getInt()}}.  This would be in addition to pulling out 
the most commonly used Dublin Core elements (as we're doing now) into our 
current structure (or maybe not if there is a conflict with native metadata???).

If the xmlified standard doesn't exist in the document, but there is a 
known+obvious standard(e.g. PBCore), the parser could generate that XML String 
from the file's metadata and store it as an element in our current structure: 
Property pbcore...or similar.

Another benefit to this transitional approach is that we could store both the 
original XMP(s) (e.g.)  _and_ the native metadata, and we wouldn't have to 
worry about deconflicting...the user can recover that the XMP said the author 
was "Joe Smith" but the native metadata said the author was "Bob Doe".  
Currently, at least in the PDFParser, we're overwriting native metadata items 
with XMP metadata.  Perhaps, though, this is an edge case, and most users just 
want "one answer"...

Another benefit is that if there is something non-standard/unparseable in the 
stored XML string, the client could still recover the String (or byte[]?) that 
was stored in the original document via the current {{get(Property property}}.

bq. bring these different sources into a unified persistence structure

The above "thought" (not even a proposal!) tentatively approaches that in an 
inelegant way, and it has a strong odor of hack that I don't like.  I very much 
appreciate the goal of a unified structure!

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-09-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746719#comment-14746719
 ] 

Ray Gauss II commented on TIKA-1607:


Hi [~talli...@mitre.org], apologies for the delay on responding here.

1. POJOs
bq. We might have better documentation of POJOs and compile-time guarantees 
about methods and typed values.

Agreed, but the DOM persistence doesn't preclude us from also using Java 
'helper' classes that know how to more easily get and set values for particular 
schemas that we'd like to focus on.

bq. Schemas/xsds can enforce plenty, I know, but would we want to build an xsd 
and maintain it?

I'd vote for sticking as true to a specification's original schema as possible 
when there is one but whether we'd want to build and maintain for those that 
don't is a good question.

2. Passthrough
bq. why couldn't we literally pass that through via the String version of the 
xml?

I think we could, but we'd first have to 'merge' with the metadata being 
modeled by the parsers and could then allow access to the full DOM {{Document}} 
object which clients could easily serialize to a string if need be.

3. Serialization to JSON
There seem to be several libraries available that can help with XML to JSON, 
though I don't think this would belong in core.

4. Multilingual fields
Great question.  XMP uses RDF and xml:lang:
{noformat}

  
quick brown fox
rapido fox marrone
  

{noformat}
that's one possibility.

bq. I'm wondering if we want to add structure only where structured data 
doesn't exist within the document and let the client parse what they'd like out 
of structured metadata that is in the document?

This also relates to passthrough above but one thing to keep in mind is that 
the metadata we're parsing could be coming from several different parts of the 
binary.  For example, EXIF doesn't necessarily also live in XMP (though most 
apps also write it there these days) and there can be more than one XMP packet 
present in a file.  It would be nice to bring these different sources into a 
unified persistence structure, even if for simpler metadata everything lives at 
the top level.

bq. how do we transfer as much normalized/structured metadata as possible in as 
simple a way to the end user.

This also gets back to passthrough and the possibility of access to the full 
DOM {{Document}} object.

Thanks for keeping the discussion going.  We obviously need to take great care 
in changing such a fundamental area of the code.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14713567#comment-14713567
 ] 

Tim Allison commented on TIKA-1607:
---

Not the same, but these two issues are related...how do we transfer as much 
normalized/structured metadata as possible in as simple a way to the end user.

It looks like TIKA-1607 addresses step IV in the metadata 
[Roadmap|http://wiki.apache.org/tika/MetadataRoadmap], and TIKA-1691 addresses 
step VI as defined in Revision #3 (dated 2012-04-30) of the Roadmap.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-25 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711801#comment-14711801
 ] 

Tim Allison commented on TIKA-1607:
---

[~rgauss], thank you for this demo code! I haven't had a chance to review 
thoroughly, and I'm sure I've missed plenty  (and, [~chrismattmann], I still 
need to send info for the metadata discussion over on OODT).  

I really like the non-POJO flexibility, the actual namespacing and the full 
blown DOM + XPath.  I side with you on avoiding the literal indexing 
(shoehorning) of keys if there are multiple complex values.  Your patch is just 
plain elegant.

You mention fleshing out the requirements list above, but I'm not sure there's 
much left to add. :)

Some half-baked thoughts:
# The flip side of POJO-bloat for every new metadata schema is unfettered 
flexibility/modifications.  We _might_ have better documentation of POJOs and 
compile-time guarantees about methods and typed values.  Schemas/xsds can 
enforce plenty, I know, but would we want to build an xsd and maintain it?  
That starts feeling like as much work as POJOs, but maybe not.
#  Your comment about passing through and the example of the vcard made me 
wonder...for complex structures (xmp, vcard), why couldn't we literally pass 
that through via the String version of the xml?  If a client has enough 
sophistication and knowledge of that structure to make use of it, why not pass 
it through literally and let them do the DOM parsing (I've been thinking about 
proposing this for the XMP that we're pulling out of PDFs and jpegs)?  The 
balancing act, of course, is determining which elements to pull into regular 
metadata values (e.g. Dublin Core, etc).
# How will we serialize to JSON, if that is desired?  XML dump for values for 
property of type DOM_Element?
# What would a multilingual field look like?

I'm wondering if we want to add structure only where structured data doesn't 
exist within the document and let the client parse what they'd like out of 
structured metadata that is in the document?  Or, in the case of multimedia, 
perhaps, generate PBCore XML as a value for a regular key.

No solutions, just thoughts...

Thank you again.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-21 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706706#comment-14706706
 ] 

Ray Gauss II commented on TIKA-1607:


Yes, by shoehorn I meant that the index is embedded in the key (in this case 
sub-group name) and that all parsers and consuming client apps must know to 
utilize that syntax rather than either a separate, explicit index field or a 
well defined structure like that of the DOM approach.

Perhaps we should flesh out a solid requirements list (possibly using the 
[comment 
above|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=14660441page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14660441]
 as a starting point).

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-20 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704880#comment-14704880
 ] 

Ray Gauss II commented on TIKA-1607:


I did see that, but I was after full URI namespaces, i.e. 
{{http://purl.org/dc/elements/1.1/}}, not just prefixes.

The OODT approach looks like you'd have to shoehorn the index into the group 
name, much like the tika-ffmpeg workaround, rather than a more strictly defined 
structure.

OODT might support deeper structures in the inner {{Group}} class, but the 
public methods appear to only support a single level?  For example, How could 
one get to something like the value of the city of the 3rd contact's 2nd 
address, i.e. p1:contact[2]/p1:address[1]/p1:city?

We could mimic XPath syntax but the DOM approach allows us to use 
{{javax.xml.xpath.XPath}} processing.  From the [test mentioned 
above|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]:
{code:java}
String expression = /tika:metadata/vcard:tel[1]/vcard:uri;
assertEquals(telUri, metadata.getValueByXPath(expression));
{code}

The DOM approach would also allow us to leverage things like attributes to 
further describe a particular metadata value in the future if need be.

We might also be able to pass through entire metadata structures that Tika 
hasn't explicitly modeled.

It's certainly a larger change, but I think it gives us a lot more options.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-20 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705623#comment-14705623
 ] 

Chris A. Mattmann commented on TIKA-1607:
-

Hi Ray. I'm not sure what shoehorn the index into the group - do you mean e.g., 
indexing arrays? If so, I don't see how it needs to be shoe-horned into the 
group or anything. By default keys are multi-valued metadata - thus if you have 
an array and you want the ith key, e.g., as in your example above, OODT 
metadata (and my hope Tika metadata) would (using / as a delimeter)

1. Find the group tika:metadata; and if it exists
2. Find its sub-group, vcard:tel[1]; and if it exists
3. Grab the value vcard:uri out of it.



 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-20 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705622#comment-14705622
 ] 

Chris A. Mattmann commented on TIKA-1607:
-

Hi Ray. I'm not sure what shoehorn the index into the group - do you mean e.g., 
indexing arrays? If so, I don't see how it needs to be shoe-horned into the 
group or anything. By default keys are multi-valued metadata - thus if you have 
an array and you want the ith key, e.g., as in your example above, OODT 
metadata (and my hope Tika metadata) would (using / as a delimeter)

1. Find the group tika:metadata; and if it exists
2. Find its sub-group, vcard:tel[1]; and if it exists
3. Grab the value vcard:uri out of it.



 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-19 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703966#comment-14703966
 ] 

Chris A. Mattmann commented on TIKA-1607:
-

[~rgauss] did you look at the OODT metadata?

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703924#comment-14703924
 ] 

Ray Gauss II commented on TIKA-1607:


I've put together the start of the DOM metadata store option on [GitHub as 
well|https://github.com/apache/tika/compare/trunk...rgauss:trunk].

The crux of the change is using a {{org.w3c.dom.Document}} object instead of a 
{{MapString, String[]}} as the metadata store and Property objects based on 
{{QName}}s instead of Strings.

A few things to note:
* This does bring in commons-lang for XML escaping, we could change if need be
* It seems mostly backwards compatible. tika-xmp is failing at the moment, but 
I think it's just a matter of applying the same techniques there
* String-based accessors weren't deprecated, but could be if targeting Tika 2.0
* There are several TODOs that would still need to be addressed

The [test 
added|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]
 demonstrates creating a DOM structure, adding it to the metadata, then pulling 
it out both programmatically and via XPath expression (sticking to the 
telephone number example).

That programmatic creation of the DOM structure is a bit cumbersome and we 
could certainly employ Java classes specific to each standard as a convenience 
(somewhat similar to [~talli...@mitre.org]'s proposal), but I do like the 
generic nature of the DOM store.

The {{toString}} method of the metadata object after building that example is 
properly structured and namespaced XML:
{code:xml}
?xml version=1.0 encoding=UTF-8 standalone=no?
tika:metadata xmlns:tika=http://tika.apache.org/;
  vcard:tel xmlns:vcard=urn:ietf:params:xml:ns:vcard-4.0
vcard:parameters
  vcard:type
vcard:textwork/vcard:text
  /vcard:type
/vcard:parameters
vcard:uritel:+1-800-555-1234/vcard:uri
  /vcard:tel
/tika:metadata
{code}

There's obviously lots of room for improvement and discussion but I wanted to 
put it out there before the momentum on this slows.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704108#comment-14704108
 ] 

Ray Gauss II commented on TIKA-1607:


[~chrismattmann], I did.

It seemed more similar to the XPath-like workaround I described with the notion 
of groups in the store, rather than the full-fledged DOM store proposed in the 
GitHub fork, i.e. I didn't see where anything was namespaced.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-19 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704225#comment-14704225
 ] 

Chris A. Mattmann commented on TIKA-1607:
-

Awesome. Yeah namespaces can be achieved by grouping, e.g., DublinCore/Author, 
or DublinCore/Creator, or EXIF/ImageHeight. You can also use _ to namespace. 
Make sense? I think if we're going to do this in Tika I would much prefer doing 
it the way OODT did it (whose Metadata container by the way, and Tika's evolved 
at the same time and drew from one another).

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681243#comment-14681243
 ] 

Chris A. Mattmann commented on TIKA-1607:
-

Hey Tim doesn't require a NASA login - just a login on the OODT JIRA. Can you 
email me offline and I'll grant one?

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681240#comment-14681240
 ] 

Chris A. Mattmann commented on TIKA-1607:
-

Ray this is the exact approach we took in OODT. It's fully back compat. We used 
/ as a group and path prefix (and/or _) and then supported a wide array of 
functions for back compat.

See: 
http://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/java/org/apache/oodt/cas/metadata/Metadata.java

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662477#comment-14662477
 ] 

Tim Allison commented on TIKA-1607:
---

I put my proposal for POJO values on [github| 
https://github.com/tballison/tika/tree/TIKA-1607-POJO-Values].  There are some 
things that I don't like about my proposal, and I really like the list that 
[~rgauss] put together of things that would be good to do.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660283#comment-14660283
 ] 

Chris A. Mattmann commented on TIKA-1607:
-

I'm confused about Ray's Tika FFMPEG that you're talking about. Also, [~gostep] 
the stuff being talked about here (how to handle properties and typed values, 
names, etc.) is precisely is what you I think were trying to get at with your 
proposal and so forth so you should probably comment here. 

Good job on actually producing code for this [~talli...@apache.org] I'd like to 
take a look at it more before commenting further. One thing I know too is that 
the [OODT Metadata 
Object|https://oodt.jpl.nasa.gov/jira/si/jira.issueviews:issue-html/OODT-303/OODT-303.html]
 discussion that we had internally at JPL a long time ago is EXTREMELY similar 
to this one and it should be considered. I pointed [~lewismc] at this during 
the initial discussion.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661353#comment-14661353
 ] 

David Smiley commented on TIKA-1607:


TIKA isn't my area of expertise, but I think it should try and expose metadata 
using types that don't require dependencies, except for perhaps XML DOM or 
whatever JSON's DOM equivalent is (I don't think there is one in the JDK).  WKT 
strings could make sense as a spatial type specifically; for simple points I 
wouldn't though.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660304#comment-14660304
 ] 

Tim Allison commented on TIKA-1607:
---

[~chrismattmann], any and all feedback would be great.  The link you sent 
requires a nasa login.  I'm not a rocket scientist, no luck. :( :)


 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660441#comment-14660441
 ] 

Ray Gauss II commented on TIKA-1607:


To clarify, the work mentioned above that uses an XPath-like syntax is only a 
workaround for mapping structured metadata into the current 'flat' metadata 
model in Tika.

I fully support moving towards a structured metadata store in a 2.0 timeframe. 
(maybe that's now?)

This is simply restating some of what's already been said, but there are many 
aspects to consider during that refactoring:
* Moving towards properly namespacing metadata (even if, for now, our 
serialization of it only contains a prefix)
* Backwards compatibility for simple string key/values
* Enabling easy serialization to XML and JSON
* Enabling easy discovery of at least top level elements
* Lightweight dependencies in tika-core
* Possible representation of binary data
* Not re-inventing the wheel

Given the above, perhaps we'd want to consider using Java DOM 
({{org.w3c.dom.*}}) classes programmatically as a metadata store, appending and 
getting child nodes, etc. rather than hard coding POJOs for each metadata 
standard we want to support.

I'll try to find some time to put together an example patch for that approach 
in the next few days.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660145#comment-14660145
 ] 

Tim Allison commented on TIKA-1607:
---

Doh! A related point: binary values. At some point I think Jukka(?) suggested 
putting thumbnails of an embedded document in that doc's metadata

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)