[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196042#comment-15196042 ] Tim Allison commented on TIKA-1607: --- bq. I'm also aware that we've strayed a bit from the original issue here of structured metadata. Should we create a separate issue? Agreed. Thank you. Let's move this to TIKA-1903. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196030#comment-15196030 ] Ray Gauss II commented on TIKA-1607: bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, no? Yes, and we could do that in addition to the above, but if I'm understanding correctly that alone would still force users to write 'Tika-based' XMP parsers rather than allowing them access to the RAW XMP encoded bytes you're referring to in the last sentence, which I do agree might be helpful in some cases. So the idea for the second part would be to get the user those bytes in a way that hopefully doesn't require sweeping changes to the parsers (I'm thinking of this with an eye towards all types of embedded resources, not just XMP). The {{EmbeddedDocumentExtractor}} interface's {{parseEmbedded}} method currently takes a {{Metadata}} object which is only associated with the embedded resource (not the same metadata object associated with the 'container' file) and is populated with the embedded resource's filename, type, size, etc. Option 1. We might be able to do something like: {code} /** * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded * resources during parsing for retrieval. */ public interface StoringEmbeddedDocumentExtractor extends EmbeddedDocumentExtractor { /** * Gets the map of known embedded resources or null if no resources * were stored during parsing * * @return the embedded resources */ MapgetEmbeddedResources(); } {code} then modify ParsingEmbeddedDocumentExtractor to implement it with an option which 'turns it on'? Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor that users could set in the context? Option 3. Just pull {{FileEmbeddedDocumentExtractor}} out of {{TikaCLI}} and make them use temp files? Option 4. Maybe the effort is better spent on said sweeping parser changes to include some {{EmbeddedResources}} object to be optionally populated along with the {{Metadata}} in the {{Parser.parse}} method? Other options? Maybe they don't need the RAW XMP? I'm also aware that we've strayed a bit from the original issue here of structured metadata. Should we create a separate issue? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195882#comment-15195882 ] Tim Allison commented on TIKA-1607: --- Y, makes sense. It might be more easily configurable to use the ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, no? That would allow for all of the benefits of easy configurability (thank you, Nick!) of parsers for advanced handling of XMP. Or is there a reason to avoid that option? The one benefit to storing encoded bytes in the metadata object is that clients in other languages could decode and process. That said, I much prefer the direction you're recommending here. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195326#comment-15195326 ] Ray Gauss II commented on TIKA-1607: Sorry, I meant {{EmbeddedDocumentExtractor}} (edited comment). We can currently dump stuff to files in some parsers with the {{--extract}} CLI option which sticks a {{FileEmbeddedDocumentExtractor}} in the context. The current default for PDF is the {{ParsingEmbeddedDocumentExtractor}}. Perhaps we could add an option to ParsingEmbeddedDocumentExtractor which, when enabled, would also save the embedded resources in memory for an advanced user to do whatever they need, knowing the risk and resources required for that option? Or provide some other in-memory implementation that advanced users could explicitly set in the context? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193845#comment-15193845 ] Ray Gauss II commented on TIKA-1607: Have we already considered treating the XMP packets more like embedded resources and making it easier for the advanced users described above to get at those resources, perhaps providing an {{EmbeddedResourceHandler}} implementation they could use without resorting to extracting them to files? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178059#comment-15178059 ] Tim Allison commented on TIKA-1607: --- FWIW, I extracted ~300k XMPs and XFAs from some of our corpus. The largest file was an XFA weighing in at 11MB (compressible via gzip to 900k). > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167208#comment-15167208 ] Tim Allison commented on TIKA-1607: --- Aside from XMP, I can't think of an example where we'd have multiple DOMs of the same type (property name). For some (rare) PDF files, I could see having a DOM for XFA and one or more DOMs for XMP, but they'd be under different keys...in my current plan. I could also see someone modifying an existing parser to generate a DOM to this type of field, say, by translating what we're pulling out of the metadata for a multimedia file into pbcore. On the one hand, this is a hack on the way to your unified DOM proposal...basic users can get what they want from key/value, and advanced users who actually know a given standard can find what they need. On the other, this would allow advanced users to extract potentially conflicting metadata (one XMP packet has dc:creator X, but the update XMP packet has dc:creator Y...and we even have this in one of our test files :)). By following the XMP standard (iirc), the more recent packet information would overwrite the earlier. Some users will want the "standard" (dc:creator=Y); some advanced users might want "all" (dc:creator=X;Y). The initial motivation for giving access to the raw bytes...if we allow access to the raw bytes for a DOM, this could also allow super advanced users to run their own content stripping that might not care about slightly dodgy/invalid xml, and we already have an example of invalid XMP in one of our multimedia files. However, I'm persuaded that making "bytes" available could lead to disaster. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167135#comment-15167135 ] Ray Gauss II commented on TIKA-1607: I know there can be multiple XMP packets in a single file, but do we have many other examples where we'd need multiple DOMs associated with a single file? I'm trying to understand if the metadata is really the right place for this. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154216#comment-15154216 ] Tim Allison commented on TIKA-1607: --- Thank you, Nick and Ray. Y, I had a failure of imagination in the potentials for misuse, which is why I sought feedback. Thank you! Is your recommendation that we avoid this proposal altogether, or are you ok with the basic storage mechanism and the {{getDOM}} option? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154208#comment-15154208 ] Nick Burch commented on TIKA-1607: -- We have generally required those developing a parser to do more thinking, so that users of Tika don't need to. A random bytes bucket does seem to be going the other way, making it very easy for a parser developer to chuck random stuff into this "other" bucket, and putting all the work onto the now-confused user. So, like Ray, I'd advise against it > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154205#comment-15154205 ] Ray Gauss II commented on TIKA-1607: In my experience people gravitate towards 'other' buckets, i.e.: "I didn't know (bother to read) what the designated ones were so I just used 'other'". {{getBytes}} feels like 'other'. While people could still do really stupid things with {{getDOM}} if they wanted to, {{getBytes}} seems to encourage a developer to go ahead and try to use each frame of a 120fps 8K video as a 'metadata' value. An extreme and unlikely example of course, but you get the gist. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149267#comment-15149267 ] Tim Allison commented on TIKA-1607: --- Y, probably. We could add limits on length although we're not currently doing this with Metadata String values. To be fair, of course, I realize that embedded binary metadata objects (XMP/XFA...) are typically longer than regular metadata values. What else is in the can? Or, is this just a plain bad idea? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149231#comment-15149231 ] Ray Gauss II commented on TIKA-1607: Are we opening a can of worms by encouraging the use of a byte array directly with no restrictions on length, etc.? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148914#comment-15148914 ] Pascal Essiembre commented on TIKA-1607: In the case of XFA forms, the form IS the content. One issue I can see with not extracting XFA text as part of the content by default, the generic message put by PDF/XFA editors will be extracted as if it was legitimate content: {noformat} Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. {noformat} That message is an example inserted by PDF/XFA editors as a workaround for PDF viewers not supporting XFA and is not genuine content published by the author (the XFA forms are). I'll support whichever way you pick, but I personally can't see use cases where extracting that workaround message is the intent when using Tika. I do see value in keeping the entire DOM though. Maybe you can do as you suggest, but "in addition" to returning the XFA text as the content? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607v1_rough_rough.patch, > TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148608#comment-15148608 ] Tim Allison commented on TIKA-1607: --- I'd like to turn something like the above "thought" into a proposal... Over on TIKA-1857, [~pascal.essiembre] has opened a pull request to strip XFA contents into the ContentHandler for PDFDocuments. It might be more elegant to store the XFA in the metadata object and let consumers process that stream. Would anyone object to adding a two new {{ValueType}}s of Property: BYTES and DOM. Both would be stored as String values (base-64 encoded {{byte[]}}) in the regular {{Metadata}} object. Similar with what we're doing with {{getDate()}} in the {{Metadata}} object, we'd add a {{getBytes(Property binaryProperty)}} that would return a decoded {{byte[]}}, and we could also add a {{getDOM(Property domProperty)}} that would return a {{org.w3c.dom.Document}}. We could also store raw XMP by this mechanism. Is this a reasonable first (half) step towards this issue? Any objections? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607v1_rough_rough.patch, > TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747338#comment-14747338 ] Tim Allison commented on TIKA-1607: --- Thank you, [~rgauss], for your thoughtful responses and example code! Y, you're absolutely right about POJOs and helper classes for the common elements. Thank you. I agree that most of my comments really had to do with pass through. bq. we'd first have to 'merge' with the metadata being modeled by the parsers and could then allow access to the full DOM Document object which clients could easily serialize to a string if need be. Agreed, my thought is a crude/knuckle-dragging/transitional approach for this type of merge using our current simple structure. If there is an XMP packet (or multiple as you point out) or any other type of xmlified standard in a document, use our current simple structure and store the XMP as a String and let clients parse the String to DOM or we could add a new property type ("DOM") with a helper method that returns a DOM object (similar to what we're doing now with {{getDate()}} and {{getInt()}}. This would be in addition to pulling out the most commonly used Dublin Core elements (as we're doing now) into our current structure (or maybe not if there is a conflict with native metadata???). If the xmlified standard doesn't exist in the document, but there is a known+obvious standard(e.g. PBCore), the parser could generate that XML String from the file's metadata and store it as an element in our current structure: Property pbcore...or similar. Another benefit to this transitional approach is that we could store both the original XMP(s) (e.g.) _and_ the native metadata, and we wouldn't have to worry about deconflicting...the user can recover that the XMP said the author was "Joe Smith" but the native metadata said the author was "Bob Doe". Currently, at least in the PDFParser, we're overwriting native metadata items with XMP metadata. Perhaps, though, this is an edge case, and most users just want "one answer"... Another benefit is that if there is something non-standard/unparseable in the stored XML string, the client could still recover the String (or byte[]?) that was stored in the original document via the current {{get(Property property}}. bq. bring these different sources into a unified persistence structure The above "thought" (not even a proposal!) tentatively approaches that in an inelegant way, and it has a strong odor of hack that I don't like. I very much appreciate the goal of a unified structure! > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.11 > > Attachments: TIKA-1607v1_rough_rough.patch, > TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746719#comment-14746719 ] Ray Gauss II commented on TIKA-1607: Hi [~talli...@mitre.org], apologies for the delay on responding here. 1. POJOs bq. We might have better documentation of POJOs and compile-time guarantees about methods and typed values. Agreed, but the DOM persistence doesn't preclude us from also using Java 'helper' classes that know how to more easily get and set values for particular schemas that we'd like to focus on. bq. Schemas/xsds can enforce plenty, I know, but would we want to build an xsd and maintain it? I'd vote for sticking as true to a specification's original schema as possible when there is one but whether we'd want to build and maintain for those that don't is a good question. 2. Passthrough bq. why couldn't we literally pass that through via the String version of the xml? I think we could, but we'd first have to 'merge' with the metadata being modeled by the parsers and could then allow access to the full DOM {{Document}} object which clients could easily serialize to a string if need be. 3. Serialization to JSON There seem to be several libraries available that can help with XML to JSON, though I don't think this would belong in core. 4. Multilingual fields Great question. XMP uses RDF and xml:lang: {noformat} quick brown fox rapido fox marrone {noformat} that's one possibility. bq. I'm wondering if we want to add structure only where structured data doesn't exist within the document and let the client parse what they'd like out of structured metadata that is in the document? This also relates to passthrough above but one thing to keep in mind is that the metadata we're parsing could be coming from several different parts of the binary. For example, EXIF doesn't necessarily also live in XMP (though most apps also write it there these days) and there can be more than one XMP packet present in a file. It would be nice to bring these different sources into a unified persistence structure, even if for simpler metadata everything lives at the top level. bq. how do we transfer as much normalized/structured metadata as possible in as simple a way to the end user. This also gets back to passthrough and the possibility of access to the full DOM {{Document}} object. Thanks for keeping the discussion going. We obviously need to take great care in changing such a fundamental area of the code. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.11 > > Attachments: TIKA-1607v1_rough_rough.patch, > TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collectione.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14713567#comment-14713567 ] Tim Allison commented on TIKA-1607: --- Not the same, but these two issues are related...how do we transfer as much normalized/structured metadata as possible in as simple a way to the end user. It looks like TIKA-1607 addresses step IV in the metadata [Roadmap|http://wiki.apache.org/tika/MetadataRoadmap], and TIKA-1691 addresses step VI as defined in Revision #3 (dated 2012-04-30) of the Roadmap. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711801#comment-14711801 ] Tim Allison commented on TIKA-1607: --- [~rgauss], thank you for this demo code! I haven't had a chance to review thoroughly, and I'm sure I've missed plenty (and, [~chrismattmann], I still need to send info for the metadata discussion over on OODT). I really like the non-POJO flexibility, the actual namespacing and the full blown DOM + XPath. I side with you on avoiding the literal indexing (shoehorning) of keys if there are multiple complex values. Your patch is just plain elegant. You mention fleshing out the requirements list above, but I'm not sure there's much left to add. :) Some half-baked thoughts: # The flip side of POJO-bloat for every new metadata schema is unfettered flexibility/modifications. We _might_ have better documentation of POJOs and compile-time guarantees about methods and typed values. Schemas/xsds can enforce plenty, I know, but would we want to build an xsd and maintain it? That starts feeling like as much work as POJOs, but maybe not. # Your comment about passing through and the example of the vcard made me wonder...for complex structures (xmp, vcard), why couldn't we literally pass that through via the String version of the xml? If a client has enough sophistication and knowledge of that structure to make use of it, why not pass it through literally and let them do the DOM parsing (I've been thinking about proposing this for the XMP that we're pulling out of PDFs and jpegs)? The balancing act, of course, is determining which elements to pull into regular metadata values (e.g. Dublin Core, etc). # How will we serialize to JSON, if that is desired? XML dump for values for property of type DOM_Element? # What would a multilingual field look like? I'm wondering if we want to add structure only where structured data doesn't exist within the document and let the client parse what they'd like out of structured metadata that is in the document? Or, in the case of multimedia, perhaps, generate PBCore XML as a value for a regular key. No solutions, just thoughts... Thank you again. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706706#comment-14706706 ] Ray Gauss II commented on TIKA-1607: Yes, by shoehorn I meant that the index is embedded in the key (in this case sub-group name) and that all parsers and consuming client apps must know to utilize that syntax rather than either a separate, explicit index field or a well defined structure like that of the DOM approach. Perhaps we should flesh out a solid requirements list (possibly using the [comment above|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=14660441page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14660441] as a starting point). Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704880#comment-14704880 ] Ray Gauss II commented on TIKA-1607: I did see that, but I was after full URI namespaces, i.e. {{http://purl.org/dc/elements/1.1/}}, not just prefixes. The OODT approach looks like you'd have to shoehorn the index into the group name, much like the tika-ffmpeg workaround, rather than a more strictly defined structure. OODT might support deeper structures in the inner {{Group}} class, but the public methods appear to only support a single level? For example, How could one get to something like the value of the city of the 3rd contact's 2nd address, i.e. p1:contact[2]/p1:address[1]/p1:city? We could mimic XPath syntax but the DOM approach allows us to use {{javax.xml.xpath.XPath}} processing. From the [test mentioned above|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]: {code:java} String expression = /tika:metadata/vcard:tel[1]/vcard:uri; assertEquals(telUri, metadata.getValueByXPath(expression)); {code} The DOM approach would also allow us to leverage things like attributes to further describe a particular metadata value in the future if need be. We might also be able to pass through entire metadata structures that Tika hasn't explicitly modeled. It's certainly a larger change, but I think it gives us a lot more options. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705623#comment-14705623 ] Chris A. Mattmann commented on TIKA-1607: - Hi Ray. I'm not sure what shoehorn the index into the group - do you mean e.g., indexing arrays? If so, I don't see how it needs to be shoe-horned into the group or anything. By default keys are multi-valued metadata - thus if you have an array and you want the ith key, e.g., as in your example above, OODT metadata (and my hope Tika metadata) would (using / as a delimeter) 1. Find the group tika:metadata; and if it exists 2. Find its sub-group, vcard:tel[1]; and if it exists 3. Grab the value vcard:uri out of it. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705622#comment-14705622 ] Chris A. Mattmann commented on TIKA-1607: - Hi Ray. I'm not sure what shoehorn the index into the group - do you mean e.g., indexing arrays? If so, I don't see how it needs to be shoe-horned into the group or anything. By default keys are multi-valued metadata - thus if you have an array and you want the ith key, e.g., as in your example above, OODT metadata (and my hope Tika metadata) would (using / as a delimeter) 1. Find the group tika:metadata; and if it exists 2. Find its sub-group, vcard:tel[1]; and if it exists 3. Grab the value vcard:uri out of it. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703966#comment-14703966 ] Chris A. Mattmann commented on TIKA-1607: - [~rgauss] did you look at the OODT metadata? Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703924#comment-14703924 ] Ray Gauss II commented on TIKA-1607: I've put together the start of the DOM metadata store option on [GitHub as well|https://github.com/apache/tika/compare/trunk...rgauss:trunk]. The crux of the change is using a {{org.w3c.dom.Document}} object instead of a {{MapString, String[]}} as the metadata store and Property objects based on {{QName}}s instead of Strings. A few things to note: * This does bring in commons-lang for XML escaping, we could change if need be * It seems mostly backwards compatible. tika-xmp is failing at the moment, but I think it's just a matter of applying the same techniques there * String-based accessors weren't deprecated, but could be if targeting Tika 2.0 * There are several TODOs that would still need to be addressed The [test added|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394] demonstrates creating a DOM structure, adding it to the metadata, then pulling it out both programmatically and via XPath expression (sticking to the telephone number example). That programmatic creation of the DOM structure is a bit cumbersome and we could certainly employ Java classes specific to each standard as a convenience (somewhat similar to [~talli...@mitre.org]'s proposal), but I do like the generic nature of the DOM store. The {{toString}} method of the metadata object after building that example is properly structured and namespaced XML: {code:xml} ?xml version=1.0 encoding=UTF-8 standalone=no? tika:metadata xmlns:tika=http://tika.apache.org/; vcard:tel xmlns:vcard=urn:ietf:params:xml:ns:vcard-4.0 vcard:parameters vcard:type vcard:textwork/vcard:text /vcard:type /vcard:parameters vcard:uritel:+1-800-555-1234/vcard:uri /vcard:tel /tika:metadata {code} There's obviously lots of room for improvement and discussion but I wanted to put it out there before the momentum on this slows. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704108#comment-14704108 ] Ray Gauss II commented on TIKA-1607: [~chrismattmann], I did. It seemed more similar to the XPath-like workaround I described with the notion of groups in the store, rather than the full-fledged DOM store proposed in the GitHub fork, i.e. I didn't see where anything was namespaced. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704225#comment-14704225 ] Chris A. Mattmann commented on TIKA-1607: - Awesome. Yeah namespaces can be achieved by grouping, e.g., DublinCore/Author, or DublinCore/Creator, or EXIF/ImageHeight. You can also use _ to namespace. Make sense? I think if we're going to do this in Tika I would much prefer doing it the way OODT did it (whose Metadata container by the way, and Tika's evolved at the same time and drew from one another). Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681243#comment-14681243 ] Chris A. Mattmann commented on TIKA-1607: - Hey Tim doesn't require a NASA login - just a login on the OODT JIRA. Can you email me offline and I'll grant one? Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681240#comment-14681240 ] Chris A. Mattmann commented on TIKA-1607: - Ray this is the exact approach we took in OODT. It's fully back compat. We used / as a group and path prefix (and/or _) and then supported a wide array of functions for back compat. See: http://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/java/org/apache/oodt/cas/metadata/Metadata.java Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662477#comment-14662477 ] Tim Allison commented on TIKA-1607: --- I put my proposal for POJO values on [github| https://github.com/tballison/tika/tree/TIKA-1607-POJO-Values]. There are some things that I don't like about my proposal, and I really like the list that [~rgauss] put together of things that would be good to do. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660283#comment-14660283 ] Chris A. Mattmann commented on TIKA-1607: - I'm confused about Ray's Tika FFMPEG that you're talking about. Also, [~gostep] the stuff being talked about here (how to handle properties and typed values, names, etc.) is precisely is what you I think were trying to get at with your proposal and so forth so you should probably comment here. Good job on actually producing code for this [~talli...@apache.org] I'd like to take a look at it more before commenting further. One thing I know too is that the [OODT Metadata Object|https://oodt.jpl.nasa.gov/jira/si/jira.issueviews:issue-html/OODT-303/OODT-303.html] discussion that we had internally at JPL a long time ago is EXTREMELY similar to this one and it should be considered. I pointed [~lewismc] at this during the initial discussion. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661353#comment-14661353 ] David Smiley commented on TIKA-1607: TIKA isn't my area of expertise, but I think it should try and expose metadata using types that don't require dependencies, except for perhaps XML DOM or whatever JSON's DOM equivalent is (I don't think there is one in the JDK). WKT strings could make sense as a spatial type specifically; for simple points I wouldn't though. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660304#comment-14660304 ] Tim Allison commented on TIKA-1607: --- [~chrismattmann], any and all feedback would be great. The link you sent requires a nasa login. I'm not a rocket scientist, no luck. :( :) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660441#comment-14660441 ] Ray Gauss II commented on TIKA-1607: To clarify, the work mentioned above that uses an XPath-like syntax is only a workaround for mapping structured metadata into the current 'flat' metadata model in Tika. I fully support moving towards a structured metadata store in a 2.0 timeframe. (maybe that's now?) This is simply restating some of what's already been said, but there are many aspects to consider during that refactoring: * Moving towards properly namespacing metadata (even if, for now, our serialization of it only contains a prefix) * Backwards compatibility for simple string key/values * Enabling easy serialization to XML and JSON * Enabling easy discovery of at least top level elements * Lightweight dependencies in tika-core * Possible representation of binary data * Not re-inventing the wheel Given the above, perhaps we'd want to consider using Java DOM ({{org.w3c.dom.*}}) classes programmatically as a metadata store, appending and getting child nodes, etc. rather than hard coding POJOs for each metadata standard we want to support. I'll try to find some time to put together an example patch for that approach in the next few days. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660145#comment-14660145 ] Tim Allison commented on TIKA-1607: --- Doh! A related point: binary values. At some point I think Jukka(?) suggested putting thumbnails of an embedded document in that doc's metadata Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)