[jira] [Commented] (TIKA-1888) Update mimetype for application/x-netcdf
[ https://issues.apache.org/jira/browse/TIKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196568#comment-15196568 ] Ajay Kumar Loganathan Ravichandran commented on TIKA-1888: -- Yeah. It is missing one magic match. > Update mimetype for application/x-netcdf > > > Key: TIKA-1888 > URL: https://issues.apache.org/jira/browse/TIKA-1888 > Project: Tika > Issue Type: Improvement > Components: core, mime >Affects Versions: 1.13 >Reporter: Ajay Kumar Loganathan Ravichandran > Labels: mimetypes > Fix For: 1.13 > > > Updating tika-mimetype.xml to identify .cdf and .nc file format. > > > > > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)
[ https://issues.apache.org/jira/browse/TIKA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196045#comment-15196045 ] Tim Allison edited comment on TIKA-1903 at 3/15/16 7:37 PM: Copied from [~rgauss] on TIKA-1607: Yes, and we could do that in addition to the above, but if I'm understanding correctly that alone would still force users to write 'Tika-based' XMP parsers rather than allowing them access to the RAW XMP encoded bytes you're referring to in the last sentence, which I do agree might be helpful in some cases. So the idea for the second part would be to get the user those bytes in a way that hopefully doesn't require sweeping changes to the parsers (I'm thinking of this with an eye towards all types of embedded resources, not just XMP). The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes a Metadata object which is only associated with the embedded resource (not the same metadata object associated with the 'container' file) and is populated with the embedded resource's filename, type, size, etc. Option 1. We might be able to do something like: /** * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded * resources during parsing for retrieval. */ public interface StoringEmbeddedDocumentExtractor extends EmbeddedDocumentExtractor { /** * Gets the map of known embedded resources or null if no resources * were stored during parsing * * @return the embedded resources */ Map getEmbeddedResources(); } then modify ParsingEmbeddedDocumentExtractor to implement it with an option which 'turns it on'? Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor that users could set in the context? Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them use temp files? Option 4. Maybe the effort is better spent on said sweeping parser changes to include some EmbeddedResources object to be optionally populated along with the Metadata in the Parser.parse method? Other options? Maybe they don't need the RAW XMP? was (Author: talli...@mitre.org): Copied from [~rgauss] on TIKA-1607: bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, no? Yes, and we could do that in addition to the above, but if I'm understanding correctly that alone would still force users to write 'Tika-based' XMP parsers rather than allowing them access to the RAW XMP encoded bytes you're referring to in the last sentence, which I do agree might be helpful in some cases. So the idea for the second part would be to get the user those bytes in a way that hopefully doesn't require sweeping changes to the parsers (I'm thinking of this with an eye towards all types of embedded resources, not just XMP). The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes a Metadata object which is only associated with the embedded resource (not the same metadata object associated with the 'container' file) and is populated with the embedded resource's filename, type, size, etc. Option 1. We might be able to do something like: /** * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded * resources during parsing for retrieval. */ public interface StoringEmbeddedDocumentExtractor extends EmbeddedDocumentExtractor { /** * Gets the map of known embedded resources or null if no resources * were stored during parsing * * @return the embedded resources */ Map getEmbeddedResources(); } then modify ParsingEmbeddedDocumentExtractor to implement it with an option which 'turns it on'? Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor that users could set in the context? Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them use temp files? Option 4. Maybe the effort is better spent on said sweeping parser changes to include some EmbeddedResources object to be optionally populated along with the Metadata in the Parser.parse method? Other options? Maybe they don't need the RAW XMP? > Allow for more flexibility in handling embedded metadata objects (e.g. XMP) > --- > > Key: TIKA-1903 > URL: https://issues.apache.org/jira/browse/TIKA-1903 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > > On TIKA-1607, we veered a bit from allowing flexible metadata structures to > how to handle embedded metadata documents, such as XMP. Let's use this issue > to discuss and design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)
[ https://issues.apache.org/jira/browse/TIKA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196045#comment-15196045 ] Tim Allison commented on TIKA-1903: --- Copied from [~rgauss] on TIKA-1607: bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, no? Yes, and we could do that in addition to the above, but if I'm understanding correctly that alone would still force users to write 'Tika-based' XMP parsers rather than allowing them access to the RAW XMP encoded bytes you're referring to in the last sentence, which I do agree might be helpful in some cases. So the idea for the second part would be to get the user those bytes in a way that hopefully doesn't require sweeping changes to the parsers (I'm thinking of this with an eye towards all types of embedded resources, not just XMP). The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes a Metadata object which is only associated with the embedded resource (not the same metadata object associated with the 'container' file) and is populated with the embedded resource's filename, type, size, etc. Option 1. We might be able to do something like: /** * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded * resources during parsing for retrieval. */ public interface StoringEmbeddedDocumentExtractor extends EmbeddedDocumentExtractor { /** * Gets the map of known embedded resources or null if no resources * were stored during parsing * * @return the embedded resources */ Map getEmbeddedResources(); } then modify ParsingEmbeddedDocumentExtractor to implement it with an option which 'turns it on'? Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor that users could set in the context? Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them use temp files? Option 4. Maybe the effort is better spent on said sweeping parser changes to include some EmbeddedResources object to be optionally populated along with the Metadata in the Parser.parse method? Other options? Maybe they don't need the RAW XMP? > Allow for more flexibility in handling embedded metadata objects (e.g. XMP) > --- > > Key: TIKA-1903 > URL: https://issues.apache.org/jira/browse/TIKA-1903 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > > On TIKA-1607, we veered a bit from allowing flexible metadata structures to > how to handle embedded metadata documents, such as XMP. Let's use this issue > to discuss and design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196042#comment-15196042 ] Tim Allison commented on TIKA-1607: --- bq. I'm also aware that we've strayed a bit from the original issue here of structured metadata. Should we create a separate issue? Agreed. Thank you. Let's move this to TIKA-1903. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)
Tim Allison created TIKA-1903: - Summary: Allow for more flexibility in handling embedded metadata objects (e.g. XMP) Key: TIKA-1903 URL: https://issues.apache.org/jira/browse/TIKA-1903 Project: Tika Issue Type: Improvement Reporter: Tim Allison On TIKA-1607, we veered a bit from allowing flexible metadata structures to how to handle embedded metadata documents, such as XMP. Let's use this issue to discuss and design. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196030#comment-15196030 ] Ray Gauss II commented on TIKA-1607: bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, no? Yes, and we could do that in addition to the above, but if I'm understanding correctly that alone would still force users to write 'Tika-based' XMP parsers rather than allowing them access to the RAW XMP encoded bytes you're referring to in the last sentence, which I do agree might be helpful in some cases. So the idea for the second part would be to get the user those bytes in a way that hopefully doesn't require sweeping changes to the parsers (I'm thinking of this with an eye towards all types of embedded resources, not just XMP). The {{EmbeddedDocumentExtractor}} interface's {{parseEmbedded}} method currently takes a {{Metadata}} object which is only associated with the embedded resource (not the same metadata object associated with the 'container' file) and is populated with the embedded resource's filename, type, size, etc. Option 1. We might be able to do something like: {code} /** * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded * resources during parsing for retrieval. */ public interface StoringEmbeddedDocumentExtractor extends EmbeddedDocumentExtractor { /** * Gets the map of known embedded resources or null if no resources * were stored during parsing * * @return the embedded resources */ Map getEmbeddedResources(); } {code} then modify ParsingEmbeddedDocumentExtractor to implement it with an option which 'turns it on'? Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor that users could set in the context? Option 3. Just pull {{FileEmbeddedDocumentExtractor}} out of {{TikaCLI}} and make them use temp files? Option 4. Maybe the effort is better spent on said sweeping parser changes to include some {{EmbeddedResources}} object to be optionally populated along with the {{Metadata}} in the {{Parser.parse}} method? Other options? Maybe they don't need the RAW XMP? I'm also aware that we've strayed a bit from the original issue here of structured metadata. Should we create a separate issue? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195882#comment-15195882 ] Tim Allison commented on TIKA-1607: --- Y, makes sense. It might be more easily configurable to use the ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, no? That would allow for all of the benefits of easy configurability (thank you, Nick!) of parsers for advanced handling of XMP. Or is there a reason to avoid that option? The one benefit to storing encoded bytes in the metadata object is that clients in other languages could decode and process. That said, I much prefer the direction you're recommending here. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193845#comment-15193845 ] Ray Gauss II edited comment on TIKA-1607 at 3/15/16 1:57 PM: - Have we already considered treating the XMP packets more like embedded resources and making it easier for the advanced users described above to get at those resources, perhaps providing an {{EmbeddedDocumentExtractor}} implementation they could use without resorting to extracting them to files? was (Author: rgauss): Have we already considered treating the XMP packets more like embedded resources and making it easier for the advanced users described above to get at those resources, perhaps providing an {{EmbeddedResourceHandler}} implementation they could use without resorting to extracting them to files? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195326#comment-15195326 ] Ray Gauss II commented on TIKA-1607: Sorry, I meant {{EmbeddedDocumentExtractor}} (edited comment). We can currently dump stuff to files in some parsers with the {{--extract}} CLI option which sticks a {{FileEmbeddedDocumentExtractor}} in the context. The current default for PDF is the {{ParsingEmbeddedDocumentExtractor}}. Perhaps we could add an option to ParsingEmbeddedDocumentExtractor which, when enabled, would also save the embedded resources in memory for an advanced user to do whatever they need, knowing the risk and resources required for that option? Or provide some other in-memory implementation that advanced users could explicitly set in the context? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor
[ https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195282#comment-15195282 ] Bob Paulin commented on TIKA-1894: -- I think that sounds like a good idea. > Add XMPMM metadata extraction to JempboxExtractor > - > > Key: TIKA-1894 > URL: https://issues.apache.org/jira/browse/TIKA-1894 > Project: Tika > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > > The XMP Media Management (XMPMM) section of xmp carries some useful > information. We currently have keys for many of the important attributes in > tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but > the wiring between the two has not yet been installed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1859) file poi reads tika does not bring the content
[ https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195280#comment-15195280 ] Tim Allison commented on TIKA-1859: --- related [discussion|https://mail-archives.apache.org/mod_mbox/poi-user/201509.mbox/%3cloom.20150918t140431-...@post.gmane.org%3E] Out of roughly 28k xlsx files we now have in our regression testing corpus, I could only find 4 that showed an increase in content after this change. Still happy to have this fixed. Thank you for raising this issue! > file poi reads tika does not bring the content > -- > > Key: TIKA-1859 > URL: https://issues.apache.org/jira/browse/TIKA-1859 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.12 >Reporter: Movses >Priority: Blocker > Attachments: testing.Xlsx, upgrade_to_POI_3_14_beta2.patch > > > I have a file xlsx I'm able to read and process in using poi but in tika it > does not extract the content of the file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195222#comment-15195222 ] Tim Allison commented on TIKA-1607: --- Oh, very nice. So, for example, the PDFParser could set the mime to "application/xmp+xml" and call parse on the byte[] with the EmbeddedResourceHandler. For starters, a user could add that mime to the DcXmlParser, and the xml would be treated as a regular attachment? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor
[ https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195210#comment-15195210 ] Tim Allison commented on TIKA-1894: --- Thank you, [~rgauss]. [~bobpaulin], if you're ok with this, I'll rename the module today. > Add XMPMM metadata extraction to JempboxExtractor > - > > Key: TIKA-1894 > URL: https://issues.apache.org/jira/browse/TIKA-1894 > Project: Tika > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > > The XMP Media Management (XMPMM) section of xmp carries some useful > information. We currently have keys for many of the important attributes in > tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but > the wiring between the two has not yet been installed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: [DISCUSS] options for XMP parsing?
Thank you! Will take a look the SO link, and I'll see if I can dig up any of these in our regression testing corpus. -Original Message- From: Ray Gauss [mailto:ray.ga...@alfresco.com] Sent: Monday, March 14, 2016 1:06 PM To: dev@tika.apache.org Subject: Re: [DISCUSS] options for XMP parsing? Hi Tim, Consolidated handing of XMP would be great, I'm glad you're taking a look at it and I'll try to help out where I can. > You've been happy with it at Alfresco? It's been a while since I looked at it but I don't recall any difficulties. > I'd be interested to hear more about what happens with InDesign files. It stores things in 'pages' [1]. Regards, Ray [1] http://stackoverflow.com/a/22661992 > On Mar 10, 2016, at 9:38 AM, Allison, Timothy B. wrote: > > Hi Ray, > Got it. Thank you. > > That'd be great. In follow up discussion with PDFBox devs, they mentioned > that it is not a design feature/restriction on XMPBox that it doesn't handle > non PDF/A files...only a matter of patching and building out their current > code base. The downside is there's quite a bit to do, the upside is that it > is a living code base. > > I'll experiment with Adobe's xmp-core. If you have any pointers/examples, > let me know...I'll be starting with: > https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/. > You've been happy with it at Alfresco? > > No matter which package we use, it would be nice to build out uniform > extraction of XMP for all image and PDF files for the common elements -- with > special handling by file type if necessary. As you mentioned, it would also > be great to add or modify our XMPScanner to extract all XMP packets from a > file...I've started dabbling with this here: > https://github.com/tballison/tika/tree/xmp_scanner . I'd be interested to > hear more about what happens with InDesign files. In our own test set, we > have a PDF file with two packets containing conflicting authorship info IIRC! > :) It would be nice to expose both the canonical XMP info (with proper > processing of "later-xmp-overrides-earlier") as well as all of the info that > can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two > different use cases. > > Thank you, again. > > Cheers, > > Tim > > > > > -Original Message- > From: Ray Gauss [mailto:ray.ga...@alfresco.com] > Sent: Tuesday, March 08, 2016 2:34 PM > To: dev@tika.apache.org > Subject: Re: [DISCUSS] options for XMP parsing? > > To clarify... the 'we' in my third sentence was referring to Alfresco, not > Tika. > > I'm not sure how much of that code would be useful but I may be able to > contribute some of it. > > Regards, > > Ray > > >> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. wrote: >> >> Thank you. Will take a look. >> >> -Original Message- >> From: Ray Gauss [mailto:ray.ga...@alfresco.com] >> Sent: Tuesday, March 08, 2016 1:55 PM >> To: dev@tika.apache.org >> Subject: Re: [DISCUSS] options for XMP parsing? >> >> Hi Tim, >> >> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing >> XMP (though has not seen updates in a while), but getting the XMP packets >> out of the files is tricker. >> >> We have XMPPacketScanner which works for many cases, but not all. InDesign >> files for example do some strange things. >> >> In the past we've used different packet scanners depending on the file type >> (including Exiftool command-line) to get the XMP out then used xmpcore to >> parse into simple flattened properties. >> >> Regards, >> >> Ray >> >> >>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. wrote: >>> >>> All, >>> >>> PDFBox 2.0 is soon to be released. In the course of its development, the >>> project has migrated from Jempbox (which we're now using) to XmpBox; and >>> Jempbox is now on its last legs. >>> >>> XmpBox was "written for PDF/A checking," not for robust processing of >>> common variants of XMPs in the wild; I found that it fails on roughly 40% >>> of XMPs I pulled out of PDFs from govdocs1/commoncrawl. >>> >>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life. >>> >>> Has anyone had any luck with an Apache-friendly XMP parser? Are there >>> better options than copying and pasting jempbox into Tika and maintaining >>> it ourselves (yuk!)? >>> >>>Best, >>> >>> Tim >>> >>> -Original Message- >>> From: Tilman Hausherr [mailto:thaush...@t-online.de] >>> Sent: Tuesday, March 08, 2016 12:13 PM >>> To: d...@pdfbox.apache.org >>> Subject: Re: roadmap for XMPBox? >>> >>> I think the problem is that XmpBox was written for PDF/A checking, so it >>> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the >>> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A: >>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp >>> _ >>> p >>> roperties
[jira] [Commented] (TIKA-1893) Add new mimetype for *.icns (Apple Icon Image Format) files
[ https://issues.apache.org/jira/browse/TIKA-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194924#comment-15194924 ] Manisha Kampasi commented on TIKA-1893: --- I am working on a implementing a custom parser class to extract certain metadata information from the *.icns files like number of icons in the file, their size in pixels etc. I will update this thread with more details once I have the code ready. > Add new mimetype for *.icns (Apple Icon Image Format) files > > > Key: TIKA-1893 > URL: https://issues.apache.org/jira/browse/TIKA-1893 > Project: Tika > Issue Type: Improvement > Components: mime >Affects Versions: 1.11 >Reporter: Manisha Kampasi >Priority: Minor > Labels: patch > > Currently, TIKA does not support the "image/icns" mime type for *.icns files > (Apple Icon Image Format). This can be added to the tika-mimetypes.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)