[jira] [Commented] (TIKA-1888) Update mimetype for application/x-netcdf

2016-03-15 Thread Ajay Kumar Loganathan Ravichandran (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196568#comment-15196568
 ] 

Ajay Kumar Loganathan Ravichandran commented on TIKA-1888:
--

Yeah. It is missing one magic match.

> Update mimetype for application/x-netcdf
> 
>
> Key: TIKA-1888
> URL: https://issues.apache.org/jira/browse/TIKA-1888
> Project: Tika
>  Issue Type: Improvement
>  Components: core, mime
>Affects Versions: 1.13
>Reporter: Ajay Kumar Loganathan Ravichandran
>  Labels: mimetypes
> Fix For: 1.13
>
>
> Updating tika-mimetype.xml to identify .cdf and .nc file format.
> 
>   
>   
>  
>
> 
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196045#comment-15196045
 ] 

Tim Allison edited comment on TIKA-1903 at 3/15/16 7:37 PM:


Copied from [~rgauss] on TIKA-1607:

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes 
a Metadata object which is only associated with the embedded resource (not the 
same metadata object associated with the 'container' file) and is populated 
with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:


/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {

/**
 * Gets the map of known embedded resources or null if no resources
 * were stored during parsing
 * 
 * @return the embedded resources
 */
Map getEmbeddedResources();

}


then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them 
use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some EmbeddedResources object to be optionally populated along with the 
Metadata in the Parser.parse method?

Other options? Maybe they don't need the RAW XMP?



was (Author: talli...@mitre.org):
Copied from [~rgauss] on TIKA-1607:

bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor 
as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes 
a Metadata object which is only associated with the embedded resource (not the 
same metadata object associated with the 'container' file) and is populated 
with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:


/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {

/**
 * Gets the map of known embedded resources or null if no resources
 * were stored during parsing
 * 
 * @return the embedded resources
 */
Map getEmbeddedResources();

}


then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them 
use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some EmbeddedResources object to be optionally populated along with the 
Metadata in the Parser.parse method?

Other options? Maybe they don't need the RAW XMP?


> Allow for more flexibility in handling embedded metadata objects (e.g. XMP)
> ---
>
> Key: TIKA-1903
> URL: https://issues.apache.org/jira/browse/TIKA-1903
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> On TIKA-1607, we veered a bit from allowing flexible metadata structures to 
> how to handle embedded metadata documents, such as XMP.  Let's use this issue 
> to discuss and design.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196045#comment-15196045
 ] 

Tim Allison commented on TIKA-1903:
---

Copied from [~rgauss] on TIKA-1607:

bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor 
as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes 
a Metadata object which is only associated with the embedded resource (not the 
same metadata object associated with the 'container' file) and is populated 
with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:


/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {

/**
 * Gets the map of known embedded resources or null if no resources
 * were stored during parsing
 * 
 * @return the embedded resources
 */
Map getEmbeddedResources();

}


then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them 
use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some EmbeddedResources object to be optionally populated along with the 
Metadata in the Parser.parse method?

Other options? Maybe they don't need the RAW XMP?


> Allow for more flexibility in handling embedded metadata objects (e.g. XMP)
> ---
>
> Key: TIKA-1903
> URL: https://issues.apache.org/jira/browse/TIKA-1903
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>
> On TIKA-1607, we veered a bit from allowing flexible metadata structures to 
> how to handle embedded metadata documents, such as XMP.  Let's use this issue 
> to discuss and design.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196042#comment-15196042
 ] 

Tim Allison commented on TIKA-1607:
---

bq. I'm also aware that we've strayed a bit from the original issue here of 
structured metadata. Should we create a separate issue?

Agreed.  Thank you.  Let's move this to TIKA-1903.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)

2016-03-15 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1903:
-

 Summary: Allow for more flexibility in handling embedded metadata 
objects (e.g. XMP)
 Key: TIKA-1903
 URL: https://issues.apache.org/jira/browse/TIKA-1903
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison


On TIKA-1607, we veered a bit from allowing flexible metadata structures to how 
to handle embedded metadata documents, such as XMP.  Let's use this issue to 
discuss and design.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196030#comment-15196030
 ] 

Ray Gauss II commented on TIKA-1607:


bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor 
as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The {{EmbeddedDocumentExtractor}} interface's {{parseEmbedded}} method 
currently takes a {{Metadata}} object which is only associated with the 
embedded resource (not the same metadata object associated with the 'container' 
file) and is populated with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:
{code}
/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {

/**
 * Gets the map of known embedded resources or null if no resources
 * were stored during parsing
 * 
 * @return the embedded resources
 */
Map getEmbeddedResources();

}
{code}

then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull {{FileEmbeddedDocumentExtractor}} out of {{TikaCLI}} and 
make them use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some {{EmbeddedResources}} object to be optionally populated along with 
the {{Metadata}} in the {{Parser.parse}} method?

Other options?  Maybe they don't need the RAW XMP?

I'm also aware that we've strayed a bit from the original issue here of 
structured metadata.  Should we create a separate issue?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195882#comment-15195882
 ] 

Tim Allison commented on TIKA-1607:
---

Y, makes sense. It might be more easily configurable to use the 
ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, 
no?  That would allow for all of the benefits of easy configurability (thank 
you, Nick!) of parsers for advanced handling of XMP.  

Or is there a reason to avoid that option?

The one benefit to storing encoded bytes in the metadata object is that clients 
in other languages could decode and process.  That said, I much prefer the 
direction you're recommending here.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193845#comment-15193845
 ] 

Ray Gauss II edited comment on TIKA-1607 at 3/15/16 1:57 PM:
-

Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedDocumentExtractor}} 
implementation they could use without resorting to extracting them to files?


was (Author: rgauss):
Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedResourceHandler}} 
implementation they could use without resorting to extracting them to files?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195326#comment-15195326
 ] 

Ray Gauss II commented on TIKA-1607:


Sorry, I meant {{EmbeddedDocumentExtractor}} (edited comment).

We can currently dump stuff to files in some parsers with the {{--extract}} CLI 
option which sticks a {{FileEmbeddedDocumentExtractor}} in the context.

The current default for PDF is the {{ParsingEmbeddedDocumentExtractor}}.

Perhaps we could add an option to ParsingEmbeddedDocumentExtractor which, when 
enabled, would also save the embedded resources in memory for an advanced user 
to do whatever they need, knowing the risk and resources required for that 
option?

Or provide some other in-memory implementation that advanced users could 
explicitly set in the context?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-03-15 Thread Bob Paulin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195282#comment-15195282
 ] 

Bob Paulin commented on TIKA-1894:
--

I think that sounds like a good idea.

> Add XMPMM metadata extraction to JempboxExtractor
> -
>
> Key: TIKA-1894
> URL: https://issues.apache.org/jira/browse/TIKA-1894
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
>
> The XMP Media Management (XMPMM) section of xmp carries some useful 
> information.  We currently have keys for many of the important attributes in 
> tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but 
> the wiring between the two has not yet been installed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1859) file poi reads tika does not bring the content

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195280#comment-15195280
 ] 

Tim Allison commented on TIKA-1859:
---

related 
[discussion|https://mail-archives.apache.org/mod_mbox/poi-user/201509.mbox/%3cloom.20150918t140431-...@post.gmane.org%3E]

Out of roughly 28k xlsx files we now have in our regression testing corpus, I 
could only find 4 that showed an increase in content after this change.  

Still happy to have this fixed.  Thank you for raising this issue!

> file poi reads tika does not bring the content
> --
>
> Key: TIKA-1859
> URL: https://issues.apache.org/jira/browse/TIKA-1859
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.12
>Reporter: Movses
>Priority: Blocker
> Attachments: testing.Xlsx, upgrade_to_POI_3_14_beta2.patch
>
>
> I have a file xlsx I'm able to read and process in using poi but in tika it 
> does not extract the content of the file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195222#comment-15195222
 ] 

Tim Allison commented on TIKA-1607:
---

Oh, very nice.  So, for example, the PDFParser could set the mime to 
"application/xmp+xml" and call parse on the byte[] with the 
EmbeddedResourceHandler.  For starters, a user could add that mime to the 
DcXmlParser, and the xml would be treated as a regular attachment?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-03-15 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195210#comment-15195210
 ] 

Tim Allison commented on TIKA-1894:
---

Thank you, [~rgauss].  [~bobpaulin], if you're ok with this, I'll rename the 
module today.

> Add XMPMM metadata extraction to JempboxExtractor
> -
>
> Key: TIKA-1894
> URL: https://issues.apache.org/jira/browse/TIKA-1894
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
>
> The XMP Media Management (XMPMM) section of xmp carries some useful 
> information.  We currently have keys for many of the important attributes in 
> tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but 
> the wiring between the two has not yet been installed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: [DISCUSS] options for XMP parsing?

2016-03-15 Thread Allison, Timothy B.
Thank you!  Will take a look the SO link, and I'll see if I can dig up any of 
these in our regression testing corpus.

-Original Message-
From: Ray Gauss [mailto:ray.ga...@alfresco.com] 
Sent: Monday, March 14, 2016 1:06 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] options for XMP parsing?

Hi Tim,

Consolidated handing of XMP would be great, I'm glad you're taking a look at it 
and I'll try to help out where I can.

> You've been happy with it at Alfresco? 

It's been a while since I looked at it but I don't recall any difficulties.

> I'd be interested to hear more about what happens with InDesign files.

It stores things in 'pages' [1].

Regards,

Ray


[1] http://stackoverflow.com/a/22661992


> On Mar 10, 2016, at 9:38 AM, Allison, Timothy B.  wrote:
> 
> Hi Ray,
>  Got it.  Thank you.
> 
> That'd be great.  In follow up discussion with PDFBox devs, they mentioned 
> that it is not a design feature/restriction on XMPBox that it doesn't handle 
> non PDF/A files...only a matter of patching and building out their current 
> code base.   The downside is there's quite a bit to do, the upside is that it 
> is a living code base.
> 
> I'll experiment with Adobe's xmp-core.  If you have any pointers/examples, 
> let me know...I'll be starting with: 
> https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/.
>  You've been happy with it at Alfresco? 
> 
> No matter which package we use, it would be nice to build out uniform 
> extraction of XMP for all image and PDF files for the common elements -- with 
> special handling by file type if necessary.  As you mentioned, it would also 
> be great to add or modify our XMPScanner to extract all XMP packets from a 
> file...I've started dabbling with this here: 
> https://github.com/tballison/tika/tree/xmp_scanner .  I'd be interested to 
> hear more about what happens with InDesign files. In our own test set, we 
> have a PDF file with two packets containing conflicting authorship info IIRC! 
> :)  It would be nice to expose both the canonical XMP info (with proper 
> processing of "later-xmp-overrides-earlier") as well as all of the info that 
> can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two 
> different use cases.
> 
> Thank you, again.
> 
> Cheers,
> 
>   Tim
> 
> 
> 
> 
> -Original Message-
> From: Ray Gauss [mailto:ray.ga...@alfresco.com]
> Sent: Tuesday, March 08, 2016 2:34 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> To clarify... the 'we' in my third sentence was referring to Alfresco, not 
> Tika.
> 
> I'm not sure how much of that code would be useful but I may be able to 
> contribute some of it.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B.  wrote:
>> 
>> Thank you.  Will take a look.
>> 
>> -Original Message-
>> From: Ray Gauss [mailto:ray.ga...@alfresco.com]
>> Sent: Tuesday, March 08, 2016 1:55 PM
>> To: dev@tika.apache.org
>> Subject: Re: [DISCUSS] options for XMP parsing?
>> 
>> Hi Tim,
>> 
>> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing 
>> XMP (though has not seen updates in a while), but getting the XMP packets 
>> out of the files is tricker.  
>> 
>> We have XMPPacketScanner which works for many cases, but not all.  InDesign 
>> files for example do some strange things.
>> 
>> In the past we've used different packet scanners depending on the file type 
>> (including Exiftool command-line) to get the XMP out then used xmpcore to 
>> parse into simple flattened properties.
>> 
>> Regards,
>> 
>> Ray
>> 
>> 
>>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B.  wrote:
>>> 
>>> All,
>>> 
>>> PDFBox 2.0 is soon to be released.  In the course of its development, the 
>>> project has migrated from Jempbox (which we're now using) to XmpBox; and 
>>> Jempbox is now on its last legs.  
>>> 
>>> XmpBox was "written for PDF/A checking," not for robust processing of 
>>> common variants of XMPs in the wild; I found that it fails on roughly 40% 
>>> of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
>>> 
>>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>>> 
>>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there 
>>> better options than copying and pasting jempbox into Tika and maintaining 
>>> it ourselves (yuk!)?
>>> 
>>>Best,
>>> 
>>>   Tim
>>> 
>>> -Original Message-
>>> From: Tilman Hausherr [mailto:thaush...@t-online.de]
>>> Sent: Tuesday, March 08, 2016 12:13 PM
>>> To: d...@pdfbox.apache.org
>>> Subject: Re: roadmap for XMPBox?
>>> 
>>> I think the problem is that XmpBox was written for PDF/A checking, so it 
>>> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the 
>>> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp
>>> _
>>> p
>>> roperties

[jira] [Commented] (TIKA-1893) Add new mimetype for *.icns (Apple Icon Image Format) files

2016-03-15 Thread Manisha Kampasi (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194924#comment-15194924
 ] 

Manisha Kampasi commented on TIKA-1893:
---

I am working on a implementing a custom parser class to extract certain 
metadata information from the *.icns files like number of icons in the file, 
their size in pixels etc. I will update this thread with more details once I 
have the code ready.

> Add new mimetype for *.icns (Apple Icon Image Format) files 
> 
>
> Key: TIKA-1893
> URL: https://issues.apache.org/jira/browse/TIKA-1893
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.11
>Reporter: Manisha Kampasi
>Priority: Minor
>  Labels: patch
>
> Currently, TIKA does not support the "image/icns" mime type for *.icns files 
> (Apple Icon Image Format). This can be added to the tika-mimetypes.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)