[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149736#comment-15149736
 ] 

Tim Allison commented on TIKA-1857:
---

This is great.  Thank you!

So, to get the best coverage for extracted content, should we do the following:

Check for fields in the AcroForm.

a) If those exist (Static XFA), use the content extracted from the AcroForm and 
ignore the XFA
b) If they don't exist (Dynamic XFA), scrape/extract info from the XFA 

In your experience, will we miss any info if we ignore the XFA for Static XFAs 
and rely solely on the AcroForm?



> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149405#comment-15149405
 ] 

Maruan Sahyoun commented on TIKA-1857:
--

The reason you are not getting the data is that this is stored as part of the 
data node in an xml data structure which matches the binding information in the 
field. That data is in {{xfa.datasets.data}} with the {{my_exibitor}} value 
stored in the {{Exhibitorname}} field.

Extracting {{speak|text|exData}} will give you the boilerplate text but not the 
field value.

Now there are two types of XFA forms - static and dynamic. Static XFA forms 
will have an XFA entry and AcroForm fields. Dynamic XFA forms will only have an 
XFA entry and no AcroForm fields.

When an XFA form is filled out with an XFA aware PDF processor for static forms 
both the {{xfa.datasets.data}} information is updated as well as the {{V}} 
entry of the AcroForm form field. If you fill out a static form with a non XFA 
aware PDF processor it will only see the AcroForm information and as a result 
only updates the AcroForm form fields {{V}} entry.

When trying to fill a dynamic XFA form with a non XFA aware PDF processor it 
will not see any form fields at all.

I'm happy to provide more information on that topic but thought that this will 
give you a first outline.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149256#comment-15149256
 ] 

Tilman Hausherr commented on TIKA-1857:
---

Sorry, I have no experience with XFA. [~msahyoun] might know more.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149267#comment-15149267
 ] 

Tim Allison commented on TIKA-1607:
---

Y, probably.  We could add limits on length although we're not currently doing 
this with Metadata String values.  To be fair, of course, I realize that 
embedded binary metadata objects (XMP/XFA...) are typically longer than regular 
metadata values.

What else is in the can?

Or, is this just a plain bad idea?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149231#comment-15149231
 ] 

Ray Gauss II commented on TIKA-1607:


Are we opening a can of worms by encouraging the use of a byte array directly 
with no restrictions on length, etc.?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149171#comment-15149171
 ] 

Tim Allison edited comment on TIKA-1857 at 2/16/16 8:09 PM:


I've only looked at a handful of files that contain xfa...this metadata is 
entirely new to me.  The files I've looked at come from govdocs1 and are fairly 
old by now.

In the attached {{041617_filled_out.pdf}}, I've added content to the forms and 
saved the document.

With the patch, I'm getting all of the boilerplate from the xfa extraction, but 
I'm not getting any content from the form because it isn't in 
{{<(speak|text|exData)>}} elements.  However, with our old code, I am seeing 
the entered data, e.g. {{my_exhibitor}}.

Is this PDF storing the contents of the form in both the xfa _and_ in the 
traditional AcroForm?

I imagine that won't happen in all PDFs, and there will be an either/or?

To avoid duplication of content, do we want to skip processing of AcroForm data 
if XFA exists?  Will we miss anything?

The other major question: I like the narrow focus that the current regexes 
yield, but why wouldn't we want to run our HtmlParser or our DcXMLParser 
against the bytes and pull everything out?  We'd have to skip inline/embedded 
images or handle those properly at some point...but any other reasons?

[~tilman], have you worked with XFA?  Any recommendations for pulling as much 
info as we can without duplication?

We could make this configurable, of course. :)



was (Author: talli...@mitre.org):
I've only looked at a handful of files that contain xfa...this metadata is 
entirely new to me.  The files I've looked at come from govdocs1 and are fairly 
old by now.

In the attached, I've added content to the forms and saved the document.

With the patch, I'm getting all of the boilerplate from the xfa extraction, but 
I'm not getting any content from the form because it isn't in 
{{<(speak|text|exData)>}} elements.  However, with our old code, I am seeing 
the entered data, e.g. {{my_exhibitor}}.

Is this PDF storing the contents of the form in both the xfa _and_ in the 
traditional AcroForm?

I imagine that won't happen in all PDFs, and there will be an either/or?

To avoid duplication of content, do we want to skip processing of AcroForm data 
if XFA exists?  Will we miss anything?

[~tilman], have you worked with XFA?  Any recommendations for pulling as much 
info as we can without duplication?

We could make this configurable, of course. :)


> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149014#comment-15149014
 ] 

Tim Allison edited comment on TIKA-1857 at 2/16/16 7:49 PM:


from TIKA-1607's 
[comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the 
form also contains the PDF's standard metadata...(author etc.) which is not 
necessarily stored in the older mechanism: COSDictionary.  govdocs1's 
{{517660.pdf}} shows this -- the author and title can be extracted from the 
XFA, but that info is not extracted with our current methods.

bq. I'll support whichever way you pick, but I personally can't see use cases 
where extracting that workaround message is the intent when using Tika. I do 
see value in keeping the entire DOM though. Maybe you can do as you suggest, 
but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.




was (Author: talli...@mitre.org):
from TIKA-1607's 
[comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the 
form also contains the PDF's standard metadata...(author etc.) which is not 
necessarily stored in the older mechanism: COSDictionary.

bq. I'll support whichever way you pick, but I personally can't see use cases 
where extracting that workaround message is the intent when using Tika. I do 
see value in keeping the entire DOM though. Maybe you can do as you suggest, 
but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.



> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1857:
--
Attachment: 041617_filled_out.pdf

I've only looked at a handful of files that contain xfa...this metadata is 
entirely new to me.  The files I've looked at come from govdocs1 and are fairly 
old by now.

In the attached, I've added content to the forms and saved the document.

With the patch, I'm getting all of the boilerplate from the xfa extraction, but 
I'm not getting any content from the form because it isn't in 
{{<(speak|text|exData)>}} elements.  However, with our old code, I am seeing 
the entered data, e.g. {{my_exhibitor}}.

Is this PDF storing the contents of the form in both the xfa _and_ in the 
traditional AcroForm?

I imagine that won't happen in all PDFs, and there will be an either/or?

To avoid duplication of content, do we want to skip processing of AcroForm data 
if XFA exists?  Will we miss anything?

[~tilman], have you worked with XFA?  Any recommendations for pulling as much 
info as we can without duplication?

We could make this configurable, of course. :)


> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1607:
--
Attachment: TIKA-1607_bytes_dom_values.patch

This is a sketch of the proposal to store base64 encoded bytes and/or DOM as a 
value in our current metadata object.

If decoding or parsing fails with getBytes/getDOM, this returns null. following 
the behavior of getDate(Property...).

One last thing we might consider doing is gzipping the byte array before 
encoding.  I'd want bzip2, but I don't want the added dependency.

Any objections to this change or recommendations?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1857:
--
Attachment: xfa_in_govdocs1.txt

list of PDFs in govdocs1 that have a non-null PDXFAResource object, found with 
PDFBox 2.0's trunk.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149014#comment-15149014
 ] 

Tim Allison commented on TIKA-1857:
---

from TIKA-1607's 
[comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the 
form also contains the PDF's standard metadata...(author etc.) which is not 
necessarily stored in the older mechanism: COSDictionary.

bq. I'll support whichever way you pick, but I personally can't see use cases 
where extracting that workaround message is the intent when using Tika. I do 
see value in keeping the entire DOM though. Maybe you can do as you suggest, 
but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.



> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1856) Error while parsing an ogg file

2016-02-16 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149012#comment-15149012
 ] 

Chris A. Mattmann commented on TIKA-1856:
-

Hey Nick It's possible they were truncated from Nutch crawls and content 
limits. See http://github.com/chrismattmann/trec-dd-polar/ for a description of 
the dataset.

> Error while parsing an ogg file
> ---
>
> Key: TIKA-1856
> URL: https://issues.apache.org/jira/browse/TIKA-1856
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.12
> Environment: python
>Reporter: Yash Tanna
>  Labels: newbie, tika
> Attachments: 
> 1B7A7AE8FE999D22E2A677EFDA38982C8957CF77BEF3371E48852F7D67A7, 
> 1DE811ACAB8432D526EFE9D941E5EFE58F3C89F1AAB6CB7152091961DD854431, 
> 4600B9FF184F6AB71AA0CF6873E580FB0A31D75CE1218998057E9A185A5FFBB2, 
> 5E5892EA6C2B4A07BE998403A04127C7924E5539DB3EB0D27B9BD34D11A1575B, 
> CA3065B754E6CE79E4BF128464F4A202B0F2CF0336FBE73FA33F13776CD01CE8, 
> F036789D92EE18032556D9D0ECAC75073CED52226E1833001E379740E23E183D, 
> F33BFE4B1AF562D40E5B9D9F5D4B34EA6734F8F3A06F99535F100F957958D9BA, 
> F47F833BFD4A7E55C128DD76DB3666EEFFD0F5EDA24BF31D6F2427BA092D, 
> FA9D1D2B8D0FB50CFE306FA6024EC48BD771562878B9B70D38D106DF4E61147A
>
>
> Unable to detect a malformed ogg file. The error thrown was 
> Exception in thread "main" java.io.IOException: Asked to read 4335 bytes
> from 0 but hit EoF at 780
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:39)
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:31)
> at org.gagravarr.ogg.OggPage.(OggPage.java:82)
> at
> org.gagravarr.ogg.OggPacketReader.getNextPacket(OggPacketReader.java:116)
> at org.gagravarr.tika.OggDetector.detect(OggDetector.java:97)
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:291)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> [xdatadeploy@xdata upload]$



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Pascal Essiembre (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148914#comment-15148914
 ] 

Pascal Essiembre commented on TIKA-1607:


In the case of XFA forms, the form IS the content. 

One issue I can see with not extracting XFA text as part of the content by 
default, the generic message put by PDF/XFA editors will be extracted as if it 
was legitimate content:

{noformat}
Please wait... 
  
If this message is not eventually replaced by the proper contents of the 
document, your PDF 
viewer may not be able to display this type of document. 
  
You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or 
Linux® by 
visiting  http://www.adobe.com/go/reader_download. 
  
For more assistance with Adobe Reader visit  http://www.adobe.com/go/acrreader. 
  
Windows is either a registered trademark or a trademark of Microsoft 
Corporation in the United States and/or other countries. Mac is a trademark 
of Apple Inc., registered in the United States and other countries. Linux is 
the registered trademark of Linus Torvalds in the U.S. and other 
countries.
{noformat}

That message is an example inserted by PDF/XFA editors as a workaround for PDF 
viewers not supporting XFA and is not genuine content published by the author 
(the XFA forms are).

I'll support whichever way you pick, but I personally can't see use cases where 
extracting that workaround message is the intent when using Tika.  I do see 
value in keeping the entire DOM though.  Maybe you can do as you suggest, but 
"in addition" to returning the XFA text as the content?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1856) Error while parsing an ogg file

2016-02-16 Thread Yash Tanna (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148861#comment-15148861
 ] 

Yash Tanna commented on TIKA-1856:
--

The files are a part of TREC Dynamic Domain Polar Dataset which is collected by 
[~chrismattmann] and his students.

> Error while parsing an ogg file
> ---
>
> Key: TIKA-1856
> URL: https://issues.apache.org/jira/browse/TIKA-1856
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.12
> Environment: python
>Reporter: Yash Tanna
>  Labels: newbie, tika
> Attachments: 
> 1B7A7AE8FE999D22E2A677EFDA38982C8957CF77BEF3371E48852F7D67A7, 
> 1DE811ACAB8432D526EFE9D941E5EFE58F3C89F1AAB6CB7152091961DD854431, 
> 4600B9FF184F6AB71AA0CF6873E580FB0A31D75CE1218998057E9A185A5FFBB2, 
> 5E5892EA6C2B4A07BE998403A04127C7924E5539DB3EB0D27B9BD34D11A1575B, 
> CA3065B754E6CE79E4BF128464F4A202B0F2CF0336FBE73FA33F13776CD01CE8, 
> F036789D92EE18032556D9D0ECAC75073CED52226E1833001E379740E23E183D, 
> F33BFE4B1AF562D40E5B9D9F5D4B34EA6734F8F3A06F99535F100F957958D9BA, 
> F47F833BFD4A7E55C128DD76DB3666EEFFD0F5EDA24BF31D6F2427BA092D, 
> FA9D1D2B8D0FB50CFE306FA6024EC48BD771562878B9B70D38D106DF4E61147A
>
>
> Unable to detect a malformed ogg file. The error thrown was 
> Exception in thread "main" java.io.IOException: Asked to read 4335 bytes
> from 0 but hit EoF at 780
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:39)
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:31)
> at org.gagravarr.ogg.OggPage.(OggPage.java:82)
> at
> org.gagravarr.ogg.OggPacketReader.getNextPacket(OggPacketReader.java:116)
> at org.gagravarr.tika.OggDetector.detect(OggDetector.java:97)
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:291)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> [xdatadeploy@xdata upload]$



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1856) Error while parsing an ogg file

2016-02-16 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148629#comment-15148629
 ] 

Nick Burch commented on TIKA-1856:
--

Picking one of those files to look at,{{oggz-info}} processes it without 
warning. {{ogginfo}} warns about the EOS being missing on both streams, but 
otherwise gives no errors

Trying with mplayer, it reports some issues with the file:
{code}
[vorbis @ 0x7f1470f5cb00]partition out of bounds: type, begin, end, size, 
blocksize: 2, 0, 192, 16, 1024
[vorbis @ 0x7f1470f5cb00] Vorbis setup header packet corrupt (residues). 
[vorbis @ 0x7f1470f5cb00]Setup header corrupt.
Could not open codec.
{code}

Do you know where these files came from? It looks like they have been truncated 
some how, could that be the case? 

(If so, we'd probably just need to improve the truncation error handling)

> Error while parsing an ogg file
> ---
>
> Key: TIKA-1856
> URL: https://issues.apache.org/jira/browse/TIKA-1856
> Project: Tika
>  Issue Type: Bug
>  Components: detector, parser
>Affects Versions: 1.12
> Environment: python
>Reporter: Yash Tanna
>  Labels: newbie, tika
> Attachments: 
> 1B7A7AE8FE999D22E2A677EFDA38982C8957CF77BEF3371E48852F7D67A7, 
> 1DE811ACAB8432D526EFE9D941E5EFE58F3C89F1AAB6CB7152091961DD854431, 
> 4600B9FF184F6AB71AA0CF6873E580FB0A31D75CE1218998057E9A185A5FFBB2, 
> 5E5892EA6C2B4A07BE998403A04127C7924E5539DB3EB0D27B9BD34D11A1575B, 
> CA3065B754E6CE79E4BF128464F4A202B0F2CF0336FBE73FA33F13776CD01CE8, 
> F036789D92EE18032556D9D0ECAC75073CED52226E1833001E379740E23E183D, 
> F33BFE4B1AF562D40E5B9D9F5D4B34EA6734F8F3A06F99535F100F957958D9BA, 
> F47F833BFD4A7E55C128DD76DB3666EEFFD0F5EDA24BF31D6F2427BA092D, 
> FA9D1D2B8D0FB50CFE306FA6024EC48BD771562878B9B70D38D106DF4E61147A
>
>
> Unable to detect a malformed ogg file. The error thrown was 
> Exception in thread "main" java.io.IOException: Asked to read 4335 bytes
> from 0 but hit EoF at 780
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:39)
> at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:31)
> at org.gagravarr.ogg.OggPage.(OggPage.java:82)
> at
> org.gagravarr.ogg.OggPacketReader.getNextPacket(OggPacketReader.java:116)
> at org.gagravarr.tika.OggDetector.detect(OggDetector.java:97)
> at
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
> at org.apache.tika.cli.TikaCLI$10.process(TikaCLI.java:291)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
> [xdatadeploy@xdata upload]$



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148608#comment-15148608
 ] 

Tim Allison commented on TIKA-1607:
---

I'd like to turn something like the above "thought" into a proposal...

Over on TIKA-1857, [~pascal.essiembre] has opened a pull request to strip XFA 
contents into the ContentHandler for PDFDocuments.

It might be more elegant to store the XFA in the metadata object and let 
consumers process that stream.

Would anyone object to adding a two new {{ValueType}}s of Property: BYTES and 
DOM.

Both would be stored as String values (base-64 encoded {{byte[]}}) in the 
regular {{Metadata}} object.

Similar with what we're doing with {{getDate()}} in the {{Metadata}} object, 
we'd add a {{getBytes(Property binaryProperty)}} that would return a decoded 
{{byte[]}}, and we could also add a {{getDOM(Property domProperty)}} that would 
return a {{org.w3c.dom.Document}}.

We could also store raw XMP by this mechanism.

Is this a reasonable first (half) step towards this issue?  Any objections?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148608#comment-15148608
 ] 

Tim Allison edited comment on TIKA-1607 at 2/16/16 1:46 PM:


I'd like to turn something like the above "thought" into a proposal...

Over on TIKA-1857, [~pascal.essiembre] has opened a pull request to strip XFA 
contents into the ContentHandler for PDFDocuments.

It might be more elegant to store the XFA in the metadata object and let 
consumers process that stream.

Would anyone object to adding a two new {{ValueType}} s of Property: BYTES and 
DOM.

Both would be stored as String values (base-64 encoded {{byte[]}}) in the 
regular {{Metadata}} object.

Similar with what we're doing with {{getDate()}} in the {{Metadata}} object, 
we'd add a {{getBytes(Property binaryProperty)}} that would return a decoded 
{{byte[]}}, and we could also add a {{getDOM(Property domProperty)}} that would 
return a {{org.w3c.dom.Document}}.

We could also store raw XMP by this mechanism.

Is this a reasonable first (half) step towards this issue?  Any objections?


was (Author: talli...@mitre.org):
I'd like to turn something like the above "thought" into a proposal...

Over on TIKA-1857, [~pascal.essiembre] has opened a pull request to strip XFA 
contents into the ContentHandler for PDFDocuments.

It might be more elegant to store the XFA in the metadata object and let 
consumers process that stream.

Would anyone object to adding a two new {{ValueType}}s of Property: BYTES and 
DOM.

Both would be stored as String values (base-64 encoded {{byte[]}}) in the 
regular {{Metadata}} object.

Similar with what we're doing with {{getDate()}} in the {{Metadata}} object, 
we'd add a {{getBytes(Property binaryProperty)}} that would return a decoded 
{{byte[]}}, and we could also add a {{getDOM(Property domProperty)}} that would 
return a {{org.w3c.dom.Document}}.

We could also store raw XMP by this mechanism.

Is this a reasonable first (half) step towards this issue?  Any objections?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)