[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Lui updated TIKA-1857: -- Attachment: doc8.pdf I cannot copy the file out of the secured environment. But this is a file I found on the Internet to have the same issue and I used this to test my pdfbox script as well. > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, doc8.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1857: -- Priority: Major (was: Trivial) > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1857: -- Attachment: govdocs1_xfas.zip 194 xfas from govdocs1 as exported with PDFBox 2.0 (trunk built from within the last few weeks). > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1857: -- Attachment: 041617_filled_out.pdf I've only looked at a handful of files that contain xfa...this metadata is entirely new to me. The files I've looked at come from govdocs1 and are fairly old by now. In the attached, I've added content to the forms and saved the document. With the patch, I'm getting all of the boilerplate from the xfa extraction, but I'm not getting any content from the form because it isn't in {{<(speak|text|exData)>}} elements. However, with our old code, I am seeing the entered data, e.g. {{my_exhibitor}}. Is this PDF storing the contents of the form in both the xfa _and_ in the traditional AcroForm? I imagine that won't happen in all PDFs, and there will be an either/or? To avoid duplication of content, do we want to skip processing of AcroForm data if XFA exists? Will we miss anything? [~tilman], have you worked with XFA? Any recommendations for pulling as much info as we can without duplication? We could make this configurable, of course. :) > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1857: -- Attachment: xfa_in_govdocs1.txt list of PDFs in govdocs1 that have a non-null PDXFAResource object, found with PDFBox 2.0's trunk. > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)