[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Tim Allison (JIRA) Tue, 16 Feb 2016 12:26:01 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149171#comment-15149171
 ]


Tim Allison edited comment on TIKA-1857 at 2/16/16 8:09 PM:
------------------------------------------------------------

I've only looked at a handful of files that contain xfa...this metadata is 
entirely new to me.  The files I've looked at come from govdocs1 and are fairly 
old by now.

In the attached {{041617_filled_out.pdf}}, I've added content to the forms and 
saved the document.

With the patch, I'm getting all of the boilerplate from the xfa extraction, but 
I'm not getting any content from the form because it isn't in 
{{<(speak|text|exData)>}} elements.  However, with our old code, I am seeing 
the entered data, e.g. {{my_exhibitor}}.

Is this PDF storing the contents of the form in both the xfa _and_ in the 
traditional AcroForm?

I imagine that won't happen in all PDFs, and there will be an either/or?

To avoid duplication of content, do we want to skip processing of AcroForm data 
if XFA exists?  Will we miss anything?

The other major question: I like the narrow focus that the current regexes 
yield, but why wouldn't we want to run our HtmlParser or our DcXMLParser 
against the bytes and pull everything out?  We'd have to skip inline/embedded 
images or handle those properly at some point...but any other reasons?

[~tilman], have you worked with XFA?  Any recommendations for pulling as much 
info as we can without duplication?

We could make this configurable, of course. :)



was (Author: talli...@mitre.org):
I've only looked at a handful of files that contain xfa...this metadata is 
entirely new to me.  The files I've looked at come from govdocs1 and are fairly 
old by now.

In the attached, I've added content to the forms and saved the document.

With the patch, I'm getting all of the boilerplate from the xfa extraction, but 
I'm not getting any content from the form because it isn't in 
{{<(speak|text|exData)>}} elements.  However, with our old code, I am seeing 
the entered data, e.g. {{my_exhibitor}}.

Is this PDF storing the contents of the form in both the xfa _and_ in the 
traditional AcroForm?

I imagine that won't happen in all PDFs, and there will be an either/or?

To avoid duplication of content, do we want to skip processing of AcroForm data 
if XFA exists?  Will we miss anything?

[~tilman], have you worked with XFA?  Any recommendations for pulling as much 
info as we can without duplication?

We could make this configurable, of course. :)


> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
>                 Key: TIKA-1857
>                 URL: https://issues.apache.org/jira/browse/TIKA-1857
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Pascal Essiembre
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.13
>
>         Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Reply via email to