[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Maruan Sahyoun (JIRA) Thu, 18 Feb 2016 23:33:41 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153851#comment-15153851
 ]


Maruan Sahyoun commented on TIKA-1857:
--------------------------------------

Sorry for my delay in answering your question.

May I propose the following strategy:

a) for static XFA if there is datasets.data use that content for the filed 
values otherwise extract from the AcroForm.
b) for dynamic XFA scrape/extract info from the XFA.

Why a different proposal for a) from yours? Adobe Reader/Acrobat use the 
information from dataset.data for the field value over the possibly differing 
content in AcroForm (which might happen if the form has been filled out with an 
XFA aware processor and afterwards was amended with a non XFA aware processor)

> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
>                 Key: TIKA-1857
>                 URL: https://issues.apache.org/jira/browse/TIKA-1857
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Pascal Essiembre
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.13
>
>         Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Reply via email to