[jira] [Resolved] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Tim Allison (JIRA) Tue, 01 Mar 2016 18:27:13 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison resolved TIKA-1857.
-------------------------------
    Resolution: Fixed

[~pascal.essiembre], thank you for this pull request!  I made a few 
modifications, but we now have basic XFA processing, thanks to you.  To obtain 
the XFA-only behavior, you'll need to do something like this:

{noformat}
        ParseContext context = new ParseContext();
        PDFParserConfig config = new PDFParserConfig();
        config.setIfXFAExtractOnlyXFA(true);
        context.set(PDFParserConfig.class, config);
{noformat}

[~msahyoun], thank you, again, for helping me understand XFA and Acroforms!

For posterity, here are some areas for improvement in XFA parsing:
 *     handle metadata stored in <desc> section (govdocs1: 754282.pdf, 
982106.pdf)
 *     handle pdf metadata (access permissions, etc.) in &lt;pdf&gt; element
 *     extract different types of uris as metadata
 *     add extraction of <image> data (govdocs1: 754282.pdf)
 *     add computation of traversal order for fields
 *     figure out when text extracted from xfa fields is duplicative of that
       extracted from the rest of the pdf...and do this efficiently and quickly
 *     avoid duplication with <speak> and <tooltip> elements


> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
>                 Key: TIKA-1857
>                 URL: https://issues.apache.org/jira/browse/TIKA-1857
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Pascal Essiembre
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.13
>
>         Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

Reply via email to