[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-1857. ------------------------------- Resolution: Fixed [~pascal.essiembre], thank you for this pull request! I made a few modifications, but we now have basic XFA processing, thanks to you. To obtain the XFA-only behavior, you'll need to do something like this: {noformat} ParseContext context = new ParseContext(); PDFParserConfig config = new PDFParserConfig(); config.setIfXFAExtractOnlyXFA(true); context.set(PDFParserConfig.class, config); {noformat} [~msahyoun], thank you, again, for helping me understand XFA and Acroforms! For posterity, here are some areas for improvement in XFA parsing: * handle metadata stored in <desc> section (govdocs1: 754282.pdf, 982106.pdf) * handle pdf metadata (access permissions, etc.) in <pdf> element * extract different types of uris as metadata * add extraction of <image> data (govdocs1: 754282.pdf) * add computation of traversal order for fields * figure out when text extracted from xfa fields is duplicative of that extracted from the rest of the pdf...and do this efficiently and quickly * avoid duplication with <speak> and <tooltip> elements > Enhance PDFParser to extract text from XFA forms > ------------------------------------------------ > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Pascal Essiembre > Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)