[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149014#comment-15149014
 ] 

Tim Allison commented on TIKA-1857:
-----------------------------------

from TIKA-1607's 
[comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the 
form also contains the PDF's standard metadata...(author etc.) which is not 
necessarily stored in the older mechanism: COSDictionary.

bq. I'll support whichever way you pick, but I personally can't see use cases 
where extracting that workaround message is the intent when using Tika. I do 
see value in keeping the entire DOM though. Maybe you can do as you suggest, 
but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.



> Enhance PDFParser to extract text from XFA forms
> ------------------------------------------------
>
>                 Key: TIKA-1857
>                 URL: https://issues.apache.org/jira/browse/TIKA-1857
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Pascal Essiembre
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.13
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to