[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888179#comment-15888179 ] Tim Allison commented on TIKA-1857: --- I pushed the fix to our new repo. Let me know if that fixes this issue. Thank you. > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, doc8.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887106#comment-15887106 ] Tim Allison commented on TIKA-1857: --- {noformat} IT IS EASY 0123456789 JUST TRY DUDE 2015-02-19 PL DO YOUR OWN JOB DON'T EXPECT ME TO DO IT! IT'S XML! 012345678 READ THE DOCUMENTATION DUDE LEARN BEFORE YOU CODE {noformat} Is now extracted as: {noformat} Nazwa pełna: IT IS EASY REGON: REGON: REGON: Nazwisko: DUDE ImiePierwsze: JUST TRY DataUrodzenia: 2015-02-19 PESEL: Numer Identyfikacji Podatkowej: Numer PESEL: Kraj: KodKraju: PL Województwo: DO YOUR OWN JOB Powiat: DON'T EXPECT ME TO DO IT! Gmina: IT'S XML! Ulica: Nr domu: 012345678 Nr lokalu: Miejscowość: READ THE DOCUMENTATION Kod pocztowy: DUDE Poczta: LEARN BEFORE YOU CODE {noformat} Once our git is back up and running, I'll push the fix. Thank you for raising this issue and sharing a triggering document. > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, doc8.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865598#comment-15865598 ] Tim Allison commented on TIKA-1857: --- Are you able to submit the triggering document? If not, are you able to share it personally? > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864963#comment-15864963 ] Kenneth Lui commented on TIKA-1857: --- Hi, I tried to use this feature but it doesn't seem to work. I understand this is not the right place to ask troubleshooting type of question, so I put the details at http://stackoverflow.com/questions/42217327/apache-tika-extract-only-field-names-from-pdf-xfa-forms-but-not-the-text-content . Could you please help whether I misconfigured Tika or it is an issue about the feature implementation. Thanks! > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235261#comment-15235261 ] Tim Allison commented on TIKA-1857: --- [~pascal.essiembre], we may be headed towards a release of 1.13 within the month (ish). Will the current update meet your needs? Thank you, again, for your patch! > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175818#comment-15175818 ] Hudson commented on TIKA-1857: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #919 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/919/]) Fix for side effect of TIKA-1857-- javax.xml.stream is no longer (tallison: rev 9a1ba9494cf2a786e4615f0d72ca5f7c303840fa) * tika-bundle/pom.xml > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174940#comment-15174940 ] Hudson commented on TIKA-1857: -- UNSTABLE: Integrated in tika-2.x #41 (See [https://builds.apache.org/job/tika-2.x/41/]) TIKA-1857: add basic XFA extraction via Pascal Essiembre. (tallison: rev f1e4ebdb422d24b7080d02620f3c38f6dda57910) * tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * CHANGES.txt * tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java * tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties * tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java * tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java * tika-test-resources/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174937#comment-15174937 ] Hudson commented on TIKA-1857: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #916 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/916/]) TIKA-1857: add basic XFA extraction support via Pascal Essiembre. (tallison: rev dbefe9830b26d05f9ce53503565a069bcc63d7c1) * tika-parsers/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf * tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java * tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java * tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties * tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java TIKA-1857: add basic XFA extraction support via Pascal Essiembre. (tallison: rev 7c245fa87507cf0887838001c54c65b79b7e7cbc) * CHANGES.txt > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174859#comment-15174859 ] ASF GitHub Bot commented on TIKA-1857: -- Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/74 > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169395#comment-15169395 ] Tim Allison commented on TIKA-1857: --- I implemented a first attempt XFA scraper with StAX; this pulls the content from the fields that Pascal identified into the ContentHhandler, and it merges the "values" from the data section with the fields section. Currently, if XFA exists, I process that and skip the AcroForm data. I'm not certain what the best path is for ignoring/processing content extracted from the "regular" PDF if there is XFA data. For now, I'm also processing the contents of the rest of the PDF. I'm more averse to losing data than to duplication because my main use case is search...but I realize this will be really frustrating to users who want "just one copy" of the content. In looking at the pdfs with xfa data in govdocs1, it looks like there would be lost content in _some_ files if we processed only the XFA and did not do the regular text extraction. On the other hand, for most of the files I examined, it looked like the content is entirely duplicative -- [~pascal.essiembre]'s point above. I propose adding a parameter to the PDFParserConfig along the lines of {{ifXFAExistsProcessItAlone}}...this would allow the behavior of Pascal's patch. I propose that the default be set to "false", erring on the side of extracting more content at the cost of duplication. Is this ok? Or, is there an easy way to determine if regular content is entirely duplicative of XFA content? > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, > xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154293#comment-15154293 ] Tim Allison commented on TIKA-1857: --- Ha. Sorry. Figured that was a typo. We'll still have it around for a while to process though. :) Thank you, again. > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154290#comment-15154290 ] Maruan Sahyoun commented on TIKA-1857: -- XFA is not deprecated in PDFBox. It will be deprecated in the PDF 2.0 specification (as it currently stands) > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154274#comment-15154274 ] Tim Allison commented on TIKA-1857: --- Doh! Sorry. I was looking at PDXFAResource. Thank you, again. bq. PDF 2.0 as there XFA is deprecated Oh, no...I guess we could copy/paste from the current PDFBox if XFA goes away in PDFBox...less than ideal. I don't see deprecation tags in PDXFAResource or PDAcroForm's {{getXFA()}}...which XFA handling might go away? > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154260#comment-15154260 ] Maruan Sahyoun commented on TIKA-1857: -- {quote} Do I understand correctly then: no matter whether static or dynamic, try to pull data from XFA; if that doesn't exist, fall back to the AcroForm? {quote} if you'd like to replicate Adobe Reader/Acrobat behavior - yes. BTW don't know what will happen with PDF 2.0 as there XFA is deprecated which might have an implication for future versions. {quote} Also, is there an obvious way to determine static vs. dynamic aside from checking to see if there are fields in the AcroForm? {quote} there is {{PDAcroForm.xfaIsDynamic()}} which will give you the information (which checks if there is XFA and no AcroForm fields) > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154162#comment-15154162 ] Tim Allison commented on TIKA-1857: --- No problem at all. I think this will take some time for me to get right...there's no rush. :) Do I understand correctly then: no matter whether static or dynamic, try to pull data from XFA; if that doesn't exist, fall back to the AcroForm? Also, is there an obvious way to determine static vs. dynamic aside from checking to see if there are fields in the AcroForm? Thank you, again! > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153851#comment-15153851 ] Maruan Sahyoun commented on TIKA-1857: -- Sorry for my delay in answering your question. May I propose the following strategy: a) for static XFA if there is datasets.data use that content for the filed values otherwise extract from the AcroForm. b) for dynamic XFA scrape/extract info from the XFA. Why a different proposal for a) from yours? Adobe Reader/Acrobat use the information from dataset.data for the field value over the possibly differing content in AcroForm (which might happen if the form has been filled out with an XFA aware processor and afterwards was amended with a non XFA aware processor) > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149736#comment-15149736 ] Tim Allison commented on TIKA-1857: --- This is great. Thank you! So, to get the best coverage for extracted content, should we do the following: Check for fields in the AcroForm. a) If those exist (Static XFA), use the content extracted from the AcroForm and ignore the XFA b) If they don't exist (Dynamic XFA), scrape/extract info from the XFA In your experience, will we miss any info if we ignore the XFA for Static XFAs and rely solely on the AcroForm? > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149405#comment-15149405 ] Maruan Sahyoun commented on TIKA-1857: -- The reason you are not getting the data is that this is stored as part of the data node in an xml data structure which matches the binding information in the field. That data is in {{xfa.datasets.data}} with the {{my_exibitor}} value stored in the {{Exhibitorname}} field. Extracting {{speak|text|exData}} will give you the boilerplate text but not the field value. Now there are two types of XFA forms - static and dynamic. Static XFA forms will have an XFA entry and AcroForm fields. Dynamic XFA forms will only have an XFA entry and no AcroForm fields. When an XFA form is filled out with an XFA aware PDF processor for static forms both the {{xfa.datasets.data}} information is updated as well as the {{V}} entry of the AcroForm form field. If you fill out a static form with a non XFA aware PDF processor it will only see the AcroForm information and as a result only updates the AcroForm form fields {{V}} entry. When trying to fill a dynamic XFA form with a non XFA aware PDF processor it will not see any form fields at all. I'm happy to provide more information on that topic but thought that this will give you a first outline. > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149256#comment-15149256 ] Tilman Hausherr commented on TIKA-1857: --- Sorry, I have no experience with XFA. [~msahyoun] might know more. > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms
[ https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149014#comment-15149014 ] Tim Allison commented on TIKA-1857: --- from TIKA-1607's [comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914] bq. In the case of XFA forms, the form IS the content. Got it. Doh. Thank you. As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the form also contains the PDF's standard metadata...(author etc.) which is not necessarily stored in the older mechanism: COSDictionary. bq. I'll support whichever way you pick, but I personally can't see use cases where extracting that workaround message is the intent when using Tika. I do see value in keeping the entire DOM though. Maybe you can do as you suggest, but "in addition" to returning the XFA text as the content? Y, that would be in addition. Thank you, again. > Enhance PDFParser to extract text from XFA forms > > > Key: TIKA-1857 > URL: https://issues.apache.org/jira/browse/TIKA-1857 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Pascal Essiembre >Priority: Trivial > Labels: patch > Fix For: 1.13 > > > Extract text from PDF Forms (XFA). Information about XFA: > https://en.wikipedia.org/wiki/XFA -- This message was sent by Atlassian JIRA (v6.3.4#6332)