[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2017-02-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888179#comment-15888179
 ] 

Tim Allison commented on TIKA-1857:
---

I pushed the fix to our new repo.  Let me know if that fixes this issue.  Thank 
you.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, doc8.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2017-02-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887106#comment-15887106
 ] 

Tim Allison commented on TIKA-1857:
---

{noformat}
IT IS EASY




0123456789
JUST TRY
DUDE
2015-02-19



PL
DO YOUR OWN JOB
DON'T EXPECT ME TO DO IT!
IT'S XML!
012345678
READ THE DOCUMENTATION
DUDE
LEARN BEFORE YOU CODE
{noformat}

Is now extracted as:
{noformat}
Nazwa pełna: 
IT IS EASY
REGON: 
REGON: 
REGON: 
Nazwisko: 
DUDE
ImiePierwsze: 
JUST TRY
DataUrodzenia: 
2015-02-19
PESEL: 
Numer Identyfikacji Podatkowej: 
Numer PESEL: 
Kraj: 
KodKraju: 


PL
Województwo: 
DO YOUR OWN JOB
Powiat: 
DON'T EXPECT ME TO DO IT!
Gmina: 
IT'S XML!
Ulica: 
Nr domu: 
012345678
Nr lokalu: 
Miejscowość: 
READ THE DOCUMENTATION
Kod pocztowy: 
DUDE
Poczta: 
LEARN BEFORE YOU CODE
{noformat}
Once our git is back up and running, I'll push the fix.  Thank you for raising 
this issue and sharing a triggering document.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, doc8.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2017-02-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865598#comment-15865598
 ] 

Tim Allison commented on TIKA-1857:
---

Are you able to submit the triggering document?  If not, are you able to share 
it personally?

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2017-02-13 Thread Kenneth Lui (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864963#comment-15864963
 ] 

Kenneth Lui commented on TIKA-1857:
---

Hi, I tried to use this feature but it doesn't seem to work. I understand this 
is not the right place to ask troubleshooting type of question, so I put the 
details at 
http://stackoverflow.com/questions/42217327/apache-tika-extract-only-field-names-from-pdf-xfa-forms-but-not-the-text-content
 . Could you please help whether I misconfigured Tika or it is an issue about 
the feature implementation. Thanks!

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-04-11 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235261#comment-15235261
 ] 

Tim Allison commented on TIKA-1857:
---

[~pascal.essiembre], we may be headed towards a release of 1.13 within the 
month (ish).  Will the current update meet your needs?  Thank you, again, for 
your patch!

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-03-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175818#comment-15175818
 ] 

Hudson commented on TIKA-1857:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #919 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/919/])
Fix for side effect of TIKA-1857-- javax.xml.stream is no longer (tallison: rev 
9a1ba9494cf2a786e4615f0d72ca5f7c303840fa)
* tika-bundle/pom.xml


> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-03-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174940#comment-15174940
 ] 

Hudson commented on TIKA-1857:
--

UNSTABLE: Integrated in tika-2.x #41 (See 
[https://builds.apache.org/job/tika-2.x/41/])
TIKA-1857: add basic XFA extraction via Pascal Essiembre. (tallison: rev 
f1e4ebdb422d24b7080d02620f3c38f6dda57910)
* 
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* CHANGES.txt
* 
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* 
tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* 
tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* 
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java
* 
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* 
tika-test-resources/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf


> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-03-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174937#comment-15174937
 ] 

Hudson commented on TIKA-1857:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #916 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/916/])
TIKA-1857: add basic XFA extraction support via Pascal Essiembre. (tallison: 
rev dbefe9830b26d05f9ce53503565a069bcc63d7c1)
* tika-parsers/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf
* tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* 
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java
TIKA-1857: add basic XFA extraction support via Pascal Essiembre. (tallison: 
rev 7c245fa87507cf0887838001c54c65b79b7e7cbc)
* CHANGES.txt


> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174859#comment-15174859
 ] 

ASF GitHub Bot commented on TIKA-1857:
--

Github user asfgit closed the pull request at:

https://github.com/apache/tika/pull/74


> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-26 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169395#comment-15169395
 ] 

Tim Allison commented on TIKA-1857:
---

I implemented a first attempt XFA scraper with StAX; this pulls the content 
from the fields that Pascal identified into the ContentHhandler, and it merges 
the "values" from the data section with the fields section.

Currently, if XFA exists, I process that and skip the AcroForm data.  

I'm not certain what the best path is for ignoring/processing content extracted 
from the "regular" PDF if there is XFA data.

For now, I'm also processing the contents of the rest of the PDF. I'm more 
averse to losing data than to duplication because my main use case is 
search...but I realize this will be really frustrating to users who want "just 
one copy" of the content.

In looking at the pdfs with xfa data in govdocs1, it looks like there would be 
lost content in  _some_ files if we processed only the XFA and did not do the 
regular text extraction.  On the other hand, for most of the files I examined, 
it looked like the content is entirely duplicative -- [~pascal.essiembre]'s 
point above.

I propose adding a parameter to the PDFParserConfig along the lines of 
{{ifXFAExistsProcessItAlone}}...this would allow the behavior of Pascal's 
patch.  I propose that the default be set to "false", erring on the side of 
extracting more content at the cost of duplication.

Is this ok?  Or, is there an easy way to determine if regular content is 
entirely duplicative of XFA content?



> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154293#comment-15154293
 ] 

Tim Allison commented on TIKA-1857:
---

Ha. Sorry. Figured that was a typo.  We'll still have it around for a while to 
process though. :)  Thank you, again.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-19 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154290#comment-15154290
 ] 

Maruan Sahyoun commented on TIKA-1857:
--

XFA is not deprecated in PDFBox. It will be deprecated in the PDF 2.0 
specification (as it currently stands)

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154274#comment-15154274
 ] 

Tim Allison commented on TIKA-1857:
---

Doh! Sorry.  I was looking at PDXFAResource.  Thank you, again.

bq. PDF 2.0 as there XFA is deprecated 

Oh, no...I guess we could copy/paste from the current PDFBox if XFA goes away 
in PDFBox...less than ideal. I don't see deprecation tags in PDXFAResource or 
PDAcroForm's {{getXFA()}}...which XFA handling might go away?

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-19 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154260#comment-15154260
 ] 

Maruan Sahyoun commented on TIKA-1857:
--

{quote}
Do I understand correctly then: no matter whether static or dynamic, try to 
pull data from XFA; if that doesn't exist, fall back to the AcroForm?
{quote}

if you'd like to replicate Adobe Reader/Acrobat behavior - yes. BTW don't know 
what will happen with PDF 2.0 as there XFA is deprecated which might have an 
implication for future versions.

{quote}
Also, is there an obvious way to determine static vs. dynamic aside from 
checking to see if there are fields in the AcroForm?
{quote}

there is {{PDAcroForm.xfaIsDynamic()}} which will give you the information 
(which checks if there is XFA and no AcroForm fields) 

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154162#comment-15154162
 ] 

Tim Allison commented on TIKA-1857:
---

No problem at all.  I think this will take some time for me to get 
right...there's no rush. :)

Do I understand correctly then: no matter whether static or dynamic, try to 
pull data from XFA; if that doesn't exist, fall back to the AcroForm?

Also, is there an obvious way to determine static vs. dynamic aside from 
checking to see if there are fields in the AcroForm?

Thank you, again!

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-18 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153851#comment-15153851
 ] 

Maruan Sahyoun commented on TIKA-1857:
--

Sorry for my delay in answering your question.

May I propose the following strategy:

a) for static XFA if there is datasets.data use that content for the filed 
values otherwise extract from the AcroForm.
b) for dynamic XFA scrape/extract info from the XFA.

Why a different proposal for a) from yours? Adobe Reader/Acrobat use the 
information from dataset.data for the field value over the possibly differing 
content in AcroForm (which might happen if the form has been filled out with an 
XFA aware processor and afterwards was amended with a non XFA aware processor)

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149736#comment-15149736
 ] 

Tim Allison commented on TIKA-1857:
---

This is great.  Thank you!

So, to get the best coverage for extracted content, should we do the following:

Check for fields in the AcroForm.

a) If those exist (Static XFA), use the content extracted from the AcroForm and 
ignore the XFA
b) If they don't exist (Dynamic XFA), scrape/extract info from the XFA 

In your experience, will we miss any info if we ignore the XFA for Static XFAs 
and rely solely on the AcroForm?



> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149405#comment-15149405
 ] 

Maruan Sahyoun commented on TIKA-1857:
--

The reason you are not getting the data is that this is stored as part of the 
data node in an xml data structure which matches the binding information in the 
field. That data is in {{xfa.datasets.data}} with the {{my_exibitor}} value 
stored in the {{Exhibitorname}} field.

Extracting {{speak|text|exData}} will give you the boilerplate text but not the 
field value.

Now there are two types of XFA forms - static and dynamic. Static XFA forms 
will have an XFA entry and AcroForm fields. Dynamic XFA forms will only have an 
XFA entry and no AcroForm fields.

When an XFA form is filled out with an XFA aware PDF processor for static forms 
both the {{xfa.datasets.data}} information is updated as well as the {{V}} 
entry of the AcroForm form field. If you fill out a static form with a non XFA 
aware PDF processor it will only see the AcroForm information and as a result 
only updates the AcroForm form fields {{V}} entry.

When trying to fill a dynamic XFA form with a non XFA aware PDF processor it 
will not see any form fields at all.

I'm happy to provide more information on that topic but thought that this will 
give you a first outline.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149256#comment-15149256
 ] 

Tilman Hausherr commented on TIKA-1857:
---

Sorry, I have no experience with XFA. [~msahyoun] might know more.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149014#comment-15149014
 ] 

Tim Allison commented on TIKA-1857:
---

from TIKA-1607's 
[comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the 
form also contains the PDF's standard metadata...(author etc.) which is not 
necessarily stored in the older mechanism: COSDictionary.

bq. I'll support whichever way you pick, but I personally can't see use cases 
where extracting that workaround message is the intent when using Tika. I do 
see value in keeping the entire DOM though. Maybe you can do as you suggest, 
but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.



> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)