[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149014#comment-15149014
 ] 

Tim Allison edited comment on TIKA-1857 at 2/16/16 7:49 PM:


from TIKA-1607's 
[comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the 
form also contains the PDF's standard metadata...(author etc.) which is not 
necessarily stored in the older mechanism: COSDictionary.  govdocs1's 
{{517660.pdf}} shows this -- the author and title can be extracted from the 
XFA, but that info is not extracted with our current methods.

bq. I'll support whichever way you pick, but I personally can't see use cases 
where extracting that workaround message is the intent when using Tika. I do 
see value in keeping the entire DOM though. Maybe you can do as you suggest, 
but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.




was (Author: talli...@mitre.org):
from TIKA-1607's 
[comment|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=15148914&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15148914]

bq. In the case of XFA forms, the form IS the content. 

Got it.  Doh.  Thank you. 

As I look at a few of these docs from govdocs1 w/ XFA data, it looks like the 
form also contains the PDF's standard metadata...(author etc.) which is not 
necessarily stored in the older mechanism: COSDictionary.

bq. I'll support whichever way you pick, but I personally can't see use cases 
where extracting that workaround message is the intent when using Tika. I do 
see value in keeping the entire DOM though. Maybe you can do as you suggest, 
but "in addition" to returning the XFA text as the content?

Y, that would be in addition.  Thank you, again.



> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149171#comment-15149171
 ] 

Tim Allison edited comment on TIKA-1857 at 2/16/16 8:09 PM:


I've only looked at a handful of files that contain xfa...this metadata is 
entirely new to me.  The files I've looked at come from govdocs1 and are fairly 
old by now.

In the attached {{041617_filled_out.pdf}}, I've added content to the forms and 
saved the document.

With the patch, I'm getting all of the boilerplate from the xfa extraction, but 
I'm not getting any content from the form because it isn't in 
{{<(speak|text|exData)>}} elements.  However, with our old code, I am seeing 
the entered data, e.g. {{my_exhibitor}}.

Is this PDF storing the contents of the form in both the xfa _and_ in the 
traditional AcroForm?

I imagine that won't happen in all PDFs, and there will be an either/or?

To avoid duplication of content, do we want to skip processing of AcroForm data 
if XFA exists?  Will we miss anything?

The other major question: I like the narrow focus that the current regexes 
yield, but why wouldn't we want to run our HtmlParser or our DcXMLParser 
against the bytes and pull everything out?  We'd have to skip inline/embedded 
images or handle those properly at some point...but any other reasons?

[~tilman], have you worked with XFA?  Any recommendations for pulling as much 
info as we can without duplication?

We could make this configurable, of course. :)



was (Author: talli...@mitre.org):
I've only looked at a handful of files that contain xfa...this metadata is 
entirely new to me.  The files I've looked at come from govdocs1 and are fairly 
old by now.

In the attached, I've added content to the forms and saved the document.

With the patch, I'm getting all of the boilerplate from the xfa extraction, but 
I'm not getting any content from the form because it isn't in 
{{<(speak|text|exData)>}} elements.  However, with our old code, I am seeing 
the entered data, e.g. {{my_exhibitor}}.

Is this PDF storing the contents of the form in both the xfa _and_ in the 
traditional AcroForm?

I imagine that won't happen in all PDFs, and there will be an either/or?

To avoid duplication of content, do we want to skip processing of AcroForm data 
if XFA exists?  Will we miss anything?

[~tilman], have you worked with XFA?  Any recommendations for pulling as much 
info as we can without duplication?

We could make this configurable, of course. :)


> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-18 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153851#comment-15153851
 ] 

Maruan Sahyoun edited comment on TIKA-1857 at 2/19/16 7:33 AM:
---

Sorry for my delay in answering your question.

May I propose the following strategy:

a) for static XFA if there is datasets.data use that content for the field 
values otherwise extract from the AcroForm.
b) for dynamic XFA scrape/extract info from the XFA.

Why a different proposal for a) from yours? Adobe Reader/Acrobat use the 
information from dataset.data for the field value over the possibly differing 
content in AcroForm (which might happen if the form has been filled out with an 
XFA aware processor and afterwards was amended with a non XFA aware processor)


was (Author: msahyoun):
Sorry for my delay in answering your question.

May I propose the following strategy:

a) for static XFA if there is datasets.data use that content for the filed 
values otherwise extract from the AcroForm.
b) for dynamic XFA scrape/extract info from the XFA.

Why a different proposal for a) from yours? Adobe Reader/Acrobat use the 
information from dataset.data for the field value over the possibly differing 
content in AcroForm (which might happen if the form has been filled out with an 
XFA aware processor and afterwards was amended with a non XFA aware processor)

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2017-02-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865598#comment-15865598
 ] 

Tim Allison edited comment on TIKA-1857 at 2/14/17 12:12 PM:
-

Are you able to share mocked up xml, sanitized of patient data?


was (Author: talli...@mitre.org):
Are you able to submit the triggering document?  If not, are you able to share 
it personally?

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2017-02-14 Thread Kenneth Lui (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865806#comment-15865806
 ] 

Kenneth Lui edited comment on TIKA-1857 at 2/14/17 2:04 PM:


I cannot copy the file out of the secured environment. But this is a file I 
found on the Internet to have the same issue and I used this to test my pdfbox 
script as well.

Edit: the comment seems to be not obvious that I attached doc8.pdf. That is the 
file I am referring to.


was (Author: hkkenneth):
I cannot copy the file out of the secured environment. But this is a file I 
found on the Internet to have the same issue and I used this to test my pdfbox 
script as well.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, doc8.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2017-02-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887106#comment-15887106
 ] 

Tim Allison edited comment on TIKA-1857 at 2/28/17 3:10 PM:


{noformat}
IT IS EASY
JUST TRY
DUDE
DO YOUR OWN JOB
DON'T EXPECT ME TO DO IT!
IT'S XML!
READ THE DOCUMENTATION
DUDE
LEARN BEFORE YOU CODE
{noformat}

Is now extracted as:
{noformat}
Nazwa pełna: IT IS EASY
Nazwisko: DUDE
ImiePierwsze: JUST TRY
Województwo: DO YOUR OWN JOB
Powiat: DON'T EXPECT ME TO DO IT!
Gmina: IT'S XML!
Miejscowość: READ THE DOCUMENTATION
Kod pocztowy: DUDE
Poczta: LEARN BEFORE YOU CODE
{noformat}
Once our git is back up and running, I'll push the fix.  Thank you for raising 
this issue and sharing a triggering document.


was (Author: talli...@mitre.org):
{noformat}
IT IS EASY




0123456789
JUST TRY
DUDE
2015-02-19



PL
DO YOUR OWN JOB
DON'T EXPECT ME TO DO IT!
IT'S XML!
012345678
READ THE DOCUMENTATION
DUDE
LEARN BEFORE YOU CODE
{noformat}

Is now extracted as:
{noformat}
Nazwa pełna: 
IT IS EASY
REGON: 
REGON: 
REGON: 
Nazwisko: 
DUDE
ImiePierwsze: 
JUST TRY
DataUrodzenia: 
2015-02-19
PESEL: 
Numer Identyfikacji Podatkowej: 
Numer PESEL: 
Kraj: 
KodKraju: 


PL
Województwo: 
DO YOUR OWN JOB
Powiat: 
DON'T EXPECT ME TO DO IT!
Gmina: 
IT'S XML!
Ulica: 
Nr domu: 
012345678
Nr lokalu: 
Miejscowość: 
READ THE DOCUMENTATION
Kod pocztowy: 
DUDE
Poczta: 
LEARN BEFORE YOU CODE
{noformat}
Once our git is back up and running, I'll push the fix.  Thank you for raising 
this issue and sharing a triggering document.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, doc8.pdf, govdocs1_xfas.zip, 
> xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)