[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371607#comment-14371607
 ] 

Maruan Sahyoun commented on TIKA-1575:
--------------------------------------

for 719128.pdf it's correct to get the content of help_print only once as this 
is the fields value and the field exists only once in the form. The reason the 
text is displayed 4 times is because the field has 4 widgets assigned to it 
representing the same field 4 times visually.

So the question is - what's the expected result. If you only want to extract 
the fields content - 1.8.9 is correct and 1.8.8 wasn't. If you would like to 
get the text 4 times as it visually appears 4 times then either
- the annotations appearance stream needs to be processed as this will have the 
same text content (if it has been updated)
- extract the text n-times (n representing the number of widgets a form field 
has) 

As I'm not familiar with the Tika code maybe you could point me to the 
class/method where that is handled to review that section (also having 2.0.0 in 
mind :-))

> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
>                 Key: TIKA-1575
>                 URL: https://issues.apache.org/jira/browse/TIKA-1575
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx, 
> reports_1_8_9_multithread_vs_single.zip
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to