[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371607#comment-14371607 ]
Maruan Sahyoun commented on TIKA-1575: -------------------------------------- for 719128.pdf it's correct to get the content of help_print only once as this is the fields value and the field exists only once in the form. The reason the text is displayed 4 times is because the field has 4 widgets assigned to it representing the same field 4 times visually. So the question is - what's the expected result. If you only want to extract the fields content - 1.8.9 is correct and 1.8.8 wasn't. If you would like to get the text 4 times as it visually appears 4 times then either - the annotations appearance stream needs to be processed as this will have the same text content (if it has been updated) - extract the text n-times (n representing the number of widgets a form field has) As I'm not familiar with the Tika code maybe you could point me to the class/method where that is handled to review that section (also having 2.0.0 in mind :-)) > Upgrade to PDFBox 1.8.9 when available > -------------------------------------- > > Key: TIKA-1575 > URL: https://issues.apache.org/jira/browse/TIKA-1575 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, > 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, > PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, > PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, > PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx, > reports_1_8_9_multithread_vs_single.zip > > > The PDFBox community is about to release 1.8.9. Let's use this issue to > track discussions before the release and to track Tika's upgrade to PDFBox > 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)