[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360492#comment-14360492 ]
Tim Allison edited comment on TIKA-1575 at 3/13/15 3:47 PM: ------------------------------------------------------------ Form clutter...This was embedded inside 776568. With PDFBox 1.8.8, we extracted the keys for the subform (but there was no meaningful content in this doc): {noformat}Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: \n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: \n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: \n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: \n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: \n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: \n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: \n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: \n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: \n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: \n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: \n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: \n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: \n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: \n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: \n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: \n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: \n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: \n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: \n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: \n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: \n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: \n\tCheckBox6[11]: \n\n\n\n\n",{noformat} In 1.8.9, there's just this: {noformat} Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\n\n\n {noformat} There's no difference with PDFBox app's ExtractText between 1.8.8 and 1.8.9 on this file. was (Author: talli...@mitre.org): Form clutter...This was embedded inside 776568. With PDFBox 1.8.8, we extracted the keys for the subform (but there was no meaningful content in this doc): {noformat}Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: \n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: \n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: \n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: \n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: \n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: \n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: \n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: \n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: \n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: \n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: \n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: \n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: \n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: \n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: \n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: \n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: \n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: \n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: \n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: \n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: \n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: \n\tCheckBox6[11]: \n\n\n\n\n",{noformat} In 1.8.9, there's just this: {noformat} Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\n\n\n {noformat} > Upgrade to PDFBox 1.8.9 when available > -------------------------------------- > > Key: TIKA-1575 > URL: https://issues.apache.org/jira/browse/TIKA-1575 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > Attachments: 10-814_Appendix B_v3.pdf, > PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, > PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip > > > The PDFBox community is about to release 1.8.9. Let's use this issue to > track discussions before the release and to track Tika's upgrade to PDFBox > 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)