[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378362#comment-14378362 ]
Tim Allison edited comment on TIKA-1575 at 3/26/15 4:08 PM: ------------------------------------------------------------ I looked into the lang id some more. My initial suspicion was wrong. In the few files I reviewed, the strings were exactly the same. I'm using cybozu lab's langdetect, and its algorithm relies on sampling via Random(). When I properly(?) set the seed before running the analysis, all lang ids were the same. was (Author: talli...@mitre.org): I looked into the lang id some more. My initial suspicion was wrong. In the few files I reviewed, the strings were exactly the same. I'm using cybozu lab's langdetect, and it's algorithm relies on sampling via Random(). When I properly(?) set the seed before running the analysis, all lang ids were the same. > Upgrade to PDFBox 1.8.9 when available > -------------------------------------- > > Key: TIKA-1575 > URL: https://issues.apache.org/jira/browse/TIKA-1575 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Priority: Minor > Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, > 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, > PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, > PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, > PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, > content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, > reports_1_8_9_multithread_vs_single.zip > > > The PDFBox community is about to release 1.8.9. Let's use this issue to > track discussions before the release and to track Tika's upgrade to PDFBox > 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)