[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368449#comment-14368449
 ] 

Tim Allison edited comment on TIKA-1575 at 3/19/15 3:35 AM:
------------------------------------------------------------

>From manual review...overall, I'm not sure there is anything glaring, esp 
>given that we're testing against ~250k documents.

Based on the More_in_A column, it looks like there are two docs with much more 
language content in 1.8.8 vs 1.8.9.  

* 005937.pdf is an anomaly that can't be reproduced in a single-threaded 
environment.  Multithreading "bug" improves extraction?! :)

* 524276.pdf; it looks like much of the first page is duplicated with 1.8.8 but 
that 1.8.9 gets junk for the first copy but maintains the content that was 
duplicated in 1.8.8.


It looks like there are quite a few documents where "this page is intentionally 
left blank" is captured more often in the 1.8.8 output than in the 1.8.9 
output. For example, in 473194.pdf, there's the main content, and then the 
Bookmarks are dumped at the end of the document in 1.8.8, "this page is 
intentionally left blank" correctly appears three times, but it only appears 
once in 1.8.9.  Not a big loss of information in my opinion, unless it points 
to a potential underlying problem...I don't know.

A similar thing happens with 719128.pdf, where the footer is repeated 3 times 
with 1.8.8 but is only extracted once with 1.8.9; the correct number should be 
4.

There appear to be some differences in AcroForm language -- "Yes, No". In the 
one I checked, 496816.pdf, the extraction appears to be more accurate in 1.8.9 
vs 1.8.8 {noformat} "Primary: Yes\n\tline: Yes\n\n\tPiggyback:" {noformat} only 
has one "Yes" in 1.8.9.

[~tilman], what are you finding?


was (Author: talli...@mitre.org):
>From manual review...

Based on the More_in_A column, it looks like there are three docs with much 
more language content in 1.8.8 vs 1.8.9.  

* 005937.pdf is an anomaly that can't be reproduced in a single-threaded 
environment.  Multithreading "bug" improves extraction?! :)

* 524276.pdf; it looks like much of the first page is duplicated with 1.8.8 but 
that 1.8.9 gets junk for the first copy but maintains the content that was 
duplicated in 1.8.8.


It looks like there are quite a few documents where "this page is intentionally 
left blank" is captured more often in the 1.8.8 output than in the 1.8.9 
output. For example, in 473194.pdf, there's the main content, and then the 
Bookmarks are dumped at the end of the document in 1.8.8, "this page is 
intentionally left blank" correctly appears three times, but it only appears 
once in 1.8.9.  Not a big loss of information in my opinion, unless it points 
to a potential underlying problem...I don't know.

A similar thing happens with 719128.pdf, where the footer is repeated 3 times 
with 1.8.8 but is only extracted once with 1.8.9; the correct number should be 
4.

There appear to be some differences in AcroForm language -- "Yes, No". In the 
one I checked, 496816.pdf, the extraction appears to be more accurate in 1.8.9 
vs 1.8.8 {noformat} "Primary: Yes\n\tline: Yes\n\n\tPiggyback:" {noformat} only 
has one "Yes" in 1.8.9.

[~tilman], what are you finding?

> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
>                 Key: TIKA-1575
>                 URL: https://issues.apache.org/jira/browse/TIKA-1575
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to