Am 14.09.2016 um 22:42 schrieb Allison, Timothy B.:
Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction
looks to have improved quite a bit" :-)
Y, absolutely. Thank _you_ for reviewing the output and all of your other
work, of course!
Tim, we have to thank you for running those tests again!!
BR
Andreas
Cheers,
Tim
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Wednesday, September 14, 2016 2:50 PM
To: [email protected]
Subject: Re: PDFBox 2.0.3 TIKA comparison
Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
There are some regressions in content extraction, but overall,
content extraction looks to have improved quite a bit. Looks like ~2
million more "common English words" via Tilman's methodology.
After some wandering around I finally looked at content extraction only, at column P
("TOP_10_MORE_IN_A") for cells with meaningful words.
It turned out that all files were from Delaware courts, so I've decided to look
only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is
IN THE COUR T OF CHAN CER Y O F TH E STA TE OF D ELA WARE
in 2.0.1 and 1.8 it is
IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
For 1.8 the explanation is that text extraction takes words, while in
2.* each character is taken alone.
The bad result in 2.0.3 is because of an incorrect /W array. The space has a
width of 3, while other characters have widths between 200 and 722. So PDFBox
believes that there are spaces where there are none.
The only mystery that remains is why it worked in 2.0.1. Maybe that one took an average glyph width
for spaces, or the width value from the font itself. I'll find this out later, but it isn't a high
priority. A look at column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes,
"content extraction looks to have improved quite a bit" :-)
Thanks for testing!
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]