Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
There are some regressions in content extraction, but overall,
content extraction looks to have improved quite a bit. Looks like ~2
million more "common English words" via Tilman's methodology.
After some wandering around I finally looked at content extraction only,
at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
It turned out that all files were from Delaware courts, so I've decided
to look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is
IN THE COUR T OF CHAN CER Y O F TH E STA TE OF D ELA WARE
in 2.0.1 and 1.8 it is
IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
For 1.8 the explanation is that text extraction takes words, while in
2.* each character is taken alone.
The bad result in 2.0.3 is because of an incorrect /W array. The space
has a width of 3, while other characters have widths between 200 and
722. So PDFBox believes that there are spaces where there are none.
The only mystery that remains is why it worked in 2.0.1. Maybe that one
took an average glyph width for spaces, or the width value from the font
itself. I'll find this out later, but it isn't a high priority. A look
at column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes,
"content extraction looks to have improved quite a bit" :-)
Thanks for testing!
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org