Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:


There are some regressions in content extraction, but overall, content extraction looks to have improved quite a bit. Looks like ~2 million more "common English words" via Tilman's methodology.

After some wandering around I finally looked at content extraction only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words. It turned out that all files were from Delaware courts, so I've decided to look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it is

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE

For 1.8 the explanation is that text extraction takes words, while in 2.* each character is taken alone.

The bad result in 2.0.3 is because of an incorrect /W array. The space has a width of 3, while other characters have widths between 200 and 722. So PDFBox believes that there are spaces where there are none.

The only mystery that remains is why it worked in 2.0.1. Maybe that one took an average glyph width for spaces, or the width value from the font itself. I'll find this out later, but it isn't a high priority. A look at column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction looks to have improved quite a bit" :-)

Thanks for testing!

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to