Re: PDFBox 2.0.3 TIKA comparison

Tilman Hausherr Wed, 14 Sep 2016 11:50:31 -0700

Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
There are some regressions in content extraction, but overall,content extraction looks to have improved quite a bit. Looks like ~2million more "common English words" via Tilman's methodology.

After some wandering around I finally looked at content extraction only,at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.It turned out that all files were from Delaware courts, so I've decidedto look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.

The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it is

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE

For 1.8 the explanation is that text extraction takes words, while in2.* each character is taken alone.

The bad result in 2.0.3 is because of an incorrect /W array. The spacehas a width of 3, while other characters have widths between 200 and722. So PDFBox believes that there are spaces where there are none.

The only mystery that remains is why it worked in 2.0.1. Maybe that onetook an average glyph width for spaces, or the width value from the fontitself. I'll find this out later, but it isn't a high priority. A lookat column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes,"content extraction looks to have improved quite a bit" :-)


Thanks for testing!

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDFBox 2.0.3 TIKA comparison

Reply via email to