Am 14.09.2016 um 20:50 schrieb Tilman Hausherr:

Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:


There are some regressions in content extraction, but overall, content extraction looks to have improved quite a bit. Looks like ~2 million more "common English words" via Tilman's methodology.

After some wandering around I finally looked at content extraction only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words. It turned out that all files were from Delaware courts, so I've decided to look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it is

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE

For 1.8 the explanation is that text extraction takes words, while in 2.* each character is taken alone.

The bad result in 2.0.3 is because of an incorrect /W array. The space has a width of 3, while other characters have widths between 200 and 722. So PDFBox believes that there are spaces where there are none.

The story is different, the space width (which is 250, not 3 - the table is a ranges array) is NOT taken from the space glyph, but from an average of all glyphs. It's a good thing I looked past in history. The breaking change was in rev 1744613 (PDFBOX-3354) and is related to the calculation of the average glyph width. Before rev 1744613 the averageWidth was always 0 (due to a bug likely accidentally introduced in some refactoring), which was corrected to a default value (1000) in text extraction.

Starting with rev 1744613 an average width was calculated, but due to many 0 values (over 65534) in the /W ranges array, the result was unreliable:

/W [1 1 0 2 3 250 4 10 0 11
12 333 13 14 0 15 15 250 16 16
333 17 17 250 18 18 277 19 19 0
20 23 500 24 35 0 36 36 722 37
37 666 38 39 722 40 40 666 41 41
610 42 43 777 44 44 389 45 45 0
46 46 777 47 47 666 48 48 943 49
49 722 50 50 777 51 51 610 52 52
0 53 53 722 54 54 556 55 55 666
56 57 722 59 59 0 60 60 722 61
67 0 68 68 500 69 69 556 70 70
443 71 71 556 72 72 443 73 73 333
74 74 500 75 75 556 76 76 277 77
77 0 78 78 556 79 79 277 80 80
833 81 81 556 82 82 500 83 84 556
85 85 443 86 86 389 87 87 333 88
88 556 89 89 0 90 90 722 91 92
500 93 178 0 179 180 500 181 181 0
182 182 333 183 751 0 752 752 198 753
794 0 795 795 612 796 1126 0 1127 1127
125 1129 1129 2000 1130 65534 0]

Solution: ignore widths that are <=0. 0 values in PDFont are already ignored in PDFont, but not in PDCIDFont.

Before the solution: 0.52861196. After the fix: 549.8571.

I'll open an issue and commit a fix after sending this. It won't be in 2.0.3, but in 2.0.4.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to