Re: PDFBox 2.0.3 TIKA comparison

Tilman Hausherr Thu, 15 Sep 2016 09:08:48 -0700

Am 14.09.2016 um 20:50 schrieb Tilman Hausherr:

Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
There are some regressions in content extraction, but overall,content extraction looks to have improved quite a bit. Looks like~2 million more "common English words" via Tilman's methodology.
After some wandering around I finally looked at content extractiononly, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.It turned out that all files were from Delaware courts, so I'vedecided to look only at one single file,Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it is

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
For 1.8 the explanation is that text extraction takes words, while in2.* each character is taken alone.
The bad result in 2.0.3 is because of an incorrect /W array. The spacehas a width of 3, while other characters have widths between 200 and722. So PDFBox believes that there are spaces where there are none.

The story is different, the space width (which is 250, not 3 - the tableis a ranges array) is NOT taken from the space glyph, but from anaverage of all glyphs. It's a good thing I looked past in history. Thebreaking change was in rev 1744613 (PDFBOX-3354) and is related to thecalculation of the average glyph width. Before rev 1744613 theaverageWidth was always 0 (due to a bug likely accidentally introducedin some refactoring), which was corrected to a default value (1000) intext extraction.

Starting with rev 1744613 an average width was calculated, but due tomany 0 values (over 65534) in the /W ranges array, the result wasunreliable:


/W [1 1 0 2 3 250 4 10 0 11
12 333 13 14 0 15 15 250 16 16
333 17 17 250 18 18 277 19 19 0
20 23 500 24 35 0 36 36 722 37
37 666 38 39 722 40 40 666 41 41
610 42 43 777 44 44 389 45 45 0
46 46 777 47 47 666 48 48 943 49
49 722 50 50 777 51 51 610 52 52
0 53 53 722 54 54 556 55 55 666
56 57 722 59 59 0 60 60 722 61
67 0 68 68 500 69 69 556 70 70
443 71 71 556 72 72 443 73 73 333
74 74 500 75 75 556 76 76 277 77
77 0 78 78 556 79 79 277 80 80
833 81 81 556 82 82 500 83 84 556
85 85 443 86 86 389 87 87 333 88
88 556 89 89 0 90 90 722 91 92
500 93 178 0 179 180 500 181 181 0
182 182 333 183 751 0 752 752 198 753
794 0 795 795 612 796 1126 0 1127 1127
125 1129 1129 2000 1130 65534 0]

Solution: ignore widths that are <=0. 0 values in PDFont are alreadyignored in PDFont, but not in PDCIDFont.


Before the solution: 0.52861196. After the fix: 549.8571.

I'll open an issue and commit a fix after sending this. It won't be in2.0.3, but in 2.0.4.


Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDFBox 2.0.3 TIKA comparison

Reply via email to