Am 14.09.2016 um 22:42 schrieb Allison, Timothy B.:
Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction 
looks to have improved quite a bit" :-)

Y, absolutely.  Thank _you_ for reviewing the output and all of your other 
work, of course!
Tim, we have to thank you for running those tests again!!

BR
Andreas

Cheers,

          Tim

-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Wednesday, September 14, 2016 2:50 PM
To: [email protected]
Subject: Re: PDFBox 2.0.3 TIKA comparison


Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:


There are some regressions in content extraction, but overall,
content extraction looks to have improved quite a bit.  Looks like ~2
million more "common English words" via Tilman's methodology.

After some wandering around I finally looked at content extraction only, at column P 
("TOP_10_MORE_IN_A") for cells with meaningful words.
It turned out that all files were from Delaware courts, so I've decided to look 
only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it is

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE

For 1.8 the explanation is that text extraction takes words, while in
2.* each character is taken alone.

The bad result in 2.0.3 is because of an incorrect /W array. The space has a 
width of 3, while other characters have widths between 200 and 722. So PDFBox 
believes that there are spaces where there are none.

The only mystery that remains is why it worked in 2.0.1. Maybe that one took an average glyph width 
for spaces, or the width value from the font itself. I'll find this out later, but it isn't a high 
priority. A look at column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, 
"content extraction looks to have improved quite a bit" :-)

Thanks for testing!

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to