https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip
This run was against the full corpus, not just PDFs. I used a fairly recent
nightly build of PDFBox and POI's 3.15-rc1.
The one apparent major new exception for PDF files was apparently fixed before
2.0.3. So, please ignore that one!
There are some regressions in content extraction, but overall, content
extraction looks to have improved quite a bit. Looks like ~2 million more
"common English words" via Tilman's methodology.
Let me know if you have any questions.
Cheers,
Tim
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Monday, September 12, 2016 12:58 PM
To: [email protected]
Subject: Re: PDFBox 2.0.3?
Am 12.09.2016 um 18:47 schrieb Allison, Timothy B.:
> Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/
> Tika 1.13).
Yes please, when you have the time, I expect no more changes.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]