RE: PDFBox 2.0.3?

Allison, Timothy B. Wed, 14 Sep 2016 09:39:12 -0700

https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip


This run was against the full corpus, not just PDFs.  I used a fairly recent 
nightly build of PDFBox and POI's 3.15-rc1.

The one apparent major new exception for PDF files was apparently fixed before 
2.0.3.  So, please ignore that one!

There are some regressions in content extraction, but overall, content 
extraction looks to have improved quite a bit.  Looks like ~2 million more 
"common English words" via Tilman's methodology.

Let me know if you have any questions.

Cheers,

         Tim

-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]] 
Sent: Monday, September 12, 2016 12:58 PM
To: [email protected]
Subject: Re: PDFBox 2.0.3?

Am 12.09.2016 um 18:47 schrieb Allison, Timothy B.:
> Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/ 
> Tika 1.13).

Yes please, when you have the time, I expect no more changes.

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: PDFBox 2.0.3?

Reply via email to