RE: PDFBox 2.0.3 TIKA comparison

Allison, Timothy B. Wed, 14 Sep 2016 10:01:31 -0700

That was caused by a cap we placed in Tika in extracting XMP history: TIKA-1999 
[1]


We haven't switched to XMPBox...still on JempBox from 1.8.x.

https://issues.apache.org/jira/browse/TIKA-1999

-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]] 
Sent: Wednesday, September 14, 2016 12:52 PM
To: [email protected]
Subject: Re: PDFBox 2.0.3 TIKA comparison

Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
> https://github.com/tballison/share/blob/master/tika_comparisons/report
> s_tika_20160904_dev.zip
>
> This run was against the full corpus, not just PDFs.  I used a fairly recent 
> nightly build of PDFBox and POI's 3.15-rc1.
>
> The one apparent major new exception for PDF files was apparently fixed 
> before 2.0.3.  So, please ignore that one!
>
> There are some regressions in content extraction, but overall, content 
> extraction looks to have improved quite a bit.  Looks like ~2 million more 
> "common English words" via Tilman's methodology.
>
> Let me know if you have any questions.

I wonder what happened here:
commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM

metadata went from 6766 to 4134.

Is this a TIKA thing, or is this because of a change from xmpbox to jempbox?

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]

RE: PDFBox 2.0.3 TIKA comparison

Reply via email to