That was caused by a cap we placed in Tika in extracting XMP history: TIKA-1999 [1]
We haven't switched to XMPBox...still on JempBox from 1.8.x. https://issues.apache.org/jira/browse/TIKA-1999 -----Original Message----- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Wednesday, September 14, 2016 12:52 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3 TIKA comparison Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.: > https://github.com/tballison/share/blob/master/tika_comparisons/report > s_tika_20160904_dev.zip > > This run was against the full corpus, not just PDFs. I used a fairly recent > nightly build of PDFBox and POI's 3.15-rc1. > > The one apparent major new exception for PDF files was apparently fixed > before 2.0.3. So, please ignore that one! > > There are some regressions in content extraction, but overall, content > extraction looks to have improved quite a bit. Looks like ~2 million more > "common English words" via Tilman's methodology. > > Let me know if you have any questions. I wonder what happened here: commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM metadata went from 6766 to 4134. Is this a TIKA thing, or is this because of a change from xmpbox to jempbox? Tilman --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org