Hi Tim Great to hear that you managed to use the dataset from CommonCrawl. Thanks!
Julien On 14 April 2015 at 14:15, Allison, Timothy B. <talli...@mitre.org> wrote: > +1 > > Thank you, Tyler! > > Apologies to Hong-Thai and community for not recognizing the severity of > TIKA-1600 when I voted in favor of rc1! > > Details... > > I reran against govdocs1, and there aren't any major surprises. > > On our Rackspace vm, I _finally_ unzipped the Common Crawl slice that > Julien Nioche created for us, and I ran against that as well. That turned > up TIKA-1605 and another exceedingly rare NPE in the PDFParser. I don't > think either of these are blockers, and they're now fixed in trunk. > > There are slightly fewer metadata values for some jpegs. For the one file > that I manually reviewed, 1.8-rc was missing these values (that were > available in 1.7): > > JPEG quality > IPTC-NAA record > Plug-in 1 Data > > Comparison reports are available here (much more work remains to be done > on tika-eval): > > https://github.com/tballison/share/tree/master/tika_comparisons > > ________________________________________ > From: Tyler Palsulich <tpalsul...@apache.org> > Sent: Monday, April 13, 2015 1:56 PM > To: dev@tika.apache.org; u...@tika.apache.org > Subject: [VOTE] Apache Tika 1.8 Release Candidate #2 > > Hi Folks, > > A candidate for the Tika 1.8 release is available at: > https://dist.apache.org/repos/dist/dev/tika/ > > The release candidate is a zip archive of the sources in: > http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/ > > The SHA1 checksum of the archive is > 5e22fee9079370398472e59082d171ae2d7fdd31. > > In addition, a staged maven repository is available here: > https://repository.apache.org/content/repositories/orgapachetika-1009 > > Please vote on releasing this package as Apache Tika 1.8. The vote is open > for the next 72 hours and passes if a majority of at least three +1 Tika > PMC votes are cast. > > [ ] +1 Release this package as Apache Tika 1.8 > [ ] ±0 I don't object to this release, but I haven't checked it > [ ] -1 Do not release this package because... > > Thanks, > Tyler > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble