Awesome! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: <Allison>, "Timothy B." <talli...@mitre.org> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> Date: Tuesday, April 14, 2015 at 12:09 PM To: "dev@tika.apache.org" <dev@tika.apache.org> Subject: topic change: common crawl slice on TIKA-1302 vm >Hi Julien, > We're just beginning to scratch the surface. There's much to learn >from this set. Apologies for my delay, and thank you! > >These proportions line up pretty closely with your blog post >(http://digitalpebble.blogspot.com/2014/11/generating-test-corpus-for-apac >he-tika.html) > >Total files: 2,135,515 > >Detected content types: >DETECTED_CONTENT_TYPE (by TIKA 1.8-rc2) COUNT >image/jpeg 857,625 >application/pdf 320,443 >text/plain; charset=ISO-8859-1 276,152 >image/png 184,855 >text/plain; charset=windows-1252 164,327 >image/gif 51,809 >text/plain; charset=UTF-8 44,766 >audio/x-wav 34,402 >application/octet-stream 28,586 >message/rfc822 18,231 >text/html; charset=ISO-8859-1 17,528 >application/xhtml+xml; charset=UTF-8 16,845 >application/zip 14,385 >text/html; charset=UTF-8 9,626 >audio/mpeg 8,670 >text/html; charset=windows-1252 7,818 >application/msword 7,782 >application/x-archive 5,970 >application/x-bibtex-text-file 5,274 >application/xml 5,234 >image/vnd.djvu 5,063 >application/rss+xml 4,726 >application/gzip 4,443 >application/xhtml+xml; charset=ISO-8859-1 4,228 >application/epub+zip 3,458 >image/tiff 2,980 >image/jp2 2,706 >application/rtf 1,622 > > > >________________________________________ >From: Julien Nioche <lists.digitalpeb...@gmail.com> >Sent: Tuesday, April 14, 2015 9:24 AM >To: dev@tika.apache.org >Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2 > >Hi Tim > >Great to hear that you managed to use the dataset from CommonCrawl. >Thanks! > >Julien > >On 14 April 2015 at 14:15, Allison, Timothy B. <talli...@mitre.org> wrote: > >> +1 >> >> Thank you, Tyler! >> >> Apologies to Hong-Thai and community for not recognizing the severity of >> TIKA-1600 when I voted in favor of rc1! >> >> Details... >> >> I reran against govdocs1, and there aren't any major surprises. >> >> On our Rackspace vm, I _finally_ unzipped the Common Crawl slice that >> Julien Nioche created for us, and I ran against that as well. That >>turned >> up TIKA-1605 and another exceedingly rare NPE in the PDFParser. I don't >> think either of these are blockers, and they're now fixed in trunk. >> >> There are slightly fewer metadata values for some jpegs. For the one >>file >> that I manually reviewed, 1.8-rc was missing these values (that were >> available in 1.7): >> >> JPEG quality >> IPTC-NAA record >> Plug-in 1 Data >> >> Comparison reports are available here (much more work remains to be done >> on tika-eval): >> >> https://github.com/tballison/share/tree/master/tika_comparisons >> >> ________________________________________ >> From: Tyler Palsulich <tpalsul...@apache.org> >> Sent: Monday, April 13, 2015 1:56 PM >> To: dev@tika.apache.org; u...@tika.apache.org >> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2 >> >> Hi Folks, >> >> A candidate for the Tika 1.8 release is available at: >> https://dist.apache.org/repos/dist/dev/tika/ >> >> The release candidate is a zip archive of the sources in: >> http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/ >> >> The SHA1 checksum of the archive is >> 5e22fee9079370398472e59082d171ae2d7fdd31. >> >> In addition, a staged maven repository is available here: >> https://repository.apache.org/content/repositories/orgapachetika-1009 >> >> Please vote on releasing this package as Apache Tika 1.8. The vote is >>open >> for the next 72 hours and passes if a majority of at least three +1 Tika >> PMC votes are cast. >> >> [ ] +1 Release this package as Apache Tika 1.8 >> [ ] ±0 I don't object to this release, but I haven't checked it >> [ ] -1 Do not release this package because... >> >> Thanks, >> Tyler >> > > > >-- > >Open Source Solutions for Text Engineering > >http://digitalpebble.blogspot.com/ >http://www.digitalpebble.com >http://twitter.com/digitalpebble