Awesome!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Allison>, "Timothy B." <talli...@mitre.org>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Tuesday, April 14, 2015 at 12:09 PM
To: "dev@tika.apache.org" <dev@tika.apache.org>
Subject: topic change: common crawl slice on TIKA-1302 vm

>Hi Julien,
>  We're just beginning to scratch the surface.  There's much to learn
>from this set.  Apologies for my delay, and thank you!
>
>These proportions line up pretty closely with your blog post
>(http://digitalpebble.blogspot.com/2014/11/generating-test-corpus-for-apac
>he-tika.html) 
>
>Total files: 2,135,515
>
>Detected content types:
>DETECTED_CONTENT_TYPE (by TIKA 1.8-rc2)         COUNT
>image/jpeg      857,625        
>application/pdf         320,443
>text/plain; charset=ISO-8859-1  276,152
>image/png       184,855
>text/plain; charset=windows-1252        164,327
>image/gif       51,809 
>text/plain; charset=UTF-8       44,766
>audio/x-wav     34,402
>application/octet-stream        28,586
>message/rfc822  18,231
>text/html; charset=ISO-8859-1   17,528
>application/xhtml+xml; charset=UTF-8    16,845
>application/zip         14,385
>text/html; charset=UTF-8        9,626
>audio/mpeg      8,670 
>text/html; charset=windows-1252         7,818
>application/msword      7,782
>application/x-archive   5,970
>application/x-bibtex-text-file  5,274
>application/xml         5,234
>image/vnd.djvu  5,063
>application/rss+xml     4,726
>application/gzip        4,443
>application/xhtml+xml; charset=ISO-8859-1       4,228
>application/epub+zip    3,458
>image/tiff      2,980 
>image/jp2       2,706 
>application/rtf         1,622
>       
> 
>
>________________________________________
>From: Julien Nioche <lists.digitalpeb...@gmail.com>
>Sent: Tuesday, April 14, 2015 9:24 AM
>To: dev@tika.apache.org
>Subject: Re: [VOTE] Apache Tika 1.8 Release Candidate #2
>
>Hi Tim
>
>Great to hear that you managed to use the dataset from CommonCrawl.
>Thanks!
>
>Julien
>
>On 14 April 2015 at 14:15, Allison, Timothy B. <talli...@mitre.org> wrote:
>
>> +1
>>
>> Thank you, Tyler!
>>
>> Apologies to Hong-Thai and community for not recognizing the severity of
>> TIKA-1600 when I voted in favor of rc1!
>>
>> Details...
>>
>> I reran against govdocs1, and there aren't any major surprises.
>>
>> On our Rackspace vm, I  _finally_ unzipped the Common Crawl slice that
>> Julien Nioche created for us, and I ran against that as well.  That
>>turned
>> up TIKA-1605 and another exceedingly rare NPE in the PDFParser.  I don't
>> think either of these are blockers, and they're now fixed in trunk.
>>
>> There are slightly fewer metadata values for some jpegs.  For the one
>>file
>> that I manually reviewed, 1.8-rc was missing these values (that were
>> available in 1.7):
>>
>> JPEG quality
>> IPTC-NAA record
>> Plug-in 1 Data
>>
>> Comparison reports are available here (much more work remains to be done
>> on tika-eval):
>>
>> https://github.com/tballison/share/tree/master/tika_comparisons
>>
>> ________________________________________
>> From: Tyler Palsulich <tpalsul...@apache.org>
>> Sent: Monday, April 13, 2015 1:56 PM
>> To: dev@tika.apache.org; u...@tika.apache.org
>> Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
>>
>> Hi Folks,
>>
>> A candidate for the Tika 1.8 release is available at:
>>   https://dist.apache.org/repos/dist/dev/tika/
>>
>> The release candidate is a zip archive of the sources in:
>>   http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
>>
>> The SHA1 checksum of the archive is
>>   5e22fee9079370398472e59082d171ae2d7fdd31.
>>
>> In addition, a staged maven repository is available here:
>>   https://repository.apache.org/content/repositories/orgapachetika-1009
>>
>> Please vote on releasing this package as Apache Tika 1.8. The vote is
>>open
>> for the next 72 hours and passes if a majority of at least three +1 Tika
>> PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Tika 1.8
>> [ ] ±0 I don't object to this release, but I haven't checked it
>> [ ] -1 Do not release this package because...
>>
>> Thanks,
>> Tyler
>>
>
>
>
>--
>
>Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble

Reply via email to