+1 this makes immense sense to me. Thanks Juls and Tim.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "tallison314...@gmail.com" <tallison314...@gmail.com>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Friday, April 3, 2015 at 5:35 AM
To: "d...@pdfbox.apache.org" <d...@pdfbox.apache.org>, "dev@tika.apache.org"
<dev@tika.apache.org>, "d...@poi.apache.org" <d...@poi.apache.org>
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

>All,
>  What do we think?
>
>On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
>
>CommonCrawl currently has the WET format that extracts plain text from
>web pages.  My guess is that this is text stripping from text-y formats.
>Let me know if I'm wrong!
>
>
>Would there be any interest in adding another format: WETT (WET-Tika) or
>supplementing the current WET by using Tika to extract contents from
>binary formats too: PDF, MSWord, etc.
>
>
>Julien Nioche kindly carved out 220 GB for us to experiment with on
>TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on a
>Rackspace vm.  But, I'm wondering now if it would make more sense to have
>CommonCrawl run Tika as part of its regular process and make the output
>available in one of your standard formats.
>
>
>
>CommonCrawl consumers would get Tika output, and the Tika dev community
>(including its dependencies, PDFBox, POI, etc.) could get the stacktraces
>to help prioritize bug fixes.
>
>
>Cheers,
>
>
>          Tim 
>
>
>
>

Reply via email to