Re: Any interest in running Apache Tika as part of CommonCrawl?

Mattmann, Chris A (3980) Fri, 03 Apr 2015 11:08:10 -0700

+1 this makes immense sense to me. Thanks Juls and Tim.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++







-----Original Message-----
From: "[email protected]" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, April 3, 2015 at 5:35 AM
To: "[email protected]" <[email protected]>, "[email protected]"
<[email protected]>, "[email protected]" <[email protected]>
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

>All,
>  What do we think?
>
>On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected] wrote:
>
>CommonCrawl currently has the WET format that extracts plain text from
>web pages.  My guess is that this is text stripping from text-y formats.
>Let me know if I'm wrong!
>
>
>Would there be any interest in adding another format: WETT (WET-Tika) or
>supplementing the current WET by using Tika to extract contents from
>binary formats too: PDF, MSWord, etc.
>
>
>Julien Nioche kindly carved out 220 GB for us to experiment with on
>TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on a
>Rackspace vm.  But, I'm wondering now if it would make more sense to have
>CommonCrawl run Tika as part of its regular process and make the output
>available in one of your standard formats.
>
>
>
>CommonCrawl consumers would get Tika output, and the Tika dev community
>(including its dependencies, PDFBox, POI, etc.) could get the stacktraces
>to help prioritize bug fixes.
>
>
>Cheers,
>
>
>          Tim 
>
>
>
>

Re: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to