Fwd: Any interest in running Apache Tika as part of CommonCrawl?

tallison314159 Fri, 03 Apr 2015 09:13:51 -0700

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, [email protected] wrote:


> CommonCrawl currently has the WET format that extracts plain text from web 
> pages.  My guess is that this is text stripping from text-y formats.  Let 
> me know if I'm wrong!
>
> Would there be any interest in adding another format: WETT (WET-Tika) or 
> supplementing the current WET by using Tika to extract contents from binary 
> formats too: PDF, MSWord, etc.
>
> Julien Nioche kindly carved out 220 GB for us to experiment with on 
> TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on 
> a Rackspace vm.  But, I'm wondering now if it would make more sense to have 
> CommonCrawl run Tika as part of its regular process and make the output 
> available in one of your standard formats.  
>
> CommonCrawl consumers would get Tika output, and the Tika dev community 
> (including its dependencies, PDFBox, POI, etc.) could get the stacktraces 
> to help prioritize bug fixes.
>
> Cheers,
>
>           Tim 
>

Fwd: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to