RE: Any interest in running Apache Tika as part of CommonCrawl?

Allison, Timothy B. Fri, 03 Apr 2015 05:59:13 -0700

Sorry, link wasn’t included:

https://groups.google.com/forum/#!topic/common-crawl/Cv21VRQjGN0

From: [email protected] [mailto:[email protected]]
Sent: Friday, April 03, 2015 8:35 AM
To: [email protected]; [email protected]; [email protected]
Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl?

All,
  What do we think?

On Friday, April 3, 2015 at 8:23:11 AM UTC-4, 
[email protected]<mailto:[email protected]> wrote:
CommonCrawl currently has the WET format that extracts plain text from web 
pages.  My guess is that this is text stripping from text-y formats.  Let me 
know if I'm wrong!

Would there be any interest in adding another format: WETT (WET-Tika) or 
supplementing the current WET by using Tika to extract contents from binary 
formats too: PDF, MSWord, etc.

Julien Nioche kindly carved out 220 GB for us to experiment with on 
TIKA-1302<https://issues.apache.org/jira/browse/TIKA-1302> on a Rackspace vm.  
But, I'm wondering now if it would make more sense to have CommonCrawl run Tika 
as part of its regular process and make the output available in one of your 
standard formats.

CommonCrawl consumers would get Tika output, and the Tika dev community 
(including its dependencies, PDFBox, POI, etc.) could get the stacktraces to 
help prioritize bug fixes.

Cheers,

          Tim

RE: Any interest in running Apache Tika as part of CommonCrawl?

Reply via email to