All, What do we think? On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote:
> CommonCrawl currently has the WET format that extracts plain text from web > pages. My guess is that this is text stripping from text-y formats. Let > me know if I'm wrong! > > Would there be any interest in adding another format: WETT (WET-Tika) or > supplementing the current WET by using Tika to extract contents from binary > formats too: PDF, MSWord, etc. > > Julien Nioche kindly carved out 220 GB for us to experiment with on > TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on > a Rackspace vm. But, I'm wondering now if it would make more sense to have > CommonCrawl run Tika as part of its regular process and make the output > available in one of your standard formats. > > CommonCrawl consumers would get Tika output, and the Tika dev community > (including its dependencies, PDFBox, POI, etc.) could get the stacktraces > to help prioritize bug fixes. > > Cheers, > > Tim >