+1 this makes immense sense to me. Thanks Juls and Tim. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: "tallison314...@gmail.com" <tallison314...@gmail.com> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> Date: Friday, April 3, 2015 at 5:35 AM To: "d...@pdfbox.apache.org" <d...@pdfbox.apache.org>, "dev@tika.apache.org" <dev@tika.apache.org>, "d...@poi.apache.org" <d...@poi.apache.org> Subject: Fwd: Any interest in running Apache Tika as part of CommonCrawl? >All, > What do we think? > >On Friday, April 3, 2015 at 8:23:11 AM UTC-4, talliso...@gmail.com wrote: > >CommonCrawl currently has the WET format that extracts plain text from >web pages. My guess is that this is text stripping from text-y formats. >Let me know if I'm wrong! > > >Would there be any interest in adding another format: WETT (WET-Tika) or >supplementing the current WET by using Tika to extract contents from >binary formats too: PDF, MSWord, etc. > > >Julien Nioche kindly carved out 220 GB for us to experiment with on >TIKA-1302 <https://issues.apache.org/jira/browse/TIKA-1302> on a >Rackspace vm. But, I'm wondering now if it would make more sense to have >CommonCrawl run Tika as part of its regular process and make the output >available in one of your standard formats. > > > >CommonCrawl consumers would get Tika output, and the Tika dev community >(including its dependencies, PDFBox, POI, etc.) could get the stacktraces >to help prioritize bug fixes. > > >Cheers, > > > Tim > > > >