That could be an interesting experiment to do with the commoncrawl dataset and Tika on Behemoth. Assuming of course that the detection is done correctly by Tika. Does anyone have a spare cluster on EC2 ;-) ?
Julien On 28 January 2012 02:01, Mattmann, Chris A (388J) < chris.a.mattm...@jpl.nasa.gov> wrote: > (sorry for the cross post) > > Hey Guys, > > I'm trying to find a good citation or estimate (if anyone has done one) > that estimates > the breakout (by % or some other metric) of content types out there out > the web > (with a whole web crawl or a meaningful representative dataset) that are > non HTML. > > Anyone have any ideas about this? > > Thanks! > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com