That could be an interesting experiment to do with the commoncrawl dataset
and Tika on Behemoth. Assuming of course that the detection is done
correctly by Tika.  Does anyone have a spare cluster on EC2 ;-) ?

Julien

On 28 January 2012 02:01, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> (sorry for the cross post)
>
> Hey Guys,
>
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates
> the breakout (by % or some other metric) of content types out there out
> the web
> (with a whole web crawl or a meaningful representative dataset) that are
> non HTML.
>
> Anyone have any ideas about this?
>
> Thanks!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to