Re: % of different content types out there on the web

Julien Nioche Sun, 29 Jan 2012 08:29:54 -0800

That could be an interesting experiment to do with the commoncrawl dataset
and Tika on Behemoth. Assuming of course that the detection is done
correctly by Tika.  Does anyone have a spare cluster on EC2 ;-) ?


Julien

On 28 January 2012 02:01, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> (sorry for the cross post)
>
> Hey Guys,
>
> I'm trying to find a good citation or estimate (if anyone has done one)
> that estimates
> the breakout (by % or some other metric) of content types out there out
> the web
> (with a whole web crawl or a meaningful representative dataset) that are
> non HTML.
>
> Anyone have any ideas about this?
>
> Thanks!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: % of different content types out there on the web

Reply via email to