On Wed, 14 Dec 2011 09:15:12 +0100, Boris Zbarsky <bzbar...@mit.edu> wrote:
On 12/14/11 3:01 AM, Simon Pieters wrote:
What I have so far as a result is a list of about 1.7 million
barewords used across several tens of thousands of pages.
Do you have a more accurate figure for the number of pages?
"57,444 unique urls, all taken from the top 21,000 domains" is all the
information I have there so far.
Thanks!
If people are interested in the exact methodology, I can probably get
a description.
I'm interested. It's hard to make conclusions from data without knowing
what the data is, how it is biased, what false positives it might have,
etc.
Yeah, understood. Working on getting that description.
-Boris
cheers
--
Simon Pieters
Opera Software