Re: getting a list of top page-ranked webpages

Ian Upright Fri, 17 Sep 2010 12:10:01 -0700

On Fri, 17 Sep 2010 04:46:44 -0700 (PDT), kenf_nc
<ken.fos...@realestate.com> wrote:


>A slightly different route to take, but one that should help test/refine a
>semantic parser is wikipedia. They make available their entire corpus, or
>any subset you define. The whole thing is like 14 terabytes, but you can get
>smaller sets. 

Actually, I do heavy analysis of the entire wikipedia, plus 1m top webpages
from Alexa, and all of dmoz url's, in order to build the semantic engine in
the first place.  However, an outside corpus is required to test it's
quality outside of this space.

Cheers, Ian

Re: getting a list of top page-ranked webpages

Reply via email to