Re: downloading exact number of pages from list of seed urls

Krish Pan Wed, 27 Oct 2010 17:49:08 -0700

Rob,
that helps a lot,

Is this related to limiting urls/host in any way?
https://issues.apache.org/jira/browse/NUTCH-272


<https://issues.apache.org/jira/browse/NUTCH-272>Thanks,
Krish

On Wed, Oct 27, 2010 at 5:14 PM, Rob Hunter <[email protected]> wrote:

> Krish,
>
>   I think what you're looking for is a depth of 2 - I believe depth of
> 1 will only return foo.bar.  Also, due to your depth change, I think you
> can reduce your topN to 50k.  I'm unsure if your results will be evenly
> distributed across your domains, hopefully someone else has an answer
> for that.
>
> -- Rob
>
> -----Original Message-----
> From: Krish Pan [mailto:[email protected]]
> Sent: Wednesday, October 27, 2010 2:29 PM
> To: [email protected]
> Subject: downloading exact number of pages from list of seed urls
>
> Hi,
>
> I am trying to use nutch to just download exact number of (say 5) html
> pages
> from each seed page I provide,
>
> I was wondering if this is the right approach,
>
> Seed Urls = total 10,000
>
>  bin/nuch crawl urls/<list of domains> -dir <out> -depth 1 -topN 60000
>
> here depth = 1 because I just want pages from only first level
>
> i.e. if domain if foo.bar I want to download
>
> foo.bar/spam.htm
> foo.bar/ham.htm
> foo.bar/eggs.htm
>
> but NOT
> foo.bar/ham/spam.htm
>
> And,
>
> -topN is 60,000 because there there are 10,000 seed urls
> 10,000 home pages and 5 top pages per url
>
> Any suggestions?
>
> Thanks,
> krish
>

Re: downloading exact number of pages from list of seed urls

Reply via email to