On 6/12/07, patrik <[EMAIL PROTECTED]> wrote:
> >When generator runs in distributed mode, it partitions urls to seperate
> map tasks according to their hosts.
> >This way, urls under the same host end up in the same map task (which
> is necessary for politeness). So,
> >in your case, you either have very few hosts (of which one has almost
> 100K urls) or there is a problem
> >with partitioning.
>
> Got it. Yup, all the urls are from one host. I understand it's not
> polite, but is there any configuration setting that'll chagne that?

There is no configuration option. You may change PartitionUrlByHost's
code so that it no longer partitions urls by host :)

I think you may also run a segment merge. If you run segmerge on a
single segment(where you set number of reduce tasks to the desired
number of fetchers) segmerge will put equal number of urls to every
part. Then set fetcher.max.threads.per.host to a value greater than 1
and  you have a very unpolite fetcher. Please don't run this to fetch
a site you don't control :)

-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to