Hi,

Being new to Nutch, I'm unsure how Nutch deals with fetching more
URL's for a site that is currently being fetched.

e.g. if we inject -> generate -> fetch, then whilst fetching we want
to add more URL's for potentially the same sites currently being
fetched, what's in place to ensure that we continue to adhere to the
politeness policy. As I understand with the initial fetch, all URL's
for the same site are partitioned into the same segment, though I'm
unsure what might maintain this for the second.

Perhaps it's as simple as only allowing 1 fetch process / cluster at a
time, and then limit the number of URL's / host to ensure the second
can start in a timely manner?

Thanks for your help,
Dan

Reply via email to