Julien Nioche-4 wrote: > >> I was saying that based on what the previous poster stated. Also the >> fact >> that I have read through quite a bit of posts stating that the problem >> with >> crawling in a vertical environment has to do with the way fetcher2 was >> built. The fetches are grouped by domain name and if you have a lot of >> urls >> from the same domain then you are not able to do quick mapreduce jobs. >> > > Nutch's default behaviour is to be polite to the hosts it visits. If you > own > the hosts (or have an agreement with the owner) you can of course hit them > as hard as you want and set a higher number of threads per host or time > between hits. If you don't own the hosts then you simply should not do > that > and use the defaults used in Nutch as a matter of courtesy. (moreover if > you > are too aggressive in your choice of parameters then you'll probably be > blacklisted by the target servers and won't be allowed to fetch any > content) > > Let's be completely clear once and for all : there is no particular issue > with using Nutch for vertical crawls - loads of people have done and still > do that. > > Julien > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >
Julien, We indeed own the hosts, and I have been experimenting with the number of threads I am able to use without crashing our web server/ database. This has led me to the refactoring of some of our code to improve connection pooling and resource allocation. What I don't know how to speed up is the mapreduce jobs.. It takes approximately 12 hours JUST do to the fetching of 250,000 or so urls. The map reduce part takes about 36 hours. Is this normal? Is there anyway to speed this up? I have seen talk of setting generate.max.per.host setting, but I don't want to limit the number of urls I fetch. And to me, this is what this setting would accomplish. -- View this message in context: http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2947297.html Sent from the Nutch - User mailing list archive at Nabble.com.

