Hello All! When using nutch 1.3 in fully distributed mode, where does the fetching occur? Does each node get a list of urls to fetch? What property in hadoop/mareduce, etc decides how many urls that a node gets to fetch? I am worried about memory on my nodes. Some of the files in our enterprise are very, very large. Like 800mb pdf files.
I am able to run inject on my cluster, but then the generate step fails and I always loose one node from the cluster. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-3-Fetching-where-does-this-happen-tp3396326p3396326.html Sent from the Nutch - User mailing list archive at Nabble.com.

