We use the Hadoop and Nutch to crawl the website. We grab the URL list from some SQL server and split them among the cluster. When we increase the number of mapper, the number of duplicate results increase. For example, if the number of mapper is 2, the record maybe replicated by 2. When there are 8 instance, the result is duplicate 8 times. Any idea about this? Where can be the problem? -- View this message in context: http://www.nabble.com/Duplicate-Input-and-duplicate-result-tp20905297p20905297.html Sent from the Hadoop core-dev mailing list archive at Nabble.com.
