Hello Mike, I am having the same problem with my own map reduce jobs. I have a job which requires two pieces of data per key, and just as a sanity check I make sure that it gets both in the reducer, but sometimes it doesn't. What's even stranger is, the same tasks that complain about missing key/value pairs will maybe fail two or three times, but then succeed on a subsequent try, which leads me to believe that the bug has to do with randomization (I'm not sure, but I think the map outputs are shuffled?).
All of my code works perfectly with 0.9, so I went back and just compared the sizes of the outputs. For some jobs, the outputs from 0.11 were consistently 4 bytes larger, probably due to changes in SequenceFile. But for others, the output sizes were all over the place. Some partitions were empty, some were correct, and some were missing data. There seems to be something seriously wrong with 0.11, so I suggest you use 0.9. I've been trying to pinpoint the bug but its random nature is really annoying. On 2/9/07, Mike Smith <[EMAIL PROTECTED]> wrote:
The map/reduce jobs are not consistent in hadoop 0.11 release and trunk both when you rerun the same job. I have observed this inconsistency of the map output in different jobs. A simple test to double check is to use hadoop 0.11 with nutch trunk. 1) Make crawl 2) Update the crawldb 3) Use readdb –stat to the get the statistics 4) Update the crawldb again (the crawldb should be still the same since no new crawl has happened). 5) Now use readdb –stat to the get the statistics again. You will see two statistics will be different. 07/02/08 22:13:43 INFO crawl.CrawlDbReader: TOTAL urls: 6782524 07/02/08 22:13:43 INFO crawl.CrawlDbReader: retry 0: 6757921 07/02/08 22:13:43 INFO crawl.CrawlDbReader: retry 1: 24601 07/02/08 22:13:43 INFO crawl.CrawlDbReader: retry 2: 2 07/02/08 22:13:43 INFO crawl.CrawlDbReader: min score: 0.0090 07/02/08 22:13:43 INFO crawl.CrawlDbReader: avg score: 0.436 07/02/08 22:13:43 INFO crawl.CrawlDbReader: max score: 9005.445 07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 6102449 07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 2 (db_fetched): 570983 07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 3 (db_gone): 23359 07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 41248 07/02/08 22:13:43 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 44485 07/02/08 22:13:50 INFO crawl.CrawlDbReader: CrawlDb statistics: done 07/02/09 02:38:29 INFO crawl.CrawlDbReader: TOTAL urls: 6438347 07/02/09 02:38:29 INFO crawl.CrawlDbReader: retry 0: 6414923 07/02/09 02:38:29 INFO crawl.CrawlDbReader: retry 1: 23422 07/02/09 02:38:29 INFO crawl.CrawlDbReader: retry 2: 2 07/02/09 02:38:29 INFO crawl.CrawlDbReader: min score: 0.0090 07/02/09 02:38:29 INFO crawl.CrawlDbReader: avg score: 0.453 07/02/09 02:38:29 INFO crawl.CrawlDbReader: max score: 10358.287 07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 5787233 07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 2 (db_fetched): 547037 07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 3 (db_gone): 22311 07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 39315 07/02/09 02:38:29 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 42451 07/02/09 02:38:36 INFO crawl.CrawlDbReader: CrawlDb statistics: done If you continue doing this, each time you will see different statistics. This is not the nutch problem, since it happens for none nutch jobs as well. I guess somewhere between mappers and reducers some keys are missing randomly. Has anybody experienced this? Thanks, Mike
