Good. Data skew should not look stuck. Try sending status updates so at least you can tell one mapper is still busy. Yes, adding data or including another field into the key can help reduce data skew.
Kevin -----Original Message----- From: Shi Yu [mailto:sh...@uchicago.edu] Sent: Thursday, March 24, 2011 5:56 PM To: common-user@hadoop.apache.org Subject: Re: Program freezes at Map 99% Reduce 33% Hi Kevin, thanks for the suggestion. I think I found the problem, because my code is a chained map / reduce. In the previous iteration there is a .lzo_deflate output which is 40 times larger than other files. That was because of a special "key" value, which has significant larger occurrences than other keys. I used a self-defined partitioner: public int getPartition(TextPair key, Text value, int numPartitions) { return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions; } maybe all the occurrences of this key are send to a single reducer and yields such a huge compressed file. And moreover the .lzo_deflate file not splittable? I removed this huge file and reran the current task, it worked smoothly. I didn't notice this problem before on small test because the output was not compressed in that run. Now I am trying to add some random string to that special key, since the previous outputs are still input for the next task so there is no reason to force them to a single reducer. Shi On 3/24/2011 4:03 PM, kevin.le...@thomsonreuters.com wrote: > Shi, > > This states "Of course, the framework discards the sub-directory of > unsuccessful task-attempts." > http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht > ml > > So yes, the missing directory is likely a failure. > > If you can, narrow the problem down by looking at sections of your data > on a smaller number of mappers. Look at just the tail of your input and > a couple of mappers and then compare that to your program run without > hadoop. > > Kevin > > -----Original Message----- > From: Shi Yu [mailto:sh...@uchicago.edu] > Sent: Thursday, March 24, 2011 3:57 PM > To: common-user@hadoop.apache.org > Subject: Re: Program freezes at Map 99% Reduce 33% > > Hi Kevin, > > thanks for reply. I could hardly imagine an example of incomplete > record. The mapper is very simple, just reading line by line as Strings, > splitting the line by tab, and outputting a Text Pair for sort and > secondary sort. If there were incomplete record, there should be an > error and the only place could happen is the tab splitting stage. I > use LZO codec compressing the Mapper output and Reducer output. > > There is one issue which I think might be the reason. The files in the > log has the following structure . It seems I had 294 mappers. Notice > that there is one file missing: "attempt_201103231501_0007_m_000292_0", > where is the Nr. 292 mapper output ???? Was it a failed node or > something? > > > All the logs just hang there for 2 hours (~12:35, while the current time > is 14:50). Some folders were visited around 14:45 by me. So no reducer > nor mapper has been generating any logs in the past two hours. > > Shi > > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36 > attempt_201103231501_0007_m_000281_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 > attempt_201103231501_0007_m_000282_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 > attempt_201103231501_0007_m_000283_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35 > attempt_201103231501_0007_m_000284_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 > attempt_201103231501_0007_m_000285_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59 > attempt_201103231501_0007_m_000286_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 > attempt_201103231501_0007_m_000287_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44 > attempt_201103231501_0007_m_000288_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 > attempt_201103231501_0007_m_000289_0 > drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37 > attempt_201103231501_0007_m_000289_1 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22 > attempt_201103231501_0007_m_000290_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45 > attempt_201103231501_0007_m_000291_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_m_000293_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000000_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000001_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000002_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000003_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000004_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000005_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000006_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000007_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47 > attempt_201103231501_0007_r_000008_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000009_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000010_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000011_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000012_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000013_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000014_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000015_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000016_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 > attempt_201103231501_0007_r_000017_0 > drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42 > attempt_201103231501_0007_r_000018_0 > > Shi > > On 3/24/2011 2:25 PM, kevin.le...@thomsonreuters.com wrote: >> Shi, >> >> The key here is the 99% done mapper. Nothing can move on until all >> mappers complete. >> Is it possible your data in the larger set has an incomplete record or >> some such at the end? >> >> Kevin >> >> -----Original Message----- >> From: Shi Yu [mailto:sh...@uchicago.edu] >> Sent: Thursday, March 24, 2011 3:02 PM >> To: hadoop user >> Subject: Program freezes at Map 99% Reduce 33% >> >> I am running a hadoop program processing Tera Byte size data. The code >> was test successfully on a small sample (100G) and it worked. However, >> when trying it on the full problem, the program freezes forever at Map >> 99% Reduce 33%. There is no error, and the size of userlog folder is >> clean (<30M) cause otherwise it will generate Giga bytes of error > logs. >> I checked the log of mapper and reducer, it seems that the reducer is >> waiting for an output from the mapper and it never reaches. What is >> the possible reason of causing this? Most of the configurations are >> set by default. I set "mapred.child.java.opts=-Xmx2000M >> hadoop.job.history.user.location=none". The problem occurs both on >> 0.19.2 and 0.20.2. Thanks! >> >> Example of Mapper logs: >> >> 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging >> 3 sorted segments >> 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to >> the last merge-pass, with 3 segments left of total size: 461743 bytes >> 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index: >> (11015008, 10030254, 607594) >> 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner: >> Task:attempt_201103231501_0007_m_000286_0 is done. And is in the >> process of commiting >> 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task >> 'attempt_201103231501_0007_m_000286_0' done. >> >> Example of Reducer logs: >> >> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask: >> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs >> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask: >> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts >> and0 dup hosts) >> 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask: >> attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) >> where 0 is already in progress >> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask: >> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs >> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask: >> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts >> and0 dup hosts) >> >> >> Shi