Hi Kevin,

thanks for reply. I could hardly imagine an example of incomplete record. The mapper is very simple, just reading line by line as Strings, splitting the line by tab, and outputting a Text Pair for sort and secondary sort. If there were incomplete record, there should be an error and the only place could happen is the tab splitting stage. I use LZO codec compressing the Mapper output and Reducer output.

There is one issue which I think might be the reason. The files in the log has the following structure . It seems I had 294 mappers. Notice that there is one file missing: "attempt_201103231501_0007_m_000292_0", where is the Nr. 292 mapper output ???? Was it a failed node or something?


All the logs just hang there for 2 hours (~12:35, while the current time is 14:50). Some folders were visited around 14:45 by me. So no reducer nor mapper has been generating any logs in the past two hours.

Shi

drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36 attempt_201103231501_0007_m_000281_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000282_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000283_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35 attempt_201103231501_0007_m_000284_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000285_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59 attempt_201103231501_0007_m_000286_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000287_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44 attempt_201103231501_0007_m_000288_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 attempt_201103231501_0007_m_000289_0 drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37 attempt_201103231501_0007_m_000289_1 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22 attempt_201103231501_0007_m_000290_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45 attempt_201103231501_0007_m_000291_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_m_000293_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000000_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000001_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000002_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000003_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000004_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000005_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000006_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000007_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47 attempt_201103231501_0007_r_000008_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000009_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000010_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000011_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000012_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000013_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000014_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000015_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000016_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 attempt_201103231501_0007_r_000017_0 drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42 attempt_201103231501_0007_r_000018_0

Shi

On 3/24/2011 2:25 PM, kevin.le...@thomsonreuters.com wrote:
Shi,

The key here is the 99% done mapper. Nothing can move on until all
mappers complete.
Is it possible your data in the larger set has an incomplete record or
some such at the end?

Kevin

-----Original Message-----
From: Shi Yu [mailto:sh...@uchicago.edu]
Sent: Thursday, March 24, 2011 3:02 PM
To: hadoop user
Subject: Program freezes at Map 99% Reduce 33%

I am running a hadoop program processing Tera Byte size data. The code
was test successfully on a small sample (100G) and it worked. However,
when trying it on the full problem, the program freezes forever at Map
99% Reduce 33%. There is no error, and the size of userlog folder is
clean (<30M) cause otherwise it will generate Giga bytes of error logs.

I checked the log of mapper and reducer, it seems that the reducer is
waiting for an output from the mapper and it never reaches. What is the
possible reason of causing this? Most of the configurations are set by
default. I set "mapred.child.java.opts=-Xmx2000M
hadoop.job.history.user.location=none". The problem occurs both on
0.19.2 and 0.20.2. Thanks!

Example of Mapper logs:

2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 3
sorted segments
2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 3 segments left of total size: 461743 bytes
2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
(11015008, 10030254, 607594)
2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201103231501_0007_m_000286_0 is done. And is in the process
of commiting
2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201103231501_0007_m_000286_0' done.

Example of Reducer logs:

2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)
2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) where
0 is already in progress
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)


Shi

Reply via email to