Hi Kevin,

thanks for the suggestion. I think I found the problem, because my code is a chained map / reduce. In the previous iteration there is a .lzo_deflate output which is 40 times larger than other files. That was because of a special "key" value, which has significant larger occurrences than other keys. I used a self-defined partitioner:

 public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
        }

maybe all the occurrences of this key are send to a single reducer and yields such a huge compressed file. And moreover the .lzo_deflate file not splittable? I removed this huge file and reran the current task, it worked smoothly. I didn't notice this problem before on small test because the output was not compressed in that run.

Now I am trying to add some random string to that special key, since the previous outputs are still input for the next task so there is no reason to force them to a single reducer.

Shi


On 3/24/2011 4:03 PM, kevin.le...@thomsonreuters.com wrote:
Shi,

This states "Of course, the framework discards the sub-directory of
unsuccessful task-attempts."
http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
ml

So yes, the missing directory is likely a failure.

If you can, narrow the problem down by looking at sections of your data
on a smaller number of mappers. Look at just the tail of your input and
a couple of mappers and then compare that to your program run without
hadoop.

Kevin

-----Original Message-----
From: Shi Yu [mailto:sh...@uchicago.edu]
Sent: Thursday, March 24, 2011 3:57 PM
To: common-user@hadoop.apache.org
Subject: Re: Program freezes at Map 99% Reduce 33%

Hi Kevin,

thanks for reply.  I could hardly imagine an example of incomplete
record. The mapper is very simple, just reading line by line as Strings,
splitting the line by tab, and outputting a Text Pair for sort and
secondary sort.  If there were incomplete record, there should be an
error and the only place could happen is the tab splitting stage.   I
use LZO codec compressing the Mapper output and Reducer output.

There is one issue which I think might be the reason. The files in the
log has the following structure . It seems I had 294 mappers.  Notice
that there is one file missing: "attempt_201103231501_0007_m_000292_0",
where is the Nr. 292 mapper output ???? Was it a failed node or
something?


All the logs just hang there for 2 hours (~12:35, while the current time
is 14:50). Some folders were visited around 14:45 by me. So no reducer
nor mapper has been generating any logs in the past two hours.

Shi

drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
attempt_201103231501_0007_m_000281_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000282_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000283_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
attempt_201103231501_0007_m_000284_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000285_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
attempt_201103231501_0007_m_000286_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000287_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
attempt_201103231501_0007_m_000288_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000289_0
drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
attempt_201103231501_0007_m_000289_1
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
attempt_201103231501_0007_m_000290_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
attempt_201103231501_0007_m_000291_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_m_000293_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000000_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000001_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000002_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000003_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000004_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000005_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000006_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000007_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
attempt_201103231501_0007_r_000008_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000009_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000010_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000011_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000012_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000013_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000014_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000015_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000016_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000017_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
attempt_201103231501_0007_r_000018_0

Shi

On 3/24/2011 2:25 PM, kevin.le...@thomsonreuters.com wrote:
Shi,

The key here is the 99% done mapper. Nothing can move on until all
mappers complete.
Is it possible your data in the larger set has an incomplete record or
some such at the end?

Kevin

-----Original Message-----
From: Shi Yu [mailto:sh...@uchicago.edu]
Sent: Thursday, March 24, 2011 3:02 PM
To: hadoop user
Subject: Program freezes at Map 99% Reduce 33%

I am running a hadoop program processing Tera Byte size data. The code
was test successfully on a small sample (100G) and it worked. However,
when trying it on the full problem, the program freezes forever at Map
99% Reduce 33%. There is no error, and the size of userlog folder is
clean (<30M) cause otherwise it will generate Giga bytes of error
logs.
I checked the log of mapper and reducer, it seems that the reducer is
waiting for an output from the mapper and it never reaches. What is
the possible reason of causing this? Most of the configurations are
set by default. I set "mapred.child.java.opts=-Xmx2000M
hadoop.job.history.user.location=none". The problem occurs both on
0.19.2 and 0.20.2. Thanks!

Example of Mapper logs:

2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging
3 sorted segments
2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 3 segments left of total size: 461743 bytes
2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
(11015008, 10030254, 607594)
2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201103231501_0007_m_000286_0 is done. And is in the
process of commiting
2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201103231501_0007_m_000286_0' done.

Example of Reducer logs:

2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)
2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)
where 0 is already in progress
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)


Shi


Reply via email to