[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789537#action_12789537
 ] 

Hudson commented on MAPREDUCE-1171:
---

Integrated in Hadoop-Mapreduce-trunk #171 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/171/])
. Allow shuffle retries and read-error reporting to be configurable. 
Contributed by Amareshwari Sriramadasu.


> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171-1-ydist.txt, patch-1171-1.txt, 
> patch-1171-2.txt, patch-1171-ydist.txt, patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-10 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789099#action_12789099
 ] 

Arun C Murthy commented on MAPREDUCE-1171:
--

Dhruba - currently the plan is to put this into 21 and y20. 

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171-1-ydist.txt, patch-1171-1.txt, 
> patch-1171-2.txt, patch-1171-ydist.txt, patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-10 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789049#action_12789049
 ] 

dhruba borthakur commented on MAPREDUCE-1171:
-

hi amareshwari 7 jothi: can you pl advice is you plan to check this patch into 
the 0.20 release? or to the yahoodist 0.20 release?

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171-1-ydist.txt, patch-1171-1.txt, 
> patch-1171-2.txt, patch-1171-ydist.txt, patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-10 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788651#action_12788651
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1171:


All unit test except TestHdfsProxy passed on machine, with ydist patch.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171-1-ydist.txt, patch-1171-1.txt, 
> patch-1171-2.txt, patch-1171-ydist.txt, patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-10 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788603#action_12788603
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1171:


The test failure, TestTrackerBlacklistAcrossJobs, is unrelated to the patch. 
The log has following ZipException :
{noformat}
2009-12-10 07:51:17,307 WARN  mapred.TaskTracker 
(TaskTracker.java:startNewTask(1887)) - Error initializing 
attempt_20091210075025802_0001_m_01_0:
java.lang.RuntimeException: java.util.zip.ZipException: ZIP_Read: error reading 
zip file
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1600)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1408)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1352)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:574)
at 
org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1874)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:392)
at 
org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:925)
at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:869)
at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1883)
at 
org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:109)
at 
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1848)
{noformat}

The same test passes on my machine.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171-1-ydist.txt, patch-1171-1.txt, 
> patch-1171-2.txt, patch-1171-ydist.txt, patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noforma

[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788596#action_12788596
 ] 

Hadoop QA commented on MAPREDUCE-1171:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427572/patch-1171-2.txt
  against trunk revision 889085.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/314/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/314/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/314/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/314/console

This message is automatically generated.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171-1-ydist.txt, patch-1171-1.txt, 
> patch-1171-2.txt, patch-1171-ydist.txt, patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which lo

[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788510#action_12788510
 ] 

Hadoop QA commented on MAPREDUCE-1171:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427567/patch-1171-ydist.txt
  against trunk revision 889085.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/181/console

This message is automatically generated.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171-1.txt, patch-1171-ydist.txt, patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220

[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788137#action_12788137
 ] 

Hadoop QA commented on MAPREDUCE-1171:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427455/patch-1171.txt
  against trunk revision 888761.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/artifact/trunk/build/test/checkstyle-errors.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/console

This message is automatically generated.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map 

[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-09 Thread Jothi Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788027#action_12788027
 ] 

Jothi Padmanabhan commented on MAPREDUCE-1171:
--

Patch looks fine to me, couple of minor nits

# Can we rename {{maxFetchFailuresBeforeReport}} to 
{{maxFetchFailuresBeforeReporting}}
# I think the documentation in mapred-default for 
{{mapreduce.reduce.shuffle.notify.readerror}} can be changed to probably 
something like {{Expert. Flag to decide whether JobTracker should be notified 
on every read error or not. If the flag is false, read errors are treated 
similar to connection errors}}.


> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-09 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788019#action_12788019
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1171:


bq. verified with configuration "mapreduce.reduce.shuffle.notify.readerror" set 
to true, the fetch failure is notified after 
"mapreduce.reduce.shuffle.maxfetchfailures". With 
"mapreduce.reduce.shuffle.notify.readerror" set to false, it is notified 
immediately after the failure.
Sorry.. I interchanged true and false in above statement. Verified with 
configuration "mapreduce.reduce.shuffle.notify.readerror" set to false, the 
fetch failure is notified after "mapreduce.reduce.shuffle.maxfetchfailures". 
With "mapreduce.reduce.shuffle.notify.readerror" set to true, it is notified 
immediately after the failure.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: patch-1171.txt
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m

[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-08 Thread Jothi Padmanabhan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787415#action_12787415
 ] 

Jothi Padmanabhan commented on MAPREDUCE-1171:
--

bq. Provide a config option whether to send a notification for read error or not

Just to clarify this a little more. Currently a notification is sent for 
_every_ read error. The config option would be to disable this behavior and 
treat read errors similar to connection errors.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-08 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787414#action_12787414
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1171:


One solution I could see is
"Provide a config option whether to send a notification for read error or not, 
and  make number of retries (currently this is hardcoded to 10 on 0.21) on a 
fetch failure configurable"
However, for Yahoo! distribution, the config option to send a notification for 
read error or not, would suffice.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>Assignee: Amareshwari Sriramadasu
>Priority: Blocker
> Fix For: 0.21.0
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-12-08 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787363#action_12787363
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1171:


bq. In branch 0.21, MAPREDUCE-353 makes connect and read timeout configurable 
for a job. Moreover, Shuffle is simplified by MAPREDUCE-318. Essentially, 
HADOOP-3327 is no more there.

Sorry, just verified the new shuffle code, the code for HADOOP-3327 still 
exists.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
> Fix For: 0.21.0
>
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-11-18 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779371#action_12779371
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1171:


No, they are not checked in.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-11-17 Thread dhruba borthakur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779352#action_12779352
 ] 

dhruba borthakur commented on MAPREDUCE-1171:
-

are the fixes for this problem already checked into the Yahoo 0.20 release?

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-11-01 Thread Christian Kunz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772443#action_12772443
 ] 

Christian Kunz commented on MAPREDUCE-1171:
---

Yes, in absence of MAPREDUCE-318, MAPREDUCE-353 should be sufficient to work 
around the issue.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-11-01 Thread Christian Kunz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772426#action_12772426
 ] 

Christian Kunz commented on MAPREDUCE-1171:
---

Just for the record, we use a 020.1 yahoo release.
I checked that Cloudera releases contain HADOOP-3327 as early as 
hadoop-0.20.0+61.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-11-01 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772424#action_12772424
 ] 

Amareshwari Sriramadasu commented on MAPREDUCE-1171:


Christian, are you using Yahoo! distribution for 0.20? 

In branch 0.21, MAPREDUCE-353 makes connect and read timeout configurable for a 
job. Moreover, Shuffle is simplified by MAPREDUCE-318. Essentially, HADOOP-3327 
is no more there. 

Christian, Making connect and read timeout configurable should address this 
issue, right?

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.21.0
>Reporter: Christian Kunz
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-10-31 Thread Christian Kunz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772300#action_12772300
 ] 

Christian Kunz commented on MAPREDUCE-1171:
---

The algorithm in hadoop-0.20 that determines when to mark a map task as having 
too many fetch failures seems to have big problems with jobs that have a few 
out-layers in map task runtimes.

The job mentioned in the description has a couple of map tasks requiring about 
3 hours of runtime (1hr application execution time and 2hrs merge phase), with 
the overall average map runtime much smaller.
The consequence is that all reduces just wait for these few map tasks to 
finish, and these maps get declared as 'failed' very quickly because of many 
socket timeout exceptions on the reduce side although there is nothing wrong 
with the map tasks.

The first attempt to run the job failed after 28hrs (we allow for up to 10 map 
task failures), with about only 200 reduces successfully fetching map output 
every time the long running map tasks completed; these map tasks got declared 
as failed within less than 1 minute after completion.

We deployed an emergency fetch in the ReduceTask, commenting out the portion 
where a single read error results in a report:
{noformat}
  
  if (//cr.getError().equals(CopyOutputErrorType.READ_ERROR) ||
 ((noFailedFetches >= fetchRetriesPerMap)
  && ((noFailedFetches - fetchRetriesPerMap) % 2) == 0)) {
synchronized (ReduceTask.this) {
  taskStatus.addFetchFailedMap(mapTaskId);
{noformat}
and it helped somewhat.
The job finished after 6 re-executions of the long running map tasks 
(fortunately, we allow a high number of map task failures), but still took 
18hrs instead of optimally less than 4 hrs. About 600 reduces got map output 
successfully before the map tasks got declared as failed, always because of 5+ 
socket timeout exceptions (one advantage of hadoop-0.20 to retry a fetch within 
a few seconds was counter-productive here).

With hadoop-0.18.3 this job  needed less than 9 hrs (not optimal but we could 
live with it).

Take away:
Current algorithm is not general enough to cover all corner cases, in 
particular jobs with a handful of much longer running map tasks.
It should be improved (there should be an allowance for socket timeout 
exceptions when a high number of reduces fetch output from a single map task in 
parallel) or there should be a configuration option to allow to override the 
threshold on a per job basis.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.20.1
>Reporter: Christian Kunz
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
>  

[jira] Commented: (MAPREDUCE-1171) Lots of fetch failures

2009-10-29 Thread Qi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771769#action_12771769
 ] 

Qi Liu commented on MAPREDUCE-1171:
---

This is caused by a behavioral change in hadoop 0.20.1. In 0.18.3, the 
fetch-retry behavior used to be that each map output fetch will retry N (by 
default 6) times. Now in 0.20.1, not each map output fetch will have N retries. 
For a particular map output, only the first fetch attempt will have N retries. 
If the first N retry fails, subsequent fetch attempt, even it is from a 
different node, will only have 2 retries before failure. Thus, it greatly 
increased the chance of having "too many fetch failures".

The line of code is in src/mapred/org/apache/hadoop/mapred/ReduceTask.java, 
line 2090 to line 2104.

I would argue that if the subsequent map output fetch attempt is from the 
mapper node, it should only have 2 retries. However, if the map output is from 
a different mapper node (basically a different map attempt), it should still 
have N retries.

> Lots of fetch failures
> --
>
> Key: MAPREDUCE-1171
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: task
>Affects Versions: 0.20.1
>Reporter: Christian Kunz
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_00_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed outat 
> java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
> at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_00_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
>