AW: Mapreduce Job fails if one Node is offline?

Mike Wenzel Mon, 24 Oct 2016 23:38:32 -0700

Hey alex,

Yes, I do use streaming and my own perl-mapper. I already read about some 
trouble with python and shebang in this case, but on the first try I couldn’t 
fix mine problem.


I’ll give it another look tokay. Thanks so far for supporting me.

Best Regards,
Mike.

Von: wget.n...@gmail.com [mailto:wget.n...@gmail.com]
Gesendet: Montag, 24. Oktober 2016 17:13
An: Mike Wenzel <mwen...@proheris.de>; user@hadoop.apache.org
Betreff: RE: Mapreduce Job fails if one Node is offline?

Hey Mike,

I only know this error from streaming jobs, and maybe with python and R – do 
you have such running?
> PipeMapRed.waitOutputThreads(): subprocess failed with code
It means the mapper is not able to read the input file. Which lead me that you 
might use streaming?

--alex

--
B: mapredit.blogspot.com

From: Mike Wenzel<mailto:mwen...@proheris.de>
Sent: Monday, October 24, 2016 2:53 PM
To: wget.n...@gmail.com<mailto:wget.n...@gmail.com>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: AW: Mapreduce Job fails if one Node is offline?

Hey alex,

first of all thanks for your reply.

>> Dfs replication has nothing to do with Yarn or MapReduce, its HDFS. 
>> Replication defines how many replicas are existing in a cluster.
Okay, I got this. I mixed wrong things up here.

>> When you kill the NM and you don’t have yarn.nodemanager.recovery.enabled 
>> (https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html)
>>  set, the containers running on that node are getting lost or killed, but 
>> your job will likely run and wait until that NM comes back.
As far as I understood this, NodeManager restart is for keeping containers 
until the NodeManager is online again, but I don’t see why this should have 
something to do with my problem.

My problem is:
If one of all my Nodes is offline, my MapReduce jobs don’t finish successfully 
anymore. I don’t want to wait until it is online again. I want my jobs to run 
and finish on all my other healthy notes instead.

Same behavior after enabling and configuring NM restart.
Attempts error message is:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 111 at
…
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 2 at
…
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 111 at
…
Task KILL is received. Killing attempt!

If my problem is not clear please let me know. If I shall post some log’s just 
let me know which ones you’re looking for.

Thanks in advice,
And best regards.
-- Mike

Von: wget.n...@gmail.com<mailto:wget.n...@gmail.com> 
[mailto:wget.n...@gmail.com]
Gesendet: Freitag, 21. Oktober 2016 11:53
An: Mike Wenzel <mwen...@proheris.de<mailto:mwen...@proheris.de>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Betreff: RE: Mapreduce Job fails if one Node is offline?

Hey Mike,

Dfs replication has nothing to do with Yarn or MapReduce, its HDFS. Replication 
defines how many replicas are existing in a cluster.
When you kill the NM and you don’t have yarn.nodemanager.recovery.enabled 
(https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html)
 set, the containers running on that node are getting lost or killed, but your 
job will likely run and wait until that NM comes back.

http://hortonworks.com/blog/resilience-of-yarn-applications-across-nodemanager-restarts/
http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_ha_yarn_work_preserving_recovery.html

--alex

--
B: mapredit.blogspot.com

From: Mike Wenzel<mailto:mwen...@proheris.de>
Sent: Friday, October 21, 2016 11:29 AM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Mapreduce Job fails if one Node is offline?

I got a small cluster for testing and learning hadoop:

Node1 - Namenode + ResourceManager + JobhistoryServer
Node2 - SecondaryNamenode
Node3 - Datanode + NodeManager
Node4 - Datanode + NodeManager
Node5 - Datanode + NodeManager

My dfs.replication is set to 2.

When I kill the Datanode and Nodemanager process on Node5  I expect Hadoop 
still to run and finish my mapreduce jobs successfully.
In reality the job fails because he tries to transfer blocks to Node5 which is 
offline. Replication is set to 2, so I expect him to see that Node5 is offline 
and only take the other two Nodes to work with.

Can someone please explain to me how Hadoop should work in this case?
If my expectation of Hadoop is correct, and someone would try to help me out, I 
can add logs and configuration.

Best Regards,
Mike.

AW: Mapreduce Job fails if one Node is offline?

Reply via email to