Re: Algorithm of distribution Map and Reduce tasks at various topology of a network

Jun Ping Du Mon, 15 Jul 2013 04:53:26 -0700

Hi Костарев, 
I tried to reproduce your case on my 5-nodes setup (with 2 nodes in dc1/rack1, 
1 node in dc1/rack2 and 2 nodes in dc2/rack2) but didn't see anything unusual. 
In my test, even with 3 replicas, I saw job with 150 map tasks distributed 
across all nodes no matter what datacenter is. 
Can you try again with a job with more map tasks as it is pretty random in 
scheduling if your job only have 10 map tasks. In your case, it seems b1, b3 
and b2 take 3-4 maps away in one heartbeat which is pretty normal case. Let me 
know the distribution and version you are using if it still not work with more 
tasks. 
btw, you can find history log for each job in application page. Isn't it?

Thanks, 

Junping 

----- Original Message -----

From: "Костарев А.Ф." <[email protected]> 
To: [email protected], [email protected] 
Sent: Friday, July 12, 2013 11:42:14 AM 
Subject: Re: Algorithm of distribution Map and Reduce tasks at various topology 
of a network 

On 07/09/2013 05:36 PM, "Костарев А.Ф." wrote: 
Hi Junping! 

We have launched MapReduce tasks in YARN cluster. Its topology is described in 
file topology.data. All files are here: 
https://drive.google.com/folderview?id=0B1dEBvIR3qbpeFVxR0R6UXhyNEU&usp=sharing 
And we found difference between MRv1 and YARN. In usual MapReduce tasks were 
executed everywhere, but in YARN they are executed only within one datacenter. 
It's shown on screenshots. Input file had replication factor 5. 

So, is it possible to run 1 job not only in 1 datacenter but on all servers of 
cluster at the same time? 

One more question. In MapReduce in ouptut directory in HDFS there were: 
_SUCCESS, log and part-r-xxxx. Now i can't find log. Is it in another place or 
in YARN there isn't such a file? 

Thank you for your prompt response 

We will try to repeat the test on Thursday and show more details 

On 07/09/2013 05:18 PM, Jun Ping Du wrote: 

<blockquote>
Hi Костарев, 
I think it should work for YARN even YARN doesn't support layer above rack 
(actually I am working on supporting more layers topology for YARN at YARN-18) 
now. 
Current YARN should just recognize your topology as three racks: "dc1/rack1", 
"dc2/rack1", "dc2/rack2". Each node (NM) with free resources should be assigned 
with containers in heartbeat with RM no matter what locality level there. The 
only exception case should be: 1. no pending resource requests 2. NM capacity 
is too small to meet resource request 3. delay scheduling is enabled and no 
data-local attempt. In your case, I don't see anything stop task assignment on 
a1 and a2. Anyone here can correct me if any misunderstanding here. :) 
Anyway, I will give it a try (as your configuration) later to see if some bugs 
in boundary cases there or it could be some misconfiguration. Which minor 
version (2.0.x or trunk) you are using now? 

Thanks, 

Junping 

----- Original Message ----- 
From: "Костарев А.Ф." <[email protected]> 
To: [email protected] 
Sent: Tuesday, July 9, 2013 5:48:49 PM 
Subject: Algorithm of distribution Map and Reduce tasks at various topology of 
a network 

Hi 
I have claster in two datacenters 

CLUSTER 
| 
+--------+---------+ 
| | 
datacenter1 datacenter2 
| | 
rack1 rack1 
| | | 
+-a1 | +-b1 
| | | 
+-a2 | +-b3 
| 
rack2 
+-b3 

Cluster have file with repcica coefficient=5 
All files's blocks resides on all servers of cluser. 

When I work with standart MapReduce (MRv1) (called on b1) Map and 
Rediuce task runs on all servers b1, b2, b3, a1, a2 
When I work with YARN (MRv2) (called on b1) Map and Reduce task runs 
only on b1, b2, b3 

Can I run in YARN Map tasks on all servers? 

</blockquote>

-- 
Консультант 1-й категории
Костарев А.Ф.

Re: Algorithm of distribution Map and Reduce tasks at various topology of a network

Reply via email to