[ 
https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896854#comment-15896854
 ] 

Huangkaixuan edited comment on YARN-6289 at 3/6/17 8:02 AM:
------------------------------------------------------------

The experiment details:
7 node cluster (1 master, 6 data nodes/node managers)
HostName        Simple37        Simple27        Simple28        Simple30        
Simple31        Simple32        Simple33
Role    Master  Master     Node1        Node2   Node3   Node4   Node5   node6

Configure HDFS with replication factor 2
File has a single block in HDFS
Configure Spark to use dynamic allocation
Configure Yarn for both mapreduce shuffle service and Spark shuffle service
Add a single small file (few bytes) to HDFS
Run wordcount on the file (using Spark/MapReduce)
Inspect if the single task for the map stage was scheduled on the node with the 
data

Results of experiment one (run 10 times):
7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block 
file, MapReduce wordcount

Round NO.  Data location         Scheduled node Hit     Time Cost
1                   Node3/Node4  Node6                  No      20s
2                   Node5/Node3  Node6                  No      17s
3                   Node3/Node5  Node1                  No      21s
4                   Node2/Node3  Node6                  No      18s
5                   Node1/Node2  Node1                  Yes     15s
6                   Node4/Node5  Node3                  No      19s
7                   Node2/Node3  Node2                  Yes     14s
8                   Node1/Node4  Node5                  No      16s
9                   Node1/Node6  Node6                  Yes     15s
10                  Node3/Node5  Node4                  NO      17s


7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block 
file, Spark wordcount

Round NO.    Data location      Scheduled node  Hit     Time cost
1                    Node3/Node4        Node4                   Yes     24s
2                    Node2/Node3        Node5                   No      30s
3                    Node3/Node5        Node4                   No      35s
4                    Node2/Node3        Node2                   Yes     24s
5                    Node1/Node2        Node4                   No      26s
6                    Node4/Node5        Node2                   No      25s
7                    Node2/Node3        Node4                   No      27s
8                    Node1/Node4        Node1                   Yes     22s
9                    Node1/Node6        Node2                   No      23s
10                   Node1/Node2        Node4                   No      33s






was (Author: huangkx6810):
Experiment1:
       7 node Hadoop cluster (1 master, 6 data nodes/node managers)
Simple37        Simple27        Simple28        Simple30        Simple31        
Simple32        Simple33
Master  Node1   Node2   Node3   Node4   Node5   node6
       Configure HDFS with replication factor 2
       File has a single block in HDFS
       Configure Spark to use dynamic allocation
       Configure Yarn for both mapreduce shuffle service and Spark shuffle 
service
       Add a single small file (few bytes) to HDFS
       Run wordcount on the file (using Spark/MapReduce)
       Inspect if the single task for the map stage was scheduled on the node 
with the data
  
The result are shown in the webui as follow:

  
 

 

Result1:
 7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block 
file
MapReduce wordcount

Times   Data location   Scheduled node  Hit     Time
1       Node3/Node4     Node6   No      20s
2       Node5/Node3     Node6   No      17s
3       Node3/Node5     Node1   No      21s
4       Node2/Node3     Node6   No      18s
5       Node1/Node2     Node1   Yes     15s
6       Node4/Node5     Node3   No      19s
7       Node2/Node3     Node2   Yes     14s
8       Node1/Node4     Node5   No      16s
9       Node1/Node6     Node6   yes     15s
10      Node3/Node5     Node4   no      17s


























7 node cluster(1 master, 6 data nodes/node managers), 2x replication, 1 block 
file
Spark wordcount

Times   Data location   Scheduled node  Hit     Time
1       Node3/Node4     Node4   Yes     24s
2       Node2/Node3     Node5   No      30s
3       Node3/Node5     Node4   No      35s
4       Node2/Node3     Node2   Yes     24s
5       Node1/Node2     Node4   No      26s
6       Node4/Node5     Node2   No      25s
7       Node2/Node3     Node4   No      27s
8       Node1/Node4     Node1   Yes     22s
9       Node1/Node6     Node2   No      23s
10      Node1/Node2     Node4   No      33s




























> yarn got little data locality
> -----------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacity scheduler
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2   Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>            Priority: Minor
>
> When I ran this experiment with both Spark and MapReduce wordcount on the 
> file, I noticed that the job did not get data locality every time. It was 
> seemingly random in the placement of the tasks, even though there is no other 
> job running on the cluster. I expected the task placement to always be on the 
> single machine which is holding the data block, but that did not happen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to