WebUI shows poor locality when task scheduling

2015-04-26 Thread eric wong
Hi developers,

I have sent to user mail list but no response...

When running a exprimental KMeans job for expriment, the Cached RDD is
original Points data.

I saw poor locality in Task details from WebUI. Almost one half of the
input of task is Network instead of Memory.

And Task with network input consumes almost the same time compare with the
task with  Hadoop(Disk) input, and twice with task(Memory input).
e.g
Task(Memory): 16s
Task(Network): 9s
Task(Hadoop): 9s


I see fectching RDD with 30MB form remote node consumes 5 seconds in
executor logs like below:

15/03/31 04:08:52 INFO CoarseGrainedExecutorBackend: Got assigned task 58
15/03/31 04:08:52 INFO Executor: Running task 15.0 in stage 1.0 (TID 58)
15/03/31 04:08:52 INFO HadoopRDD: Input split:
hdfs://master:8000/kmeans/data-Kmeans-5.3g:2013265920+134217728
15/03/31 04:08:52 INFO BlockManager: Found block rdd_3_15 locally
15/03/31 04:08:58 INFO Executor: Finished task 15.0 in stage 1.0 (TID 58).
1920 bytes result sent to driver
15/03/31 04:08:58 INFO CoarseGrainedExecutorBackend: Got assigned task 60
-Task60
15/03/31 04:08:58 INFO Executor: Running task 17.0 in stage 1.0 (TID 60)
15/03/31 04:08:58 INFO HadoopRDD: Input split:
hdfs://master:8000/kmeans/data-Kmeans-5.3g:2281701376+134217728
15/03/31 04:09:02 INFO BlockManager: Found block rdd_3_17 remotely
15/03/31 04:09:12 INFO Executor: Finished task 17.0 in stage 1.0 (TID 60).
1920 bytes result sent to driver


So
1)is that means i should use RDD with cache(MEMORY_AND_DISK) instead of
Memory only?

2)And should i expand Network capacity or turn Schduling locality parameter?
  i set spark.locality.wait up to 15000, but no effect seems to increase
the Memory input percentage

Any suggestion will be appreciated.


--Env info---

Cluster: 4 worker, with 1 Cores and 2G executor memory

Spark version: 1.1.0

Network: 30MB/s

-Submit shell---
bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans
--master spark://master:7077 --executor-memory 1g
lib/spark-examples-1.1.0-hadoop2.3.0.jar
hdfs://master:8000/kmeans/data-Kmeans-7g 8 1


Thanks very much and forgive for my poor English.

-- 
Wang Haihua






-- 
王海华

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: WebUI shows poor locality when task scheduling

2015-04-26 Thread Patrick Wendell
Hi Eric - please direct this to the user@ list. This list is for
development of Spark itself.

On Sun, Apr 26, 2015 at 1:12 AM, eric wong win19...@gmail.com wrote:



 Hi developers,

 I have sent to user mail list but no response...

 When running a exprimental KMeans job for expriment, the Cached RDD is
 original Points data.

 I saw poor locality in Task details from WebUI. Almost one half of the input
 of task is Network instead of Memory.

 And Task with network input consumes almost the same time compare with the
 task with  Hadoop(Disk) input, and twice with task(Memory input).
 e.g
 Task(Memory): 16s
 Task(Network): 9s
 Task(Hadoop): 9s


 I see fectching RDD with 30MB form remote node consumes 5 seconds in
 executor logs like below:

 15/03/31 04:08:52 INFO CoarseGrainedExecutorBackend: Got assigned task 58
 15/03/31 04:08:52 INFO Executor: Running task 15.0 in stage 1.0 (TID 58)
 15/03/31 04:08:52 INFO HadoopRDD: Input split:
 hdfs://master:8000/kmeans/data-Kmeans-5.3g:2013265920+134217728
 15/03/31 04:08:52 INFO BlockManager: Found block rdd_3_15 locally
 15/03/31 04:08:58 INFO Executor: Finished task 15.0 in stage 1.0 (TID 58).
 1920 bytes result sent to driver
 15/03/31 04:08:58 INFO CoarseGrainedExecutorBackend: Got assigned task 60
 -Task60
 15/03/31 04:08:58 INFO Executor: Running task 17.0 in stage 1.0 (TID 60)
 15/03/31 04:08:58 INFO HadoopRDD: Input split:
 hdfs://master:8000/kmeans/data-Kmeans-5.3g:2281701376+134217728
 15/03/31 04:09:02 INFO BlockManager: Found block rdd_3_17 remotely
 15/03/31 04:09:12 INFO Executor: Finished task 17.0 in stage 1.0 (TID 60).
 1920 bytes result sent to driver


 So
 1)is that means i should use RDD with cache(MEMORY_AND_DISK) instead of
 Memory only?

 2)And should i expand Network capacity or turn Schduling locality parameter?
   i set spark.locality.wait up to 15000, but no effect seems to increase the
 Memory input percentage

 Any suggestion will be appreciated.


 --Env info---

 Cluster: 4 worker, with 1 Cores and 2G executor memory

 Spark version: 1.1.0

 Network: 30MB/s

 -Submit shell---
 bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans --master
 spark://master:7077 --executor-memory 1g
 lib/spark-examples-1.1.0-hadoop2.3.0.jar
 hdfs://master:8000/kmeans/data-Kmeans-7g 8 1


 Thanks very much and forgive for my poor English.

 --
 Wang Haihua






 --
 王海华


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org