Hi My Spark Streaming application receiving data from one kafka topic ( one partition) and rdd have 30 partition.
but scheduler schedule the task between executors running on same host with NODE_LOCAL locality level. ( where kafka topic partition created) . Below are the logs : 16/05/06 11:21:38 INFO YarnScheduler: Adding task set 1.0 with 30 tasks 16/05/06 11:21:38 DEBUG TaskSetManager: Epoch for TaskSet 1.0: 1 16/05/06 11:21:38 DEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NODE_LOCAL, RACK_LOCAL, ANY 16/05/06 11:21:38 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ivcp-m04.novalocal, partition 0,NODE_LOCAL, 2248 bytes) 16/05/06 11:21:38 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ivcp-m04.novalocal, partition 1,NODE_LOCAL, 2248 bytes) 16/05/06 11:21:38 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ivcp-m04.novalocal, partition 2,NODE_LOCAL, 2248 bytes) 16/05/06 11:21:38 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ivcp-m04.novalocal, partition 3,NODE_LOCAL, 2248 bytes) 16/05/06 11:21:38 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ivcp-m04.novalocal, partition 4,NODE_LOCAL, 2248 bytes) I have seen this scenario after upgrading my spark from 1.5.to 1.6 . same application distributed rdd partition evenly to executors in spark 1.5 . As mentioned on some spark developer blogs , I have tried spark.shuffle.reduceLocality.enabled=false and after that my rdd partition is distributed between executors of all host with PROCESS_LOCAL locality level. Below are the logs : 16/05/06 11:24:46 INFO YarnScheduler: Adding task set 1.0 with 30 tasks 16/05/06 11:24:46 DEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NO_PREF, ANY 16/05/06 11:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ivcp-m02.novalocal, partition 0,PROCESS_LOCAL, 2248 bytes) 16/05/06 11:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, ivcp-m01.novalocal, partition 1,PROCESS_LOCAL, 2248 bytes) 16/05/06 11:24:46 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, ivcp-m06.novalocal, partition 2,PROCESS_LOCAL, 2248 bytes) 16/05/06 11:24:46 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, ivcp-m04.novalocal, partition 3,PROCESS_LOCAL, 2248 bytes) 16/05/06 11:24:46 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, ivcp-m04.novalocal, partition 4,PROCESS_LOCAL, 2248 bytes) -------- -------- -------- Is above configuration is correct solution for problem ? and why spark.shuffle.reduceLocality.enabled not mentioned in spark configuration document ? Regards Prateek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-RDD-Partitions-not-distributed-evenly-to-executors-tp26911.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org