Re: No. of Task vs No. of Executors

2015-07-21 Thread shahid ashraf
Thanks All!

thanks Ayan!

I did the repartition to 20 so it used all cores in the cluster and was
done in 3 minutes. seems data was skewed to this partition.



On Tue, Jul 14, 2015 at 8:05 PM, ayan guha guha.a...@gmail.com wrote:

 Hi

 As you can see, Spark has taken data locality into consideration and thus
 scheduled all tasks as node local. It is because spark could run task on a
 node where data is present, so spark went ahead and scheduled the tasks. It
 is actually good for reading. If you really want to fan out processing, you
 may do a repartition(n).
 Regarding slowness, as you can see another task has completed successfully
 in 6 mins in Excutor id 2.So it does not seem that node itself is slow. it
 is possible the computation for one node is skewed. you may want to switch
 on speculative execution to see if the same task gets completed in other
 node faster or not. If yes, then its a node issue, else, ost ikely data
 issue

 On Tue, Jul 14, 2015 at 11:43 PM, shahid sha...@trialx.com wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all
 nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are
 put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the
 delay
 in results.
 
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png
 



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Best Regards,
 Ayan Guha




-- 
with Regards
Shahid Ashraf


Re: No. of Task vs No. of Executors

2015-07-18 Thread Gylfi
You could even try changing the block size of the input data on HDFS (can be
done on a per file basis) and that would get all workers going right from
the get-go in Spark. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824p23896.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No. of Task vs No. of Executors

2015-07-18 Thread David Mitchell
This is likely due to data skew.  If you are using key-value pairs, one key
has a lot more records, than the other keys.  Do you have any groupBy
operations?

David


On Tue, Jul 14, 2015 at 9:43 AM, shahid sha...@trialx.com wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are
 put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the delay
 in results.
 
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png
 



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


No. of Task vs No. of Executors

2015-07-14 Thread shahid
hi 

I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
partitions i get is 9. I am running a spark application , it gets stuck on
one of tasks, looking at the UI it seems application is not using all nodes
to do calculations. attached is the screen shot of tasks, it seems tasks are
put on each node more then once. looking at tasks 8 tasks get completed
under 7-8 minutes and one task takes around 30 minutes so causing the delay
in results. 
http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No. of Task vs No. of Executors

2015-07-14 Thread ayan guha
Hi

As you can see, Spark has taken data locality into consideration and thus
scheduled all tasks as node local. It is because spark could run task on a
node where data is present, so spark went ahead and scheduled the tasks. It
is actually good for reading. If you really want to fan out processing, you
may do a repartition(n).
Regarding slowness, as you can see another task has completed successfully
in 6 mins in Excutor id 2.So it does not seem that node itself is slow. it
is possible the computation for one node is skewed. you may want to switch
on speculative execution to see if the same task gets completed in other
node faster or not. If yes, then its a node issue, else, ost ikely data
issue

On Tue, Jul 14, 2015 at 11:43 PM, shahid sha...@trialx.com wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are
 put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the delay
 in results.
 
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n23824/Screen_Shot_2015-07-13_at_9.png
 



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Best Regards,
Ayan Guha