Hey I was reading the Berkley paper "Spark: Cluster Computung with Working Sets" and come across an sentence which is bothering me. Currently I am trying to run an python script on Spark which executes a parallel k-means ... my problem is ... after the algorithm finish working with the dataset (ca. 50s) it seams that spark needs the rest of the time (ca 7m) to collect all the data. The paper from Berkley mentioned that Spark does not support parallel collection. Is that really the case?\
If I can make something run faster in Spark please tell me how since I have another problem, that Spark is not really responding to my configuration changes. I ran over 25 tests with configuration of executor.memory and task.cpus or akka.threads but nothing changed (conf from 2-62g RAM, 4-912 cpus and 4-912 threads). I also read that you can not run more than 1 executor per node while Spark is running in standalone mode. Do I really need to run Spark on Yarn to get more than 1 executor on a node? If so does anyone has an tutorial how to install yarn and spark on top of it? Thank you for your help best makevnin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-1-0-1-stil-collect-results-in-serial-tp11816.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org