Re: Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?
Hi Simon, Thanks. I did actually have "SPARK_WORKER_CORES=8" in spark-env.sh - its commented as 'to set the number of cores to use on this machine'. Not sure how this would interplay with SPARK_EXECUTOR_INSTANCES and SPARK_EXECUTOR_CORES, but I removed it and still see no scaleup with increasing cores. Nothing else is set in spark-env.sh However, your email has drawn my attention to the comments in spark-env now which indicate that SPARK_EXECUTOR_INSTANCES and SPARK_EXECUTOR_CORES are only read in for Yarn configurations. Based also on what is listed under "Options for the daemons used in the standalone deploy mode" I guess the standalone thing to do would be use: # - SPARK_WORKER_CORES, to set the number of cores to use on this machine # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node But as I'm running locally, and there is a separate comment & section for "Options read when launching programs locally with ./bin/run-example or ./bin/spark-submit". I don't believe the daemon settings would be read for my setup. In fact I just tried switching to SPARK_WORKER_CORES and SPARK_WORKER_INSTANCES and the cores don't scale, so I its probably using all cores available on the machine and I don't have control of executors and cores/executor if running local. Lans comments here: http://stackoverflow.com/questions/24696777/what-is-the-relationship-between-workers-worker-instances-and-executors mention standalone cluster manager. I had assumed that it would apply also to a large local machine. Will I in future versions of Spark be able to control executors and cores/executor ? Any plans for this ? Please let me know if my current understanding of whats possible in spark local mode is incorrect, Many thanks Karen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Correct-way-of-setting-executor-numbers-and-executor-cores-in-Spark-1-6-1-for-non-clustered-mode-tp26894p26896.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?
Hi, I'm running spark 1.6.1 on a single machine, initially a small one (8 cores, 16GB ram) using "--master local[*]" to spark-submit and I'm trying to see scaling with increasing cores, unsuccessfully. Initially I'm setting SPARK_EXECUTOR_INSTANCES=1, and increasing cores for each executor. The way I'm setting cores per executor is either with "SPARK_EXECUTOR_CORES=1" (up to 4) and I also tried with " --conf "spark.executor.cores=1 spark.executor.memory=9g". I'm repartitioning the RDD of the large dataset into 4/8/10 partitions for different runs. Am I setting executors/cores correctly for running Spark 1.6 locally/Standalone mode ? The logs show the same overall timings for execution of the key stages (within a stage I see the number of tasks match the data partitioning value) whether I'm setting for 1, 4 or 8 cores per executor. And the process table looks like the requested cores aren't being used. I know eg. "--num.executors=X" is only an argument to Yarn. I can't find specific instructions in one place for settings these params (executors/cores) on Spark running on one machine. An example of my full spark-submit command is: SPARK_EXECUTOR_INSTANCES=1 SPARK_EXECUTOR_CORES=4 spark-submit --master local[*] --conf "spark.executor.cores=4 spark.executor.memory=9g" --class asap.examples.mllib.TfIdfExample /home/ubuntu/spark-1.6.1-bin-hadoop2.6/asap_ml/target/scala-2.10/ml-operators_2.10-1.0.jar Duplicated settings here but it shows the different ways I've been setting the parameters. Thanks Karen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Correct-way-of-setting-executor-numbers-and-executor-cores-in-Spark-1-6-1-for-non-clustered-mode-tp26894.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark MLLib benchmarks
Hi, I'm benchmarking Spark(1.6) and MLLib TF-IDF (with hdfs) on a 20GB dataset, and not seeing much scale-up when I increase cores/executors/RAM according to Spark tuning documentation. I suspect I'm missing a trick in my configuration. I'm running on shared memory (96 cores, 256GB RAM) and testing various combinations of: Number of executors (1,2,4,8) Number of cores per executor (1,2,4,8,12,24) Memory per executor (calculated as per cloudera recommendations) Of course in line with combined resource limits. Also setting the RDD partitioning number to 2,4,6,8 (I see best results at 4 partitions, about 5% better than worse case). Have also varied/switched the following settings: Using the Kyro Serialiser Setting driver memory Setting for compressed ops Dynamic scheduling trying different storage levels for persisting RDDs As we to up the cores in the best of these configurations we still see a running time of 19-20 minutes. Is there anything else I should be configuring to get better scale-up ? Are there any documented TF-IDF benchmark results that I can make comparisons with to validate (even if very approximate indirect comparisons?) Any advice would be much appreciated, Thanks Karen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLib-benchmarks-tp26878.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Pagerank implementation
Hiya, I too am looking for a PageRank solution in GraphX where the probabilities sum to 1. I tried a few modifications, including division by the total number of vertices in the first part of the equation, as well as trying to return full rank instead of delta (though not correctly as evident from exception at runtime). Tom did you manage to make a version which sums to 1 ? Could you possibly divulge the changes if so ? Also, I'm interested to know if the algorithm handles the case where there are no outgoing links from a node ? Does it avoid unfairness with sinks ? I'm new to Scala (and spark). Had a look at the code and don't see that it is, but could be missing something, Thanks Karen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pagerank-implementation-tp19013p20687.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.1.1, Hadoop 2.6 - Protobuf conflict
I had this problem also with spark 1.1.1. At the time I was using hadoop 0.20. To get around it I installed hadoop 2.5.2, and set the protobuf.version to 2.5.0 in the build command like so: mvn -Phadoop-2.5 -Dhadoop.version=2.5.2 -Dprotobuf.version=2.5.0 -DskipTests clean package So I changed spark's pom.xml to read the protobuf.version from the command line. If I didn't explicitly set protobuf.version it was picking up an older version that existed on my filesystem somewhere, Karen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-1-Hadoop-2-6-Protobuf-conflict-tp20656p20658.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org