Standalone scheduler issue - one job occupies the whole cluster somehow
Hi all, Recently we started having issues with one of our background processing scripts which we run on Spark. The cluster runs only two jobs. One job runs for days, and another is usually like a couple of hours. Both jobs have a crob schedule. The cluster is small, just 2 slaves, 24 cores, 25.4 GB of memory. Each job takes 6 cores and 6 GB per worker. So when both jobs are running it's 12 cores out of 24 cores and 24 GB out of 25.4 GB. But sometimes I see this: https://www.dropbox.com/s/6uad4hrchqpihp4/Screen%20Shot%202016-01-25%20at%201.16.19%20PM.png So basically the long running job somehow occupied the whole cluster and the fast one can't make any progress because the cluster doesn't have resources. That's what I see in the logs: 16/01/25 21:26:48 WARN TaskSchedulerImpl: Initial job has not accepted any > resources; check your cluster UI to ensure that workers are registered and > have sufficient resources When I log in to the slaves I see this: slave 1: > /usr/lib/jvm/java/bin/java -cp -Xms6144M -Xmx6144M > -Dspark.driver.port=42548 -Drun.mode=production -XX:MaxPermSize=256m > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > akka.tcp://sparkDriver@10.233.17.48:42548/user/CoarseGrainedScheduler > --executor-id 450 --hostname 10.191.4.151 *--cores 1 --app-id > app-20160124152439-1468* --worker-url akka.tcp:// > sparkWorker@10.191.4.151:53144/user/Worker > /usr/lib/jvm/java/bin/java -cp -cp -Xms6144M -Xmx6144M > -Dspark.driver.port=42548 -Drun.mode=production -XX:MaxPermSize=256m > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > akka.tcp://sparkDriver@10.233.17.48:42548/user/CoarseGrainedScheduler > --executor-id 451 --hostname 10.191.4.151 *--cores 1 --app-id > app-20160124152439-1468* --worker-url akka.tcp:// > sparkWorker@10.191.4.151:53144/user/Worker slave 2: > /usr/lib/jvm/java/bin/java -cp -Xms6144M -Xmx6144M > -Dspark.driver.port=42548 -Drun.mode=production -XX:MaxPermSize=256m > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > akka.tcp://sparkDriver@10.233.17.48:42548/user/CoarseGrainedScheduler > --executor-id 1 --hostname 10.253.142.59 *--cores 3 --app-id > app-20160124152439-1468* --worker-url akka.tcp:// > sparkWorker@10.253.142.59:33265/user/Worker > /usr/lib/jvm/java/bin/java -cp -cp -Xms6144M -Xmx6144M > -Dspark.driver.port=42548 -Drun.mode=production -XX:MaxPermSize=256m > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > akka.tcp://sparkDriver@10.233.17.48:42548/user/CoarseGrainedScheduler > --executor-id 448 --hostname 10.253.142.59 *--cores 1 --app-id > app-20160124152439-1468* --worker-url akka.tcp:// > sparkWorker@10.253.142.59:33265/user/Worker so somehow Spark created 4 executors, 2 on each machine, 1 core + 1 core and 3 cores + 1 core to get the total of 6 cores. But because 6 GB setting is per executor, it ends up occupying 24 GB instead of 12 GB (2 executors, 3 cores + 3 cores) and blocks the other Spark job. My wild guess is that for some reason 1 executor of the long job failed, so the job becomes 3 cores short and asks the scheduler if it can get 3 more cores, then the scheduler distributes it evenly across the slaves: 2 cores + 1 core but this distribution doesn't work until the short job finishes (because the shor job holds the rest of the memory). This explains 3 + 1 on one slave but doesn't explain 1 + 1 on another. Did anyone experience anything similar to this? Any ideas how to avoid it? Thanks, Mikhail
Re: issues with ./bin/spark-shell for standalone mode
Hi Patrick, I used 1.0 branch, but it was not an official release, I just git pulled whatever was there and compiled. Thanks, Mikhail -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
issues with ./bin/spark-shell for standalone mode
Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month ago). The setup is a standalone cluster with 4 worker machines and 1 master machine. I used to run spark shell like this: ./bin/spark-shell -c 30 -em 20g -dm 10g Today I've finally updated to Spark 1.0 release. Now I can only run spark shell like this: ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30 --executor-memory 20g --driver-memory 10g The documentation at http://spark.apache.org/docs/latest/spark-standalone.html says: You can also pass an option --cores numCores to control the number of cores that spark-shell uses on the cluster. This doesn't work, you need to pass --total-executor-cores numCores instead. Note that if you are running spark-shell from one of the spark cluster machines, the bin/spark-shell script will automatically set MASTER from the SPARK_MASTER_IP and SPARK_MASTER_PORT variables in conf/spark-env.sh. This is not working for me too. I run the shell from the master machine, and I do have SPARK_MASTER_IP set up in conf/spark-env.sh like this: export SPARK_MASTER_IP='10.2.1.5' But if I omit --master spark://10.2.1.5:7077 then the console starts but I can't see it in the UI at http://10.2.1.5:8080. But when I go to http://10.2.1.5:4040 (the application UI) I see that the app is using only master as a worker. My question is: are those just the bugs in the documentation? That there is no --cores option and that SPARK_MASTER_IP is not used anymore when I run the Spark shell from the master? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: issues with ./bin/spark-shell for standalone mode
Thanks Andrew, ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30 --executor-memory 20g --driver-memory 10g works well, just wanted to make sure that I'm not missing anything -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9111.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: output tuples in CSV format
you can just use something like this: myRdd(_.productIterator.mkString(,)).saveAsTextFile On Tue, Jun 10, 2014 at 6:34 PM, SK skrishna...@gmail.com wrote: My output is a set of tuples and when I output it using saveAsTextFile, my file looks as follows: (field1_tup1, field2_tup1, field3_tup1,...) (field1_tup2, field2_tup2, field3_tup2,...) In Spark. is there some way I can simply have it output in CSV format as follows (i.e. without the parentheses): field1_tup1, field2_tup1, field3_tup1,... field1_tup2, field2_tup2, field3_tup2,... I could write a script to remove the parentheses, but would be easier if I could omit the parentheses. I did not find a saveAsCsvFile in Spark. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/output-tuples-in-CSV-format-tp7363.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: stage kill link is awfully close to the stage name
Nick Chammas wrote I think it would be better to have the kill link flush right, leaving a large amount of space between it the stage detail link. I think even better would be to have a pop-up confirmation Do you really want to kill this stage? :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stage-kill-link-is-awfully-close-to-the-stage-name-tp7153p7154.html Sent from the Apache Spark User List mailing list archive at Nabble.com.