Standalone scheduler issue - one job occupies the whole cluster somehow

2016-01-25 Thread Mikhail Strebkov
Hi all,

Recently we started having issues with one of our background processing
scripts which we run on Spark. The cluster runs only two jobs. One job runs
for days, and another is usually like a couple of hours. Both jobs have a
crob schedule. The cluster is small, just 2 slaves, 24 cores, 25.4 GB of
memory. Each job takes 6 cores and 6 GB per worker. So when both jobs are
running it's 12 cores out of 24 cores and 24 GB out of 25.4 GB. But
sometimes I see this:

https://www.dropbox.com/s/6uad4hrchqpihp4/Screen%20Shot%202016-01-25%20at%201.16.19%20PM.png

So basically the long running job somehow occupied the whole cluster and
the fast one can't make any progress because the cluster doesn't have
resources. That's what I see in the logs:

16/01/25 21:26:48 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient resources


When I log in to the slaves I see this:

slave 1:

> /usr/lib/jvm/java/bin/java -cp  -Xms6144M -Xmx6144M
> -Dspark.driver.port=42548 -Drun.mode=production -XX:MaxPermSize=256m
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> akka.tcp://sparkDriver@10.233.17.48:42548/user/CoarseGrainedScheduler
> --executor-id 450 --hostname 10.191.4.151 *--cores 1 --app-id
> app-20160124152439-1468* --worker-url akka.tcp://
> sparkWorker@10.191.4.151:53144/user/Worker
> /usr/lib/jvm/java/bin/java -cp -cp  -Xms6144M -Xmx6144M
> -Dspark.driver.port=42548 -Drun.mode=production -XX:MaxPermSize=256m
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> akka.tcp://sparkDriver@10.233.17.48:42548/user/CoarseGrainedScheduler
> --executor-id 451 --hostname 10.191.4.151 *--cores 1 --app-id
> app-20160124152439-1468* --worker-url akka.tcp://
> sparkWorker@10.191.4.151:53144/user/Worker


slave 2:

> /usr/lib/jvm/java/bin/java -cp  -Xms6144M -Xmx6144M
> -Dspark.driver.port=42548 -Drun.mode=production -XX:MaxPermSize=256m
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> akka.tcp://sparkDriver@10.233.17.48:42548/user/CoarseGrainedScheduler
> --executor-id 1 --hostname 10.253.142.59 *--cores 3 --app-id
> app-20160124152439-1468* --worker-url akka.tcp://
> sparkWorker@10.253.142.59:33265/user/Worker
> /usr/lib/jvm/java/bin/java -cp -cp  -Xms6144M -Xmx6144M
> -Dspark.driver.port=42548 -Drun.mode=production -XX:MaxPermSize=256m
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
> akka.tcp://sparkDriver@10.233.17.48:42548/user/CoarseGrainedScheduler
> --executor-id 448 --hostname 10.253.142.59 *--cores 1 --app-id
> app-20160124152439-1468* --worker-url akka.tcp://
> sparkWorker@10.253.142.59:33265/user/Worker


so somehow Spark created 4 executors, 2 on each machine, 1 core + 1 core
and 3 cores + 1 core to get the total of 6 cores. But because 6 GB setting
is per executor, it ends up occupying 24 GB instead of 12 GB (2 executors,
3 cores + 3 cores) and blocks the other Spark job.

My wild guess is that for some reason 1 executor of the long job failed, so
the job becomes 3 cores short and asks the scheduler if it can get 3 more
cores, then the scheduler distributes it evenly across the slaves: 2 cores
+ 1 core but this distribution doesn't work until the short job finishes
(because the shor job holds the rest of the memory). This explains 3 + 1 on
one slave but doesn't explain 1 + 1 on another.

Did anyone experience anything similar to this? Any ideas how to avoid it?

Thanks,
Mikhail


Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Mikhail Strebkov
Hi Patrick,

I used 1.0 branch, but it was not an official release, I just git pulled
whatever was there and compiled.

Thanks,
Mikhail



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Mikhail Strebkov
Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month
ago). The setup is a standalone cluster with 4 worker machines and 1 master
machine. I used to run spark shell like this:

  ./bin/spark-shell -c 30 -em 20g -dm 10g

Today I've finally updated to Spark 1.0 release. Now I can only run spark
shell like this:

  ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30
--executor-memory 20g --driver-memory 10g

The documentation at
http://spark.apache.org/docs/latest/spark-standalone.html says:

You can also pass an option --cores numCores to control the number of
cores that spark-shell uses on the cluster.
This doesn't work, you need to pass --total-executor-cores numCores
instead.

Note that if you are running spark-shell from one of the spark cluster
machines, the bin/spark-shell script will automatically set MASTER from the
SPARK_MASTER_IP and SPARK_MASTER_PORT variables in conf/spark-env.sh.
This is not working for me too. I run the shell from the master machine, and
I do have SPARK_MASTER_IP set up in conf/spark-env.sh like this:
export SPARK_MASTER_IP='10.2.1.5'
But if I omit --master spark://10.2.1.5:7077 then the console starts but I
can't see it in the UI at http://10.2.1.5:8080. But when I go to
http://10.2.1.5:4040 (the application UI) I see that the app is using only
master as a worker.

My question is: are those just the bugs in the documentation? That there is
no --cores option and that SPARK_MASTER_IP is not used anymore when I run
the Spark shell from the master?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Mikhail Strebkov
Thanks Andrew, 
 ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30
--executor-memory 20g --driver-memory 10g
works well, just wanted to make sure that I'm not missing anything



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9111.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: output tuples in CSV format

2014-06-10 Thread Mikhail Strebkov
you can just use something like this:
  myRdd(_.productIterator.mkString(,)).saveAsTextFile


On Tue, Jun 10, 2014 at 6:34 PM, SK skrishna...@gmail.com wrote:

 My output is a set of tuples and when I output it using saveAsTextFile, my
 file looks as follows:

 (field1_tup1, field2_tup1, field3_tup1,...)
 (field1_tup2, field2_tup2, field3_tup2,...)

 In Spark. is there some way I can simply have it output in CSV format as
 follows (i.e. without the parentheses):
 field1_tup1, field2_tup1, field3_tup1,...
 field1_tup2, field2_tup2, field3_tup2,...

 I could write a script to remove the parentheses, but would be easier if I
 could omit the parentheses. I did not find a saveAsCsvFile in Spark.

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/output-tuples-in-CSV-format-tp7363.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: stage kill link is awfully close to the stage name

2014-06-06 Thread Mikhail Strebkov
Nick Chammas wrote
 I think it would be better to have the kill link flush right, leaving a
 large amount of space between it the stage detail link.

I think even better would be to have a pop-up confirmation Do you really
want to kill this stage? :)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stage-kill-link-is-awfully-close-to-the-stage-name-tp7153p7154.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.