Re: Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-07 Thread kmurph
Hi Simon,

Thanks.  I did actually have "SPARK_WORKER_CORES=8" in spark-env.sh - its
commented as 'to set the number of cores to use on this machine'.
Not sure how this would interplay with SPARK_EXECUTOR_INSTANCES and
SPARK_EXECUTOR_CORES, but I removed it and still see no scaleup with
increasing cores.  Nothing else is set in spark-env.sh 

However, your email has drawn my attention to the comments in spark-env now
which indicate that SPARK_EXECUTOR_INSTANCES and SPARK_EXECUTOR_CORES are
only read in for Yarn configurations.  Based also on what is listed under
"Options for the daemons used in the standalone deploy mode" I guess the
standalone thing to do would be use:

# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node

But as I'm running locally, and there is a separate comment & section for
"Options read when launching programs locally with ./bin/run-example or
./bin/spark-submit".
I don't believe the daemon settings would be read for my setup.  In fact I
just tried switching to SPARK_WORKER_CORES and SPARK_WORKER_INSTANCES and
the cores don't scale, so I its probably using all cores available on the
machine and I don't have control of executors and cores/executor if running
local.

Lans comments here:
   
http://stackoverflow.com/questions/24696777/what-is-the-relationship-between-workers-worker-instances-and-executors
mention standalone cluster manager.  I had assumed that it would apply also
to a large local machine.

Will I in future versions of Spark be able to control executors and
cores/executor ?  Any plans for this ?

Please let me know if my current understanding of whats possible in spark
local mode is incorrect,

Many thanks

Karen




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Correct-way-of-setting-executor-numbers-and-executor-cores-in-Spark-1-6-1-for-non-clustered-mode-tp26894p26896.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-07 Thread kmurph

Hi,

I'm running spark 1.6.1 on a single machine, initially a small one (8 cores,
16GB ram) using "--master local[*]" to spark-submit and I'm trying to see
scaling with increasing cores, unsuccessfully.  
Initially I'm setting SPARK_EXECUTOR_INSTANCES=1, and increasing cores for
each executor.  The way I'm setting cores per executor is either with
"SPARK_EXECUTOR_CORES=1" (up to 4) and I also tried with " --conf
"spark.executor.cores=1 spark.executor.memory=9g".  
I'm repartitioning the RDD of the large dataset into 4/8/10 partitions for
different runs.

Am I setting executors/cores correctly for running Spark 1.6
locally/Standalone mode ? 
The logs show the same overall  timings for execution of the key stages
(within a stage I see the number of tasks match the data partitioning value)
whether I'm setting for 1, 4 or 8 cores per executor.  And the process table
looks like the requested cores aren't being used.

I know eg. "--num.executors=X" is only an argument to Yarn.  I can't find
specific instructions in one place for settings these params
(executors/cores) on Spark running on one machine.

An example of my full spark-submit command is:

SPARK_EXECUTOR_INSTANCES=1 SPARK_EXECUTOR_CORES=4 spark-submit --master
local[*] --conf "spark.executor.cores=4 spark.executor.memory=9g" --class
asap.examples.mllib.TfIdfExample
/home/ubuntu/spark-1.6.1-bin-hadoop2.6/asap_ml/target/scala-2.10/ml-operators_2.10-1.0.jar

Duplicated settings here but it shows the different ways I've been setting
the parameters.

Thanks
Karen




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Correct-way-of-setting-executor-numbers-and-executor-cores-in-Spark-1-6-1-for-non-clustered-mode-tp26894.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark MLLib benchmarks

2016-05-04 Thread kmurph

Hi, 

I'm benchmarking Spark(1.6) and MLLib TF-IDF (with hdfs) on a 20GB dataset,
and not seeing much scale-up when I increase cores/executors/RAM according
to Spark tuning documentation.  I suspect I'm missing a trick in my
configuration.

I'm running on shared memory (96 cores, 256GB RAM) and testing various
combinations of:
Number of executors (1,2,4,8)
Number of cores per executor (1,2,4,8,12,24)
Memory per executor (calculated as per cloudera recommendations)
Of course in line with combined resource limits.

Also setting the RDD partitioning number to 2,4,6,8  (I see best results at
4 partitions, about 5% better than worse case).

Have also varied/switched the following settings:
Using the Kyro Serialiser
Setting driver memory
Setting for compressed ops
Dynamic scheduling
trying different storage levels for persisting RDDs

As we to up the cores in the best of these configurations we still see a
running time of 19-20 minutes.
Is there anything else I should be configuring to get better scale-up ?
Are there any documented TF-IDF benchmark results that I can make
comparisons with to validate (even if very approximate indirect
comparisons?)

Any advice would be much appreciated,
Thanks
Karen




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLib-benchmarks-tp26878.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pagerank implementation

2014-12-15 Thread kmurph

Hiya,

I too am looking for a PageRank solution in GraphX where the probabilities
sum to 1.
I tried a few modifications, including division by the total number of
vertices in the first part of the equation, as well as trying to return full
rank instead of delta (though not correctly as evident from exception at
runtime).

Tom did you manage to make a version which sums to 1 ?  Could you possibly
divulge the changes if so ?

Also, I'm interested to know if the algorithm handles the case where there
are no outgoing links from a node ?  Does it avoid unfairness with sinks ? 
I'm new to Scala (and spark).  Had a look at the code and don't see that it
is, but could be missing something,

Thanks

Karen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pagerank-implementation-tp19013p20687.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1.1, Hadoop 2.6 - Protobuf conflict

2014-12-12 Thread kmurph
I had this problem also with spark 1.1.1.  At the time I was using hadoop
0.20.

To get around it I installed hadoop 2.5.2, and set the protobuf.version to
2.5.0 in the build command like so:
mvn -Phadoop-2.5 -Dhadoop.version=2.5.2 -Dprotobuf.version=2.5.0
-DskipTests clean package

So I changed spark's pom.xml to read the protobuf.version from the command
line.
If I didn't explicitly set protobuf.version it was picking up an older
version that existed on my filesystem somewhere,

Karen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-1-Hadoop-2-6-Protobuf-conflict-tp20656p20658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org