Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mich Talebzadeh
Well that is what the OP stated. I have a spark cluster consisting of 4 nodes in a standalone mode,.. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Michael Segel
Did the OP say he was running a stand alone cluster of Spark, or on Yarn? > On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh > wrote: > > Hi Jakub, > > Any reason why you are running in standalone mode, given that your are > familiar with YARN? > > In theory your

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mathieu Longtin
>From experience, here's the kind of things that cause the driver to run out of memory: - Way too many partitions (1 and up) - Something like this: data = load_large_data() rdd = sc.parallelize(data) - Any call to rdd.collect() or rdd.take(N) where the resulting data is bigger than driver

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mich Talebzadeh
Hi Jakub, Any reason why you are running in standalone mode, given that your are familiar with YARN? In theory your settings are correct. I checked your environment tab settings and they look correct. I assume you have checked this link http://spark.apache.org/docs/latest/spark-standalone.html

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Jakub Stransky
So now that we clarified that all is submitted at cluster standalone mode what is left when the application (ML pipeline) doesn't take advantage of full cluster power but essentially running just on master node until resources are exhausted. Why training ml Decesion Tree doesn't scale to the rest

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
well this will be apparent from the Environment tab of GUI. It will show how the job is actually running. Jacek's point is correct. I suspect this is actually running in Local mode as it looks consuming all from the master node. HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jacek Laskowski
On Mon, Jul 4, 2016 at 8:36 PM, Mathieu Longtin wrote: > Are you using a --master argument, or equivalent config, when calling > spark-submit? > > If you don't, it runs in standalone mode. > s/standalone/local[*] Jacek

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mathieu Longtin
Are you using a --master argument, or equivalent config, when calling spark-submit? If you don't, it runs in standalone mode. On Mon, Jul 4, 2016 at 2:27 PM Jakub Stransky wrote: > Hi Mich, > > sure that workers are mentioned in slaves file. I can see them in spark >

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
OK spark-submit by default starts its GUI at port :4040. You can change that using --conf "spark.ui.port=" or any other port. In GUI what do you see under Environment and Executors tabs. Can you send the snapshot? HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hi Mich, sure that workers are mentioned in slaves file. I can see them in spark master UI and even after start they are "blocked" for this application but the cpu and memory consumption is close to nothing. Thanks Jakub On 4 July 2016 at 18:36, Mich Talebzadeh

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
Silly question. Have you added your workers to sbin/slaves file and have you started start-slaves.sh. on master node when you type jps what do you see? The problem seems to be that workers are ignored and spark is essentially running in Local mode HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Mathieu, there is no rocket science there. Essentially creates dataframe and then call fit from ML pipeline. The thing which I do not understand is how the parallelization is done in terms of ML algorithm. Is it based on parallel factor of the dataframe? Because ML algorithm doesn't offer such

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hi Mich, I have set up spark default configuration in conf directory spark-defaults.conf where I specify master hence no need to put it in command line spark.master spark://spark.master:7077 the same applies to driver memory which has been increased to 4GB and the same is for

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mathieu Longtin
When the driver is running out of memory, it usually means you're loading data in a non parallel way (without using RDD). Make sure anything that requires non trivial amount of memory is done by an RDD. Also, the default memory for everything is 1GB, which may not be enough for you. On Mon, Jul

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
Hi Jakub, In standalone mode Spark does the resource management. Which version of Spark are you running? How do you define your SparkConf() parameters for example setMaster etc. From spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp SparkPOC.jar 10 4.3 I did not see any

Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hello, I have a spark cluster consisting of 4 nodes in a standalone mode, master + 3 workers nodes with configured available memory and cpus etc. I have an spark application which is essentially a MLlib pipeline for training a classifier, in this case RandomForest but could be a DecesionTree