Well that is what the OP stated.
I have a spark cluster consisting of 4 nodes in a standalone mode,..
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Did the OP say he was running a stand alone cluster of Spark, or on Yarn?
> On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh
> wrote:
>
> Hi Jakub,
>
> Any reason why you are running in standalone mode, given that your are
> familiar with YARN?
>
> In theory your
>From experience, here's the kind of things that cause the driver to run out
of memory:
- Way too many partitions (1 and up)
- Something like this:
data = load_large_data()
rdd = sc.parallelize(data)
- Any call to rdd.collect() or rdd.take(N) where the resulting data is
bigger than driver
Hi Jakub,
Any reason why you are running in standalone mode, given that your are
familiar with YARN?
In theory your settings are correct. I checked your environment tab
settings and they look correct.
I assume you have checked this link
http://spark.apache.org/docs/latest/spark-standalone.html
So now that we clarified that all is submitted at cluster standalone mode
what is left when the application (ML pipeline) doesn't take advantage of
full cluster power but essentially running just on master node until
resources are exhausted. Why training ml Decesion Tree doesn't scale to the
rest
well this will be apparent from the Environment tab of GUI. It will show
how the job is actually running.
Jacek's point is correct. I suspect this is actually running in Local mode
as it looks consuming all from the master node.
HTH
Dr Mich Talebzadeh
LinkedIn *
On Mon, Jul 4, 2016 at 8:36 PM, Mathieu Longtin
wrote:
> Are you using a --master argument, or equivalent config, when calling
> spark-submit?
>
> If you don't, it runs in standalone mode.
>
s/standalone/local[*]
Jacek
Are you using a --master argument, or equivalent config, when calling
spark-submit?
If you don't, it runs in standalone mode.
On Mon, Jul 4, 2016 at 2:27 PM Jakub Stransky wrote:
> Hi Mich,
>
> sure that workers are mentioned in slaves file. I can see them in spark
>
OK spark-submit by default starts its GUI at port :4040. You can
change that using --conf "spark.ui.port=" or any other port.
In GUI what do you see under Environment and Executors tabs. Can you send
the snapshot?
HTH
Dr Mich Talebzadeh
LinkedIn *
Hi Mich,
sure that workers are mentioned in slaves file. I can see them in spark
master UI and even after start they are "blocked" for this application but
the cpu and memory consumption is close to nothing.
Thanks
Jakub
On 4 July 2016 at 18:36, Mich Talebzadeh
Silly question. Have you added your workers to sbin/slaves file and have
you started start-slaves.sh.
on master node when you type jps what do you see?
The problem seems to be that workers are ignored and spark is essentially
running in Local mode
HTH
Dr Mich Talebzadeh
LinkedIn *
Mathieu,
there is no rocket science there. Essentially creates dataframe and then
call fit from ML pipeline. The thing which I do not understand is how the
parallelization is done in terms of ML algorithm. Is it based on parallel
factor of the dataframe? Because ML algorithm doesn't offer such
Hi Mich,
I have set up spark default configuration in conf directory
spark-defaults.conf where I specify master hence no need to put it in
command line
spark.master spark://spark.master:7077
the same applies to driver memory which has been increased to 4GB
and the same is for
When the driver is running out of memory, it usually means you're loading
data in a non parallel way (without using RDD). Make sure anything that
requires non trivial amount of memory is done by an RDD. Also, the default
memory for everything is 1GB, which may not be enough for you.
On Mon, Jul
Hi Jakub,
In standalone mode Spark does the resource management. Which version of
Spark are you running?
How do you define your SparkConf() parameters for example setMaster etc.
From
spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
SparkPOC.jar 10 4.3
I did not see any
Hello,
I have a spark cluster consisting of 4 nodes in a standalone mode, master +
3 workers nodes with configured available memory and cpus etc.
I have an spark application which is essentially a MLlib pipeline for
training a classifier, in this case RandomForest but could be a
DecesionTree
16 matches
Mail list logo