Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Mathieu Longtin
del64
>>> sun.boot.class.path
>>>  
>>> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/resources.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/rt.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jsse.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jce.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/charsets.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/rhino.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/jfr.jar:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/classes
>>> sun.boot.library.path
>>>  /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre/lib/amd64
>>> sun.cpu.endianlittle
>>> sun.cpu.isalist
>>> sun.io.unicode.encodingUnicodeLittle
>>> sun.java.commandorg.apache.spark.deploy.SparkSubmit --conf
>>> spark.driver.extraClassPath=/home/sparkuser/sqljdbc4.jar --class  --class
>>> DemoApp SparkPOC.jar 10 4.3
>>> sun.java.launcherSUN_STANDARD
>>> sun.jnu.encodingUTF-8
>>> sun.management.compilerHotSpot 64-Bit Tiered Compilers
>>> sun.nio.ch.bugLevel
>>> sun.os.patch.levelunknown
>>> user.countryUS
>>> user.dir/home/sparkuser
>>> user.home/home/sparkuser
>>> user.languageen
>>> user.namesparkuser
>>> user.timezoneEtc/UTC
>>> Classpath Entries
>>>
>>> ResourceSource
>>> /home/sparkuser/sqljdbc4.jarSystem Classpath
>>> /usr/local/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.1-hadoop2.2.0.jar
>>>System Classpath
>>> /usr/local/spark-1.6.1/conf/System Classpath
>>> /usr/local/spark-1.6.1/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
>>>  System Classpath
>>> /usr/local/spark-1.6.1/lib_managed/jars/datanucleus-core-3.2.10.jar
>>>  System Classpath
>>> /usr/local/spark-1.6.1/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>>>  System Classpath
>>> http://10.2.0.4:35639/jars/SparkPOC.jarAdded By User
>>>
>>> On 4 July 2016 at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> well this will be apparent from the Environment tab of GUI. It will
>>>> show how the job is actually running.
>>>>
>>>> Jacek's point is correct. I suspect this is actually running in Local
>>>> mode as it looks consuming all from the master node.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 4 July 2016 at 20:35, Jacek Laskowski <ja...@japila.pl> wrote:
>>>>
>>>>> On Mon, Jul 4, 2016 at 8:36 PM, Mathieu Longtin <
>>>>> math...@closetwork.org> wrote:
>>>>>
>>>>>> Are you using a --master argument, or equivalent config, when calling
>>>>>> spark-submit?
>>>>>>
>>>>>> If you don't, it runs in standalone mode.
>>>>>>
>>>>>
>>>>> s/standalone/local[*]
>>>>>
>>>>> Jacek
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jakub Stransky
>>> cz.linkedin.com/in/jakubstransky
>>>
>>>
>>
>
>
> --
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
>
> --
Mathieu Longtin
1-514-803-8977


Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mathieu Longtin
Are you using a --master argument, or equivalent config, when calling
spark-submit?

If you don't, it runs in standalone mode.

On Mon, Jul 4, 2016 at 2:27 PM Jakub Stransky <stransky...@gmail.com> wrote:

> Hi Mich,
>
> sure that workers are mentioned in slaves file. I can see them in spark
> master UI and even after start they are "blocked" for this application but
> the cpu and memory consumption is close to nothing.
>
> Thanks
> Jakub
>
> On 4 July 2016 at 18:36, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Silly question. Have you added your workers to sbin/slaves file and have
>> you started start-slaves.sh.
>>
>> on master node when you type jps what do you see?
>>
>> The problem seems to be that workers are ignored and spark is essentially
>> running in Local mode
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 July 2016 at 17:05, Jakub Stransky <stransky...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> I have set up spark default configuration in conf directory
>>> spark-defaults.conf where I specify master hence no need to put it in
>>> command line
>>> spark.master   spark://spark.master:7077
>>>
>>> the same applies to driver memory which has been increased to 4GB
>>>  and the same is for spark.executor.memory 12GB as machines have 16GB
>>>
>>> Jakub
>>>
>>>
>>>
>>>
>>> On 4 July 2016 at 17:44, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jakub,
>>>>
>>>> In standalone mode Spark does the resource management. Which version of
>>>> Spark are you running?
>>>>
>>>> How do you define your SparkConf() parameters for example setMaster
>>>> etc.
>>>>
>>>> From
>>>>
>>>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>>>> SparkPOC.jar 10 4.3
>>>>
>>>> I did not see any executor, memory allocation, so I assume you are
>>>> allocating them somewhere else?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 4 July 2016 at 16:31, Jakub Stransky <stransky...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a spark cluster consisting of 4 nodes in a standalone mode,
>>>>> master + 3 workers nodes with configured available memory and cpus etc.
>>>>>
>>>>> I have an spark application which is essentially a MLlib pipeline for
>>>>> training a classifier, in this case RandomForest  but could be a
>>>>> DecesionTree just for the sake of simplicity.
>>>>>
>>>>> But when I submit the spark application to the cluster via spark
>>>>> submit it is running out of memory. Even though the executors are
>>>>> "taken"/created in the cluster they are esentially doing nothing ( poor
>>>>> cpu, nor memory utilization) while the master seems to do all the work
>>>>> which finally results in OOM.
>>>>>
>>>>> My submission is following:
>>>>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>>>>> SparkPOC.jar 10 4.3
>>>>>
>>>>> I am submitting from the master node.
>>>>>
>>>>> By default it is running in client mode which the driver process is
>>>>> attached to spark-shell.
>>>>>
>>>>> Do I need to set up some settings to make MLlib algos parallelized and
>>>>> distributed as well or all is driven by parallel factor set on dataframe
>>>>> with input data?
>>>>>
>>>>> Essentially it seems that all work is just done on master and the rest
>>>>> is idle.
>>>>> Any hints what to check?
>>>>>
>>>>> Thx
>>>>> Jakub
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Jakub Stransky
>>> cz.linkedin.com/in/jakubstransky
>>>
>>>
>>
>
>
> --
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
>
> --
Mathieu Longtin
1-514-803-8977


Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mathieu Longtin
When the driver is running out of memory, it usually means you're loading
data in a non parallel way (without using RDD). Make sure anything that
requires non trivial amount of memory is done by an RDD. Also, the default
memory for everything is 1GB, which may not be enough for you.

On Mon, Jul 4, 2016 at 11:44 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Jakub,
>
> In standalone mode Spark does the resource management. Which version of
> Spark are you running?
>
> How do you define your SparkConf() parameters for example setMaster etc.
>
> From
>
> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
> SparkPOC.jar 10 4.3
>
> I did not see any executor, memory allocation, so I assume you are
> allocating them somewhere else?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 July 2016 at 16:31, Jakub Stransky <stransky...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a spark cluster consisting of 4 nodes in a standalone mode, master
>> + 3 workers nodes with configured available memory and cpus etc.
>>
>> I have an spark application which is essentially a MLlib pipeline for
>> training a classifier, in this case RandomForest  but could be a
>> DecesionTree just for the sake of simplicity.
>>
>> But when I submit the spark application to the cluster via spark submit
>> it is running out of memory. Even though the executors are "taken"/created
>> in the cluster they are esentially doing nothing ( poor cpu, nor memory
>> utilization) while the master seems to do all the work which finally
>> results in OOM.
>>
>> My submission is following:
>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>> SparkPOC.jar 10 4.3
>>
>> I am submitting from the master node.
>>
>> By default it is running in client mode which the driver process is
>> attached to spark-shell.
>>
>> Do I need to set up some settings to make MLlib algos parallelized and
>> distributed as well or all is driven by parallel factor set on dataframe
>> with input data?
>>
>> Essentially it seems that all work is just done on master and the rest is
>> idle.
>> Any hints what to check?
>>
>> Thx
>> Jakub
>>
>>
>>
>>
> --
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
Try to figure out what the env vars and arguments of the worker JVM and
Python process are. Maybe you'll get a clue.

On Mon, Jul 4, 2016 at 11:42 AM Mathieu Longtin <math...@closetwork.org>
wrote:

> I started with a download of 1.6.0. These days, we use a self compiled
> 1.6.2.
>
> On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav <ashraag...@gmail.com>
> wrote:
>
>> I am thinking of any possibilities as to why this could be happening. If
>> the cores are multi-threaded, should that affect the daemons? Your spark
>> was built from source code or downloaded as a binary, though that should
>> not technically change anything?
>>
>> On Mon, Jul 4, 2016 at 9:03 PM, Mathieu Longtin <math...@closetwork.org>
>> wrote:
>>
>>> 1.6.1.
>>>
>>> I have no idea. SPARK_WORKER_CORES should do the same.
>>>
>>> On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav <ashraag...@gmail.com>
>>> wrote:
>>>
>>>> Which version of Spark are you using? 1.6.1?
>>>>
>>>> Any ideas as to why it is not working in ours?
>>>>
>>>> On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin <math...@closetwork.org
>>>> > wrote:
>>>>
>>>>> 16.
>>>>>
>>>>> On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav <ashraag...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I tried what you suggested and started the slave using the following
>>>>>> command:
>>>>>>
>>>>>> start-slave.sh --cores 1 
>>>>>>
>>>>>> But it still seems to start as many pyspark daemons as the number of
>>>>>> cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
>>>>>> file by giving SPARK_WORKER_CORES=1 also didn't help.
>>>>>>
>>>>>> When you said it helped you and limited it to 2 processes in your
>>>>>> cluster, how many cores did each machine have?
>>>>>>
>>>>>> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin <
>>>>>> math...@closetwork.org> wrote:
>>>>>>
>>>>>>> It depends on what you want to do:
>>>>>>>
>>>>>>> If, on any given server, you don't want Spark to use more than one
>>>>>>> core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh
>>>>>>> --cores=1
>>>>>>>
>>>>>>> If you have a bunch of servers dedicated to Spark, but you don't
>>>>>>> want a driver to use more than one core per server, then: 
>>>>>>> spark.executor.cores=1
>>>>>>> tells it not to use more than 1 core per server. However, it seems it 
>>>>>>> will
>>>>>>> start as many pyspark as there are cores, but maybe not use them.
>>>>>>>
>>>>>>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav <ashraag...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Mathieu,
>>>>>>>>
>>>>>>>> Isn't that the same as setting "spark.executor.cores" to 1? And how
>>>>>>>> can I specify "--cores=1" from the application?
>>>>>>>>
>>>>>>>> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <
>>>>>>>> math...@closetwork.org> wrote:
>>>>>>>>
>>>>>>>>> When running the executor, put --cores=1. We use this and I only
>>>>>>>>> see 2 pyspark process, one seem to be the parent of the other and is 
>>>>>>>>> idle.
>>>>>>>>>
>>>>>>>>> In your case, are all pyspark process working?
>>>>>>>>>
>>>>>>>>> On Mon, Jul 4, 2016 at 3:15 AM ar7 <ashraag...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>>>>>>>>>> application
>>>>>>>>>> is run, the load on the workers seems to go more than what was
>>>>>>>>>> given. When I
>>>>>>>>>> ran top, I noticed 

Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
I started with a download of 1.6.0. These days, we use a self compiled
1.6.2.

On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav <ashraag...@gmail.com> wrote:

> I am thinking of any possibilities as to why this could be happening. If
> the cores are multi-threaded, should that affect the daemons? Your spark
> was built from source code or downloaded as a binary, though that should
> not technically change anything?
>
> On Mon, Jul 4, 2016 at 9:03 PM, Mathieu Longtin <math...@closetwork.org>
> wrote:
>
>> 1.6.1.
>>
>> I have no idea. SPARK_WORKER_CORES should do the same.
>>
>> On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav <ashraag...@gmail.com>
>> wrote:
>>
>>> Which version of Spark are you using? 1.6.1?
>>>
>>> Any ideas as to why it is not working in ours?
>>>
>>> On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin <math...@closetwork.org>
>>> wrote:
>>>
>>>> 16.
>>>>
>>>> On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav <ashraag...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I tried what you suggested and started the slave using the following
>>>>> command:
>>>>>
>>>>> start-slave.sh --cores 1 
>>>>>
>>>>> But it still seems to start as many pyspark daemons as the number of
>>>>> cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
>>>>> file by giving SPARK_WORKER_CORES=1 also didn't help.
>>>>>
>>>>> When you said it helped you and limited it to 2 processes in your
>>>>> cluster, how many cores did each machine have?
>>>>>
>>>>> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin <
>>>>> math...@closetwork.org> wrote:
>>>>>
>>>>>> It depends on what you want to do:
>>>>>>
>>>>>> If, on any given server, you don't want Spark to use more than one
>>>>>> core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh
>>>>>> --cores=1
>>>>>>
>>>>>> If you have a bunch of servers dedicated to Spark, but you don't want
>>>>>> a driver to use more than one core per server, then: 
>>>>>> spark.executor.cores=1
>>>>>> tells it not to use more than 1 core per server. However, it seems it 
>>>>>> will
>>>>>> start as many pyspark as there are cores, but maybe not use them.
>>>>>>
>>>>>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav <ashraag...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mathieu,
>>>>>>>
>>>>>>> Isn't that the same as setting "spark.executor.cores" to 1? And how
>>>>>>> can I specify "--cores=1" from the application?
>>>>>>>
>>>>>>> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <
>>>>>>> math...@closetwork.org> wrote:
>>>>>>>
>>>>>>>> When running the executor, put --cores=1. We use this and I only
>>>>>>>> see 2 pyspark process, one seem to be the parent of the other and is 
>>>>>>>> idle.
>>>>>>>>
>>>>>>>> In your case, are all pyspark process working?
>>>>>>>>
>>>>>>>> On Mon, Jul 4, 2016 at 3:15 AM ar7 <ashraag...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>>>>>>>>> application
>>>>>>>>> is run, the load on the workers seems to go more than what was
>>>>>>>>> given. When I
>>>>>>>>> ran top, I noticed that there were too many Pyspark.daemons
>>>>>>>>> processes
>>>>>>>>> running. There was another mail thread regarding the same:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>>>>>>>>
>>>>>>>>> I followed what was mentioned there, i.e. reduced the number of
>>>>>>>>> executor
>>>>>>>>> cores and number of executors in one node to 1. But the number of
>>>>>>>>> pyspark.daemons process is still not coming down. It looks like
>>>>>>>>> initially
>>>>>>>>> there is one Pyspark.daemons process and this in turn spawns as
>>>>>>>>> many
>>>>>>>>> pyspark.daemons processes as the number of cores in the machine.
>>>>>>>>>
>>>>>>>>> Any help is appreciated :)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ashwin Raaghav.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Mathieu Longtin
>>>>>>>> 1-514-803-8977
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>>
>>>>>>> Ashwin Raaghav
>>>>>>>
>>>>>> --
>>>>>> Mathieu Longtin
>>>>>> 1-514-803-8977
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Ashwin Raaghav
>>>>>
>>>> --
>>>> Mathieu Longtin
>>>> 1-514-803-8977
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Ashwin Raaghav
>>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
1.6.1.

I have no idea. SPARK_WORKER_CORES should do the same.

On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav <ashraag...@gmail.com> wrote:

> Which version of Spark are you using? 1.6.1?
>
> Any ideas as to why it is not working in ours?
>
> On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin <math...@closetwork.org>
> wrote:
>
>> 16.
>>
>> On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav <ashraag...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I tried what you suggested and started the slave using the following
>>> command:
>>>
>>> start-slave.sh --cores 1 
>>>
>>> But it still seems to start as many pyspark daemons as the number of
>>> cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
>>> file by giving SPARK_WORKER_CORES=1 also didn't help.
>>>
>>> When you said it helped you and limited it to 2 processes in your
>>> cluster, how many cores did each machine have?
>>>
>>> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin <math...@closetwork.org>
>>> wrote:
>>>
>>>> It depends on what you want to do:
>>>>
>>>> If, on any given server, you don't want Spark to use more than one
>>>> core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh
>>>> --cores=1
>>>>
>>>> If you have a bunch of servers dedicated to Spark, but you don't want a
>>>> driver to use more than one core per server, then: spark.executor.cores=1
>>>> tells it not to use more than 1 core per server. However, it seems it will
>>>> start as many pyspark as there are cores, but maybe not use them.
>>>>
>>>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav <ashraag...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Mathieu,
>>>>>
>>>>> Isn't that the same as setting "spark.executor.cores" to 1? And how
>>>>> can I specify "--cores=1" from the application?
>>>>>
>>>>> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <
>>>>> math...@closetwork.org> wrote:
>>>>>
>>>>>> When running the executor, put --cores=1. We use this and I only see
>>>>>> 2 pyspark process, one seem to be the parent of the other and is idle.
>>>>>>
>>>>>> In your case, are all pyspark process working?
>>>>>>
>>>>>> On Mon, Jul 4, 2016 at 3:15 AM ar7 <ashraag...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>>>>>>> application
>>>>>>> is run, the load on the workers seems to go more than what was
>>>>>>> given. When I
>>>>>>> ran top, I noticed that there were too many Pyspark.daemons processes
>>>>>>> running. There was another mail thread regarding the same:
>>>>>>>
>>>>>>>
>>>>>>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>>>>>>
>>>>>>> I followed what was mentioned there, i.e. reduced the number of
>>>>>>> executor
>>>>>>> cores and number of executors in one node to 1. But the number of
>>>>>>> pyspark.daemons process is still not coming down. It looks like
>>>>>>> initially
>>>>>>> there is one Pyspark.daemons process and this in turn spawns as many
>>>>>>> pyspark.daemons processes as the number of cores in the machine.
>>>>>>>
>>>>>>> Any help is appreciated :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ashwin Raaghav.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>> --
>>>>>> Mathieu Longtin
>>>>>> 1-514-803-8977
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Ashwin Raaghav
>>>>>
>>>> --
>>>> Mathieu Longtin
>>>> 1-514-803-8977
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Ashwin Raaghav
>>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
16.

On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav <ashraag...@gmail.com> wrote:

> Hi,
>
> I tried what you suggested and started the slave using the following
> command:
>
> start-slave.sh --cores 1 
>
> But it still seems to start as many pyspark daemons as the number of cores
> in the node (1 parent and 3 workers). Limiting it via spark-env.sh file by
> giving SPARK_WORKER_CORES=1 also didn't help.
>
> When you said it helped you and limited it to 2 processes in your cluster,
> how many cores did each machine have?
>
> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin <math...@closetwork.org>
> wrote:
>
>> It depends on what you want to do:
>>
>> If, on any given server, you don't want Spark to use more than one core,
>> use this to start the workers: SPARK_HOME/sbin/start-slave.sh --cores=1
>>
>> If you have a bunch of servers dedicated to Spark, but you don't want a
>> driver to use more than one core per server, then: spark.executor.cores=1
>> tells it not to use more than 1 core per server. However, it seems it will
>> start as many pyspark as there are cores, but maybe not use them.
>>
>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav <ashraag...@gmail.com>
>> wrote:
>>
>>> Hi Mathieu,
>>>
>>> Isn't that the same as setting "spark.executor.cores" to 1? And how can
>>> I specify "--cores=1" from the application?
>>>
>>> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <math...@closetwork.org>
>>> wrote:
>>>
>>>> When running the executor, put --cores=1. We use this and I only see 2
>>>> pyspark process, one seem to be the parent of the other and is idle.
>>>>
>>>> In your case, are all pyspark process working?
>>>>
>>>> On Mon, Jul 4, 2016 at 3:15 AM ar7 <ashraag...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>>>>> application
>>>>> is run, the load on the workers seems to go more than what was given.
>>>>> When I
>>>>> ran top, I noticed that there were too many Pyspark.daemons processes
>>>>> running. There was another mail thread regarding the same:
>>>>>
>>>>>
>>>>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>>>>
>>>>> I followed what was mentioned there, i.e. reduced the number of
>>>>> executor
>>>>> cores and number of executors in one node to 1. But the number of
>>>>> pyspark.daemons process is still not coming down. It looks like
>>>>> initially
>>>>> there is one Pyspark.daemons process and this in turn spawns as many
>>>>> pyspark.daemons processes as the number of cores in the machine.
>>>>>
>>>>> Any help is appreciated :)
>>>>>
>>>>> Thanks,
>>>>> Ashwin Raaghav.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>> --
>>>> Mathieu Longtin
>>>> 1-514-803-8977
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Ashwin Raaghav
>>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
It depends on what you want to do:

If, on any given server, you don't want Spark to use more than one core,
use this to start the workers: SPARK_HOME/sbin/start-slave.sh --cores=1

If you have a bunch of servers dedicated to Spark, but you don't want a
driver to use more than one core per server, then: spark.executor.cores=1
tells it not to use more than 1 core per server. However, it seems it will
start as many pyspark as there are cores, but maybe not use them.

On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav <ashraag...@gmail.com> wrote:

> Hi Mathieu,
>
> Isn't that the same as setting "spark.executor.cores" to 1? And how can I
> specify "--cores=1" from the application?
>
> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <math...@closetwork.org>
> wrote:
>
>> When running the executor, put --cores=1. We use this and I only see 2
>> pyspark process, one seem to be the parent of the other and is idle.
>>
>> In your case, are all pyspark process working?
>>
>> On Mon, Jul 4, 2016 at 3:15 AM ar7 <ashraag...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>>> application
>>> is run, the load on the workers seems to go more than what was given.
>>> When I
>>> ran top, I noticed that there were too many Pyspark.daemons processes
>>> running. There was another mail thread regarding the same:
>>>
>>>
>>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>>
>>> I followed what was mentioned there, i.e. reduced the number of executor
>>> cores and number of executors in one node to 1. But the number of
>>> pyspark.daemons process is still not coming down. It looks like initially
>>> there is one Pyspark.daemons process and this in turn spawns as many
>>> pyspark.daemons processes as the number of cores in the machine.
>>>
>>> Any help is appreciated :)
>>>
>>> Thanks,
>>> Ashwin Raaghav.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
When running the executor, put --cores=1. We use this and I only see 2
pyspark process, one seem to be the parent of the other and is idle.

In your case, are all pyspark process working?

On Mon, Jul 4, 2016 at 3:15 AM ar7 <ashraag...@gmail.com> wrote:

> Hi,
>
> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
> application
> is run, the load on the workers seems to go more than what was given. When
> I
> ran top, I noticed that there were too many Pyspark.daemons processes
> running. There was another mail thread regarding the same:
>
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>
> I followed what was mentioned there, i.e. reduced the number of executor
> cores and number of executors in one node to 1. But the number of
> pyspark.daemons process is still not coming down. It looks like initially
> there is one Pyspark.daemons process and this in turn spawns as many
> pyspark.daemons processes as the number of cores in the machine.
>
> Any help is appreciated :)
>
> Thanks,
> Ashwin Raaghav.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Mathieu Longtin
1-514-803-8977


Re: Do tasks from the same application run in different JVMs

2016-06-29 Thread Mathieu Longtin
Same JVMs.

On Wed, Jun 29, 2016 at 8:48 AM Huang Meilong <ims...@outlook.com> wrote:

> Hi,
>
> In spark, tasks from different applications run in different JVMs, then
> what about tasks from the same application?
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Reporting warnings from workers

2016-06-16 Thread Mathieu Longtin
It turns out you can easily use a Python set, so I can send back a list of
failed files. Thanks.

On Wed, Jun 15, 2016 at 4:28 PM Ted Yu <yuzhih...@gmail.com> wrote:

> Have you looked at:
>
> https://spark.apache.org/docs/latest/programming-guide.html#accumulators
>
> On Wed, Jun 15, 2016 at 1:24 PM, Mathieu Longtin <math...@closetwork.org>
> wrote:
>
>> Is there a way to report warnings from the workers back to the driver
>> process?
>>
>> Let's say I have an RDD and do this:
>>
>> newrdd = rdd.map(somefunction)
>>
>> In *somefunction*, I want to catch when there are invalid values in *rdd
>> *and either put them in another RDD or send some sort of message back.
>>
>> Is that possible?
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
> --
Mathieu Longtin
1-514-803-8977


Reporting warnings from workers

2016-06-15 Thread Mathieu Longtin
Is there a way to report warnings from the workers back to the driver
process?

Let's say I have an RDD and do this:

newrdd = rdd.map(somefunction)

In *somefunction*, I want to catch when there are invalid values in *rdd *and
either put them in another RDD or send some sort of message back.

Is that possible?
-- 
Mathieu Longtin
1-514-803-8977


Re: Not able to write output to local filsystem from Standalone mode.

2016-05-25 Thread Mathieu Longtin
Experience. I don't use Mesos or Yarn or Hadoop, so I don't know.


On Wed, May 25, 2016 at 2:51 AM Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Mathieu,
>
> Thanks a lot for the answer! I did *not* know it's the driver to
> create the directory.
>
> You said "standalone mode", is this the case for the other modes -
> yarn and mesos?
>
> p.s. Did you find it in the code or...just experienced before? #curious
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, May 24, 2016 at 4:04 PM, Mathieu Longtin <math...@closetwork.org>
> wrote:
> > In standalone mode, executor assume they have access to a shared file
> > system. The driver creates the directory and the executor write files, so
> > the executors end up not writing anything since there is no local
> directory.
> >
> > On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi <stutiawas...@hcl.com>
> wrote:
> >>
> >> hi Jacek,
> >>
> >> Parent directory already present, its my home directory. Im using Linux
> >> (Redhat) machine 64 bit.
> >> Also I noticed that "test1" folder is created in my master with
> >> subdirectory as "_temporary" which is empty. but on slaves, no such
> >> directory is created under /home/stuti.
> >>
> >> Thanks
> >> Stuti
> >> 
> >> From: Jacek Laskowski [ja...@japila.pl]
> >> Sent: Tuesday, May 24, 2016 5:27 PM
> >> To: Stuti Awasthi
> >> Cc: user
> >> Subject: Re: Not able to write output to local filsystem from Standalone
> >> mode.
> >>
> >> Hi,
> >>
> >> What happens when you create the parent directory /home/stuti? I think
> the
> >> failure is due to missing parent directories. What's the OS?
> >>
> >> Jacek
> >>
> >> On 24 May 2016 11:27 a.m., "Stuti Awasthi" <stutiawas...@hcl.com>
> wrote:
> >>
> >> Hi All,
> >>
> >> I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2
> >> Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch
> >> shell , read the input file from local filesystem and perform
> transformation
> >> successfully. When I try to write my output in local filesystem path
> then I
> >> receive below error .
> >>
> >>
> >>
> >> I tried to search on web and found similar Jira :
> >> https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows
> >> resolved for Spark 1.3+ but already people have posted the same issue
> still
> >> persists in latest versions.
> >>
> >>
> >>
> >> ERROR
> >>
> >> scala> data.saveAsTextFile("/home/stuti/test1")
> >>
> >> 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID
> 2,
> >> server1): java.io.IOException: The temporary job-output directory
> >> file:/home/stuti/test1/_temporary doesn't exist!
> >>
> >> at
> >>
> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
> >>
> >> at
> >>
> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
> >>
> >> at
> >>
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
> >>
> >> at
> >> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
> >>
> >> at
> >>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
> >>
> >> at
> >>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
> >>
> >> at
> >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> >>
> >> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> >>
> >> at
> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> >>
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>

Re: Spark-submit hangs indefinitely after job completion.

2016-05-24 Thread Mathieu Longtin
I have been seeing the same behavior in standalone with a master.

On Tue, May 24, 2016 at 3:08 PM Pradeep Nayak <pradeep1...@gmail.com> wrote:

>
>
> I have posted the same question of Stack Overflow:
> http://stackoverflow.com/questions/37421852/spark-submit-continues-to-hang-after-job-completion
>
> I am trying to test spark 1.6 with hdfs in AWS. I am using the wordcount
> python example available in the examples folder. I submit the job with
> spark-submit, the job completes successfully and its prints the results on
> the console as well. The web-UI also says its completed. However the
> spark-submit never terminates. I have verified that the context is stopped
> in the word count example code as well.
>
> What could be wrong ?
>
> This is what I see on the console.
>
>
> 6-05-24 14:58:04,749 INFO  [Thread-3] handler.ContextHandler 
> (ContextHandler.java:doStop(843)) - stopped 
> o.s.j.s.ServletContextHandler{/stages/stage,null}2016-05-24 14:58:04,749 INFO 
>  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
> stopped o.s.j.s.ServletContextHandler{/stages/json,null}2016-05-24 
> 14:58:04,749 INFO  [Thread-3] handler.ContextHandler 
> (ContextHandler.java:doStop(843)) - stopped 
> o.s.j.s.ServletContextHandler{/stages,null}2016-05-24 14:58:04,749 INFO  
> [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - stopped 
> o.s.j.s.ServletContextHandler{/jobs/job/json,null}2016-05-24 14:58:04,750 
> INFO  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
> stopped o.s.j.s.ServletContextHandler{/jobs/job,null}2016-05-24 14:58:04,750 
> INFO  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
> stopped o.s.j.s.ServletContextHandler{/jobs/json,null}2016-05-24 14:58:04,750 
> INFO  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
> stopped o.s.j.s.ServletContextHandler{/jobs,null}2016-05-24 14:58:04,802 INFO 
>  [Thread-3] ui.SparkUI (Logging.scala:logInfo(58)) - Stopped Spark web UI at 
> http://172.30.2.239:40402016-05-24 14:58:04,805 INFO  [Thread-3] 
> cluster.SparkDeploySchedulerBackend (Logging.scala:logInfo(58)) - Shutting 
> down all executors2016-05-24 14:58:04,805 INFO  [dispatcher-event-loop-2] 
> cluster.SparkDeploySchedulerBackend (Logging.scala:logInfo(58)) - Asking each 
> executor to shut down2016-05-24 14:58:04,814 INFO  [dispatcher-event-loop-5] 
> spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(58)) - 
> MapOutputTrackerMasterEndpoint stopped!2016-05-24 14:58:04,818 INFO  
> [Thread-3] storage.MemoryStore (Logging.scala:logInfo(58)) - MemoryStore 
> cleared2016-05-24 14:58:04,818 INFO  [Thread-3] storage.BlockManager 
> (Logging.scala:logInfo(58)) - BlockManager stopped2016-05-24 14:58:04,820 
> INFO  [Thread-3] storage.BlockManagerMaster (Logging.scala:logInfo(58)) - 
> BlockManagerMaster stopped2016-05-24 14:58:04,821 INFO  
> [dispatcher-event-loop-3] 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint 
> (Logging.scala:logInfo(58)) - OutputCommitCoordinator stopped!2016-05-24 
> 14:58:04,824 INFO  [Thread-3] spark.SparkContext (Logging.scala:logInfo(58)) 
> - Successfully stopped SparkContext2016-05-24 14:58:04,827 INFO  
> [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
> remote.RemoteActorRefProvider$RemotingTerminator 
> (Slf4jLogger.scala:apply$mcV$sp(74)) - Shutting down remote daemon.2016-05-24 
> 14:58:04,828 INFO  [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
> remote.RemoteActorRefProvider$RemotingTerminator 
> (Slf4jLogger.scala:apply$mcV$sp(74)) - Remote daemon shut down; proceeding 
> with flushing remote transports.2016-05-24 14:58:04,843 INFO  
> [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
> remote.RemoteActorRefProvider$RemotingTerminator 
> (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting shut down.
>
>
> I have to do a ctrl-c to terminate the spark-submit process. This is
> really a weird problem and I have no idea how to fix this. Please let me
> know if there are any logs I should be looking at, or doing things
> differently here.
>
>
> --
Mathieu Longtin
1-514-803-8977


Re: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Mathieu Longtin
In standalone mode, executor assume they have access to a shared file
system. The driver creates the directory and the executor write files, so
the executors end up not writing anything since there is no local directory.

On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi <stutiawas...@hcl.com> wrote:

> hi Jacek,
>
> Parent directory already present, its my home directory. Im using Linux
> (Redhat) machine 64 bit.
> Also I noticed that "test1" folder is created in my master with
> subdirectory as "_temporary" which is empty. but on slaves, no such
> directory is created under /home/stuti.
>
> Thanks
> Stuti
> --
> *From:* Jacek Laskowski [ja...@japila.pl]
> *Sent:* Tuesday, May 24, 2016 5:27 PM
> *To:* Stuti Awasthi
> *Cc:* user
> *Subject:* Re: Not able to write output to local filsystem from
> Standalone mode.
>
> Hi,
>
> What happens when you create the parent directory /home/stuti? I think the
> failure is due to missing parent directories. What's the OS?
>
> Jacek
> On 24 May 2016 11:27 a.m., "Stuti Awasthi" <stutiawas...@hcl.com> wrote:
>
> Hi All,
>
> I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2
> Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch
> shell , read the input file from local filesystem and perform
> transformation successfully. When I try to write my output in local
> filesystem path then I receive below error .
>
>
>
> I tried to search on web and found similar Jira :
> https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows
> resolved for Spark 1.3+ but already people have posted the same issue still
> persists in latest versions.
>
>
>
> *ERROR*
>
> scala> data.saveAsTextFile("/home/stuti/test1")
>
> 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2,
> server1): java.io.IOException: The temporary job-output directory
> file:/home/stuti/test1/_temporary doesn't exist!
>
> at
> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
>
> at
> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
>
> at
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
>
> at
> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
>
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
>
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> What is the best way to resolve this issue if suppose I don’t want to have
> Hadoop installed OR is it mandatory to have Hadoop to write the output from
> Standalone cluster mode.
>
>
>
> Please suggest.
>
>
>
> Thanks 
>
> Stuti Awasthi
>
>
>
>
>
> ::DISCLAIMER::
>
> 
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> 
>
> - To
> unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org

-- 
Mathieu Longtin
1-514-803-8977


Re: How to set the degree of parallelism in Spark SQL?

2016-05-23 Thread Mathieu Longtin
Since the default is 200, I would guess you're only running 2 executors.
Try to verify how many executor you are actually running with the web
interface (port 8080 where the master is running).

On Sat, May 21, 2016 at 11:42 PM Ted Yu <yuzhih...@gmail.com> wrote:

> Looks like an equal sign is missing between partitions and 200.
>
> On Sat, May 21, 2016 at 8:31 PM, SRK <swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> How to set the degree of parallelism in Spark SQL? I am using the
>> following
>> but it somehow seems to allocate only two executors at a time.
>>
>>  sqlContext.sql(" set spark.sql.shuffle.partitions  200  ")
>>
>> Thanks,
>> Swetha
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-the-degree-of-parallelism-in-Spark-SQL-tp26996.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
> --
Mathieu Longtin
1-514-803-8977


Re: Starting executor without a master

2016-05-20 Thread Mathieu Longtin
Correct, what I do to start workers is the equivalent of start-slaves.sh.
It ends up running the same command on the worker servers as start-slaves
does.

It definitively uses all workers, and workers starting later pick up work
as well. If you have a long running job, you can add workers dynamically
and they will pick up work as long as there are enough partitions to go
around.

I set spark.locality.wait to 0 so that workers never wait to pick up tasks.



On Fri, May 20, 2016 at 2:57 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> OK this is basically form my notes for Spark standalone. Worker process is
> the slave process
>
> [image: Inline images 2]
>
>
>
> You start worker as you showed
>
> $SPARK_HOME/sbin/start-slaves.sh
> Now that picks up the worker host node names from $SPARK_HOME/conf/slaves
> files. So you still have to tell Spark where to run workers.
>
> However, if I am correct regardless of what you have specified in slaves,
> in this standalone mode there will not be any spark process spawned by the
> driver on the slaves. In all probability you will be running one
> spark-submit process on the driver node. You can see this through the
> output of
>
> jps|grep SparkSubmit
>
> and you will see the details by running jmonitor for that SparkSubmit job
>
> However, I still doubt whether Scheduling Across applications is feasible
> in standalone mode.
>
> The doc says
>
> *Standalone mode:* By default, applications submitted to the standalone
> mode cluster will run in FIFO (first-in-first-out) order, and each
> application will try to use *all available nodes*. You can limit the
> number of nodes an application uses by setting the spark.cores.max
> configuration property in it, or change the default for applications that
> don’t set this setting through spark.deploy.defaultCores. Finally, in
> addition to controlling cores, each application’s spark.executor.memory
> setting controls its memory use.
>
> It uses the word all available nodes but I am not convinced if it will use
> those nodes? Someone can possibly clarify this
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 May 2016 at 02:03, Mathieu Longtin <math...@closetwork.org> wrote:
>
>> Okay:
>> *host=my.local.server*
>> *port=someport*
>>
>> This is the spark-submit command, which runs on my local server:
>> *$SPARK_HOME/bin/spark-submit --master spark://$host:$port
>> --executor-memory 4g python-script.py with args*
>>
>> If I want 200 worker cores, I tell the cluster scheduler to run this
>> command on 200 cores:
>> *$SPARK_HOME/sbin/start-slave.sh --cores=1 --memory=4g
>> spark://$host:$port *
>>
>> That's it. When the task starts, it uses all available workers. If for
>> some reason, not enough cores are available immediately, it still starts
>> processing with whatever it gets and the load will be spread further as
>> workers come online.
>>
>>
>> On Thu, May 19, 2016 at 8:24 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> In a normal operation we tell spark which node the worker processes can
>>> run by adding the nodenames to conf/slaves.
>>>
>>> Not very clear on this in your case all the jobs run locally with say
>>> 100 executor cores like below:
>>>
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>
>>> --master local[*] \
>>>
>>> --driver-memory xg \  --default would be 512M
>>>
>>> --num-executors=1 \   -- This is the constraint in
>>> stand-alone Spark cluster, whether specified or not
>>>
>>> --executor-memory=xG \ --
>>>
>>> --executor-cores=n \
>>>
>>> --master local[*] means all cores and --executor-cores in your case need
>>> not be specified? or you can cap it like above --executor-cores=n. If
>>> it is not specified then the Spark app will go and grab every core.
>>> Although in practice that does not happen it is just an upper ceiling. It
>>> is FIFO.
>>>
>>> What typical executor memory is specified in your case?
>>>
>>> Do you have a  sample snapshot of spark-submit job by any chance Mathieu?
>>>
>>> Cheers
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>&g

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
I'm looking to bypass the master entirely. I manage the workers outside of
Spark. So I want to start the driver, the start workers that connect
directly to the driver.

Anyway, it looks like I will have to live with our current solution for a
while.

On Thu, May 19, 2016 at 8:32 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> Hi Mathieu,
>
> There's nothing like that in Spark currently. For that, you'd need a
> new cluster manager implementation that knows how to start executors
> in those remote machines (e.g. by running ssh or something).
>
> In the current master there's an interface you can implement to try
> that if you really want to (ExternalClusterManager), but it's
> currently "private[spark]" and it probably wouldn't be a very simple
> task.
>
>
> On Thu, May 19, 2016 at 10:45 AM, Mathieu Longtin
> <math...@closetwork.org> wrote:
> > First a bit of context:
> > We use Spark on a platform where each user start workers as needed. This
> has
> > the advantage that all permission management is handled by the OS, so the
> > users can only read files they have permission to.
> >
> > To do this, we have some utility that does the following:
> > - start a master
> > - start worker managers on a number of servers
> > - "submit" the Spark driver program
> > - the driver then talks to the master, tell it how many executors it
> needs
> > - the master tell the worker nodes to start executors and talk to the
> driver
> > - the executors are started
> >
> > From here on, the master doesn't do much, neither do the process manager
> on
> > the worker nodes.
> >
> > What I would like to do is simplify this to:
> > - Start the driver program
> > - Start executors on a number of servers, telling them where to find the
> > driver
> > - The executors connect directly to the driver
> >
> > Is there a way I could do this without the master and worker managers?
> >
> > Thanks!
> >
> >
> > --
> > Mathieu Longtin
> > 1-514-803-8977
>
>
>
> --
> Marcelo
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Okay:
*host=my.local.server*
*port=someport*

This is the spark-submit command, which runs on my local server:
*$SPARK_HOME/bin/spark-submit --master spark://$host:$port
--executor-memory 4g python-script.py with args*

If I want 200 worker cores, I tell the cluster scheduler to run this
command on 200 cores:
*$SPARK_HOME/sbin/start-slave.sh --cores=1 --memory=4g spark://$host:$port *

That's it. When the task starts, it uses all available workers. If for some
reason, not enough cores are available immediately, it still starts
processing with whatever it gets and the load will be spread further as
workers come online.


On Thu, May 19, 2016 at 8:24 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> In a normal operation we tell spark which node the worker processes can
> run by adding the nodenames to conf/slaves.
>
> Not very clear on this in your case all the jobs run locally with say 100
> executor cores like below:
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master local[*] \
>
> --driver-memory xg \  --default would be 512M
>
> --num-executors=1 \   -- This is the constraint in
> stand-alone Spark cluster, whether specified or not
>
> --executor-memory=xG \ --
>
> --executor-cores=n \
>
> --master local[*] means all cores and --executor-cores in your case need
> not be specified? or you can cap it like above --executor-cores=n. If it
> is not specified then the Spark app will go and grab every core. Although
> in practice that does not happen it is just an upper ceiling. It is FIFO.
>
> What typical executor memory is specified in your case?
>
> Do you have a  sample snapshot of spark-submit job by any chance Mathieu?
>
> Cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 May 2016 at 00:27, Mathieu Longtin <math...@closetwork.org> wrote:
>
>> Mostly, the resource management is not up to the Spark master.
>>
>> We routinely start 100 executor-cores for 5 minute job, and they just
>> quit when they are done. Then those processor cores can do something else
>> entirely, they are not reserved for Spark at all.
>>
>> On Thu, May 19, 2016 at 4:55 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Then in theory every user can fire multiple spark-submit jobs. do you
>>> cap it with settings in  $SPARK_HOME/conf/spark-defaults.conf , but I
>>> guess in reality every user submits one job only.
>>>
>>> This is an interesting model for two reasons:
>>>
>>>
>>>- It uses parallel processing across all the nodes or most of the
>>>nodes to minimise the processing time
>>>- it requires less intervention
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 May 2016 at 21:33, Mathieu Longtin <math...@closetwork.org> wrote:
>>>
>>>> Driver memory is default. Executor memory depends on job, the caller
>>>> decides how much memory to use. We don't specify --num-executors as we want
>>>> all cores assigned to the local master, since they were started by the
>>>> current user. No local executor.  --master=spark://localhost:someport. 1
>>>> core per executor.
>>>>
>>>> On Thu, May 19, 2016 at 4:12 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Thanks Mathieu
>>>>>
>>>>> So it would be interesting to see what resources allocated in your
>>>>> case, especially the num-executors and executor-cores. I gather every node
>>>>> has enough memory and cores.
>>>>>
>>>>>
>>>>>
>>>>> ${SPARK_HOME}/bin/spark-submit \
>>>>>
>>>>> --master local[2] \
>>>>>
>>>>> --driver-memory 4g \
>>>>>
>>>>> --num-executors=1 \
>>>>>
>>>>>  

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Mostly, the resource management is not up to the Spark master.

We routinely start 100 executor-cores for 5 minute job, and they just quit
when they are done. Then those processor cores can do something else
entirely, they are not reserved for Spark at all.

On Thu, May 19, 2016 at 4:55 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Then in theory every user can fire multiple spark-submit jobs. do you cap
> it with settings in  $SPARK_HOME/conf/spark-defaults.conf , but I guess
> in reality every user submits one job only.
>
> This is an interesting model for two reasons:
>
>
>- It uses parallel processing across all the nodes or most of the
>nodes to minimise the processing time
>- it requires less intervention
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 21:33, Mathieu Longtin <math...@closetwork.org> wrote:
>
>> Driver memory is default. Executor memory depends on job, the caller
>> decides how much memory to use. We don't specify --num-executors as we want
>> all cores assigned to the local master, since they were started by the
>> current user. No local executor.  --master=spark://localhost:someport. 1
>> core per executor.
>>
>> On Thu, May 19, 2016 at 4:12 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Mathieu
>>>
>>> So it would be interesting to see what resources allocated in your case,
>>> especially the num-executors and executor-cores. I gather every node has
>>> enough memory and cores.
>>>
>>>
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>
>>> --master local[2] \
>>>
>>> --driver-memory 4g \
>>>
>>> --num-executors=1 \
>>>
>>> --executor-memory=4G \
>>>
>>> --executor-cores=2 \
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 May 2016 at 21:02, Mathieu Longtin <math...@closetwork.org> wrote:
>>>
>>>> The driver (the process started by spark-submit) runs locally. The
>>>> executors run on any of thousands of servers. So far, I haven't tried more
>>>> than 500 executors.
>>>>
>>>> Right now, I run a master on the same server as the driver.
>>>>
>>>> On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> ok so you are using some form of NFS mounted file system shared among
>>>>> the nodes and basically you start the processes through spark-submit.
>>>>>
>>>>> In Stand-alone mode, a simple cluster manager included with Spark. It
>>>>> does the management of resources so it is not clear to me what you are
>>>>> referring as worker manager here?
>>>>>
>>>>> This is my take from your model.
>>>>>  The application will go and grab all the cores in the cluster.
>>>>> You only have one worker that lives within the driver JVM process.
>>>>> The Driver node runs on the same host that the cluster manager is
>>>>> running. The Driver requests the Cluster Manager for resources to run
>>>>> tasks. In this case there is only one executor for the Driver? The 
>>>>> Executor
>>>>> runs tasks for the Driver.
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Driver memory is default. Executor memory depends on job, the caller
decides how much memory to use. We don't specify --num-executors as we want
all cores assigned to the local master, since they were started by the
current user. No local executor.  --master=spark://localhost:someport. 1
core per executor.

On Thu, May 19, 2016 at 4:12 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Mathieu
>
> So it would be interesting to see what resources allocated in your case,
> especially the num-executors and executor-cores. I gather every node has
> enough memory and cores.
>
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master local[2] \
>
> --driver-memory 4g \
>
> --num-executors=1 \
>
> --executor-memory=4G \
>
> --executor-cores=2 \
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 21:02, Mathieu Longtin <math...@closetwork.org> wrote:
>
>> The driver (the process started by spark-submit) runs locally. The
>> executors run on any of thousands of servers. So far, I haven't tried more
>> than 500 executors.
>>
>> Right now, I run a master on the same server as the driver.
>>
>> On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> ok so you are using some form of NFS mounted file system shared among
>>> the nodes and basically you start the processes through spark-submit.
>>>
>>> In Stand-alone mode, a simple cluster manager included with Spark. It
>>> does the management of resources so it is not clear to me what you are
>>> referring as worker manager here?
>>>
>>> This is my take from your model.
>>>  The application will go and grab all the cores in the cluster.
>>> You only have one worker that lives within the driver JVM process.
>>> The Driver node runs on the same host that the cluster manager is
>>> running. The Driver requests the Cluster Manager for resources to run
>>> tasks. In this case there is only one executor for the Driver? The Executor
>>> runs tasks for the Driver.
>>>
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 May 2016 at 20:37, Mathieu Longtin <math...@closetwork.org> wrote:
>>>
>>>> No master and no node manager, just the processes that do actual work.
>>>>
>>>> We use the "stand alone" version because we have a shared file system
>>>> and a way of allocating computing resources already (Univa Grid Engine). If
>>>> an executor were to die, we have other ways of restarting it, we don't need
>>>> the worker manager to deal with it.
>>>>
>>>> On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Mathieu
>>>>>
>>>>> What does this approach provide that the norm lacks?
>>>>>
>>>>> So basically each node has its master in this model.
>>>>>
>>>>> Are these supposed to be individual stand alone servers?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 19 May 2016 at 18:45, Mathieu Longtin <math...@closetwork.org>
>>>>> wrote:
>>>>>
>>>>>> First a bit of context:
>>>>>> We use Spark on a platform where each us

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
The driver (the process started by spark-submit) runs locally. The
executors run on any of thousands of servers. So far, I haven't tried more
than 500 executors.

Right now, I run a master on the same server as the driver.

On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> ok so you are using some form of NFS mounted file system shared among the
> nodes and basically you start the processes through spark-submit.
>
> In Stand-alone mode, a simple cluster manager included with Spark. It
> does the management of resources so it is not clear to me what you are
> referring as worker manager here?
>
> This is my take from your model.
>  The application will go and grab all the cores in the cluster.
> You only have one worker that lives within the driver JVM process.
> The Driver node runs on the same host that the cluster manager is running.
> The Driver requests the Cluster Manager for resources to run tasks. In this
> case there is only one executor for the Driver? The Executor runs tasks for
> the Driver.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 20:37, Mathieu Longtin <math...@closetwork.org> wrote:
>
>> No master and no node manager, just the processes that do actual work.
>>
>> We use the "stand alone" version because we have a shared file system and
>> a way of allocating computing resources already (Univa Grid Engine). If an
>> executor were to die, we have other ways of restarting it, we don't need
>> the worker manager to deal with it.
>>
>> On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Mathieu
>>>
>>> What does this approach provide that the norm lacks?
>>>
>>> So basically each node has its master in this model.
>>>
>>> Are these supposed to be individual stand alone servers?
>>>
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 May 2016 at 18:45, Mathieu Longtin <math...@closetwork.org> wrote:
>>>
>>>> First a bit of context:
>>>> We use Spark on a platform where each user start workers as needed.
>>>> This has the advantage that all permission management is handled by the OS,
>>>> so the users can only read files they have permission to.
>>>>
>>>> To do this, we have some utility that does the following:
>>>> - start a master
>>>> - start worker managers on a number of servers
>>>> - "submit" the Spark driver program
>>>> - the driver then talks to the master, tell it how many executors it
>>>> needs
>>>> - the master tell the worker nodes to start executors and talk to the
>>>> driver
>>>> - the executors are started
>>>>
>>>> From here on, the master doesn't do much, neither do the process
>>>> manager on the worker nodes.
>>>>
>>>> What I would like to do is simplify this to:
>>>> - Start the driver program
>>>> - Start executors on a number of servers, telling them where to find
>>>> the driver
>>>> - The executors connect directly to the driver
>>>>
>>>> Is there a way I could do this without the master and worker managers?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> --
>>>> Mathieu Longtin
>>>> 1-514-803-8977
>>>>
>>>
>>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
> --
Mathieu Longtin
1-514-803-8977


Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
No master and no node manager, just the processes that do actual work.

We use the "stand alone" version because we have a shared file system and a
way of allocating computing resources already (Univa Grid Engine). If an
executor were to die, we have other ways of restarting it, we don't need
the worker manager to deal with it.

On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Mathieu
>
> What does this approach provide that the norm lacks?
>
> So basically each node has its master in this model.
>
> Are these supposed to be individual stand alone servers?
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 18:45, Mathieu Longtin <math...@closetwork.org> wrote:
>
>> First a bit of context:
>> We use Spark on a platform where each user start workers as needed. This
>> has the advantage that all permission management is handled by the OS, so
>> the users can only read files they have permission to.
>>
>> To do this, we have some utility that does the following:
>> - start a master
>> - start worker managers on a number of servers
>> - "submit" the Spark driver program
>> - the driver then talks to the master, tell it how many executors it needs
>> - the master tell the worker nodes to start executors and talk to the
>> driver
>> - the executors are started
>>
>> From here on, the master doesn't do much, neither do the process manager
>> on the worker nodes.
>>
>> What I would like to do is simplify this to:
>> - Start the driver program
>> - Start executors on a number of servers, telling them where to find the
>> driver
>> - The executors connect directly to the driver
>>
>> Is there a way I could do this without the master and worker managers?
>>
>> Thanks!
>>
>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
> --
Mathieu Longtin
1-514-803-8977


Re: support for golang

2016-05-14 Thread Mathieu Longtin
Considering that Pyspark is a very tightly integrated library rather than
an RPC integration, I doubt a Go integration would come any time soon.

On Fri, May 13, 2016 at 10:22 PM Sourav Chakraborty <soura...@gmail.com>
wrote:

> Folks,
>   Was curious to find out if anybody ever considered/attempted to support
> golang with spark .
>
> -Thanks
> Sourav
>
-- 
Mathieu Longtin
1-514-803-8977


Re: pyspark mappartions ()

2016-05-14 Thread Mathieu Longtin
>From memory:
def processor(iterator):
  for item in iterator:
newitem = do_whatever(item)
yield newitem

newdata = data.mapPartition(processor)

Basically, your function takes an iterator as an argument, and must either
be an iterator or return one.

On Sat, May 14, 2016 at 12:39 AM Abi <analyst.tech.j...@gmail.com> wrote:

>
>
> On Tue, May 10, 2016 at 2:20 PM, Abi <analyst.tech.j...@gmail.com> wrote:
>
>> Is there any example of this ? I want to see how you write the the
>> iterable example
>
>
> --
Mathieu Longtin
1-514-803-8977


Re: Is this possible to do in spark ?

2016-05-12 Thread Mathieu Longtin
Make a function (or lambda) that reads the text file. Make a RDD with a
list of X/Y, then map that RDD throught the file reading function. Same
with you X/Y/Z directory. You then have RDDs with the content of each file
as a record. Work with those as needed.

On Wed, May 11, 2016 at 2:36 PM Pradeep Nayak <pradeep1...@gmail.com> wrote:

> Hi -
>
> I have a very unique problem which I am trying to solve and I am not sure
> if spark would help here.
>
> I have a directory: /X/Y/a.txt and in the same structure /X/Y/Z/b.txt.
>
> a.txt contains a unique serial number, say:
> 12345
>
> and b.txt contains key value pairs.
> a,1
> b,1,
> c,0 etc.
>
> Everyday you receive data for a system Y. so there are multiple a.txt and
> b.txt for a serial number.  The serial number doesn't change and that the
> key. So there are multiple systems and the data of a whole year is
> available and its huge.
>
> I am trying to generate a report of unique serial numbers where the value
> of the option a has changed to 1 over the last few months. Lets say the
> default is 0. Also figure how many times it was toggled.
>
>
> I am not sure how to read two text files in spark at the same time and
> associated them with the serial number. Is there a way of doing this in
> place given that we know the directory structure ? OR we should be
> transforming the data anyway to solve this ?
>
-- 
Mathieu Longtin
1-514-803-8977


Re: best fit - Dataframe and spark sql use cases

2016-05-10 Thread Mathieu Longtin
Spark SQL is translated to DataFrame operations by the SQL engine. Use
whichever is more comfortable for the task. Unless I'm doing something very
straight forward, I go with SQL, since any improvement to the SQL engine
will improve the resulting DataFrame operations. Hard-coded DataFrame
operation won't change even if a better operation becomes available.

On Mon, May 9, 2016 at 10:37 PM Divya Gehlot <divya.htco...@gmail.com>
wrote:

> Hi,
> I would like to know the uses cases where data frames is best fit and use
> cases where Spark SQL is best fit based on the one's  experience .
>
>
> Thanks,
> Divya
>
>
>
>
>
> --
Mathieu Longtin
1-514-803-8977


Re: removing header from csv file

2016-05-03 Thread Mathieu Longtin
This only works if the files are "unsplittable". For example gzip files,
each partition is one file (if you have more partitions than files), so the
first line of each partition is the header.

Spark-csv extensions reads the very first line of the RDD, assumes it's the
header, and then filters every occurrence of that line. Something like this
(python code here, but Scala should be very similar)

header = data.first()
data = data.filter(lambda line: line != header)

Since I had lots of small CSV files, and not all of them have the same
exact header, I use the following:

file_list = sc.parallelize(list_of_csv)
data =
file_list.flatMap(function_that_reads_csvs_and_extracts_the_colums_I_want)




On Tue, May 3, 2016 at 3:23 AM Abhishek Anand <abhis.anan...@gmail.com>
wrote:

> You can use this function to remove the header from your
> dataset(applicable to RDD)
>
> def dropHeader(data: RDD[String]): RDD[String] = {
> data.mapPartitionsWithIndex((idx, lines) => {
>   if (idx == 0) {
> lines.drop(1)
>   }
>   lines
> })
> }
>
>
> Abhi
>
> On Wed, Apr 27, 2016 at 12:55 PM, Marco Mistroni <mmistr...@gmail.com>
> wrote:
>
>> If u r using Scala api you can do
>> Myrdd.zipwithindex.filter(_._2 >0).map(_._1)
>>
>> Maybe a little bit complicated but will do the trick
>> As per spark CSV, you will get back a data frame which you can reconduct
>> to rdd. .
>> Hth
>> Marco
>> On 27 Apr 2016 6:59 am, "nihed mbarek" <nihe...@gmail.com> wrote:
>>
>>> You can add a filter with string that you are sure available only in the
>>> header
>>>
>>> Le mercredi 27 avril 2016, Divya Gehlot <divya.htco...@gmail.com> a
>>> écrit :
>>>
>>>> yes you can remove the headers by removing the first row
>>>>
>>>> can first() or head() to do that
>>>>
>>>>
>>>> Thanks,
>>>> Divya
>>>>
>>>> On 27 April 2016 at 13:24, Ashutosh Kumar <kmr.ashutos...@gmail.com>
>>>> wrote:
>>>>
>>>>> I see there is a library spark-csv which can be used for removing
>>>>> header and processing of csv files. But it seems it works with sqlcontext
>>>>> only. Is there a way to remove header from csv files without sqlcontext ?
>>>>>
>>>>> Thanks
>>>>> Ashutosh
>>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>> M'BAREK Med Nihed,
>>> Fedora Ambassador, TUNISIA, Northern Africa
>>> http://www.nihed.com
>>>
>>> <http://tn.linkedin.com/in/nihed>
>>>
>>>
>>> --
Mathieu Longtin
1-514-803-8977


Re: Transformation question

2016-04-27 Thread Mathieu Longtin
I would make a DataFrame (or DataSet) out of the RDD and use SQL join.

On Wed, Apr 27, 2016 at 2:50 PM Eduardo <erocha@gmail.com> wrote:

> Is there a way to write a transformation that for each entry of an RDD
> uses certain other values of another RDD? As an example, image you have a
> RDD of entries to predict a certain label. In a second RDD, you have
> historical data. So for each entry in the first RDD, you want to find
> similar entries in the second RDD and take, let's say, the average. Does
> that fit the Spark model? Is there any alternative?
>
> Thanks in advance
>
-- 
Mathieu Longtin
1-514-803-8977