Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Thanks Phuong But the point of my post is how to achieve without using  the
deprecated the mllib pacakge. The mllib package already has  multinomial
regression built in

2016-05-28 21:19 GMT-07:00 Phuong LE-HONG :

> Dear Stephen,
>
> Yes, you're right, LogisticGradient is in the mllib package, not ml
> package. I just want to say that we can build a multinomial logistic
> regression model from the current version of Spark.
>
> Regards,
>
> Phuong
>
>
>
> On Sun, May 29, 2016 at 12:04 AM, Stephen Boesch 
> wrote:
> > Hi Phuong,
> >The LogisticGradient exists in the mllib but not ml package. The
> > LogisticRegression chooses either the breeze LBFGS - if L2 only (not
> elastic
> > net) and no regularization or the Orthant Wise Quasi Newton (OWLQN)
> > otherwise: it does not appear to choose GD in either scenario.
> >
> > If I have misunderstood your response please do clarify.
> >
> > thanks stephenb
> >
> > 2016-05-28 20:55 GMT-07:00 Phuong LE-HONG :
> >>
> >> Dear Stephen,
> >>
> >> The Logistic Regression currently supports only binary regression.
> >> However, the LogisticGradient does support computing gradient and loss
> >> for a multinomial logistic regression. That is, you can train a
> >> multinomial logistic regression model with LogisticGradient and a
> >> class to solve optimization like LBFGS to get a weight vector of the
> >> size (numClassrd-1)*numFeatures.
> >>
> >>
> >> Phuong
> >>
> >>
> >> On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch 
> >> wrote:
> >> > Followup: just encountered the "OneVsRest" classifier in
> >> > ml.classsification: I will look into using it with the binary
> >> > LogisticRegression as the provided classifier.
> >> >
> >> > 2016-05-28 9:06 GMT-07:00 Stephen Boesch :
> >> >>
> >> >>
> >> >> Presently only the mllib version has the one-vs-all approach for
> >> >> multinomial support.  The ml version with ElasticNet support only
> >> >> allows
> >> >> binary regression.
> >> >>
> >> >> With feature parity of ml vs mllib having been stated as an objective
> >> >> for
> >> >> 2.0.0 -  is there a projected availability of the  multinomial
> >> >> regression in
> >> >> the ml package?
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> `
> >> >
> >> >
> >
> >
>


Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Phuong LE-HONG
Dear Stephen,

Yes, you're right, LogisticGradient is in the mllib package, not ml
package. I just want to say that we can build a multinomial logistic
regression model from the current version of Spark.

Regards,

Phuong



On Sun, May 29, 2016 at 12:04 AM, Stephen Boesch  wrote:
> Hi Phuong,
>The LogisticGradient exists in the mllib but not ml package. The
> LogisticRegression chooses either the breeze LBFGS - if L2 only (not elastic
> net) and no regularization or the Orthant Wise Quasi Newton (OWLQN)
> otherwise: it does not appear to choose GD in either scenario.
>
> If I have misunderstood your response please do clarify.
>
> thanks stephenb
>
> 2016-05-28 20:55 GMT-07:00 Phuong LE-HONG :
>>
>> Dear Stephen,
>>
>> The Logistic Regression currently supports only binary regression.
>> However, the LogisticGradient does support computing gradient and loss
>> for a multinomial logistic regression. That is, you can train a
>> multinomial logistic regression model with LogisticGradient and a
>> class to solve optimization like LBFGS to get a weight vector of the
>> size (numClassrd-1)*numFeatures.
>>
>>
>> Phuong
>>
>>
>> On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch 
>> wrote:
>> > Followup: just encountered the "OneVsRest" classifier in
>> > ml.classsification: I will look into using it with the binary
>> > LogisticRegression as the provided classifier.
>> >
>> > 2016-05-28 9:06 GMT-07:00 Stephen Boesch :
>> >>
>> >>
>> >> Presently only the mllib version has the one-vs-all approach for
>> >> multinomial support.  The ml version with ElasticNet support only
>> >> allows
>> >> binary regression.
>> >>
>> >> With feature parity of ml vs mllib having been stated as an objective
>> >> for
>> >> 2.0.0 -  is there a projected availability of the  multinomial
>> >> regression in
>> >> the ml package?
>> >>
>> >>
>> >>
>> >>
>> >> `
>> >
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Hi Phuong,
   The LogisticGradient exists in the mllib but not ml package. The
LogisticRegression chooses either the breeze LBFGS - if L2 only (not
elastic net) and no regularization or the Orthant Wise Quasi Newton (OWLQN)
otherwise: it does not appear to choose GD in either scenario.

If I have misunderstood your response please do clarify.

thanks stephenb

2016-05-28 20:55 GMT-07:00 Phuong LE-HONG :

> Dear Stephen,
>
> The Logistic Regression currently supports only binary regression.
> However, the LogisticGradient does support computing gradient and loss
> for a multinomial logistic regression. That is, you can train a
> multinomial logistic regression model with LogisticGradient and a
> class to solve optimization like LBFGS to get a weight vector of the
> size (numClassrd-1)*numFeatures.
>
>
> Phuong
>
>
> On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch 
> wrote:
> > Followup: just encountered the "OneVsRest" classifier in
> > ml.classsification: I will look into using it with the binary
> > LogisticRegression as the provided classifier.
> >
> > 2016-05-28 9:06 GMT-07:00 Stephen Boesch :
> >>
> >>
> >> Presently only the mllib version has the one-vs-all approach for
> >> multinomial support.  The ml version with ElasticNet support only allows
> >> binary regression.
> >>
> >> With feature parity of ml vs mllib having been stated as an objective
> for
> >> 2.0.0 -  is there a projected availability of the  multinomial
> regression in
> >> the ml package?
> >>
> >>
> >>
> >>
> >> `
> >
> >
>


Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Phuong LE-HONG
Dear Stephen,

The Logistic Regression currently supports only binary regression.
However, the LogisticGradient does support computing gradient and loss
for a multinomial logistic regression. That is, you can train a
multinomial logistic regression model with LogisticGradient and a
class to solve optimization like LBFGS to get a weight vector of the
size (numClassrd-1)*numFeatures.


Phuong


On Sat, May 28, 2016 at 12:25 PM, Stephen Boesch  wrote:
> Followup: just encountered the "OneVsRest" classifier in
> ml.classsification: I will look into using it with the binary
> LogisticRegression as the provided classifier.
>
> 2016-05-28 9:06 GMT-07:00 Stephen Boesch :
>>
>>
>> Presently only the mllib version has the one-vs-all approach for
>> multinomial support.  The ml version with ElasticNet support only allows
>> binary regression.
>>
>> With feature parity of ml vs mllib having been stated as an objective for
>> 2.0.0 -  is there a projected availability of the  multinomial regression in
>> the ml package?
>>
>>
>>
>>
>> `
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: join function in a loop

2016-05-28 Thread heri wijayanto
I am sorry, we can not divide the data set and process it separately. does
it mean that I overuse Spark for my data size because it consumes a long
time to shuffle the data?

On Sun, May 29, 2016 at 8:53 AM, Ted Yu  wrote:

> Heri:
> Is it possible to partition your data set so that the number of rows
> involved in join is under control ?
>
> Cheers
>
> On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You are welcome
>>
>> Also use can use OS command /usr/bin/free to see how much free memory
>> you have on each node.
>>
>> You should also see from SPARK GUI (first job on master node:4040, next
>> on 4041etc) the  resource and Storage (memory usage) for each SparkSubmit
>> job.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 01:16, heri wijayanto  wrote:
>>
>>> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
>>> currently, my cluster is running to do the other job. After it finished, I
>>> will try your suggestions
>>>
>>> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 You should have errors in yarn-nodemanager and yarn-resourcemanager
 logs.

 Something like below for heathy container

 2016-05-29 00:50:50,496 INFO
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree 29769 for container-id
 container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
 used; 2.7 GB of 8.4 GB virtual memory used

 It appears that you are running out of memory. Have you also checked
 with jps and jmonitor for SparkSubmit (the driver process) for the failing
 job? It will show you the resource usage= like memory/heap/cpu etc

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 29 May 2016 at 00:26, heri wijayanto  wrote:

> I implement spark with join function for processing in around 250
> million rows of text.
>
> When I just used several hundred of rows, it could run, but when I use
> the large data, it is failed.
>
> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
> node computers.
>
> Thank you very much, Ted Yu
>
> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>
>> Can you let us know your case ?
>>
>> When the join failed, what was the error (consider pastebin) ?
>>
>> Which release of Spark are you using ?
>>
>> Thanks
>>
>> > On May 28, 2016, at 3:27 PM, heri wijayanto 
>> wrote:
>> >
>> > Hi everyone,
>> > I perform join function in a loop, and it is failed. I found a
>> tutorial from the web, it says that I should use a broadcast variable but
>> it is not a good choice for doing it on the loop.
>> > I need your suggestion to address this problem, thank you very much.
>> > and I am sorry, I am a beginner in Spark programming
>>
>
>

>>>
>>
>


Re: join function in a loop

2016-05-28 Thread Ted Yu
Heri:
Is it possible to partition your data set so that the number of rows
involved in join is under control ?

Cheers

On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh 
wrote:

> You are welcome
>
> Also use can use OS command /usr/bin/free to see how much free memory you
> have on each node.
>
> You should also see from SPARK GUI (first job on master node:4040, next on
> 4041etc) the  resource and Storage (memory usage) for each SparkSubmit job.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 01:16, heri wijayanto  wrote:
>
>> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
>> currently, my cluster is running to do the other job. After it finished, I
>> will try your suggestions
>>
>> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You should have errors in yarn-nodemanager and yarn-resourcemanager
>>> logs.
>>>
>>> Something like below for heathy container
>>>
>>> 2016-05-29 00:50:50,496 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>>> Memory usage of ProcessTree 29769 for container-id
>>> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
>>> used; 2.7 GB of 8.4 GB virtual memory used
>>>
>>> It appears that you are running out of memory. Have you also checked
>>> with jps and jmonitor for SparkSubmit (the driver process) for the failing
>>> job? It will show you the resource usage= like memory/heap/cpu etc
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 00:26, heri wijayanto  wrote:
>>>
 I implement spark with join function for processing in around 250
 million rows of text.

 When I just used several hundred of rows, it could run, but when I use
 the large data, it is failed.

 My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
 node computers.

 Thank you very much, Ted Yu

 On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:

> Can you let us know your case ?
>
> When the join failed, what was the error (consider pastebin) ?
>
> Which release of Spark are you using ?
>
> Thanks
>
> > On May 28, 2016, at 3:27 PM, heri wijayanto 
> wrote:
> >
> > Hi everyone,
> > I perform join function in a loop, and it is failed. I found a
> tutorial from the web, it says that I should use a broadcast variable but
> it is not a good choice for doing it on the loop.
> > I need your suggestion to address this problem, thank you very much.
> > and I am sorry, I am a beginner in Spark programming
>


>>>
>>
>


Re: join function in a loop

2016-05-28 Thread Mich Talebzadeh
You are welcome

Also use can use OS command /usr/bin/free to see how much free memory you
have on each node.

You should also see from SPARK GUI (first job on master node:4040, next on
4041etc) the  resource and Storage (memory usage) for each SparkSubmit job.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 29 May 2016 at 01:16, heri wijayanto  wrote:

> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
> currently, my cluster is running to do the other job. After it finished, I
> will try your suggestions
>
> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You should have errors in yarn-nodemanager and yarn-resourcemanager logs.
>>
>> Something like below for heathy container
>>
>> 2016-05-29 00:50:50,496 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>> Memory usage of ProcessTree 29769 for container-id
>> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
>> used; 2.7 GB of 8.4 GB virtual memory used
>>
>> It appears that you are running out of memory. Have you also checked with
>> jps and jmonitor for SparkSubmit (the driver process) for the failing job?
>> It will show you the resource usage= like memory/heap/cpu etc
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 00:26, heri wijayanto  wrote:
>>
>>> I implement spark with join function for processing in around 250
>>> million rows of text.
>>>
>>> When I just used several hundred of rows, it could run, but when I use
>>> the large data, it is failed.
>>>
>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
>>> node computers.
>>>
>>> Thank you very much, Ted Yu
>>>
>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>>>
 Can you let us know your case ?

 When the join failed, what was the error (consider pastebin) ?

 Which release of Spark are you using ?

 Thanks

 > On May 28, 2016, at 3:27 PM, heri wijayanto 
 wrote:
 >
 > Hi everyone,
 > I perform join function in a loop, and it is failed. I found a
 tutorial from the web, it says that I should use a broadcast variable but
 it is not a good choice for doing it on the loop.
 > I need your suggestion to address this problem, thank you very much.
 > and I am sorry, I am a beginner in Spark programming

>>>
>>>
>>
>


Re: join function in a loop

2016-05-28 Thread heri wijayanto
Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
currently, my cluster is running to do the other job. After it finished, I
will try your suggestions

On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh 
wrote:

> You should have errors in yarn-nodemanager and yarn-resourcemanager logs.
>
> Something like below for heathy container
>
> 2016-05-29 00:50:50,496 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Memory usage of ProcessTree 29769 for container-id
> container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
> used; 2.7 GB of 8.4 GB virtual memory used
>
> It appears that you are running out of memory. Have you also checked with
> jps and jmonitor for SparkSubmit (the driver process) for the failing job?
> It will show you the resource usage= like memory/heap/cpu etc
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 00:26, heri wijayanto  wrote:
>
>> I implement spark with join function for processing in around 250 million
>> rows of text.
>>
>> When I just used several hundred of rows, it could run, but when I use
>> the large data, it is failed.
>>
>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
>> node computers.
>>
>> Thank you very much, Ted Yu
>>
>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>>
>>> Can you let us know your case ?
>>>
>>> When the join failed, what was the error (consider pastebin) ?
>>>
>>> Which release of Spark are you using ?
>>>
>>> Thanks
>>>
>>> > On May 28, 2016, at 3:27 PM, heri wijayanto 
>>> wrote:
>>> >
>>> > Hi everyone,
>>> > I perform join function in a loop, and it is failed. I found a
>>> tutorial from the web, it says that I should use a broadcast variable but
>>> it is not a good choice for doing it on the loop.
>>> > I need your suggestion to address this problem, thank you very much.
>>> > and I am sorry, I am a beginner in Spark programming
>>>
>>
>>
>


Re: join function in a loop

2016-05-28 Thread Mich Talebzadeh
You should have errors in yarn-nodemanager and yarn-resourcemanager logs.

Something like below for heathy container

2016-05-29 00:50:50,496 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 29769 for container-id
container_1464210869844_0061_01_01: 372.6 MB of 4 GB physical memory
used; 2.7 GB of 8.4 GB virtual memory used

It appears that you are running out of memory. Have you also checked with
jps and jmonitor for SparkSubmit (the driver process) for the failing job?
It will show you the resource usage= like memory/heap/cpu etc

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 29 May 2016 at 00:26, heri wijayanto  wrote:

> I implement spark with join function for processing in around 250 million
> rows of text.
>
> When I just used several hundred of rows, it could run, but when I use the
> large data, it is failed.
>
> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 node
> computers.
>
> Thank you very much, Ted Yu
>
> On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:
>
>> Can you let us know your case ?
>>
>> When the join failed, what was the error (consider pastebin) ?
>>
>> Which release of Spark are you using ?
>>
>> Thanks
>>
>> > On May 28, 2016, at 3:27 PM, heri wijayanto  wrote:
>> >
>> > Hi everyone,
>> > I perform join function in a loop, and it is failed. I found a tutorial
>> from the web, it says that I should use a broadcast variable but it is not
>> a good choice for doing it on the loop.
>> > I need your suggestion to address this problem, thank you very much.
>> > and I am sorry, I am a beginner in Spark programming
>>
>
>


Re: join function in a loop

2016-05-28 Thread heri wijayanto
I implement spark with join function for processing in around 250 million
rows of text.

When I just used several hundred of rows, it could run, but when I use the
large data, it is failed.

My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 node
computers.

Thank you very much, Ted Yu

On Sun, May 29, 2016 at 6:48 AM, Ted Yu  wrote:

> Can you let us know your case ?
>
> When the join failed, what was the error (consider pastebin) ?
>
> Which release of Spark are you using ?
>
> Thanks
>
> > On May 28, 2016, at 3:27 PM, heri wijayanto  wrote:
> >
> > Hi everyone,
> > I perform join function in a loop, and it is failed. I found a tutorial
> from the web, it says that I should use a broadcast variable but it is not
> a good choice for doing it on the loop.
> > I need your suggestion to address this problem, thank you very much.
> > and I am sorry, I am a beginner in Spark programming
>


Re: join function in a loop

2016-05-28 Thread Ted Yu
Can you let us know your case ?

When the join failed, what was the error (consider pastebin) ?

Which release of Spark are you using ?

Thanks

> On May 28, 2016, at 3:27 PM, heri wijayanto  wrote:
> 
> Hi everyone,
> I perform join function in a loop, and it is failed. I found a tutorial from 
> the web, it says that I should use a broadcast variable but it is not a good 
> choice for doing it on the loop. 
> I need your suggestion to address this problem, thank you very much.
> and I am sorry, I am a beginner in Spark programming

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



join function in a loop

2016-05-28 Thread heri wijayanto
Hi everyone,
I perform join function in a loop, and it is failed. I found a tutorial
from the web, it says that I should use a broadcast variable but it is not
a good choice for doing it on the loop.
I need your suggestion to address this problem, thank you very much.
and I am sorry, I am a beginner in Spark programming


Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Great, Thanks.

On Sun, May 29, 2016 at 12:38 AM, Chris Fregly  wrote:

> btw, here's a handy Spark Config Generator by Ewan Higgs in in Gent,
> Belgium:
>
> code:  https://github.com/ehiggs/spark-config-gen
>
> demo:  http://ehiggs.github.io/spark-config-gen/
>
> my recent tweet on this:
> https://twitter.com/cfregly/status/736631633927753729
>
> On Sat, May 28, 2016 at 10:50 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> hang on. Free is telling me you have 8GB of memory. I was under the
>> impression that you had 4GB of RAM :)
>>
>> So with no app you have 3.99GB free ~ 4GB
>>  1st app takes 428MB of memory and the second is 425MB so pretty lean apps
>>
>> The question is the apps that I run take 2-3GB each. But your mileage
>> varies. If you end up with free memory running these minute apps and no
>> sudden spike in memory/cpu usage then as long as they run and finish within
>> SLA you should be OK whichever environment you run. May be you apps do not
>> require that amount of memory.
>>
>> I don't think there is clear cut answer to NOT to use local mode in prod.
>> Others may have different opinions on this.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 28 May 2016 at 18:37, sujeet jog  wrote:
>>
>>> ran these from muliple bash shell for now, probably a multi threaded
>>> python script would do ,  memory and resource allocations are seen as
>>> submitted parameters
>>>
>>>
>>> *say before running any applications . *
>>>
>>> [root@fos-elastic02 ~]# /usr/bin/free
>>>  total   used   free sharedbuffers cached
>>> Mem:   8058568*4066296 *   3992272  10172 141368
>>>  1549520
>>> -/+ buffers/cache:23754085683160
>>> Swap:  8290300 1086728181628
>>>
>>>
>>> *only 1 App : *
>>>
>>> [root@fos-elastic02 ~]# /usr/bin/free
>>>  total   used   free sharedbuffers cached
>>> Mem:   8058568*4494488*3564080  10172 141392
>>>  1549948
>>> -/+ buffers/cache:28031485255420
>>> Swap:  8290300 1086728181628
>>>
>>>
>>> ran the single APP twice in parallel ( memory used double as expected )
>>>
>>> [root@fos-elastic02 ~]# /usr/bin/free
>>>  total   used   free sharedbuffers cached
>>> Mem:   8058568*4919532 *   3139036  10172 141444
>>>  1550376
>>> -/+ buffers/cache:32277124830856
>>> Swap:  8290300 1086728181628
>>>
>>>
>>> Curious to know if local mode is used in real deployments where there is
>>> a scarcity of resources.
>>>
>>>
>>> Thanks,
>>> Sujeet
>>>
>>> On Sat, May 28, 2016 at 10:50 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 OK that is good news. So briefly how do you kick off spark-submit for
 each (or sparkConf). In terms of memory/resources allocations.

 Now what is the output of

 /usr/bin/free



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 28 May 2016 at 18:12, sujeet jog  wrote:

> Yes Mich,
> They are currently emitting the results parallely,
> http://localhost:4040 &  http://localhost:4041 , i also see the
> monitoring from these URL's,
>
>
> On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> ok they are submitted but the latter one 14302 is it doing anything?
>>
>> can you check it with jmonitor or the logs created
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 28 May 2016 at 18:03, sujeet jog  wrote:
>>
>>> Thanks Ted,
>>>
>>> Thanks Mich,  yes i see that i can run two applications by
>>> submitting these,  probably Driver + Executor running in a single JVM .
>>> In-Process Spark.
>>>
>>> wondering if this can be used in production systems,  the reason for
>>> me considering local instead of standalone cluster mode is purely 
>>> because
>>> of CPU/MEM resources,  i.e,  i currently do not have the liberty to use 
>>> 1
>>> Driver & 1 Executor per application,

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Chris Fregly
btw, here's a handy Spark Config Generator by Ewan Higgs in in Gent,
Belgium:

code:  https://github.com/ehiggs/spark-config-gen

demo:  http://ehiggs.github.io/spark-config-gen/

my recent tweet on this:
https://twitter.com/cfregly/status/736631633927753729

On Sat, May 28, 2016 at 10:50 AM, Mich Talebzadeh  wrote:

> hang on. Free is telling me you have 8GB of memory. I was under the
> impression that you had 4GB of RAM :)
>
> So with no app you have 3.99GB free ~ 4GB
>  1st app takes 428MB of memory and the second is 425MB so pretty lean apps
>
> The question is the apps that I run take 2-3GB each. But your mileage
> varies. If you end up with free memory running these minute apps and no
> sudden spike in memory/cpu usage then as long as they run and finish within
> SLA you should be OK whichever environment you run. May be you apps do not
> require that amount of memory.
>
> I don't think there is clear cut answer to NOT to use local mode in prod.
> Others may have different opinions on this.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 18:37, sujeet jog  wrote:
>
>> ran these from muliple bash shell for now, probably a multi threaded
>> python script would do ,  memory and resource allocations are seen as
>> submitted parameters
>>
>>
>> *say before running any applications . *
>>
>> [root@fos-elastic02 ~]# /usr/bin/free
>>  total   used   free sharedbuffers cached
>> Mem:   8058568*4066296 *   3992272  10172 141368
>>  1549520
>> -/+ buffers/cache:23754085683160
>> Swap:  8290300 1086728181628
>>
>>
>> *only 1 App : *
>>
>> [root@fos-elastic02 ~]# /usr/bin/free
>>  total   used   free sharedbuffers cached
>> Mem:   8058568*4494488*3564080  10172 141392
>>  1549948
>> -/+ buffers/cache:28031485255420
>> Swap:  8290300 1086728181628
>>
>>
>> ran the single APP twice in parallel ( memory used double as expected )
>>
>> [root@fos-elastic02 ~]# /usr/bin/free
>>  total   used   free sharedbuffers cached
>> Mem:   8058568*4919532 *   3139036  10172 141444
>>  1550376
>> -/+ buffers/cache:32277124830856
>> Swap:  8290300 1086728181628
>>
>>
>> Curious to know if local mode is used in real deployments where there is
>> a scarcity of resources.
>>
>>
>> Thanks,
>> Sujeet
>>
>> On Sat, May 28, 2016 at 10:50 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> OK that is good news. So briefly how do you kick off spark-submit for
>>> each (or sparkConf). In terms of memory/resources allocations.
>>>
>>> Now what is the output of
>>>
>>> /usr/bin/free
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 28 May 2016 at 18:12, sujeet jog  wrote:
>>>
 Yes Mich,
 They are currently emitting the results parallely,
 http://localhost:4040 &  http://localhost:4041 , i also see the
 monitoring from these URL's,


 On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> ok they are submitted but the latter one 14302 is it doing anything?
>
> can you check it with jmonitor or the logs created
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 18:03, sujeet jog  wrote:
>
>> Thanks Ted,
>>
>> Thanks Mich,  yes i see that i can run two applications by submitting
>> these,  probably Driver + Executor running in a single JVM .  In-Process
>> Spark.
>>
>> wondering if this can be used in production systems,  the reason for
>> me considering local instead of standalone cluster mode is purely because
>> of CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
>> Driver & 1 Executor per application,( running in a embedded network
>> switch  )
>>
>>
>> jps output
>> [root@fos-elastic02 ~]# jps
>> 14258 SparkSubmit
>> 14503 Jps
>> 14302 SparkSubmit
>> ,
>>
>> On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <
>> 

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
hang on. Free is telling me you have 8GB of memory. I was under the
impression that you had 4GB of RAM :)

So with no app you have 3.99GB free ~ 4GB
 1st app takes 428MB of memory and the second is 425MB so pretty lean apps

The question is the apps that I run take 2-3GB each. But your mileage
varies. If you end up with free memory running these minute apps and no
sudden spike in memory/cpu usage then as long as they run and finish within
SLA you should be OK whichever environment you run. May be you apps do not
require that amount of memory.

I don't think there is clear cut answer to NOT to use local mode in prod.
Others may have different opinions on this.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 May 2016 at 18:37, sujeet jog  wrote:

> ran these from muliple bash shell for now, probably a multi threaded
> python script would do ,  memory and resource allocations are seen as
> submitted parameters
>
>
> *say before running any applications . *
>
> [root@fos-elastic02 ~]# /usr/bin/free
>  total   used   free sharedbuffers cached
> Mem:   8058568*4066296 *   3992272  10172 141368
>  1549520
> -/+ buffers/cache:23754085683160
> Swap:  8290300 1086728181628
>
>
> *only 1 App : *
>
> [root@fos-elastic02 ~]# /usr/bin/free
>  total   used   free sharedbuffers cached
> Mem:   8058568*4494488*3564080  10172 141392
>  1549948
> -/+ buffers/cache:28031485255420
> Swap:  8290300 1086728181628
>
>
> ran the single APP twice in parallel ( memory used double as expected )
>
> [root@fos-elastic02 ~]# /usr/bin/free
>  total   used   free sharedbuffers cached
> Mem:   8058568*4919532 *   3139036  10172 141444
>  1550376
> -/+ buffers/cache:32277124830856
> Swap:  8290300 1086728181628
>
>
> Curious to know if local mode is used in real deployments where there is a
> scarcity of resources.
>
>
> Thanks,
> Sujeet
>
> On Sat, May 28, 2016 at 10:50 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> OK that is good news. So briefly how do you kick off spark-submit for
>> each (or sparkConf). In terms of memory/resources allocations.
>>
>> Now what is the output of
>>
>> /usr/bin/free
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 28 May 2016 at 18:12, sujeet jog  wrote:
>>
>>> Yes Mich,
>>> They are currently emitting the results parallely,
>>> http://localhost:4040 &  http://localhost:4041 , i also see the
>>> monitoring from these URL's,
>>>
>>>
>>> On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 ok they are submitted but the latter one 14302 is it doing anything?

 can you check it with jmonitor or the logs created

 HTH



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 28 May 2016 at 18:03, sujeet jog  wrote:

> Thanks Ted,
>
> Thanks Mich,  yes i see that i can run two applications by submitting
> these,  probably Driver + Executor running in a single JVM .  In-Process
> Spark.
>
> wondering if this can be used in production systems,  the reason for
> me considering local instead of standalone cluster mode is purely because
> of CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
> Driver & 1 Executor per application,( running in a embedded network
> switch  )
>
>
> jps output
> [root@fos-elastic02 ~]# jps
> 14258 SparkSubmit
> 14503 Jps
> 14302 SparkSubmit
> ,
>
> On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Ok so you want to run all this in local mode. In other words
>> something like below
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>
>> --master local[2] \
>>
>> --driver-memory 2G \
>>
>> --num-executors=1 \
>>
>> --executor-memory=2G \
>>
>> --executor-cores=2 \
>>
>>
>> I am not sure it will work for multiple drivers (app/JVM).  The only
>> 

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
ran these from muliple bash shell for now, probably a multi threaded python
script would do ,  memory and resource allocations are seen as submitted
parameters


*say before running any applications . *

[root@fos-elastic02 ~]# /usr/bin/free
 total   used   free sharedbuffers cached
Mem:   8058568*4066296 *   3992272  10172 1413681549520
-/+ buffers/cache:23754085683160
Swap:  8290300 1086728181628


*only 1 App : *

[root@fos-elastic02 ~]# /usr/bin/free
 total   used   free sharedbuffers cached
Mem:   8058568*4494488*3564080  10172 1413921549948
-/+ buffers/cache:28031485255420
Swap:  8290300 1086728181628


ran the single APP twice in parallel ( memory used double as expected )

[root@fos-elastic02 ~]# /usr/bin/free
 total   used   free sharedbuffers cached
Mem:   8058568*4919532 *   3139036  10172 1414441550376
-/+ buffers/cache:32277124830856
Swap:  8290300 1086728181628


Curious to know if local mode is used in real deployments where there is a
scarcity of resources.


Thanks,
Sujeet

On Sat, May 28, 2016 at 10:50 PM, Mich Talebzadeh  wrote:

> OK that is good news. So briefly how do you kick off spark-submit for each
> (or sparkConf). In terms of memory/resources allocations.
>
> Now what is the output of
>
> /usr/bin/free
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 18:12, sujeet jog  wrote:
>
>> Yes Mich,
>> They are currently emitting the results parallely,
>> http://localhost:4040 &  http://localhost:4041 , i also see the
>> monitoring from these URL's,
>>
>>
>> On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> ok they are submitted but the latter one 14302 is it doing anything?
>>>
>>> can you check it with jmonitor or the logs created
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 28 May 2016 at 18:03, sujeet jog  wrote:
>>>
 Thanks Ted,

 Thanks Mich,  yes i see that i can run two applications by submitting
 these,  probably Driver + Executor running in a single JVM .  In-Process
 Spark.

 wondering if this can be used in production systems,  the reason for me
 considering local instead of standalone cluster mode is purely because of
 CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
 Driver & 1 Executor per application,( running in a embedded network
 switch  )


 jps output
 [root@fos-elastic02 ~]# jps
 14258 SparkSubmit
 14503 Jps
 14302 SparkSubmit
 ,

 On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Ok so you want to run all this in local mode. In other words something
> like below
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master local[2] \
>
> --driver-memory 2G \
>
> --num-executors=1 \
>
> --executor-memory=2G \
>
> --executor-cores=2 \
>
>
> I am not sure it will work for multiple drivers (app/JVM).  The only
> way you can find out is to do try it running two apps simultaneously. You
> have a number of tools.
>
>
>
>1. use jps to see the apps and PID
>2. use jmonitor to see memory/cpu/heap usage for each spark-submit
>job
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 17:41, Ted Yu  wrote:
>
>> Sujeet:
>>
>> Please also see:
>>
>> https://spark.apache.org/docs/latest/spark-standalone.html
>>
>> On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Sujeet,
>>>
>>> if you have a single machine then it is Spark standalone mode.
>>>
>>> In Standalone cluster mode Spark allocates resources based on
>>> cores. By default, an application will grab all the cores in the 

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
OK that is good news. So briefly how do you kick off spark-submit for each
(or sparkConf). In terms of memory/resources allocations.

Now what is the output of

/usr/bin/free



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 May 2016 at 18:12, sujeet jog  wrote:

> Yes Mich,
> They are currently emitting the results parallely,
> http://localhost:4040 &  http://localhost:4041 , i also see the
> monitoring from these URL's,
>
>
> On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> ok they are submitted but the latter one 14302 is it doing anything?
>>
>> can you check it with jmonitor or the logs created
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 28 May 2016 at 18:03, sujeet jog  wrote:
>>
>>> Thanks Ted,
>>>
>>> Thanks Mich,  yes i see that i can run two applications by submitting
>>> these,  probably Driver + Executor running in a single JVM .  In-Process
>>> Spark.
>>>
>>> wondering if this can be used in production systems,  the reason for me
>>> considering local instead of standalone cluster mode is purely because of
>>> CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
>>> Driver & 1 Executor per application,( running in a embedded network
>>> switch  )
>>>
>>>
>>> jps output
>>> [root@fos-elastic02 ~]# jps
>>> 14258 SparkSubmit
>>> 14503 Jps
>>> 14302 SparkSubmit
>>> ,
>>>
>>> On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Ok so you want to run all this in local mode. In other words something
 like below

 ${SPARK_HOME}/bin/spark-submit \

 --master local[2] \

 --driver-memory 2G \

 --num-executors=1 \

 --executor-memory=2G \

 --executor-cores=2 \


 I am not sure it will work for multiple drivers (app/JVM).  The only
 way you can find out is to do try it running two apps simultaneously. You
 have a number of tools.



1. use jps to see the apps and PID
2. use jmonitor to see memory/cpu/heap usage for each spark-submit
job

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 28 May 2016 at 17:41, Ted Yu  wrote:

> Sujeet:
>
> Please also see:
>
> https://spark.apache.org/docs/latest/spark-standalone.html
>
> On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Sujeet,
>>
>> if you have a single machine then it is Spark standalone mode.
>>
>> In Standalone cluster mode Spark allocates resources based on cores.
>> By default, an application will grab all the cores in the cluster.
>>
>> You only have one worker that lives within the driver JVM process
>> that you start when you start the application with spark-shell or
>> spark-submit in the host where the cluster manager is running.
>>
>> The Driver node runs on the same host that the cluster manager is
>> running. The Driver requests the Cluster Manager for resources to run
>> tasks. The worker is tasked to create the executor (in this case there is
>> only one executor) for the Driver. The Executor runs tasks for the 
>> Driver.
>> Only one executor can be allocated on each worker per application. In 
>> your
>> case you only have
>>
>>
>> The minimum you will need will be 2-4G of RAM and two cores. Well
>> that is my experience. Yes you can submit more than one spark-submit (the
>> driver) but they may queue up behind the running one if there is not 
>> enough
>> resources.
>>
>>
>> You pointed out that you will be running few applications in parallel
>> on the same host. The likelihood is that you are using a VM machine for
>> this purpose and the best option is to try running the first one, Check 
>> Web
>> GUI on  4040 to see the progress of this Job. If you start the next JVM
>> then assuming it is working, it will be using port 4041 and so forth.
>>
>>
>> In actual fact try the command "free" to see how much free memory you
>> 

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Yes Mich,
They are currently emitting the results parallely,http://localhost:4040
&  http://localhost:4041 , i also see the monitoring from these URL's,


On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh  wrote:

> ok they are submitted but the latter one 14302 is it doing anything?
>
> can you check it with jmonitor or the logs created
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 18:03, sujeet jog  wrote:
>
>> Thanks Ted,
>>
>> Thanks Mich,  yes i see that i can run two applications by submitting
>> these,  probably Driver + Executor running in a single JVM .  In-Process
>> Spark.
>>
>> wondering if this can be used in production systems,  the reason for me
>> considering local instead of standalone cluster mode is purely because of
>> CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
>> Driver & 1 Executor per application,( running in a embedded network
>> switch  )
>>
>>
>> jps output
>> [root@fos-elastic02 ~]# jps
>> 14258 SparkSubmit
>> 14503 Jps
>> 14302 SparkSubmit
>> ,
>>
>> On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Ok so you want to run all this in local mode. In other words something
>>> like below
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>
>>> --master local[2] \
>>>
>>> --driver-memory 2G \
>>>
>>> --num-executors=1 \
>>>
>>> --executor-memory=2G \
>>>
>>> --executor-cores=2 \
>>>
>>>
>>> I am not sure it will work for multiple drivers (app/JVM).  The only way
>>> you can find out is to do try it running two apps simultaneously. You have
>>> a number of tools.
>>>
>>>
>>>
>>>1. use jps to see the apps and PID
>>>2. use jmonitor to see memory/cpu/heap usage for each spark-submit
>>>job
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 28 May 2016 at 17:41, Ted Yu  wrote:
>>>
 Sujeet:

 Please also see:

 https://spark.apache.org/docs/latest/spark-standalone.html

 On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Sujeet,
>
> if you have a single machine then it is Spark standalone mode.
>
> In Standalone cluster mode Spark allocates resources based on cores.
> By default, an application will grab all the cores in the cluster.
>
> You only have one worker that lives within the driver JVM process that
> you start when you start the application with spark-shell or spark-submit
> in the host where the cluster manager is running.
>
> The Driver node runs on the same host that the cluster manager is
> running. The Driver requests the Cluster Manager for resources to run
> tasks. The worker is tasked to create the executor (in this case there is
> only one executor) for the Driver. The Executor runs tasks for the Driver.
> Only one executor can be allocated on each worker per application. In your
> case you only have
>
>
> The minimum you will need will be 2-4G of RAM and two cores. Well that
> is my experience. Yes you can submit more than one spark-submit (the
> driver) but they may queue up behind the running one if there is not 
> enough
> resources.
>
>
> You pointed out that you will be running few applications in parallel
> on the same host. The likelihood is that you are using a VM machine for
> this purpose and the best option is to try running the first one, Check 
> Web
> GUI on  4040 to see the progress of this Job. If you start the next JVM
> then assuming it is working, it will be using port 4041 and so forth.
>
>
> In actual fact try the command "free" to see how much free memory you
> have.
>
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 16:42, sujeet jog  wrote:
>
>> Hi,
>>
>> I have a question w.r.t  production deployment mode of spark,
>>
>> I have 3 applications which i would like to run independently on a
>> single machine, i need 

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
ok they are submitted but the latter one 14302 is it doing anything?

can you check it with jmonitor or the logs created

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 May 2016 at 18:03, sujeet jog  wrote:

> Thanks Ted,
>
> Thanks Mich,  yes i see that i can run two applications by submitting
> these,  probably Driver + Executor running in a single JVM .  In-Process
> Spark.
>
> wondering if this can be used in production systems,  the reason for me
> considering local instead of standalone cluster mode is purely because of
> CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
> Driver & 1 Executor per application,( running in a embedded network
> switch  )
>
>
> jps output
> [root@fos-elastic02 ~]# jps
> 14258 SparkSubmit
> 14503 Jps
> 14302 SparkSubmit
> ,
>
> On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Ok so you want to run all this in local mode. In other words something
>> like below
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>
>> --master local[2] \
>>
>> --driver-memory 2G \
>>
>> --num-executors=1 \
>>
>> --executor-memory=2G \
>>
>> --executor-cores=2 \
>>
>>
>> I am not sure it will work for multiple drivers (app/JVM).  The only way
>> you can find out is to do try it running two apps simultaneously. You have
>> a number of tools.
>>
>>
>>
>>1. use jps to see the apps and PID
>>2. use jmonitor to see memory/cpu/heap usage for each spark-submit job
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 28 May 2016 at 17:41, Ted Yu  wrote:
>>
>>> Sujeet:
>>>
>>> Please also see:
>>>
>>> https://spark.apache.org/docs/latest/spark-standalone.html
>>>
>>> On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Sujeet,

 if you have a single machine then it is Spark standalone mode.

 In Standalone cluster mode Spark allocates resources based on cores.
 By default, an application will grab all the cores in the cluster.

 You only have one worker that lives within the driver JVM process that
 you start when you start the application with spark-shell or spark-submit
 in the host where the cluster manager is running.

 The Driver node runs on the same host that the cluster manager is
 running. The Driver requests the Cluster Manager for resources to run
 tasks. The worker is tasked to create the executor (in this case there is
 only one executor) for the Driver. The Executor runs tasks for the Driver.
 Only one executor can be allocated on each worker per application. In your
 case you only have


 The minimum you will need will be 2-4G of RAM and two cores. Well that
 is my experience. Yes you can submit more than one spark-submit (the
 driver) but they may queue up behind the running one if there is not enough
 resources.


 You pointed out that you will be running few applications in parallel
 on the same host. The likelihood is that you are using a VM machine for
 this purpose and the best option is to try running the first one, Check Web
 GUI on  4040 to see the progress of this Job. If you start the next JVM
 then assuming it is working, it will be using port 4041 and so forth.


 In actual fact try the command "free" to see how much free memory you
 have.


 HTH





 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 28 May 2016 at 16:42, sujeet jog  wrote:

> Hi,
>
> I have a question w.r.t  production deployment mode of spark,
>
> I have 3 applications which i would like to run independently on a
> single machine, i need to run the drivers in the same machine.
>
> The amount of resources i have is also limited, like 4- 5GB RAM , 3 -
> 4 cores.
>
> For deployment in standalone mode : i believe i need
>
> 1 Driver JVM,  1 worker node ( 1 executor )
> 1 Driver JVM,  1 worker node ( 1 executor )
> 1 Driver JVM,  1 worker node ( 1 executor )
>
> The issue here is i will require 6 JVM running in parallel, for which

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Thanks Ted,

Thanks Mich,  yes i see that i can run two applications by submitting
these,  probably Driver + Executor running in a single JVM .  In-Process
Spark.

wondering if this can be used in production systems,  the reason for me
considering local instead of standalone cluster mode is purely because of
CPU/MEM resources,  i.e,  i currently do not have the liberty to use 1
Driver & 1 Executor per application,( running in a embedded network
switch  )


jps output
[root@fos-elastic02 ~]# jps
14258 SparkSubmit
14503 Jps
14302 SparkSubmit
,

On Sat, May 28, 2016 at 10:21 PM, Mich Talebzadeh  wrote:

> Ok so you want to run all this in local mode. In other words something
> like below
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master local[2] \
>
> --driver-memory 2G \
>
> --num-executors=1 \
>
> --executor-memory=2G \
>
> --executor-cores=2 \
>
>
> I am not sure it will work for multiple drivers (app/JVM).  The only way
> you can find out is to do try it running two apps simultaneously. You have
> a number of tools.
>
>
>
>1. use jps to see the apps and PID
>2. use jmonitor to see memory/cpu/heap usage for each spark-submit job
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 17:41, Ted Yu  wrote:
>
>> Sujeet:
>>
>> Please also see:
>>
>> https://spark.apache.org/docs/latest/spark-standalone.html
>>
>> On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Sujeet,
>>>
>>> if you have a single machine then it is Spark standalone mode.
>>>
>>> In Standalone cluster mode Spark allocates resources based on cores. By
>>> default, an application will grab all the cores in the cluster.
>>>
>>> You only have one worker that lives within the driver JVM process that
>>> you start when you start the application with spark-shell or spark-submit
>>> in the host where the cluster manager is running.
>>>
>>> The Driver node runs on the same host that the cluster manager is
>>> running. The Driver requests the Cluster Manager for resources to run
>>> tasks. The worker is tasked to create the executor (in this case there is
>>> only one executor) for the Driver. The Executor runs tasks for the Driver.
>>> Only one executor can be allocated on each worker per application. In your
>>> case you only have
>>>
>>>
>>> The minimum you will need will be 2-4G of RAM and two cores. Well that
>>> is my experience. Yes you can submit more than one spark-submit (the
>>> driver) but they may queue up behind the running one if there is not enough
>>> resources.
>>>
>>>
>>> You pointed out that you will be running few applications in parallel on
>>> the same host. The likelihood is that you are using a VM machine for this
>>> purpose and the best option is to try running the first one, Check Web GUI
>>> on  4040 to see the progress of this Job. If you start the next JVM then
>>> assuming it is working, it will be using port 4041 and so forth.
>>>
>>>
>>> In actual fact try the command "free" to see how much free memory you
>>> have.
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 28 May 2016 at 16:42, sujeet jog  wrote:
>>>
 Hi,

 I have a question w.r.t  production deployment mode of spark,

 I have 3 applications which i would like to run independently on a
 single machine, i need to run the drivers in the same machine.

 The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4
 cores.

 For deployment in standalone mode : i believe i need

 1 Driver JVM,  1 worker node ( 1 executor )
 1 Driver JVM,  1 worker node ( 1 executor )
 1 Driver JVM,  1 worker node ( 1 executor )

 The issue here is i will require 6 JVM running in parallel, for which i
 do not have sufficient CPU/MEM resources,


 Hence i was looking more towards a local mode deployment mode, would
 like to know if anybody is using local mode where Driver + Executor run in
 a single JVM in production mode.

 Are there any inherent issues upfront using local mode for production
 base systems.?..


>>>
>>
>


Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
Ok so you want to run all this in local mode. In other words something like
below

${SPARK_HOME}/bin/spark-submit \

--master local[2] \

--driver-memory 2G \

--num-executors=1 \

--executor-memory=2G \

--executor-cores=2 \


I am not sure it will work for multiple drivers (app/JVM).  The only way
you can find out is to do try it running two apps simultaneously. You have
a number of tools.



   1. use jps to see the apps and PID
   2. use jmonitor to see memory/cpu/heap usage for each spark-submit job

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 May 2016 at 17:41, Ted Yu  wrote:

> Sujeet:
>
> Please also see:
>
> https://spark.apache.org/docs/latest/spark-standalone.html
>
> On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Sujeet,
>>
>> if you have a single machine then it is Spark standalone mode.
>>
>> In Standalone cluster mode Spark allocates resources based on cores. By
>> default, an application will grab all the cores in the cluster.
>>
>> You only have one worker that lives within the driver JVM process that
>> you start when you start the application with spark-shell or spark-submit
>> in the host where the cluster manager is running.
>>
>> The Driver node runs on the same host that the cluster manager is
>> running. The Driver requests the Cluster Manager for resources to run
>> tasks. The worker is tasked to create the executor (in this case there is
>> only one executor) for the Driver. The Executor runs tasks for the Driver.
>> Only one executor can be allocated on each worker per application. In your
>> case you only have
>>
>>
>> The minimum you will need will be 2-4G of RAM and two cores. Well that is
>> my experience. Yes you can submit more than one spark-submit (the driver)
>> but they may queue up behind the running one if there is not enough
>> resources.
>>
>>
>> You pointed out that you will be running few applications in parallel on
>> the same host. The likelihood is that you are using a VM machine for this
>> purpose and the best option is to try running the first one, Check Web GUI
>> on  4040 to see the progress of this Job. If you start the next JVM then
>> assuming it is working, it will be using port 4041 and so forth.
>>
>>
>> In actual fact try the command "free" to see how much free memory you
>> have.
>>
>>
>> HTH
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 28 May 2016 at 16:42, sujeet jog  wrote:
>>
>>> Hi,
>>>
>>> I have a question w.r.t  production deployment mode of spark,
>>>
>>> I have 3 applications which i would like to run independently on a
>>> single machine, i need to run the drivers in the same machine.
>>>
>>> The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4
>>> cores.
>>>
>>> For deployment in standalone mode : i believe i need
>>>
>>> 1 Driver JVM,  1 worker node ( 1 executor )
>>> 1 Driver JVM,  1 worker node ( 1 executor )
>>> 1 Driver JVM,  1 worker node ( 1 executor )
>>>
>>> The issue here is i will require 6 JVM running in parallel, for which i
>>> do not have sufficient CPU/MEM resources,
>>>
>>>
>>> Hence i was looking more towards a local mode deployment mode, would
>>> like to know if anybody is using local mode where Driver + Executor run in
>>> a single JVM in production mode.
>>>
>>> Are there any inherent issues upfront using local mode for production
>>> base systems.?..
>>>
>>>
>>
>


Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Ted Yu
Sujeet:

Please also see:

https://spark.apache.org/docs/latest/spark-standalone.html

On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh 
wrote:

> Hi Sujeet,
>
> if you have a single machine then it is Spark standalone mode.
>
> In Standalone cluster mode Spark allocates resources based on cores. By
> default, an application will grab all the cores in the cluster.
>
> You only have one worker that lives within the driver JVM process that you
> start when you start the application with spark-shell or spark-submit in
> the host where the cluster manager is running.
>
> The Driver node runs on the same host that the cluster manager is running.
> The Driver requests the Cluster Manager for resources to run tasks. The
> worker is tasked to create the executor (in this case there is only one
> executor) for the Driver. The Executor runs tasks for the Driver. Only one
> executor can be allocated on each worker per application. In your case you
> only have
>
>
> The minimum you will need will be 2-4G of RAM and two cores. Well that is
> my experience. Yes you can submit more than one spark-submit (the driver)
> but they may queue up behind the running one if there is not enough
> resources.
>
>
> You pointed out that you will be running few applications in parallel on
> the same host. The likelihood is that you are using a VM machine for this
> purpose and the best option is to try running the first one, Check Web GUI
> on  4040 to see the progress of this Job. If you start the next JVM then
> assuming it is working, it will be using port 4041 and so forth.
>
>
> In actual fact try the command "free" to see how much free memory you have.
>
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 May 2016 at 16:42, sujeet jog  wrote:
>
>> Hi,
>>
>> I have a question w.r.t  production deployment mode of spark,
>>
>> I have 3 applications which i would like to run independently on a single
>> machine, i need to run the drivers in the same machine.
>>
>> The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4
>> cores.
>>
>> For deployment in standalone mode : i believe i need
>>
>> 1 Driver JVM,  1 worker node ( 1 executor )
>> 1 Driver JVM,  1 worker node ( 1 executor )
>> 1 Driver JVM,  1 worker node ( 1 executor )
>>
>> The issue here is i will require 6 JVM running in parallel, for which i
>> do not have sufficient CPU/MEM resources,
>>
>>
>> Hence i was looking more towards a local mode deployment mode, would like
>> to know if anybody is using local mode where Driver + Executor run in a
>> single JVM in production mode.
>>
>> Are there any inherent issues upfront using local mode for production
>> base systems.?..
>>
>>
>


Re: Spark_API_Copy_From_Edgenode

2016-05-28 Thread Ajay Chander
Hi Everyone, Any insights on this thread? Thank you.

On Friday, May 27, 2016, Ajay Chander  wrote:

> Hi Everyone,
>
>I have some data located on the EdgeNode. Right
> now, the process I follow to copy the data from Edgenode to HDFS is through
> a shellscript which resides on Edgenode. In Oozie I am using a SSH action
> to execute the shell script on Edgenode which copies the data to HDFS.
>
>   I was just wondering if there is any built in
> API with in Spark to do this job. I want to read the data from Edgenode
> into RDD using JavaSparkContext then do saveAsTextFile("hdfs://...").
> JavaSparkContext  does provide any method to pass Edgenode's access
> credentials and read the data into an RDD ??
>
> Thank you for your valuable time. Any pointers are appreciated.
>
> Thank You,
> Aj
>


Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Followup: just encountered the "OneVsRest" classifier in
 ml.classsification: I will look into using it with the binary
LogisticRegression as the provided classifier.

2016-05-28 9:06 GMT-07:00 Stephen Boesch :

>
> Presently only the mllib version has the one-vs-all approach for
> multinomial support.  The ml version with ElasticNet support only allows
> binary regression.
>
> With feature parity of ml vs mllib having been stated as an objective for
> 2.0.0 -  is there a projected availability of the  multinomial regression
> in the ml package?
>
>
>
>
> `
>


Re: ANOVA test in Spark

2016-05-28 Thread cyberjog
If any specific algorithm is not present, perhaps you can use R/Python
scikit, pipe your data to it & get the model back, 

I'm currently trying this, and it works fine. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ANOVA-test-in-Spark-tp26949p27043.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
Hi Sujeet,

if you have a single machine then it is Spark standalone mode.

In Standalone cluster mode Spark allocates resources based on cores. By
default, an application will grab all the cores in the cluster.

You only have one worker that lives within the driver JVM process that you
start when you start the application with spark-shell or spark-submit in
the host where the cluster manager is running.

The Driver node runs on the same host that the cluster manager is running.
The Driver requests the Cluster Manager for resources to run tasks. The
worker is tasked to create the executor (in this case there is only one
executor) for the Driver. The Executor runs tasks for the Driver. Only one
executor can be allocated on each worker per application. In your case you
only have


The minimum you will need will be 2-4G of RAM and two cores. Well that is
my experience. Yes you can submit more than one spark-submit (the driver)
but they may queue up behind the running one if there is not enough
resources.


You pointed out that you will be running few applications in parallel on
the same host. The likelihood is that you are using a VM machine for this
purpose and the best option is to try running the first one, Check Web GUI
on  4040 to see the progress of this Job. If you start the next JVM then
assuming it is working, it will be using port 4041 and so forth.


In actual fact try the command "free" to see how much free memory you have.


HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 28 May 2016 at 16:42, sujeet jog  wrote:

> Hi,
>
> I have a question w.r.t  production deployment mode of spark,
>
> I have 3 applications which i would like to run independently on a single
> machine, i need to run the drivers in the same machine.
>
> The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4
> cores.
>
> For deployment in standalone mode : i believe i need
>
> 1 Driver JVM,  1 worker node ( 1 executor )
> 1 Driver JVM,  1 worker node ( 1 executor )
> 1 Driver JVM,  1 worker node ( 1 executor )
>
> The issue here is i will require 6 JVM running in parallel, for which i do
> not have sufficient CPU/MEM resources,
>
>
> Hence i was looking more towards a local mode deployment mode, would like
> to know if anybody is using local mode where Driver + Executor run in a
> single JVM in production mode.
>
> Are there any inherent issues upfront using local mode for production base
> systems.?..
>
>


local Vs Standalonecluster production deployment

2016-05-28 Thread cyberjog
Hi, 

I have a question w.r.t  production deployment mode of spark, 

I have 3 applications which i would like to run independently on a single
machine, i need to run the drivers in the same machine.

The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4
cores. 

For deployment in standalone mode : i believe i need 

1 Driver JVM,  1 worker node ( 1 executor ) 
1 Driver JVM,  1 worker node ( 1 executor ) 
1 Driver JVM,  1 worker node ( 1 executor ) 

The issue here is i will require 6 JVM running in parallel, for which i do
not have sufficient CPU/MEM resources, 


Hence i was looking more towards a local mode deployment mode, would like to
know if anybody is using local mode where Driver + Executor run in a single
JVM in production mode. 

Are there any inherent issues upfront using local mode for production base
systems.?..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/local-Vs-Standalonecluster-production-deployment-tp27042.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Presently only the mllib version has the one-vs-all approach for
multinomial support.  The ml version with ElasticNet support only allows
binary regression.

With feature parity of ml vs mllib having been stated as an objective for
2.0.0 -  is there a projected availability of the  multinomial regression
in the ml package?




`


local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Hi,

I have a question w.r.t  production deployment mode of spark,

I have 3 applications which i would like to run independently on a single
machine, i need to run the drivers in the same machine.

The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4
cores.

For deployment in standalone mode : i believe i need

1 Driver JVM,  1 worker node ( 1 executor )
1 Driver JVM,  1 worker node ( 1 executor )
1 Driver JVM,  1 worker node ( 1 executor )

The issue here is i will require 6 JVM running in parallel, for which i do
not have sufficient CPU/MEM resources,


Hence i was looking more towards a local mode deployment mode, would like
to know if anybody is using local mode where Driver + Executor run in a
single JVM in production mode.

Are there any inherent issues upfront using local mode for production base
systems.?..