Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Thanks Phuong But the point of my post is how to achieve without using the deprecated the mllib pacakge. The mllib package already has multinomial regression built in 2016-05-28 21:19 GMT-07:00 Phuong LE-HONG : > Dear Stephen, > > Yes, you're right, LogisticGradient is in

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Phuong LE-HONG
Dear Stephen, Yes, you're right, LogisticGradient is in the mllib package, not ml package. I just want to say that we can build a multinomial logistic regression model from the current version of Spark. Regards, Phuong On Sun, May 29, 2016 at 12:04 AM, Stephen Boesch

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Hi Phuong, The LogisticGradient exists in the mllib but not ml package. The LogisticRegression chooses either the breeze LBFGS - if L2 only (not elastic net) and no regularization or the Orthant Wise Quasi Newton (OWLQN) otherwise: it does not appear to choose GD in either scenario. If I have

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Phuong LE-HONG
Dear Stephen, The Logistic Regression currently supports only binary regression. However, the LogisticGradient does support computing gradient and loss for a multinomial logistic regression. That is, you can train a multinomial logistic regression model with LogisticGradient and a class to solve

Re: join function in a loop

2016-05-28 Thread heri wijayanto
I am sorry, we can not divide the data set and process it separately. does it mean that I overuse Spark for my data size because it consumes a long time to shuffle the data? On Sun, May 29, 2016 at 8:53 AM, Ted Yu wrote: > Heri: > Is it possible to partition your data set

Re: join function in a loop

2016-05-28 Thread Ted Yu
Heri: Is it possible to partition your data set so that the number of rows involved in join is under control ? Cheers On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh wrote: > You are welcome > > Also use can use OS command /usr/bin/free to see how much free memory

Re: join function in a loop

2016-05-28 Thread Mich Talebzadeh
You are welcome Also use can use OS command /usr/bin/free to see how much free memory you have on each node. You should also see from SPARK GUI (first job on master node:4040, next on 4041etc) the resource and Storage (memory usage) for each SparkSubmit job. HTH Dr Mich Talebzadeh

Re: join function in a loop

2016-05-28 Thread heri wijayanto
Thank you, Dr Mich Talebzadeh, I will capture the error messages, but currently, my cluster is running to do the other job. After it finished, I will try your suggestions On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh wrote: > You should have errors in

Re: join function in a loop

2016-05-28 Thread Mich Talebzadeh
You should have errors in yarn-nodemanager and yarn-resourcemanager logs. Something like below for heathy container 2016-05-29 00:50:50,496 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 29769 for container-id

Re: join function in a loop

2016-05-28 Thread heri wijayanto
I implement spark with join function for processing in around 250 million rows of text. When I just used several hundred of rows, it could run, but when I use the large data, it is failed. My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 node computers. Thank you very much,

Re: join function in a loop

2016-05-28 Thread Ted Yu
Can you let us know your case ? When the join failed, what was the error (consider pastebin) ? Which release of Spark are you using ? Thanks > On May 28, 2016, at 3:27 PM, heri wijayanto wrote: > > Hi everyone, > I perform join function in a loop, and it is failed. I

join function in a loop

2016-05-28 Thread heri wijayanto
Hi everyone, I perform join function in a loop, and it is failed. I found a tutorial from the web, it says that I should use a broadcast variable but it is not a good choice for doing it on the loop. I need your suggestion to address this problem, thank you very much. and I am sorry, I am a

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Great, Thanks. On Sun, May 29, 2016 at 12:38 AM, Chris Fregly wrote: > btw, here's a handy Spark Config Generator by Ewan Higgs in in Gent, > Belgium: > > code: https://github.com/ehiggs/spark-config-gen > > demo: http://ehiggs.github.io/spark-config-gen/ > > my recent tweet

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Chris Fregly
btw, here's a handy Spark Config Generator by Ewan Higgs in in Gent, Belgium: code: https://github.com/ehiggs/spark-config-gen demo: http://ehiggs.github.io/spark-config-gen/ my recent tweet on this: https://twitter.com/cfregly/status/736631633927753729 On Sat, May 28, 2016 at 10:50 AM, Mich

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
hang on. Free is telling me you have 8GB of memory. I was under the impression that you had 4GB of RAM :) So with no app you have 3.99GB free ~ 4GB 1st app takes 428MB of memory and the second is 425MB so pretty lean apps The question is the apps that I run take 2-3GB each. But your mileage

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
ran these from muliple bash shell for now, probably a multi threaded python script would do , memory and resource allocations are seen as submitted parameters *say before running any applications . * [root@fos-elastic02 ~]# /usr/bin/free total used free shared

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
OK that is good news. So briefly how do you kick off spark-submit for each (or sparkConf). In terms of memory/resources allocations. Now what is the output of /usr/bin/free Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Yes Mich, They are currently emitting the results parallely,http://localhost:4040 & http://localhost:4041 , i also see the monitoring from these URL's, On Sat, May 28, 2016 at 10:37 PM, Mich Talebzadeh wrote: > ok they are submitted but the latter one 14302 is

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
ok they are submitted but the latter one 14302 is it doing anything? can you check it with jmonitor or the logs created HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Thanks Ted, Thanks Mich, yes i see that i can run two applications by submitting these, probably Driver + Executor running in a single JVM . In-Process Spark. wondering if this can be used in production systems, the reason for me considering local instead of standalone cluster mode is purely

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
Ok so you want to run all this in local mode. In other words something like below ${SPARK_HOME}/bin/spark-submit \ --master local[2] \ --driver-memory 2G \ --num-executors=1 \ --executor-memory=2G \

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Ted Yu
Sujeet: Please also see: https://spark.apache.org/docs/latest/spark-standalone.html On Sat, May 28, 2016 at 9:19 AM, Mich Talebzadeh wrote: > Hi Sujeet, > > if you have a single machine then it is Spark standalone mode. > > In Standalone cluster mode Spark allocates

Re: Spark_API_Copy_From_Edgenode

2016-05-28 Thread Ajay Chander
Hi Everyone, Any insights on this thread? Thank you. On Friday, May 27, 2016, Ajay Chander wrote: > Hi Everyone, > >I have some data located on the EdgeNode. Right > now, the process I follow to copy the data from Edgenode to HDFS is through > a

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Followup: just encountered the "OneVsRest" classifier in ml.classsification: I will look into using it with the binary LogisticRegression as the provided classifier. 2016-05-28 9:06 GMT-07:00 Stephen Boesch : > > Presently only the mllib version has the one-vs-all approach

Re: ANOVA test in Spark

2016-05-28 Thread cyberjog
If any specific algorithm is not present, perhaps you can use R/Python scikit, pipe your data to it & get the model back, I'm currently trying this, and it works fine. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ANOVA-test-in-Spark-tp26949p27043.html

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
Hi Sujeet, if you have a single machine then it is Spark standalone mode. In Standalone cluster mode Spark allocates resources based on cores. By default, an application will grab all the cores in the cluster. You only have one worker that lives within the driver JVM process that you start when

local Vs Standalonecluster production deployment

2016-05-28 Thread cyberjog
Hi, I have a question w.r.t production deployment mode of spark, I have 3 applications which i would like to run independently on a single machine, i need to run the drivers in the same machine. The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4 cores. For deployment

Multinomial regression with spark.ml version of LogisticRegression

2016-05-28 Thread Stephen Boesch
Presently only the mllib version has the one-vs-all approach for multinomial support. The ml version with ElasticNet support only allows binary regression. With feature parity of ml vs mllib having been stated as an objective for 2.0.0 - is there a projected availability of the multinomial

local Vs Standalonecluster production deployment

2016-05-28 Thread sujeet jog
Hi, I have a question w.r.t production deployment mode of spark, I have 3 applications which i would like to run independently on a single machine, i need to run the drivers in the same machine. The amount of resources i have is also limited, like 4- 5GB RAM , 3 - 4 cores. For deployment in