Best range of parameters for grid search?
I would like to run a naive implementation of grid search with MLlib but I am a bit confused about choosing the 'best' range of parameters. Apparently, I do not want to waste too much resources for a combination of parameters that will probably not give a better mode. Any suggestions from your experience? val intercept : List[Boolean] = List(false) val classes : List[Int] = List(2) val validate: List[Boolean] = List(true) val tolerance : List[Double] = List(0.001 , 0.01 , 0.1 , 0.0001 , 0.001 , 0.01 , 0.1 , 1.0) val gradient: List[Gradient] = List(new LogisticGradient() , new LeastSquaresGradient() , new HingeGradient()) val corrections : List[Int] = List(5 , 10 , 15) val iters : List[Int] = List(1 , 10 , 100 , 1000 , 1) val regparam: List[Double] = List(0.0 , 0.0001 , 0.001 , 0.01 , 0.1 , 1.0 , 10.0 , 100.0) val updater : List[Updater] = List(new SimpleUpdater() , new L1Updater() , new SquaredL2Updater()) val combinations = for (a <- intercept; b <- classes; c <- validate; d <- tolerance; e <- gradient; f <- corrections; g <- iters; h <- regparam; i <- updater) yield (a,b,c,d,e,f,g,h,i) for( ( interceptS , classesS , validateS , toleranceS , gradientS , correctionsS , itersS , regParamS , updaterS ) <- combinations.take(3) ) { val lr : LogisticRegressionWithLBFGS = new LogisticRegressionWithLBFGS(). setIntercept(addIntercept=interceptS). setNumClasses(numClasses=classesS). setValidateData(validateData=validateS) lr. optimizer. setConvergenceTol(tolerance=toleranceS). setGradient(gradient=gradientS). setNumCorrections(corrections=correctionsS). setNumIterations(iters=itersS). setRegParam(regParam=regParamS). setUpdater(updater=updaterS) } - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Grid Search using Spark MLLib Pipelines
Great. I like your second solution. But how can I make sure that cvModel holds the best model overall (as opposed to the last one that was tired out but the grid search)? In addition, do you have an idea how to collect the average error of each grid search (here 1x1x1)? On 12/08/2016 08:59 μμ, Bryan Cutler wrote: You will need to cast bestModel to include the MLWritable trait. The class Model does not mix it in by default. For instance: cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path") Alternatively, you could save the CV model directly, which takes care of this cvModel.save("/my/path") On Fri, Aug 12, 2016 at 9:17 AM, Adamantios Corais mailto:adamantios.cor...@gmail.com>> wrote: Hi, Assuming that I have run the following pipeline and have got the best logistic regression model. How can I then save that model for later use? The following command throws an error: cvModel.bestModel.save("/my/path") Also, is it possible to get the error (a collection of) for each combination of parameters? I am using spark 1.6.2 import org.apache.spark.ml.Pipeline import org.apache.spark.ml <http://org.apache.spark.ml>.classification.LogisticRegression import org.apache.spark.ml <http://org.apache.spark.ml>.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.tuning.{ParamGridBuilder , CrossValidator} val lr = new LogisticRegression() val pipeline = new Pipeline(). setStages(Array(lr)) val paramGrid = new ParamGridBuilder(). addGrid(lr.elasticNetParam , Array(0.1)). addGrid(lr.maxIter , Array(10)). addGrid(lr.regParam , Array(0.1)). build() val cv = new CrossValidator(). setEstimator(pipeline). setEvaluator(new BinaryClassificationEvaluator). setEstimatorParamMaps(paramGrid). setNumFolds(2) val cvModel = cv. fit(training) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org>
Grid Search using Spark MLLib Pipelines
Hi, Assuming that I have run the following pipeline and have got the best logistic regression model. How can I then save that model for later use? The following command throws an error: cvModel.bestModel.save("/my/path") Also, is it possible to get the error (a collection of) for each combination of parameters? I am using spark 1.6.2 import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.tuning.{ParamGridBuilder , CrossValidator} val lr = new LogisticRegression() val pipeline = new Pipeline(). setStages(Array(lr)) val paramGrid = new ParamGridBuilder(). addGrid(lr.elasticNetParam , Array(0.1)). addGrid(lr.maxIter , Array(10)). addGrid(lr.regParam , Array(0.1)). build() val cv = new CrossValidator(). setEstimator(pipeline). setEvaluator(new BinaryClassificationEvaluator). setEstimatorParamMaps(paramGrid). setNumFolds(2) val cvModel = cv. fit(training) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to compute the probability of each class in Naive Bayes
Thanks Sean. As far as I can see probabilities are NOT normalized; denominator isn't implemented in either v1.1.0 or v1.5.0 (by denominator, I refer to the probability of feature X). So, for given lambda, how to compute the denominator? FYI: https://github.com/apache/spark/blob/v1.5.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala *// Adamantios* On Thu, Sep 10, 2015 at 7:03 PM, Sean Owen wrote: > The log probabilities are unlikely to be very large, though the > probabilities may be very small. The direct answer is to exponentiate > brzPi + brzTheta * testData.toBreeze -- apply exp(x). > > I have forgotten whether the probabilities are normalized already > though. If not you'll have to normalize to get them to sum to 1 and be > real class probabilities. This is better done in log space though. > > On Thu, Sep 10, 2015 at 5:12 PM, Adamantios Corais > wrote: > > great. so, provided that model.theta represents the log-probabilities and > > (hence the result of brzPi + brzTheta * testData.toBreeze is a big number > > too), how can I get back the non-log-probabilities which - apparently - > are > > bounded between 0.0 and 1.0? > > > > > > // Adamantios > > > > > > > > On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen wrote: > >> > >> (pedantic: it's the log-probabilities) > >> > >> On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang > wrote: > >> > Actually > >> > brzPi + brzTheta * testData.toBreeze > >> > is the probabilities of the input Vector on each class, however it's a > >> > Breeze Vector. > >> > Pay attention the index of this Vector need to map to the > corresponding > >> > label index. > >> > > >> > 2015-08-28 20:38 GMT+08:00 Adamantios Corais > >> > : > >> >> > >> >> Hi, > >> >> > >> >> I am trying to change the following code so as to get the > probabilities > >> >> of > >> >> the input Vector on each class (instead of the class itself with the > >> >> highest > >> >> probability). I know that this is already available as part of the > most > >> >> recent release of Spark but I have to use Spark 1.1.0. > >> >> > >> >> Any help is appreciated. > >> >> > >> >>> override def predict(testData: Vector): Double = { > >> >>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze)) > >> >>> } > >> >> > >> >> > >> >>> > >> >>> > >> >>> > https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > >> >> > >> >> > >> >> // Adamantios > >> >> > >> >> > >> > > > > > >
Re: How to compute the probability of each class in Naive Bayes
great. so, provided that *model.theta* represents the log-probabilities and (hence the result of *brzPi + brzTheta * testData.toBreeze* is a big number too), how can I get back the *non-*log-probabilities which - apparently - are bounded between *0.0 and 1.0*? *// Adamantios* On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen wrote: > (pedantic: it's the log-probabilities) > > On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang wrote: > > Actually > > brzPi + brzTheta * testData.toBreeze > > is the probabilities of the input Vector on each class, however it's a > > Breeze Vector. > > Pay attention the index of this Vector need to map to the corresponding > > label index. > > > > 2015-08-28 20:38 GMT+08:00 Adamantios Corais < > adamantios.cor...@gmail.com>: > >> > >> Hi, > >> > >> I am trying to change the following code so as to get the probabilities > of > >> the input Vector on each class (instead of the class itself with the > highest > >> probability). I know that this is already available as part of the most > >> recent release of Spark but I have to use Spark 1.1.0. > >> > >> Any help is appreciated. > >> > >>> override def predict(testData: Vector): Double = { > >>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze)) > >>> } > >> > >> > >>> > >>> > https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > >> > >> > >> // Adamantios > >> > >> > > >
How to compute the probability of each class in Naive Bayes
Hi, I am trying to change the following code so as to get the probabilities of the input Vector on each class (instead of the class itself with the highest probability). I know that this is already available as part of the most recent release of Spark but I have to use Spark 1.1.0. Any help is appreciated. override def predict(testData: Vector): Double = { > labels(brzArgmax(brzPi + brzTheta * testData.toBreeze)) > } https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala *// Adamantios*
How to determine a good set of parameters for a ML grid search task?
I have a sparse dataset of size 775946 x 845372. I would like to perform a grid search in order to tune the parameters of my LogisticRegressionWithSGD model. I have noticed that the building of each model takes about 300 to 400 seconds. That means that in order to try all possible combinations of parameters I have to wait for about 24 hours. Most importantly though, I am not sure if the following combinations make sense at all. So, how should I pick up those parameters more wisely as well as in a way that I can wait less time? val numIterations = Seq(100 , 500 , 1000 , 5000 , 1 , 5 , 10 > , 50) > val stepSizes = Seq(10 , 50 , 100 , 500 , 1000 , 5000 , 1 , 5) > val miniBatchFractions = Seq(1.0) > val updaters = Seq(new SimpleUpdater , new SquaredL2Updater , new > L1Updater) Any advice is appreciated. *// Adamantios*
Re: How to binarize data in spark
I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2? val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=> var lt : Array[Double] = new Array[Double](test12.size) val id = x._1._1 val cl = x._1._2 val dt = x._2 var i = -1 test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0} val vs = Vectors.dense(lt) (id , cl , vs) } *// Adamantios* On Fri, Aug 7, 2015 at 8:36 AM, Yanbo Liang wrote: > I think you want to flatten the 1M products to a vector of 1M elements, of > course mostly are zero. > It looks like HashingTF > <https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf> > can help you. > > 2015-08-07 11:02 GMT+08:00 praveen S : > >> Use StringIndexer in MLib1.4 : >> >> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html >> >> On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais < >> adamantios.cor...@gmail.com> wrote: >> >>> I have a set of data based on which I want to create a classification >>> model. Each row has the following form: >>> >>> user1,class1,product1 >>>> user1,class1,product2 >>>> user1,class1,product5 >>>> user2,class1,product2 >>>> user2,class1,product5 >>>> user3,class2,product1 >>>> etc >>> >>> >>> There are about 1M users, 2 classes, and 1M products. What I would like >>> to do next is create the sparse vectors (something already supported by >>> MLlib) BUT in order to apply that function I have to create the dense >>> vectors >>> (with the 0s), first. In other words, I have to binarize my data. What's >>> the easiest (or most elegant) way of doing that? >>> >>> >>> *// Adamantios* >>> >>> >>> >> >
How to binarize data in spark
I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 > user1,class1,product2 > user1,class1,product5 > user2,class1,product2 > user2,class1,product5 > user3,class2,product1 > etc There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that? *// Adamantios*
Cannot build "learning spark" project
Hi, I am trying to build this project https://github.com/databricks/learning-spark with mvn package.This should work out of the box but unfortunately it doesn't. In fact, I get the following error: mvn pachage -X > Apache Maven 3.0.5 > Maven home: /usr/share/maven > Java version: 1.7.0_76, vendor: Oracle Corporation > Java home: /usr/lib/jvm/java-7-oracle/jre > Default locale: en_US, platform encoding: UTF-8 > OS name: "linux", version: "3.13.0-45-generic", arch: "amd64", family: > "unix" > [INFO] Error stacktraces are turned on. > [DEBUG] Reading global settings from /usr/share/maven/conf/settings.xml > [DEBUG] Reading user settings from /home/adam/.m2/settings.xml > [DEBUG] Using local repository at /home/adam/.m2/repository > [DEBUG] Using manager EnhancedLocalRepositoryManager with priority 10 for > /home/adam/.m2/repository > [INFO] Scanning for projects... > [DEBUG] Extension realms for project > com.oreilly.learningsparkexamples:java:jar:0.0.2: (none) > [DEBUG] Looking up lifecyle mappings for packaging jar from > ClassRealm[plexus.core, parent: null] > [ERROR] The build could not read 1 project -> [Help 1] > org.apache.maven.project.ProjectBuildingException: Some problems were > encountered while processing the POMs: > [ERROR] 'dependencies.dependency.artifactId' for > org.scalatest:scalatest_${scala.binary.version}:jar with value > 'scalatest_${scala.binary.version}' does not match a valid id pattern. @ > line 101, column 19 > at > org.apache.maven.project.DefaultProjectBuilder.build(DefaultProjectBuilder.java:363) > at org.apache.maven.DefaultMaven.collectProjects(DefaultMaven.java:636) > at > org.apache.maven.DefaultMaven.getProjectsForMavenReactor(DefaultMaven.java:585) > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:234) > at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) > at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537) > at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) > at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) > at > org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) > at > org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) > at > org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) > [ERROR] > [ERROR] The project com.oreilly.learningsparkexamples:java:0.0.2 > (/home/adam/learning-spark/learning-spark-master/pom.xml) has 1 error > [ERROR] 'dependencies.dependency.artifactId' for > org.scalatest:scalatest_${scala.binary.version}:jar with value > 'scalatest_${scala.binary.version}' does not match a valid id pattern. @ > line 101, column 19 > [ERROR] > [ERROR] > [ERROR] For more information about the errors and possible solutions, > please read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException As a further step I would like to know how to build it against DataStax Enterprise 4.6.2 Any help is appreciated! *// Adamantios*
Re: How do I alter the combination of keys that exit the Spark shell?
this doesn't solve my problem... apparently, my problem is that from time to time I accidentally press CTRL + C (instead of CTRL + ALT + V for copying commands in the shell) and that results in closing my shell. In order to solve this I was wondering if I just deactivating the CTRL + C combination at all! Any ideas? *// Adamantios* On Fri, Mar 13, 2015 at 7:37 PM, Marcelo Vanzin wrote: > You can type ":quit". > > On Fri, Mar 13, 2015 at 10:29 AM, Adamantios Corais > wrote: > > Hi, > > > > I want change the default combination of keys that exit the Spark shell > > (i.e. CTRL + C) to something else, such as CTRL + H? > > > > Thank you in advance. > > > > // Adamantios > > > > > > > > > > -- > Marcelo >
How do I alter the combination of keys that exit the Spark shell?
Hi, I want change the default combination of keys that exit the Spark shell (i.e. CTRL + C) to something else, such as CTRL + H? Thank you in advance. *// Adamantios*
Spark (SQL) as OLAP engine
Hi, After some research I have decided that Spark (SQL) would be ideal for building an OLAP engine. My goal is to push aggregated data (to Cassandra or other low-latency data storage) and then be able to project the results on a web page (web service). New data will be added (aggregated) once a day, only. On the other hand, the web service must be able to run some fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the results with D3.js. Note that I can already achieve similar speeds while in REPL mode by caching the data. Therefore, I believe that my problem must be re-phrased as follows: "How can I automatically cache the data once a day and make them available on a web service that is capable of running any Spark or Spark (SQL) statement in order to plot the results with D3.js?" Note that I have already some experience in Spark (+Spark SQL) as well as D3.js but not at all with OLAP engines (at least in their traditional form). Any ideas or suggestions? *// Adamantios*
Supported Notebooks (and other viz tools) for Spark 0.9.1?
Hi, I am using Spark 0.9.1 and I am looking for a proper viz tools that supports that specific version. As far as I have seen all relevant tools (e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no mentions about older versions of Spark. Any ideas or suggestions? *// Adamantios*
Re: which is the recommended workflow engine for Apache Spark jobs?
Hi again, As Jimmy said, any thoughts about Luigi and/or any other tools? So far it seems that Oozie is the best and only choice here. Is that right? On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain wrote: > I have used Oozie for all our workflows with Spark apps but you will have > to use a java event as the workflow element. I am interested in anyones > experience with Luigi and/or any other tools. > > > On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais < > adamantios.cor...@gmail.com> wrote: > >> I have some previous experience with Apache Oozie while I was developing >> in Apache Pig. Now, I am working explicitly with Apache Spark and I am >> looking for a tool with similar functionality. Is Oozie recommended? What >> about Luigi? What do you use \ recommend? >> > > > > -- > > > "Nothing under the sun is greater than education. By educating one person > and sending him/her into the society of his/her generation, we make a > contribution extending a hundred generations to come." > -Jigoro Kano, Founder of Judo- >
which is the recommended workflow engine for Apache Spark jobs?
I have some previous experience with Apache Oozie while I was developing in Apache Pig. Now, I am working explicitly with Apache Spark and I am looking for a tool with similar functionality. Is Oozie recommended? What about Luigi? What do you use \ recommend?
Re: return probability \ confidence instead of actual class
Thank you Sean. I'll try to do it externally as you suggested, however, can you please give me some hints on how to do that? In fact, where can I find the 1.2 implementation you just mentioned? Thanks! On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen wrote: > Plain old SVMs don't produce an estimate of class probabilities; > predict_proba() does some additional work to estimate class > probabilities from the SVM output. Spark does not implement this right > now. > > Spark implements the equivalent of decision_function (the wTx + b bit) > but does not expose it, and instead gives you predict(), which gives 0 > or 1 depending on whether the decision function exceeds the specified > threshold. > > Yes you can roll your own just like you did to calculate the decision > function from weights and intercept. I suppose it would be nice to > expose it (do I hear a PR?) but it's not hard to do externally. You'll > have to do this anyway if you're on anything earlier than 1.2. > > On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais > wrote: > > ok let me rephrase my question once again. python-wise I am preferring > > .predict_proba(X) instead of .decision_function(X) since it is easier > for me > > to interpret the results. as far as I can see, the latter functionality > is > > already implemented in Spark (well, in version 0.9.2 for example I have > to > > compute the dot product on my own otherwise I get 0 or 1) but the former > is > > not implemented (yet!). what should I do \ how to implement that one in > > Spark as well? what are the required inputs here and how does the formula > > look like? > > > > On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen wrote: > >> > >> It looks like you are directly computing the SVM decision function in > >> both cases: > >> > >> val predictions2 = m_users_double.map{point=> > >> point.zip(weights).map(a=> a._1 * a._2).sum + intercept > >> }.cache() > >> > >> clf.decision_function(T) > >> > >> This does not give you +1/-1 in SVMs (well... not for most points, > >> which will be outside the margin around the separating hyperplane). > >> > >> You can use the predict() function in SVMModel -- which will give you > >> 0 or 1 (rather than +/- 1 but that's just differing convention) > >> depending on the sign of the decision function. I don't know if this > >> was in 0.9. > >> > >> At the moment I assume you saw small values of the decision function > >> in scikit because of the radial basis function. >
Re: return probability \ confidence instead of actual class
ok let me rephrase my question once again. python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to compute the dot product on my own otherwise I get 0 or 1) but the former is not implemented (yet!). what should I do \ how to implement that one in Spark as well? what are the required inputs here and how does the formula look like? On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen wrote: > It looks like you are directly computing the SVM decision function in > both cases: > > val predictions2 = m_users_double.map{point=> > point.zip(weights).map(a=> a._1 * a._2).sum + intercept > }.cache() > > clf.decision_function(T) > > This does not give you +1/-1 in SVMs (well... not for most points, > which will be outside the margin around the separating hyperplane). > > You can use the predict() function in SVMModel -- which will give you > 0 or 1 (rather than +/- 1 but that's just differing convention) > depending on the sign of the decision function. I don't know if this > was in 0.9. > > At the moment I assume you saw small values of the decision function > in scikit because of the radial basis function. > > On Tue, Oct 7, 2014 at 7:45 PM, Sunny Khatri wrote: > > Not familiar with scikit SVM implementation ( and I assume you are using > > linearSVC). To figure out an optimal decision boundary based on the > scores > > obtained, you can use an ROC curve varying your thresholds. > > >
Re: return probability \ confidence instead of actual class
Well, apparently, the above Python set-up is wrong. Please consider the following set-up which DOES use 'linear' kernel... And the question remains the same: how to interpret Spark results (or why Spark results are NOT bounded between -1 and 1)? On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri wrote: > One diff I can find is you may have different kernel functions for your > training, In Spark, you end up using Linear Kernel whereas for scikit you > are using rbk kernel. That can explain the different in the coefficients > you are getting. > > On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais < > adamantios.cor...@gmail.com> wrote: > >> Hi again, >> >> Finally, I found the time to play around with your suggestions. >> Unfortunately, I noticed some unusual behavior in the MLlib results, which >> is more obvious when I compare them against their scikit-learn equivalent. >> Note that I am currently using spark 0.9.2. Long story short: I find it >> difficult to interpret the result: scikit-learn SVM always returns a value >> between 0 and 1 which makes it easy for me to set-up a threshold in order >> to keep only the most significant classifications (this is the case for >> both short and long input vectors). On the other hand, Spark MLlib makes it >> impossible to interpret the results; results are hardly ever bounded >> between -1 and +1 and hence it is impossible to choose a good cut-off value >> - results are of no practical use. And here is the strangest thing ever: >> although - it seems that - MLlib does NOT generate the right weights and >> intercept, when I feed the MLlib with the weights and intercept from >> scikit-learn the results become pretty accurate Any ideas about what is >> happening? Any suggestion is highly appreciated. >> >> PS: to make thinks easier I have quoted both of my implantations as well >> as results, bellow. >> >> // >> >> SPARK (short input): >> training_error: Double = 0.0 >> res2: Array[Double] = Array(-1.4420684459128205E-19, >> -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749, >> 0.7498, 0.7498, 0.7498) >> >> SPARK (long input): >> training_error: Double = 0.0 >> res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241, >> -0.782207630902241, 0.9522394329769612, 2.6866864968561632, >> 2.6866864968561632, 2.6866864968561632) >> >> PYTHON (short input): >> array([[-1.0001], >>[-1.0001], >>[-1.0001], >>[-0.], >>[ 1.0001], >>[ 1.0001], >>[ 1.0001]]) >> >> PYTHON (long input): >> array([[-1.0001], >>[-1.0001], >>[-1.0001], >>[-0.], >>[ 1.0001], >>[ 1.0001], >>[ 1.0001]]) >> >> // >> >> import analytics.MSC >> >> import java.util.Calendar >> import java.text.SimpleDateFormat >> import scala.collection.mutable >> import scala.collection.JavaConversions._ >> import org.apache.spark.SparkContext._ >> import org.apache.spark.mllib.classification.SVMWithSGD >> import org.apache.spark.mllib.regression.LabeledPoint >> import org.apache.spark.mllib.optimization.L1Updater >> import com.datastax.bdp.spark.connector.CassandraConnector >> import com.datastax.bdp.spark.SparkContextCassandraFunctions._ >> >> val sc = MSC.sc >> val lg = MSC.logger >> >> //val s_users_double_2 = Seq( >> // (0.0,Seq(0.0, 0.0, 0.0)), >> // (0.0,Seq(0.0, 0.0, 0.0)), >> // (0.0,Seq(0.0, 0.0, 0.0)), >> // (1.0,Seq(1.0, 1.0, 1.0)), >> // (1.0,Seq(1.0, 1.0, 1.0)), >> // (1.0,Seq(1.0, 1.0, 1.0)) >> //) >> val s_users_double_2 = Seq( >> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), >> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), >> (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), >&g
Re: return probability \ confidence instead of actual class
5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0] ] clf = svm.SVC() clf.fit(X, Y) svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) clf.decision_function(T) /// On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri wrote: > For multi-class you can use the same SVMWithSGD (for binary > classification) with One-vs-All approach constructing respective training > corpuses consisting one Class i as positive samples and Rest of the classes > as negative one, and then use the same method provided by Aris as a measure > of how far Class i is from the decision boundary. > > On Wed, Sep 24, 2014 at 4:06 PM, Aris wrote: > >> Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου.. >> >> Just to follow up on Liquan, you might be interested in removing the >> thresholds, and then treating the predictions as a probability from 0..1 >> inclusive. SVM with the linear kernel is a straightforward linear >> classifier -- so you with the model.clearThreshold() you can just get the >> raw predicted scores, removing the threshold which simple translates that >> into a positive/negative class. >> >> API is here >> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel >> >> Enjoy! >> Aris >> >> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei wrote: >> >>> HI Adamantios, >>> >>> For your first question, after you train the SVM, you get a model with a >>> vector of weights w and an intercept b, point x such that w.dot(x) + b = 1 >>> and w.dot(x) + b = -1 are points that on the decision boundary. The >>> quantity w.dot(x) + b for point x is a confidence measure of >>> classification. >>> >>> Code wise, suppose you trained your model via >>> val model = SVMWithSGD.train(...) >>> >>> and you can set a threshold by calling >>> >>> model.setThreshold(your threshold here) >>> >>> to set the threshold that separate positive predictions from negative >>> predictions. >>> >>> For more info, please take a look at >>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel >>> >>> For your second question, SVMWithSGD only supports binary >>> classification. >>> >>> Hope this helps, >>> >>> Liquan >>> >>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais < >>> adamantios.cor...@gmail.com> wrote: >>> >>>> Nobody? >>>> >>>> If that's not supported already, can please, at least, give me a few >>>> hints on how to implement it? >>>> >>>> Thanks! >>>> >>>> >>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais < >>>> adamantios.cor...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am working with the SVMWithSGD classification algorithm on Spark. It >>>>> works fine for me, however, I would like to recognize the instances that >>>>> are classified with a high confidence from those with a low one. How do we >>>>> define the threshold here? Ultimately, I want to keep only those for which >>>>> the algorithm is very *very* certain about its its decision! How to do >>>>> that? Is this feature supported already by any MLlib algorithm? What if I >>>>> had multiple categories? >>>>> >>>>> Any input is highly appreciated! >>>>> >>>> >>>> >>> >>> >>> -- >>> Liquan Pei >>> Department of Physics >>> University of Massachusetts Amherst >>> >> >> >
Re: return probability \ confidence instead of actual class
Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais < adamantios.cor...@gmail.com> wrote: > Hi, > > I am working with the SVMWithSGD classification algorithm on Spark. It > works fine for me, however, I would like to recognize the instances that > are classified with a high confidence from those with a low one. How do we > define the threshold here? Ultimately, I want to keep only those for which > the algorithm is very *very* certain about its its decision! How to do > that? Is this feature supported already by any MLlib algorithm? What if I > had multiple categories? > > Any input is highly appreciated! >
return probability \ confidence instead of actual class
Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which the algorithm is very *very* certain about its its decision! How to do that? Is this feature supported already by any MLlib algorithm? What if I had multiple categories? Any input is highly appreciated!