Best range of parameters for grid search?
I would like to run a naive implementation of grid search with MLlib but I am a bit confused about choosing the 'best' range of parameters. Apparently, I do not want to waste too much resources for a combination of parameters that will probably not give a better mode. Any suggestions from your experience? val intercept : List[Boolean] = List(false) val classes : List[Int] = List(2) val validate: List[Boolean] = List(true) val tolerance : List[Double] = List(0.001 , 0.01 , 0.1 , 0.0001 , 0.001 , 0.01 , 0.1 , 1.0) val gradient: List[Gradient] = List(new LogisticGradient() , new LeastSquaresGradient() , new HingeGradient()) val corrections : List[Int] = List(5 , 10 , 15) val iters : List[Int] = List(1 , 10 , 100 , 1000 , 1) val regparam: List[Double] = List(0.0 , 0.0001 , 0.001 , 0.01 , 0.1 , 1.0 , 10.0 , 100.0) val updater : List[Updater] = List(new SimpleUpdater() , new L1Updater() , new SquaredL2Updater()) val combinations = for (a <- intercept; b <- classes; c <- validate; d <- tolerance; e <- gradient; f <- corrections; g <- iters; h <- regparam; i <- updater) yield (a,b,c,d,e,f,g,h,i) for( ( interceptS , classesS , validateS , toleranceS , gradientS , correctionsS , itersS , regParamS , updaterS ) <- combinations.take(3) ) { val lr : LogisticRegressionWithLBFGS = new LogisticRegressionWithLBFGS(). setIntercept(addIntercept=interceptS). setNumClasses(numClasses=classesS). setValidateData(validateData=validateS) lr. optimizer. setConvergenceTol(tolerance=toleranceS). setGradient(gradient=gradientS). setNumCorrections(corrections=correctionsS). setNumIterations(iters=itersS). setRegParam(regParam=regParamS). setUpdater(updater=updaterS) } - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Grid Search using Spark MLLib Pipelines
Great. I like your second solution. But how can I make sure that cvModel holds the best model overall (as opposed to the last one that was tired out but the grid search)? In addition, do you have an idea how to collect the average error of each grid search (here 1x1x1)? On 12/08/2016 08:59 μμ, Bryan Cutler wrote: You will need to cast bestModel to include the MLWritable trait. The class Model does not mix it in by default. For instance: cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path") Alternatively, you could save the CV model directly, which takes care of this cvModel.save("/my/path") On Fri, Aug 12, 2016 at 9:17 AM, Adamantios Corais <adamantios.cor...@gmail.com <mailto:adamantios.cor...@gmail.com>> wrote: Hi, Assuming that I have run the following pipeline and have got the best logistic regression model. How can I then save that model for later use? The following command throws an error: cvModel.bestModel.save("/my/path") Also, is it possible to get the error (a collection of) for each combination of parameters? I am using spark 1.6.2 import org.apache.spark.ml.Pipeline import org.apache.spark.ml <http://org.apache.spark.ml>.classification.LogisticRegression import org.apache.spark.ml <http://org.apache.spark.ml>.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.tuning.{ParamGridBuilder , CrossValidator} val lr = new LogisticRegression() val pipeline = new Pipeline(). setStages(Array(lr)) val paramGrid = new ParamGridBuilder(). addGrid(lr.elasticNetParam , Array(0.1)). addGrid(lr.maxIter , Array(10)). addGrid(lr.regParam , Array(0.1)). build() val cv = new CrossValidator(). setEstimator(pipeline). setEvaluator(new BinaryClassificationEvaluator). setEstimatorParamMaps(paramGrid). setNumFolds(2) val cvModel = cv. fit(training) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org>
Grid Search using Spark MLLib Pipelines
Hi, Assuming that I have run the following pipeline and have got the best logistic regression model. How can I then save that model for later use? The following command throws an error: cvModel.bestModel.save("/my/path") Also, is it possible to get the error (a collection of) for each combination of parameters? I am using spark 1.6.2 import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.tuning.{ParamGridBuilder , CrossValidator} val lr = new LogisticRegression() val pipeline = new Pipeline(). setStages(Array(lr)) val paramGrid = new ParamGridBuilder(). addGrid(lr.elasticNetParam , Array(0.1)). addGrid(lr.maxIter , Array(10)). addGrid(lr.regParam , Array(0.1)). build() val cv = new CrossValidator(). setEstimator(pipeline). setEvaluator(new BinaryClassificationEvaluator). setEstimatorParamMaps(paramGrid). setNumFolds(2) val cvModel = cv. fit(training) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to compute the probability of each class in Naive Bayes
great. so, provided that *model.theta* represents the log-probabilities and (hence the result of *brzPi + brzTheta * testData.toBreeze* is a big number too), how can I get back the *non-*log-probabilities which - apparently - are bounded between *0.0 and 1.0*? *// Adamantios* On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen <so...@cloudera.com> wrote: > (pedantic: it's the log-probabilities) > > On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang <yblia...@gmail.com> wrote: > > Actually > > brzPi + brzTheta * testData.toBreeze > > is the probabilities of the input Vector on each class, however it's a > > Breeze Vector. > > Pay attention the index of this Vector need to map to the corresponding > > label index. > > > > 2015-08-28 20:38 GMT+08:00 Adamantios Corais < > adamantios.cor...@gmail.com>: > >> > >> Hi, > >> > >> I am trying to change the following code so as to get the probabilities > of > >> the input Vector on each class (instead of the class itself with the > highest > >> probability). I know that this is already available as part of the most > >> recent release of Spark but I have to use Spark 1.1.0. > >> > >> Any help is appreciated. > >> > >>> override def predict(testData: Vector): Double = { > >>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze)) > >>> } > >> > >> > >>> > >>> > https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > >> > >> > >> // Adamantios > >> > >> > > >
Re: How to compute the probability of each class in Naive Bayes
Thanks Sean. As far as I can see probabilities are NOT normalized; denominator isn't implemented in either v1.1.0 or v1.5.0 (by denominator, I refer to the probability of feature X). So, for given lambda, how to compute the denominator? FYI: https://github.com/apache/spark/blob/v1.5.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala *// Adamantios* On Thu, Sep 10, 2015 at 7:03 PM, Sean Owen <so...@cloudera.com> wrote: > The log probabilities are unlikely to be very large, though the > probabilities may be very small. The direct answer is to exponentiate > brzPi + brzTheta * testData.toBreeze -- apply exp(x). > > I have forgotten whether the probabilities are normalized already > though. If not you'll have to normalize to get them to sum to 1 and be > real class probabilities. This is better done in log space though. > > On Thu, Sep 10, 2015 at 5:12 PM, Adamantios Corais > <adamantios.cor...@gmail.com> wrote: > > great. so, provided that model.theta represents the log-probabilities and > > (hence the result of brzPi + brzTheta * testData.toBreeze is a big number > > too), how can I get back the non-log-probabilities which - apparently - > are > > bounded between 0.0 and 1.0? > > > > > > // Adamantios > > > > > > > > On Tue, Sep 1, 2015 at 12:57 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> (pedantic: it's the log-probabilities) > >> > >> On Tue, Sep 1, 2015 at 10:48 AM, Yanbo Liang <yblia...@gmail.com> > wrote: > >> > Actually > >> > brzPi + brzTheta * testData.toBreeze > >> > is the probabilities of the input Vector on each class, however it's a > >> > Breeze Vector. > >> > Pay attention the index of this Vector need to map to the > corresponding > >> > label index. > >> > > >> > 2015-08-28 20:38 GMT+08:00 Adamantios Corais > >> > <adamantios.cor...@gmail.com>: > >> >> > >> >> Hi, > >> >> > >> >> I am trying to change the following code so as to get the > probabilities > >> >> of > >> >> the input Vector on each class (instead of the class itself with the > >> >> highest > >> >> probability). I know that this is already available as part of the > most > >> >> recent release of Spark but I have to use Spark 1.1.0. > >> >> > >> >> Any help is appreciated. > >> >> > >> >>> override def predict(testData: Vector): Double = { > >> >>> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze)) > >> >>> } > >> >> > >> >> > >> >>> > >> >>> > >> >>> > https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > >> >> > >> >> > >> >> // Adamantios > >> >> > >> >> > >> > > > > > >
How to determine a good set of parameters for a ML grid search task?
I have a sparse dataset of size 775946 x 845372. I would like to perform a grid search in order to tune the parameters of my LogisticRegressionWithSGD model. I have noticed that the building of each model takes about 300 to 400 seconds. That means that in order to try all possible combinations of parameters I have to wait for about 24 hours. Most importantly though, I am not sure if the following combinations make sense at all. So, how should I pick up those parameters more wisely as well as in a way that I can wait less time? val numIterations = Seq(100 , 500 , 1000 , 5000 , 1 , 5 , 10 , 50) val stepSizes = Seq(10 , 50 , 100 , 500 , 1000 , 5000 , 1 , 5) val miniBatchFractions = Seq(1.0) val updaters = Seq(new SimpleUpdater , new SquaredL2Updater , new L1Updater) Any advice is appreciated. *// Adamantios*
How to compute the probability of each class in Naive Bayes
Hi, I am trying to change the following code so as to get the probabilities of the input Vector on each class (instead of the class itself with the highest probability). I know that this is already available as part of the most recent release of Spark but I have to use Spark 1.1.0. Any help is appreciated. override def predict(testData: Vector): Double = { labels(brzArgmax(brzPi + brzTheta * testData.toBreeze)) } https://github.com/apache/spark/blob/v1.1.0/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala *// Adamantios*
Re: How to binarize data in spark
I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2? val data = test11.map(x= ((x(0) , x(1)) , x(2))).groupByKey().map(x= (x._1 , x._2.toArray)).map{x= var lt : Array[Double] = new Array[Double](test12.size) val id = x._1._1 val cl = x._1._2 val dt = x._2 var i = -1 test12.foreach{y = i += 1; lt(i) = if(dt contains y) 1.0 else 0.0} val vs = Vectors.dense(lt) (id , cl , vs) } *// Adamantios* On Fri, Aug 7, 2015 at 8:36 AM, Yanbo Liang yblia...@gmail.com wrote: I think you want to flatten the 1M products to a vector of 1M elements, of course mostly are zero. It looks like HashingTF https://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf can help you. 2015-08-07 11:02 GMT+08:00 praveen S mylogi...@gmail.com: Use StringIndexer in MLib1.4 : https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/ml/feature/StringIndexer.html On Thu, Aug 6, 2015 at 8:49 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product2 user2,class1,product5 user3,class2,product1 etc There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that? *// Adamantios*
How to binarize data in spark
I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product2 user2,class1,product5 user3,class2,product1 etc There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that? *// Adamantios*
Cannot build learning spark project
Hi, I am trying to build this project https://github.com/databricks/learning-spark with mvn package.This should work out of the box but unfortunately it doesn't. In fact, I get the following error: mvn pachage -X Apache Maven 3.0.5 Maven home: /usr/share/maven Java version: 1.7.0_76, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-7-oracle/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 3.13.0-45-generic, arch: amd64, family: unix [INFO] Error stacktraces are turned on. [DEBUG] Reading global settings from /usr/share/maven/conf/settings.xml [DEBUG] Reading user settings from /home/adam/.m2/settings.xml [DEBUG] Using local repository at /home/adam/.m2/repository [DEBUG] Using manager EnhancedLocalRepositoryManager with priority 10 for /home/adam/.m2/repository [INFO] Scanning for projects... [DEBUG] Extension realms for project com.oreilly.learningsparkexamples:java:jar:0.0.2: (none) [DEBUG] Looking up lifecyle mappings for packaging jar from ClassRealm[plexus.core, parent: null] [ERROR] The build could not read 1 project - [Help 1] org.apache.maven.project.ProjectBuildingException: Some problems were encountered while processing the POMs: [ERROR] 'dependencies.dependency.artifactId' for org.scalatest:scalatest_${scala.binary.version}:jar with value 'scalatest_${scala.binary.version}' does not match a valid id pattern. @ line 101, column 19 at org.apache.maven.project.DefaultProjectBuilder.build(DefaultProjectBuilder.java:363) at org.apache.maven.DefaultMaven.collectProjects(DefaultMaven.java:636) at org.apache.maven.DefaultMaven.getProjectsForMavenReactor(DefaultMaven.java:585) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:234) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) [ERROR] [ERROR] The project com.oreilly.learningsparkexamples:java:0.0.2 (/home/adam/learning-spark/learning-spark-master/pom.xml) has 1 error [ERROR] 'dependencies.dependency.artifactId' for org.scalatest:scalatest_${scala.binary.version}:jar with value 'scalatest_${scala.binary.version}' does not match a valid id pattern. @ line 101, column 19 [ERROR] [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException As a further step I would like to know how to build it against DataStax Enterprise 4.6.2 Any help is appreciated! *// Adamantios*
How do I alter the combination of keys that exit the Spark shell?
Hi, I want change the default combination of keys that exit the Spark shell (i.e. CTRL + C) to something else, such as CTRL + H? Thank you in advance. *// Adamantios*
Spark (SQL) as OLAP engine
Hi, After some research I have decided that Spark (SQL) would be ideal for building an OLAP engine. My goal is to push aggregated data (to Cassandra or other low-latency data storage) and then be able to project the results on a web page (web service). New data will be added (aggregated) once a day, only. On the other hand, the web service must be able to run some fixed(?) queries (either on Spark or Spark SQL) at anytime and plot the results with D3.js. Note that I can already achieve similar speeds while in REPL mode by caching the data. Therefore, I believe that my problem must be re-phrased as follows: How can I automatically cache the data once a day and make them available on a web service that is capable of running any Spark or Spark (SQL) statement in order to plot the results with D3.js? Note that I have already some experience in Spark (+Spark SQL) as well as D3.js but not at all with OLAP engines (at least in their traditional form). Any ideas or suggestions? *// Adamantios*
Supported Notebooks (and other viz tools) for Spark 0.9.1?
Hi, I am using Spark 0.9.1 and I am looking for a proper viz tools that supports that specific version. As far as I have seen all relevant tools (e.g. spark-notebook, zeppelin-project etc) only support 1.1 or 1.2; no mentions about older versions of Spark. Any ideas or suggestions? *// Adamantios*
which is the recommended workflow engine for Apache Spark jobs?
I have some previous experience with Apache Oozie while I was developing in Apache Pig. Now, I am working explicitly with Apache Spark and I am looking for a tool with similar functionality. Is Oozie recommended? What about Luigi? What do you use \ recommend?
Re: which is the recommended workflow engine for Apache Spark jobs?
Hi again, As Jimmy said, any thoughts about Luigi and/or any other tools? So far it seems that Oozie is the best and only choice here. Is that right? On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain jimmy.mcerl...@gmail.com wrote: I have used Oozie for all our workflows with Spark apps but you will have to use a java event as the workflow element. I am interested in anyones experience with Luigi and/or any other tools. On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: I have some previous experience with Apache Oozie while I was developing in Apache Pig. Now, I am working explicitly with Apache Spark and I am looking for a tool with similar functionality. Is Oozie recommended? What about Luigi? What do you use \ recommend? -- Nothing under the sun is greater than education. By educating one person and sending him/her into the society of his/her generation, we make a contribution extending a hundred generations to come. -Jigoro Kano, Founder of Judo-
Re: return probability \ confidence instead of actual class
Thank you Sean. I'll try to do it externally as you suggested, however, can you please give me some hints on how to do that? In fact, where can I find the 1.2 implementation you just mentioned? Thanks! On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen so...@cloudera.com wrote: Plain old SVMs don't produce an estimate of class probabilities; predict_proba() does some additional work to estimate class probabilities from the SVM output. Spark does not implement this right now. Spark implements the equivalent of decision_function (the wTx + b bit) but does not expose it, and instead gives you predict(), which gives 0 or 1 depending on whether the decision function exceeds the specified threshold. Yes you can roll your own just like you did to calculate the decision function from weights and intercept. I suppose it would be nice to expose it (do I hear a PR?) but it's not hard to do externally. You'll have to do this anyway if you're on anything earlier than 1.2. On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: ok let me rephrase my question once again. python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to compute the dot product on my own otherwise I get 0 or 1) but the former is not implemented (yet!). what should I do \ how to implement that one in Spark as well? what are the required inputs here and how does the formula look like? On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen so...@cloudera.com wrote: It looks like you are directly computing the SVM decision function in both cases: val predictions2 = m_users_double.map{point= point.zip(weights).map(a= a._1 * a._2).sum + intercept }.cache() clf.decision_function(T) This does not give you +1/-1 in SVMs (well... not for most points, which will be outside the margin around the separating hyperplane). You can use the predict() function in SVMModel -- which will give you 0 or 1 (rather than +/- 1 but that's just differing convention) depending on the sign of the decision function. I don't know if this was in 0.9. At the moment I assume you saw small values of the decision function in scikit because of the radial basis function.
Re: return probability \ confidence instead of actual class
ok let me rephrase my question once again. python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to compute the dot product on my own otherwise I get 0 or 1) but the former is not implemented (yet!). what should I do \ how to implement that one in Spark as well? what are the required inputs here and how does the formula look like? On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen so...@cloudera.com wrote: It looks like you are directly computing the SVM decision function in both cases: val predictions2 = m_users_double.map{point= point.zip(weights).map(a= a._1 * a._2).sum + intercept }.cache() clf.decision_function(T) This does not give you +1/-1 in SVMs (well... not for most points, which will be outside the margin around the separating hyperplane). You can use the predict() function in SVMModel -- which will give you 0 or 1 (rather than +/- 1 but that's just differing convention) depending on the sign of the decision function. I don't know if this was in 0.9. At the moment I assume you saw small values of the decision function in scikit because of the radial basis function. On Tue, Oct 7, 2014 at 7:45 PM, Sunny Khatri sunny.k...@gmail.com wrote: Not familiar with scikit SVM implementation ( and I assume you are using linearSVC). To figure out an optimal decision boundary based on the scores obtained, you can use an ROC curve varying your thresholds.
Re: return probability \ confidence instead of actual class
, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0] ] clf = svm.SVC() clf.fit(X, Y) svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) clf.decision_function(T) /// On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri sunny.k...@gmail.com wrote: For multi-class you can use the same SVMWithSGD (for binary classification) with One-vs-All approach constructing respective training corpuses consisting one Class i as positive samples and Rest of the classes as negative one, and then use the same method provided by Aris as a measure of how far Class i is from the decision boundary. On Wed, Sep 24, 2014 at 4:06 PM, Aris arisofala...@gmail.com wrote: Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου.. Just to follow up on Liquan, you might be interested in removing the thresholds, and then treating the predictions as a probability from 0..1 inclusive. SVM with the linear kernel is a straightforward linear classifier -- so you with the model.clearThreshold() you can just get the raw predicted scores, removing the threshold which simple translates that into a positive/negative class. API is here http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel Enjoy! Aris On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei liquan...@gmail.com wrote: HI Adamantios, For your first question, after you train the SVM, you get a model with a vector of weights w and an intercept b, point x such that w.dot(x) + b = 1 and w.dot(x) + b = -1 are points that on the decision boundary. The quantity w.dot(x) + b for point x is a confidence measure of classification. Code wise, suppose you trained your model via val model = SVMWithSGD.train(...) and you can set a threshold by calling model.setThreshold(your threshold here) to set the threshold that separate positive predictions from negative predictions. For more info, please take a look at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel For your second question, SVMWithSGD only supports binary classification. Hope this helps, Liquan On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which the algorithm is very *very* certain about its its decision! How to do that? Is this feature supported already by any MLlib algorithm? What if I had multiple categories? Any input is highly appreciated! -- Liquan Pei Department of Physics University of Massachusetts Amherst
Re: return probability \ confidence instead of actual class
Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which the algorithm is very *very* certain about its its decision! How to do that? Is this feature supported already by any MLlib algorithm? What if I had multiple categories? Any input is highly appreciated!
return probability \ confidence instead of actual class
Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which the algorithm is very *very* certain about its its decision! How to do that? Is this feature supported already by any MLlib algorithm? What if I had multiple categories? Any input is highly appreciated!