Re: return probability \ confidence instead of actual class
Thank you Sean. I'll try to do it externally as you suggested, however, can you please give me some hints on how to do that? In fact, where can I find the 1.2 implementation you just mentioned? Thanks! On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen so...@cloudera.com wrote: Plain old SVMs don't produce an estimate of class probabilities; predict_proba() does some additional work to estimate class probabilities from the SVM output. Spark does not implement this right now. Spark implements the equivalent of decision_function (the wTx + b bit) but does not expose it, and instead gives you predict(), which gives 0 or 1 depending on whether the decision function exceeds the specified threshold. Yes you can roll your own just like you did to calculate the decision function from weights and intercept. I suppose it would be nice to expose it (do I hear a PR?) but it's not hard to do externally. You'll have to do this anyway if you're on anything earlier than 1.2. On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: ok let me rephrase my question once again. python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to compute the dot product on my own otherwise I get 0 or 1) but the former is not implemented (yet!). what should I do \ how to implement that one in Spark as well? what are the required inputs here and how does the formula look like? On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen so...@cloudera.com wrote: It looks like you are directly computing the SVM decision function in both cases: val predictions2 = m_users_double.map{point= point.zip(weights).map(a= a._1 * a._2).sum + intercept }.cache() clf.decision_function(T) This does not give you +1/-1 in SVMs (well... not for most points, which will be outside the margin around the separating hyperplane). You can use the predict() function in SVMModel -- which will give you 0 or 1 (rather than +/- 1 but that's just differing convention) depending on the sign of the decision function. I don't know if this was in 0.9. At the moment I assume you saw small values of the decision function in scikit because of the radial basis function.
Re: return probability \ confidence instead of actual class
ok let me rephrase my question once again. python-wise I am preferring .predict_proba(X) instead of .decision_function(X) since it is easier for me to interpret the results. as far as I can see, the latter functionality is already implemented in Spark (well, in version 0.9.2 for example I have to compute the dot product on my own otherwise I get 0 or 1) but the former is not implemented (yet!). what should I do \ how to implement that one in Spark as well? what are the required inputs here and how does the formula look like? On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen so...@cloudera.com wrote: It looks like you are directly computing the SVM decision function in both cases: val predictions2 = m_users_double.map{point= point.zip(weights).map(a= a._1 * a._2).sum + intercept }.cache() clf.decision_function(T) This does not give you +1/-1 in SVMs (well... not for most points, which will be outside the margin around the separating hyperplane). You can use the predict() function in SVMModel -- which will give you 0 or 1 (rather than +/- 1 but that's just differing convention) depending on the sign of the decision function. I don't know if this was in 0.9. At the moment I assume you saw small values of the decision function in scikit because of the radial basis function. On Tue, Oct 7, 2014 at 7:45 PM, Sunny Khatri sunny.k...@gmail.com wrote: Not familiar with scikit SVM implementation ( and I assume you are using linearSVC). To figure out an optimal decision boundary based on the scores obtained, you can use an ROC curve varying your thresholds.
Re: return probability \ confidence instead of actual class
Not familiar with scikit SVM implementation ( and I assume you are using linearSVC). To figure out an optimal decision boundary based on the scores obtained, you can use an ROC curve varying your thresholds. On Tue, Oct 7, 2014 at 12:08 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: Well, apparently, the above Python set-up is wrong. Please consider the following set-up which DOES use 'linear' kernel... And the question remains the same: how to interpret Spark results (or why Spark results are NOT bounded between -1 and 1)? On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri sunny.k...@gmail.com wrote: One diff I can find is you may have different kernel functions for your training, In Spark, you end up using Linear Kernel whereas for scikit you are using rbk kernel. That can explain the different in the coefficients you are getting. On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi again, Finally, I found the time to play around with your suggestions. Unfortunately, I noticed some unusual behavior in the MLlib results, which is more obvious when I compare them against their scikit-learn equivalent. Note that I am currently using spark 0.9.2. Long story short: I find it difficult to interpret the result: scikit-learn SVM always returns a value between 0 and 1 which makes it easy for me to set-up a threshold in order to keep only the most significant classifications (this is the case for both short and long input vectors). On the other hand, Spark MLlib makes it impossible to interpret the results; results are hardly ever bounded between -1 and +1 and hence it is impossible to choose a good cut-off value - results are of no practical use. And here is the strangest thing ever: although - it seems that - MLlib does NOT generate the right weights and intercept, when I feed the MLlib with the weights and intercept from scikit-learn the results become pretty accurate Any ideas about what is happening? Any suggestion is highly appreciated. PS: to make thinks easier I have quoted both of my implantations as well as results, bellow. // SPARK (short input): training_error: Double = 0.0 res2: Array[Double] = Array(-1.4420684459128205E-19, -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749, 0.7498, 0.7498, 0.7498) SPARK (long input): training_error: Double = 0.0 res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241, -0.782207630902241, 0.9522394329769612, 2.6866864968561632, 2.6866864968561632, 2.6866864968561632) PYTHON (short input): array([[-1.0001], [-1.0001], [-1.0001], [-0.], [ 1.0001], [ 1.0001], [ 1.0001]]) PYTHON (long input): array([[-1.0001], [-1.0001], [-1.0001], [-0.], [ 1.0001], [ 1.0001], [ 1.0001]]) // import analytics.MSC import java.util.Calendar import java.text.SimpleDateFormat import scala.collection.mutable import scala.collection.JavaConversions._ import org.apache.spark.SparkContext._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.optimization.L1Updater import com.datastax.bdp.spark.connector.CassandraConnector import com.datastax.bdp.spark.SparkContextCassandraFunctions._ val sc = MSC.sc val lg = MSC.logger //val s_users_double_2 = Seq( // (0.0,Seq(0.0, 0.0, 0.0)), // (0.0,Seq(0.0, 0.0, 0.0)), // (0.0,Seq(0.0, 0.0, 0.0)), // (1.0,Seq(1.0, 1.0, 1.0)), // (1.0,Seq(1.0, 1.0, 1.0)), // (1.0,Seq(1.0, 1.0, 1.0)) //) val s_users_double_2 = Seq( (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)), (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0,
Re: return probability \ confidence instead of actual class
Hi again, Finally, I found the time to play around with your suggestions. Unfortunately, I noticed some unusual behavior in the MLlib results, which is more obvious when I compare them against their scikit-learn equivalent. Note that I am currently using spark 0.9.2. Long story short: I find it difficult to interpret the result: scikit-learn SVM always returns a value between 0 and 1 which makes it easy for me to set-up a threshold in order to keep only the most significant classifications (this is the case for both short and long input vectors). On the other hand, Spark MLlib makes it impossible to interpret the results; results are hardly ever bounded between -1 and +1 and hence it is impossible to choose a good cut-off value - results are of no practical use. And here is the strangest thing ever: although - it seems that - MLlib does NOT generate the right weights and intercept, when I feed the MLlib with the weights and intercept from scikit-learn the results become pretty accurate Any ideas about what is happening? Any suggestion is highly appreciated. PS: to make thinks easier I have quoted both of my implantations as well as results, bellow. // SPARK (short input): training_error: Double = 0.0 res2: Array[Double] = Array(-1.4420684459128205E-19, -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749, 0.7498, 0.7498, 0.7498) SPARK (long input): training_error: Double = 0.0 res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241, -0.782207630902241, 0.9522394329769612, 2.6866864968561632, 2.6866864968561632, 2.6866864968561632) PYTHON (short input): array([[-1.0001], [-1.0001], [-1.0001], [-0.], [ 1.0001], [ 1.0001], [ 1.0001]]) PYTHON (long input): array([[-1.0001], [-1.0001], [-1.0001], [-0.], [ 1.0001], [ 1.0001], [ 1.0001]]) // import analytics.MSC import java.util.Calendar import java.text.SimpleDateFormat import scala.collection.mutable import scala.collection.JavaConversions._ import org.apache.spark.SparkContext._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.optimization.L1Updater import com.datastax.bdp.spark.connector.CassandraConnector import com.datastax.bdp.spark.SparkContextCassandraFunctions._ val sc = MSC.sc val lg = MSC.logger //val s_users_double_2 = Seq( // (0.0,Seq(0.0, 0.0, 0.0)), // (0.0,Seq(0.0, 0.0, 0.0)), // (0.0,Seq(0.0, 0.0, 0.0)), // (1.0,Seq(1.0, 1.0, 1.0)), // (1.0,Seq(1.0, 1.0, 1.0)), // (1.0,Seq(1.0, 1.0, 1.0)) //) val s_users_double_2 = Seq( (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)), (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)), (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)) ) val s_users_double = sc.parallelize(s_users_double_2) val s_users_parsed = s_users_double.map{line= LabeledPoint(line._1, line._2.toArray) }.cache() val iterations = 100 val model = SVMWithSGD.train(s_users_parsed, iterations) val predictions1 = s_users_parsed.map{point= (point.label, model.predict(point.features)) }.cache() val training_error = predictions1.filter(r= r._1 != r._2).count().toDouble / s_users_parsed.count() val TP = predictions1.map(s= if (s._1==1.0 s._2==1.0) true else false).filter(t= t).count() val FP = predictions1.map(s= if (s._1==0.0 s._2==1.0) true else false).filter(t= t).count() val TN = predictions1.map(s= if (s._1==0.0 s._2==0.0) true else false).filter(t= t).count() val FN = predictions1.map(s= if (s._1==1.0 s._2==0.0) true else false).filter(t=
Re: return probability \ confidence instead of actual class
One diff I can find is you may have different kernel functions for your training, In Spark, you end up using Linear Kernel whereas for scikit you are using rbk kernel. That can explain the different in the coefficients you are getting. On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi again, Finally, I found the time to play around with your suggestions. Unfortunately, I noticed some unusual behavior in the MLlib results, which is more obvious when I compare them against their scikit-learn equivalent. Note that I am currently using spark 0.9.2. Long story short: I find it difficult to interpret the result: scikit-learn SVM always returns a value between 0 and 1 which makes it easy for me to set-up a threshold in order to keep only the most significant classifications (this is the case for both short and long input vectors). On the other hand, Spark MLlib makes it impossible to interpret the results; results are hardly ever bounded between -1 and +1 and hence it is impossible to choose a good cut-off value - results are of no practical use. And here is the strangest thing ever: although - it seems that - MLlib does NOT generate the right weights and intercept, when I feed the MLlib with the weights and intercept from scikit-learn the results become pretty accurate Any ideas about what is happening? Any suggestion is highly appreciated. PS: to make thinks easier I have quoted both of my implantations as well as results, bellow. // SPARK (short input): training_error: Double = 0.0 res2: Array[Double] = Array(-1.4420684459128205E-19, -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749, 0.7498, 0.7498, 0.7498) SPARK (long input): training_error: Double = 0.0 res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241, -0.782207630902241, 0.9522394329769612, 2.6866864968561632, 2.6866864968561632, 2.6866864968561632) PYTHON (short input): array([[-1.0001], [-1.0001], [-1.0001], [-0.], [ 1.0001], [ 1.0001], [ 1.0001]]) PYTHON (long input): array([[-1.0001], [-1.0001], [-1.0001], [-0.], [ 1.0001], [ 1.0001], [ 1.0001]]) // import analytics.MSC import java.util.Calendar import java.text.SimpleDateFormat import scala.collection.mutable import scala.collection.JavaConversions._ import org.apache.spark.SparkContext._ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.optimization.L1Updater import com.datastax.bdp.spark.connector.CassandraConnector import com.datastax.bdp.spark.SparkContextCassandraFunctions._ val sc = MSC.sc val lg = MSC.logger //val s_users_double_2 = Seq( // (0.0,Seq(0.0, 0.0, 0.0)), // (0.0,Seq(0.0, 0.0, 0.0)), // (0.0,Seq(0.0, 0.0, 0.0)), // (1.0,Seq(1.0, 1.0, 1.0)), // (1.0,Seq(1.0, 1.0, 1.0)), // (1.0,Seq(1.0, 1.0, 1.0)) //) val s_users_double_2 = Seq( (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)), (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)), (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)), (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)) ) val s_users_double = sc.parallelize(s_users_double_2) val s_users_parsed = s_users_double.map{line= LabeledPoint(line._1, line._2.toArray) }.cache() val iterations = 100 val model = SVMWithSGD.train(s_users_parsed, iterations) val predictions1 = s_users_parsed.map{point= (point.label, model.predict(point.features)) }.cache() val training_error = predictions1.filter(r=
Re: return probability \ confidence instead of actual class
Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου.. Just to follow up on Liquan, you might be interested in removing the thresholds, and then treating the predictions as a probability from 0..1 inclusive. SVM with the linear kernel is a straightforward linear classifier -- so you with the model.clearThreshold() you can just get the raw predicted scores, removing the threshold which simple translates that into a positive/negative class. API is here http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel Enjoy! Aris On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei liquan...@gmail.com wrote: HI Adamantios, For your first question, after you train the SVM, you get a model with a vector of weights w and an intercept b, point x such that w.dot(x) + b = 1 and w.dot(x) + b = -1 are points that on the decision boundary. The quantity w.dot(x) + b for point x is a confidence measure of classification. Code wise, suppose you trained your model via val model = SVMWithSGD.train(...) and you can set a threshold by calling model.setThreshold(your threshold here) to set the threshold that separate positive predictions from negative predictions. For more info, please take a look at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel For your second question, SVMWithSGD only supports binary classification. Hope this helps, Liquan On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which the algorithm is very *very* certain about its its decision! How to do that? Is this feature supported already by any MLlib algorithm? What if I had multiple categories? Any input is highly appreciated! -- Liquan Pei Department of Physics University of Massachusetts Amherst
Re: return probability \ confidence instead of actual class
For multi-class you can use the same SVMWithSGD (for binary classification) with One-vs-All approach constructing respective training corpuses consisting one Class i as positive samples and Rest of the classes as negative one, and then use the same method provided by Aris as a measure of how far Class i is from the decision boundary. On Wed, Sep 24, 2014 at 4:06 PM, Aris arisofala...@gmail.com wrote: Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου.. Just to follow up on Liquan, you might be interested in removing the thresholds, and then treating the predictions as a probability from 0..1 inclusive. SVM with the linear kernel is a straightforward linear classifier -- so you with the model.clearThreshold() you can just get the raw predicted scores, removing the threshold which simple translates that into a positive/negative class. API is here http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel Enjoy! Aris On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei liquan...@gmail.com wrote: HI Adamantios, For your first question, after you train the SVM, you get a model with a vector of weights w and an intercept b, point x such that w.dot(x) + b = 1 and w.dot(x) + b = -1 are points that on the decision boundary. The quantity w.dot(x) + b for point x is a confidence measure of classification. Code wise, suppose you trained your model via val model = SVMWithSGD.train(...) and you can set a threshold by calling model.setThreshold(your threshold here) to set the threshold that separate positive predictions from negative predictions. For more info, please take a look at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel For your second question, SVMWithSGD only supports binary classification. Hope this helps, Liquan On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which the algorithm is very *very* certain about its its decision! How to do that? Is this feature supported already by any MLlib algorithm? What if I had multiple categories? Any input is highly appreciated! -- Liquan Pei Department of Physics University of Massachusetts Amherst
Re: return probability \ confidence instead of actual class
Nobody? If that's not supported already, can please, at least, give me a few hints on how to implement it? Thanks! On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais adamantios.cor...@gmail.com wrote: Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which the algorithm is very *very* certain about its its decision! How to do that? Is this feature supported already by any MLlib algorithm? What if I had multiple categories? Any input is highly appreciated!
return probability \ confidence instead of actual class
Hi, I am working with the SVMWithSGD classification algorithm on Spark. It works fine for me, however, I would like to recognize the instances that are classified with a high confidence from those with a low one. How do we define the threshold here? Ultimately, I want to keep only those for which the algorithm is very *very* certain about its its decision! How to do that? Is this feature supported already by any MLlib algorithm? What if I had multiple categories? Any input is highly appreciated!