Re: return probability \ confidence instead of actual class

2014-10-11 Thread Adamantios Corais
Thank you Sean. I'll try to do it externally as you suggested, however, can
you please give me some hints on how to do that? In fact, where can I find
the 1.2 implementation you just mentioned? Thanks!




On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen so...@cloudera.com wrote:

 Plain old SVMs don't produce an estimate of class probabilities;
 predict_proba() does some additional work to estimate class
 probabilities from the SVM output. Spark does not implement this right
 now.

 Spark implements the equivalent of decision_function (the wTx + b bit)
 but does not expose it, and instead gives you predict(), which gives 0
 or 1 depending on whether the decision function exceeds the specified
 threshold.

 Yes you can roll your own just like you did to calculate the decision
 function from weights and intercept. I suppose it would be nice to
 expose it (do I hear a PR?) but it's not hard to do externally. You'll
 have to do this anyway if you're on anything earlier than 1.2.

 On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais
 adamantios.cor...@gmail.com wrote:
  ok let me rephrase my question once again. python-wise I am preferring
  .predict_proba(X) instead of .decision_function(X) since it is easier
 for me
  to interpret the results. as far as I can see, the latter functionality
 is
  already implemented in Spark (well, in version 0.9.2 for example I have
 to
  compute the dot product on my own otherwise I get 0 or 1) but the former
 is
  not implemented (yet!). what should I do \ how to implement that one in
  Spark as well? what are the required inputs here and how does the formula
  look like?
 
  On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen so...@cloudera.com wrote:
 
  It looks like you are directly computing the SVM decision function in
  both cases:
 
  val predictions2 = m_users_double.map{point=
point.zip(weights).map(a= a._1 * a._2).sum + intercept
  }.cache()
 
  clf.decision_function(T)
 
  This does not give you +1/-1 in SVMs (well... not for most points,
  which will be outside the margin around the separating hyperplane).
 
  You can use the predict() function in SVMModel -- which will give you
  0 or 1 (rather than +/- 1 but that's just differing convention)
  depending on the sign of the decision function. I don't know if this
  was in 0.9.
 
  At the moment I assume you saw small values of the decision function
  in scikit because of the radial basis function.



Re: return probability \ confidence instead of actual class

2014-10-08 Thread Adamantios Corais
ok let me rephrase my question once again. python-wise I am preferring
.predict_proba(X) instead of .decision_function(X) since it is easier for
me to interpret the results. as far as I can see, the latter functionality
is already implemented in Spark (well, in version 0.9.2 for example I have
to compute the dot product on my own otherwise I get 0 or 1) but the former
is not implemented (yet!). what should I do \ how to implement that one in
Spark as well? what are the required inputs here and how does the formula
look like?

On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen so...@cloudera.com wrote:

 It looks like you are directly computing the SVM decision function in
 both cases:

 val predictions2 = m_users_double.map{point=
   point.zip(weights).map(a= a._1 * a._2).sum + intercept
 }.cache()

 clf.decision_function(T)

 This does not give you +1/-1 in SVMs (well... not for most points,
 which will be outside the margin around the separating hyperplane).

 You can use the predict() function in SVMModel -- which will give you
 0 or 1 (rather than +/- 1 but that's just differing convention)
 depending on the sign of the decision function. I don't know if this
 was in 0.9.

 At the moment I assume you saw small values of the decision function
 in scikit because of the radial basis function.

 On Tue, Oct 7, 2014 at 7:45 PM, Sunny Khatri sunny.k...@gmail.com wrote:
  Not familiar with scikit SVM implementation ( and I assume you are using
  linearSVC). To figure out an optimal decision boundary based on the
 scores
  obtained, you can use an ROC curve varying your thresholds.
 



Re: return probability \ confidence instead of actual class

2014-10-07 Thread Sunny Khatri
Not familiar with scikit SVM implementation ( and I assume you are using
linearSVC). To figure out an optimal decision boundary based on the scores
obtained, you can use an ROC curve varying your thresholds.

On Tue, Oct 7, 2014 at 12:08 AM, Adamantios Corais 
adamantios.cor...@gmail.com wrote:

 Well, apparently, the above Python set-up is wrong. Please consider the
 following set-up which DOES use 'linear' kernel... And the question remains
 the same: how to interpret Spark results (or why Spark results are NOT
 bounded between -1 and 1)?

 On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri sunny.k...@gmail.com wrote:

 One diff I can find is you may have different kernel functions for your
 training, In Spark, you end up using Linear Kernel whereas for scikit you
 are using rbk kernel. That can explain the different in the coefficients
 you are getting.

 On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Hi again,

 Finally, I found the time to play around with your suggestions.
 Unfortunately, I noticed some unusual behavior in the MLlib results, which
 is more obvious when I compare them against their scikit-learn equivalent.
 Note that I am currently using spark 0.9.2. Long story short: I find it
 difficult to interpret the result: scikit-learn SVM always returns a value
 between 0 and 1 which makes it easy for me to set-up a threshold in order
 to keep only the most significant classifications (this is the case for
 both short and long input vectors). On the other hand, Spark MLlib makes it
 impossible to interpret the results; results are hardly ever bounded
 between -1 and +1 and hence it is impossible to choose a good cut-off value
 - results are of no practical use. And here is the strangest thing ever:
 although - it seems that - MLlib does NOT generate the right weights and
 intercept, when I feed the MLlib with the weights and intercept from
 scikit-learn the results become pretty accurate Any ideas about what is
 happening? Any suggestion is highly appreciated.

 PS: to make thinks easier I have quoted both of my implantations as well
 as results, bellow.

 //

 SPARK (short input):
 training_error: Double = 0.0
 res2: Array[Double] = Array(-1.4420684459128205E-19,
 -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749,
 0.7498, 0.7498, 0.7498)

 SPARK (long input):
 training_error: Double = 0.0
 res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
 -0.782207630902241, 0.9522394329769612, 2.6866864968561632,
 2.6866864968561632, 2.6866864968561632)

 PYTHON (short input):
 array([[-1.0001],
[-1.0001],
[-1.0001],
[-0.],
[ 1.0001],
[ 1.0001],
[ 1.0001]])

 PYTHON (long input):
 array([[-1.0001],
[-1.0001],
[-1.0001],
[-0.],
[ 1.0001],
[ 1.0001],
[ 1.0001]])

 //

 import analytics.MSC

 import java.util.Calendar
 import java.text.SimpleDateFormat
 import scala.collection.mutable
 import scala.collection.JavaConversions._
 import org.apache.spark.SparkContext._
 import org.apache.spark.mllib.classification.SVMWithSGD
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.optimization.L1Updater
 import com.datastax.bdp.spark.connector.CassandraConnector
 import com.datastax.bdp.spark.SparkContextCassandraFunctions._

 val sc = MSC.sc
 val lg = MSC.logger

 //val s_users_double_2 = Seq(
 //  (0.0,Seq(0.0, 0.0, 0.0)),
 //  (0.0,Seq(0.0, 0.0, 0.0)),
 //  (0.0,Seq(0.0, 0.0, 0.0)),
 //  (1.0,Seq(1.0, 1.0, 1.0)),
 //  (1.0,Seq(1.0, 1.0, 1.0)),
 //  (1.0,Seq(1.0, 1.0, 1.0))
 //)
 val s_users_double_2 = Seq(
 (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
 (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
 (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
 (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
 (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 

Re: return probability \ confidence instead of actual class

2014-10-06 Thread Adamantios Corais
Hi again,

Finally, I found the time to play around with your suggestions.
Unfortunately, I noticed some unusual behavior in the MLlib results, which
is more obvious when I compare them against their scikit-learn equivalent.
Note that I am currently using spark 0.9.2. Long story short: I find it
difficult to interpret the result: scikit-learn SVM always returns a value
between 0 and 1 which makes it easy for me to set-up a threshold in order
to keep only the most significant classifications (this is the case for
both short and long input vectors). On the other hand, Spark MLlib makes it
impossible to interpret the results; results are hardly ever bounded
between -1 and +1 and hence it is impossible to choose a good cut-off value
- results are of no practical use. And here is the strangest thing ever:
although - it seems that - MLlib does NOT generate the right weights and
intercept, when I feed the MLlib with the weights and intercept from
scikit-learn the results become pretty accurate Any ideas about what is
happening? Any suggestion is highly appreciated.

PS: to make thinks easier I have quoted both of my implantations as well as
results, bellow.

//

SPARK (short input):
training_error: Double = 0.0
res2: Array[Double] = Array(-1.4420684459128205E-19,
-1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749,
0.7498, 0.7498, 0.7498)

SPARK (long input):
training_error: Double = 0.0
res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
-0.782207630902241, 0.9522394329769612, 2.6866864968561632,
2.6866864968561632, 2.6866864968561632)

PYTHON (short input):
array([[-1.0001],
   [-1.0001],
   [-1.0001],
   [-0.],
   [ 1.0001],
   [ 1.0001],
   [ 1.0001]])

PYTHON (long input):
array([[-1.0001],
   [-1.0001],
   [-1.0001],
   [-0.],
   [ 1.0001],
   [ 1.0001],
   [ 1.0001]])

//

import analytics.MSC

import java.util.Calendar
import java.text.SimpleDateFormat
import scala.collection.mutable
import scala.collection.JavaConversions._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.optimization.L1Updater
import com.datastax.bdp.spark.connector.CassandraConnector
import com.datastax.bdp.spark.SparkContextCassandraFunctions._

val sc = MSC.sc
val lg = MSC.logger

//val s_users_double_2 = Seq(
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (1.0,Seq(1.0, 1.0, 1.0)),
//  (1.0,Seq(1.0, 1.0, 1.0)),
//  (1.0,Seq(1.0, 1.0, 1.0))
//)
val s_users_double_2 = Seq(
(0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
(0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
(0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
(1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
(1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
(1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))
)
val s_users_double = sc.parallelize(s_users_double_2)

val s_users_parsed = s_users_double.map{line=
  LabeledPoint(line._1, line._2.toArray)
}.cache()

val iterations = 100

val model = SVMWithSGD.train(s_users_parsed, iterations)

val predictions1 = s_users_parsed.map{point=
  (point.label, model.predict(point.features))
}.cache()

val training_error = predictions1.filter(r= r._1 != r._2).count().toDouble
/ s_users_parsed.count()

val TP = predictions1.map(s= if (s._1==1.0  s._2==1.0) true else
false).filter(t= t).count()
val FP = predictions1.map(s= if (s._1==0.0  s._2==1.0) true else
false).filter(t= t).count()
val TN = predictions1.map(s= if (s._1==0.0  s._2==0.0) true else
false).filter(t= t).count()
val FN = predictions1.map(s= if (s._1==1.0  s._2==0.0) true else
false).filter(t= 

Re: return probability \ confidence instead of actual class

2014-10-06 Thread Sunny Khatri
One diff I can find is you may have different kernel functions for your
training, In Spark, you end up using Linear Kernel whereas for scikit you
are using rbk kernel. That can explain the different in the coefficients
you are getting.

On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais 
adamantios.cor...@gmail.com wrote:

 Hi again,

 Finally, I found the time to play around with your suggestions.
 Unfortunately, I noticed some unusual behavior in the MLlib results, which
 is more obvious when I compare them against their scikit-learn equivalent.
 Note that I am currently using spark 0.9.2. Long story short: I find it
 difficult to interpret the result: scikit-learn SVM always returns a value
 between 0 and 1 which makes it easy for me to set-up a threshold in order
 to keep only the most significant classifications (this is the case for
 both short and long input vectors). On the other hand, Spark MLlib makes it
 impossible to interpret the results; results are hardly ever bounded
 between -1 and +1 and hence it is impossible to choose a good cut-off value
 - results are of no practical use. And here is the strangest thing ever:
 although - it seems that - MLlib does NOT generate the right weights and
 intercept, when I feed the MLlib with the weights and intercept from
 scikit-learn the results become pretty accurate Any ideas about what is
 happening? Any suggestion is highly appreciated.

 PS: to make thinks easier I have quoted both of my implantations as well
 as results, bellow.

 //

 SPARK (short input):
 training_error: Double = 0.0
 res2: Array[Double] = Array(-1.4420684459128205E-19,
 -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749,
 0.7498, 0.7498, 0.7498)

 SPARK (long input):
 training_error: Double = 0.0
 res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
 -0.782207630902241, 0.9522394329769612, 2.6866864968561632,
 2.6866864968561632, 2.6866864968561632)

 PYTHON (short input):
 array([[-1.0001],
[-1.0001],
[-1.0001],
[-0.],
[ 1.0001],
[ 1.0001],
[ 1.0001]])

 PYTHON (long input):
 array([[-1.0001],
[-1.0001],
[-1.0001],
[-0.],
[ 1.0001],
[ 1.0001],
[ 1.0001]])

 //

 import analytics.MSC

 import java.util.Calendar
 import java.text.SimpleDateFormat
 import scala.collection.mutable
 import scala.collection.JavaConversions._
 import org.apache.spark.SparkContext._
 import org.apache.spark.mllib.classification.SVMWithSGD
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.optimization.L1Updater
 import com.datastax.bdp.spark.connector.CassandraConnector
 import com.datastax.bdp.spark.SparkContextCassandraFunctions._

 val sc = MSC.sc
 val lg = MSC.logger

 //val s_users_double_2 = Seq(
 //  (0.0,Seq(0.0, 0.0, 0.0)),
 //  (0.0,Seq(0.0, 0.0, 0.0)),
 //  (0.0,Seq(0.0, 0.0, 0.0)),
 //  (1.0,Seq(1.0, 1.0, 1.0)),
 //  (1.0,Seq(1.0, 1.0, 1.0)),
 //  (1.0,Seq(1.0, 1.0, 1.0))
 //)
 val s_users_double_2 = Seq(
 (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
 (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
 (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
 (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
 (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
 (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))
 )
 val s_users_double = sc.parallelize(s_users_double_2)

 val s_users_parsed = s_users_double.map{line=
   LabeledPoint(line._1, line._2.toArray)
 }.cache()

 val iterations = 100

 val model = SVMWithSGD.train(s_users_parsed, iterations)

 val predictions1 = s_users_parsed.map{point=
   (point.label, model.predict(point.features))
 }.cache()

 val training_error = predictions1.filter(r= 

Re: return probability \ confidence instead of actual class

2014-09-24 Thread Aris
Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου..

Just to follow up on Liquan, you might be interested in removing the
thresholds, and then treating the predictions as a probability from 0..1
inclusive. SVM with the linear kernel is a straightforward linear
classifier -- so you with the model.clearThreshold() you can just get the
raw predicted scores, removing the threshold which simple translates that
into a positive/negative class.

API is here
http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

Enjoy!
Aris

On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei liquan...@gmail.com wrote:

 HI Adamantios,

 For your first question, after you train the SVM, you get a model with a
 vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
 and w.dot(x) + b = -1 are points that on the decision boundary. The
 quantity w.dot(x) + b for point x is a confidence measure of
 classification.

 Code wise, suppose you trained your model via
 val model = SVMWithSGD.train(...)

 and you can set a threshold by calling

 model.setThreshold(your threshold here)

 to set the threshold that separate positive predictions from negative
 predictions.

 For more info, please take a look at
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

 For your second question, SVMWithSGD only supports binary classification.

 Hope this helps,

 Liquan

 On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Nobody?

 If that's not supported already, can please, at least, give me a few
 hints on how to implement it?

 Thanks!


 On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Hi,

 I am working with the SVMWithSGD classification algorithm on Spark. It
 works fine for me, however, I would like to recognize the instances that
 are classified with a high confidence from those with a low one. How do we
 define the threshold here? Ultimately, I want to keep only those for which
 the algorithm is very *very* certain about its its decision! How to do
 that? Is this feature supported already by any MLlib algorithm? What if I
 had multiple categories?

 Any input is highly appreciated!





 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst



Re: return probability \ confidence instead of actual class

2014-09-24 Thread Sunny Khatri
For multi-class you can use the same SVMWithSGD (for binary classification)
with One-vs-All approach constructing respective training corpuses
consisting one Class i as positive samples and Rest of the classes as
negative one, and then use the same method provided by Aris as a measure of
how far Class i is from the decision boundary.

On Wed, Sep 24, 2014 at 4:06 PM, Aris arisofala...@gmail.com wrote:

 Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου..

 Just to follow up on Liquan, you might be interested in removing the
 thresholds, and then treating the predictions as a probability from 0..1
 inclusive. SVM with the linear kernel is a straightforward linear
 classifier -- so you with the model.clearThreshold() you can just get the
 raw predicted scores, removing the threshold which simple translates that
 into a positive/negative class.

 API is here
 http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

 Enjoy!
 Aris

 On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei liquan...@gmail.com wrote:

 HI Adamantios,

 For your first question, after you train the SVM, you get a model with a
 vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
 and w.dot(x) + b = -1 are points that on the decision boundary. The
 quantity w.dot(x) + b for point x is a confidence measure of
 classification.

 Code wise, suppose you trained your model via
 val model = SVMWithSGD.train(...)

 and you can set a threshold by calling

 model.setThreshold(your threshold here)

 to set the threshold that separate positive predictions from negative
 predictions.

 For more info, please take a look at
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

 For your second question, SVMWithSGD only supports binary classification.

 Hope this helps,

 Liquan

 On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Nobody?

 If that's not supported already, can please, at least, give me a few
 hints on how to implement it?

 Thanks!


 On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais 
 adamantios.cor...@gmail.com wrote:

 Hi,

 I am working with the SVMWithSGD classification algorithm on Spark. It
 works fine for me, however, I would like to recognize the instances that
 are classified with a high confidence from those with a low one. How do we
 define the threshold here? Ultimately, I want to keep only those for which
 the algorithm is very *very* certain about its its decision! How to do
 that? Is this feature supported already by any MLlib algorithm? What if I
 had multiple categories?

 Any input is highly appreciated!





 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst





Re: return probability \ confidence instead of actual class

2014-09-22 Thread Adamantios Corais
Nobody?

If that's not supported already, can please, at least, give me a few hints
on how to implement it?

Thanks!


On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais 
adamantios.cor...@gmail.com wrote:

 Hi,

 I am working with the SVMWithSGD classification algorithm on Spark. It
 works fine for me, however, I would like to recognize the instances that
 are classified with a high confidence from those with a low one. How do we
 define the threshold here? Ultimately, I want to keep only those for which
 the algorithm is very *very* certain about its its decision! How to do
 that? Is this feature supported already by any MLlib algorithm? What if I
 had multiple categories?

 Any input is highly appreciated!



return probability \ confidence instead of actual class

2014-09-19 Thread Adamantios Corais
Hi,

I am working with the SVMWithSGD classification algorithm on Spark. It
works fine for me, however, I would like to recognize the instances that
are classified with a high confidence from those with a low one. How do we
define the threshold here? Ultimately, I want to keep only those for which
the algorithm is very *very* certain about its its decision! How to do
that? Is this feature supported already by any MLlib algorithm? What if I
had multiple categories?

Any input is highly appreciated!