Re: return probability \ confidence instead of actual class

Adamantios Corais Mon, 06 Oct 2014 10:16:30 -0700

Hi again,

Finally, I found the time to play around with your suggestions.
Unfortunately, I noticed some unusual behavior in the MLlib results, which
is more obvious when I compare them against their scikit-learn equivalent.
Note that I am currently using spark 0.9.2. Long story short: I find it
difficult to interpret the result: scikit-learn SVM always returns a value
between 0 and 1 which makes it easy for me to set-up a threshold in order
to keep only the most significant classifications (this is the case for
both short and long input vectors). On the other hand, Spark MLlib makes it
impossible to interpret the results; results are hardly ever bounded
between -1 and +1 and hence it is impossible to choose a good cut-off value
- results are of no practical use. And here is the strangest thing ever:
although - it seems that - MLlib does NOT generate the right weights and
intercept, when I feed the MLlib with the weights and intercept from
scikit-learn the results become pretty accurate!!!! Any ideas about what is
happening? Any suggestion is highly appreciated.


PS: to make thinks easier I have quoted both of my implantations as well as
results, bellow.

//////////////////////////////////////////////////

SPARK (short input):
training_error: Double = 0.0
res2: Array[Double] = Array(-1.4420684459128205E-19,
-1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749999999999999,
0.7499999999999998, 0.7499999999999998, 0.7499999999999998)

SPARK (long input):
training_error: Double = 0.0
res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
-0.782207630902241, 0.9522394329769612, 2.6866864968561632,
2.6866864968561632, 2.6866864968561632)

PYTHON (short input):
array([[-1.00000001],
       [-1.00000001],
       [-1.00000001],
       [-0.        ],
       [ 1.00000001],
       [ 1.00000001],
       [ 1.00000001]])

PYTHON (long input):
array([[-1.00000001],
       [-1.00000001],
       [-1.00000001],
       [-0.        ],
       [ 1.00000001],
       [ 1.00000001],
       [ 1.00000001]])

//////////////////////////////////////////////////

import analytics.MSC

import java.util.Calendar
import java.text.SimpleDateFormat
import scala.collection.mutable
import scala.collection.JavaConversions._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.optimization.L1Updater
import com.datastax.bdp.spark.connector.CassandraConnector
import com.datastax.bdp.spark.SparkContextCassandraFunctions._

val sc = MSC.sc
val lg = MSC.logger

//val s_users_double_2 = Seq(
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (1.0,Seq(1.0, 1.0, 1.0)),
//  (1.0,Seq(1.0, 1.0, 1.0)),
//  (1.0,Seq(1.0, 1.0, 1.0))
//)
val s_users_double_2 = Seq(
    (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
    (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
    (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
    (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
    (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
    (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))
)
val s_users_double = sc.parallelize(s_users_double_2)

val s_users_parsed = s_users_double.map{line=>
  LabeledPoint(line._1, line._2.toArray)
}.cache()

val iterations = 100

val model = SVMWithSGD.train(s_users_parsed, iterations)

val predictions1 = s_users_parsed.map{point=>
  (point.label, model.predict(point.features))
}.cache()

val training_error = predictions1.filter(r=> r._1 != r._2).count().toDouble
/ s_users_parsed.count()

val TP = predictions1.map(s=> if (s._1==1.0 && s._2==1.0) true else
false).filter(t=> t).count()
val FP = predictions1.map(s=> if (s._1==0.0 && s._2==1.0) true else
false).filter(t=> t).count()
val TN = predictions1.map(s=> if (s._1==0.0 && s._2==0.0) true else
false).filter(t=> t).count()
val FN = predictions1.map(s=> if (s._1==1.0 && s._2==0.0) true else
false).filter(t=> t).count()

val weights = model.weights

val intercept = model.intercept

//val m_users_double_2 = Seq(
//  Seq(0.0, 0.0, 0.0),
//  Seq(0.0, 0.0, 0.0),
//  Seq(0.0, 0.0, 0.0),
//  Seq(0.5, 0.5, 0.5),
//  Seq(1.0, 1.0, 1.0),
//  Seq(1.0, 1.0, 1.0),
//  Seq(1.0, 1.0, 1.0)
//)
val m_users_double_2 = Seq(
    Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
    Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
    Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
      Seq(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5),
    Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
    Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
    Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)
)
val m_users_double = sc.parallelize(m_users_double_2)

val predictions2 = m_users_double.map{point=>
  point.zip(weights).map(a=> a._1 * a._2).sum + intercept
}.cache()

predictions2.collect()

//////////////////////////////////////////////////

from sklearn import svm

flag = 'short' # 'long'

if flag == 'long':
    X = [
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0]
    ]
    Y = [
        0.0,
        0.0,
        0.0,
        1.0,
        1.0,
        1.0
    ]
    T = [
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.5, 0.5, 0.5],
        [1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0]
    ]

if flag == 'long':
    X = [
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
    ]
    Y = [
        0.0,
        0.0,
        0.0,
        1.0,
        1.0,
        1.0
    ]
    T = [
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
    ]

clf = svm.SVC()
clf.fit(X, Y)
svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
clf.decision_function(T)

///////////////////////////////////////////////////




On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri <sunny.k...@gmail.com> wrote:

> For multi-class you can use the same SVMWithSGD (for binary
> classification) with One-vs-All approach constructing respective training
> corpuses consisting one Class i as positive samples and Rest of the classes
> as negative one, and then use the same method provided by Aris as a measure
> of how far Class i is from the decision boundary.
>
> On Wed, Sep 24, 2014 at 4:06 PM, Aris <arisofala...@gmail.com> wrote:
>
>> Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι το όνομα σου..
>>
>> Just to follow up on Liquan, you might be interested in removing the
>> thresholds, and then treating the predictions as a probability from 0..1
>> inclusive. SVM with the linear kernel is a straightforward linear
>> classifier -- so you with the model.clearThreshold() you can just get the
>> raw predicted scores, removing the threshold which simple translates that
>> into a positive/negative class.
>>
>> API is here
>> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>
>> Enjoy!
>> Aris
>>
>> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <liquan...@gmail.com> wrote:
>>
>>> HI Adamantios,
>>>
>>> For your first question, after you train the SVM, you get a model with a
>>> vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
>>> and w.dot(x) + b = -1 are points that on the decision boundary. The
>>> quantity w.dot(x) + b for point x is a confidence measure of
>>> classification.
>>>
>>> Code wise, suppose you trained your model via
>>> val model = SVMWithSGD.train(...)
>>>
>>> and you can set a threshold by calling
>>>
>>> model.setThreshold(your threshold here)
>>>
>>> to set the threshold that separate positive predictions from negative
>>> predictions.
>>>
>>> For more info, please take a look at
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>
>>> For your second question, SVMWithSGD only supports binary
>>> classification.
>>>
>>> Hope this helps,
>>>
>>> Liquan
>>>
>>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
>>> adamantios.cor...@gmail.com> wrote:
>>>
>>>> Nobody?
>>>>
>>>> If that's not supported already, can please, at least, give me a few
>>>> hints on how to implement it?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>>>> adamantios.cor...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am working with the SVMWithSGD classification algorithm on Spark. It
>>>>> works fine for me, however, I would like to recognize the instances that
>>>>> are classified with a high confidence from those with a low one. How do we
>>>>> define the threshold here? Ultimately, I want to keep only those for which
>>>>> the algorithm is very *very* certain about its its decision! How to do
>>>>> that? Is this feature supported already by any MLlib algorithm? What if I
>>>>> had multiple categories?
>>>>>
>>>>> Any input is highly appreciated!
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Liquan Pei
>>> Department of Physics
>>> University of Massachusetts Amherst
>>>
>>
>>
>

Re: return probability \ confidence instead of actual class

Reply via email to