Re: scikit-learn and mllib difference in predictions python

2016-12-25 Thread Yuhao Yang
Hi ioanna,

I'd like to help look into it. Is there a way to access your training data?

2016-12-20 17:21 GMT-08:00 ioanna :

> I have an issue with an SVM model trained for binary classification using
> Spark 2.0.0.
> I have followed the same logic using scikit-learn and MLlib, using the
> exact
> same dataset.
> For scikit learn I have the following code:
>
> svc_model = SVC()
> svc_model.fit(X_train, y_train)
>
> print "supposed to be 1"
> print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
> print
> svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
> print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.
> 0,15.0])
> print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,
> 15.0])
>
> print "supposed to be 0"
> print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0,
> 15.0, 15.0])
> print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.
> 0,19.0,7.0,7.0])
> print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0,
> 18.0,
> 7.0, 15.0])
> print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0,
> 15.0, 7.0])
>
>
> and it returns:
>
> supposed to be 1
> [0]
> [1]
> [1]
> [1]
> supposed to be 0
> [0]
> [0]
> [0]
> [0]
>
> For spark am doing:
>
> model_svm = SVMWithSGD.train(trainingData, iterations=100)
>
> model_svm.clearThreshold()
>
> print "supposed to be 1"
> print
> model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,
> 4.0,12.0,8.0,0.0,7.0))
> print
> model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,
> 15.0,15.0,0.0,12.0,15.0))
> print
> model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.
> 0,15.0,15.0,15.0,15.0))
> print
> model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,
> 15.0,7.0,7.0,15.0,15.0))
>
> print "supposed to be 0"
> print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0,
> 15.0, 15.0, 15.0, 15.0))
> print
> model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,
> 13.0,7.0,19.0,7.0,7.0))
> print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0,
> 15.0,
> 15.0, 18.0, 7.0, 15.0))
> print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0,
> 15.0, 15.0, 15.0, 7.0))
>
> which returns:
>
> supposed to be 1
> 12.8250120159
> 16.0786937313
> 14.2139435305
> 16.5115589658
> supposed to be 0
> 17.1311777004
> 14.075461697
> 20.8883372052
> 12.9132580999
>
> when I am setting the threshold I am either getting all zeros or all ones.
>
> Does anyone know how to approach this problem?
>
> As I said I have checked multiple times that my dataset and feature
> extraction logic are exactly the same in both cases.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/scikit-learn-and-mllib-difference-in-
> predictions-python-tp28240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


scikit-learn and mllib difference in predictions python

2016-12-20 Thread ioanna
I have an issue with an SVM model trained for binary classification using
Spark 2.0.0.
I have followed the same logic using scikit-learn and MLlib, using the exact
same dataset.
For scikit learn I have the following code:

svc_model = SVC()
svc_model.fit(X_train, y_train)

print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print
svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])

print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0,
15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0,
7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0,
15.0, 7.0])


and it returns:

supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]

For spark am doing:

model_svm = SVMWithSGD.train(trainingData, iterations=100)

model_svm.clearThreshold()

print "supposed to be 1"
print
model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print
model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print
model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print
model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))
   
print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0,
15.0, 15.0, 15.0, 15.0))
print
model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0,
15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0,
15.0, 15.0, 15.0, 7.0))

which returns:

supposed to be 1
12.8250120159
16.0786937313
14.2139435305
16.5115589658
supposed to be 0
17.1311777004
14.075461697
20.8883372052
12.9132580999

when I am setting the threshold I am either getting all zeros or all ones.

Does anyone know how to approach this problem?

As I said I have checked multiple times that my dataset and feature
extraction logic are exactly the same in both cases.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/scikit-learn-and-mllib-difference-in-predictions-python-tp28240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org