Hi,
I created a dataset of 100 points, ranging from X=1.0 to to X=100.0. I let
the y variable be 0.0 if X < 51.0 and 1.0 otherwise. I then fit a
SVMwithSGD. When I predict the y values for the same values of X as in the
sample, I get back 1.0 for each predicted y!
Incidentally, I don't get perfect separation when I replace SVMwithSGD with
LogisticRegressionWithSGD or NaiveBayes.
Here's the code:
import sys
from pyspark import SparkContext
from pyspark.mllib.classification import LogisticRegressionWithSGD,
LogisticRegressionModel
from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint
import numpy as np
# Load a text file and convert each line to a tuple.
sc=SparkContext(appName="Prem")
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split('\t')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("c:/python27/classifier.txt")
parsedData = data.map(parsePoint)
print parsedData
# Build the model
model = SVMWithSGD.train(parsedData, iterations=100)
model.setThreshold(0.5)
print model
### Build the model
##model = LogisticRegressionWithSGD.train(parsedData, iterations=100,
intercept=True)
##print model
### Build the model
##model = NaiveBayes.train(parsedData)
##print model
for i in range(100):
print i+1, model.predict(np.array([float(i+1)]))
=
Incidentally, the weights I observe in MLlib are 0.8949991, while if I run
it using the scikit-learn library version of support vector machine, I get
0.05417109. Is this indicative of the problem?
Can you please let me know what I am doing wrong?
Thanks,
Prem
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/support-vector-machine-does-not-classify-properly-tp26216.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org