Hi,
I've written an application that performs some machine learning on some
data. I've validated that the data _should_ give a good output with a decent
RMSE by using Lib-SVM:
Mean squared error = 0.00922063 (regression)
Squared correlation coefficient = 0.9987 (regression)

When I try to use Spark ML to do the exact same thing I get:
Mean Squared Error = 8.466193152067944E224

Which is "somewhat" worse.. I've tried to look at the data before it's
inputted to the model, printed that data to file (which is actually the data
used when I got the result from Lib-SVM above). Somewhere there much be a
huge mistake, but I cannot place it somewhere in my code (see below).
traningLP and testLP are training and test-data, in RDD[LabeledPoint].  

// Generate model
val model_gen = new RidgeRegressionWithSGD();
val model = model_gen.run(trainingLP);

// Predict on the test-data
val valuesAndPreds = testLP.map { point =>
        val prediction = model.predict(point.features);
        println("label: " + point.label + ", pred: " + prediction);
        (point.label, prediction);
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean();
println("Mean Squared Error = " + MSE) 


I've printed label and prediction-values for each data-point in the testset,
and the result is something like this;
label: 5.04, pred: -4.607899000641277E112
label: 3.59, pred: -3.96787105480399E112
label: 5.06, pred: -2.8263294374576145E112
label: 2.85, pred: -1.1536508029072844E112
label: 2.1, pred: -4.269312783707508E111
label: 2.75, pred: -3.0072665148591558E112
label: -0.29, pred: -2.035681731641989E112
label: 1.98, pred: -3.163404340354783E112

So there is obviously something wrong with the prediction step. I'm using
the SparseVector representation of the Vector in LabeledPoint, looking
something like this for reference (shortened for convenience);
(-1.59,(2080,[29,59,62,74,127,128,131,144,149,175,198,200,239,247,267,293,307,364,374,393,410,424,425,431,448,469,477,485,501,525,532,533,538,560,..],[1.0,1.0,2.0,8.0,1.0,1.0,6.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,2.0,1.0,1.0,..]))
(-1.75,(2080,[103,131,149,208,296,335,520,534,603,620,661,694,709,748,859,1053,1116,1156,1186,1207,1208,1223,1256,1278,1356,1375,1399,1480,1569,..],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,4.0,1.0,7.0,1.0,3.0,2.0,1.0]))

I do get one type of warning, but that's about it! (And as to my
understanding, this native code is not required to get the correct results,
only to improve performance). 
6010 [main] WARN  com.github.fommil.netlib.BLAS  - Failed to load
implementation from: com.github.fommil.netlib.NativeSystemBLAS
6011 [main] WARN  com.github.fommil.netlib.BLAS  - Failed to load
implementation from: com.github.fommil.netlib.NativeRefBLAS

So where do I go from here? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-using-Spark-ML-tp22591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to