Hi, I've written an application that performs some machine learning on some data. I've validated that the data _should_ give a good output with a decent RMSE by using Lib-SVM: Mean squared error = 0.00922063 (regression) Squared correlation coefficient = 0.9987 (regression)
When I try to use Spark ML to do the exact same thing I get: Mean Squared Error = 8.466193152067944E224 Which is "somewhat" worse.. I've tried to look at the data before it's inputted to the model, printed that data to file (which is actually the data used when I got the result from Lib-SVM above). Somewhere there much be a huge mistake, but I cannot place it somewhere in my code (see below). traningLP and testLP are training and test-data, in RDD[LabeledPoint]. // Generate model val model_gen = new RidgeRegressionWithSGD(); val model = model_gen.run(trainingLP); // Predict on the test-data val valuesAndPreds = testLP.map { point => val prediction = model.predict(point.features); println("label: " + point.label + ", pred: " + prediction); (point.label, prediction); } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean(); println("Mean Squared Error = " + MSE) I've printed label and prediction-values for each data-point in the testset, and the result is something like this; label: 5.04, pred: -4.607899000641277E112 label: 3.59, pred: -3.96787105480399E112 label: 5.06, pred: -2.8263294374576145E112 label: 2.85, pred: -1.1536508029072844E112 label: 2.1, pred: -4.269312783707508E111 label: 2.75, pred: -3.0072665148591558E112 label: -0.29, pred: -2.035681731641989E112 label: 1.98, pred: -3.163404340354783E112 So there is obviously something wrong with the prediction step. I'm using the SparseVector representation of the Vector in LabeledPoint, looking something like this for reference (shortened for convenience); (-1.59,(2080,[29,59,62,74,127,128,131,144,149,175,198,200,239,247,267,293,307,364,374,393,410,424,425,431,448,469,477,485,501,525,532,533,538,560,..],[1.0,1.0,2.0,8.0,1.0,1.0,6.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,2.0,1.0,1.0,..])) (-1.75,(2080,[103,131,149,208,296,335,520,534,603,620,661,694,709,748,859,1053,1116,1156,1186,1207,1208,1223,1256,1278,1356,1375,1399,1480,1569,..],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,4.0,1.0,7.0,1.0,3.0,2.0,1.0])) I do get one type of warning, but that's about it! (And as to my understanding, this native code is not required to get the correct results, only to improve performance). 6010 [main] WARN com.github.fommil.netlib.BLAS - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 6011 [main] WARN com.github.fommil.netlib.BLAS - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS So where do I go from here? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-using-Spark-ML-tp22591.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org