I think you need to use setIntercept(true) to get it to allow a non-zero intercept. I also kind of agree that's not obvious or the intuitive default.
Is your data set highly imbalanced, with lots of positive examples? that could explain why predictions are heavily skewed. Iterations should definitely not be of the same order of magnitude as your input, which could have millions of elements. 100 should be plenty as a default. Threshold is not related to the 0/1 labels in SVMs. It is a threshold on the SVM margin. Margin is 0 at the decision boundary, not 0.5. There's no grid search at this stage but it's easy to code up in a short method. On Wed, Nov 12, 2014 at 12:41 AM, Caron <caron.big...@gmail.com> wrote: > I'm hoping to get a linear classifier on a dataset. > I'm using SVMWithSGD to train the data. > After running with the default options: val model = > SVMWithSGD.train(training, numIterations), > I don't think SVM has done the classification correctly. > > My observations: > 1. the intercept is always 0.0 > 2. the predicted labels are ALL 1's, no 0's. > > My questions are: > 1. what should the numIterations be? I tried to set it to > 10*trainingSetSize, is that sufficient? > 2. since MLlib only accepts data with labels "0" or "1", shouldn't the > default threshold for SVMWithSGD be 0.5 instead of 0.0? > 3. It seems counter-intuitive to me to have the default intercept be 0.0, > meaning the line has to go through the origin. > 4. Does Spark MLlib provide an API to do grid search like scikit-learn > does? > > Any help would be greatly appreciated! > > > > > ----- > Thanks! > -Caron > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >