I think you need to use setIntercept(true) to get it to allow a non-zero
intercept. I also kind of agree that's not obvious or the intuitive default.

Is your data set highly imbalanced, with lots of positive examples? that
could explain why predictions are heavily skewed.

Iterations should definitely not be of the same order of magnitude as your
input, which could have millions of elements. 100 should be plenty as a
default.

Threshold is not related to the 0/1 labels in SVMs. It is a threshold on
the SVM margin. Margin is 0 at the decision boundary, not 0.5.

There's no grid search at this stage but it's easy to code up in a short
method.


On Wed, Nov 12, 2014 at 12:41 AM, Caron <caron.big...@gmail.com> wrote:

> I'm hoping to get a linear classifier on a dataset.
> I'm using SVMWithSGD to train the data.
> After running with the default options: val model =
> SVMWithSGD.train(training, numIterations),
> I don't think SVM has done the classification correctly.
>
> My observations:
> 1. the intercept is always 0.0
> 2. the predicted labels are ALL 1's, no 0's.
>
> My questions are:
> 1. what should the numIterations be? I tried to set it to
> 10*trainingSetSize, is that sufficient?
> 2. since MLlib only accepts data with labels "0" or "1", shouldn't the
> default threshold for SVMWithSGD be 0.5 instead of 0.0?
> 3. It seems counter-intuitive to me to have the default intercept be 0.0,
> meaning the line has to go through the origin.
> 4. Does Spark MLlib provide an API to do grid search like scikit-learn
> does?
>
> Any help would be greatly appreciated!
>
>
>
>
> -----
> Thanks!
> -Caron
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to