Hi,
I am using the utility function kFold provided in Spark for doing k-fold
cross validation using logistic regression. However, each time I run the
experiment, I got different different result. Since everything else stays
constant, I was wondering if this is due to the kFold function I used.
Have a look at the source code for MLUtils.kFold. Yes, there is a
random element. That's good; you want the folds to be randomly chosen.
Note there is a seed parameter, as in a lot of the APIs, that lets you
fix the RNG seed and so get the same result every time, if you need
to.
On Fri, Jan 30,
Are you using SGD for logistic regression? There's a random element
there too, by nature. I looked into the code and see that you can't
set a seed, but actually, the sampling is done with a fixed seed per
partition anyway. Hm.
In general you would not expect these algorithms to produce the same
Thanks. I did specify a seed parameter.
Seems that the problem is not caused by kFold. I actually ran another
experiment without cross validation. I just built a model with the training
data and then tested the model on the test data. However, the accuracy
still varies from one run to another.