Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Hi, I am using the utility function kFold provided in Spark for doing k-fold cross validation using logistic regression. However, each time I run the experiment, I got different different result. Since everything else stays constant, I was wondering if this is due to the kFold function I used.

Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Sean Owen
Have a look at the source code for MLUtils.kFold. Yes, there is a random element. That's good; you want the folds to be randomly chosen. Note there is a seed parameter, as in a lot of the APIs, that lets you fix the RNG seed and so get the same result every time, if you need to. On Fri, Jan 30,

Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Sean Owen
Are you using SGD for logistic regression? There's a random element there too, by nature. I looked into the code and see that you can't set a seed, but actually, the sampling is done with a fixed seed per partition anyway. Hm. In general you would not expect these algorithms to produce the same

Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Thanks. I did specify a seed parameter. Seems that the problem is not caused by kFold. I actually ran another experiment without cross validation. I just built a model with the training data and then tested the model on the test data. However, the accuracy still varies from one run to another.