Hi Thomas, What quality do you have on training?
There is no silver bullet, but there is quite common technique you can use to find out if you use appropriate algorithm. You can take a look at the difference between "train" and "validation" quality of learning curves ( example <http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#example-model-selection-plot-learning-curve-py>). If you see big gap, then you can reduce complexity of your model to overcome overfitting (reduce interaction parameter / number of variables / iterations / ...). If you see a small gap, then you can try to increase model complexity to fit your data better. Moreover, I see you have a tiny dataset and use 50/50 split. I presume, that you will train "production" model on the whole available dataset. In that case, I suggest you to use more data for training and use almost LOO <http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo> approach to better estimate your predictive quality. But, be really cautious about cross-validation as you can easily overfit your data. 2016-10-01 15:59 GMT+01:00 Thomas Evangelidis <[email protected]>: > Dear scikit-learn users and developers, > > I have a dataset consisting of 42 observation (molnames) and 4 variables ( > VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model > that estimates the experimental value (Expr). I tried multivariate linear > regression using 10,000 bootstrap repeats each time using 21 observations > for training and the rest 21 for testing, but the average correlation was > only R= 0.1727 +- 0.19779. > > > molname VDWAALS EEL EGB >> ESURF Expr >> CHEMBL108457 -20.4848 -96.5826 23.4584 >> -5.4045 -7.27193 >> CHEMBL388269 -50.3860 28.9403 -51.5147 >> -6.4061 -6.8022 >> CHEMBL244078 -49.1466 -21.9869 17.7999 >> -6.4588 -6.61742 >> CHEMBL244077 -53.4365 -32.8943 34.8723 >> -7.0384 -6.61742 >> CHEMBL396772 -51.4111 -34.4904 36.0326 >> -6.5443 -5.82207 >> ........ > > > I would like your advice about what other machine learning algorithm I > could try with these data. E.g. can I make a decision tree or the > observations and variable are too few to avoid overfitting? I could > include more variables but the observations will always remain 42. > > I would greatly appreciate any advice! > > Thomas > > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Yours sincerely, https://www.linkedin.com/in/alexey-dral Alexey A. Dral
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
