Re: [Scikit-learn-general] Comparisons of classifiers

2016-04-12 Thread Gael Varoquaux
On Sat, Mar 26, 2016 at 05:31:36PM -0400, Sebastian Raschka wrote: > I wouldn’t fundamentally change the random forest algorithm in scikit-learn > using ideas from xgboost, since it wouldn’t be a random forest anymore, then. > Please don’t get me wrong, I’d also like to see a more efficient (pred

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Joel Nothman
I don't think we can deny this is strange, certainly for real-world, IID data! On 13 April 2016 at 10:31, Juan Nunez-Iglesias wrote: > Yes but would you expect sampling 280K / 3M to be qualitatively different > from sampling 70K / 3M? > > At any rate I'll attempt a more rigorous test later this

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Juan Nunez-Iglesias
Yes but would you expect sampling 280K / 3M to be qualitatively different from sampling 70K / 3M? At any rate I'll attempt a more rigorous test later this week and report back. Thanks! Juan. On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman wrote: > It's hard to believe this is a software problem

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Joel Nothman
It's hard to believe this is a software problem rather than a data problem. If your data was accidentally a duplicate of the dataset, you could certainly get 100%. On 13 April 2016 at 10:10, Juan Nunez-Iglesias wrote: > Hallelujah! I'd given up on this thread. Thanks for resurrecting it, Andy! >

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Juan Nunez-Iglesias
Hallelujah! I'd given up on this thread. Thanks for resurrecting it, Andy! =) However, I don't think data distribution can explain the result, since GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K random samples but changes to perfect classification for 280K samples. I don't

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

2016-04-12 Thread Andreas Mueller
Have you tried to "score" the grid-search on the non-training set? The cross-validation is using stratified k-fold while your confirmation used the beginning of the dataset vs the rest. Your data is probably not IID. On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote: Hi all, TL;DR: when I ru

Re: [Scikit-learn-general] Class Weight Random Forest Classifier

2016-04-12 Thread Andreas Mueller
Another possibility is to threshold the predict_proba differently, such that the decision maximizes whatever metric you have defined. On 03/15/2016 07:44 AM, Mamun Rashid wrote: Hi All, I have asked this question couple of weeks ago on the list. I have a two class problem where my positive cl

Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-12 Thread Vlad Niculae
I would definitely join the sprint, anything after June 17 works for me. I was thinking to come hang around during ICML, even if I might not be able to afford the conference. Cheers, Vlad On Tue, Apr 12, 2016 at 11:39 AM, Andreas Mueller wrote: > So should we pick another or possibly an addition

Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-12 Thread Andreas Mueller
So should we pick another or possibly an additional date? Will anyone be in NYC for ICML / UAI / COLT? On 04/12/2016 03:56 AM, Alexandre Gramfort wrote: >> Sorry, ICML is at the same dates as the big brain imaging conference, so >> I will not be able to attend (neither the conference, nor a sprint

Re: [Scikit-learn-general] load_svmlight_file value error

2016-04-12 Thread Gunjan Dewan
Hi Manjush, Yes, this issue has been reported. You can use the data from the following link. It's train and test data sets do not have spaces between commas, so I was able to load this using svmlight. Link : http://research.microsoft.com/en-us/um/people/manik/downloads/XC/XMLRepository.html On

Re: [Scikit-learn-general] Data properties for mutual information feature selection

2016-04-12 Thread Manjush Vundemodalu
It depends on your problem statement and data set you are using to train your model. Can you be more specific Regards, Manjush On Wed, Feb 17, 2016 at 8:26 AM Shishir Pandey wrote: > Hi > > What properties of data should I look at to justify that mutual > information is a good feature selection

Re: [Scikit-learn-general] load_svmlight_file value error

2016-04-12 Thread Manjush Vundemodalu
Is this issue reported already ? I am getting same error while trying to load kaggle train.csv (same file) with load_svmlight_file Regards, Manjush On Sat, Feb 13, 2016 at 9:56 AM Gunjan Dewan wrote: > Ill do that. > > Thanks a lot. > > Gunjan > > On Sat, Feb 13, 2016 at 6:04 AM, Mathieu Blonde

Re: [Scikit-learn-general] sklearn Hackathon during ICML ?

2016-04-12 Thread Alexandre Gramfort
> Sorry, ICML is at the same dates as the big brain imaging conference, so > I will not be able to attend (neither the conference, nor a sprint). same for me. Surprisingly... Alex -- Find and fix application performance