Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-17 Thread Leon Palafox
Jake, I changed that line and It worked perfect, also the checksum matched for the one I downloaded in Linux (which was the correct one). Thanks for the help Leon On Sun, Nov 18, 2012 at 1:20 AM, Jake Vanderplas < vanderp...@astro.washington.edu> wrote: > Leon, > I think the problem might b

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Olivier Grisel
No: - either you use models that can stream on the data without loading everything in memory at once by using the models that support the `partial_fit` API as explained above (which is not the case for tree-based models but would work for Perceptron, SGDClassifier or PassiveAggressiveClassifier)

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Lars Buitinck
2012/11/17 Andreas Mueller : > It would be easy to implement this for the naive Bayes models, but > they don't have partial fit yet afaik. For multinomial and Bernoulli NB, this would be quite easy. It just involves keeping the feature frequencies in the estimator and re-estimating the parameters

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Ronnie Ghose
Ehh... how do i say this ~ i'm 95% sure this stems from me having not so much in the way of a clue but: So for the trees you can break them up into using a small subset of the data for each tree in SGD you can iterate over the data and use parts of it at a time Is there any other methods in sklea

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Andreas Mueller
On 11/17/2012 04:19 PM, Ronnie Ghose wrote: > See you guys just said I could use trees on subsets and they will work > well. > > So why not partial_fits + trees? > As I tried to say, these are different stories: Gilles said (and wrote about) using a different small subset of the data for each tre

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Ronnie Ghose
Yes... that's what I mean. Could i do a similar thing to get around/use other methods besides SGA and Trees. On 17 November 2012 11:26, Richard T. Guy wrote: > It only makes sense train a tree on a subset as part of an ensembl method, > and in that case you can train a set of trees by training

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Richard T. Guy
It only makes sense train a tree on a subset as part of an ensembl method, and in that case you can train a set of trees by training each one on a subset of the data (be sure to randomly choose the subset though). It's true that ensembl methods like RandomForest don't have partial_fit, but you cou

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-17 Thread Jake Vanderplas
Leon, I think the problem might be that the fetch script writes the file without binary mode. We should replace open(LOCAL_FILE, 'w').write(fhandle.read()) with open(LOCAL_FILE, 'wb').write(fhandle.read()) could you check if that solves the problem? Jake On 11/17/2012 01:58 AM, L

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Ronnie Ghose
See you guys just said I could use trees on subsets and they will work well. So why not partial_fits + trees? On 17 November 2012 11:12, Gael Varoquaux wrote: > On Sat, Nov 17, 2012 at 11:10:52AM -0500, Ronnie Ghose wrote: > > hmm i'm asking is it possible to run all of the typical ~ whatever t

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Gael Varoquaux
On Sat, Nov 17, 2012 at 11:10:52AM -0500, Ronnie Ghose wrote: > hmm i'm asking is it possible to run all of the typical ~ whatever that > means ~ models in sklearn on a subset of that data and have it work > pretty well most of the time? No, only those that have 'partial_fit'. G ---

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Ronnie Ghose
hmm i'm asking is it possible to run all of the typical ~ whatever that means ~ models in sklearn on a subset of that data and have it work pretty well most of the time? On 17 November 2012 11:08, Andreas Mueller wrote: > On 11/17/2012 03:41 PM, Ronnie Ghose wrote: > > Hmmm interesting so I cou

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Andreas Mueller
On 11/17/2012 03:41 PM, Ronnie Ghose wrote: > Hmmm interesting so I could run > ex: > Naive Bayes, > Bayesian Nets > Boosting + Bagging > Generalized Unsupervised Learning > > on subsets O_O? The idea with trees and subsets is that you work with an ensemble any way (a random forest). So you can tr

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Andreas Mueller
On 11/17/2012 03:41 PM, Ronnie Ghose wrote: > Also I have no clue what this is: > http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation > But is it relevant to my problem? It popped up on StackOverflow. > Are you sure that was meant when LDA was said? It could have also been Linear Discriminan

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Ronnie Ghose
Also I have no clue what this is: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation But is it relevant to my problem? It popped up on StackOverflow. On 17 November 2012 10:41, Ronnie Ghose wrote: > Hmmm interesting so I could run > ex: > Naive Bayes, > Bayesian Nets > Boosting + Bagging

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Ronnie Ghose
Hmmm interesting so I could run ex: Naive Bayes, Bayesian Nets Boosting + Bagging Generalized Unsupervised Learning on subsets O_O? And yeah 20gb isn't that much, but that's because i'm still downloading. I'm about to start downloading ~ 2 gb or so per day, and I want to run it on those additional

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Olivier Grisel
It works: see Gilles' paper: http://orbi.ulg.ac.be/handle/2268/130099 -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Monitor your physical, virtual and cloud infrastructure from a single web console.

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Gilles Louppe
> For Trees, you could subsample and train trees on different > subsets but not sure how well this works if the subsets > are only a small fraction of the whole dataset. This often works surprisingly well :) (both along examples and features) -

Re: [Scikit-learn-general] Joblib dump memory error

2012-11-17 Thread Olivier Grisel
The problem is likely the `vocabulary_` python dict of the CountVectorizer. It's pickled using the default python pickler which is probably not very efficient. Anyway for large text data, using a hashing vectorizer would be a much better solution. You can follow progress on this branch that shoul

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

2012-11-17 Thread Andreas Mueller
On 11/17/2012 12:11 AM, Ronnie Ghose wrote: > :/ darnit. I wanted to run CARTs and Neural Nets on it >_<. though it > was a mystery to me how that would work. > You can do neural nets the same as the linear classifiers. We don't implement them yet, though ;) Btw, my tip would be to get a machine

Re: [Scikit-learn-general] GridSearch example

2012-11-17 Thread Andreas Mueller
On 11/16/2012 08:41 PM, Fred Mailhot wrote: On 15 November 2012 23:20, Andreas Mueller > wrote: [...] You can give GridSearchCV not only a grid but also a list of grids. I would go with that. (is that sufficiently documented?) This doesn't app

Re: [Scikit-learn-general] RandomForest benchmark

2012-11-17 Thread Olivier Grisel
Yeah actually they can only be better if the data is memmaped in advanced (for instance using joblib.dump(data, filename) / joblib.load(filename, mmap_mode='c')). Also this is only really interesting for large datasets (e.g. larger than 100MB) which is probably not the case here in retrospect. 201

Re: [Scikit-learn-general] RandomForest benchmark

2012-11-17 Thread Peter Prettenhofer
Olivier, I tested it with the joblib PR - results got a bit worse. see below best, Peter arcene r py score 0.2700 (0.03) 0.2633 (0.02) train 3.9454 (0.09) 4.6661 (0.20) test

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-17 Thread Leon Palafox
Ok, I did a new set of downloads and did 2 MD5 checksums: Both data files have effectively different check sums. Can someone post the correct checksum just to be sure. I was thinking that it might have something with my system being in Japanese (although I doubt it) can someone try it on a Windo

Re: [Scikit-learn-general] Data Set on Tutorial: Machine Learning for Astronomy with Scikit-learn

2012-11-17 Thread Gael Varoquaux
On Sat, Nov 17, 2012 at 01:10:21PM +0900, Leon Palafox wrote: > Indeed, when I tried to re run it on my Windows PC at home it also found Nan. > The problem appears to be when I download the data using the script, since I > tried it with the data I downloaded from the Linux server and It ran fine.