Re: [Scikit-learn-general] Dynamic Bayes network/hierarchical hidden markov model (hhmm)

2013-08-23 Thread Shuo Wang
Hi Andy, thanks. I found out, it seems that people are not very interested in Dynamic Bayes Network these years, even the Murphy's BNT has been inactive for quite a while, any idea why the interest for this method is low these years? It seems to me a good way to learn and predict regimes in time s

[Scikit-learn-general] Reviewers + testers needed | memory issue with numpy.dot for numpy < 1.8 tackled

2013-08-23 Thread Denis-Alexander Engemann
Dear scikit-learners, during the last sprint we've spotted an efficiency issue with the numpy.dot for numpy versions < 1.8. Apparently, the dot allocates additional copies in order to deliver appropriate input to the underlying BLAS gemm function which expects Fortran contiguous memory layout f

Re: [Scikit-learn-general] Scikit-learn for large datasets?

2013-08-23 Thread amir rahimi
Hi, Helge, this ECML/PKDD paper [1] might be helpful in the case of semi-supervised learning. Sometimes ago me and one of the authors of [1] talked about implementing the algorithm in sklearn. I think now is a good time to mention it in the mailing list. I'm not sure if there is any online semi-s

Re: [Scikit-learn-general] lightning vs. scikit-learn benchmark

2013-08-23 Thread Mathieu Blondel
A poor-man's scikit-learn compatible wrapper around VW would be to call the command line via popen and feed it data through stdin. If you do that, create a gist and add it to the third-party snippet list in https://github.com/scikit-learn/scikit-learn/wiki/Useful-Snippets Mathieu On Fri, Aug 23

Re: [Scikit-learn-general] Scikit-learn for large datasets?

2013-08-23 Thread Olivier Grisel
Thanks for the details. My main advice is still the same: try on small subsamples with increasing sizes and check the impact of the size of the training set on the test score. For a linear binary classifier I am pretty sure that it's not going to help you to use all the data (unless you learn non-

Re: [Scikit-learn-general] Scikit-learn for large datasets?

2013-08-23 Thread helge.reike...@gmail.com
Thanks a lot Nick and Oliver. To answer your questions: - how many samples? About 1 billion rows. > - how many features? > It will depend on the nature of the analyzes. Many of the categorical variables have taxonomies that can be used to reduce cardinality. Sometimes I'll want to use these,

Re: [Scikit-learn-general] lightning vs. scikit-learn benchmark

2013-08-23 Thread Eustache DIEMERT
> The kind of thing I would like to do is run vowpal-wabbit from within > scikit learn. > I know VW has a C interface now, so it is theoretically possible to develop a python binding (hunch.net seems down as of now, but John Langford wrote about it on the blog). However, memory structures possib

Re: [Scikit-learn-general] lightning vs. scikit-learn benchmark

2013-08-23 Thread Sean Violante
Thanks Lars - I would really like to clarify the problems with my suggestion, in particular if/how a CLI interface would break the scikit learn interface. You obviously can immediately identify the problems. The kind of thing I would like to do is run vowpal-wabbit from within scikit learn. There

Re: [Scikit-learn-general] Scikit-learn for large datasets?

2013-08-23 Thread Olivier Grisel
2013/8/23 helge.reike...@gmail.com : > Good day, > > Can anyone perhaps give me an idea of how large datasets scikit-learn > algorithms typically can handle? > > I have about 4 TB of structured data. I might be able to normalize that down > to say 1 TB if necessary. The tasks would typically be log

Re: [Scikit-learn-general] Scikit-learn for large datasets?

2013-08-23 Thread Nick Pentreath
Hey Helge Funny I just saw this drop into my inbox! Hope you are well. What does your data look like? Is it sparse? For classification tasks (read: SGDClassifier), one can stream data one-by-one and thus be "out-of-core" - though in this case I'd recommend doing it in "mini-batches". This would u

[Scikit-learn-general] Scikit-learn for large datasets?

2013-08-23 Thread helge.reike...@gmail.com
Good day, Can anyone perhaps give me an idea of how large datasets scikit-learn algorithms typically can handle? I have about 4 TB of structured data. I might be able to normalize that down to say 1 TB if necessary. The tasks would typically be logistic regression, Naive Bayes, k-Means and possib