Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Joseph Turian
Actually, it turns out I was incorrect. According to the docs: http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees "each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a

Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-27 Thread Olivier Grisel
2012/10/26 Philipp Singer : > Am 26.10.2012 15:35, schrieb Olivier Grisel: >> BTW, in the mean time you could encode your coocurrences as text >> identifiers use either Lucene/Solr in Java using the sunburnt python >> client or woosh [1] in python as a way to do efficient sparse lookups >> in such

Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-27 Thread Philipp Singer
Am 27.10.2012 23:43, schrieb Joseph Turian: > If you only care about near matches and not the full n^2 matrix: > > +1 to OG's suggestion to use pylucene. > > You can use pylucene to generate candidates, and then compute the > exact tf*idf cosine distance on the shortlist. Yes exactly. I would only

Re: [Scikit-learn-general] Jython and Scikit-Learn

2012-10-27 Thread Robert Kern
On Sat, Oct 27, 2012 at 10:39 PM, Joseph Turian wrote: > How does jnius compare with jpype? It isn't dead, mostly. More seriously, with active developers and Cython underpinnings, they might accept some PRs to add efficient numpy support. -- Robert Kern ---

Re: [Scikit-learn-general] All-pairs-similarity calculation

2012-10-27 Thread Joseph Turian
If you only care about near matches and not the full n^2 matrix: +1 to OG's suggestion to use pylucene. You can use pylucene to generate candidates, and then compute the exact tf*idf cosine distance on the shortlist. I assume this will be n log n. Another option for fast all-pairs is to use loc

Re: [Scikit-learn-general] Jython and Scikit-Learn

2012-10-27 Thread Joseph Turian
How does jnius compare with jpype? On Fri, Oct 26, 2012 at 4:52 PM, Robert Kern wrote: > On Fri, Oct 26, 2012 at 4:52 PM, Didier Vila wrote: >> Mathieu and Olivier, >> >> Thanks for your emails. >> >> My interest on python and scikit-learn growth each day so I will try a >> solution for the new

Re: [Scikit-learn-general] Precision-recall now requires probas_pred to be in [0, 1]

2012-10-27 Thread Gael Varoquaux
On Fri, Oct 26, 2012 at 06:24:28PM +0100, Andreas Mueller wrote: > Which PR was that. That is bad :-( > > I suggest to change it back to working with any non-bounded test > > statistic. Any reason not to? I am proposing to do the work. > +1 Done in 90c007981f54 G

Re: [Scikit-learn-general] ANN: astroML version 0.1

2012-10-27 Thread Jake Vanderplas
Thanks Gael, Yes, I've been thinking a lot about density estimation, and I've designed all the astroML code to be fairly easy to move upstream if desired. I have a bit of a vision for density estimation: I'd love in the future to create an sklearn.density submodule which has things like KDE (

Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Richard T. Guy
That explains the confusion! Thanks, guys. Tommy On Sat, Oct 27, 2012 at 5:25 AM, Joseph Turian wrote: > Gilles, > > I met Tommy Guy at the pydata conference today. > If I remember correctly, Brian Eoff (I don't have his email address) > errantly said that random forests partitions/samples the

Re: [Scikit-learn-general] Jython and Scikit-Learn

2012-10-27 Thread didier vila
All, it s look like that the system ERP that we want to implement has yet an API in C++. SO this is a good news for python and scikit learn. It will be just a question to create a wrapper in Python to have access to the system through their C++ API. Does it looks sensible ? Regards Didier > F

Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Joseph Turian
Gilles, I met Tommy Guy at the pydata conference today. If I remember correctly, Brian Eoff (I don't have his email address) errantly said that random forests partitions/samples the features before creating each tree. I didn't want to correct him in front of the audience, and it slipped my mind to

Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Gilles Louppe
Hi, > I know the speaker at pydata today claimed that the features are > partitioned, Can you elaborate? If you pick your features prior to the construction of the tree and then build it on that subset only, then indeed, this is not random forest. That algorithm is called Random Subspaces. Best,

Re: [Scikit-learn-general] random forest question

2012-10-27 Thread Joseph Turian
> So the short answer is no. All features will be considered when > building a decision tree, as it should. Tommy, I know the speaker at pydata today claimed that the features are partitioned, but I don't believe this to be the case in how random forests were originally specified. Best, Josep

Re: [Scikit-learn-general] ANN: astroML version 0.1

2012-10-27 Thread Gael Varoquaux
It looks really awesome! The examples are superbe. It looks like you have some really cool density estimation code. I would personnally love to see such functionality in the scikit. Do you think that some of it could be move upstream? Thanks a lot for being our astrophysics figure-head! I feel th