Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-24 Thread Ark
Ronnie Ghose writes: > > > do you think it isn't saving correctly or it isn't loading correctly? > I am thinking the issue is with writing length of the compressed zfile, in the write_zfile of numpy_pickle.py (Although I might be wrong :) ). [Er, btw bit unrelated, but since I moved t

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-24 Thread Ronnie Ghose
do you think it isn't saving correctly or it isn't loading correctly? On Thu, Jan 24, 2013 at 8:14 PM, Ark wrote: > Gael Varoquaux writes: > > > > > On Wed, Jan 23, 2013 at 12:16:32AM +, Afik Cohen wrote: > > Hi, I'm working with Ark on this project. Yes, that's what it looks like > > > -

Re: [Scikit-learn-general] Joblib compressed file error?

2013-01-24 Thread Ark
Gael Varoquaux writes: > > On Wed, Jan 23, 2013 at 12:16:32AM +, Afik Cohen wrote: > Hi, I'm working with Ark on this project. Yes, that's what it looks like > > - some investigation into this appears to show that either this is a bug > > in zlib (the length returned is incorrect) or this

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Satrajit Ghosh
hi ariel, on the unit sphere, the dot product of the vectors would be exactly that except for the range. you would want to scale it so that -1 to 1 maps to 0 to 1 and then run spectral clustering on that matrix. if you have too many vectors you can create a sparse matrix, but on my mbp i can handl

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Ariel Rokem
Hi Satra, Thanks for the hint. "Affinity", as in an affine transformation? Isn't the angle between any two unit vectors that transformation? Cheers, Ariel On Thu, Jan 24, 2013 at 1:01 PM, Satrajit Ghosh wrote: > hi ariel, > > if you can precompute affinity between your vectors, you could al

Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

2013-01-24 Thread Osman Başkaya
Thanks Olivier. But I want to ask a question about this. Isn't it a problem that we give such a big steps? I worked on Weka a bit more and it seems that Logistic Regression in scikit is similar to Weka's LibLINEAR. I tried them on the same dataset. Results are similar except especially *accommodat

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Satrajit Ghosh
hi ariel, if you can precompute affinity between your vectors, you could also try spectral clustering. cheers, satra On Thu, Jan 24, 2013 at 1:18 PM, Ariel Rokem wrote: > Thanks everyone for chiming in! > > For now, I think that I will take Alex's heuristic (?) solution (also > described here

Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

2013-01-24 Thread Olivier Grisel
You could speed up your grid search greatly by using an exponential scale for the values of C: import numpy as np parameters = {"C": np.logspace(0, 4, 5)} -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5,

Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

2013-01-24 Thread Osman Başkaya
> > Even there is no parameter optimization for Weka, it looks significantly > better for these data. Is there something I missed? I am correcting my conclusion: Even there is no parameter optimization for Weka, it looks significantly better for first three words. Is there something I missed? I

Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

2013-01-24 Thread Osman Başkaya
Dear Olivier and Gael, Thank you guys. Olivier, Do you mean each feature vector sum to 1, right? Yes and their values start and end 0 and 1 respectively. These are probability distribution actually. You should never use the default settings of a classifier to compare > scores. Always grid sea

Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

2013-01-24 Thread Olivier Grisel
2013/1/24 O. B. : > Sorry I forgot the mention: > > Scikit's Logistic Regression is incredibly fast compared to Weka. Weka's > implementation (mostly based on this paper) is slow as well as VERY memory > intensive. Sometimes it wasn't enough to allocate 3 GB as heap size. My > dataset (words in abo

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Ariel Rokem
Thanks everyone for chiming in! For now, I think that I will take Alex's heuristic (?) solution (also described here: http://www.sci.utah.edu/~weiliu/research/clustering_fmri/Zhong_sphericalKmeans.pdf). I also like CP's projection idea - might work for my kind of data, which has antipodal symmetr

Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

2013-01-24 Thread Gael Varoquaux
On Thu, Jan 24, 2013 at 08:06:58PM +0200, O. B. wrote: > Sorry I forgot the mention: > Scikit's Logistic Regression is incredibly fast compared to Weka. Weka's > implementation (mostly based on this paper) is slow as well as VERY memory > intensive. Sometimes it wasn't enough to allocate 3 GB as h

Re: [Scikit-learn-general] Scikit-learn vs Weka on Logistic Regression

2013-01-24 Thread O. B.
Sorry I forgot the mention: Scikit's Logistic Regression is incredibly fast compared to Weka. Weka's implementation (mostly based on this paper) is slow as well as VERY memory intensive. Sometimes it wasn't

Re: [Scikit-learn-general] (no subject)

2013-01-24 Thread Olivier Grisel
2013/1/24 Ronnie Ghose : > I'm primarily thinking svm for which it looks horrible. Actually out of svm, > trees and logistic regression - log reg is the best. Just to confirm, > gridsearch uses score which returns R^2 yes? Getting negative scores with > svm and gridsearch was confusing For classif

Re: [Scikit-learn-general] (no subject)

2013-01-24 Thread Ronnie Ghose
I'm primarily thinking svm for which it looks horrible. Actually out of svm, trees and logistic regression - log reg is the best. Just to confirm, gridsearch uses score which returns R^2 yes? Getting negative scores with svm and gridsearch was confusing On Jan 24, 2013 11:51 AM, "Flavio Vinicius"

Re: [Scikit-learn-general] (no subject)

2013-01-24 Thread Flavio Vinicius
I think you can only guarantee that R2 is always positive when performing linear regression with no constraints. Is this you case, or are you using another model? For example, when using a regression forest you cannot guarantee positive R2. Better explanation here: http://stats.stackexchange.com/

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Bertrand Thirion
As alluded previously, you need to find a way to compute the centroid by minimizing the sum of squared distances to a given set of points within each cluster. However, it is true that re-projecting the euclidean mean to the sphere would approximate well the theoretical solution in most cases.

Re: [Scikit-learn-general] (no subject)

2013-01-24 Thread Bertrand Thirion
If you use a cross validation scheme, where you estimate the residuals variance on left-out data and compare it to the variance of the model with the intercept only, then R^2 can be negative. This approach is an alternative to adjusted R^2 for model selection, and probably makes more sense when

Re: [Scikit-learn-general] (no subject)

2013-01-24 Thread Alexandre Gramfort
hi, have a look at the code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/metrics.py#L1536 if you feel some wording / doc should be improved/fixed do not hesitate to send a PR. Best, Alex --

Re: [Scikit-learn-general] Multilabel questions

2013-01-24 Thread Lars Buitinck
2013/1/24 Philipp Singer : > Just another question: If the OVR predicts multiple labels for a sample, > are they somehow ranked? I know it is just the one vs rest approach, but > maybe there is some kind of confidence involved. Because then the > evaluation would be interesting, by looking at ranki

Re: [Scikit-learn-general] (no subject)

2013-01-24 Thread Ronnie Ghose
is it adjusted R^2? The usual R^2 can never be negative afaik http://en.wikipedia.org/wiki/Coefficient_of_determination On Wed, Jan 23, 2013 at 2:42 PM, Andreas Mueller wrote: > Am 23.01.2013 20:32, schrieb Ronnie Ghose: > > How can _best_score in GridSearchCV be negative? R^2 can only be from

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Stéfan van der Walt
On Jan 24, 2013 5:35 PM, "Charles-Pierre Astolfi" wrote: > > There's no projection that conserves the distance wrt to any pair of > points on the sphere (although there are some that conserves the > distance wrt 1 or 2 specific points on the sphere) > > BUT the gnomonic project conserves the short

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Charles-Pierre Astolfi
I'm a noob when it comes to data on a sphere, but is there any issue with preprocessing the data to project it on a place, run kmeans in the plane and the reproject it back on the sphere? There's no projection that conserves the distance wrt to any pair of points on the sphere (although there are

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Alexandre Gramfort
hi Ariel, what I would do first, if the data are not too big, is reimplement my kmeans in 10 lines and after you update the centers, normalize them to put them back on the sphere. I don't think you can say much about convergence but it might work well enough in practice. HTH Alex On Thu, Jan 24,

Re: [Scikit-learn-general] K means on a sphere

2013-01-24 Thread Vince Fernando
Are there any theoretical problems if one uses the great circle (orthodromic) distance on a sphere in k-means or any other clustering algorithm? vince On 24 January 2013 07:11, Mathieu Blondel wrote: > On Thu, Jan 24, 2013 at 9:24 AM, Gael Varoquaux > wrote: > > > Yes, there is a massive diffe

Re: [Scikit-learn-general] Using sklearn in Hadoop

2013-01-24 Thread Nick Pentreath
May I suggest you look at Spark (http://spark-project.org/ and https://github.com/mesos/spark). It is written in Scala, has a Java API and the current master branch has the new Python API (0.7.0 release when it happens). I've been doing some testing, including using sklearn together with Spark, an

Re: [Scikit-learn-general] Multilabel questions

2013-01-24 Thread Philipp Singer
Yep, I know that. The PR looks promising, will look into it. Just another question: If the OVR predicts multiple labels for a sample, are they somehow ranked? I know it is just the one vs rest approach, but maybe there is some kind of confidence involved. Because then the evaluation would be i

Re: [Scikit-learn-general] Multilabel questions

2013-01-24 Thread Joly Arnaud
You should also be aware that the current metrics module doesn't handle multilabels correctly. The following pr https://github.com/scikit-learn/scikit-learn/pull/1606 might interest you. It had for multi-labels support for some metrics. Best regards, Arnaud Joly Le 23/01/2013 18:44, Andreas Muel

Re: [Scikit-learn-general] Multilabel questions

2013-01-24 Thread Philipp Singer
Am 23.01.2013 18:39, schrieb Lars Buitinck: > 2013/1/23 Andreas Mueller : >> Am 23.01.2013 16:47, schrieb Philipp Singer: >>> That's what I originally thought, but then I tried it with just using >>> LinearSVC and it magically worked for my sample dataset, really >>> interesting. I think it is work