Re: [Scikit-learn-general] Trees with unbalanced classes

2013-07-12 Thread Sergey Feldman
Thanks, Manish! Exactly what I was looking for. On Fri, Jul 12, 2013 at 4:52 PM, Manish Amde wrote: > Hi Sergey, > > There is a sample_weights option (not very well documented) in the random > forest classifier that might help. You might want to check out the SVC > example to see the sample_we

Re: [Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Anne Dwyer
Peter, I tried your suggestion. But my training error with sample weights is still not the same as without sample weights. It seems like I am missing something here. It doesn't seem to work for me. Anne Dwyer On Fri, Jul 12, 2013 at 5:19 PM, Peter Prettenhofer < peter.prettenho...@gmail.com> wr

Re: [Scikit-learn-general] Trees with unbalanced classes

2013-07-12 Thread Manish Amde
Hi Sergey, There is a sample_weights option (not very well documented) in the random forest classifier that might help. You might want to check out the SVC example to see the sample_weights format. http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html You can provide diffe

[Scikit-learn-general] Trees with unbalanced classes

2013-07-12 Thread Sergey Feldman
I'm dealing with a 50-class classification problem with extremely unbalanced classes. The smallest class has about 1000 samples and the largest has 500,000. The random forest I've trained is being heavily skewed towards the big classes. Is there a good way to deal with this kind of problem in sk

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Olivier Grisel
2013/7/12 Lars Buitinck : > 2013/7/12 Antonio Manuel Macías Ojeda : >> I'm not sure how are you using it but something to take into account is that >> the default NLTK tokenizer is meant to be used on sentences, not on whole >> paragraphs or documents, so it should operate on the output of a senten

Re: [Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Peter Prettenhofer
try float(len(y_train)) - seems like C default is int... Am 13.07.2013 00:10 schrieb "Anne Dwyer" : > Peter, > > Thanks for your answers. When I scale C by len(y_train), I get the > following error: > > ValueError: C <= 0 > > Anne Dwyer > > > On Fri, Jul 12, 2013 at 3:34 PM, Peter Prettenhofer < >

Re: [Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Anne Dwyer
Peter, Thanks for your answers. When I scale C by len(y_train), I get the following error: ValueError: C <= 0 Anne Dwyer On Fri, Jul 12, 2013 at 3:34 PM, Peter Prettenhofer < peter.prettenho...@gmail.com> wrote: > Hi Anne, > > I would also expect that using uniform weights should result in th

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Antonio Manuel Macías Ojeda
Yeah it's definitely not build with speed as it's design goal. Good patch! On Fri, Jul 12, 2013 at 1:45 PM, Lars Buitinck wrote: > 2013/7/12 Antonio Manuel Macías Ojeda : > > I'm not sure how are you using it but something to take into account is > that > > the default NLTK tokenizer is meant t

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Lars Buitinck
2013/7/12 Antonio Manuel Macías Ojeda : > I'm not sure how are you using it but something to take into account is that > the default NLTK tokenizer is meant to be used on sentences, not on whole > paragraphs or documents, so it should operate on the output of a sentence > tokenizer not on the raw t

Re: [Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Peter Prettenhofer
2013/7/12 Peter Prettenhofer > Hi Anne, > > I would also expect that using uniform weights should result in the same > solution as no weights -- but maybe there is an interaction with the C > parameter... for this we would need to know more about the internals of > libsvm and how it handles sampl

Re: [Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Peter Prettenhofer
Hi Anne, I would also expect that using uniform weights should result in the same solution as no weights -- but maybe there is an interaction with the C parameter... for this we would need to know more about the internals of libsvm and how it handles sample weights - try scaling C by ``len(y_train

[Scikit-learn-general] Question about using sample weights to fit an svm

2013-07-12 Thread Anne Dwyer
I have been using the sonar data set (I believe this is a sample data set used in many demonstrations of machine learning.) It is a two class data set with 60 features with 208 training examples. I have a questions about using sample weights in fitting the SVM model. When I fit the model using sc

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Antonio Manuel Macías Ojeda
Hi! > I found that about 75% of > the time was spent in MiniBatchKMeans.fit, while the rest of it was > spent inside nltk.word_tokenize (!) > I'm not sure how are you using it but something to take into account is that the default NLTK tokenizer is meant to be used on sentences, not on whole par

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Fred Mailhot
On 12 July 2013 09:48, Lars Buitinck wrote: > 2013/7/11 Tom Fawcett : > [...] > > I guess because it's terribly slow. I recently tried to cluster a > sample of Wikipedia text at the word level. What kind of results did you get? I did some work recently clustering short-form text and was general

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Olivier Grisel
2013/7/12 Lars Buitinck : > 2013/7/11 Tom Fawcett : >>> On Sun, Jul 7, 2013 at 6:58 AM, Joel Nothman >>> wrote: >>> (But I'm also not convinced that NLTK is the right tool for a lot of >>> large-scale feature extraction jobs.) >> >> I’m curious – why? > > I guess because it's terribly slow. I re

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Lars Buitinck
2013/7/11 Tom Fawcett : >> On Sun, Jul 7, 2013 at 6:58 AM, Joel Nothman >> wrote: >> (But I'm also not convinced that NLTK is the right tool for a lot of >> large-scale feature extraction jobs.) > > I’m curious – why? I guess because it's terribly slow. I recently tried to cluster a sample of W

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Gael Varoquaux
On Fri, Jul 12, 2013 at 09:06:03AM +0200, Andreas Mueller wrote: > > Structured prediction in sklearn was one of the outcomes from the survey. > > Would it be a better idea to send people to pystruct, rather than > > implement it here? > I think so. I think so to. > We decided that structured p

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Olivier Grisel
2013/7/12 Hakan : > Unfortunately it's not pretty straight forward as you > said... The error message was: TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] It is completely straightforward. It says that the object you are dealing with a sparse matrix as written in the docum

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Hakan
On Fri, 12 Jul 2013 17:59:29 +0200 Olivier Grisel wrote: >> X_train=X_in >> y_train=y_in >> X_test=X_in >> y_test=y_in > > This is a methodological mistake: you should never use >the same data > for training and testing a model. Instead use: > > from sklearn.cross_validation import train_test

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Hakan
Unfortunately it's not pretty straight forward as you said... I have made the changes Mathieu and you mentioned but loading the feature set into an array "X=X.toarray()" doesn't respond immediately to run any example with libsvm datasets. Please have a look the following code...decision boundry

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Olivier Grisel
2013/7/12 Hueseyin Hakan Pekmezci : > Hi scikit-learn members, > > 0.13.1 documentation states that individual datasets can > be loaded in svmlight / libsvm format. So I have fed in > "iris.scale" libSVM dataset however some erroneous > behaviour happens. I am just trying to reproduce > "plot_iris

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Andreas Mueller
On 07/12/2013 05:14 PM, Hakan wrote: > as you see initially I was loading the iris data exactly > like example. But being able to work for individual > datasets, I needed to give it a libSVM try. Is there any > piece of code, example to point out its smooth integration > with scikit-learn? I mean s

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Hakan
as you see initially I was loading the iris data exactly like example. But being able to work for individual datasets, I needed to give it a libSVM try. Is there any piece of code, example to point out its smooth integration with scikit-learn? I mean some svm classifier example with svmlight_l

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Hakan
Initally I have tried that one you mentioned but I toss the barrier as following. Then I started to reconsider may be there is a problem with libSVM reading... Traceback (most recent call last): File "linsvm.py", line 48, in pl.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=pl.cm.Paired

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Andreas Mueller
Hi. If you just want the iris dataset, you can get it using "datasets.load_iris()" (and scale it with StandardScaler). The problem in your code is that load_svmlight_file returns X as a sparse matrix. You need to convert it to an nd-array if you want to use the example using X.toarray(). (I thin

Re: [Scikit-learn-general] libsvm data support

2013-07-12 Thread Mathieu Blondel
Well the error message says it all: you cannot use len on a sparse matrix. Instead of len(X), use X.shape[0]. Mathieu On Fri, Jul 12, 2013 at 11:35 PM, Hueseyin Hakan Pekmezci < pekme...@rhrk.uni-kl.de> wrote: > Hi scikit-learn members, > > 0.13.1 documentation states that individual datasets ca

[Scikit-learn-general] libsvm data support

2013-07-12 Thread Hueseyin Hakan Pekmezci
Hi scikit-learn members, 0.13.1 documentation states that individual datasets can be loaded in svmlight / libsvm format. So I have fed in "iris.scale" libSVM dataset however some erroneous behaviour happens. I am just trying to reproduce "plot_iris_exercise.py" with iris.scale(http://www.csi

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Olivier Grisel
skstruct? In french it translates to "c'est quoi ce truc?" :) -- Olivier -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate

Re: [Scikit-learn-general] Improving Text Classification

2013-07-12 Thread Nigel Legg
I'm coming at this from a market research point of view (that's my background). There seem to be a number of opportunities there for classificaton, clustering, and regression analysis tools, so I am building - or rather attempting to build - tools with the aim that they will go on the web, and peo

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Mathieu Blondel
On Fri, Jul 12, 2013 at 4:06 PM, Andreas Mueller wrote: > > About naming it scikit-struct: is there any requirement to become a scikit? > Also: is there much benefit - pandas seems to be doing quite well > without the brand ;) > My suggestion was half a joke :). But I find it a little bit disappo

Re: [Scikit-learn-general] Improving Text Classification

2013-07-12 Thread Ian Ozsvald
Hi Nigel. I see you're in the UK, I'm based east of you in London. My goal with the disambiguator is to provide a well documented pipeline such that it can be easily retrained. I have a notion that in the future I'll host a version of my code production-ready under my http://annotate.io/ , ready f

Re: [Scikit-learn-general] Improving Text Classification

2013-07-12 Thread Ian Ozsvald
Hi Harold. Are you using different models for the different types of social media? I'd guess that the grammar/terms used in a tweet could look quite different to what you see in e.g. a Google+ Comment (different demographic->probably higher quality English, less space restrictions->longer/clearer w

Re: [Scikit-learn-general] Paris Sprint location

2013-07-12 Thread Mathieu Blondel
On Fri, Jul 12, 2013 at 7:01 PM, Gilles Louppe wrote: > Otherwise, on my part, I plan to complete PR #2131 if it is not yet > merged in by the time of the sprint, and then address tree-related > issues/PRs that have been lying around for months now. Also, if > someone has a special request for th

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Peter Prettenhofer
2013/7/12 Andreas Mueller > On 07/12/2013 01:26 AM, Robert Layton wrote: > > Structured prediction in sklearn was one of the outcomes from the survey. > > Would it be a better idea to send people to pystruct, rather than > > implement it here? > > > I think so. We decided that structured predicti

Re: [Scikit-learn-general] Paris Sprint location

2013-07-12 Thread Gilles Louppe
> - discuss the with the tree growers guys on how to best parallelize > random forest trainings on multi-core without copying the training set > in memory >- either with threads in joblib and "with nogil" statements in the > inner loops of the (new) cython code >- either with shared memory

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Olivier Grisel
2013/7/12 Andreas Mueller : > On 07/12/2013 09:23 AM, Vlad Niculae wrote: >> The requirements are definitely the blocking thing here. Not just the >> dependency on cvxopt but also the inference packages and the fact they >> need to be built manually. The api is sklearn-ish enough even with >> list

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Andreas Mueller
On 07/12/2013 09:23 AM, Vlad Niculae wrote: > The requirements are definitely the blocking thing here. Not just the > dependency on cvxopt but also the inference packages and the fact they > need to be built manually. The api is sklearn-ish enough even with > lists-of-lists. > The API, yes, but th

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Vlad Niculae
The requirements are definitely the blocking thing here. Not just the dependency on cvxopt but also the inference packages and the fact they need to be built manually. The api is sklearn-ish enough even with lists-of-lists. On Fri, Jul 12, 2013 at 10:06 AM, Andreas Mueller wrote: > On 07/12/201

Re: [Scikit-learn-general] Pystruct website and mailing list

2013-07-12 Thread Andreas Mueller
On 07/12/2013 01:26 AM, Robert Layton wrote: > Structured prediction in sklearn was one of the outcomes from the survey. > Would it be a better idea to send people to pystruct, rather than > implement it here? > I think so. We decided that structured prediction was out of scope for sklearn, right