Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Mathieu Blondel
On Fri, Jun 21, 2013 at 6:56 AM, Nicolas Trésegnie < nicolas.treseg...@gmail.com> wrote: > >- To impute only some of the missing values (rows, columns or a >combination) > > I think this can be added later if you have time. For now, I would rather not clutter the API. For rows, one can jus

Re: [Scikit-learn-general] SVM: select the training set randomly

2013-06-20 Thread Bilal Dadanlar
you can have a look at "sklearn.cross_validation.train_test_split()" and some other methods from here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cross_validation On Fri, Jun 21, 2013 at 3:59 AM, Joel Nothman wrote: > Please see > http://scikit-learn.org/stable/tutorial/

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Mathieu Blondel
> So are you proposing that sparse matrices encode missing values with nan? Or something else? There are different scenarios: 1) fewer missing values than non-missing values 2) much more missing-values than non-missing values (recommendation system setting) 3) fewer missing values than non-missin

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
On Fri, Jun 21, 2013 at 3:17 PM, Joel Nothman wrote: > On Fri, Jun 21, 2013 at 3:07 PM, Mathieu Blondel wrote: > >> >> In my research, I have used CSR matrices, since they can be processed >> very efficiently in Cython. Since CSR matrices are indeed non-trivial to >> construct, we could provide an

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
On Fri, Jun 21, 2013 at 3:36 PM, Lars Buitinck wrote: > > Ah, right. I missed the other email. It's unfortunate that > scipy.sparse doesn't allow any other value than 0 to be omitted. > I'm sure a PR would be welcomed :P ---

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Lars Buitinck
2013/6/21 Mathieu Blondel : > Dense formats like masked arrays or arrays with missing-values encoded by > NaN won't work for recommender datasets (unless you can fit a n_users x > n_items dense matrix in memory). So, we do need to find a suitable sparse > format to work with, since the second half

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
> > And I agree that sparse input should probably not be handled (for now?)... > If for no other reason than that for sparse X you can't currently do things > neatly like X[X == np.nan] = 1 (but you probably will be able to do it > after this GSOC). > I guess you can do X.data[X.data == np.nan] =

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
On Fri, Jun 21, 2013 at 3:07 PM, Mathieu Blondel wrote: > > In my research, I have used CSR matrices, since they can be processed very > efficiently in Cython. Since CSR matrices are indeed non-trivial to > construct, we could provide an utility function which takes an iterator of > (feature_index

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Mathieu Blondel
On Fri, Jun 21, 2013 at 1:28 PM, Lars Buitinck wrote: > Besides, scipy.sparse is hard to update in-place, is a very wasteful > representation for dense data and is harder to work with than np.array > (for us, but more importantly for users). > Dense formats like masked arrays or arrays with miss

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
On Fri, Jun 21, 2013 at 2:28 PM, Lars Buitinck wrote: > > Besides, scipy.sparse is hard to update in-place, is a very wasteful > representation for dense data and is harder to work with than np.array > (for us, but more importantly for users). > And it can't be masked..? > > Currently, -1 is us

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Lars Buitinck
2013/6/21 Joel Nothman : > As long as the representation of unknown values is known (be it a particular > value, or use of a masked array), writing a Transformer should be pretty > straightforward, but I don't understand why you need extra arguments to > transform (which you imply by linking to #19

Re: [Scikit-learn-general] SVM: select the training set randomly

2013-06-20 Thread Joel Nothman
Please see http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html On Fri, Jun 21, 2013 at 10:31 AM, Gianni Iannelli wrote: > Dear All, > > I have one question. I have a dataset of 100 vector each with some > features. Of this 100 I already know the classification of a

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
I'm not certain I've understood all of your suggestions. I assume we can consider a simple example like: * X has some unknown values for some features * they should be filled with the mean of the known values for those features As long as the representation of unknown values is known (be it a par

[Scikit-learn-general] SVM: select the training set randomly

2013-06-20 Thread Gianni Iannelli
Dear All, I have one question. I have a dataset of 100 vector each with some features. Of this 100 I already know the classification of all of them. What I wanna do is select randomly in this 100 a subset to use as training set and the rest as test set. There is something already implemented in

[Scikit-learn-general] Shade/tint a segment

2013-06-20 Thread Brickle Macho
I over segment an image using a superpixel algorithm. I region grow using the superpixels to end up with a segmented image, a label-image. I overlay the label boundaries using mark_boundaries(). I can click on a segment/region and indicate it as either foreground or background. This foreg

[Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Nicolas Trésegnie
Hi, A part of my job for the GSoC is to discuss with you an interface for data imputation. This topic is strongly related to the issue #1963 and this mail

Re: [Scikit-learn-general] Time for GSoC 2013!

2013-06-20 Thread Kemal Eren
Hi all, Thanks, Nicolas, for starting the introductions. I would also like to introduce my GSOC project. I'll be adding biclustering capabilities to scikit-learn: some well-known algorithms such as Spectral Biclustering, Cheng and Church, and BiMax, as well as scoring metrics and utilities for g

Re: [Scikit-learn-general] Time for GSoC 2013!

2013-06-20 Thread Nicolas Trésegnie
Hi everybody, First of all, I would like to thank the people who participated in the selection process. I know the choice wasn't easy and I'm glad to have been selected to participate in the Google Summer of Code. As Vlad said, I'll work on the matrix completion problem. I first wanted to wo

Re: [Scikit-learn-general] Random forest with zero features

2013-06-20 Thread Olivier Grisel
The error message could indeed be improved but this is a pathological case anyway. I would rather make the grid search fault tolerant instead of making all the scikit-learn estimators accept invalid inputs (such as empty dataset). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel

Re: [Scikit-learn-general] Random forest with zero features

2013-06-20 Thread Michal Romaniuk
Here is an example script: import numpy from sklearn import ensemble y = numpy.random.random_integers(0,1,100) X = numpy.zeros((100,0)) rf = ensemble.RandomForestClassifier() rf.fit(X,y) Michal > I am not sure to understand. Please provide a minimalistic > reproduction script (10 lines max) and

Re: [Scikit-learn-general] Random forest with zero features

2013-06-20 Thread Olivier Grisel
2013/6/20 Michal Romaniuk : > What is the default behaviour for random forests with zero features? It > seems to me that it just gives an error (although I'm not 100% sure if > that's the cause). This is a problem when using a feature selection step > and searching a grid for a good feature selecti

[Scikit-learn-general] Random forest with zero features

2013-06-20 Thread Michal Romaniuk
What is the default behaviour for random forests with zero features? It seems to me that it just gives an error (although I'm not 100% sure if that's the cause). This is a problem when using a feature selection step and searching a grid for a good feature selection parameter. Occasionally there mig

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Peter Prettenhofer
you can try an ordinal encoding instead - just map each categorical value to an integer so that you end up with 8 numerical features - if you use enough trees and grow them deep it may work 2013/6/20 Maheshakya Wijewardena > And yes Gilles, It is the Amazon challenge :D > > > On Thu, Jun 20, 20

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Maheshakya Wijewardena
And yes Gilles, It is the Amazon challenge :D On Thu, Jun 20, 2013 at 8:21 PM, Maheshakya Wijewardena < pmaheshak...@gmail.com> wrote: > The shape of X after encoding is (32769, 16600). Seems as if that is too > big to be converted into a dense matrix. Can Random forest handle this > amount of f

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Maheshakya Wijewardena
The shape of X after encoding is (32769, 16600). Seems as if that is too big to be converted into a dense matrix. Can Random forest handle this amount of features? On Thu, Jun 20, 2013 at 7:31 PM, Olivier Grisel wrote: > 2013/6/20 Lars Buitinck : > > 2013/6/20 Olivier Grisel : > >>> Actually twi

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Olivier Grisel
2013/6/20 Lars Buitinck : > 2013/6/20 Gilles Louppe : >> This looks like the dataset from the Amazon challenge currently >> running on Kaggle. When one-hot-encoded, you end up with rhoughly >> 15000 binary features, which means that the dense representation >> requires at least 32000*15000*4 bytes

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Olivier Grisel
2013/6/20 Lars Buitinck : > 2013/6/20 Olivier Grisel : >>> Actually twice as much, even on a 32-bit platform (float size is >>> always 64 bits). >> >> The decision tree code always uses 32 bits floats: >> >> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38 >> >> b

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Joel Nothman
So Maheshakya's `toarray` might work with `X.astype(np.float32).toarray('F')`... (But by "might work" I mean won't throw a ValueError...) On Thu, Jun 20, 2013 at 11:56 PM, Olivier Grisel wrote: > 2013/6/20 Lars Buitinck : > > 2013/6/20 Gilles Louppe : > >> This looks like the dataset from the Am

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Lars Buitinck
2013/6/20 Olivier Grisel : >> Actually twice as much, even on a 32-bit platform (float size is >> always 64 bits). > > The decision tree code always uses 32 bits floats: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38 > > but you have to cast your data to `dt

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Lars Buitinck
2013/6/20 Gilles Louppe : > This looks like the dataset from the Amazon challenge currently > running on Kaggle. When one-hot-encoded, you end up with rhoughly > 15000 binary features, which means that the dense representation > requires at least 32000*15000*4 bytes to hold in memory (or even twice

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Olivier Grisel
What is the cardinality of each feature? -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Scikit-learn-general mai

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Gilles Louppe
Hi, This looks like the dataset from the Amazon challenge currently running on Kaggle. When one-hot-encoded, you end up with rhoughly 15000 binary features, which means that the dense representation requires at least 32000*15000*4 bytes to hold in memory (or even twice as as more depending on your

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Joel Nothman
Hi Maheshakya, It's probably right: your feature space is too big and sparse to be reasonable for random forests. What sort of categorical data are you encoding? What is the shape of the matrix after applying one-hot encoding? If you need to use random forests, and not a method that natively hand

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Olivier Grisel
2013/6/20 Maheshakya Wijewardena : > The shape is (32769, 8). There are 8 categorical variables before applying > OneHotEncoding. And what is the shape after? -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel --

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Maheshakya Wijewardena
The shape is (32769, 8). There are 8 categorical variables before applying OneHotEncoding. On Thu, Jun 20, 2013 at 5:43 PM, Peter Prettenhofer < peter.prettenho...@gmail.com> wrote: > > Hi, > > seems like your sparse matrix is too large to be converted to a dense > matrix. What shape does X have

Re: [Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Peter Prettenhofer
Hi, seems like your sparse matrix is too large to be converted to a dense matrix. What shape does X have? How many categorical variables do you have (before applying the OneHotTransformer)? -- This SF.net email is sponsore

[Scikit-learn-general] (no subject)

2013-06-20 Thread 高玲
(冲账,报销瞟务,请打开)在职场奔波,也许激烈swrgx的竞争和冷酷的考核快把你压弯了腰,没办法,这qwecx就是工作!一切都还得继续.周日晚上焦虑过后,新的一周照样开始,实在不想再看见老板那张脸饿,每天还得照旧毕恭毕敬. 大宝不断地发大了万祺凯易突然瞟了几眼兰亭序    颧骨你的满意是我们的动力!!优质的产品!     冲账,报销   开◆发◆票◆蔡经理:  颧骨扣◆扣:2◆3◆8◆8◆8◆1◆5◆1◆6◆8 万祺凯易⊙联◆系:1◆5◆0◆1◆3◆2◆7◆2◆2◆7◆6%{AUT OVALS7} 人们通常认为

[Scikit-learn-general] Using Random forest classifier after One hot encoding

2013-06-20 Thread Maheshakya Wijewardena
Hi, I'm new to scikit-learn. I'm trying use preprocessing.OneHotEncoder to encode my training and test data. After encoding I tried to train Random forest classifier using that data. But I get the following error when fitting. (Here the error trace) 99 model.fit(X_train, y_train)100