Re: [Scikit-learn-general] Interface for data imputation

2013-06-30 Thread Nicolas Trésegnie
On 06/21/2013 08:52 AM, Mathieu Blondel wrote: I think I like calling it transform() but one concern is how this will play in the matrix-factorization based matrix completion object. Similarly to PCA or other matrix factorization algorithms from sklearn.decomposition, transform() could also be u

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Mathieu Blondel
On Fri, Jun 21, 2013 at 6:56 AM, Nicolas Trésegnie < nicolas.treseg...@gmail.com> wrote: > >- To impute only some of the missing values (rows, columns or a >combination) > > I think this can be added later if you have time. For now, I would rather not clutter the API. For rows, one can jus

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Mathieu Blondel
> So are you proposing that sparse matrices encode missing values with nan? Or something else? There are different scenarios: 1) fewer missing values than non-missing values 2) much more missing-values than non-missing values (recommendation system setting) 3) fewer missing values than non-missin

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
On Fri, Jun 21, 2013 at 3:17 PM, Joel Nothman wrote: > On Fri, Jun 21, 2013 at 3:07 PM, Mathieu Blondel wrote: > >> >> In my research, I have used CSR matrices, since they can be processed >> very efficiently in Cython. Since CSR matrices are indeed non-trivial to >> construct, we could provide an

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
On Fri, Jun 21, 2013 at 3:36 PM, Lars Buitinck wrote: > > Ah, right. I missed the other email. It's unfortunate that > scipy.sparse doesn't allow any other value than 0 to be omitted. > I'm sure a PR would be welcomed :P ---

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Lars Buitinck
2013/6/21 Mathieu Blondel : > Dense formats like masked arrays or arrays with missing-values encoded by > NaN won't work for recommender datasets (unless you can fit a n_users x > n_items dense matrix in memory). So, we do need to find a suitable sparse > format to work with, since the second half

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
> > And I agree that sparse input should probably not be handled (for now?)... > If for no other reason than that for sparse X you can't currently do things > neatly like X[X == np.nan] = 1 (but you probably will be able to do it > after this GSOC). > I guess you can do X.data[X.data == np.nan] =

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
On Fri, Jun 21, 2013 at 3:07 PM, Mathieu Blondel wrote: > > In my research, I have used CSR matrices, since they can be processed very > efficiently in Cython. Since CSR matrices are indeed non-trivial to > construct, we could provide an utility function which takes an iterator of > (feature_index

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Mathieu Blondel
On Fri, Jun 21, 2013 at 1:28 PM, Lars Buitinck wrote: > Besides, scipy.sparse is hard to update in-place, is a very wasteful > representation for dense data and is harder to work with than np.array > (for us, but more importantly for users). > Dense formats like masked arrays or arrays with miss

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
On Fri, Jun 21, 2013 at 2:28 PM, Lars Buitinck wrote: > > Besides, scipy.sparse is hard to update in-place, is a very wasteful > representation for dense data and is harder to work with than np.array > (for us, but more importantly for users). > And it can't be masked..? > > Currently, -1 is us

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Lars Buitinck
2013/6/21 Joel Nothman : > As long as the representation of unknown values is known (be it a particular > value, or use of a masked array), writing a Transformer should be pretty > straightforward, but I don't understand why you need extra arguments to > transform (which you imply by linking to #19

Re: [Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Joel Nothman
I'm not certain I've understood all of your suggestions. I assume we can consider a simple example like: * X has some unknown values for some features * they should be filled with the mean of the known values for those features As long as the representation of unknown values is known (be it a par

[Scikit-learn-general] Interface for data imputation

2013-06-20 Thread Nicolas Trésegnie
Hi, A part of my job for the GSoC is to discuss with you an interface for data imputation. This topic is strongly related to the issue #1963 and this mail