Thank you for your feedback Alex!

On 10/02/2018 09:28 AM, Alex Garel wrote:

  * chunk processing (kind of handling streaming data) :  when dealing
    with lot of data, the ability to fit_partial, then use transform
    on chunks of data is of good help. But it's not well exposed in
    current doc and API,

This has been discussed in the past, but it looks like no-one was excited enough about it to add it to the roadmap. This would require quite some additions to the API. Olivier, who has been quite interested in this before now seems to be more interested in integration with dask, which might achieve the same thing.

  * and a lot of models do not support it, while they could.

Can you give examples of that?

  * Also pipeline does not support fit_partial and there is not
    fit_transform_partial.

What would you expect those to do? Each step in the pipeline might require passing over the whole dataset multiple times before being able to transform anything. That basically makes the current interface impossible to work with the pipeline. Even if only a single pass of the dataset was required, that wouldn't work with the current interface. If we would be handing around generators that allow to loop over the whole data, that would work. But it would be unclear
how to support a streaming setting.

  * while handling "Passing around information that is not (X, y)", is
    there any plan to have transform being able to transform X and y ?
    This would ease lots of problems like subsampling, resampling or
    masking data when too incomplete.

An API for subsampling is on the roadmap :)



_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to