Hi Nicolas,

You are right, I am just checking this in the source code.

Sorry for the confusion and thanks for the quick response

Cheers

Sole

On Wed, 10 Apr 2019 at 18:43, Nicolas Goix <goix.nico...@gmail.com> wrote:

> Hi Sole,
>
> I'm not sure the 2 limitations you mentioned are correct.
> 1) in your example, using the ColumnTransformer you can impute different
> values for different columns.
> 2) the sklearn transformers do learn on the training set and are able to
> perpetuate the values learnt from the train set to unseen data.
>
> Nicolas
>
> On Wed, Apr 10, 2019, 18:25 Sole Galli <solegal...@gmail.com> wrote:
>
>> Dear Scikit-Learn team,
>>>
>>> Feature engineering is a big task ahead of building machine learning
>>> models. It involves imputation of missing values, encoding of categorical
>>> variables, discretisation, variable transformation etc.
>>>
>>> Sklearn includes some functionality for feature engineering, which is
>>> useful, but it has a few limitations:
>>>
>>> 1) it does not allow for feature specification - it will do the same
>>> process on all variables, for example SimpleImputer. Typically, we want
>>> to impute different columns with different values.
>>> 2) It does not capture information from the training set, this is it
>>> does not learn, therefore, it is not able to perpetuate the values learnt
>>> from the train set, to unseen data.
>>>
>>> The 2 limitations above apply to all the feature transformers in
>>> sklearn, I believe.
>>>
>>> Therefore, if these transformers are used as part of a pipeline, we
>>> could end up doing different transformations to train and test, depending
>>> on the characteristics of the datasets. For business purposes, this is not
>>> a desired option.
>>>
>>> I think that building transformers that learn from the train set would
>>> be of much use for the community.
>>>
>>> To this end, I built a python package called feature engine
>>> <https://pypi.org/project/feature-engine/> which expands the
>>> sklearn-api with additional feature engineering techniques, and the
>>> functionality that allows the transformer to learn from data and store the
>>> parameters learnt.
>>>
>>> The techniques included have been used worldwide, both in business and
>>> in data competitions, and reported in kdd reports and other articles. I
>>> also cover them in an udemy course
>>> <https://www.udemy.com/feature-engineering-for-machine-learning> which
>>> has enrolled several thousand students.
>>>
>>> The package capitalises on the use of pandas to capture the features,
>>> but I am confident that the columns names could be captured and the df
>>> transformed to a numpy array to comply with sklearn requirements.
>>>
>>> I wondered whether it would be of interest to include the functionality
>>> of this package within sklearn?
>>> If you would consider extending the sklearn api to include these
>>> transformers, I would be happy to help.
>>>
>>> Alternatively, would you consider to add the package to your website?
>>> where you mention the libaries that extend sklearn functionality?
>>>
>>> All feedback is welcome.
>>>
>>> Many thanks and I look forward to hearing from you
>>>
>>> Thank you so much fur such an awesome contribution through the sklearn
>>> api
>>>
>>> Kind regards
>>>
>>> Sole
>>>
>>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to