Hi Nicolas, You are right, I am just checking this in the source code.
Sorry for the confusion and thanks for the quick response Cheers Sole On Wed, 10 Apr 2019 at 18:43, Nicolas Goix <goix.nico...@gmail.com> wrote: > Hi Sole, > > I'm not sure the 2 limitations you mentioned are correct. > 1) in your example, using the ColumnTransformer you can impute different > values for different columns. > 2) the sklearn transformers do learn on the training set and are able to > perpetuate the values learnt from the train set to unseen data. > > Nicolas > > On Wed, Apr 10, 2019, 18:25 Sole Galli <solegal...@gmail.com> wrote: > >> Dear Scikit-Learn team, >>> >>> Feature engineering is a big task ahead of building machine learning >>> models. It involves imputation of missing values, encoding of categorical >>> variables, discretisation, variable transformation etc. >>> >>> Sklearn includes some functionality for feature engineering, which is >>> useful, but it has a few limitations: >>> >>> 1) it does not allow for feature specification - it will do the same >>> process on all variables, for example SimpleImputer. Typically, we want >>> to impute different columns with different values. >>> 2) It does not capture information from the training set, this is it >>> does not learn, therefore, it is not able to perpetuate the values learnt >>> from the train set, to unseen data. >>> >>> The 2 limitations above apply to all the feature transformers in >>> sklearn, I believe. >>> >>> Therefore, if these transformers are used as part of a pipeline, we >>> could end up doing different transformations to train and test, depending >>> on the characteristics of the datasets. For business purposes, this is not >>> a desired option. >>> >>> I think that building transformers that learn from the train set would >>> be of much use for the community. >>> >>> To this end, I built a python package called feature engine >>> <https://pypi.org/project/feature-engine/> which expands the >>> sklearn-api with additional feature engineering techniques, and the >>> functionality that allows the transformer to learn from data and store the >>> parameters learnt. >>> >>> The techniques included have been used worldwide, both in business and >>> in data competitions, and reported in kdd reports and other articles. I >>> also cover them in an udemy course >>> <https://www.udemy.com/feature-engineering-for-machine-learning> which >>> has enrolled several thousand students. >>> >>> The package capitalises on the use of pandas to capture the features, >>> but I am confident that the columns names could be captured and the df >>> transformed to a numpy array to comply with sklearn requirements. >>> >>> I wondered whether it would be of interest to include the functionality >>> of this package within sklearn? >>> If you would consider extending the sklearn api to include these >>> transformers, I would be happy to help. >>> >>> Alternatively, would you consider to add the package to your website? >>> where you mention the libaries that extend sklearn functionality? >>> >>> All feedback is welcome. >>> >>> Many thanks and I look forward to hearing from you >>> >>> Thank you so much fur such an awesome contribution through the sklearn >>> api >>> >>> Kind regards >>> >>> Sole >>> >>> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn