Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?

2020-05-06 Thread Joel Nothman
When it comes to trees, the API for handling categoricals is simpler than the implementation. Traditionally, tree-based models' handling of categorical variables differs from both ordinal and one-hot encoding, while both of those will work reasonably well for many problems. We are working on implem

Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?

2020-05-06 Thread Fernando Marcos Wittmann
That's an excellent discussion! I've always wondered how other tools like R handled naturally categorical variables or not. LightGBM has a scikit-learn API which handles categorical features by inputting their columns names (or indexes): ``` import lightgbm lgb=lightgbm.LGBMClassifier() lgb.fit(*X*