Hi everyone, I wanted to use the scikit-learn transformer API to clean up some messy data as input to a neural network. One of the steps involves converting categorical variables (of very high cardinality) into integers for use in an embedding layer.
Unfortunately, I cannot quite use LabelEncoder to do solve this. When dealing with categorical variables with very high cardinality, I found it useful in practice to have a threshold value for the frequency under which a variable ends up with the 'unk' or 'rare' label. This same label would also end up applied at test time to entries that were not observed in the train set. This is relatively straightforward to add to the existing label encoder code, but it breaks the contract slightly: if we encode some variables with a 'rare' label, then the transform operation is no longer a bijection. Is this feature too niche for the main sklearn? I saw there was a package ( https://feature-engine.readthedocs.io/en/latest/RareLabelCategoricalEncoder.html) that implemented a similar feature discussed in the mailing list.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn