Hi Team, I tried to run Ignite ML across the dataset with categorical features and came across some problems.
My dataset is Mushrooms <https://www.kaggle.com/uciml/mushroom-classification> dataset from Kaggle. There are only categorial features and categorical labels. (so-called classification problem). My attempt you can find in my repo <https://github.com/dehasi/mushrooms/blob/master/src/main/java/me/dehasi/mushrooms/MushroomsMain.java> . My goal is to make a pipeline which takes raw string values, encodes them to numbers, then train a model. The first problem is the Vectorizer. I started with DummyVectorizer but it supports only Double labels. All other vectorizers have the same issue because all of them are inherited from DefaultLabelVectorizer <https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/dataset/feature/extractor/ExtractionUtils.java#L36> where Double labels are hardcoded at the generic level. I didn’t find an approach to work with only categorical data with standard Ignite vectorizers. I wrote my own. The second problem. EncoderTrainer (in my case STRING_ENCODER). It doesn’t encode labels. The trainer just ignores labels. See EncoderTrainer <https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L169> . Probably ignoring labels makes sense, but… The third problem. ClassCastException. There are “hidden” (for user) casts labels to Double in model trainers i.e. SVMLinearClassificationTrainer <https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearClassificationTrainer.java#L191>, DiscreteNaiveBayesTrainer etc. Feel free to use my regex \(Double\).*\.label\(\) to search other casts. To sum up, I can say that there are assumptions that labels are numeric values, but if we solve a classification problem, labels can be whatever. But I didn’t find an easy way to preprocess them. If you have any question or need details, feel free to write to me. Best regards, Ravil