Hi Team,

I tried to run Ignite ML across the dataset with categorical features and
came across some problems.

My dataset is Mushrooms
<https://www.kaggle.com/uciml/mushroom-classification> dataset from Kaggle.
There are only categorial features and categorical labels.

(so-called classification problem). My attempt you can find in my repo
<https://github.com/dehasi/mushrooms/blob/master/src/main/java/me/dehasi/mushrooms/MushroomsMain.java>
.

My goal is to make a pipeline which takes raw string values, encodes them
to numbers, then train a model.

The first problem is the Vectorizer.

I started with DummyVectorizer but it supports only Double labels.

All other vectorizers have the same issue because all of them are inherited

from DefaultLabelVectorizer
<https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/dataset/feature/extractor/ExtractionUtils.java#L36>
where Double labels are hardcoded at the generic level.

I didn’t find an approach to work with only categorical data with standard
Ignite vectorizers. I wrote my own.

The second problem. EncoderTrainer (in my case STRING_ENCODER).

It doesn’t encode labels. The trainer just ignores labels. See
EncoderTrainer
<https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L169>
.

Probably ignoring labels makes sense, but…

The third problem. ClassCastException.

There are “hidden” (for user) casts labels to Double in model trainers

i.e. SVMLinearClassificationTrainer
<https://github.com/apache/ignite/blob/master/modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearClassificationTrainer.java#L191>,
DiscreteNaiveBayesTrainer etc.

Feel free to use my regex \(Double\).*\.label\(\) to search other casts.

To sum up, I can say that there are assumptions that labels are numeric
values,

but if we solve a classification problem, labels can be whatever.

But I didn’t find an easy way to preprocess them.



If you have any question or need details, feel free to write to me.

Best regards,

Ravil

Reply via email to