I'm having trouble getting decision forests to work with categorical
features. I have a dataset with a categorical feature with 40 values.
It seems to be treated as a continuous/numeric value by the
implementation.

Digging deeper, I see there is some logic in the code that indicates
that categorical features over N values do not work unless the number
of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
naive brute force condition, wherein the decision tree will test all
possible splits of the categorical value.

However, this gets unusable quickly as the number of bins should be
tens or hundreds at best, and this requirement rules out categorical
values over more than 10 or so features as a result. But, of course,
it's not unusual to have categorical features with high cardinality.
It's almost common.

There are some pretty fine heuristics for selecting 'bins' over
categorical features when the number of bins is far fewer than the
complete, exhaustive set.

Before I open a JIRA or continue, does anyone know what I am talking
about, am I mistaken? Is this a real limitation and is it worth
pursuing these heuristics? I can't figure out how to proceed with
decision forests in MLlib otherwise.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to