Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/18513
@sethah thanks for reviewing.
_For the 1st question:_
Yes, currently categorical columns that are numerical would need to be
explicitly encoded as strings. I mentioned it as a follow up improvement. It's
easy to handle, it's just the API for this I'm not certain of yet, here are the
two options I see:
1. User can specify param `categoricalCols` to explicitly set categorical
cols. But, do we then assume that all other columns not in that list, that are
strings, are categorical? i.e. this param is effectively only for numeric
columns that must be treated as categorical? Or do we ignore all other
non-numerical columns? etc
2. User can specify param `realCols` to explicitly set the numeric columns.
All other columns are treated as categorical.
We could potentially offer both formats, though I tend to gravitate towards
potentially (2) above, since the default use case will be encoding many
(usually high cardinality) categorical columns, with maybe a few real columns
in there.
_For the second issue:_
There is no way (at least that I know of) to provide a `dropLast` feature,
since we don't know how many features there are - the whole point of hashing is
not to keep the `feature <-> index` mapping for speed and memory efficiency.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]