Github user actuaryzhang commented on the issue:
https://github.com/apache/spark/pull/17967
@yanboliang I understand your points. The issue is `OneHotEncoder` only
supports `dropLast`.
The ideal solution to match R exactly (both the category dropped and
ordering of feature columns) will be use `alphabetAsc` in StringIndexer and
`dropFirst` in OneHotEncoder.
Without changing `OneHotEncoder`, the best I can do in this PR is to match
only the category that is dropped in R. This will make sure the model
interpretation and magnitude of coefficients are consistent with R, but the
ordering among the feature columns are still different, which is a minor issue.
That's also why I sorted the coefficients first in the example above to compare
GLM results.
Please let me know if this is clear and your thought on `OneHotEncoder`. If
adding a `dropFirst` is preferred, I can also update `OneHotEncoder`. But that
may cause some disruption. Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]