Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @yanboliang I understand your points. The issue is `OneHotEncoder` only supports `dropLast`. The ideal solution to match R exactly (both the category dropped and ordering of feature columns) will be use `alphabetAsc` in StringIndexer and `dropFirst` in OneHotEncoder. Without changing `OneHotEncoder`, the best I can do in this PR is to match only the category that is dropped in R. This will make sure the model interpretation and magnitude of coefficients are consistent with R, but the ordering among the feature columns are still different, which is a minor issue. That's also why I sorted the coefficients first in the example above to compare GLM results. Please let me know if this is clear and your thought on `OneHotEncoder`. If adding a `dropFirst` is preferred, I can also update `OneHotEncoder`. But that may cause some disruption. Thanks.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org