[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

actuaryzhang Mon, 22 May 2017 08:11:56 -0700

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @yanboliang I understand your points. The issue is `OneHotEncoder` only 
supports `dropLast`. 
    The ideal solution to match R exactly (both the category dropped and 
ordering of feature columns) will be use `alphabetAsc` in StringIndexer and 
`dropFirst` in OneHotEncoder. 
    
    Without changing `OneHotEncoder`, the best I can do in this PR is to match 
only the category that is dropped in R. This will make sure the model 
interpretation and magnitude of coefficients are consistent with R,  but the 
ordering among the feature columns are still different, which is a minor issue. 
That's also why I sorted the coefficients first in the example above to compare 
GLM results. 
    
    Please let me know if this is clear and your thought on `OneHotEncoder`. If 
adding a `dropFirst` is preferred, I can also update `OneHotEncoder`. But that 
may cause some disruption. Thanks.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Reply via email to