[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?
[ https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249331#comment-15249331 ] Apache Spark commented on SPARK-14478: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/12519 > Should StandardScaler use biased variance to scale? > --- > > Key: SPARK-14478 > URL: https://issues.apache.org/jira/browse/SPARK-14478 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Currently, MLlib's StandardScaler scales columns using the corrected standard > deviation (sqrt of unbiased variance). This matches what R's scale package > does. > However, it is a bit odd for 2 reasons: > * Optimization/ML algorithms which require scaled columns generally assume > unit variance (for mathematical convenience). That requires using biased > variance. > * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance. > *Question*: Should we switch to unbiased? > *Decision*: No. Document what we do, and possibly add support for unbiased > later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?
[ https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248825#comment-15248825 ] Joseph K. Bradley commented on SPARK-14478: --- Adding a param seems reasonable, though probably pretty low priority. To make a judgement call...how about we leave it as is for now? I'll send a PR to document that it's using unbiased variance. If any user ever needs biased, then we can add the Param (but I've never heard anyone except myself complain). > Should StandardScaler use biased variance to scale? > --- > > Key: SPARK-14478 > URL: https://issues.apache.org/jira/browse/SPARK-14478 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Reporter: Joseph K. Bradley > > Currently, MLlib's StandardScaler scales columns using the unbiased standard > deviation. This matches what R's scale package does. > However, it is a bit odd for 2 reasons: > * Optimization/ML algorithms which require scaled columns generally assume > unit variance (for mathematical convenience). That requires using biased > variance. > * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance. > *Question*: Should we switch to unbiased? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?
[ https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232468#comment-15232468 ] Yanbo Liang commented on SPARK-14478: - Should we add a param that control whether use biased or unbiased variance in StandardScaler? > Should StandardScaler use biased variance to scale? > --- > > Key: SPARK-14478 > URL: https://issues.apache.org/jira/browse/SPARK-14478 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Reporter: Joseph K. Bradley > > Currently, MLlib's StandardScaler scales columns using the unbiased standard > deviation. This matches what R's scale package does. > However, it is a bit odd for 2 reasons: > * Optimization/ML algorithms which require scaled columns generally assume > unit variance (for mathematical convenience). That requires using biased > variance. > * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance. > *Question*: Should we switch to unbiased? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?
[ https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231532#comment-15231532 ] Joseph K. Bradley commented on SPARK-14478: --- I'm listing this as "Major" priority since it is a behavioral change and would be good to decide before 2.0. > Should StandardScaler use biased variance to scale? > --- > > Key: SPARK-14478 > URL: https://issues.apache.org/jira/browse/SPARK-14478 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Reporter: Joseph K. Bradley > > Currently, MLlib's StandardScaler scales columns using the unbiased standard > deviation. This matches what R's scale package does. > However, it is a bit odd for 2 reasons: > * Optimization/ML algorithms which require scaled columns generally assume > unit variance (for mathematical convenience). That requires using biased > variance. > * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance. > *Question*: Should we switch to unbiased? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org