[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce
[ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231097#comment-15231097 ] Joseph K. Bradley edited comment on SPARK-14408 at 4/8/16 12:01 AM: Note on StandardScaler: MLlib's StandardScaler uses the unbiased sample std to rescale, whereas sklearn uses the biased sample std. * [sklearn.preprocessing.StandardScaler | http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html] uses biased sample std. R's [scale package | https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the unbiased sample std. I'm used to seeing the biased sample std used in ML, probably because it is handy for proofs to know columns have L2 norm 1. * [~mengxr] reports that glmnet uses the biased sample std. * *Q*: Should we change StandardScaler to use unbiased sample std? was (Author: josephkb): StandardScaler: This may be 2 confounded issues. MLlib's StandardScaler uses the unbiased sample std to rescale, whereas sklearn uses the biased sample std. * *Q*: [sklearn.preprocessing.StandardScaler | http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html] uses biased sample std. R's [scale package | https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the unbiased sample std. I'm used to seeing the biased sample std used in ML, probably because it is handy for proofs to know columns have L2 norm 1. My main question is: What does glmnet do? This is important since we compare with it for MLlib GLM unit tests. The difference might be insignificant, though, for GLMs and the datasets we are testing on. > Update RDD.treeAggregate not to use reduce > -- > > Key: SPARK-14408 > URL: https://issues.apache.org/jira/browse/SPARK-14408 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > **Issue** > In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and > {{combOp}} functions to modify and return their first argument, just like > {{RDD.aggregate}}. However, it is not documented that way. > I started to add docs to this effect, but then noticed that {{treeAggregate}} > uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which > technically allows the seq/combOps to modify and return their first arguments. > **Question**: Is the implementation safe, or does it need to be updated? > **Decision**: Avoid using reduce. Use fold instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce
[ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230994#comment-15230994 ] Joseph K. Bradley edited comment on SPARK-14408 at 4/8/16 12:00 AM: After a bit of a scare (b/c of the confounding issue of StandardScaler not matching sklearn), it's definitely an issue with my initial PR to "fix" treeAggregate's implementation. That said, I'm still having a hard time figuring out the right way to fix the implementation. I'll comment more on the PR. was (Author: josephkb): Not meaning to cause panic here, but I'm escalating this since it might be a critical bug in MLlib. [~dbtsai] [~mengxr] [~mlnick] [~srowen] could you please help me confirm that this is a bug? If you agree, then we can: * Change this to a blocker for 2.0 * Update all failing unit tests. ** I propose to do this in a single PR. It would be great to get help with fixing the unit tests via PRs sent to my PR. ** Alternatively, we could split up this work by creating a temporary {{private[spark] def brokenTreeAggregate}} method to be used for unit tests not yet ported to the fixed treeAggregate. But I'd prefer not to do this since we will want to backport the fix. * Backport to all reasonable versions. This will be painful because of unit tests. Currently, I'm testing StandardScaler and IDF. > Update RDD.treeAggregate not to use reduce > -- > > Key: SPARK-14408 > URL: https://issues.apache.org/jira/browse/SPARK-14408 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > **Issue** > In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and > {{combOp}} functions to modify and return their first argument, just like > {{RDD.aggregate}}. However, it is not documented that way. > I started to add docs to this effect, but then noticed that {{treeAggregate}} > uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which > technically allows the seq/combOps to modify and return their first arguments. > **Question**: Is the implementation safe, or does it need to be updated? > **Decision**: Avoid using reduce. Use fold instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce
[ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230994#comment-15230994 ] Joseph K. Bradley edited comment on SPARK-14408 at 4/7/16 9:24 PM: --- Not meaning to cause panic here, but I'm escalating this since it might be a critical bug in MLlib. [~dbtsai] [~mengxr] [~mlnick] [~srowen] could you please help me confirm that this is a bug? If you agree, then we can: * Change this to a blocker for 2.0 * Update all failing unit tests. ** I propose to do this in a single PR. It would be great to get help with fixing the unit tests via PRs sent to my PR. ** Alternatively, we could split up this work by creating a temporary {{private[spark] def brokenTreeAggregate}} method to be used for unit tests not yet ported to the fixed treeAggregate. But I'd prefer not to do this since we will want to backport the fix. * Backport to all reasonable versions. This will be painful because of unit tests. Currently, I'm testing StandardScaler and IDF. was (Author: josephkb): Not meaning to cause panic here, but I'm escalating this since it might be a critical bug in MLlib. [~dbtsai] [~mengxr] [~mlnick] [~srowen] could you please help me confirm that this is a bug? If you agree, then we can: * Change this to a blocker for 2.0 * Update all failing unit tests. ** I propose to do this in a single PR. It would be great to get help with fixing the unit tests via PRs sent to my PR. ** Alternatively, we could split up this work by creating a temporary {{private[spark] def brokenTreeAggregate}} method to be used for unit tests not yet ported to the fixed treeAggregate. But I'd prefer not to do this since we will want to backport the fix. * Backport to all reasonable versions. This will be painful because of unit tests. Currently, I'm testing StandardScaler a little more carefully to check its results. > Update RDD.treeAggregate not to use reduce > -- > > Key: SPARK-14408 > URL: https://issues.apache.org/jira/browse/SPARK-14408 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > **Issue** > In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and > {{combOp}} functions to modify and return their first argument, just like > {{RDD.aggregate}}. However, it is not documented that way. > I started to add docs to this effect, but then noticed that {{treeAggregate}} > uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which > technically allows the seq/combOps to modify and return their first arguments. > **Question**: Is the implementation safe, or does it need to be updated? > **Decision**: Avoid using reduce. Use fold instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce
[ https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231097#comment-15231097 ] Joseph K. Bradley edited comment on SPARK-14408 at 4/7/16 9:22 PM: --- StandardScaler: This may be 2 confounded issues. MLlib's StandardScaler uses the unbiased sample std to rescale, whereas sklearn uses the biased sample std. * *Q*: [sklearn.preprocessing.StandardScaler | http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html] uses biased sample std. R's [scale package | https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the unbiased sample std. I'm used to seeing the biased sample std used in ML, probably because it is handy for proofs to know columns have L2 norm 1. My main question is: What does glmnet do? This is important since we compare with it for MLlib GLM unit tests. The difference might be insignificant, though, for GLMs and the datasets we are testing on. was (Author: josephkb): StandardScaler * This may be 2 confounded issues. MLlib's StandardScaler uses the unbiased sample std to rescale, whereas sklearn uses the biased sample std. ** *Q*: [sklearn.preprocessing.StandardScaler | http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html] uses biased sample std. R's [scale package | https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the unbiased sample std. I'm used to seeing the biased sample std used in ML, probably because it is handy for proofs to know columns have L2 norm 1. My main question is: What does glmnet do? This is important since we compare with it for MLlib GLM unit tests. The difference might be insignificant, though, for GLMs and the datasets we are testing on. > Update RDD.treeAggregate not to use reduce > -- > > Key: SPARK-14408 > URL: https://issues.apache.org/jira/browse/SPARK-14408 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > **Issue** > In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and > {{combOp}} functions to modify and return their first argument, just like > {{RDD.aggregate}}. However, it is not documented that way. > I started to add docs to this effect, but then noticed that {{treeAggregate}} > uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which > technically allows the seq/combOps to modify and return their first arguments. > **Question**: Is the implementation safe, or does it need to be updated? > **Decision**: Avoid using reduce. Use fold instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org