[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231097#comment-15231097
 ] 

Joseph K. Bradley edited comment on SPARK-14408 at 4/8/16 12:01 AM:


Note on StandardScaler: MLlib's StandardScaler uses the unbiased sample std to 
rescale, whereas sklearn uses the biased sample std.
* [sklearn.preprocessing.StandardScaler | 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html]
 uses biased sample std.  R's [scale package | 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the 
unbiased sample std.  I'm used to seeing the biased sample std used in ML, 
probably because it is handy for proofs to know columns have L2 norm 1. 
* [~mengxr] reports that glmnet uses the biased sample std.
* *Q*: Should we change StandardScaler to use unbiased sample std?


was (Author: josephkb):
StandardScaler: This may be 2 confounded issues.  MLlib's StandardScaler uses 
the unbiased sample std to rescale, whereas sklearn uses the biased sample std.
* *Q*: [sklearn.preprocessing.StandardScaler | 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html]
 uses biased sample std.  R's [scale package | 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the 
unbiased sample std.  I'm used to seeing the biased sample std used in ML, 
probably because it is handy for proofs to know columns have L2 norm 1.  My 
main question is: What does glmnet do?  This is important since we compare with 
it for MLlib GLM unit tests.  The difference might be insignificant, though, 
for GLMs and the datasets we are testing on.

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230994#comment-15230994
 ] 

Joseph K. Bradley edited comment on SPARK-14408 at 4/8/16 12:00 AM:


After a bit of a scare (b/c of the confounding issue of StandardScaler not 
matching sklearn), it's definitely an issue with my initial PR to "fix" 
treeAggregate's implementation.  That said, I'm still having a hard time 
figuring out the right way to fix the implementation.  I'll comment more on the 
PR.


was (Author: josephkb):
Not meaning to cause panic here, but I'm escalating this since it might be a 
critical bug in MLlib.  [~dbtsai] [~mengxr] [~mlnick] [~srowen] could you 
please help me confirm that this is a bug?  If you agree, then we can:
* Change this to a blocker for 2.0
* Update all failing unit tests.
** I propose to do this in a single PR.  It would be great to get help with 
fixing the unit tests via PRs sent to my PR.
** Alternatively, we could split up this work by creating a temporary 
{{private[spark] def brokenTreeAggregate}} method to be used for unit tests not 
yet ported to the fixed treeAggregate.  But I'd prefer not to do this since we 
will want to backport the fix.
* Backport to all reasonable versions.  This will be painful because of unit 
tests.

Currently, I'm testing StandardScaler and IDF.

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230994#comment-15230994
 ] 

Joseph K. Bradley edited comment on SPARK-14408 at 4/7/16 9:24 PM:
---

Not meaning to cause panic here, but I'm escalating this since it might be a 
critical bug in MLlib.  [~dbtsai] [~mengxr] [~mlnick] [~srowen] could you 
please help me confirm that this is a bug?  If you agree, then we can:
* Change this to a blocker for 2.0
* Update all failing unit tests.
** I propose to do this in a single PR.  It would be great to get help with 
fixing the unit tests via PRs sent to my PR.
** Alternatively, we could split up this work by creating a temporary 
{{private[spark] def brokenTreeAggregate}} method to be used for unit tests not 
yet ported to the fixed treeAggregate.  But I'd prefer not to do this since we 
will want to backport the fix.
* Backport to all reasonable versions.  This will be painful because of unit 
tests.

Currently, I'm testing StandardScaler and IDF.


was (Author: josephkb):
Not meaning to cause panic here, but I'm escalating this since it might be a 
critical bug in MLlib.  [~dbtsai] [~mengxr] [~mlnick] [~srowen] could you 
please help me confirm that this is a bug?  If you agree, then we can:
* Change this to a blocker for 2.0
* Update all failing unit tests.
** I propose to do this in a single PR.  It would be great to get help with 
fixing the unit tests via PRs sent to my PR.
** Alternatively, we could split up this work by creating a temporary 
{{private[spark] def brokenTreeAggregate}} method to be used for unit tests not 
yet ported to the fixed treeAggregate.  But I'd prefer not to do this since we 
will want to backport the fix.
* Backport to all reasonable versions.  This will be painful because of unit 
tests.

Currently, I'm testing StandardScaler a little more carefully to check its 
results.

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14408) Update RDD.treeAggregate not to use reduce

2016-04-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231097#comment-15231097
 ] 

Joseph K. Bradley edited comment on SPARK-14408 at 4/7/16 9:22 PM:
---

StandardScaler: This may be 2 confounded issues.  MLlib's StandardScaler uses 
the unbiased sample std to rescale, whereas sklearn uses the biased sample std.
* *Q*: [sklearn.preprocessing.StandardScaler | 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html]
 uses biased sample std.  R's [scale package | 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the 
unbiased sample std.  I'm used to seeing the biased sample std used in ML, 
probably because it is handy for proofs to know columns have L2 norm 1.  My 
main question is: What does glmnet do?  This is important since we compare with 
it for MLlib GLM unit tests.  The difference might be insignificant, though, 
for GLMs and the datasets we are testing on.


was (Author: josephkb):
StandardScaler
* This may be 2 confounded issues.  MLlib's StandardScaler uses the unbiased 
sample std to rescale, whereas sklearn uses the biased sample std.
** *Q*: [sklearn.preprocessing.StandardScaler | 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html]
 uses biased sample std.  R's [scale package | 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the 
unbiased sample std.  I'm used to seeing the biased sample std used in ML, 
probably because it is handy for proofs to know columns have L2 norm 1.  My 
main question is: What does glmnet do?  This is important since we compare with 
it for MLlib GLM unit tests.  The difference might be insignificant, though, 
for GLMs and the datasets we are testing on.

> Update RDD.treeAggregate not to use reduce
> --
>
> Key: SPARK-14408
> URL: https://issues.apache.org/jira/browse/SPARK-14408
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, Spark Core
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org