[ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231097#comment-15231097
 ] 

Joseph K. Bradley commented on SPARK-14408:
-------------------------------------------

StandardScaler
* This may be 2 confounded issues.  MLlib's StandardScaler uses the unbiased 
sample std to rescale, whereas sklearn uses the biased sample std.
** *Q*: [sklearn.preprocessing.StandardScaler | 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html]
 uses biased sample std.  R's [scale package | 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html] uses the 
unbiased sample std.  I'm used to seeing the biased sample std used in ML, 
probably because it is handy for proofs to know columns have L2 norm 1.  My 
main question is: What does glmnet do?  This is important since we compare with 
it for MLlib GLM unit tests.  The difference might be insignificant, though, 
for GLMs and the datasets we are testing on.

> Update RDD.treeAggregate not to use reduce
> ------------------------------------------
>
>                 Key: SPARK-14408
>                 URL: https://issues.apache.org/jira/browse/SPARK-14408
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, Spark Core
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to