[jira] [Commented] (SPARK-14408) Update RDD.treeAggregate not to use reduce

DB Tsai (JIRA) Fri, 08 Apr 2016 01:36:55 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231871#comment-15231871
 ]


DB Tsai commented on SPARK-14408:
---------------------------------

I remember that when we implemented the scaler, we had similar discussion. We 
ended up following R's scale package which is unbiased std. How about we add 
extra flag to StandardScaler to make it biased but default to unbiased? 

I remember that when I implemented LOR/LiR in Spark, there are packages in R 
using unbiased std to scale the features, and most of the time, when the # of 
samples are huge, this will not be a concern. So I ended up just using the 
standard scaler in mllib.

> Update RDD.treeAggregate not to use reduce
> ------------------------------------------
>
>                 Key: SPARK-14408
>                 URL: https://issues.apache.org/jira/browse/SPARK-14408
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, Spark Core
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Minor
>
> **Issue**
> In MLlib, we have assumed that {{RDD.treeAggregate}} allows the {{seqOp}} and 
> {{combOp}} functions to modify and return their first argument, just like 
> {{RDD.aggregate}}.  However, it is not documented that way.
> I started to add docs to this effect, but then noticed that {{treeAggregate}} 
> uses {{reduceByKey}} and {{reduce}} in its implementation, neither of which 
> technically allows the seq/combOps to modify and return their first arguments.
> **Question**: Is the implementation safe, or does it need to be updated?
> **Decision**: Avoid using reduce.  Use fold instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14408) Update RDD.treeAggregate not to use reduce

Reply via email to