[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866535#comment-15866535
 ] 

Timothy Hunter edited comment on SPARK-19208 at 2/14/17 8:04 PM:
-----------------------------------------------------------------

I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have an API like 
this:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.



was (Author: timhunter):
I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have a the 
following API:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.


> MultivariateOnlineSummarizer performance optimization
> -----------------------------------------------------
>
>                 Key: SPARK-19208
>                 URL: https://issues.apache.org/jira/browse/SPARK-19208
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: zhengruifeng
>         Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to