[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2018-03-20 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407451#comment-16407451
 ] 

Teng Peng commented on SPARK-19208:
---

[~timhunter] Has the Jira ticket been opened? I believe this would be a great 
improvement.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-07-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088259#comment-16088259
 ] 

Apache Spark commented on SPARK-19208:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14950

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-16 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870655#comment-15870655
 ] 

Timothy Hunter commented on SPARK-19208:


I put together the ideas in this thread into a document. I will update the 
umbrella ticket with sub tasks once folks have had a chance to comment:

https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866768#comment-15866768
 ] 

Timothy Hunter commented on SPARK-19208:


Yes, I meant returning a struct and then projecting this struct. I do not think 
there is any other way right now with the current UDAFs, as you mention. In 
that proposal, {{VectorSummarizer.metrics(...).summary(...)}} returns a struct, 
the fields of which are decided by the arguments in {{.metrics}}, and each of 
the individual functions  {{VectorSummarizer.min/max/variasce(...)}} returns 
columns of vectors or matrices.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755
 ] 

Nick Pentreath commented on SPARK-19208:


Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{{code}}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{{code}}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714
 ] 

Timothy Hunter commented on SPARK-19208:


Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), 
col("VectorSummary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866675#comment-15866675
 ] 

Nick Pentreath commented on SPARK-19208:


When I said "estimator-like", I didn't mean it should necessarily be an actual 
{{Estimator}} (I agree it is not really intended to fit into transformers & 
pipelines), but rather mimic the API, i.e. that the summarizer is "fitted" on a 
dataset to return a summary.

I just wasn't too keen on the idea of returning a struct as it just feels sort 
of clunky relative to returning a df with vector columns {{"mean", "min", 
"max"}} etc.

Supporting SS and {{groupBy}} seems like an important goal, so something like 
[~timhunter]'s suggestion looks like it will work nicely.

For doing it via catalyst rules, that would be first prize to automatically 
re-use the intermediate results for multiple end-result computations, and only 
compute what is necessary for the required end-results. But, is that even 
supported for UDTs currently? I'm not an expert but my understanding was that 
is not supported yet.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535
 ] 

Timothy Hunter commented on SPARK-19208:


I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have a the 
following API:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.


> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855090#comment-15855090
 ] 

Joseph K. Bradley commented on SPARK-19208:
---

You're right that sharing intermediate results will be necessary.

I'm happy with [~mlnick]'s VectorSummarizer API.  I also think that, if we 
wanted to use the API I suggested above, the version returning a single struct 
col would work: {{df.select(VectorSummary.summary("features", "weights"))}}.  
The new column could be constructed from intermediate columns which would not 
show up in the final output.  (Is this essentially the "private UDAF" 
[~podongfeng] is mentioning above?)

I'm OK either way.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-02 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849837#comment-15849837
 ] 

zhengruifeng commented on SPARK-19208:
--

[~mlnick] +1
I think we can create a private UDAF to compute the metrics.
And for users, we create a wrapper {{VectorSummarizer}} based on the internal 
UDAF.
In algorithms, we can use both the {{VectorSummarizer}} and the UDAF (if we 
need to groupby)

[~imatiach] [~josephkb] Is this OK?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-01 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848140#comment-15848140
 ] 

zhengruifeng commented on SPARK-19208:
--

[~mlnick] What about supporting {{groupBy}} in it? I think it is useful in some 
cases, for example, we can use it to compute the {{weightedSum}} per label in 
{{NaiveBayes}}, and {{variance}} per label in GaussianNB.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848108#comment-15848108
 ] 

Nick Pentreath commented on SPARK-19208:


Another option would be an "Estimator" like API, where the UDAF is purely 
internal to the API and not exposed to users, e.g.

{code}
val summarizer = new VectorSummarizer().setMetrics("min", 
"max").setInputCol("features").setWeightCol("weight")
val summary = summarizer.fit(df)

(or summarizer.evaluate, summarizer.summarize, etc?)

// this would need to throw exceptions (or perhaps return empty vectors) if the 
metric was not set
val min: Vector = summary.getMin

// OR DataFrame-based result:

val min: Vector = summary.select("min").as[Vector]
{code}

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-01-27 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843884#comment-15843884
 ] 

zhengruifeng commented on SPARK-19208:
--

[~josephkb] I have considered of the analogy of Double column stats.
But there is a small difference: Some temporary intermediate variables are used 
by multi metrics.

{code}
val results: DataFrame = df.select(VectorSummary.mean("features"), 
VectorSummary.variance("features"))
{code}

The {{currMean}} and {{weightSum}} are used both in {{VectorSummary.mean}} and 
{{VectorSummary.variance}}. So we maybe have to compute {{currMean}} and 
{{weightSum}} twice, if we use two seperate udaf.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-01-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843670#comment-15843670
 ] 

Joseph K. Bradley commented on SPARK-19208:
---

Thanks for writing out your ideas.  Here are my thoughts about the API:

*Reference API: Double column stats*
When working with Double columns (not Vectors), one would expect write things 
like: {{myDataFrame.select(min("x"), max("x"))}} to select 2 stats, min and 
max.  Here, min and max are functions provided by Spark SQL which return 
columns.

*Analogy*
We should probably provide an analogous API.  Here's what I imagine:
{code}
import org.apache.spark.ml.stat.VectorSummary
val df: DataFrame = ...

val results: DataFrame = df.select(VectorSummary.min("features"), 
VectorSummary.mean("features"))
val weightedResults: DataFrame = df.select(VectorSummary.min("features"), 
VectorSummary.mean("features", "weight"))
// Both of these result DataFrames contain 2 Vector columns.
{code}

I.e., we provide vectorized versions of stats functions.

If you want to put everything into a single function, then we could also have 
VectorSummary have a function "summary" which returns a struct type with every 
stat available:
{code}
val results = df.select(VectorSummary.summary("features", "weights"))
// results DataFrame contains 1 struct column, which has a Vector field for 
every statistic we provide.
{code}

Note: I removed "online" from the name since it the user does not need to know 
that it does online aggregation.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org