[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2018-03-20 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407451#comment-16407451
 ] 

Teng Peng edited comment on SPARK-19208 at 3/21/18 4:44 AM:


[~timhunter] Has the Jira ticket been opened? I believe the new API for 
statistical info would be a great improvement.


was (Author: teng peng):
[~timhunter] Has the Jira ticket been opened? I believe this would be a great 
improvement.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755
 ] 

Nick Pentreath edited comment on SPARK-19208 at 2/14/17 9:42 PM:
-

Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{code}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{code}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?


was (Author: mlnick):
Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{{code}}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{{code}}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714
 ] 

Timothy Hunter edited comment on SPARK-19208 at 2/14/17 9:24 PM:
-

Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("vector_summary(features).min").as("min(features)"), 
col("vector_summary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?


was (Author: timhunter):
Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), 
col("VectorSummary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535
 ] 

Timothy Hunter edited comment on SPARK-19208 at 2/14/17 8:04 PM:
-

I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have an API like 
this:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.



was (Author: timhunter):
I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have a the 
following API:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.


> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For 

[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848108#comment-15848108
 ] 

Nick Pentreath edited comment on SPARK-19208 at 2/1/17 8:09 AM:


Another option would be an "Estimator" like API, where the UDAF is purely 
internal to the API and not exposed to users, e.g.

{code}
val summarizer = new VectorSummarizer().setMetrics("min", 
"max").setInputCol("features").setWeightCol("weight")
val summary = summarizer.fit(df)

(or summarizer.evaluate, summarizer.summarize, etc?)

// this would need to throw exceptions (or perhaps return empty vectors) if the 
metric was not set
val min: Vector = summary.getMin

// OR DataFrame-based result:

val min: Vector = summary.select("min").as[Vector]
{code}

Agree it is important (and the point of this issue) to (a) only compute 
required metrics; and (b) not duplicate computation for efficiency.


was (Author: mlnick):
Another option would be an "Estimator" like API, where the UDAF is purely 
internal to the API and not exposed to users, e.g.

{code}
val summarizer = new VectorSummarizer().setMetrics("min", 
"max").setInputCol("features").setWeightCol("weight")
val summary = summarizer.fit(df)

(or summarizer.evaluate, summarizer.summarize, etc?)

// this would need to throw exceptions (or perhaps return empty vectors) if the 
metric was not set
val min: Vector = summary.getMin

// OR DataFrame-based result:

val min: Vector = summary.select("min").as[Vector]
{code}

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org