[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407451#comment-16407451 ] Teng Peng commented on SPARK-19208: --- [~timhunter] Has the Jira ticket been opened? I believe this would be a great improvement. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Major > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088259#comment-16088259 ] Apache Spark commented on SPARK-19208: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/14950 > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870655#comment-15870655 ] Timothy Hunter commented on SPARK-19208: I put together the ideas in this thread into a document. I will update the umbrella ticket with sub tasks once folks have had a chance to comment: https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866768#comment-15866768 ] Timothy Hunter commented on SPARK-19208: Yes, I meant returning a struct and then projecting this struct. I do not think there is any other way right now with the current UDAFs, as you mention. In that proposal, {{VectorSummarizer.metrics(...).summary(...)}} returns a struct, the fields of which are decided by the arguments in {{.metrics}}, and each of the individual functions {{VectorSummarizer.min/max/variasce(...)}} returns columns of vectors or matrices. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755 ] Nick Pentreath commented on SPARK-19208: Ah right I see - yes rewrite rules would be a good ultimate goal. One question I have - if we do: {{code}} val summary = df.select(VectorSummarizer.metrics("min", "max").summary("features")) {{code}} How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple return cols are not supported by UDAF? Or do we have to live with the struct return type for now until we could do the rewrite version? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714 ] Timothy Hunter commented on SPARK-19208: Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), col("VectorSummary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866675#comment-15866675 ] Nick Pentreath commented on SPARK-19208: When I said "estimator-like", I didn't mean it should necessarily be an actual {{Estimator}} (I agree it is not really intended to fit into transformers & pipelines), but rather mimic the API, i.e. that the summarizer is "fitted" on a dataset to return a summary. I just wasn't too keen on the idea of returning a struct as it just feels sort of clunky relative to returning a df with vector columns {{"mean", "min", "max"}} etc. Supporting SS and {{groupBy}} seems like an important goal, so something like [~timhunter]'s suggestion looks like it will work nicely. For doing it via catalyst rules, that would be first prize to automatically re-use the intermediate results for multiple end-result computations, and only compute what is necessary for the required end-results. But, is that even supported for UDTs currently? I'm not an expert but my understanding was that is not supported yet. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535 ] Timothy Hunter commented on SPARK-19208: I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have a the following API: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855090#comment-15855090 ] Joseph K. Bradley commented on SPARK-19208: --- You're right that sharing intermediate results will be necessary. I'm happy with [~mlnick]'s VectorSummarizer API. I also think that, if we wanted to use the API I suggested above, the version returning a single struct col would work: {{df.select(VectorSummary.summary("features", "weights"))}}. The new column could be constructed from intermediate columns which would not show up in the final output. (Is this essentially the "private UDAF" [~podongfeng] is mentioning above?) I'm OK either way. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849837#comment-15849837 ] zhengruifeng commented on SPARK-19208: -- [~mlnick] +1 I think we can create a private UDAF to compute the metrics. And for users, we create a wrapper {{VectorSummarizer}} based on the internal UDAF. In algorithms, we can use both the {{VectorSummarizer}} and the UDAF (if we need to groupby) [~imatiach] [~josephkb] Is this OK? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848140#comment-15848140 ] zhengruifeng commented on SPARK-19208: -- [~mlnick] What about supporting {{groupBy}} in it? I think it is useful in some cases, for example, we can use it to compute the {{weightedSum}} per label in {{NaiveBayes}}, and {{variance}} per label in GaussianNB. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848108#comment-15848108 ] Nick Pentreath commented on SPARK-19208: Another option would be an "Estimator" like API, where the UDAF is purely internal to the API and not exposed to users, e.g. {code} val summarizer = new VectorSummarizer().setMetrics("min", "max").setInputCol("features").setWeightCol("weight") val summary = summarizer.fit(df) (or summarizer.evaluate, summarizer.summarize, etc?) // this would need to throw exceptions (or perhaps return empty vectors) if the metric was not set val min: Vector = summary.getMin // OR DataFrame-based result: val min: Vector = summary.select("min").as[Vector] {code} > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843884#comment-15843884 ] zhengruifeng commented on SPARK-19208: -- [~josephkb] I have considered of the analogy of Double column stats. But there is a small difference: Some temporary intermediate variables are used by multi metrics. {code} val results: DataFrame = df.select(VectorSummary.mean("features"), VectorSummary.variance("features")) {code} The {{currMean}} and {{weightSum}} are used both in {{VectorSummary.mean}} and {{VectorSummary.variance}}. So we maybe have to compute {{currMean}} and {{weightSum}} twice, if we use two seperate udaf. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843670#comment-15843670 ] Joseph K. Bradley commented on SPARK-19208: --- Thanks for writing out your ideas. Here are my thoughts about the API: *Reference API: Double column stats* When working with Double columns (not Vectors), one would expect write things like: {{myDataFrame.select(min("x"), max("x"))}} to select 2 stats, min and max. Here, min and max are functions provided by Spark SQL which return columns. *Analogy* We should probably provide an analogous API. Here's what I imagine: {code} import org.apache.spark.ml.stat.VectorSummary val df: DataFrame = ... val results: DataFrame = df.select(VectorSummary.min("features"), VectorSummary.mean("features")) val weightedResults: DataFrame = df.select(VectorSummary.min("features"), VectorSummary.mean("features", "weight")) // Both of these result DataFrames contain 2 Vector columns. {code} I.e., we provide vectorized versions of stats functions. If you want to put everything into a single function, then we could also have VectorSummary have a function "summary" which returns a struct type with every stat available: {code} val results = df.select(VectorSummary.summary("features", "weights")) // results DataFrame contains 1 struct column, which has a Vector field for every statistic we provide. {code} Note: I removed "online" from the name since it the user does not need to know that it does online aggregation. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org