[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407451#comment-16407451 ] Teng Peng edited comment on SPARK-19208 at 3/21/18 4:44 AM: [~timhunter] Has the Jira ticket been opened? I believe the new API for statistical info would be a great improvement. was (Author: teng peng): [~timhunter] Has the Jira ticket been opened? I believe this would be a great improvement. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Major > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755 ] Nick Pentreath edited comment on SPARK-19208 at 2/14/17 9:42 PM: - Ah right I see - yes rewrite rules would be a good ultimate goal. One question I have - if we do: {code} val summary = df.select(VectorSummarizer.metrics("min", "max").summary("features")) {code} How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple return cols are not supported by UDAF? Or do we have to live with the struct return type for now until we could do the rewrite version? was (Author: mlnick): Ah right I see - yes rewrite rules would be a good ultimate goal. One question I have - if we do: {{code}} val summary = df.select(VectorSummarizer.metrics("min", "max").summary("features")) {{code}} How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple return cols are not supported by UDAF? Or do we have to live with the struct return type for now until we could do the rewrite version? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714 ] Timothy Hunter edited comment on SPARK-19208 at 2/14/17 9:24 PM: - Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("vector_summary(features).min").as("min(features)"), col("vector_summary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? was (Author: timhunter): Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), col("VectorSummary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535 ] Timothy Hunter edited comment on SPARK-19208 at 2/14/17 8:04 PM: - I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have an API like this: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. was (Author: timhunter): I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have a the following API: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848108#comment-15848108 ] Nick Pentreath edited comment on SPARK-19208 at 2/1/17 8:09 AM: Another option would be an "Estimator" like API, where the UDAF is purely internal to the API and not exposed to users, e.g. {code} val summarizer = new VectorSummarizer().setMetrics("min", "max").setInputCol("features").setWeightCol("weight") val summary = summarizer.fit(df) (or summarizer.evaluate, summarizer.summarize, etc?) // this would need to throw exceptions (or perhaps return empty vectors) if the metric was not set val min: Vector = summary.getMin // OR DataFrame-based result: val min: Vector = summary.select("min").as[Vector] {code} Agree it is important (and the point of this issue) to (a) only compute required metrics; and (b) not duplicate computation for efficiency. was (Author: mlnick): Another option would be an "Estimator" like API, where the UDAF is purely internal to the API and not exposed to users, e.g. {code} val summarizer = new VectorSummarizer().setMetrics("min", "max").setInputCol("features").setWeightCol("weight") val summary = summarizer.fit(df) (or summarizer.evaluate, summarizer.summarize, etc?) // this would need to throw exceptions (or perhaps return empty vectors) if the metric was not set val min: Vector = summary.getMin // OR DataFrame-based result: val min: Vector = summary.select("min").as[Vector] {code} > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org