GitHub user uzadude opened a pull request: https://github.com/apache/spark/pull/19683
[SPARK-21657][SQL] optimize explode quadratic memory consumpation ## What changes were proposed in this pull request? The issue has been raised in two Jira tickets: [SPARK-21657](https://issues.apache.org/jira/browse/SPARK-21657), [SPARK-16998](https://issues.apache.org/jira/browse/SPARK-16998). Basically, what happens is that in collection generators like explode/inline we create many rows from each row. Currently each exploded row contains also the column on which it was created. This causes, for example, if we have a 10k array in one row that this array will get copy 10k times - to each of the row. this results a qudratic memory consumption. However, it is a common case that the original column gets projected out after the explode, so we can avoid duplicating it. In this solution we propose to identify this situation in the optimizer and turn on a flag for omitting the original column in the generation process. ## How was this patch tested? 1. We added a benchmark test to MiscBenchmark that shows x16 improvement in runtimes. 2. We ran some of the other tests in MiscBenchmark and they show 15% improvements. 3. We ran this code on a specific case from our production data with rows containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 mins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uzadude/spark optimize_explode Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19683.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19683 ---- commit ce7c3694a99584348957dc756234bb667466be4e Author: oraviv <ora...@paypal.com> Date: 2017-11-07T11:34:21Z [SPARK-21657][SQL] optimize explode quadratic memory consumpation ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org