[GitHub] spark pull request #19642: [SPARK-22410][SQL] Remove unnecessary output from...

viirya Thu, 02 Nov 2017 03:48:38 -0700

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/19642


    [SPARK-22410][SQL] Remove unnecessary output from BatchEvalPython's 
children plans

    ## What changes were proposed in this pull request?
    
    When we insert `BatchEvalPython` for Python UDFs into a query plan, if its 
child has some outputs that are not used by the original parent node, 
`BatchEvalPython` will still take those outputs and save into the queue. When 
the data for those outputs are big, it is easily to generate big spill on disk.
    
    For example, the following reproducible code is from the JIRA ticket.
    
    ```python
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    
    lines_of_file = [ "this is a line" for x in xrange(10000) ]
    file_obj = [ "this_is_a_foldername/this_is_a_filename", lines_of_file ]
    data = [ file_obj for x in xrange(5) ]
    
    small_df = spark.sparkContext.parallelize(data).map(lambda x : (x[0], 
x[1])).toDF(["file", "lines"])
    exploded = small_df.select("file", explode("lines"))
    
    def split_key(s):
        return s.split("/")[1]
    
    split_key_udf = udf(split_key, StringType())
    
    with_filename = exploded.withColumn("filename", split_key_udf("file"))
    with_filename.explain(True)
    ```
    
    The physical plan before/after this change:
    
    Before:
    
    ```
    *Project [file#0, col#5, pythonUDF0#14 AS filename#9]
    +- BatchEvalPython [split_key(file#0)], [file#0, lines#1, col#5, 
pythonUDF0#14]
       +- Generate explode(lines#1), true, false, [col#5]
          +- Scan ExistingRDD[file#0,lines#1]
    
    ```
    
    After:
    
    ```
    *Project [file#0, col#5, pythonUDF0#14 AS filename#9]
    +- BatchEvalPython [split_key(file#0)], [col#5, file#0, pythonUDF0#14]
       +- *Project [col#5, file#0]
          +- Generate explode(lines#1), true, false, [col#5]
             +- Scan ExistingRDD[file#0,lines#1]
    ```
    
    Before this change, `lines#1` is a redundant input to `BatchEvalPython`. 
This patch removes it by adding a Project.
    
    ## How was this patch tested?
    
    Manually test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-22410

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19642.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19642
    
----
commit 4e0974dec907c0a74fca5701263001c2bab9c250
Author: Liang-Chi Hsieh <[email protected]>
Date:   2017-11-02T10:37:45Z

    Remove unnecessary output from BatchEvalPython's input.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19642: [SPARK-22410][SQL] Remove unnecessary output from...

Reply via email to