GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/8833

    [SPARK-10685] [SPARK-8632] [SQL] [PYSPARK] Python UDF should only compute 
the upstream once

    This PR changes to buffer the rows from upstream into a Queue, then zip 
them with result from Python UDF, to avoid the double computation of upstream.
    
    Thanks the idea from @rxin to simplify the buffer greatly!
    
    cc @marmbrus 
    
    Closes #8662  

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark pyudf

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8833.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8833
    
----
commit b98cf2a45984d24bdb181e4d43d8f83ca1849aff
Author: Davies Liu <dav...@databricks.com>
Date:   2015-09-19T05:42:26Z

    compute the upstream once

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to