[GitHub] spark issue #16534: [SPARK-19161][PYTHON][SQL] Improving UDF Docstrings

zero323 Fri, 24 Feb 2017 07:45:03 -0800

Github user zero323 commented on the issue:

    https://github.com/apache/spark/pull/16534
  
    Don't worry, I get it :) The point is to make user experience better not 
worse, right? In practice:
    
    - These changes are pretty far from data, so overall impact is negligible 
and constant.
    - For UDF creation overhead is around ~8 microseconds (this doesn't include 
any JVM communication).
    - With Py4J call (JUDF and Column creation) everything is bound by JVM 
communication which has three orders of magnitude higher latency than our 
Python code.
    
    Rough tests (build 8f33731e796750e6f60dc9e2fc33a94d29d198b4):
    
    ```
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
          /_/
    
    Using Python version 3.5.2 (default, Jul  2 2016 17:53:06)
    SparkSession available as 'spark'.
    
    In [1]: from pyspark.sql.functions import udf
    
    In [2]: from functools import wraps
    
    In [3]: def wrapped(f):
       ...:     f_ = udf(f)
       ...:     @wraps(f)
       ...:     def wrapped_(*args):
       ...:         return f_(*args)
       ...:     return wrapped_
       ...: 
    
    In [4]: %timeit udf(lambda x: x)
    The slowest run took 8.96 times longer than the fastest. This could mean 
that an intermediate result is being cached.
    100000 loops, best of 3: 3.45 Âµs per loop
    
    In [5]: %timeit wrapped(lambda x: x)
    The slowest run took 6.67 times longer than the fastest. This could mean 
that an intermediate result is being cached.
    100000 loops, best of 3: 12.3 Âµs per loop
    
    In [6]: %timeit udf(lambda x: x)("x")
    The slowest run took 13.64 times longer than the fastest. This could mean 
that an intermediate result is being cached.
    100 loops, best of 3: 11.3 ms per loop
    
    In [7]: %timeit wrapped(lambda x: x)("a")
    100 loops, best of 3: 9.9 ms per loop
    
    In [8]: %timeit -n10  spark.range(0, 10000).toDF("id").select(udf(lambda x: 
x)("id")).rdd.foreach(lambda _: None)
    10 loops, best of 3: 227 ms per loop
    
    In [9]: %timeit -n10  spark.range(0, 
10000).toDF("id").select(wrapped(lambda x: x)("id")).rdd.foreach(lambda _: None)
    10 loops, best of 3: 206 ms per loop
    ```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16534: [SPARK-19161][PYTHON][SQL] Improving UDF Docstrings

Reply via email to