[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

justinuang Mon, 14 Sep 2015 12:20:50 -0700

Github user justinuang commented on the pull request:

    https://github.com/apache/spark/pull/8662#issuecomment-140181530
  
    Sorry for the delay, here is the code I ran and here are the results
    
        from pyspark.sql.functions import udf
        from pyspark.sql.types import IntegerType
        import time
        mult = udf(lambda x: 2 * x, IntegerType())
    
        for i in range(0,6):
            df = sqlContext.range(1000000).withColumnRenamed("id", "f")
            for j in range(i):
                df = df.select(mult(df.f).alias('f'))
    
            start = time.time()
            df.count() # make sure the Python UDF is evaluated
            used = time.time() - start
            print "Number of udfs: {} - {}".format(i, used)
    
    The results are as expected. The python overhead is about 1.5 seconds, but 
you can see how the time becomes exponential for without fix, since the cost of 
calculating upstream twice includes expensive python operations themselves.
    
        With fix
        NumberÂ ofÂ udfs:Â 0Â -Â 0.091050863266
        NumberÂ ofÂ udfs:Â 1Â -Â 1.72215199471
        NumberÂ ofÂ udfs:Â 2Â -Â 3.32698297501
        NumberÂ ofÂ udfs:Â 3Â -Â 5.64863801003
        NumberÂ ofÂ udfs:Â 4Â -Â 7.06328701973
        NumberÂ ofÂ udfs:Â 5Â -Â 9.22025489807
    
        Without fix
        NumberÂ ofÂ udfs:Â 0Â -Â 1.00539588928
        NumberÂ ofÂ udfs:Â 1Â -Â 3.12671899796
        NumberÂ ofÂ udfs:Â 2Â -Â 5.91188406944
        NumberÂ ofÂ udfs:Â 3Â -Â 11.124516964
        NumberÂ ofÂ udfs:Â 4Â -Â 24.3277280331
        NumberÂ ofÂ udfs:Â 5Â -Â 47.621573925




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

Reply via email to