Github user justinuang commented on the pull request:

    https://github.com/apache/spark/pull/8662#issuecomment-140181530
  
    Sorry for the delay, here is the code I ran and here are the results
    
        from pyspark.sql.functions import udf
        from pyspark.sql.types import IntegerType
        import time
        mult = udf(lambda x: 2 * x, IntegerType())
    
        for i in range(0,6):
            df = sqlContext.range(1000000).withColumnRenamed("id", "f")
            for j in range(i):
                df = df.select(mult(df.f).alias('f'))
    
            start = time.time()
            df.count() # make sure the Python UDF is evaluated
            used = time.time() - start
            print "Number of udfs: {} - {}".format(i, used)
    
    The results are as expected. The python overhead is about 1.5 seconds, but 
you can see how the time becomes exponential for without fix, since the cost of 
calculating upstream twice includes expensive python operations themselves.
    
        With fix
        Number of udfs: 0 - 0.091050863266
        Number of udfs: 1 - 1.72215199471
        Number of udfs: 2 - 3.32698297501
        Number of udfs: 3 - 5.64863801003
        Number of udfs: 4 - 7.06328701973
        Number of udfs: 5 - 9.22025489807
    
        Without fix
        Number of udfs: 0 - 1.00539588928
        Number of udfs: 1 - 3.12671899796
        Number of udfs: 2 - 5.91188406944
        Number of udfs: 3 - 11.124516964
        Number of udfs: 4 - 24.3277280331
        Number of udfs: 5 - 47.621573925



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to