Github user justinuang commented on the pull request: https://github.com/apache/spark/pull/8662#issuecomment-140181530 Sorry for the delay, here is the code I ran and here are the results from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType import time mult = udf(lambda x: 2 * x, IntegerType()) for i in range(0,6): df = sqlContext.range(1000000).withColumnRenamed("id", "f") for j in range(i): df = df.select(mult(df.f).alias('f')) start = time.time() df.count() # make sure the Python UDF is evaluated used = time.time() - start print "Number of udfs: {} - {}".format(i, used) The results are as expected. The python overhead is about 1.5 seconds, but you can see how the time becomes exponential for without fix, since the cost of calculating upstream twice includes expensive python operations themselves. With fix Number of udfs: 0 - 0.091050863266 Number of udfs: 1 - 1.72215199471 Number of udfs: 2 - 3.32698297501 Number of udfs: 3 - 5.64863801003 Number of udfs: 4 - 7.06328701973 Number of udfs: 5 - 9.22025489807 Without fix Number of udfs: 0 - 1.00539588928 Number of udfs: 1 - 3.12671899796 Number of udfs: 2 - 5.91188406944 Number of udfs: 3 - 11.124516964 Number of udfs: 4 - 24.3277280331 Number of udfs: 5 - 47.621573925
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org