I am trying to do a high performance calculations which require custom functions.
As a first stage I am trying to profile the effect of using UDF and I am getting weird results. I created a simple test (in https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6100413009427980/1838590056414286/3882373005101896/latest.html) The basic process is as follows: - I create a dataframe using the range option with 50M records and cache it. - I then do a filter to find those smaller than 10 and count them. Once by doing column < 10 and once by doing it via UDF. - I ran each action 10 times to get a good time estimate. I then tried to run this test on 3 different systems: - On databricks system both methods took around the same time: ~4 seconds. - On the first on premise cluster (8 nodes, using yarn, each node with ~40GB memory and plenty of cores): I got a result of 1 second for the first option (no udf) and 8 for the second (using udf). - On the second on premise cluster (2 nodes, using yarn, each node with ~16GB memory and few cores): I got 1 second for the first option (no udf) and 16 seconds for the second (using UDF) I am trying to understand the big differences in performance (I rerun the test several times on each of the systems and got relatively consistent results). Thanks, Assaf. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/UDF-UDAF-performance-tp27612.html Sent from the Apache Spark User List mailing list archive at Nabble.com.