I am trying to do a high performance calculations which require custom 
functions.

As a first stage I am trying to profile the effect of using UDF and I am 
getting weird results.

I created a simple test (in 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6100413009427980/1838590056414286/3882373005101896/latest.html)

The basic process is as follows:
- I create a dataframe using the range option with 50M records and cache it.
- I then do a filter to find those smaller than 10 and count them. Once by 
doing column < 10 and once by doing it via UDF.
- I ran each action 10 times to get a good time estimate.

I then tried to run this test on 3 different systems:

-          On databricks system both methods took around the same time: ~4 
seconds.

-          On the first on premise cluster (8 nodes, using yarn, each node with 
~40GB memory and plenty of cores): I got a result of 1 second for the first 
option (no udf) and 8 for the second (using udf).

-          On the second on premise cluster (2 nodes, using yarn, each node 
with ~16GB memory and few cores): I got 1 second for the first option (no udf) 
and 16 seconds for the second (using UDF)

I am trying to understand the big differences in performance (I rerun the test 
several times on each of the systems and got relatively consistent results).
Thanks,
Assaf.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/UDF-UDAF-performance-tp27612.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to