UDF/UDAF performance

AssafMendelson Sun, 28 Aug 2016 02:55:11 -0700

I am trying to do a high performance calculations which require custom 
functions.

As a first stage I am trying to profile the effect of using UDF and I am
getting weird results.

I created a simple test (in
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6100413009427980/1838590056414286/3882373005101896/latest.html)

The basic process is as follows:
- I create a dataframe using the range option with 50M records and cache it.
- I then do a filter to find those smaller than 10 and count them. Once by
doing column < 10 and once by doing it via UDF.
- I ran each action 10 times to get a good time estimate.

I then tried to run this test on 3 different systems:

- On databricks system both methods took around the same time: ~4
seconds.

- On the first on premise cluster (8 nodes, using yarn, each node with
~40GB memory and plenty of cores): I got a result of 1 second for the first
option (no udf) and 8 for the second (using udf).

- On the second on premise cluster (2 nodes, using yarn, each node
with ~16GB memory and few cores): I got 1 second for the first option (no udf)
and 16 seconds for the second (using UDF)

I am trying to understand the big differences in performance (I rerun the test
several times on each of the systems and got relatively consistent results).
Thanks,
Assaf.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/UDF-UDAF-performance-tp27612.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

UDF/UDAF performance

Reply via email to