Can you try running it directly on hive to see the timing or through
spark-sql may be.
Spark does what Hive does that is processing large sets of data, but it
attempts to do the intermediate iterations in memory if it can (i.e. if
there is enough memory available to keep the data set in
"select 'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case
when uid is null then 0 else 1 end),sum(case when uid is null then 1 else 0
end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb"
Is this as is, or did you use a UDF here?
-Sahil
On Thu, Dec 3, 2015 at 4:06