Can you try running it directly on hive to see the timing or through
spark-sql may be.
Spark does what Hive does that is processing large sets of data, but it
attempts to do the intermediate iterations in memory if it can (i.e. if
there is enough memory available to keep the data set in
Dear All,
I have a hive table with 100 million data and I just ran some very simple
operations on this dataset like:
val df = sqlContext.sql("select * from user ").toDF
df.cache
df.registerTempTable("tb")
val b=sqlContext.sql("select
"select 'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case
when uid is null then 0 else 1 end),sum(case when uid is null then 1 else 0
end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb"
Is this as is, or did you use a UDF here?
-Sahil
On Thu, Dec 3, 2015 at 4:06