Hi Spark users and developers,

I'm doing some simple benchmarks with my team and we found out a potential
performance issue using Hive via SparkSQL. It is very bothersome. So your
help in understanding why it is terribly slow is very very important.

First, we have some text files in HDFS which are also managed by Hive as a
table called "m". There is nothing special about the table name "m".

In pure spark way, I will just do the following to get a total number of
line of text files:

scala>
sc.textFile("hdfs://namenode:8020/user/hive/warehouse/test.db/m/*").count

This takes 2.7 minutes.

If I use SparkSQL, I will do this:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
hql("use test")
hql("select count(*) from m").collect.foreach(println)

This takes 11.9minutes!

This is 4x slower than using pure spark.

I wonder if anyone knows what causes the performance issue?

For the curious mind, the dataset is about 200-300GB and we are using 10
machines for this benchmark. Given the env is equal between the two
experiments, why pure spark is faster than SparkSQL?

Best Regards,

Jerry

Reply via email to