Hi Spark users,

Also, to put the performance issue into perspective, we also ran the query
on Hive. It took about 5 minutes to run.

Best Regards,

Jerry




On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam <chiling...@gmail.com> wrote:

> By the way, I also try hql("select * from m").count. It is terribly slow
> too.
>
>
> On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Spark users and developers,
>>
>> I'm doing some simple benchmarks with my team and we found out a
>> potential performance issue using Hive via SparkSQL. It is very bothersome.
>> So your help in understanding why it is terribly slow is very very
>> important.
>>
>> First, we have some text files in HDFS which are also managed by Hive as
>> a table called "m". There is nothing special about the table name "m".
>>
>> In pure spark way, I will just do the following to get a total number of
>> line of text files:
>>
>> scala>
>> sc.textFile("hdfs://namenode:8020/user/hive/warehouse/test.db/m/*").count
>>
>> This takes 2.7 minutes.
>>
>> If I use SparkSQL, I will do this:
>> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> import hiveContext._
>> hql("use test")
>> hql("select count(*) from m").collect.foreach(println)
>>
>> This takes 11.9minutes!
>>
>> This is 4x slower than using pure spark.
>>
>> I wonder if anyone knows what causes the performance issue?
>>
>> For the curious mind, the dataset is about 200-300GB and we are using 10
>> machines for this benchmark. Given the env is equal between the two
>> experiments, why pure spark is faster than SparkSQL?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>>
>>
>

Reply via email to