Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run.
Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam <chiling...@gmail.com> wrote: > By the way, I also try hql("select * from m").count. It is terribly slow > too. > > > On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Spark users and developers, >> >> I'm doing some simple benchmarks with my team and we found out a >> potential performance issue using Hive via SparkSQL. It is very bothersome. >> So your help in understanding why it is terribly slow is very very >> important. >> >> First, we have some text files in HDFS which are also managed by Hive as >> a table called "m". There is nothing special about the table name "m". >> >> In pure spark way, I will just do the following to get a total number of >> line of text files: >> >> scala> >> sc.textFile("hdfs://namenode:8020/user/hive/warehouse/test.db/m/*").count >> >> This takes 2.7 minutes. >> >> If I use SparkSQL, I will do this: >> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >> import hiveContext._ >> hql("use test") >> hql("select count(*) from m").collect.foreach(println) >> >> This takes 11.9minutes! >> >> This is 4x slower than using pure spark. >> >> I wonder if anyone knows what causes the performance issue? >> >> For the curious mind, the dataset is about 200-300GB and we are using 10 >> machines for this benchmark. Given the env is equal between the two >> experiments, why pure spark is faster than SparkSQL? >> >> Best Regards, >> >> Jerry >> >> >> >> >