Re: Using HQL is terribly slow: Potential Performance Issue
Hey Jerry, When you ran these queries using different methods, did you see any discrepancy in the returned results (i.e. the counts)? On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sorry. I think you are seeing some weirdness with partitioned tables that I have also seen elsewhere. I've created a JIRA and assigned someone at databricks to investigate. https://issues.apache.org/jira/browse/SPARK-2443 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam chiling...@gmail.com wrote: Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type. I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com wrote: On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote: For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? There is going to be some overhead to parsing data using the Hive SerDes instead of the native Spark code, however, the slow down you are seeing here is much larger than I would expect. Can you tell me more about the table? What does the schema look like? Is it partitioned? By the way, I also try hql(select * from m).count. It is terribly slow too. FYI, this query is actually identical to the one where you write out COUNT(*).
Using HQL is terribly slow: Potential Performance Issue
Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via SparkSQL. It is very bothersome. So your help in understanding why it is terribly slow is very very important. First, we have some text files in HDFS which are also managed by Hive as a table called m. There is nothing special about the table name m. In pure spark way, I will just do the following to get a total number of line of text files: scala sc.textFile(hdfs://namenode:8020/user/hive/warehouse/test.db/m/*).count This takes 2.7 minutes. If I use SparkSQL, I will do this: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(use test) hql(select count(*) from m).collect.foreach(println) This takes 11.9minutes! This is 4x slower than using pure spark. I wonder if anyone knows what causes the performance issue? For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? Best Regards, Jerry
Re: Using HQL is terribly slow: Potential Performance Issue
By the way, I also try hql(select * from m).count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via SparkSQL. It is very bothersome. So your help in understanding why it is terribly slow is very very important. First, we have some text files in HDFS which are also managed by Hive as a table called m. There is nothing special about the table name m. In pure spark way, I will just do the following to get a total number of line of text files: scala sc.textFile(hdfs://namenode:8020/user/hive/warehouse/test.db/m/*).count This takes 2.7 minutes. If I use SparkSQL, I will do this: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(use test) hql(select count(*) from m).collect.foreach(println) This takes 11.9minutes! This is 4x slower than using pure spark. I wonder if anyone knows what causes the performance issue? For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? Best Regards, Jerry
Re: Using HQL is terribly slow: Potential Performance Issue
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote: By the way, I also try hql(select * from m).count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via SparkSQL. It is very bothersome. So your help in understanding why it is terribly slow is very very important. First, we have some text files in HDFS which are also managed by Hive as a table called m. There is nothing special about the table name m. In pure spark way, I will just do the following to get a total number of line of text files: scala sc.textFile(hdfs://namenode:8020/user/hive/warehouse/test.db/m/*).count This takes 2.7 minutes. If I use SparkSQL, I will do this: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(use test) hql(select count(*) from m).collect.foreach(println) This takes 11.9minutes! This is 4x slower than using pure spark. I wonder if anyone knows what causes the performance issue? For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? Best Regards, Jerry
Re: Using HQL is terribly slow: Potential Performance Issue
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote: For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? There is going to be some overhead to parsing data using the Hive SerDes instead of the native Spark code, however, the slow down you are seeing here is much larger than I would expect. Can you tell me more about the table? What does the schema look like? Is it partitioned? By the way, I also try hql(select * from m).count. It is terribly slow too. FYI, this query is actually identical to the one where you write out COUNT(*).
Re: Using HQL is terribly slow: Potential Performance Issue
Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type. I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com wrote: On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote: For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? There is going to be some overhead to parsing data using the Hive SerDes instead of the native Spark code, however, the slow down you are seeing here is much larger than I would expect. Can you tell me more about the table? What does the schema look like? Is it partitioned? By the way, I also try hql(select * from m).count. It is terribly slow too. FYI, this query is actually identical to the one where you write out COUNT(*).
Re: Using HQL is terribly slow: Potential Performance Issue
Yeah, sorry. I think you are seeing some weirdness with partitioned tables that I have also seen elsewhere. I've created a JIRA and assigned someone at databricks to investigate. https://issues.apache.org/jira/browse/SPARK-2443 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam chiling...@gmail.com wrote: Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type. I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com wrote: On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote: For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? There is going to be some overhead to parsing data using the Hive SerDes instead of the native Spark code, however, the slow down you are seeing here is much larger than I would expect. Can you tell me more about the table? What does the schema look like? Is it partitioned? By the way, I also try hql(select * from m).count. It is terribly slow too. FYI, this query is actually identical to the one where you write out COUNT(*).