Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-11 Thread Zongheng Yang
Hey Jerry,

When you ran these queries using different methods, did you see any
discrepancy in the returned results (i.e. the counts)?

On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust
mich...@databricks.com wrote:
 Yeah, sorry.  I think you are seeing some weirdness with partitioned tables
 that I have also seen elsewhere. I've created a JIRA and assigned someone at
 databricks to investigate.

 https://issues.apache.org/jira/browse/SPARK-2443


 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi Michael,

 Yes the table is partitioned on 1 column. There are 11 columns in the
 table and they are all String type.

 I understand that SerDes contributes to some overheads but using pure
 Hive, we could run the query about 5 times faster than SparkSQL. Given that
 Hive also has the same SerDes overhead, then there must be something
 additional that SparkSQL adds to the overall overheads that Hive doesn't
 have.

 Best Regards,

 Jerry



 On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com
 wrote:

 On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote:

 For the curious mind, the dataset is about 200-300GB and we are using 10
 machines for this benchmark. Given the env is equal between the two
 experiments, why pure spark is faster than SparkSQL?


 There is going to be some overhead to parsing data using the Hive SerDes
 instead of the native Spark code, however, the slow down you are seeing here
 is much larger than I would expect. Can you tell me more about the table?
 What does the schema look like?  Is it partitioned?

 By the way, I also try hql(select * from m).count. It is terribly slow
 too.


 FYI, this query is actually identical to the one where you write out
 COUNT(*).





Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users and developers,

I'm doing some simple benchmarks with my team and we found out a potential
performance issue using Hive via SparkSQL. It is very bothersome. So your
help in understanding why it is terribly slow is very very important.

First, we have some text files in HDFS which are also managed by Hive as a
table called m. There is nothing special about the table name m.

In pure spark way, I will just do the following to get a total number of
line of text files:

scala
sc.textFile(hdfs://namenode:8020/user/hive/warehouse/test.db/m/*).count

This takes 2.7 minutes.

If I use SparkSQL, I will do this:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
hql(use test)
hql(select count(*) from m).collect.foreach(println)

This takes 11.9minutes!

This is 4x slower than using pure spark.

I wonder if anyone knows what causes the performance issue?

For the curious mind, the dataset is about 200-300GB and we are using 10
machines for this benchmark. Given the env is equal between the two
experiments, why pure spark is faster than SparkSQL?

Best Regards,

Jerry


Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql(select * from m).count. It is terribly slow
too.


On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi Spark users and developers,

 I'm doing some simple benchmarks with my team and we found out a potential
 performance issue using Hive via SparkSQL. It is very bothersome. So your
 help in understanding why it is terribly slow is very very important.

 First, we have some text files in HDFS which are also managed by Hive as a
 table called m. There is nothing special about the table name m.

 In pure spark way, I will just do the following to get a total number of
 line of text files:

 scala
 sc.textFile(hdfs://namenode:8020/user/hive/warehouse/test.db/m/*).count

 This takes 2.7 minutes.

 If I use SparkSQL, I will do this:
 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 import hiveContext._
 hql(use test)
 hql(select count(*) from m).collect.foreach(println)

 This takes 11.9minutes!

 This is 4x slower than using pure spark.

 I wonder if anyone knows what causes the performance issue?

 For the curious mind, the dataset is about 200-300GB and we are using 10
 machines for this benchmark. Given the env is equal between the two
 experiments, why pure spark is faster than SparkSQL?

 Best Regards,

 Jerry






Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users,

Also, to put the performance issue into perspective, we also ran the query
on Hive. It took about 5 minutes to run.

Best Regards,

Jerry




On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote:

 By the way, I also try hql(select * from m).count. It is terribly slow
 too.


 On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi Spark users and developers,

 I'm doing some simple benchmarks with my team and we found out a
 potential performance issue using Hive via SparkSQL. It is very bothersome.
 So your help in understanding why it is terribly slow is very very
 important.

 First, we have some text files in HDFS which are also managed by Hive as
 a table called m. There is nothing special about the table name m.

 In pure spark way, I will just do the following to get a total number of
 line of text files:

 scala
 sc.textFile(hdfs://namenode:8020/user/hive/warehouse/test.db/m/*).count

 This takes 2.7 minutes.

 If I use SparkSQL, I will do this:
 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 import hiveContext._
 hql(use test)
 hql(select count(*) from m).collect.foreach(println)

 This takes 11.9minutes!

 This is 4x slower than using pure spark.

 I wonder if anyone knows what causes the performance issue?

 For the curious mind, the dataset is about 200-300GB and we are using 10
 machines for this benchmark. Given the env is equal between the two
 experiments, why pure spark is faster than SparkSQL?

 Best Regards,

 Jerry







Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote:

 For the curious mind, the dataset is about 200-300GB and we are using 10
 machines for this benchmark. Given the env is equal between the two
 experiments, why pure spark is faster than SparkSQL?


There is going to be some overhead to parsing data using the Hive SerDes
instead of the native Spark code, however, the slow down you are seeing
here is much larger than I would expect. Can you tell me more about the
table?  What does the schema look like?  Is it partitioned?

By the way, I also try hql(select * from m).count. It is terribly slow
 too.


FYI, this query is actually identical to the one where you write out
COUNT(*).


Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Michael,

Yes the table is partitioned on 1 column. There are 11 columns in the table
and they are all String type.

I understand that SerDes contributes to some overheads but using pure Hive,
we could run the query about 5 times faster than SparkSQL. Given that Hive
also has the same SerDes overhead, then there must be something additional
that SparkSQL adds to the overall overheads that Hive doesn't have.

Best Regards,

Jerry



On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com
wrote:

 On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote:

 For the curious mind, the dataset is about 200-300GB and we are using 10
 machines for this benchmark. Given the env is equal between the two
 experiments, why pure spark is faster than SparkSQL?


 There is going to be some overhead to parsing data using the Hive SerDes
 instead of the native Spark code, however, the slow down you are seeing
 here is much larger than I would expect. Can you tell me more about the
 table?  What does the schema look like?  Is it partitioned?

 By the way, I also try hql(select * from m).count. It is terribly slow
 too.


 FYI, this query is actually identical to the one where you write out
 COUNT(*).



Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
Yeah, sorry.  I think you are seeing some weirdness with partitioned tables
that I have also seen elsewhere. I've created a JIRA and assigned someone
at databricks to investigate.

https://issues.apache.org/jira/browse/SPARK-2443


On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi Michael,

 Yes the table is partitioned on 1 column. There are 11 columns in the
 table and they are all String type.

 I understand that SerDes contributes to some overheads but using pure
 Hive, we could run the query about 5 times faster than SparkSQL. Given that
 Hive also has the same SerDes overhead, then there must be something
 additional that SparkSQL adds to the overall overheads that Hive doesn't
 have.

 Best Regards,

 Jerry



 On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com
 wrote:

 On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote:

 For the curious mind, the dataset is about 200-300GB and we are using 10
 machines for this benchmark. Given the env is equal between the two
 experiments, why pure spark is faster than SparkSQL?


 There is going to be some overhead to parsing data using the Hive SerDes
 instead of the native Spark code, however, the slow down you are seeing
 here is much larger than I would expect. Can you tell me more about the
 table?  What does the schema look like?  Is it partitioned?

 By the way, I also try hql(select * from m).count. It is terribly slow
 too.


 FYI, this query is actually identical to the one where you write out
 COUNT(*).