Re: performance difference between Thrift server and SparkSQL?

2015-10-05 Thread Jeff Thompson
Thanks for the suggestion. The output from EXPLAIN is indeed equivalent in both sparkSQL and via the Thrift server. I did some more testing. The source of the performance difference is in the way I was triggering the sparkSQL query. I was using .count() instead of .collect(). When I use

Re: performance difference between Thrift server and SparkSQL?

2015-10-03 Thread Michael Armbrust
Underneath the covers, the thrift server is just calling hiveContext.sql(...) so this is surprising. Maybe running EXPLAIN or EXPLAIN

performance difference between Thrift server and SparkSQL?

2015-10-03 Thread Jeff Thompson
Hi, I'm running a simple SQL query over a ~700 million row table of the form: SELECT * FROM my_table WHERE id = '12345'; When I submit the query via beeline & the JDBC thrift server it returns in 35s When I submit the exact same query using sparkSQL from a pyspark shell (sqlContex.sql("SELECT *