Underneath the covers, the thrift server is just calling <https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L224> hiveContext.sql(...) so this is surprising. Maybe running EXPLAIN or EXPLAIN EXTENDED in both modes would be helpful in debugging?
On Sat, Oct 3, 2015 at 1:08 PM, Jeff Thompson < jeffreykeatingthomp...@gmail.com> wrote: > Hi, > > I'm running a simple SQL query over a ~700 million row table of the form: > > SELECT * FROM my_table WHERE id = '12345'; > > When I submit the query via beeline & the JDBC thrift server it returns in > 35s > When I submit the exact same query using sparkSQL from a pyspark shell > (sqlContex.sql("SELECT * FROM ....")) it returns in 3s. > > Both times are obtained from the spark web UI. The query only returns 43 > rows, a small amount of data. > > The table was created by saving a sparkSQL dataframe as a parquet file and > then calling createExternalTable. > > I have tried to ensure that all relevant cluster parameters are equivalent > across the two queries: > spark.executor.memory = 6g > spark.executor.instances = 100 > no explicit caching (storage tab in web UI is empty) > spark version: 1.4.1 > Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN > jobs run on the same physical cluster (on-site harware) > > From the web UIs, I can see that the query plans are clearly different, > and I think this may be the source of the performance difference. > > Thrift server job: > 1 stage only, stage 1 (35s) map -> Filter -> mapPartitions > > SparkSQL job: > 2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate -> Exchange, > stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions > > Is this a know issue? Is there anything I can do to get the Thrift server > to use the same query optimizer as the one used by sparkSQL? I'd love to > pick up a ~10x performance gain for my jobs submitted via the Thrift server. > > Best regards, > > Jeff >