Underneath the covers, the thrift server is just calling
<https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L224>
hiveContext.sql(...) so this is surprising.  Maybe running EXPLAIN or
EXPLAIN EXTENDED in both modes would be helpful in debugging?



On Sat, Oct 3, 2015 at 1:08 PM, Jeff Thompson <
jeffreykeatingthomp...@gmail.com> wrote:

> Hi,
>
> I'm running a simple SQL query over a ~700 million row table of the form:
>
> SELECT * FROM my_table WHERE id = '12345';
>
> When I submit the query via beeline & the JDBC thrift server it returns in
> 35s
> When I submit the exact same query using sparkSQL from a pyspark shell
> (sqlContex.sql("SELECT * FROM ....")) it returns in 3s.
>
> Both times are obtained from the spark web UI.  The query only returns 43
> rows, a small amount of data.
>
> The table was created by saving a sparkSQL dataframe as a parquet file and
> then calling createExternalTable.
>
> I have tried to ensure that all relevant cluster parameters are equivalent
> across the two queries:
> spark.executor.memory = 6g
> spark.executor.instances = 100
> no explicit caching (storage tab in web UI is empty)
> spark version: 1.4.1
> Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN
> jobs run on the same physical cluster (on-site harware)
>
> From the web UIs, I can see that the query plans are clearly different,
> and I think this may be the source of the performance difference.
>
> Thrift server job:
> 1 stage only, stage 1 (35s) map -> Filter -> mapPartitions
>
> SparkSQL job:
> 2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate -> Exchange,
> stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions
>
> Is this a know issue?  Is there anything I can do to get the Thrift server
> to use the same query optimizer as the one used by sparkSQL?  I'd love to
> pick up a ~10x performance gain for my jobs submitted via the Thrift server.
>
> Best regards,
>
> Jeff
>

Reply via email to