[jira] [Commented] (SPARK-14946) Spark 2.0 vs 1.6.1 Query Time(out)

Raymond Honderdors (JIRA) Sat, 07 May 2016 23:06:54 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275483#comment-15275483
 ]


Raymond Honderdors commented on SPARK-14946:
--------------------------------------------

Currently, the JDBC ODBC Thrift Server shares a cluster-global HiveContext 
(special type of SqlContext).

When you run a query through the Thrift Server, the results are returned in 1 
of 2 ways depending on the spark.sql.thriftServer.incrementalCollect flag 
(true/false):

1) false: the Driver calls collect() to retrieve all partitions from all Worker 
nodes. The Driver sends the data back to the client.

This option retrieves from all partitions in parallel and will therefore be 
faster, overall.

However, this may not be an option as the Driver currently has a limit of 4GB 
when calling collect(). Calling collect() on results over 4GB will not work.

2) true: The Driver calls foreachPartition() to incrementally and sequentially 
retrieve each partition from each Worker. As each partition is retrieved, the 
Driver sends it back to the client. This option handles result sets > 4GB, but 
will likely be slower, overall

> Spark 2.0 vs 1.6.1 Query Time(out)
> ----------------------------------
>
>                 Key: SPARK-14946
>                 URL: https://issues.apache.org/jira/browse/SPARK-14946
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Raymond Honderdors
>            Priority: Critical
>         Attachments: Query Plan 1.6.1.png, screenshot-spark_2.0.png, 
> spark-defaults.conf, spark-env.sh
>
>
> I run a query using JDBC driver running it on version 1.6.1 it return after 5 
> – 6 min , the same query against version 2.0 fails after 2h (due to timeout) 
> for details on how to reproduce (also see comments below)
> here is what I tried
> I run the following query: select * from pe_servingdata sd inner join 
> pe_campaigns_gzip c on sd.campaignid = c.campaign_id ;
> (with and without a counter and group by on campaigne_id)
> I run spark 1.6.1 and Thriftserver
> then running the sql from beeline or squirrel, after a few min I get answer 
> (0 row) it is correct due to the fact my data did not have matching campaign 
> ids in both tables
> when I run spark 2.0 and Thriftserver, I once again run the sql statement and 
> after 2:30 min it gives up, bit already after 30/60 sec I stop seeing 
> activity on the spark ui
> (sorry for the delay in competing the description of the bug, I was on and 
> off work due to national holidays)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14946) Spark 2.0 vs 1.6.1 Query Time(out)

Reply via email to