[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211748#comment-15211748
 ] 

Rui Li commented on HIVE-13293:
-------------------------------

Just did some research about this. Actually the overhead is not so big as I 
thought. If a query is complicated, we'll have multiple stages. For spark, 
intermediate stages are called {{ShuffleMapStage}} and the last stage is called 
{{ResultStage}}. Suppose we have the following stage graph:
{noformat}
ShuffleMapStage1
      | (join)
ShuflleMapStage2
      | (groupBy)
ShuffleMapStage3
      | (sortByKey)
   ResultStage4
{noformat}
When calling sortByKey, spark launches the sampling job. The job triggers 
computation of ShuffleMapStage1, ShuflleMapStage2 and a ResultStage that shares 
most of ShuffleMapStage3. When we launch the real job, we'll submit the above 
stage graph. But at this point, spark will consider ShuffleMapStage1 and 
ShuflleMapStage2 as already computed because the shuffle outputs are still in 
local disk. Therefore what's re-computed is just ShuffleMapStage3. I have done 
some tests to verify this.
That being said, when ShuffleMapStage3 is complicated enough, we'll still have 
some considerable overhead. And I think that's the case for Q10 in TPCx-BB.
Rather than splitting the task, I think a better and easier way is to cache the 
RDD before calling sortByKey. We can use DISK_ONLY storage level if memory is a 
concern. I'll come up with a patch for review.

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-13293
>                 URL: https://issues.apache.org/jira/browse/HIVE-13293
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 2.0.0
>            Reporter: Lifeng Wang
>            Assignee: Rui Li
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to