[ 
https://issues.apache.org/jira/browse/HIVE-17677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Sherman reassigned HIVE-17677:
-------------------------------------

    Assignee:     (was: Andrew Sherman)

> Investigate using hive statistics information to optimize HoS parallel order 
> by
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-17677
>                 URL: https://issues.apache.org/jira/browse/HIVE-17677
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Andrew Sherman
>            Priority: Major
>
> I think Spark's native parallel order by works in a similar way to what we do 
> for Hive-on-MR.  That is, it scans the RDD once and sample the data to 
> determine what ranges the data should be partitioned into, and then scans the 
> RDD again to do the actual order by (with multiple reducers). 
> One optimization suggested by [~stakiar] is that if we have column stats 
> about the col we are ordering by, then the first scan on the RDD is not 
> necessary. If we have histogram data about the RDD, we already know what the 
> ranges of the order by should be. This should work when running parallel 
> order by on simple tables, will be harder when we run it on derived datasets 
> (although not impossible). 
> To do his we would have to understand more about the internals of 
> JavaPairRDD. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to