[jira] [Assigned] (HIVE-17677) Investigate using hive statistics information to optimize HoS parallel order by

2018-12-11 Thread Andrew Sherman (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-17677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Sherman reassigned HIVE-17677:
-

Assignee: (was: Andrew Sherman)

> Investigate using hive statistics information to optimize HoS parallel order 
> by
> ---
>
> Key: HIVE-17677
> URL: https://issues.apache.org/jira/browse/HIVE-17677
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Andrew Sherman
>Priority: Major
>
> I think Spark's native parallel order by works in a similar way to what we do 
> for Hive-on-MR.  That is, it scans the RDD once and sample the data to 
> determine what ranges the data should be partitioned into, and then scans the 
> RDD again to do the actual order by (with multiple reducers). 
> One optimization suggested by [~stakiar] is that if we have column stats 
> about the col we are ordering by, then the first scan on the RDD is not 
> necessary. If we have histogram data about the RDD, we already know what the 
> ranges of the order by should be. This should work when running parallel 
> order by on simple tables, will be harder when we run it on derived datasets 
> (although not impossible). 
> To do his we would have to understand more about the internals of 
> JavaPairRDD. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-17677) Investigate using hive statistics information to optimize HoS parallel order by

2017-10-02 Thread Andrew Sherman (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Sherman reassigned HIVE-17677:
-


> Investigate using hive statistics information to optimize HoS parallel order 
> by
> ---
>
> Key: HIVE-17677
> URL: https://issues.apache.org/jira/browse/HIVE-17677
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Andrew Sherman
>Assignee: Andrew Sherman
>
> I think Spark's native parallel order by works in a similar way to what we do 
> for Hive-on-MR.  That is, it scans the RDD once and sample the data to 
> determine what ranges the data should be partitioned into, and then scans the 
> RDD again to do the actual order by (with multiple reducers). 
> One optimization suggested by [~stakiar] is that if we have column stats 
> about the col we are ordering by, then the first scan on the RDD is not 
> necessary. If we have histogram data about the RDD, we already know what the 
> ranges of the order by should be. This should work when running parallel 
> order by on simple tables, will be harder when we run it on derived datasets 
> (although not impossible). 
> To do his we would have to understand more about the internals of 
> JavaPairRDD. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)