[jira] [Assigned] (HIVE-17677) Investigate using hive statistics information to optimize HoS parallel order by
[ https://issues.apache.org/jira/browse/HIVE-17677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Sherman reassigned HIVE-17677: - Assignee: (was: Andrew Sherman) > Investigate using hive statistics information to optimize HoS parallel order > by > --- > > Key: HIVE-17677 > URL: https://issues.apache.org/jira/browse/HIVE-17677 > Project: Hive > Issue Type: Improvement >Affects Versions: 3.0.0 >Reporter: Andrew Sherman >Priority: Major > > I think Spark's native parallel order by works in a similar way to what we do > for Hive-on-MR. That is, it scans the RDD once and sample the data to > determine what ranges the data should be partitioned into, and then scans the > RDD again to do the actual order by (with multiple reducers). > One optimization suggested by [~stakiar] is that if we have column stats > about the col we are ordering by, then the first scan on the RDD is not > necessary. If we have histogram data about the RDD, we already know what the > ranges of the order by should be. This should work when running parallel > order by on simple tables, will be harder when we run it on derived datasets > (although not impossible). > To do his we would have to understand more about the internals of > JavaPairRDD. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVE-17677) Investigate using hive statistics information to optimize HoS parallel order by
[ https://issues.apache.org/jira/browse/HIVE-17677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Sherman reassigned HIVE-17677: - > Investigate using hive statistics information to optimize HoS parallel order > by > --- > > Key: HIVE-17677 > URL: https://issues.apache.org/jira/browse/HIVE-17677 > Project: Hive > Issue Type: Improvement >Affects Versions: 3.0.0 >Reporter: Andrew Sherman >Assignee: Andrew Sherman > > I think Spark's native parallel order by works in a similar way to what we do > for Hive-on-MR. That is, it scans the RDD once and sample the data to > determine what ranges the data should be partitioned into, and then scans the > RDD again to do the actual order by (with multiple reducers). > One optimization suggested by [~stakiar] is that if we have column stats > about the col we are ordering by, then the first scan on the RDD is not > necessary. If we have histogram data about the RDD, we already know what the > ranges of the order by should be. This should work when running parallel > order by on simple tables, will be harder when we run it on derived datasets > (although not impossible). > To do his we would have to understand more about the internals of > JavaPairRDD. -- This message was sent by Atlassian JIRA (v6.4.14#64029)