[
https://issues.apache.org/jira/browse/HIVE-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ankit Kamboj updated HIVE-7989:
-------------------------------
Status: Patch Available (was: Reopened)
On queries:
Query1: select a.*,sum(b) over(partition by a order by b desc ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW) from window_test_1 a limit 100;
Query2: select a.*,sum(b) over(partition by a order by b desc ROWS BETWEEN
CURRENT ROW AND UNBOUNDED FOLLOWING) from window_test_1 a limit 100;
The performance impact that I have seen is:
a) With dataset with 5 partitions each having 10K rows, the performance
improves by 53%
b) With dataset with 5 partitions each having 20K rows, the performance
improves by 70%
c) With dataset with 5 partitions each having 50K rows, the performance
improves by 91%
> Optimize Windowing function performance for row frames
> ------------------------------------------------------
>
> Key: HIVE-7989
> URL: https://issues.apache.org/jira/browse/HIVE-7989
> Project: Hive
> Issue Type: Improvement
> Components: PTF-Windowing
> Affects Versions: 0.13.0
> Reporter: Ankit Kamboj
> Attachments: HIVE-7989.patch
>
>
> To find aggregate value for each row, current windowing function
> implementation creates a new aggregation buffer for each row, iterates over
> all the rows in respective window frame, puts them in buffer and then finds
> the aggregated value. This causes bottleneck for partitions with huge number
> of rows because this process runs in n-square complexity (n being rows in a
> partition) for each partition. So, if there are multiple partitions in a
> dataset, each with millions of rows, aggregation for all rows will take days
> to finish.
> There is scope of optimization for row frames, for following cases:
> a) For UNBOUNDED PRECEDING start and bounded end: Instead of iterating on
> window frame again for each row, we can slide the end one row at a time and
> aggregate, since we know the start is fixed for each row. This will have
> running time linear to the size of partition.
> b) For bounded start and UNBOUNDED FOLLOWING end: Instead of iterating on
> window frame again for each row, we can slide the start one row at a time and
> aggregate in reverse, since we know the end is fixed for each row. This will
> have running time linear to the size of partition.
> Also, In general for both row and value frames, we don't need to iterate over
> the range and re-create aggregation buffer if the start as well as end remain
> same. Instead, can re-use the previously created aggregation buffer.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)