[ https://issues.apache.org/jira/browse/HIVE-21217?focusedWorklogId=199264&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-199264 ]
ASF GitHub Bot logged work on HIVE-21217: ----------------------------------------- Author: ASF GitHub Bot Created on: 15/Feb/19 14:40 Start Date: 15/Feb/19 14:40 Worklog Time Spent: 10m Work Description: szlta commented on pull request #538: HIVE-21217: Optimize range calculation for PTF URL: https://github.com/apache/hive/pull/538 @pvary ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 199264) Time Spent: 10m Remaining Estimate: 0h > Optimize range calculation for PTF > ---------------------------------- > > Key: HIVE-21217 > URL: https://issues.apache.org/jira/browse/HIVE-21217 > Project: Hive > Issue Type: Improvement > Reporter: Adam Szita > Assignee: Adam Szita > Priority: Major > Labels: pull-request-available > Attachments: HIVE-21217.0.patch, HIVE-21217.1.patch, > HIVE-21217.2.patch > > Time Spent: 10m > Remaining Estimate: 0h > > During window function execution Hive has to iterate on neighbouring rows of > the current row to find the beginning and end of the proper range (on which > the aggregation will be executed). > When we're using range based windows and have many rows with a certain key > value this can take a lot of time. (e.g. partition size of 80M, in which we > have 2 ranges of 40M rows according to the orderby column: within these 40M > rowsets we're doing 40M x 40M/2 steps.. which is of n^2 time complexity) > I propose to introduce a cache that keeps track of already calculated range > ends so it can be reused in future scans. -- This message was sent by Atlassian JIRA (v7.6.3#76005)