[ https://issues.apache.org/jira/browse/HIVE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547625#comment-13547625 ]
Harish Butani commented on HIVE-896: ------------------------------------ Hi Alan, Thanks for taking the time. Here are my responses: 1. Could you point out the interfaces... Yes you are right, from a function writer perspective TableFunctionEvaluator, TableFunctionResolver are the important ifcs; PTFPartition(and PTFPartitionIterator) is the data container ifc. 2. If I read this right you are using CLUSTER BY and SORT BY instead of PARTITION BY and ORDER BY for syntax in OVER. Why? To highlight the similarity. The Partition/Order specs in a Window clause have the same meaning as Cluster/Distribute in HQL. Note you can use a Cluster/Distribute at the query level and not specify any Partition spec in a Window clause. So the following are different ways for saying the same thing: a. select p_mfgr, p_name, sum(p_retailprice) over (distribute by p_mfgr sort by p_name rows between unbounded preceding and current row) from part; b. select p_mfgr, p_name, p_size, sum(p_retailprice) over (rows between unbounded preceding and current row) from part distribute by p_mfgr sort by p_name; c. select p_mfgr, p_name, p_size, sum(p_retailprice) over (w1) from part window w1 as distribute by p_mfgr sort by p_name rows between 2 preceding and 2 following; (I just realized that there are no egs of using Cluster/Distribute in Wdw clauses in the tests; we are adding them now) 3. Can I put one of the existing aggregate functions in an OVER clause using this? I am not exactly clear what your question is. I may have answered it above. To be clear there is no special Window Function. Any existing Hive UDAF invocation can have a Windowing specification. tests 31,40,41 cover most of the UDAFs. 4. Could you explain how the partition is handled in memory... Partitions are backed by a Persistent List ( see ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this package. Yes you are right, things can be done in delaying bringing rows into a partition and getting rid of rows once outside the window. This is true for Windowing Table Function; especially for Range based Windows. But for a general PTF the contract is Partition in Partition out. For e.g. CandidateFrequency function will read the rows in a partition multiple times. The PartitionedByteBasedList is backed by a set of PersistentByteBasedLists which uses weak refs and stores its data on disk. Done some testing with partitions with a million rows. But I agree with what you are getting at: there is stuff that can be done to reduce the memory footprint. Haven't gotten around to it.... > Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive. > --------------------------------------------------------------- > > Key: HIVE-896 > URL: https://issues.apache.org/jira/browse/HIVE-896 > Project: Hive > Issue Type: New Feature > Components: OLAP, UDF > Reporter: Amr Awadallah > Priority: Minor > Attachments: HIVE-896.1.patch.txt > > > Windowing functions are very useful for click stream processing and similar > time-series/sliding-window analytics. > More details at: > http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1006709 > http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007059 > http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007032 > -- amr -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira