On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <vil...@hazelcast.com> wrote:
> > 2. rename the hop_size to emit_frequency. Because that's what this > argument tries to say: e.g. emit in every x time_unit. > > It's not about how often you emit. By "emit" I mean the moment when the > rows are sent out. For example, when working with event time and the input > stalls, you might emit much later. Or you can use it for historical data in > which case the actual interval will be much shorter. It's the time interval > by which the window "hops". Two subsequent windows will be this much apart. > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented by up to 4 > member functions: describe, start, fulfill, finish. The `describe` function > is supposed to be called during validation. It can validate the arguments. > In case of TUMBLE, it can check whether the column described by the > descriptor is present in the input table and if it is of a correct type. It > can throw errors which will become validation errors. Yes. Descriptor has a feature to enable a type checking. If considering it, functions will be (argument name not decided yet): TUMBLE( data TABLE, time_column DESCRIPTOR(*TIMESTAMP*), window_size INTERVAL DAY TO SECOND, offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS) HOP( data TABLE, time_column DESCRIPTOR(*TIMESTAMP*), window_size INTERVAL DAY TO SECOND, emit_frequency INTERVAL DAY TO SECOND, offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS) > It also determines > the actual output table type. > > I think it's worthwhile to read the PTF spec before implementing this, it's > freely available here: > > https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip > > Viliam > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <amaliu...@apache.org> wrote: > > > Thanks Danny for your summary! > > > > For question one, I prefer the ordering of parameter in option two. > > However, for argument names, I think it might be better to > > 1. name the window size explicitly by "window_size". > > 2. rename the hop_size to emit_frequency. Because that's what this > > argument tries to say: e.g. emit in every x time_unit. > > 3. don't use "table" as the first argument name, because "table" is a > > keyword. So we can replace it by "data" > > > > so it might be > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset]) > > HOP(data, DESCRIPTOR(time_column), window_size, emit_frequency[, offset]) > > > > Meanwhile, one thing worths mentioning is, like Viliam always discussed > in > > CALCITE-4000[1], we can also finalize the data types for each signature: > > TUMBLE( > > data TABLE, > > time_column DESCRIPTOR, > > window_size INTERVAL DAY TO SECOND, > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS) > > > > HOP( > > data TABLE, > > time_column DESCRIPTOR, > > window_size INTERVAL DAY TO SECOND, > > emit_frequency INTERVAL DAY TO SECOND, > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS) > > > > So those intervals have to be INTERVAL DAY TO SECOND, not INTERVAL YEAR > TO > > MONTH, because year to month varies in seconds. > > > > > > Regarding question two, I tried to dig into codebase and I think it could > > be a RexCorrelVariable to replace that table_name, so it becomes > > table_function(RexCorrelVariable, DESCRIPTOR($1), internal:INTERVAL > MINUTE) > > > > I know RexCorrelVariable might be not designed for this purpose, but I > > find RexCorrelVariable is really useful because: > > 1. It can provide an id, to indicate which input of TableFunctionScan > that > > this table_function is applied on. Note that TableFunctionScan might not > > have one input. > > 2. It can provide RelDataType, to save the referenced input's schema. The > > input schema is useful during operator argument type validation (for > > DESCRIPTOR) and operator return type inference. > > > > > > [1]: > > > > > https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665 > > > > > > -Rui > > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <yuzhao....@gmail.com> wrote: > > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the window > > table > > > functions to replace the legacy group window syntax, the idea comes > from > > > the SQL-2016 polymorphic table functions. But the current > implementation > > > also got some feedbacks(or confusion), this thread tries to have a > > > discussion on theses questions that are undecided. > > > > > > One is about the window function argument names: > > > > > > === option1 > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset) > > > HOP(table, DESCRIPTOR(time_column), slide, size) > > > > > > === option2 > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset) > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset) > > > > > > I would prefer for option2 for the reasons pointed out by Viliam in > > > CALCITE-3737. > > > > > > > > > Another is about how to translate the query > > > > > > For query, select * from TUMBLE(table, DESCRIPTOR(time_column), > > interval); > > > > > > Currently our outputs plan is: > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2], > > > window_end=[$3]) > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1), > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER ORDERID, > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start, TIMESTAMP(0) > > > window_end)]) > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1]) > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]]) > > > > > > But this seems not right, because the "table table_name" argument is > not > > a > > > relational expression, in CALCITE-3955, Julian has pointed out that we > > > should translate the "table" argument as a CURSOR (or TABLE by Viliam), > > but > > > another question is how to translate the referenced name "time_column", > > to > > > a correlate variable ? which is also confusing because there is no > > > correlation in the plan actually. > > > > > > Any suggestions are appreciated ! > > > > > > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955 > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737 > > > > > > Best, > > > Danny Chan > > > > > > > > -- > Viliam Durina > Jet Developer > hazelcastĀ® > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA 94402 | > USA > +1 (650) 521-5453 <(650)%20521-5453> | hazelcast.com < > https://www.hazelcast.com> > > -- > This message contains confidential information and is intended only for > the > individuals named. If you are not the named addressee you should not > disseminate, distribute or copy this e-mail. Please notify the sender > immediately by e-mail if you have received this e-mail by mistake and > delete this e-mail from your system. E-mail transmission cannot be > guaranteed to be secure or error-free as information could be intercepted, > corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. > The sender therefore does not accept liability for any errors or omissions > in the contents of this message, which arise as a result of e-mail > transmission. If verification is required, please request a hard-copy > version. -Hazelcast >