Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Rui Wang Thu, 22 Oct 2020 23:36:30 -0700

Yes. I had offline chats with Kenneth and Julian. I think the primary
reason that window metadata is not as a STRUCT, is just because STRUCT
might not be well supported by engines. Even in Calcite, ROW (STRUCT) might
now be supported well.


Meanwhile, it seems like there is no serious issue for having two
separate columns. For example, if you want to construct PK for windows, you
can use window_start (or the ends)  as PK for most of the windowing cases.
If you still need wstart and wend as a ROW, you could construct it by row
constructor.

-Rui

On Thu, Oct 22, 2020 at 10:46 PM Pengcheng Liu <[email protected]>
wrote:

> Hi, Danny,
>    I think there is already an explanation in [1], as the author noted:
> > The return value of Tumble is a relation that includes all columns of
> data as well as
> > additional event time columns wstart and wend. It was considered to place
> the original
> > row in a nested row and wstart and wend in a separate nested row, for
> simplicity of
> > namespacing, but that would limit these extensions to engines supporting
> nested rows.
>
> [1]
>
> https://docs.google.com/document/d/138uA7VTpbF84CFrd--cz3YVe0-AQ9ALnsavaSE2JeE4
>
>
> Danny Chan <[email protected]> 于2020年10月22日周四 下午7:40写道：
>
> > Rui Wang, can you help to consult with the author of the window TVF
> syntax
> > ? We want to hear more suggestions.
> >
> > cc @Rui Wang
> >
> > Rui Wang <[email protected]> 于2020年10月22日周四 下午12:01写道：
> >
> > > This seems reasonable to merge window metadata as a column (more
> compact
> > > format).
> > >
> > > Maybe Julian can comment whether there was special consideration to
> > define
> > > window_start and window_end as two separate columns?
> > >
> > >
> > > -Rui
> > >
> > > On Wed, Oct 21, 2020 at 8:20 PM Danny Chan <[email protected]>
> wrote:
> > >
> > > > Some of our fellows suggest the window table value functions to
> append
> > a
> > > > structure type field:
> > > >
> > > > window: ROW(start: timestamp(3), end: timestamp(3))
> > > >
> > > > For example if
> > > >
> > > > table T has schema: (ts: timestamp(3), b: varchar(20)), then
> > > >
> > > > Tumble(TABLE T, DESCRIPTOR(T.ts), INTERVAL '5' MINUTE)
> > > >
> > > > has return type
> > > >
> > > > Row(ts timestamp(3), b varchar(20), window Row(start timestamp(3),
> end
> > > > timestamp(3)))
> > > >
> > > > Thus in the aggregate key or the join condition, we can compare the
> > > > structure “window” field directly, e.g.
> > > >
> > > > GROUP BY T.window or ON L.window = R.window
> > > >
> > > > What do you think about this ?
> > > >
> > > > Rui Wang <[email protected]> 于2020年6月9日周二 上午1:34写道：
> > > >
> > > > > On Mon, Jun 8, 2020 at 12:04 AM Viliam Durina <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > > 2. rename the hop_size to emit_frequency. Because that's what
> > this
> > > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > >
> > > > > > It's not about how often you emit. By "emit" I mean the moment
> when
> > > the
> > > > > > rows are sent out. For example, when working with event time and
> > the
> > > > > input
> > > > > > stalls, you might emit much later. Or you can use it for
> historical
> > > > data
> > > > > in
> > > > > > which case the actual interval will be much shorter. It's the
> time
> > > > > interval
> > > > > > by which the window "hops". Two subsequent windows will be this
> > much
> > > > > apart.
> > > > > >
> > > > > > Regarding the DESCRIPTOR, the PTF is supposed to be implemented
> by
> > up
> > > > to
> > > > > 4
> > > > > > member functions: describe, start, fulfill, finish. The
> `describe`
> > > > > function
> > > > > > is supposed to be called during validation. It can validate the
> > > > > arguments.
> > > > > > In case of TUMBLE, it can check whether the column described by
> the
> > > > > > descriptor is present in the input table and if it is of a
> correct
> > > > type.
> > > > > It
> > > > > > can throw errors which will become validation errors.
> > > > >
> > > > > Yes. Descriptor has a feature to enable a type checking. If
> > considering
> > > > it,
> > > > > functions will be (argument name not decided yet):
> > > > >
> > > > > TUMBLE(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > > HOP(
> > > > > data TABLE,
> > > > > time_column DESCRIPTOR(*TIMESTAMP*),
> > > > > window_size INTERVAL DAY TO SECOND,
> > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > It also determines
> > > > > > the actual output table type.
> > > > > >
> > > > > > I think it's worthwhile to read the PTF spec before implementing
> > > this,
> > > > > it's
> > > > > > freely available here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://standards.iso.org/ittf/PubliclyAvailableStandards/c069776_ISO_IEC_TR_19075-7_2017.zip
> > > > > >
> > > > > > Viliam
> > > > > >
> > > > > > On Sun, 7 Jun 2020 at 06:23, Rui Wang <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Thanks Danny for your summary!
> > > > > > >
> > > > > > > For question one, I prefer the ordering of parameter in option
> > two.
> > > > > > > However, for argument names, I think it might be better to
> > > > > > > 1. name the window size explicitly by "window_size".
> > > > > > > 2. rename the hop_size to emit_frequency. Because that's what
> > this
> > > > > > > argument tries to say: e.g. emit in every x time_unit.
> > > > > > > 3. don't use "table" as the first argument name, because
> "table"
> > > is a
> > > > > > > keyword. So we can replace it by "data"
> > > > > > >
> > > > > > > so it might be
> > > > > > > TUMBLE(data, DESCRIPTOR(time_column), window_size[, offset])
> > > > > > > HOP(data, DESCRIPTOR(time_column), window_size,
> emit_frequency[,
> > > > > offset])
> > > > > > >
> > > > > > > Meanwhile, one thing worths mentioning is, like Viliam always
> > > > discussed
> > > > > > in
> > > > > > > CALCITE-4000[1], we can also finalize the data types for each
> > > > > signature:
> > > > > > > TUMBLE(
> > > > > > > data TABLE,
> > > > > > > time_column DESCRIPTOR,
> > > > > > > window_size INTERVAL DAY TO SECOND,
> > > > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > > > >
> > > > > > > HOP(
> > > > > > > data TABLE,
> > > > > > > time_column DESCRIPTOR,
> > > > > > > window_size INTERVAL DAY TO SECOND,
> > > > > > > emit_frequency INTERVAL DAY TO SECOND,
> > > > > > > offset INTERVAL DAY TO SECOND DEFAULT INTERVAL '0' SECONDS)
> > > > > > >
> > > > > > > So those intervals have to be INTERVAL DAY TO SECOND, not
> > INTERVAL
> > > > YEAR
> > > > > > TO
> > > > > > > MONTH, because year to month varies in seconds.
> > > > > > >
> > > > > > >
> > > > > > > Regarding question two, I tried to dig into codebase and I
> think
> > it
> > > > > could
> > > > > > > be a RexCorrelVariable to replace that table_name, so it
> becomes
> > > > > > > table_function(RexCorrelVariable, DESCRIPTOR($1),
> > internal:INTERVAL
> > > > > > MINUTE)
> > > > > > >
> > > > > > > I know RexCorrelVariable might be not designed for this
> purpose,
> > > but
> > > > I
> > > > > > > find RexCorrelVariable is really useful because:
> > > > > > > 1. It can provide an id, to indicate which input of
> > > TableFunctionScan
> > > > > > that
> > > > > > > this table_function is applied on. Note that TableFunctionScan
> > > might
> > > > > not
> > > > > > > have one input.
> > > > > > > 2. It can provide RelDataType, to save the referenced input's
> > > schema.
> > > > > The
> > > > > > > input schema is useful during operator argument type validation
> > > (for
> > > > > > > DESCRIPTOR) and operator return type inference.
> > > > > > >
> > > > > > >
> > > > > > > [1]:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://jira.apache.org/jira/browse/CALCITE-4000?focusedCommentId=17107665&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17107665
> > > > > > >
> > > > > > >
> > > > > > > -Rui
> > > > > > >
> > > > > > > On Thu, Jun 4, 2020 at 6:02 AM Danny Chan <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi, all, in the last 2 releases, Rui Wang has contributed the
> > > > window
> > > > > > > table
> > > > > > > > functions to replace the legacy group window syntax, the idea
> > > comes
> > > > > > from
> > > > > > > > the SQL-2016 polymorphic table functions. But the current
> > > > > > implementation
> > > > > > > > also got some feedbacks(or confusion), this thread tries to
> > have
> > > a
> > > > > > > > discussion on theses questions that are undecided.
> > > > > > > >
> > > > > > > > One is about the window function argument names:
> > > > > > > >
> > > > > > > > === option1
> > > > > > > > TUMBLE(table, DESCRIPTOR(datetime), interval, offset)
> > > > > > > > HOP(table, DESCRIPTOR(time_column), slide, size)
> > > > > > > >
> > > > > > > > === option2
> > > > > > > > TUMBLE(table, DESCRIPTOR(time_column), size, offset)
> > > > > > > > HOP(table, DESCRIPTOR(time_column), size, hop_size, offset)
> > > > > > > >
> > > > > > > > I would prefer for option2 for the reasons pointed out by
> > Viliam
> > > in
> > > > > > > > CALCITE-3737.
> > > > > > > >
> > > > > > > >
> > > > > > > > Another is about how to translate the query
> > > > > > > >
> > > > > > > > For query, select * from TUMBLE(table,
> DESCRIPTOR(time_column),
> > > > > > > interval);
> > > > > > > >
> > > > > > > > Currently our outputs plan is:
> > > > > > > >
> > > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1], window_start=[$2],
> > > > > > > > window_end=[$3])
> > > > > > > > LogicalTableFunctionScan(invocation=[TUMBLE(DESCRIPTOR($1),
> > > > > > > > interal:INTERVAL MINUTE)], rowType=[RecordType(INTEGER
> ORDERID,
> > > > > > > > TIMESTAMP(0) time_column, TIMESTAMP(0) window_start,
> > TIMESTAMP(0)
> > > > > > > > window_end)])
> > > > > > > > LogicalProject(ORDERID=[$0], ROWTIME=[$1])
> > > > > > > > LogicalTableScan(table=[[CATALOG, SALES, SHIPMENTS]])
> > > > > > > >
> > > > > > > > But this seems not right, because the "table table_name"
> > argument
> > > > is
> > > > > > not
> > > > > > > a
> > > > > > > > relational expression, in CALCITE-3955, Julian has pointed
> out
> > > that
> > > > > we
> > > > > > > > should translate the "table" argument as a CURSOR (or TABLE
> by
> > > > > Viliam),
> > > > > > > but
> > > > > > > > another question is how to translate the referenced name
> > > > > "time_column",
> > > > > > > to
> > > > > > > > a correlate variable ? which is also confusing because there
> is
> > > no
> > > > > > > > correlation in the plan actually.
> > > > > > > >
> > > > > > > > Any suggestions are appreciated !
> > > > > > > >
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/CALCITE-3955
> > > > > > > > [2] https://issues.apache.org/jira/browse/CALCITE-3737
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Danny Chan
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Viliam Durina
> > > > > > Jet Developer
> > > > > >       hazelcast®
> > > > > >
> > > > > >   <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo,
> CA
> > > > > 94402 |
> > > > > > USA
> > > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453>
> <(650)%20521-5453> |
> > > > hazelcast.com <
> > > > > > https://www.hazelcast.com>
> > > > > >
> > > > > > --
> > > > > > This message contains confidential information and is intended
> only
> > > for
> > > > > > the
> > > > > > individuals named. If you are not the named addressee you should
> > not
> > > > > > disseminate, distribute or copy this e-mail. Please notify the
> > sender
> > > > > > immediately by e-mail if you have received this e-mail by mistake
> > and
> > > > > > delete this e-mail from your system. E-mail transmission cannot
> be
> > > > > > guaranteed to be secure or error-free as information could be
> > > > > intercepted,
> > > > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > > > viruses.
> > > > > > The sender therefore does not accept liability for any errors or
> > > > > omissions
> > > > > > in the contents of this message, which arise as a result of
> e-mail
> > > > > > transmission. If verification is required, please request a
> > hard-copy
> > > > > > version. -Hazelcast
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Some quesitons about the new introduced window table functions (TUMBLE, HOP and SESSION)

Reply via email to