It happens in "raw Beam" somewhat often that there is one windowed
aggregation followed by another. It is pretty unheard of to "assign"
windows twice before an aggregation, because Beam cannot do anything useful
with this: Beam does not have the ability to have two different
fields/columns that are both associated with a watermark. So TUMBLE
followed by TUMBLE is more expressive. You get a number of columns all of
which can be used for windowed aggregation.

One reason this could happen is that you are authoring a table function
that operates generically. It does not know that its input has already been
TUMBLED. In Beam you would reify the prior window information, do your own
TUMBLE. Likely you would do an aggregation and then restore the prior
window information somehow. A compelling example does not come to mind, but
I think considering applying TUMBLE to a generic input helps clarify that
it could happen in code worth allowing.

Kenn

On Wed, Sep 23, 2020 at 11:47 AM Julian Hyde <jh...@apache.org> wrote:

> I don't think it's a problem with table functions in general. And
> besides, we can't change the semantics of table functions. A table
> function must not produce duplicate column names.
>
> The problem is with the semantics of these particular table functions
> - HOP, TUMBLE, SESSION - and what semantics are desirable depends on
> how people will typically use them. Is it common to follow TUMBLE with
> TUMBLE? What would a user expect to be the output columns?
>
> On Wed, Sep 23, 2020 at 11:21 AM Rui Wang <amaliu...@apache.org> wrote:
> >
> > >Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > > generally want two sets of window_start, window_end columns?
> >
> > I think it is reasonable to apply TUMBLE to TUMBLE or even TUMBLE to HOP
> > join, as long as there is a real requirement there. The window
> starts/ends
> > are not duplicates. For example TUMBLE as L JOIN HOP as R, L offers a
> > window start and a window end, same for R. This is no different from a
> > normal JOIN case where both JOIN sides have the same column names (but
> they
> > are not considered duplicates). The SQL rule is still applying: within a
> > scope there shouldn't be ambiguous column names (e.g. duplicated column
> > name). For JOIN duplicate names from JOIN inputs are differentiated by
> > table alias.
> >
> >
> > Regarding https://issues.apache.org/jira/browse/CALCITE-4274, this is an
> > interesting case that is different from the JOIN case, and I also think
> > this is a general case (not limited to TUMBLE).
> >
> > Think about that for any query that uses table function of the pattern
> > in CALCITE-4274. The first table function generates column A and then it
> > becomes the input for the second table function, which also wants to
> append
> > a column named "A". How should Calcite handle this case?
> >
> >
> > -Rui
> >
> >
> >
> > On Wed, Sep 23, 2020 at 9:13 AM Julian Hyde <jh...@apache.org> wrote:
> >
> > > I think we should also discuss
> > > https://issues.apache.org/jira/browse/CALCITE-4274 here.
> > >
> > > We've never discussed what should happen if you apply TUMBLE to TUMBLE
> > > (or TUMBLE to HOP, etc.). What happens now is that you get duplicate
> > > columns.
> > >
> > > Is it reasonable to apply TUMBLE to TUMBLE? If so, would people
> > > generally want two sets of window_start, window_end columns?
> > >
> > > Julian
> > >
> > > On Wed, Sep 23, 2020 at 2:41 AM Danny Chan <yuzhao....@gmail.com>
> wrote:
> > > >
> > > > Thanks for the feedback, I agree we should keep the verbose part
> > > >
> > > > **L.window_start = R.window_start AND L.window_end =R.window_end**
> > > >
> > > > Which would make the semantic more clear ~
> > > >
> > > > Best,
> > > > Danny Chan
> > > > 在 2020年9月23日 +0800 PM3:24,Viliam Durina <vil...@hazelcast.com>,写道:
> > > > > You can also use
> > > > >
> > > > > SELECT L.f0, R.f2, L.window_start, L.window_end
> > > > > FROM
> > > > > Tumble(table T1, descriptor(T1.ts), INTERVAL ‘5’ MINUTE) L
> > > > > JOIN
> > > > > Tumble(table T2, descriptor(T2.ts), INTERVAL ‘5’ MINUTE) R
> > > > > USING (f0, window_start)
> > > > >
> > > > > Viliam
> > > > >
> > > > > On Wed, 23 Sep 2020 at 08:02, Rui Wang <amaliu...@apache.org>
> wrote:
> > > > >
> > > > > > Regarding to **L.window_start = R.window_start AND L.window_end =
> > > > > > R.window_end**:
> > > > > >
> > > > > > In general, the current table function windowing model is to
> append
> > > window
> > > > > > metadata to table directly, thus window metadata becomes a part
> of
> > > table
> > > > > > (or call it data). So as a part of table, these two columns
> should be
> > > > > > treated as normal columns thus they should be in the join on
> > > condition.
> > > > > >
> > > > > > If you want to make it optional, it makes window start/end
> columns
> > > special
> > > > > > and has a semantic binding with special table functions (TUMBLE,
> HOP,
> > > > > > SESSION), which then becomes really not a SQL thing. For
> example, we
> > > can
> > > > > > allow users to define their own windowing table function. In that
> > > case, how
> > > > > > will you utilize window start/end produced by a customized
> windowing
> > > table
> > > > > > function? What if users produce wired windows that have
> overlapped
> > > window
> > > > > > starts or window ends?
> > > > > >
> > > > > > Keeping windows start/end as a part of the table, treating them
> no
> > > > > > different from other columns, could give a consistent behavior
> for
> > > either
> > > > > > built-in table function or user-defined table function.
> > > > > >
> > > > > > If you think it is too verbose, there are two options to
> optimize:
> > > > > >
> > > > > > 1. for TUMBLE/HOP/SESSION, to identify a unique window, you will
> > > only need
> > > > > > either window start or end, so you can simplify it, for example,
> to
> > > > > > L.window_start = R.window_start only.
> > > > > > 2. (not recommended), you can cut off **L.window_start =
> > > R.window_start AND
> > > > > > L.window_end = R.window_end**, but add window metadata
> comparison to
> > > join
> > > > > > implicitly by execution engine. E.g. you can make up the join
> > > condition in
> > > > > > your JoinRel if two inputs are TUMBLE.
> > > > > >
> > > > > >
> > > > > >
> > > > > > -Rui
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 22, 2020 at 10:27 PM Danny Chan <
> yuzhao....@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Yes, the red part is **L.window_start = R.window_start AND
> > > L.window_end =
> > > > > > > R.window_end**
> > > > > > >
> > > > > > > > Is this a limitation for "triggered by the watermark of the
> > > stream”?
> > > > > > >
> > > > > > > No, because in most of the cases, there is no need to output
> the
> > > > > > > intermediate/partial join records then send retractions.
> > > > > > >
> > > > > > >
> > > > > > > So, how do you think about the condition syntax
> **L.window_start =
> > > > > > > R.window_start AND L.window_end = R.window_end** ?
> > > > > > >
> > > > > > > Best,
> > > > > > > Danny Chan
> > > > > > > 在 2020年9月23日 +0800 PM12:47,dev@calcite.apache.org,写道:
> > > > > > > >
> > > > > > > > L.window_start = R.window_start AND L.window_end =
> R.window_end
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Viliam Durina
> > > > > Jet Developer
> > > > > hazelcast®
> > > > >
> > > > > <https://www.hazelcast.com> 2 W 5th Ave, Ste 300 | San Mateo, CA
> > > 94402 |
> > > > > USA
> > > > > +1 (650) 521-5453 <(650)%20521-5453> <(650)%20521-5453> |
> hazelcast.com <
> > > https://www.hazelcast.com>
> > > > >
> > > > > --
> > > > > This message contains confidential information and is intended only
> > > for the
> > > > > individuals named. If you are not the named addressee you should
> not
> > > > > disseminate, distribute or copy this e-mail. Please notify the
> sender
> > > > > immediately by e-mail if you have received this e-mail by mistake
> and
> > > > > delete this e-mail from your system. E-mail transmission cannot be
> > > > > guaranteed to be secure or error-free as information could be
> > > intercepted,
> > > > > corrupted, lost, destroyed, arrive late or incomplete, or contain
> > > viruses.
> > > > > The sender therefore does not accept liability for any errors or
> > > omissions
> > > > > in the contents of this message, which arise as a result of e-mail
> > > > > transmission. If verification is required, please request a
> hard-copy
> > > > > version. -Hazelcast
> > >
>

Reply via email to