Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function

刘芃成 Fri, 09 Oct 2020 21:01:35 -0700

Hi, Benchao,
       I think I got your point, actually, in current implementation for group 
window aggregation, the value of time attributes(e.g. 
TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I think 
we can just use it directly if you need this. But I think this time attributes 
is mainly suggested to use in case of cascaded window operations.
Regarding the example you provided, I think the semantics of the SQL in your 
example which doing interval join(e.g. with TUMBLE_ROWTIME) after window 
aggregation is not clear in the current implementation, and I think that’s a 
strong reason why we need the new TVFs syntax.
      With the new syntax, users should understand which time column to use and 
how to generate it when doing interval join and etc.


Best,
Pengcheng

发件人: Benchao Li <libenc...@apache.org>
日期: 2020年10月10日 星期六 上午11:02
收件人: pengcheng Liu <pengchengliucr...@gmail.com>
抄送: dev <dev@flink.apache.org>
主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function

Hi pengcheng,

Thanks for your response.
I knew that the original time attribute column will be retained after the TVF,
what I'm questioning is how do we get the time attribute column after 
Aggregation.
Your answer did not remove my doubts about this.

It's ok if we did not plan to integrate new TVF aggregate with old "time 
attribute scenarios"
listed in my previous email in this FLIP. However it's good to elaborate more 
on this, and
leave it to the future plan.

pengcheng Liu <pengchengliucr...@gmail.com<mailto:pengchengliucr...@gmail.com>> 
于2020年10月10日周六 上午10:45写道：
Hi，Benchao,
    In TVFs, the time attributes is just passed through from parent rels, and 
the TVFs just add two
    additional window attributes(i.e. window_start & window_end). Also, I think 
the time columns can be not only a time attribute
    with type of `TimeIndicatorType` but also a regular column with type of 
`Timestamp`.

    For cascaded window operations, we can use window_start/window_end of the 
previous window result directly to
    indicate operating on the same window, or use  new DESCRIPTOR column to 
assign new windows, in case of the change of
    the time column(e.g. in some case, the original timestamp is inaccurate and 
need some conversion to be used).

    You can check the definition or signature of these TVFs in the FLIP.
     e.g.
  SELECT * FROM TABLE(
   TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
    In the example, the `bidtime` is the time attribute column, which is the 
first operand of the DESCRIPTOR function.

    +1 start voting.

Benchao Li <libenc...@apache.org<mailto:libenc...@apache.org>> 于2020年10月10日周六 
上午10:08写道：
Hi Jark,

2 & 3 sounds good to me.

Regarding time attribute,
I still have some questions, I knew it's easy to support cascaded window 
aggregate using new TVFs.
However there are some other places where need time attribute:
- CEP
- interval join
- order by
- over window
If there is no time attribute column, how do we integrate these old features 
with the new TVFs.
E.g.
StreamA -> new window aggregate -> interval join -> Sink
                                                         /
StreamB -----------------------------------


Jark Wu <imj...@gmail.com<mailto:imj...@gmail.com>> 于2020年10月9日周五 下午11:51写道：
Hi Benchao,

1) time attribute
Yes. We don't need time attribute auxiliary function. Because the new
window operations are all based on the
 window_start and window_end columns instead of on the time attributes. So
we don't need to propagate time attributes.
Cascaded window aggregate can be expressed by simply GROUP BY the
window_start and window_end of the previous window result.
I have added a cascaded window aggregate example in the Tumbling Window
section in the FLIP.
If you want to define proctime window aggregate, the time column in TVF
should be a proctime attribute field (or PROCTIME() function).

2) batch support
Yes. The proposed syntax/API are unified for batch and streaming. Batch
support is in the plan, but may not have enough time to catch up 1.12.

3) support `grouping sets`
This is not included in the FLIP, but I think it's great if we can support
`grouping sets`.
The existing window impl doesn't support this because we convert the
LogicalAggregate into WindowAggregate in the beginning,
the expand grouping sets rule can't be applied in this situation.
Fortunately, with the new window impl, the conversion to WindowAggregate
will happen at the end, so I think the expand rule can be
 applied and support this feature naturally.
Therefore, IMO, we don't need to include this feature in this FLIP to avoid
the FLIP being too large.
This can be a follow-up issue (maybe just add tests and docs) after the
FLIP.

Best,
Jark


On Fri, 9 Oct 2020 at 19:09, 刘 芃成 
<pengchengliucr...@gmail.com<mailto:pengchengliucr...@gmail.com>> wrote:

> Hi，Benchao,
>         Welcome to join the discussion, yes, this new syntax can make SQL
> more clear and simpler.
>         For your first question, the `window_start` and `window_end`
> columns will be added automatically,
>         so we don't need to use auxiliary group functions to infer or
> access the window properties.
>
>         For the `grouping sets` on TVFs, I think it's interesting if we
> can support it, as we already supported `grouping sets`
>         on streaming aggregates in blink planner. But I'm not sure if it
> will be included into this FLIP.
>
>         cc @Jark Wu
>
> Best,
> Pengcheng
>
>
> 在 2020/10/9 下午5:25，“Benchao 
> Li”<libenc...@apache.org<mailto:libenc...@apache.org>> 写入:
>
>     Thanks Jark for bringing this discussion, I like this FLIP very much.
>
>     Especially the cumulate window, it's much like the current TUMBLE
> window +
>     Fast Emit (which is an undocumented experimental feature), however,
> it's
>     more powerful.
>
>     And This will make the SQL semantic more standard, especially for the
>     HOPPING window.
>
>     Regarding time attribute,
>     It seems that we don't need a specific function to infer the time
> attribute
>     like
>     `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and
>     `window_end`
>     column a time attribute column automatically?
>     - If not, what will be the time attribute of the result relation of
> these
>     TVFs?
>       Especially after the window aggregation.
>     - If yes, then how do we handle proctime?
>
>     Regarding batch operators,
>     It's great to hear that we can reuse the batch operators in continuous
>     batch mode
>     as you mentioned in the FLIP.
>     Current window aggregate could also be used in batch mode with
> rowtime. Do
>     you plan
>     to support these TVFs for batch mode in this FLIP? Hence the Table/SQL
> is a
>     unified
>     API, it's great if we can keep the features complete both in streaming
> and
>     batch mode.
>
>     There is one more question, I don't know whether it should be
> considered in
>     this FLIP.
>     Does the new window support `grouping sets`? (It's not supported in old
>     window impl).
>
>     Jark Wu <imj...@gmail.com<mailto:imj...@gmail.com>> 于2020年10月9日周五 
> 下午4:14写道：
>
>     > Hi all,
>     >
>     > I know we have a lot of discussion and development on going right
> now but
>     > it would be great if we can get FLIP-145 into a votable state.
>     > If there are no objections, I would like to start voting in the next
> days.
>     >
>     > Best,
>     > Jark
>     >
>     > On Thu, 1 Oct 2020 at 14:29, Jark Wu 
> <imj...@gmail.com<mailto:imj...@gmail.com>> wrote:
>     >
>     > > Hi everyone,
>     > >
>     > > I have added a section for Performance Optimization to describe
> how to
>     > > improve the performance in the short-term and long-term
>     > > and sketch the future performance potential under the new window
> API.
>     > > Introducing the window API is just the first step, we will
>     > > continuously improve the performance to make it powerful and
> useful.
>     > >
>     > > Best,
>     > > Jark
>     > >
>     > > On Thu, 1 Oct 2020 at 14:28, Jark Wu 
> <imj...@gmail.com<mailto:imj...@gmail.com>> wrote:
>     > >
>     > >> Hi Pengcheng,
>     > >>
>     > >> Yes, the window TVF is part of the FLIP. Welcome to contribute
> and join
>     > >> the discussion.
>     > >> Regarding the SESSION window aggregation, users can use the
> existing
>     > >> grouped session window function.
>     > >>
>     > >> Best,
>     > >> Jark
>     > >>
>     > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng <
> pengchengliucr...@gmail.com<mailto:pengchengliucr...@gmail.com>
>     > >
>     > >> wrote:
>     > >>
>     > >>> Hi Jark,
>     > >>>         Thanks for reply, yes, I think it's a good feature, it
> can
>     > >>> improve the NRT scenarios
>     > >>>         as you mentioned in the FLIP. Also, I think it can
> improve the
>     > >>> streaming SQL greatly,
>     > >>>         it can support richer window operations in flink SQL and
> bring
>     > >>> great convenience to users.
>     > >>>         (we are now only supported group window in flink).
>     > >>>
>     > >>>         Regarding the SESSION window, I think it's especially
> useful
>     > for
>     > >>> user behavior analysis(e.g.
>     > >>>         counting user visits on a news website or social
> platform), but
>     > >>> I agree that we can keep it
>     > >>>         out of the FLIP now to catch up 1.12.
>     > >>>
>     > >>>         Recently, I've done some work on the stream planner with
> the
>     > >>> TVFs, and I'm willing to contribute
>     > >>>         to this part. Is it in the plan of this FLIP?
>     > >>>
>     > >>>         Best,
>     > >>>         PengchengLiu
>     > >>>
>     > >>>
>     > >>> 在 2020/9/26 下午11:09，“Jark 
> Wu”<imj...@gmail.com<mailto:imj...@gmail.com>> 写入:
>     > >>>
>     > >>>     Hi pengcheng,
>     > >>>
>     > >>>     That's great to see you also have the need of window join.
>     > >>>     You are right, the windowing TVF is a powerful feature which
> can
>     > >>> support
>     > >>>     more operations in the future.
>     > >>>     I think it as of the date time "partition" selection in
> batch SQL
>     > >>> jobs,
>     > >>>     with this new syntax, I think it is possible
>     > >>>      to migrate traditional batch SQL jobs to Flink SQL by
> changing a
>     > >>> few lines.
>     > >>>
>     > >>>     Regarding the SESSION window, this is on purpose to keep it
> out of
>     > >>> the
>     > >>>     FLIP, because we want to keep the
>     > >>>     FLIP small to catch up 1.12 and SESSION TVF is rarely useful
> (e.g.
>     > >>> session
>     > >>>     window join?).
>     > >>>
>     > >>>     Best,
>     > >>>     Jark
>     > >>>
>     > >>>     On Fri, 25 Sep 2020 at 22:59, liupengcheng <
>     > >>> pengchengliucr...@gmail.com<mailto:pengchengliucr...@gmail.com>>
>     > >>>     wrote:
>     > >>>
>     > >>>     > Hi, Jark,
>     > >>>     >         I'm very interested in this feature, and I'm also
> working
>     > >>> on this
>     > >>>     > recently.
>     > >>>     >         I just have a glance at the FLIP, it's good, but I
> found
>     > >>> that
>     > >>>     > there is no plan to add SESSION windows.
>     > >>>     >         Also, I think there can be more things we can do
> based on
>     > >>> this new
>     > >>>     > syntax. For example,
>     > >>>     >         - window sort support
>     > >>>     >         - window union/intersect/minus support
>     > >>>     >         - Improve dimension table join
>     > >>>     >         We can have more deep discussion on this new
> feature
>     > later
>     > >>> .
>     > >>>     >         I've also opened an jira that is related to this
> feature
>     > >>> recently:
>     > >>>     > https://issues.apache.org/jira/browse/FLINK-18830
>     > >>>     >
>     > >>>     > Best!
>     > >>>     > PengchengLiu
>     > >>>     >
>     > >>>     > 在 2020/9/25 下午10:30，“Jark 
> Wu”<imj...@gmail.com<mailto:imj...@gmail.com>> 写入:
>     > >>>     >
>     > >>>     >     Hi everyone,
>     > >>>     >
>     > >>>     >     I want to start a FLIP about supporting windowing
>     > table-valued
>     > >>>     > functions
>     > >>>     >     (TVF).
>     > >>>     >     The main purpose of this FLIP is to improve the near
>     > real-time
>     > >>> (NRT)
>     > >>>     >     experience of Flink.
>     > >>>     >
>     > >>>     >     FLIP-145:
>     > >>>     >
>     > >>>     >
>     > >>>
>     >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function
>     > >>>     >
>     > >>>     >     We want to introduce TUMBLE, HOP, CUMULATE windowing
> TVFs,
>     > the
>     > >>>     > CUMULATE is
>     > >>>     >     a new kind of window.
>     > >>>     >     With the windowing TVFs, we can support richer
> operations on
>     > >>> windows,
>     > >>>     >     including window join, window TopN and so on.
>     > >>>     >     This makes things simple: we only need to assign
> windows at
>     > the
>     > >>>     > beginning
>     > >>>     >     of the query, and then apply operations after that like
>     > >>> traditional
>     > >>>     > batch
>     > >>>     >     SQL.
>     > >>>     >     We hope it can help to reduce the learning curve of
> windows,
>     > >>> improve
>     > >>>     > NRT
>     > >>>     >     for Flink, and attract more batch users.
>     > >>>     >
>     > >>>     >     A simple code snippet for 10 minutes tumbling window
>     > aggregate:
>     > >>>     >
>     > >>>     >     SELECT window_start, window_end, SUM(price)
>     > >>>     >     FROM TABLE(
>     > >>>     >         TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL
> '10'
>     > >>> MINUTES))
>     > >>>     >     GROUP BY window_start, window_end;
>     > >>>     >
>     > >>>     >     I'm looking forward to your feedback.
>     > >>>     >
>     > >>>     >     Best,
>     > >>>     >     Jark
>     > >>>     >
>     > >>>     >
>     > >>>     >
>     > >>>
>     > >>>
>     > >>>
>     >
>
>
>     --
>
>     Best,
>     Benchao Li
>


--

Best,
Benchao Li


--

Best,
Benchao Li

Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function

Reply via email to