Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

Alexey Leonov-Vendrovskiy Mon, 28 Aug 2023 15:22:45 -0700

SGTM

For future reference, responding here:


5) Any operator can introduce system (psedo) columns.
>
> This is clearly out of scope for this FLIP. The implementation effort
> would be huge and could introduce a lot of bugs.
>

 I didn't imply any specific implementation or feature coverage in the
currently proposed FLIP, but rather as a way to describe the semantics of
system columns that those could be (but don't have to) introduced by any
operator.

Thanks,
Alexey


On Tue, Aug 22, 2023 at 2:18 PM Jing Ge <[email protected]> wrote:

> Hi Timo,
>
> Your last suggestion sounds good.
>
> Best regards,
> Jing
>
> On Mon, Aug 21, 2023 at 4:21 AM Benchao Li <[email protected]> wrote:
>
> > It sounds good to me too, that we avoid introducing the concept of
> "system
> > columns" for now.
> >
> > Timo Walther <[email protected]> 于2023年8月18日周五 22:38写道：
> >
> > > Great, I also like my last suggestion as it is even more elegant. I
> will
> > > update the FLIP until Monday.
> > >
> > > Regards,
> > > Timo
> > >
> > > On 17.08.23 13:55, Jark Wu wrote:
> > > > Hi Timo,
> > > >
> > > > I'm fine with your latest suggestion that introducing a flag to
> control
> > > > expanding behavior of metadata virtual columns, but not introducing
> > > > any concept of system/pseudo columns for now.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Tue, 15 Aug 2023 at 23:25, Timo Walther <[email protected]>
> wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> I would like to bump this thread up again.
> > > >>
> > > >> Esp. I would like to hear opinions on my latest suggestions to
> simply
> > > >> use METADATA VIRTUAL as system columns and only introduce a config
> > > >> option for the SELECT * behavior. Implementation-wise this means
> > minimal
> > > >> effort and less new concepts.
> > > >>
> > > >> Looking forward to any kind of feedback.
> > > >>
> > > >> Thanks,
> > > >> Timo
> > > >>
> > > >> On 07.08.23 12:07, Timo Walther wrote:
> > > >>> Hi everyone,
> > > >>>
> > > >>> thanks for the various feedback and lively discussion. Sorry, for
> the
> > > >>> late reply as I was on vacation. Let me answer to some of the
> topics:
> > > >>>
> > > >>> 1) Systems columns should not be shown with DESCRIBE statements
> > > >>>
> > > >>> This sounds fine to me. I will update the FLIP in the next
> iteration.
> > > >>>
> > > >>> 2) Do you know why most SQL systems do not need any prefix with
> their
> > > >>> pseudo column?
> > > >>>
> > > >>> Because most systems do not have external catalogs or connectors.
> And
> > > >>> also the number of system columns is limited to a handful of
> columns.
> > > >>> Flink is more generic and thus more complex. And we have already
> the
> > > >>> concepts of metadata columns. We need to be careful with not
> > > overloading
> > > >>> our language.
> > > >>>
> > > >>> 3) Implementation details
> > > >>>
> > > >>>   > how to you plan to implement the "system columns", do we need
> to
> > > add
> > > >>> it to `RelNode` level? Or we just need to do it in the
> > > >>> parsing/validating phase?
> > > >>>   > I'm not sure that Calcite's "system column" feature is fully
> > ready
> > > >>>
> > > >>> My plan would be to only modify the parsing/validating phase. I
> would
> > > >>> like to avoid additional complexity in planner rules and
> > > >>> connector/catalog interfaces. Metadata columns already support
> > > >>> projection push down and are passed through the stack (via Schema,
> > > >>> ResolvedSchema, SupportsReadableMetadata). Calcite's "system
> column"
> > > >>> feature is not fully ready yet and it would be a large effort
> > > >>> potentially introducing bugs in supporting it. Thus, I'm proposing
> to
> > > >>> leverage what we already have. The only part that needs to be
> > modified
> > > >>> is the "expand star" method in SqlValidator and Table API.
> > > >>>
> > > >>> Queries such as `SELECT * FROM (SELECT $rowtime, * FROM t);` would
> > show
> > > >>> $rowtime as the expand star has only a special case when in the
> scope
> > > >>> for `FROM t`. All further subqueries treat it as a regular column.
> > > >>>
> > > >>> 4) Built-in defined pseudo-column "$rowtime"
> > > >>>
> > > >>>   > Did you consider making it as a built-in defined pseudo-column
> > > >>> "$rowtime" which returns the time attribute value (if exists) or
> null
> > > >>> (if non-exists) for every table/query, and pseudo-column
> "$proctime"
> > > >>> always returns PROCTIME() value for each table/query
> > > >>>
> > > >>> Built-in pseudo-columns mean that connector or catalog providers
> need
> > > >>> consensus in Flink which pseudo-columns should be built-in. We
> should
> > > >>> keep the concept generic and let platform providers decide which
> > pseudo
> > > >>> columns to expose. $rowtime might be obvious but others such as
> > > >>> $partition or $offset are tricky to get consensus as every external
> > > >>> connector works differently. Also a connector might want to expose
> > > >>> different time semantics (such as ingestion time).
> > > >>>
> > > >>> 5) Any operator can introduce system (psedo) columns.
> > > >>>
> > > >>> This is clearly out of scope for this FLIP. The implementation
> effort
> > > >>> would be huge and could introduce a lot of bugs.
> > > >>>
> > > >>> 6) "Metadata Key Prefix Constraint" which is still a little complex
> > > >>>
> > > >>> Another option could be to drop the naming pattern constraint. We
> > could
> > > >>> make it configurable that METADATA VIRTUAL columns are never
> selected
> > > by
> > > >>> default in SELECT * or visible in DESCRIBE. This would further
> > simplify
> > > >>> the FLIP and esp lower the impact on the planner and all
> interfaces.
> > > >>>
> > > >>> What do you think about this? We could introduce a flag:
> > > >>>
> > > >>> table.expand-metadata-columns (better name to be defined)
> > > >>>
> > > >>> This way we don't need to introduce the concept of system columns
> > yet,
> > > >>> but can still offer similar functionality with minimal overhead in
> > the
> > > >>> code base.
> > > >>>
> > > >>> Regards,
> > > >>> Timo
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On 04.08.23 23:06, Alexey Leonov-Vendrovskiy wrote:
> > > >>>> Looks like both kinds of system columns can converge.
> > > >>>> We can say that any operator can introduce system (psedo) columns.
> > > >>>>
> > > >>>> cc Eugene who is also interested in the subject.
> > > >>>>
> > > >>>> On Wed, Aug 2, 2023 at 1:03 AM Paul Lam <[email protected]>
> > > wrote:
> > > >>>>
> > > >>>>> Hi Timo,
> > > >>>>>
> > > >>>>> Thanks for starting the discussion! System columns are no doubt a
> > > >>>>> good boost on Flink SQL’s usability, and I see the feedbacks are
> > > >>>>> mainly concerns about the accessibility of system columns.
> > > >>>>>
> > > >>>>> I think most of the concerns could be solved by clarifying the
> > > >>>>> ownership of the system columns. Different from databases like
> > > >>>>> Oracle/BigQuery/PG who owns the data/metadata, Flink uses the
> > > >>>>> data/metadata from external systems. That means Flink could
> > > >>>>> have 2 kinds of system columns (take ROWID for example):
> > > >>>>>
> > > >>>>> 1. system columns provided by external systems via catalogs, such
> > > >>>>>       as ROWID from the original system.
> > > >>>>> 2. system columns generated by Flink, such as ROWID generated by
> > > >>>>>       Flink itself.
> > > >>>>>
> > > >>>>> IIUC, the FLIP is proposing the 1st approach: the catalog defines
> > > what
> > > >>>>> system columns to provide, and Flink treats them as normal
> columns
> > > >>>>> with a special naming pattern.
> > > >>>>>
> > > >>>>> On the other hand, Jark is proposing the 2nd one: the system
> > columns
> > > >>>>> are defined and owned by Flink, and can be inferred from external
> > > >>>>> systems. Therefore, system columns should be predefined by Flink,
> > > >>>>> and optionally implemented by the catalogs.
> > > >>>>>
> > > >>>>> Personally, I’m in favor of the 2nd approach, because it makes
> the
> > > >>>>> system columns very accessible and more aligned across the
> > catalogs.
> > > >>>>>
> > > >>>>> BTW, I second Alexey that systems columns should not be shown
> with
> > > >>>>> DESCRIBE statements.
> > > >>>>>
> > > >>>>> WDYT? Thanks!
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Paul Lam
> > > >>>>>
> > > >>>>>> 2023年7月31日 23:54，Jark Wu <[email protected]> 写道：
> > > >>>>>>
> > > >>>>>> Hi Timo,
> > > >>>>>>
> > > >>>>>> Thanks for your proposal. I think this is a nice feature for
> users
> > > >>>>>> and I
> > > >>>>>> prefer option 3.
> > > >>>>>>
> > > >>>>>> I only have one concern about the concept of pseudo-column or
> > > >>>>>> system-column,
> > > >>>>>> because this is the first time we introduce it in Flink SQL. The
> > > >>>>>> confusion is similar to the
> > > >>>>>> question of Benchao and Sergey about the propagation of
> > > pseudo-column.
> > > >>>>>>
> > > >>>>>>   From my understanding, a pseudo-column can be get from an
> > > arbitrary
> > > >>>>> query,
> > > >>>>>> just similar to
> > > >>>>>> ROWNUM in Oracle[1], such as :
> > > >>>>>>
> > > >>>>>> SELECT *
> > > >>>>>> FROM (SELECT * FROM employees ORDER BY employee_id)
> > > >>>>>> WHERE ROWNUM < 11;
> > > >>>>>>
> > > >>>>>> However, IIUC, the proposed "$rowtime" pseudo-column can only be
> > got
> > > >>>>>> from
> > > >>>>>> the physical table
> > > >>>>>> and can't be got from queries even if the query propagates the
> > > rowtime
> > > >>>>>> attribute. There was also
> > > >>>>>> a discussion about adding a pseudo-column "_proctime" [2] to
> make
> > > >>>>>> lookup
> > > >>>>>> join easier to use
> > > >>>>>> which can be got from arbitrary queries. That "_proctime" may
> > > conflict
> > > >>>>> with
> > > >>>>>> the proposed
> > > >>>>>> pseudo-column concept.
> > > >>>>>>
> > > >>>>>> Did you consider making it as a built-in defined pseudo-column
> > > >>>>>> "$rowtime"
> > > >>>>>> which returns the
> > > >>>>>> time attribute value (if exists) or null (if non-exists) for
> every
> > > >>>>>> table/query, and pseudo-column
> > > >>>>>> "$proctime" always returns PROCTIME() value for each
> table/query.
> > In
> > > >>>>>> this
> > > >>>>>> way, catalogs only need
> > > >>>>>> to provide a default rowtime attribute and users can get it in
> the
> > > >> same
> > > >>>>>> way. And we don't need
> > > >>>>>> to introduce the contract interface of "Metadata Key Prefix
> > > >> Constraint"
> > > >>>>>> which is still a little complex
> > > >>>>>> for users and devs to understand.
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Jark
> > > >>>>>>
> > > >>>>>> [1]:
> > > >>>>>>
> > > >>>>>
> > > >>
> > >
> >
> https://docs.oracle.com/cd/E11882_01/server.112/e41084/pseudocolumns009.htm#SQLRF00255
> > > >>>>>> [2]:
> > > https://lists.apache.org/thread/7ln106qxyw8sp7ljq40hs2p1lb1gdwj5
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Fri, 28 Jul 2023 at 06:18, Alexey Leonov-Vendrovskiy <
> > > >>>>>> [email protected]> wrote:
> > > >>>>>>
> > > >>>>>>>>
> > > >>>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);`
> > > >>>>>>>> Am I right that it will show `$rowtime` in output ?
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Yes, all explicitly selected columns become a part of the
> result
> > > (and
> > > >>>>>>> intermediate) schema, and hence propagate.
> > > >>>>>>>
> > > >>>>>>> On Thu, Jul 27, 2023 at 2:40 PM Alexey Leonov-Vendrovskiy <
> > > >>>>>>> [email protected]> wrote:
> > > >>>>>>>
> > > >>>>>>>> Thank you, Timo, for starting this FLIP!
> > > >>>>>>>>
> > > >>>>>>>> I propose the following change:
> > > >>>>>>>>
> > > >>>>>>>> Remove the requirement that DESCRIBE need to show system
> > columns.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Some concrete vendor specific catalog implementations might
> > prefer
> > > >>>>>>>> this
> > > >>>>>>>> approach.
> > > >>>>>>>> Usually the same system columns are available on all (or
> family)
> > > of
> > > >>>>>>>> tables, and it can be easily captured in the documentation.
> > > >>>>>>>>
> > > >>>>>>>> For example, BigQuery does exactly this: there, pseudo-columns
> > do
> > > >> not
> > > >>>>>>> show
> > > >>>>>>>> up in the table schema in any place, but can be accessed via
> > > >>>>>>>> reference.
> > > >>>>>>>>
> > > >>>>>>>> So I propose we:
> > > >>>>>>>> a) Either we say that DESCRIBE doesn't show system columns,
> > > >>>>>>>> b) Or leave this vendor-specific / or configurable via flag
> (if
> > > >>>>> needed).
> > > >>>>>>>>
> > > >>>>>>>> Regards,
> > > >>>>>>>> Alexey
> > > >>>>>>>>
> > > >>>>>>>> On Thu, Jul 27, 2023 at 3:27 AM Sergey Nuyanzin <
> > > >> [email protected]>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Timo,
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks for the FLIP.
> > > >>>>>>>>> I also tend to think that Option 3 is better.
> > > >>>>>>>>>
> > > >>>>>>>>> I would be also interested in a question mentioned by Benchao
> > Li.
> > > >>>>>>>>> And a similar question about nested queries like
> > > >>>>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);`
> > > >>>>>>>>> Am I right that it will show `$rowtime` in output ?
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Jul 27, 2023 at 6:58 AM Benchao Li <
> > [email protected]
> > > >
> > > >>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi Timo,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks for the FLIP, I also like the idea and option 3
> sounds
> > > >>>>>>>>>> good to
> > > >>>>>>>>> me.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I would like to discuss a case which is not mentioned in the
> > > >>>>>>>>>> current
> > > >>>>>>>>> FLIP.
> > > >>>>>>>>>> How are the "System column"s expressed in intermediate
> result,
> > > >> e.g.
> > > >>>>>>>>> Join?
> > > >>>>>>>>>> E.g. `SELECT * FROM t1 JOIN t2`, I guess it should not
> include
> > > >>>>> "system
> > > >>>>>>>>>> columns" from t1 and t2 as you proposed, and for `SELECT
> > > >>>>>>>>>> t1.$rowtime,
> > > >>>>>>> *
> > > >>>>>>>>>> FROM t1 JOIN t2`, it should also be valid.
> > > >>>>>>>>>> Then the question is how to you plan to implement the
> "system
> > > >>>>>>> columns",
> > > >>>>>>>>> do
> > > >>>>>>>>>> we need to add it to `RelNode` level? Or we just need to do
> it
> > > >>>>>>>>>> in the
> > > >>>>>>>>>> parsing/validating phase?
> > > >>>>>>>>>> I'm not sure that Calcite's "system column" feature is fully
> > > ready
> > > >>>>> for
> > > >>>>>>>>> this
> > > >>>>>>>>>> since the code about this part is imported from the earlier
> > > >> project
> > > >>>>>>>>> before
> > > >>>>>>>>>> it gets into Apache, and has not been considered much in the
> > > past
> > > >>>>>>>>>> development.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Jing Ge <[email protected]> 于2023年7月26日周三 00:01写
> > > >>>>>>>>>> 道：
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Hi Timo,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for your proposal. It is a very pragmatic feature.
> > Among
> > > >>>>>>>>>>> all
> > > >>>>>>>>>> options
> > > >>>>>>>>>>> in the FLIP, option 3 is one I prefer too and I'd like to
> ask
> > > >> some
> > > >>>>>>>>>>> questions to understand your thoughts.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1. I did some research on pseudo columns, just out of
> > > >>>>>>>>>>> curiosity, do
> > > >>>>>>>>> you
> > > >>>>>>>>>>> know why most SQL systems do not need any prefix with their
> > > >> pseudo
> > > >>>>>>>>>> column?
> > > >>>>>>>>>>> 2. Some platform providers will use ${variable_name} to
> > define
> > > >>>>>>>>>>> their
> > > >>>>>>>>> own
> > > >>>>>>>>>>> configurations and allow them to be embedded into SQL
> > scripts.
> > > >>>>>>>>>>> Will
> > > >>>>>>>>> there
> > > >>>>>>>>>>> be any conflict with option 3?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Best regards,
> > > >>>>>>>>>>> Jing
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Tue, Jul 25, 2023 at 7:00 PM Konstantin Knauf
> > > >>>>>>>>>>> <[email protected]
> > > >>>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Timo,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> this makes sense to me. Option 3 seems reasonable, too.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Cheers,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Konstantin
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Am Di., 25. Juli 2023 um 12:53 Uhr schrieb Timo Walther <
> > > >>>>>>>>>>>> [email protected]
> > > >>>>>>>>>>>>> :
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi everyone,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I would like to start a discussion about introducing the
> > > >> concept
> > > >>>>>>>>> of
> > > >>>>>>>>>>>>> "System Columns" in SQL and Table API.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> The subject sounds bigger than it actually is. Luckily,
> > Flink
> > > >>>>>>> SQL
> > > >>>>>>>>>>>>> already exposes the concept of metadata columns. And this
> > > >>>>>>>>> proposal is
> > > >>>>>>>>>>>>> just a slight adjustment for how metadata columns can be
> > used
> > > >> as
> > > >>>>>>>>>> system
> > > >>>>>>>>>>>>> columns.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> The biggest problem of metadata columns currently is
> that a
> > > >>>>>>>>> catalog
> > > >>>>>>>>>>>>> implementation can't provide them by default because they
> > > would
> > > >>>>>>>>>> affect
> > > >>>>>>>>>>>>> `SELECT *` when adding another one.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Looking forward to your feedback on FLIP-348:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-348%3A+Support+System+Columns+in+SQL+and+Table+API
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>> Timo
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> --
> > > >>>>>>>>>>>> https://twitter.com/snntrable
> > > >>>>>>>>>>>> https://github.com/knaufk
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --
> > > >>>>>>>>>>
> > > >>>>>>>>>> Best,
> > > >>>>>>>>>> Benchao Li
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>> Best regards,
> > > >>>>>>>>> Sergey
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> > >
> >
> > --
> >
> > Best,
> > Benchao Li
> >
>

Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

Reply via email to