SGTM For future reference, responding here:
5) Any operator can introduce system (psedo) columns. > > This is clearly out of scope for this FLIP. The implementation effort > would be huge and could introduce a lot of bugs. > I didn't imply any specific implementation or feature coverage in the currently proposed FLIP, but rather as a way to describe the semantics of system columns that those could be (but don't have to) introduced by any operator. Thanks, Alexey On Tue, Aug 22, 2023 at 2:18 PM Jing Ge <[email protected]> wrote: > Hi Timo, > > Your last suggestion sounds good. > > Best regards, > Jing > > On Mon, Aug 21, 2023 at 4:21 AM Benchao Li <[email protected]> wrote: > > > It sounds good to me too, that we avoid introducing the concept of > "system > > columns" for now. > > > > Timo Walther <[email protected]> 于2023年8月18日周五 22:38写道: > > > > > Great, I also like my last suggestion as it is even more elegant. I > will > > > update the FLIP until Monday. > > > > > > Regards, > > > Timo > > > > > > On 17.08.23 13:55, Jark Wu wrote: > > > > Hi Timo, > > > > > > > > I'm fine with your latest suggestion that introducing a flag to > control > > > > expanding behavior of metadata virtual columns, but not introducing > > > > any concept of system/pseudo columns for now. > > > > > > > > Best, > > > > Jark > > > > > > > > On Tue, 15 Aug 2023 at 23:25, Timo Walther <[email protected]> > wrote: > > > > > > > >> Hi everyone, > > > >> > > > >> I would like to bump this thread up again. > > > >> > > > >> Esp. I would like to hear opinions on my latest suggestions to > simply > > > >> use METADATA VIRTUAL as system columns and only introduce a config > > > >> option for the SELECT * behavior. Implementation-wise this means > > minimal > > > >> effort and less new concepts. > > > >> > > > >> Looking forward to any kind of feedback. > > > >> > > > >> Thanks, > > > >> Timo > > > >> > > > >> On 07.08.23 12:07, Timo Walther wrote: > > > >>> Hi everyone, > > > >>> > > > >>> thanks for the various feedback and lively discussion. Sorry, for > the > > > >>> late reply as I was on vacation. Let me answer to some of the > topics: > > > >>> > > > >>> 1) Systems columns should not be shown with DESCRIBE statements > > > >>> > > > >>> This sounds fine to me. I will update the FLIP in the next > iteration. > > > >>> > > > >>> 2) Do you know why most SQL systems do not need any prefix with > their > > > >>> pseudo column? > > > >>> > > > >>> Because most systems do not have external catalogs or connectors. > And > > > >>> also the number of system columns is limited to a handful of > columns. > > > >>> Flink is more generic and thus more complex. And we have already > the > > > >>> concepts of metadata columns. We need to be careful with not > > > overloading > > > >>> our language. > > > >>> > > > >>> 3) Implementation details > > > >>> > > > >>> > how to you plan to implement the "system columns", do we need > to > > > add > > > >>> it to `RelNode` level? Or we just need to do it in the > > > >>> parsing/validating phase? > > > >>> > I'm not sure that Calcite's "system column" feature is fully > > ready > > > >>> > > > >>> My plan would be to only modify the parsing/validating phase. I > would > > > >>> like to avoid additional complexity in planner rules and > > > >>> connector/catalog interfaces. Metadata columns already support > > > >>> projection push down and are passed through the stack (via Schema, > > > >>> ResolvedSchema, SupportsReadableMetadata). Calcite's "system > column" > > > >>> feature is not fully ready yet and it would be a large effort > > > >>> potentially introducing bugs in supporting it. Thus, I'm proposing > to > > > >>> leverage what we already have. The only part that needs to be > > modified > > > >>> is the "expand star" method in SqlValidator and Table API. > > > >>> > > > >>> Queries such as `SELECT * FROM (SELECT $rowtime, * FROM t);` would > > show > > > >>> $rowtime as the expand star has only a special case when in the > scope > > > >>> for `FROM t`. All further subqueries treat it as a regular column. > > > >>> > > > >>> 4) Built-in defined pseudo-column "$rowtime" > > > >>> > > > >>> > Did you consider making it as a built-in defined pseudo-column > > > >>> "$rowtime" which returns the time attribute value (if exists) or > null > > > >>> (if non-exists) for every table/query, and pseudo-column > "$proctime" > > > >>> always returns PROCTIME() value for each table/query > > > >>> > > > >>> Built-in pseudo-columns mean that connector or catalog providers > need > > > >>> consensus in Flink which pseudo-columns should be built-in. We > should > > > >>> keep the concept generic and let platform providers decide which > > pseudo > > > >>> columns to expose. $rowtime might be obvious but others such as > > > >>> $partition or $offset are tricky to get consensus as every external > > > >>> connector works differently. Also a connector might want to expose > > > >>> different time semantics (such as ingestion time). > > > >>> > > > >>> 5) Any operator can introduce system (psedo) columns. > > > >>> > > > >>> This is clearly out of scope for this FLIP. The implementation > effort > > > >>> would be huge and could introduce a lot of bugs. > > > >>> > > > >>> 6) "Metadata Key Prefix Constraint" which is still a little complex > > > >>> > > > >>> Another option could be to drop the naming pattern constraint. We > > could > > > >>> make it configurable that METADATA VIRTUAL columns are never > selected > > > by > > > >>> default in SELECT * or visible in DESCRIBE. This would further > > simplify > > > >>> the FLIP and esp lower the impact on the planner and all > interfaces. > > > >>> > > > >>> What do you think about this? We could introduce a flag: > > > >>> > > > >>> table.expand-metadata-columns (better name to be defined) > > > >>> > > > >>> This way we don't need to introduce the concept of system columns > > yet, > > > >>> but can still offer similar functionality with minimal overhead in > > the > > > >>> code base. > > > >>> > > > >>> Regards, > > > >>> Timo > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> On 04.08.23 23:06, Alexey Leonov-Vendrovskiy wrote: > > > >>>> Looks like both kinds of system columns can converge. > > > >>>> We can say that any operator can introduce system (psedo) columns. > > > >>>> > > > >>>> cc Eugene who is also interested in the subject. > > > >>>> > > > >>>> On Wed, Aug 2, 2023 at 1:03 AM Paul Lam <[email protected]> > > > wrote: > > > >>>> > > > >>>>> Hi Timo, > > > >>>>> > > > >>>>> Thanks for starting the discussion! System columns are no doubt a > > > >>>>> good boost on Flink SQL’s usability, and I see the feedbacks are > > > >>>>> mainly concerns about the accessibility of system columns. > > > >>>>> > > > >>>>> I think most of the concerns could be solved by clarifying the > > > >>>>> ownership of the system columns. Different from databases like > > > >>>>> Oracle/BigQuery/PG who owns the data/metadata, Flink uses the > > > >>>>> data/metadata from external systems. That means Flink could > > > >>>>> have 2 kinds of system columns (take ROWID for example): > > > >>>>> > > > >>>>> 1. system columns provided by external systems via catalogs, such > > > >>>>> as ROWID from the original system. > > > >>>>> 2. system columns generated by Flink, such as ROWID generated by > > > >>>>> Flink itself. > > > >>>>> > > > >>>>> IIUC, the FLIP is proposing the 1st approach: the catalog defines > > > what > > > >>>>> system columns to provide, and Flink treats them as normal > columns > > > >>>>> with a special naming pattern. > > > >>>>> > > > >>>>> On the other hand, Jark is proposing the 2nd one: the system > > columns > > > >>>>> are defined and owned by Flink, and can be inferred from external > > > >>>>> systems. Therefore, system columns should be predefined by Flink, > > > >>>>> and optionally implemented by the catalogs. > > > >>>>> > > > >>>>> Personally, I’m in favor of the 2nd approach, because it makes > the > > > >>>>> system columns very accessible and more aligned across the > > catalogs. > > > >>>>> > > > >>>>> BTW, I second Alexey that systems columns should not be shown > with > > > >>>>> DESCRIBE statements. > > > >>>>> > > > >>>>> WDYT? Thanks! > > > >>>>> > > > >>>>> Best, > > > >>>>> Paul Lam > > > >>>>> > > > >>>>>> 2023年7月31日 23:54,Jark Wu <[email protected]> 写道: > > > >>>>>> > > > >>>>>> Hi Timo, > > > >>>>>> > > > >>>>>> Thanks for your proposal. I think this is a nice feature for > users > > > >>>>>> and I > > > >>>>>> prefer option 3. > > > >>>>>> > > > >>>>>> I only have one concern about the concept of pseudo-column or > > > >>>>>> system-column, > > > >>>>>> because this is the first time we introduce it in Flink SQL. The > > > >>>>>> confusion is similar to the > > > >>>>>> question of Benchao and Sergey about the propagation of > > > pseudo-column. > > > >>>>>> > > > >>>>>> From my understanding, a pseudo-column can be get from an > > > arbitrary > > > >>>>> query, > > > >>>>>> just similar to > > > >>>>>> ROWNUM in Oracle[1], such as : > > > >>>>>> > > > >>>>>> SELECT * > > > >>>>>> FROM (SELECT * FROM employees ORDER BY employee_id) > > > >>>>>> WHERE ROWNUM < 11; > > > >>>>>> > > > >>>>>> However, IIUC, the proposed "$rowtime" pseudo-column can only be > > got > > > >>>>>> from > > > >>>>>> the physical table > > > >>>>>> and can't be got from queries even if the query propagates the > > > rowtime > > > >>>>>> attribute. There was also > > > >>>>>> a discussion about adding a pseudo-column "_proctime" [2] to > make > > > >>>>>> lookup > > > >>>>>> join easier to use > > > >>>>>> which can be got from arbitrary queries. That "_proctime" may > > > conflict > > > >>>>> with > > > >>>>>> the proposed > > > >>>>>> pseudo-column concept. > > > >>>>>> > > > >>>>>> Did you consider making it as a built-in defined pseudo-column > > > >>>>>> "$rowtime" > > > >>>>>> which returns the > > > >>>>>> time attribute value (if exists) or null (if non-exists) for > every > > > >>>>>> table/query, and pseudo-column > > > >>>>>> "$proctime" always returns PROCTIME() value for each > table/query. > > In > > > >>>>>> this > > > >>>>>> way, catalogs only need > > > >>>>>> to provide a default rowtime attribute and users can get it in > the > > > >> same > > > >>>>>> way. And we don't need > > > >>>>>> to introduce the contract interface of "Metadata Key Prefix > > > >> Constraint" > > > >>>>>> which is still a little complex > > > >>>>>> for users and devs to understand. > > > >>>>>> > > > >>>>>> Best, > > > >>>>>> Jark > > > >>>>>> > > > >>>>>> [1]: > > > >>>>>> > > > >>>>> > > > >> > > > > > > https://docs.oracle.com/cd/E11882_01/server.112/e41084/pseudocolumns009.htm#SQLRF00255 > > > >>>>>> [2]: > > > https://lists.apache.org/thread/7ln106qxyw8sp7ljq40hs2p1lb1gdwj5 > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> On Fri, 28 Jul 2023 at 06:18, Alexey Leonov-Vendrovskiy < > > > >>>>>> [email protected]> wrote: > > > >>>>>> > > > >>>>>>>> > > > >>>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);` > > > >>>>>>>> Am I right that it will show `$rowtime` in output ? > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Yes, all explicitly selected columns become a part of the > result > > > (and > > > >>>>>>> intermediate) schema, and hence propagate. > > > >>>>>>> > > > >>>>>>> On Thu, Jul 27, 2023 at 2:40 PM Alexey Leonov-Vendrovskiy < > > > >>>>>>> [email protected]> wrote: > > > >>>>>>> > > > >>>>>>>> Thank you, Timo, for starting this FLIP! > > > >>>>>>>> > > > >>>>>>>> I propose the following change: > > > >>>>>>>> > > > >>>>>>>> Remove the requirement that DESCRIBE need to show system > > columns. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Some concrete vendor specific catalog implementations might > > prefer > > > >>>>>>>> this > > > >>>>>>>> approach. > > > >>>>>>>> Usually the same system columns are available on all (or > family) > > > of > > > >>>>>>>> tables, and it can be easily captured in the documentation. > > > >>>>>>>> > > > >>>>>>>> For example, BigQuery does exactly this: there, pseudo-columns > > do > > > >> not > > > >>>>>>> show > > > >>>>>>>> up in the table schema in any place, but can be accessed via > > > >>>>>>>> reference. > > > >>>>>>>> > > > >>>>>>>> So I propose we: > > > >>>>>>>> a) Either we say that DESCRIBE doesn't show system columns, > > > >>>>>>>> b) Or leave this vendor-specific / or configurable via flag > (if > > > >>>>> needed). > > > >>>>>>>> > > > >>>>>>>> Regards, > > > >>>>>>>> Alexey > > > >>>>>>>> > > > >>>>>>>> On Thu, Jul 27, 2023 at 3:27 AM Sergey Nuyanzin < > > > >> [email protected]> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> Hi Timo, > > > >>>>>>>>> > > > >>>>>>>>> Thanks for the FLIP. > > > >>>>>>>>> I also tend to think that Option 3 is better. > > > >>>>>>>>> > > > >>>>>>>>> I would be also interested in a question mentioned by Benchao > > Li. > > > >>>>>>>>> And a similar question about nested queries like > > > >>>>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);` > > > >>>>>>>>> Am I right that it will show `$rowtime` in output ? > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> On Thu, Jul 27, 2023 at 6:58 AM Benchao Li < > > [email protected] > > > > > > > >>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Hi Timo, > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks for the FLIP, I also like the idea and option 3 > sounds > > > >>>>>>>>>> good to > > > >>>>>>>>> me. > > > >>>>>>>>>> > > > >>>>>>>>>> I would like to discuss a case which is not mentioned in the > > > >>>>>>>>>> current > > > >>>>>>>>> FLIP. > > > >>>>>>>>>> How are the "System column"s expressed in intermediate > result, > > > >> e.g. > > > >>>>>>>>> Join? > > > >>>>>>>>>> E.g. `SELECT * FROM t1 JOIN t2`, I guess it should not > include > > > >>>>> "system > > > >>>>>>>>>> columns" from t1 and t2 as you proposed, and for `SELECT > > > >>>>>>>>>> t1.$rowtime, > > > >>>>>>> * > > > >>>>>>>>>> FROM t1 JOIN t2`, it should also be valid. > > > >>>>>>>>>> Then the question is how to you plan to implement the > "system > > > >>>>>>> columns", > > > >>>>>>>>> do > > > >>>>>>>>>> we need to add it to `RelNode` level? Or we just need to do > it > > > >>>>>>>>>> in the > > > >>>>>>>>>> parsing/validating phase? > > > >>>>>>>>>> I'm not sure that Calcite's "system column" feature is fully > > > ready > > > >>>>> for > > > >>>>>>>>> this > > > >>>>>>>>>> since the code about this part is imported from the earlier > > > >> project > > > >>>>>>>>> before > > > >>>>>>>>>> it gets into Apache, and has not been considered much in the > > > past > > > >>>>>>>>>> development. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Jing Ge <[email protected]> 于2023年7月26日周三 00:01写 > > > >>>>>>>>>> 道: > > > >>>>>>>>>> > > > >>>>>>>>>>> Hi Timo, > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks for your proposal. It is a very pragmatic feature. > > Among > > > >>>>>>>>>>> all > > > >>>>>>>>>> options > > > >>>>>>>>>>> in the FLIP, option 3 is one I prefer too and I'd like to > ask > > > >> some > > > >>>>>>>>>>> questions to understand your thoughts. > > > >>>>>>>>>>> > > > >>>>>>>>>>> 1. I did some research on pseudo columns, just out of > > > >>>>>>>>>>> curiosity, do > > > >>>>>>>>> you > > > >>>>>>>>>>> know why most SQL systems do not need any prefix with their > > > >> pseudo > > > >>>>>>>>>> column? > > > >>>>>>>>>>> 2. Some platform providers will use ${variable_name} to > > define > > > >>>>>>>>>>> their > > > >>>>>>>>> own > > > >>>>>>>>>>> configurations and allow them to be embedded into SQL > > scripts. > > > >>>>>>>>>>> Will > > > >>>>>>>>> there > > > >>>>>>>>>>> be any conflict with option 3? > > > >>>>>>>>>>> > > > >>>>>>>>>>> Best regards, > > > >>>>>>>>>>> Jing > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Tue, Jul 25, 2023 at 7:00 PM Konstantin Knauf > > > >>>>>>>>>>> <[email protected] > > > >>>>>>>> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> Hi Timo, > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> this makes sense to me. Option 3 seems reasonable, too. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Cheers, > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Konstantin > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Am Di., 25. Juli 2023 um 12:53 Uhr schrieb Timo Walther < > > > >>>>>>>>>>>> [email protected] > > > >>>>>>>>>>>>> : > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> Hi everyone, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> I would like to start a discussion about introducing the > > > >> concept > > > >>>>>>>>> of > > > >>>>>>>>>>>>> "System Columns" in SQL and Table API. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> The subject sounds bigger than it actually is. Luckily, > > Flink > > > >>>>>>> SQL > > > >>>>>>>>>>>>> already exposes the concept of metadata columns. And this > > > >>>>>>>>> proposal is > > > >>>>>>>>>>>>> just a slight adjustment for how metadata columns can be > > used > > > >> as > > > >>>>>>>>>> system > > > >>>>>>>>>>>>> columns. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> The biggest problem of metadata columns currently is > that a > > > >>>>>>>>> catalog > > > >>>>>>>>>>>>> implementation can't provide them by default because they > > > would > > > >>>>>>>>>> affect > > > >>>>>>>>>>>>> `SELECT *` when adding another one. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Looking forward to your feedback on FLIP-348: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>> > > > >>>>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-348%3A+Support+System+Columns+in+SQL+and+Table+API > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>>> Timo > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> -- > > > >>>>>>>>>>>> https://twitter.com/snntrable > > > >>>>>>>>>>>> https://github.com/knaufk > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> -- > > > >>>>>>>>>> > > > >>>>>>>>>> Best, > > > >>>>>>>>>> Benchao Li > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> -- > > > >>>>>>>>> Best regards, > > > >>>>>>>>> Sergey > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > >> > > > > > > > > > > > > > > -- > > > > Best, > > Benchao Li > > >
