Hi all, As I see this thread has consensus, here is the issue and PR to disable the legacy behavior by default: https://issues.apache.org/jira/browse/FLINK-26551 https://github.com/apache/flink/pull/19020
We target to merge it before the release of 1.15, unless there are any objections. FG On Tue, Mar 1, 2022 at 2:18 PM Marios Trivyzas <mat...@gmail.com> wrote: > Indeed, if we manage to use the configuration from *flink-conf.yaml* down > the stack, > it would be easy for everyone to configure a "system-wide" legacy cast > behaviour. > > Best regards, > Marios > > On Tue, Mar 1, 2022 at 2:52 PM Timo Walther <twal...@apache.org> wrote: > > > +1 > > > > Thanks for bringing up this discussion one more time Marios. > > > > I strongly support enabling the new behavior in 1.15. It definitely has > > implications on existing users, but as Seth said, thinking about the > > upcoming upgrade story we need to make sure that at least the core/basic > > operations are correct. Otherwise we will have to maintain multiple > > versions of functions with broken semantics. > > > > I since we also try to fix various issues around configuration, maybe it > > might still be possible to configure the legacy cast behavior globally > > via flink-conf.yaml. This should make the transitioning period easier in > > production. > > > > Regards, > > Timo > > > > Am 28.02.22 um 19:04 schrieb Seth Wiesman: > > > +1 > > > > > > Especially as SQL upgrades are right around the corner, it makes sense > to > > > get our defaults right. > > > > > > Seth > > > > > > On Mon, Feb 28, 2022 at 7:14 AM Martijn Visser <mart...@ververica.com> > > > wrote: > > > > > >> +1 for setting this option to disabled by default. I believe failures > > >> should be brought forward as soon as possible, so they can be fixed as > > fast > > >> as possible. It will also be less confusing for new users. Last but > not > > >> least, I believe the impact on existing users will be minimal (since > it > > can > > >> be changed by changing one flag). > > >> > > >> Best regards, > > >> > > >> Martijn > > >> > > >> On Tue, 22 Feb 2022 at 17:55, Marios Trivyzas <mat...@gmail.com> > wrote: > > >> > > >>> Thanks Francesco, > > >>> > > >>> The two arguments you posted, further strengthen the need to make it > > >>> DISABLED by default. > > >>> > > >>> On Tue, Feb 22, 2022 at 12:10 PM Francesco Guardiani < > > >>> france...@ververica.com> wrote: > > >>> > > >>>> Hi all, > > >>>> I'm +1 with what everything you said Marios. > > >>>> > > >>>> I'm gonna add another argument on top of that: the > > >> "legacy-cast-behavior" > > >>>> has also a broken type inference, leading to incorrect results or > > >> further > > >>>> errors down in the pipeline[1]. For example, take this: > > >>>> > > >>>> SELECT COALESCE(CAST('a' AS INT), 0) ... > > >>>> > > >>>> With the legacy cast behavior ENABLED, this is going to lead to the > > >> wrong > > >>>> result, as 'a' is inferred as VARCHAR NOT NULL, the CAST return > value > > >> is > > >>>> inferred as INT NOT NULL, so the planner will drop COALESCE, and > will > > >>> never > > >>>> return 0. Essentially, CAST will infer the wrong nullability leading > > to > > >>>> assigning its result to a NOT NULL type, when its value can > > effectively > > >>> be > > >>>> NULL. > > >>>> > > >>>>> You introduce a deprecated flag to help users > > >>>> using old versions of the software to smoothly transition to the new > > >>>> version, while the new users experience the new features/behavior, > > >>>> without the need to set a flag. > > >>>> > > >>>> This is IMO the major point on why we should disable the legacy cast > > >>>> behavior by default. This is even more relevant with 1.15 and the > > >> upgrade > > >>>> story, as per the problem described above, because users will now > end > > >> up > > >>>> generating plans with wrong type inference, which will be hard to > > >> migrate > > >>>> in the next flink versions. > > >>>> > > >>>> FG > > >>>> > > >>>> [1] In case you're wondering why it wasn't fixed, the reason is that > > >>> fixing > > >>>> it means creating a breaking change, for details > > >>>> https://github.com/apache/flink/pull/18611#issuecomment-1028174877 > > >>>> > > >>>> > > >>>> On Tue, Feb 22, 2022 at 10:07 AM Marios Trivyzas <mat...@gmail.com> > > >>> wrote: > > >>>>> Hello all! > > >>>>> > > >>>>> I would like to bring back the discussion regarding the > > >>>>> *table.exec.legacy-cast-behaviour* > > >>>>> configuration option which we are introducing with Flink *1.15*. > This > > >>>>> option provides the users > > >>>>> with the flexibility to continue using the old (incorrect, > according > > >> to > > >>>> SQL > > >>>>> standards) behaviour > > >>>>> of *CAST.* > > >>>>> > > >>>>> With Flink *1.15* we have introduced a bunch of fixes, improvements > > >> and > > >>>> new > > >>>>> casting functionality > > >>>>> between types, see > https://issues.apache.org/jira/browse/FLINK-24403 > > >> , > > >>>> and > > >>>>> some of them are > > >>>>> guarded behind the legacy behaviour option: > > >>>>> > > >>>>> - Trimming and padding when casting to CHAR/VARCHAR types to > > >> respect > > >>>> the > > >>>>> specified length > > >>>>> - Changes for casting various types to CHAR/VARCHAR/STRING > > >>>>> - Runtime errors for CAST no longer emit *null *as result but > > >>>> exceptions > > >>>>> are thrown with a meaningful message for the cause, that fail > the > > >>>>> pipeline. *TRY_CAST > > >>>>> *is introduced instead, which emits *null* results instead of > > >>> throwing > > >>>>> exceptions. > > >>>>> > > >>>>> Those changes become active if users set the > > >>>>> *table.exec.legacy-cast-behaviour > > >>>>> *option to *DISABLED*, otherwise > > >>>>> they will continue to experience the old, *erroneous*, behaviour of > > >>>> *CAST*. > > >>>>> Currently, we have set the *table.exec.legacy-cast-behaviour > *option > > >>>>> to be *ENABLED > > >>>>> *by default, so if users want > > >>>>> to get the new correct behaviour, they are required to set > explicitly > > >>> the > > >>>>> option to *DISABLED*. Moreover, the option > > >>>>> itself is marked as deprecated, since the plan is to be removed in > > >> the > > >>>>> future, so that the old, erroneous behaviour > > >>>>> won't be an option, and the *TRY_CAST* would be the way to go if > > >> users > > >>>>> don't want to have errors and failed pipelines, > > >>>>> but have *null*s emitted in case of runtime errors when casting. > > >>>>> > > >>>>> I would like to start a discussion and maybe ask for voting, so > that > > >> we > > >>>> set > > >>>>> the *table.exec.legacy-cast-behaviour* option > > >>>>> to *DISABLED *by default, so that users that want to keep their old > > >>>>> pipelines working the same way, without changing their > > >>>>> SQL/TableAPI code, would need to explicitly set it to *ENABLED.* > > >>>>> > > >>>>> My main argument for changing the default value for the option, is > > >> that > > >>>> the > > >>>>> *DISABLED* value is the one that enables > > >>>>> the *correct* behaviour for CAST which should be the default for > all > > >>> new > > >>>>> users. This way, new FLINK users, or users which > > >>>>> build new pipelines, from now on would get the correct behaviour by > > >>>> default > > >>>>> without the need of changing some flag in their > > >>>>> configuration. It feels weird to me, especially for people very > > >>> familiar > > >>>>> with standard SQL, to be obliged to set some config > > >>>>> flag, to be able to get the correct behaviour for CAST. On top, > users > > >>>> that > > >>>>> won't read about this option in our docs, will, > > >>>>> "blindly", experience the old incorrect behaviour for their new > > >>>> pipelines, > > >>>>> and issues that could cause the CAST to fail > > >>>>> will remain hidden from them, since *nulls* would be emitted. IMHO, > > >>> this > > >>>>> last part is also very important during the development > > >>>>> stages of an application/pipeline. Normally, developers would want > to > > >>> see > > >>>>> all possible errors/scenarios during development > > >>>>> stages , in order to build a robust production system. If errors, > are > > >>>>> hidden then, they can easily end up with those errors > > >>>>> in the production system which would be even harder to discover and > > >>>> debug, > > >>>>> since no exception will ever be thrown. > > >>>>> Imagine that there is a CAST which generates nulls because of > runtime > > >>>>> errors, and its result is used in an aggregate function: > > >>>>> > > >>>>> SELECT AVG(CAST(col1 AS FLOAT)), ... FROM t GROUP BY .... > > >>>>> > > >>>>> If nulls are emitted by the CAST (let's say because some records > > >> have a > > >>>>> quoted string value for col1, then simply the AVG result > > >>>>> would be wrong and in would be extremely difficult to realise and > fix > > >>> the > > >>>>> issue by simply wrapping col1 first with, for example, > > >>>>> a REGEXP_REPLACE, to get rid of the quotes. > > >>>>> > > >>>>> My second argument is that, since we have marked > > >>>>> *table.exec.legacy-cast-behaviour* as deprecated, and we want to > > >>>> completely > > >>>>> remove it in the future, if the default value is *DISABLED*, when > > >> it's > > >>>>> removed we also make a breaking change, by changing the > > >>>>> default behaviour for all users, which is against the common > software > > >>>>> practices. You introduce a deprecated flag to help users > > >>>>> using old versions of the software to smoothly transition to the > new > > >>>>> version, while the new users experience the new features/behaviour, > > >>>>> without the need to set a flag. So, when in the future this flag is > > >>>>> completely removed, for those "new" users it would be completely > > >>>>> transparent. If we continue with having the default value > *ENABLED*, > > >>>> those > > >>>>> new user would experience an "unnatural" breaking change > > >>>>> when this option is completely removed. > > >>>>> > > >>>>> Best, > > >>>>> Marios > > >>>>> > > >>> > > >>> -- > > >>> Marios > > >>> > > > > > > -- > Marios >