Re: [DISCUSS] Shall casting functions return null or throw exceptions for invalid input

Francesco Guardiani Mon, 22 Nov 2021 07:29:52 -0800

> NULL in SQL essentially means "UNKNOWN", it's not as scary as a null in
java which will easily cause a NPE or some random behavior with a c++
function call.


This is true from the user point of view, except our runtime doesn't treat
null as some value where you can safely execute operations and get "noop"
results. In our runtime null is Java's null, hence causing issues and
generating NPEs here and there when nulls are not expected.

> It will really create a big mess after users upgrade their SQL jobs

This is what I don't really understand here: how adding a configuration
option causes issues here? We make it very clear in our release notes that
you need to switch that flag if you're relying on this behavior and that's
it: if you reprocess jobs every time you upgrade, you just flip the switch
before reprocessing and you won't have any issues. If you don't because you
use the hybrid source, either you upgrade your query or you flip the flag
and in both cases this shouldn't generate any issue.
Since it's a big change, I also expect to keep this flag for some releases,
at least up to Flink 2.

On Sat, Nov 20, 2021 at 7:25 AM Kurt Young <ykt...@gmail.com> wrote:

> Hi Francesco,
>
> Thanks for sharing your opinion about this and examples with other
> programming
> languages. I just want to mention, that NULL in SQL world is a bit
> different with the
> meaning in programming languages like java.
>
> NULL in SQL essentially means "UNKNOWN", it's not as scary as a null in
> java which
> will easily cause a NPE or some random behavior with a c++ function call.
> UNKNOWN
> means it could be any value. In java, the condition "null == null" always
> return true. But
> in SQL, it returns NULL, which means UNKNOWN.
>
> Another example, if you run following statements:
> select 'true' where 3 in (1, 2, 3, null) // this will print true
> select 'true' where 3 not in (1, 2, null) // this won't print anything
>
> In summary, SQL's NULL is a bit different from others, it has its own
> meaning. So I won't
> compare the behavior of returning NULL with programming languages and then
> judge it
> as bad behavior. And it's not a very big deal if we return NULL when trying
> to cast "abc"
> to an integer, which means we don't know the correct value.
>
> But still, I'm ok to change the behavior, but just not now. It will really
> create a big mess after
> users upgrade their SQL jobs. I'm either fine to do it in some really big
> version change like
> Flink 2.0, or we can do it after we have some universal error records
> handling mechanism, so
> in that way, users could have a chance to handle such a situation.
>
> Best,
> Kurt
>
>
> On Fri, Nov 19, 2021 at 7:29 PM Francesco Guardiani <
> france...@ververica.com>
> wrote:
>
> > Hi all,
> >
> > tl;dr:
> >
> > I think Timo pretty much said it all. As described in the issue, my
> > proposal is:
> >
> > * Let's switch the default behavior of CAST to fail
> > * Let's add TRY_CAST to have the old behavior
> > * Let's add a rule (disabled by default) that wraps every CAST in a TRY,
> in
> > order to keep the old behavior.
> > * Let's put a giant warning in the release notes explaining to enable the
> > rule, in case you're depending on the old behavior
> >
> > This way, we break no SQL scripts, as you can apply this flag to every
> > previously running script. We can also think to another strategy, more
> than
> > the planner rule, to keep the old behavior, always behind a flag disabled
> > by default.
> >
> > Timing of this proposal is also crucial, since CAST is a basic primitive
> of
> > our language and, after we have the upgrade story in place, this is going
> > to be a whole more harder to deal with.
> >
> > And I would say that in the next future, we should start thinking to
> > support proper error handling strategies, that is:
> >
> > * How users are supposed to handle records that fails an expression
> > computation, aggregation, etc?
> > * Can we provide some default strategies, like log and discard, send to a
> > dead letter queue?
> >
> > Now let me go a bit more deep in the reason we really need such change:
> >
> > For me the issue is not really about being compliant with the SQL
> standard
> > or not, or the fact that other databases behaves differently from us, but
> > the fact that the CAST function we have right now is effectively a
> footgun
> > <https://en.wiktionary.org/wiki/footgun> for our users.
> > The concept of casting one value to another inherently involves some
> > concept of failure, this is something as a programmer I expect, exactly
> > like when dividing a value by 0 or when sending a message to an external
> > system. And this is why every programming language has some explicit way
> to
> > signal to you such failures exist and, some of them, even force you to
> deal
> > with such failures, e.g. Java has the Exceptions and the try catch block,
> > Rust has the ? operator, Golang returns you an error together with the
> > result. Not failing when a failure is inherently defined by the operation
> > itself, or even not being explicit about the fact that such operation can
> > fail, leads users to mistakenly think the operation they're doing is
> always
> > safe and cannot lead to failures. And this is IMHO really the problem
> with
> > the CAST primitive we have right now: it has a concept of failure but we
> > shade it, and we're not even explicit about the fact that we're shading
> it
> > [1], and we expect users to go in the documentation and read that CAST
> can
> > return an eventual NULL and then deal with it. I even question why for
> > example, we return NULL more than a default sane value when an exception
> > happens, e.g. it would be way better to return epoch 0 more than NULL
> when
> > failing a cast from STRING to TIMESTAMP. This is for example the approach
> > taken by Golang: parsing string to timestamp function returns both a sane
> > value and an error.
> >
> > And the cherry on top is that, for our users, the consequences of the bad
> > usage of CAST are simply disastrous: best case, some operator fails with
> > NPE, worst case you get bad results or even some data is lost down in the
> > pipeline. We give no indication at all that the cast failed, and even if
> we
> > push a change to log "hey this cast failed on this record" it would still
> > be extremely complicated to track down how badly a single cast failure
> > affected the results of a projection, a grouping, an aggregation, etc.
> > Hence my definition of our CAST function as a footgun.
> >
> > The bottom line for me is that our CAST primitive goes directly against
> the
> > goal of Flink SQL to provide a simple to use API for developers and
> > business people to develop computation pipelines, because it's not
> > explicit, it silently fail with NULLs, and we require users to deal with
> > it.
> >
> > The very same discussion applies with TO_TIMESTAMP, which among the
> others
> > might even be more crucial because we directly use it in our
> documentation
> > to tell users how to compute rowtime.
> >
> > FG
> >
> > [1] Note: here the naming is a fundamental part of the issue, the
> function
> > we have today is named CAST and not TRY_CAST or CAST_OR_NULL or any other
> > name giving the indication that the operation might fail and provide a
> > result different from the cast result.
> >
> >
> > On Fri, Nov 19, 2021 at 4:00 AM Kurt Young <ykt...@gmail.com> wrote:
> >
> > > Hi Timo,
> > >
> > > Regarding CAST, I think no one denies the standard behavior which
> should
> > > raise errors when
> > > failed. The only question is how do we solve it, given lots of users
> > > already relying on current
> > > more tolerant behavior. Some violation of standard but acceptable
> > behavior
> > > doesn't deserve
> > > a breaking change in Flink minor version IMO, i'm more comfortable to
> fix
> > > it in versions like
> > > Flink 2.0.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Thu, Nov 18, 2021 at 11:44 PM Timo Walther <twal...@apache.org>
> > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > >
> > > > thanks for finally have this discussion on the mailing list. As both
> a
> > > > contributor and user, I have experienced a couple issues around
> > > > nullability coming out of nowhere in a pipeline. This discussion
> should
> > > > not only cover CAST but failure handling in general.
> > > >
> > > > Let me summarize my opinion:
> > > >
> > > > 1) CAST vs. TRY_CAST
> > > >
> > > > CAST is a SQL standard core operation with well-defined semantics
> > across
> > > > all major SQL vendors. There should be no discussion whether it
> returns
> > > > NULL or an error. The semantics are already defined externally. I
> don't
> > > > agree with "Streaming computing is a resident program ... users do
> not
> > > > want it to frequently fail", the same argument is also true for
> nightly
> > > > batch jobs. A batch job can also get stuck through a SQL statement
> that
> > > > is not lenient enough defined by the user.
> > > >
> > > > An option that restores the old behavior and TRY_CAST for the future
> > > > should solve this use case and make all parties happy.
> > > >
> > > > 2) TO_TIMESTAMP / TO_DATE
> > > >
> > > > We should distinguish between CASTING and CONVERSION / PARSING. As a
> > > > user, I would expect that parsing can fail and have to deal with this
> > > > accordingly. Therefore, I'm fine with returning NULL in TO_ or
> CONVERT_
> > > > functions. This is also consistent with other vendors. Take PARSE of
> > SQL
> > > > Server as an example [1]: "If a parameter with a null value is passed
> > at
> > > > run time, then a null is returned, to avoid canceling the whole
> > batch.".
> > > > Here we can be more flexible with the semantics because users need to
> > > > read the docs anyway.
> > > >
> > > > 3) Null at other locations
> > > >
> > > > In general, we should stick to our data type constraints. Everything
> > > > else will mess up the architecture of functions/connectors and their
> > > > return types. Take the rowtime (event-time timestamp) attribute as an
> > > > example: PRs like the one for FLINK-24885 are just the peak of the
> > > > iceberg. If we would allow rowtime columns to be NULL we would need
> to
> > > > check all time-based operators and implement additional handling
> logic
> > > > for this.
> > > >
> > > > It would be better to define unified error-handling for operators and
> > > > maybe drop rows if the per-element processing failed. We should have
> a
> > > > unified approach how to log/side output such records.
> > > >
> > > > Until this is in place, I would suggest we spend some time in rules
> > that
> > > > can be enabled with an option for modifying the plan and wrap
> > frequently
> > > > failing expressions with a generic TRY() function. In this case, we
> > > > don't need to deal with NULL in all built-in functions, we can throw
> > > > helpful errors during development, and can return NULL even though
> the
> > > > return type is NOT NULL. It would also make the NULL returning
> explicit
> > > > in the plan.
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://docs.microsoft.com/en-us/sql/t-sql/functions/parse-transact-sql?view=sql-server-ver15
> > > > [2] https://issues.apache.org/jira/browse/FLINK-24885
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 18.11.21 11:34, Kurt Young wrote:
> > > > > Sorry I forgot to add user ML. I also would like to gather some
> users
> > > > > feedback on this thing.
> > > > > Since I didn't get any feedback on this topic before from users.
> > > > >
> > > > > Best,
> > > > > Kurt
> > > > >
> > > > >
> > > > > On Thu, Nov 18, 2021 at 6:33 PM Kurt Young <ykt...@gmail.com>
> wrote:
> > > > >
> > > > >> (added user ML to this thread)
> > > > >>
> > > > >> HI all,
> > > > >>
> > > > >> I would like to raise a different opinion about this change. I
> agree
> > > > with
> > > > >> Ingo that
> > > > >> we should not just break some existing behavior, and even if we
> > > > introduce
> > > > >> an
> > > > >> option to control the behavior, i would propose to set the default
> > > value
> > > > >> to current
> > > > >> behavior.
> > > > >>
> > > > >> I want to mention one angle to assess whether we should change it
> or
> > > > not,
> > > > >> which
> > > > >> is "what could users benefit from the changes". To me, it looks
> > like:
> > > > >>
> > > > >> * new users: happy about the behavior
> > > > >> * existing users: suffer from the change, it either cause them to
> > > modify
> > > > >> the SQL or
> > > > >> got a call in late night reporting his online job got crashed and
> > > > couldn't
> > > > >> be able to
> > > > >> restart.
> > > > >>
> > > > >> I would like to quote another breaking change we did when we
> adjust
> > > the
> > > > >> time-related
> > > > >> function in FLIP-162 [1]. In that case, both new users and
> existing
> > > > users
> > > > >> are suffered
> > > > >> from *incorrectly* implemented time function behavior, and we saw
> a
> > > lots
> > > > >> of feedbacks and
> > > > >> complains from various channels. After we fixed that, we never saw
> > > > related
> > > > >> problems again.
> > > > >>
> > > > >> Back to this topic, do we ever seen a user complain about current
> > CAST
> > > > >> behavior? Form my
> > > > >> side, no.
> > > > >>
> > > > >> To summarize:
> > > > >>
> > > > >> +1 to introduce TRY_CAST to better prepare for the future.
> > > > >> -1 to modify the default behavior.
> > > > >> +0 to introduce a config option, but with the default value to
> > > existing
> > > > >> behavior. it's +0 because it
> > > > >> seems not necessary if i'm -1 to change the default behavior and
> > also
> > > > >> don't see an urgent to modify.
> > > > >>
> > > > >>
> > > > >> [1]
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-162%3A+Consistent+Flink+SQL+time+function+behavior
> > > > >>
> > > > >> Best,
> > > > >> Kurt
> > > > >>
> > > > >>
> > > > >> On Thu, Nov 18, 2021 at 4:26 PM Ingo Bürk <i...@ververica.com>
> > wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>>
> > > > >>> first of all, thanks for the summary of both sides, and for
> > bringing
> > > up
> > > > >>> the
> > > > >>> discussion on this.
> > > > >>> I think it is obvious that this is not something we can just
> > "break",
> > > > so
> > > > >>> the config option seems mandatory to me.
> > > > >>>
> > > > >>> Overall I agree with Martijn and Till that throwing errors is the
> > > more
> > > > >>> expected behavior. I mostly think this is valuable default
> behavior
> > > > >>> because
> > > > >>> it allows developers to find mistakes early and diagnose them
> much
> > > > easier
> > > > >>> compare to having to "work backwards" and figure out that it is
> the
> > > > CAST
> > > > >>> that failed. It also means that pipelines using TRY_CAST are
> > > > >>> self-documenting because using that can signal "we might receive
> > > broken
> > > > >>> data here".
> > > > >>>
> > > > >>>
> > > > >>> Best
> > > > >>> Ingo
> > > > >>>
> > > > >>> On Thu, Nov 18, 2021 at 9:11 AM Till Rohrmann <
> > trohrm...@apache.org>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Hi everyone,
> > > > >>>>
> > > > >>>> personally I would also prefer the system telling me that
> > something
> > > is
> > > > >>>> wrong instead of silently ignoring records. If there is a
> TRY_CAST
> > > > >>> function
> > > > >>>> that has the old behaviour, people can still get the old
> > behaviour.
> > > > For
> > > > >>>> backwards compatibility reasons it is a good idea to introduce a
> > > > switch
> > > > >>> to
> > > > >>>> get back the old behaviour. By default we could set it to the
> new
> > > > >>>> behaviour, though. Of course, we should explicitly document this
> > new
> > > > >>>> behaviour so that people are aware of it before running their
> jobs
> > > for
> > > > >>> days
> > > > >>>> and then encountering an invalid input.
> > > > >>>>
> > > > >>>> Cheers,
> > > > >>>> Till
> > > > >>>>
> > > > >>>> On Thu, Nov 18, 2021 at 9:02 AM Martijn Visser <
> > > mart...@ververica.com
> > > > >
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Hi Caizhi,
> > > > >>>>>
> > > > >>>>> Thanks for bringing this up for discussion. I think the
> important
> > > > >>> part is
> > > > >>>>> what do developers expect as the default behaviour of a CAST
> > > function
> > > > >>>> when
> > > > >>>>> casting fails. If I look at Postgres [1] or MSSQL [2], the
> > default
> > > > >>>>> behaviour of a CAST failing would be to return an error, which
> > > would
> > > > >>> be
> > > > >>>> the
> > > > >>>>> new behaviour. Returning a value when a CAST fails can lead to
> > > users
> > > > >>> not
> > > > >>>>> understanding immediately where that value comes from. So, I
> > would
> > > be
> > > > >>> in
> > > > >>>>> favor of the new behaviour by default, but including a
> > > configuration
> > > > >>> flag
> > > > >>>>> to maintain the old behaviour to avoid that you need to rewrite
> > all
> > > > >>> these
> > > > >>>>> jobs.
> > > > >>>>>
> > > > >>>>> Best regards,
> > > > >>>>>
> > > > >>>>> Martijn
> > > > >>>>>
> > > > >>>>> [1]
> https://www.postgresql.org/docs/current/sql-createcast.html
> > > > >>>>> [2]
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > >
> > >
> >
> https://docs.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-ver15
> > > > >>>>>
> > > > >>>>> On Thu, 18 Nov 2021 at 03:17, Caizhi Weng <
> tsreape...@gmail.com>
> > > > >>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi devs!
> > > > >>>>>>
> > > > >>>>>> We're discussing the behavior of casting functions (including
> > > cast,
> > > > >>>>>> to_timestamp, to_date, etc.) for invalid input in
> > > > >>>>>> https://issues.apache.org/jira/browse/FLINK-24924. As this
> > topic
> > > is
> > > > >>>>>> crucial
> > > > >>>>>> to compatibility and usability we'd like to continue
> discussing
> > > this
> > > > >>>>>> publicly in the mailing list.
> > > > >>>>>>
> > > > >>>>>> The main topic is to discuss that shall casting functions
> return
> > > > >>> null
> > > > >>>>> (keep
> > > > >>>>>> its current behavior) or throw exceptions (introduce a new
> > > > >>> behavior).
> > > > >>>> I'm
> > > > >>>>>> trying to conclude the ideas on both sides. Correct me if I
> miss
> > > > >>>>> something.
> > > > >>>>>>
> > > > >>>>>> *From the devs who support throwing exceptions (new
> behavior):*
> > > > >>>>>>
> > > > >>>>>> The main concern is that if we silently return null then
> > > unexpected
> > > > >>>>> results
> > > > >>>>>> or exceptions (mainly NullPointerException) may be produced.
> > > > >>> However,
> > > > >>>> it
> > > > >>>>>> will be hard for users to reason for this because there is no
> > > > >>> detailed
> > > > >>>>>> message. If we throw exceptions in the first place, then it's
> > much
> > > > >>>> easier
> > > > >>>>>> to catch the exception with nice detailed messages explaining
> > what
> > > > >>> is
> > > > >>>>> going
> > > > >>>>>> wrong. Especially for this case of DATE/TIME/TIMESTAMP it's
> very
> > > > >>>> helpful
> > > > >>>>> to
> > > > >>>>>> have a detailed error and see where and why the parsing broke.
> > > > >>>>>>
> > > > >>>>>> For compatibility concerns, we can provide a TRY_CAST function
> > > > >>> which is
> > > > >>>>>> exactly the same as the current CAST function by returning
> nulls
> > > for
> > > > >>>>>> invalid input.
> > > > >>>>>>
> > > > >>>>>> *From the devs who support return null (current behavior):*
> > > > >>>>>>
> > > > >>>>>> The main concern is compatibility and usability.
> > > > >>>>>>
> > > > >>>>>> On usability: The upstream system may occasionally produce
> > invalid
> > > > >>> data
> > > > >>>>> and
> > > > >>>>>> if we throw exceptions when seeing this it will fail the job
> > again
> > > > >>> and
> > > > >>>>>> again even after restart (because the invalid data is always
> > > > >>>>>> there). Streaming computing is a resident program and users do
> > not
> > > > >>> want
> > > > >>>>> it
> > > > >>>>>> to frequently fail and cannot automatically recover. Most
> users
> > > are
> > > > >>>>> willing
> > > > >>>>>> to just skip that record and continue processing. Imagine an
> > > online
> > > > >>> job
> > > > >>>>>> running for a couple of weeks and suddenly fails due to some
> > > > >>> unexpected
> > > > >>>>>> dirty data. What choices do users have to quickly resume the
> > job?
> > > > >>>>>>
> > > > >>>>>> On compatibility: There are currently thousands of users and
> > tens
> > > of
> > > > >>>>>> thousands of jobs relying on the current behavior to filter
> out
> > > > >>> invalid
> > > > >>>>>> input. If we change the behavior it will be a disaster for
> users
> > > as
> > > > >>>> they
> > > > >>>>>> have to rewrite and check their SQL very carefully.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> What do you think? We're looking forward to your feedback.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Shall casting functions return null or throw exceptions for invalid input

Reply via email to