+1, the existing "NULL on error" behavior is terrible for data quality.

I have one concern about error reporting with DataFrame APIs. Query
execution is lazy and where the error happens can be far away from where
the dataframe/column was created. We are improving it (PR
<https://github.com/apache/spark/pull/45377>) but it's not fully done yet.

On Fri, Apr 12, 2024 at 2:21 PM L. C. Hsieh <vii...@gmail.com> wrote:

> +1
>
> I believe ANSI mode is well developed after many releases. No doubt it
> could be used.
> Since it is very easy to disable it to restore to current behavior, I
> guess the impact could be limited.
> Do we have known the possible impacts such as what are the major
> changes (e.g., what kind of queries/expressions will fail)? We can
> describe them in the release note.
>
> On Thu, Apr 11, 2024 at 10:29 PM Gengliang Wang <ltn...@gmail.com> wrote:
> >
> >
> > +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly
> enhance data quality and integrity. I fully support this initiative.
> >
> > > In other words, the current Spark ANSI SQL implementation becomes the
> first implementation for Spark SQL users to face at first while providing
> > `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.`spark.sql.ansi.enabled=false` in the same way without losing
> any capability.
> >
> > BTW, the try_* functions and SQL Error Attribution Framework will also
> be beneficial in migrating to ANSI SQL mode.
> >
> >
> > Gengliang
> >
> >
> > On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
> >>
> >> Hi, All.
> >>
> >> Thanks to you, we've been achieving many things and have on-going SPIPs.
> >> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more
> narrowly
> >> by asking your opinions about Apache Spark's ANSI SQL mode.
> >>
> >>     https://issues.apache.org/jira/browse/SPARK-44111
> >>     Prepare Apache Spark 4.0.0
> >>
> >> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of
> desirable
> >> items for 4.0.0 because it's a big behavior.
> >>
> >>     https://issues.apache.org/jira/browse/SPARK-44444
> >>     Use ANSI SQL mode by default
> >>
> >> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0
> and has
> >> been aiming to provide a better Spark SQL compatibility in a standard
> way.
> >> We also have a daily CI to protect the behavior too.
> >>
> >>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
> >>
> >> However, it's still behind the configuration with several known issues,
> e.g.,
> >>
> >>     SPARK-41794 Reenable ANSI mode in test_connect_column
> >>     SPARK-41547 Reenable ANSI mode in test_connect_functions
> >>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
> >>
> >> To be clear, we know that many DBMSes have their own implementations of
> >> SQL standard and not the same. Like them, SPARK-44444 aims to enable
> >> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
> >> There is nothing more than that.
> >>
> >> In other words, the current Spark ANSI SQL implementation becomes the
> first
> >> implementation for Spark SQL users to face at first while providing
> >> `spark.sql.ansi.enabled=false` in the same way without losing any
> capability.
> >>
> >> If we don't want this change for some reasons, we can simply exclude
> >> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0
> preparation.
> >> It's time just to make a go/no-go decision for this item for the global
> optimization
> >> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
> >> for this again for the next four years until 2028.
> >>
> >> WDYT?
> >>
> >> Bests,
> >> Dongjoon
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to