+1, the existing "NULL on error" behavior is terrible for data quality.
I have one concern about error reporting with DataFrame APIs. Query execution is lazy and where the error happens can be far away from where the dataframe/column was created. We are improving it (PR <https://github.com/apache/spark/pull/45377>) but it's not fully done yet. On Fri, Apr 12, 2024 at 2:21 PM L. C. Hsieh <vii...@gmail.com> wrote: > +1 > > I believe ANSI mode is well developed after many releases. No doubt it > could be used. > Since it is very easy to disable it to restore to current behavior, I > guess the impact could be limited. > Do we have known the possible impacts such as what are the major > changes (e.g., what kind of queries/expressions will fail)? We can > describe them in the release note. > > On Thu, Apr 11, 2024 at 10:29 PM Gengliang Wang <ltn...@gmail.com> wrote: > > > > > > +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly > enhance data quality and integrity. I fully support this initiative. > > > > > In other words, the current Spark ANSI SQL implementation becomes the > first implementation for Spark SQL users to face at first while providing > > `spark.sql.ansi.enabled=false` in the same way without losing any > capability.`spark.sql.ansi.enabled=false` in the same way without losing > any capability. > > > > BTW, the try_* functions and SQL Error Attribution Framework will also > be beneficial in migrating to ANSI SQL mode. > > > > > > Gengliang > > > > > > On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> > >> Hi, All. > >> > >> Thanks to you, we've been achieving many things and have on-going SPIPs. > >> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more > narrowly > >> by asking your opinions about Apache Spark's ANSI SQL mode. > >> > >> https://issues.apache.org/jira/browse/SPARK-44111 > >> Prepare Apache Spark 4.0.0 > >> > >> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of > desirable > >> items for 4.0.0 because it's a big behavior. > >> > >> https://issues.apache.org/jira/browse/SPARK-44444 > >> Use ANSI SQL mode by default > >> > >> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 > and has > >> been aiming to provide a better Spark SQL compatibility in a standard > way. > >> We also have a daily CI to protect the behavior too. > >> > >> https://github.com/apache/spark/actions/workflows/build_ansi.yml > >> > >> However, it's still behind the configuration with several known issues, > e.g., > >> > >> SPARK-41794 Reenable ANSI mode in test_connect_column > >> SPARK-41547 Reenable ANSI mode in test_connect_functions > >> SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard > >> > >> To be clear, we know that many DBMSes have their own implementations of > >> SQL standard and not the same. Like them, SPARK-44444 aims to enable > >> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`. > >> There is nothing more than that. > >> > >> In other words, the current Spark ANSI SQL implementation becomes the > first > >> implementation for Spark SQL users to face at first while providing > >> `spark.sql.ansi.enabled=false` in the same way without losing any > capability. > >> > >> If we don't want this change for some reasons, we can simply exclude > >> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0 > preparation. > >> It's time just to make a go/no-go decision for this item for the global > optimization > >> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim > >> for this again for the next four years until 2028. > >> > >> WDYT? > >> > >> Bests, > >> Dongjoon > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >