This is a side issue, but I’d like to bring people’s attention to SPARK-28024. 

Cases 2, 3, and 4 described in that ticket are still problems today on master 
(I just rechecked) even with ANSI mode enabled.

Well, maybe not problems, but I’m flagging this since Spark’s behavior differs 
in these cases from Postgres, as described in the ticket.


> On Apr 12, 2024, at 12:09 AM, Gengliang Wang <ltn...@gmail.com> wrote:
> 
> 
> +1, enabling Spark's ANSI SQL mode in version 4.0 will significantly enhance 
> data quality and integrity. I fully support this initiative.
> 
> > In other words, the current Spark ANSI SQL implementation becomes the first 
> > implementation for Spark SQL users to face at first while providing
> `spark.sql.ansi.enabled=false` in the same way without losing any 
> capability.`spark.sql.ansi.enabled=false` in the same way without losing any 
> capability.
> 
> BTW, the try_* 
> <https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#useful-functions-for-ansi-mode>
>  functions and SQL Error Attribution Framework 
> <https://issues.apache.org/jira/browse/SPARK-38615> will also be beneficial 
> in migrating to ANSI SQL mode.
> 
> 
> Gengliang
> 
> 
> On Thu, Apr 11, 2024 at 7:56 PM Dongjoon Hyun <dongjoon.h...@gmail.com 
> <mailto:dongjoon.h...@gmail.com>> wrote:
>> Hi, All.
>> 
>> Thanks to you, we've been achieving many things and have on-going SPIPs.
>> I believe it's time to scope Apache Spark 4.0.0 (SPARK-44111) more narrowly
>> by asking your opinions about Apache Spark's ANSI SQL mode.
>> 
>>     https://issues.apache.org/jira/browse/SPARK-44111
>>     Prepare Apache Spark 4.0.0
>> 
>> SPARK-44444 was proposed last year (on 15/Jul/23) as the one of desirable
>> items for 4.0.0 because it's a big behavior.
>> 
>>     https://issues.apache.org/jira/browse/SPARK-44444
>>     Use ANSI SQL mode by default
>> 
>> Historically, spark.sql.ansi.enabled was added at Apache Spark 3.0.0 and has
>> been aiming to provide a better Spark SQL compatibility in a standard way.
>> We also have a daily CI to protect the behavior too.
>> 
>>     https://github.com/apache/spark/actions/workflows/build_ansi.yml
>> 
>> However, it's still behind the configuration with several known issues, e.g.,
>> 
>>     SPARK-41794 Reenable ANSI mode in test_connect_column
>>     SPARK-41547 Reenable ANSI mode in test_connect_functions
>>     SPARK-46374 Array Indexing is 1-based via ANSI SQL Standard
>> 
>> To be clear, we know that many DBMSes have their own implementations of
>> SQL standard and not the same. Like them, SPARK-44444 aims to enable
>> only the existing Spark's configuration, `spark.sql.ansi.enabled=true`.
>> There is nothing more than that.
>> 
>> In other words, the current Spark ANSI SQL implementation becomes the first
>> implementation for Spark SQL users to face at first while providing
>> `spark.sql.ansi.enabled=false` in the same way without losing any capability.
>> 
>> If we don't want this change for some reasons, we can simply exclude
>> SPARK-44444 from SPARK-44111 as a part of Apache Spark 4.0.0 preparation.
>> It's time just to make a go/no-go decision for this item for the global 
>> optimization
>> for Apache Spark 4.0.0 release. After 4.0.0, it's unlikely for us to aim
>> for this again for the next four years until 2028.
>> 
>> WDYT?
>> 
>> Bests,
>> Dongjoon

Reply via email to