Hi everyone,

I'd like to call for a new vote on SPARK-28885
<https://issues.apache.org/jira/browse/SPARK-28885> "Follow ANSI store
assignment rules in table insertion by default" after revising the ANSI
store assignment policy(SPARK-29326
<https://issues.apache.org/jira/browse/SPARK-29326>).
When inserting a value into a column with the different data type, Spark
performs type coercion. Currently, we support 3 policies for the store
assignment rules: ANSI, legacy and strict, which can be set via the option
"spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice,
the behavior is mostly the same as PostgreSQL. It disallows certain
unreasonable type conversions such as converting `string` to `int` and
`double` to `boolean`. It will throw a runtime exception if the value is
out-of-range(overflow).
2. Legacy: Spark allows the store assignment as long as it is a valid
`Cast`, which is very loose. E.g., converting either `string` to `int` or
`double` to `boolean` is allowed. It is the current behavior in Spark 2.x
for compatibility with Hive. When inserting an out-of-range value to an
integral field, the low-order bits of the value is inserted(the same as
Java/Scala numeric type casting). For example, if 257 is inserted into a
field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data
truncation in store assignment, e.g., converting either `double` to `int`
or `decimal` to `double` is allowed. The rules are originally for Dataset
encoder. As far as I know, no mainstream DBMS is using this policy by
default.

Currently, the V1 data source uses "Legacy" policy by default, while V2
uses "Strict". This proposal is to use "ANSI" policy by default for both V1
and V2 in Spark 3.0.

This vote is open until Friday (Oct. 11).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang

Reply via email to