subject:"Re\: \[VOTE\]\[SPARK\-28885\] Follow ANSI store assignment rules in table insertion by default"

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Dongjoon Hyun

+1

Bests,
Dongjoon

On Thu, Oct 10, 2019 at 10:14 Ryan Blue  wrote:

> +1
>
> Thanks for fixing this!
>
> On Thu, Oct 10, 2019 at 6:30 AM Xiao Li  wrote:
>
>> +1
>>
>> On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon  wrote:
>>
>>> +1 (binding)
>>>
>>> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이
>>> 작성:
>>>
 Thanks for the great work, Gengliang!

 +1 for that.
 As I said before, the behaviour is pretty common in DBMSs, so the change
 helps for DMBS users.

 Bests,
 Takeshi


 On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
 gengliang.w...@databricks.com> wrote:

> Hi everyone,
>
> I'd like to call for a new vote on SPARK-28885
>  "Follow ANSI
> store assignment rules in table insertion by default" after revising the
> ANSI store assignment policy(SPARK-29326
> ).
> When inserting a value into a column with the different data type,
> Spark performs type coercion. Currently, we support 3 policies for the
> store assignment rules: ANSI, legacy and strict, which can be set via the
> option "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
> practice, the behavior is mostly the same as PostgreSQL. It disallows
> certain unreasonable type conversions such as converting `string` to `int`
> and `double` to `boolean`. It will throw a runtime exception if the value
> is out-of-range(overflow).
> 2. Legacy: Spark allows the store assignment as long as it is a valid
> `Cast`, which is very loose. E.g., converting either `string` to `int` or
> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
> for compatibility with Hive. When inserting an out-of-range value to an
> integral field, the low-order bits of the value is inserted(the same as
> Java/Scala numeric type casting). For example, if 257 is inserted into a
> field of Byte type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data
> truncation in store assignment, e.g., converting either `double` to `int`
> or `decimal` to `double` is allowed. The rules are originally for Dataset
> encoder. As far as I know, no mainstream DBMS is using this policy by
> default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while
> V2 uses "Strict". This proposal is to use "ANSI" policy by default for 
> both
> V1 and V2 in Spark 3.0.
>
> This vote is open until Friday (Oct. 11).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>


 --
 ---
 Takeshi Yamamuro

>>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Ryan Blue

+1

Thanks for fixing this!

On Thu, Oct 10, 2019 at 6:30 AM Xiao Li  wrote:

> +1
>
> On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon  wrote:
>
>> +1 (binding)
>>
>> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이 작성:
>>
>>> Thanks for the great work, Gengliang!
>>>
>>> +1 for that.
>>> As I said before, the behaviour is pretty common in DBMSs, so the change
>>> helps for DMBS users.
>>>
>>> Bests,
>>> Takeshi
>>>
>>>
>>> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
>>> gengliang.w...@databricks.com> wrote:
>>>
 Hi everyone,

 I'd like to call for a new vote on SPARK-28885
  "Follow ANSI store
 assignment rules in table insertion by default" after revising the ANSI
 store assignment policy(SPARK-29326
 ).
 When inserting a value into a column with the different data type,
 Spark performs type coercion. Currently, we support 3 policies for the
 store assignment rules: ANSI, legacy and strict, which can be set via the
 option "spark.sql.storeAssignmentPolicy":
 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
 practice, the behavior is mostly the same as PostgreSQL. It disallows
 certain unreasonable type conversions such as converting `string` to `int`
 and `double` to `boolean`. It will throw a runtime exception if the value
 is out-of-range(overflow).
 2. Legacy: Spark allows the store assignment as long as it is a valid
 `Cast`, which is very loose. E.g., converting either `string` to `int` or
 `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
 for compatibility with Hive. When inserting an out-of-range value to an
 integral field, the low-order bits of the value is inserted(the same as
 Java/Scala numeric type casting). For example, if 257 is inserted into a
 field of Byte type, the result is 1.
 3. Strict: Spark doesn't allow any possible precision loss or data
 truncation in store assignment, e.g., converting either `double` to `int`
 or `decimal` to `double` is allowed. The rules are originally for Dataset
 encoder. As far as I know, no mainstream DBMS is using this policy by
 default.

 Currently, the V1 data source uses "Legacy" policy by default, while V2
 uses "Strict". This proposal is to use "ANSI" policy by default for both V1
 and V2 in Spark 3.0.

 This vote is open until Friday (Oct. 11).

 [ ] +1: Accept the proposal
 [ ] +0
 [ ] -1: I don't think this is a good idea because ...

 Thank you!

 Gengliang

>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
> [image: Databricks Summit - Watch the talks]
> 
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Xiao Li

+1

On Thu, Oct 10, 2019 at 2:13 AM Hyukjin Kwon  wrote:

> +1 (binding)
>
> 2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이 작성:
>
>> Thanks for the great work, Gengliang!
>>
>> +1 for that.
>> As I said before, the behaviour is pretty common in DBMSs, so the change
>> helps for DMBS users.
>>
>> Bests,
>> Takeshi
>>
>>
>> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to call for a new vote on SPARK-28885
>>>  "Follow ANSI store
>>> assignment rules in table insertion by default" after revising the ANSI
>>> store assignment policy(SPARK-29326
>>> ).
>>> When inserting a value into a column with the different data type, Spark
>>> performs type coercion. Currently, we support 3 policies for the store
>>> assignment rules: ANSI, legacy and strict, which can be set via the option
>>> "spark.sql.storeAssignmentPolicy":
>>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>>> certain unreasonable type conversions such as converting `string` to `int`
>>> and `double` to `boolean`. It will throw a runtime exception if the value
>>> is out-of-range(overflow).
>>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>>> for compatibility with Hive. When inserting an out-of-range value to an
>>> integral field, the low-order bits of the value is inserted(the same as
>>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>>> field of Byte type, the result is 1.
>>> 3. Strict: Spark doesn't allow any possible precision loss or data
>>> truncation in store assignment, e.g., converting either `double` to `int`
>>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>>> encoder. As far as I know, no mainstream DBMS is using this policy by
>>> default.
>>>
>>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>>> and V2 in Spark 3.0.
>>>
>>> This vote is open until Friday (Oct. 11).
>>>
>>> [ ] +1: Accept the proposal
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>>
>>> Thank you!
>>>
>>> Gengliang
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
> --
[image: Databricks Summit - Watch the talks]

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Hyukjin Kwon

+1 (binding)

2019년 10월 10일 (목) 오후 5:11, Takeshi Yamamuro 님이 작성:

> Thanks for the great work, Gengliang!
>
> +1 for that.
> As I said before, the behaviour is pretty common in DBMSs, so the change
> helps for DMBS users.
>
> Bests,
> Takeshi
>
>
> On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang <
> gengliang.w...@databricks.com> wrote:
>
>> Hi everyone,
>>
>> I'd like to call for a new vote on SPARK-28885
>>  "Follow ANSI store
>> assignment rules in table insertion by default" after revising the ANSI
>> store assignment policy(SPARK-29326
>> ).
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the store
>> assignment rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>> certain unreasonable type conversions such as converting `string` to `int`
>> and `double` to `boolean`. It will throw a runtime exception if the value
>> is out-of-range(overflow).
>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>> for compatibility with Hive. When inserting an out-of-range value to an
>> integral field, the low-order bits of the value is inserted(the same as
>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>> field of Byte type, the result is 1.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in store assignment, e.g., converting either `double` to `int`
>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no mainstream DBMS is using this policy by
>> default.
>>
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>> and V2 in Spark 3.0.
>>
>> This vote is open until Friday (Oct. 11).
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thank you!
>>
>> Gengliang
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Takeshi Yamamuro

Thanks for the great work, Gengliang!

+1 for that.
As I said before, the behaviour is pretty common in DBMSs, so the change
helps for DMBS users.

Bests,
Takeshi


On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang 
wrote:

> Hi everyone,
>
> I'd like to call for a new vote on SPARK-28885
>  "Follow ANSI store
> assignment rules in table insertion by default" after revising the ANSI
> store assignment policy(SPARK-29326
> ).
> When inserting a value into a column with the different data type, Spark
> performs type coercion. Currently, we support 3 policies for the store
> assignment rules: ANSI, legacy and strict, which can be set via the option
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice,
> the behavior is mostly the same as PostgreSQL. It disallows certain
> unreasonable type conversions such as converting `string` to `int` and
> `double` to `boolean`. It will throw a runtime exception if the value is
> out-of-range(overflow).
> 2. Legacy: Spark allows the store assignment as long as it is a valid
> `Cast`, which is very loose. E.g., converting either `string` to `int` or
> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
> for compatibility with Hive. When inserting an out-of-range value to an
> integral field, the low-order bits of the value is inserted(the same as
> Java/Scala numeric type casting). For example, if 257 is inserted into a
> field of Byte type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data
> truncation in store assignment, e.g., converting either `double` to `int`
> or `decimal` to `double` is allowed. The rules are originally for Dataset
> encoder. As far as I know, no mainstream DBMS is using this policy by
> default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while V2
> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
> and V2 in Spark 3.0.
>
> This vote is open until Friday (Oct. 11).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>


-- 
---
Takeshi Yamamuro

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-08 Thread Russell Spitzer

+1 (non-binding). Sounds good to me

On Mon, Oct 7, 2019 at 11:58 PM Wenchen Fan  wrote:

> +1
>
> I think this is the most reasonable default behavior among the three.
>
> On Mon, Oct 7, 2019 at 6:06 PM Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>> I have been following this standardization effort and I think it is sound
>> and it provides the needed flexibility via the option.
>>
>> Best regards,
>> Alessandro
>>
>> On Mon, 7 Oct 2019 at 10:24, Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to call for a new vote on SPARK-28885
>>>  "Follow ANSI store
>>> assignment rules in table insertion by default" after revising the ANSI
>>> store assignment policy(SPARK-29326
>>> ).
>>> When inserting a value into a column with the different data type, Spark
>>> performs type coercion. Currently, we support 3 policies for the store
>>> assignment rules: ANSI, legacy and strict, which can be set via the option
>>> "spark.sql.storeAssignmentPolicy":
>>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>>> certain unreasonable type conversions such as converting `string` to `int`
>>> and `double` to `boolean`. It will throw a runtime exception if the value
>>> is out-of-range(overflow).
>>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>>> for compatibility with Hive. When inserting an out-of-range value to an
>>> integral field, the low-order bits of the value is inserted(the same as
>>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>>> field of Byte type, the result is 1.
>>> 3. Strict: Spark doesn't allow any possible precision loss or data
>>> truncation in store assignment, e.g., converting either `double` to `int`
>>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>>> encoder. As far as I know, no mainstream DBMS is using this policy by
>>> default.
>>>
>>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>>> and V2 in Spark 3.0.
>>>
>>> This vote is open until Friday (Oct. 11).
>>>
>>> [ ] +1: Accept the proposal
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>>
>>> Thank you!
>>>
>>> Gengliang
>>>
>>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-07 Thread Wenchen Fan

+1

I think this is the most reasonable default behavior among the three.

On Mon, Oct 7, 2019 at 6:06 PM Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> +1 (non-binding)
>
> I have been following this standardization effort and I think it is sound
> and it provides the needed flexibility via the option.
>
> Best regards,
> Alessandro
>
> On Mon, 7 Oct 2019 at 10:24, Gengliang Wang 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to call for a new vote on SPARK-28885
>>  "Follow ANSI store
>> assignment rules in table insertion by default" after revising the ANSI
>> store assignment policy(SPARK-29326
>> ).
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the store
>> assignment rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In
>> practice, the behavior is mostly the same as PostgreSQL. It disallows
>> certain unreasonable type conversions such as converting `string` to `int`
>> and `double` to `boolean`. It will throw a runtime exception if the value
>> is out-of-range(overflow).
>> 2. Legacy: Spark allows the store assignment as long as it is a valid
>> `Cast`, which is very loose. E.g., converting either `string` to `int` or
>> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
>> for compatibility with Hive. When inserting an out-of-range value to an
>> integral field, the low-order bits of the value is inserted(the same as
>> Java/Scala numeric type casting). For example, if 257 is inserted into a
>> field of Byte type, the result is 1.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in store assignment, e.g., converting either `double` to `int`
>> or `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no mainstream DBMS is using this policy by
>> default.
>>
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>> and V2 in Spark 3.0.
>>
>> This vote is open until Friday (Oct. 11).
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thank you!
>>
>> Gengliang
>>
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-07 Thread Alessandro Solimando

+1 (non-binding)

I have been following this standardization effort and I think it is sound
and it provides the needed flexibility via the option.

Best regards,
Alessandro

On Mon, 7 Oct 2019 at 10:24, Gengliang Wang 
wrote:

> Hi everyone,
>
> I'd like to call for a new vote on SPARK-28885
>  "Follow ANSI store
> assignment rules in table insertion by default" after revising the ANSI
> store assignment policy(SPARK-29326
> ).
> When inserting a value into a column with the different data type, Spark
> performs type coercion. Currently, we support 3 policies for the store
> assignment rules: ANSI, legacy and strict, which can be set via the option
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice,
> the behavior is mostly the same as PostgreSQL. It disallows certain
> unreasonable type conversions such as converting `string` to `int` and
> `double` to `boolean`. It will throw a runtime exception if the value is
> out-of-range(overflow).
> 2. Legacy: Spark allows the store assignment as long as it is a valid
> `Cast`, which is very loose. E.g., converting either `string` to `int` or
> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
> for compatibility with Hive. When inserting an out-of-range value to an
> integral field, the low-order bits of the value is inserted(the same as
> Java/Scala numeric type casting). For example, if 257 is inserted into a
> field of Byte type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data
> truncation in store assignment, e.g., converting either `double` to `int`
> or `decimal` to `double` is allowed. The rules are originally for Dataset
> encoder. As far as I know, no mainstream DBMS is using this policy by
> default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while V2
> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
> and V2 in Spark 3.0.
>
> This vote is open until Friday (Oct. 11).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-12 Thread Gengliang Wang

Thanks for the great suggestions from Ryan, Russell, and Wenchen.
As there are -1 from Ryan and Felix, this vote doesn't pass.

As per the SQL standard, data rounding or truncation is allowed on
assigning value to numeric/datetime type. So, I think we can discuss
whether data rounding/truncation should be allowed in strict mode, as Spark
doesn't produce invalid null values with data rounding/truncation.
For example, the conversion between `Date` and `Timestamp` should be
allowed as per the SQL standard. For another example, converting `Decimal`
to `Double` should be allowed as well, since in Spark SQL the max precision
of Decimal type is 38, while the range of `Double` is
[-1.7976931348623157E308, 1.7976931348623157E308]. Converting `Decimal` to
`Double` won't cause an overflow in Spark.

After refining the ANSI mode and strict modes, we can vote for the default
table insertion behavior for both V1 and V2.



On Thu, Sep 12, 2019 at 2:09 PM Wenchen Fan  wrote:

> I think it's too risky to enable the "runtime exception" mode by default
> in the next release. We don't even have a spec to describe when Spark would
> throw runtime exceptions. Currently the "runtime exception" mode works for
> overflow but I believe there are more places need to be considered (e.g.
> divide by zero).
>
> However, Ryan has a good point that if we use the ANSI store assignment
> policy, we should make sure the table insertion behavior completely follows
> the SQL spec. After reading the related section in the SQL spec, the rule
> is to throw runtime exception for value out of range, which is the overflow
> check we already have in Spark. I think we should enable the overflow
> check during table insertion, when ANSI policy is picked. This should be
> done no matter which policy becomes the default eventually.
>
> On Mon, Sep 9, 2019 at 8:00 AM Felix Cheung 
> wrote:
>
>> I’d prefer strict mode and fail fast (analysis check)
>>
>> Also I like what Alastair suggested about standard clarification.
>>
>> I think we can re-visit this proposal and restart the vote
>>
>> --
>> *From:* Ryan Blue 
>> *Sent:* Friday, September 6, 2019 5:28 PM
>> *To:* Alastair Green
>> *Cc:* Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang
>> *Subject:* Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in
>> table insertion by default
>>
>>
>> We discussed this thread quite a bit in the DSv2 sync up and Russell
>> brought up a really good point about this.
>>
>> The ANSI rule used here specifies how to store a specific value, V, so
>> this is a runtime rule — an earlier case covers when V is NULL, so it is
>> definitely referring to a specific value. The rule requires that if the
>> type doesn’t match or if the value cannot be truncated, an exception is
>> thrown for “numeric value out of range”.
>>
>> That runtime error guarantees that even though the cast is introduced at
>> analysis time, unexpected NULL values aren’t inserted into a table in place
>> of data values that are out of range. Unexpected NULL values are the
>> problem that was concerning to many of us in the discussion thread, but it
>> turns out that real ANSI behavior doesn’t have the problem. (In the sync,
>> we validated this by checking Postgres and MySQL behavior, too.)
>>
>> In Spark, the runtime check is a separate configuration property from
>> this one, but in order to actually implement ANSI semantics, both need to
>> be set. So I think it makes sense to*change both defaults to be ANSI*.
>> The analysis check alone does not implement the ANSI standard.
>>
>> In the sync, we also agreed that it makes sense to be able to turn off
>> the runtime check in order to avoid job failures. Another, safer way to
>> avoid job failures is to require an explicit cast, i.e., strict mode.
>>
>> I think that we should amend this proposal to change the default for both
>> the runtime check and the analysis check to ANSI.
>>
>> As this stands now, I vote -1. But I would support this if the vote were
>> to set both runtime and analysis checks to ANSI mode.
>>
>> rb
>>
>> On Fri, Sep 6, 2019 at 3:12 AM Alastair Green
>>  wrote:
>>
>>> Makes sense.
>>>
>>> While the ISO SQL standard automatically becomes an American national
>>>  (ANSI) standard, changes are only made to the International (ISO/IEC)
>>> Standard, which is the authoritative specification.
>>>
>>> These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2),
>>> section 9.2.
>>>
>>> Could we rename the propose

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-12 Thread Wenchen Fan

I think it's too risky to enable the "runtime exception" mode by default in
the next release. We don't even have a spec to describe when Spark would
throw runtime exceptions. Currently the "runtime exception" mode works for
overflow but I believe there are more places need to be considered (e.g.
divide by zero).

However, Ryan has a good point that if we use the ANSI store assignment
policy, we should make sure the table insertion behavior completely follows
the SQL spec. After reading the related section in the SQL spec, the rule
is to throw runtime exception for value out of range, which is the overflow
check we already have in Spark. I think we should enable the overflow
check during table insertion, when ANSI policy is picked. This should be
done no matter which policy becomes the default eventually.

On Mon, Sep 9, 2019 at 8:00 AM Felix Cheung 
wrote:

> I’d prefer strict mode and fail fast (analysis check)
>
> Also I like what Alastair suggested about standard clarification.
>
> I think we can re-visit this proposal and restart the vote
>
> --
> *From:* Ryan Blue 
> *Sent:* Friday, September 6, 2019 5:28 PM
> *To:* Alastair Green
> *Cc:* Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang
> *Subject:* Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in
> table insertion by default
>
>
> We discussed this thread quite a bit in the DSv2 sync up and Russell
> brought up a really good point about this.
>
> The ANSI rule used here specifies how to store a specific value, V, so
> this is a runtime rule — an earlier case covers when V is NULL, so it is
> definitely referring to a specific value. The rule requires that if the
> type doesn’t match or if the value cannot be truncated, an exception is
> thrown for “numeric value out of range”.
>
> That runtime error guarantees that even though the cast is introduced at
> analysis time, unexpected NULL values aren’t inserted into a table in place
> of data values that are out of range. Unexpected NULL values are the
> problem that was concerning to many of us in the discussion thread, but it
> turns out that real ANSI behavior doesn’t have the problem. (In the sync,
> we validated this by checking Postgres and MySQL behavior, too.)
>
> In Spark, the runtime check is a separate configuration property from this
> one, but in order to actually implement ANSI semantics, both need to be
> set. So I think it makes sense to*change both defaults to be ANSI*. The
> analysis check alone does not implement the ANSI standard.
>
> In the sync, we also agreed that it makes sense to be able to turn off the
> runtime check in order to avoid job failures. Another, safer way to avoid
> job failures is to require an explicit cast, i.e., strict mode.
>
> I think that we should amend this proposal to change the default for both
> the runtime check and the analysis check to ANSI.
>
> As this stands now, I vote -1. But I would support this if the vote were
> to set both runtime and analysis checks to ANSI mode.
>
> rb
>
> On Fri, Sep 6, 2019 at 3:12 AM Alastair Green
>  wrote:
>
>> Makes sense.
>>
>> While the ISO SQL standard automatically becomes an American national
>>  (ANSI) standard, changes are only made to the International (ISO/IEC)
>> Standard, which is the authoritative specification.
>>
>> These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section
>> 9.2.
>>
>> Could we rename the proposed default to “ISO/IEC (ANSI)”?
>>
>> — Alastair
>>
>> On Thu, Sep 5, 2019 at 17:17, Reynold Xin  wrote:
>>
>> Having three modes is a lot. Why not just use ansi mode as default, and
>> legacy for backward compatibility? Then over time there's only the ANSI
>> mode, which is standard compliant and easy to understand. We also don't
>> need to invent a standard just for Spark.
>>
>>
>> On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan 
>> wrote:
>>
>>> +1
>>>
>>> To be honest I don't like the legacy policy. It's too loose and easy for
>>> users to make mistakes, especially when Spark returns null if a function
>>> hit errors like overflow.
>>>
>>> The strict policy is not good either. It's too strict and stops valid
>>> use cases like writing timestamp values to a date type column. Users do
>>> expect truncation to happen without adding cast manually in this case. It's
>>> also weird to use a spark specific policy that no other database is using.
>>>
>>> The ANSI policy is better. It stops invalid use cases like writing
>>> string values to an int type column, while keeping valid use cases like
>>>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-08 Thread Felix Cheung

I’d prefer strict mode and fail fast (analysis check)

Also I like what Alastair suggested about standard clarification.

I think we can re-visit this proposal and restart the vote

From: Ryan Blue 
Sent: Friday, September 6, 2019 5:28 PM
To: Alastair Green
Cc: Reynold Xin; Wenchen Fan; Spark dev list; Gengliang Wang
Subject: Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table 
insertion by default

We discussed this thread quite a bit in the DSv2 sync up and Russell brought up 
a really good point about this.

The ANSI rule used here specifies how to store a specific value, V, so this is 
a runtime rule — an earlier case covers when V is NULL, so it is definitely 
referring to a specific value. The rule requires that if the type doesn’t match 
or if the value cannot be truncated, an exception is thrown for “numeric value 
out of range”.

That runtime error guarantees that even though the cast is introduced at 
analysis time, unexpected NULL values aren’t inserted into a table in place of 
data values that are out of range. Unexpected NULL values are the problem that 
was concerning to many of us in the discussion thread, but it turns out that 
real ANSI behavior doesn’t have the problem. (In the sync, we validated this by 
checking Postgres and MySQL behavior, too.)

In Spark, the runtime check is a separate configuration property from this one, 
but in order to actually implement ANSI semantics, both need to be set. So I 
think it makes sense tochange both defaults to be ANSI. The analysis check 
alone does not implement the ANSI standard.

In the sync, we also agreed that it makes sense to be able to turn off the 
runtime check in order to avoid job failures. Another, safer way to avoid job 
failures is to require an explicit cast, i.e., strict mode.

I think that we should amend this proposal to change the default for both the 
runtime check and the analysis check to ANSI.

As this stands now, I vote -1. But I would support this if the vote were to set 
both runtime and analysis checks to ANSI mode.

rb

On Fri, Sep 6, 2019 at 3:12 AM Alastair Green 
 wrote:
Makes sense.

While the ISO SQL standard automatically becomes an American national  (ANSI) 
standard, changes are only made to the International (ISO/IEC) Standard, which 
is the authoritative specification.

These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section 9.2.

Could we rename the proposed default to “ISO/IEC (ANSI)”?

— Alastair

On Thu, Sep 5, 2019 at 17:17, Reynold Xin 
mailto:r...@databricks.com>> wrote:

Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.

On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
+1

To be honest I don't like the legacy policy. It's too loose and easy for users 
to make mistakes, especially when Spark returns null if a function hit errors 
like overflow.

The strict policy is not good either. It's too strict and stops valid use cases 
like writing timestamp values to a date type column. Users do expect truncation 
to happen without adding cast manually in this case. It's also weird to use a 
spark specific policy that no other database is using.

The ANSI policy is better. It stops invalid use cases like writing string 
values to an int type column, while keeping valid use cases like timestamp -> 
date.

I think it's no doubt that we should use ANSI policy instead of legacy policy 
for v1 tables. Except for backward compatibility, ANSI policy is literally 
better than the legacy policy.

The v2 table is arguable here. Although the ANSI policy is better than strict 
policy to me, this is just the store assignment policy, which only partially 
controls the table insertion behavior. With Spark's "return null on error" 
behavior, the table insertion is more likely to insert invalid null values with 
the ANSI policy compared to the strict policy.

I think we should use ANSI policy by default for both v1 and v2 tables, because
1. End-users don't care how the table is implemented. Spark should provide 
consistent table insertion behavior between v1 and v2 tables.
2. Data Source V2 is unstable in Spark 2.x so there is no backward 
compatibility issue. That said, the baseline to judge which policy is better 
should be the table insertion behavior in Spark 2.x, which is the legacy policy 
+ "return null on error". ANSI policy is better than the baseline.
3. We expect more and more uses to migrate their data sources to the V2 API. 
The strict policy can be a stopper as it's a too big breaking change, which may 
break many existing queries.

Thanks,
Wenchen

On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang 
mailto:gengliang.w...@databricks.com>> wrote:

Hi everyone,

I'd like

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-06 Thread Ryan Blue

We discussed this thread quite a bit in the DSv2 sync up and Russell
brought up a really good point about this.

The ANSI rule used here specifies how to store a specific value, V, so this
is a runtime rule — an earlier case covers when V is NULL, so it is
definitely referring to a specific value. The rule requires that if the
type doesn’t match or if the value cannot be truncated, an exception is
thrown for “numeric value out of range”.

That runtime error guarantees that even though the cast is introduced at
analysis time, unexpected NULL values aren’t inserted into a table in place
of data values that are out of range. Unexpected NULL values are the
problem that was concerning to many of us in the discussion thread, but it
turns out that real ANSI behavior doesn’t have the problem. (In the sync,
we validated this by checking Postgres and MySQL behavior, too.)

In Spark, the runtime check is a separate configuration property from this
one, but in order to actually implement ANSI semantics, both need to be
set. So I think it makes sense to *change both defaults to be ANSI*. The
analysis check alone does not implement the ANSI standard.

In the sync, we also agreed that it makes sense to be able to turn off the
runtime check in order to avoid job failures. Another, safer way to avoid
job failures is to require an explicit cast, i.e., strict mode.

I think that we should amend this proposal to change the default for both
the runtime check and the analysis check to ANSI.

As this stands now, I vote -1. But I would support this if the vote were to
set both runtime and analysis checks to ANSI mode.

rb

On Fri, Sep 6, 2019 at 3:12 AM Alastair Green
 wrote:

> Makes sense.
>
> While the ISO SQL standard automatically becomes an American national
>  (ANSI) standard, changes are only made to the International (ISO/IEC)
> Standard, which is the authoritative specification.
>
> These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section
> 9.2.
>
> Could we rename the proposed default to “ISO/IEC (ANSI)”?
>
> — Alastair
>
> On Thu, Sep 5, 2019 at 17:17, Reynold Xin  wrote:
>
> Having three modes is a lot. Why not just use ansi mode as default, and
> legacy for backward compatibility? Then over time there's only the ANSI
> mode, which is standard compliant and easy to understand. We also don't
> need to invent a standard just for Spark.
>
>
> On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan  wrote:
>
>> +1
>>
>> To be honest I don't like the legacy policy. It's too loose and easy for
>> users to make mistakes, especially when Spark returns null if a function
>> hit errors like overflow.
>>
>> The strict policy is not good either. It's too strict and stops valid use
>> cases like writing timestamp values to a date type column. Users do expect
>> truncation to happen without adding cast manually in this case. It's also
>> weird to use a spark specific policy that no other database is using.
>>
>> The ANSI policy is better. It stops invalid use cases like writing string
>> values to an int type column, while keeping valid use cases like timestamp
>> -> date.
>>
>> I think it's no doubt that we should use ANSI policy instead of legacy
>> policy for v1 tables. Except for backward compatibility, ANSI policy is
>> literally better than the legacy policy.
>>
>> The v2 table is arguable here. Although the ANSI policy is better than
>> strict policy to me, this is just the store assignment policy, which only
>> partially controls the table insertion behavior. With Spark's "return null
>> on error" behavior, the table insertion is more likely to insert invalid
>> null values with the ANSI policy compared to the strict policy.
>>
>> I think we should use ANSI policy by default for both v1 and v2 tables,
>> because
>> 1. End-users don't care how the table is implemented. Spark should
>> provide consistent table insertion behavior between v1 and v2 tables.
>> 2. Data Source V2 is unstable in Spark 2.x so there is no backward
>> compatibility issue. That said, the baseline to judge which policy is
>> better should be the table insertion behavior in Spark 2.x, which is the
>> legacy policy + "return null on error". ANSI policy is better than the
>> baseline.
>> 3. We expect more and more uses to migrate their data sources to the V2
>> API. The strict policy can be a stopper as it's a too big breaking change,
>> which may break many existing queries.
>>
>> Thanks,
>> Wenchen
>>
>>
>> On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-28885 
>>  "Follow ANSI store 
>> assignment rules in table insertion by default".
>> When inserting a value into a column with the different data type, Spark 
>> performs type coercion. Currently, we support 3 policies for the type 
>> coercion rules: ANSI, legacy and strict, which can be set via the option 
>>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-06 Thread Alastair Green

Makes sense.
While the ISO SQL standard automatically becomes an American national (ANSI) 
standard, changes are only made to the International (ISO/IEC) Standard, which 
is the authoritative specification.
These rules are specified in SQL/Foundation (ISO/IEC SQL Part 2), section 9.2.
Could we rename the proposed default to “ISO/IEC (ANSI)”?
— Alastair

On Thu, Sep 5, 2019 at 17:17, Reynold Xin  wrote:
Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.

On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan < cloud0...@gmail.com 
[cloud0...@gmail.com] > wrote:
+1
To be honest I don't like the legacy policy. It's too loose and easy for users 
to make mistakes, especially when Spark returns null if a function hit errors 
like overflow.
The strict policy is not good either. It's too strict and stops valid use cases 
like writing timestamp values to a date type column. Users do expect truncation 
to happen without adding cast manually in this case. It's also weird to use a 
spark specific policy that no other database is using.
The ANSI policy is better. It stops invalid use cases like writing string 
values to an int type column, while keeping valid use cases like timestamp -> 
date.
I think it's no doubt that we should use ANSI policy instead of legacy policy 
for v1 tables. Except for backward compatibility, ANSI policy is literally 
better than the legacy policy.
The v2 table is arguable here. Although the ANSI policy is better than strict 
policy to me, this is just the store assignment policy, which only partially 
controls the table insertion behavior. With Spark's "return null on error" 
behavior, the table insertion is more likely to insert invalid null values with 
the ANSI policy compared to the strict policy.
I think we should use ANSI policy by default for both v1 and v2 tables, because 
1. End-users don't care how the table is implemented. Spark should provide 
consistent table insertion behavior between v1 and v2 tables. 2. Data Source V2 
is unstable in Spark 2.x so there is no backward compatibility issue. That 
said, the baseline to judge which policy is better should be the table 
insertion behavior in Spark 2.x, which is the legacy policy + "return null on 
error". ANSI policy is better than the baseline. 3. We expect more and more 
uses to migrate their data sources to the V2 API. The strict policy can be a 
stopper as it's a too big breaking change, which may break many existing 
queries.
Thanks, Wenchen


On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang < gengliang. wang@ databricks. 
com [gengliang.w...@databricks.com] > wrote:
Hi everyone,

I'd like to call for a vote on SPARK-28885 
[https://issues.apache.org/jira/browse/SPARK-28885] "Follow ANSI store 
assignment rules in table insertion by default".  
When inserting a value into a column with the different data type, Spark 
performs type coercion. Currently, we support 3 policies for the type coercion 
rules: ANSI, legacy and strict, which can be set via the option 
"spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the 
behavior is mostly the same as PostgreSQL. It disallows certain unreasonable 
type conversions such as converting `string` to `int` and `double` to `boolean`.
2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, 
which is very loose. E.g., converting either `string` to `int` or `double` to 
`boolean` is allowed. It is the current behavior in Spark 2.x for compatibility 
with Hive.
3. Strict: Spark doesn't allow any possible precision loss or data truncation 
in type coercion, e.g., converting either `double` to `int` or `decimal` to 
`double` is allowed. The rules are originally for Dataset encoder. As far as I 
know, no maintainstream DBMS is using this policy by default.

Currently, the V1 data source uses "Legacy" policy by default, while V2 uses 
"Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 
in Spark 3.0.

There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the dev 
mailing list.

This vote is open until next Thurs (Sept. 12nd).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Gengliang

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Reynold Xin

Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.

On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan < cloud0...@gmail.com > wrote:

> 
> +1
> 
> 
> To be honest I don't like the legacy policy. It's too loose and easy for
> users to make mistakes, especially when Spark returns null if a function
> hit errors like overflow.
> 
> 
> The strict policy is not good either. It's too strict and stops valid use
> cases like writing timestamp values to a date type column. Users do expect
> truncation to happen without adding cast manually in this case. It's also
> weird to use a spark specific policy that no other database is using.
> 
> 
> The ANSI policy is better. It stops invalid use cases like writing string
> values to an int type column, while keeping valid use cases like timestamp
> -> date.
> 
> 
> I think it's no doubt that we should use ANSI policy instead of legacy
> policy for v1 tables. Except for backward compatibility, ANSI policy is
> literally better than the legacy policy.
> 
> 
> The v2 table is arguable here. Although the ANSI policy is better than
> strict policy to me, this is just the store assignment policy, which only
> partially controls the table insertion behavior. With Spark's "return null
> on error" behavior, the table insertion is more likely to insert invalid
> null values with the ANSI policy compared to the strict policy.
> 
> 
> I think we should use ANSI policy by default for both v1 and v2 tables,
> because
> 1. End-users don't care how the table is implemented. Spark should provide
> consistent table insertion behavior between v1 and v2 tables.
> 2. Data Source V2 is unstable in Spark 2.x so there is no backward
> compatibility issue. That said, the baseline to judge which policy is
> better should be the table insertion behavior in Spark 2.x, which is the
> legacy policy + "return null on error". ANSI policy is better than the
> baseline.
> 3. We expect more and more uses to migrate their data sources to the V2
> API. The strict policy can be a stopper as it's a too big breaking change,
> which may break many existing queries.
> 
> 
> Thanks,
> Wenchen 
> 
> 
> 
> 
> On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang < gengliang. wang@ databricks.
> com ( gengliang.w...@databricks.com ) > wrote:
> 
> 
>> Hi everyone,
>> 
>> I'd like to call for a vote on SPARK-28885 (
>> https://issues.apache.org/jira/browse/SPARK-28885 ) "Follow ANSI store
>> assignment rules in table insertion by default".  
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the type
>> coercion rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice,
>> the behavior is mostly the same as PostgreSQL. It disallows certain
>> unreasonable type conversions such as converting `string` to `int` and
>> `double` to `boolean`.
>> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`,
>> which is very loose. E.g., converting either `string` to `int` or `double`
>> to `boolean` is allowed. It is the current behavior in Spark 2.x for
>> compatibility with Hive.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in type coercion, e.g., converting either `double` to `int` or
>> `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no maintainstream DBMS is using this policy by
>> default.
>> 
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both
>> V1 and V2 in Spark 3.0.
>> 
>> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in
>> the dev mailing list.
>> 
>> This vote is open until next Thurs (Sept. 12nd).
>> 
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ... Thank you! Gengliang
>> 
>> 
> 
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Wenchen Fan

+1

To be honest I don't like the legacy policy. It's too loose and easy for
users to make mistakes, especially when Spark returns null if a function
hit errors like overflow.

The strict policy is not good either. It's too strict and stops valid use
cases like writing timestamp values to a date type column. Users do expect
truncation to happen without adding cast manually in this case. It's also
weird to use a spark specific policy that no other database is using.

The ANSI policy is better. It stops invalid use cases like writing string
values to an int type column, while keeping valid use cases like timestamp
-> date.

I think it's no doubt that we should use ANSI policy instead of legacy
policy for v1 tables. Except for backward compatibility, ANSI policy is
literally better than the legacy policy.

The v2 table is arguable here. Although the ANSI policy is better than
strict policy to me, this is just the store assignment policy, which only
partially controls the table insertion behavior. With Spark's "return null
on error" behavior, the table insertion is more likely to insert invalid
null values with the ANSI policy compared to the strict policy.

I think we should use ANSI policy by default for both v1 and v2 tables,
because
1. End-users don't care how the table is implemented. Spark should provide
consistent table insertion behavior between v1 and v2 tables.
2. Data Source V2 is unstable in Spark 2.x so there is no backward
compatibility issue. That said, the baseline to judge which policy is
better should be the table insertion behavior in Spark 2.x, which is the
legacy policy + "return null on error". ANSI policy is better than the
baseline.
3. We expect more and more uses to migrate their data sources to the V2
API. The strict policy can be a stopper as it's a too big breaking change,
which may break many existing queries.

Thanks,
Wenchen

On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang 
wrote:

> Hi everyone,
>
> I'd like to call for a vote on SPARK-28885 
>  "Follow ANSI store 
> assignment rules in table insertion by default".
> When inserting a value into a column with the different data type, Spark 
> performs type coercion. Currently, we support 3 policies for the type 
> coercion rules: ANSI, legacy and strict, which can be set via the option 
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the 
> behavior is mostly the same as PostgreSQL. It disallows certain unreasonable 
> type conversions such as converting `string` to `int` and `double` to 
> `boolean`.
> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, 
> which is very loose. E.g., converting either `string` to `int` or `double` to 
> `boolean` is allowed. It is the current behavior in Spark 2.x for 
> compatibility with Hive.
> 3. Strict: Spark doesn't allow any possible precision loss or data truncation 
> in type coercion, e.g., converting either `double` to `int` or `decimal` to 
> `double` is allowed. The rules are originally for Dataset encoder. As far as 
> I know, no maintainstream DBMS is using this policy by default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses 
> "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 
> in Spark 3.0.
>
> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the 
> dev mailing list.
>
> This vote is open until next Thurs (Sept. 12nd).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

15 matches

Site Navigation

Mail list logo

Footer information