Re: FYI: The evolution on `CHAR` type behavior

Reynold Xin Thu, 19 Mar 2020 20:48:16 -0700

You are joking when you said " informed widely and discussed in many ways 
twice" right?


This thread doesn't even talk about char/varchar: 
https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E

(Yes it talked about changing the default data source provider, but that's just 
one of the ways we are exposing this char/varchar issue).

On Thu, Mar 19, 2020 at 8:41 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> +1 for Wenchen's suggestion.
> 
> I believe that the difference and effects are informed widely and
> discussed in many ways twice.
> 
> First, this was shared on last December.
> 
>     "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax", 2019/12/06
>    https:/ / lists. apache. org/ thread. html/ 
> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
> spark. apache. org%3E (
> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
> )
> 
> Second (at this time in this thread), this has been discussed according to
> the new community rubric.
> 
>     - https:/ / spark. apache. org/ versioning-policy. html (
> https://spark.apache.org/versioning-policy.html ) (Section: "Considerations
> When Breaking APIs")
> 
> 
> Thank you all.
> 
> 
> Bests,
> Dongjoon.
> 
> On Tue, Mar 17, 2020 at 10:41 PM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0...@gmail.com ) > wrote:
> 
> 
>> OK let me put a proposal here:
>> 
>> 
>> 1. Permanently ban CHAR for native data source tables, and only keep it
>> for Hive compatibility.
>> It's OK to forget about padding like what Snowflake and MySQL have done.
>> But it's hard for Spark to require consistent behavior about CHAR type in
>> all data sources. Since CHAR type is not that useful nowadays, seems OK to
>> just ban it. Another way is to document that the padding of CHAR type is
>> data source dependent, but it's a bit weird to leave this inconsistency in
>> Spark.
>> 
>> 
>> 2. Leave VARCHAR unchanged in 3.0
>> VARCHAR type is so widely used in databases and it's weird if Spark
>> doesn't support it. VARCHAR type is exactly the same as Spark StringType
>> when the length limitation is not hit, and I'm fine to temporarily leave
>> this flaw in 3.0 and users may hit behavior changes when the string values
>> hit the VARCHAR length limitation.
>> 
>> 
>> 3. Finalize the VARCHAR behavior in 3.1
>> For now I have 2 ideas:
>> a) Make VARCHAR(x) a first-class data type. This means Spark data sources
>> should support VARCHAR, and CREATE TABLE should fail if a column is
>> VARCHAR type and the underlying data source doesn't support it (e.g.
>> JSON/CSV). Type cast, type coercion, table insertion, etc. should be
>> updated as well.
>> b) Simply document that, the underlying data source may or may not enforce
>> the length limitation of VARCHAR(x).
>> 
>> 
>> Please let me know if you have different ideas.
>> 
>> 
>> Thanks,
>> Wenchen
>> 
>> On Wed, Mar 18, 2020 at 1:08 AM Michael Armbrust < michael@ databricks. com
>> ( mich...@databricks.com ) > wrote:
>> 
>> 
>>> 
>>>> What I'd oppose is to just ban char for the native data sources, and do
>>>> not have a plan to address this problem systematically.
>>>> 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>>  
>>> 
>>>> Just forget about padding, like what Snowflake and MySQL have done.
>>>> Document that char(x) is just an alias for string. And then move on.
>>>> Almost no work needs to be done...
>>>> 
>>> 
>>> 
>>> 
>>> +1 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: FYI: The evolution on `CHAR` type behavior

Reply via email to