Re: FYI: The evolution on `CHAR` type behavior

Reynold Xin Mon, 16 Mar 2020 16:07:28 -0700

I haven't spent enough time thinking about it to give a strong opinion, but 
this is of course very different from TRIM.


TRIM is a publicly documented function with two arguments, and we silently 
swapped the two arguments. And trim is also quite commonly used from a long 
time ago.

CHAR is an undocumented data type without clearly defined semantics. It's not 
great that we are changing the value here, but the value is already fucked up. 
It depends on the underlying data source, and random configs that are seemingly 
unrelated (orc) would impact the value.

On Mon, Mar 16, 2020 at 4:01 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> (And +Michael Armbrust)
> 
> 
> If you think so, do you think it's okay that we change the return value
> silently? Then, I'm wondering why we reverted `TRIM` functions then?
> 
> 
> > Are we sure "not padding" is "incorrect"?
> 
> 
> 
> Bests,
> Dongjoon.
> 
> 
> 
> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail.
> com ( gourav.sengu...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> 
>> 
>> 100% agree with Reynold.
>> 
>> 
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> 
>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Are we sure "not padding" is "incorrect"?
>>> 
>>> 
>>> 
>>> I don't know whether ANSI SQL actually requires padding, but plenty of
>>> databases don't actually pad.
>>> 
>>> 
>>> 
>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. 
>>> html
>>> (
>>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
>>> ) : "Snowflake currently deviates from common CHAR semantics in that
>>> strings shorter than the maximum length are not space-padded at the end."
>>> 
>>> 
>>> 
>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ 
>>> why-char-dont-have-padding-in-mysql
>>> (
>>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>> )
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>>> Hi, Reynold.
>>>> 
>>>> 
>>>> Please see the following for the context.
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
>>>> https://issues.apache.org/jira/browse/SPARK-31136 )
>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>> syntax"
>>>> 
>>>> 
>>>> I raised the above issue according to the new rubric, and the banning was
>>>> the proposed alternative to reduce the potential issue.
>>>> 
>>>> 
>>>> Please give us your opinion since it's still PR.
>>>> 
>>>> 
>>>> Bests,
>>>> Dongjoon.
>>>> 
>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>>>>> of both new and old users?
>>>>> 
>>>>> 
>>>>> For old users, their old code that was working for char(3) would now stop
>>>>> working. 
>>>>> 
>>>>> 
>>>>> For new users, depending on whether the underlying metastore char(3) is
>>>>> either supported but different from ansi Sql (which is not that big of a
>>>>> deal if we explain it) or not supported. 
>>>>> 
>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>>>> ( dongjoon.h...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Hi, All.
>>>>>> 
>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>>>>> type behavior among its usages and configurations. However, the evolution
>>>>>> direction has been gradually moving forward to be consistent inside 
>>>>>> Apache
>>>>>> Spark because we don't have `CHAR` offically. The following is the
>>>>>> summary.
>>>>>> 
>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>>> Hive behavior.)
>>>>>> 
>>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>> 
>>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>>>>> behavior.)
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a   3
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>>>>> consistent.
>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>>> fallback to Hive behavior.)
>>>>>> 
>>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>>     a 2
>>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>>     a 2
>>>>>> 
>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>>>>> following syntax to be safe.
>>>>>> 
>>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>>>>> https://github.com/apache/spark/pull/27902 )
>>>>>> 
>>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> References:
>>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>>      https:/ / lists. apache. org/ thread. html/ 
>>>>>> 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>>>>> spark. apache. org%3E (
>>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>>> )
>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>> syntax", 2019/12/06
>>>>>>    https:/ / lists. apache. org/ thread. html/ 
>>>>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>>>>> spark. apache. org%3E (
>>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>> )
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: FYI: The evolution on `CHAR` type behavior

Reply via email to