Re: FYI: The evolution on `CHAR` type behavior

Dongjoon Hyun Mon, 16 Mar 2020 16:02:28 -0700

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value
silently? Then, I'm wondering why we reverted `TRIM` functions then?


> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> 100% agree with Reynold.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin <r...@databricks.com> wrote:
>
>> Are we sure "not padding" is "incorrect"?
>>
>> I don't know whether ANSI SQL actually requires padding, but plenty of
>> databases don't actually pad.
>>
>> https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
>> <https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.>
>>  :
>> "Snowflake currently deviates from common CHAR semantics in that strings
>> shorter than the maximum length are not space-padded at the end."
>>
>> MySQL:
>> https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Hi, Reynold.
>>>
>>> Please see the following for the context.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-31136
>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax"
>>>
>>> I raised the above issue according to the new rubric, and the banning
>>> was the proposed alternative to reduce the potential issue.
>>>
>>> Please give us your opinion since it's still PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell
>>>> out of both new and old users?
>>>>
>>>> For old users, their old code that was working for char(3) would now
>>>> stop working.
>>>>
>>>> For new users, depending on whether the underlying metastore char(3) is
>>>> either supported but different from ansi Sql (which is not that big of a
>>>> deal if we explain it) or not supported.
>>>>
>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Apache Spark has been suffered from a known consistency issue on
>>>>> `CHAR` type behavior among its usages and configurations. However, the
>>>>> evolution direction has been gradually moving forward to be consistent
>>>>> inside Apache Spark because we don't have `CHAR` offically. The following
>>>>> is the summary.
>>>>>
>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different
>>>>> result.
>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>>>> Hive behavior.)
>>>>>
>>>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>>>>
>>>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to
>>>>> Hive behavior.)
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a   3
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause)
>>>>> became consistent.
>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>>>> fallback to Hive behavior.)
>>>>>
>>>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>>>     a 2
>>>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>>>     a 2
>>>>>
>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in
>>>>> the following syntax to be safe.
>>>>>
>>>>>     CREATE TABLE t(a CHAR(3));
>>>>>     https://github.com/apache/spark/pull/27902
>>>>>
>>>>> This email is sent out to inform you based on the new policy we voted.
>>>>> The recommendation is always using Apache Spark's native type `String`.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> References:
>>>>> 1. "CHAR implementation?", 2017/09/15
>>>>>
>>>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE
>>>>> TABLE syntax", 2019/12/06
>>>>>
>>>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>>>>
>>>>
>>

Re: FYI: The evolution on `CHAR` type behavior

Reply via email to