Re: FYI: The evolution on `CHAR` type behavior

Reynold Xin Sun, 15 Mar 2020 20:31:23 -0700

Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of 
databases don't actually pad.


https://docs.snowflake.net/manuals/sql-reference/data-types-text.html ( 
https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#:~:text=CHAR%20%2C%20CHARACTER,(1)%20is%20the%20default.&text=Snowflake%20currently%20deviates%20from%20common,space%2Dpadded%20at%20the%20end.
 ) : "Snowflake currently deviates from common CHAR semantics in that strings 
shorter than the maximum length are not space-padded at the end."

MySQL: 
https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql

On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Reynold.
> 
> 
> Please see the following for the context.
> 
> 
> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 (
> https://issues.apache.org/jira/browse/SPARK-31136 )
> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE
> syntax"
> 
> 
> I raised the above issue according to the new rubric, and the banning was
> the proposed alternative to reduce the potential issue.
> 
> 
> Please give us your opinion since it's still PR.
> 
> 
> Bests,
> Dongjoon.
> 
> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out
>> of both new and old users?
>> 
>> 
>> For old users, their old code that was working for char(3) would now stop
>> working. 
>> 
>> 
>> For new users, depending on whether the underlying metastore char(3) is
>> either supported but different from ansi Sql (which is not that big of a
>> deal if we explain it) or not supported. 
>> 
>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>> ( dongjoon.h...@gmail.com ) > wrote:
>> 
>> 
>>> Hi, All.
>>> 
>>> Apache Spark has been suffered from a known consistency issue on `CHAR`
>>> type behavior among its usages and configurations. However, the evolution
>>> direction has been gradually moving forward to be consistent inside Apache
>>> Spark because we don't have `CHAR` offically. The following is the
>>> summary.
>>> 
>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to
>>> Hive behavior.)
>>> 
>>>     spark-sql> CREATE TABLE t1(a CHAR(3));
>>>     spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>>>     spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>> 
>>>     spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>>>     spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 2.4.0, `STORED AS ORC` became consistent.
>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive
>>> behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a   3
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) became
>>> consistent.
>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a
>>> fallback to Hive behavior.)
>>> 
>>>     spark-sql> SELECT a, length(a) FROM t1;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t2;
>>>     a 2
>>>     spark-sql> SELECT a, length(a) FROM t3;
>>>     a 2
>>> 
>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in the
>>> following syntax to be safe.
>>> 
>>>     CREATE TABLE t(a CHAR(3));
>>>    https:/ / github. com/ apache/ spark/ pull/ 27902 (
>>> https://github.com/apache/spark/pull/27902 )
>>> 
>>> This email is sent out to inform you based on the new policy we voted.
>>> The recommendation is always using Apache Spark's native type `String`.
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> References:
>>> 1. "CHAR implementation?", 2017/09/15
>>>      https:/ / lists. apache. org/ thread. html/ 
>>> 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev.spark.apache.org%3E
>>> )
>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax", 2019/12/06
>>>    https:/ / lists. apache. org/ thread. html/ 
>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.
>>> spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E
>>> )
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: FYI: The evolution on `CHAR` type behavior

Reply via email to