I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char.
On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Thank you, Stephen and Reynold. > > > To Reynold. > > > The way I see the following is a little different. > > > > CHAR is an undocumented data type without clearly defined > semantics. > > Let me describe in Apache Spark User's View point. > > > Apache Spark started to claim `HiveContext` (and `hql/hiveql` function) at > Apache Spark 1.x without much documentation. In addition, there still > exists an effort which is trying to keep it in 3.0.0 age. > > https:/ / issues. apache. org/ jira/ browse/ SPARK-31088 ( > https://issues.apache.org/jira/browse/SPARK-31088 ) > Add back HiveContext and createExternalTable > > Historically, we tried to make many SQL-based customer migrate their > workloads from Apache Hive into Apache Spark through `HiveContext`. > > Although Apache Spark didn't have a good document about the inconsistent > behavior among its data sources, Apache Hive has been providing its > documentation and many customers rely the behavior. > > - https:/ / cwiki. apache. org/ confluence/ display/ Hive/ > LanguageManual+Types > ( https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types ) > > At that time, frequently in on-prem Hadoop clusters by well-known vendors, > many existing huge tables were created by Apache Hive, not Apache Spark. > And, Apache Spark is used for boosting SQL performance with its *caching*. > This was true because Apache Spark was added into the Hadoop-vendor > products later than Apache Hive. > > > Until the turning point at Apache Spark 2.0, we tried to catch up more > features to be consistent at least with Hive tables in Apache Hive and > Apache Spark because two SQL engines share the same tables. > > For the following, technically, while Apache Hive doesn't changed its > existing behavior in this part, Apache Spark evolves inevitably by moving > away from the original Apache Spark old behaviors one-by-one. > > > > the value is already fucked up > > > The following is the change log. > > - When we switched the default value of `convertMetastoreParquet`. > (at Apache Spark 1.2) > - When we switched the default value of `convertMetastoreOrc` (at > Apache Spark 2.4) > - When we switched `CREATE TABLE` itself. (Change `TEXT` table to > `PARQUET` table at Apache Spark 3.0) > > To sum up, this has been a well-known issue in the community and among the > customers. > > Bests, > Dongjoon. > > On Mon, Mar 16, 2020 at 5:24 PM Stephen Coy < scoy@ infomedia. com. au ( > s...@infomedia.com.au ) > wrote: > > >> Hi there, >> >> >> I’m kind of new around here, but I have had experience with all of all the >> so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL >> Server as well as Postgresql. >> >> >> They all support the notion of “ANSI padding” for CHAR columns - which >> means that such columns are always space padded, and they default to >> having this enabled (for ANSI compliance). >> >> >> MySQL also supports it, but it defaults to leaving it disabled for >> historical reasons not unlike what we have here. >> >> >> In my opinion we should push toward standards compliance where possible >> and then document where it cannot work. >> >> >> If users don’t like the padding on CHAR columns then they should change to >> VARCHAR - I believe that was its purpose in the first place, and it does >> not dictate any sort of “padding". >> >> >> I can see why you might “ban” the use of CHAR columns where they cannot be >> consistently supported, but VARCHAR is a different animal and I would >> expect it to work consistently everywhere. >> >> >> >> >> Cheers, >> >> >> Steve C >> >> >>> On 17 Mar 2020, at 10:01 am, Dongjoon Hyun < dongjoon. hyun@ gmail. com ( >>> dongjoon.h...@gmail.com ) > wrote: >>> >>> Hi, Reynold. >>> (And +Michael Armbrust) >>> >>> >>> If you think so, do you think it's okay that we change the return value >>> silently? Then, I'm wondering why we reverted `TRIM` functions then? >>> >>> >>> > Are we sure "not padding" is "incorrect"? >>> >>> >>> >>> Bests, >>> Dongjoon. >>> >>> >>> >>> On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta < gourav. sengupta@ gmail. >>> com ( gourav.sengu...@gmail.com ) > wrote: >>> >>> >>>> Hi, >>>> >>>> >>>> 100% agree with Reynold. >>>> >>>> >>>> >>>> >>>> Regards, >>>> Gourav Sengupta >>>> >>>> >>>> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com ( >>>> r...@databricks.com ) > wrote: >>>> >>>> >>>>> >>>>> Are we sure "not padding" is "incorrect"? >>>>> >>>>> >>>>> >>>>> I don't know whether ANSI SQL actually requires padding, but plenty of >>>>> databases don't actually pad. >>>>> >>>>> >>>>> >>>>> https:/ / docs. snowflake. net/ manuals/ sql-reference/ data-types-text. >>>>> html >>>>> ( >>>>> https://aus01.safelinks.protection.outlook.com/?url=https:%2F%2Fdocs.snowflake.net%2Fmanuals%2Fsql-reference%2Fdata-types-text.html%23:~:text%3DCHAR%2520%252C%2520CHARACTER%2C(1)%2520is%2520the%2520default.%26text%3DSnowflake%2520currently%2520deviates%2520from%2520common%2Cspace-padded%2520at%2520the%2520end.&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=BvnZTTPTZBAi8oGWIvJk2fC%2FYSgdvq%2BAxtOj0nVzufk%3D&reserved=0 >>>>> ) : "Snowflake currently deviates from common CHAR semantics in that >>>>> strings shorter than the maximum length are not space-padded at the end." >>>>> >>>>> >>>>> >>>>> MySQL: https:/ / stackoverflow. com/ questions/ 53528645/ >>>>> why-char-dont-have-padding-in-mysql >>>>> ( >>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F53528645%2Fwhy-char-dont-have-padding-in-mysql&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062044368&sdata=3OGLht%2Fa28GcKhAGwJPXIR%2BMODiIwXGVuNuResZqwXM%3D&reserved=0 >>>>> ) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. >>>>> com >>>>> ( dongjoon.h...@gmail.com ) > wrote: >>>>> >>>>>> Hi, Reynold. >>>>>> >>>>>> >>>>>> Please see the following for the context. >>>>>> >>>>>> >>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-31136 ( >>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-31136&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=pWQ9QhfVY4Uzyc8oIJ1QONQ0zOBAQ2DGSemyBj%2BvFeM%3D&reserved=0 >>>>>> ) >>>>>> "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE >>>>>> syntax" >>>>>> >>>>>> >>>>>> I raised the above issue according to the new rubric, and the banning was >>>>>> the proposed alternative to reduce the potential issue. >>>>>> >>>>>> >>>>>> Please give us your opinion since it's still PR. >>>>>> >>>>>> >>>>>> Bests, >>>>>> Dongjoon. >>>>>> >>>>>> On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com ( >>>>>> r...@databricks.com ) > wrote: >>>>>> >>>>>> >>>>>>> I don’t understand this change. Wouldn’t this “ban” confuse the hell out >>>>>>> of both new and old users? >>>>>>> >>>>>>> >>>>>>> For old users, their old code that was working for char(3) would now >>>>>>> stop >>>>>>> working. >>>>>>> >>>>>>> >>>>>>> For new users, depending on whether the underlying metastore char(3) is >>>>>>> either supported but different from ansi Sql (which is not that big of a >>>>>>> deal if we explain it) or not supported. >>>>>>> >>>>>>> On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun < dongjoon. hyun@ gmail. >>>>>>> com >>>>>>> ( dongjoon.h...@gmail.com ) > wrote: >>>>>>> >>>>>>> >>>>>>>> Hi, All. >>>>>>>> >>>>>>>> Apache Spark has been suffered from a known consistency issue on `CHAR` >>>>>>>> type behavior among its usages and configurations. However, the >>>>>>>> evolution >>>>>>>> direction has been gradually moving forward to be consistent inside >>>>>>>> Apache >>>>>>>> Spark because we don't have `CHAR` offically. The following is the >>>>>>>> summary. >>>>>>>> >>>>>>>> With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different >>>>>>>> result. >>>>>>>> (`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to >>>>>>>> Hive behavior.) >>>>>>>> >>>>>>>> spark-sql> CREATE TABLE t1(a CHAR(3)); >>>>>>>> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC; >>>>>>>> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET; >>>>>>>> >>>>>>>> spark-sql> INSERT INTO TABLE t1 SELECT 'a '; >>>>>>>> spark-sql> INSERT INTO TABLE t2 SELECT 'a '; >>>>>>>> spark-sql> INSERT INTO TABLE t3 SELECT 'a '; >>>>>>>> >>>>>>>> spark-sql> SELECT a, length(a) FROM t1; >>>>>>>> a 3 >>>>>>>> spark-sql> SELECT a, length(a) FROM t2; >>>>>>>> a 3 >>>>>>>> spark-sql> SELECT a, length(a) FROM t3; >>>>>>>> a 2 >>>>>>>> >>>>>>>> Since 2.4.0, `STORED AS ORC` became consistent. >>>>>>>> (`spark.sql.hive.convertMetastoreOrc=false` provides a fallback to Hive >>>>>>>> behavior.) >>>>>>>> >>>>>>>> spark-sql> SELECT a, length(a) FROM t1; >>>>>>>> a 3 >>>>>>>> spark-sql> SELECT a, length(a) FROM t2; >>>>>>>> a 2 >>>>>>>> spark-sql> SELECT a, length(a) FROM t3; >>>>>>>> a 2 >>>>>>>> >>>>>>>> Since 3.0.0-preview2, `CREATE TABLE` (without `STORED AS` clause) >>>>>>>> became >>>>>>>> consistent. >>>>>>>> (`spark.sql.legacy.createHiveTableByDefault.enabled=true` provides a >>>>>>>> fallback to Hive behavior.) >>>>>>>> >>>>>>>> spark-sql> SELECT a, length(a) FROM t1; >>>>>>>> a 2 >>>>>>>> spark-sql> SELECT a, length(a) FROM t2; >>>>>>>> a 2 >>>>>>>> spark-sql> SELECT a, length(a) FROM t3; >>>>>>>> a 2 >>>>>>>> >>>>>>>> In addition, in 3.0.0, SPARK-31147 aims to ban `CHAR/VARCHAR` type in >>>>>>>> the >>>>>>>> following syntax to be safe. >>>>>>>> >>>>>>>> CREATE TABLE t(a CHAR(3)); >>>>>>>> https:/ / github. com/ apache/ spark/ pull/ 27902 ( >>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F27902&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062054364&sdata=lhwUP5TcTtaO%2BLUTmx%2BPTjT0ASXPrQ7oKLL0N6EG0Ug%3D&reserved=0 >>>>>>>> ) >>>>>>>> >>>>>>>> This email is sent out to inform you based on the new policy we voted. >>>>>>>> The recommendation is always using Apache Spark's native type `String`. >>>>>>>> >>>>>>>> Bests, >>>>>>>> Dongjoon. >>>>>>>> >>>>>>>> References: >>>>>>>> 1. "CHAR implementation?", 2017/09/15 >>>>>>>> https:/ / lists. apache. org/ thread. html/ >>>>>>>> 96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%40%3Cdev. >>>>>>>> spark. apache. org%3E ( >>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F96b004331d9762e356053b5c8c97e953e398e489d15e1b49e775702f%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=6hkno6zKTkcIrO%2FJo4hTYihsYvNynMuWcxhzL0fZR68%3D&reserved=0 >>>>>>>> ) >>>>>>>> 2. "FYI: SPARK-30098 Use default datasource as provider for CREATE >>>>>>>> TABLE >>>>>>>> syntax", 2019/12/06 >>>>>>>> https:/ / lists. apache. org/ thread. html/ >>>>>>>> 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev. >>>>>>>> spark. apache. org%3E ( >>>>>>>> https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread.html%2F493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%2540%253Cdev.spark.apache.org%253E&data=02%7C01%7Cscoy%40infomedia.com.au%7C5346c8d2675342008b5708d7c9fdff54%7C45d5407150f849caa59f9457123dc71c%7C0%7C0%7C637199965062064358&sdata=QJnEU3mvUJff53Gw8F%2FAbxzd%2F8ZA1hhuoQwicX4ZXyI%3D&reserved=0 >>>>>>>> ) >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> >> This email contains confidential information of and is the copyright of >> Infomedia. It must not be forwarded, amended or disclosed without consent >> of the sender. If you received this message by mistake, please advise the >> sender and delete all copies. Security of transmission on the internet >> cannot be guaranteed, could be infected, intercepted, or corrupted and you >> should ensure you have suitable antivirus protection in place. By sending >> us your or any third party personal details, you consent to (or confirm >> you have obtained consent from such third parties) to Infomedia’s privacy >> policy. http:/ / www. infomedia. com. au/ privacy-policy/ ( >> http://www.infomedia.com.au/privacy-policy/ ) >> > >
smime.p7s
Description: S/MIME Cryptographic Signature