Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
I agree it sucks. We started with some decision that might have made sense back in 2013 (let's use Hive as the default source, and guess what, pick the slowest possible serde by default). We are paying that debt ever since. Thanks for bringing this thread up though. We don't have a clear

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Dongjoon Hyun
Technically, I has been suffered with (1) `CREATE TABLE` due to many difference for a long time (since 2017). So, I had a wrong assumption for the implication of that "(2) FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", Reynold. I admit that. You may not feel in the

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
You are joking when you said " informed widely and discussed in many ways twice" right? This thread doesn't even talk about char/varchar:  https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E (Yes it talked about changing the

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Dongjoon Hyun
+1 for Wenchen's suggestion. I believe that the difference and effects are informed widely and discussed in many ways twice. First, this was shared on last December. "FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax", 2019/12/06

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
OK let me put a proposal here: 1. Permanently ban CHAR for native data source tables, and only keep it for Hive compatibility. It's OK to forget about padding like what Snowflake and MySQL have done. But it's hard for Spark to require consistent behavior about CHAR type in all data sources. Since

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Michael Armbrust
> > What I'd oppose is to just ban char for the native data sources, and do > not have a plan to address this problem systematically. > +1 > Just forget about padding, like what Snowflake and MySQL have done. > Document that char(x) is just an alias for string. And then move on. Almost > no

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Maryann Xue
It would be super weird not to support VARCHAR as SQL engine. Banning CHAR is probably fine, as its semantics is genuinely confusing. We can issue a warning when parsing VARCHAR with a limit and suggest the usage of String instead. On Tue, Mar 17, 2020 at 10:27 AM Wenchen Fan wrote: > I agree

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
I agree that Spark can define the semantic of CHAR(x) differently than the SQL standard (no padding), and ask the data sources to follow it. But the problem is, some data sources may not be able to skip padding, like the Hive serde table. On the other hand, it's easier to require padding for

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
I don’t think I can recall any usages of type CHAR in any situation. Really, it’s only use (on any traditional SQL database) would be when you *want* a fixed width character column that has been right padded with spaces. On 17 Mar 2020, at 12:13 pm, Reynold Xin mailto:r...@databricks.com>>

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
For sure. There's another reason I feel char is not that important and it's more important to be internally consistent (e.g. all data sources support it with the same behavior, vs one data sources do one behavior and another do the other). char was created at a time when cpu was slow and

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Thank you for sharing and confirming. We had better consider all heterogeneous customers in the world. And, I also have experiences with the non-negligible cases in on-prem. Bests, Dongjoon. On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin wrote: > −User > > char barely showed up (honestly

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
−User char barely showed up (honestly negligible). I was comparing select vs select. On Mon, Mar 16, 2020 at 5:40 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Ur, are you comparing the number of SELECT statement with TRIM and CREATE > statements with `CHAR`? > > > I looked up our

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Ur, are you comparing the number of SELECT statement with TRIM and CREATE statements with `CHAR`? > I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char. We need to discuss more about what to do. This thread is what

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
BTW I'm not opposing us sticking to SQL standard (I'm in general for it). I was merely pointing out that if we deviate away from SQL standard in any way we are considered "wrong" or "incorrect". That argument itself is flawed when plenty of other popular database systems also deviate away from

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I looked up our usage logs (sorry I can't share this publicly) and trim has at least four orders of magnitude higher usage than char. On Mon, Mar 16, 2020 at 5:27 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Thank you, Stephen and Reynold. > > > To Reynold. > > > The way I see

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Thank you, Stephen and Reynold. To Reynold. The way I see the following is a little different. > CHAR is an undocumented data type without clearly defined semantics. Let me describe in Apache Spark User's View point. Apache Spark started to claim `HiveContext` (and `hql/hiveql`

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
Hi there, I’m kind of new around here, but I have had experience with all of all the so called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as well as Postgresql. They all support the notion of “ANSI padding” for CHAR columns - which means that such columns are always

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
I haven't spent enough time thinking about it to give a strong opinion, but this is of course very different from TRIM. TRIM is a publicly documented function with two arguments, and we silently swapped the two arguments. And trim is also quite commonly used from a long time ago. CHAR is an

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Dongjoon Hyun
Hi, Reynold. (And +Michael Armbrust) If you think so, do you think it's okay that we change the return value silently? Then, I'm wondering why we reverted `TRIM` functions then? > Are we sure "not padding" is "incorrect"? Bests, Dongjoon. On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta

Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Reynold Xin
Are we sure "not padding" is "incorrect"? I don't know whether ANSI SQL actually requires padding, but plenty of databases don't actually pad. https://docs.snowflake.net/manuals/sql-reference/data-types-text.html (

Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Dongjoon Hyun
Hi, Reynold. Please see the following for the context. https://issues.apache.org/jira/browse/SPARK-31136 "Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax" I raised the above issue according to the new rubric, and the banning was the proposed alternative to reduce

Re: FYI: The evolution on `CHAR` type behavior

2020-03-14 Thread Reynold Xin
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users? For old users, their old code that was working for char(3) would now stop working. For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi

FYI: The evolution on `CHAR` type behavior

2020-03-14 Thread Dongjoon Hyun
Hi, All. Apache Spark has been suffered from a known consistency issue on `CHAR` type behavior among its usages and configurations. However, the evolution direction has been gradually moving forward to be consistent inside Apache Spark because we don't have `CHAR` offically. The following is the