Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
OK let me put a proposal here: 1. Permanently ban CHAR for native data source tables, and only keep it for Hive compatibility. It's OK to forget about padding like what Snowflake and MySQL have done. But it's hard for Spark to require consistent behavior about CHAR type in all data sources. Since

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Michael Armbrust
> > What I'd oppose is to just ban char for the native data sources, and do > not have a plan to address this problem systematically. > +1 > Just forget about padding, like what Snowflake and MySQL have done. > Document that char(x) is just an alias for string. And then move on. Almost > no

Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-17 Thread Ben Roling
I tried this on the users mailing list but didn't get traction. It's probably more appropriate here anyway. I've noticed that DataSet.sqlContext is public in Scala but the equivalent (DataFrame._sc) in PySpark is named as if it should be treated as private. Is this intentional? If so, what's

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Maryann Xue
It would be super weird not to support VARCHAR as SQL engine. Banning CHAR is probably fine, as its semantics is genuinely confusing. We can issue a warning when parsing VARCHAR with a limit and suggest the usage of String instead. On Tue, Mar 17, 2020 at 10:27 AM Wenchen Fan wrote: > I agree

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Wenchen Fan
I agree that Spark can define the semantic of CHAR(x) differently than the SQL standard (no padding), and ask the data sources to follow it. But the problem is, some data sources may not be able to skip padding, like the Hive serde table. On the other hand, it's easier to require padding for

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-17 Thread Hyukjin Kwon
Option 2 seems fine to me. 2020년 3월 17일 (화) 오후 3:41, Wenchen Fan 님이 작성: > I don't think option 1 is possible. > > For option 2: I think we need to do it anyway. It's kind of a bug that the > typed Scala UDF doesn't support case class that thus can't support > struct-type input columns. > > For

Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-17 Thread Wenchen Fan
I don't think option 1 is possible. For option 2: I think we need to do it anyway. It's kind of a bug that the typed Scala UDF doesn't support case class that thus can't support struct-type input columns. For option 3: It's a bit risky to add a new API but seems like we have a good reason. The