Wenchen, could you start a new thread? Many people have probably already muted this one, and it isn't really on topic.
The question that needs to be discussed is whether this is a safe change for the 3.1 release, and reusing an old thread is not a great way to get people's attention about something potentially harmful like that. On Tue, Dec 1, 2020 at 10:46 AM Wenchen Fan <cloud0...@gmail.com> wrote: > I'm reviving this thread because this feature was reverted before the 3.0 > release, and now we are trying to add it back since the CREATE TABLE syntax > is unified. > > The benefits are pretty clear: CREATE TABLE by default (without USING or > STORED AS) should create native tables that work best with Spark. You can > see all the benefits listed in https://github.com/apache/spark/pull/30554. > > I'm sending this email to collect feedback about the risks. AFAIK > the broken use cases are: > 1. A user issues `CREATE TABLE ... LOCATION ...`. After some table > insertions he want to read the data files directly from the table location. > Because the file format is changed from Hive text to Parquet, this use case > may be broken. > 2. A user issues `CREATE TABLE ...` and then runs `ALTER TABLE SET SERDE` > or `LOAD DATA`. These two are Hive specific commands and doesn't work with > Spark native tables. > 3. A user issues `CREATE TABLE ...` and then uses Hive to add partitions > with different serdes to this table. Spark doesn't allow a native > partitioned table to have partitions with different formats. > > From my personal experience, the Hive text tables are usually used to > import CSV-like data. It's very likely that people will create Hive text > table explicitly as they need the Hive syntax to specify options like > delimiter. Besides, I'm not sure how many Spark users are using this > feature, as the native CSV data source can do the same job. > > I'd consider it a bad user experience if a simple `CREATE TABLE` gives > users a very slow table. Changing it to return native Parquet table doesn't > seems to break many people, but I can be wrong. > > Please reply to this thread if you know more use cases that may be > affected by this change, and share your thoughts. > > Thanks, > Wenchen > > On Sat, Dec 7, 2019 at 1:58 PM Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> Oh, looks nice. Thanks for the sharing, Dongjoon >> >> Bests, >> Takeshi >> >> On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> Hi, All. >>> >>> I want to share the following change to the community. >>> >>> SPARK-30098 Use default datasource as provider for CREATE TABLE >>> syntax >>> >>> This is merged today and now Spark's `CREATE TABLE` is using Spark's >>> default data sources instead of `hive` provider. This is a good and big >>> improvement for Apache Spark 3.0, but this might surprise someone. (Please >>> note that there is a fallback option for them.) >>> >>> Thank you, Yi, Wenchen, Xiao. >>> >>> Cheers, >>> Dongjoon. >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > -- Ryan Blue Software Engineer Netflix