Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Wenchen Fan Tue, 01 Dec 2020 10:46:50 -0800

I'm reviving this thread because this feature was reverted before the 3.0
release, and now we are trying to add it back since the CREATE TABLE syntax
is unified.

The benefits are pretty clear: CREATE TABLE by default (without USING or
STORED AS) should create native tables that work best with Spark. You can
see all the benefits listed in https://github.com/apache/spark/pull/30554.

I'm sending this email to collect feedback about the risks. AFAIK
the broken use cases are:
1. A user issues `CREATE TABLE ... LOCATION ...`. After some table
insertions he want to read the data files directly from the table location.
Because the file format is changed from Hive text to Parquet, this use case
may be broken.
2. A user issues `CREATE TABLE ...` and then runs `ALTER TABLE SET SERDE`
or `LOAD DATA`. These two are Hive specific commands and doesn't work with
Spark native tables.
3. A user issues `CREATE TABLE ...` and then uses Hive to add partitions
with different serdes to this table. Spark doesn't allow a native
partitioned table to have partitions with different formats.

>From my personal experience, the Hive text tables are usually used to
import CSV-like data. It's very likely that people will create Hive text
table explicitly as they need the Hive syntax to specify options like
delimiter. Besides, I'm not sure how many Spark users are using this
feature, as the native CSV data source can do the same job.

I'd consider it a bad user experience if a simple `CREATE TABLE` gives
users a very slow table. Changing it to return native Parquet table doesn't
seems to break many people, but I can be wrong.

Please reply to this thread if you know more use cases that may be affected
by this change, and share your thoughts.

Thanks,
Wenchen

On Sat, Dec 7, 2019 at 1:58 PM Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> Oh, looks nice. Thanks for the sharing, Dongjoon
>
> Bests,
> Takeshi
>
> On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> I want to share the following change to the community.
>>
>>     SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
>>
>> This is merged today and now Spark's `CREATE TABLE` is using Spark's
>> default data sources instead of `hive` provider. This is a good and big
>> improvement for Apache Spark 3.0, but this might surprise someone. (Please
>> note that there is a fallback option for them.)
>>
>> Thank you, Yi, Wenchen, Xiao.
>>
>> Cheers,
>> Dongjoon.
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Reply via email to