Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Ryan Blue Tue, 01 Dec 2020 13:07:25 -0800

Wenchen, could you start a new thread? Many people have probably already
muted this one, and it isn't really on topic.


The question that needs to be discussed is whether this is a safe change
for the 3.1 release, and reusing an old thread is not a great way to get
people's attention about something potentially harmful like that.

On Tue, Dec 1, 2020 at 10:46 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> I'm reviving this thread because this feature was reverted before the 3.0
> release, and now we are trying to add it back since the CREATE TABLE syntax
> is unified.
>
> The benefits are pretty clear: CREATE TABLE by default (without USING or
> STORED AS) should create native tables that work best with Spark. You can
> see all the benefits listed in https://github.com/apache/spark/pull/30554.
>
> I'm sending this email to collect feedback about the risks. AFAIK
> the broken use cases are:
> 1. A user issues `CREATE TABLE ... LOCATION ...`. After some table
> insertions he want to read the data files directly from the table location.
> Because the file format is changed from Hive text to Parquet, this use case
> may be broken.
> 2. A user issues `CREATE TABLE ...` and then runs `ALTER TABLE SET SERDE`
> or `LOAD DATA`. These two are Hive specific commands and doesn't work with
> Spark native tables.
> 3. A user issues `CREATE TABLE ...` and then uses Hive to add partitions
> with different serdes to this table. Spark doesn't allow a native
> partitioned table to have partitions with different formats.
>
> From my personal experience, the Hive text tables are usually used to
> import CSV-like data. It's very likely that people will create Hive text
> table explicitly as they need the Hive syntax to specify options like
> delimiter. Besides, I'm not sure how many Spark users are using this
> feature, as the native CSV data source can do the same job.
>
> I'd consider it a bad user experience if a simple `CREATE TABLE` gives
> users a very slow table. Changing it to return native Parquet table doesn't
> seems to break many people, but I can be wrong.
>
> Please reply to this thread if you know more use cases that may be
> affected by this change, and share your thoughts.
>
> Thanks,
> Wenchen
>
> On Sat, Dec 7, 2019 at 1:58 PM Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> Oh, looks nice. Thanks for the sharing, Dongjoon
>>
>> Bests,
>> Takeshi
>>
>> On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> I want to share the following change to the community.
>>>
>>>     SPARK-30098 Use default datasource as provider for CREATE TABLE
>>> syntax
>>>
>>> This is merged today and now Spark's `CREATE TABLE` is using Spark's
>>> default data sources instead of `hive` provider. This is a good and big
>>> improvement for Apache Spark 3.0, but this might surprise someone. (Please
>>> note that there is a fallback option for them.)
>>>
>>> Thank you, Yi, Wenchen, Xiao.
>>>
>>> Cheers,
>>> Dongjoon.
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

Reply via email to