Re: [DISCUSS] Default values and data sources

Wenchen Fan Wed, 19 Dec 2018 18:39:42 -0800

The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN
column_name datatype [DEFAULT value];


If the DEFAULT statement is not specified, then the default value is null.
If we are going to change the behavior and say the default value is decided
by the underlying data source, we should use a new SQL syntax(I don't have
a proposal in mind), instead of reusing the existing syntax, to be SQL
compatible.

Personally I don't like re-invent wheels. It's better to just implement the
SQL standard ADD COLUMN command, which means the default value is decided
by the end-users.

On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <[email protected]> wrote:

> Wenchen, can you give more detail about the different ADD COLUMN syntax?
> That sounds confusing to end users to me.
>
> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <[email protected]> wrote:
>
>> Note that the design we make here will affect both data source developers
>> and end-users. It's better to provide reliable behaviors to end-users,
>> instead of asking them to read the spec of the data source and know which
>> value will be used for missing columns, when they write data.
>>
>> If we do want to go with the "data source decides default value"
>> approach, we should create a new SQL syntax for ADD COLUMN, as its behavior
>> is different from the SQL standard ADD COLUMN command.
>>
>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2
>>> is for a unified way of dealing with missing columns? I feel like that
>>> probably should be left up to the underlying datasource implementation. For
>>> example if you have missing columns with a database the Datasource can
>>> choose a value based on the Database's metadata if such a thing exists, I
>>> don't think Spark should really have a this level of detail but I've also
>>> missed out on all of these meetings (sorry it's family dinner time :) ) so
>>> I may be missing something.
>>>
>>> So my tldr is, Let a datasource report whether or not missing columns
>>> are OK and let the Datasource deal with the missing data based on it's
>>> underlying storage.
>>>
>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <[email protected]> wrote:
>>>
>>>> I agree that we should not rewrite existing parquet files when a new
>>>> column is added, but we should also try out best to make the behavior same
>>>> as RDBMS/SQL standard.
>>>>
>>>> 1. it should be the user who decides the default value of a column, by
>>>> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>>> 2. When adding a new column, the default value should be effective for
>>>> all the existing data, and newly written data.
>>>> 3. When altering an existing column and change the default value, it
>>>> should be effective for newly written data only.
>>>>
>>>> A possible implementation:
>>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>>> 2. when adding a column with a default value, set both the initial one
>>>> and the latest one to this value. But do not update existing data.
>>>> 3. when reading data, fill the missing column with the initial default
>>>> value
>>>> 4. when writing data, fill the missing column with the latest default
>>>> value
>>>> 5. when altering a column to change its default value, only update the
>>>> latest default value.
>>>>
>>>> This works because:
>>>> 1. new files will be written with the latest default value, nothing we
>>>> need to worry about at read time.
>>>> 2. old files will be read with the initial default value, which returns
>>>> expected result.
>>>>
>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> This thread is a follow-up to a discussion that we started in the DSv2
>>>>> community sync last week.
>>>>>
>>>>> The problem I’m trying to solve is that the format I’m using DSv2 to
>>>>> integrate supports schema evolution. Specifically, adding a new optional
>>>>> column so that rows without that column get a default value (null for
>>>>> Iceberg). The current validation rule for an append in DSv2 fails a write
>>>>> if it is missing a column, so adding a column to an existing table will
>>>>> cause currently-scheduled jobs that insert data to start failing. Clearly,
>>>>> schema evolution shouldn't break existing jobs that produce valid data.
>>>>>
>>>>> To fix this problem, I suggested option 1: adding a way for Spark to
>>>>> check whether to fail when an optional column is missing. Other
>>>>> contributors in the sync thought that Spark should go with option 2:
>>>>> Spark’s schema should have defaults and Spark should handle filling in
>>>>> defaults the same way across all sources, like other databases.
>>>>>
>>>>> I think we agree that option 2 would be ideal. The problem is that it
>>>>> is very hard to implement.
>>>>>
>>>>> A source might manage data stored in millions of immutable Parquet
>>>>> files, so adding a default value isn’t possible. Spark would need to fill
>>>>> in defaults for files written before the column was added at read time (it
>>>>> could fill in defaults in new files at write time). Filling in defaults at
>>>>> read time would require Spark to fill in defaults for only some of the
>>>>> files in a scan, so Spark would need different handling for each task
>>>>> depending on the schema of that task. Tasks would also be required to
>>>>> produce a consistent schema, so a file without the new column couldn’t be
>>>>> combined into a task with a file that has the new column. This adds quite 
>>>>> a
>>>>> bit of complexity.
>>>>>
>>>>> Other sources may not need Spark to fill in the default at all. A JDBC
>>>>> source would be capable of filling in the default values itself, so Spark
>>>>> would need some way to communicate the default to that source. If the
>>>>> source had a different policy for default values (write time instead of
>>>>> read time, for example) then behavior could still be inconsistent.
>>>>>
>>>>> I think that this complexity probably isn’t worth consistency in
>>>>> default values across sources, if that is even achievable.
>>>>>
>>>>> In the sync we thought it was a good idea to send this out to the
>>>>> larger group to discuss. Please reply with comments!
>>>>>
>>>>> rb
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Default values and data sources

Reply via email to