Wenchen, can you give more detail about the different ADD COLUMN syntax?
That sounds confusing to end users to me.

On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> Note that the design we make here will affect both data source developers
> and end-users. It's better to provide reliable behaviors to end-users,
> instead of asking them to read the spec of the data source and know which
> value will be used for missing columns, when they write data.
>
> If we do want to go with the "data source decides default value" approach,
> we should create a new SQL syntax for ADD COLUMN, as its behavior is
> different from the SQL standard ADD COLUMN command.
>
> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2
>> is for a unified way of dealing with missing columns? I feel like that
>> probably should be left up to the underlying datasource implementation. For
>> example if you have missing columns with a database the Datasource can
>> choose a value based on the Database's metadata if such a thing exists, I
>> don't think Spark should really have a this level of detail but I've also
>> missed out on all of these meetings (sorry it's family dinner time :) ) so
>> I may be missing something.
>>
>> So my tldr is, Let a datasource report whether or not missing columns are
>> OK and let the Datasource deal with the missing data based on it's
>> underlying storage.
>>
>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> I agree that we should not rewrite existing parquet files when a new
>>> column is added, but we should also try out best to make the behavior same
>>> as RDBMS/SQL standard.
>>>
>>> 1. it should be the user who decides the default value of a column, by
>>> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN.
>>> 2. When adding a new column, the default value should be effective for
>>> all the existing data, and newly written data.
>>> 3. When altering an existing column and change the default value, it
>>> should be effective for newly written data only.
>>>
>>> A possible implementation:
>>> 1. a columnn has 2 default values: the initial one and the latest one.
>>> 2. when adding a column with a default value, set both the initial one
>>> and the latest one to this value. But do not update existing data.
>>> 3. when reading data, fill the missing column with the initial default
>>> value
>>> 4. when writing data, fill the missing column with the latest default
>>> value
>>> 5. when altering a column to change its default value, only update the
>>> latest default value.
>>>
>>> This works because:
>>> 1. new files will be written with the latest default value, nothing we
>>> need to worry about at read time.
>>> 2. old files will be read with the initial default value, which returns
>>> expected result.
>>>
>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> This thread is a follow-up to a discussion that we started in the DSv2
>>>> community sync last week.
>>>>
>>>> The problem I’m trying to solve is that the format I’m using DSv2 to
>>>> integrate supports schema evolution. Specifically, adding a new optional
>>>> column so that rows without that column get a default value (null for
>>>> Iceberg). The current validation rule for an append in DSv2 fails a write
>>>> if it is missing a column, so adding a column to an existing table will
>>>> cause currently-scheduled jobs that insert data to start failing. Clearly,
>>>> schema evolution shouldn't break existing jobs that produce valid data.
>>>>
>>>> To fix this problem, I suggested option 1: adding a way for Spark to
>>>> check whether to fail when an optional column is missing. Other
>>>> contributors in the sync thought that Spark should go with option 2:
>>>> Spark’s schema should have defaults and Spark should handle filling in
>>>> defaults the same way across all sources, like other databases.
>>>>
>>>> I think we agree that option 2 would be ideal. The problem is that it
>>>> is very hard to implement.
>>>>
>>>> A source might manage data stored in millions of immutable Parquet
>>>> files, so adding a default value isn’t possible. Spark would need to fill
>>>> in defaults for files written before the column was added at read time (it
>>>> could fill in defaults in new files at write time). Filling in defaults at
>>>> read time would require Spark to fill in defaults for only some of the
>>>> files in a scan, so Spark would need different handling for each task
>>>> depending on the schema of that task. Tasks would also be required to
>>>> produce a consistent schema, so a file without the new column couldn’t be
>>>> combined into a task with a file that has the new column. This adds quite a
>>>> bit of complexity.
>>>>
>>>> Other sources may not need Spark to fill in the default at all. A JDBC
>>>> source would be capable of filling in the default values itself, so Spark
>>>> would need some way to communicate the default to that source. If the
>>>> source had a different policy for default values (write time instead of
>>>> read time, for example) then behavior could still be inconsistent.
>>>>
>>>> I think that this complexity probably isn’t worth consistency in
>>>> default values across sources, if that is even achievable.
>>>>
>>>> In the sync we thought it was a good idea to send this out to the
>>>> larger group to discuss. Please reply with comments!
>>>>
>>>> rb
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to