Wenchen, can you give more detail about the different ADD COLUMN syntax? That sounds confusing to end users to me.
On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com> wrote: > Note that the design we make here will affect both data source developers > and end-users. It's better to provide reliable behaviors to end-users, > instead of asking them to read the spec of the data source and know which > value will be used for missing columns, when they write data. > > If we do want to go with the "data source decides default value" approach, > we should create a new SQL syntax for ADD COLUMN, as its behavior is > different from the SQL standard ADD COLUMN command. > > On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 2 >> is for a unified way of dealing with missing columns? I feel like that >> probably should be left up to the underlying datasource implementation. For >> example if you have missing columns with a database the Datasource can >> choose a value based on the Database's metadata if such a thing exists, I >> don't think Spark should really have a this level of detail but I've also >> missed out on all of these meetings (sorry it's family dinner time :) ) so >> I may be missing something. >> >> So my tldr is, Let a datasource report whether or not missing columns are >> OK and let the Datasource deal with the missing data based on it's >> underlying storage. >> >> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> I agree that we should not rewrite existing parquet files when a new >>> column is added, but we should also try out best to make the behavior same >>> as RDBMS/SQL standard. >>> >>> 1. it should be the user who decides the default value of a column, by >>> CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER COLUMN. >>> 2. When adding a new column, the default value should be effective for >>> all the existing data, and newly written data. >>> 3. When altering an existing column and change the default value, it >>> should be effective for newly written data only. >>> >>> A possible implementation: >>> 1. a columnn has 2 default values: the initial one and the latest one. >>> 2. when adding a column with a default value, set both the initial one >>> and the latest one to this value. But do not update existing data. >>> 3. when reading data, fill the missing column with the initial default >>> value >>> 4. when writing data, fill the missing column with the latest default >>> value >>> 5. when altering a column to change its default value, only update the >>> latest default value. >>> >>> This works because: >>> 1. new files will be written with the latest default value, nothing we >>> need to worry about at read time. >>> 2. old files will be read with the initial default value, which returns >>> expected result. >>> >>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> Hi everyone, >>>> >>>> This thread is a follow-up to a discussion that we started in the DSv2 >>>> community sync last week. >>>> >>>> The problem I’m trying to solve is that the format I’m using DSv2 to >>>> integrate supports schema evolution. Specifically, adding a new optional >>>> column so that rows without that column get a default value (null for >>>> Iceberg). The current validation rule for an append in DSv2 fails a write >>>> if it is missing a column, so adding a column to an existing table will >>>> cause currently-scheduled jobs that insert data to start failing. Clearly, >>>> schema evolution shouldn't break existing jobs that produce valid data. >>>> >>>> To fix this problem, I suggested option 1: adding a way for Spark to >>>> check whether to fail when an optional column is missing. Other >>>> contributors in the sync thought that Spark should go with option 2: >>>> Spark’s schema should have defaults and Spark should handle filling in >>>> defaults the same way across all sources, like other databases. >>>> >>>> I think we agree that option 2 would be ideal. The problem is that it >>>> is very hard to implement. >>>> >>>> A source might manage data stored in millions of immutable Parquet >>>> files, so adding a default value isn’t possible. Spark would need to fill >>>> in defaults for files written before the column was added at read time (it >>>> could fill in defaults in new files at write time). Filling in defaults at >>>> read time would require Spark to fill in defaults for only some of the >>>> files in a scan, so Spark would need different handling for each task >>>> depending on the schema of that task. Tasks would also be required to >>>> produce a consistent schema, so a file without the new column couldn’t be >>>> combined into a task with a file that has the new column. This adds quite a >>>> bit of complexity. >>>> >>>> Other sources may not need Spark to fill in the default at all. A JDBC >>>> source would be capable of filling in the default values itself, so Spark >>>> would need some way to communicate the default to that source. If the >>>> source had a different policy for default values (write time instead of >>>> read time, for example) then behavior could still be inconsistent. >>>> >>>> I think that this complexity probably isn’t worth consistency in >>>> default values across sources, if that is even achievable. >>>> >>>> In the sync we thought it was a good idea to send this out to the >>>> larger group to discuss. Please reply with comments! >>>> >>>> rb >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> -- Ryan Blue Software Engineer Netflix