Re: [DISCUSS] Default values and data sources

Ryan Blue Fri, 21 Dec 2018 10:12:18 -0800

Alessandro, yes. This was one of the use cases that motivated the
capability API I proposed.


After this discussion, I think we probably need a couple of capabilities.
First, the capability that indicates reads will fill in some default value
for missing columns. That way, Spark allows writes to continue when they do
not include a column that was recently added. Second, I think we need a
capability to indicate that the source accepts non-null default values in
DDL. Those are distinct because some sources can return non-null defaults,
but sources like C* and Iceberg don't support them. In Iceberg, the default
value is always null for optional columns so it supports missing columns
but not non-null defaults.

On Thu, Dec 20, 2018 at 11:40 PM Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello,
> I agree that Spark should check whether the underlying datasource
> support default values or not, and adjust its behavior accordingly.
>
> If we follow this direction, do you see the default-values capability
> in scope of the "DataSourceV2 capability API"?
>
> Best regards,
> Alessandro
>
> On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <cloud0...@gmail.com> wrote:
> >
> > Hi Ryan,
> >
> > That's a good point. Since in this case Spark is just a channel to pass
> user's action to the data source, we should think of what actions the data
> source supports.
> >
> > Following this direction, it makes more sense to delegate everything to
> data sources.
> >
> > As the first step, maybe we should not add DDL commands to change schema
> of data source, but just use the capability API to let data source decide
> what to do when input schema doesn't match the table schema during writing.
> Users can use native client of data source to change schema.
> >
> > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <rb...@netflix.com> wrote:
> >>
> >> I think it is good to know that not all sources support default values.
> That makes me think that we should delegate this behavior to the source and
> have a way for sources to signal that they accept default values in DDL (a
> capability) and assume that they do not in most cases.
> >>
> >> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
> >>>
> >>> I guess my question is why is this a Spark level behavior? Say the
> user has an underlying source where they have a different behavior at the
> source level. In Spark they set a new default behavior and it's added to
> the catalogue, is the Source expected to propagate this? Or does the user
> have to be aware that their own Source settings may be different for a
> client connecting via Spark or via a native driver.
> >>>
> >>> For example say i'm using C* (sorry but obviously I'm always thinking
> about C*), and I add a new column to the database. When i connect to the
> database with a non-spark application I expect to be able to insert to the
> table given that I satisfy the required columns. In Spark someone sets the
> columns as having a default value (there is no such feature in C*), now
> depending on how I connect to the source I have two different behaviors. If
> I insert from the native app I get empty cells, if I insert from spark i
> get a default value inserted. That sounds more confusing to an end-user to
> than having a consistent behavior between native clients and Spark clients.
> This is why I asked if the goal was to just have a common "Spark" behavior
> because I don't think it makes sense if you consider multiple interaction
> points for a source.
> >>>
> >>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <cloud0...@gmail.com>
> wrote:
> >>>>
> >>>> So you agree with my proposal that we should follow RDBMS/SQL
> standard regarding the behavior?
> >>>>
> >>>> > pass the default through to the underlying data source
> >>>>
> >>>> This is one way to implement the behavior.
> >>>>
> >>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <rb...@netflix.com> wrote:
> >>>>>
> >>>>> I don't think we have to change the syntax. Isn't the right thing
> (for option 1) to pass the default through to the underlying data source?
> Sources that don't support defaults would throw an exception.
> >>>>>
> >>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <cloud0...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD
> COLUMN column_name datatype [DEFAULT value];
> >>>>>>
> >>>>>> If the DEFAULT statement is not specified, then the default value
> is null. If we are going to change the behavior and say the default value
> is decided by the underlying data source, we should use a new SQL syntax(I
> don't have a proposal in mind), instead of reusing the existing syntax, to
> be SQL compatible.
> >>>>>>
> >>>>>> Personally I don't like re-invent wheels. It's better to just
> implement the SQL standard ADD COLUMN command, which means the default
> value is decided by the end-users.
> >>>>>>
> >>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <rb...@netflix.com>
> wrote:
> >>>>>>>
> >>>>>>> Wenchen, can you give more detail about the different ADD COLUMN
> syntax? That sounds confusing to end users to me.
> >>>>>>>
> >>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> Note that the design we make here will affect both data source
> developers and end-users. It's better to provide reliable behaviors to
> end-users, instead of asking them to read the spec of the data source and
> know which value will be used for missing columns, when they write data.
> >>>>>>>>
> >>>>>>>> If we do want to go with the "data source decides default value"
> approach, we should create a new SQL syntax for ADD COLUMN, as its behavior
> is different from the SQL standard ADD COLUMN command.
> >>>>>>>>
> >>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we
> want 2 is for a unified way of dealing with missing columns? I feel like
> that probably should be left up to the underlying datasource
> implementation. For example if you have missing columns with a database the
> Datasource can choose a value based on the Database's metadata if such a
> thing exists, I don't think Spark should really have a this level of detail
> but I've also missed out on all of these meetings (sorry it's family dinner
> time :) ) so I may be missing something.
> >>>>>>>>>
> >>>>>>>>> So my tldr is, Let a datasource report whether or not missing
> columns are OK and let the Datasource deal with the missing data based on
> it's underlying storage.
> >>>>>>>>>
> >>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I agree that we should not rewrite existing parquet files when
> a new column is added, but we should also try out best to make the behavior
> same as RDBMS/SQL standard.
> >>>>>>>>>>
> >>>>>>>>>> 1. it should be the user who decides the default value of a
> column, by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER
> COLUMN.
> >>>>>>>>>> 2. When adding a new column, the default value should be
> effective for all the existing data, and newly written data.
> >>>>>>>>>> 3. When altering an existing column and change the default
> value, it should be effective for newly written data only.
> >>>>>>>>>>
> >>>>>>>>>> A possible implementation:
> >>>>>>>>>> 1. a columnn has 2 default values: the initial one and the
> latest one.
> >>>>>>>>>> 2. when adding a column with a default value, set both the
> initial one and the latest one to this value. But do not update existing
> data.
> >>>>>>>>>> 3. when reading data, fill the missing column with the initial
> default value
> >>>>>>>>>> 4. when writing data, fill the missing column with the latest
> default value
> >>>>>>>>>> 5. when altering a column to change its default value, only
> update the latest default value.
> >>>>>>>>>>
> >>>>>>>>>> This works because:
> >>>>>>>>>> 1. new files will be written with the latest default value,
> nothing we need to worry about at read time.
> >>>>>>>>>> 2. old files will be read with the initial default value, which
> returns expected result.
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue
> <rb...@netflix.com.invalid> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> This thread is a follow-up to a discussion that we started in
> the DSv2 community sync last week.
> >>>>>>>>>>>
> >>>>>>>>>>> The problem I’m trying to solve is that the format I’m using
> DSv2 to integrate supports schema evolution. Specifically, adding a new
> optional column so that rows without that column get a default value (null
> for Iceberg). The current validation rule for an append in DSv2 fails a
> write if it is missing a column, so adding a column to an existing table
> will cause currently-scheduled jobs that insert data to start failing.
> Clearly, schema evolution shouldn't break existing jobs that produce valid
> data.
> >>>>>>>>>>>
> >>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for
> Spark to check whether to fail when an optional column is missing. Other
> contributors in the sync thought that Spark should go with option 2:
> Spark’s schema should have defaults and Spark should handle filling in
> defaults the same way across all sources, like other databases.
> >>>>>>>>>>>
> >>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is
> that it is very hard to implement.
> >>>>>>>>>>>
> >>>>>>>>>>> A source might manage data stored in millions of immutable
> Parquet files, so adding a default value isn’t possible. Spark would need
> to fill in defaults for files written before the column was added at read
> time (it could fill in defaults in new files at write time). Filling in
> defaults at read time would require Spark to fill in defaults for only some
> of the files in a scan, so Spark would need different handling for each
> task depending on the schema of that task. Tasks would also be required to
> produce a consistent schema, so a file without the new column couldn’t be
> combined into a task with a file that has the new column. This adds quite a
> bit of complexity.
> >>>>>>>>>>>
> >>>>>>>>>>> Other sources may not need Spark to fill in the default at
> all. A JDBC source would be capable of filling in the default values
> itself, so Spark would need some way to communicate the default to that
> source. If the source had a different policy for default values (write time
> instead of read time, for example) then behavior could still be
> inconsistent.
> >>>>>>>>>>>
> >>>>>>>>>>> I think that this complexity probably isn’t worth consistency
> in default values across sources, if that is even achievable.
> >>>>>>>>>>>
> >>>>>>>>>>> In the sync we thought it was a good idea to send this out to
> the larger group to discuss. Please reply with comments!
> >>>>>>>>>>>
> >>>>>>>>>>> rb
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Ryan Blue
> >>>>>>>>>>> Software Engineer
> >>>>>>>>>>> Netflix
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Ryan Blue
> >>>>>>> Software Engineer
> >>>>>>> Netflix
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Ryan Blue
> >>>>> Software Engineer
> >>>>> Netflix
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Default values and data sources

Reply via email to