Hello, I agree that Spark should check whether the underlying datasource support default values or not, and adjust its behavior accordingly.
If we follow this direction, do you see the default-values capability in scope of the "DataSourceV2 capability API"? Best regards, Alessandro On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <cloud0...@gmail.com> wrote: > > Hi Ryan, > > That's a good point. Since in this case Spark is just a channel to pass > user's action to the data source, we should think of what actions the data > source supports. > > Following this direction, it makes more sense to delegate everything to data > sources. > > As the first step, maybe we should not add DDL commands to change schema of > data source, but just use the capability API to let data source decide what > to do when input schema doesn't match the table schema during writing. Users > can use native client of data source to change schema. > > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <rb...@netflix.com> wrote: >> >> I think it is good to know that not all sources support default values. That >> makes me think that we should delegate this behavior to the source and have >> a way for sources to signal that they accept default values in DDL (a >> capability) and assume that they do not in most cases. >> >> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <russell.spit...@gmail.com> >> wrote: >>> >>> I guess my question is why is this a Spark level behavior? Say the user has >>> an underlying source where they have a different behavior at the source >>> level. In Spark they set a new default behavior and it's added to the >>> catalogue, is the Source expected to propagate this? Or does the user have >>> to be aware that their own Source settings may be different for a client >>> connecting via Spark or via a native driver. >>> >>> For example say i'm using C* (sorry but obviously I'm always thinking about >>> C*), and I add a new column to the database. When i connect to the database >>> with a non-spark application I expect to be able to insert to the table >>> given that I satisfy the required columns. In Spark someone sets the >>> columns as having a default value (there is no such feature in C*), now >>> depending on how I connect to the source I have two different behaviors. If >>> I insert from the native app I get empty cells, if I insert from spark i >>> get a default value inserted. That sounds more confusing to an end-user to >>> than having a consistent behavior between native clients and Spark clients. >>> This is why I asked if the goal was to just have a common "Spark" behavior >>> because I don't think it makes sense if you consider multiple interaction >>> points for a source. >>> >>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <cloud0...@gmail.com> wrote: >>>> >>>> So you agree with my proposal that we should follow RDBMS/SQL standard >>>> regarding the behavior? >>>> >>>> > pass the default through to the underlying data source >>>> >>>> This is one way to implement the behavior. >>>> >>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <rb...@netflix.com> wrote: >>>>> >>>>> I don't think we have to change the syntax. Isn't the right thing (for >>>>> option 1) to pass the default through to the underlying data source? >>>>> Sources that don't support defaults would throw an exception. >>>>> >>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <cloud0...@gmail.com> wrote: >>>>>> >>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN >>>>>> column_name datatype [DEFAULT value]; >>>>>> >>>>>> If the DEFAULT statement is not specified, then the default value is >>>>>> null. If we are going to change the behavior and say the default value >>>>>> is decided by the underlying data source, we should use a new SQL >>>>>> syntax(I don't have a proposal in mind), instead of reusing the existing >>>>>> syntax, to be SQL compatible. >>>>>> >>>>>> Personally I don't like re-invent wheels. It's better to just implement >>>>>> the SQL standard ADD COLUMN command, which means the default value is >>>>>> decided by the end-users. >>>>>> >>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <rb...@netflix.com> wrote: >>>>>>> >>>>>>> Wenchen, can you give more detail about the different ADD COLUMN >>>>>>> syntax? That sounds confusing to end users to me. >>>>>>> >>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>>>>>>> >>>>>>>> Note that the design we make here will affect both data source >>>>>>>> developers and end-users. It's better to provide reliable behaviors to >>>>>>>> end-users, instead of asking them to read the spec of the data source >>>>>>>> and know which value will be used for missing columns, when they write >>>>>>>> data. >>>>>>>> >>>>>>>> If we do want to go with the "data source decides default value" >>>>>>>> approach, we should create a new SQL syntax for ADD COLUMN, as its >>>>>>>> behavior is different from the SQL standard ADD COLUMN command. >>>>>>>> >>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer >>>>>>>> <russell.spit...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want >>>>>>>>> 2 is for a unified way of dealing with missing columns? I feel like >>>>>>>>> that probably should be left up to the underlying datasource >>>>>>>>> implementation. For example if you have missing columns with a >>>>>>>>> database the Datasource can choose a value based on the Database's >>>>>>>>> metadata if such a thing exists, I don't think Spark should really >>>>>>>>> have a this level of detail but I've also missed out on all of these >>>>>>>>> meetings (sorry it's family dinner time :) ) so I may be missing >>>>>>>>> something. >>>>>>>>> >>>>>>>>> So my tldr is, Let a datasource report whether or not missing columns >>>>>>>>> are OK and let the Datasource deal with the missing data based on >>>>>>>>> it's underlying storage. >>>>>>>>> >>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I agree that we should not rewrite existing parquet files when a new >>>>>>>>>> column is added, but we should also try out best to make the >>>>>>>>>> behavior same as RDBMS/SQL standard. >>>>>>>>>> >>>>>>>>>> 1. it should be the user who decides the default value of a column, >>>>>>>>>> by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER >>>>>>>>>> COLUMN. >>>>>>>>>> 2. When adding a new column, the default value should be effective >>>>>>>>>> for all the existing data, and newly written data. >>>>>>>>>> 3. When altering an existing column and change the default value, it >>>>>>>>>> should be effective for newly written data only. >>>>>>>>>> >>>>>>>>>> A possible implementation: >>>>>>>>>> 1. a columnn has 2 default values: the initial one and the latest >>>>>>>>>> one. >>>>>>>>>> 2. when adding a column with a default value, set both the initial >>>>>>>>>> one and the latest one to this value. But do not update existing >>>>>>>>>> data. >>>>>>>>>> 3. when reading data, fill the missing column with the initial >>>>>>>>>> default value >>>>>>>>>> 4. when writing data, fill the missing column with the latest >>>>>>>>>> default value >>>>>>>>>> 5. when altering a column to change its default value, only update >>>>>>>>>> the latest default value. >>>>>>>>>> >>>>>>>>>> This works because: >>>>>>>>>> 1. new files will be written with the latest default value, nothing >>>>>>>>>> we need to worry about at read time. >>>>>>>>>> 2. old files will be read with the initial default value, which >>>>>>>>>> returns expected result. >>>>>>>>>> >>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue >>>>>>>>>> <rb...@netflix.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi everyone, >>>>>>>>>>> >>>>>>>>>>> This thread is a follow-up to a discussion that we started in the >>>>>>>>>>> DSv2 community sync last week. >>>>>>>>>>> >>>>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 >>>>>>>>>>> to integrate supports schema evolution. Specifically, adding a new >>>>>>>>>>> optional column so that rows without that column get a default >>>>>>>>>>> value (null for Iceberg). The current validation rule for an append >>>>>>>>>>> in DSv2 fails a write if it is missing a column, so adding a column >>>>>>>>>>> to an existing table will cause currently-scheduled jobs that >>>>>>>>>>> insert data to start failing. Clearly, schema evolution shouldn't >>>>>>>>>>> break existing jobs that produce valid data. >>>>>>>>>>> >>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark >>>>>>>>>>> to check whether to fail when an optional column is missing. Other >>>>>>>>>>> contributors in the sync thought that Spark should go with option >>>>>>>>>>> 2: Spark’s schema should have defaults and Spark should handle >>>>>>>>>>> filling in defaults the same way across all sources, like other >>>>>>>>>>> databases. >>>>>>>>>>> >>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is that >>>>>>>>>>> it is very hard to implement. >>>>>>>>>>> >>>>>>>>>>> A source might manage data stored in millions of immutable Parquet >>>>>>>>>>> files, so adding a default value isn’t possible. Spark would need >>>>>>>>>>> to fill in defaults for files written before the column was added >>>>>>>>>>> at read time (it could fill in defaults in new files at write >>>>>>>>>>> time). Filling in defaults at read time would require Spark to fill >>>>>>>>>>> in defaults for only some of the files in a scan, so Spark would >>>>>>>>>>> need different handling for each task depending on the schema of >>>>>>>>>>> that task. Tasks would also be required to produce a consistent >>>>>>>>>>> schema, so a file without the new column couldn’t be combined into >>>>>>>>>>> a task with a file that has the new column. This adds quite a bit >>>>>>>>>>> of complexity. >>>>>>>>>>> >>>>>>>>>>> Other sources may not need Spark to fill in the default at all. A >>>>>>>>>>> JDBC source would be capable of filling in the default values >>>>>>>>>>> itself, so Spark would need some way to communicate the default to >>>>>>>>>>> that source. If the source had a different policy for default >>>>>>>>>>> values (write time instead of read time, for example) then behavior >>>>>>>>>>> could still be inconsistent. >>>>>>>>>>> >>>>>>>>>>> I think that this complexity probably isn’t worth consistency in >>>>>>>>>>> default values across sources, if that is even achievable. >>>>>>>>>>> >>>>>>>>>>> In the sync we thought it was a good idea to send this out to the >>>>>>>>>>> larger group to discuss. Please reply with comments! >>>>>>>>>>> >>>>>>>>>>> rb >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Software Engineer >>>>>>>>>>> Netflix >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org