Re: [DISCUSS] Default values and data sources

Alessandro Solimando Thu, 20 Dec 2018 23:40:35 -0800

Hello,
I agree that Spark should check whether the underlying datasource
support default values or not, and adjust its behavior accordingly.


If we follow this direction, do you see the default-values capability
in scope of the "DataSourceV2 capability API"?

Best regards,
Alessandro

On Fri, 21 Dec 2018 at 03:31, Wenchen Fan <cloud0...@gmail.com> wrote:
>
> Hi Ryan,
>
> That's a good point. Since in this case Spark is just a channel to pass 
> user's action to the data source, we should think of what actions the data 
> source supports.
>
> Following this direction, it makes more sense to delegate everything to data 
> sources.
>
> As the first step, maybe we should not add DDL commands to change schema of 
> data source, but just use the capability API to let data source decide what 
> to do when input schema doesn't match the table schema during writing. Users 
> can use native client of data source to change schema.
>
> On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>> I think it is good to know that not all sources support default values. That 
>> makes me think that we should delegate this behavior to the source and have 
>> a way for sources to signal that they accept default values in DDL (a 
>> capability) and assume that they do not in most cases.
>>
>> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer <russell.spit...@gmail.com> 
>> wrote:
>>>
>>> I guess my question is why is this a Spark level behavior? Say the user has 
>>> an underlying source where they have a different behavior at the source 
>>> level. In Spark they set a new default behavior and it's added to the 
>>> catalogue, is the Source expected to propagate this? Or does the user have 
>>> to be aware that their own Source settings may be different for a client 
>>> connecting via Spark or via a native driver.
>>>
>>> For example say i'm using C* (sorry but obviously I'm always thinking about 
>>> C*), and I add a new column to the database. When i connect to the database 
>>> with a non-spark application I expect to be able to insert to the table 
>>> given that I satisfy the required columns. In Spark someone sets the 
>>> columns as having a default value (there is no such feature in C*), now 
>>> depending on how I connect to the source I have two different behaviors. If 
>>> I insert from the native app I get empty cells, if I insert from spark i 
>>> get a default value inserted. That sounds more confusing to an end-user to 
>>> than having a consistent behavior between native clients and Spark clients. 
>>> This is why I asked if the goal was to just have a common "Spark" behavior 
>>> because I don't think it makes sense if you consider multiple interaction 
>>> points for a source.
>>>
>>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>> So you agree with my proposal that we should follow RDBMS/SQL standard 
>>>> regarding the behavior?
>>>>
>>>> > pass the default through to the underlying data source
>>>>
>>>> This is one way to implement the behavior.
>>>>
>>>> On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> I don't think we have to change the syntax. Isn't the right thing (for 
>>>>> option 1) to pass the default through to the underlying data source? 
>>>>> Sources that don't support defaults would throw an exception.
>>>>>
>>>>> On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>>>
>>>>>> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD COLUMN 
>>>>>> column_name datatype [DEFAULT value];
>>>>>>
>>>>>> If the DEFAULT statement is not specified, then the default value is 
>>>>>> null. If we are going to change the behavior and say the default value 
>>>>>> is decided by the underlying data source, we should use a new SQL 
>>>>>> syntax(I don't have a proposal in mind), instead of reusing the existing 
>>>>>> syntax, to be SQL compatible.
>>>>>>
>>>>>> Personally I don't like re-invent wheels. It's better to just implement 
>>>>>> the SQL standard ADD COLUMN command, which means the default value is 
>>>>>> decided by the end-users.
>>>>>>
>>>>>> On Thu, Dec 20, 2018 at 12:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>> Wenchen, can you give more detail about the different ADD COLUMN 
>>>>>>> syntax? That sounds confusing to end users to me.
>>>>>>>
>>>>>>> On Wed, Dec 19, 2018 at 7:15 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Note that the design we make here will affect both data source 
>>>>>>>> developers and end-users. It's better to provide reliable behaviors to 
>>>>>>>> end-users, instead of asking them to read the spec of the data source 
>>>>>>>> and know which value will be used for missing columns, when they write 
>>>>>>>> data.
>>>>>>>>
>>>>>>>> If we do want to go with the "data source decides default value" 
>>>>>>>> approach, we should create a new SQL syntax for ADD COLUMN, as its 
>>>>>>>> behavior is different from the SQL standard ADD COLUMN command.
>>>>>>>>
>>>>>>>> On Wed, Dec 19, 2018 at 10:58 PM Russell Spitzer 
>>>>>>>> <russell.spit...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> I'm not sure why 1) wouldn't be fine. I'm guessing the reason we want 
>>>>>>>>> 2 is for a unified way of dealing with missing columns? I feel like 
>>>>>>>>> that probably should be left up to the underlying datasource 
>>>>>>>>> implementation. For example if you have missing columns with a 
>>>>>>>>> database the Datasource can choose a value based on the Database's 
>>>>>>>>> metadata if such a thing exists, I don't think Spark should really 
>>>>>>>>> have a this level of detail but I've also missed out on all of these 
>>>>>>>>> meetings (sorry it's family dinner time :) ) so I may be missing 
>>>>>>>>> something.
>>>>>>>>>
>>>>>>>>> So my tldr is, Let a datasource report whether or not missing columns 
>>>>>>>>> are OK and let the Datasource deal with the missing data based on 
>>>>>>>>> it's underlying storage.
>>>>>>>>>
>>>>>>>>> On Wed, Dec 19, 2018 at 8:23 AM Wenchen Fan <cloud0...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I agree that we should not rewrite existing parquet files when a new 
>>>>>>>>>> column is added, but we should also try out best to make the 
>>>>>>>>>> behavior same as RDBMS/SQL standard.
>>>>>>>>>>
>>>>>>>>>> 1. it should be the user who decides the default value of a column, 
>>>>>>>>>> by CREATE TABLE, or ALTER TABLE ADD COLUMN, or ALTER TABLE ALTER 
>>>>>>>>>> COLUMN.
>>>>>>>>>> 2. When adding a new column, the default value should be effective 
>>>>>>>>>> for all the existing data, and newly written data.
>>>>>>>>>> 3. When altering an existing column and change the default value, it 
>>>>>>>>>> should be effective for newly written data only.
>>>>>>>>>>
>>>>>>>>>> A possible implementation:
>>>>>>>>>> 1. a columnn has 2 default values: the initial one and the latest 
>>>>>>>>>> one.
>>>>>>>>>> 2. when adding a column with a default value, set both the initial 
>>>>>>>>>> one and the latest one to this value. But do not update existing 
>>>>>>>>>> data.
>>>>>>>>>> 3. when reading data, fill the missing column with the initial 
>>>>>>>>>> default value
>>>>>>>>>> 4. when writing data, fill the missing column with the latest 
>>>>>>>>>> default value
>>>>>>>>>> 5. when altering a column to change its default value, only update 
>>>>>>>>>> the latest default value.
>>>>>>>>>>
>>>>>>>>>> This works because:
>>>>>>>>>> 1. new files will be written with the latest default value, nothing 
>>>>>>>>>> we need to worry about at read time.
>>>>>>>>>> 2. old files will be read with the initial default value, which 
>>>>>>>>>> returns expected result.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 19, 2018 at 8:39 AM Ryan Blue 
>>>>>>>>>> <rb...@netflix.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> This thread is a follow-up to a discussion that we started in the 
>>>>>>>>>>> DSv2 community sync last week.
>>>>>>>>>>>
>>>>>>>>>>> The problem I’m trying to solve is that the format I’m using DSv2 
>>>>>>>>>>> to integrate supports schema evolution. Specifically, adding a new 
>>>>>>>>>>> optional column so that rows without that column get a default 
>>>>>>>>>>> value (null for Iceberg). The current validation rule for an append 
>>>>>>>>>>> in DSv2 fails a write if it is missing a column, so adding a column 
>>>>>>>>>>> to an existing table will cause currently-scheduled jobs that 
>>>>>>>>>>> insert data to start failing. Clearly, schema evolution shouldn't 
>>>>>>>>>>> break existing jobs that produce valid data.
>>>>>>>>>>>
>>>>>>>>>>> To fix this problem, I suggested option 1: adding a way for Spark 
>>>>>>>>>>> to check whether to fail when an optional column is missing. Other 
>>>>>>>>>>> contributors in the sync thought that Spark should go with option 
>>>>>>>>>>> 2: Spark’s schema should have defaults and Spark should handle 
>>>>>>>>>>> filling in defaults the same way across all sources, like other 
>>>>>>>>>>> databases.
>>>>>>>>>>>
>>>>>>>>>>> I think we agree that option 2 would be ideal. The problem is that 
>>>>>>>>>>> it is very hard to implement.
>>>>>>>>>>>
>>>>>>>>>>> A source might manage data stored in millions of immutable Parquet 
>>>>>>>>>>> files, so adding a default value isn’t possible. Spark would need 
>>>>>>>>>>> to fill in defaults for files written before the column was added 
>>>>>>>>>>> at read time (it could fill in defaults in new files at write 
>>>>>>>>>>> time). Filling in defaults at read time would require Spark to fill 
>>>>>>>>>>> in defaults for only some of the files in a scan, so Spark would 
>>>>>>>>>>> need different handling for each task depending on the schema of 
>>>>>>>>>>> that task. Tasks would also be required to produce a consistent 
>>>>>>>>>>> schema, so a file without the new column couldn’t be combined into 
>>>>>>>>>>> a task with a file that has the new column. This adds quite a bit 
>>>>>>>>>>> of complexity.
>>>>>>>>>>>
>>>>>>>>>>> Other sources may not need Spark to fill in the default at all. A 
>>>>>>>>>>> JDBC source would be capable of filling in the default values 
>>>>>>>>>>> itself, so Spark would need some way to communicate the default to 
>>>>>>>>>>> that source. If the source had a different policy for default 
>>>>>>>>>>> values (write time instead of read time, for example) then behavior 
>>>>>>>>>>> could still be inconsistent.
>>>>>>>>>>>
>>>>>>>>>>> I think that this complexity probably isn’t worth consistency in 
>>>>>>>>>>> default values across sources, if that is even achievable.
>>>>>>>>>>>
>>>>>>>>>>> In the sync we thought it was a good idea to send this out to the 
>>>>>>>>>>> larger group to discuss. Please reply with comments!
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Default values and data sources

Reply via email to