[DISCUSS] Default values and data sources

Ryan Blue Tue, 18 Dec 2018 16:40:19 -0800

Hi everyone,

This thread is a follow-up to a discussion that we started in the DSv2
community sync last week.


The problem I’m trying to solve is that the format I’m using DSv2 to
integrate supports schema evolution. Specifically, adding a new optional
column so that rows without that column get a default value (null for
Iceberg). The current validation rule for an append in DSv2 fails a write
if it is missing a column, so adding a column to an existing table will
cause currently-scheduled jobs that insert data to start failing. Clearly,
schema evolution shouldn't break existing jobs that produce valid data.

To fix this problem, I suggested option 1: adding a way for Spark to check
whether to fail when an optional column is missing. Other contributors in
the sync thought that Spark should go with option 2: Spark’s schema should
have defaults and Spark should handle filling in defaults the same way
across all sources, like other databases.

I think we agree that option 2 would be ideal. The problem is that it is
very hard to implement.

A source might manage data stored in millions of immutable Parquet files,
so adding a default value isn’t possible. Spark would need to fill in
defaults for files written before the column was added at read time (it
could fill in defaults in new files at write time). Filling in defaults at
read time would require Spark to fill in defaults for only some of the
files in a scan, so Spark would need different handling for each task
depending on the schema of that task. Tasks would also be required to
produce a consistent schema, so a file without the new column couldn’t be
combined into a task with a file that has the new column. This adds quite a
bit of complexity.

Other sources may not need Spark to fill in the default at all. A JDBC
source would be capable of filling in the default values itself, so Spark
would need some way to communicate the default to that source. If the
source had a different policy for default values (write time instead of
read time, for example) then behavior could still be inconsistent.

I think that this complexity probably isn’t worth consistency in default
values across sources, if that is even achievable.

In the sync we thought it was a good idea to send this out to the larger
group to discuss. Please reply with comments!

rb
-- 
Ryan Blue
Software Engineer
Netflix

[DISCUSS] Default values and data sources

Reply via email to