Re: DataSourceV2 capability API

Reynold Xin Fri, 09 Nov 2018 12:39:02 -0800

"If there is no way to report a feature (e.g., able to read missing as
null) then there is no way for Spark to take advantage of it in the first
place"


Consider this (just a hypothetical scenario): We added "supports-decimal"
in the future, because we see a lot of data sources don't support decimal
and we want a more graceful error handling. That'd break all existing data
sources.

You can say we would never add any "existing" features to the feature list
in the future, as a requirement for the feature list. But then I'm
wondering how much does it really give you, beyond telling data sources to
throw exceptions when they don't support a specific operation.


On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue <rb...@netflix.com> wrote:

> Do you have an example in mind where we might add a capability and break
> old versions of data sources?
>
> These are really for being able to tell what features a data source has.
> If there is no way to report a feature (e.g., able to read missing as null)
> then there is no way for Spark to take advantage of it in the first place.
> For the uses I've proposed, forward compatibility isn't a concern. When we
> add a capability, we add handling for it that old versions wouldn't be able
> to use anyway. The advantage is that we don't have to treat all sources the
> same.
>
> On Fri, Nov 9, 2018 at 11:32 AM Reynold Xin <r...@databricks.com> wrote:
>
>> How do we deal with forward compatibility? Consider, Spark adds a new
>> "property". In the past the data source supports that property, but since
>> it was not explicitly defined, in the new version of Spark that data source
>> would be considered not supporting that property, and thus throwing an
>> exception.
>>
>>
>> On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> I'd have two places. First, a class that defines properties supported
>>> and identified by Spark, like the SQLConf definitions. Second, in
>>> documentation for the v2 table API.
>>>
>>> On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> One question is where will the list of capability strings be defined?
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Ryan Blue <rb...@netflix.com.invalid>
>>>> *Sent:* Thursday, November 8, 2018 2:09 PM
>>>> *To:* Reynold Xin
>>>> *Cc:* Spark Dev List
>>>> *Subject:* Re: DataSourceV2 capability API
>>>>
>>>>
>>>> Yes, we currently use traits that have methods. Something like
>>>> “supports reading missing columns” doesn’t need to deliver methods. The
>>>> other example is where we don’t have an object to test for a trait (
>>>> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown
>>>> done. That could be expensive so we can use a capability to fail faster.
>>>>
>>>> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>>> This is currently accomplished by having traits that data sources can
>>>>> extend, as well as runtime exceptions right? It's hard to argue one way vs
>>>>> another without knowing how things will evolve (e.g. how many different
>>>>> capabilities there will be).
>>>>>
>>>>>
>>>>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I’d like to propose an addition to DataSourceV2 tables, a capability
>>>>>> API. This API would allow Spark to query a table to determine whether it
>>>>>> supports a capability or not:
>>>>>>
>>>>>> val table = catalog.load(identifier)
>>>>>> val supportsContinuous = table.isSupported("continuous-streaming")
>>>>>>
>>>>>> There are a couple of use cases for this. First, we want to be able
>>>>>> to fail fast when a user tries to stream a table that doesn’t support it.
>>>>>> The design of our read implementation doesn’t necessarily support this. 
>>>>>> If
>>>>>> we want to share the same “scan” across streaming and batch, then we need
>>>>>> to “branch” in the API after that point, but that is at odds with failing
>>>>>> fast. We could use capabilities to fail fast and not worry about that
>>>>>> concern in the read design.
>>>>>>
>>>>>> I also want to use capabilities to change the behavior of some
>>>>>> validation rules. The rule that validates appends, for example, doesn’t
>>>>>> allow a write that is missing an optional column. That’s because the
>>>>>> current v1 sources don’t support reading when columns are missing. But
>>>>>> Iceberg does support reading a missing column as nulls, so that users can
>>>>>> add a column to a table without breaking a scheduled job that populates 
>>>>>> the
>>>>>> table. To fix this problem, I would use a table capability, like
>>>>>> read-missing-columns-as-null.
>>>>>>
>>>>>> Any comments on this approach?
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 capability API

Reply via email to