Re: DataSourceV2 capability API

Ryan Blue Fri, 09 Nov 2018 09:12:03 -0800

I'd have two places. First, a class that defines properties supported and
identified by Spark, like the SQLConf definitions. Second, in documentation
for the v2 table API.


On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung <felixcheun...@hotmail.com>
wrote:

> One question is where will the list of capability strings be defined?
>
>
> ------------------------------
> *From:* Ryan Blue <rb...@netflix.com.invalid>
> *Sent:* Thursday, November 8, 2018 2:09 PM
> *To:* Reynold Xin
> *Cc:* Spark Dev List
> *Subject:* Re: DataSourceV2 capability API
>
>
> Yes, we currently use traits that have methods. Something like “supports
> reading missing columns” doesn’t need to deliver methods. The other example
> is where we don’t have an object to test for a trait (
> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown
> done. That could be expensive so we can use a capability to fail faster.
>
> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin <r...@databricks.com> wrote:
>
>> This is currently accomplished by having traits that data sources can
>> extend, as well as runtime exceptions right? It's hard to argue one way vs
>> another without knowing how things will evolve (e.g. how many different
>> capabilities there will be).
>>
>>
>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I’d like to propose an addition to DataSourceV2 tables, a capability
>>> API. This API would allow Spark to query a table to determine whether it
>>> supports a capability or not:
>>>
>>> val table = catalog.load(identifier)
>>> val supportsContinuous = table.isSupported("continuous-streaming")
>>>
>>> There are a couple of use cases for this. First, we want to be able to
>>> fail fast when a user tries to stream a table that doesn’t support it. The
>>> design of our read implementation doesn’t necessarily support this. If we
>>> want to share the same “scan” across streaming and batch, then we need to
>>> “branch” in the API after that point, but that is at odds with failing
>>> fast. We could use capabilities to fail fast and not worry about that
>>> concern in the read design.
>>>
>>> I also want to use capabilities to change the behavior of some
>>> validation rules. The rule that validates appends, for example, doesn’t
>>> allow a write that is missing an optional column. That’s because the
>>> current v1 sources don’t support reading when columns are missing. But
>>> Iceberg does support reading a missing column as nulls, so that users can
>>> add a column to a table without breaking a scheduled job that populates the
>>> table. To fix this problem, I would use a table capability, like
>>> read-missing-columns-as-null.
>>>
>>> Any comments on this approach?
>>>
>>> rb
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 capability API

Reply via email to