Re: DataSourceV2 capability API

Ryan Blue Fri, 09 Nov 2018 13:35:46 -0800

Another solution to the decimal case is using the capability API: use a
capability to signal that the table knows about `supports-decimal`. So
before the decimal support check, it would check
`table.isSupported("type-capabilities")`.


On Fri, Nov 9, 2018 at 12:45 PM Ryan Blue <rb...@netflix.com> wrote:

> For that case, I think we would have a property that defines whether
> supports-decimal is assumed or checked with the capability.
>
> Wouldn't we have this problem no matter what the capability API is? If we
> used a trait to signal decimal support, then we would have to deal with
> sources that were written before the trait was introduced. That doesn't
> change the need for some way to signal support for specific capabilities
> like the ones I've suggested.
>
> On Fri, Nov 9, 2018 at 12:38 PM Reynold Xin <r...@databricks.com> wrote:
>
>> "If there is no way to report a feature (e.g., able to read missing as
>> null) then there is no way for Spark to take advantage of it in the first
>> place"
>>
>> Consider this (just a hypothetical scenario): We added "supports-decimal"
>> in the future, because we see a lot of data sources don't support decimal
>> and we want a more graceful error handling. That'd break all existing data
>> sources.
>>
>> You can say we would never add any "existing" features to the feature
>> list in the future, as a requirement for the feature list. But then I'm
>> wondering how much does it really give you, beyond telling data sources to
>> throw exceptions when they don't support a specific operation.
>>
>>
>> On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Do you have an example in mind where we might add a capability and break
>>> old versions of data sources?
>>>
>>> These are really for being able to tell what features a data source has.
>>> If there is no way to report a feature (e.g., able to read missing as null)
>>> then there is no way for Spark to take advantage of it in the first place.
>>> For the uses I've proposed, forward compatibility isn't a concern. When we
>>> add a capability, we add handling for it that old versions wouldn't be able
>>> to use anyway. The advantage is that we don't have to treat all sources the
>>> same.
>>>
>>> On Fri, Nov 9, 2018 at 11:32 AM Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> How do we deal with forward compatibility? Consider, Spark adds a new
>>>> "property". In the past the data source supports that property, but since
>>>> it was not explicitly defined, in the new version of Spark that data source
>>>> would be considered not supporting that property, and thus throwing an
>>>> exception.
>>>>
>>>>
>>>> On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> I'd have two places. First, a class that defines properties supported
>>>>> and identified by Spark, like the SQLConf definitions. Second, in
>>>>> documentation for the v2 table API.
>>>>>
>>>>> On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung <felixcheun...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>>> One question is where will the list of capability strings be defined?
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Ryan Blue <rb...@netflix.com.invalid>
>>>>>> *Sent:* Thursday, November 8, 2018 2:09 PM
>>>>>> *To:* Reynold Xin
>>>>>> *Cc:* Spark Dev List
>>>>>> *Subject:* Re: DataSourceV2 capability API
>>>>>>
>>>>>>
>>>>>> Yes, we currently use traits that have methods. Something like
>>>>>> “supports reading missing columns” doesn’t need to deliver methods. The
>>>>>> other example is where we don’t have an object to test for a trait (
>>>>>> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown
>>>>>> done. That could be expensive so we can use a capability to fail faster.
>>>>>>
>>>>>> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This is currently accomplished by having traits that data sources
>>>>>>> can extend, as well as runtime exceptions right? It's hard to argue one 
>>>>>>> way
>>>>>>> vs another without knowing how things will evolve (e.g. how many 
>>>>>>> different
>>>>>>> capabilities there will be).
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I’d like to propose an addition to DataSourceV2 tables, a
>>>>>>>> capability API. This API would allow Spark to query a table to 
>>>>>>>> determine
>>>>>>>> whether it supports a capability or not:
>>>>>>>>
>>>>>>>> val table = catalog.load(identifier)
>>>>>>>> val supportsContinuous = table.isSupported("continuous-streaming")
>>>>>>>>
>>>>>>>> There are a couple of use cases for this. First, we want to be able
>>>>>>>> to fail fast when a user tries to stream a table that doesn’t support 
>>>>>>>> it.
>>>>>>>> The design of our read implementation doesn’t necessarily support 
>>>>>>>> this. If
>>>>>>>> we want to share the same “scan” across streaming and batch, then we 
>>>>>>>> need
>>>>>>>> to “branch” in the API after that point, but that is at odds with 
>>>>>>>> failing
>>>>>>>> fast. We could use capabilities to fail fast and not worry about that
>>>>>>>> concern in the read design.
>>>>>>>>
>>>>>>>> I also want to use capabilities to change the behavior of some
>>>>>>>> validation rules. The rule that validates appends, for example, doesn’t
>>>>>>>> allow a write that is missing an optional column. That’s because the
>>>>>>>> current v1 sources don’t support reading when columns are missing. But
>>>>>>>> Iceberg does support reading a missing column as nulls, so that users 
>>>>>>>> can
>>>>>>>> add a column to a table without breaking a scheduled job that 
>>>>>>>> populates the
>>>>>>>> table. To fix this problem, I would use a table capability, like
>>>>>>>> read-missing-columns-as-null.
>>>>>>>>
>>>>>>>> Any comments on this approach?
>>>>>>>>
>>>>>>>> rb
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 capability API

Reply via email to