Re: DataSourceV2 capability API

Felix Cheung Fri, 09 Nov 2018 09:01:10 -0800

One question is where will the list of capability strings be defined?

________________________________
From: Ryan Blue <[email protected]>
Sent: Thursday, November 8, 2018 2:09 PM
To: Reynold Xin
Cc: Spark Dev List
Subject: Re: DataSourceV2 capability API

Yes, we currently use traits that have methods. Something like “supports 
reading missing columns” doesn’t need to deliver methods. The other example is 
where we don’t have an object to test for a trait 
(scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown done. 
That could be expensive so we can use a capability to fail faster.

On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin 
<[email protected]<mailto:[email protected]>> wrote:
This is currently accomplished by having traits that data sources can extend, 
as well as runtime exceptions right? It's hard to argue one way vs another 
without knowing how things will evolve (e.g. how many different capabilities 
there will be).

On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <[email protected]> wrote:

Hi everyone,

I’d like to propose an addition to DataSourceV2 tables, a capability API. This 
API would allow Spark to query a table to determine whether it supports a 
capability or not:

val table = catalog.load(identifier)
val supportsContinuous = table.isSupported("continuous-streaming")

There are a couple of use cases for this. First, we want to be able to fail 
fast when a user tries to stream a table that doesn’t support it. The design of 
our read implementation doesn’t necessarily support this. If we want to share 
the same “scan” across streaming and batch, then we need to “branch” in the API 
after that point, but that is at odds with failing fast. We could use 
capabilities to fail fast and not worry about that concern in the read design.

I also want to use capabilities to change the behavior of some validation 
rules. The rule that validates appends, for example, doesn’t allow a write that 
is missing an optional column. That’s because the current v1 sources don’t 
support reading when columns are missing. But Iceberg does support reading a 
missing column as nulls, so that users can add a column to a table without 
breaking a scheduled job that populates the table. To fix this problem, I would 
use a table capability, like read-missing-columns-as-null.

Any comments on this approach?

rb

--
Ryan Blue
Software Engineer
Netflix

--
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 capability API

Reply via email to