Re: DataSourceV2 capability API

Reynold Xin Thu, 08 Nov 2018 14:09:13 -0800

This is currently accomplished by having traits that data sources can
extend, as well as runtime exceptions right? It's hard to argue one way vs
another without knowing how things will evolve (e.g. how many different
capabilities there will be).



On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi everyone,
>
> I’d like to propose an addition to DataSourceV2 tables, a capability API.
> This API would allow Spark to query a table to determine whether it
> supports a capability or not:
>
> val table = catalog.load(identifier)
> val supportsContinuous = table.isSupported("continuous-streaming")
>
> There are a couple of use cases for this. First, we want to be able to
> fail fast when a user tries to stream a table that doesn’t support it. The
> design of our read implementation doesn’t necessarily support this. If we
> want to share the same “scan” across streaming and batch, then we need to
> “branch” in the API after that point, but that is at odds with failing
> fast. We could use capabilities to fail fast and not worry about that
> concern in the read design.
>
> I also want to use capabilities to change the behavior of some validation
> rules. The rule that validates appends, for example, doesn’t allow a write
> that is missing an optional column. That’s because the current v1 sources
> don’t support reading when columns are missing. But Iceberg does support
> reading a missing column as nulls, so that users can add a column to a
> table without breaking a scheduled job that populates the table. To fix
> this problem, I would use a table capability, like
> read-missing-columns-as-null.
>
> Any comments on this approach?
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 capability API

Reply via email to