I'd have two places. First, a class that defines properties supported and identified by Spark, like the SQLConf definitions. Second, in documentation for the v2 table API.
On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung <felixcheun...@hotmail.com> wrote: > One question is where will the list of capability strings be defined? > > > ------------------------------ > *From:* Ryan Blue <rb...@netflix.com.invalid> > *Sent:* Thursday, November 8, 2018 2:09 PM > *To:* Reynold Xin > *Cc:* Spark Dev List > *Subject:* Re: DataSourceV2 capability API > > > Yes, we currently use traits that have methods. Something like “supports > reading missing columns” doesn’t need to deliver methods. The other example > is where we don’t have an object to test for a trait ( > scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown > done. That could be expensive so we can use a capability to fail faster. > > On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin <r...@databricks.com> wrote: > >> This is currently accomplished by having traits that data sources can >> extend, as well as runtime exceptions right? It's hard to argue one way vs >> another without knowing how things will evolve (e.g. how many different >> capabilities there will be). >> >> >> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Hi everyone, >>> >>> I’d like to propose an addition to DataSourceV2 tables, a capability >>> API. This API would allow Spark to query a table to determine whether it >>> supports a capability or not: >>> >>> val table = catalog.load(identifier) >>> val supportsContinuous = table.isSupported("continuous-streaming") >>> >>> There are a couple of use cases for this. First, we want to be able to >>> fail fast when a user tries to stream a table that doesn’t support it. The >>> design of our read implementation doesn’t necessarily support this. If we >>> want to share the same “scan” across streaming and batch, then we need to >>> “branch” in the API after that point, but that is at odds with failing >>> fast. We could use capabilities to fail fast and not worry about that >>> concern in the read design. >>> >>> I also want to use capabilities to change the behavior of some >>> validation rules. The rule that validates appends, for example, doesn’t >>> allow a write that is missing an optional column. That’s because the >>> current v1 sources don’t support reading when columns are missing. But >>> Iceberg does support reading a missing column as nulls, so that users can >>> add a column to a table without breaking a scheduled job that populates the >>> table. To fix this problem, I would use a table capability, like >>> read-missing-columns-as-null. >>> >>> Any comments on this approach? >>> >>> rb >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix