"If there is no way to report a feature (e.g., able to read missing as null) then there is no way for Spark to take advantage of it in the first place"
Consider this (just a hypothetical scenario): We added "supports-decimal" in the future, because we see a lot of data sources don't support decimal and we want a more graceful error handling. That'd break all existing data sources. You can say we would never add any "existing" features to the feature list in the future, as a requirement for the feature list. But then I'm wondering how much does it really give you, beyond telling data sources to throw exceptions when they don't support a specific operation. On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue <rb...@netflix.com> wrote: > Do you have an example in mind where we might add a capability and break > old versions of data sources? > > These are really for being able to tell what features a data source has. > If there is no way to report a feature (e.g., able to read missing as null) > then there is no way for Spark to take advantage of it in the first place. > For the uses I've proposed, forward compatibility isn't a concern. When we > add a capability, we add handling for it that old versions wouldn't be able > to use anyway. The advantage is that we don't have to treat all sources the > same. > > On Fri, Nov 9, 2018 at 11:32 AM Reynold Xin <r...@databricks.com> wrote: > >> How do we deal with forward compatibility? Consider, Spark adds a new >> "property". In the past the data source supports that property, but since >> it was not explicitly defined, in the new version of Spark that data source >> would be considered not supporting that property, and thus throwing an >> exception. >> >> >> On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> I'd have two places. First, a class that defines properties supported >>> and identified by Spark, like the SQLConf definitions. Second, in >>> documentation for the v2 table API. >>> >>> On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung <felixcheun...@hotmail.com> >>> wrote: >>> >>>> One question is where will the list of capability strings be defined? >>>> >>>> >>>> ------------------------------ >>>> *From:* Ryan Blue <rb...@netflix.com.invalid> >>>> *Sent:* Thursday, November 8, 2018 2:09 PM >>>> *To:* Reynold Xin >>>> *Cc:* Spark Dev List >>>> *Subject:* Re: DataSourceV2 capability API >>>> >>>> >>>> Yes, we currently use traits that have methods. Something like >>>> “supports reading missing columns” doesn’t need to deliver methods. The >>>> other example is where we don’t have an object to test for a trait ( >>>> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown >>>> done. That could be expensive so we can use a capability to fail faster. >>>> >>>> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin <r...@databricks.com> wrote: >>>> >>>>> This is currently accomplished by having traits that data sources can >>>>> extend, as well as runtime exceptions right? It's hard to argue one way vs >>>>> another without knowing how things will evolve (e.g. how many different >>>>> capabilities there will be). >>>>> >>>>> >>>>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I’d like to propose an addition to DataSourceV2 tables, a capability >>>>>> API. This API would allow Spark to query a table to determine whether it >>>>>> supports a capability or not: >>>>>> >>>>>> val table = catalog.load(identifier) >>>>>> val supportsContinuous = table.isSupported("continuous-streaming") >>>>>> >>>>>> There are a couple of use cases for this. First, we want to be able >>>>>> to fail fast when a user tries to stream a table that doesn’t support it. >>>>>> The design of our read implementation doesn’t necessarily support this. >>>>>> If >>>>>> we want to share the same “scan” across streaming and batch, then we need >>>>>> to “branch” in the API after that point, but that is at odds with failing >>>>>> fast. We could use capabilities to fail fast and not worry about that >>>>>> concern in the read design. >>>>>> >>>>>> I also want to use capabilities to change the behavior of some >>>>>> validation rules. The rule that validates appends, for example, doesn’t >>>>>> allow a write that is missing an optional column. That’s because the >>>>>> current v1 sources don’t support reading when columns are missing. But >>>>>> Iceberg does support reading a missing column as nulls, so that users can >>>>>> add a column to a table without breaking a scheduled job that populates >>>>>> the >>>>>> table. To fix this problem, I would use a table capability, like >>>>>> read-missing-columns-as-null. >>>>>> >>>>>> Any comments on this approach? >>>>>> >>>>>> rb >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >