Another solution to the decimal case is using the capability API: use a capability to signal that the table knows about `supports-decimal`. So before the decimal support check, it would check `table.isSupported("type-capabilities")`.
On Fri, Nov 9, 2018 at 12:45 PM Ryan Blue <rb...@netflix.com> wrote: > For that case, I think we would have a property that defines whether > supports-decimal is assumed or checked with the capability. > > Wouldn't we have this problem no matter what the capability API is? If we > used a trait to signal decimal support, then we would have to deal with > sources that were written before the trait was introduced. That doesn't > change the need for some way to signal support for specific capabilities > like the ones I've suggested. > > On Fri, Nov 9, 2018 at 12:38 PM Reynold Xin <r...@databricks.com> wrote: > >> "If there is no way to report a feature (e.g., able to read missing as >> null) then there is no way for Spark to take advantage of it in the first >> place" >> >> Consider this (just a hypothetical scenario): We added "supports-decimal" >> in the future, because we see a lot of data sources don't support decimal >> and we want a more graceful error handling. That'd break all existing data >> sources. >> >> You can say we would never add any "existing" features to the feature >> list in the future, as a requirement for the feature list. But then I'm >> wondering how much does it really give you, beyond telling data sources to >> throw exceptions when they don't support a specific operation. >> >> >> On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> Do you have an example in mind where we might add a capability and break >>> old versions of data sources? >>> >>> These are really for being able to tell what features a data source has. >>> If there is no way to report a feature (e.g., able to read missing as null) >>> then there is no way for Spark to take advantage of it in the first place. >>> For the uses I've proposed, forward compatibility isn't a concern. When we >>> add a capability, we add handling for it that old versions wouldn't be able >>> to use anyway. The advantage is that we don't have to treat all sources the >>> same. >>> >>> On Fri, Nov 9, 2018 at 11:32 AM Reynold Xin <r...@databricks.com> wrote: >>> >>>> How do we deal with forward compatibility? Consider, Spark adds a new >>>> "property". In the past the data source supports that property, but since >>>> it was not explicitly defined, in the new version of Spark that data source >>>> would be considered not supporting that property, and thus throwing an >>>> exception. >>>> >>>> >>>> On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> I'd have two places. First, a class that defines properties supported >>>>> and identified by Spark, like the SQLConf definitions. Second, in >>>>> documentation for the v2 table API. >>>>> >>>>> On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung <felixcheun...@hotmail.com> >>>>> wrote: >>>>> >>>>>> One question is where will the list of capability strings be defined? >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> *From:* Ryan Blue <rb...@netflix.com.invalid> >>>>>> *Sent:* Thursday, November 8, 2018 2:09 PM >>>>>> *To:* Reynold Xin >>>>>> *Cc:* Spark Dev List >>>>>> *Subject:* Re: DataSourceV2 capability API >>>>>> >>>>>> >>>>>> Yes, we currently use traits that have methods. Something like >>>>>> “supports reading missing columns” doesn’t need to deliver methods. The >>>>>> other example is where we don’t have an object to test for a trait ( >>>>>> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown >>>>>> done. That could be expensive so we can use a capability to fail faster. >>>>>> >>>>>> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> This is currently accomplished by having traits that data sources >>>>>>> can extend, as well as runtime exceptions right? It's hard to argue one >>>>>>> way >>>>>>> vs another without knowing how things will evolve (e.g. how many >>>>>>> different >>>>>>> capabilities there will be). >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue <rb...@netflix.com.invalid> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> I’d like to propose an addition to DataSourceV2 tables, a >>>>>>>> capability API. This API would allow Spark to query a table to >>>>>>>> determine >>>>>>>> whether it supports a capability or not: >>>>>>>> >>>>>>>> val table = catalog.load(identifier) >>>>>>>> val supportsContinuous = table.isSupported("continuous-streaming") >>>>>>>> >>>>>>>> There are a couple of use cases for this. First, we want to be able >>>>>>>> to fail fast when a user tries to stream a table that doesn’t support >>>>>>>> it. >>>>>>>> The design of our read implementation doesn’t necessarily support >>>>>>>> this. If >>>>>>>> we want to share the same “scan” across streaming and batch, then we >>>>>>>> need >>>>>>>> to “branch” in the API after that point, but that is at odds with >>>>>>>> failing >>>>>>>> fast. We could use capabilities to fail fast and not worry about that >>>>>>>> concern in the read design. >>>>>>>> >>>>>>>> I also want to use capabilities to change the behavior of some >>>>>>>> validation rules. The rule that validates appends, for example, doesn’t >>>>>>>> allow a write that is missing an optional column. That’s because the >>>>>>>> current v1 sources don’t support reading when columns are missing. But >>>>>>>> Iceberg does support reading a missing column as nulls, so that users >>>>>>>> can >>>>>>>> add a column to a table without breaking a scheduled job that >>>>>>>> populates the >>>>>>>> table. To fix this problem, I would use a table capability, like >>>>>>>> read-missing-columns-as-null. >>>>>>>> >>>>>>>> Any comments on this approach? >>>>>>>> >>>>>>>> rb >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix