Hi Hamel, Sorry for the slow reply. Do you mind writing down the thoughts in a document, with API sketches? I think all the devils are in the details of the API for this one.
If we can design an API that is type-safe, supports all languages, and also can be stable, then it sounds like a great idea. On Sat, Feb 27, 2016 at 10:12 AM, Hamel Kothari <hamelkoth...@gmail.com> wrote: > Thanks for the flags Reynold. > > 1. For the 4+ languages, these are just on the consumption side (i.e. you > can't write a data source in Python or SQL, correct), right? ? If this is > correct and you can only write data sources in the JVM languages than that > makes this story a lot easier. On the DataSource side we just require that > the configuration object is JSON deserializable. > > Then on the consumption side (ie. from sqlContext.read): > - From Java/Scala these objects can be passed through to the DataSource > natively since it's in the same JVM and people have access to the concrete > parameter classes. > - On the Python side this object can be passed over via JSON which is > deserialized and could be forced to generate explicit serialization > failures when insufficient options are provided. The datasource provide > could even (optionally) provide a python object which performs validation > on the python side to make this easier for consumers. > - In the SQL instance, since these objects are JSON serializable, we can > alter the OPTIONS keyword to allow nested maps to create the JSON object. > > In all of these cases the solution proposed still worst case degrades to > something equivalent to the Map[String, String] (except that it has nesting > support), but in the best cases we have POJOs and optionally provided > python objects which help facilitate this in a first class fashion. > > 2. Yeah agree this is a big problem, which is why I flagged it in the > initial email. I'll put some more thought into how this can be done in a > reasonable fashion (although any sugguestions wouild be greatly > appreciated). > > With the above answer to #1 and contingent on finding a solution to the > API stability part of it, would you be supportive of a change to do this? > If so, I'll submit a JIRA first and solicit/brainstorm some ideas on how to > do #2 in a more sane way. > > On Fri, Feb 26, 2016 at 5:02 PM Reynold Xin <r...@databricks.com> wrote: > >> Thanks for the email. This sounds great in theory, but might run into two >> major problems: >> >> 1. Need to support 4+ programming languages (SQL, Python, Java, Scala) >> >> 2. API stability (both backward and forward) >> >> >> >> On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari <hamelkoth...@gmail.com> >> wrote: >> >>> Hi devs, >>> >>> Has there been any discussion around changing the DataSource parameters >>> arguments be something more sophisticated than Map[String, String]? As you >>> write more complex DataSources there are likely to be a variety of >>> parameters of varying formats which are needed and having to coerce them to >>> be strings becomes suboptimal pretty fast. >>> >>> Quite often I see this combated by people specifying parameters which >>> take in Json strings and then parse them into the parameter objects that >>> they actually need. Unfortunately having people write Json strings can be a >>> really error prone process so to ensure compile time safety people write >>> convenience functions written which take in actual POJOs as parameters, >>> serialize them to json so they can be passed into the data source API and >>> then deserialize them in the constructors of their data sources. There's >>> also no real story around discoverability of options with the current >>> Map[String, String] setup other than looking at the source code of the >>> datasource and hoping that they specified constants somewhere. >>> >>> Rather than doing all of the above, we could adapt the DataSource API to >>> have RelationProviders be templated on a parameter class which could be >>> provided to the createRelation call. On the user's side, they could just >>> create the appropriate configuration object and provide that object to the >>> DataFrameReader.parameters call and it would be possible to guarantee that >>> enough parameters were provided to construct a DataFrame in that case. >>> >>> The key challenge I see with this approach is that I'm not sure how to >>> make the above changes in a backwards compatible way that doesn't involve >>> duplicating a bunch of methods. >>> >>> Do people have thoughts regarding this approach? I'm happy to file a >>> JIRA and have the discussion there if it makes sense. >>> >>> Best, >>> Hamel >>> >> >>