Re: More Robust DataSource Parameters

Reynold Xin Mon, 07 Mar 2016 22:53:21 -0800

Hi Hamel,

Sorry for the slow reply. Do you mind writing down the thoughts in a
document, with API sketches? I think all the devils are in the details of
the API for this one.


If we can design an API that is type-safe, supports all languages, and also
can be stable, then it sounds like a great idea.



On Sat, Feb 27, 2016 at 10:12 AM, Hamel Kothari <hamelkoth...@gmail.com>
wrote:

> Thanks for the flags Reynold.
>
> 1. For the 4+ languages, these are just on the consumption side (i.e. you
> can't write a data source in Python or SQL, correct), right? ? If this is
> correct and you can only write data sources in the JVM languages than that
> makes this story a lot easier. On the DataSource side we just require that
> the configuration object is JSON deserializable.
>
> Then on the consumption side (ie. from sqlContext.read):
>  - From Java/Scala these objects can be passed through to the DataSource
> natively since it's in the same JVM and people have access to the concrete
> parameter classes.
>  - On the Python side this object can be passed over via JSON which is
> deserialized and could be forced to generate explicit serialization
> failures when insufficient options are provided. The datasource provide
> could even (optionally) provide a python object which performs validation
> on the python side to make this easier for consumers.
> - In the SQL instance, since these objects are JSON serializable, we can
> alter the OPTIONS keyword to allow nested maps to create the JSON object.
>
> In all of these cases the solution proposed still worst case degrades to
> something equivalent to the Map[String, String] (except that it has nesting
> support), but in the best cases we have POJOs and optionally provided
> python objects which help facilitate this in a first class fashion.
>
> 2. Yeah agree this is a big problem, which is why I flagged it in the
> initial email. I'll put some more thought into how this can be done in a
> reasonable fashion (although any sugguestions wouild be greatly
> appreciated).
>
> With the above answer to #1 and contingent on finding a solution to the
> API stability part of it, would you be supportive of a change to do this?
> If so, I'll submit a JIRA first and solicit/brainstorm some ideas on how to
> do #2 in a more sane way.
>
> On Fri, Feb 26, 2016 at 5:02 PM Reynold Xin <r...@databricks.com> wrote:
>
>> Thanks for the email. This sounds great in theory, but might run into two
>> major problems:
>>
>> 1. Need to support 4+ programming languages (SQL, Python, Java, Scala)
>>
>> 2. API stability (both backward and forward)
>>
>>
>>
>> On Fri, Feb 26, 2016 at 8:44 AM, Hamel Kothari <hamelkoth...@gmail.com>
>> wrote:
>>
>>> Hi devs,
>>>
>>> Has there been any discussion around changing the DataSource parameters
>>> arguments be something more sophisticated than Map[String, String]? As you
>>> write more complex DataSources there are likely to be a variety of
>>> parameters of varying formats which are needed and having to coerce them to
>>> be strings becomes suboptimal pretty fast.
>>>
>>> Quite often I see this combated by people specifying parameters which
>>> take in Json strings and then parse them into the parameter objects that
>>> they actually need. Unfortunately having people write Json strings can be a
>>> really error prone process so to ensure compile time safety people write
>>> convenience functions written which take in actual POJOs as parameters,
>>> serialize them to json so they can be passed into the data source API and
>>> then deserialize them in the constructors of their data sources. There's
>>> also no real story around discoverability of options with the current
>>> Map[String, String] setup other than looking at the source code of the
>>> datasource and hoping that they specified constants somewhere.
>>>
>>> Rather than doing all of the above, we could adapt the DataSource API to
>>> have RelationProviders be templated on a parameter class which could be
>>> provided to the createRelation call. On the user's side, they could just
>>> create the appropriate configuration object and provide that object to the
>>> DataFrameReader.parameters call and it would be possible to guarantee that
>>> enough parameters were provided to construct a DataFrame in that case.
>>>
>>> The key challenge I see with this approach is that I'm not sure how to
>>> make the above changes in a backwards compatible way that doesn't involve
>>> duplicating a bunch of methods.
>>>
>>> Do people have thoughts regarding this approach? I'm happy to file a
>>> JIRA and have the discussion there if it makes sense.
>>>
>>> Best,
>>> Hamel
>>>
>>
>>

Re: More Robust DataSource Parameters

Reply via email to