Re: Spark SQL API changes and stabilization

Corey Nolet Thu, 15 Jan 2015 12:18:06 -0800

Reynold,

One thing I'd like worked into the public portion of the API is the json
inferencing logic that creates a Set[(String, StructType)] out of
Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators
to infer my schema instead of forcing a map/reduce phase to occur on an RDD
in order to get the final schema. Do you (or anyone else) see a path
forward in exposing this to users? A utility class perhaps?


On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin <r...@databricks.com> wrote:

> Alex,
>
> I didn't communicate properly. By "private", I simply meant the expectation
> that it is not a public API. The plan is to still omit it from the
> scaladoc/javadoc generation, but no language visibility modifier will be
> applied on them.
>
> After 1.3, you will likely no longer need to use things in sql.catalyst
> package directly. Programmatically construct SchemaRDDs is going to be a
> first class public API. Data types have already been moved out of the
> sql.catalyst package and now lives in sql.types. They are becoming stable
> public APIs. When the "data frame" patch is submitted, you will see a
> public expression library also. There will be few reason for end users or
> library developers to hook into things in sql.catalyst. For the bravest and
> the most advanced, they can still use them, with the expectation that it is
> subject to change.
>
>
>
>
>
> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta <alexbare...@gmail.com
> >
> wrote:
>
> > Reynold,
> >
> > Thanks for the heads up. In general, I strongly oppose the use of
> > "private" to restrict access to certain parts of the API, the reason
> being
> > that I might find the need to use some of the internals of a library from
> > my own project. I find that a @DeveloperAPI annotation serves the same
> > purpose as "private" without imposing unnecessary restrictions: it
> > discourages people from using the annotated API and reserves the right
> for
> > the core developers to change it suddenly in backwards incompatible ways.
> >
> > In particular, I would like to express the desire that the APIs to
> > programmatically construct SchemaRDDs from an RDD[Row] and a StructType
> > remain public. All the SparkSQL data type objects should be exposed by
> the
> > API, and the jekyll build should not hide the docs as it does now.
> >
> > Thanks.
> >
> > Alex
> >
> > On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >
> >> Hi Spark devs,
> >>
> >> Given the growing number of developers that are building on Spark SQL,
> we
> >> would like to stabilize the API in 1.3 so users and developers can be
> >> confident to build on it. This also gives us a chance to improve the
> API.
> >>
> >> In particular, we are proposing the following major changes. This should
> >> have no impact for most users (i.e. those running SQL through the JDBC
> >> client or SQLContext.sql method).
> >>
> >> 1. Everything in sql.catalyst package is private to the project.
> >>
> >> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
> >> SchemaRDD and logical plans in order to construct test cases. We have
> >> received feedback from a lot of users that the DSL can be incredibly
> >> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
> >> not
> >> only constructing test cases, but also in everyday data pipelines. The
> new
> >> SchemaRDD API is inspired by the data frame concept in Pandas and R.
> >>
> >> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose
> one
> >> set of APIs that will work for both Java and Scala. The current Java API
> >> (sql.api.java) does not share any common ancestor with the Scala API.
> This
> >> led to high maintenance burden for us as Spark developers and for
> library
> >> developers. We propose to eliminate the Java specific API, and simply
> work
> >> on the existing Scala API to make it also usable for Java. This will
> make
> >> Java a first class citizen as Scala. This effectively means that all
> >> public
> >> classes should be usable for both Scala and Java, including SQLContext,
> >> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
> >>
> >>
> >> Again, this should have no impact on most users since the existing DSL
> is
> >> rarely used by end users. However, library developers might need to
> change
> >> the import statements because we are moving certain classes around. We
> >> will
> >> keep you posted as patches are merged.
> >>
> >
> >
>

Re: Spark SQL API changes and stabilization

Reply via email to