Reynold, One thing I'd like worked into the public portion of the API is the json inferencing logic that creates a Set[(String, StructType)] out of Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators to infer my schema instead of forcing a map/reduce phase to occur on an RDD in order to get the final schema. Do you (or anyone else) see a path forward in exposing this to users? A utility class perhaps?
On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin <r...@databricks.com> wrote: > Alex, > > I didn't communicate properly. By "private", I simply meant the expectation > that it is not a public API. The plan is to still omit it from the > scaladoc/javadoc generation, but no language visibility modifier will be > applied on them. > > After 1.3, you will likely no longer need to use things in sql.catalyst > package directly. Programmatically construct SchemaRDDs is going to be a > first class public API. Data types have already been moved out of the > sql.catalyst package and now lives in sql.types. They are becoming stable > public APIs. When the "data frame" patch is submitted, you will see a > public expression library also. There will be few reason for end users or > library developers to hook into things in sql.catalyst. For the bravest and > the most advanced, they can still use them, with the expectation that it is > subject to change. > > > > > > On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta <alexbare...@gmail.com > > > wrote: > > > Reynold, > > > > Thanks for the heads up. In general, I strongly oppose the use of > > "private" to restrict access to certain parts of the API, the reason > being > > that I might find the need to use some of the internals of a library from > > my own project. I find that a @DeveloperAPI annotation serves the same > > purpose as "private" without imposing unnecessary restrictions: it > > discourages people from using the annotated API and reserves the right > for > > the core developers to change it suddenly in backwards incompatible ways. > > > > In particular, I would like to express the desire that the APIs to > > programmatically construct SchemaRDDs from an RDD[Row] and a StructType > > remain public. All the SparkSQL data type objects should be exposed by > the > > API, and the jekyll build should not hide the docs as it does now. > > > > Thanks. > > > > Alex > > > > On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin <r...@databricks.com> > wrote: > > > >> Hi Spark devs, > >> > >> Given the growing number of developers that are building on Spark SQL, > we > >> would like to stabilize the API in 1.3 so users and developers can be > >> confident to build on it. This also gives us a chance to improve the > API. > >> > >> In particular, we are proposing the following major changes. This should > >> have no impact for most users (i.e. those running SQL through the JDBC > >> client or SQLContext.sql method). > >> > >> 1. Everything in sql.catalyst package is private to the project. > >> > >> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for > >> SchemaRDD and logical plans in order to construct test cases. We have > >> received feedback from a lot of users that the DSL can be incredibly > >> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for > >> not > >> only constructing test cases, but also in everyday data pipelines. The > new > >> SchemaRDD API is inspired by the data frame concept in Pandas and R. > >> > >> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose > one > >> set of APIs that will work for both Java and Scala. The current Java API > >> (sql.api.java) does not share any common ancestor with the Scala API. > This > >> led to high maintenance burden for us as Spark developers and for > library > >> developers. We propose to eliminate the Java specific API, and simply > work > >> on the existing Scala API to make it also usable for Java. This will > make > >> Java a first class citizen as Scala. This effectively means that all > >> public > >> classes should be usable for both Scala and Java, including SQLContext, > >> HiveContext, SchemaRDD, data types, and the aforementioned DSL. > >> > >> > >> Again, this should have no impact on most users since the existing DSL > is > >> rarely used by end users. However, library developers might need to > change > >> the import statements because we are moving certain classes around. We > >> will > >> keep you posted as patches are merged. > >> > > > > >