Re: Spark SQL API changes and stabilization

2015-01-16 Thread Reynold Xin
That's a good idea. We didn't intentionally break the doc generation. The
doc generation for Catalyst is broken because we use Scala macros and we
haven't had time to investigate how to fix it yet.

If you have a minute and want to investigate, I can merge it in as soon as
possible.





On Fri, Jan 16, 2015 at 2:11 PM, Alessandro Baretta 
wrote:

> Reynold,
>
> Your clarification is much appreciated. One issue though, that I would
> strongly encourage you to work on, is to make sure that the Scaladoc CAN be
> generated manually if needed (a "Use at your own risk" clause would be
> perfectly legitimate here). The reason I say this is that currently even
> hacking SparkBuild.scala to include SparkSQL in the unidoc target doesn't
> help, as scaladoc itself fails with errors such as these.
>
> [error]
> /Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:359:
> polymorphic expression cannot be instantiated to expected type;
> [error]  found   : [T(in method
> apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
> [error]  required:
> org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method
> functionToUdfBuilder)]
> [error]   implicit def functionToUdfBuilder[T: TypeTag](func:
> Function22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
> _, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
> [error]
>
>   ^
> [error]
> /Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:147:
> value q is not a member of StringContext
> [error]  Note: implicit class Evaluate2 is not applicable here because it
> comes after the application point and it lacks an explicit result type
> [error] q"""
> [error] ^
> [error]
> /Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:181:
> value q is not a member of StringContext
> [error] q"""
> [error] ^
> [error]
> /Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:198:
> value q is not a member of StringContext
>
> While I understand you desire to discourage users from relying on the
> internal "private" APIs, there is no reason to prevent people from gaining
> a better understanding of how things work by allow them--with some
> effort--to get to the docs.
>
> Thanks,
>
> Alex
>
> On Thu, Jan 15, 2015 at 10:33 AM, Reynold Xin  wrote:
>
>> Alex,
>>
>> I didn't communicate properly. By "private", I simply meant the
>> expectation that it is not a public API. The plan is to still omit it from
>> the scaladoc/javadoc generation, but no language visibility modifier will
>> be applied on them.
>>
>> After 1.3, you will likely no longer need to use things in sql.catalyst
>> package directly. Programmatically construct SchemaRDDs is going to be a
>> first class public API. Data types have already been moved out of the
>> sql.catalyst package and now lives in sql.types. They are becoming stable
>> public APIs. When the "data frame" patch is submitted, you will see a
>> public expression library also. There will be few reason for end users or
>> library developers to hook into things in sql.catalyst. For the bravest and
>> the most advanced, they can still use them, with the expectation that it is
>> subject to change.
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta <
>> alexbare...@gmail.com> wrote:
>>
>>> Reynold,
>>>
>>> Thanks for the heads up. In general, I strongly oppose the use of
>>> "private" to restrict access to certain parts of the API, the reason being
>>> that I might find the need to use some of the internals of a library from
>>> my own project. I find that a @DeveloperAPI annotation serves the same
>>> purpose as "private" without imposing unnecessary restrictions: it
>>> discourages people from using the annotated API and reserves the right for
>>> the core developers to change it suddenly in backwards incompatible ways.
>>>
>>> In particular, I would like to express the desire that the APIs to
>>> programmatically construct SchemaRDDs from an RDD[Row] and a StructType
>>> remain public. All the SparkSQL data type objects should be exposed by the
>>> API, and the jekyll build should not hide the docs as it does now.
>>>
>>> Thanks.
>>>
>>> Alex
>>>
>>> On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin 
>>> wrote:
>>>
 Hi Spark devs,

 Given the growing number of developers that are building on Spark SQL,
 we
 would like to stabilize the API in 1.3 so users and developers can be
 confident to build on it. This also gives us a chance to improve the
 API.

 In particular, we are proposing the following major changes. This should
 have no impact for most users (i.e. those running SQL through the JDBC
 client or SQLContext.sql method).

 1. Everything in sql.

Re: Spark SQL API changes and stabilization

2015-01-16 Thread Alessandro Baretta
Reynold,

Your clarification is much appreciated. One issue though, that I would
strongly encourage you to work on, is to make sure that the Scaladoc CAN be
generated manually if needed (a "Use at your own risk" clause would be
perfectly legitimate here). The reason I say this is that currently even
hacking SparkBuild.scala to include SparkSQL in the unidoc target doesn't
help, as scaladoc itself fails with errors such as these.

[error]
/Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:359:
polymorphic expression cannot be instantiated to expected type;
[error]  found   : [T(in method
apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
[error]  required:
org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method
functionToUdfBuilder)]
[error]   implicit def functionToUdfBuilder[T: TypeTag](func: Function22[_,
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, T]):
ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
[error]

^
[error]
/Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:147:
value q is not a member of StringContext
[error]  Note: implicit class Evaluate2 is not applicable here because it
comes after the application point and it lacks an explicit result type
[error] q"""
[error] ^
[error]
/Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:181:
value q is not a member of StringContext
[error] q"""
[error] ^
[error]
/Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:198:
value q is not a member of StringContext

While I understand you desire to discourage users from relying on the
internal "private" APIs, there is no reason to prevent people from gaining
a better understanding of how things work by allow them--with some
effort--to get to the docs.

Thanks,

Alex

On Thu, Jan 15, 2015 at 10:33 AM, Reynold Xin  wrote:

> Alex,
>
> I didn't communicate properly. By "private", I simply meant the
> expectation that it is not a public API. The plan is to still omit it from
> the scaladoc/javadoc generation, but no language visibility modifier will
> be applied on them.
>
> After 1.3, you will likely no longer need to use things in sql.catalyst
> package directly. Programmatically construct SchemaRDDs is going to be a
> first class public API. Data types have already been moved out of the
> sql.catalyst package and now lives in sql.types. They are becoming stable
> public APIs. When the "data frame" patch is submitted, you will see a
> public expression library also. There will be few reason for end users or
> library developers to hook into things in sql.catalyst. For the bravest and
> the most advanced, they can still use them, with the expectation that it is
> subject to change.
>
>
>
>
>
> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta  > wrote:
>
>> Reynold,
>>
>> Thanks for the heads up. In general, I strongly oppose the use of
>> "private" to restrict access to certain parts of the API, the reason being
>> that I might find the need to use some of the internals of a library from
>> my own project. I find that a @DeveloperAPI annotation serves the same
>> purpose as "private" without imposing unnecessary restrictions: it
>> discourages people from using the annotated API and reserves the right for
>> the core developers to change it suddenly in backwards incompatible ways.
>>
>> In particular, I would like to express the desire that the APIs to
>> programmatically construct SchemaRDDs from an RDD[Row] and a StructType
>> remain public. All the SparkSQL data type objects should be exposed by the
>> API, and the jekyll build should not hide the docs as it does now.
>>
>> Thanks.
>>
>> Alex
>>
>> On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin  wrote:
>>
>>> Hi Spark devs,
>>>
>>> Given the growing number of developers that are building on Spark SQL, we
>>> would like to stabilize the API in 1.3 so users and developers can be
>>> confident to build on it. This also gives us a chance to improve the API.
>>>
>>> In particular, we are proposing the following major changes. This should
>>> have no impact for most users (i.e. those running SQL through the JDBC
>>> client or SQLContext.sql method).
>>>
>>> 1. Everything in sql.catalyst package is private to the project.
>>>
>>> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
>>> SchemaRDD and logical plans in order to construct test cases. We have
>>> received feedback from a lot of users that the DSL can be incredibly
>>> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
>>> not
>>> only constructing test cases, but also in everyday data pipelines. The
>>> new
>>> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>>>
>>> 3. Reconcile Java an

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin
We can look into some sort of util class in sql.types for general type
inference. In general many methods in JsonRDD might be useful enough to
extract. Those will probably be marked as DeveloperAPI with less stability
guarantees.

On Thu, Jan 15, 2015 at 12:16 PM, Corey Nolet  wrote:

> Reynold,
>
> One thing I'd like worked into the public portion of the API is the json
> inferencing logic that creates a Set[(String, StructType)] out of
> Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators
> to infer my schema instead of forcing a map/reduce phase to occur on an RDD
> in order to get the final schema. Do you (or anyone else) see a path
> forward in exposing this to users? A utility class perhaps?
>
> On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin  wrote:
>
>> Alex,
>>
>> I didn't communicate properly. By "private", I simply meant the
>> expectation
>> that it is not a public API. The plan is to still omit it from the
>> scaladoc/javadoc generation, but no language visibility modifier will be
>> applied on them.
>>
>> After 1.3, you will likely no longer need to use things in sql.catalyst
>> package directly. Programmatically construct SchemaRDDs is going to be a
>> first class public API. Data types have already been moved out of the
>> sql.catalyst package and now lives in sql.types. They are becoming stable
>> public APIs. When the "data frame" patch is submitted, you will see a
>> public expression library also. There will be few reason for end users or
>> library developers to hook into things in sql.catalyst. For the bravest
>> and
>> the most advanced, they can still use them, with the expectation that it
>> is
>> subject to change.
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta <
>> alexbare...@gmail.com>
>> wrote:
>>
>> > Reynold,
>> >
>> > Thanks for the heads up. In general, I strongly oppose the use of
>> > "private" to restrict access to certain parts of the API, the reason
>> being
>> > that I might find the need to use some of the internals of a library
>> from
>> > my own project. I find that a @DeveloperAPI annotation serves the same
>> > purpose as "private" without imposing unnecessary restrictions: it
>> > discourages people from using the annotated API and reserves the right
>> for
>> > the core developers to change it suddenly in backwards incompatible
>> ways.
>> >
>> > In particular, I would like to express the desire that the APIs to
>> > programmatically construct SchemaRDDs from an RDD[Row] and a StructType
>> > remain public. All the SparkSQL data type objects should be exposed by
>> the
>> > API, and the jekyll build should not hide the docs as it does now.
>> >
>> > Thanks.
>> >
>> > Alex
>> >
>> > On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin 
>> wrote:
>> >
>> >> Hi Spark devs,
>> >>
>> >> Given the growing number of developers that are building on Spark SQL,
>> we
>> >> would like to stabilize the API in 1.3 so users and developers can be
>> >> confident to build on it. This also gives us a chance to improve the
>> API.
>> >>
>> >> In particular, we are proposing the following major changes. This
>> should
>> >> have no impact for most users (i.e. those running SQL through the JDBC
>> >> client or SQLContext.sql method).
>> >>
>> >> 1. Everything in sql.catalyst package is private to the project.
>> >>
>> >> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
>> >> SchemaRDD and logical plans in order to construct test cases. We have
>> >> received feedback from a lot of users that the DSL can be incredibly
>> >> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
>> >> not
>> >> only constructing test cases, but also in everyday data pipelines. The
>> new
>> >> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>> >>
>> >> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose
>> one
>> >> set of APIs that will work for both Java and Scala. The current Java
>> API
>> >> (sql.api.java) does not share any common ancestor with the Scala API.
>> This
>> >> led to high maintenance burden for us as Spark developers and for
>> library
>> >> developers. We propose to eliminate the Java specific API, and simply
>> work
>> >> on the existing Scala API to make it also usable for Java. This will
>> make
>> >> Java a first class citizen as Scala. This effectively means that all
>> >> public
>> >> classes should be usable for both Scala and Java, including SQLContext,
>> >> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
>> >>
>> >>
>> >> Again, this should have no impact on most users since the existing DSL
>> is
>> >> rarely used by end users. However, library developers might need to
>> change
>> >> the import statements because we are moving certain classes around. We
>> >> will
>> >> keep you posted as patches are merged.
>> >>
>> >
>> >
>>
>
>


Re: Spark SQL API changes and stabilization

2015-01-15 Thread Corey Nolet
Reynold,

One thing I'd like worked into the public portion of the API is the json
inferencing logic that creates a Set[(String, StructType)] out of
Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators
to infer my schema instead of forcing a map/reduce phase to occur on an RDD
in order to get the final schema. Do you (or anyone else) see a path
forward in exposing this to users? A utility class perhaps?

On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin  wrote:

> Alex,
>
> I didn't communicate properly. By "private", I simply meant the expectation
> that it is not a public API. The plan is to still omit it from the
> scaladoc/javadoc generation, but no language visibility modifier will be
> applied on them.
>
> After 1.3, you will likely no longer need to use things in sql.catalyst
> package directly. Programmatically construct SchemaRDDs is going to be a
> first class public API. Data types have already been moved out of the
> sql.catalyst package and now lives in sql.types. They are becoming stable
> public APIs. When the "data frame" patch is submitted, you will see a
> public expression library also. There will be few reason for end users or
> library developers to hook into things in sql.catalyst. For the bravest and
> the most advanced, they can still use them, with the expectation that it is
> subject to change.
>
>
>
>
>
> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta  >
> wrote:
>
> > Reynold,
> >
> > Thanks for the heads up. In general, I strongly oppose the use of
> > "private" to restrict access to certain parts of the API, the reason
> being
> > that I might find the need to use some of the internals of a library from
> > my own project. I find that a @DeveloperAPI annotation serves the same
> > purpose as "private" without imposing unnecessary restrictions: it
> > discourages people from using the annotated API and reserves the right
> for
> > the core developers to change it suddenly in backwards incompatible ways.
> >
> > In particular, I would like to express the desire that the APIs to
> > programmatically construct SchemaRDDs from an RDD[Row] and a StructType
> > remain public. All the SparkSQL data type objects should be exposed by
> the
> > API, and the jekyll build should not hide the docs as it does now.
> >
> > Thanks.
> >
> > Alex
> >
> > On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin 
> wrote:
> >
> >> Hi Spark devs,
> >>
> >> Given the growing number of developers that are building on Spark SQL,
> we
> >> would like to stabilize the API in 1.3 so users and developers can be
> >> confident to build on it. This also gives us a chance to improve the
> API.
> >>
> >> In particular, we are proposing the following major changes. This should
> >> have no impact for most users (i.e. those running SQL through the JDBC
> >> client or SQLContext.sql method).
> >>
> >> 1. Everything in sql.catalyst package is private to the project.
> >>
> >> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
> >> SchemaRDD and logical plans in order to construct test cases. We have
> >> received feedback from a lot of users that the DSL can be incredibly
> >> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
> >> not
> >> only constructing test cases, but also in everyday data pipelines. The
> new
> >> SchemaRDD API is inspired by the data frame concept in Pandas and R.
> >>
> >> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose
> one
> >> set of APIs that will work for both Java and Scala. The current Java API
> >> (sql.api.java) does not share any common ancestor with the Scala API.
> This
> >> led to high maintenance burden for us as Spark developers and for
> library
> >> developers. We propose to eliminate the Java specific API, and simply
> work
> >> on the existing Scala API to make it also usable for Java. This will
> make
> >> Java a first class citizen as Scala. This effectively means that all
> >> public
> >> classes should be usable for both Scala and Java, including SQLContext,
> >> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
> >>
> >>
> >> Again, this should have no impact on most users since the existing DSL
> is
> >> rarely used by end users. However, library developers might need to
> change
> >> the import statements because we are moving certain classes around. We
> >> will
> >> keep you posted as patches are merged.
> >>
> >
> >
>


Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin
Alex,

I didn't communicate properly. By "private", I simply meant the expectation
that it is not a public API. The plan is to still omit it from the
scaladoc/javadoc generation, but no language visibility modifier will be
applied on them.

After 1.3, you will likely no longer need to use things in sql.catalyst
package directly. Programmatically construct SchemaRDDs is going to be a
first class public API. Data types have already been moved out of the
sql.catalyst package and now lives in sql.types. They are becoming stable
public APIs. When the "data frame" patch is submitted, you will see a
public expression library also. There will be few reason for end users or
library developers to hook into things in sql.catalyst. For the bravest and
the most advanced, they can still use them, with the expectation that it is
subject to change.





On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta 
wrote:

> Reynold,
>
> Thanks for the heads up. In general, I strongly oppose the use of
> "private" to restrict access to certain parts of the API, the reason being
> that I might find the need to use some of the internals of a library from
> my own project. I find that a @DeveloperAPI annotation serves the same
> purpose as "private" without imposing unnecessary restrictions: it
> discourages people from using the annotated API and reserves the right for
> the core developers to change it suddenly in backwards incompatible ways.
>
> In particular, I would like to express the desire that the APIs to
> programmatically construct SchemaRDDs from an RDD[Row] and a StructType
> remain public. All the SparkSQL data type objects should be exposed by the
> API, and the jekyll build should not hide the docs as it does now.
>
> Thanks.
>
> Alex
>
> On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin  wrote:
>
>> Hi Spark devs,
>>
>> Given the growing number of developers that are building on Spark SQL, we
>> would like to stabilize the API in 1.3 so users and developers can be
>> confident to build on it. This also gives us a chance to improve the API.
>>
>> In particular, we are proposing the following major changes. This should
>> have no impact for most users (i.e. those running SQL through the JDBC
>> client or SQLContext.sql method).
>>
>> 1. Everything in sql.catalyst package is private to the project.
>>
>> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
>> SchemaRDD and logical plans in order to construct test cases. We have
>> received feedback from a lot of users that the DSL can be incredibly
>> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
>> not
>> only constructing test cases, but also in everyday data pipelines. The new
>> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>>
>> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one
>> set of APIs that will work for both Java and Scala. The current Java API
>> (sql.api.java) does not share any common ancestor with the Scala API. This
>> led to high maintenance burden for us as Spark developers and for library
>> developers. We propose to eliminate the Java specific API, and simply work
>> on the existing Scala API to make it also usable for Java. This will make
>> Java a first class citizen as Scala. This effectively means that all
>> public
>> classes should be usable for both Scala and Java, including SQLContext,
>> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
>>
>>
>> Again, this should have no impact on most users since the existing DSL is
>> rarely used by end users. However, library developers might need to change
>> the import statements because we are moving certain classes around. We
>> will
>> keep you posted as patches are merged.
>>
>
>


Re: Spark SQL API changes and stabilization

2015-01-15 Thread Alessandro Baretta
Reynold,

Thanks for the heads up. In general, I strongly oppose the use of "private"
to restrict access to certain parts of the API, the reason being that I
might find the need to use some of the internals of a library from my own
project. I find that a @DeveloperAPI annotation serves the same purpose as
"private" without imposing unnecessary restrictions: it discourages people
from using the annotated API and reserves the right for the core developers
to change it suddenly in backwards incompatible ways.

In particular, I would like to express the desire that the APIs to
programmatically construct SchemaRDDs from an RDD[Row] and a StructType
remain public. All the SparkSQL data type objects should be exposed by the
API, and the jekyll build should not hide the docs as it does now.

Thanks.

Alex

On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin  wrote:

> Hi Spark devs,
>
> Given the growing number of developers that are building on Spark SQL, we
> would like to stabilize the API in 1.3 so users and developers can be
> confident to build on it. This also gives us a chance to improve the API.
>
> In particular, we are proposing the following major changes. This should
> have no impact for most users (i.e. those running SQL through the JDBC
> client or SQLContext.sql method).
>
> 1. Everything in sql.catalyst package is private to the project.
>
> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
> SchemaRDD and logical plans in order to construct test cases. We have
> received feedback from a lot of users that the DSL can be incredibly
> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for not
> only constructing test cases, but also in everyday data pipelines. The new
> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>
> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one
> set of APIs that will work for both Java and Scala. The current Java API
> (sql.api.java) does not share any common ancestor with the Scala API. This
> led to high maintenance burden for us as Spark developers and for library
> developers. We propose to eliminate the Java specific API, and simply work
> on the existing Scala API to make it also usable for Java. This will make
> Java a first class citizen as Scala. This effectively means that all public
> classes should be usable for both Scala and Java, including SQLContext,
> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
>
>
> Again, this should have no impact on most users since the existing DSL is
> rarely used by end users. However, library developers might need to change
> the import statements because we are moving certain classes around. We will
> keep you posted as patches are merged.
>