[jira] [Updated] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits

Gerard Maas (JIRA) Mon, 07 May 2018 13:43:19 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-24202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gerard Maas updated SPARK-24202:
--------------------------------
    Description: 
The current implementation of the implicits in SparkSession passes the current 
active SQLContext to the SQLImplicits class. This implies that all usage of 
these (extremely helpful) implicits require the prior creation of a Spark 
Session instance.

Usage is typically done as follows:

 
{code:java}
val sparkSession = SparkSession.builder()
....getOrCreate()
import sparkSession.implicits._
{code}
 

This is OK in user code, but it burdens the creation of library code that uses 
Spark, where  static imports for _Encoder_ support is required.

A simple example would be:

 
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
    def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
 

Attempting to compile such code would result in the following exception:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
String, etc) and Product types (case classes) are supported by importing 
spark.implicits._  Support for serializing other types will be added in future 
releases.

The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
utilities to transform _RDD_ and local collections into a _Dataset_.

These are 2 methods of the 46 implicit conversions offered by this class.

The request is to separate the two implicit methods that depend on the instance 
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
 * Creates a [[Dataset]] from an RDD.
 *
 * @since 1.6.0
 */
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
 DatasetHolder(_sqlContext.createDataset(rdd))
}

/**
 * Creates a [[Dataset]] from a local Seq.
 * @since 1.6.0
 */
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] 
= {
 DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on 
_sqlContext_ into  separate classes, we could provide static imports for all 
the other functionality and only require the instance-bound  implicits for the 
RDD and collection support (Which is an uncommon use case these days)

As this is potentially breaking the current interface, this might be a 
candidate for Spark 3.0. Although there's nothing stopping us from creating a 
separate hierarchy for the static encoders already. 

  was:
The current implementation of the implicits in SparkSession passes the current 
active SQLContext to the SQLImplicits class. This implies that all usage of 
these (extremely helpful) implicits require the prior creation of a Spark 
Session instance.

Usage is typically done as follows:

 
{code:java}
val sparkSession = SparkSession.builder()
....build()
import sparkSession.implicits._
{code}
 

This is OK in user code, but it burdens the creation of library code that uses 
Spark, where  static imports for _Encoder_ support is required.

A simple example would be:

 
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
    def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
 

Attempting to compile such code would result in the following exception:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
String, etc) and Product types (case classes) are supported by importing 
spark.implicits._  Support for serializing other types will be added in future 
releases.

The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
utilities to transform _RDD_ and local collections into a _Dataset_.

These are 2 methods of the 46 implicit conversions offered by this class.

The request is to separate the two implicit methods that depend on the instance 
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
 * Creates a [[Dataset]] from an RDD.
 *
 * @since 1.6.0
 */
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
 DatasetHolder(_sqlContext.createDataset(rdd))
}

/**
 * Creates a [[Dataset]] from a local Seq.
 * @since 1.6.0
 */
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] 
= {
 DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on 
_sqlContext_ into  separate classes, we could provide static imports for all 
the other functionality and only require the instance-bound  implicits for the 
RDD and collection support (Which is an uncommon use case these days)

As this is potentially breaking the current interface, this might be a 
candidate for Spark 3.0. Although there's nothing stopping us from creating a 
separate hierarchy for the static encoders already. 


> Separate SQLContext dependencies from SparkSession.implicits
> ------------------------------------------------------------
>
>                 Key: SPARK-24202
>                 URL: https://issues.apache.org/jira/browse/SPARK-24202
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Gerard Maas
>            Priority: Major
>
> The current implementation of the implicits in SparkSession passes the 
> current active SQLContext to the SQLImplicits class. This implies that all 
> usage of these (extremely helpful) implicits require the prior creation of a 
> Spark Session instance.
> Usage is typically done as follows:
>  
> {code:java}
> val sparkSession = SparkSession.builder()
> ....getOrCreate()
> import sparkSession.implicits._
> {code}
>  
> This is OK in user code, but it burdens the creation of library code that 
> uses Spark, where  static imports for _Encoder_ support is required.
> A simple example would be:
>  
> {code:java}
> class SparkTransformation[In: Encoder, Out: Encoder] {
>     def transform(ds: Dataset[In]): Dataset[Out]
> }
> {code}
>  
> Attempting to compile such code would result in the following exception:
> Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
> String, etc) and Product types (case classes) are supported by importing 
> spark.implicits._  Support for serializing other types will be added in 
> future releases.
> The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
> utilities to transform _RDD_ and local collections into a _Dataset_.
> These are 2 methods of the 46 implicit conversions offered by this class.
> The request is to separate the two implicit methods that depend on the 
> instance creation into a separate class:
> {code:java}
> SQLImplicits#214-229
> /**
>  * Creates a [[Dataset]] from an RDD.
>  *
>  * @since 1.6.0
>  */
> implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = 
> {
>  DatasetHolder(_sqlContext.createDataset(rdd))
> }
> /**
>  * Creates a [[Dataset]] from a local Seq.
>  * @since 1.6.0
>  */
> implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): 
> DatasetHolder[T] = {
>  DatasetHolder(_sqlContext.createDataset(s))
> }{code}
> By separating the static methods from these two methods that depend on 
> _sqlContext_ into  separate classes, we could provide static imports for all 
> the other functionality and only require the instance-bound  implicits for 
> the RDD and collection support (Which is an uncommon use case these days)
> As this is potentially breaking the current interface, this might be a 
> candidate for Spark 3.0. Although there's nothing stopping us from creating a 
> separate hierarchy for the static encoders already. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits

Reply via email to