[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468833#comment-15468833
 ] 

Jakob Odersky edited comment on SPARK-17368 at 9/6/16 10:57 PM:
----------------------------------------------------------------

So I thought about this a bit more and although it is possible to support value 
classes, I currently see two main issues that make it cumbersome:

1. Catalyst (the engine behind Datasets) generates and compiles code during 
runtime, that will represent the actual computation. This code being Java, 
together with the fact that value classes don't have runtime representations, 
will require changes in the implementation of Encoders (see my experimental 
branch 
[here|https://github.com/apache/spark/compare/master...jodersky:value-classes]).

2. The largest problem of both is how will encoders for value classes be 
accessible? Currently, encoders are exposed as type classes and there is 
unfortunately no way to create type classes for classes extending AnyVal (you 
could create an encoder for AnyVals, however that would also apply to any 
primitive type and you would get implicit resolution conflicts). Requiring 
explicit encoders for value classes may work, however you would still have no 
compile-time safety, as accessing of a value class' inner val will occur during 
runtime and may hence fail if it is not encodable.

The cleanest solution would be to use meta programming: it would guarantee 
"encodability" during compile-time and could easily complement the current API. 
Unfortunately however, I don't think it could be included in Spark in the near 
future as the current meta programming solutions in Scala are either too new 
(scala.meta) or on their way to being deprecated (the current experimental 
scala macros). (I have been wanting to experiment with meta encoders for a 
while though, so maybe I'll try putting together an external library for that)

How inconvenient is it to extract the wrapped value before creating a dataset 
and re-wrapping your final results?


was (Author: jodersky):
So I thought about this a bit more and although it is possible to support value 
classes, I currently see two main issues that make it cumbersome:

1. Catalyst (the engine behind Datasets) generates and compiles code during 
runtime, that will represent the actual computation. This code being Java, 
together with the fact that value classes don't have runtime representations, 
will require changes in the implementation of Encoders (see my experimental 
branch here).

2. The largest problem of both is how will encoders for value classes be 
accessible? Currently, encoders are exposed as type classes and there is 
unfortunately no way to create type classes for classes extending AnyVal (you 
could create an encoder for AnyVals, however that would also apply to any 
primitive type and you would get implicit resolution conflicts). Requiring 
explicit encoders for value classes may work, however you would still have no 
compile-time safety, as accessing of a value class' inner val will occur during 
runtime and may hence fail if it is not encodable.

The cleanest solution would be to use meta programming: it would guarantee 
"encodability" during compile-time and could easily complement the current API. 
Unfortunately however, I don't think it could be included in Spark in the near 
future as the current meta programming solutions in Scala are either too new 
(scala.meta) or on their way to being deprecated (the current experimental 
scala macros). (I have been wanting to experiment with meta encoders for a 
while though, so maybe I'll try putting together an external library for that)

How inconvenient is it to extract the wrapped value before creating a dataset 
and re-wrapping your final results?

> Scala value classes create encoder problems and break at runtime
> ----------------------------------------------------------------
>
>                 Key: SPARK-17368
>                 URL: https://issues.apache.org/jira/browse/SPARK-17368
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.6.2, 2.0.0
>         Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>            Reporter: Aris Vlasakakis
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>    +- assertnotnull(input[0, int, true], top level non-flat input object)
>       +- input[0, int, true]".
>         at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
>         at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
>         at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
>     val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
>     val spark = SparkSession.builder.getOrCreate()
>     import spark.implicits._
>     spark.sparkContext.setLogLevel("warn")
>     val ds: Dataset[FeatureId] = spark.createDataset(seq)
>     println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to