[ https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468833#comment-15468833 ]
Jakob Odersky edited comment on SPARK-17368 at 9/6/16 10:57 PM: ---------------------------------------------------------------- So I thought about this a bit more and although it is possible to support value classes, I currently see two main issues that make it cumbersome: 1. Catalyst (the engine behind Datasets) generates and compiles code during runtime, that will represent the actual computation. This code being Java, together with the fact that value classes don't have runtime representations, will require changes in the implementation of Encoders (see my experimental branch [here|https://github.com/apache/spark/compare/master...jodersky:value-classes]). 2. The largest problem of both is how will encoders for value classes be accessible? Currently, encoders are exposed as type classes and there is unfortunately no way to create type classes for classes extending AnyVal (you could create an encoder for AnyVals, however that would also apply to any primitive type and you would get implicit resolution conflicts). Requiring explicit encoders for value classes may work, however you would still have no compile-time safety, as accessing of a value class' inner val will occur during runtime and may hence fail if it is not encodable. The cleanest solution would be to use meta programming: it would guarantee "encodability" during compile-time and could easily complement the current API. Unfortunately however, I don't think it could be included in Spark in the near future as the current meta programming solutions in Scala are either too new (scala.meta) or on their way to being deprecated (the current experimental scala macros). (I have been wanting to experiment with meta encoders for a while though, so maybe I'll try putting together an external library for that) How inconvenient is it to extract the wrapped value before creating a dataset and re-wrapping your final results? was (Author: jodersky): So I thought about this a bit more and although it is possible to support value classes, I currently see two main issues that make it cumbersome: 1. Catalyst (the engine behind Datasets) generates and compiles code during runtime, that will represent the actual computation. This code being Java, together with the fact that value classes don't have runtime representations, will require changes in the implementation of Encoders (see my experimental branch here). 2. The largest problem of both is how will encoders for value classes be accessible? Currently, encoders are exposed as type classes and there is unfortunately no way to create type classes for classes extending AnyVal (you could create an encoder for AnyVals, however that would also apply to any primitive type and you would get implicit resolution conflicts). Requiring explicit encoders for value classes may work, however you would still have no compile-time safety, as accessing of a value class' inner val will occur during runtime and may hence fail if it is not encodable. The cleanest solution would be to use meta programming: it would guarantee "encodability" during compile-time and could easily complement the current API. Unfortunately however, I don't think it could be included in Spark in the near future as the current meta programming solutions in Scala are either too new (scala.meta) or on their way to being deprecated (the current experimental scala macros). (I have been wanting to experiment with meta encoders for a while though, so maybe I'll try putting together an external library for that) How inconvenient is it to extract the wrapped value before creating a dataset and re-wrapping your final results? > Scala value classes create encoder problems and break at runtime > ---------------------------------------------------------------- > > Key: SPARK-17368 > URL: https://issues.apache.org/jira/browse/SPARK-17368 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 1.6.2, 2.0.0 > Environment: JDK 8 on MacOS > Scala 2.11.8 > Spark 2.0.0 > Reporter: Aris Vlasakakis > > Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 > and 1.6.X. > This simple Spark 2 application demonstrates that the code will compile, but > will break at runtime with the error. The value class is of course > *FeatureId*, as it extends AnyVal. > {noformat} > Exception in thread "main" java.lang.RuntimeException: Error while encoding: > java.lang.RuntimeException: Couldn't find v on int > assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0 > +- assertnotnull(input[0, int, true], top level non-flat input object).v > +- assertnotnull(input[0, int, true], top level non-flat input object) > +- input[0, int, true]". > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > at > org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421) > {noformat} > Test code for Spark 2.0.0: > {noformat} > import org.apache.spark.sql.{Dataset, SparkSession} > object BreakSpark { > case class FeatureId(v: Int) extends AnyVal > def main(args: Array[String]): Unit = { > val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3)) > val spark = SparkSession.builder.getOrCreate() > import spark.implicits._ > spark.sparkContext.setLogLevel("warn") > val ds: Dataset[FeatureId] = spark.createDataset(seq) > println(s"BREAK HERE: ${ds.count}") > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org