[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353969#comment-15353969 ]
Sean Zhong edited comment on SPARK-14083 at 6/29/16 12:17 AM: -------------------------------------------------------------- For typed operation like map, it will first de-serialize InternalRow to type T, apply the operation, and then serialize T back to InternalRow. For un-typed operation like select("column"), it directly operates on InternalRow. If end user defines a custom serializer like Kryo, then it is not possible to map typed operation to un-typed operation. {code} scala> case class C(c: Int) scala> val ds: Dataset[C] = Seq(C(1)).toDS scala> ds.select("c") // <- Return correct result when using default encoder. res1: org.apache.spark.sql.DataFrame = [c: int] scala> implicit val encoder: Encoder[C] = Encoders.kryo[C] // <- Define a Kryo encoder scala> val ds2: Dataset[C] = Seq(C(1)).toDS ds2: org.apache.spark.sql.Dataset[C] = [value: binary] // <- Row is encoded as binary by using Kryo encoder scala> ds2.select("c") // <- Fails even if "c" is an existing field in class C! org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input columns: [value]; ... {code} was (Author: clockfly): For typed operation like map, it will first de-serialize InternalRow to type T, apply the operation, and then serialize T back to InternalRow. For un-typed operation like select("column"), it directly operates on InternalRow. If end user defines a custom serializer like Kryo, then it is not possible to map typed operation to un-typed operation. ``` scala> case class C(c: Int) scala> val ds: Dataset[C] = Seq(C(1)).toDS scala> ds.select("c") // <- Return correct result when using default encoder. res1: org.apache.spark.sql.DataFrame = [c: int] scala> implicit val encoder: Encoder[C] = Encoders.kryo[C] // <- Define a Kryo encoder scala> val ds2: Dataset[C] = Seq(C(1)).toDS ds2: org.apache.spark.sql.Dataset[C] = [value: binary] // <- Row is encoded as binary by using Kryo encoder scala> ds2.select("c") // <- Fails even if "c" is an existing field in class C! org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input columns: [value]; ... ``` > Analyze JVM bytecode and turn closures into Catalyst expressions > ---------------------------------------------------------------- > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org