[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299426#comment-15299426
 ] 

Takeshi Yamamuro commented on SPARK-14083:
------------------------------------------

To check the feasibility to support other types except for Row, I moved Josh's 
and viirya's prototype forward: 
https://github.com/maropu/spark/compare/master...expression-analysis3
Seems we could easily support these types based on the prototype; see test 
codes 
https://github.com/maropu/spark/compare/master...expression-analysis3#diff-46c2cd76bdf4dc90045f44f70cb33e15R28
I think these types has no null-handling issue discussed above and so it is 
good as a first step to support this feature.


> Analyze JVM bytecode and turn closures into Catalyst expressions
> ----------------------------------------------------------------
>
>                 Key: SPARK-14083
>                 URL: https://issues.apache.org/jira/browse/SPARK-14083
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to