[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219488#comment-15219488
 ] 

Josh Rosen commented on SPARK-14083:
------------------------------------

I spent a chunk of time investigating this last weekend and was able to make 
some decent progress.

In a nutshell, my approach is based on iterating over the closure's bytecode 
instruction-by-instruction while simulating what happens to the stack. Similar 
to the trick from that paper that I linked, we define variables which 
correspond to positions on the stack and at each instruction update variables 
to hold Catalyst expressions corresponding to the partial value computed up to 
that point OR to hold pointers to input arguments (e.g. the input row to the 
UDF) for use in resolving the targets of method invocations.

If we hit a branch instruction, then we copy the current state and recursively 
investigate each branch, then use the results of the branches to emit a CASE 
WHEN expression. In principle, this approach risks infinite looping, but I hope 
to avoid this by using Javassist to compute the graph of basic blocks and 
aborting if the graph contains a cycle. We also have to worry about recursive 
method invocations: to handle this, I propose to abort the analysis after a 
fixed number of steps / budget of instructions.

The Catalyst expressions which result from this process are much more complex 
than is ideal but I'm hoping that we'll be able to easily simplify them given a 
few new expression optimization rules (such as rules for pushing projections 
beneath CASE WHEN statements).

In case anyone's interested, here's my _very hacky_ work-in-progress spaghetti 
code (this code is basically unedited and was written mostly in a day):

https://github.com/apache/spark/compare/master...JoshRosen:expression-analysis?diff=unified&name=expression-analysis

This works for a single test case in DatasetSuite and can handle simple filter 
and arithmetic expressions. I hope that this small snippet is a decent 
illustration of my basic proposed approach.

I'm hoping to pick up this line of work once I get some other stuff off my 
plate, but in the meantime I'd be interested in collaborating with anyone who 
wants to run with these ideas.

> Analyze JVM bytecode and turn closures into Catalyst expressions
> ----------------------------------------------------------------
>
>                 Key: SPARK-14083
>                 URL: https://issues.apache.org/jira/browse/SPARK-14083
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to