[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939625#comment-15939625 ] Liang-Chi Hsieh commented on SPARK-14083: - [~maropu] Thanks! That's great! > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937908#comment-15937908 ] Liang-Chi Hsieh commented on SPARK-14083: - Yeah. Maybe I wrongly read your comment above. I thought you are going to submit a WIP PR now. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937907#comment-15937907 ] Kazuaki Ishizaki commented on SPARK-14083: -- I agree with you. I do not think that current status of the branch can be a WIP PR. We have to do several things to make it better, e.g. address ToDos, add test suites, check results of major Dataset test suites (i.e. DatasetSuite and so on). > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937888#comment-15937888 ] Liang-Chi Hsieh commented on SPARK-14083: - Hmm, I am not sure if the current status of the branch can be a PR even it is WIP. Should we let it in better shape before submitting it? > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937886#comment-15937886 ] Kazuaki Ishizaki commented on SPARK-14083: -- [~viirya] For a while, I will be able to update the branch https://github.com/kiszk/spark/tree/expression-analysis4 until I will submit a WIP PR. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937883#comment-15937883 ] Kazuaki Ishizaki commented on SPARK-14083: -- [~maropu] Thanks. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937865#comment-15937865 ] Liang-Chi Hsieh commented on SPARK-14083: - [~kiszk] Can you update the branch? So I can send PR against it. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937810#comment-15937810 ] Takeshi Yamamuro commented on SPARK-14083: -- I could also join this work (code reviews, discussion, or something). > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937796#comment-15937796 ] Liang-Chi Hsieh commented on SPARK-14083: - Great. I am not sure if it is ready for a (WIP) PR. I think maybe we can collaborate on this. If there is progress I can send PR against your branch. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937783#comment-15937783 ] Kazuaki Ishizaki commented on SPARK-14083: -- [~viirya] Thank you for your comment. Good to hear. I will also work to make it feasible by implementing ToDo and others and to explorer more opportunities in bytecode analysis. Then, I am thinking about submitting it as a WIP PR. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937754#comment-15937754 ] Liang-Chi Hsieh commented on SPARK-14083: - [~kiszk] Thanks for rebasing it. It is more convenient to continue the work. I will look into it to see where it can be improved. I will publish new branch if any progress. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904460#comment-15904460 ] Kazuaki Ishizaki commented on SPARK-14083: -- I rebased this with master: https://github.com/kiszk/spark/tree/expression-analysis4 > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896452#comment-15896452 ] Kazuaki Ishizaki commented on SPARK-14083: -- Does anyone go forward with this? If not, I will continue to work for this. Recently, I noticed that Spark already uses ASM framework that provides similar features to Javassist. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15779288#comment-15779288 ] Kazuaki Ishizaki commented on SPARK-14083: -- [Here|https://github.com/apache/spark/pull/16391#discussion_r93788919] is another motivation to apply this optimization cc:[~cloud_fan] > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353969#comment-15353969 ] Sean Zhong commented on SPARK-14083: For typed operation like map, it will first de-serialize InternalRow to type T, apply the operation, and then serialize T back to InternalRow. For un-typed operation like select("column"), it directly operates on InternalRow. If end user defines a custom serializer like Kryo, then it is not possible to map typed operation to un-typed operation. ``` scala> case class C(c: Int) scala> val ds: Dataset[C] = Seq(C(1)).toDS scala> ds.select("c") // <- Return correct result when using default encoder. res1: org.apache.spark.sql.DataFrame = [c: int] scala> implicit val encoder: Encoder[C] = Encoders.kryo[C] // <- Define a Kryo encoder scala> val ds2: Dataset[C] = Seq(C(1)).toDS ds2: org.apache.spark.sql.Dataset[C] = [value: binary] // <- Row is encoded as binary by using Kryo encoder scala> ds2.select("c") // <- Fails even if "c" is an existing field in class C! org.apache.spark.sql.AnalysisException: cannot resolve '`c`' given input columns: [value]; ... ``` > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299426#comment-15299426 ] Takeshi Yamamuro commented on SPARK-14083: -- To check the feasibility to support other types except for Row, I moved Josh's and viirya's prototype forward: https://github.com/maropu/spark/compare/master...expression-analysis3 Seems we could easily support these types based on the prototype; see test codes https://github.com/maropu/spark/compare/master...expression-analysis3#diff-46c2cd76bdf4dc90045f44f70cb33e15R28 I think these types has no null-handling issue discussed above and so it is good as a first step to support this feature. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245427#comment-15245427 ] Liang-Chi Hsieh commented on SPARK-14083: - Based on [~joshrosen]'s code, I added some comments and few Java opcodes: https://github.com/apache/spark/compare/master...viirya:expression-analysis2 > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223010#comment-15223010 ] Josh Rosen commented on SPARK-14083: Here's one example of how we might aim to preserve Java/Scala closure API null behavior for field accesses: Consider the following closure: {code} val ds = Seq[(String, Integer)](("a", 1), ("b", 2), ("c", 3), (null, null)).toDF() ds.filter(r => r.getInt(1) == 2).collect() {code} This code will fail with a NullPointerException in the getInt() call (per its contract). This closure's bytecode looks like this: {code} aload_1 iconst_1 invokeinterface #22 = Method org.apache.spark.sql.Row.getInt((I)I) iconst_2 if_icmpne 15 iconst_1 goto 16 iconst_0 ireturn {code} My most recent prototype converts this into {code} cast(if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean) {code} where {{npeonnull}} is a new non-SQL expression which throws a null pointer exception on null inputs. If we trust our nullability analysis optimization rules, then we could add a trivial optimizer rule to eliminate {{npeonnull}} calls when their children are non-nullable. If a user wanted to implement the SQL filter semantics here, then they could rewrite their closure to {code} ds.filter(r => !r.isNullAt(1) && r.getInt(1) == 2) {code} My prototype translates this closure into {code} cast(if (isnull(_2#3)) 0 else if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean) {code} Again, I think that this could be easily simplified given some new optimizer rules: - We can propagate the negation of the `if` condition into the attributes of the else branch. - Therefore, we can conclude that column 2 is not null when analyzing the else case and can strip out the `npeonnull` check. - After both optimizations plus cast pushdown, constant folding, and an optimization for rewriting {{if}} expressions with non-nullable conditions by the condition expression itself, I think we could produce exactly the same {{filter _2#3 = 2}} expression that the Catalyst expression DSL would have given us. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222648#comment-15222648 ] Liang-Chi Hsieh commented on SPARK-14083: - I think this optimization now just consider the analysis of JVM bytecode, right? If so, it is better to follow JVM behavior as the users should not aware of the optimization and expect the results are come from their closures. Different behavior might cause confusing. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222634#comment-15222634 ] Kazuaki Ishizaki commented on SPARK-14083: -- Good point. On the other hand, should we follow JVM behavior exactly? We need to follow the semantics of Dataset. In my opinion, since Dataset is still experimental, we could update the semantics of Dataset. Dataset may be written in other languages such as Python, too. According to my experiments to implement Java just-in-time compiler, too strict semantics may loose opportunities for optimizations while to throw an exception is useful for efficient debugging. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222611#comment-15222611 ] Josh Rosen commented on SPARK-14083: Null-handling is going to present a major design challenge here. If we want to exactly preserve the behavior of the Java closure then we need to ensure that our translation does not add implicit null-handling which differs from the JVM's own handling. For example: - What happens today if a user calls .getInt(..) on a column which is null? We need to preserve the current behavior. - What if a user calls .getString(...).equals("foo") on a row where the string column is null? Today the user's code will throw a NullPointerException. To preserve this behavior, we might need to add an expression which throws exceptions on null values. - In Java, casting null to a numeric type returns the zero-value of that type, whereas SQL casts preserve nulls. I think that we don't have a choice but to faithfully preserve Java's null semantics. If we didn't, then subtle differences in Java closures or in compilers' emitted bytecode could alter the result of queries. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220055#comment-15220055 ] Kazuaki Ishizaki commented on SPARK-14083: -- Thank you. Makes sense to me. It seems that this translator performs minimum analysis to ensure correct transformation (e.g., avoid infinite loop). Then, we rely on Catalyst for further optimizations. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220001#comment-15220001 ] Josh Rosen commented on SPARK-14083: I'm aware of Soot but chose not to use it for a few reasons: 1. Its LGPL license likely prevents us from using it in Spark. 2. It doesn't publish releases to Maven Central. 3. While Soot provides higher-level analysis capabilities, I think that many of its features aren't necessary for the relatively simple types of analysis that we need to do, so I'd be slightly concerned that Soot might introduce an undesirable performance / complexity penalty. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219702#comment-15219702 ] Liang-Chi Hsieh commented on SPARK-14083: - Very interested in this work too. I would like to collaborate on this if possibly. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219637#comment-15219637 ] Kazuaki Ishizaki commented on SPARK-14083: -- Looks interesting. I am also interested in collaborating about this. I have one question. What is the reason to use Javassist? Soot (https://sable.github.io/soot/) looks more powerful for performing bytecode analysis and applying some optimizations. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219488#comment-15219488 ] Josh Rosen commented on SPARK-14083: I spent a chunk of time investigating this last weekend and was able to make some decent progress. In a nutshell, my approach is based on iterating over the closure's bytecode instruction-by-instruction while simulating what happens to the stack. Similar to the trick from that paper that I linked, we define variables which correspond to positions on the stack and at each instruction update variables to hold Catalyst expressions corresponding to the partial value computed up to that point OR to hold pointers to input arguments (e.g. the input row to the UDF) for use in resolving the targets of method invocations. If we hit a branch instruction, then we copy the current state and recursively investigate each branch, then use the results of the branches to emit a CASE WHEN expression. In principle, this approach risks infinite looping, but I hope to avoid this by using Javassist to compute the graph of basic blocks and aborting if the graph contains a cycle. We also have to worry about recursive method invocations: to handle this, I propose to abort the analysis after a fixed number of steps / budget of instructions. The Catalyst expressions which result from this process are much more complex than is ideal but I'm hoping that we'll be able to easily simplify them given a few new expression optimization rules (such as rules for pushing projections beneath CASE WHEN statements). In case anyone's interested, here's my _very hacky_ work-in-progress spaghetti code (this code is basically unedited and was written mostly in a day): https://github.com/apache/spark/compare/master...JoshRosen:expression-analysis?diff=unified&name=expression-analysis This works for a single test case in DatasetSuite and can handle simple filter and arithmetic expressions. I hope that this small snippet is a decent illustration of my basic proposed approach. I'm hoping to pick up this line of work once I get some other stuff off my plate, but in the meantime I'd be interested in collaborating with anyone who wants to run with these ideas. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219471#comment-15219471 ] Reynold Xin commented on SPARK-14083: - We could do that, but probably lower priority. I am aware of that paper, but unfortunately it is very shy on details. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219470#comment-15219470 ] Takeshi Yamamuro commented on SPARK-14083: -- Is this goal of this ticket easily extended into simple Hive UDFs? Seems a component analyzing JVM byte codes also applies the compiled UDFs for better performance. fyi: this seems to be similar to an approach proposed in http://www.vldb.org/pvldb/vol8/p1466-crotty.pdf. > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213959#comment-15213959 ] Kazuaki Ishizaki commented on SPARK-14083: -- Another possible bytecode analysis library is WALA at https://github.com/wala/WALA/wiki > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213910#comment-15213910 ] Reynold Xin commented on SPARK-14083: - A relevant paper suggested by [~joshrosen]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7708 > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org