[
https://issues.apache.org/jira/browse/DRILL-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467749#comment-17467749
]
ASF GitHub Bot commented on DRILL-8088:
---------------------------------------
paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1003831613
Hi @luocooong, looks like you're looking at the expression and operator
code. I wonder, is there anything you're trying to improve? Execution
performance, maybe?
As you know, Drill is very complicated. Drill uses code generation for
expression evaluation. The code generation goes though a path that made sense
for Java 5 (when Drill was written), but is now a bit awkward. We do have a way
to use the native Java tools, which worked faster several years ago; that path
is probably even faster now.
Operator setup (another of your PRs) is impacted by code gen cost. Drill
generates code for each fragment. If your query has 20 fragments, we generate
code 20 times. The reason we must do that is that, in theory, every fragment
can see a different schema, so the generated code could differ. By comparison,
Spark generates code once, then pushes that code to all its executors.
The generated code itself can be rather awkward for large queries: the code
tries to inline everything which is great for small functions, but causes
optimization problems as code blocks get larger.
The mechanism to generate code, especially in the PROJECT operator, is
vastly overly complex and could use a good re-think. It is so complex that it
is hard to optimize because of the many assumptions and other issues embedded
in the code.
The generated code is meant to be small. But, over time, some operators
added lots of "standard" code to the code generation path. The work is more
work for the compiler and "byte code optimizer" that adds no per-query value.
We've taken several passes at refactoring to pull that code of the code gen
path, but there is more to do.
Drill was designed to allow vector operations (hence Value Vectors), but the
code was never written. In part because there are no CPU vector instructions
that work with SQL nullable data. Arrow is supposed to have figured out
solutions (Gandiva, is it?) which, perhaps we could consider (but probably only
for non-nullable data.)
Anyway, there are many areas we can improve. I can give you more details if
I know what you're trying to accomplish.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Improve expression evaluation performance
> -----------------------------------------
>
> Key: DRILL-8088
> URL: https://issues.apache.org/jira/browse/DRILL-8088
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Codegen
> Reporter: wtf
> Assignee: wtf
> Priority: Minor
>
> Found unnecessary map copy when doing expression evaluation, it will slow
> down the codegen when the query include many "case when" or avg/stddev(the
> reduced expressions include "case when"). In our case, the query include 314
> avg, it takes 3+ seconds to generate the projector expressions(Intel(R)
> Xeon(R) CPU E5-2682 v4 @ 2.50GHz 32cores).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)