[GitHub] [drill] paul-rogers commented on issue #1944: DRILL-7503: Refactor the project operator

GitBox Thu, 02 Jan 2020 21:16:41 -0800

paul-rogers commented on issue #1944: DRILL-7503: Refactor the project operator
URL: https://github.com/apache/drill/pull/1944#issuecomment-570465345
 
 
   
   Longer term, it seems we might want to restructure how we generate code. 
Today, if we run a query across, say, 20 minor fragments in the same Drillbit, 
and they all see the same schema (the normal case), then all 20 will generate 
exactly the same code. Then, down in the compiler layer, the first thread will 
compile the code and put it into the code cache. Threads 2 through 20 will get 
the compiled code from the cache.
   
   But, today, the process is **very** inefficient. Each thread:
   
   * Does the semantic expression analysis (function lookup, type resolution, 
etc.)
   * Generate the entire Java code.
   * Rewrite the Java code to remove the varying bits (the generated class 
name).
   * Hash the entire code to get a hash code to look up in the code cache.
   * If a match is found, compare the entire code block byte-by-byte to verify 
a match.
   * If new, generate code, cache it, and use the source code (which can be 
huge) as the cache key.
   
   The only real benefit of this approach is that it has worked all these 
years. 
   
   The better approach is to:
   
   * Create a parameterized "descriptor" object that holds all the factors 
needed to generate the code. (Input schema, expressions, relevant session 
options.)
   * Use that descriptor as a key into the code lookup table. If found, reuse 
the compiled code without any code gen.
   * If not found, only then tell the descriptor to generate the needed code, 
which will then be shared by all fragments.
   
   The work I did back in the managed sort, and that I'm starting here, at 
least splits code gen from the operator.
   
   I suspect one (very long term) improvement would be to introduce another 
layer of abstraction like we had in Impala. The code gen code tries to do the 
kind of logical type analysis normally done in a planner. But, because the goal 
is Java code gen, it tends to mix SQL type analysis with Java implementation 
details, leading to overly complex code. (That's what I'm fighting with the 
typeof/UNION issue.).
   
   Such an approach would be doubly useful a we roll out the schema 
improvements your team has been doing. If we know the schema (types) at plan 
time, we can work out all the type conversion stuff at that time. In fact, we 
can even play the Spark trick: generate Java once in the planner and send it to 
the workers.
   
   I have only vague ideas here; have not spent much time on it. Sounds like 
you've looked at this some. What do you suggest we do?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #1944: DRILL-7503: Refactor the project operator

Reply via email to