paul-rogers commented on issue #1944: DRILL-7503: Refactor the project operator URL: https://github.com/apache/drill/pull/1944#issuecomment-570465345 Longer term, it seems we might want to restructure how we generate code. Today, if we run a query across, say, 20 minor fragments in the same Drillbit, and they all see the same schema (the normal case), then all 20 will generate exactly the same code. Then, down in the compiler layer, the first thread will compile the code and put it into the code cache. Threads 2 through 20 will get the compiled code from the cache. But, today, the process is **very** inefficient. Each thread: * Does the semantic expression analysis (function lookup, type resolution, etc.) * Generate the entire Java code. * Rewrite the Java code to remove the varying bits (the generated class name). * Hash the entire code to get a hash code to look up in the code cache. * If a match is found, compare the entire code block byte-by-byte to verify a match. * If new, generate code, cache it, and use the source code (which can be huge) as the cache key. The only real benefit of this approach is that it has worked all these years. The better approach is to: * Create a parameterized "descriptor" object that holds all the factors needed to generate the code. (Input schema, expressions, relevant session options.) * Use that descriptor as a key into the code lookup table. If found, reuse the compiled code without any code gen. * If not found, only then tell the descriptor to generate the needed code, which will then be shared by all fragments. The work I did back in the managed sort, and that I'm starting here, at least splits code gen from the operator. I suspect one (very long term) improvement would be to introduce another layer of abstraction like we had in Impala. The code gen code tries to do the kind of logical type analysis normally done in a planner. But, because the goal is Java code gen, it tends to mix SQL type analysis with Java implementation details, leading to overly complex code. (That's what I'm fighting with the typeof/UNION issue.). Such an approach would be doubly useful a we roll out the schema improvements your team has been doing. If we know the schema (types) at plan time, we can work out all the type conversion stuff at that time. In fact, we can even play the Spark trick: generate Java once in the planner and send it to the workers. I have only vague ideas here; have not spent much time on it. Sounds like you've looked at this some. What do you suggest we do?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services