[ 
https://issues.apache.org/jira/browse/DRILL-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009058#comment-17009058
 ] 

ASF GitHub Bot commented on DRILL-7503:
---------------------------------------

paul-rogers commented on issue #1944: DRILL-7503: Refactor the project operator
URL: https://github.com/apache/drill/pull/1944#issuecomment-571241220
 
 
   One more thought about the CG. In working on another issue, it slowly dawned 
on me a couple of other worthwhile goals in addition to those mentioned earlier.
   
   We mentioned above that CD combines logical planning and code gen, creating 
a very complex bit of code. Again, some of us are not smart enough to be able 
to quickly understand the combined logic, so it would make sense to separate 
out the two steps so that average folks can more easily understand and modify 
them.
   
   Then, we need to think about debug and testing. As originally implemented, 
CG was direct path from operator to running byte codes. The original authors 
were smart enough to be able to debug the result by stepping through the 
compiled byte codes. Most of us are not that smart. So,  few years back we 
added the "plain Java" and "save code for debugging" modes which made it 
possible for us newbies to view and step through the Java code.
   
   Still, in order to debug, we need to run Drill, find a useful unit test, run 
the query, and then debug CG in the context of running a query. We are limited 
in what we can test based on what queries we can find or create to set up the 
scenarios of interest.
   
   Better would be to apply unit testing principles: separate out CG so we can 
set up a set of inputs, run CG, and diff the resulting code against a golden 
copy. This is how we debugged the Impala planner and it worked very well. This 
way, we can easily set up every data type without needing to use a zillion 
different input file formats. That is:
   
   ```
   (operator config, schema, options) --> CG --> (Java code, exec plan)
   ```
   
   Where the "exec plan" would be things like the knick-lacks that the 
`VectorState` currently wrap. That is, rather than having CG set up vectors, 
maybe CG creates a plan that says, "here are the vectors you will need", then 
the operator creates the vectors. Still pretty hand-wavey at this point.
   
   This refactoring can be seen as a naive grasping toward that goal. We need 
to pull out CG, but it is not clear yet the final form. The same can be said of 
the earlier refactoring of the external sort to separate out CG. Sometimes, 
just by taking a step, we can get a bit better understanding of where we need 
to go.
   
   All that said, the goal here is just to take a step, not to implement a 
final solution. I realize, however, that makes this PR hard to review since it 
is incomplete: it takes a step, but does not clearly state "a step to where?"
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor project operator
> -------------------------
>
>                 Key: DRILL-7503
>                 URL: https://issues.apache.org/jira/browse/DRILL-7503
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.17.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>             Fix For: 1.18.0
>
>
> Work on another ticket revealed that the Project operator ("record batch") 
> has grown quite complex. The setup phase lives in the operator as one huge 
> function. The function combines the "logical" tasks of working out the 
> projection expressions and types, the code gen for those expressions, and the 
> physical setup of vectors.
> The refactoring breaks up the logic so that it is easier to focus on the 
> specific bits of interest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to