paul-rogers commented on a change in pull request #1944: DRILL-7503: Refactor the project operator URL: https://github.com/apache/drill/pull/1944#discussion_r362702109
########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java ########## @@ -42,307 +44,310 @@ import java.util.Map; /** - * - * ProjectMemoryManager(PMM) is used to estimate the size of rows produced by ProjectRecordBatch. - * The PMM works as follows: - * - * Setup phase: As and when ProjectRecordBatch creates or transfers a field, it registers the field with PMM. - * If the field is a variable width field, PMM records the expression that produces the variable - * width field. The expression is a tree of LogicalExpressions. The PMM walks this tree of LogicalExpressions - * to produce a tree of OutputWidthExpressions. The widths of Fixed width fields are just accumulated into a single - * total. Note: The PMM, currently, cannot handle new complex fields, it just uses a hard-coded estimate for such fields. - * - * - * Execution phase: Just before a batch is processed by Project, the PMM walks the tree of OutputWidthExpressions - * and converts them to FixedWidthExpressions. It uses the RecordBatchSizer and the function annotations to do this conversion. - * See OutputWidthVisitor for details. + * ProjectMemoryManager(PMM) is used to estimate the size of rows produced by + * ProjectRecordBatch. The PMM works as follows: + * <p> + * Setup phase: As and when ProjectRecordBatch creates or transfers a field, it + * registers the field with PMM. If the field is a variable width field, PMM + * records the expression that produces the variable width field. The expression + * is a tree of LogicalExpressions. The PMM walks this tree of + * LogicalExpressions to produce a tree of OutputWidthExpressions. The widths of + * Fixed width fields are just accumulated into a single total. Note: The PMM, + * currently, cannot handle new complex fields, it just uses a hard-coded + * estimate for such fields. + * <p> + * Execution phase: Just before a batch is processed by Project, the PMM walks + * the tree of OutputWidthExpressions and converts them to + * FixedWidthExpressions. It uses the RecordBatchSizer and the function + * annotations to do this conversion. See OutputWidthVisitor for details. */ public class ProjectMemoryManager extends RecordBatchMemoryManager { - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ProjectMemoryManager.class); - - public RecordBatch getIncomingBatch() { - return incomingBatch; + static final Logger logger = LoggerFactory.getLogger(ProjectMemoryManager.class); + + private RecordBatch incomingBatch; + private ProjectRecordBatch outgoingBatch; + + private int rowWidth; + private final Map<String, ColumnWidthInfo> outputColumnSizes; + // Number of variable width columns in the batch + private int variableWidthColumnCount; + // Number of fixed width columns in the batch + private int fixedWidthColumnCount; + // Number of complex columns in the batch + private int complexColumnsCount; + + // Holds sum of all fixed width column widths + private int totalFixedWidthColumnWidth; + // Holds sum of all complex column widths + // Currently, this is just a guess + private int totalComplexColumnWidth; + + private enum WidthType { + FIXED, + VARIABLE + } Review comment: Done. I tried to limit the "blast radius" by gritting my teeth and leaving this class as it was. This seems to be some offshoot of my "temporary" `BatchSizer`. Just for background, some time ago we realized that not all Drill rows are of the same width. (Who'd have guessed?) So, we added "temporary" code to work out sizes in each operator. This was supposed to be temporary until batches carried their own size info. (Once we figure it out for a vector, we don't have to throw it away and figure it out again for each operator.) In fact, I deliberately chose a goofy name "BatchSizer" to remind folks that this was a temporary hack to so I could focus on the "real work" of fixing the external sort. Sigh... But, the temporary solution seems to have become semi-permanent and has grown odd variations such as this. The "master plan" was to not try to predict batch sizes as we are doing here. Instead, the `ResultSetLoader` is intended to just let the operator write to the batch until we hit the desired limit. All the calcs are already done for the reader case. The goal was to use the same mechanism in other places were we don't know widths ahead of time. Project is the classic case: for all we know, the user is doing some silly function like repeating a big `VARCHAR` 100 times. So, if we can ever get there, all this temporary stuff will be swept away. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services