paul-rogers commented on a change in pull request #1944: DRILL-7503: Refactor
the project operator
URL: https://github.com/apache/drill/pull/1944#discussion_r362702109
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
##########
@@ -42,307 +44,310 @@
import java.util.Map;
/**
- *
- * ProjectMemoryManager(PMM) is used to estimate the size of rows produced by
ProjectRecordBatch.
- * The PMM works as follows:
- *
- * Setup phase: As and when ProjectRecordBatch creates or transfers a field,
it registers the field with PMM.
- * If the field is a variable width field, PMM records the expression that
produces the variable
- * width field. The expression is a tree of LogicalExpressions. The PMM walks
this tree of LogicalExpressions
- * to produce a tree of OutputWidthExpressions. The widths of Fixed width
fields are just accumulated into a single
- * total. Note: The PMM, currently, cannot handle new complex fields, it just
uses a hard-coded estimate for such fields.
- *
- *
- * Execution phase: Just before a batch is processed by Project, the PMM walks
the tree of OutputWidthExpressions
- * and converts them to FixedWidthExpressions. It uses the RecordBatchSizer
and the function annotations to do this conversion.
- * See OutputWidthVisitor for details.
+ * ProjectMemoryManager(PMM) is used to estimate the size of rows produced by
+ * ProjectRecordBatch. The PMM works as follows:
+ * <p>
+ * Setup phase: As and when ProjectRecordBatch creates or transfers a field, it
+ * registers the field with PMM. If the field is a variable width field, PMM
+ * records the expression that produces the variable width field. The
expression
+ * is a tree of LogicalExpressions. The PMM walks this tree of
+ * LogicalExpressions to produce a tree of OutputWidthExpressions. The widths
of
+ * Fixed width fields are just accumulated into a single total. Note: The PMM,
+ * currently, cannot handle new complex fields, it just uses a hard-coded
+ * estimate for such fields.
+ * <p>
+ * Execution phase: Just before a batch is processed by Project, the PMM walks
+ * the tree of OutputWidthExpressions and converts them to
+ * FixedWidthExpressions. It uses the RecordBatchSizer and the function
+ * annotations to do this conversion. See OutputWidthVisitor for details.
*/
public class ProjectMemoryManager extends RecordBatchMemoryManager {
- static final org.slf4j.Logger logger =
org.slf4j.LoggerFactory.getLogger(ProjectMemoryManager.class);
-
- public RecordBatch getIncomingBatch() {
- return incomingBatch;
+ static final Logger logger =
LoggerFactory.getLogger(ProjectMemoryManager.class);
+
+ private RecordBatch incomingBatch;
+ private ProjectRecordBatch outgoingBatch;
+
+ private int rowWidth;
+ private final Map<String, ColumnWidthInfo> outputColumnSizes;
+ // Number of variable width columns in the batch
+ private int variableWidthColumnCount;
+ // Number of fixed width columns in the batch
+ private int fixedWidthColumnCount;
+ // Number of complex columns in the batch
+ private int complexColumnsCount;
+
+ // Holds sum of all fixed width column widths
+ private int totalFixedWidthColumnWidth;
+ // Holds sum of all complex column widths
+ // Currently, this is just a guess
+ private int totalComplexColumnWidth;
+
+ private enum WidthType {
+ FIXED,
+ VARIABLE
+ }
Review comment:
Done. I tried to limit the "blast radius" by gritting my teeth and leaving
this class as it was. This seems to be some offshoot of my "temporary"
`BatchSizer`. Just for background, some time ago we realized that not all Drill
rows are of the same width. (Who'd have guessed?) So, we added "temporary" code
to work out sizes in each operator. This was supposed to be temporary until
batches carried their own size info. (Once we figure it out for a vector, we
don't have to throw it away and figure it out again for each operator.)
In fact, I deliberately chose a goofy name "BatchSizer" to remind folks that
this was a temporary hack to so I could focus on the "real work" of fixing the
external sort. Sigh...
But, the temporary solution seems to have become semi-permanent and has
grown odd variations such as this.
The "master plan" was to not try to predict batch sizes as we are doing
here. Instead, the `ResultSetLoader` is intended to just let the operator write
to the batch until we hit the desired limit. All the calcs are already done for
the reader case. The goal was to use the same mechanism in other places were we
don't know widths ahead of time. Project is the classic case: for all we know,
the user is doing some silly function like repeating a big `VARCHAR` 100 times.
So, if we can ever get there, all this temporary stuff will be swept away.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services