[GitHub] [drill] paul-rogers commented on a change in pull request #1944: DRILL-7503: Refactor the project operator

GitBox Thu, 02 Jan 2020 20:54:27 -0800

paul-rogers commented on a change in pull request #1944: DRILL-7503: Refactor 
the project operator
URL: https://github.com/apache/drill/pull/1944#discussion_r362702109


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
 ##########
 @@ -42,307 +44,310 @@
 import java.util.Map;
 
 /**
- *
- * ProjectMemoryManager(PMM) is used to estimate the size of rows produced by 
ProjectRecordBatch.
- * The PMM works as follows:
- *
- * Setup phase: As and when ProjectRecordBatch creates or transfers a field, 
it registers the field with PMM.
- * If the field is a variable width field, PMM records the expression that 
produces the variable
- * width field. The expression is a tree of LogicalExpressions. The PMM walks 
this tree of LogicalExpressions
- * to produce a tree of OutputWidthExpressions. The widths of Fixed width 
fields are just accumulated into a single
- * total. Note: The PMM, currently, cannot handle new complex fields, it just 
uses a hard-coded estimate for such fields.
- *
- *
- * Execution phase: Just before a batch is processed by Project, the PMM walks 
the tree of OutputWidthExpressions
- * and converts them to FixedWidthExpressions. It uses the RecordBatchSizer 
and the function annotations to do this conversion.
- * See OutputWidthVisitor for details.
+ * ProjectMemoryManager(PMM) is used to estimate the size of rows produced by
+ * ProjectRecordBatch. The PMM works as follows:
+ * <p>
+ * Setup phase: As and when ProjectRecordBatch creates or transfers a field, it
+ * registers the field with PMM. If the field is a variable width field, PMM
+ * records the expression that produces the variable width field. The 
expression
+ * is a tree of LogicalExpressions. The PMM walks this tree of
+ * LogicalExpressions to produce a tree of OutputWidthExpressions. The widths 
of
+ * Fixed width fields are just accumulated into a single total. Note: The PMM,
+ * currently, cannot handle new complex fields, it just uses a hard-coded
+ * estimate for such fields.
+ * <p>
+ * Execution phase: Just before a batch is processed by Project, the PMM walks
+ * the tree of OutputWidthExpressions and converts them to
+ * FixedWidthExpressions. It uses the RecordBatchSizer and the function
+ * annotations to do this conversion. See OutputWidthVisitor for details.
  */
 public class ProjectMemoryManager extends RecordBatchMemoryManager {
 
-    static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ProjectMemoryManager.class);
-
-    public RecordBatch getIncomingBatch() {
-        return incomingBatch;
+  static final Logger logger = 
LoggerFactory.getLogger(ProjectMemoryManager.class);
+
+  private RecordBatch incomingBatch;
+  private ProjectRecordBatch outgoingBatch;
+
+  private int rowWidth;
+  private final Map<String, ColumnWidthInfo> outputColumnSizes;
+  // Number of variable width columns in the batch
+  private int variableWidthColumnCount;
+  // Number of fixed width columns in the batch
+  private int fixedWidthColumnCount;
+  // Number of complex columns in the batch
+  private int complexColumnsCount;
+
+  // Holds sum of all fixed width column widths
+  private int totalFixedWidthColumnWidth;
+  // Holds sum of all complex column widths
+  // Currently, this is just a guess
+  private int totalComplexColumnWidth;
+
+  private enum WidthType {
+      FIXED,
+      VARIABLE
+  }
 
 Review comment:
   Done. I tried to limit the "blast radius" by gritting my teeth and leaving 
this class as it was. This seems to be some offshoot of my "temporary" 
`BatchSizer`. Just for background, some time ago we realized that not all Drill 
rows are of the same width. (Who'd have guessed?) So, we added "temporary" code 
to work out sizes in each operator. This was supposed to be temporary until 
batches carried their own size info. (Once we figure it out for a vector, we 
don't have to throw it away and figure it out again for each operator.)
   
   In fact, I deliberately chose a goofy name "BatchSizer" to remind folks that 
this was a temporary hack to so I could focus on the "real work" of fixing the 
external sort. Sigh...
   
   But, the temporary solution seems to have become semi-permanent and has 
grown odd variations such as this.
   
   The "master plan" was to not try to predict batch sizes as we are doing 
here. Instead, the `ResultSetLoader` is intended to just let the operator write 
to the batch until we hit the desired limit. All the calcs are already done for 
the reader case. The goal was to use the same mechanism in other places were we 
don't know widths ahead of time. Project is the classic case: for all we know, 
the user is doing some silly function like repeating a big `VARCHAR` 100 times. 
So, if we can ever get there, all this temporary stuff will be swept away.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1944: DRILL-7503: Refactor the project operator

Reply via email to