[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-04-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1181


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-31 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r178445213
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -463,4 +488,20 @@ protected boolean setupNewSchema() throws 
SchemaChangeException {
 }
 return exprs;
   }
+
+  private void updateStats() {
+stats.setLongStat(Metric.NUM_INCOMING_BATCHES, 
flattenMemoryManager.getNumIncomingBatches());
+stats.setLongStat(Metric.AVG_INPUT_BATCH_SIZE, 
flattenMemoryManager.getAvgInputBatchSize());
+stats.setLongStat(Metric.AVG_INPUT_ROW_WIDTH, 
flattenMemoryManager.getAvgInputRowWidth());
+stats.setLongStat(Metric.TOTAL_INPUT_RECORDS, 
flattenMemoryManager.getTotalInputRecords());
+stats.setLongStat(Metric.NUM_OUTGOING_BATCHES, 
flattenMemoryManager.getNumOutgoingBatches());
+stats.setLongStat(Metric.AVG_OUTPUT_BATCH_SIZE, 
flattenMemoryManager.getAvgOutputBatchSize());
+stats.setLongStat(Metric.AVG_OUTPUT_ROW_WIDTH, 
flattenMemoryManager.getAvgOutputRowWidth());
+stats.setLongStat(Metric.TOTAL_OUTPUT_RECORDS, 
flattenMemoryManager.getTotalOutputRecords());
+  }
+
+  @Override
+  public void close() {
--- End diff --

Seems resolved.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-31 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r178445205
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,7 +100,23 @@ private void clear() {
 }
   }
 
-  private class FlattenMemoryManager extends 
AbstractRecordBatchMemoryManager {
+  public enum Metric implements MetricDef {
+INPUT_BATCH_COUNT,
+AVG_INPUT_BATCH_BYTES,
+AVG_INPUT_ROW_BYTES,
+TOTAL_INPUT_RECORDS,
+OUTPUT_BATCH_COUNT,
+AVG_OUTPUT_BATCH_BYTES,
+AVG_OUTPUT_ROW_BYTES,
+TOTAL_OUTPUT_RECORDS;
--- End diff --

Very nice, these labels should be quite readable in the UI. (We should have 
an enum-to-label lookup table, but we don't...)

This pass I noticed the inconsistency between `INPUT_BATCH_COUNT` and 
`TOTAL_INPUT_RECORDS`. Perhaps change the latter to `INPUT_RECORD_COUNT`.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-31 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r178445267
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java
 ---
@@ -573,4 +610,21 @@ private LogicalExpression 
materializeExpression(LogicalExpression expression, It
 }
 return materializedExpr;
   }
+
+  private void updateStats() {
+stats.setLongStat(MergeJoinBatch.Metric.LEFT_INPUT_BATCH_COUNT, 
mergeJoinMemoryManager.getNumIncomingBatches(LEFT_INDEX));
+stats.setLongStat(MergeJoinBatch.Metric.LEFT_AVG_INPUT_BATCH_BYTES, 
mergeJoinMemoryManager.getAvgInputBatchSize(LEFT_INDEX));
+stats.setLongStat(MergeJoinBatch.Metric.LEFT_AVG_INPUT_ROW_BYTES, 
mergeJoinMemoryManager.getAvgInputRowWidth(LEFT_INDEX));
+stats.setLongStat(MergeJoinBatch.Metric.LEFT_TOTAL_INPUT_RECORDS, 
mergeJoinMemoryManager.getTotalInputRecords(LEFT_INDEX));
+
+stats.setLongStat(MergeJoinBatch.Metric.RIGHT_INPUT_BATCH_COUNT, 
mergeJoinMemoryManager.getNumIncomingBatches(RIGHT_INDEX));
+stats.setLongStat(MergeJoinBatch.Metric.RIGHT_AVG_INPUT_BATCH_BYTES, 
mergeJoinMemoryManager.getAvgInputBatchSize(RIGHT_INDEX));
+stats.setLongStat(MergeJoinBatch.Metric.RIGHT_AVG_INPUT_ROW_BYTES, 
mergeJoinMemoryManager.getAvgInputRowWidth(RIGHT_INDEX));
+stats.setLongStat(MergeJoinBatch.Metric.RIGHT_TOTAL_INPUT_RECORDS, 
mergeJoinMemoryManager.getTotalInputRecords(RIGHT_INDEX));
+
+stats.setLongStat(MergeJoinBatch.Metric.OUTPUT_BATCH_COUNT, 
mergeJoinMemoryManager.getNumOutgoingBatches());
+stats.setLongStat(MergeJoinBatch.Metric.AVG_OUTPUT_BATCH_BYTES, 
mergeJoinMemoryManager.getAvgOutputBatchSize());
+stats.setLongStat(MergeJoinBatch.Metric.AVG_OUTPUT_ROW_BYTES, 
mergeJoinMemoryManager.getAvgOutputRowWidth());
+stats.setLongStat(MergeJoinBatch.Metric.TOTAL_OUTPUT_RECORDS, 
mergeJoinMemoryManager.getTotalOutputRecords());
--- End diff --

As above, maybe also write the information to the log for easier debugging.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-31 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r178445230
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java
 ---
@@ -109,12 +110,37 @@
   private static final String LEFT_INPUT = "LEFT INPUT";
   private static final String RIGHT_INPUT = "RIGHT INPUT";
 
-  private class MergeJoinMemoryManager extends 
AbstractRecordBatchMemoryManager {
+  private static final int numInputs = 2;
+  private static final int LEFT_INDEX = 0;
+  private static final int RIGHT_INDEX = 1;
+
+  public enum Metric implements MetricDef {
+LEFT_INPUT_BATCH_COUNT,
+LEFT_AVG_INPUT_BATCH_BYTES,
+LEFT_AVG_INPUT_ROW_BYTES,
+LEFT_TOTAL_INPUT_RECORDS,
--- End diff --

`LEFT_INPUT_RECORD_COUNT`?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-31 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r178445224
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -463,4 +489,21 @@ protected boolean setupNewSchema() throws 
SchemaChangeException {
 }
 return exprs;
   }
+
+  private void updateStats() {
+stats.setLongStat(Metric.INPUT_BATCH_COUNT, 
flattenMemoryManager.getNumIncomingBatches());
+stats.setLongStat(Metric.AVG_INPUT_BATCH_BYTES, 
flattenMemoryManager.getAvgInputBatchSize());
+stats.setLongStat(Metric.AVG_INPUT_ROW_BYTES, 
flattenMemoryManager.getAvgInputRowWidth());
+stats.setLongStat(Metric.TOTAL_INPUT_RECORDS, 
flattenMemoryManager.getTotalInputRecords());
+stats.setLongStat(Metric.OUTPUT_BATCH_COUNT, 
flattenMemoryManager.getNumOutgoingBatches());
+stats.setLongStat(Metric.AVG_OUTPUT_BATCH_BYTES, 
flattenMemoryManager.getAvgOutputBatchSize());
+stats.setLongStat(Metric.AVG_OUTPUT_ROW_BYTES, 
flattenMemoryManager.getAvgOutputRowWidth());
+stats.setLongStat(Metric.TOTAL_OUTPUT_RECORDS, 
flattenMemoryManager.getTotalOutputRecords());
--- End diff --

Not critical, but I found it super helpful to dump this information into 
the log as well as as metrics. That way, when some issue arises, all the 
information is in the logs, I didn't need to track down both the logs and the 
query profile.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-31 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r178445244
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java
 ---
@@ -127,17 +153,20 @@
 @Override
 public void update(int inputIndex) {
   switch(inputIndex) {
-case 0:
-  leftSizer = new RecordBatchSizer(left);
-  leftRowWidth = leftSizer.netRowWidth();
+case LEFT_INDEX:
+  setRecordBatchSizer(inputIndex, new RecordBatchSizer(left));
+  leftRowWidth = getRecordBatchSizer(inputIndex).netRowWidth();
+  logger.debug("mergejoin left incoming batch sizer : {}", 
getRecordBatchSizer(inputIndex));
--- End diff --

Maybe "Left incoming batch size"? (The log will tell you it is the merge 
join. The "sizer" is an implementation detail.)


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-31 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r178445262
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java
 ---
@@ -153,16 +182,22 @@ public void update(int inputIndex) {
   // calculate memory used so far based on previous outgoing row width 
and how many rows we already processed.
   final long memoryUsed = status.getOutPosition() * 
getOutgoingRowWidth();
   // This is the remaining memory.
-  final long remainingMemory = 
Math.max(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR - memoryUsed, 0);
+  final long remainingMemory = Math.max(outputBatchSize - memoryUsed, 
0);
   // These are number of rows we can fit in remaining memory based on 
new outgoing row width.
   final int numOutputRowsRemaining = 
RecordBatchSizer.safeDivide(remainingMemory, newOutgoingRowWidth);
 
-  status.setTargetOutputRowCount(status.getOutPosition() + 
numOutputRowsRemaining);
+  
status.setTargetOutputRowCount(adjustOutputRowCount(status.getOutPosition() + 
numOutputRowsRemaining));
   setOutgoingRowWidth(newOutgoingRowWidth);
+
+  logger.debug("outputBatchSize : {}, avgOutgoingRowWidth : {}, 
outputRowCount : {}",
--- End diff --

Since this is a label, not a symbol, maybe "output batch size: {}"


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-28 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177825111
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
+AVG_INPUT_ROW_WIDTH,
--- End diff --

Parallel here simply means the same name: "INCOMING"/"INPUT" --> "INPUT", 
"OUTGOING"/"OUTPUT" --> "OUTPUT". Same name in each term.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-28 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177810092
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
+AVG_INPUT_ROW_WIDTH,
--- End diff --

@paul-rogers Paul, I did not understand what you mean by parallel here and 
below. Do you mean they should be adjacent columns in the web UI ?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617185
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
--- End diff --

Not clear what units `SIZE` is in. Maybe `AVG_INPUT_BATCH_BYTES` (if it is 
bytes)?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617922
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/AbstractRecordBatchMemoryManager.java
 ---
@@ -29,6 +29,50 @@
   private int outgoingRowWidth;
   private RecordBatchSizer sizer;
 
+  /**
+   * operator metric stats
+   */
+  private long numIncomingBatches;
+  private long sumInputBatchSizes;
+  private long sumInputRowWidths;
+  private long totalInputRecords;
+  private long numOutgoingBatches;
+  private long sumOutputBatchSizes;
+  private long sumOutputRowWidths;
+  private long totalOutputRecords;
+
+  public long getNumIncomingBatches() {
+return numIncomingBatches;
+  }
+
+  public long getTotalInputRecords() {
+return totalInputRecords;
+  }
+
+  public long getNumOutgoingBatches() {
+return numOutgoingBatches;
+  }
+
+  public long getTotalOutputRecords() {
+return totalOutputRecords;
+  }
+
+  public long getAvgInputBatchSize() {
+return RecordBatchSizer.safeDivide(sumInputBatchSizes, 
numIncomingBatches);
+  }
+
+  public long getAvgInputRowWidth() {
+return RecordBatchSizer.safeDivide(sumInputRowWidths, 
numIncomingBatches);
--- End diff --

See below regarding the average calculation.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617262
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
+AVG_INPUT_ROW_WIDTH,
+TOTAL_INPUT_RECORDS,
+NUM_OUTGOING_BATCHES,
--- End diff --

`OUTPUT_BATCH_COUNT` as per comments above?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617387
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -463,4 +488,20 @@ protected boolean setupNewSchema() throws 
SchemaChangeException {
 }
 return exprs;
   }
+
+  private void updateStats() {
+stats.setLongStat(Metric.NUM_INCOMING_BATCHES, 
flattenMemoryManager.getNumIncomingBatches());
+stats.setLongStat(Metric.AVG_INPUT_BATCH_SIZE, 
flattenMemoryManager.getAvgInputBatchSize());
+stats.setLongStat(Metric.AVG_INPUT_ROW_WIDTH, 
flattenMemoryManager.getAvgInputRowWidth());
+stats.setLongStat(Metric.TOTAL_INPUT_RECORDS, 
flattenMemoryManager.getTotalInputRecords());
+stats.setLongStat(Metric.NUM_OUTGOING_BATCHES, 
flattenMemoryManager.getNumOutgoingBatches());
+stats.setLongStat(Metric.AVG_OUTPUT_BATCH_SIZE, 
flattenMemoryManager.getAvgOutputBatchSize());
+stats.setLongStat(Metric.AVG_OUTPUT_ROW_WIDTH, 
flattenMemoryManager.getAvgOutputRowWidth());
+stats.setLongStat(Metric.TOTAL_OUTPUT_RECORDS, 
flattenMemoryManager.getTotalOutputRecords());
+  }
+
+  @Override
+  public void close() {
--- End diff --

Is `super.close()` needed?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617131
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
--- End diff --

Also, maybe `INPUT_BATCH_COUNT` to be parallel with `AVG_INPUT_BATCH_SIZE`?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617860
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/AbstractRecordBatchMemoryManager.java
 ---
@@ -29,6 +29,50 @@
   private int outgoingRowWidth;
   private RecordBatchSizer sizer;
 
+  /**
+   * operator metric stats
+   */
+  private long numIncomingBatches;
+  private long sumInputBatchSizes;
+  private long sumInputRowWidths;
+  private long totalInputRecords;
+  private long numOutgoingBatches;
+  private long sumOutputBatchSizes;
+  private long sumOutputRowWidths;
+  private long totalOutputRecords;
+
+  public long getNumIncomingBatches() {
+return numIncomingBatches;
+  }
+
+  public long getTotalInputRecords() {
+return totalInputRecords;
+  }
+
+  public long getNumOutgoingBatches() {
+return numOutgoingBatches;
+  }
+
+  public long getTotalOutputRecords() {
+return totalOutputRecords;
+  }
+
+  public long getAvgInputBatchSize() {
+return RecordBatchSizer.safeDivide(sumInputBatchSizes, 
numIncomingBatches);
+  }
+
+  public long getAvgInputRowWidth() {
+return RecordBatchSizer.safeDivide(sumInputRowWidths, 
numIncomingBatches);
+  }
+
+  public long getAvgOutputBatchSize() {
+return RecordBatchSizer.safeDivide(sumOutputBatchSizes, 
numOutgoingBatches);
+  }
+
+  public long getAvgOutputRowWidth() {
+return RecordBatchSizer.safeDivide(sumOutputRowWidths, 
numOutgoingBatches);
--- End diff --

Not sure if it really matters, but this calculation is not accurate. This 
is an unweighted average. The actual width requires a weighted average. I have 
one batch with a row of 1 MB in size, and another batch of 1K rows of 1K each. 
The average row width is actually ~2K, not ~500K.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177616893
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
--- End diff --

Strangely, these enum names show up as labels in the web UI. Maybe 
`INCOMING_BATCH_COUNT`? The `AVG_` and `TOTAL_` names seem readable enough...


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-27 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177617047
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
+AVG_INPUT_ROW_WIDTH,
--- End diff --

`AVG_INPUT_ROW_WIDTH` to be parallel with `AVG_OUTPUT_ROW_WIDTH`?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-21 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1181

DRILL-6284: Add operator metrics for batch sizing for flatten

@kkhatua please review.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6284

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1181.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1181


commit f0b7bed20aef64cdc9e025a5ca209e1ad6220aa6
Author: Padma Penumarthy 
Date:   2018-03-20T20:44:50Z

DRILL-6284: Add operator metrics for batch sizing for flatten




---