[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/1181 ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r178445213 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -463,4 +488,20 @@ protected boolean setupNewSchema() throws SchemaChangeException { } return exprs; } + + private void updateStats() { +stats.setLongStat(Metric.NUM_INCOMING_BATCHES, flattenMemoryManager.getNumIncomingBatches()); +stats.setLongStat(Metric.AVG_INPUT_BATCH_SIZE, flattenMemoryManager.getAvgInputBatchSize()); +stats.setLongStat(Metric.AVG_INPUT_ROW_WIDTH, flattenMemoryManager.getAvgInputRowWidth()); +stats.setLongStat(Metric.TOTAL_INPUT_RECORDS, flattenMemoryManager.getTotalInputRecords()); +stats.setLongStat(Metric.NUM_OUTGOING_BATCHES, flattenMemoryManager.getNumOutgoingBatches()); +stats.setLongStat(Metric.AVG_OUTPUT_BATCH_SIZE, flattenMemoryManager.getAvgOutputBatchSize()); +stats.setLongStat(Metric.AVG_OUTPUT_ROW_WIDTH, flattenMemoryManager.getAvgOutputRowWidth()); +stats.setLongStat(Metric.TOTAL_OUTPUT_RECORDS, flattenMemoryManager.getTotalOutputRecords()); + } + + @Override + public void close() { --- End diff -- Seems resolved. ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r178445205 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,7 +100,23 @@ private void clear() { } } - private class FlattenMemoryManager extends AbstractRecordBatchMemoryManager { + public enum Metric implements MetricDef { +INPUT_BATCH_COUNT, +AVG_INPUT_BATCH_BYTES, +AVG_INPUT_ROW_BYTES, +TOTAL_INPUT_RECORDS, +OUTPUT_BATCH_COUNT, +AVG_OUTPUT_BATCH_BYTES, +AVG_OUTPUT_ROW_BYTES, +TOTAL_OUTPUT_RECORDS; --- End diff -- Very nice, these labels should be quite readable in the UI. (We should have an enum-to-label lookup table, but we don't...) This pass I noticed the inconsistency between `INPUT_BATCH_COUNT` and `TOTAL_INPUT_RECORDS`. Perhaps change the latter to `INPUT_RECORD_COUNT`. ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r178445267 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java --- @@ -573,4 +610,21 @@ private LogicalExpression materializeExpression(LogicalExpression expression, It } return materializedExpr; } + + private void updateStats() { +stats.setLongStat(MergeJoinBatch.Metric.LEFT_INPUT_BATCH_COUNT, mergeJoinMemoryManager.getNumIncomingBatches(LEFT_INDEX)); +stats.setLongStat(MergeJoinBatch.Metric.LEFT_AVG_INPUT_BATCH_BYTES, mergeJoinMemoryManager.getAvgInputBatchSize(LEFT_INDEX)); +stats.setLongStat(MergeJoinBatch.Metric.LEFT_AVG_INPUT_ROW_BYTES, mergeJoinMemoryManager.getAvgInputRowWidth(LEFT_INDEX)); +stats.setLongStat(MergeJoinBatch.Metric.LEFT_TOTAL_INPUT_RECORDS, mergeJoinMemoryManager.getTotalInputRecords(LEFT_INDEX)); + +stats.setLongStat(MergeJoinBatch.Metric.RIGHT_INPUT_BATCH_COUNT, mergeJoinMemoryManager.getNumIncomingBatches(RIGHT_INDEX)); +stats.setLongStat(MergeJoinBatch.Metric.RIGHT_AVG_INPUT_BATCH_BYTES, mergeJoinMemoryManager.getAvgInputBatchSize(RIGHT_INDEX)); +stats.setLongStat(MergeJoinBatch.Metric.RIGHT_AVG_INPUT_ROW_BYTES, mergeJoinMemoryManager.getAvgInputRowWidth(RIGHT_INDEX)); +stats.setLongStat(MergeJoinBatch.Metric.RIGHT_TOTAL_INPUT_RECORDS, mergeJoinMemoryManager.getTotalInputRecords(RIGHT_INDEX)); + +stats.setLongStat(MergeJoinBatch.Metric.OUTPUT_BATCH_COUNT, mergeJoinMemoryManager.getNumOutgoingBatches()); +stats.setLongStat(MergeJoinBatch.Metric.AVG_OUTPUT_BATCH_BYTES, mergeJoinMemoryManager.getAvgOutputBatchSize()); +stats.setLongStat(MergeJoinBatch.Metric.AVG_OUTPUT_ROW_BYTES, mergeJoinMemoryManager.getAvgOutputRowWidth()); +stats.setLongStat(MergeJoinBatch.Metric.TOTAL_OUTPUT_RECORDS, mergeJoinMemoryManager.getTotalOutputRecords()); --- End diff -- As above, maybe also write the information to the log for easier debugging. ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r178445230 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java --- @@ -109,12 +110,37 @@ private static final String LEFT_INPUT = "LEFT INPUT"; private static final String RIGHT_INPUT = "RIGHT INPUT"; - private class MergeJoinMemoryManager extends AbstractRecordBatchMemoryManager { + private static final int numInputs = 2; + private static final int LEFT_INDEX = 0; + private static final int RIGHT_INDEX = 1; + + public enum Metric implements MetricDef { +LEFT_INPUT_BATCH_COUNT, +LEFT_AVG_INPUT_BATCH_BYTES, +LEFT_AVG_INPUT_ROW_BYTES, +LEFT_TOTAL_INPUT_RECORDS, --- End diff -- `LEFT_INPUT_RECORD_COUNT`? ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r178445224 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -463,4 +489,21 @@ protected boolean setupNewSchema() throws SchemaChangeException { } return exprs; } + + private void updateStats() { +stats.setLongStat(Metric.INPUT_BATCH_COUNT, flattenMemoryManager.getNumIncomingBatches()); +stats.setLongStat(Metric.AVG_INPUT_BATCH_BYTES, flattenMemoryManager.getAvgInputBatchSize()); +stats.setLongStat(Metric.AVG_INPUT_ROW_BYTES, flattenMemoryManager.getAvgInputRowWidth()); +stats.setLongStat(Metric.TOTAL_INPUT_RECORDS, flattenMemoryManager.getTotalInputRecords()); +stats.setLongStat(Metric.OUTPUT_BATCH_COUNT, flattenMemoryManager.getNumOutgoingBatches()); +stats.setLongStat(Metric.AVG_OUTPUT_BATCH_BYTES, flattenMemoryManager.getAvgOutputBatchSize()); +stats.setLongStat(Metric.AVG_OUTPUT_ROW_BYTES, flattenMemoryManager.getAvgOutputRowWidth()); +stats.setLongStat(Metric.TOTAL_OUTPUT_RECORDS, flattenMemoryManager.getTotalOutputRecords()); --- End diff -- Not critical, but I found it super helpful to dump this information into the log as well as as metrics. That way, when some issue arises, all the information is in the logs, I didn't need to track down both the logs and the query profile. ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r178445244 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java --- @@ -127,17 +153,20 @@ @Override public void update(int inputIndex) { switch(inputIndex) { -case 0: - leftSizer = new RecordBatchSizer(left); - leftRowWidth = leftSizer.netRowWidth(); +case LEFT_INDEX: + setRecordBatchSizer(inputIndex, new RecordBatchSizer(left)); + leftRowWidth = getRecordBatchSizer(inputIndex).netRowWidth(); + logger.debug("mergejoin left incoming batch sizer : {}", getRecordBatchSizer(inputIndex)); --- End diff -- Maybe "Left incoming batch size"? (The log will tell you it is the merge join. The "sizer" is an implementation detail.) ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r178445262 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java --- @@ -153,16 +182,22 @@ public void update(int inputIndex) { // calculate memory used so far based on previous outgoing row width and how many rows we already processed. final long memoryUsed = status.getOutPosition() * getOutgoingRowWidth(); // This is the remaining memory. - final long remainingMemory = Math.max(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR - memoryUsed, 0); + final long remainingMemory = Math.max(outputBatchSize - memoryUsed, 0); // These are number of rows we can fit in remaining memory based on new outgoing row width. final int numOutputRowsRemaining = RecordBatchSizer.safeDivide(remainingMemory, newOutgoingRowWidth); - status.setTargetOutputRowCount(status.getOutPosition() + numOutputRowsRemaining); + status.setTargetOutputRowCount(adjustOutputRowCount(status.getOutPosition() + numOutputRowsRemaining)); setOutgoingRowWidth(newOutgoingRowWidth); + + logger.debug("outputBatchSize : {}, avgOutgoingRowWidth : {}, outputRowCount : {}", --- End diff -- Since this is a label, not a symbol, maybe "output batch size: {}" ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177825111 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,6 +100,22 @@ private void clear() { } } + public enum Metric implements MetricDef { +NUM_INCOMING_BATCHES, +AVG_INPUT_BATCH_SIZE, +AVG_INPUT_ROW_WIDTH, --- End diff -- Parallel here simply means the same name: "INCOMING"/"INPUT" --> "INPUT", "OUTGOING"/"OUTPUT" --> "OUTPUT". Same name in each term. ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177810092 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,6 +100,22 @@ private void clear() { } } + public enum Metric implements MetricDef { +NUM_INCOMING_BATCHES, +AVG_INPUT_BATCH_SIZE, +AVG_INPUT_ROW_WIDTH, --- End diff -- @paul-rogers Paul, I did not understand what you mean by parallel here and below. Do you mean they should be adjacent columns in the web UI ? ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177617185 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,6 +100,22 @@ private void clear() { } } + public enum Metric implements MetricDef { +NUM_INCOMING_BATCHES, +AVG_INPUT_BATCH_SIZE, --- End diff -- Not clear what units `SIZE` is in. Maybe `AVG_INPUT_BATCH_BYTES` (if it is bytes)? ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177617922 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/AbstractRecordBatchMemoryManager.java --- @@ -29,6 +29,50 @@ private int outgoingRowWidth; private RecordBatchSizer sizer; + /** + * operator metric stats + */ + private long numIncomingBatches; + private long sumInputBatchSizes; + private long sumInputRowWidths; + private long totalInputRecords; + private long numOutgoingBatches; + private long sumOutputBatchSizes; + private long sumOutputRowWidths; + private long totalOutputRecords; + + public long getNumIncomingBatches() { +return numIncomingBatches; + } + + public long getTotalInputRecords() { +return totalInputRecords; + } + + public long getNumOutgoingBatches() { +return numOutgoingBatches; + } + + public long getTotalOutputRecords() { +return totalOutputRecords; + } + + public long getAvgInputBatchSize() { +return RecordBatchSizer.safeDivide(sumInputBatchSizes, numIncomingBatches); + } + + public long getAvgInputRowWidth() { +return RecordBatchSizer.safeDivide(sumInputRowWidths, numIncomingBatches); --- End diff -- See below regarding the average calculation. ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177617262 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,6 +100,22 @@ private void clear() { } } + public enum Metric implements MetricDef { +NUM_INCOMING_BATCHES, +AVG_INPUT_BATCH_SIZE, +AVG_INPUT_ROW_WIDTH, +TOTAL_INPUT_RECORDS, +NUM_OUTGOING_BATCHES, --- End diff -- `OUTPUT_BATCH_COUNT` as per comments above? ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177617387 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -463,4 +488,20 @@ protected boolean setupNewSchema() throws SchemaChangeException { } return exprs; } + + private void updateStats() { +stats.setLongStat(Metric.NUM_INCOMING_BATCHES, flattenMemoryManager.getNumIncomingBatches()); +stats.setLongStat(Metric.AVG_INPUT_BATCH_SIZE, flattenMemoryManager.getAvgInputBatchSize()); +stats.setLongStat(Metric.AVG_INPUT_ROW_WIDTH, flattenMemoryManager.getAvgInputRowWidth()); +stats.setLongStat(Metric.TOTAL_INPUT_RECORDS, flattenMemoryManager.getTotalInputRecords()); +stats.setLongStat(Metric.NUM_OUTGOING_BATCHES, flattenMemoryManager.getNumOutgoingBatches()); +stats.setLongStat(Metric.AVG_OUTPUT_BATCH_SIZE, flattenMemoryManager.getAvgOutputBatchSize()); +stats.setLongStat(Metric.AVG_OUTPUT_ROW_WIDTH, flattenMemoryManager.getAvgOutputRowWidth()); +stats.setLongStat(Metric.TOTAL_OUTPUT_RECORDS, flattenMemoryManager.getTotalOutputRecords()); + } + + @Override + public void close() { --- End diff -- Is `super.close()` needed? ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177617131 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,6 +100,22 @@ private void clear() { } } + public enum Metric implements MetricDef { +NUM_INCOMING_BATCHES, --- End diff -- Also, maybe `INPUT_BATCH_COUNT` to be parallel with `AVG_INPUT_BATCH_SIZE`? ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177617860 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/AbstractRecordBatchMemoryManager.java --- @@ -29,6 +29,50 @@ private int outgoingRowWidth; private RecordBatchSizer sizer; + /** + * operator metric stats + */ + private long numIncomingBatches; + private long sumInputBatchSizes; + private long sumInputRowWidths; + private long totalInputRecords; + private long numOutgoingBatches; + private long sumOutputBatchSizes; + private long sumOutputRowWidths; + private long totalOutputRecords; + + public long getNumIncomingBatches() { +return numIncomingBatches; + } + + public long getTotalInputRecords() { +return totalInputRecords; + } + + public long getNumOutgoingBatches() { +return numOutgoingBatches; + } + + public long getTotalOutputRecords() { +return totalOutputRecords; + } + + public long getAvgInputBatchSize() { +return RecordBatchSizer.safeDivide(sumInputBatchSizes, numIncomingBatches); + } + + public long getAvgInputRowWidth() { +return RecordBatchSizer.safeDivide(sumInputRowWidths, numIncomingBatches); + } + + public long getAvgOutputBatchSize() { +return RecordBatchSizer.safeDivide(sumOutputBatchSizes, numOutgoingBatches); + } + + public long getAvgOutputRowWidth() { +return RecordBatchSizer.safeDivide(sumOutputRowWidths, numOutgoingBatches); --- End diff -- Not sure if it really matters, but this calculation is not accurate. This is an unweighted average. The actual width requires a weighted average. I have one batch with a row of 1 MB in size, and another batch of 1K rows of 1K each. The average row width is actually ~2K, not ~500K. ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177616893 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,6 +100,22 @@ private void clear() { } } + public enum Metric implements MetricDef { +NUM_INCOMING_BATCHES, --- End diff -- Strangely, these enum names show up as labels in the web UI. Maybe `INCOMING_BATCH_COUNT`? The `AVG_` and `TOTAL_` names seem readable enough... ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177617047 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,6 +100,22 @@ private void clear() { } } + public enum Metric implements MetricDef { +NUM_INCOMING_BATCHES, +AVG_INPUT_BATCH_SIZE, +AVG_INPUT_ROW_WIDTH, --- End diff -- `AVG_INPUT_ROW_WIDTH` to be parallel with `AVG_OUTPUT_ROW_WIDTH`? ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1181 DRILL-6284: Add operator metrics for batch sizing for flatten @kkhatua please review. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6284 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1181.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1181 commit f0b7bed20aef64cdc9e025a5ca209e1ad6220aa6 Author: Padma PenumarthyDate: 2018-03-20T20:44:50Z DRILL-6284: Add operator metrics for batch sizing for flatten ---