Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r183264629 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -277,18 +286,29 @@ public boolean isRepeatedList() { /** * This is the average per entry width, used for vector allocation. */ - public int getEntryWidth() { + private int getEntryWidthForAlloc() { int width = 0; if (isVariableWidth) { - width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH; + width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH; // Subtract out the bits (is-set) vector width - if (metadata.getDataMode() == DataMode.OPTIONAL) { + if (isOptional) { width -= BIT_VECTOR_WIDTH; } + + if (isRepeated && getValueCount() == 0) { + return (safeDivide(width, STD_REPETITION_FACTOR)); --- End diff -- If the value count is zero, but the row count is non-zero, then a very low repetition rate is more realistic than 10. In earlier drafts, I found the repetition rate had to be a float since, in some data, the rate is something like 0.7 or 1.4. Rounding to an integer caused quite an error when multiplying by, say, 50K rows. Any reason we can't use the actual computed amount here? If we really have 1 or 2 rows, then a guess of 10 is fine. But, if we have 60K rows, with an actual estimate of 0, then guessing 10 will allocate 600K values when we probably needed close to 0. (Unless I'm missing something.)
---