[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187249#comment-16187249 ] ASF GitHub Bot commented on DRILL-5758: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/932 > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Labels: ready-to-commit > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187235#comment-16187235 ] ASF GitHub Bot commented on DRILL-5758: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/932 Squashed commits. > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Labels: ready-to-commit > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187233#comment-16187233 ] ASF GitHub Bot commented on DRILL-5758: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/932 Rebased onto master. > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Labels: ready-to-commit > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187231#comment-16187231 ] ASF GitHub Bot commented on DRILL-5758: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/932#discussion_r142017748 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -74,53 +74,52 @@ public final int estSize; /** - * Number of times the value here (possibly repeated) appears in - * the record batch. + * Number of occurrences of the value in the batch. This is trivial + * for top-level scalars: it is the record count. For a top-level + * repeated vector, this is the number of arrays, also the record + * count. For a value nested inside a repeated map, it is the + * total number of values across all maps, and may be less than, + * greater than (but unlikely) same as the row count. */ public final int valueCount; /** - * The number of elements in the value vector. Consider two cases. - * A required or nullable vector has one element per row, so the - * entryCount is the same as the valueCount (which, - * in turn, is the same as the row count.) But, if this vector is an - * array, then the valueCount is the number of columns, while - * entryCount is the total number of elements in all the arrays - * that make up the columns, so entryCount will be different than - * the valueCount (normally larger, but possibly smaller if most - * arrays are empty. - * - * Finally, the column may be part of another list. In this case, the above - * logic still applies, but the valueCount is the number of entries - * in the outer array, not the row count. + * Total number of elements for a repeated type, or 1 if this is + * a non-repeated type. That is, a batch of 100 rows may have an + * array with 10 elements per row. In this case, the element count + * is 1000. */ -public int entryCount; +public final int elementCount; --- End diff -- Good point. However, a single batch of greater than 2 GB is far more than the sort can handle, so we'd not even get this far if the batch was this large. Still, the point is valid, so a new commit changes batch size variables from int to long. > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Labels: ready-to-commit > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181629#comment-16181629 ] ASF GitHub Bot commented on DRILL-5758: --- Github user amansinha100 commented on the issue: https://github.com/apache/drill/pull/932 Changes LGTM. +1 > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181626#comment-16181626 ] ASF GitHub Bot commented on DRILL-5758: --- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/932#discussion_r141191139 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -74,53 +74,52 @@ public final int estSize; /** - * Number of times the value here (possibly repeated) appears in - * the record batch. + * Number of occurrences of the value in the batch. This is trivial + * for top-level scalars: it is the record count. For a top-level + * repeated vector, this is the number of arrays, also the record + * count. For a value nested inside a repeated map, it is the + * total number of values across all maps, and may be less than, + * greater than (but unlikely) same as the row count. */ public final int valueCount; /** - * The number of elements in the value vector. Consider two cases. - * A required or nullable vector has one element per row, so the - * entryCount is the same as the valueCount (which, - * in turn, is the same as the row count.) But, if this vector is an - * array, then the valueCount is the number of columns, while - * entryCount is the total number of elements in all the arrays - * that make up the columns, so entryCount will be different than - * the valueCount (normally larger, but possibly smaller if most - * arrays are empty. - * - * Finally, the column may be part of another list. In this case, the above - * logic still applies, but the valueCount is the number of entries - * in the outer array, not the row count. + * Total number of elements for a repeated type, or 1 if this is + * a non-repeated type. That is, a batch of 100 rows may have an + * array with 10 elements per row. In this case, the element count + * is 1000. */ -public int entryCount; +public final int elementCount; --- End diff -- Not related to elementCount per-se but I see that netBatchSize and accountedMemorySize are integers. These could overflow depending on number of columns. Should they be longs ? > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178455#comment-16178455 ] ASF GitHub Bot commented on DRILL-5758: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/932 Rebased onto latest master. > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178075#comment-16178075 ] ASF GitHub Bot commented on DRILL-5758: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/932 Added a fix for repeated columns that have a low cardinality. If, say, one row in ten has an array entry, then average cardinality (values per row) is 0.1. This was represented by an int, rounded to 0 and caused a zero-length vector to be allocated. Drill then tried to double the length, which resulted in 0, which was doubled again, and so on forever. The fix has three parts: * The "record batch sizer" uses floats to allow fractional cardinality. * The vector initializer now works with fractional cardinality. * If all else fails, if a fixed-width vector is asked to double from zero, it sizes the vector to 256 bytes. > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160517#comment-16160517 ] ASF GitHub Bot commented on DRILL-5758: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/932#discussion_r137963963 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortMemoryManager.java --- @@ -680,9 +680,11 @@ public MergeTask consolidateBatches(long allocMemory, int inMemCount, int spille // spilled run. // Maximum spill batches that fit into available memory. +// Use the maximum buffer size since spill batches seem to +// be read with almost 50% internal fragmentation. int memMergeLimit = (int) ((mergeMemoryLimit - allocMemory) / -spillBatchSize.expectedBufferSize); +spillBatchSize.maxBufferSize); memMergeLimit = Math.max(0, memMergeLimit); --- End diff -- This line of code is not needed ( Math.max() ), as the code above (around line 675) ensures that mergeMemoryLimit >= allocMemory . > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160158#comment-16160158 ] ASF GitHub Bot commented on DRILL-5758: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/932 @Ben-Zvi, added another fix. Please take another look. > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151080#comment-16151080 ] Paul Rogers commented on DRILL-5758: With most issues now resolved (or close to it), we have enabled the "managed" sort by default. The boot-time option is removed. The system/session option continues to exist to disable the managed sort if desired. > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151079#comment-16151079 ] ASF GitHub Bot commented on DRILL-5758: --- GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/932 DRILL-5758: Fix for repeated columns; enable managed sort by default Fix for DRILL-5443. Basically, the “record batch sizer” did not handle repeated columns correctly. Enabled managed sort by default. You can merge this pull request into a Git repository by running: $ git pull https://github.com/paul-rogers/drill DRILL-5758 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/932.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #932 commit 2c9b2249fbad1bee2cb290a9c04fb82df5ecf2e8 Author: Paul Rogers Date: 2017-09-01T00:24:43Z Fix for DRILL-5443. See DRILL-5758 for details. Basically, the “record batch sizer” did not handle repeated columns correctly. Enabled managed sort by default > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149825#comment-16149825 ] Paul Rogers commented on DRILL-5758: Turns out the {{RecordBatchSizer}} contained a bug for repeated elements. Consider the original output: {code} rms.mapvalue.col2(type: REPEATED BIGINT, count: 1, total entries: 1, per-array: 1, std size: 8, actual size: 52, data size: 52) ... Records: 4096, Total size: 1441792, Data size: 376615, Gross row width: 352, Net row width: 92, Density: 27} {code} In the above, {{col2}} is repeated, but the entries per array is set at 1. Output after the fix: {code} rms.mapvalue.col2(type: REPEATED BIGINT, count: 4096, elements: 12288, per-array: 3, std size: 8, actual size: 28, data size: 114688) ... Records: 4096, Total size: 1441792, Data size: 1136848, Gross row width: 352, Net row width: 278, Density: 79} {code} Note that the (average) elements per-array is now 3 and the estimated "net" row width has grown from 92 to 278. The result is much better vector size estimates and no vector reallocations: {code} Initial output batch allocation: 811008 bytes, 3771 records Took 4438 us to merge 3771 records, consuming 811008 bytes of memory {code} And now the sort completes: {code} Results: 4,000,000 records, 63 batches {code} > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA
[ https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149708#comment-16149708 ] Paul Rogers commented on DRILL-5758: The external sort memory manager works by anticipating the allocation size of each batch: input, spill, merge, and so on. This is done using the "record batch sizer" that figures out data sizes by observing input vectors. The sizer then generates a set of allocation "hints" used to allocate proper-size vectors for the various batches. If the memory calcs are wrong, then a batch might become larger than expected, causing OOM errors. One way to check if batches are under-estimated is to check if the sort code ends up needing to double batch sizes. This does, in fact, occur: {code} Spilling 42 batches, into spill batches of 11397 rows, to /tmp/drill/spill/... Initial output batch allocation: 2392064 bytes vector.BigIntVector - Reallocating vector [$data$(BIGINT:REQUIRED)]. # of bytes: [91176] -> [182352] vector.UInt4Vector - Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [45592] -> [91184] vector.UInt4Vector - Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [45592] -> [91184] vector.UInt1Vector - Reallocating vector [$bits$(UINT1:REQUIRED)]. # of bytes: [11397] -> [22794] vector.UInt1Vector - Reallocating vector [$bits$(UINT1:REQUIRED)]. # of bytes: [11397] -> [22794] vector.UInt1Vector - Reallocating vector [$bits$(UINT1:REQUIRED)]. # of bytes: [11397] -> [22794] vector.BigIntVector - Reallocating vector [c(BIGINT:OPTIONAL)]. # of bytes: [91176] -> [182352] vector.UInt1Vector - Reallocating vector [$bits$(UINT1:REQUIRED)]. # of bytes: [11397] -> [22794] vector.Float8Vector - Reallocating vector [d(FLOAT8:OPTIONAL)]. # of bytes: [91176] -> [182352] vector.BigIntVector - Reallocating vector [$data$(BIGINT:REQUIRED)]. # of bytes: [182352] -> [364704] Took 130655 us to merge 11397 records, consuming 3244032 bytes of memory {code} The above tells us that the estimates are off by no more than 50% (else the vectors would be double more than once.) But, the estimates are off and must be corrected. * Original estimate for the batch size (from elsewhere in the logs): 1,572,786 * Actual initial allocation size: 2,392,064 * Final actual allocation size: 3,244,032 This tells us that the calculations are wrong somewhere. > Rollup of external sort fixes to issues found by QA > --- > > Key: DRILL-5758 > URL: https://issues.apache.org/jira/browse/DRILL-5758 > Project: Apache Drill > Issue Type: Task >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers > Fix For: 1.12.0 > > > Tracking JIRA to used for the PR that combines fixes for various JIRA > entries. Bugs fixed in this task are given by the linked issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)