[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346224#comment-16346224 ] ASF GitHub Bot commented on DRILL-6080: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/1090 > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16343613#comment-16343613 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1090 +1 > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342362#comment-16342362 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1090 Squashed and rebased. Reran the unit tests, which passed up to mongo, which failed for some unrelated reason. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16342334#comment-16342334 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r164282435 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java --- @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, DataGenerator dataGen, public void runJumboBatchTest(OperatorFixture fixture, int rowCount) { timer.reset(); -DataGenerator dataGen = new DataGenerator(fixture, rowCount, Character.MAX_VALUE); -DataValidator validator = new DataValidator(rowCount, Character.MAX_VALUE); +DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); --- End diff -- Thanks Vlad. I double-checked everything. `testLargeBatch()` passes to `runJumboBatchTest()` the number of records per batch. It passes the maximum, which is `ValueVector.MAX_ROW_COUNT`: ``` runJumboBatchTest(fixture, ValueVector.MAX_ROW_COUNT); ``` Then, `runJumboBatchTest()` sets up a test of *n* rows with *m* rows per batch. (This allows testing multiple batches with a small number of batches each.) Here, we want one batch with the maximum number of rows, so we pass: ``` DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); DataValidator validator = new DataValidator(rowCount, ValueVector.MAX_ROW_COUNT); ``` That is, a row count of 64K (passed in) in batches of 64K. The result is a single batch of the maximum possible row count. Next, `DataGenerator` records these two numbers. It was a bit forgiving by adjusting the maximum number to the allowed maximum: ``` this.batchSize = Math.min(batchSize, ValueVector.MAX_ROW_COUNT); ``` I see that this causes confusion, and is probably a bit too lenient for a test. So, I've enforced the limit instead: ``` Preconditions.checkArgument(batchSize > 0 && batchSize <= ValueVector.MAX_ROW_COUNT); this.batchSize = batchSize; ``` I made the same change to the `DataValidator` class. We also discussed the SV4 class. What may not be entirely obvious there is the length of an SV4 can often exceed the batch size limit. We have two indirection vectors (AKA "selection vectors"). The SV2 is an indirection on top of a single batch (and so has a length limited to `ValueVector.MAX_RECORD_COUNT`.) But, the SV4 "glues together" multiple batches, and so will have a length limited only by the maximum vector length. (Hard coded at 2 GB, but ideally limited to 16 GB.) Further, as your work showed, we probably don't want a huge SV4 for other reasons. A primary use of at the SV4 is to merge a collection if batches to produce a set of merged output batches. You showed that merge performance suffers above a certain input width, and that width is far below the theoretical maximum. So, I *think* I've got everything covered. If I'm overlooking anything, please do point it out so I can fix it. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340616#comment-16340616 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1090 LGTM, please rebase + squash to resolve conflicts. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340615#comment-16340615 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r164040082 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java --- @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, DataGenerator dataGen, public void runJumboBatchTest(OperatorFixture fixture, int rowCount) { timer.reset(); -DataGenerator dataGen = new DataGenerator(fixture, rowCount, Character.MAX_VALUE); -DataValidator validator = new DataValidator(rowCount, Character.MAX_VALUE); +DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); --- End diff -- It is more code maintenance question. Changing maximum value in DataGenerator/DataValidator requires similar change here. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340473#comment-16340473 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r164022284 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java --- @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, DataGenerator dataGen, public void runJumboBatchTest(OperatorFixture fixture, int rowCount) { timer.reset(); -DataGenerator dataGen = new DataGenerator(fixture, rowCount, Character.MAX_VALUE); -DataValidator validator = new DataValidator(rowCount, Character.MAX_VALUE); +DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); --- End diff -- That is true. But, what advantage is there to passing an arbitrarily large number that doesn't represent a valid Drill batch size? The reader would see it as an error until they go and look at the code to see that the potential error is not really an error. The question is: is this a bug that needs to be fixed, or just a coding preference that we can let go for now? > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340469#comment-16340469 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r164021818 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/selection/SelectionVector4.java --- @@ -31,8 +31,9 @@ private int length; public SelectionVector4(ByteBuf vector, int recordCount, int batchRecordCount) throws SchemaChangeException { -if (recordCount > Integer.MAX_VALUE /4) { - throw new SchemaChangeException(String.format("Currently, Drill can only support allocations up to 2gb in size. You requested an allocation of %d bytes.", recordCount * 4)); +if (recordCount > Integer.MAX_VALUE / 4) { + throw new SchemaChangeException(String.format("Currently, Drill can only support allocations up to 2gb in size. " + + "You requested an allocation of %d bytes.", recordCount * 4)); --- End diff -- Sounds like two issues. First, while I pointed out opportunities for improvement in the code to be consistent with work elsewhere, the code as it is has worked for the last two years. Second, if it helps to move this PR ahead for @ilooner, I can back out the formatting changes to this file so that it drops out of the PR. That said, our general policy has been to include code cleanup within other commits rather than incurring the cost and delay of doing two commits for each bit of work (one for code cleanup, another for substantive changes.) Besides this issue, anything else needed? > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340405#comment-16340405 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r164015707 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java --- @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, DataGenerator dataGen, public void runJumboBatchTest(OperatorFixture fixture, int rowCount) { timer.reset(); -DataGenerator dataGen = new DataGenerator(fixture, rowCount, Character.MAX_VALUE); -DataValidator validator = new DataValidator(rowCount, Character.MAX_VALUE); +DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); --- End diff -- Correct, but will not DataGenerator/DataValidator use `Math.min(batchSize, ValueVector.MAX_ROW_COUNT);` and limit the size to the maximum possible? > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340290#comment-16340290 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r163999199 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/selection/SelectionVector4.java --- @@ -31,8 +31,9 @@ private int length; public SelectionVector4(ByteBuf vector, int recordCount, int batchRecordCount) throws SchemaChangeException { -if (recordCount > Integer.MAX_VALUE /4) { - throw new SchemaChangeException(String.format("Currently, Drill can only support allocations up to 2gb in size. You requested an allocation of %d bytes.", recordCount * 4)); +if (recordCount > Integer.MAX_VALUE / 4) { + throw new SchemaChangeException(String.format("Currently, Drill can only support allocations up to 2gb in size. " + + "You requested an allocation of %d bytes.", recordCount * 4)); --- End diff -- Do you mean `Integer.MAX_VALUE/4 + 1`? I'd prefer to avoid format only changes especially if the original code is broken. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340282#comment-16340282 ] ASF GitHub Bot commented on DRILL-6080: --- Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1090 @vrozov @paul-rogers Is there a way we can accelerate this PR? I keep on hitting the DRILL-6086 issue in my builds. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336950#comment-16336950 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r163456760 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/selection/SelectionVector4.java --- @@ -31,8 +31,9 @@ private int length; public SelectionVector4(ByteBuf vector, int recordCount, int batchRecordCount) throws SchemaChangeException { -if (recordCount > Integer.MAX_VALUE /4) { - throw new SchemaChangeException(String.format("Currently, Drill can only support allocations up to 2gb in size. You requested an allocation of %d bytes.", recordCount * 4)); +if (recordCount > Integer.MAX_VALUE / 4) { + throw new SchemaChangeException(String.format("Currently, Drill can only support allocations up to 2gb in size. " + + "You requested an allocation of %d bytes.", recordCount * 4)); --- End diff -- This code was original, I only cleaned it up. It is true that the `ValueVector` code will fail for an allocation above 2 GB. But, that code is not used for the SV4. (Strangely, a selection *vector* is not a value *vector*...) So, the SV4 must impose its own limit. The limit is, essentially, 32K batches * 64K records per batch. (Given how we do the math, it could be 64K batches since we mask off the high bits after the shift in generated code.) I'm guessing the original thinking was: we can allocate vectors up to 2 GB, so we should only allow SV4s of the same size. With 4-bytes per SV4-entry, we can allocate up to `(Integer.MAX_VALUE + 1) / 4` bytes. Code like this is hard to reason about because 1) it is existing and 2) is it broken. As we know, any allocation above 16 MB may fail due to memory fragmentation, so allowing a 2 GB size is wildly optimistic on a busy system or one that has been running long enough that direct memory all resides on the Netty free list. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336952#comment-16336952 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r163457037 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/selection/SelectionVector4.java --- @@ -100,8 +101,8 @@ public boolean next() { return false; } -start = start+length; -int newEnd = Math.min(start+length, recordCount); +start = start + length; --- End diff -- You may be right. I just added some spaces... > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336951#comment-16336951 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r163456045 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java --- @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, DataGenerator dataGen, public void runJumboBatchTest(OperatorFixture fixture, int rowCount) { timer.reset(); -DataGenerator dataGen = new DataGenerator(fixture, rowCount, Character.MAX_VALUE); -DataValidator validator = new DataValidator(rowCount, Character.MAX_VALUE); +DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); --- End diff -- Well... As it turns out, `ValueVector.MAX_ROW_COUNT` is 64K, which the the maximum size an SV2 can address. (An SV2 is 16 bits wide.) `Integer.MAX_VALUE` is 2^32, which would require a 32-bit SV2, which we don't have. So, using the `Integer.MAX_VALUE` would cause the test to fail as the sorter could not sort batches larger than 64K... Prior we used to use `Character.MAX_VALUE`, but it is not intuitively obvious that our batch size should be correlated to the size of Java's UTF-16 character encoding... And, in fact, the original bug is that they are not correlated: `Character.MAX_VALUE` is 65535, while `ValueVector.MAX_ROWS` is 65536. As a result, we were not testing the full-batch corner case. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336949#comment-16336949 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r163455571 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java --- @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, DataGenerator dataGen, public void runJumboBatchTest(OperatorFixture fixture, int rowCount) { timer.reset(); -DataGenerator dataGen = new DataGenerator(fixture, rowCount, Character.MAX_VALUE); -DataValidator validator = new DataValidator(rowCount, Character.MAX_VALUE); +DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); +DataValidator validator = new DataValidator(rowCount, ValueVector.MAX_ROW_COUNT); runLargeSortTest(fixture, dataGen, validator); -System.out.println(timer.elapsed(TimeUnit.MILLISECONDS)); +//System.out.println(timer.elapsed(TimeUnit.MILLISECONDS)); --- End diff -- Removed all the timing & debugging code to avoid the need for commented-out lines. Logging in tests is a no-op; we simply discard the logs. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335119#comment-16335119 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r162772482 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/selection/SelectionVector4.java --- @@ -100,8 +101,8 @@ public boolean next() { return false; } -start = start+length; -int newEnd = Math.min(start+length, recordCount); +start = start + length; --- End diff -- This code does not look right to me. It tries to enforce invariant that `start + length <= recordCount`, but based on the check on line 96, the invariant is not enforced in other places, so it is not clear why the invariant needs to be enforced here. If the invariant needs to be enforced, will it be better to use: ``` start += length; length = Math.min(length, recordCount - start); ``` > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335121#comment-16335121 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r163093639 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java --- @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, DataGenerator dataGen, public void runJumboBatchTest(OperatorFixture fixture, int rowCount) { timer.reset(); -DataGenerator dataGen = new DataGenerator(fixture, rowCount, Character.MAX_VALUE); -DataValidator validator = new DataValidator(rowCount, Character.MAX_VALUE); +DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); --- End diff -- Will it be better to use ``` DataGenerator dataGen = new DataGenerator(fixture, rowCount, Integer.MAX_VALUE); DataValidator validator = new DataValidator(rowCount, Integer.MAX_VALUE); ``` my understanding is that idea is to use max batch size in the test. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335120#comment-16335120 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r162771968 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/selection/SelectionVector4.java --- @@ -31,8 +31,9 @@ private int length; public SelectionVector4(ByteBuf vector, int recordCount, int batchRecordCount) throws SchemaChangeException { -if (recordCount > Integer.MAX_VALUE /4) { - throw new SchemaChangeException(String.format("Currently, Drill can only support allocations up to 2gb in size. You requested an allocation of %d bytes.", recordCount * 4)); +if (recordCount > Integer.MAX_VALUE / 4) { + throw new SchemaChangeException(String.format("Currently, Drill can only support allocations up to 2gb in size. " + + "You requested an allocation of %d bytes.", recordCount * 4)); --- End diff -- My understanding is that Java will use `int` to compute `recordCount * 4` and overflow. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335118#comment-16335118 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r163093794 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/managed/TestSortImpl.java --- @@ -466,10 +469,10 @@ public void runLargeSortTest(OperatorFixture fixture, DataGenerator dataGen, public void runJumboBatchTest(OperatorFixture fixture, int rowCount) { timer.reset(); -DataGenerator dataGen = new DataGenerator(fixture, rowCount, Character.MAX_VALUE); -DataValidator validator = new DataValidator(rowCount, Character.MAX_VALUE); +DataGenerator dataGen = new DataGenerator(fixture, rowCount, ValueVector.MAX_ROW_COUNT); +DataValidator validator = new DataValidator(rowCount, ValueVector.MAX_ROW_COUNT); runLargeSortTest(fixture, dataGen, validator); -System.out.println(timer.elapsed(TimeUnit.MILLISECONDS)); +//System.out.println(timer.elapsed(TimeUnit.MILLISECONDS)); --- End diff -- Please remove or convert to logging. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329799#comment-16329799 ] ASF GitHub Bot commented on DRILL-6080: --- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1090 Pushed a new commit that fixes an indexing error. Also removes the "unsafe" methods in favor of the "set" methods in DrillBuf as Vlad requested. @vrozov, please take another look. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325385#comment-16325385 ] ASF GitHub Bot commented on DRILL-6080: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1090#discussion_r161384235 --- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java --- @@ -851,48 +851,52 @@ public void print(StringBuilder sb, int indent, Verbosity verbosity) { * Write an integer to the buffer at the given byte index, without * bounds checks. * - * @param offset byte (not int) offset of the location to write + * @param bufOffset byte (not int) offset of the location to write * @param value the value to write */ - public void unsafePutInt(int offset, int value) { -PlatformDependent.putInt(addr + offset, value); + public void unsafePutInt(int bufOffset, int value) { +assert unsafeCheckIndex(bufOffset, 4); --- End diff -- I don't think that `unsafePutXXX()` provides any performance benefits over existing `setXXX()` variants after DRILL-6004 was merged, except that unsafe variants use `assert` while existing use `final static boolean` flag that functionally and performance wise is equivalent to java `assert`. It is necessary to agree on a single mechanism to enable bounds checking and use it in all methods. > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DRILL-6080) Sort incorrectly limits batch size to 65535 records rather than 65536
[ https://issues.apache.org/jira/browse/DRILL-6080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324927#comment-16324927 ] ASF GitHub Bot commented on DRILL-6080: --- GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/1090 DRILL-6080: Sort incorrectly limits batch size to 65535 records Sort incorrectly limits batch size to 65535 records rather than 65536. This PR also includes a few code cleanup items. Also fixes DRILL-6086: Overflow in offset vector in row set writer @vrozov, please review. You can merge this pull request into a Git repository by running: $ git pull https://github.com/paul-rogers/drill DRILL-6080 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1090.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1090 commit c1d3402a619f3355e47e845aae245fd0f96e2189 Author: Paul RogersDate: 2018-01-11T00:04:53Z DRILL-6080: Sort incorrectly limits batch size to 65535 records Sort incorrectly limits batch size to 65535 records rather than 65536. This PR also includes a few code cleanup items. Fix for overflow in offset vector in row set writer > Sort incorrectly limits batch size to 65535 records rather than 65536 > - > > Key: DRILL-6080 > URL: https://issues.apache.org/jira/browse/DRILL-6080 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.13.0 > > > Drill places an upper limit on the number of rows in a batch of 64K. That is > 65,536 decimal. When we index records, the indexes run from 0 to 64K-1 or 0 > to 65,535. > The sort code incorrectly uses {{Character.MAX_VALUE}} as the maximum row > count. So, if an incoming batch uses the full 64K size, sort ends up > splitting batches unnecessarily. > The fix is to instead use the correct constant `ValueVector.MAX_ROW_COUNT`. -- This message was sent by Atlassian JIRA (v6.4.14#64029)