[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187249#comment-16187249
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/932


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187235#comment-16187235
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/932
  
Squashed commits.


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187233#comment-16187233
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/932
  
Rebased onto master.


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187231#comment-16187231
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/932#discussion_r142017748
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -74,53 +74,52 @@
 public final int estSize;
 
 /**
- * Number of times the value here (possibly repeated) appears in
- * the record batch.
+ * Number of occurrences of the value in the batch. This is trivial
+ * for top-level scalars: it is the record count. For a top-level
+ * repeated vector, this is the number of arrays, also the record
+ * count. For a value nested inside a repeated map, it is the
+ * total number of values across all maps, and may be less than,
+ * greater than (but unlikely) same as the row count.
  */
 
 public final int valueCount;
 
 /**
- * The number of elements in the value vector. Consider two cases.
- * A required or nullable vector has one element per row, so the
- * entryCount is the same as the valueCount (which,
- * in turn, is the same as the row count.) But, if this vector is an
- * array, then the valueCount is the number of columns, while
- * entryCount is the total number of elements in all the 
arrays
- * that make up the columns, so entryCount will be different 
than
- * the valueCount (normally larger, but possibly smaller if 
most
- * arrays are empty.
- * 
- * Finally, the column may be part of another list. In this case, the 
above
- * logic still applies, but the valueCount is the number of 
entries
- * in the outer array, not the row count.
+ * Total number of elements for a repeated type, or 1 if this is
+ * a non-repeated type. That is, a batch of 100 rows may have an
+ * array with 10 elements per row. In this case, the element count
+ * is 1000.
  */
 
-public int entryCount;
+public final int elementCount;
--- End diff --

Good point. However, a single batch of greater than 2 GB is far more than 
the sort can handle, so we'd not even get this far if the batch was this large.

Still, the point is valid, so a new commit changes batch size variables 
from int to long.


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181629#comment-16181629
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user amansinha100 commented on the issue:

https://github.com/apache/drill/pull/932
  
Changes LGTM.  +1


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181626#comment-16181626
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/932#discussion_r141191139
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -74,53 +74,52 @@
 public final int estSize;
 
 /**
- * Number of times the value here (possibly repeated) appears in
- * the record batch.
+ * Number of occurrences of the value in the batch. This is trivial
+ * for top-level scalars: it is the record count. For a top-level
+ * repeated vector, this is the number of arrays, also the record
+ * count. For a value nested inside a repeated map, it is the
+ * total number of values across all maps, and may be less than,
+ * greater than (but unlikely) same as the row count.
  */
 
 public final int valueCount;
 
 /**
- * The number of elements in the value vector. Consider two cases.
- * A required or nullable vector has one element per row, so the
- * entryCount is the same as the valueCount (which,
- * in turn, is the same as the row count.) But, if this vector is an
- * array, then the valueCount is the number of columns, while
- * entryCount is the total number of elements in all the 
arrays
- * that make up the columns, so entryCount will be different 
than
- * the valueCount (normally larger, but possibly smaller if 
most
- * arrays are empty.
- * 
- * Finally, the column may be part of another list. In this case, the 
above
- * logic still applies, but the valueCount is the number of 
entries
- * in the outer array, not the row count.
+ * Total number of elements for a repeated type, or 1 if this is
+ * a non-repeated type. That is, a batch of 100 rows may have an
+ * array with 10 elements per row. In this case, the element count
+ * is 1000.
  */
 
-public int entryCount;
+public final int elementCount;
--- End diff --

Not related to elementCount per-se but I see that netBatchSize and 
accountedMemorySize are integers.  These could overflow depending on number of 
columns.  Should they be longs ? 


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178455#comment-16178455
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/932
  
Rebased onto latest master.


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178075#comment-16178075
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/932
  
Added a fix for repeated columns that have a low cardinality. If, say, one 
row in ten has an array entry, then average cardinality (values per row) is 
0.1. This was represented by an int, rounded to 0 and caused a zero-length 
vector to be allocated. Drill then tried to double the length, which resulted 
in 0, which was doubled again, and so on forever.

The fix has three parts:

* The "record batch sizer" uses floats to allow fractional cardinality.
* The vector initializer now works with fractional cardinality.
* If all else fails, if a fixed-width vector is asked to double from zero, 
it sizes the vector to 256 bytes.


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160517#comment-16160517
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/932#discussion_r137963963
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/managed/SortMemoryManager.java
 ---
@@ -680,9 +680,11 @@ public MergeTask consolidateBatches(long allocMemory, 
int inMemCount, int spille
 // spilled run.
 
 // Maximum spill batches that fit into available memory.
+// Use the maximum buffer size since spill batches seem to
+// be read with almost 50% internal fragmentation.
 
 int memMergeLimit = (int) ((mergeMemoryLimit - allocMemory) /
-spillBatchSize.expectedBufferSize);
+spillBatchSize.maxBufferSize);
 memMergeLimit = Math.max(0, memMergeLimit);
--- End diff --

This line of code is not needed ( Math.max() ), as the code above 
(around line 675) ensures that mergeMemoryLimit >= allocMemory .



> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160158#comment-16160158
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/932
  
@Ben-Zvi, added another fix. Please take another look.


> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-01 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151080#comment-16151080
 ] 

Paul Rogers commented on DRILL-5758:


With most issues now resolved (or close to it), we have enabled the "managed" 
sort by default. The boot-time option is removed. The system/session option 
continues to exist to disable the managed sort if desired.

> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-09-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16151079#comment-16151079
 ] 

ASF GitHub Bot commented on DRILL-5758:
---

GitHub user paul-rogers opened a pull request:

https://github.com/apache/drill/pull/932

DRILL-5758: Fix for repeated columns; enable managed sort by default

Fix for DRILL-5443. Basically, the “record batch sizer” did not handle 
repeated columns correctly.

Enabled managed sort by default.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/paul-rogers/drill DRILL-5758

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/932.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #932


commit 2c9b2249fbad1bee2cb290a9c04fb82df5ecf2e8
Author: Paul Rogers 
Date:   2017-09-01T00:24:43Z

Fix for DRILL-5443. See DRILL-5758 for details. Basically, the “record
batch sizer” did not handle repeated columns correctly.

Enabled managed sort by default




> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-08-31 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149825#comment-16149825
 ] 

Paul Rogers commented on DRILL-5758:


Turns out the {{RecordBatchSizer}} contained a bug for repeated elements. 
Consider the original output:

{code}
  rms.mapvalue.col2(type: REPEATED BIGINT, count: 1, total entries: 1, 
per-array: 1, std size: 8, actual size: 52, data size: 52)
...
  Records: 4096, Total size: 1441792, Data size: 376615, Gross row width: 352, 
Net row width: 92, Density: 27}
{code}

In the above, {{col2}} is repeated, but the entries per array is set at 1.

Output after the fix:

{code}
  rms.mapvalue.col2(type: REPEATED BIGINT, count: 4096, elements: 12288, 
per-array: 3, std size: 8, actual size: 28, data size: 114688)
...
  Records: 4096, Total size: 1441792, Data size: 1136848, Gross row width: 352, 
Net row width: 278, Density: 79}
{code}

Note that the (average) elements per-array is now 3 and the estimated "net" row 
width has grown from 92 to 278.

The result is much better vector size estimates and no vector reallocations:

{code}
Initial output batch allocation: 811008 bytes, 3771 records

Took 4438 us to merge 3771 records, consuming 811008 bytes of memory
{code}

And now the sort completes:
{code}
Results: 4,000,000 records, 63 batches
{code}

> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5758) Rollup of external sort fixes to issues found by QA

2017-08-31 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149708#comment-16149708
 ] 

Paul Rogers commented on DRILL-5758:


The external sort memory manager works by anticipating the allocation size of 
each batch: input, spill, merge, and so on. This is done using the "record 
batch sizer" that figures out data sizes by observing input vectors. The sizer 
then generates a set of allocation "hints" used to allocate proper-size vectors 
for the various batches. If the memory calcs are wrong, then a batch might 
become larger than expected, causing OOM errors. One way to check if batches 
are under-estimated is to check if the sort code ends up needing to double 
batch sizes. This does, in fact, occur:

{code}
Spilling 42 batches, into spill batches of 11397 rows, to /tmp/drill/spill/...
Initial output batch allocation: 2392064 bytes
vector.BigIntVector - Reallocating vector [$data$(BIGINT:REQUIRED)]. # of 
bytes: [91176] -> [182352]
vector.UInt4Vector - Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of 
bytes: [45592] -> [91184]
vector.UInt4Vector - Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of 
bytes: [45592] -> [91184]
vector.UInt1Vector - Reallocating vector [$bits$(UINT1:REQUIRED)]. # of bytes: 
[11397] -> [22794]
vector.UInt1Vector - Reallocating vector [$bits$(UINT1:REQUIRED)]. # of bytes: 
[11397] -> [22794]
vector.UInt1Vector - Reallocating vector [$bits$(UINT1:REQUIRED)]. # of bytes: 
[11397] -> [22794]
vector.BigIntVector - Reallocating vector [c(BIGINT:OPTIONAL)]. # of bytes: 
[91176] -> [182352]
vector.UInt1Vector - Reallocating vector [$bits$(UINT1:REQUIRED)]. # of bytes: 
[11397] -> [22794]
vector.Float8Vector - Reallocating vector [d(FLOAT8:OPTIONAL)]. # of bytes: 
[91176] -> [182352]
vector.BigIntVector - Reallocating vector [$data$(BIGINT:REQUIRED)]. # of 
bytes: [182352] -> [364704]
Took 130655 us to merge 11397 records, consuming 3244032 bytes of memory
{code}

The above tells us that the estimates are off by no more than 50% (else the 
vectors would be double more than once.) But, the estimates are off and must be 
corrected.

* Original estimate for the batch size (from elsewhere in the logs): 1,572,786
* Actual initial allocation size: 2,392,064
* Final actual allocation size: 3,244,032

This tells us that the calculations are wrong somewhere.

> Rollup of external sort fixes to issues found by QA
> ---
>
> Key: DRILL-5758
> URL: https://issues.apache.org/jira/browse/DRILL-5758
> Project: Apache Drill
>  Issue Type: Task
>Affects Versions: 1.12.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> Tracking JIRA to used for the PR that combines fixes for various JIRA 
> entries. Bugs fixed in this task are given by the linked issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)