Paul Rogers created DRILL-7301:
----------------------------------
Summary: Assertion failure in HashAgg with mem prediction off
Key: DRILL-7301
URL: https://issues.apache.org/jira/browse/DRILL-7301
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Boaz Ben-Zvi
DRILL-6951 revised the mock data source to use the new "EVF". A side effect is
that the new version minimizes batch internal fragmentation (which is a good
thing.) As it turns out, the {{TestHashAggrSpill}} unit tests based their
spilling tests on total memory, included wasted internal fragmentation. After
the upgrade to the mock data source, some of the {{TestHashAggrSpill}} tests
failed because they no longer spilled.
The revised mock limits batch sizes to 10 MB by default. The code ensures that
the largest vector, likely the one for {{empid_s17}}, is near 100% full.
Experimentation showed that doubling the row count provided sufficient memory
usage to cause the operator to spill as requested. But, one test now fails with
an assertion error:
{code:java}
/**
* Test with "needed memory" prediction turned off
* (i.e., exercise code paths that catch OOMs from the Hash Table and recover)
*/
@Test
public void testNoPredictHashAggrSpill() throws Exception {
testSpill(58_000_000, 16, 2, 2, false, false /* no prediction */, null,
DEFAULT_ROW_COUNT, 1, 1, 1);
}
{code}
Partial stack:
{noformat}
at
org.apache.drill.exec.physical.impl.common.HashTableTemplate.outputKeys(HashTableTemplate.java:910)
~[classes/:na]
at
org.apache.drill.exec.test.generated.HashAggregatorGen0.outputCurrentBatch(HashAggTemplate.java:1184)
~[na:na]
at
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:267)
~[classes/:na]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
~[classes/:na]
{noformat}
Failure line:
{code:java}
@Override
public boolean outputKeys(int batchIdx, VectorContainer outContainer, int
numRecords) {
assert batchIdx < batchHolders.size(); // <-- Fails here
return batchHolders.get(batchIdx).outputKeys(outContainer, numRecords);
}
{code}
Perhaps the increase in row count forced the operator into an operating range
with insufficient memory. If so, the test should have failed with some kind of
OOM rather than an index assertion.
To test the low-memory theory, the memory limit was increased to
{{60_000_000}}. Now the code failed at a different point:
{noformat}
at
org.apache.drill.exec.physical.impl.common.HashTableTemplate.put(HashTableTemplate.java:678)
~[classes/:na]
at
org.apache.drill.exec.test.generated.HashAggregatorGen0.checkGroupAndAggrValues(HashAggTemplate.java:1337)
~[na:na]
at
org.apache.drill.exec.test.generated.HashAggregatorGen0.doWork(HashAggTemplate.java:606)
~[na:na]
at
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:296)
~[classes/:na]
{noformat}
Code line:
{code:java}
@Override
public PutStatus put(int incomingRowIdx, IndexPointer htIdxHolder, int
hashCode, int targetBatchRowCount) throws SchemaChangeException,
RetryAfterSpillException {
...
for ( int currentIndex = startIdx;
... {
// remember the current link, which would be the last when the next link
is empty
lastEntryBatch = batchHolders.get((currentIndex >>> 16) & BATCH_MASK); //
<-- Here
{code}
Increasing memory to {{62_000_000}} produced this error:
{noformat}
at
org.apache.drill.exec.physical.impl.common.HashTableTemplate.outputKeys(HashTableTemplate.java:910)
~[classes/:na]
at
org.apache.drill.exec.test.generated.HashAggregatorGen0.outputCurrentBatch(HashAggTemplate.java:1184)
~[na:na]
at
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:267)
~[classes/:na]
{noformat}
At the line shown for the first exception.
Increasing memory to {{64_000_000}} triggered the second error again.
Increasing memory to {{66_000_000}} triggered the first error again.
The errors recurred at memory (later jumping by 20M) up to 140 M, at which
point the test failed because the query ran, but did not spill. The query fails
with a memory limit of 130M.
At {{135_000_000}} the query works, but the returned row count is wrong:
{noformat}
java.lang.AssertionError: expected:<2400000> but was:<2334465>
{noformat}
At:
{code:java}
private void runAndDump(...
assertEquals(expectedRows, summary.recordCount());
{code}
There does seem to be something wrong with this code path. All other tests run
fine with the new mock data source (and adjusted row counts.)
Have disabled the offending test until this bug can be fixed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)