Paul Rogers created DRILL-7301:
----------------------------------

             Summary: Assertion failure in HashAgg with mem prediction off
                 Key: DRILL-7301
                 URL: https://issues.apache.org/jira/browse/DRILL-7301
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.16.0
            Reporter: Paul Rogers
            Assignee: Boaz Ben-Zvi


DRILL-6951 revised the mock data source to use the new "EVF". A side effect is 
that the new version minimizes batch internal fragmentation (which is a good 
thing.) As it turns out, the {{TestHashAggrSpill}} unit tests based their 
spilling tests on total memory, included wasted internal fragmentation. After 
the upgrade to the mock data source, some of the {{TestHashAggrSpill}} tests 
failed because they no longer spilled.

The revised mock limits batch sizes to 10 MB by default. The code ensures that 
the largest vector, likely the one for {{empid_s17}}, is near 100% full.

Experimentation showed that doubling the row count provided sufficient memory 
usage to cause the operator to spill as requested. But, one test now fails with 
an assertion error:

{code:java}
  /**
   * Test with "needed memory" prediction turned off
   * (i.e., exercise code paths that catch OOMs from the Hash Table and recover)
   */
  @Test
  public void testNoPredictHashAggrSpill() throws Exception {
    testSpill(58_000_000, 16, 2, 2, false, false /* no prediction */, null,
        DEFAULT_ROW_COUNT, 1, 1, 1);
  }
{code}

Partial stack:

{noformat}
        at 
org.apache.drill.exec.physical.impl.common.HashTableTemplate.outputKeys(HashTableTemplate.java:910)
 ~[classes/:na]
        at 
org.apache.drill.exec.test.generated.HashAggregatorGen0.outputCurrentBatch(HashAggTemplate.java:1184)
 ~[na:na]
        at 
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:267)
 ~[classes/:na]
        at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
 ~[classes/:na]
{noformat}

Failure line:

{code:java}
  @Override
  public boolean outputKeys(int batchIdx, VectorContainer outContainer, int 
numRecords) {
    assert batchIdx < batchHolders.size(); // <-- Fails here
    return batchHolders.get(batchIdx).outputKeys(outContainer, numRecords);
  }
{code}

Perhaps the increase in row count forced the operator into an operating range 
with insufficient memory. If so, the test should have failed with some kind of 
OOM rather than an index assertion.

To test the low-memory theory, the memory limit was increased to 
{{60_000_000}}. Now the code failed at a different point:

{noformat}
        at 
org.apache.drill.exec.physical.impl.common.HashTableTemplate.put(HashTableTemplate.java:678)
 ~[classes/:na]
        at 
org.apache.drill.exec.test.generated.HashAggregatorGen0.checkGroupAndAggrValues(HashAggTemplate.java:1337)
 ~[na:na]
        at 
org.apache.drill.exec.test.generated.HashAggregatorGen0.doWork(HashAggTemplate.java:606)
 ~[na:na]
        at 
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:296)
 ~[classes/:na]
{noformat}

Code line:

{code:java}
  @Override
  public PutStatus put(int incomingRowIdx, IndexPointer htIdxHolder, int 
hashCode, int targetBatchRowCount) throws SchemaChangeException, 
RetryAfterSpillException {
    ...
    for ( int currentIndex = startIdx;
         ... {
      // remember the current link, which would be the last when the next link 
is empty
      lastEntryBatch = batchHolders.get((currentIndex >>> 16) & BATCH_MASK); // 
<-- Here
{code}

Increasing memory to {{62_000_000}} produced this error:

{noformat}
        at 
org.apache.drill.exec.physical.impl.common.HashTableTemplate.outputKeys(HashTableTemplate.java:910)
 ~[classes/:na]
        at 
org.apache.drill.exec.test.generated.HashAggregatorGen0.outputCurrentBatch(HashAggTemplate.java:1184)
 ~[na:na]
        at 
org.apache.drill.exec.physical.impl.aggregate.HashAggBatch.innerNext(HashAggBatch.java:267)
 ~[classes/:na]
{noformat}

At the line shown for the first exception.

Increasing memory to {{64_000_000}} triggered the second error again.

Increasing memory to {{66_000_000}} triggered the first error again.

The errors recurred at memory (later jumping by 20M) up to 140 M, at which 
point the test failed because the query ran, but did not spill. The query fails 
with a memory limit of 130M.

At {{135_000_000}} the query works, but the returned row count is wrong:

{noformat}
java.lang.AssertionError: expected:<2400000> but was:<2334465>
{noformat}

At:

{code:java}
  private void runAndDump(...
      assertEquals(expectedRows, summary.recordCount());
{code}

There does seem to be something wrong with this code path. All other tests run 
fine with the new mock data source (and adjusted row counts.)

Have disabled the offending test until this bug can be fixed.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to