Hi,

   Using version 1.5, DirectMemory is currently set at 32GB, heap is at
8GB. We have been trying to perform multiple aggregation in one query (see
below) on 40 Billions+ rows stored on 13 nodes. We are using parquet format.

We keep getting OutOfMemoryException: Failure allocating buffer..

on a query that looks like this:

create table hdfs.`test1234` as
(
select string_field1,
          string_field2,
          min ( int_field3 ),
          max ( int_field4 ),
          count(1),
          count ( distinct int_field5 ),
          count ( distinct int_field6 ),
          count ( distinct string_field7 )
    from hdfs.`/data/`
    group by string_field1, string_field2;
)

The documentation state:
"Currently, hash-based operations do not spill to disk as needed."

and

"If the hash-based operators run out of memory during execution, the query
fails. If large hash operations do not fit in memory on your system, you
can disable these operations. When disabled, Drill creates alternative
plans that allow spilling to disk."

My understanding is that it will fall back to Streaming aggregation, which
required sorting..

but

"As of Drill 1.5, ... the sort operator (in queries that ran successfully
in previous releases) may not have enough memory, resulting in a failed
query"

And Indeed, disabling hash agg and hash join resulted in memory leak error.

So it looks like increasing direct memory our only option.

Is there a plan to have Hash Aggregation to spill on disk in the next
release?


Thanks for your feedback

Reply via email to