Disabling hash aggregation will default to streaming aggregation + sort.
This will allow you to handle larger data and spill to disk if necessary.

Like stated in the documentation, starting from Drill 1.5 the default
memory limit of sort may not be enough to process large data, but you can
bump it up by increasing planner.memory.max_query_memory_per_node (defaults
to 2GB), and if necessary reducing planner.width.max_per_node (defaults to
75% of number of cores).

You said disabling hash aggregate and hash join causes a memory leak. Can
you give more details about the error ? the query may fail with an out of
memory but it shouldn't leak.

On Fri, Mar 11, 2016 at 10:53 PM, John Omernik <j...@omernik.com> wrote:

> I've had some luck disabling multi-phase aggregations on some queries where
> memory was an issue.
>
> https://drill.apache.org/docs/guidelines-for-optimizing-aggregation/
>
> After I try that, than I typically look at the hash aggregation as you have
> done:
>
>
> https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/
>
> I've had limited success with changing the max_query_memory_per_node and
> max_width, sometimes it's a weird combination of things that work in there.
>
> https://drill.apache.org/docs/troubleshooting/#memory-issues
>
> Back to your spill stuff if you disable hash aggregation, do you know if
> your spill directories are setup? That may be part of the issue, I am not
> sure what the default spill behavior of Drill is for spill directory setup.
>
>
>
> On Fri, Mar 11, 2016 at 2:17 PM, François Méthot <fmetho...@gmail.com>
> wrote:
>
> > Hi,
> >
> >    Using version 1.5, DirectMemory is currently set at 32GB, heap is at
> > 8GB. We have been trying to perform multiple aggregation in one query
> (see
> > below) on 40 Billions+ rows stored on 13 nodes. We are using parquet
> > format.
> >
> > We keep getting OutOfMemoryException: Failure allocating buffer..
> >
> > on a query that looks like this:
> >
> > create table hdfs.`test1234` as
> > (
> > select string_field1,
> >           string_field2,
> >           min ( int_field3 ),
> >           max ( int_field4 ),
> >           count(1),
> >           count ( distinct int_field5 ),
> >           count ( distinct int_field6 ),
> >           count ( distinct string_field7 )
> >     from hdfs.`/data/`
> >     group by string_field1, string_field2;
> > )
> >
> > The documentation state:
> > "Currently, hash-based operations do not spill to disk as needed."
> >
> > and
> >
> > "If the hash-based operators run out of memory during execution, the
> query
> > fails. If large hash operations do not fit in memory on your system, you
> > can disable these operations. When disabled, Drill creates alternative
> > plans that allow spilling to disk."
> >
> > My understanding is that it will fall back to Streaming aggregation,
> which
> > required sorting..
> >
> > but
> >
> > "As of Drill 1.5, ... the sort operator (in queries that ran successfully
> > in previous releases) may not have enough memory, resulting in a failed
> > query"
> >
> > And Indeed, disabling hash agg and hash join resulted in memory leak
> error.
> >
> > So it looks like increasing direct memory our only option.
> >
> > Is there a plan to have Hash Aggregation to spill on disk in the next
> > release?
> >
> >
> > Thanks for your feedback
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Reply via email to