Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Reynold Xin
What's the plan if you run explain? In 1.5 the default should be TungstenAggregate, which does spill (switching from hash-based aggregation to sort-based aggregation). On Mon, Sep 21, 2015 at 5:34 PM, Matt Cheah wrote: > Hi everyone, > > I’m debugging some slowness and

Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Matt Cheah
t; Cc: "dev@spark.apache.org" <dev@spark.apache.org>, Mingyu Kim <m...@palantir.com>, Peter Faiman <peterfai...@palantir.com> Subject: Re: DataFrames Aggregate does not spill? What's the plan if you run explain? In 1.5 the default should be TungstenAggregate, whi

DataFrames Aggregate does not spill?

2015-09-21 Thread Matt Cheah
Hi everyone, I¹m debugging some slowness and apparent memory pressure + GC issues after I ported some workflows from raw RDDs to Data Frames. In particular, I¹m looking into an aggregation workflow that computes many aggregations per key at once. My workflow before was doing a fairly