We've done this with reduce - that definitely works. I've reworked the logic to use accumulators because, when it works, it's 5-10x faster
On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen <so...@cloudera.com> wrote: > This sounds more like a use case for reduce? or fold? it sounds like > you're kind of cobbling together the same function on accumulators, > when reduce/fold are simpler and have the behavior you suggest. > > On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld > <nkronenf...@oculusinfo.com> wrote: > > I think I understand what is going on here, but I was hoping someone > could > > confirm (or explain reality if I don't) what I'm seeing. > > > > We are collecting data using a rather sizable accumulator - essentially, > an > > array of tens of thousands of entries. All told, about 1.3m of data. > > > > If I understand things correctly, it looks to me like, when our job is > done, > > a copy of this array is retrieved from each individual task, all at once, > > for combination on the client - which means, with 400 tasks to the job, > each > > collection is using up half a gig of memory on the client. > > > > Is this true? If so, does anyone know a way to get accumulators to > > accumulate as results collect, rather than all at once at the end, so we > > only have to hold a few in memory at a time, rather than all 400? > > > > Thanks, > > -Nathan > > > > > > -- > > Nathan Kronenfeld > > Senior Visualization Developer > > Oculus Info Inc > > 2 Berkeley Street, Suite 600, > > Toronto, Ontario M5A 4J5 > > Phone: +1-416-203-3003 x 238 > > Email: nkronenf...@oculusinfo.com > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com