I"m not sure if it's an exact match, or just very close :-)

I don't think our problem is the workload on the driver, I think it's just
memory - so while the solution proposed there would work, it would also be
sufficient for our purposes, I believe, simply to clear each block as soon
as it's added into the canonical version, and try to do so as soon as
possible - but I could be misunderstanding some of the timing, I'm still
investigating.

Though to combine on the worker before returning, as he suggests, would
probably be even better.

On Fri, Nov 21, 2014 at 6:08 PM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Nathan,
>
> It sounds like what you're asking for has already been filed as
> https://issues.apache.org/jira/browse/SPARK-664  Does that ticket match
> what you're proposing?
>
> Andrew
>
> On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
>> We've done this with reduce - that definitely works.
>>
>> I've reworked the logic to use accumulators because, when it works, it's
>> 5-10x faster
>>
>> On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> This sounds more like a use case for reduce? or fold? it sounds like
>>> you're kind of cobbling together the same function on accumulators,
>>> when reduce/fold are simpler and have the behavior you suggest.
>>>
>>> On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld
>>> <nkronenf...@oculusinfo.com> wrote:
>>> > I think I understand what is going on here, but I was hoping someone
>>> could
>>> > confirm (or explain reality if I don't) what I'm seeing.
>>> >
>>> > We are collecting data using a rather sizable accumulator -
>>> essentially, an
>>> > array of tens of thousands of entries.  All told, about 1.3m of data.
>>> >
>>> > If I understand things correctly, it looks to me like, when our job is
>>> done,
>>> > a copy of this array is retrieved from each individual task, all at
>>> once,
>>> > for combination on the client - which means, with 400 tasks to the
>>> job, each
>>> > collection is using up half a gig of memory on the client.
>>> >
>>> > Is this true?  If so, does anyone know a way to get accumulators to
>>> > accumulate as results collect, rather than all at once at the end, so
>>> we
>>> > only have to hold a few in memory at a time, rather than all 400?
>>> >
>>> > Thanks,
>>> >               -Nathan
>>> >
>>> >
>>> > --
>>> > Nathan Kronenfeld
>>> > Senior Visualization Developer
>>> > Oculus Info Inc
>>> > 2 Berkeley Street, Suite 600,
>>> > Toronto, Ontario M5A 4J5
>>> > Phone:  +1-416-203-3003 x 238
>>> > Email:  nkronenf...@oculusinfo.com
>>>
>>
>>
>>
>> --
>> Nathan Kronenfeld
>> Senior Visualization Developer
>> Oculus Info Inc
>> 2 Berkeley Street, Suite 600,
>> Toronto, Ontario M5A 4J5
>> Phone:  +1-416-203-3003 x 238
>> Email:  nkronenf...@oculusinfo.com
>>
>
>


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Reply via email to