On Sun, May 2, 2010 at 5:45 PM, Sean Owen <sro...@gmail.com> wrote:

> Not surprising indeed, that won't scale at some point.
> What is the stage that needs everything in memory? maybe describing
> that helps imagine solutions.
>
Algorithm is simple
For each point read into the mapper.
           Find the canopy it is closest to(from memory List<>) and add it
to the canopy.
           Else if the distance is greater than a threshold t1 then create a
new canopy(into memory List<>)



> The typical reason for this, in my experience back in the day, was
> needing to look up data infrequently in a key-value way.
> "Side-loading" off HDFS (well, GFS via Bigtable) was reasonable. For
> whatever reason I cannot get any reasonable performance out of MapFile
> in this regard.
>
> Another common pattern seems to be that you need two or more kinds of
> values for a key in order to performa a computation. (For example in
> recommendations I'd need user vectors and matrix rows, both). The
> natural solution is to load one of them into memory and map the others
> into the computation.
>
> Instead I very much like Ankur's trick(s) for this situation: use two
> mappers, which Hadoop allows. They output different value types
> though, V1 and V2. So create a sort of "V1OrV2Writable" that can hold
> one or the other. It's simple to tell them apart in the mapper.
>
> There are even further tricks to ensure you get V1 or V2 first if needed.
>
> Don't know if that helps but might inspire ideas.
>
>
>
> On Sun, May 2, 2010 at 12:14 PM, Robin Anil <robin.a...@gmail.com> wrote:
> > Keeping all canopies in memory is not making things scale. I frequently
> run
> > into out of memory errors when the distance thresholds are not good on
> > reuters. Any ideas on optimizing this?
> >
> > Robin
> >
>

Reply via email to