Not surprising indeed, that won't scale at some point.
What is the stage that needs everything in memory? maybe describing
that helps imagine solutions.

The typical reason for this, in my experience back in the day, was
needing to look up data infrequently in a key-value way.
"Side-loading" off HDFS (well, GFS via Bigtable) was reasonable. For
whatever reason I cannot get any reasonable performance out of MapFile
in this regard.

Another common pattern seems to be that you need two or more kinds of
values for a key in order to performa a computation. (For example in
recommendations I'd need user vectors and matrix rows, both). The
natural solution is to load one of them into memory and map the others
into the computation.

Instead I very much like Ankur's trick(s) for this situation: use two
mappers, which Hadoop allows. They output different value types
though, V1 and V2. So create a sort of "V1OrV2Writable" that can hold
one or the other. It's simple to tell them apart in the mapper.

There are even further tricks to ensure you get V1 or V2 first if needed.

Don't know if that helps but might inspire ideas.



On Sun, May 2, 2010 at 12:14 PM, Robin Anil <robin.a...@gmail.com> wrote:
> Keeping all canopies in memory is not making things scale. I frequently run
> into out of memory errors when the distance thresholds are not good on
> reuters. Any ideas on optimizing this?
>
> Robin
>

Reply via email to