Not surprising indeed, that won't scale at some point. What is the stage that needs everything in memory? maybe describing that helps imagine solutions.
The typical reason for this, in my experience back in the day, was needing to look up data infrequently in a key-value way. "Side-loading" off HDFS (well, GFS via Bigtable) was reasonable. For whatever reason I cannot get any reasonable performance out of MapFile in this regard. Another common pattern seems to be that you need two or more kinds of values for a key in order to performa a computation. (For example in recommendations I'd need user vectors and matrix rows, both). The natural solution is to load one of them into memory and map the others into the computation. Instead I very much like Ankur's trick(s) for this situation: use two mappers, which Hadoop allows. They output different value types though, V1 and V2. So create a sort of "V1OrV2Writable" that can hold one or the other. It's simple to tell them apart in the mapper. There are even further tricks to ensure you get V1 or V2 first if needed. Don't know if that helps but might inspire ideas. On Sun, May 2, 2010 at 12:14 PM, Robin Anil <robin.a...@gmail.com> wrote: > Keeping all canopies in memory is not making things scale. I frequently run > into out of memory errors when the distance thresholds are not good on > reuters. Any ideas on optimizing this? > > Robin >