On Sun, May 2, 2010 at 5:45 PM, Sean Owen <sro...@gmail.com> wrote: > Not surprising indeed, that won't scale at some point. > What is the stage that needs everything in memory? maybe describing > that helps imagine solutions. > Algorithm is simple For each point read into the mapper. Find the canopy it is closest to(from memory List<>) and add it to the canopy. Else if the distance is greater than a threshold t1 then create a new canopy(into memory List<>)
> The typical reason for this, in my experience back in the day, was > needing to look up data infrequently in a key-value way. > "Side-loading" off HDFS (well, GFS via Bigtable) was reasonable. For > whatever reason I cannot get any reasonable performance out of MapFile > in this regard. > > Another common pattern seems to be that you need two or more kinds of > values for a key in order to performa a computation. (For example in > recommendations I'd need user vectors and matrix rows, both). The > natural solution is to load one of them into memory and map the others > into the computation. > > Instead I very much like Ankur's trick(s) for this situation: use two > mappers, which Hadoop allows. They output different value types > though, V1 and V2. So create a sort of "V1OrV2Writable" that can hold > one or the other. It's simple to tell them apart in the mapper. > > There are even further tricks to ensure you get V1 or V2 first if needed. > > Don't know if that helps but might inspire ideas. > > > > On Sun, May 2, 2010 at 12:14 PM, Robin Anil <robin.a...@gmail.com> wrote: > > Keeping all canopies in memory is not making things scale. I frequently > run > > into out of memory errors when the distance thresholds are not good on > > reuters. Any ideas on optimizing this? > > > > Robin > > >