On Mon, Dec 14, 2009 at 4:22 PM, Jake Mannix <jake.man...@gmail.com> wrote: > On Mon, Dec 14, 2009 at 1:39 AM, Sean Owen <sro...@gmail.com> wrote: > >> I get it. My concern is that I suspect (but don't know) it's >> disk-bound right now as it seems to be boggin down loading all those >> vectors. But I think your approach addresses it. >> > > I'm not sure if you're disk-bound - but in each map task you run, you're > opening up a new SequenceFileReader to zip through the entire > co-occurrencematrix, and this is causing you to lose all localization, > really: that entire matrix would need to get streamed over via HDFS to > the mapper node! So I'd imagine in a truly distributed setup the > performance would be even worse (in comparison to the other tasks > in the job) than you're seeing now.
Update to the latest code -- I'm using MapFiles now, which is more reasonable for the random-access nature of the lookups involved in computing the recommendation by using the columns rather than every row. > Yeah, I'm digging into the code (I get what's happening in it now), > but as I'm away from the office, I'm running on a poor laptop too, > so I can only test pseudo-distributed mode myself. But that's better > than nothing! I'll see if I can try out the approach I outlined when > I get a chance. You're welcome to commit something and let me test it out. I'm quite set up to run this against big data.