On Mon, Dec 14, 2009 at 4:22 PM, Jake Mannix <jake.man...@gmail.com> wrote:
> On Mon, Dec 14, 2009 at 1:39 AM, Sean Owen <sro...@gmail.com> wrote:
>
>> I get it. My concern is that I suspect (but don't know) it's
>> disk-bound right now as it seems to be boggin down loading all those
>> vectors. But I think your approach addresses it.
>>
>
> I'm not sure if you're disk-bound - but in each map task you run, you're
> opening up a new SequenceFileReader to zip through the entire
> co-occurrencematrix, and this is causing you to lose all localization,
> really: that entire matrix would need to get streamed over via HDFS to
> the mapper node!  So I'd imagine in a truly distributed setup the
> performance would be even worse (in comparison to the other tasks
> in the job) than you're seeing now.

Update to the latest code -- I'm using MapFiles now, which is more
reasonable for the random-access nature of the lookups involved in
computing the recommendation by using the columns rather than every
row.

> Yeah, I'm digging into the code (I get what's happening in it now),
> but as I'm away from the office, I'm running on a poor laptop too,
> so I can only test pseudo-distributed mode myself.  But that's better
> than nothing!  I'll see if I can try out the approach I outlined when
> I get a chance.

You're welcome to commit something and let me test it out. I'm quite
set up to run this against big data.

Reply via email to