PS. Just another detail. Most of the time actually i don't run into long vectors. But occasiionally, among millions of them, i may run into a 10-20 even 50 million-long one. Which means, not only i have to adjust algorithm settings to allow for 400m for prebuffering (which is all i have), i actually have to allow for the worst case, i.e. what happens only in 0.1% of the cases.
I can't accept answer "use -Xmx1G" in map processes for reasons that i can't change. Absent of this solution, i realistically don't see how i can go without a push technique in accessing the vectors. You are welcome to suggest alternative solution to this conundrum, but i see nothing in current code that would help to fix it, without introducing some (very little amount of ) a new code. On Mon, Dec 13, 2010 at 11:19 AM, Dmitriy Lyubimov <[email protected]>wrote: > Ted, > > like i commented in the issue, 100m dense data elements means ~800M to load > a vector. If (suppose) i have 1G in a mapper (which is not terribly common, > not in our case, we only run jobs with -500M), then require 80% of RAM for > prebuffering (i.e. just being able to get access to the data) and leave 20% > for algorithm? Setting aside other considerations, doesn't this strike you > as being off-balance in requirements ? it does to me and unfortunately in > my practical application with the hardware constraints i got i won't be able > to afford that luxury. So it is a problem for my task for the moment, > believe it or not. At this moment even 100Mb for prebuffering the data is > prohibitively costly in my app. > > -d > > > On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <[email protected]>wrote: > >> I really don't see this as a big deal even with crazy big vectors. >> >> Looking at web scale, for instance, the most linked wikipedia article only >> has 10 million in-links or so. On the web, the most massive web site is >> unlikely to have >100 million in-links. Both of these fit in very modest >> amounts of memory. >> >> Where's the rub? >> >> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[email protected] >> >wrote: >> >> > Jake, >> > No i was trying exactly what you were proposing some time ago on the >> list. >> > I >> > am trying to make long vectors not to occupy a lot of memory. >> > >> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am >> > saying, hey, there's a lot of sequential techniques that can provide a >> > hander that would inspect vector element-by-element without having to >> > preallocate 8Mb. >> > >> > for 1 million-long vectors it doesn't scary too much but starts being so >> > for >> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million >> > non-zero elements). Stochastic SVD will survive that, but it means less >> > memory for blocking, and the more blocks you have, the more CPU it >> requires >> > (although CPU demand is only linear to the number of blocks and only in >> > signficantly smaller part of computation, so that only insigificant part >> of >> > total CPU flops depends on # of blocks, but there is part that does, >> still. >> > ) >> > >> > Like i said, it also would address the case when rows don't fit in the >> > memory (hence no memory bound for n of A) but the most immediate benefit >> is >> > to speed/ scalability/memory req of SSVD in most practical LSI cases. >> > >> > -Dmitriy >> > >> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[email protected]> >> > wrote: >> > >> > > Hey Dmitriy, >> > > >> > > I've also been playing around with a VectorWritable format which is >> > backed >> > > by a >> > > SequenceFile, but I've been focussed on the case where it's >> essentially >> > the >> > > entire >> > > matrix, and the rows don't fit into memory. This seems different than >> > your >> > > current >> > > use case, however - you just want (relatively) small vectors to load >> > > faster, >> > > right? >> > > >> > > -jake >> > > >> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[email protected]> >> > > wrote: >> > > >> > > > Interesting idea. >> > > > >> > > > Would this introduce a new vector type that only allows iterating >> > through >> > > > the elements once? >> > > > >> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov < >> [email protected]> >> > > > wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > I would like to submit a patch to VectorWritable that allows for >> > > > streaming >> > > > > access to vector elements without having to prebuffer all of them >> > > first. >> > > > > (current code allows for the latter only). >> > > > > >> > > > > That patch would allow to strike down one of the memory usage >> issues >> > in >> > > > > current Stochastic SVD implementation and effectively open memory >> > bound >> > > > for >> > > > > n of the SVD work. (The value i see is not to open up the the >> bound >> > > > though >> > > > > but just be more efficient in memory use, thus essentially >> speeding u >> > p >> > > > the >> > > > > computation. ) >> > > > > >> > > > > If it's ok, i would like to create a JIRA issue and provide a >> patch >> > for >> > > > it. >> > > > > >> > > > > Another issue is to provide an SSVD patch that depends on that >> patch >> > > for >> > > > > VectorWritable. >> > > > > >> > > > > Thank you. >> > > > > -Dmitriy >> > > > > >> > > > >> > > >> > >> > >
