Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov Mon, 13 Dec 2010 11:37:17 -0800

PS. Just another detail.

Most of the time actually i don't run into long vectors. But occasiionally,
among millions of them, i may run into a 10-20 even 50 million-long one.
Which means, not only i have to adjust algorithm settings to allow for 400m
for prebuffering (which is all i have), i actually have to allow for the
worst case, i.e. what happens only in 0.1% of the cases.


I can't accept answer "use -Xmx1G" in map processes for reasons that i can't
change.

Absent of this solution, i realistically don't see how i can go without a
push technique in accessing the vectors. You are welcome to suggest
alternative solution to this conundrum, but i see nothing in current code
that would help to fix it, without introducing some (very little amount of )
a new code.



On Mon, Dec 13, 2010 at 11:19 AM, Dmitriy Lyubimov <[email protected]>wrote:

> Ted,
>
> like i commented in the issue, 100m dense data elements means ~800M to load
> a vector. If (suppose) i have 1G in a mapper (which is not terribly common,
> not in our case, we only run jobs with -500M), then require 80% of RAM for
> prebuffering (i.e. just being able to get access to the data) and leave 20%
> for algorithm? Setting aside other considerations, doesn't this strike you
> as being off-balance in requirements ? it does to me and unfortunately  in
> my practical application with the hardware constraints i got i won't be able
> to afford that luxury. So it is a problem for my task for the moment,
> believe it or not. At this moment even 100Mb for prebuffering the data is
> prohibitively costly in my app.
>
> -d
>
>
> On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <[email protected]>wrote:
>
>> I really don't see this as a big deal even with crazy big vectors.
>>
>> Looking at web scale, for instance, the most linked wikipedia article only
>> has 10 million in-links or so.  On the web, the most massive web site is
>> unlikely to have >100 million in-links.  Both of these fit in very modest
>> amounts of memory.
>>
>> Where's the rub?
>>
>> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[email protected]
>> >wrote:
>>
>> > Jake,
>> > No i was trying exactly what you were proposing some time ago on the
>> list.
>> > I
>> > am trying to make long vectors not to occupy a lot of memory.
>> >
>> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
>> > saying, hey, there's a lot of sequential techniques that can provide a
>> > hander that would inspect vector element-by-element without having to
>> > preallocate 8Mb.
>> >
>> > for 1 million-long vectors it doesn't scary too much but starts being so
>> > for
>> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
>> > non-zero elements). Stochastic SVD will survive that, but it means less
>> > memory for blocking, and the more blocks you have, the more CPU it
>> requires
>> > (although CPU demand is only linear to the number of blocks and only in
>> > signficantly smaller part of computation, so that only insigificant part
>> of
>> > total CPU flops depends on # of blocks, but there is part that does,
>> still.
>> > )
>> >
>> > Like i said, it also would address the case when rows don't fit in the
>> > memory (hence no memory bound for n of A) but the most immediate benefit
>> is
>> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
>> >
>> > -Dmitriy
>> >
>> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[email protected]>
>> > wrote:
>> >
>> > > Hey Dmitriy,
>> > >
>> > >  I've also been playing around with a VectorWritable format which is
>> > backed
>> > > by a
>> > > SequenceFile, but I've been focussed on the case where it's
>> essentially
>> > the
>> > > entire
>> > > matrix, and the rows don't fit into memory.  This seems different than
>> > your
>> > > current
>> > > use case, however - you just want (relatively) small vectors to load
>> > > faster,
>> > > right?
>> > >
>> > >  -jake
>> > >
>> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[email protected]>
>> > > wrote:
>> > >
>> > > > Interesting idea.
>> > > >
>> > > > Would this introduce a new vector type that only allows iterating
>> > through
>> > > > the elements once?
>> > > >
>> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I would like to submit a patch to VectorWritable that allows for
>> > > > streaming
>> > > > > access to vector elements without having to prebuffer all of them
>> > > first.
>> > > > > (current code allows for the latter only).
>> > > > >
>> > > > > That patch would allow to strike down one of the memory usage
>> issues
>> > in
>> > > > > current Stochastic SVD implementation and effectively open memory
>> > bound
>> > > > for
>> > > > > n of the SVD work. (The value i see is not to open up the the
>> bound
>> > > > though
>> > > > > but just be more efficient in memory use, thus essentially
>> speeding u
>> > p
>> > > > the
>> > > > > computation. )
>> > > > >
>> > > > > If it's ok, i would like to create a JIRA issue and provide a
>> patch
>> > for
>> > > > it.
>> > > > >
>> > > > > Another issue is to provide an SSVD patch that depends on that
>> patch
>> > > for
>> > > > > VectorWritable.
>> > > > >
>> > > > > Thank you.
>> > > > > -Dmitriy
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Sequential access to VectorWritable content proposal.

Reply via email to