Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov Mon, 13 Dec 2010 11:20:00 -0800

Ted,

like i commented in the issue, 100m dense data elements means ~800M to load
a vector. If (suppose) i have 1G in a mapper (which is not terribly common,
not in our case, we only run jobs with -500M), then require 80% of RAM for
prebuffering (i.e. just being able to get access to the data) and leave 20%
for algorithm? Setting aside other considerations, doesn't this strike you
as being off-balance in requirements ? it does to me and unfortunately  in
my practical application with the hardware constraints i got i won't be able
to afford that luxury. So it is a problem for my task for the moment,
believe it or not. At this moment even 100Mb for prebuffering the data is
prohibitively costly in my app.


-d

On Mon, Dec 13, 2010 at 11:12 AM, Ted Dunning <[email protected]> wrote:

> I really don't see this as a big deal even with crazy big vectors.
>
> Looking at web scale, for instance, the most linked wikipedia article only
> has 10 million in-links or so.  On the web, the most massive web site is
> unlikely to have >100 million in-links.  Both of these fit in very modest
> amounts of memory.
>
> Where's the rub?
>
> On Mon, Dec 13, 2010 at 11:05 AM, Dmitriy Lyubimov <[email protected]
> >wrote:
>
> > Jake,
> > No i was trying exactly what you were proposing some time ago on the
> list.
> > I
> > am trying to make long vectors not to occupy a lot of memory.
> >
> > E.g. a 1m-long dense vector would require 8Mb just to load it. And i am
> > saying, hey, there's a lot of sequential techniques that can provide a
> > hander that would inspect vector element-by-element without having to
> > preallocate 8Mb.
> >
> > for 1 million-long vectors it doesn't scary too much but starts being so
> > for
> > default hadoop memory settings at the area of 50-100Mb (or 5-10 million
> > non-zero elements). Stochastic SVD will survive that, but it means less
> > memory for blocking, and the more blocks you have, the more CPU it
> requires
> > (although CPU demand is only linear to the number of blocks and only in
> > signficantly smaller part of computation, so that only insigificant part
> of
> > total CPU flops depends on # of blocks, but there is part that does,
> still.
> > )
> >
> > Like i said, it also would address the case when rows don't fit in the
> > memory (hence no memory bound for n of A) but the most immediate benefit
> is
> > to speed/ scalability/memory req of SSVD in most practical LSI cases.
> >
> > -Dmitriy
> >
> > On Mon, Dec 13, 2010 at 10:24 AM, Jake Mannix <[email protected]>
> > wrote:
> >
> > > Hey Dmitriy,
> > >
> > >  I've also been playing around with a VectorWritable format which is
> > backed
> > > by a
> > > SequenceFile, but I've been focussed on the case where it's essentially
> > the
> > > entire
> > > matrix, and the rows don't fit into memory.  This seems different than
> > your
> > > current
> > > use case, however - you just want (relatively) small vectors to load
> > > faster,
> > > right?
> > >
> > >  -jake
> > >
> > > On Mon, Dec 13, 2010 at 10:18 AM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > > > Interesting idea.
> > > >
> > > > Would this introduce a new vector type that only allows iterating
> > through
> > > > the elements once?
> > > >
> > > > On Mon, Dec 13, 2010 at 9:49 AM, Dmitriy Lyubimov <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to submit a patch to VectorWritable that allows for
> > > > streaming
> > > > > access to vector elements without having to prebuffer all of them
> > > first.
> > > > > (current code allows for the latter only).
> > > > >
> > > > > That patch would allow to strike down one of the memory usage
> issues
> > in
> > > > > current Stochastic SVD implementation and effectively open memory
> > bound
> > > > for
> > > > > n of the SVD work. (The value i see is not to open up the the bound
> > > > though
> > > > > but just be more efficient in memory use, thus essentially speeding
> u
> > p
> > > > the
> > > > > computation. )
> > > > >
> > > > > If it's ok, i would like to create a JIRA issue and provide a patch
> > for
> > > > it.
> > > > >
> > > > > Another issue is to provide an SSVD patch that depends on that
> patch
> > > for
> > > > > VectorWritable.
> > > > >
> > > > > Thank you.
> > > > > -Dmitriy
> > > > >
> > > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Reply via email to