Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov Tue, 14 Dec 2010 00:26:44 -0800

I see. Thank you, Sean.

On Tue, Dec 14, 2010 at 12:17 AM, Sean Owen <[email protected]> wrote:


> Yah I have the same sort of problem with recommenders. Once in a while you
> have a very, very popular item. In those cases I use some smart sampling to
> simply throw out some data. For my use cases, that has virtually no effect
> on the result so it's valid. Not sure if you can use the same approaches.
>
>
> On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > Sean,
> >
> > Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
> > saw is 128m but most commonly 64.
> >
> > Imagine for a second loading 1Gb records into a mapper. Assuming
> > sufficiently large amount of nodes, the math expectancy of the collocated
> > data in this case is approximately 32M, everything else comes from some
> > other node.
> >
> > So we start our job with moving ~97% of the data around just to make it
> > available to the mappers.
> >
> > There are two considerations in my case however that alleviated that
> > concern
> > after all at least in my particular case somewhat:
> >
> > 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
> > 100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
> > just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
> > for 1Gb vectors, i am saying that the problem is already quite
> detrimental
> > to algorithm running time at much smaller sizes.
> >
> > 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb
> vectors,
> > nor do i plan to. But if i did, i strongly suspect it is not I/O that
> would
> > be a bottleneck.
> >
> > On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[email protected]> wrote:
> >
> > > This may be a late or ill-informed comment but --
> > >
> > > I don't believe there's any issue with VectorWritable per se, no.
> Hadoop
> > > most certainly assumes that one Writable can fit into RAM. A 1GB
> Writable
> > > is
> > > just completely incompatible with how Hadoop works. The algorithm would
> > > have
> > > to be parallelized more then.
> > >
> > > Yes that may mean re-writing the code to deal with small Vectors and
> > such.
> > > That probably doesn't imply a change to VectorWritable but to the jobs
> > > using
> > > it.
> > >
> > >
> >
>

Re: Sequential access to VectorWritable content proposal.

Reply via email to