I see. Thank you, Sean. On Tue, Dec 14, 2010 at 12:17 AM, Sean Owen <[email protected]> wrote:
> Yah I have the same sort of problem with recommenders. Once in a while you > have a very, very popular item. In those cases I use some smart sampling to > simply throw out some data. For my use cases, that has virtually no effect > on the result so it's valid. Not sure if you can use the same approaches. > > > On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > Sean, > > > > Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever > > saw is 128m but most commonly 64. > > > > Imagine for a second loading 1Gb records into a mapper. Assuming > > sufficiently large amount of nodes, the math expectancy of the collocated > > data in this case is approximately 32M, everything else comes from some > > other node. > > > > So we start our job with moving ~97% of the data around just to make it > > available to the mappers. > > > > There are two considerations in my case however that alleviated that > > concern > > after all at least in my particular case somewhat: > > > > 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps > > 100-150mb at most. But i still have to yield 50% of mapper RAM to the VW > > just to make sure 0.1% of cases goes thru. Sp tje case i am making is not > > for 1Gb vectors, i am saying that the problem is already quite > detrimental > > to algorithm running time at much smaller sizes. > > > > 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb > vectors, > > nor do i plan to. But if i did, i strongly suspect it is not I/O that > would > > be a bottleneck. > > > > On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[email protected]> wrote: > > > > > This may be a late or ill-informed comment but -- > > > > > > I don't believe there's any issue with VectorWritable per se, no. > Hadoop > > > most certainly assumes that one Writable can fit into RAM. A 1GB > Writable > > > is > > > just completely incompatible with how Hadoop works. The algorithm would > > > have > > > to be parallelized more then. > > > > > > Yes that may mean re-writing the code to deal with small Vectors and > > such. > > > That probably doesn't imply a change to VectorWritable but to the jobs > > > using > > > it. > > > > > > > > >
