Sometimes sampling works. Sometimes it doesn't. Sampling helps in recommendations even more than with space because cooccurrence counting requires time quadratic in the number of occurrences of an item.
In fact, it is possible to do the block decomposition we were talking about using an interleaved sampling. That would have an interesting and often positive result in many algorithms. But sampling is, sadly, not an option for many algorithms. On Tue, Dec 14, 2010 at 12:26 AM, Dmitriy Lyubimov <[email protected]>wrote: > I see. Thank you, Sean. > > On Tue, Dec 14, 2010 at 12:17 AM, Sean Owen <[email protected]> wrote: > > > Yah I have the same sort of problem with recommenders. Once in a while > you > > have a very, very popular item. In those cases I use some smart sampling > to > > simply throw out some data. For my use cases, that has virtually no > effect > > on the result so it's valid. Not sure if you can use the same approaches. > > > > > > On Tue, Dec 14, 2010 at 1:21 AM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > Sean, > > > > > > Absolutely agree. There's no such thing as 1Gb HDFS block. The max i > ever > > > saw is 128m but most commonly 64. > > > > > > Imagine for a second loading 1Gb records into a mapper. Assuming > > > sufficiently large amount of nodes, the math expectancy of the > collocated > > > data in this case is approximately 32M, everything else comes from some > > > other node. > > > > > > So we start our job with moving ~97% of the data around just to make it > > > available to the mappers. > > > > > > There are two considerations in my case however that alleviated that > > > concern > > > after all at least in my particular case somewhat: > > > > > > 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps > > > 100-150mb at most. But i still have to yield 50% of mapper RAM to the > VW > > > just to make sure 0.1% of cases goes thru. Sp tje case i am making is > not > > > for 1Gb vectors, i am saying that the problem is already quite > > detrimental > > > to algorithm running time at much smaller sizes. > > > > > > 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb > > vectors, > > > nor do i plan to. But if i did, i strongly suspect it is not I/O that > > would > > > be a bottleneck. > > > > > > On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[email protected]> wrote: > > > > > > > This may be a late or ill-informed comment but -- > > > > > > > > I don't believe there's any issue with VectorWritable per se, no. > > Hadoop > > > > most certainly assumes that one Writable can fit into RAM. A 1GB > > Writable > > > > is > > > > just completely incompatible with how Hadoop works. The algorithm > would > > > > have > > > > to be parallelized more then. > > > > > > > > Yes that may mean re-writing the code to deal with small Vectors and > > > such. > > > > That probably doesn't imply a change to VectorWritable but to the > jobs > > > > using > > > > it. > > > > > > > > > > > > > >
