Sean, Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever saw is 128m but most commonly 64.
Imagine for a second loading 1Gb records into a mapper. Assuming sufficiently large amount of nodes, the math expectancy of the collocated data in this case is approximately 32M, everything else comes from some other node. So we start our job with moving ~97% of the data around just to make it available to the mappers. There are two considerations in my case however that alleviated that concern after all at least in my particular case somewhat: 1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps 100-150mb at most. But i still have to yield 50% of mapper RAM to the VW just to make sure 0.1% of cases goes thru. Sp tje case i am making is not for 1Gb vectors, i am saying that the problem is already quite detrimental to algorithm running time at much smaller sizes. 2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb vectors, nor do i plan to. But if i did, i strongly suspect it is not I/O that would be a bottleneck. On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[email protected]> wrote: > This may be a late or ill-informed comment but -- > > I don't believe there's any issue with VectorWritable per se, no. Hadoop > most certainly assumes that one Writable can fit into RAM. A 1GB Writable > is > just completely incompatible with how Hadoop works. The algorithm would > have > to be parallelized more then. > > Yes that may mean re-writing the code to deal with small Vectors and such. > That probably doesn't imply a change to VectorWritable but to the jobs > using > it. > >
