Re: Sequential access to VectorWritable content proposal.

Dmitriy Lyubimov Mon, 13 Dec 2010 17:22:11 -0800

Sean,

Absolutely agree. There's no such thing as 1Gb HDFS block. The max i ever
saw is 128m but most commonly 64.

Imagine for a second loading 1Gb records into a mapper. Assuming
sufficiently large amount of nodes, the math expectancy of the collocated
data in this case is approximately 32M, everything else comes from some
other node.

So we start our job with moving ~97% of the data around just to make it
available to the mappers.

There are two considerations in my case however that alleviated that concern
after all at least in my particular case somewhat:

1) Not every vector is 1Gb. In fact, only 0.1% of vectors are perhaps
100-150mb at most. But i still have to yield 50% of mapper RAM to the VW
just to make sure 0.1% of cases goes thru. Sp tje case i am making is not
for 1Gb vectors, i am saying that the problem is already quite detrimental
to algorithm running time at much smaller sizes.

2) Stochastic SVD is CPU bound. I did not actually run it with 1Gb vectors,
nor do i plan to. But if i did, i strongly suspect it is not I/O that would
be a bottleneck.

On Mon, Dec 13, 2010 at 5:01 PM, Sean Owen <[email protected]> wrote:

> This may be a late or ill-informed comment but --
>
> I don't believe there's any issue with VectorWritable per se, no. Hadoop
> most certainly assumes that one Writable can fit into RAM. A 1GB Writable
> is
> just completely incompatible with how Hadoop works. The algorithm would
> have
> to be parallelized more then.
>
> Yes that may mean re-writing the code to deal with small Vectors and such.
> That probably doesn't imply a change to VectorWritable but to the jobs
> using
> it.
>
>

Re: Sequential access to VectorWritable content proposal.

Reply via email to