Hi,

I just wanted to see the flow of nodes getting allocated for rdd
replication. I see that all the blocks are getting replicated in the same
node. I was expecting that each block gets replicated over different nodes.
I have a humble three node spark cluster :).

Below is the trace of replicate() method through print statements:



Before calling the getPeers
null

After calling: WrappedArray(BlockManagerId(1, s1, 47511, 0))

Inside the forloop

host: s1  port: 47511  execID: 1  netty: 0

Try to replicate BlockId rdd_1_4 once; The size of the data is 38722395
Bytes. To node: BlockManagerId(1, s1, 47511, 0)

Before calling the getPeers:
WrappedArray(BlockManagerId(1, s1, 47511, 0))

Inside the forloop

host: s1  port: 47511 exeID: 1  netty: 0

Try to replicate BlockId rdd_1_0 once; The size of the data is 139496007
Bytes. To node: BlockManagerId(1, s1, 47511, 0)

Before calling the getPeers
WrappedArray(BlockManagerId(1, s1, 47511, 0))

Inside the forloop

host: s1  port: 47511 execID: 1  netty: 0

Try to replicate BlockId rdd_1_1 once; The size of the data is 139495994
Bytes. To node: BlockManagerId(1, s1, 47511, 0)

Before calling the getPeers
WrappedArray(BlockManagerId(1, s1, 47511, 0))

Inside the forloop

host: s1  port: 47511 execID: 1 netty: 0

Try to replicate BlockId rdd_1_2 once; The size of the data is 139496003
Bytes. To node: BlockManagerId(1, s1, 47511, 0).

Can someone please tell me why this is happening??
Why is it that the entire rdd is replicated on a single node??

Thank you
-Karthik.

Reply via email to