Hi, I just wanted to see the flow of nodes getting allocated for rdd replication. I see that all the blocks are getting replicated in the same node. I was expecting that each block gets replicated over different nodes. I have a humble three node spark cluster :).
Below is the trace of replicate() method through print statements: Before calling the getPeers null After calling: WrappedArray(BlockManagerId(1, s1, 47511, 0)) Inside the forloop host: s1 port: 47511 execID: 1 netty: 0 Try to replicate BlockId rdd_1_4 once; The size of the data is 38722395 Bytes. To node: BlockManagerId(1, s1, 47511, 0) Before calling the getPeers: WrappedArray(BlockManagerId(1, s1, 47511, 0)) Inside the forloop host: s1 port: 47511 exeID: 1 netty: 0 Try to replicate BlockId rdd_1_0 once; The size of the data is 139496007 Bytes. To node: BlockManagerId(1, s1, 47511, 0) Before calling the getPeers WrappedArray(BlockManagerId(1, s1, 47511, 0)) Inside the forloop host: s1 port: 47511 execID: 1 netty: 0 Try to replicate BlockId rdd_1_1 once; The size of the data is 139495994 Bytes. To node: BlockManagerId(1, s1, 47511, 0) Before calling the getPeers WrappedArray(BlockManagerId(1, s1, 47511, 0)) Inside the forloop host: s1 port: 47511 execID: 1 netty: 0 Try to replicate BlockId rdd_1_2 once; The size of the data is 139496003 Bytes. To node: BlockManagerId(1, s1, 47511, 0). Can someone please tell me why this is happening?? Why is it that the entire rdd is replicated on a single node?? Thank you -Karthik.