Hello,

I've been trying to add a new data center to our Cassandra 1.1.10 cluster
for the last few days, but I've been unable to successfully rebuild the
nodes on the new DC due to streaming problems.

I have followed the procedure described in
http://www.datastax.com/docs/1.1/cluster_management#adding-capacity (section
"Adding a Data Center to a Cluster"), but some of the new nodes hang
forever during the "nodetool rebuild" operation. A "nodetool netstats |
grep -v 0%" on the node with a frozen rebuild (205.229.68.48) will show:

Mode: NORMAL
Not sending any streams.
*Streaming from: /253.126.57.150*
Pool Name                    Active   Pending      Completed
Commands                        n/a         0         686479
Responses                       n/a         0        7533195

But if I connect to the source node 253.126.57.150 and issue a "nodetool
netstats", it will show:

Mode: NORMAL
Not sending any streams.
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a         0      879835258
Responses                       n/a         0      611936734

If I check the C* logs of this machine (253.126.57.150), I will find:

2013 Sep  9 17:50:40 ip-10-177-14-80 ERROR [Streaming to /205.229.68.48:8]
2013-09-09 17:50:40,689 AbstractCassandraDaemon.java (line 135) Exception
in thread Thread[Streaming to /205.229.68.48,5,main]
2013 Sep  9 17:50:40 ip-10-177-14-80 java.lang.RuntimeException:
*java.net.SocketTimeoutException:
Read timed out*
[...]

So, it seems the streaming source (253.126.57.150) waits for a response
from the streaming destination (205.229.68.480), but the *
streaming_socket_timeout_in_ms* is reached (*60s* in our case) and the
source stops streaming. However, *the tricky part is that the destination
node never times out*, so it never retries re-streaming the problematic
file and hangs there forever, never completing the rebuild operation.

I'd appreciate if anyone could give me a hand on this. Could it be a bug or
is there some configuration tuning that could help this? I will try
increasing the *streaming_socket_timeout_in_ms* property, but if the
problem is on the destination this won't help (since the socket only times
out at the source).

I've investigated the source code of IncomingStreamReader
and StreamInSession, but didn't notice any blocking operation apart from
the DataInputStream.read, which should timeout on *
streaming_socket_timeout_in_ms*, but it doesn't.

Any help would be very much appreciated,

Thanks,

Paulo

Reply via email to