[ 
https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701415#comment-15701415
 ] 

Benjamin Roth commented on CASSANDRA-12905:
-------------------------------------------

Maybe its not correct to say "a hint delivery has its own thread" but it is not 
as multiplexed as usual write request and the usual WTE in the write path has a 
different semantic meaning.

For my understanding
- a WTE should more or less indicate if a write request was able to meet the 
desired consistency in a limited time frame.
- and it is a constraint that many clients do not completely block the 
coordinator for an unlimited time if there is "something wrong" in the write 
path, right?

But in case of hint delivery this is different:
1. There is not an arbitrary number of clients. Max clients == #nodes - 1
2. There is only 1 connection for node x > node y
3. If there is a timeout between node x > node y the hint will be retried and 
retried and retried anyway which probably will put even more pressure on the 
target node than just to retry lock aquisition until it can be acquired. I 
observered that many times.
4. In worst case (also observerd), if a lot of WTE occur during hint delivery, 
not hints will be delivered successfully at all and hints will pile up more and 
more until the pressure is relieved manually by pausing hint delivery on all 
nodes but 1 or 2.

Some feedback on this?

> Retry acquire MV lock on failure instead of throwing WTE on streaming
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-12905
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12905
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: centos 6.7 x86_64
>            Reporter: Nir Zilka
>            Priority: Critical
>             Fix For: 3.9
>
>
> Hello,
> I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, 
> private VLAN),
> first it was 2.2.5.1 and repair worked flawlessly,
> second upgrade was to 3.0.9 (with upgradesstables) and also repair worked 
> well,
> then i upgraded 2 weeks ago to 3.9 - and the repair problems started.
> there are several errors types from the system.log (different nodes) :
> - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx
> - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation 
> timed out - received only 0 responses
> - Remote peer xxx.xxx.xxx.xxx failed stream session
> - Session completed with the following error
> org.apache.cassandra.streaming.StreamException: Stream failed
> ----
> i use 3.9 default configuration with the cluster settings adjustments (3 
> seeds, GossipingPropertyFileSnitch).
> streaming_socket_timeout_in_ms is the default (86400000).
> i'm afraid from consistency problems while i'm not performing repair.
> Any ideas?
> Thanks,
> Nir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to