[ 
https://issues.apache.org/jira/browse/CASSANDRA-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-3533:
----------------------------------------

    Attachment: 3533.txt

bq. Thats great, but in cassandra we have DSnitch which can mark nodes down too.

We can't actually mark a node down with the dsnitch, we can just choose to not 
use it (for a while)

bq. Will it make sense for us to poll just before we mark the node up (To 
double check)? 

I looked at doing this, and frankly integrating a check here is pretty scary 
and messy.  For instance we need to report new nodes in handleMajorStateChange 
that sends onJoin events, which _cause_ the initial connection, so to poll we'd 
have to change that, or make an extra connection, neither of which is very 
desirable to put in Gossiper and both of which are scary to put in a minor 
release, in my opinion.  Furthermore, in the case of natural, _temporary_ 
partitions of this kind, there are some things we still want to retry instead 
of failing fast, like streaming.

Instead, in the attached patch, I took a different, more coordinator-based 
approach, that requires the FD report the node as alive as well as confirming 
there is a live outbound connection to the destination before a read/write is 
attempted, otherwise UE is thrown.
                
> TimeoutException when there is a firewall issue.
> ------------------------------------------------
>
>                 Key: CASSANDRA-3533
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3533
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Vijay
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 1.1.4
>
>         Attachments: 3533.txt
>
>
> When one node in the cluster is not able to talk to the other DC/RAC due to 
> firewall or network related issue (StorageProxy calls fail), and the nodes 
> are NOT marked down because at least one node in the cluster can talk to the 
> other DC/RAC, we get timeoutException instead of throwing a 
> unavailableException.
> The problem with this:
> 1) It is hard to monitor/identify these errors.
> 2) It is hard to diffrentiate from the client if the node being bad vs a bad 
> query.
> 3) when this issue happens we have to wait for at-least the RPC timeout time 
> to know that the query wont succeed.
> Possible Solution: when marking a node down we might want to check if the 
> node is actually alive by trying to communicate to it? So we can be sure that 
> the node is actually alive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to