[ 
https://issues.apache.org/jira/browse/CASSANDRA-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637036#comment-13637036
 ] 

Arya Goudarzi edited comment on CASSANDRA-5432 at 4/20/13 12:10 AM:
--------------------------------------------------------------------

Hey Vijay,

Good to see you here. Sorry if my analysis is unclear. Here is my take:

> The first time we start the communication to a node we try to Initiate 
> communications we use the public IP and eventually once we have the private 
> IP we will switch back to local ip's.

Has this always been the case? Because if you are using public ips (not public 
dns name), there has to be explicit security rules on public ips to allow this. 
Otherwise, if in security groups you are opening the ports to the machines in 
the same group using their security group name, it allows traffic only within 
their private ips, so this won't work. 

We use Priam (your awesome tooling), and as you know, it opens up only the SSL 
port on the public IPs for cross region communication. And from the operator's 
perspective, that is the correct thing to do. I only have the SSL port open on 
public IPs and don't want to open the non SSL port for security reasons. Now, 
all other ports like non SSL, JMX, etc are opened the way I described using 
security group names and it allows traffic on private IPs. It is just the way 
AWS has been. So, if within the same region, you are trying to connect to any 
machine using public ip, it won't work. 

Here is how I achieved the scenario above and I believe they are all co-related 
to the statement you said that all machine connect to public IPs first.

Setup a cluster as I described in my previous comment. It can be a single 
region. Restart all machines at the same time. Each machine would only see 
itself as UP. Everyone else is reported to be DOWN in nodetool ring. I am 
guessing that it is because they are trying to send gossips to public IPs but 
only SSL port is open on public IPs. The cluster is configured to only do SSL 
cross datacenter/region not within the same region. So, now I am left with 
bunch of nodes that only see themselves in the ring. I go to my AWS console, 
open up the non SSL port on every single public IP in that security group. Now 
all the nodes see each other. 

By now, I had a theory about nodes wanting to communicate through the public ip 
which is not possible, so I stepped into troubleshooting repairs. I know that 
with current settings repair would succeed. Since the nodes see each other now, 
I go to security groups and remove the non SSL on public IP rules that I added 
in previous step. Start the repair, and I ended up with the log message as 
above. The public ip mentioned in the log, belongs to the node that owns the 
log and is running repair, so it tried to communicated to itself using its own 
public IP. 

Did I make sense? I can call you to describe it over the phone, but basically 
this setup used to work on 1.1.10 but does not work on 1.2.4. I have attached 
the debugger to a node and am trying to trace  the code. I'll let you know if I 
find something new.
                
      was (Author: arya):
    Hey Vijay,

Good to see you here. Sorry if my analysis is unclear. Here is my take:

> The first time we start the communication to a node we try to Initiate 
> communications we use the public IP and eventually once we have the private 
> IP we will switch back to local ip's.

Has this always been the case? Because if you are using public ips (not public 
dns name), there has to be explicit security rules on public ips to allow this. 
Otherwise, if in security groups you are opening the ports to the machines in 
the same group using their security group name, it allows traffic only within 
their private ips, so this won't work. 

We use Priam (your awesome tooling), and as you know, it opens up only the SSL 
port on the public IPs for cross region communication. And from the operator's 
perspective, that is the correct thing to do. I only have the SSL port open on 
public IPs and don't want to open the non SSL port for security reasons. Now, 
all other ports like non SSL, JMX, etc are opened the way I described using 
security group names and it allows traffic on private IPs. It is just the way 
AWS has been. So, if within the same region, you are trying to connect to any 
machine using public ip, it won't work. 

Here is how I achieved the scenario above and I believe they are all co-related 
to the statement you said that all machine connect to public IPs first.

Setup a cluster as I described in my previous comment. It can be a single 
region. Restart all machines at the same time. Each machine would only see 
itself at UP. Everyone else is reported to be DOWN in nodetool ring. I am 
guessing that it is because they are trying to send gossips to public IPs but 
only SSL port is open on public IPs. The cluster is configured to only do SSL 
cross datacenter/region not within the same region. So, not I am left with 
bunch of nodes that only see themselves in the ring. I go to my AWS console, 
open up the non SSL port on every single public IP in that security group. Now 
all the nodes see each other. 

By now, I had a theory about nodes wanting to communicate through the public ip 
which is not possible, so I stepped into troubleshooting repairs. I know that 
with current settings repair would succeed. Since the nodes see each other now, 
I go to security groups and remove the non SSL on public IP rules that I added 
in previous step. Start the repair, and I ended up with the log message as 
above. The public ip mentioned in the log, belongs to the node that owns the 
log and is running repair, so it tried to communicated to itself using its own 
public IP. 

Did I make sense? I can call you to describe it over the phone, but basically 
this setup used to work on 1.1.10 but does not work on 1.2.4. I have attached 
the debugger to a node and am trying to trace  the code. I'll let you know if I 
find something new.
                  
> Repair Freeze/Gossip Invisibility Issues 1.2.4
> ----------------------------------------------
>
>                 Key: CASSANDRA-5432
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5432
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.4
>         Environment: Ubuntu 10.04.1 LTS
> C* 1.2.3
> Sun Java 6 u43
> JNA Enabled
> Not using VNodes
>            Reporter: Arya Goudarzi
>            Assignee: Vijay
>            Priority: Critical
>
> Read comment 6. This description summarizes the repair issue only, but I 
> believe there is a bigger problem going on with networking as described on 
> that comment. 
> Since I have upgraded our sandbox cluster, I am unable to run repair on any 
> node and I am reaching our gc_grace seconds this weekend. Please help. So 
> far, I have tried the following suggestions:
> - nodetool scrub
> - offline scrub
> - running repair on each CF separately. Didn't matter. All got stuck the same 
> way.
> The repair command just gets stuck and the machine is idling. Only the 
> following logs are printed for repair job:
>  INFO [Thread-42214] 2013-04-05 23:30:27,785 StorageService.java (line 2379) 
> Starting repair command #4, repairing 1 ranges for keyspace 
> cardspring_production
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,789 AntiEntropyService.java 
> (line 652) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] new session: will 
> sync /X.X.X.190, /X.X.X.43, /X.X.X.56 on range 
> (1808575600,42535295865117307932921825930779602032] for 
> keyspace_production.[comma separated list of CFs]
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,790 AntiEntropyService.java 
> (line 858) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] requesting merkle 
> trees for BusinessConnectionIndicesEntries (to [/X.X.X.43, /X.X.X.56, 
> /X.X.X.190])
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,086 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.43
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.56
> Please advise. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to