[jira] [Comment Edited] (CASSANDRA-5432) Repair Freeze/Gossip Invisibility Issues 1.2.4

Arya Goudarzi (JIRA) Thu, 18 Apr 2013 22:33:18 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636082#comment-13636082
 ]


Arya Goudarzi edited comment on CASSANDRA-5432 at 4/19/13 5:31 AM:
-------------------------------------------------------------------

> non-ssl on the private IP within the same one [region]

OK, a little more digging, and I found the root cause which I believe is a bug, 
so I am re-opening this.

See this log snippet for a repair sessions I triggered on a node in a single 
region in AWS:

 INFO [AntiEntropySessions:1] 2013-04-19 04:28:16,587 AntiEntropyService.java 
(line 651) [repair #8e59b7c0-a8a9-11e2-ba85-d39d57f66b97] new session: will 
sync /54.242.X.YYY, /54.224.XX.YYY, /50.17.XXX.YYY on range 
(99249023685273718510150927169407637270,127605887595351923798765477788721654890]
 for cardspring_production.[App]
 INFO [AntiEntropySessions:1] 2013-04-19 04:28:16,591 AntiEntropyService.java 
(line 857) [repair #8e59b7c0-a8a9-11e2-ba85-d39d57f66b97] requesting merkle 
trees for App (to [/54.224.XX.YYY, /50.17.XXX.YYY, /54.242.X.YYY])
DEBUG [WRITE-/50.17.159.210] 2013-04-19 04:28:16,592 OutboundTcpConnection.java 
(line 260) attempting to connect to /10.170.XX.YYY
DEBUG [WRITE-/54.224.36.214] 2013-04-19 04:28:16,593 OutboundTcpConnection.java 
(line 260) attempting to connect to /10.121.XX.YYY
DEBUG [WRITE-/54.242.1.111] 2013-04-19 04:28:16,593 OutboundTcpConnection.java 
(line 260) attempting to connect to /54.242.X.YYY

Notice the last line. This is the public IP of the node running repair. Why is 
this picking up the public ip address for itself to send the tree request? This 
is the source of problem. In AWS you cannot communicated through public ip 
address with security group rules that are defined based on group names in the 
same region, which is a common use case. Hence the tree request gets stuck at 
sending point to itself. 




                
      was (Author: arya):
    > non-ssl on the private IP within the same one [region]

OK, a little more digging, and I found the root cause which I believe is a bug, 
so I am re-opening this.

See this log snippet for a repair sessions I triggered on a node in a single 
region in AWS:

 INFO [AntiEntropySessions:1] 2013-04-19 04:28:16,587 AntiEntropyService.java 
(line 651) [repair #8e59b7c0-a8a9-11e2-ba85-d39d57f66b97] new session: will 
sync /54.242.X.YYY, /54.224.XX.YYY, /50.17.XXX.YYY on range 
(99249023685273718510150927169407637270,127605887595351923798765477788721654890]
 for cardspring_production.[App]
 INFO [AntiEntropySessions:1] 2013-04-19 04:28:16,591 AntiEntropyService.java 
(line 857) [repair #8e59b7c0-a8a9-11e2-ba85-d39d57f66b97] requesting merkle 
trees for App (to [/54.224.XX.YYY, /50.17.XXX.YYY, /54.242.X.YYY])
DEBUG [WRITE-/50.17.159.210] 2013-04-19 04:28:16,592 OutboundTcpConnection.java 
(line 260) attempting to connect to /10.170.XX.YYY
DEBUG [WRITE-/54.224.36.214] 2013-04-19 04:28:16,593 OutboundTcpConnection.java 
(line 260) attempting to connect to /10.121.XX.YYY
DEBUG [WRITE-/54.242.1.111] 2013-04-19 04:28:16,593 OutboundTcpConnection.java 
(line 260) attempting to connect to /54.242.X.YYY

Notice the last line. This is the public IP of the node running repair. Why is 
this picking up the public ip address for itself to send the tree request? This 
is the source of problem. In AWS you cannot communicated through public ip 
address with security group rules that are defined based on group names, which 
is a common use case. Hence the tree request gets stuck at sending point to 
itself. 




                  
> Repair Freeze/Gossip Invisibility Issues 1.2.4
> ----------------------------------------------
>
>                 Key: CASSANDRA-5432
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5432
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.4
>         Environment: Ubuntu 10.04.1 LTS
> C* 1.2.3
> Sun Java 6 u43
> JNA Enabled
> Not using VNodes
>            Reporter: Arya Goudarzi
>            Priority: Critical
>
> Read comment 6. This description summarizes the repair issue only, but I 
> believe there is a bigger problem going on with networking as described on 
> that comment. 
> Since I have upgraded our sandbox cluster, I am unable to run repair on any 
> node and I am reaching our gc_grace seconds this weekend. Please help. So 
> far, I have tried the following suggestions:
> - nodetool scrub
> - offline scrub
> - running repair on each CF separately. Didn't matter. All got stuck the same 
> way.
> The repair command just gets stuck and the machine is idling. Only the 
> following logs are printed for repair job:
>  INFO [Thread-42214] 2013-04-05 23:30:27,785 StorageService.java (line 2379) 
> Starting repair command #4, repairing 1 ranges for keyspace 
> cardspring_production
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,789 AntiEntropyService.java 
> (line 652) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] new session: will 
> sync /X.X.X.190, /X.X.X.43, /X.X.X.56 on range 
> (1808575600,42535295865117307932921825930779602032] for 
> keyspace_production.[comma separated list of CFs]
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,790 AntiEntropyService.java 
> (line 858) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] requesting merkle 
> trees for BusinessConnectionIndicesEntries (to [/X.X.X.43, /X.X.X.56, 
> /X.X.X.190])
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,086 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.43
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.56
> Please advise. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-5432) Repair Freeze/Gossip Invisibility Issues 1.2.4

Reply via email to