[ 
https://issues.apache.org/jira/browse/CASSANDRA-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641178#comment-13641178
 ] 

Arya Goudarzi edited comment on CASSANDRA-5432 at 4/24/13 11:39 PM:
--------------------------------------------------------------------

So, I rolled back CASSANDRA-5171. Pushed it to my test cluster. The gossip 
issue where nodes after restart didn't see each other got fixed. The repair 
still tried to connect to the machine running repair (self) with its public IP 
for requesting MerkleTree where it gets stuck, so it has the same issue. Some 
behavior changed though, and the OutBoundTCPConnection didn't report connecting 
to other 2 replicas for requesting MerkleTree, so I only saw the message when 
trying to connect. Here is the snippet: 

 INFO [Thread-458] 2013-04-24 23:21:16,543 StorageService.java (line 2407) 
Starting repair command #1, repairing 1 ranges for keyspace app_production
DEBUG [Thread-458] 2013-04-24 23:21:16,580 StorageService.java (line 2547) 
computing ranges for 1808575600, 7089215977519551322153637656637080005, 
14178431955039102644307275311465584410, 4253529586511
7307932921825930779602030, 49624511842636859255075463585608106435, 
56713727820156410577229101240436610840, 85070591730234615865843651859750628460, 
92159807707754167187997289514579132865, 9924902368527
3718510150927169407637270, 127605887595351923798765477788721654890, 
134695103572871475120919115443550159295, 141784319550391026443072753098378663700
 INFO [AntiEntropySessions:1] 2013-04-24 23:21:16,587 AntiEntropyService.java 
(line 651) [repair #a9a87e40-ad35-11e2-945a-050d956ff11b] new session: will 
sync /YYY.XX.98.11, /YY.XXX.107.137, /YY.XXX.1
33.163 on range 
(99249023685273718510150927169407637270,127605887595351923798765477788721654890]
 for cardspring_production.[App]
 INFO [AntiEntropySessions:1] 2013-04-24 23:21:16,598 AntiEntropyService.java 
(line 857) [repair #a9a87e40-ad35-11e2-945a-050d956ff11b] requesting merkle 
trees for App (to [/XX.YYY.107.137, /XX.YYY.133.163, /XXX.YY.98.11])
DEBUG [WRITE-/107.20.98.11] 2013-04-24 23:21:16,601 OutboundTcpConnection.java 
(line 260) attempting to connect to /XXX.YY.98.11
 INFO [AntiEntropyStage:1] 2013-04-24 23:21:19,111 AntiEntropyService.java 
(line 213) [repair #a9a87e40-ad35-11e2-945a-050d956ff11b] Received merkle tree 
for App from /XX.YYY.133.163
DEBUG [ScheduledTasks:1] 2013-04-24 23:21:19,409 GCInspector.java (line 121) GC 
for ParNew: 54 ms for 1 collections, 669806384 used; max is 4211081216
 INFO [AntiEntropyStage:1] 2013-04-24 23:21:20,408 AntiEntropyService.java 
(line 213) [repair #a9a87e40-ad35-11e2-945a-050d956ff11b] Received merkle tree 
for App from /XX.YYY.107.137

See the debug line with OutboundTcpConnection. It is trying to connect to 
public IP of self (XXX.YY.98.11), which is still an issue. What I was expecting 
to see before this line was two other consecutive lines like before where it 
showed OutboundTcpConnection trying to connect to other nodes as well. Despite 
them returning the MerkleTrees, those log lines did not show. So, connection 
was made successfully to the other nodes somehow. 
                
      was (Author: arya):
    So, I rolled back CASSANDRA-5171. Pushed it to my test cluster. The gossip 
issue where nodes after restart didn't see each other got fixed. The repair 
still tried to connect to the machine running repair (self) with its public IP 
for requesting MerkleTree where it gets stuck, so it has the same issue. Some 
behavior changed though, and the OutBoundTCPConnection didn't report connecting 
to other 2 replicas for requesting MerkleTree, so I only saw the message when 
trying to connect. Here is the snippet: 

 INFO [Thread-458] 2013-04-24 23:21:16,543 StorageService.java (line 2407) 
Starting repair command #1, repairing 1 ranges for keyspace app_production
DEBUG [Thread-458] 2013-04-24 23:21:16,580 StorageService.java (line 2547) 
computing ranges for 1808575600, 7089215977519551322153637656637080005, 
14178431955039102644307275311465584410, 4253529586511
7307932921825930779602030, 49624511842636859255075463585608106435, 
56713727820156410577229101240436610840, 85070591730234615865843651859750628460, 
92159807707754167187997289514579132865, 9924902368527
3718510150927169407637270, 127605887595351923798765477788721654890, 
134695103572871475120919115443550159295, 141784319550391026443072753098378663700
 INFO [AntiEntropySessions:1] 2013-04-24 23:21:16,587 AntiEntropyService.java 
(line 651) [repair #a9a87e40-ad35-11e2-945a-050d956ff11b] new session: will 
sync /107.20.98.11, /54.224.107.137, /54.224.1
33.163 on range 
(99249023685273718510150927169407637270,127605887595351923798765477788721654890]
 for cardspring_production.[App]
 INFO [AntiEntropySessions:1] 2013-04-24 23:21:16,598 AntiEntropyService.java 
(line 857) [repair #a9a87e40-ad35-11e2-945a-050d956ff11b] requesting merkle 
trees for App (to [/XX.YYY.107.137, /XX.YYY.133.163, /XXX.YY.98.11])
DEBUG [WRITE-/107.20.98.11] 2013-04-24 23:21:16,601 OutboundTcpConnection.java 
(line 260) attempting to connect to /XXX.YY.98.11
 INFO [AntiEntropyStage:1] 2013-04-24 23:21:19,111 AntiEntropyService.java 
(line 213) [repair #a9a87e40-ad35-11e2-945a-050d956ff11b] Received merkle tree 
for App from /XX.YYY.133.163
DEBUG [ScheduledTasks:1] 2013-04-24 23:21:19,409 GCInspector.java (line 121) GC 
for ParNew: 54 ms for 1 collections, 669806384 used; max is 4211081216
 INFO [AntiEntropyStage:1] 2013-04-24 23:21:20,408 AntiEntropyService.java 
(line 213) [repair #a9a87e40-ad35-11e2-945a-050d956ff11b] Received merkle tree 
for App from /XX.YYY.107.137

See the debug line with OutboundTcpConnection. It is trying to connect to 
public IP of self (XXX.YY.98.11), which is still an issue. What I was expecting 
to see before this line was two other consecutive lines like before where it 
showed OutboundTcpConnection trying to connect to other nodes as well. Despite 
them returning the MerkleTrees, those log lines did not show. So, connection 
was made successfully to the other nodes somehow. 
                  
> Repair Freeze/Gossip Invisibility Issues 1.2.4
> ----------------------------------------------
>
>                 Key: CASSANDRA-5432
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5432
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.4
>         Environment: Ubuntu 10.04.1 LTS
> C* 1.2.3
> Sun Java 6 u43
> JNA Enabled
> Not using VNodes
>            Reporter: Arya Goudarzi
>            Assignee: Vijay
>            Priority: Critical
>
> Read comment 6. This description summarizes the repair issue only, but I 
> believe there is a bigger problem going on with networking as described on 
> that comment. 
> Since I have upgraded our sandbox cluster, I am unable to run repair on any 
> node and I am reaching our gc_grace seconds this weekend. Please help. So 
> far, I have tried the following suggestions:
> - nodetool scrub
> - offline scrub
> - running repair on each CF separately. Didn't matter. All got stuck the same 
> way.
> The repair command just gets stuck and the machine is idling. Only the 
> following logs are printed for repair job:
>  INFO [Thread-42214] 2013-04-05 23:30:27,785 StorageService.java (line 2379) 
> Starting repair command #4, repairing 1 ranges for keyspace 
> cardspring_production
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,789 AntiEntropyService.java 
> (line 652) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] new session: will 
> sync /X.X.X.190, /X.X.X.43, /X.X.X.56 on range 
> (1808575600,42535295865117307932921825930779602032] for 
> keyspace_production.[comma separated list of CFs]
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,790 AntiEntropyService.java 
> (line 858) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] requesting merkle 
> trees for BusinessConnectionIndicesEntries (to [/X.X.X.43, /X.X.X.56, 
> /X.X.X.190])
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,086 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.43
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.56
> Please advise. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to