Oleg Kibirev created CASSANDRA-5456: ---------------------------------------
Summary: Large number of bootstrapping nodes cause gossip to stop working Key: CASSANDRA-5456 URL: https://issues.apache.org/jira/browse/CASSANDRA-5456 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.1.10 Reporter: Oleg Kibirev Long running section of code in PendingRangeCalculatorService is synchronized on bootstrapTokens. This causes gossip to stop working as it waits for the same lock when a large number of nodes (hundreds in our case) are bootstrapping. Consequently, the whole cluster becomes non-functional. I experimented with the following change in PendingRangeCalculatorService.java and it resolved the problem in our case. Prior code had synchronized around the for loop. synchronized(bootstrapTokens) { bootstrapTokens = new LinkedHashMap<Token, InetAddress>(bootstrapTokens); } for (Map.Entry<Token, InetAddress> entry : bootstrapTokens.entrySet()) { InetAddress endpoint = entry.getValue(); allLeftMetadata.updateNormalToken(entry.getKey(), endpoint); for (Range<Token> range : strategy.getAddressRanges(allLeftMetadata).get(endpoint)) pendingRanges.put(range, endpoint); allLeftMetadata.removeEndpoint(endpoint); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira