[ 
https://issues.apache.org/jira/browse/CASSANDRA-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jai Bheemsen Rao Dhanwada updated CASSANDRA-19633:
--------------------------------------------------
    Description: 
Hello,

 

I am running into an issue where in a node that is replacing a dead (non-seed) 
node is stuck in calculating ranges forever. It eventually succeeds, however 
the time taken for calculating the ranges is not constant. I do sometimes see 
that it takes 24 hours to calculate ranges for each keyspace. Attached the 
flume graph of the cassandra process during this time, which points to the 
below code. 

 

 
{code:java}
Multimap<InetAddressAndPort, Range<Token>> getRangeFetchMapForNonTrivialRanges()
{
//Get the graph with edges between ranges and their source endpoints
MutableCapacityGraph<Vertex, Integer> graph = getGraph();
//Add source and destination vertex and edges
addSourceAndDestination(graph, getDestinationLinkCapacity(graph));
int flow = 0;
MaximumFlowAlgorithmResult<Integer, CapacityEdge<Vertex, Integer>> result = 
null;
//We might not be working on all ranges
while (flow < getTotalRangeVertices(graph))
{
if (flow > 0)
{ //We could not find a path with previous graph. Bump the capacity b/w 
endpoint vertices and destination by 1 incrementCapacity(graph, 1); }
MaximumFlowAlgorithm fordFulkerson = 
FordFulkersonAlgorithm.getInstance(DFSPathFinder.getInstance());
result = fordFulkerson.calc(graph, sourceVertex, destinationVertex, 
IntegerNumberSystem.getInstance());
int newFlow = result.calcTotalFlow();
assert newFlow > flow; //We are not making progress which should not happen
flow = newFlow;
}
return getRangeFetchMapFromGraphResult(graph, result);
}
{code}
Digging through the logs, I see the below log line for a given keyspace 
`system_auth`
{code:java}
INFO [main] 2024-05-10 17:35:02,489 RangeStreamer.java:330 - Bootstrap: range 
Full(/10.135.56.214:7000,(5080189126057290696,5081324396311791613]) exists on 
Full(/10.135.56.157:7000,(5080189126057290696,5081324396311791613]) for 
keyspace system_auth{code}
 

corresponding code:
{code:java}
for (Map.Entry<Replica, Replica> entry : fetchMap.flattenEntries())
logger.info("{}: range {} exists on {} for keyspace {}", description, 
entry.getKey(), entry.getValue(), keyspaceName);{code}
BUT do not see the below line for the corresponding keyspace
{code:java}
RangeStreamer.java:606 - Output from RangeFetchMapCalculator for keyspace{code}
this means the code it's stuck in `getRangeFetchMap();`
{code:java}
Multimap<InetAddressAndPort, Range<Token>> rangeFetchMapMap = 
calculator.getRangeFetchMap();
logger.info("Output from RangeFetchMapCalculator for keyspace {}", 
keyspace);{code}
Here is the cluster topology:
 * Cassandra version: 4.0.12
 * # of nodes: 190
 * Tokens (vnodes): 128

Initial hypothesis was that the graph calculation was taking longer due to the 
combination of nodes + tokens + tables but in the same cluster I see one of the 
node joined without any issues. 
wondering if I am hitting a bug causing it to  work sometimes but get into an 
infinite loop some times?
Please let me know if you need any other details and appreciate any pointers to 
debug this further.

  was:
Hello,

 

I am running into an issue where in a node that is replacing a dead (non-seed) 
node is stuck in calculating ranges forever. It eventually succeeds, however 
the time taken for calculating the ranges is not constant. I do sometimes see 
that it takes 24 hours to calculate ranges for each keyspace. Attached the 
flume graph of the cassandra process during this time, which points to the 
below code. 

 

```
Multimap<InetAddressAndPort, Range<Token>> getRangeFetchMapForNonTrivialRanges()
{
//Get the graph with edges between ranges and their source endpoints
MutableCapacityGraph<Vertex, Integer> graph = getGraph();
//Add source and destination vertex and edges
addSourceAndDestination(graph, getDestinationLinkCapacity(graph));

int flow = 0;
MaximumFlowAlgorithmResult<Integer, CapacityEdge<Vertex, Integer>> result = 
null;

//We might not be working on all ranges
while (flow < getTotalRangeVertices(graph))
{
if (flow > 0)
{
//We could not find a path with previous graph. Bump the capacity b/w endpoint 
vertices and destination by 1
incrementCapacity(graph, 1);
}

MaximumFlowAlgorithm fordFulkerson = 
FordFulkersonAlgorithm.getInstance(DFSPathFinder.getInstance());
result = fordFulkerson.calc(graph, sourceVertex, destinationVertex, 
IntegerNumberSystem.getInstance());

int newFlow = result.calcTotalFlow();
assert newFlow > flow; //We are not making progress which should not happen
flow = newFlow;
}

return getRangeFetchMapFromGraphResult(graph, result);
}
```

Digging through the logs, I see the below log line for a given keyspace 
`system_auth`

 

```

INFO [main] 2024-05-10 17:35:02,489 RangeStreamer.java:330 - Bootstrap: range 
Full(/10.135.56.214:7000,(5080189126057290696,5081324396311791613]) exists on 
Full(/10.135.56.157:7000,(5080189126057290696,5081324396311791613]) for 
keyspace system_auth

```

corresponding code:

 

```
for (Map.Entry<Replica, Replica> entry : fetchMap.flattenEntries())
logger.info("{}: range {} exists on {} for keyspace {}", description, 
entry.getKey(), entry.getValue(), keyspaceName);
```

BUT do not see the below line for the corresponding keyspace

 

```

RangeStreamer.java:606 - Output from RangeFetchMapCalculator for keyspace

```

this means the code it's stuck in `getRangeFetchMap();`

```
Multimap<InetAddressAndPort, Range<Token>> rangeFetchMapMap = 
calculator.getRangeFetchMap();
logger.info("Output from RangeFetchMapCalculator for keyspace {}", keyspace);
```

Here is the cluster topology:
 * Cassandra version: 4.0.12
 * # of nodes: 190
 * Tokens (vnodes): 128

Initial hypothesis was that the graph calculation was taking longer due to the 
combination of nodes + tokens + tables but in the same cluster I see one of the 
node joined without any issues. 
wondering if I am hitting a bug causing it to  work sometimes but get into an 
infinite loop some times?
Please let me know if you need any other details and appreciate any pointers to 
debug this further.


> Replaced node is stuck in a loop calculating ranges
> ---------------------------------------------------
>
>                 Key: CASSANDRA-19633
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19633
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jai Bheemsen Rao Dhanwada
>            Priority: Normal
>         Attachments: result1.html
>
>
> Hello,
>  
> I am running into an issue where in a node that is replacing a dead 
> (non-seed) node is stuck in calculating ranges forever. It eventually 
> succeeds, however the time taken for calculating the ranges is not constant. 
> I do sometimes see that it takes 24 hours to calculate ranges for each 
> keyspace. Attached the flume graph of the cassandra process during this time, 
> which points to the below code. 
>  
>  
> {code:java}
> Multimap<InetAddressAndPort, Range<Token>> 
> getRangeFetchMapForNonTrivialRanges()
> {
> //Get the graph with edges between ranges and their source endpoints
> MutableCapacityGraph<Vertex, Integer> graph = getGraph();
> //Add source and destination vertex and edges
> addSourceAndDestination(graph, getDestinationLinkCapacity(graph));
> int flow = 0;
> MaximumFlowAlgorithmResult<Integer, CapacityEdge<Vertex, Integer>> result = 
> null;
> //We might not be working on all ranges
> while (flow < getTotalRangeVertices(graph))
> {
> if (flow > 0)
> { //We could not find a path with previous graph. Bump the capacity b/w 
> endpoint vertices and destination by 1 incrementCapacity(graph, 1); }
> MaximumFlowAlgorithm fordFulkerson = 
> FordFulkersonAlgorithm.getInstance(DFSPathFinder.getInstance());
> result = fordFulkerson.calc(graph, sourceVertex, destinationVertex, 
> IntegerNumberSystem.getInstance());
> int newFlow = result.calcTotalFlow();
> assert newFlow > flow; //We are not making progress which should not happen
> flow = newFlow;
> }
> return getRangeFetchMapFromGraphResult(graph, result);
> }
> {code}
> Digging through the logs, I see the below log line for a given keyspace 
> `system_auth`
> {code:java}
> INFO [main] 2024-05-10 17:35:02,489 RangeStreamer.java:330 - Bootstrap: range 
> Full(/10.135.56.214:7000,(5080189126057290696,5081324396311791613]) exists on 
> Full(/10.135.56.157:7000,(5080189126057290696,5081324396311791613]) for 
> keyspace system_auth{code}
>  
> corresponding code:
> {code:java}
> for (Map.Entry<Replica, Replica> entry : fetchMap.flattenEntries())
> logger.info("{}: range {} exists on {} for keyspace {}", description, 
> entry.getKey(), entry.getValue(), keyspaceName);{code}
> BUT do not see the below line for the corresponding keyspace
> {code:java}
> RangeStreamer.java:606 - Output from RangeFetchMapCalculator for 
> keyspace{code}
> this means the code it's stuck in `getRangeFetchMap();`
> {code:java}
> Multimap<InetAddressAndPort, Range<Token>> rangeFetchMapMap = 
> calculator.getRangeFetchMap();
> logger.info("Output from RangeFetchMapCalculator for keyspace {}", 
> keyspace);{code}
> Here is the cluster topology:
>  * Cassandra version: 4.0.12
>  * # of nodes: 190
>  * Tokens (vnodes): 128
> Initial hypothesis was that the graph calculation was taking longer due to 
> the combination of nodes + tokens + tables but in the same cluster I see one 
> of the node joined without any issues. 
> wondering if I am hitting a bug causing it to  work sometimes but get into an 
> infinite loop some times?
> Please let me know if you need any other details and appreciate any pointers 
> to debug this further.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to