Hello again,
Looking into system.peers we found that some nodes contain entries about
themselves with null values. Not sure if this could be an issue, maybe someone
saw something similar? This state is there before including the funky DC into
replication.
peer
data_center
host_id
preferred_ip
rack
release_version
rpc_address
schema_version
tokens
<IP address>
null
null
192.168.104.111
null
null
null
null
null
Have a wonderful day 😊
Gediminas
From: Gediminas Blazys <[email protected]>
Sent: Monday, May 4, 2020 10:09
To: [email protected]
Subject: RE: [EXTERNAL] Re: Adding new DC results in clients failing to connect
Hello,
Thanks for the reply.
Following your advice we took a look at system.local for seed nodes and
compared that data with nodetool ring. Both sources contain the same tokens for
these specific hosts. Will continue looking into system.peers.
We have enabled more verbosity on the C# driver and this is the message that we
get now:
ControlConnection: 05/03/2020 14:28:42.346 +03:00 : Updating keyspaces metadata
ControlConnection: 05/03/2020 14:28:42.377 +03:00 : Rebuilding token map
ControlConnection: 05/03/2020 14:29:03.837 +03:00 : Finished building TokenMap
for 7 keyspaces and 210 hosts. It took 19403 milliseconds.
ControlConnection: 05/03/2020 14:29:03.901 +03:00 ALARMA: ENDPOINT:
<<IPADDRESS>>:9042 EXCEPTION: System.ArgumentException: The source argument
contains duplicate keys.
at
System.Collections.Concurrent.ConcurrentDictionary`2.InitializeFromCollection(IEnumerable`1
collection)
at System.Collections.Concurrent.ConcurrentDictionary`2..ctor(IEnumerable`1
collection, IEqualityComparer`1 comparer)
at System.Collections.Concurrent.ConcurrentDictionary`2..ctor(IEnumerable`1
collection)
at Cassandra.TokenMap..ctor(TokenFactory factory, IReadOnlyDictionary`2
tokenToHostsByKeyspace, List`1 ring, IReadOnlyDictionary`2 primaryReplicas,
IReadOnlyDictionary`2 keyspaceTokensCache, IReadOnlyDictionary`2 datacenters,
Int32 numberOfHostsWithTokens)
at Cassandra.TokenMap.Build(String partitioner, ICollection`1 hosts,
ICollection`1 keyspaces)
at Cassandra.Metadata.<RebuildTokenMapAsync>d__59.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task)
at
System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult()
at Cassandra.Connections.ControlConnection.<Connect>d__44.MoveNext()
The error occurs on Cassandra.TokenMap. We are analyzing objects that the
driver initializes during the token map creation but we are yet to find that
dictionary with duplicated keys.
Just to note, once this new DC is added to replication python driver is unable
to establish a connection either. cqlsh though, seems to be ok. It is hard to
say for sure, but for now at least, this issue seems to be pointing to
Cassandra.
Gediminas
From: Jorge Bay Gondra
<[email protected]<mailto:[email protected]>>
Sent: Thursday, April 30, 2020 11:45
To: [email protected]<mailto:[email protected]>
Subject: [EXTERNAL] Re: Adding new DC results in clients failing to connect
Hi,
You can enable logging at driver to see what's happening under the hood:
https://docs.datastax.com/en/developer/csharp-driver/3.14/faq/#how-can-i-enable-logging-in-the-driver<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.datastax.com%2Fen%2Fdeveloper%2Fcsharp-driver%2F3.14%2Ffaq%2F%23how-can-i-enable-logging-in-the-driver&data=02%7C01%7CGediminas.Blazys%40microsoft.com%7C6a5b382a16e54752bb8e08d7effa07bc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637241729477296305&sdata=a3XX8EzNAZk7ak3EE3Q7U4kxTtNii2svHqNpoKZgADI%3D&reserved=0>
With logging information, it should be easy to track the issue down.
Can you query system.local and system.peers on a seed node / contact point to
see if all the node list / token info is expected. You can compare it to
nodetool ring info.
Not directly related: 256 vnodes is probably more than you want.
Thanks,
Jorge
On Thu, Apr 30, 2020 at 9:48 AM Gediminas Blazys
<[email protected]<mailto:[email protected]>>
wrote:
Hello,
We have run into a very interesting issue and maybe some of you have
encountered it or just have an idea where to look.
We are working towards adding new dcs into our cluster, here's the current
topology:
DC1 - 18 nodes
DC2 - 18 nodes
DC3 - 18 nodes
DC4 - 18 nodes
DC5 - 18 nodes
Recently we introduced a new DC6 (60 nodes) into our cluster. The joining and
rebuilding of DC6 went smoothly, clients are using it without issue. This is
how it looked after joining DC6:
DC1 - 18 nodes
DC2 - 18 nodes
DC3 - 18 nodes
DC4 - 18 nodes
DC5 - 18 nodes
DC6 - 60 nodes
Next we wanted to add another DC7 (also 60 nodes) making it a total of 210
nodes in the cluster, and while joining new nodes went smoothly, once we
changed the replication of user defined keyspaces to include DC7, no clients
were able to connect to Cassandra (regardless of which DC is being addressed).
They would throw an exception that I have provided at the end of the email.
Cassandra version 3.11.4.
C# driver version 3.12.0. Also tested with 3.14.0. We use dc round robin policy
and update ring metadata for connecting clients.
Amount of vnodes per node: 256
The stack trace starts with an exception 'The source argument contains
duplicate keys.'. Maybe you know what kind of data is in this dictionary? What
data can be duplicated here?
Clients are unable to connect until the moment we remove DC7 from replication.
Once replication is adjusted to exclude DC7, clients can connect normally.
Cassandra.NoHostAvailableException: All hosts tried for query failed (tried
<<IPaddress>>:9042: ArgumentException 'The source argument contains duplicate
keys.')2020/04/29 10:19:27.51410636
at Cassandra.Connections.ControlConnection.<Connect>d__39.MoveNext()2020/04/29
10:19:27.51410636
--- End of stack trace from previous location where exception was thrown
---2020/04/29 10:19:27.51410636
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
10:19:27.51410636
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task)2020/04/29 10:19:27.51410636
Cassandra.Connections.ControlConnection.<InitAsync>d__36.MoveNext()2020/04/29
10:19:27.51410636
End of stack trace from previous location where exception was thrown
---2020/04/29 10:19:27.51410636
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
10:19:27.51410636
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task)2020/04/29 10:19:27.51410636
Cassandra.Tasks.TaskHelper.<WaitToCompleteAsync>d__10.MoveNext()2020/04/29
10:19:27.51410636
End of stack trace from previous location where exception was thrown
---2020/04/29 10:19:27.51410636
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
10:19:27.51410636
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task)2020/04/29 10:19:27.51410636
Cassandra.Cluster.<Cassandra-SessionManagement-IInternalCluster-OnInitializeAsync>d__50.MoveNext()2020/04/29
10:19:27.51410636
End of stack trace from previous location where exception was thrown
---2020/04/29 10:19:27.51410636
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
10:19:27.51410636
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task)2020/04/29 10:19:27.51410636
Cassandra.ClusterLifecycleManager.<InitializeAsync>d__3.MoveNext()2020/04/29
10:19:27.51410636
End of stack trace from previous location where exception was thrown
---2020/04/29 10:19:27.51410636
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
10:19:27.51410636
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task)2020/04/29 10:19:27.51410636
Cassandra.Cluster.<Cassandra-SessionManagement-IInternalCluster-ConnectAsync>d__47`1.MoveNext()2020/04/29
10:19:27.51410636
End of stack trace from previous location where exception was thrown
---2020/04/29 10:19:27.51410636
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
10:19:27.51410636
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task
task)2020/04/29 10:19:27.51410636
Cassandra.Cluster.<ConnectAsync>d__46.MoveNext()2020/04/29 10:19:27.51410636
End of stack trace from previous location where exception was thrown
---2020/04/29 10:19:27.51410636
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()2020/04/29
10:19:27.51410636
Cassandra.Tasks.TaskHelper.WaitToComplete(Task task, Int32 timeout)2020/04/29
10:19:27.51410636
Cassandra.Cluster.Connect()2020/04/29 10:19:27.51410636
We would really appreciate your input, big thanks in advance.
Gediminas