[jira] [Updated] (CASSANDRA-19361) fix node info NPE when ClusterMetadata is null

Ling Mao (Jira) Sun, 04 Feb 2024 19:37:05 -0800


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-19361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ling Mao updated CASSANDRA-19361:
---------------------------------
    Description: 
h3. How

 
I create an ensemble with 3 nodes(It works well), then I add the fourth node to 
join the party. 
when executing nodetool info, get the following exception:
{code:java}
➜  bin ./nodetool info

java.lang.NullPointerException at 
org.apache.cassandra.service.StorageService.operationMode(StorageService.java:3744)
 at 
org.apache.cassandra.service.StorageService.isBootstrapFailed(StorageService.java:3810)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)   

➜  bin ./nodetool info 

WARN  [InternalResponseStage:152] 2024-02-02 11:45:15,731 
RemoteProcessor.java:213 - Got error from /127.0.0.4:7000: TIMEOUT when sending 
TCM_COMMIT_REQ, retrying on CandidateIterator{candidates=[/127.0.0.4:7000], 
checkLive=true} error: null -- StackTrace -- java.lang.NullPointerException at 
org.apache.cassandra.service.StorageService.getLocalHostId(StorageService.java:1904)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71) at 
jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:260){code}
server 1 cannot execute node info and cql shell, server 2 and 3 can do it. Try 
to query the system prefix tables, I attach stack error log for the further 
debugging. Cannot find a way to recover. After deleting data(losing all data), 
restart and everything became OK
{code:java}
➜  bin ./nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load  Tokens  Owns (effective)  Host ID                          
     Rack
UN  127.0.0.2  ?     16      51.2%             
6d194555-f6eb-41d0-c000-000000000002  rack1
DN  127.0.0.4  ?     16      48.8%             
6d194555-f6eb-41d0-c000-000000000001  rack1{code}
h3. When

 
It was introduced by the Patch: CEP-21. Anyway, the NPE check is needed to 
protect its propagation anywhere
{code:java}
Implementation of Transactional Cluster Metadata as described in CEP-21
Hash: ae084237
 
code diff:
 
    public String getLocalHostId()
     {
-        UUID id = getLocalHostUUID();
-        return id != null ? id.toString() : null;
+        return getLocalHostUUID().toString();
     }
 
     public UUID getLocalHostUUID()
     {
-        UUID id = 
getTokenMetadata().getHostId(FBUtilities.getBroadcastAddressAndPort());
-        if (id != null)
-            return id;
-        // this condition is to prevent accessing the tables when the node is 
not started yet, and in particular,
-        // when it is not going to be started at all (e.g. when running some 
unit tests or client tools).
-        else if ((DatabaseDescriptor.isDaemonInitialized() || 
DatabaseDescriptor.isToolInitialized()) && CommitLog.instance.isStarted())
-            return SystemKeyspace.getLocalHostId();
-
-        return null;
+        // Metadata collector requires using local host id, and flush of 
IndexInfo may race with
+        // creation and initialization of cluster metadata service. Metadata 
collector does accept
+        // null localhost ID values, it's just that TokenMetadata was created 
earlier.
+        ClusterMetadata metadata = ClusterMetadata.currentNullable();
+        if (metadata == null || 
metadata.directory.peerId(getBroadcastAddressAndPort()) == null)
+            return null;
+        return 
metadata.directory.peerId(getBroadcastAddressAndPort()).toUUID();
     } {code}

  was:
h3. How
 
I create an ensemble with 3 nodes(It works well), then I add the fourth node to 
join the party. 
when executing nodetool info, get the following exception:
 
 
{code:java}
➜  bin ./nodetool info

java.lang.NullPointerException at 
org.apache.cassandra.service.StorageService.operationMode(StorageService.java:3744)
 at 
org.apache.cassandra.service.StorageService.isBootstrapFailed(StorageService.java:3810)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)   

➜  bin ./nodetool info 

WARN  [InternalResponseStage:152] 2024-02-02 11:45:15,731 
RemoteProcessor.java:213 - Got error from /127.0.0.4:7000: TIMEOUT when sending 
TCM_COMMIT_REQ, retrying on CandidateIterator{candidates=[/127.0.0.4:7000], 
checkLive=true} error: null -- StackTrace -- java.lang.NullPointerException at 
org.apache.cassandra.service.StorageService.getLocalHostId(StorageService.java:1904)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71) at 
jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:260){code}
 
server 1 cannot execute node info and cql shell, server 2 and 3 can do it. Try 
to query the system prefix tables, I attach stack error log for the further 
debugging. Cannot find a way to recover. After deleting data(losing all data), 
restart and everything became OK
 
 
{code:java}
➜  bin ./nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load  Tokens  Owns (effective)  Host ID                          
     Rack
UN  127.0.0.2  ?     16      51.2%             
6d194555-f6eb-41d0-c000-000000000002  rack1
DN  127.0.0.4  ?     16      48.8%             
6d194555-f6eb-41d0-c000-000000000001  rack1{code}
 
h3. When
 
It was introduced by the Patch: CEP-21. Anyway, the NPE check is needed to 
protect its propagation anywhere
{code:java}
Implementation of Transactional Cluster Metadata as described in CEP-21
Hash: ae084237
 
code diff:
 
    public String getLocalHostId()
     {
-        UUID id = getLocalHostUUID();
-        return id != null ? id.toString() : null;
+        return getLocalHostUUID().toString();
     }
 
     public UUID getLocalHostUUID()
     {
-        UUID id = 
getTokenMetadata().getHostId(FBUtilities.getBroadcastAddressAndPort());
-        if (id != null)
-            return id;
-        // this condition is to prevent accessing the tables when the node is 
not started yet, and in particular,
-        // when it is not going to be started at all (e.g. when running some 
unit tests or client tools).
-        else if ((DatabaseDescriptor.isDaemonInitialized() || 
DatabaseDescriptor.isToolInitialized()) && CommitLog.instance.isStarted())
-            return SystemKeyspace.getLocalHostId();
-
-        return null;
+        // Metadata collector requires using local host id, and flush of 
IndexInfo may race with
+        // creation and initialization of cluster metadata service. Metadata 
collector does accept
+        // null localhost ID values, it's just that TokenMetadata was created 
earlier.
+        ClusterMetadata metadata = ClusterMetadata.currentNullable();
+        if (metadata == null || 
metadata.directory.peerId(getBroadcastAddressAndPort()) == null)
+            return null;
+        return 
metadata.directory.peerId(getBroadcastAddressAndPort()).toUUID();
     } {code}


> fix node info NPE when ClusterMetadata is null
> ----------------------------------------------
>
>                 Key: CASSANDRA-19361
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19361
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tool/nodetool, Transactional Cluster Metadata
>            Reporter: Ling Mao
>            Assignee: Ling Mao
>            Priority: Normal
>             Fix For: 5.0.x
>
>
> h3. How
>  
> I create an ensemble with 3 nodes(It works well), then I add the fourth node 
> to join the party. 
> when executing nodetool info, get the following exception:
> {code:java}
> ➜  bin ./nodetool info
> java.lang.NullPointerException at 
> org.apache.cassandra.service.StorageService.operationMode(StorageService.java:3744)
>  at 
> org.apache.cassandra.service.StorageService.isBootstrapFailed(StorageService.java:3810)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)   
> ➜  bin ./nodetool info 
> WARN  [InternalResponseStage:152] 2024-02-02 11:45:15,731 
> RemoteProcessor.java:213 - Got error from /127.0.0.4:7000: TIMEOUT when 
> sending TCM_COMMIT_REQ, retrying on 
> CandidateIterator{candidates=[/127.0.0.4:7000], checkLive=true} error: null 
> -- StackTrace -- java.lang.NullPointerException at 
> org.apache.cassandra.service.StorageService.getLocalHostId(StorageService.java:1904)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71) at 
> jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:260){code}
> server 1 cannot execute node info and cql shell, server 2 and 3 can do it. 
> Try to query the system prefix tables, I attach stack error log for the 
> further debugging. Cannot find a way to recover. After deleting data(losing 
> all data), restart and everything became OK
> {code:java}
> ➜  bin ./nodetool status
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load  Tokens  Owns (effective)  Host ID                        
>        Rack
> UN  127.0.0.2  ?     16      51.2%             
> 6d194555-f6eb-41d0-c000-000000000002  rack1
> DN  127.0.0.4  ?     16      48.8%             
> 6d194555-f6eb-41d0-c000-000000000001  rack1{code}
> h3. When
>  
> It was introduced by the Patch: CEP-21. Anyway, the NPE check is needed to 
> protect its propagation anywhere
> {code:java}
> Implementation of Transactional Cluster Metadata as described in CEP-21
> Hash: ae084237
>  
> code diff:
>  
>     public String getLocalHostId()
>      {
> -        UUID id = getLocalHostUUID();
> -        return id != null ? id.toString() : null;
> +        return getLocalHostUUID().toString();
>      }
>  
>      public UUID getLocalHostUUID()
>      {
> -        UUID id = 
> getTokenMetadata().getHostId(FBUtilities.getBroadcastAddressAndPort());
> -        if (id != null)
> -            return id;
> -        // this condition is to prevent accessing the tables when the node 
> is not started yet, and in particular,
> -        // when it is not going to be started at all (e.g. when running some 
> unit tests or client tools).
> -        else if ((DatabaseDescriptor.isDaemonInitialized() || 
> DatabaseDescriptor.isToolInitialized()) && CommitLog.instance.isStarted())
> -            return SystemKeyspace.getLocalHostId();
> -
> -        return null;
> +        // Metadata collector requires using local host id, and flush of 
> IndexInfo may race with
> +        // creation and initialization of cluster metadata service. Metadata 
> collector does accept
> +        // null localhost ID values, it's just that TokenMetadata was 
> created earlier.
> +        ClusterMetadata metadata = ClusterMetadata.currentNullable();
> +        if (metadata == null || 
> metadata.directory.peerId(getBroadcastAddressAndPort()) == null)
> +            return null;
> +        return 
> metadata.directory.peerId(getBroadcastAddressAndPort()).toUUID();
>      } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19361) fix node info NPE when ClusterMetadata is null

Reply via email to