[ https://issues.apache.org/jira/browse/CASSANDRA-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442754#comment-17442754 ]
Aleksey Yeschenko commented on CASSANDRA-17049: ----------------------------------------------- This is a very rare one. An example stack trace of an NPE: {code} ERROR 2021-10-19T08:30:16,692 [HintsWriteExecutor:1] org.apache.cassandra.service.CassandraDaemon:599 - Exception in thread Thread[HintsWriteExecutor:1,5,main] java.lang.NullPointerException: null at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) ~[?:?] at org.apache.cassandra.hints.HintsCatalog.get(HintsCatalog.java:102) ~[cie-cassandra-4.0.0.35.jar:4.0.0.35] at com.google.common.collect.Iterables$5.lambda$forEach$0(Iterables.java:704) ~[guava-27.0-jre.jar:?] at com.google.common.collect.Iterables$5.lambda$forEach$0(Iterables.java:704) ~[guava-27.0-jre.jar:?] at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?] at com.google.common.collect.Iterables$5.forEach(Iterables.java:704) ~[guava-27.0-jre.jar:?] at com.google.common.collect.Iterables$5.forEach(Iterables.java:704) ~[guava-27.0-jre.jar:?] at org.apache.cassandra.hints.HintsWriteExecutor$PartiallyFlushBufferPoolTask.run(HintsWriteExecutor.java:188) ~[cie-cassandra-4.0.0.35.jar:4.0.0.35] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.58.Final.jar:4.1.58.Final] at java.lang.Thread.run(Thread.java:834) [?:?] {code} > Fix rare NPE caused by batchlog replay / node decomission races > --------------------------------------------------------------- > > Key: CASSANDRA-17049 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17049 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Batch Log, Consistency/Hints > Reporter: Aleksey Yeschenko > Assignee: Aleksey Yeschenko > Priority: Low > Fix For: 3.0.x, 3.11.x, 4.0.x, 4.x > > > Batchlog replay process collects addresses of the hosts that have been hinted > to, so it can flush hints for them to disk before confirming deletion of the > replayed batches. If a node has been decommissioned during replay, however, > when the time comes to flush the hints at the very end of replay, > {{StorageService.getHostIdForEndpoint()}} will return {{null}} for its > address, which will, down the line, cause {{HintsCatalog::get()}} to be > invoked with a {{null}} host id argument, causing an NPE. > The simple fix is to check returned host ids for addresses for nulls, and > collect hinted host ids instead of hinted addresses. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org