[ https://issues.apache.org/jira/browse/CASSANDRA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994779#comment-14994779 ]
Paulo Motta commented on CASSANDRA-10485: ----------------------------------------- I implemented an alternative approach which is a bit cleaner and more deterministic. The basic idea is to have a new method {{TokenMetadata.isMemberOrPending()}}, and only submit hints to endpoints that are ring members or pending membership, thus, avoiding fetching null host IDs for removed pending endpoints while the new pending ranges are being calculated. In order to support the {{TokenMetadata.isMemberOrPending()}} method, the {{TokenMetadata}} maintains a new {{livePendingEndpoints}} set which is populated every time new pending ranges are set. When endpoints are removed from {{TokenMetadata}} via the {{removeEndpoint}} method, they're also removed from the {{livePendingEndpoints}} set, so {{TokenMetadata.isMemberOrPending()}} returns false if the endpoint is evicted from the ring. Since both {{removeEndpoint}} and {{setPendingRanges}} update this set, they share a write lock. {{TokenMetadata.isMemberOrPending()}} also uses a read lock, similar to other methods {{isMember()}} or {{getHostId()}}. Merging the solution from 2.1 to 2.2/3.0 was a bit tricky because the pending ranges calculation was extracted from the {{PendingRangeCalculatorService}} to {{TokenMetadata}} within a read lock, so I had to separate the actual calculation (within a read lock) to the actual assignment of the {{pendingRanges}} via the {{setPendingRanges}} method, which uses a write lock. On 3.0, the hints submission part is slightly different (even simpler) due to the new hints implementation. It's still not ideal but I guess better than the previous approach. I will add a link from this ticket to CASSANDRA-6061 so we can take this ticket into account when refactoring the {{TokenMetadata}}. Below are the new branches and test results: ||2.1||2.2||3.0||trunk|| |[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10485-v3]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-10485-v3]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-10485-v3]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-10485-v3]| |[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-v3-testall/lastCompletedBuild/testReport/]| |[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-v3-dtest/lastCompletedBuild/testReport/]| > Missing host ID on hinted handoff write > --------------------------------------- > > Key: CASSANDRA-10485 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10485 > Project: Cassandra > Issue Type: Bug > Reporter: Paulo Motta > Assignee: Paulo Motta > Fix For: 2.1.x, 2.2.x, 3.0.x > > > when I restart one of them I receive the error "Missing host ID": > {noformat} > WARN [SharedPool-Worker-1] 2015-10-08 13:15:33,882 > AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread > Thread[SharedPool-Worker-1,5,main]: {} > java.lang.AssertionError: Missing host ID for 63.251.156.141 > at > org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:978) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at > org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:950) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at > org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:2235) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_60] > at > org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) > [apache-cassandra-2.1.3.jar:2.1.3] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > {noformat} > If I made nodetool status, the problematic node has ID: > {noformat} > UN 10.10.10.12 1.3 TB 1 ? > 4d5c8fd2-a909-4f09-a23c-4cd6040f338a rack3 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)