[ 
https://issues.apache.org/jira/browse/CASSANDRA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994779#comment-14994779
 ] 

Paulo Motta commented on CASSANDRA-10485:
-----------------------------------------

I implemented an alternative approach which is a bit cleaner and more 
deterministic. The basic idea is to have a new method 
{{TokenMetadata.isMemberOrPending()}}, and only submit hints to endpoints that 
are ring members or pending membership, thus, avoiding fetching null host IDs 
for removed pending endpoints while the new pending ranges are being calculated.

In order to support the {{TokenMetadata.isMemberOrPending()}} method, the 
{{TokenMetadata}} maintains a new {{livePendingEndpoints}} set which is 
populated every time new pending ranges are set. When endpoints are removed 
from {{TokenMetadata}} via the {{removeEndpoint}} method, they're also removed 
from the {{livePendingEndpoints}} set, so {{TokenMetadata.isMemberOrPending()}} 
returns false if the endpoint is evicted from the ring. Since both 
{{removeEndpoint}} and {{setPendingRanges}} update this set, they share a write 
lock. {{TokenMetadata.isMemberOrPending()}} also uses a read lock, similar to 
other methods {{isMember()}} or {{getHostId()}}.

Merging the solution from 2.1 to 2.2/3.0 was a bit tricky because the pending 
ranges calculation was extracted from the {{PendingRangeCalculatorService}} to 
{{TokenMetadata}} within a read lock, so I had to separate the actual 
calculation (within a read lock) to the actual  assignment of the 
{{pendingRanges}} via the {{setPendingRanges}} method, which uses a write lock. 
On 3.0, the hints submission part is slightly different (even simpler) due to 
the new hints implementation.

It's still not ideal but I guess better than the previous approach. I will add 
a link from this ticket to CASSANDRA-6061 so we can take this ticket into 
account when refactoring the {{TokenMetadata}}.

Below are the new branches and test results:
||2.1||2.2||3.0||trunk||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10485-v3]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-10485-v3]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-10485-v3]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-10485-v3]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-v3-testall/lastCompletedBuild/testReport/]|
|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-v3-dtest/lastCompletedBuild/testReport/]|


> Missing host ID on hinted handoff write
> ---------------------------------------
>
>                 Key: CASSANDRA-10485
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10485
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>             Fix For: 2.1.x, 2.2.x, 3.0.x
>
>
> when I restart one of them I receive the error "Missing host ID":
> {noformat}
> WARN  [SharedPool-Worker-1] 2015-10-08 13:15:33,882 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-1,5,main]: {}
> java.lang.AssertionError: Missing host ID for 63.251.156.141
>         at 
> org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:978)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:950)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:2235)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_60]
>         at 
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
> [apache-cassandra-2.1.3.jar:2.1.3]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {noformat}
> If I made nodetool status, the problematic node has ID:
> {noformat}
> UN  10.10.10.12  1.3 TB     1       ?       
> 4d5c8fd2-a909-4f09-a23c-4cd6040f338a  rack3
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to