[ 
https://issues.apache.org/jira/browse/CASSANDRA-19097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809833#comment-17809833
 ] 

Berenguer Blasi commented on CASSANDRA-19097:
---------------------------------------------

We seem to have a legit problem. I have found instances in butler from 4.0 and 
4.1 as it was already mentioned in the ticket. I have amended fixVersions 
accordingly.

The problem seems to be a race revolving about nodes coming up, gossiper, 
SotrageService and TokenMetadata updates, recalculations, state changes, 
caches, token collisions, etc. The logs are too sparse to see in detail what is 
going on, maybe sbdy with deep knowledge on this area could pin it, but without 
a reproduction this is hard.

I have attached logs for a dtest-offheap failure example and a working example 
under folder 'works' for a dtests-novnode. Searching for StreamSession in all 
nodes one gets an idea of the difference in behavior. Then following how nodes 
2 and 3 come up in both node1 logs makes it clear that on the failing ones 
state changes are ignored 'because it is not a member in token metadata'. 
Seeing how nodes come up, stream, tokens recalculate, etc is where the problem 
seems to be.

Given this is present in other versions I don't know if it should be a 5.0 
blocker. On the other hand given it's severity and that we don't have enough 
history to tell if it's a recent failure or if it has been there for ages I 
have mised thoughts. At least in jira there is only one similar ticket 
CASSANDRA-10072 and looks unrelated. So this could be a recent thing...

I'll keep digging a bit more.

> Test Failure: bootstrap_test.TestBootstrap.*
> --------------------------------------------
>
>                 Key: CASSANDRA-19097
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19097
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI
>            Reporter: Michael Semb Wever
>            Assignee: Berenguer Blasi
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0-rc, 5.x
>
>         Attachments: jenkinslogs.zip
>
>
> test_killed_wiped_node_cannot_join
> test_read_from_bootstrapped_node
> test_shutdown_wiped_node_cannot_join
> Seen in dtests_offheap in CASSANDRA-19034
> https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/258/workflows/cea7d697-ca33-40bb-8914-fb9fc662820a/jobs/21162/parallel-runs/38
> {noformat}
> self = <bootstrap_test.TestBootstrap object at 0x7fc471171d50>
>     def test_killed_wiped_node_cannot_join(self):
> >       self._wiped_node_cannot_join_test(gently=False)
> bootstrap_test.py:608: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = <bootstrap_test.TestBootstrap object at 0x7fc471171d50>, gently = False
>     def _wiped_node_cannot_join_test(self, gently):
>         """
>         @jira_ticket CASSANDRA-9765
>         Test that if we stop a node and wipe its data then the node cannot 
> join
>         when it is not a seed. Test both a nice shutdown or a forced 
> shutdown, via
>         the gently parameter.
>         """
>         cluster = self.cluster
>         
> cluster.set_environment_variable('CASSANDRA_TOKEN_PREGENERATION_DISABLED', 
> 'True')
>         cluster.populate(3)
>         cluster.start()
>     
>         stress_table = 'keyspace1.standard1'
>     
>         # write some data
>         node1 = cluster.nodelist()[0]
>         node1.stress(['write', 'n=10K', 'no-warmup', '-rate', 'threads=8'])
>     
>         session = self.patient_cql_connection(node1)
>         original_rows = list(session.execute("SELECT * FROM 
> {}".format(stress_table,)))
>     
>         # Add a new node, bootstrap=True ensures that it is not a seed
>         node4 = new_node(cluster, bootstrap=True)
>         node4.start(wait_for_binary_proto=True)
>     
>         session = self.patient_cql_connection(node4)
> >       assert original_rows == list(session.execute("SELECT * FROM 
> > {}".format(stress_table,)))
> E       assert [Row(key=b'PP...e9\xbb'), ...] == [Row(key=b'PP...e9\xbb'), 
> ...]
> E         At index 587 diff: Row(key=b'OP2656L630', 
> C0=b"E02\xd2\x8clBv\tr\n\xe3\x01\xdd\xf2\x8a\x91\x7f-\x9dm'\xa5\xe7PH\xef\xc1xlO\xab+d",
>  
> C1=b"\xb2\xc0j\xff\xcb'\xe3\xcc\x0b\x93?\x18@\xc4\xc7tV\xb7q\xeeF\x82\xa4\xd3\xdcFl\xd9\x87
>  \x9a\xde\xdc\xa3", 
> C2=b'\xed\xf8\x8d%\xa4\xa6LPs;\x98f\xdb\xca\x913\xba{M\x8d6XW\x01\xea-\xb5<J\x1eo\xa0F\xbe',
>  
> C3=b'\x9ec\xcf\xc7\xec\xa5\x85Z]\xa6\x19\xeb\xc4W\x1d%lyZj\xb9\x94I\x90\xebZ\xdba\xdd\xdc\x9e\x82\x95\x1c',
>  
> C4=b'\xab\x9e\x13\x8b\xc6\x15D\x9b\xccl\xdcX\xb23\xd0\x8b\xa3\xba7\xc1c\xf7F\x1d\xf8e\xbd\x89\xcb\xd8\xd1)f\xdd')
>  != Row(key=b'4LN78NONP0', 
> C0=b"\xdf\x90\xb3/u\xc9/C\xcdOYG3\x070@#\xc3k\xaa$M'\x19\xfb\xab\xc0\x10]\xa6\xac\x1d\x81\xad",
>  
> C1=b'\x8a\xb7j\x95\xf9\xbd?&\x11\xaaH\xcd\x87\xaa\xd2\x85\x08X\xea9\x94\xae8U\x92\xad\xb0\x1b9\xff\x87Z\xe81',
>  
> C2=b'6\x1d\xa1-\xf77\xc7\xde+`\xb7\x89\xaa\xcd\xb5_\xe5\xb3\x04\xc7\xb1\x95e\x81s\t1\x8b\x16sc\x0eMm',
>  
> C3=b'\xfbi\x08;\xc9\x94\x15}r\xfe\x1b\xae5\xf6v\x83\xae\xff\x82\x9b`J\xc2D\xa6k+\xf3\xd3\xff{C\xd0;',
>  
> C4=b'\x8f\x87\x18\x0f\xfa\xadK"\x9e\x96\x87:tiu\xa5\x99\xe1_Ax\xa3\x12\xb4Z\xc9v\xa5\xad\xb8{\xc0\xa3\x93')
> E         Left contains 2830 more items, first extra item: 
> Row(key=b'5N7N172K30', 
> C0=b'Y\x81\xa6\x02\x89\xa0hyp\x00O\xe9kFp$\x86u\xea\n\x7fK\x99\xe1\xf6G\xf77\xf7\xd7\xe1\xc7L\x...0\x87a\x03\xee',
>  
> C4=b'\xe8\xd8\x17\xf3\x14\x16Q\x9d\\jb\xde=\x81\xc1B\x9c;T\xb1\xa2O-\x87zF=\x04`\x04\xbd\xc9\x95\xad')
> E         Full diff:
> E           [
> …
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to