[ 
https://issues.apache.org/jira/browse/CASSANDRA-9100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511858#comment-14511858
 ] 

Jason Brown edited comment on CASSANDRA-9100 at 4/27/15 2:27 AM:
-----------------------------------------------------------------

TL;DR I think some dtests/ccm are the way to go for now.

Last summer, I built a simulator for our gossip so I could better understand 
the convergence properties and see where it starts to break down. It took me 
about 2.5 weeks just to pull apart the gossip components from the rest of the 
system so I could run them in isolation - meaning, have more than one Gossiper 
executing in a single JVM. The changes included a series hack that broke many 
other components, like MessagingService (but that was acceptable for the 
simulator), and I'm not sure the rest of cassandra was totally legit with the 
hacks, either (except Gossiper, of course). I did have a workable simulator 
after the effort, but didn't have much time to work on it beyond that (maybe 
prep work for my various gossip talks) to invest into the simulator.

This being said, I think it's an incredibly non-trivial effort to tease gossip 
out for testing due to all the singletons, as [~brandon.williams] mentioned. I 
think some good wins, however, could be gained by adding in some dtests - but 
then, the question is "what to monitor for indications of success/failure?". 
I'm not sure there's a fantastic answer here. The (limited) possibilities 
include nodetool output, log file scraping, and ... ? I'd be most inclined for 
nodetool output, but we already scrape log files in dtests (I think), so that's 
not without precedent; but it also depends on what is being tested.

Thinking on it more, and, if it's even possible, it might be neat to script 
some iptables manipulation into dtests to block IPs/ports from communicating, 
then observe that gossip behaves as expected. Think of it as "mini-Jepsen", and 
testing gossip in the face of network partitions seems like apropos place for 
that kind of testing.



was (Author: jasobrown):
TL;DR I think some dtests/ccm are the way to go for now.

Last summer, I built a simulator for our gossip so I could better understand 
the convergence properties and see where it starts to break down. It took me 
about 2.5 weeks just to pull apart the gossip components from the rest of the 
system so I could run them in isolation - meaning, have more than one Gossiper 
executing in a siungle JVM. The changes included a series hack that broke many 
other components, like MessasingService (but that was acceptable for the 
simulator), and I'm not sure the rest of cassandra was totally legit with the 
hacks, either (except Gossiper, of course). I did have a workable simulator 
after the effort, but didn't have much time to work on it beyond that (maybe 
prep work for my various gossip talks) to invest into the simulator.

This being said, I think it's an incredibly non-trivial effort to tease gossip 
out for testing due to all the singletons, as [~brandon.williams] mentioned. I 
think some good wins, however, could be gained by adding in some dtests - but 
then, the question is "what to monitor for indications of sucess/failure?". I'm 
not sure there's a fantastic answer here. The (limited) possibilities include 
nodetool output, log file scraping, and ... ? I'd be most inclined for nodetool 
output, but we already scrape log files in dtests (I think), so that's not 
without precendent; but it also depends on what is being tested.

Thinking on it more, and, if it's even possible, it might be neat to script 
some iptables manipulation into dtests to block IPs/ports from communicating, 
then observe that gossip behaves as expected. Think of it as "mini-Jepsen", and 
testing gossip in the face of network partitions seems like apropos place for 
that kind of testing.


> Gossip is inadequately tested
> -----------------------------
>
>                 Key: CASSANDRA-9100
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9100
>             Project: Cassandra
>          Issue Type: Test
>          Components: Core
>            Reporter: Ariel Weisberg
>
> We found a few unit tests, but nothing that exercises Gossip under 
> challenging conditions. Maybe consider a long test that hooks up some 
> gossipers over a fake network and then do fault injection on that fake 
> network. Uni-directional and bi-directional partitions, delayed delivery, out 
> of order delivery if that is something that they can see in practice. 
> Connects/disconnects.
> Also play with bad clocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to