[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426085#comment-13426085 ] Jonathan Ellis commented on CASSANDRA-4427: --- +1 then. nit: i'd also change sleep(delay) in the MigrationManager loop to sleep(1000), or even sleep(100) > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-4.txt, 4427-5.txt, 4427-v2.txt, 4427-v3.txt, > 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426075#comment-13426075 ] Brandon Williams commented on CASSANDRA-4427: - bq. Don't you still want the full ring delay to make sure you know about everyone in the cluster (so if you are picking a "balanced" token it does the Right Thing)? Well, if we got any non-empty schema, a full gossip round has occurred so we should be good to go at that point, since it will have also populated our knowledge of the ring. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-4.txt, 4427-5.txt, 4427-v2.txt, 4427-v3.txt, > 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426074#comment-13426074 ] Jonathan Ellis commented on CASSANDRA-4427: --- Don't you still want the full ring delay to make sure you know about everyone in the cluster (so if you are picking a "balanced" token it does the Right Thing)? > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-4.txt, 4427-5.txt, 4427-v2.txt, 4427-v3.txt, > 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425991#comment-13425991 ] Brandon Williams commented on CASSANDRA-4427: - +1 > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-4.txt, 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425920#comment-13425920 ] Brandon Williams commented on CASSANDRA-4427: - Here's the real problem: {noformat} INFO 16:49:57,531 Starting up server gossip INFO 16:49:57,547 Enqueuing flush of Memtable-LocationInfo@1547338589(126/157 serialized/live bytes, 3 ops) INFO 16:49:57,548 Writing Memtable-LocationInfo@1547338589(126/157 serialized/live bytes, 3 ops) INFO 16:49:57,586 Completed flushing /var/lib/cassandra/data/system/LocationInfo/system-LocationInfo-he-1-Data.db (234 bytes) for commitlog position ReplayPosition(segmentId=10938112371080118, position=595) INFO 16:49:57,616 Starting Messaging Service on port 7000 INFO 16:49:59,634 Saved token not found. Using 113427455640312821154458202477256070484 from configuration INFO 16:49:59,636 Enqueuing flush of Memtable-LocationInfo@1088940267(53/66 serialized/live bytes, 2 ops) INFO 16:49:59,636 Writing Memtable-LocationInfo@1088940267(53/66 serialized/live bytes, 2 ops) INFO 16:49:59,652 Completed flushing /var/lib/cassandra/data/system/LocationInfo/system-LocationInfo-he-2-Data.db (163 bytes) for commitlog position ReplayPosition(segmentId=10938112371080118, position=776) INFO 16:49:59,655 Node cassandra-3/10.179.111.137 state jump to normal INFO 16:49:59,656 Bootstrap/Replace/Move completed! Now serving reads. INFO 16:49:59,690 Binding thrift service to cassandra-3/10.179.111.137:9160 INFO 16:49:59,694 Using TFastFramedTransport with a max frame size of 15728640 bytes. INFO 16:49:59,698 Using synchronous/threadpool thrift server on cassandra-3/10.179.111.137 : 9160 INFO 16:49:59,699 Listening for thrift clients... INFO 16:49:59,873 Node /10.179.64.227 is now part of the cluster INFO 16:49:59,874 InetAddress /10.179.64.227 is now UP INFO 16:49:59,876 Enqueuing flush of Memtable-LocationInfo@1301257077(35/43 serialized/live bytes, 1 ops) INFO 16:49:59,877 Writing Memtable-LocationInfo@1301257077(35/43 serialized/live bytes, 1 ops) INFO 16:49:59,892 Completed flushing /var/lib/cassandra/data/system/LocationInfo/system-LocationInfo-he-3-Data.db (89 bytes) for commitlog position ReplayPosition(segmentId=10938112371080118, position=874) INFO 16:49:59,894 Node /10.179.65.102 is now part of the cluster INFO 16:49:59,894 InetAddress /10.179.65.102 is now UP {noformat} Gossip hasn't quite discovered any other nodes yet when the schema check fires. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425905#comment-13425905 ] Jonathan Ellis commented on CASSANDRA-4427: --- 59adb24e-f3cd-3e02-97f0-5b395827453f is emptyVersion, so from that snippet it looks like it's working as designed. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424595#comment-13424595 ] Brandon Williams commented on CASSANDRA-4427: - bq. Added a quick fix for this case. If the cluster is so new that there is no SCHEMA state, then there's no actual schema info either. LGTM. bq. Granted, but surely two rounds is a better measure than the zero we had before. (Which apparently worked most of the time...) Remember, our goal is to avoid the full RING_DELAY sleep when we don't need to bootstrap. I know. It's a situation with no perfect solution unfortunately (but I agree 2 > 0 ;) > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424590#comment-13424590 ] Jonathan Ellis commented on CASSANDRA-4427: --- bq. This doesn't quite work, because we're looking for the SCHEMA app state, which at startup won't always exist Added a quick fix for this case. If the cluster is so new that there is no SCHEMA state, then there's no actual schema info either. bq. It's possible that you could have 3 seeds and all but one could be down, thus 2 gossip rounds doesn't guarantee you'll have any appstates Granted, but surely two rounds is a better measure than the zero we had before. (Which apparently worked most of the time...) Remember, our goal is to avoid the full RING_DELAY sleep when we don't need to bootstrap. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424580#comment-13424580 ] Brandon Williams commented on CASSANDRA-4427: - Note: that was with autobootstrap disabled. But, I'm also not convinced that waiting two gossiper rounds is sufficient either (alert the ring_delay police!) It's possible that you could have 3 seeds and all but one could be down, thus 2 gossip rounds doesn't guarantee you'll have any appstates. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424563#comment-13424563 ] Brandon Williams commented on CASSANDRA-4427: - This doesn't quite work, because we're looking for the SCHEMA app state, which at startup won't exist since the gossiper isn't even started yet: {noformat} ERROR [main] 2012-07-29 01:08:28,476 CassandraDaemon.java (line 335) Exception encountered during startup java.lang.NullPointerException at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:527) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:475) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:366) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:228) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:318) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:361) {noformat} > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424311#comment-13424311 ] Sylvain Lebresne commented on CASSANDRA-4427: - lgtm, +1 > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424177#comment-13424177 ] Jonathan Ellis commented on CASSANDRA-4427: --- bq. I believe the schemaPresent condition shouldn't be negated Right, fix pushed to same github branch. bq. I would have put the initialization fo Schema.emptyVersion in a static block to make it explicit that it's a one time initialization I thought you couldn't declare emptyVersion final that way... I was wrong, the compiler is smart enough to recognize the static block. Also fixed. bq. it could be nice to also log whether we're going to boostrap or not and why in the other case. Added a debug line. bq. exclude ourselves when we check for schemaPresent Done. (Since we can't have one ourselves unless another does too -- or unless we already joined the ring successfully -- there is no loss of correctness.) bq. this feels a bit bigger than what I'm plainly confortable pushing in 1.0 at this point +1, let's leave it as a known issue in 1.0. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423909#comment-13423909 ] Sylvain Lebresne commented on CASSANDRA-4427: - I suspect the test failures are due to the removal of the seeds special case and because our tests are not fully realistic. Namely, in the tests, while localhost is a seed, it gets a schema loaded before joinTokenRing is called, and so it ends up with schemaPresent = true and tries to bootstrap (even though it's the only node). That shouldn't happen in real life but at least on the short term fixing the tests themselves is more work than is worth it, so maybe we can: * Either we back the isSeed test * Or exclude ourselves when we check for schemaPresent Some Preference? > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423777#comment-13423777 ] Sylvain Lebresne commented on CASSANDRA-4427: - In the check for bootstrap: {noformat} if (DatabaseDescriptor.isAutoBootstrap() && (SystemTable.bootstrapInProgress() || (!SystemTable.bootstrapComplete() && !schemaPresent))) {noformat} I believe the schemaPresent condition shouldn't be negated. We want to skip boostrap is there is no schema, but bootstrap if there is one. Even with that fixed, this breaks some of the unit tests (BoostrapperTest, EmbeddedCassandraServiceTest, StreamingTransferTest and AntiEntropyServiceStandardTest). Namely: {noformat} junit] java.lang.RuntimeException: No other nodes seen! Unable to bootstrap.If you intended to start a single-node cluster, you should make sure your broadcast_address (or listen_address) is listed as a seed. Otherwise, you need to determine why the seed being contacted has no knowledge of the rest of the cluster. Usually, this can be solved by giving all nodes the same seed list. junit] at org.apache.cassandra.dht.BootStrapper.getBootstrapSource(BootStrapper.java:127) junit] at org.apache.cassandra.dht.BootStrapper.getBalancedToken(BootStrapper.java:109) junit] at org.apache.cassandra.dht.BootStrapper.getBootstrapToken(BootStrapper.java:104) junit] at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:629) junit] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:526) junit] at org.apache.cassandra.dht.BootStrapperTest.testTokenRoundtrip(BootStrapperTest.java:50) {noformat} On committing to 1.0, I'm not sure what was the intention, but this feels a bit bigger than what I'm plainly confortable pushing in 1.0 at this point, and it feels we can tell people on 1.0 to wipe the data dir on a failed boostrap before retrying. That's not a strong opposition though, more an opinion. Nits: * Instead of calculateEmptySchema(), I would have put the initialization fo Schema.emptyVersion in a static block to make it explicit that it's a one time initialization. Though if you made that on purpose because you don't like static blocks, that's good enough for me. * We log when we detect a boostrap failure, but it could be nice to also log whether we're going to boostrap or not and why in the other case. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420736#comment-13420736 ] Jonathan Ellis commented on CASSANDRA-4427: --- bq. there is still one behavior that the patch changes, that is it will always boobstrap non seeds node You're right. Okay, take five: https://github.com/jbellis/cassandra/tree/4427-5 4 patches here on top of Brandon's work. The main ones are the 1st and 4th. In the first, I remove the seed special case since it's a subset of the empty schema case. (Unless you're Doing It Wrong and adding seed nodes directly to an active cluster, which always surprises people when it burns them. So I say good riddance.) The first also adds a 2-gossip-round sleep so that (always assuming seeds are set correctly) we eliminate the risk of thinking schema is empty incorrectly due to a race w/ gossip. The fourth patch follows this up by making the schema check based on other peers' schema uuids instead of local data. Which is unlikely to be a problem today, but is is still a race-y approach and the correct alternative was straightforward. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417016#comment-13417016 ] Brandon Williams commented on CASSANDRA-4427: - bq. The fact is that recording that bootstrap is in progress (along with the system table check) would allow to fix the instajoin while keeping the current behavior unchanged otherwise, and I do feel that recording the info is not a bad idea in itself, so that would have my preference. I tend to agree that having an explicit, persisted flag feels a lot less fragile than the current logic, and being able to indicate a failure to the user seems like a good improvement. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416945#comment-13416945 ] Sylvain Lebresne commented on CASSANDRA-4427: - bq. Adding "bootstrap in progress" concept does nothing for this one way or the other. You're right, brain fart, sorry. Anyway, there is still one behavior that the patch changes, that is it will always boobstrap non seeds node, while previously the system table check was making sure we never bootstrapped a node in a new cluster, independently of whether it was a seed or not. It is clearly not a bad idea when you start a new cluster to set all those nodes as seeds, but I just want to point out that the behavior is changed and I'm not sure everyone always set all of its initial node as seeds today. I'll also note that boostrapping some of the node in an initial cluster don't break anything, it just makes the node start much less quickly that they would otherwise. I'm not sure how I feel about changing that behavior, especially in a minor release. The fact is that recording that bootstrap is in progress (along with the system table check) would allow to fix the instajoin while keeping the current behavior unchanged otherwise, and I do feel that recording the info is not a bad idea in itself, so that would have my preference. But that is not an extremely strong preference either. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416921#comment-13416921 ] Jonathan Ellis commented on CASSANDRA-4427: --- bq. I believe this simpler fix doesn't handle the case of boostrapping multiple nodes into an existing cluster. We've never tried to prevent this, except by saying "thou shalt space bootstraps apart two minutes," because the only way to stop it is to drop the "balanced" token picking altogether. Adding "bootstrap in progress" concept does nothing for this one way or the other. bq. Namely, in that case, that will have a schema and so the node will have a system table by the time it checks for it and we'll end up picking the same token for multiple nodes. This is exactly how it's supposed to work: if there's a schema, we use "existing cluster mode" and pick a token to divide the range of the heaviest node (and cross our fingers that the user is spacing things out enough between node additions). If there's no schema, we use "new cluster mode" and pick a random token. Let the record show that back in CASSANDRA-3219 I said this was confusing behavior and we should add explicit initial_token modes instead of trying to make it magical. :) > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416041#comment-13416041 ] Sylvain Lebresne commented on CASSANDRA-4427: - I believe this simpler fix doesn't handle the case of boostrapping multiple nodes into an existing cluster. Namely, in that case, that will have a schema and so the node will have a system table by the time it checks for it and we'll end up picking the same token for multiple nodes. Also, I think checking system tables existence is fairly fragile and I would prefer moving away from it. It is way too easy to screw that up by having something (anything) written to those system tables. Typically, I don't know if that fix works for multiple nodes started in a brand new cluster (with not all being seeds), because without careful checking I don't know if we can end up writing some info in the system tables before checking for getBootstrapToken. Overall I do like the idea of registering that the bootstrap is in process, because on top of (I think) fixing the problem in a non-fragile way, it also allows us better reporting. Even outside of the problem of generating tokens, I think it is reassuring for a user that restart a node that failed to boostrap to have the software acknowledge that it understand and handle correctly the situation. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Jonathan Ellis > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415714#comment-13415714 ] Jonathan Ellis commented on CASSANDRA-4427: --- Started trying to improve the comments and got stuck on the schema check: it's basically a no-op (except for the purposes of screwing up a partial bootstrap like this), since we perform the check before waiting for gossip to fill in the schema. Simpler fix at https://github.com/jbellis/cassandra/tree/4427-4 to move the schema check into getBootstrapToken. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Brandon Williams > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415631#comment-13415631 ] Jonathan Ellis commented on CASSANDRA-4427: --- +1 nit: worth adding a comment to explain wtf all the clauses of that if statement are, so we don't have to dig through ticket history next time > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Brandon Williams > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427-v3.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413931#comment-13413931 ] Jonathan Ellis commented on CASSANDRA-4427: --- I think you're misreading the original seed logic... !(isBootstrapped || isSeed) expands to !isBootstrapped && !isSeed. Still need that so that single-node clusters don't try to bootstrap. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Brandon Williams > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427-v2.txt, 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413244#comment-13413244 ] Brandon Williams commented on CASSANDRA-4427: - bq. Indeed, in 1.0.0 we decided to draw this line based on whether a schema had been created or not This seems more dangerous than it was worth, since you can easily receive even partial schema within a couple of seconds, realize you made some sort of mistake (forgot to mount the data dir, etc) and restart it, possibly wrecking your production app. (The seed check still seems strange regardless) > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Brandon Williams > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4427) Restarting a failed bootstrap instajoins the ring
[ https://issues.apache.org/jira/browse/CASSANDRA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413145#comment-13413145 ] Jonathan Ellis commented on CASSANDRA-4427: --- Here's what we were trying to address there: bq. Now there is a actual new problem with 1.0.0. That problem is that when you start an initial cluster, i.e, when in 0.8 you would start node with auto-boostrap=false, you do often end up starting nodes simultaneously. That is why older version were using random token when auto-bootstrap was false. This problem does need to be fix for 1.0.0 because that is a serious regression. However, my argument is that even though we now default to auto-boostrap=true, that doesn't mean that there is no difference between setting up the initial nodes of a cluster and the latter bootstrapping of nodes to add capacity to an existing cluster. Indeed, in 1.0.0 we decided to draw this line based on whether a schema had been created or not (we call the bootstrap() method based on that). Imho, this means that we have no boostrap option and the "I have no schema" is the old auto-boostrap=false. So we should use random token in that case and balanced one otherwise the same way we are doing it in 0.8. > Restarting a failed bootstrap instajoins the ring > - > > Key: CASSANDRA-4427 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4427 > Project: Cassandra > Issue Type: Bug > Components: Core >Affects Versions: 1.0.0 >Reporter: Brandon Williams >Assignee: Brandon Williams > Fix For: 1.0.11, 1.1.3 > > Attachments: 4427.txt > > > I think when we made auto_bootstrap = true the default, we broke the check > for the bootstrap flag, creating a dangerous situation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira