Figured it out. The problem was that I have BackupPC set to run 8 nightlies at once (which usually takes 12 or more hours), but it was ending up in a state where only one was running at a time.
This may be the longest, most detailed debugging writeup I've ever done in 15 years of being a computer professional; I hope y'all appreciate it. :) I had to do this to hold all the relevant state in my head. It turns out that the issue occurs when the 24-hour-ly nightlies job is already running, and you do sudo -u backuppc BackupPC_serverMesg BackupPC_nightly run which I've been doing a lot. Deciding to queue new nightly jobs goes like this: while ( $CmdJob eq "" && @CmdQueue > 0 && $RunNightlyWhenIdle != 1 || @CmdQueue > 0 && $RunNightlyWhenIdle == 2 && $bpc->isAdminJob($CmdQueue[0]->{host}) ) { We'll be coming back to this a lot. "isAdminJob" matches nightly jobs only AFAICT. CmdQueue State: Empty CmdJob: Empty RunNightlyWhenIdle: 0 While State: False, since @CmdQueue <= 0 Running Job State: Empty Event: Normal nightly run occurs. RunNightlyWhenIdle is set to 1, which triggers all the nightly jobs getting added to the queue, and RunNightlyWhenIdle getting set to 0 CmdQueue State: 8 nightly jobs CmdJob: Empty RunNightlyWhenIdle: 2 isAdminJob Matches First Job: True While State: True, via Branch 2 Running Job State: Empty Event: Nightly jobs get kicked off, all 8 of them. CmdQueue State: Empty CmdJob: non-empty; "admin7" or similar RunNightlyWhenIdle: 2 isAdminJob Matches First Job: False While State: False, since @CmdQueue <= 0 Running Job State: 8 nightly jobs Event: A backup finishes, and queues up a BackupPC_link job. This happens several times, since the nightly jobs take 8+ hours, even split into 8 parts, on my machine (4+TiB of backups per backup machine). CmdQueue State: Several link jobs CmdJob: non-empty; "admin7" or similar RunNightlyWhenIdle: 2 isAdminJob Matches First Job: False; link jobs don't match While State: False Running Job State: 8 nightly jobs Event: User runs "sudo -u backuppc BackupPC_serverMesg BackupPC_nightly run". This causes RunNightlyWhenIdle to be set to 1, but before that hits the while, the jobs are actually queued, *USING unshift*, which puts them at the front of the queue. This is where things start to go horribly wrong. CmdQueue State: 8 nightly jobs, *THEN* Several link jobs CmdJob: non-empty; "admin7" or similar RunNightlyWhenIdle: 2 isAdminJob Matches First Job: *TRUE* While State: True, branch 2 Running Job State: 8 nightly jobs Event: *Pop* a job from the queue. This means that even though the *test* is for the job from the *front* of the queue, the job that actually gets handled is the job at the *end* of the queue. So, the last job on the queue, a link job, gets run. THIS SHOULD NEVER HAPPEN, as I understand it, because nightly jobs (the first set) are still running. The link job sets CmdJob, but that doesn't matter because we're going through the *second* branch of the while, which doesn't care about CmdJob. So, it happily launches another link job: CmdQueue State: 8 nightly jobs, then N-1 link jobs CmdJob: hostname non-empty from the last link job RunNightlyWhenIdle: 2 isAdminJob Matches First Job: True While State: True, branch 2 Running Job State: 8 nightly jobs, 1 link job Event: Runs the next link job. And all the others. We end up with *all* link jobs running at once. THIS SHOULD NEVER HAPPEN; CmdQueue is supposed to be one at a time. But wait, it gets better! When each link job starts, it sets $CmdJob to its own host name; this means that at the end of the run through the queue, it's set to the last link job that ran, like this: CmdQueue State: 8 nightly jobs CmdJob: hostname from the last link job RunNightlyWhenIdle: 2 isAdminJob Matches First Job: True While State: True, branch 2 Running Job State: 8 nightly jobs, N link jobs Event: So, from here, it tries to run the last (remember, pop) nightly job, but it can't, because any given nightly segment can only run one at a time, because they are named to prevent duplicates (leading to the "Botch on admin job for admin7 : already in use!!" log messages: that means the 8th nightly job is running, so you can't start it again). Having failed to run the nightly job, it unshifts it onto the front of the CmdQueue. It runs through all the queued nightly jobs in this way. Eventually, a link job finishes. CmdQueue State: 8 nightly jobs CmdJob: hostname from the last link job RunNightlyWhenIdle: 2 isAdminJob Matches First Job: True While State: True, branch 2 Running Job State: 8 nightly jobs, N-1 link jobs Event: When each link job finshes, this test runs: if ( $CmdJob eq $host || $bpc->isAdminJob($host) ) { This will only match the last link job. If it doesn't match, the host is tested for whether it needs linking, and if so, it requeues the (already completed) link, *at the front of the queue*. This happens for all but the last link job. This means the thing at the front of the queue isn't a nightly job, so branch 2 of the while fails, which would restore some kind of order. The kicker is that the last job, the one that passes that if above, *clears CmdJob*, which means the first branch *passes*. CmdQueue State: N-1 link jobs, 8 nightly jobs CmdJob: *EMPTY* RunNightlyWhenIdle: 2 isAdminJob Matches First Job: *FALSE* While State: True, *BRANCH ONE* Running Job State: 8 nightly jobs Event: It pops *the last job*, a nightly, which fails, gets requeud *at the front*, without setting CmdJob, and now we're back to Branch 2 passing. This happens with all the nightly jobs, and then it's ready to re-run all the (already completed once) link jobs: CmdQueue State: 8 nightly jobs, N-1 link jobs CmdJob: Empty RunNightlyWhenIdle: 2 isAdminJob Matches First Job: *TRUE* While State: True, *BRANCH TWO* Running Job State: 8 nightly jobs Event: Runs a link job. CmdQueue State: 8 nightly jobs, N-2 link jobs CmdJob: Empty RunNightlyWhenIdle: 2 isAdminJob Matches First Job: *TRUE* While State: True, *BRANCH TWO* Running Job State: 8 nightly jobs, 1 link job Event: This keeps happening until all the link jobs have been completed (meaning that the last link job to finish has been run a total of N times): CmdQueue State: 8 nightly jobs CmdJob: Empty RunNightlyWhenIdle: 2 isAdminJob Matches First Job: True Running Job State: 8 nightly jobs While State: True, branch 2 Here's a log snippet of all this craziness: 2010-11-24 07:29:45 Running BackupPC_link host00002 (pid=26391) 2010-11-24 07:29:45 Running BackupPC_link host00005 (pid=26392) 2010-11-24 07:29:45 Running BackupPC_link hostother1 (pid=26393) 2010-11-24 07:29:45 Running BackupPC_link host00006 (pid=26394) 2010-11-24 07:29:45 Running BackupPC_link host00011 (pid=26395) 2010-11-24 07:29:45 Running BackupPC_link host00000 (pid=26396) 2010-11-24 07:29:45 Running BackupPC_link host00003 (pid=26397) 2010-11-24 07:29:45 Running BackupPC_link host00004 (pid=26398) 2010-11-24 07:29:45 Botch on admin job for admin7 : already in use!! 2010-11-24 07:29:45 Finished host00004 (BackupPC_link host00004) 2010-11-24 07:29:45 Botch on admin job for admin : already in use!! 2010-11-24 07:29:47 Botch on admin job for admin1 : already in use!! 2010-11-24 07:29:47 Botch on admin job for admin2 : already in use!! 2010-11-24 07:29:47 Botch on admin job for admin3 : already in use!! 2010-11-24 07:29:47 Botch on admin job for admin4 : already in use!! 2010-11-24 07:29:47 Botch on admin job for admin5 : already in use!! 2010-11-24 07:29:57 Botch on admin job for admin6 : already in use!! 2010-11-24 07:29:57 Botch on admin job for admin7 : already in use!! 2010-11-24 07:29:57 Running BackupPC_link host00002 (pid=26411) 2010-11-24 07:29:57 Running BackupPC_link host00005 (pid=26412) 2010-11-24 07:29:57 Running BackupPC_link hostother1 (pid=26413) 2010-11-24 07:29:57 Running BackupPC_link host00006 (pid=26414) 2010-11-24 07:29:57 Running BackupPC_link host00011 (pid=26415) 2010-11-24 07:29:57 Running BackupPC_link host00003 (pid=26416) 2010-11-24 07:29:57 Running BackupPC_link host00000 (pid=26417) 2010-11-24 07:29:57 Botch on admin job for admin : already in use!! 2010-11-24 07:29:57 Finished host00000 (BackupPC_link host00000) 2010-11-24 07:29:57 Botch on admin job for admin1 : already in use!! 2010-11-24 07:29:57 Botch on admin job for admin2 : already in use!! 2010-11-24 07:29:59 Botch on admin job for admin3 : already in use!! 2010-11-24 07:29:59 Botch on admin job for admin4 : already in use!! 2010-11-24 07:29:59 Botch on admin job for admin5 : already in use!! 2010-11-24 07:29:59 Botch on admin job for admin6 : already in use!! 2010-11-24 07:29:59 Botch on admin job for admin7 : already in use!! 2010-11-24 07:30:18 Botch on admin job for admin : already in use!! 2010-11-24 07:30:18 Running BackupPC_link host00002 (pid=26453) 2010-11-24 07:30:18 Running BackupPC_link host00005 (pid=26454) 2010-11-24 07:30:18 Running BackupPC_link hostother1 (pid=26455) 2010-11-24 07:30:18 Running BackupPC_link host00006 (pid=26456) 2010-11-24 07:30:18 Running BackupPC_link host00011 (pid=26457) 2010-11-24 07:30:18 Botch on admin job for admin1 : already in use!! 2010-11-24 07:30:18 Running BackupPC_link host00003 (pid=26459) 2010-11-24 07:30:18 Botch on admin job for admin2 : already in use!! 2010-11-24 07:30:18 Botch on admin job for admin3 : already in use!! 2010-11-24 07:30:18 Finished host00003 (BackupPC_link host00003) None of this explains why we end up with only one nightly job running at a time, but it's background on how very badly messed up things can get if you do "BackupPC_nightly run". Here's a repeat of the while, to save you some scrolling: while ( $CmdJob eq "" && @CmdQueue > 0 && $RunNightlyWhenIdle != 1 || @CmdQueue > 0 && $RunNightlyWhenIdle == 2 && $bpc->isAdminJob($CmdQueue[0]->{host}) ) { Ok, so, the current state is this: CmdQueue State: 8 nightly jobs CmdJob: Empty RunNightlyWhenIdle: 2 isAdminJob Matches First Job: True Running Job State: 8 nightly jobs While State: True, branch 2 So it keep running through the queue and spewing "Botch" messages to the logs. Then one of the nightlies *actually finishes*. This means one of the *other* nightlies can actually run, which sets CmdJob, which is our prelude to disaster. CmdQueue State: *SEVEN* nightly jobs CmdJob: *NON-EMPTY*; "adminN", for some N RunNightlyWhenIdle: 2 isAdminJob Matches First Job: True Running Job State: 8 nightly jobs While State: True, branch 2 And it goes back to cycling through the nightlies and "Botch"ing. But eventually an actual backup finishes, and a link job gets queued *to the front*. The while is now false: CmdQueue State: 1 link job, 7 nightly jobs CmdJob: "adminN" RunNightlyWhenIdle: 2 isAdminJob Matches First Job: *FALSE* Running Job State: 8 nightly jobs While State: *FALSE*, since isAdminJob is false and $CmdJob is not "" At this point, the CmdQueue will not get run until something changes, which is the correct behaviour when nightlies are running. Note that the link job that cause the while to go false does not, itself, actually get run. The nightlies continue to complete. Finishing a nightly DOES NOT chang CmdJob, so this state continues until every nightly job currently running has completed. Note that what's currently running is 7 of the original nightly jobs, and one from the second round. When there are 0 nightly jobs running, BackupPC considers the run of nightlies to be complete. At this point, it sets RunNightlyWhenIdle to 0 (i.e. "run normally") and sets CmdJob to "", so we have this: CmdQueue State: 7 nightly jobs CmdJob: "" RunNightlyWhenIdle: 0 isAdminJob Matches First Job: *TRUE* Running Job State: *EMPTY* While State: *TRUE*, *BRANCH ONE* (CmdJob empty, and suddenly RunNightlyWhenIdle isn't 2 anymore) Since the while is true, it launches the last job on the queue, a nightly job, which leads to this state: CmdQueue State: 6 nightly jobs CmdJob: "adminN" RunNightlyWhenIdle: 0 isAdminJob Matches First Job: True Running Job State: 1 nightly job While State: *FALSE*; RunNightlyWhenIdle isn't 2, so branch 2 can't be true, and CmdJob isn't "", so branch 1 can't be true No more CmdQueue checking, until the current nightly job ends. At that point, there are 0 nightly jobs running, which makes BackupPC thing everything is done, so it reports like everything is done, and sets CmdJob to "", which puts us back to one state above. Repeat until all nightly jobs are *actually* done, which leaves us in a proper state. The really bad part: *NO LINK JOBS RUN IN THIS STATE*. On my large hosts, each nightly job, run by itself, takes *ten or more hours*. This means that once we've entered this state, we've got *at least* 70 hours with no link jobs. A host will not run a new backup until the link from the last backup finishes. This means that it's very easy for hosts to become 2 or 3 days out of date. The daily nightlies run tries to fix it, but it doesn't actually work. When the daily nightly run kicks off, we end up with nightlies running, nightlies in the queue, and RunNightlyWhenIdle set to 2, which is the state we basically started in way above; it's only a matter of time before it gets stuck again. In my experience, in that case, it actually gets stuck again within a matter of seconds: 2010-11-25 01:00:00 Running BackupPC_nightly 224 255 (pid=25383) 2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 2 ; new BackupPCNightlyLock counter: 2 2010-11-25 01:00:00 Running BackupPC_nightly -m 0 31 (pid=25384) 2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 3 ; new BackupPCNightlyLock counter: 3 2010-11-25 01:00:00 Running BackupPC_nightly 32 63 (pid=25385) 2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 4 ; new BackupPCNightlyLock counter: 4 2010-11-25 01:00:00 Running BackupPC_nightly 64 95 (pid=25386) 2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 5 ; new BackupPCNightlyLock counter: 5 2010-11-25 01:00:00 Running BackupPC_nightly 96 127 (pid=25387) 2010-11-25 01:00:00 New BackupPCNightlyJobs counter: 6 ; new BackupPCNightlyLock counter: 6 2010-11-25 01:00:00 Running BackupPC_link mc00-s00106 (pid=25388) 2010-11-25 01:00:00 Running BackupPC_link mc00-s00009-prod-shared-redbubble (pid=25389) 2010-11-25 01:00:00 Running BackupPC_link mavenlink--tm25-e00053--tm25-s00106---db-backup (pid=25390) 2010-11-25 01:00:00 Running BackupPC_link provolvesolutions--tm25-e00080--tm25-s00157---data (pid=25391) 2010-11-25 01:00:00 Running BackupPC_link tm25-s00090__nfs (pid=25392) 2010-11-25 01:00:00 Running BackupPC_link provolvesolutions--tm25-e00079--tm25-s00155---db-backup (pid=25393) 2010-11-25 01:00:00 Running BackupPC_link mavenlink--tm25-e00054--tm25-s00107---data (pid=25394) 2010-11-25 01:00:00 Running BackupPC_link howcast--tm25-e00076--tm25-s00149---db-backup (pid=25395) 2010-11-25 01:00:00 Running BackupPC_link brighthouse--tm25-e00039--tm25-s00083---db-backup (pid=25396) 2010-11-25 01:00:00 Running BackupPC_link howcast--tm25-e00071--tm25-s00134---db-backup (pid=25397) 2010-11-25 01:00:00 Botch on admin job for admin : already in use!! 2010-11-25 01:00:00 Next wakeup is 2010-11-25 02:00:00 2010-11-25 01:00:00 Botch on admin job for admin1 : already in use!! 2010-11-25 01:00:01 Botch on admin job for admin2 : already in use!! 2010-11-25 01:00:02 Finished howcast--tm25-e00071--tm25-s00134---db-backup (BackupPC_link howcast--tm25-e00071--tm25-s00134---db-backup) 2010-11-25 01:00:02 Botch on admin job for admin3 : already in use!! 2010-11-25 01:00:07 Running BackupPC_nightly 128 159 (pid=25414) 2010-11-25 01:00:07 New BackupPCNightlyJobs counter: 7 ; new BackupPCNightlyLock counter: 7 So *some of* the nightlies ran, and set RunNightlyWhenIdle to 2, which caused all the links to kick off for reasons explained above, leaving the queue with just the nightly jobs. The non-last links finish, leaving a link job at the front of the queue. Then a link finishes, leaving CmdJob empty. Then another nightly job runs, setting CmdJob, so we end up with: CmdQueue State: some number of nightly jobs CmdJob: "adminN" RunNightlyWhenIdle: 2 isAdminJob Matches First Job: *FALSE* Running Job State: some number of nightly job While State: *FALSE* And round and round she goes. Total progress in 24 hours from BackupPC's point of view: *one link job*. It can keep doing this *forever*, as each 24-hour-ly run of nightlies restarts the problem, unless all the sequential nightlies can finish within 24 hours, which is *most definitely* not the case here. SOLUTION(S). TOTALLY UNTESTED. 1. Don't run a non-nightly job from the CmdQueue when there are nightly job running, *EVER*. (my copy is locally modified with debugging stuff to try to figure out these issues, so if this diff doesn't patch, that's why, sorry). --- /var/tmp/BackupPC 2010-11-25 01:52:30.000000000 +0000 +++ /usr/local/bin/BackupPC 2010-11-25 02:00:03.000000000 +0000 @@ -522,6 +522,15 @@ $req = pop(@CmdQueue); $host = $req->{host}; + + if( $BackupPCNightlyJobs > 0 && ! $bpc->isAdminJob($host) ) + { + print(LOG $bpc->timeStamp, "Tried to run ".$req->{cmd}." on $host when there are nightlies running. That's bad.\n"); + + unshift(@CmdQueue, $req); + return; + } + if ( defined($Jobs{$host}) ) { print(LOG $bpc->timeStamp, "Botch on admin job for $host: already in use!!\n"); 2. That means that if nightlies are running, and you do a "BackupPC_nightly run", it'll queue them, 8 at a time total, behind the previous ones, which means the whole run will take about twice as long as it would have otherwise. I don't know about you, but that's certainly not what *I* expect when I launch a nightly run; I expect all the old ones to die in favour of new ones: @@ -1362,7 +1371,19 @@ } elsif ( $cmd =~ /^backup all$/ ) { QueueAllPCs(); } elsif ( $cmd =~ /^BackupPC_nightly run$/ ) { + print(LOG $bpc->timeStamp, + "Running nightlies at user request.\n" ); + foreach my $host (keys %Jobs) { + if( $bpc->isAdminJob( $host ) ) { + my $pid = $Jobs{$host}{pid}; + kill($bpc->sigName2num("INT"), $pid); + print(LOG $bpc->timeStamp, + "Killing nightly job $host with PID $pid to make way for manual run.\n"); + } + } $RunNightlyWhenIdle = 1; + $BackupPCNightlyJobs = 0; + $BackupPCNightlyLock = 0; } elsif ( $cmd =~ /^backup (\S+)\s+(\S+)\s+(\S+)\s+(\S+)/ ) { my $hostIP = $1; $host = $2; -Robin -- http://singinst.org/ : Our last, best hope for a fantastic future. Lojban (http://www.lojban.org/): The language in which "this parrot is dead" is "ti poi spitaki cu morsi", but "this sentence is false" is "na nei". My personal page: http://www.digitalkingdom.org/rlp/ ------------------------------------------------------------------------------ Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/