Hi all, my biggest problem currently is that after some days the whole system stalls with a couple of hundred jobs all being in that state:
> 699 Full MoveToTapeInternal.2008-02-05_17.55.29 is waiting on max Storage jobs At that point nothing at all is processed, the SD is perfectly idle without any jobs, and all new jobs are queued up with either the above "max Storage" state or with this one: > 698 Full DBLogsInternal1.2008-02-05_17.55.28 is waiting execution Now ony can try and begin cancelling in this heap, but this will not make the jobs disappear - they will stay in the "stat" list of the DIR, only the state changes to "has been cancelled". Occasionally one cancels somekind of specific job in the heap - and then a whole bunch of the already cancelled ones goes away! But in the end there will always stay about 10 jobs in the list as "has been cancelled", and they will never go away, and any new backup jobs will again stall with "waiting on max Storage". Restore jobs do still work at that point. The only help is to restart the DIR. How to reproduce: I start a migration from disk storage to tape storage. This will bring up maybe 100 migration jobs (most of them working on empty filesets, but nevertheless bringing up migration jobs). Now while this migration is running, the scheduler will continuously create new regular backupjobs according to schedules. Interestingly (and I do not know why) these new jobs are *not* just put at the end of the queue, to be processed after the whole migration; instead they are executed right after the current migration job (it seems like the migration running at a someway lower priority, which would make sense from a practical viewpoint - except that it does not work). And as long as it goes that way, everything is fine. But the migration works through different storage pools, and consequently it will request various tape volumes as targets, one after another. If maybe I am asleep or watching video, and do *not* change tapes immediately on request, then the nwewly created regular backupjobs will pile up; they will *not* be put in front and executed, instead everything seems to wait for that tapechange. (This is one of the problems I have with insufficient parallelism - I still have to work on these.) And after the tapechange is done, it will do some more work, and then the described deadlock appears! And I cannot find out where this "max Storage jobs" is configured. I have initally set up all the various "concurrent jobs" with values that seem to make sense from a practical viewpoint - not too high and not too low - mostly between 2 and 5 (except some special jobs that create locks and must never run in parallel). I fear I now must go the long and hard way, and set all concurrency back to 1, see if it works that way, and then bring it in again step by step while understanding the precise implications... (There is already quite a couple of facets of bacula which I had to do that way during the last days, while often it came out my initial understanding of the concepts being someway unsatisfactory...) rgds, PMc -- ..having tossedaway Microsoft(tm) already in 1991! ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users