Hi all,

  my biggest problem currently is that after some days the whole
system stalls with a couple of hundred jobs all being in that state:

> 699 Full MoveToTapeInternal.2008-02-05_17.55.29 is waiting on max Storage jobs

  At that point nothing at all is processed, the SD is perfectly idle
without any jobs, and all new jobs are queued up with either the
above "max Storage" state or with this one:

> 698 Full DBLogsInternal1.2008-02-05_17.55.28 is waiting execution

  Now ony can try and begin cancelling in this heap, but this will not
make the jobs disappear - they will stay in the "stat" list of the DIR,
only the state changes to "has been cancelled".
Occasionally one cancels somekind of specific job in the heap - and
then a whole bunch of the already cancelled ones goes away!
  But in the end there will always stay about 10 jobs in the list as
"has been cancelled", and they will never go away, and any new backup
jobs will again stall with "waiting on max Storage". Restore jobs do
still work at that point.

  The only help is to restart the DIR.


How to reproduce:

 I start a migration from disk storage to tape storage. This will
bring up maybe 100 migration jobs (most of them working on empty
filesets, but nevertheless bringing up migration jobs).
Now while this migration is running, the scheduler will continuously
create new regular backupjobs according to schedules.

  Interestingly (and I do not know why) these new jobs are *not*
just put at the end of the queue, to be processed after the whole
migration; instead they are executed right after the current
migration job (it seems like the migration running at a someway
lower priority, which would make sense from a practical viewpoint -
except that it does not work).

  And as long as it goes that way, everything is fine.
But the migration works through different storage pools, and
consequently it will request various tape volumes as targets,
one after another.

  If maybe I am asleep or watching video, and do *not* change tapes
immediately on request, then the nwewly created regular backupjobs
will pile up; they will *not* be put in front and executed, instead
everything seems to wait for that tapechange. (This is one of the
problems I have with insufficient parallelism - I still have to work
on these.)
And after the tapechange is done, it will do some more work, and
then the described deadlock appears!


  And I cannot find out where this "max Storage jobs" is configured.
I have initally set up all the various "concurrent jobs" with values
that seem to make sense from a practical viewpoint - not too high and
not too low - mostly between 2 and 5 (except some special jobs that
create locks and must never run in parallel).

I fear I now must go the long and hard way, and set all concurrency
back to 1, see if it works that way, and then bring it in again step
by step while understanding the precise implications...
(There is already quite a couple of facets of bacula which I had to
do that way during the last days, while often it came out my
initial understanding of the concepts being someway
unsatisfactory...)

rgds,
PMc
-- 
..having tossedaway Microsoft(tm) already in 1991!

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to