On 4/29/24 19:17, Bill Arlofski via Bacula-users wrote:

Hello and thanks a lot for your time and attention.





My first guess (without seeing any logs or configurations) is that there is a `MaximumConcurrentJobs` setting set to low causing the bottleneck.

I don't think so, otherwise it would never work (opposed to sometimes working, sometimes not).





Can you show a `status director` output,

An excerpt:
Running Jobs:
Console connected using TLS at 01-May-24 13:04
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
   255  Back Full          0         0  BackupCatalog     is waiting for higher 
priority jobs to finish
   256  Back Full          0         0  aaaaaaaaaa        is waiting on Storage 
"my-sd-private"
   259  Back Incr          0         0  bbbbbbbbbbbbbbb   is waiting on max 
Storage jobs
   260  Back Full     48,860    250.9 G cccccccccc        is running
   261  Back Incr          0         0  dddddddddd        is waiting on Storage 
"my-sd-private"
   262  Back Incr          0         0  eeeeeee           is waiting on Storage 
"my-sd-private"
   263  Back Incr          0         0  fffff             is waiting for its 
start time (01-May 18:08)
   264  Back Incr          0         0  ggggggggggggggg   is waiting on Storage 
"my-sd-private"
   265  Back Incr          0         0  hhhhhhhhhhhhhhh   is waiting on Storage 
"my-sd-in"
   266  Back Full          0         0  iiiiiiiii         is waiting on Storage 
"my-sd-in"
   267  Back Incr          0         0  jjjjjjjjj         is waiting on Storage 
"my-sd-in"
   269  Back Incr          0         0  kkkkkkk           is waiting on Storage 
"my-sd-in"
   271  Back Full          0         0  lllllllllllllll   is waiting for its 
start time (01-May 17:37)
   273  Back Incr          0         0  mmmmm             is waiting on Storage 
"my-sd-private"

So cccccccccc is running (using storage my-sd-in).
That's obviously blocking hhhhhhhhhhhhhhh, iiiiiiiii, jjjjjjjjj and kkkkkkk, as they're waiting to use the same device. bbbbbbbbbbbbbbb is also waiting on "my-sd-in" (possibily due to "maximum concurrent jobs", which was set at 5, now is commented, but maybe I didn't restart?).

However I see no reason aaaaaaaaaa, dddddddddd, eeeeeee, ggggggggggggggg and mmmmm should be stuck, waiting on a different device, where no job is running.
Or am I missing something?

I'm pretty sure jobs would start running in parallel again if I restarted the SD. I don't want to stop the running job now, though, since it's very long and I might lose a time window.





your configurations (sanitized)

You mean SD config?
Here it is:
Storage {                             # definition of myself
  Name=my-sd
  SDPort=9103
  WorkingDirectory = "/var/db/bacula"
  Pid Directory = "/var/run"
  Plugin Directory = "/usr/local/lib"
  Maximum Concurrent Jobs=20
  Encryption Command = "/usr/local/share/bacula/key-manager.py getkey"
}

Director {
  Name=my-dir
  Password = "............................................"
}
Director {
  Name=nagios
  Password=".........."
  Monitor = yes
}
Device {
  Name=In
  Media Type=File
  Archive Device=/backup/in
  LabelMedia = yes;                   # lets Bacula label unlabeled media
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  RemovableMedia = no;
  AlwaysOpen = no;
  Requires Mount = no;
#  Maximum Concurrent Jobs=5
}
Device {
  Name=DMZ
  Media Type=File
  Archive Device=/backup/dmz
  LabelMedia = yes;                   # lets Bacula label unlabeled media
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  RemovableMedia = no;
  AlwaysOpen = no;
  Requires Mount = no;
#  Maximum Concurrent Jobs=5
}
Device {
  Name=Private
  Media Type=File
  Archive Device=/backup/private
  LabelMedia = yes;                   # lets Bacula label unlabeled media
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  RemovableMedia = no;
  AlwaysOpen = no;
  Requires Mount = no;
#  Maximum Concurrent Jobs=5
}

#
# Send all messages to the Director,
# mount messages also are sent to the email address
#
Messages {
  Name = Standard
  director = my-dir = all
}




and some job logs of jobs waiting on something in the `status director` "Running J
obs" output?

Not sure what you are asking for.
I cancelled job aaaaaaaaaa in order to get its full log by mail and it's here.
01-May 10:36 my-dir JobId 256: Rescheduled Job 
aaaaaaaaaa.2024-05-01_09.30.00_45 at 01-May-2024 10:36 to re-run in 3600 
seconds (01-May-2024 11:36).
01-May 10:38 my-dir JobId 256: Job aaaaaaaaaa.2024-05-01_09.30.00_45 waiting 
3480 seconds for scheduled start time.
01-May 11:39 my-dir JobId 256: Start Backup JobId 256, 
Job=aaaaaaaaaa.2024-05-01_09.30.00_45
01-May 11:39 my-dir JobId 256: Connected to Storage "my-private" at 
bacula.private.xxxxxxxxxxxxxxx.org:9103 with TLS
01-May 17:36 my-dir JobId 256: Storage daemon "my-private" didn't accept Device "Private" 
because: 3924 Device "Private" not in SD Device resources or no matching Media Type or is disabled.
01-May 17:36 my-dir JobId 256: Fatal error: Failed to start job on the storage: 
my-private
01-May 17:36 my-dir JobId 256: Bacula my-dir 15.0.2 (21Mar24):
  Build OS:               amd64-portbld-freebsd14.0 freebsd 14.0-RELEASE-p5
  JobId:                  256
  Job:                    aaaaaaaaaa.2024-05-01_09.30.00_45
  Backup Level:           Full (upgraded from Incremental)
  Client:                 "aaaaaaaaaa-fd" 15.0.2 (21Mar24) Windows 7 
Professional Professional (build 7601), 64-bit,Cross-compile,Win64
  FileSet:                "windows_dati" 2024-04-24 13:17:07
  Pool:                   "aaaaaaaaaaFull" (From Job FullPool override)
  Catalog:                "MyCatalog" (From Client resource)
  Storage:                "my-private" (From Job resource)
  Scheduled time:         01-May-2024 09:30:00
  Start time:             01-May-2024 11:39:34
  End time:               01-May-2024 17:36:36
  Elapsed time:           5 hours 57 mins 2 secs
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  Comm Line Compression:  None
  Snapshot/VSS:           no
  Encryption:             no
  Accurate:               yes
Volume name(s): Volume Session Id: 44
  Volume Session Time:    1714468554
  Last Volume Bytes:      0 (0 B)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Canceled
SD termination status: Termination: Backup Canceled


Notice
01-May 17:36 my-dir JobId 256: Storage daemon "my-private" didn't accept Device "Private" because: 3924 Device "Private" not in SD Device resources or no matching Media Type or is disabled. 01-May 17:36 my-dir JobId 256: Fatal error: Failed to start job on the storage: my-private

What does this mean???





One things comes to mind: fffff and lllllllllllllll are being rescheduled (since they are probably powered off now) and they'll be using my-sd-private.
This should not hold the Device, should it?
In any case, cancelling them both did not let any other job start.





 bye & Thanks
        av.


_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to