Another thing I just noticed is that all of the failed jobs had an Elapsed time right around 30 minutes, some just under 30 but all under 31. I did rule out the (external) firewall timeout, which is 240 minutes, and am looking through OS and application timeouts for anything around 30 minutes.
Josh On Tue, Jul 23, 2024 at 11:28 AM Joshua Myles <jamy...@mtu.edu> wrote: > One of our Virtual Full jobs has been failing every day during Always > Incremental consolidation, and I'm having trouble figuring out why. > > *list joblog jobid=18065 > Automatically selected Catalog: MyCatalog > Using Catalog "MyCatalog" > 2024-07-23 09:01:22 bareos-dir-prod JobId 18065: Start Virtual Backup > JobId 18065, Job=node0057-AI.2024-07-23_09.00.04_41 > 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Bootstrap records > written to /var/lib/bareos/bareos-dir-prod.restore.45.bsr > 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Consolidating JobIds > 17646,13356,13422,13488,13556,13625,13694 containing 2025684 files > 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Connected Storage daemon > at bareos-sd-t1-prod.foo.bar.edu:9103, encryption: > TLS_CHACHA20_POLY1305_SHA256 TLSv1.3 > 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Encryption: > TLS_CHACHA20_POLY1305_SHA256 TLSv1.3 > 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Using Device > "FileStorage5" to read. > 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Max configured use > duration=72,000 sec. exceeded. Marking Volume > "node0057-AI-Consolidated-12250" as Used. > 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Created new Volume > "node0057-AI-Consolidated-12276" in catalog. > 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Using Device > "FileStorageConsolidated5" to write. > 2024-07-23 09:01:41 bareos-sd-t1-prod JobId 18065: Labeled new Volume > "node0057-AI-Consolidated-12276" on device "FileStorageConsolidated5" > (/var/lib/bareos/storage). > 2024-07-23 09:01:41 bareos-sd-t1-prod JobId 18065: Wrote label to > prelabeled Volume "node0057-AI-Consolidated-12276" on device > "FileStorageConsolidated5" (/var/lib/bareos/storage) > 2024-07-23 09:01:41 bareos-sd-t1-prod JobId 18065: Ready to read from > volume "node0057-AI-Consolidated-6560" on device "FileStorage5" > (/var/lib/bareos/storage). > 2024-07-23 09:01:41 bareos-sd-t1-prod JobId 18065: Forward spacing Volume > "node0057-AI-Consolidated-6560" to file:block 0:274. > 2024-07-23 09:02:15 bareos-dir-prod JobId 18065: Insert of attributes > batch table with 800001 entries start > 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Insert of attributes > batch table done > 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Fatal error: Director's > comm line to SD dropped. > 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Insert of attributes > batch table with 3303 entries start > 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Insert of attributes > batch table done > 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Replicating deleted > files from jobids 17646,13356,13422,13488,13556,13625,13694 to jobid 18065 > 2024-07-23 09:03:03 bareos-dir-prod JobId 18065: Error: Bareos > bareos-dir-prod 22.1.4 (28Feb24): > Build OS: Red Hat Enterprise Linux release 9.1 (Plow) > JobId: 18065 > Job: node0057-AI.2024-07-23_09.00.04_41 > Backup Level: Virtual Full > Client: "node0057.foo.bar.edu-fd" 22.1.5 (04Jun24) Red > Hat Enterprise Linux Server release 7.9 (Maipo),redhat > FileSet: "LinuxAll" 2023-10-22 13:39:16 > Pool: "node0057-AI-Consolidated" (From Job Pool's > NextPool resource) > Catalog: "MyCatalog" (From Client resource) > Storage: "bareos-sd-t1-prod-Consolidated" (From Storage > from Pool's NextPool resource) > Scheduled time: 23-Jul-2024 09:00:04 > Start time: 21-May-2024 02:00:01 > End time: 21-May-2024 02:30:31 > Elapsed time: 30 mins 30 secs > Priority: 10 > SD Files Written: 0 > SD Bytes Written: 110,846,184 (110.8 MB) > Rate: 60.6 KB/s > Volume name(s): node0057-AI-Consolidated-12276 > Volume Session Id: 19 > Volume Session Time: 1721645047 > Last Volume Bytes: 275 (275 B) > SD Errors: 0 > SD termination status: Error > Accurate: yes > Bareos binary info: Bareos subscription release > Job triggered by: User > Termination: *** Backup Error *** > > * > > Director and SD are separate hosts, and this issue seems to persist only > with jobs from this client, node0057. I enabled debug tracing on the SD but > haven't seen anything that makes sense to me. > > ... > bareos-sd-t1-prod (200): stored/mac.cc:195-18065 before write JobId=18065 > FI=804499 SessId=19 Strm=1998 len=65 > bareos-sd-t1-prod (200): stored/mac.cc:195-18065 before write JobId=18065 > FI=804499 SessId=19 Strm=MD5 len=16 > bareos-sd-t1-prod (100): stored/mac.cc:655-18065 ok=0 > bareos-sd-t1-prod (130): stored/label.cc:627-18065 session_label > record=fc052df8 > bareos-sd-t1-prod (150): stored/label.cc:652-18065 Write sesson_label > record JobId=18065 FI=EOS_LABEL SessId=19 Strm=18065 len=234 remainder=0 > bareos-sd-t1-prod (150): stored/label.cc:660-18065 Leave WriteSessionLabel > Block=390293886d File=0d > bareos-sd-t1-prod (100): stored/block.cc:567-18065 return WriteBlockToDev, > job is canceled > bareos-sd-t1-prod (100): stored/mac.cc:684-18065 Set ok=FALSE after > WriteBlockToDevice. > bareos-sd-t1-prod (200): stored/mac.cc:687-18065 Flush block to device pos > 0:390293886 > bareos-sd-t1-prod (100): stored/acquire.cc:538-18065 releasing device > "FileStorageConsolidated5" (/var/lib/bareos/storage) > bareos-sd-t1-prod (100): stored/acquire.cc:560-18065 There are 0 writers > in ReleaseDevice > bareos-sd-t1-prod (50): stored/askdir.cc:366-18065 >dird UpdCat > Job=node0057-AI.2024-07-23_09.00.04_41 FileAttributes bareos-sd-t1-prod > (50): stored/askdir.cc:369-18065 create_jobmedia error BnetRecv > bareos-sd-t1-prod (200): stored/mac.cc:229-18062 bareos-sd-t1-prod (200): > stored/acquire.cc:568-18065 ===== Wrote block new pos 2:4028935146 > bareos-sd-t1-prod (50): stored/askdir.cc:298-18065 Update cat > VolBytes=390293887 > bareos-sd-t1-prod (50): stored/askdir.cc:317-18065 >dird bareos-sd-t1-prod > (200): stored/acquire.cc:587-18065 dir_update_vol_info. Release > vol=node0057-AI-Consolidated-12276 dev="FileStorageConsolidated5" > (/var/lib/bareos/storage) > bareos-sd-t1-prod (150): stored/vol_mgr.cc:695-18065 === clear in_use > vol=node0057-AI-Consolidated-12276 > bareos-sd-t1-prod (150): stored/vol_mgr.cc:712-18065 === set not reserved > vol=node0057-AI-Consolidated-12276 num_writers=0 dev_reserved=0 > dev="FileStorageConsolidated5" (/var/lib/bareos/storage) > bareos-sd-t1-prod (150): stored/vol_mgr.cc:740-18065 === clear in_use > vol=node0057-AI-Consolidated-12276 > bareos-sd-t1-prod (150): stored/vol_mgr.cc:751-18065 === remove volume > node0057-AI-Consolidated-12276 dev="FileStorageConsolidated5" > (/var/lib/bareos/storage) > ... > > I can provide client/job/pool/storage configurations if they seem relevant > here, and am continuing to poke at this myself. Thanks for any thoughts on > troubleshooting. > > Josh > > -- > You received this message because you are subscribed to the Google Groups > "bareos-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to bareos-users+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/bareos-users/4a802f7a-69e6-4cff-9756-f6f089c8aa3en%40googlegroups.com > <https://groups.google.com/d/msgid/bareos-users/4a802f7a-69e6-4cff-9756-f6f089c8aa3en%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Joshua Myles he/him Senior Storage Administrator Michigan Tech IT We can help. mtu.edu/it 906-369-1870 -- You received this message because you are subscribed to the Google Groups "bareos-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to bareos-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/CAE3eABeUCC_arNSNQ5FPsdR9yRHp2AZJOxx%2B69z7kNa-TiuxKg%40mail.gmail.com.