Please open an issue on github, will be the best place to track it down.
On Thursday 8 August 2024 at 20:06:08 UTC+2 Paul Simmons wrote: > Addendum: below is the error log taken from the FD which mentions a > segmentation violation during the data stream from FD to SD and this > appears to be a result of a malformed response from Ceph *perf_stats.py* > on the storage medium and the Bareos *append.cc* has a while statement > that doesn't seem to account for interrupts in the data stream from > malformed responses and thus throws an error indicating segmentation > violation. This results in the Job failing and entire rescheduling and > rerun of the Job. This may be a possible bug with Bareos, should I move > this ticket over to bug tracking at Bareos GitHub page? > > This was log messages from the FD during the Full backup Job running in > the STDOUT posted in my original comment. > > Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal > 11: Segmentation violation > Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal > 11: Segmentation violation > Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: bareos-fd, pebbles-fd1 got > signal 11 - Segmentation violation. Attempting traceback. > Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: exepath=/usr/sbin/ > Jul 31 22:28:11 pebbles-fd1 bareos-fd[97985]: Calling: > /usr/sbin/btraceback /usr/sbin/bareos-fd 1309 /var/lib/bareos > Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: It looks like the traceback > worked... > Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: Dumping: > /var/lib/bareos/pebbles-fd1.1309.bactrace > Jul 31 22:28:12 pebbles-fd1 kernel: ceph: get acl > 1000067e647.fffffffffffffffe failed, err=-512 > > Error message from the Ceph manager, pebbles01 is one of the storage > servers within the Ceph cluster where the Volumes are stored on CephFS as a > POSIX filesystem. > > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > Exception in thread Thread-126185: > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > Traceback (most recent call last): > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > self.run() > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > File "/lib64/python3.6/threading.py", line 1203, in run > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > self.function(*self.args, **self.kwargs) > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in > re_register_queries > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > if self.mx_last_updated >= ua_last_updated: > Jul 31 22:28:29 pebbles01 > ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: > AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated' > > Could be relevant to this issue: > https://tracker.ceph.com/issues/65073 > This can happen when FSPerfStats.re_register_queries is called before > mgr/stats can process a single mds report. > > These lines from perf_stats.py shows that a malformed response could > potentially be sent to Bareos... > > def re_register_queries(self, rank0_gid, ua_last_updated): > #reregister queries if the metrics are the latest. Otherwise > reschedule the timer and > #wait for the empty metrics > with self.lock: > if self.mx_last_updated >= ua_last_updated: > self.log.debug("reregistering queries...") > self.module.reregister_mds_perf_queries() > self.prev_rank0_gid = rank0_gid > else: > #reschedule the timer > self.rqtimer = Timer(REREGISTER_TIMER_INTERVAL, > self.re_register_queries, > args=(rank0_gid, ua_last_updated,)) > self.rqtimer.start() > > - Paul Simmons > On Wednesday, August 7, 2024 at 3:43:46 PM UTC-7 Paul Simmons wrote: > >> Hello, >> >> I manage and configure my organization's Bareos backup system, which >> backs-up millions of files totaling ~350 TB of data from an NFS mounted on >> the Bareos server and stores the data in Volumes on a CephFS also mounted >> on the Bareos server and uses disk-based storage for the Volumes unlike the >> tape library we used previously. >> >> Over the last several months, bareos-sd has been encountering recurring >> errors during Incremental and Full Jobs in which the Jobs fail with a fatal >> SD error and non-fatal FD error. We did upgrade Bareos from v.21 to v.23 a >> month ago, but it hasn't seemed to resolve the errors. Below is the errors >> from one of the joblogs: >> >> 2024-07-31 02:16:37 bareos-dir JobId 28897: There are no more Jobs >> associated with Volume "Full-3418". Marking it purged. >> >> 2024-07-31 02:16:37 bareos-dir JobId 28897: All records pruned from >> Volume "Full-3418"; marking it "Purged" >> >> 2024-07-31 02:16:37 bareos-dir JobId 28897: Recycled volume "Full-3418" >> >> 2024-07-31 02:16:38 bareos-sd JobId 28897: Recycled volume "Full-3418" >> on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/), all previous >> data lost. >> >> 2024-07-31 02:16:38 bareos-sd JobId 28897: New volume "Full-3418" >> mounted on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/) at >> 31-Jul-2024 02:16. >> >> 2024-07-31 14:44:37 bareos-dir JobId 28897: Insert of attributes batch >> table with 800001 entries start >> >> 2024-07-31 14:44:51 bareos-dir JobId 28897: Insert of attributes batch >> table done >> >> 2024-07-31 22:28:12 bareos-sd JobId 28897: Fatal error: >> stored/append.cc:447 Error reading data header from FD. ERR=No data >> available >> >> 2024-07-31 22:28:12 bareos-dir JobId 28897: Fatal error: Network error >> with FD during Backup: ERR=No data available >> >> 2024-07-31 22:28:12 bareos-sd JobId 28897: Releasing device >> "Full-device0012" (/mnt/bareosfs/backups/Fulls/). >> >> 2024-07-31 22:28:12 bareos-sd JobId 28897: Elapsed time=37:58:42, >> Transfer rate=14.96 M Bytes/second >> >> 2024-07-31 22:28:24 bareos-dir JobId 28897: Fatal error: No Job status >> returned from FD. >> >> 2024-07-31 22:28:24 bareos-dir JobId 28897: Insert of attributes batch >> table with 384090 entries start >> >> 2024-07-31 22:28:37 bareos-dir JobId 28897: Insert of attributes batch >> table done >> >> 2024-07-31 22:28:37 bareos-dir JobId 28897: Error: Bareos bareos-dir >> 23.0.4~pre74.8cb0a0c26 >> >> Any assistance is troubleshooting this is greatly appreciated. I can >> provide any configurations and other info as necessary, minus any IP >> addresses or other confidential info. >> >> >> - Paul Simmons >> > -- You received this message because you are subscribed to the Google Groups "bareos-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to bareos-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/ab26be66-42e3-4077-8d36-6cd109c25d28n%40googlegroups.com.