[bareos-users] Re: Recurring SD Errors

Bruno Friedmann (bruno-at-bareos) Thu, 08 Aug 2024 12:43:59 -0700

Please open an issue on github, will be the best place to track it down.


On Thursday 8 August 2024 at 20:06:08 UTC+2 Paul Simmons wrote:

> Addendum: below is the error log taken from the FD which mentions a 
> segmentation violation during the data stream from FD to SD and this 
> appears to be a result of a malformed response from Ceph *perf_stats.py* 
> on the storage medium and the Bareos *append.cc* has a while statement 
> that doesn't seem to account for interrupts in the data stream from 
> malformed responses and thus throws an error indicating segmentation 
> violation. This results in the Job failing and entire rescheduling and 
> rerun of the Job. This may be a possible bug with Bareos, should I move 
> this ticket over to bug tracking at Bareos GitHub page?
>
> This was log messages from the FD during the Full backup Job running in 
> the STDOUT posted in my original comment.
>
> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 
> 11: Segmentation violation
> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 
> 11: Segmentation violation
> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: bareos-fd, pebbles-fd1 got 
> signal 11 - Segmentation violation. Attempting traceback.
> Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: exepath=/usr/sbin/
> Jul 31 22:28:11 pebbles-fd1 bareos-fd[97985]: Calling: 
> /usr/sbin/btraceback /usr/sbin/bareos-fd 1309 /var/lib/bareos
> Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: It looks like the traceback 
> worked...
> Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: Dumping: 
> /var/lib/bareos/pebbles-fd1.1309.bactrace
> Jul 31 22:28:12 pebbles-fd1 kernel: ceph: get acl 
> 1000067e647.fffffffffffffffe failed, err=-512
>
> Error message from the Ceph manager, pebbles01 is one of the storage 
> servers within the Ceph cluster where the Volumes are stored on CephFS as a 
> POSIX filesystem.
>
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
> Exception in thread Thread-126185:
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
> Traceback (most recent call last):
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
>  self.run()
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>  File "/lib64/python3.6/threading.py", line 1203, in run
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
>  self.function(*self.args, **self.kwargs)
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
>  File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in 
> re_register_queries
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:   
>  if self.mx_last_updated >= ua_last_updated:
> Jul 31 22:28:29 pebbles01 
> ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: 
> AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated'
>
> Could be relevant to this issue:
> https://tracker.ceph.com/issues/65073
> This can happen when FSPerfStats.re_register_queries is called before 
> mgr/stats can process a single mds report.
>
> These lines from perf_stats.py shows that a malformed response could 
> potentially be sent to Bareos...
>
>     def re_register_queries(self, rank0_gid, ua_last_updated):
>         #reregister queries if the metrics are the latest. Otherwise 
> reschedule the timer and
>         #wait for the empty metrics
>         with self.lock:
>             if self.mx_last_updated >= ua_last_updated:
>                 self.log.debug("reregistering queries...")
>                 self.module.reregister_mds_perf_queries()
>                 self.prev_rank0_gid = rank0_gid
>             else:
>                 #reschedule the timer
>                 self.rqtimer = Timer(REREGISTER_TIMER_INTERVAL,
>                                      self.re_register_queries, 
> args=(rank0_gid, ua_last_updated,))
>                 self.rqtimer.start()
>
> - Paul Simmons
> On Wednesday, August 7, 2024 at 3:43:46 PM UTC-7 Paul Simmons wrote:
>
>> Hello,
>>
>> I manage and configure my organization's Bareos backup system, which 
>> backs-up millions of files totaling ~350 TB of data from an NFS mounted on 
>> the Bareos server and stores the data in Volumes on a CephFS also mounted 
>> on the Bareos server and uses disk-based storage for the Volumes unlike the 
>> tape library we used previously.
>>
>> Over the last several months, bareos-sd has been encountering recurring 
>> errors during Incremental and Full Jobs in which the Jobs fail with a fatal 
>> SD error and non-fatal FD error. We did upgrade Bareos from v.21 to v.23 a 
>> month ago, but it hasn't seemed to resolve the errors. Below is the errors 
>> from one of the joblogs:
>>
>>  2024-07-31 02:16:37 bareos-dir JobId 28897: There are no more Jobs 
>> associated with Volume "Full-3418". Marking it purged.
>>
>>  2024-07-31 02:16:37 bareos-dir JobId 28897: All records pruned from 
>> Volume "Full-3418"; marking it "Purged"
>>
>>  2024-07-31 02:16:37 bareos-dir JobId 28897: Recycled volume "Full-3418"
>>
>>  2024-07-31 02:16:38 bareos-sd JobId 28897: Recycled volume "Full-3418" 
>> on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/), all previous 
>> data lost.
>>
>>  2024-07-31 02:16:38 bareos-sd JobId 28897: New volume "Full-3418" 
>> mounted on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/) at 
>> 31-Jul-2024 02:16.
>>
>>  2024-07-31 14:44:37 bareos-dir JobId 28897: Insert of attributes batch 
>> table with 800001 entries start
>>
>>  2024-07-31 14:44:51 bareos-dir JobId 28897: Insert of attributes batch 
>> table done
>>
>>  2024-07-31 22:28:12 bareos-sd JobId 28897: Fatal error: 
>> stored/append.cc:447 Error reading data header from FD. ERR=No data 
>> available
>>
>>  2024-07-31 22:28:12 bareos-dir JobId 28897: Fatal error: Network error 
>> with FD during Backup: ERR=No data available
>>
>>  2024-07-31 22:28:12 bareos-sd JobId 28897: Releasing device 
>> "Full-device0012" (/mnt/bareosfs/backups/Fulls/).
>>
>>  2024-07-31 22:28:12 bareos-sd JobId 28897: Elapsed time=37:58:42, 
>> Transfer rate=14.96 M Bytes/second
>>
>>  2024-07-31 22:28:24 bareos-dir JobId 28897: Fatal error: No Job status 
>> returned from FD.
>>
>>  2024-07-31 22:28:24 bareos-dir JobId 28897: Insert of attributes batch 
>> table with 384090 entries start
>>
>>  2024-07-31 22:28:37 bareos-dir JobId 28897: Insert of attributes batch 
>> table done
>>
>>  2024-07-31 22:28:37 bareos-dir JobId 28897: Error: Bareos bareos-dir 
>> 23.0.4~pre74.8cb0a0c26
>>
>> Any assistance is troubleshooting this is greatly appreciated. I can 
>> provide any configurations and other info as necessary, minus any IP 
>> addresses or other confidential info.
>>
>>
>> - Paul Simmons
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"bareos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to bareos-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bareos-users/ab26be66-42e3-4077-8d36-6cd109c25d28n%40googlegroups.com.

[bareos-users] Re: Recurring SD Errors

Reply via email to