Hi Philippe, May I just suggest to revisit the reason of that sentence: > I'm using the bareos-fd-postgresql plugin to backup the director's database
You will not be able to use that backup without a working bareos instance and the whole plugin environment. I would really advise to have a native dump of the catalog, so you can extract it from volume and/or restore it with almost nothing. to make a Disaster Recovery the most efficient as possible. Having a read only slave being a must. Beside that for the plugin the failure you've seeing is the fact that the new wal is not present for any reason (not yet flushed, placed into that directory). Your minimal wal size is 80MB I won't exclude there's maybe an issue with the code in certain case :-) if it is failing often enough you may want to run that job with setdebug level 150 Is this still happening is you comment out the following pg conf line. > archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/wal_archive %r' Like this is more documented to be used by a slave cluster ? On Thursday 18 July 2024 at 15:09:52 UTC+2 Philippe wrote: > Hi all, > > I'm using the bareos-fd-postgresql plugin to backup the director's > database. > > The config is: > > --%snip%-- > > Job { > > Name = backup-mydirector-postgres > > Client = mydirector > > JobDefs = postgres > > Storage = File-mystorage > > Maximum Concurrent Jobs = 1 > > } > > > > JobDefs { > > Name = postgres > > JobDefs = DefaultJob > > FileSet = postgres > > } > > > > FileSet { > > Name = postgres > > Description = "Fileset for postgres" > > Include { > > Options { > > Signature = XXH128 > > Compression = LZ4HC > > } > > Plugin = "python3" > > ":module_name=bareos-fd-postgresql" > > ":db_host=/run/postgresql" > > ":wal_archive_dir=/var/lib/pgsql/wal_archive" > > ":switch_wal_timeout=180" > > } > > } > --%snip%-- > > The dbms is configured as follows: > > --%snip%-- > > max_wal_size = 1GB > > min_wal_size = 80MB > > archive_mode = on > > archive_command = 'install -D %p /var/lib/pgsql/wal_archive/%f' > > restore_command = 'cp /var/lib/pgsql/wal_archive/%f %p' > > archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/wal_archive > %r' > --%snip%-- > > There is no replication slave. > > > From time to time I get the following error: > > > 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Got > last_backup_stop_time 1721215228 from restore object of job 44528 > > 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Got last_lsn > 17/85000000 from restore object of job 44528 > > 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Got pg major > version 13 from restore object of job 44528 > > 18-Jul 11:20 mydirector JobId 44591: Using Device "File-mystorage" to > write. > > 18-Jul 11:20 mydirector JobId 44591: Extended attribute support is > enabled > > 18-Jul 11:20 mydirector JobId 44591: ACL support is enabled > > 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: python: 3.9.18 > (main, May 16 2024, 00:00:00) > > [GCC 11.4.1 20231218 (Red Hat 11.4.1-3.0.1)] | pg8000: 1.31.2 > > 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Connected to > PostgreSQL version 130014 > > 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Current LSN > 17/87538B18, last LSN: 17/85000000 > > 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: A difference was > found, between current_lsn 17/87538B18 and last LSN: 17/85000000 > > 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Current LSN > 17/880001A8, last LSN: 17/85000000 > > 18-Jul 11:23 mydirector JobId 44591: Fatal error: python3-fd-mod: > Timeout waiting 180 sec. for wal file 000000010000001700000088 to be > archived > > 18-Jul 11:23 mydirector JobId 44591: Fatal error: > filed/fd_plugins.cc:673 PluginSave: Command plugin > "python3:module_name=bareos-fd-postgresql:db_host=/run/postgresql:wal_archive_dir=/var/lib/pgsql/wal_archive:switch_wal_timeout=180" > > requested, but job is already cancelled. > > 18-Jul 11:23 mydirector JobId 44591: python3-fd-mod: Database connection > closed. > > 18-Jul 11:20 mystorage JobId 44591: Connected File Daemon at > 192.168.1.5:9102, encryption: TLS_AES_256_GCM_SHA384 TLSv1.3 > > 18-Jul 11:23 mydirector JobId 44591: Fatal error: Director's comm line > to SD dropped > > As you can see, I already increased the default value of 60s for > switch_wal_timeout to 180s, but this error still shows up. > > The database is stored on an nvme, with no performance bottlenecks (ram, > cpu). > > Does anyone have an idea of how to get this fixed? > > Thanks & kind regards, > > Philippe > -- You received this message because you are subscribed to the Google Groups "bareos-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to bareos-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/bf67c2db-0f6f-4c3b-942c-3b36a82d1903n%40googlegroups.com.