On Thursday 23 February 2012 11:11:54 Silver Salonen wrote: > On Wednesday 22 February 2012 15:20:10 Silver Salonen wrote: > > On Wed, 22 Feb 2012 12:33:49 +0100, Hugo Letemplier wrote: > > > I think you can try to configure the Heartbeat Interval directive on > > > your various daemons. > > > > Hi. > > > > My SD already had Heartbeat Interval set to 60. I now tried it on one > > FD too, but the job still failed with the same error. > > > > Other FD's on both FreeBSD and Linux are able to run jobs for hours and > > complete them successfully. > > > > -- > > Silver > > > > > 2012/2/22 Silver Salonen <sil...@serverock.ee>: > > >> Hi. > > >> > > >> > > >> > > >> Recently we changed the network connection for our backup server > > >> which is > > >> Bacula 5.2.3 on FreeBSD 9.0. > > >> > > >> > > >> > > >> After that many jobs running across WAN started failing with various > > >> "broken > > >> pipe" errors. Some examples: > > >> > > >> > > >> > > >> 21-Feb 22:42 fbsd1-fd JobId 57779: Error: bsock.c:398 Wrote 32151 > > >> bytes to > > >> Storage daemon:backupsrv.url:9103, but only 16384 accepted. > > >> > > >> 21-Feb 22:42 fbsd1-fd JobId 57779: Fatal error: backup.c:1024 > > >> Network send > > >> error to SD. ERR=Broken pipe > > >> > > >> 21-Feb 22:42 fbsd1-fd JobId 57779: Error: bsock.c:339 Socket has > > >> errors=1 on > > >> call to Storage daemon:backupsrv.url:9103 > > >> > > >> > > >> > > >> (this one runs from the same backup-network to another SD) > > >> > > >> 22-Feb 00:14 backupsrv-dir JobId 57852: Fatal error: Network error > > >> with FD > > >> during Backup: ERR=Broken pipe > > >> > > >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: JobId=57852 > > >> Job="linux1-userdata.2012-02-21_23.05.02_02" marked to be canceled. > > >> > > >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: Job write elapsed time = > > >> 00:33:20, > > >> Transfer rate = 591.5 K Bytes/second > > >> > > >> 22-Feb 00:14 backupsrv-sd2 JobId 57852: Error: bsock.c:529 Read > > >> expected > > >> 65568 got 1448 from client:123.45.67.81:36643 > > >> > > >> 22-Feb 00:14 backupsrv-dir JobId 57852: Fatal error: No Job status > > >> returned > > >> from FD. > > >> > > >> > > >> > > >> 22-Feb 00:16 backupsrv-dir JobId 57821: Fatal error: Network error > > >> with FD > > >> during Backup: ERR=Broken pipe > > >> > > >> 22-Feb 00:16 backupsrv-sd JobId 57821: Job write elapsed time = > > >> 00:57:00, > > >> Transfer rate = 26.69 K Bytes/second > > >> > > >> 22-Feb 00:16 backupsrv-dir JobId 57821: Fatal error: No Job status > > >> returned > > >> from FD. > > >> > > >> > > >> > > >> 22-Feb 00:24 winsrv1-fd JobId 57784: Error: > > >> /home/kern/bacula/k/bacula/src/lib/bsock.c:393 Write error sending > > >> 9363 > > >> bytes to Storage daemon:backupsrv.url:9103: ERR=Input/output error > > >> > > >> 22-Feb 00:24 winsrv1-fd JobId 57784: Fatal error: > > >> /home/kern/bacula/k/bacula/src/filed/backup.c:1024 Network send > > >> error to SD. > > >> ERR=Input/output error > > >> > > >> 22-Feb 00:26 winsrv1-fd JobId 57784: Error: > > >> /home/kern/bacula/k/bacula/src/lib/bsock.c:339 Socket has errors=1 > > >> on call > > >> to Storage daemon:backupsrv.url:9103 > > >> > > >> > > >> > > >> (this one runs from the same backup-network to another SD) > > >> > > >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Socket error on > > >> ClientRunBeforeJob command: ERR=Broken pipe > > >> > > >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Client > > >> "winsrv2-fd" > > >> RunScript failed. > > >> > > >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: Network error > > >> with FD > > >> during Backup: ERR=Broken pipe > > >> > > >> 22-Feb 01:33 backupsrv-sd2 JobId 57872: JobId=57872 > > >> Job="winsrv2.2012-02-22_01.00.00_27" marked to be canceled. > > >> > > >> 22-Feb 01:33 backupsrv-dir JobId 57872: Fatal error: No Job status > > >> returned > > >> from FD. > > >> > > >> > > >> > > >> 22-Feb 01:51 fbsd2-fd JobId 57806: Error: bsock.c:398 Wrote 61750 > > >> bytes to > > >> Storage daemon:backupsrv.url:9103, but only 16384 accepted. > > >> > > >> 22-Feb 01:51 fbsd2-fd JobId 57806: Fatal error: backup.c:1024 > > >> Network send > > >> error to SD. ERR=Broken pipe > > >> > > >> 22-Feb 01:51 fbsd2-fd JobId 57806: Error: bsock.c:339 Socket has > > >> errors=1 on > > >> call to Storage daemon:backupsrv.url:9103 > > >> > > >> > > >> > > >> 22-Feb 02:15 backupsrv-dir JobId 57819: Fatal error: Network error > > >> with FD > > >> during Backup: ERR=Connection reset by peer > > >> > > >> 22-Feb 02:15 backupsrv-dir JobId 57819: Fatal error: No Job status > > >> returned > > >> from FD. > > >> > > >> > > >> > > >> > > >> > > >> These jobs have been failing every day for a week now. Meanwhile > > >> other jobs > > >> complete just fine, and it seems not to about jobs' size or scripts > > >> to be > > >> run before jobs on clients etc. > > >> > > >> > > >> > > >> Any idea what could be wrong? > > What's also interesting about these failures are these lines (similar in all > these failing jobs): > FD Files Written: 381 > SD Files Written: 0 > FD Bytes Written: 391,430,239 (391.4 MB) > SD Bytes Written: 0 (0 B) > Last Volume Bytes: 260 (260 B) > > And the actual volume file seems to contain all the data (its size is 373MB). > > What can we conclude from that? > Does the failure/timeout/whatever occur after the FD--SD connection, eg. when > SD tries to communicate with DIR about the end of the job or smth?
I look at these lines: 28-Feb 00:04 backupsrv-dir JobId 58414: Sending Accurate information. 28-Feb 00:07 backupsrv-sd2 JobId 58414: Recycled volume "fbsd1-userdata-incr-6106" on device "device-fbsd1" (/mnt/backup/bacula/fbsd1), all previous data lost. 28-Feb 00:07 backupsrv-dir JobId 58414: Max Volume jobs=1 exceeded. Marking Volume "fbsd1-userdata-incr-6106" as Used. 28-Feb 00:40 backupsrv-dir JobId 58414: Fatal error: Network error with FD during Backup: ERR=Broken pipe 28-Feb 00:40 backupsrv-sd2 JobId 58414: JobId=58414 Job="fbsd1-userdata.2012-02-27_23.05.02_42" marked to be canceled. 28-Feb 00:40 backupsrv-dir JobId 58414: Fatal error: No Job status returned from FD. 28-Feb 00:40 backupsrv-dir JobId 58414: Error: Bacula backupsrv-dir 5.2.3 (16Dec11): ...and wonder why does DIR get a "Network error" after 33 minutes. What does DIR talk about with FD after backing up? Does DIR take up a new network session for that? It wouldn't be possible otherwise, would it? If I understand it correctly, communications go smth like that: 1) DIR --> FD (auth, SD info etc) 2) DIR --> SD (auth, ...) 3) FD --> SD (data, attributes) 4) SD --> DIR (attributes) 5) DIR --> FD (??) - this is the point where "Broken pipe" occurs. What step is that? -- Silver ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users