long timeouts for very small DLEs
Hi, Both client and server are at 3.3.0 I now see this all the time on this particular server-client: amanda schedules, say an incremental level 1 (a few mbs usually) for a DLE and when the chunker seems to have all done it's job, it just stands there doing nothing, for hours sometimes, before it completes. I have DLEs of a few GBs that complete faster than a 50MB one... Suffice to say, this completely disrupts the backups. The amanda runs now take more than 2 days to complete (51 DLEs in total) :( Any help welcomed. Thanks, jf Here are a few logfiles. I can provide more info if needed. --- amstatus: gustav:/raid/dagher3 152m dumping 52m ( 99.97%) (15:23:32) --- amdump logfile shows: pondering gustav:/raid/dagher3... next_level0 6 last_level 1 (not due for a full dump, picking an incr level) gustav /raid/dagher3 pri 2 lev 1 nsize 54160 csize 54160 DUMP gustav 9efefbff0f /raid/dagher3 2027201101 2 1 2011:11:24:1:11:1 54160 54160 143 378 "Skipping: a full is not planned, so can't dump in degraded mode" driver: send-cmd time 69151.276 to chunker1: PORT-WRITE 00-00070 /holddisk/top/2027201101/gustav._raid_dagher3.1 gustav 9efefbff0f /raid/dagher3 1 2011:11:24:1:11:1 9007199254740896 APPLICATION 54240 |;auth=BSDTCP;index; driver: send-cmd time 69151.282 to dumper1: PORT-DUMP 00-00070 36699 gustav 9efefbff0f /raid/dagher3 NODEVICE 1 2011:11:24:1:11:1 amgtar "" "" "" "" BSDTCP AMANDA 127.0.0.1:36700 |" BSDTCP\n YES\n YES\n AMANDA\n \n amgtar\n\n ignore\n :_Directory_is_new$\n :_Directory_has_been_renamed\n :_file_changed_as_we_read_it$\n :_socket_ignored$\n :_Cannot_stat:_No_such_file_or_directory$\n \n\n check-device\n NO\n\n \n" driver: state time 69151.282 free kps: 1023574 space: 3896191960 taper: idle idle-dumpers: 9 qlen tapeq: 1 runq: 16 roomq: 0 wakeup: 0 driver-idle: no-dumpers driver: interface-state time 69151.282 if default: free 1023574 driver: hdisk-state time 69151.282 hdisk 0: free 3896191960 dumpers 3 --- chunker debug file: chunker.2028152332.debug /var/tmp/amanda/server/top/chunker.2028152332.debug:Mon Nov 28 15:23:32 2011: chunker: getcmd: PORT-WRITE 00-00070 /holddisk/top/2027201101/gustav._raid_dagher3.1 gustav 9efefbff0f /raid/dagher3 1 2011:11:24:1:11:1 9007199254740896 APPLICATION 54240 |;auth=BSDTCP;index; /var/tmp/amanda/server/top/chunker.2028152332.debug:Mon Nov 28 15:23:32 2011: chunker: Building type FILE header of 32768-32768 bytes with name='gustav' disk='/raid/dagher3' dumplevel=1 and blocksize=32768 --- client amandad debug file: amandad.2028152755.debug [...] Mon Nov 28 15:27:55 2011: amandad: sending ACK pkt: < > Mon Nov 28 15:27:55 2011: amandad: tcpm_send_token: data is still flowing Mon Nov 28 15:27:55 2011: amandad: security_streaminit(stream=0x235cd00, driver=0x7f1440cfe380 (BSDTCP)) Mon Nov 28 15:27:55 2011: amandad: security_streaminit(stream=0x2364d60, driver=0x7f1440cfe380 (BSDTCP)) Mon Nov 28 15:27:55 2011: amandad: security_streaminit(stream=0x236cdc0, driver=0x7f1440cfe380 (BSDTCP)) Mon Nov 28 15:27:55 2011: amandad: sending REP pkt: < CONNECT DATA 49 MESG 48 INDEX 47 OPTIONS features=9efefbff0f; > Mon Nov 28 15:27:55 2011: amandad: received ACK pkt: < > Mon Nov 28 15:27:55 2011: amandad: opening security stream for fd 50 Mon Nov 28 15:27:55 2011: amandad: opening security stream for fd 51 Mon Nov 28 15:27:55 2011: amandad: opening security stream for fd 52 Mon Nov 28 15:27:55 2011: amandad: security_close(handle=0x23438c0, driver=0x7f1440cfe380 (BSDTCP)) Mon Nov 28 15:27:55 2011: amandad: security_stream_close(0x2343a90) Mon Nov 28 15:31:00 2011: amandad: tcpm_send_token: data is still flowing
Can anyone explain this log error message?
Our setup: - AMANDA server: Debian 'Stable' release, 2.6.1p2-3 direct from Debian packages; - AMANDA clients: various, Debian 'Stable' with same version as above. We upgraded from Debian 'oldstable' to 'stable' in the past few months and since upgrading the client, we occasionally see the following on one of our servers, named 'athena' relating to the /home volume which is currently approximately 1.7TB in size. For example from this morning: FAILURE DUMP SUMMARY: athena /home lev 0 FAILED "/usr/lib/amanda/runtar exited with status 1: see /var/log/amanda/client/Daily/sendsize.2018234504.debug" The pertinent part of /var/log/amanda/client/Daily/sendsize.2018234504.debug is: 1321659908.212265: sendsize: waiting for any estimate child: 1 running 1321659908.212332: sendsize: calculating for amname /home, dirname /home, spindle -1 GNUTAR 1321659908.212399: sendsize: getting size via gnutar for /home level 0 1321659908.213588: sendsize: pipespawnv: stdoutfd is 3 1321659908.213715: sendsize: Spawning "/usr/lib/amanda/runtar runtar Daily /bin/tar --create --file /dev/null --numeric-owner --directory /home --one-file-system --listed-incremental /var/lib/amanda/gnutar-lists/athena_home_0.new --sparse --ignore-failed-read --totals --exclude-from /tmp/amanda/sendsize._home.2018234508.exclude ." in pipeline 1321660065.551101: sendsize: Total bytes written: 1831488286720 (1.7TiB, 11GiB/s) 1321660065.551698: sendsize: . 1321660065.551733: sendsize: estimate time for /home level 0: 157.338 1321660065.551742: sendsize: estimate size for /home level 0: 1788562780 KB 1321660065.551754: sendsize: waiting for runtar "/home" child 1321660065.570903: sendsize: after runtar /home wait 1321660065.592913: sendsize: getting size via gnutar for /home level 4 1321660065.701616: sendsize: pipespawnv: stdoutfd is 3 1321660065.711458: sendsize: Spawning "/usr/lib/amanda/runtar runtar Daily /bin/tar --create --file /dev/null --numeric-owner --directory /home --one-file-system --listed-incremental /var/lib/amanda/gnutar-lists/athena_home_4.new --sparse --ignore-failed-read --totals --exclude-from /tmp/amanda/sendsize._home.2018234745.exclude ." in pipeline 1321660183.361325: sendsize: Total bytes written: 8443156480 (7.9GiB, 69MiB/s) 1321660183.361910: sendsize: . 1321660183.361935: sendsize: estimate time for /home level 4: 117.660 1321660183.361944: sendsize: estimate size for /home level 4: 8245270 KB 1321660183.361952: sendsize: waiting for runtar "/home" child 1321660183.387790: sendsize: after runtar /home wait 1321660183.409511: sendsize: getting size via gnutar for /home level 5 1321660183.513086: sendsize: pipespawnv: stdoutfd is 3 1321660183.513274: sendsize: Spawning "/usr/lib/amanda/runtar runtar Daily /bin/tar --create --file /dev/null --numeric-owner --directory /home --one-file-system --listed-incremental /var/lib/amanda/gnutar-lists/athena_home_5.new --sparse --ignore-failed-read --totals --exclude-from /tmp/amanda/sendsize._home.2018234943.exclude ." in pipeline 1321660292.407967: sendsize: /bin/tar: ./heather/CEU_Work/csrc/bbquaire/mwlv_cron/nutauto.chk: File removed before we read it 1321660298.633046: sendsize: Total bytes written: 4815687680 (4.5GiB, 41MiB/s) 1321660298.633493: sendsize: . 1321660298.633518: sendsize: estimate time for /home level 5: 115.120 1321660298.633526: sendsize: estimate size for /home level 5: 4702820 KB 1321660298.633533: sendsize: waiting for runtar "/home" child 1321660298.659697: sendsize: after runtar /home wait 1321660298.681181: sendsize: errmsg is /usr/lib/amanda/runtar exited with status 1: see /var/log/amanda/client/Daily/sendsize.2018234504.debug 1321660298.681237: sendsize: done with amname /home dirname /home spindle -1 I note that there were changes on the filesystem during the backup process: this is normal (cron jobs etc.) and has caused nothing other than a 'STRANGE' note in the report previously. I also note that the error relates to a level 5 backup, but that a level 0 backup was actually carried out, apparently successfully: athena /home 0 1788562460 532736530 29.8 746:27 11894.8 137:06 64766.4 I believe this is the appropriate runtar.*.debug corresponding to above, based on the timestamps: 1321660183.516316: runtar: pid 2400 ruid 34 euid 0 version 2.6.1p2: start at Fri Nov 18 23:49:43 2011 1321660183.516378: runtar: version 2.6.1p2 1321660183.519739: runtar: /bin/tar version: tar (GNU tar) 1.23 1321660183.519872: runtar: config: Daily 1321660183.520622: runtar: pid 2400 ruid 0 euid 0 version 2.6.1p2: rename at Fri Nov 18