So,

On the client, in /tmp/amanda/amandad/, I see the debug files for the estimates and for each of the DLEs except for the one that failed. No entry for it from 5/8 or 5/9 (after midnight). Did a more on `ls *0508* *0509* | sort` and scanned through that with find.

On the server, in /tmp/amanda/server/daily/, I also don't see any references to the DLE "/data" in any of the dumper debug files from 5/8 or 5/9. I do see some sections where the "timeout waiting for ACK" appear, but I don't know what to make of them. They don't seem to offer me much in the way of diagnostics or the assurance that they are talking about the "/data" DLE.

Here is one of those pieces with the successful items before and after:

dumper: Building type 4 (FILE) header of size 32768 using:
dumper: Contents of *(dumpfile_t *)28894:
dumper:     type             = 4 (FILE)
dumper:     datestamp        = '20130508224500'
dumper:     dumplevel        = 1
dumper:     compressed       = 1
dumper:     encrypted        = 0
dumper:     comp_suffix      = '.gz'
dumper:     encrypt_suffix   = 'N'
dumper:     name             = 'metzi1.physics.umass.edu'
dumper:     disk             = '/usr/local'
dumper:     program          = '/bin/tar'
dumper:     srvcompprog      = ''
dumper:     clntcompprog     = ''
dumper:     srv_encrypt      = ''
dumper:     clnt_encrypt     = ''
dumper:     recover_cmd      = '/bin/gzip -dc |/bin/tar -f - ...'
dumper:     uncompress_cmd   = ''
dumper:     encrypt_cmd      = ''
dumper:     decrypt_cmd      = ''
dumper:     srv_decrypt_opt  = ''
dumper:     clnt_decrypt_opt = ''
dumper:     cont_filename    = ''
dumper:     is_partial       = 0
dumper:     partnum          = 0
dumper:     totalparts       = 0
dumper:     blocksize        = 32768
security_stream_seterr(57cd0, EOF)
security_stream_close(57cd0)
security_stream_seterr(47c50, EOF)
security_stream_close(47c50)
security_stream_seterr(4fc90, EOF)
security_stream_close(4fc90)
dumper: connect_port: Try  port 1025: Available   -
dumper: connected to 127.0.0.1.36800
dumper: our side is 0.0.0.0.1025
dumper: try_socksize: send buffer size is 65536
security_getdriver(name=ssh) returns ff316638
security_handleinit(handle=335a0, driver=ff316638 (SSH))
security_streaminit(stream=3fc10, driver=ff316638 (SSH))
security_seterror(handle=335a0, driver=ff316638 (SSH) error=timeout waiting for 
ACK)
security_close(handle=335a0, driver=ff316638 (SSH))
security_stream_close(3fc10)
dumper: connect_port: Try  port 1025: Available   -
dumper: connected to 127.0.0.1.36806
dumper: our side is 0.0.0.0.1025
dumper: try_socksize: send buffer size is 65536
security_getdriver(name=ssh) returns ff316638
security_handleinit(handle=335a0, driver=ff316638 (SSH))
security_streaminit(stream=3fc10, driver=ff316638 (SSH))
security_seterror(handle=335a0, driver=ff316638 (SSH) error=timeout waiting for 
ACK)
security_close(handle=335a0, driver=ff316638 (SSH))
security_stream_close(3fc10)
dumper: connect_port: Try  port 1025: Available   -
dumper: connected to 127.0.0.1.36810
dumper: our side is 0.0.0.0.1025
dumper: try_socksize: send buffer size is 65536
security_getdriver(name=ssh) returns ff316638
security_handleinit(handle=33398, driver=ff316638 (SSH))
security_streaminit(stream=3fc10, driver=ff316638 (SSH))
security_streaminit(stream=47c50, driver=ff316638 (SSH))
security_streaminit(stream=4fc90, driver=ff316638 (SSH))
security_streaminit(stream=57cd0, driver=ff316638 (SSH))
security_close(handle=33398, driver=ff316638 (SSH))
security_stream_close(3fc10)
dumper: Building type 4 (FILE) header of size 32768 using:
dumper: Contents of *(dumpfile_t *)28894:
dumper:     type             = 4 (FILE)
dumper:     datestamp        = '20130508224500'
dumper:     dumplevel        = 1
dumper:     compressed       = 1
dumper:     encrypted        = 0
dumper:     comp_suffix      = '.gz'
dumper:     encrypt_suffix   = 'N'
dumper:     name             = 'rassilon.crc.isb.nsm'
dumper:     disk             = '/var/radmind'
dumper:     program          = '/bin/tar'
dumper:     srvcompprog      = ''
dumper:     clntcompprog     = ''
dumper:     srv_encrypt      = ''
dumper:     clnt_encrypt     = ''
dumper:     recover_cmd      = '/bin/gzip -dc |/bin/tar -f - ...'
dumper:     uncompress_cmd   = ''
dumper:     encrypt_cmd      = ''
dumper:     decrypt_cmd      = ''
dumper:     srv_decrypt_opt  = ''
dumper:     clnt_decrypt_opt = ''
dumper:     cont_filename    = ''
dumper:     is_partial       = 0
dumper:     partnum          = 0
dumper:     totalparts       = 0
dumper:     blocksize        = 32768




On 5/9/13 3:24 PM, Jean-Louis Martineau wrote:
Chris,

look in the amandad debug files on the client and the dumper debug file on the 
server.

Jean-Louis

On 05/09/2013 03:06 PM, Chris Hoogendyk wrote:
On one of my older setups that is running Amanda 2.5.1p3, I'm getting patterns of failures that I can't make sense of. Some days everything works just fine. Other days I get everything from a couple of DLEs failing to a half dozen. They are typically the same three servers, which are in another building. However, there is a server that has the largest DLEs in that same building that does not exhibit failures. There are also several servers in the same building as the Amanda backup server that don't show any failures. I even made a spreadsheet that shows a 0, 1 or 2 for backup levels of successes and a red x for failures. Can't see any pattern.

I've looked at a bunch of things and pored over the log files to no avail.

The errors show up in the Amanda reports as:

  anise.nsm.umass.edu        /home     lev 1  FAILED [cannot read header: got 0 
instead of 32768]
  metzi1.physics.umass.edu   /data     lev 1  FAILED [cannot read header: got 0 
instead of 32768]
  anise.nsm.umass.edu        /home     lev 1  FAILED [cannot read header: got 0 
instead of 32768]
anise.nsm.umass.edu /home lev 1 FAILED [too many dumper retry: "[request failed: timeout waiting for ACK]"] metzi1.physics.umass.edu /data lev 1 FAILED [too many dumper retry: "[request failed: timeout waiting for ACK]"]
  metzi1.physics.umass.edu   /data     lev 1  FAILED [cannot read header: got 0 
instead of 32768]

The interesting thing is that if I go to metzi1, into /tmp/amanda/client/daily/ and do an `ls *0508*` I can see the debug logs from last night. If I do a `grep '/data' *0508*` I can see any entries that mention the DLE /data. I see no instances of sendbackup. I see only those runtar debug files that correspond to the size estimates for 0, 1 and 2 level backups. There is no runtar that would correspond to an actual dump.

If I do the same thing with `grep '/home' *0508*` (the DLE /home was successfully backed up), then I see all the runtar debug files for the estimates as well as a runtar debug file for the actual backup. I also see several lines in the sendbackup debug file for /home.

I've also looked through /var/log/syslog, /var/log/auth.log, etc. on the client (which is Ubuntu 12.04 LTS), and I've looked through Amanda debug logs, /var/adm/messages, /var/adm/authlog, etc. on the server (which is Solaris 10). I don't see any logged errors for dropped connections or failures of any sort. The Amanda logs just don't mention /data on metzi1. A couple of the other servers that are being backed up are Ubuntu 12.04 and several are Ubuntu 10.04. None of them have been tweeked for sshd_config. All have tcpkeepalive turned on.

I tried bumping up the timeouts in amanda.conf (by a factor of 5). That seems a bit much, and it didn't seem to make any difference.

What should I be looking for? Where would Amanda log what is going on? (Or, why would it not be logging it?)


Thank you,




--
---------------

Chris Hoogendyk

-
   O__  ---- Systems Administrator
  c/ /'_ --- Biology & Geology Departments
 (*) \(*) -- 140 Morrill Science Center
~~~~~~~~~~ - University of Massachusetts, Amherst

<hoogen...@bio.umass.edu>

---------------

Erdös 4

Reply via email to