Re: [Veritas-bu] same job keeps hanging
For the win!!! I disabled compression and TIR and the backup ran fine for the first time last night. Go figure. Thanks for all your help. -Aaron Re: Aaron, Looks like compression may be the killer ... clear compression (and TIR) in the policy and crash the car again. The idea being to simplify the processing as much as possible as compression (and TIR) add overhead. Since you're running this job on the master to a local tape drive (which also does compression), I don't see anywhere for any gain by doing client compression. If it still fails, then create this empty file on the system where bpbkar runs # touch /usr/openv/netbackup/bpbkar_path_tr It will add a message to the bpbkar log for the start of processing for each file via a SelectFile message. bp.conf VERBOSE = 5 triggers the bpbkar PrintFile messages indicating it's done handling the file. Make sure you have bpbrm logging enabled, too. When the job ends with status 41, then look in the bpbrm log for the timestamp when your 3600 second CLIENT_READ_TIME expires. Then take a look at the bpbkar log and see what files were being handled around that time. Look for time gaps betwen SelectFile and PrintFile, then see if there is something special about that file (big, open, locked, active database, sparse, etc). When you ran the interactive bpbkar to /dev/null, you weren't doing compression, and it completed in just over an hour, while your scheduled run with compression was nearly 5 hours. It's the -Z on the bpbkar call that tells bpbkar to do compression. Compression is a double edged sword. I personally prefer to let the tape drive deal with it in most situations, although I might enable client compression to an undersized disk or disk staging storage unit. --- TTFN Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging
Aaron, Looks like compression may be the killer ... clear compression (and TIR) in the policy and crash the car again. The idea being to simplify the processing as much as possible as compression (and TIR) add overhead. Since you're running this job on the master to a local tape drive (which also does compression), I don't see anywhere for any gain by doing client compression. If it still fails, then create this empty file on the system where bpbkar runs # touch /usr/openv/netbackup/bpbkar_path_tr It will add a message to the bpbkar log for the start of processing for each file via a SelectFile message. bp.conf VERBOSE = 5 triggers the bpbkar PrintFile messages indicating it's done handling the file. Make sure you have bpbrm logging enabled, too. When the job ends with status 41, then look in the bpbrm log for the timestamp when your 3600 second CLIENT_READ_TIME expires. Then take a look at the bpbkar log and see what files were being handled around that time. Look for time gaps betwen SelectFile and PrintFile, then see if there is something special about that file (big, open, locked, active database, sparse, etc). When you ran the interactive bpbkar to /dev/null, you weren't doing compression, and it completed in just over an hour, while your scheduled run with compression was nearly 5 hours. It's the -Z on the bpbkar call that tells bpbkar to do compression. Compression is a double edged sword. I personally prefer to let the tape drive deal with it in most situations, although I might enable client compression to an undersized disk or disk staging storage unit. --- TTFN___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging
I gave a test run a shot. Ran bpbkar -nocont -nfsok /foo/bar /dev/null this took roughly an an hour and 18 minutes and completed successfully. The actual policy ran last night and the same problem occurred as before - the backup stalls and then bpbrm logs a timeout. From bpbkar... Backup starts at 00:15:23: 00:15:23.804 [29233] **LOCALE ERROR** locale en_US.ISO8859-1 not found in file /usr/openv/msg/.conf 00:15:23.804 [29233] 4 bpbkar main: real locales /C/en_US.ISO8859-15/en_US.ISO8859-15/en_US.ISO8859-15/C/en_US.ISO8859-1 00:15:23.804 [29233] 4 bpbkar main: standardized locales - lc_messages C lc_ctype en_US.ISO8859-1 lc_time C lc_collate C lc_numeric C 00:15:23.806 [29233] 2 logparams: bpbkar -r 8035200 -ru root -dt 0 -to 36000 -clnt foo.bar.com -class inbound -sched ftpif -st FULL -bpstart_to 300 -bpend_to 300 -read_to 3600 -tir -tir_plus -nfsok -Z -b foo.bar.com_1184307320 -kl 5 -shm Then the thing fails at 04:52:42 with zero errors: 04:52:42.060 [29233] 4 bpbkar compress_file: INF - Compression: 82% /foo/bar/somefile.csv 04:52:42.061 [29233] 4 bpbkar PrintFile: /foo/bar/somefile.csv That's it. No errors. Nothing. bpbrm then fails with a backup timeout from the client. Could this have anything to do with other jobs being queued on the master server? Is there any known issue with doing nfs backups on a master server or something? I'm pretty much out of ideas. -Aaron -Original Message- From: ankur kumar [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 10, 2007 12:14 PM To: veritas-bu@mailman.eng.auburn.edu Subject: Re: [Veritas-bu] same job keeps hanging 1. check out the bpbkar logs 2. Add VERBOSE = 5 to the /usr/openv/netbackup/bp.conf on the client. Create the /usr/openv/netbackup/logs/bpbkar directory if it doesn't already exist. Run /usr/openv/netbackup/bin/bpbkar -nocont DIRECTORY /dev/null against the directory where the file resides. The above command will cause bpbkar to read the directory and write the output to /dev/null instead of disk or tape. Running bpbkar manually is a good method to verify if bkbkar can read a file without doing an actual backup of the client. Any errors will be logged to the bpbkar log directory. Calculus Got a little couch potato? Check out fun summer activities for kids. http://search.yahoo.com/search?fr=oni_on_mailp=summer+activities+for+ki dscs=bz Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging.
If you are backing up files on a UNIX system, check if there is a hung nfs mount. I have had backups hang because the ls -l command hangs on the mount point. - Just a thought.. = Carl Stehman IT Distributed Services Team Pepco Holdings, Inc. 202-331-6619 Pager 301-765-2703 [EMAIL PROTECTED] Aaron Mills [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 07/09/2007 04:39 PM To Liddle, Stuart [EMAIL PROTECTED], veritas-bu@mailman.eng.auburn.edu cc Subject Re: [Veritas-bu] same job keeps hanging. Actually, I?ve done that before with the same results. We upped the timeouts to around 10,000 seconds to no avail. It?s as though at some point the backups just hang for no good reason. A quick find shows that I?m backing up roughly 29,000 files ? that shouldn?t take too long to enumerate, should it? Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] From: Liddle, Stuart [mailto:[EMAIL PROTECTED] Sent: Monday, July 09, 2007 11:14 AM To: Aaron Mills; veritas-bu@mailman.eng.auburn.edu Subject: RE: [Veritas-bu] same job keeps hanging. So, are you trying to back up a filesystem with lots and lots of small files? If so, remember that NetBackup will try to enumerate all of the files that you are trying to back up. We had a similar situation where we were trying to back up a filesystem with 3.5 million files in 50,000 directories. It took hours to do a filelist of all of that?.consequently, it timed out. Symantec told us the best solution for that particular directory was NDMP (since the timeouts are much longer). OR?I suppose you could up the timeout value to more than 3600 seconds and see what happens. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Aaron Mills Sent: Monday, July 09, 2007 9:58 AM To: veritas-bu@mailman.eng.auburn.edu Subject: [Veritas-bu] same job keeps hanging. Hi all, I?m hoping someone?s seen this before. I?m running 5.1MP6 w/ AIT3 ? I?ve got a ~126GB backup that kicks off weekly, but hangs within a few hours every time ? the error I get is always ?media manager terminated by parent process? but the logs don?t seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself. bptm gives me: 03:28:45.470 [4999] 2 io_ioctl: command (1)MTFSF 1 from (bptm.c.8307) on drive index 1 03:28:45.530 [4999] 2 io_close: closing /usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310 03:28:45.530 [4999] 2 catch_signal: EXITING with status 82 so I check bpbrm: 02:05:33.882 [4992] 2 bpbrm spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound -bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound -hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z ?mediasvr foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion 51 -shm 02:05:33.884 [4992] 2 bpbrm write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK 4 02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/inbound on COMM_SOCK 02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/ftp on COMM_SOCK 02:05:33.884 [4992] 2 bpbrm main: wrote CONTINUE on COMM_SOCK 02:05:33.885 [4992] 2 bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com _1183968330 02:09:44.763 [4992] 2 bpbrm mm_sig: received ready signal from media manager 02:09:44.763 [4992] 2 bpbrm readline: retrying partial read from fgets :: 03:27:22.261 [4992] 2 bpbrm sighandler: signal 14 caught by bpbrm 03:27:22.272 [4992] 2 bpbrm sighandler: bpbrm timeout after 3600 seconds 03:27:22.287 [4992] 2 clear_held_signals: clearing signal mask stack, mask_stack_depth = 0 03:27:22.287 [4992] 2 bpbrm kill_child_process: start 03:27:22.287 [4992] 2 bpbrm wait_for_child: start 03:28:48.546 [4992] 2 bpbrm wait_for_child: child exit_status = 82 signal_status = 0 03:28:48.557 [4992] 2 inform_client_of_status: INF - Server status = 41 but I can?t seem to figure out why there was a timeout. I checked all the related logs ? bpbkar just shows file writing stopping at 2:42am ? like the process just hangs there, no errors though. Looking right now, the bpbrm and bpbkar processes for this backup are still running, but nothing is happening. The job shows as active and everything is queueing up behind it. I?ve also adjusted the CLIENT_READ_TIMEOUT in /usr/openv/netbackup/bp.conf to no avail. Can anyone point me in the right direction as to what I?m missing? I?m guessing there?s something I?m not seeing in one of the logs. -Aaron Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu This Email message and any attachment may contain
Re: [Veritas-bu] same job keeps hanging.
I've experienced the same. Paul -- -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Rock Sent: July 9, 2007 5:12 PM To: veritas-bu@mailman.eng.auburn.edu Subject: Re: [Veritas-bu] same job keeps hanging. 2. client on the other side of a firewall What was happening in our case was the backup would start, one hour into the backup, the firewall would decide since it didn't see any traffic coming from the client to the master server, it would drop the entry in the state table. Then, one hour later, the client would try to send a keepalive packet through the now-defunct connection, fail, retry several times, and then finally give up and die, taking the backup with it. La version française suit le texte anglais. This email may contain privileged and/or confidential information, and the Bank of Canada does not waive any related rights. Any distribution, use, or copying of this email or the information it contains by other than the intended recipient is unauthorized. If you received this email in error please delete it immediately from your system and notify the sender promptly by email that you have done so. Le présent courriel peut contenir de l'information privilégiée ou confidentielle. La Banque du Canada ne renonce pas aux droits qui s'y rapportent. Toute diffusion, utilisation ou copie de ce courriel ou des renseignements qu'il contient par une personne autre que le ou les destinataires désignés est interdite. Si vous recevez ce courriel par erreur, veuillez le supprimer immédiatement et envoyer sans délai à l'expéditeur un message électronique pour l'aviser que vous avez éliminé de votre ordinateur toute copie du courriel reçu. ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging
1. check out the bpbkar logs 2. Add VERBOSE = 5 to the /usr/openv/netbackup/bp.conf on the client. Create the /usr/openv/netbackup/logs/bpbkar directory if it doesn't already exist. Run /usr/openv/netbackup/bin/bpbkar -nocont DIRECTORY /dev/null against the directory where the file resides. The above command will cause bpbkar to read the directory and write the output to /dev/null instead of disk or tape. Running bpbkar manually is a good method to verify if bkbkar can read a file without doing an actual backup of the client. Any errors will be logged to the bpbkar log directory. Calculus Got a little couch potato? Check out fun summer activities for kids. http://search.yahoo.com/search?fr=oni_on_mailp=summer+activities+for+kidscs=bz ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
[Veritas-bu] same job keeps hanging.
Hi all, I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've got a ~126GB backup that kicks off weekly, but hangs within a few hours every time - the error I get is always media manager terminated by parent process but the logs don't seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself. bptm gives me: 03:28:45.470 [4999] 2 io_ioctl: command (1)MTFSF 1 from (bptm.c.8307) on drive index 1 03:28:45.530 [4999] 2 io_close: closing /usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310 03:28:45.530 [4999] 2 catch_signal: EXITING with status 82 so I check bpbrm: 02:05:33.882 [4992] 2 bpbrm spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound -bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound -hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z -mediasvr foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion 51 -shm 02:05:33.884 [4992] 2 bpbrm write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK 4 02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/inbound on COMM_SOCK 02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/ftp on COMM_SOCK 02:05:33.884 [4992] 2 bpbrm main: wrote CONTINUE on COMM_SOCK 02:05:33.885 [4992] 2 bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com _1183968330 02:09:44.763 [4992] 2 bpbrm mm_sig: received ready signal from media manager 02:09:44.763 [4992] 2 bpbrm readline: retrying partial read from fgets :: 03:27:22.261 [4992] 2 bpbrm sighandler: signal 14 caught by bpbrm 03:27:22.272 [4992] 2 bpbrm sighandler: bpbrm timeout after 3600 seconds 03:27:22.287 [4992] 2 clear_held_signals: clearing signal mask stack, mask_stack_depth = 0 03:27:22.287 [4992] 2 bpbrm kill_child_process: start 03:27:22.287 [4992] 2 bpbrm wait_for_child: start 03:28:48.546 [4992] 2 bpbrm wait_for_child: child exit_status = 82 signal_status = 0 03:28:48.557 [4992] 2 inform_client_of_status: INF - Server status = 41 but I can't seem to figure out why there was a timeout. I checked all the related logs - bpbkar just shows file writing stopping at 2:42am - like the process just hangs there, no errors though. Looking right now, the bpbrm and bpbkar processes for this backup are still running, but nothing is happening. The job shows as active and everything is queueing up behind it. I've also adjusted the CLIENT_READ_TIMEOUT in /usr/openv/netbackup/bp.conf to no avail. Can anyone point me in the right direction as to what I'm missing? I'm guessing there's something I'm not seeing in one of the logs. -Aaron Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging.
So, are you trying to back up a filesystem with lots and lots of small files? If so, remember that NetBackup will try to enumerate all of the files that you are trying to back up. We had a similar situation where we were trying to back up a filesystem with 3.5 million files in 50,000 directories. It took hours to do a filelist of all of thatconsequently, it timed out. Symantec told us the best solution for that particular directory was NDMP (since the timeouts are much longer). OR...I suppose you could up the timeout value to more than 3600 seconds and see what happens. _ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Aaron Mills Sent: Monday, July 09, 2007 9:58 AM To: veritas-bu@mailman.eng.auburn.edu Subject: [Veritas-bu] same job keeps hanging. Hi all, I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've got a ~126GB backup that kicks off weekly, but hangs within a few hours every time - the error I get is always media manager terminated by parent process but the logs don't seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself. bptm gives me: 03:28:45.470 [4999] 2 io_ioctl: command (1)MTFSF 1 from (bptm.c.8307) on drive index 1 03:28:45.530 [4999] 2 io_close: closing /usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310 03:28:45.530 [4999] 2 catch_signal: EXITING with status 82 so I check bpbrm: 02:05:33.882 [4992] 2 bpbrm spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound -bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound -hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z -mediasvr foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion 51 -shm 02:05:33.884 [4992] 2 bpbrm write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK 4 02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/inbound on COMM_SOCK 02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/ftp on COMM_SOCK 02:05:33.884 [4992] 2 bpbrm main: wrote CONTINUE on COMM_SOCK 02:05:33.885 [4992] 2 bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com _1183968330 02:09:44.763 [4992] 2 bpbrm mm_sig: received ready signal from media manager 02:09:44.763 [4992] 2 bpbrm readline: retrying partial read from fgets :: 03:27:22.261 [4992] 2 bpbrm sighandler: signal 14 caught by bpbrm 03:27:22.272 [4992] 2 bpbrm sighandler: bpbrm timeout after 3600 seconds 03:27:22.287 [4992] 2 clear_held_signals: clearing signal mask stack, mask_stack_depth = 0 03:27:22.287 [4992] 2 bpbrm kill_child_process: start 03:27:22.287 [4992] 2 bpbrm wait_for_child: start 03:28:48.546 [4992] 2 bpbrm wait_for_child: child exit_status = 82 signal_status = 0 03:28:48.557 [4992] 2 inform_client_of_status: INF - Server status = 41 but I can't seem to figure out why there was a timeout. I checked all the related logs - bpbkar just shows file writing stopping at 2:42am - like the process just hangs there, no errors though. Looking right now, the bpbrm and bpbkar processes for this backup are still running, but nothing is happening. The job shows as active and everything is queueing up behind it. I've also adjusted the CLIENT_READ_TIMEOUT in /usr/openv/netbackup/bp.conf to no avail. Can anyone point me in the right direction as to what I'm missing? I'm guessing there's something I'm not seeing in one of the logs. -Aaron Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging.
On Mon, 9 Jul 2007, Aaron Mills wrote: Hi all, I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've got a ~126GB backup that kicks off weekly, but hangs within a few hours every time - the error I get is always media manager terminated by parent process but the logs don't seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself. Can you backup other directories, /etc, /var with no problem? ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging.
Well, we have incremental backups that run on this filesystem - they seem to run fine. I wonder if it isn't a number of files issue as Stuart Liddle suggested... Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] -Original Message- From: Justin Piszcz [mailto:[EMAIL PROTECTED] Sent: Monday, July 09, 2007 11:31 AM To: Aaron Mills Cc: veritas-bu@mailman.eng.auburn.edu Subject: Re: [Veritas-bu] same job keeps hanging. On Mon, 9 Jul 2007, Aaron Mills wrote: Hi all, I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've got a ~126GB backup that kicks off weekly, but hangs within a few hours every time - the error I get is always media manager terminated by parent process but the logs don't seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself. Can you backup other directories, /etc, /var with no problem? ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging.
Actually, I've done that before with the same results. We upped the timeouts to around 10,000 seconds to no avail. It's as though at some point the backups just hang for no good reason. A quick find shows that I'm backing up roughly 29,000 files - that shouldn't take too long to enumerate, should it? Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] From: Liddle, Stuart [mailto:[EMAIL PROTECTED] Sent: Monday, July 09, 2007 11:14 AM To: Aaron Mills; veritas-bu@mailman.eng.auburn.edu Subject: RE: [Veritas-bu] same job keeps hanging. So, are you trying to back up a filesystem with lots and lots of small files? If so, remember that NetBackup will try to enumerate all of the files that you are trying to back up. We had a similar situation where we were trying to back up a filesystem with 3.5 million files in 50,000 directories. It took hours to do a filelist of all of thatconsequently, it timed out. Symantec told us the best solution for that particular directory was NDMP (since the timeouts are much longer). OR...I suppose you could up the timeout value to more than 3600 seconds and see what happens. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Aaron Mills Sent: Monday, July 09, 2007 9:58 AM To: veritas-bu@mailman.eng.auburn.edu Subject: [Veritas-bu] same job keeps hanging. Hi all, I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've got a ~126GB backup that kicks off weekly, but hangs within a few hours every time - the error I get is always media manager terminated by parent process but the logs don't seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself. bptm gives me: 03:28:45.470 [4999] 2 io_ioctl: command (1)MTFSF 1 from (bptm.c.8307) on drive index 1 03:28:45.530 [4999] 2 io_close: closing /usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310 03:28:45.530 [4999] 2 catch_signal: EXITING with status 82 so I check bpbrm: 02:05:33.882 [4992] 2 bpbrm spawn_child: /usr/openv/netbackup/bin/bptm bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound -bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound -hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z -mediasvr foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion 51 -shm 02:05:33.884 [4992] 2 bpbrm write_continue_backup: wrote CONTINUE BACKUP on COMM_SOCK 4 02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/inbound on COMM_SOCK 02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/ftp on COMM_SOCK 02:05:33.884 [4992] 2 bpbrm main: wrote CONTINUE on COMM_SOCK 02:05:33.885 [4992] 2 bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com _1183968330 02:09:44.763 [4992] 2 bpbrm mm_sig: received ready signal from media manager 02:09:44.763 [4992] 2 bpbrm readline: retrying partial read from fgets :: 03:27:22.261 [4992] 2 bpbrm sighandler: signal 14 caught by bpbrm 03:27:22.272 [4992] 2 bpbrm sighandler: bpbrm timeout after 3600 seconds 03:27:22.287 [4992] 2 clear_held_signals: clearing signal mask stack, mask_stack_depth = 0 03:27:22.287 [4992] 2 bpbrm kill_child_process: start 03:27:22.287 [4992] 2 bpbrm wait_for_child: start 03:28:48.546 [4992] 2 bpbrm wait_for_child: child exit_status = 82 signal_status = 0 03:28:48.557 [4992] 2 inform_client_of_status: INF - Server status = 41 but I can't seem to figure out why there was a timeout. I checked all the related logs - bpbkar just shows file writing stopping at 2:42am - like the process just hangs there, no errors though. Looking right now, the bpbrm and bpbkar processes for this backup are still running, but nothing is happening. The job shows as active and everything is queueing up behind it. I've also adjusted the CLIENT_READ_TIMEOUT in /usr/openv/netbackup/bp.conf to no avail. Can anyone point me in the right direction as to what I'm missing? I'm guessing there's something I'm not seeing in one of the logs. -Aaron Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging.
* Aaron Mills [EMAIL PROTECTED] [2007-07-09 16:39]: Hi all, I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've got a ~126GB backup that kicks off weekly, but hangs within a few hours every time - the error I get is always media manager terminated by parent process but the logs don't seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself. When you say runs on the server itself, what do you actually mean? We say an odd timeout that always happened at the same time into the backup, but the specific circumstances were: 1. a bpbackup command running on a client system 2. client on the other side of a firewall What was happening in our case was the backup would start, one hour into the backup, the firewall would decide since it didn't see any traffic coming from the client to the master server, it would drop the entry in the state table. Then, one hour later, the client would try to send a keepalive packet through the now-defunct connection, fail, retry several times, and then finally give up and die, taking the backup with it. This may not be anything like what you are dealing with, but it is a pretty good example of how things other than NBU can cause weird things to happen and make it look like NBU is the cause. Does your job always die at the same time, or does it vary from attempt to attempt? -- David Rock [EMAIL PROTECTED] ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Re: [Veritas-bu] same job keeps hanging.
Anecdotally - it doesn't always die at the same time, but roughly an hour or two into the job. I never actually looked to see if it was within a few minutes, but the symptom is always the same: daemon terminated by parent process, bpbrm timeout after 3600 seconds Something seems to be causing the client process to get stuck, for lack of a better word. As to the server - the job runs on the NBU server itself. I have an NFS mount hanging off it that I'm backing up. I've checked /var/adm/messages and I don't see anything weird happening at the time the backup fails (mount going stale, etc.), either. Aaron Mills Systems Administrator Return Path, Inc. http://www.returnpath.net [EMAIL PROTECTED] -Original Message- From: David Rock [mailto:[EMAIL PROTECTED] Sent: Monday, July 09, 2007 3:12 PM To: veritas-bu@mailman.eng.auburn.edu Subject: Re: [Veritas-bu] same job keeps hanging. * Aaron Mills [EMAIL PROTECTED] [2007-07-09 16:39]: Hi all, I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've got a ~126GB backup that kicks off weekly, but hangs within a few hours every time - the error I get is always media manager terminated by parent process but the logs don't seem to show anything odd. No other backups hang like this. This is also the only job that runs on the server itself. When you say runs on the server itself, what do you actually mean? We say an odd timeout that always happened at the same time into the backup, but the specific circumstances were: 1. a bpbackup command running on a client system 2. client on the other side of a firewall What was happening in our case was the backup would start, one hour into the backup, the firewall would decide since it didn't see any traffic coming from the client to the master server, it would drop the entry in the state table. Then, one hour later, the client would try to send a keepalive packet through the now-defunct connection, fail, retry several times, and then finally give up and die, taking the backup with it. This may not be anything like what you are dealing with, but it is a pretty good example of how things other than NBU can cause weird things to happen and make it look like NBU is the cause. Does your job always die at the same time, or does it vary from attempt to attempt? -- David Rock [EMAIL PROTECTED] ___ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu