Re: [Veritas-bu] same job keeps hanging

2007-07-17 Thread Aaron Mills
For the win!!! I disabled compression and TIR and the backup ran fine
for the first time last night. Go figure.

 

Thanks for all your help.

 

-Aaron

 

 
Re:
 
Aaron,
Looks like compression may be the killer ... clear compression (and TIR)
in the policy and crash the car again.   The idea being to simplify the
processing as much as possible as compression (and TIR) add overhead.
Since you're running this job on the master to a local tape drive (which
also does compression), I don't see anywhere for any gain by doing
client compression.
 
If it still fails, then create this empty file on the system where
bpbkar runs
  # touch /usr/openv/netbackup/bpbkar_path_tr
It will add a message to the bpbkar log for the start of processing for
each file via a SelectFile message.
 
bp.conf VERBOSE = 5 triggers the bpbkar PrintFile messages indicating
it's done handling the file.
 
Make sure you have bpbrm logging enabled, too.
 
When the job ends with status 41, then look in the bpbrm log for the
timestamp when your 3600 second CLIENT_READ_TIME expires.  Then take a
look at the bpbkar log and see what files were being handled around that
time.  Look for time gaps betwen SelectFile and PrintFile, then see if
there is something special about that file (big, open, locked, active
database, sparse, etc).
 
When you ran the interactive bpbkar to /dev/null, you weren't doing
compression, and it completed in just over an hour, while your scheduled
run with compression was nearly 5 hours.  It's the -Z on the bpbkar
call that tells bpbkar to do compression.  Compression is a double edged
sword.  I personally prefer to let the tape drive deal with it in most
situations, although I might enable client compression to an undersized
disk or disk staging storage unit.
 
--- TTFN

 

 

 

 

 

 

 

Aaron Mills

Systems Administrator

Return Path, Inc.

http://www.returnpath.net

[EMAIL PROTECTED]

 

 

___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging

2007-07-14 Thread rarmstr0
Aaron,
Looks like compression may be the killer ... clear compression (and TIR) in the 
policy and crash the car again.   The idea being to simplify the processing as 
much as possible as compression (and TIR) add overhead.  Since you're running 
this job on the master to a local tape drive (which also does compression), I 
don't see anywhere for any gain by doing client compression.

If it still fails, then create this empty file on the system where bpbkar runs
  # touch /usr/openv/netbackup/bpbkar_path_tr
It will add a message to the bpbkar log for the start of processing for each 
file via a SelectFile message.

bp.conf VERBOSE = 5 triggers the bpbkar PrintFile messages indicating it's done 
handling the file.

Make sure you have bpbrm logging enabled, too.

When the job ends with status 41, then look in the bpbrm log for the timestamp 
when your 3600 second CLIENT_READ_TIME expires.  Then take a look at the bpbkar 
log and see what files were being handled around that time.  Look for time gaps 
betwen SelectFile and PrintFile, then see if there is something special about 
that file (big, open, locked, active database, sparse, etc).

When you ran the interactive bpbkar to /dev/null, you weren't doing 
compression, and it completed in just over an hour, while your scheduled run 
with compression was nearly 5 hours.  It's the -Z on the bpbkar call that 
tells bpbkar to do compression.  Compression is a double edged sword.  I 
personally prefer to let the tape drive deal with it in most situations, 
although I might enable client compression to an undersized disk or disk 
staging storage unit.

--- TTFN___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging

2007-07-13 Thread Aaron Mills
I gave a test run a shot.

Ran bpbkar -nocont -nfsok /foo/bar  /dev/null 

this took roughly an an hour and 18 minutes and completed successfully.
The actual policy ran last night and the same problem occurred as before
- the backup stalls and then bpbrm logs a timeout.

From bpbkar...

Backup starts at 00:15:23:

00:15:23.804 [29233] **LOCALE ERROR** locale en_US.ISO8859-1 not found
in file /usr/openv/msg/.conf
00:15:23.804 [29233] 4 bpbkar main: real locales
/C/en_US.ISO8859-15/en_US.ISO8859-15/en_US.ISO8859-15/C/en_US.ISO8859-1

00:15:23.804 [29233] 4 bpbkar main: standardized locales - lc_messages
C lc_ctype en_US.ISO8859-1 lc_time C lc_collate C lc_numeric
 C
00:15:23.806 [29233] 2 logparams: bpbkar -r 8035200 -ru root -dt 0 -to
36000 -clnt foo.bar.com -class inbound -sched ftpif -st
FULL -bpstart_to 300 -bpend_to 300 -read_to 3600 -tir -tir_plus -nfsok
-Z -b foo.bar.com_1184307320 -kl 5 -shm

Then the thing fails at 04:52:42 with zero errors:

04:52:42.060 [29233] 4 bpbkar compress_file: INF - Compression:   82%
/foo/bar/somefile.csv
04:52:42.061 [29233] 4 bpbkar PrintFile: /foo/bar/somefile.csv

That's it. No errors. Nothing. bpbrm then fails with a backup timeout
from the client.

Could this have anything to do with other jobs being queued on the
master server? Is there any known issue with doing nfs backups on a
master server or something? I'm pretty much out of ideas.

-Aaron



-Original Message-
From: ankur kumar [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 10, 2007 12:14 PM
To: veritas-bu@mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] same job keeps hanging



1. check out the bpbkar logs
2. Add  VERBOSE = 5  to the
/usr/openv/netbackup/bp.conf on the client.
Create the /usr/openv/netbackup/logs/bpbkar directory
if it doesn't already exist.
Run /usr/openv/netbackup/bin/bpbkar -nocont DIRECTORY
 /dev/null against the directory where the file
resides.
The above command will cause bpbkar to read the
directory and write the output to /dev/null instead of
disk or tape. Running bpbkar manually is a good method
to verify if bkbkar can read a file without doing an
actual backup of the client.  Any errors will be
logged to the bpbkar log directory.

Calculus


   


Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mailp=summer+activities+for+ki
dscs=bz 


Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
[EMAIL PROTECTED]

___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging.

2007-07-10 Thread ckstehman
If you are backing up files on a UNIX system, check if there is a hung nfs 
mount.  I have had backups hang
because the ls -l command hangs on the mount point.  - Just a thought..
=
Carl Stehman
IT Distributed Services Team
Pepco Holdings, Inc.
202-331-6619
Pager 301-765-2703
[EMAIL PROTECTED]



Aaron Mills [EMAIL PROTECTED] 
Sent by: [EMAIL PROTECTED]
07/09/2007 04:39 PM

To
Liddle, Stuart [EMAIL PROTECTED], veritas-bu@mailman.eng.auburn.edu
cc

Subject
Re: [Veritas-bu] same job keeps hanging.






Actually, I?ve done that before with the same results. We upped the 
timeouts to around 10,000 seconds to no avail. It?s as though at some 
point the backups just hang for no good reason. 
 
A quick find shows that I?m backing up roughly 29,000 files ? that 
shouldn?t take too long to enumerate, should it? 
 
 
Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
[EMAIL PROTECTED]
 

From: Liddle, Stuart [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 09, 2007 11:14 AM
To: Aaron Mills; veritas-bu@mailman.eng.auburn.edu
Subject: RE: [Veritas-bu] same job keeps hanging.
 
So, are you trying to back up a filesystem with lots and lots of small 
files?  If so, remember that NetBackup will try to enumerate all of the 
files that you are trying to back up.  We had a similar situation where we 
were trying to back up a filesystem with 3.5 million files in 50,000 
directories.  It took hours to do a filelist of all of that?.consequently, 
it timed out. 
 
 
Symantec told us the best solution for that particular directory was NDMP 
(since the timeouts are much longer).
 
 
OR?I suppose you could up the timeout value to more than 3600 seconds and 
see what happens.
 

From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Aaron 
Mills
Sent: Monday, July 09, 2007 9:58 AM
To: veritas-bu@mailman.eng.auburn.edu
Subject: [Veritas-bu] same job keeps hanging.
 
Hi all,
 
I?m hoping someone?s seen this before. I?m running 5.1MP6 w/ AIT3 ? I?ve 
got a ~126GB backup that kicks off weekly, but hangs within a few hours 
every time ? the error I get is always ?media manager terminated by parent 
process? but the logs don?t seem to show anything odd. No other backups 
hang like this. This is also the only job that runs on the server itself.
 
bptm gives me:
 
03:28:45.470 [4999] 2 io_ioctl: command (1)MTFSF 1 from (bptm.c.8307) on 
drive index 1
03:28:45.530 [4999] 2 io_close: closing 
/usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310
03:28:45.530 [4999] 2 catch_signal: EXITING with status 82
 
so I check bpbrm:
 
02:05:33.882 [4992] 2 bpbrm spawn_child: /usr/openv/netbackup/bin/bptm 
bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound 
-bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound -hostname 
foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname foo.bar.com -rl 5 
-rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z ?mediasvr 
foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion 51 -shm
02:05:33.884 [4992] 2 bpbrm write_continue_backup: wrote CONTINUE BACKUP 
on COMM_SOCK 4
02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/inbound on COMM_SOCK
02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/ftp on COMM_SOCK
02:05:33.884 [4992] 2 bpbrm main: wrote CONTINUE on COMM_SOCK
02:05:33.885 [4992] 2 bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com 
_1183968330
02:09:44.763 [4992] 2 bpbrm mm_sig: received ready signal from media 
manager
02:09:44.763 [4992] 2 bpbrm readline: retrying partial read from fgets 
::
03:27:22.261 [4992] 2 bpbrm sighandler: signal 14 caught by bpbrm
03:27:22.272 [4992] 2 bpbrm sighandler: bpbrm timeout after 3600 seconds
03:27:22.287 [4992] 2 clear_held_signals: clearing signal mask stack, 
mask_stack_depth = 0
03:27:22.287 [4992] 2 bpbrm kill_child_process: start
03:27:22.287 [4992] 2 bpbrm wait_for_child: start
03:28:48.546 [4992] 2 bpbrm wait_for_child: child exit_status = 82 
signal_status = 0
03:28:48.557 [4992] 2 inform_client_of_status: INF - Server status = 41
 
but I can?t seem to figure out why there was a timeout. I checked all the 
related logs ? bpbkar just shows file writing stopping at 2:42am ? like 
the process just hangs there, no errors though. Looking right now, the 
bpbrm and bpbkar processes for this backup are still running, but nothing 
is happening. The job shows as active and everything is queueing up behind 
it.  I?ve also adjusted the CLIENT_READ_TIMEOUT in 
/usr/openv/netbackup/bp.conf to no avail.
 
Can anyone point me in the right direction as to what I?m missing? I?m 
guessing there?s something I?m not seeing in one of the logs.
 
-Aaron
 
Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
[EMAIL PROTECTED]
 
 ___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


This Email message and any attachment may contain

Re: [Veritas-bu] same job keeps hanging.

2007-07-10 Thread Paul Keating
I've experienced the same.

Paul

-- 


 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf 
 Of David Rock
 Sent: July 9, 2007 5:12 PM
 To: veritas-bu@mailman.eng.auburn.edu
 Subject: Re: [Veritas-bu] same job keeps hanging.

 2. client on the other side of a firewall
 
 What was happening in our case was the backup would start, 
 one hour into
 the backup, the firewall would decide since it didn't see any traffic
 coming from the client to the master server, it would drop 
 the entry in
 the state table.  Then, one hour later, the client would try to send a
 keepalive packet through the now-defunct connection, fail, 
 retry several
 times, and then finally give up and die, taking the backup with it.


La version française suit le texte anglais.



This email may contain privileged and/or confidential information, and the Bank 
of
Canada does not waive any related rights. Any distribution, use, or copying of 
this
email or the information it contains by other than the intended recipient is
unauthorized. If you received this email in error please delete it immediately 
from
your system and notify the sender promptly by email that you have done so. 



Le présent courriel peut contenir de l'information privilégiée ou 
confidentielle.
La Banque du Canada ne renonce pas aux droits qui s'y rapportent. Toute 
diffusion,
utilisation ou copie de ce courriel ou des renseignements qu'il contient par une
personne autre que le ou les destinataires désignés est interdite. Si vous 
recevez
ce courriel par erreur, veuillez le supprimer immédiatement et envoyer sans 
délai à
l'expéditeur un message électronique pour l'aviser que vous avez éliminé de 
votre
ordinateur toute copie du courriel reçu.

___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging

2007-07-10 Thread ankur kumar


1. check out the bpbkar logs
2. Add  VERBOSE = 5  to the
/usr/openv/netbackup/bp.conf on the client.
Create the /usr/openv/netbackup/logs/bpbkar directory
if it doesn't already exist.
Run /usr/openv/netbackup/bin/bpbkar -nocont DIRECTORY
 /dev/null against the directory where the file
resides.
The above command will cause bpbkar to read the
directory and write the output to /dev/null instead of
disk or tape. Running bpbkar manually is a good method
to verify if bkbkar can read a file without doing an
actual backup of the client.  Any errors will be
logged to the bpbkar log directory.

Calculus


   

Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mailp=summer+activities+for+kidscs=bz
 
___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


[Veritas-bu] same job keeps hanging.

2007-07-09 Thread Aaron Mills
Hi all,

 

I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've
got a ~126GB backup that kicks off weekly, but hangs within a few hours
every time - the error I get is always media manager terminated by
parent process but the logs don't seem to show anything odd. No other
backups hang like this. This is also the only job that runs on the
server itself.

 

bptm gives me:

 

03:28:45.470 [4999] 2 io_ioctl: command (1)MTFSF 1 from (bptm.c.8307)
on drive index 1

03:28:45.530 [4999] 2 io_close: closing
/usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310

03:28:45.530 [4999] 2 catch_signal: EXITING with status 82

 

so I check bpbrm:

 

02:05:33.882 [4992] 2 bpbrm spawn_child: /usr/openv/netbackup/bin/bptm
bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound
-bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound
-hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname
foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v
-Z -mediasvr foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion
51 -shm

02:05:33.884 [4992] 2 bpbrm write_continue_backup: wrote CONTINUE
BACKUP on COMM_SOCK 4

02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/inbound on
COMM_SOCK

02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/ftp on COMM_SOCK

02:05:33.884 [4992] 2 bpbrm main: wrote CONTINUE on COMM_SOCK

02:05:33.885 [4992] 2 bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com
_1183968330

02:09:44.763 [4992] 2 bpbrm mm_sig: received ready signal from media
manager

02:09:44.763 [4992] 2 bpbrm readline: retrying partial read from fgets
::

03:27:22.261 [4992] 2 bpbrm sighandler: signal 14 caught by bpbrm

03:27:22.272 [4992] 2 bpbrm sighandler: bpbrm timeout after 3600
seconds

03:27:22.287 [4992] 2 clear_held_signals: clearing signal mask stack,
mask_stack_depth = 0

03:27:22.287 [4992] 2 bpbrm kill_child_process: start

03:27:22.287 [4992] 2 bpbrm wait_for_child: start

03:28:48.546 [4992] 2 bpbrm wait_for_child: child exit_status = 82
signal_status = 0

03:28:48.557 [4992] 2 inform_client_of_status: INF - Server status =
41

 

but I can't seem to figure out why there was a timeout. I checked all
the related logs - bpbkar just shows file writing stopping at 2:42am -
like the process just hangs there, no errors though. Looking right now,
the bpbrm and bpbkar processes for this backup are still running, but
nothing is happening. The job shows as active and everything is queueing
up behind it.  I've also adjusted the CLIENT_READ_TIMEOUT in
/usr/openv/netbackup/bp.conf to no avail.

 

Can anyone point me in the right direction as to what I'm missing? I'm
guessing there's something I'm not seeing in one of the logs.

 

-Aaron

 

Aaron Mills

Systems Administrator

Return Path, Inc.

http://www.returnpath.net

[EMAIL PROTECTED]

 

 

___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging.

2007-07-09 Thread Liddle, Stuart
So, are you trying to back up a filesystem with lots and lots of small
files?  If so, remember that NetBackup will try to enumerate all of the
files that you are trying to back up.  We had a similar situation where we
were trying to back up a filesystem with 3.5 million files in 50,000
directories.  It took hours to do a filelist of all of thatconsequently,
it timed out. 

 

 

Symantec told us the best solution for that particular directory was NDMP
(since the timeouts are much longer).

 

 

OR...I suppose you could up the timeout value to more than 3600 seconds and
see what happens.

 

  _  

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Aaron Mills
Sent: Monday, July 09, 2007 9:58 AM
To: veritas-bu@mailman.eng.auburn.edu
Subject: [Veritas-bu] same job keeps hanging.

 

Hi all,

 

I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've got
a ~126GB backup that kicks off weekly, but hangs within a few hours every
time - the error I get is always media manager terminated by parent
process but the logs don't seem to show anything odd. No other backups hang
like this. This is also the only job that runs on the server itself.

 

bptm gives me:

 

03:28:45.470 [4999] 2 io_ioctl: command (1)MTFSF 1 from (bptm.c.8307) on
drive index 1

03:28:45.530 [4999] 2 io_close: closing
/usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310

03:28:45.530 [4999] 2 catch_signal: EXITING with status 82

 

so I check bpbrm:

 

02:05:33.882 [4992] 2 bpbrm spawn_child: /usr/openv/netbackup/bin/bptm
bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound -bt
1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound -hostname
foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname foo.bar.com -rl 5 -rp
8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v -Z -mediasvr foo.bar.com
-jobid 117926 -jobgrpid 117926 -masterversion 51 -shm

02:05:33.884 [4992] 2 bpbrm write_continue_backup: wrote CONTINUE BACKUP
on COMM_SOCK 4

02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/inbound on COMM_SOCK

02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/ftp on COMM_SOCK

02:05:33.884 [4992] 2 bpbrm main: wrote CONTINUE on COMM_SOCK

02:05:33.885 [4992] 2 bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com
_1183968330

02:09:44.763 [4992] 2 bpbrm mm_sig: received ready signal from media
manager

02:09:44.763 [4992] 2 bpbrm readline: retrying partial read from fgets ::

03:27:22.261 [4992] 2 bpbrm sighandler: signal 14 caught by bpbrm

03:27:22.272 [4992] 2 bpbrm sighandler: bpbrm timeout after 3600 seconds

03:27:22.287 [4992] 2 clear_held_signals: clearing signal mask stack,
mask_stack_depth = 0

03:27:22.287 [4992] 2 bpbrm kill_child_process: start

03:27:22.287 [4992] 2 bpbrm wait_for_child: start

03:28:48.546 [4992] 2 bpbrm wait_for_child: child exit_status = 82
signal_status = 0

03:28:48.557 [4992] 2 inform_client_of_status: INF - Server status = 41

 

but I can't seem to figure out why there was a timeout. I checked all the
related logs - bpbkar just shows file writing stopping at 2:42am - like the
process just hangs there, no errors though. Looking right now, the bpbrm and
bpbkar processes for this backup are still running, but nothing is
happening. The job shows as active and everything is queueing up behind it.
I've also adjusted the CLIENT_READ_TIMEOUT in /usr/openv/netbackup/bp.conf
to no avail.

 

Can anyone point me in the right direction as to what I'm missing? I'm
guessing there's something I'm not seeing in one of the logs.

 

-Aaron

 

Aaron Mills

Systems Administrator

Return Path, Inc.

http://www.returnpath.net

[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 

 

 

___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging.

2007-07-09 Thread Justin Piszcz

On Mon, 9 Jul 2007, Aaron Mills wrote:

 Hi all,



 I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've
 got a ~126GB backup that kicks off weekly, but hangs within a few hours
 every time - the error I get is always media manager terminated by
 parent process but the logs don't seem to show anything odd. No other
 backups hang like this. This is also the only job that runs on the
 server itself.


Can you backup other directories, /etc, /var with no problem?

___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging.

2007-07-09 Thread Aaron Mills
Well, we have incremental backups that run on this filesystem - they
seem to run fine. I wonder if it isn't a number of files issue as Stuart
Liddle suggested...


Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
[EMAIL PROTECTED]
 

-Original Message-
From: Justin Piszcz [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 09, 2007 11:31 AM
To: Aaron Mills
Cc: veritas-bu@mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] same job keeps hanging.


On Mon, 9 Jul 2007, Aaron Mills wrote:

 Hi all,



 I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 -
I've
 got a ~126GB backup that kicks off weekly, but hangs within a few
hours
 every time - the error I get is always media manager terminated by
 parent process but the logs don't seem to show anything odd. No other
 backups hang like this. This is also the only job that runs on the
 server itself.


Can you backup other directories, /etc, /var with no problem?


___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging.

2007-07-09 Thread Aaron Mills
Actually, I've done that before with the same results. We upped the
timeouts to around 10,000 seconds to no avail. It's as though at some
point the backups just hang for no good reason. 

 

A quick find shows that I'm backing up roughly 29,000 files - that
shouldn't take too long to enumerate, should it? 

 

 

Aaron Mills

Systems Administrator

Return Path, Inc.

http://www.returnpath.net

[EMAIL PROTECTED]

 



From: Liddle, Stuart [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 09, 2007 11:14 AM
To: Aaron Mills; veritas-bu@mailman.eng.auburn.edu
Subject: RE: [Veritas-bu] same job keeps hanging.

 

So, are you trying to back up a filesystem with lots and lots of small
files?  If so, remember that NetBackup will try to enumerate all of the
files that you are trying to back up.  We had a similar situation where
we were trying to back up a filesystem with 3.5 million files in 50,000
directories.  It took hours to do a filelist of all of
thatconsequently, it timed out. 

 

 

Symantec told us the best solution for that particular directory was
NDMP (since the timeouts are much longer).

 

 

OR...I suppose you could up the timeout value to more than 3600 seconds
and see what happens.

 



From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Aaron
Mills
Sent: Monday, July 09, 2007 9:58 AM
To: veritas-bu@mailman.eng.auburn.edu
Subject: [Veritas-bu] same job keeps hanging.

 

Hi all,

 

I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've
got a ~126GB backup that kicks off weekly, but hangs within a few hours
every time - the error I get is always media manager terminated by
parent process but the logs don't seem to show anything odd. No other
backups hang like this. This is also the only job that runs on the
server itself.

 

bptm gives me:

 

03:28:45.470 [4999] 2 io_ioctl: command (1)MTFSF 1 from (bptm.c.8307)
on drive index 1

03:28:45.530 [4999] 2 io_close: closing
/usr/openv/netbackup/db/media/tpreq/AK6503, from bptm.c.8310

03:28:45.530 [4999] 2 catch_signal: EXITING with status 82

 

so I check bpbrm:

 

02:05:33.882 [4992] 2 bpbrm spawn_child: /usr/openv/netbackup/bin/bptm
bptm -w -c foo.bar.com -den 17 -rt 6 -rn 0 -stunit Spectra2 -cl inbound
-bt 1183968330 -b foo.bar.com _1183968330 -st 0 -cj 1 -p inbound
-hostname foo.bar.com -ru root -rclnt foo.bar.com -rclnthostname
foo.bar.com -rl 5 -rp 8035200 -sl ftpif -ct 0 -maxfrag 1048576 -tir -v
-Z -mediasvr foo.bar.com -jobid 117926 -jobgrpid 117926 -masterversion
51 -shm

02:05:33.884 [4992] 2 bpbrm write_continue_backup: wrote CONTINUE
BACKUP on COMM_SOCK 4

02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/inbound on
COMM_SOCK

02:05:33.884 [4992] 2 bpbrm main: wrote /na270/pub/ftp on COMM_SOCK

02:05:33.884 [4992] 2 bpbrm main: wrote CONTINUE on COMM_SOCK

02:05:33.885 [4992] 2 bpbrm main: ESTIMATE -1 -1 nbu0 foo.bar.com
_1183968330

02:09:44.763 [4992] 2 bpbrm mm_sig: received ready signal from media
manager

02:09:44.763 [4992] 2 bpbrm readline: retrying partial read from fgets
::

03:27:22.261 [4992] 2 bpbrm sighandler: signal 14 caught by bpbrm

03:27:22.272 [4992] 2 bpbrm sighandler: bpbrm timeout after 3600
seconds

03:27:22.287 [4992] 2 clear_held_signals: clearing signal mask stack,
mask_stack_depth = 0

03:27:22.287 [4992] 2 bpbrm kill_child_process: start

03:27:22.287 [4992] 2 bpbrm wait_for_child: start

03:28:48.546 [4992] 2 bpbrm wait_for_child: child exit_status = 82
signal_status = 0

03:28:48.557 [4992] 2 inform_client_of_status: INF - Server status =
41

 

but I can't seem to figure out why there was a timeout. I checked all
the related logs - bpbkar just shows file writing stopping at 2:42am -
like the process just hangs there, no errors though. Looking right now,
the bpbrm and bpbkar processes for this backup are still running, but
nothing is happening. The job shows as active and everything is queueing
up behind it.  I've also adjusted the CLIENT_READ_TIMEOUT in
/usr/openv/netbackup/bp.conf to no avail.

 

Can anyone point me in the right direction as to what I'm missing? I'm
guessing there's something I'm not seeing in one of the logs.

 

-Aaron

 

Aaron Mills

Systems Administrator

Return Path, Inc.

http://www.returnpath.net

[EMAIL PROTECTED]

 

 

___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging.

2007-07-09 Thread David Rock
* Aaron Mills [EMAIL PROTECTED] [2007-07-09 16:39]:
 Hi all,
 
 I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 - I've
 got a ~126GB backup that kicks off weekly, but hangs within a few hours
 every time - the error I get is always media manager terminated by
 parent process but the logs don't seem to show anything odd. No other
 backups hang like this. This is also the only job that runs on the
 server itself.

When you say runs on the server itself, what do you actually mean?  We
say an odd timeout that always happened at the same time into the
backup, but the specific circumstances were:

1. a bpbackup command running on a client system
2. client on the other side of a firewall

What was happening in our case was the backup would start, one hour into
the backup, the firewall would decide since it didn't see any traffic
coming from the client to the master server, it would drop the entry in
the state table.  Then, one hour later, the client would try to send a
keepalive packet through the now-defunct connection, fail, retry several
times, and then finally give up and die, taking the backup with it.

This may not be anything like what you are dealing with, but it is a
pretty good example of how things other than NBU can cause weird things
to happen and make it look like NBU is the cause.  Does your job always
die at the same time, or does it vary from attempt to attempt?

-- 
David Rock
[EMAIL PROTECTED]
___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] same job keeps hanging.

2007-07-09 Thread Aaron Mills
Anecdotally - it doesn't always die at the same time, but roughly an
hour or two into the job. I never actually looked to see if it was
within a few minutes, but the symptom is always the same: daemon
terminated by parent process, bpbrm timeout after 3600 seconds

Something seems to be causing the client process to get stuck, for lack
of a better word.

As to the server - the job runs on the NBU server itself. I have an NFS
mount hanging off it that I'm backing up. I've checked /var/adm/messages
and I don't see anything weird happening at the time the backup fails
(mount going stale, etc.), either. 


Aaron Mills
Systems Administrator
Return Path, Inc.
http://www.returnpath.net
[EMAIL PROTECTED]
 

-Original Message-
From: David Rock [mailto:[EMAIL PROTECTED] 
Sent: Monday, July 09, 2007 3:12 PM
To: veritas-bu@mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] same job keeps hanging.

* Aaron Mills [EMAIL PROTECTED] [2007-07-09 16:39]:
 Hi all,
 
 I'm hoping someone's seen this before. I'm running 5.1MP6 w/ AIT3 -
I've
 got a ~126GB backup that kicks off weekly, but hangs within a few
hours
 every time - the error I get is always media manager terminated by
 parent process but the logs don't seem to show anything odd. No other
 backups hang like this. This is also the only job that runs on the
 server itself.

When you say runs on the server itself, what do you actually mean?  We
say an odd timeout that always happened at the same time into the
backup, but the specific circumstances were:

1. a bpbackup command running on a client system
2. client on the other side of a firewall

What was happening in our case was the backup would start, one hour into
the backup, the firewall would decide since it didn't see any traffic
coming from the client to the master server, it would drop the entry in
the state table.  Then, one hour later, the client would try to send a
keepalive packet through the now-defunct connection, fail, retry several
times, and then finally give up and die, taking the backup with it.

This may not be anything like what you are dealing with, but it is a
pretty good example of how things other than NBU can cause weird things
to happen and make it look like NBU is the cause.  Does your job always
die at the same time, or does it vary from attempt to attempt?

-- 
David Rock
[EMAIL PROTECTED]


___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu