Re: [Veritas-bu] Bpjobd and other failures.

2010-01-21 Thread Lightner, Jeff
My coworker here modified one of the key files for NetBackup 6 once but
didn't bother to bounce NetBackup to see what effect his change had.

 

Later he goes on vacation and I bounce NetBackup while dealing with a
fairly minor problem.  It didn't come back up and after spending 24
hours on the phone with NetBackup support I figured out what the problem
was.  NetBackup never suggested looking at this file  - I just happened
to notice the file date was newer than the previous NetBackup bounce
while I was on one of my interminable holds with them.What was worse
is they couldn't tell me what the fields in the file he changed were for
even after I discovered it.   Luckily I'd taught him at a previous job
to always save originals when editing files for quick back out so I was
able to revert.  

 

Now any time he says "I'm going to..." and it is heading into a weekend
I'm on call I tell him "No, you're not!"   He's a pretty smart guy but I
figure he should have the ...um...joy of dealing with the results of his
experiments.  

 



From: veritas-bu-boun...@mailman.eng.auburn.edu
[mailto:veritas-bu-boun...@mailman.eng.auburn.edu] On Behalf Of WEAVER,
Simon (external)
Sent: Thursday, January 21, 2010 6:47 AM
To: Jeff Cleverley; Justin Piszcz
Cc: VERITAS-BU@mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Bpjobd and other failures.

 

Jeff

Good idea about not touching anything again...

 

One colleague a few years back, done a Firmware upgrade to a Tape
Library on a Friday and went on a few weeks vacation.

 

On the Monday, a new person, totally unaware of the backups found every
single one was failing. took 2 weeks to resolve.

 

Never make big changes on the last day of the week, or before you go on
vacation springs to mind :-)

Enjoy the break !

 

Simon

 



From: veritas-bu-boun...@mailman.eng.auburn.edu
[mailto:veritas-bu-boun...@mailman.eng.auburn.edu] On Behalf Of Jeff
Cleverley
Sent: Wednesday, January 20, 2010 6:03 PM
To: Justin Piszcz
Cc: VERITAS-BU@mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Bpjobd and other failures.

Justin,

Thanks for the reply.  For whatever reason things seem to have magically
started working again.  All I did was shutdown Veritas (again), turned
up the verbosity in bp.conf, and restarted it.  When it first started I
still didn't have bpdbm, bpjobd, etc, running.  The vnetd log had a lot
of errors.  When I ran bpdbjobs from the command line, nothing came
back.

While looking through the bpdbm log I found no errors but a lot of
entries like it was doing backups.  About 10 minutes later I ran
bpdbjobs again and everything showed up and some jobs were running.  I
think this are restarts of some failed jobs so we'll see how they do.
So far 4 of them have completed successfully.

Since I leave the country on vacation tomorrow morning I don't plan on
touching anything else on it today :-)

Thanks again for the help.

Jeff

On Wed, Jan 20, 2010 at 2:11 AM, Justin Piszcz 
wrote:

Hi,

Taking a shot in the dark here, for the tcp issues, try adding:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1

To your /etc/sysctl.conf, reboot.

For vnetd, check your /etc/xinetd.d/vnetd*
Also check the logs that xinetd is not throttling connections if too
many servers are trying to backup too fast that can happen.

Justin. 



On Tue, 19 Jan 2010, Jeff Cleverley wrote:

Greetings,

While continuing to work on this it seems there may be issues with
vnetd.
The netstat -a |grep vnet shows this:

tcp0  0 *:vnetd *:*
LISTEN
tcp0  0 sgpbkp04.sgp.avagotec:35781
agt604.sgp.avagotech.:vnetd
ESTABLISHED
tcp0  0 sgpbkp04.sgp.avagotec:35720
sgpbkp04.sgp.avagotec:vnetd
ESTABLISHED
tcp0  0 sgpbkp04.sgp.avagotec:vnetd
sgpbkp04.sgp.avagotec:35720
ESTABLISHED
tcp0  0 localhost.localdomain:vnetd
sgpbkp04.sgp.avagotec:35846
TIME_WAIT
tcp0  0 localhost.localdomain:vnetd
sgpbkp04.sgp.avagotec:35853
TIME_WAIT
tcp0  0 localhost.localdomain:vnetd
sgpbkp04.sgp.avagotec:35839
TIME_WAIT
unix  2  [ ACC ] STREAM LISTENING 146403
/usr/openv/var/vnetd/vmd.uds
unix  2  [ ACC ] STREAM LISTENING 145874
/usr/openv/var/vnetd/bpcompatd.uds
unix  2  [ ACC ] STREAM LISTENING 146786
/usr/openv/var/vnetd/tldcd.uds
unix  3  [ ] STREAM CONNECTED 152574
/usr/openv/var/vnetd/bpcompatd.uds

The time_wait entries seem to stick around a lot.  I've restarted xinetd
on
the system and we have rebooted but things are still wedged.

Thanks,

Jeff

On Tue, Jan 19, 2010 at 6:00 PM, Jeff Cleverley <
jeff.clever...@avagotech.com> wrote:

Greetings,

Our environment is NB6.5.1 on a RHEL4 server.  It has a hpux SAN media
server also.  All other clients are backed up over the network.  Most
are
RHEL4x.

The tape library 

Re: [Veritas-bu] Bpjobd and other failures.

2010-01-21 Thread WEAVER, Simon (external)
Jeff
Good idea about not touching anything again...
 
One colleague a few years back, done a Firmware upgrade to a Tape
Library on a Friday and went on a few weeks vacation.
 
On the Monday, a new person, totally unaware of the backups found every
single one was failing. took 2 weeks to resolve.
 
Never make big changes on the last day of the week, or before you go on
vacation springs to mind :-)
Enjoy the break !
 
Simon



From: veritas-bu-boun...@mailman.eng.auburn.edu
[mailto:veritas-bu-boun...@mailman.eng.auburn.edu] On Behalf Of Jeff
Cleverley
Sent: Wednesday, January 20, 2010 6:03 PM
To: Justin Piszcz
Cc: VERITAS-BU@mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Bpjobd and other failures.


Justin,

Thanks for the reply.  For whatever reason things seem to have magically
started working again.  All I did was shutdown Veritas (again), turned
up the verbosity in bp.conf, and restarted it.  When it first started I
still didn't have bpdbm, bpjobd, etc, running.  The vnetd log had a lot
of errors.  When I ran bpdbjobs from the command line, nothing came
back.

While looking through the bpdbm log I found no errors but a lot of
entries like it was doing backups.  About 10 minutes later I ran
bpdbjobs again and everything showed up and some jobs were running.  I
think this are restarts of some failed jobs so we'll see how they do.
So far 4 of them have completed successfully.

Since I leave the country on vacation tomorrow morning I don't plan on
touching anything else on it today :-)

Thanks again for the help.

Jeff


On Wed, Jan 20, 2010 at 2:11 AM, Justin Piszcz 
wrote:


Hi,

Taking a shot in the dark here, for the tcp issues, try adding:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1

To your /etc/sysctl.conf, reboot.

For vnetd, check your /etc/xinetd.d/vnetd*
Also check the logs that xinetd is not throttling connections if
too many servers are trying to backup too fast that can happen.

Justin. 


On Tue, 19 Jan 2010, Jeff Cleverley wrote:



Greetings,

While continuing to work on this it seems there may be
issues with vnetd.
The netstat -a |grep vnet shows this:

tcp0  0 *:vnetd *:*
LISTEN
tcp0  0 sgpbkp04.sgp.avagotec:35781
agt604.sgp.avagotech.:vnetd
ESTABLISHED
tcp0  0 sgpbkp04.sgp.avagotec:35720
sgpbkp04.sgp.avagotec:vnetd
ESTABLISHED
tcp0  0 sgpbkp04.sgp.avagotec:vnetd
sgpbkp04.sgp.avagotec:35720
ESTABLISHED
tcp0  0 localhost.localdomain:vnetd
sgpbkp04.sgp.avagotec:35846
TIME_WAIT
tcp0  0 localhost.localdomain:vnetd
sgpbkp04.sgp.avagotec:35853
TIME_WAIT
tcp0  0 localhost.localdomain:vnetd
sgpbkp04.sgp.avagotec:35839
TIME_WAIT
unix  2  [ ACC ] STREAM LISTENING 146403
/usr/openv/var/vnetd/vmd.uds
unix  2  [ ACC ] STREAM LISTENING 145874
/usr/openv/var/vnetd/bpcompatd.uds
unix  2  [ ACC ] STREAM LISTENING 146786
/usr/openv/var/vnetd/tldcd.uds
unix  3  [ ] STREAM CONNECTED 152574
/usr/openv/var/vnetd/bpcompatd.uds

The time_wait entries seem to stick around a lot.  I've
restarted xinetd on
the system and we have rebooted but things are still
wedged.

Thanks,

Jeff

On Tue, Jan 19, 2010 at 6:00 PM, Jeff Cleverley <
jeff.clever...@avagotech.com> wrote:



Greetings,

Our environment is NB6.5.1 on a RHEL4 server.
It has a hpux SAN media
server also.  All other clients are backed up
over the network.  Most are
RHEL4x.

The tape library in our Singapore office failed
over the weekend and caused
a lot of things to fail and continue to be
wedged up.  Some jobs seemed to
have run but some failed with errors 13, 63, and
233.  This varied across
policies.  I decided to try and restart all
processes and get things cleaned
up.  This hasn't worked well.

When I started everything using service
netbackup start or

Re: [Veritas-bu] Bpjobd and other failures.

2010-01-20 Thread Jeff Cleverley
Justin,

Thanks for the reply.  For whatever reason things seem to have magically
started working again.  All I did was shutdown Veritas (again), turned up
the verbosity in bp.conf, and restarted it.  When it first started I still
didn't have bpdbm, bpjobd, etc, running.  The vnetd log had a lot of
errors.  When I ran bpdbjobs from the command line, nothing came back.

While looking through the bpdbm log I found no errors but a lot of entries
like it was doing backups.  About 10 minutes later I ran bpdbjobs again and
everything showed up and some jobs were running.  I think this are restarts
of some failed jobs so we'll see how they do.  So far 4 of them have
completed successfully.

Since I leave the country on vacation tomorrow morning I don't plan on
touching anything else on it today :-)

Thanks again for the help.

Jeff

On Wed, Jan 20, 2010 at 2:11 AM, Justin Piszcz wrote:

> Hi,
>
> Taking a shot in the dark here, for the tcp issues, try adding:
> net.ipv4.tcp_tw_reuse = 1
> net.ipv4.tcp_tw_recycle = 1
>
> To your /etc/sysctl.conf, reboot.
>
> For vnetd, check your /etc/xinetd.d/vnetd*
> Also check the logs that xinetd is not throttling connections if too many
> servers are trying to backup too fast that can happen.
>
> Justin.
>
>
> On Tue, 19 Jan 2010, Jeff Cleverley wrote:
>
>  Greetings,
>>
>> While continuing to work on this it seems there may be issues with vnetd.
>> The netstat -a |grep vnet shows this:
>>
>> tcp0  0 *:vnetd *:*
>> LISTEN
>> tcp0  0 sgpbkp04.sgp.avagotec:35781
>> agt604.sgp.avagotech.:vnetd
>> ESTABLISHED
>> tcp0  0 sgpbkp04.sgp.avagotec:35720
>> sgpbkp04.sgp.avagotec:vnetd
>> ESTABLISHED
>> tcp0  0 sgpbkp04.sgp.avagotec:vnetd
>> sgpbkp04.sgp.avagotec:35720
>> ESTABLISHED
>> tcp0  0 localhost.localdomain:vnetd
>> sgpbkp04.sgp.avagotec:35846
>> TIME_WAIT
>> tcp0  0 localhost.localdomain:vnetd
>> sgpbkp04.sgp.avagotec:35853
>> TIME_WAIT
>> tcp0  0 localhost.localdomain:vnetd
>> sgpbkp04.sgp.avagotec:35839
>> TIME_WAIT
>> unix  2  [ ACC ] STREAM LISTENING 146403
>> /usr/openv/var/vnetd/vmd.uds
>> unix  2  [ ACC ] STREAM LISTENING 145874
>> /usr/openv/var/vnetd/bpcompatd.uds
>> unix  2  [ ACC ] STREAM LISTENING 146786
>> /usr/openv/var/vnetd/tldcd.uds
>> unix  3  [ ] STREAM CONNECTED 152574
>> /usr/openv/var/vnetd/bpcompatd.uds
>>
>> The time_wait entries seem to stick around a lot.  I've restarted xinetd
>> on
>> the system and we have rebooted but things are still wedged.
>>
>> Thanks,
>>
>> Jeff
>>
>> On Tue, Jan 19, 2010 at 6:00 PM, Jeff Cleverley <
>> jeff.clever...@avagotech.com> wrote:
>>
>>  Greetings,
>>>
>>> Our environment is NB6.5.1 on a RHEL4 server.  It has a hpux SAN media
>>> server also.  All other clients are backed up over the network.  Most are
>>> RHEL4x.
>>>
>>> The tape library in our Singapore office failed over the weekend and
>>> caused
>>> a lot of things to fail and continue to be wedged up.  Some jobs seemed
>>> to
>>> have run but some failed with errors 13, 63, and 233.  This varied across
>>> policies.  I decided to try and restart all processes and get things
>>> cleaned
>>> up.  This hasn't worked well.
>>>
>>> When I started everything using service netbackup start or
>>> /etc/init.d/netbackup start, everything looks OK.  When I look at things
>>> like bpps -a I notice that the bpjobd isn't running anymore.  When I try
>>> to
>>> start it manually it fails saying File size limit exceeded.  The bpdbjobs
>>> returns no output.  I haven't been able to figure out which file it is
>>> complaining about.
>>>
>>> I'm sure I have a lot of things that need to be cleaned up.  There are a
>>> lot of files in the restart and trylogs.  I was thinking it was safe to
>>> move
>>> those out of the way but wanted to make sure.
>>>
>>> Any help on tracking the bpjobd error along with advice on cleaning up
>>> all
>>> the restart and trylogs would be appreciated.  Naturally I'm leaving on
>>> vacation Thursday so I need to help clean this up before I go.  I won't
>>> be
>>> doing any replies to this after Wednesday night because of that.
>>>
>>> Thanks,
>>>
>>> Jeff
>>>
>>> --
>>> Jeff Cleverley
>>> Unix Systems Administrator
>>> 4380 Ziegler Road
>>> Fort Collins, Colorado 80525
>>> 970-288-4611
>>>
>>>
>>>
>>
>> --
>> Jeff Cleverley
>> Unix Systems Administrator
>> 4380 Ziegler Road
>> Fort Collins, Colorado 80525
>> 970-288-4611
>>
>>


-- 
Jeff Cleverley
Unix Systems Administrator
4380 Ziegler Road
Fort Collins, Colorado 80525
970-288-4611
___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] Bpjobd and other failures.

2010-01-20 Thread Justin Piszcz
Hi,

Taking a shot in the dark here, for the tcp issues, try adding:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1

To your /etc/sysctl.conf, reboot.

For vnetd, check your /etc/xinetd.d/vnetd*
Also check the logs that xinetd is not throttling connections if too many 
servers are trying to backup too fast that can happen.

Justin.

On Tue, 19 Jan 2010, Jeff Cleverley wrote:

> Greetings,
>
> While continuing to work on this it seems there may be issues with vnetd.
> The netstat -a |grep vnet shows this:
>
> tcp0  0 *:vnetd *:*
> LISTEN
> tcp0  0 sgpbkp04.sgp.avagotec:35781 agt604.sgp.avagotech.:vnetd
> ESTABLISHED
> tcp0  0 sgpbkp04.sgp.avagotec:35720 sgpbkp04.sgp.avagotec:vnetd
> ESTABLISHED
> tcp0  0 sgpbkp04.sgp.avagotec:vnetd sgpbkp04.sgp.avagotec:35720
> ESTABLISHED
> tcp0  0 localhost.localdomain:vnetd sgpbkp04.sgp.avagotec:35846
> TIME_WAIT
> tcp0  0 localhost.localdomain:vnetd sgpbkp04.sgp.avagotec:35853
> TIME_WAIT
> tcp0  0 localhost.localdomain:vnetd sgpbkp04.sgp.avagotec:35839
> TIME_WAIT
> unix  2  [ ACC ] STREAM LISTENING 146403
> /usr/openv/var/vnetd/vmd.uds
> unix  2  [ ACC ] STREAM LISTENING 145874
> /usr/openv/var/vnetd/bpcompatd.uds
> unix  2  [ ACC ] STREAM LISTENING 146786
> /usr/openv/var/vnetd/tldcd.uds
> unix  3  [ ] STREAM CONNECTED 152574
> /usr/openv/var/vnetd/bpcompatd.uds
>
> The time_wait entries seem to stick around a lot.  I've restarted xinetd on
> the system and we have rebooted but things are still wedged.
>
> Thanks,
>
> Jeff
>
> On Tue, Jan 19, 2010 at 6:00 PM, Jeff Cleverley <
> jeff.clever...@avagotech.com> wrote:
>
>> Greetings,
>>
>> Our environment is NB6.5.1 on a RHEL4 server.  It has a hpux SAN media
>> server also.  All other clients are backed up over the network.  Most are
>> RHEL4x.
>>
>> The tape library in our Singapore office failed over the weekend and caused
>> a lot of things to fail and continue to be wedged up.  Some jobs seemed to
>> have run but some failed with errors 13, 63, and 233.  This varied across
>> policies.  I decided to try and restart all processes and get things cleaned
>> up.  This hasn't worked well.
>>
>> When I started everything using service netbackup start or
>> /etc/init.d/netbackup start, everything looks OK.  When I look at things
>> like bpps -a I notice that the bpjobd isn't running anymore.  When I try to
>> start it manually it fails saying File size limit exceeded.  The bpdbjobs
>> returns no output.  I haven't been able to figure out which file it is
>> complaining about.
>>
>> I'm sure I have a lot of things that need to be cleaned up.  There are a
>> lot of files in the restart and trylogs.  I was thinking it was safe to move
>> those out of the way but wanted to make sure.
>>
>> Any help on tracking the bpjobd error along with advice on cleaning up all
>> the restart and trylogs would be appreciated.  Naturally I'm leaving on
>> vacation Thursday so I need to help clean this up before I go.  I won't be
>> doing any replies to this after Wednesday night because of that.
>>
>> Thanks,
>>
>> Jeff
>>
>> --
>> Jeff Cleverley
>> Unix Systems Administrator
>> 4380 Ziegler Road
>> Fort Collins, Colorado 80525
>> 970-288-4611
>>
>>
>
>
> -- 
> Jeff Cleverley
> Unix Systems Administrator
> 4380 Ziegler Road
> Fort Collins, Colorado 80525
> 970-288-4611
>
___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


Re: [Veritas-bu] Bpjobd and other failures.

2010-01-19 Thread Jeff Cleverley
Greetings,

While continuing to work on this it seems there may be issues with vnetd.
The netstat -a |grep vnet shows this:

tcp0  0 *:vnetd *:*
LISTEN
tcp0  0 sgpbkp04.sgp.avagotec:35781 agt604.sgp.avagotech.:vnetd
ESTABLISHED
tcp0  0 sgpbkp04.sgp.avagotec:35720 sgpbkp04.sgp.avagotec:vnetd
ESTABLISHED
tcp0  0 sgpbkp04.sgp.avagotec:vnetd sgpbkp04.sgp.avagotec:35720
ESTABLISHED
tcp0  0 localhost.localdomain:vnetd sgpbkp04.sgp.avagotec:35846
TIME_WAIT
tcp0  0 localhost.localdomain:vnetd sgpbkp04.sgp.avagotec:35853
TIME_WAIT
tcp0  0 localhost.localdomain:vnetd sgpbkp04.sgp.avagotec:35839
TIME_WAIT
unix  2  [ ACC ] STREAM LISTENING 146403
/usr/openv/var/vnetd/vmd.uds
unix  2  [ ACC ] STREAM LISTENING 145874
/usr/openv/var/vnetd/bpcompatd.uds
unix  2  [ ACC ] STREAM LISTENING 146786
/usr/openv/var/vnetd/tldcd.uds
unix  3  [ ] STREAM CONNECTED 152574
/usr/openv/var/vnetd/bpcompatd.uds

The time_wait entries seem to stick around a lot.  I've restarted xinetd on
the system and we have rebooted but things are still wedged.

Thanks,

Jeff

On Tue, Jan 19, 2010 at 6:00 PM, Jeff Cleverley <
jeff.clever...@avagotech.com> wrote:

> Greetings,
>
> Our environment is NB6.5.1 on a RHEL4 server.  It has a hpux SAN media
> server also.  All other clients are backed up over the network.  Most are
> RHEL4x.
>
> The tape library in our Singapore office failed over the weekend and caused
> a lot of things to fail and continue to be wedged up.  Some jobs seemed to
> have run but some failed with errors 13, 63, and 233.  This varied across
> policies.  I decided to try and restart all processes and get things cleaned
> up.  This hasn't worked well.
>
> When I started everything using service netbackup start or
> /etc/init.d/netbackup start, everything looks OK.  When I look at things
> like bpps -a I notice that the bpjobd isn't running anymore.  When I try to
> start it manually it fails saying File size limit exceeded.  The bpdbjobs
> returns no output.  I haven't been able to figure out which file it is
> complaining about.
>
> I'm sure I have a lot of things that need to be cleaned up.  There are a
> lot of files in the restart and trylogs.  I was thinking it was safe to move
> those out of the way but wanted to make sure.
>
> Any help on tracking the bpjobd error along with advice on cleaning up all
> the restart and trylogs would be appreciated.  Naturally I'm leaving on
> vacation Thursday so I need to help clean this up before I go.  I won't be
> doing any replies to this after Wednesday night because of that.
>
> Thanks,
>
> Jeff
>
> --
> Jeff Cleverley
> Unix Systems Administrator
> 4380 Ziegler Road
> Fort Collins, Colorado 80525
> 970-288-4611
>
>


-- 
Jeff Cleverley
Unix Systems Administrator
4380 Ziegler Road
Fort Collins, Colorado 80525
970-288-4611
___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu


[Veritas-bu] Bpjobd and other failures.

2010-01-19 Thread Jeff Cleverley
Greetings,

Our environment is NB6.5.1 on a RHEL4 server.  It has a hpux SAN media
server also.  All other clients are backed up over the network.  Most are
RHEL4x.

The tape library in our Singapore office failed over the weekend and caused
a lot of things to fail and continue to be wedged up.  Some jobs seemed to
have run but some failed with errors 13, 63, and 233.  This varied across
policies.  I decided to try and restart all processes and get things cleaned
up.  This hasn't worked well.

When I started everything using service netbackup start or
/etc/init.d/netbackup start, everything looks OK.  When I look at things
like bpps -a I notice that the bpjobd isn't running anymore.  When I try to
start it manually it fails saying File size limit exceeded.  The bpdbjobs
returns no output.  I haven't been able to figure out which file it is
complaining about.

I'm sure I have a lot of things that need to be cleaned up.  There are a lot
of files in the restart and trylogs.  I was thinking it was safe to move
those out of the way but wanted to make sure.

Any help on tracking the bpjobd error along with advice on cleaning up all
the restart and trylogs would be appreciated.  Naturally I'm leaving on
vacation Thursday so I need to help clean this up before I go.  I won't be
doing any replies to this after Wednesday night because of that.

Thanks,

Jeff

-- 
Jeff Cleverley
Unix Systems Administrator
4380 Ziegler Road
Fort Collins, Colorado 80525
970-288-4611
___
Veritas-bu maillist  -  Veritas-bu@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu