One quick answer ==> Remove nidlog and stdout from replication
partition.

One quick question ==> 1. How are you managing DRBD switchover without
PDRBD ? 

-Nagendra

> -----Original Message-----
> From: Hans Feldt [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, October 31, 2007 5:13 PM
> To: Hans Feldt; Kumar Nagendra-G20235; [email protected]
> Subject: RE: [Users] Help with controller fail-over issues?
> 
> Could the problem be because in my system I have the nidlog & 
> stdout directories on the replicated partition?
> I could change that and give it a try.
> 
> /Hans
> 
> > -----Original Message-----
> > From: Hans Feldt
> > Sent: den 31 oktober 2007 12:39
> > To: Hans Feldt; Kumar Nagendra-G20235; [email protected]
> > Subject: RE: [Users] Help with controller fail-over issues?
> > 
> > Yep. I extended the healthcheck period and duration for MQD 
> a lot (20s 
> > and 10s), and fail-over works.
> > 
> > But that is not good so what MQD thread is executing the healtcheck 
> > dispatch and does I/O in the same thread?
> > 
> > If my theory is right, no I/O can be performed in that thread.
> > 
> > Thanks,
> > Hans
> > 
> > > -----Original Message-----
> > > From: [EMAIL PROTECTED] 
> > > [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt
> > > Sent: den 31 oktober 2007 12:14
> > > To: Kumar Nagendra-G20235; [email protected]
> > > Subject: Re: [Users] Help with controller fail-over issues?
> > > 
> > > Thanks, see below.
> > > 
> > > /Hans
> > > 
> > > > -----Original Message-----
> > > > From: Kumar Nagendra-G20235 [mailto:[EMAIL PROTECTED]
> > > 
> > > >        The log string "faulted due to 6 -rcvr=9" transltes to
> > > > -  6 being AVND_ERR_SRC_CBK_HC_TIMEOUT(AMF health check
> > > callback times
> > > > out),  9 being AVSV_ERR_RCVR_SU_FAILOVER. It says that 
> since the 
> > > > component was not able to respond to the health check
> > > response, the SU
> > > > of that componentt failed. This will only happen when the
> > system is
> > > > heavily loaded and MAS and other components are not
> > getting time to
> > > > respond in 2 sec (health check value of MAS,MQD components)
> > > especially
> > > > in a PC environment. You need to fine tune these components
> > > in the BOM
> > > > file as per you configurations. It may happen during failover.
> > > 
> > > But what if the component is "hanging" in an I/O operation? 
> > > An operation that will not succeed before the healtcheck
> > timeout since
> > > the replicated parition is not available.
> > > 
> > > I could try to increase the healthcheck timeout for MQD a
> > lot and see
> > > what happens.
> > > 
> > > > There should n't be any connection between DRBD and OpenSAF
> > > failover
> > > > timings.
> > > 
> > > But the replicated partition is unavailable for some time! 
> > > OpenSAF configuration (pssv_store) and logs are stored there. 
> > > A lot of writing to the replicated partition will take 
> place during 
> > > fail-over I assume. I think there is connection between
> > DRBD OpenSAF
> > > fail-over.
> > > 
> > > > Actually, openSAF has control over DRBD for failover 
> using PDRBD.
> > > 
> > > Not currently in our setup.
> > > 
> > > > Send all the logs using the script attached.
> > > 
> > > I have no interesting logs to send. The DTS logs are 
> empty from the 
> > > time of failover (since the replicated partition is unavailable?).
> > > 
> > > > I would like to know what is your approach to test failover.
> > > 
> > > I just did 'pkill cpd' on the active controller.
> > > 
> > > _______________________________________________
> > > Users mailing list
> > > [email protected]
> > > http://list.opensaf.org/maillist/listinfo/users
> > > 
> 
_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users

Reply via email to