One quick answer ==> Remove nidlog and stdout from replication partition.
One quick question ==> 1. How are you managing DRBD switchover without PDRBD ? -Nagendra > -----Original Message----- > From: Hans Feldt [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 31, 2007 5:13 PM > To: Hans Feldt; Kumar Nagendra-G20235; [email protected] > Subject: RE: [Users] Help with controller fail-over issues? > > Could the problem be because in my system I have the nidlog & > stdout directories on the replicated partition? > I could change that and give it a try. > > /Hans > > > -----Original Message----- > > From: Hans Feldt > > Sent: den 31 oktober 2007 12:39 > > To: Hans Feldt; Kumar Nagendra-G20235; [email protected] > > Subject: RE: [Users] Help with controller fail-over issues? > > > > Yep. I extended the healthcheck period and duration for MQD > a lot (20s > > and 10s), and fail-over works. > > > > But that is not good so what MQD thread is executing the healtcheck > > dispatch and does I/O in the same thread? > > > > If my theory is right, no I/O can be performed in that thread. > > > > Thanks, > > Hans > > > > > -----Original Message----- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt > > > Sent: den 31 oktober 2007 12:14 > > > To: Kumar Nagendra-G20235; [email protected] > > > Subject: Re: [Users] Help with controller fail-over issues? > > > > > > Thanks, see below. > > > > > > /Hans > > > > > > > -----Original Message----- > > > > From: Kumar Nagendra-G20235 [mailto:[EMAIL PROTECTED] > > > > > > > The log string "faulted due to 6 -rcvr=9" transltes to > > > > - 6 being AVND_ERR_SRC_CBK_HC_TIMEOUT(AMF health check > > > callback times > > > > out), 9 being AVSV_ERR_RCVR_SU_FAILOVER. It says that > since the > > > > component was not able to respond to the health check > > > response, the SU > > > > of that componentt failed. This will only happen when the > > system is > > > > heavily loaded and MAS and other components are not > > getting time to > > > > respond in 2 sec (health check value of MAS,MQD components) > > > especially > > > > in a PC environment. You need to fine tune these components > > > in the BOM > > > > file as per you configurations. It may happen during failover. > > > > > > But what if the component is "hanging" in an I/O operation? > > > An operation that will not succeed before the healtcheck > > timeout since > > > the replicated parition is not available. > > > > > > I could try to increase the healthcheck timeout for MQD a > > lot and see > > > what happens. > > > > > > > There should n't be any connection between DRBD and OpenSAF > > > failover > > > > timings. > > > > > > But the replicated partition is unavailable for some time! > > > OpenSAF configuration (pssv_store) and logs are stored there. > > > A lot of writing to the replicated partition will take > place during > > > fail-over I assume. I think there is connection between > > DRBD OpenSAF > > > fail-over. > > > > > > > Actually, openSAF has control over DRBD for failover > using PDRBD. > > > > > > Not currently in our setup. > > > > > > > Send all the logs using the script attached. > > > > > > I have no interesting logs to send. The DTS logs are > empty from the > > > time of failover (since the replicated partition is unavailable?). > > > > > > > I would like to know what is your approach to test failover. > > > > > > I just did 'pkill cpd' on the active controller. > > > > > > _______________________________________________ > > > Users mailing list > > > [email protected] > > > http://list.opensaf.org/maillist/listinfo/users > > > > _______________________________________________ Users mailing list [email protected] http://list.opensaf.org/maillist/listinfo/users
