Yep. I extended the healthcheck period and duration for MQD a lot (20s and 10s), and fail-over works.
But that is not good so what MQD thread is executing the healtcheck dispatch and does I/O in the same thread? If my theory is right, no I/O can be performed in that thread. Thanks, Hans > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Hans Feldt > Sent: den 31 oktober 2007 12:14 > To: Kumar Nagendra-G20235; [email protected] > Subject: Re: [Users] Help with controller fail-over issues? > > Thanks, see below. > > /Hans > > > -----Original Message----- > > From: Kumar Nagendra-G20235 [mailto:[EMAIL PROTECTED] > > > The log string "faulted due to 6 -rcvr=9" transltes to > > - 6 being AVND_ERR_SRC_CBK_HC_TIMEOUT(AMF health check > callback times > > out), 9 being AVSV_ERR_RCVR_SU_FAILOVER. It says that since the > > component was not able to respond to the health check > response, the SU > > of that componentt failed. This will only happen when the system is > > heavily loaded and MAS and other components are not getting time to > > respond in 2 sec (health check value of MAS,MQD components) > especially > > in a PC environment. You need to fine tune these components > in the BOM > > file as per you configurations. It may happen during failover. > > But what if the component is "hanging" in an I/O operation? > An operation that will not succeed before the healtcheck > timeout since the replicated parition is not available. > > I could try to increase the healthcheck timeout for MQD a lot > and see what happens. > > > There should n't be any connection between DRBD and OpenSAF > failover > > timings. > > But the replicated partition is unavailable for some time! > OpenSAF configuration (pssv_store) and logs are stored there. > A lot of writing to the replicated partition will take place > during fail-over I assume. I think there is connection > between DRBD OpenSAF fail-over. > > > Actually, openSAF has control over DRBD for failover using PDRBD. > > Not currently in our setup. > > > Send all the logs using the script attached. > > I have no interesting logs to send. The DTS logs are empty > from the time of failover (since the replicated partition is > unavailable?). > > > I would like to know what is your approach to test failover. > > I just did 'pkill cpd' on the active controller. > > _______________________________________________ > Users mailing list > [email protected] > http://list.opensaf.org/maillist/listinfo/users > _______________________________________________ Users mailing list [email protected] http://list.opensaf.org/maillist/listinfo/users
