Re: [SunRay-Users] Restart of one server in failover group causes the whole group to have a downtime...

Jens Langner Tue, 03 Mar 2009 06:11:47 -0800

Hi Bob,

Bob Doolittle schrieb:


[...]
>> Here I can perfectly reproduce the problem by shutting down one server
>> in a failover group. Afterwards on all other servers in the group the
>> "utgstatus" command returns the following information:
>>
>> r...@saturn:~# utgstatus
>> Error: Could not get gstatus information from server saturn
>>
>> Unfortunately, I haven't had the time yet to debug that any further by
>> increasing the debug level. But I would like to ask in here if someone
>> else have the same trouble and if this is a known issue and/or if there
>> is a fix for that strange behaviour.
>>   
> 
> Do you use card registration? Do you delete registrations regularly?
> 
> We saw an issue like this some time ago when a large site did frequent
> registration deletions. The Sun Ray Data Store's data got fragmented and
> the database indexing became poor. This meant that when a server was
> shut down, and all Sun Rays connected to it attempted to connect the the
> remaining servers, there was a large volume of connections which
> resulted in a large volume of SRDS lookups, which took a long time due
> to the poor indexing. The Sun Rays eventually timed out, then
> reconnecting and making the problem worse by adding more lookups to the
> queue. IIRC the lookups may even have starved heartbeat processing,
> causing other Sun Rays to disconnect and attempt to reconnect.
> 
> This is CR 6540012: "SRDS DBM files need periodic reindexing". Since
> we've never encountered the problem again this CR hasn't gotten a high
> priority. It would be good to know if this is your problem.
> 
> Please let us know if you do card registration and if you frequently
> delete old registrations. If so, I can send you a procedure that was
> used at the time to re-index SRDS and we can see if that resolves your
> problem. If it does we can investigate re-adjusting the priority on that
> defect report.

No, we are currently not using any card or token registration here.
Several months ago we had that enabled but since we switched all our
servers to 4.1 and are also largely using kiosk mode we haven't
reenabled any card/token registration yet.

Any other idea what might cause that all servers lose their group
connections as soon as one server in a failover group fails? Here I got
a private email from Terry Mayer where we suggested to have a look at
multicasting functionality in our switches. However, we didn't play with
our switches here so I can't really tell if this might be a
problem/reason here.

Any further suggestions welcome.

cheers,
jens
-- 
Jens Langner                                         Ph: +49-351-2602757
Forschungszentrum Dresden-Rossendorf e.V.
Institute of Radiopharmacy - PET Center                 [email protected]
Germany                                               http://www.fzd.de/
_______________________________________________
SunRay-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/sunray-users

Re: [SunRay-Users] Restart of one server in failover group causes the whole group to have a downtime...

Reply via email to