[Linux-HA] Why does o2cb RA remove module ocfs2?

2014-02-05 Thread Ulrich Windl
Hi!

I had a problem where O2CB stop fenced the node that was shut down:
I had updated the kernel, and then rebooted. As part of shutdown, the cluster 
stack was stopped. In turn, the O2CB resource was stopped.
Unfortunately this caused an error like (SLES11 SP3):

---
modprobe: FATAL: Could not load /lib/modules/3.0.101-0.8-xen/modules.dep: No 
such file or directory
o2cb(prm_O2CB)[19908]: ERROR: Unable to unload module: ocfs2
---

This in turn caused a node fence, which ruined the clean reboot.

So why is the RA messing with the kernel module on stop?

Regards,
Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Why does o2cb RA remove module ocfs2?

2014-02-05 Thread Lars Marowsky-Bree
On 2014-02-05T12:24:00, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I had a problem where O2CB stop fenced the node that was shut down:
 I had updated the kernel, and then rebooted. As part of shutdown, the cluster 
 stack was stopped. In turn, the O2CB resource was stopped.
 Unfortunately this caused an error like (SLES11 SP3):
 
 ---
 modprobe: FATAL: Could not load /lib/modules/3.0.101-0.8-xen/modules.dep: No 
 such file or directory
 o2cb(prm_O2CB)[19908]: ERROR: Unable to unload module: ocfs2
 ---
 
 This in turn caused a node fence, which ruined the clean reboot.
 
 So why is the RA messing with the kernel module on stop?

Because customers complained about the new module not being picked up if
they upgrade ocfs2-kmp and restarted the cluster stack on a node. It's
incredibly hard to please everyone, alas ...

The right way to update a cluster node is anyway this one:

1. Stop the cluster stack
2. Update/upgrade/reboot as needed
3. Restart the cluster stack

This would avoid this error too. Or keeping multiple kernel versions in
parallel (which also helps if a kernel update no longer boots for some
reason). Removing the running kernel package is usually not a great
idea; I prefer to remove them after having successfully rebooted only,
because you *never* know if you may have to reload a module.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Antw: Re: Why does o2cb RA remove module ocfs2?

2014-02-05 Thread Ulrich Windl
 Lars Marowsky-Bree l...@suse.com schrieb am 05.02.2014 um 12:36 in
Nachricht
20140205113649.gn13...@suse.de:
 On 2014-02-05T12:24:00, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de
wrote:
 
 I had a problem where O2CB stop fenced the node that was shut down:
 I had updated the kernel, and then rebooted. As part of shutdown, the 
 cluster stack was stopped. In turn, the O2CB resource was stopped.
 Unfortunately this caused an error like (SLES11 SP3):
 
 ---
 modprobe: FATAL: Could not load /lib/modules/3.0.101-0.8-xen/modules.dep:
No 
 such file or directory
 o2cb(prm_O2CB)[19908]: ERROR: Unable to unload module: ocfs2
 ---
 
 This in turn caused a node fence, which ruined the clean reboot.
 
 So why is the RA messing with the kernel module on stop?
 
 Because customers complained about the new module not being picked up if
 they upgrade ocfs2-kmp and restarted the cluster stack on a node. It's
 incredibly hard to please everyone, alas ...

I think the proper way would be this:
Stop your OCFS2 resources, rmmod the module, [modprobe the module to re-insert
the new version], start your OCFS2 resources.

I guess the kernel update is more common than the just the ocfs2-kmp update

 
 The right way to update a cluster node is anyway this one:
 
 1. Stop the cluster stack
 2. Update/upgrade/reboot as needed
 3. Restart the cluster stack
 
 This would avoid this error too. Or keeping multiple kernel versions in
 parallel (which also helps if a kernel update no longer boots for some
 reason). Removing the running kernel package is usually not a great
 idea; I prefer to remove them after having successfully rebooted only,
 because you *never* know if you may have to reload a module.

There's another way: (Like HP-UX learned to do it): Defer changes to the
running kernel until shutdown/reboot.

 
 
 Regards,
 Lars
 
 -- 
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer,

 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org 
 http://lists.linux-ha.org/mailman/listinfo/linux-ha 
 See also: http://linux-ha.org/ReportingProblems 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Why does o2cb RA remove module ocfs2?

2014-02-05 Thread Lars Marowsky-Bree
On 2014-02-05T15:06:47, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I guess the kernel update is more common than the just the ocfs2-kmp update

Well, some customers do apply updates in the recommended way, and thus
don't encounter this ;-) In any case, since at this time the cluster
services are already stopped, at least the service impact is minimal.

  This would avoid this error too. Or keeping multiple kernel versions in
  parallel (which also helps if a kernel update no longer boots for some
  reason). Removing the running kernel package is usually not a great
  idea; I prefer to remove them after having successfully rebooted only,
  because you *never* know if you may have to reload a module.
 
 There's another way: (Like HP-UX learned to do it): Defer changes to the
 running kernel until shutdown/reboot.

True. Hence: activate multi-versions for the kernel in
/etc/zypp/zypp.conf and only remove the old kernel after the reboot. I
do that manually, but I do think we even have a script for that
somewhere. I honestly don't remember where though; I like to keep
several kernels around for testing anyway.

I think this is the default going forward, but as always: zypper gained
this ability during the SLE 11 cycle, and we couldn't just change
existing behaviour in a simple update, it has to be manually
activated.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems