Here is the bug which I'm trying to fix - https://bugs.launchpad.net/fuel/+bug/1538587.
In VMs (set up with fuel-virtualbox) kernel panic occurs every time you delete node, stack trace shows error in ext4 driver [1]. The same as in the bug. Here is a patch - https://review.openstack.org/297669 . I've checked it with virtual box VMs and it works fine. I propose also don't reboot nodes in case of kernel panic, so that we'll catch possible errors, but maybe it's too dangerous before release. [1] [13607.545119] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:4928: IO failure [13608.157968] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:4928: IO failure [13608.780695] EXT4-fs error (device dm-0) in ext4_reserve_inode_write:4928: IO failure [13609.471245] Aborting journal on device dm-0-8. [13609.478549] EXT4-fs error (device dm-0) in ext4_dirty_inode:5047: IO failure [13610.069244] EXT4-fs error (device dm-0) in ext4_dirty_inode:5047: IO failure [13610.698915] Kernel panic - not syncing: EXT4-fs (device dm-0): panic forced after error [13610.698915] [13611.060673] CPU: 0 PID: 8676 Comm: systemd-udevd Not tainted 3.13.0-83-generic #127-Ubuntu [13611.236566] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [13611.887198] 00000000fffffffb ffff88003b6e9a08 ffffffff81725992 ffffffff81a77878 [13612.527154] ffff88003b6e9a80 ffffffff8171e80b ffffffff00000010 ffff88003b6e9a90 [13613.037061] ffff88003b6e9a30 ffff88003b6e9a50 ffff8800367f2ad0 0000000000000040 [13613.717119] Call Trace: [13613.927162] [<ffffffff81725992>] dump_stack+0x45/0x56 [13614.306858] [<ffffffff8171e80b>] panic+0xc8/0x1e1 [13614.767154] [<ffffffff8125e7c6>] ext4_handle_error.part.187+0xa6/0xb0 [13615.187201] [<ffffffff8125eddb>] __ext4_std_error+0x7b/0x100 [13615.627960] [<ffffffff81244c64>] ext4_reserve_inode_write+0x44/0xa0 [13616.007943] [<ffffffff81247f80>] ? ext4_dirty_inode+0x40/0x60 [13616.448084] [<ffffffff81244d04>] ext4_mark_inode_dirty+0x44/0x1f0 [13616.917611] [<ffffffff8126f7f9>] ? __ext4_journal_start_sb+0x69/0xe0 [13617.367730] [<ffffffff81247f80>] ext4_dirty_inode+0x40/0x60 [13617.747567] [<ffffffff811e858a>] __mark_inode_dirty+0x10a/0x2d0 [13618.088060] [<ffffffff811d94e1>] update_time+0x81/0xd0 [13618.467965] [<ffffffff811d96f0>] file_update_time+0x80/0xd0 [13618.977649] [<ffffffff811511f0>] __generic_file_aio_write+0x180/0x3d0 [13619.467993] [<ffffffff81151498>] generic_file_aio_write+0x58/0xa0 [13619.978080] [<ffffffff8123c712>] ext4_file_write+0xa2/0x3f0 [13620.467624] [<ffffffff81158066>] ? free_hot_cold_page_list+0x46/0xa0 [13621.038045] [<ffffffff8115d400>] ? release_pages+0x80/0x210 [13621.408080] [<ffffffff811bdf5a>] do_sync_write+0x5a/0x90 [13621.818155] [<ffffffff810e52f6>] do_acct_process+0x4e6/0x5c0 [13622.278005] [<ffffffff810e5a91>] acct_process+0x71/0xa0 [13622.597617] [<ffffffff8106a3cf>] do_exit+0x80f/0xa50 [13622.968015] [<ffffffff811c041e>] ? ____fput+0xe/0x10 [13623.337738] [<ffffffff8106a68f>] do_group_exit+0x3f/0xa0 [13623.738020] [<ffffffff8106a704>] SyS_exit_group+0x14/0x20 [13624.137447] [<ffffffff8173659d>] system_call_fastpath+0x1a/0x1f [13624.518044] Rebooting in 10 seconds.. On Tue, Mar 22, 2016 at 1:07 PM, Dmitry Guryanov <dgurya...@mirantis.com> wrote: > Hello, > > Here is a start of the discussion - > http://lists.openstack.org/pipermail/openstack-dev/2015-December/083021.html > . I've subscribed to this mailing list later, so can reply there. > > Currently we clear node's disks in two places. The first one is before > reboot into bootstrap image [0] and the second - just before provisioning > in fuel-agent [1]. > > There are two problems, which should be solved with erasing first megabyte > of disk data: node should not boot from hdd after reboot and new > partitioning scheme should overwrite the previous one. > > The first problem could be solved with zeroing first 512 bytes of each > disk (not partition). Even 446 to be precise, because last 66 bytes are > partition scheme, see > https://wiki.archlinux.org/index.php/Master_Boot_Record . > > The second problem should be solved only after reboot into bootstrap. > Because if we bring a new node to the cluster from some other place and > boot it with bootstrap image it will possibly have disks with some > partitions, md devices and lvm volumes. So all these entities should be > correctly cleared before provisioning, not before reboot. And fuel-agent > does it in [1]. > > I propose to remove erasing first 1M of each partiton, because it can lead > to errors in FS kernel drivers and kernel panic. An existing workaround, > that in case of kernel panic we do reboot is bad because it may occur just > after clearing first partition of the first disk and after reboot bios will > read MBR of the second disk and boot from it instead of network. Let's just > clear first 446 bytes of each disk. > > > [0] > https://github.com/openstack/fuel-astute/blob/master/mcagents/erase_node.rb#L162-L174 > [1] > https://github.com/openstack/fuel-agent/blob/master/fuel_agent/manager.py#L194-L221 > > > -- > Dmitry Guryanov > -- Dmitry Guryanov
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev