I have long standing performance problems on Lucid when handling large files.
I notice this on several servers, but here is a detailed example of a scenario I encountered yesterday. Server (stilgar) is a Quad-core with 8 GB ram. The server has 3 disks. 1 Disk contains the operating system. The other two are mdadm RAID0 with LVM. I need to recreate the RAID manually[1] on most boots, but otherwise it is working fine. (Before there are any heart attacks from reading 'raid0': the data on it is NOT important, and only meant for testing.) The server runs 4 virtual machines (KVM). - 2 Lucid servers on qcow, residing on the local (non-raid) disk. - 1 Lucid server on a fstab mounted NFS4 share. - 1 Windows desktop on a logical volume. I have an NFS mounted backup disk. When I restore the Windows image from the backup (60GB), I encounter bug 658131[2]. All running virtual machines will start showing errors like in bug 522014[3] in their logs (hung_task_timeout_secs) and services on them will no longer be reachable. The load on the server can climb to >30. Libvirt will no longer be able to shutdown the virtual machines. Nothing else can be done than a reboot of the whole machine. >From the bug report, it looks like this might be NFS related, but I'm not convinced. If I copy the image first and then restore it, the load also climbs insanely high and the virtual machines will be on the verge of crashing. Services will be temporaraly unavailable. The software used is qemu-img or dd. In all cases I'm running the commands with 'ionice -c 3'. This is only an example. Any high IO (e.g. rsync with large files) can crash Lucid servers, but what should I do? Sometimes it is necessary to copy large files. That should be something that can be done without taking down the entire server. Any thoughts on the matter? Links: [1] https://bugs.launchpad.net/bugs/27037 [2] https://bugs.launchpad.net/bugs/658131 [3] https://bugs.launchpad.net/bugs/522014 Example from /var/log/messages (kernel) on the server: kvm D 0000000000000000 0 9632 1 0x00000000 ffff8801a4269ca8 0000000000000086 0000000000015bc0 0000000000015bc0 ffff8802004fdf38 ffff8801a4269fd8 0000000000015bc0 ffff8802004fdb80 0000000000015bc0 ffff8801a4269fd8 0000000000015bc0 ffff8802004fdf38 Call Trace: [<ffffffff815596b7>] __mutex_lock_slowpath+0x107/0x190 [<ffffffff815590b3>] mutex_lock+0x23/0x50 [<ffffffff810f5899>] generic_file_aio_write+0x59/0xe0 [<ffffffff811d7879>] ext4_file_write+0x39/0xb0 [<ffffffff81143a8a>] do_sync_write+0xfa/0x140 [<ffffffff81084380>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81252316>] ? security_file_permission+0x16/0x20 [<ffffffff81143d88>] vfs_write+0xb8/0x1a0 [<ffffffff81144722>] sys_pwrite64+0x82/0xa0 [<ffffffff810121b2>] system_call_fastpath+0x16/0x1b kdmflush D 0000000000000002 0 396 2 0x00000000 ffff88022eeb3d10 0000000000000046 0000000000015bc0 0000000000015bc0 ffff88022f489a98 ffff88022eeb3fd8 0000000000015bc0 ffff88022f4896e0 0000000000015bc0 ffff88022eeb3fd8 0000000000015bc0 ffff88022f489a98 Call Trace: [<ffffffff815589a7>] io_schedule+0x47/0x70 [<ffffffff81435383>] dm_wait_for_completion+0xa3/0x160 [<ffffffff81059b90>] ? default_wake_function+0x0/0x20 [<ffffffff81435d47>] ? __split_and_process_bio+0x127/0x190 [<ffffffff81435dda>] dm_flush+0x2a/0x70 [<ffffffff81435e6c>] dm_wq_work+0x4c/0x1c0 [<ffffffff81435e20>] ? dm_wq_work+0x0/0x1c0 [<ffffffff8107f7e7>] run_workqueue+0xc7/0x1a0 [<ffffffff8107f963>] worker_thread+0xa3/0x110 [<ffffffff81084380>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8107f8c0>] ? worker_thread+0x0/0x110 [<ffffffff81084006>] kthread+0x96/0xa0 [<ffffffff810131ea>] child_rip+0xa/0x20 [<ffffffff81083f70>] ? kthread+0x0/0xa0 [<ffffffff810131e0>] ? child_rip+0x0/0x20 -- ubuntu-server mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-server More info: https://wiki.ubuntu.com/ServerTeam
