On Thursday 03 February 2011 15:01:37 Serge E. Hallyn wrote: > Quoting Alvin ([email protected]): > > I have long standing performance problems on Lucid when handling large > > files. > > > > I notice this on several servers, but here is a detailed example of a > > scenario I encountered yesterday. > > > > Server (stilgar) is a Quad-core with 8 GB ram. The server has 3 disks. 1 > > Disk contains the operating system. The other two are mdadm RAID0 with > > LVM. I need to recreate the RAID manually[1] on most boots, but > > otherwise it is working fine. > > (Before there are any heart attacks from reading 'raid0': the data on it > > is NOT important, and only meant for testing.) > > The server runs 4 virtual machines (KVM). > > - 2 Lucid servers on qcow, residing on the local (non-raid) disk. > > - 1 Lucid server on a fstab mounted NFS4 share. > > - 1 Windows desktop on a logical volume. > > > > I have an NFS mounted backup disk. When I restore the Windows image from > > the backup (60GB), I encounter bug 658131[2]. All running virtual > > machines will start showing errors like in bug 522014[3] in their logs > > (hung_task_timeout_secs) and services on them will no longer be > > reachable. The load on the server can climb to >30. Libvirt will no > > longer be able to > > Is it possible for you to use CIFS instead of NFS? > > It's been a few years, but when I had my NAS at home I found CIFS far more > stable and reliable than NFS.
Yes. I know NFS is somewhat neglected in Ubuntu, but why use MS Windows file sharing between Linux machines? That makes no sense. NFS is easier to set up. In short: I could try CIFS, but in order to exclude the network share from this issue I copied the image file locally first. It is true that NFS (maybe CIFS too) has an impact on this. The load gets even higher when using it. > > shutdown the virtual machines. Nothing else can be done than a reboot of > > the whole machine. > > > > From the bug report, it looks like this might be NFS related, but I'm not > > convinced. If I copy the image first and then restore it, the load also > > climbs insanely high and the virtual machines will be on the verge of > > crashing. Services will be temporaraly unavailable. > > (Not trying to be critical) What do you expect to happen? I.e what do you > think is the bug there? Is it that ionice seems to be insufficient? I'm > asking in particular about the conversion by itself, not the copy, as I > agree the copy pinning CPU must be a (kernel) bug. Well, I expect a performance hit, but no hung tasks. Especially when using ionice. > > The software used is qemu-img or dd. In all cases I'm running the > > commands with 'ionice -c 3'. > > > > This is only an example. Any high IO (e.g. rsync with large files) can > > crash Lucid servers, > > Over NFS, or any rsync? Both. In the example, NFS/rsync was not used. I only told that because I've had the same trouble when using them on other servers. > For that matter, rsync tries to be smart and slice and dice the file to > minimize network traffic. What about a simple ftp/scp? > > > but what should I do? Sometimes it is necessary to copy large > > files. That should be something that can be done without taking down the > > entire server. Any thoughts on the matter? > > It might be worth testing other IO schedulers. > > It also might be worth testing a more current kernel. The kernel team > does produce backports of newer kernels to lucid which, while surely not > officially supported, should work and may fix these issues. I might try those. I see you found my new bug report[1]. You're on to something there! I didn't remove an usb drive, but there are similar troubles I did not link to this before: - mdadm does not auto-assemble [2] - I have an LVM snapshot present on that system! Even worse, the snapshot is 100% full and thus corrupt. Now, I didn't think of the snapshot. The presence of an LVM snapshot is a huge IO performance hit, so that explains the extreme load. In my example I was reading the raw image from its parent volume. Because of your comment I also found a blog post[3] about the issue: "Non-existent Device Mapper Volumes Causing I/O Errors?" So, I will first contact all users and find a moment to take the server offline for some testing. Then, i'll post my findings in the bug report. Thanks for the tips. Links: [1] https://bugs.launchpad.net/bugs/712392 [2] https://bugs.launchpad.net/bugs/27037 [3] http://slated.org/device_mapper_weirdness -- Alvin -- ubuntu-server mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-server More info: https://wiki.ubuntu.com/ServerTeam
