Hi all, So it seems this is not the same issue as chris is seeing, as I now have done the kvm_stat for my hanging domain. I'm not seeing all zero counters. CPU is still hanging at 100%, I can still login and saw again the timejump in the dmesg output (the guest was startet with acpi, but still used the kvmclock, the other guest had used both acpi and acpi_pm clock).
I'm switching also this guest to acpi_pm now. dmesg in the guest shows again that nice timejump: [ 3.968415] Uniform CD-ROM driver Revision: 3.20 [ 4.122814] vda: vda1 vda2 < vda5 > [ 4.620176] kjournald starting. Commit interval 5 seconds [ 4.626042] EXT3-fs: mounted filesystem with ordered data mode. [ 5.641077] udevd version 125 started [ 6.548858] input: Power Button (FF) as /class/input/input1 [ 6.568119] ACPI: Power Button (FF) [PWRF] [ 6.847566] piix4_smbus 0000:00:01.3: Found 0000:00:01.3 device [ 7.065429] input: PC Speaker as /class/input/input2 [ 7.229412] input: ImExPS/2 Generic Explorer Mouse as /class/input/input3 [ 7.269947] Error: Driver 'pcspkr' is already registered, aborting... [ 7.277315] udev: renamed network interface eth0 to eth1 [ 8.674616] Adding 489940k swap on /dev/vda5. Priority:-1 extents:1 across:489940k [ 8.767526] EXT3 FS on vda1, internal journal [ 10.270122] loop: module loaded [ 10.461641] device-mapper: uevent: version 1.0.3 [ 10.475258] device-mapper: ioctl: 4.13.0-ioctl (2007-10-18) initialised: [EMAIL PROTECTED] [ 11.221794] NET: Registered protocol family 10 [ 11.224276] lo: Disabled Privacy Extensions [ 19.770963] warning: `ntpd' uses 32-bit capabilities (legacy support in use) [ 21.420169] eth1: no IPv6 routers present [1266862591.699790] BUG: soft lockup - CPU#0 stuck for 1179853412s! [logcheck:4056] [1266862591.699790] Modules linked in: video output ac battery ipv6 dm_snapshot dm_mirror dm_log dm_mod loop virtio_net virtio_balloon snd_pcsp serio_raw psmouse snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core button evdev ext3 jbd mbcache virtio_blk ide_cd_mod cdrom ata_generic libata scsi_mod dock ide_pci_generic floppy virtio_pci uhci_hcd usbcore piix ide_core thermal processor fan thermal_sys [1266862591.699790] [1266862591.699790] Pid: 4056, comm: logcheck Not tainted (2.6.26-1-486 #1) [1266862591.699790] EIP: 0060:[<c0115324>] EFLAGS: 00000202 CPU: 0 [1266862591.699790] EIP is at ptep_set_access_flags+0x3e/0x6e [1266862591.699790] EAX: 19070067 EBX: 09661cc0 ECX: ddb0d984 EDX: 09661cc0 [1266862591.699790] ESI: ddb0d984 EDI: 00000001 EBP: ddb0541c ESP: dededeb0 [1266862591.699790] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [1266862591.699791] CR0: 8005003b CR2: 09661cc0 CR3: 1dc49000 CR4: 00000690 [1266862591.699791] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [1266862591.699791] DR6: ffff0ff0 DR7: 00000400 [1266862591.699791] [<c0154c38>] ? do_wp_page+0x3db/0x434 [1266862591.699791] [<c011314b>] ? pvclock_clocksource_read+0x4b/0xd0 [1266862591.699791] [<c011314b>] ? pvclock_clocksource_read+0x4b/0xd0 [1266862591.699791] [<c0155da9>] ? handle_mm_fault+0x55a/0x5d2 [1266862591.699791] [<c0116b87>] ? __dequeue_entity+0x1f/0x71 [1266862591.699791] [<c0113ac2>] ? do_page_fault+0x294/0x5ea [1266862591.699791] [<c011f275>] ? __do_softirq+0x3e/0x87 [1266862591.699791] [<c011382e>] ? do_page_fault+0x0/0x5ea [1266862591.699791] [<c02a6a1a>] ? error_code+0x6a/0x70 [1266862591.699791] ======================= efer_relo exits fpu_reloa halt_exit halt_wake host_stat hypercall insn_emul insn_emul invlpg io_exits irq_exits irq_injec irq_windo kvm_reque largepage mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc mmu_shado mmu_unsyn nmi_injec nmi_windo pf_fixed pf_guest remote_tl request_n signal_ex tlb_flush 0 1848 0 0 0 5 0 948 0 0 0 646 653 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 150 0 1852 0 0 0 5 0 949 0 0 0 649 654 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 149 0 1848 0 0 0 5 0 949 0 0 0 649 649 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 151 0 1843 0 0 0 6 0 951 0 0 0 649 645 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 149 0 1825 0 0 0 6 0 946 0 0 0 649 625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 150 0 1832 0 0 0 6 0 948 0 0 0 I've tried a dd if=/dev/zero of=/tmp/zero.file bs=10M count=100 to test IO in the hanging guest, and now the console hangs there. Dong an strace shows: select(17, [4 7 9 10 11 12 14 16], [], [], {1, 0}) = 2 (in [12 14], left {1, 0}) read(12, 0x7fff94b745e0, 8) = -1 EIO (Input/output error) write(15, "\1\0\0\0\0\0\0\0"..., 8) = 8 clock_gettime(CLOCK_MONOTONIC, {263807, 676112562}) = 0 clock_gettime(CLOCK_MONOTONIC, {263807, 676175416}) = 0 clock_gettime(CLOCK_MONOTONIC, {263807, 676237712}) = 0 timer_gettime(0, {it_interval={0, 0}, it_value={0, 9550245}}) = 0 read(14, "\2\0\0\0\0\0\0\0"..., 8) = 8 select(17, [4 7 9 10 11 14 16], [], [], {1, 0}) = 1 (in [16], left {0, 992000}) read(16, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 rt_sigaction(SIGALRM, NULL, {0x405980, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7fe58c0aaa80}, 8) = 0 write(5, "\0"..., 1) = 1 read(16, 0x7fff94b74950, 128) = -1 EAGAIN (Resource temporarily unavailable) select(17, [4 7 9 10 11 14 16], [], [], {1, 0}) = 1 (in [4], left {1, 0}) read(4, "\0"..., 512) = 1 read(4, 0x7fff94b747e0, 512) = -1 EAGAIN (Resource temporarily unavailable) clock_gettime(CLOCK_MONOTONIC, {263807, 686388502}) = 0 clock_gettime(CLOCK_MONOTONIC, {263807, 686449959}) = 0 clock_gettime(CLOCK_MONOTONIC, {263807, 686511137}) = 0 clock_gettime(CLOCK_MONOTONIC, {263807, 686572035}) = 0 Should I start a different thread for this issue, to not mix things up with chris problem? +rl On Fri, Nov 21, 2008 at 8:32 PM, Marcelo Tosatti <[EMAIL PROTECTED]> wrote: > On Thu, Nov 20, 2008 at 09:10:57AM -0800, [EMAIL PROTECTED] wrote: >> On Wed, Nov 19, 2008 at 02:43:42PM -0800, [EMAIL PROTECTED] wrote: >> > Thanks for the responses, >> > >> > I'm not sure if my problem is the same as Roland's, but it definitely >> > sounds >> > plausible. I had been running ntpdate in the host to synchronize time >> > every hour (in a cron job), so it sounds as if we could be seeing the same >> > issue. >> > >> >> Actually, with ntpdate taken out of crontab, I'm still seeing periodic >> hangs, so it's either a different problem or I'm hitting it in a >> different manner. >> >> OK, I installed kvm-79 and kernel 2.6.27.6, and here's the the kvm-stat >> output >> with 1 guest hung and 3 more operational: > > <snip> > >> If I shut down the 3 operational guests leaving just the hung guest, the >> kvm-stat output is all 0s: >> >> efer_relo exits fpu_reloa halt_exit halt_wake host_stat hypercall >> insn_emul insn_emul invlpg io_exits irq_exits irq_windo largepage >> mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc >> mmu_shado nmi_windo pf_fixed pf_guest remote_tl request_i signal_ex >> tlb_flush >> 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 >> 0 > > So the guest is not actually running here, which means its > QEMU that its hanging at. > >> The hung guest in this case was run with this command: >> >> sudo /usr/local/bin/qemu-system-x86_64 \ >> -daemonize \ >> -no-kvm-irqchip \ >> -hda Imgs/ndev_root.img \ >> -m 1024 \ >> -cdrom ISOs/ubuntu-8.10-server-amd64.iso \ >> -vnc :4 \ >> -net nic,macaddr=DE:AD:BE:EF:04:04,model=e1000 \ >> -net tap,ifname=tap4,script=/home/chris/kvm/qemu-ifup.sh \ >> >>& Logs/ndev_run.log >> >> >> I should also mention that when the guest is hung, I can still switch >> to the monitor with ctrl-alt 2. So, at least it's a little bit alive. > > In coma perhaps. > >> I've also noticed that the behavior with the hung guest is slightly >> different on kvm-79 than it was earlier. When the guest hangs, the kvm >> process in the host doesn't spin at 100% busy any longer - the guest is >> just unresponsive at both the network and VNC console. > >> Also, I've noticed that if I reset the guest from the monitor, the >> guest will boot up again, and I can get through to it on the network, >> but strangely, the mouse and keyboard will still be hung at the >> VNC console (except that I can still switch back and forth to the >> monitor). >> >> Hope some of this helps, let me know if you need to me to provide any >> other troubleshooting info. > > $ gdb -p pid-of-qemu > > (gdb) info threads > > Print the backtrace for every thread with: > > (gdb) thread N > (gdb) bt > > -- Roland Lammel QuikIT - IT Lösungen - flexibel und schnell Web: http://www.quikit.at Email: [EMAIL PROTECTED] "Enjoy your job, make lots of money, work within the law. Choose any two." -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html