Re: reboot after panic: vm_page_unwire: invalid wire count: 0
On Nov 14, 2007, at 10:13 AM, Vivek Khera wrote: I'm running 6.2-REL. The old kernel was -p5, now without the zero copy sockets, i'm running -p8. I'll know in a couple of days if this is our solution. For the archives: Removing zero copy sockets seems to have fixed the issue. Not a single panic on that box since, and it used to panic within 3-4 days under the load it has. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
On Nov 13, 2007, at 7:49 PM, Kris Kennaway wrote: notification. In the meantime, your best bet is to disable ZERO_COPY_SOCKETS. There is a chance this was a recent regression, previously in 7.0 they were believed to work. I'm running 6.2-REL. The old kernel was -p5, now without the zero copy sockets, i'm running -p8. I'll know in a couple of days if this is our solution. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
Vivek Khera wrote: On Nov 13, 2007, at 7:49 PM, Kris Kennaway wrote: notification. In the meantime, your best bet is to disable ZERO_COPY_SOCKETS. There is a chance this was a recent regression, previously in 7.0 they were believed to work. I'm running 6.2-REL. The old kernel was -p5, now without the zero copy sockets, i'm running -p8. I'll know in a couple of days if this is our solution. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] According to alc, if the page is being wired by something else then ZCS has never worked properly. Kris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
reboot after panic: vm_page_unwire: invalid wire count: 0
I've got a Dell 1750 box that was rock-solid stable running 4.11 for a couple of years now operating a pretty busy website backend. A month or so ago we wiped it clean and repurposed it to run a different website running Drupal with a Varnish front-end cache using FreeBSD 6.2-RELEASE-p5. The system is i386 and has 1Gb of RAM. Uname output: FreeBSD mb.kcilink.com 6.2-RELEASE-p5 FreeBSD 6.2- RELEASE-p5 #0: Wed Jun 27 10:47:15 EDT 2007 [EMAIL PROTECTED]:/n/lorax1/usr6/obj.i386/n/lorax1/usr6/src/sys/ KCI32SMP i386 The last week or so, it has been crashing regularly. Sometimes twice per day, and sometimes it runs for two days without a problem. I finally managed to make it dump a crashlog and core, and discovered that the panic was: reboot after panic: vm_page_unwire: invalid wire count: 0 I google around and found one old PR #33637 which had a patch but that was for FreeBSD 4.5. I have also found two other mentions of this panic, one on the mailing lists with no responses, and another for a PR from 6.1-PRERELEASE, PR #94578, which has no comments on it. According to the http and varnish logs, we're not being particularly hit very hard when the panic happens, but I don't know if we lose some log data during the panic. I have the core and the kernel.debug. I'm not sure what info to extract from it beyond the backtrace. The watchdog timer fired and dropped me to DDB, so I just typed watchdog and c and let it finish dumping. Here's the backtrace, and bt full output. # kgdb kernel.debug /var/crash/vmcore.0 [GDB will not be able to debug user-mode threads: /usr/lib/ libthread_db.so: Undefined symbol ps_pglobal_lookup] GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i386-marcel-freebsd. Unread portion of the kernel message buffer: panic: vm_page_unwire: invalid wire count: 0 cpuid = 1 KDB: stack backtrace: kdb_backtrace(100,c5a76000,c0e88ab0,0,d90d82c8,...) at kdb_backtrace +0x29 panic(c06b011f,0,c0e88ab0,efe80900,c057b96a,...) at panic+0x114 vm_page_unwire(c0e88ab0,0) at vm_page_unwire+0x68 vfs_vmio_release(d90d82c8) at vfs_vmio_release+0xa2 getnewbuf(0,0,4000,4000) at getnewbuf+0x2bc getblk(c6f81550,4f5,0,4000,0,...) at getblk+0x360 ffs_balloc_ufs2(c6f81550,13d4000,0,fa,c4f32780,...) at ffs_balloc_ufs2+0x1606 ffs_write(efe80bec) at ffs_write+0x2ec VOP_WRITE_APV(c06e06a0,efe80bec) at VOP_WRITE_APV+0xce vn_write(c59c8000,efe80cbc,c51cf400,0,c5a76000) at vn_write+0x1ee dofilewrite(c5a76000,c,c59c8000,efe80cbc,,...) at dofilewrite +0x77 kern_writev(c5a76000,c,efe80cbc,821bba3,fa,...) at kern_writev+0x3b write(c5a76000,efe80d04) at write+0x45 syscall(3b,809003b,bfbf003b,0,bfbfeaa4,...) at syscall+0x2bf Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (4, FreeBSD ELF32, write), eip = 0x483d732f, esp = 0xbfbfe9dc, ebp = 0xbfbfea08 --- Uptime: 1d20h51m58s Dumping 1023 MB (2 chunks) chunk 0: 1MB (159 pages) ... ok chunk 1: 1023MB (261872 pages) 1007 991 975 959 943 927 911 895 879 863 847 831 815 799 783 767 751 735 719 703 687 671 655 639 623 607 591 575 559 543 527 511 495 479 463 447 431 415 399 383 367 351 335 319 303 287 271 255 239 223 207 191 175 159 143 127 111 95interrupt total irq4: sio0 21758 irq15: ata11 irq16: bge0 4544565 irq17: bge1 17684238 irq18: amr0 588223 cpu0: timer323148326 cpu2: timer323148294 cpu1: timer323148331 cpu3: timer323148344 Total 1315432158 KDB: stack backtrace: kdb_backtrace(c069ec5d,4e67e6de,0,c06ea170,c06e9818,...) at kdb_backtrace+0x29 watchdog_fire(c07120e0,c8,efe80634,c065c821,efe8063c,...) at watchdog_fire+0x9d hardclock(efe8063c) at hardclock+0x115 lapic_handle_timer(0) at lapic_handle_timer+0x51 Xtimerint(c4fe6000,1,efe806a8,c066d57b,c4fe6000,...) at Xtimerint+0x30 getit(c4fe6000,c4fe6000,4,efe806c0,c0496f97,...) at getit+0x88 DELAY(1) at DELAY+0x3b amr_quartz_poll_command1(c4fe6000,c51fbff0,0,0,1000,...) at amr_quartz_poll_command1+0x1af amr_setup_polled_dmamap(c51fbff0,c4fef800,1,0) at amr_setup_polled_dmamap+0x94 bus_dmamap_load(c4ffe380,0,c0c22000,1,c0496cd4,c51fbff0,1) at bus_dmamap_load+0x4b5 amr_quartz_poll_command(c51fbff0) at amr_quartz_poll_command+0x51 amr_dump_blocks(c4fe6000,0,4cb25e,c0c22000,80) at amr_dump_blocks+0x5f amrd_dump(c515b700,c0c22000,0,9964bc00,0,1) at amrd_dump+0x7c cb_dumpdata(c0711a48,1,c06f44a0) at cb_dumpdata+0x100 foreach_chunk(c0655a78,c06f44a0,c06f44a0) at foreach_chunk+0x23
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
On 11/13/07, Vivek Khera [EMAIL PROTECTED] wrote: I've got a Dell 1750 box that was rock-solid stable running 4.11 for a couple of years now operating a pretty busy website backend. A month or so ago we wiped it clean and repurposed it to run a different website running Drupal with a Varnish front-end cache using FreeBSD 6.2-RELEASE-p5. The system is i386 and has 1Gb of RAM. Uname output: FreeBSD mb.kcilink.com 6.2-RELEASE-p5 FreeBSD 6.2- RELEASE-p5 #0: Wed Jun 27 10:47:15 EDT 2007 [EMAIL PROTECTED]:/n/lorax1/usr6/obj.i386/n/lorax1/usr6/src/sys/ KCI32SMP i386 The last week or so, it has been crashing regularly. Sometimes twice per day, and sometimes it runs for two days without a problem. I finally managed to make it dump a crashlog and core, and discovered that the panic was: reboot after panic: vm_page_unwire: invalid wire count: 0 I google around and found one old PR #33637 which had a patch but that was for FreeBSD 4.5. I have also found two other mentions of this panic, one on the mailing lists with no responses, and another for a PR from 6.1-PRERELEASE, PR #94578, which has no comments on it. According to the http and varnish logs, we're not being particularly hit very hard when the panic happens, but I don't know if we lose some log data during the panic. I have the core and the kernel.debug. I'm not sure what info to extract from it beyond the backtrace. The watchdog timer fired and dropped me to DDB, so I just typed watchdog and c and let it finish dumping. Here's the backtrace, and bt full output. # kgdb kernel.debug /var/crash/vmcore.0 [GDB will not be able to debug user-mode threads: /usr/lib/ libthread_db.so: Undefined symbol ps_pglobal_lookup] GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i386-marcel-freebsd. Unread portion of the kernel message buffer: panic: vm_page_unwire: invalid wire count: 0 cpuid = 1 KDB: stack backtrace: kdb_backtrace(100,c5a76000,c0e88ab0,0,d90d82c8,...) at kdb_backtrace +0x29 panic(c06b011f,0,c0e88ab0,efe80900,c057b96a,...) at panic+0x114 vm_page_unwire(c0e88ab0,0) at vm_page_unwire+0x68 vfs_vmio_release(d90d82c8) at vfs_vmio_release+0xa2 getnewbuf(0,0,4000,4000) at getnewbuf+0x2bc getblk(c6f81550,4f5,0,4000,0,...) at getblk+0x360 ffs_balloc_ufs2(c6f81550,13d4000,0,fa,c4f32780,...) at ffs_balloc_ufs2+0x1606 ffs_write(efe80bec) at ffs_write+0x2ec VOP_WRITE_APV(c06e06a0,efe80bec) at VOP_WRITE_APV+0xce vn_write(c59c8000,efe80cbc,c51cf400,0,c5a76000) at vn_write+0x1ee dofilewrite(c5a76000,c,c59c8000,efe80cbc,,...) at dofilewrite +0x77 kern_writev(c5a76000,c,efe80cbc,821bba3,fa,...) at kern_writev+0x3b write(c5a76000,efe80d04) at write+0x45 syscall(3b,809003b,bfbf003b,0,bfbfeaa4,...) at syscall+0x2bf Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (4, FreeBSD ELF32, write), eip = 0x483d732f, esp = 0xbfbfe9dc, ebp = 0xbfbfea08 --- Uptime: 1d20h51m58s Dumping 1023 MB (2 chunks) chunk 0: 1MB (159 pages) ... ok chunk 1: 1023MB (261872 pages) 1007 991 975 959 943 927 911 895 879 863 847 831 815 799 783 767 751 735 719 703 687 671 655 639 623 607 591 575 559 543 527 511 495 479 463 447 431 415 399 383 367 351 335 319 303 287 271 255 239 223 207 191 175 159 143 127 111 95interrupt total irq4: sio0 21758 irq15: ata11 irq16: bge0 4544565 irq17: bge1 17684238 irq18: amr0 588223 cpu0: timer323148326 cpu2: timer323148294 cpu1: timer323148331 cpu3: timer323148344 Total 1315432158 KDB: stack backtrace: kdb_backtrace(c069ec5d,4e67e6de,0,c06ea170,c06e9818,...) at kdb_backtrace+0x29 watchdog_fire(c07120e0,c8,efe80634,c065c821,efe8063c,...) at watchdog_fire+0x9d hardclock(efe8063c) at hardclock+0x115 lapic_handle_timer(0) at lapic_handle_timer+0x51 Xtimerint(c4fe6000,1,efe806a8,c066d57b,c4fe6000,...) at Xtimerint+0x30 getit(c4fe6000,c4fe6000,4,efe806c0,c0496f97,...) at getit+0x88 DELAY(1) at DELAY+0x3b amr_quartz_poll_command1(c4fe6000,c51fbff0,0,0,1000,...) at amr_quartz_poll_command1+0x1af amr_setup_polled_dmamap(c51fbff0,c4fef800,1,0) at amr_setup_polled_dmamap+0x94 bus_dmamap_load(c4ffe380,0,c0c22000,1,c0496cd4,c51fbff0,1) at bus_dmamap_load+0x4b5 amr_quartz_poll_command(c51fbff0) at amr_quartz_poll_command+0x51 amr_dump_blocks(c4fe6000,0,4cb25e,c0c22000,80) at amr_dump_blocks+0x5f amrd_dump(c515b700,c0c22000,0,9964bc00,0,1) at amrd_dump+0x7c cb_dumpdata(c0711a48,1,c06f44a0) at cb_dumpdata
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote: vmio = 1 offset = Unhandled dwarf expression opcode 0x93 (kgdb) Do you happen to have ZERO_COPY_SOCKETS in your kernel config? Yes, I do. Are they known to be bad under certain loads or just in general. I don't have this issue with any other web server running the same kernel config but those are amd64 boxes mostly. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
On 11/13/07, Vivek Khera [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote: vmio = 1 offset = Unhandled dwarf expression opcode 0x93 (kgdb) Do you happen to have ZERO_COPY_SOCKETS in your kernel config? Yes, I do. Are they known to be bad under certain loads or just in general. I don't have this issue with any other web server running the same kernel config but those are amd64 boxes mostly. Remove, retry :) This thing bit me hard in the past too, see the freebsd-fs@ archives. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] -- Mahnahmahnah! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
Unfortunately, ZERO_COPY_SOCKETs have long been a known source of problems. I think also, when a page is copied as part of COW the new page is unwired (see pmap_copy et al.), this could lead to socow_iodone unwiring after send a page that was not wired. An added issue is that parts of the VM assume that COW and wired are mutually exclusive which the socow code violates. At some point in the near future I may be adding support for doing zero copy send without COW for blocking sockets. The one down side of this approach is that if you have multiple threads in your process it widens the window during which they can stomp on data that you're sending. Nonetheless, this would be a bug in the application code. More complicated would be zero-copy non-COW send on non-blocking sockets as it would require an extension to kevent for completion notification. In the meantime, your best bet is to disable ZERO_COPY_SOCKETS. -Kip On Nov 13, 2007 1:59 PM, Vivek Khera [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote: vmio = 1 offset = Unhandled dwarf expression opcode 0x93 (kgdb) Do you happen to have ZERO_COPY_SOCKETS in your kernel config? Yes, I do. Are they known to be bad under certain loads or just in general. I don't have this issue with any other web server running the same kernel config but those are amd64 boxes mostly. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
On Nov 13, 2007, at 5:13 PM, Kip Macy wrote: In the meantime, your best bet is to disable ZERO_COPY_SOCKETS. Thanks for the info. I'm putting the new kernel in place and will see what happens and report back. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
Kip Macy wrote: Unfortunately, ZERO_COPY_SOCKETs have long been a known source of problems. I think also, when a page is copied as part of COW the new page is unwired (see pmap_copy et al.), this could lead to socow_iodone unwiring after send a page that was not wired. An added issue is that parts of the VM assume that COW and wired are mutually exclusive which the socow code violates. At some point in the near future I may be adding support for doing zero copy send without COW for blocking sockets. The one down side of this approach is that if you have multiple threads in your process it widens the window during which they can stomp on data that you're sending. Nonetheless, this would be a bug in the application code. More complicated would be zero-copy non-COW send on non-blocking sockets as it would require an extension to kevent for completion notification. In the meantime, your best bet is to disable ZERO_COPY_SOCKETS. There is a chance this was a recent regression, previously in 7.0 they were believed to work. Kris -Kip On Nov 13, 2007 1:59 PM, Vivek Khera [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote: vmio = 1 offset = Unhandled dwarf expression opcode 0x93 (kgdb) Do you happen to have ZERO_COPY_SOCKETS in your kernel config? Yes, I do. Are they known to be bad under certain loads or just in general. I don't have this issue with any other web server running the same kernel config but those are amd64 boxes mostly. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: reboot after panic: vm_page_unwire: invalid wire count: 0
Various calls that downgrade permissions or virtually copy a pmap in pmap.c now remove PG_W (and did not 6 months ago). This may be the cause of the regression. It would probably be better (and faster) if the pages were held instead of wired. -Kip On Nov 13, 2007 4:49 PM, Kris Kennaway [EMAIL PROTECTED] wrote: Kip Macy wrote: Unfortunately, ZERO_COPY_SOCKETs have long been a known source of problems. I think also, when a page is copied as part of COW the new page is unwired (see pmap_copy et al.), this could lead to socow_iodone unwiring after send a page that was not wired. An added issue is that parts of the VM assume that COW and wired are mutually exclusive which the socow code violates. At some point in the near future I may be adding support for doing zero copy send without COW for blocking sockets. The one down side of this approach is that if you have multiple threads in your process it widens the window during which they can stomp on data that you're sending. Nonetheless, this would be a bug in the application code. More complicated would be zero-copy non-COW send on non-blocking sockets as it would require an extension to kevent for completion notification. In the meantime, your best bet is to disable ZERO_COPY_SOCKETS. There is a chance this was a recent regression, previously in 7.0 they were believed to work. Kris -Kip On Nov 13, 2007 1:59 PM, Vivek Khera [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote: vmio = 1 offset = Unhandled dwarf expression opcode 0x93 (kgdb) Do you happen to have ZERO_COPY_SOCKETS in your kernel config? Yes, I do. Are they known to be bad under certain loads or just in general. I don't have this issue with any other web server running the same kernel config but those are amd64 boxes mostly. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]