Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-12-13 Thread Vivek Khera


On Nov 14, 2007, at 10:13 AM, Vivek Khera wrote:

I'm running 6.2-REL.  The old kernel was -p5, now without the zero  
copy sockets, i'm running -p8.  I'll know in a couple of days if  
this is our solution.


For the archives:

Removing zero copy sockets seems to have fixed the issue.  Not a  
single panic on that box since, and it used to panic within 3-4 days  
under the load it has.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-14 Thread Vivek Khera


On Nov 13, 2007, at 7:49 PM, Kris Kennaway wrote:


notification.
In the meantime, your best bet is to disable ZERO_COPY_SOCKETS.


There is a chance this was a recent regression, previously in 7.0  
they were believed to work.




I'm running 6.2-REL.  The old kernel was -p5, now without the zero  
copy sockets, i'm running -p8.  I'll know in a couple of days if this  
is our solution.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-14 Thread Kris Kennaway

Vivek Khera wrote:


On Nov 13, 2007, at 7:49 PM, Kris Kennaway wrote:


notification.
In the meantime, your best bet is to disable ZERO_COPY_SOCKETS.


There is a chance this was a recent regression, previously in 7.0 they 
were believed to work.




I'm running 6.2-REL.  The old kernel was -p5, now without the zero copy 
sockets, i'm running -p8.  I'll know in a couple of days if this is our 
solution.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]




According to alc, if the page is being wired by something else then ZCS 
has never worked properly.


Kris
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-13 Thread Vivek Khera
I've got a Dell 1750 box that was rock-solid stable running 4.11 for a  
couple of years now operating a pretty busy website backend.  A month  
or so ago we wiped it clean and repurposed it to run a different  
website running Drupal with a Varnish front-end cache using FreeBSD  
6.2-RELEASE-p5.  The system is i386 and has 1Gb of RAM.


Uname output: FreeBSD mb.kcilink.com 6.2-RELEASE-p5 FreeBSD 6.2- 
RELEASE-p5 #0: Wed Jun 27 10:47:15 EDT 2007  
[EMAIL PROTECTED]:/n/lorax1/usr6/obj.i386/n/lorax1/usr6/src/sys/ 
KCI32SMP  i386



The last week or so, it has been crashing regularly.  Sometimes twice  
per day, and sometimes it runs for two days without a problem.  I  
finally managed to make it dump a crashlog and core, and discovered  
that the panic was:


 reboot after panic: vm_page_unwire: invalid wire count: 0

I google around and found one old PR #33637 which had a patch but that  
was for FreeBSD 4.5.  I have also found two other mentions of this  
panic, one on the mailing lists with no responses, and another for a  
PR from 6.1-PRERELEASE, PR #94578, which has no comments on it.


According to the http and varnish logs, we're not being particularly  
hit very hard when the panic happens, but I don't know if we lose some  
log data during the panic.


I have the core and the kernel.debug.  I'm not sure what info to  
extract from it beyond the backtrace.  The watchdog timer fired and  
dropped me to DDB, so I just typed watchdog and c and let it  
finish dumping.


Here's the backtrace, and bt full output.


# kgdb kernel.debug /var/crash/vmcore.0
[GDB will not be able to debug user-mode threads: /usr/lib/ 
libthread_db.so: Undefined symbol ps_pglobal_lookup]

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
conditions.

Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for  
details.

This GDB was configured as i386-marcel-freebsd.

Unread portion of the kernel message buffer:
panic: vm_page_unwire: invalid wire count: 0
cpuid = 1
KDB: stack backtrace:
kdb_backtrace(100,c5a76000,c0e88ab0,0,d90d82c8,...) at kdb_backtrace 
+0x29

panic(c06b011f,0,c0e88ab0,efe80900,c057b96a,...) at panic+0x114
vm_page_unwire(c0e88ab0,0) at vm_page_unwire+0x68
vfs_vmio_release(d90d82c8) at vfs_vmio_release+0xa2
getnewbuf(0,0,4000,4000) at getnewbuf+0x2bc
getblk(c6f81550,4f5,0,4000,0,...) at getblk+0x360
ffs_balloc_ufs2(c6f81550,13d4000,0,fa,c4f32780,...) at  
ffs_balloc_ufs2+0x1606

ffs_write(efe80bec) at ffs_write+0x2ec
VOP_WRITE_APV(c06e06a0,efe80bec) at VOP_WRITE_APV+0xce
vn_write(c59c8000,efe80cbc,c51cf400,0,c5a76000) at vn_write+0x1ee
dofilewrite(c5a76000,c,c59c8000,efe80cbc,,...) at dofilewrite 
+0x77

kern_writev(c5a76000,c,efe80cbc,821bba3,fa,...) at kern_writev+0x3b
write(c5a76000,efe80d04) at write+0x45
syscall(3b,809003b,bfbf003b,0,bfbfeaa4,...) at syscall+0x2bf
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (4, FreeBSD ELF32, write), eip = 0x483d732f, esp =  
0xbfbfe9dc, ebp = 0xbfbfea08 ---

Uptime: 1d20h51m58s
Dumping 1023 MB (2 chunks)
  chunk 0: 1MB (159 pages) ... ok
  chunk 1: 1023MB (261872 pages) 1007 991 975 959 943 927 911 895 879  
863 847 831 815 799 783 767 751 735 719 703 687 671 655 639 623 607  
591 575 559 543 527 511 495 479 463 447 431 415 399 383 367 351 335  
319 303 287 271 255 239 223 207 191 175 159 143 127 111  
95interrupt   total

irq4: sio0 21758
irq15: ata11
irq16: bge0  4544565
irq17: bge1 17684238
irq18: amr0   588223
cpu0: timer323148326
cpu2: timer323148294
cpu1: timer323148331
cpu3: timer323148344
Total  1315432158
KDB: stack backtrace:
kdb_backtrace(c069ec5d,4e67e6de,0,c06ea170,c06e9818,...) at  
kdb_backtrace+0x29
watchdog_fire(c07120e0,c8,efe80634,c065c821,efe8063c,...) at  
watchdog_fire+0x9d

hardclock(efe8063c) at hardclock+0x115
lapic_handle_timer(0) at lapic_handle_timer+0x51
Xtimerint(c4fe6000,1,efe806a8,c066d57b,c4fe6000,...) at Xtimerint+0x30
getit(c4fe6000,c4fe6000,4,efe806c0,c0496f97,...) at getit+0x88
DELAY(1) at DELAY+0x3b
amr_quartz_poll_command1(c4fe6000,c51fbff0,0,0,1000,...) at  
amr_quartz_poll_command1+0x1af
amr_setup_polled_dmamap(c51fbff0,c4fef800,1,0) at  
amr_setup_polled_dmamap+0x94
bus_dmamap_load(c4ffe380,0,c0c22000,1,c0496cd4,c51fbff0,1) at  
bus_dmamap_load+0x4b5

amr_quartz_poll_command(c51fbff0) at amr_quartz_poll_command+0x51
amr_dump_blocks(c4fe6000,0,4cb25e,c0c22000,80) at amr_dump_blocks+0x5f
amrd_dump(c515b700,c0c22000,0,9964bc00,0,1) at amrd_dump+0x7c
cb_dumpdata(c0711a48,1,c06f44a0) at cb_dumpdata+0x100
foreach_chunk(c0655a78,c06f44a0,c06f44a0) at foreach_chunk+0x23

Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-13 Thread Vlad GALU
On 11/13/07, Vivek Khera [EMAIL PROTECTED] wrote:
 I've got a Dell 1750 box that was rock-solid stable running 4.11 for a
 couple of years now operating a pretty busy website backend.  A month
 or so ago we wiped it clean and repurposed it to run a different
 website running Drupal with a Varnish front-end cache using FreeBSD
 6.2-RELEASE-p5.  The system is i386 and has 1Gb of RAM.

 Uname output: FreeBSD mb.kcilink.com 6.2-RELEASE-p5 FreeBSD 6.2-
 RELEASE-p5 #0: Wed Jun 27 10:47:15 EDT 2007
 [EMAIL PROTECTED]:/n/lorax1/usr6/obj.i386/n/lorax1/usr6/src/sys/
 KCI32SMP  i386


 The last week or so, it has been crashing regularly.  Sometimes twice
 per day, and sometimes it runs for two days without a problem.  I
 finally managed to make it dump a crashlog and core, and discovered
 that the panic was:

   reboot after panic: vm_page_unwire: invalid wire count: 0

 I google around and found one old PR #33637 which had a patch but that
 was for FreeBSD 4.5.  I have also found two other mentions of this
 panic, one on the mailing lists with no responses, and another for a
 PR from 6.1-PRERELEASE, PR #94578, which has no comments on it.

 According to the http and varnish logs, we're not being particularly
 hit very hard when the panic happens, but I don't know if we lose some
 log data during the panic.

 I have the core and the kernel.debug.  I'm not sure what info to
 extract from it beyond the backtrace.  The watchdog timer fired and
 dropped me to DDB, so I just typed watchdog and c and let it
 finish dumping.

 Here's the backtrace, and bt full output.


 # kgdb kernel.debug /var/crash/vmcore.0
 [GDB will not be able to debug user-mode threads: /usr/lib/
 libthread_db.so: Undefined symbol ps_pglobal_lookup]
 GNU gdb 6.1.1 [FreeBSD]
 Copyright 2004 Free Software Foundation, Inc.
 GDB is free software, covered by the GNU General Public License, and
 you are
 welcome to change it and/or distribute copies of it under certain
 conditions.
 Type show copying to see the conditions.
 There is absolutely no warranty for GDB.  Type show warranty for
 details.
 This GDB was configured as i386-marcel-freebsd.

 Unread portion of the kernel message buffer:
 panic: vm_page_unwire: invalid wire count: 0
 cpuid = 1
 KDB: stack backtrace:
 kdb_backtrace(100,c5a76000,c0e88ab0,0,d90d82c8,...) at kdb_backtrace
 +0x29
 panic(c06b011f,0,c0e88ab0,efe80900,c057b96a,...) at panic+0x114
 vm_page_unwire(c0e88ab0,0) at vm_page_unwire+0x68
 vfs_vmio_release(d90d82c8) at vfs_vmio_release+0xa2
 getnewbuf(0,0,4000,4000) at getnewbuf+0x2bc
 getblk(c6f81550,4f5,0,4000,0,...) at getblk+0x360
 ffs_balloc_ufs2(c6f81550,13d4000,0,fa,c4f32780,...) at
 ffs_balloc_ufs2+0x1606
 ffs_write(efe80bec) at ffs_write+0x2ec
 VOP_WRITE_APV(c06e06a0,efe80bec) at VOP_WRITE_APV+0xce
 vn_write(c59c8000,efe80cbc,c51cf400,0,c5a76000) at vn_write+0x1ee
 dofilewrite(c5a76000,c,c59c8000,efe80cbc,,...) at dofilewrite
 +0x77
 kern_writev(c5a76000,c,efe80cbc,821bba3,fa,...) at kern_writev+0x3b
 write(c5a76000,efe80d04) at write+0x45
 syscall(3b,809003b,bfbf003b,0,bfbfeaa4,...) at syscall+0x2bf
 Xint0x80_syscall() at Xint0x80_syscall+0x1f
 --- syscall (4, FreeBSD ELF32, write), eip = 0x483d732f, esp =
 0xbfbfe9dc, ebp = 0xbfbfea08 ---
 Uptime: 1d20h51m58s
 Dumping 1023 MB (2 chunks)
chunk 0: 1MB (159 pages) ... ok
chunk 1: 1023MB (261872 pages) 1007 991 975 959 943 927 911 895 879
 863 847 831 815 799 783 767 751 735 719 703 687 671 655 639 623 607
 591 575 559 543 527 511 495 479 463 447 431 415 399 383 367 351 335
 319 303 287 271 255 239 223 207 191 175 159 143 127 111
 95interrupt   total
 irq4: sio0 21758
 irq15: ata11
 irq16: bge0  4544565
 irq17: bge1 17684238
 irq18: amr0   588223
 cpu0: timer323148326
 cpu2: timer323148294
 cpu1: timer323148331
 cpu3: timer323148344
 Total  1315432158
 KDB: stack backtrace:
 kdb_backtrace(c069ec5d,4e67e6de,0,c06ea170,c06e9818,...) at
 kdb_backtrace+0x29
 watchdog_fire(c07120e0,c8,efe80634,c065c821,efe8063c,...) at
 watchdog_fire+0x9d
 hardclock(efe8063c) at hardclock+0x115
 lapic_handle_timer(0) at lapic_handle_timer+0x51
 Xtimerint(c4fe6000,1,efe806a8,c066d57b,c4fe6000,...) at Xtimerint+0x30
 getit(c4fe6000,c4fe6000,4,efe806c0,c0496f97,...) at getit+0x88
 DELAY(1) at DELAY+0x3b
 amr_quartz_poll_command1(c4fe6000,c51fbff0,0,0,1000,...) at
 amr_quartz_poll_command1+0x1af
 amr_setup_polled_dmamap(c51fbff0,c4fef800,1,0) at
 amr_setup_polled_dmamap+0x94
 bus_dmamap_load(c4ffe380,0,c0c22000,1,c0496cd4,c51fbff0,1) at
 bus_dmamap_load+0x4b5
 amr_quartz_poll_command(c51fbff0) at amr_quartz_poll_command+0x51
 amr_dump_blocks(c4fe6000,0,4cb25e,c0c22000,80) at amr_dump_blocks+0x5f
 amrd_dump(c515b700,c0c22000,0,9964bc00,0,1) at amrd_dump+0x7c
 cb_dumpdata(c0711a48,1,c06f44a0) at cb_dumpdata

Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-13 Thread Vivek Khera


On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote:


   vmio = 1
   offset = Unhandled dwarf expression opcode 0x93
(kgdb)



   Do you happen to have ZERO_COPY_SOCKETS in your kernel config?




Yes, I do.  Are they known to be bad under certain loads or just in  
general.  I don't have this issue with any other web server running  
the same kernel config but those are amd64 boxes mostly.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-13 Thread Vlad GALU
On 11/13/07, Vivek Khera [EMAIL PROTECTED] wrote:

 On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote:

 vmio = 1
 offset = Unhandled dwarf expression opcode 0x93
  (kgdb)
 
 
 Do you happen to have ZERO_COPY_SOCKETS in your kernel config?
 
 

 Yes, I do.  Are they known to be bad under certain loads or just in
 general.  I don't have this issue with any other web server running
 the same kernel config but those are amd64 boxes mostly.

Remove, retry :) This thing bit me hard in the past too, see the
freebsd-fs@ archives.


 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to [EMAIL PROTECTED]



-- 
Mahnahmahnah!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-13 Thread Kip Macy
Unfortunately, ZERO_COPY_SOCKETs have long been a known source of
problems. I think also, when a page is copied as part of COW the new
page is unwired (see pmap_copy et al.), this could lead to
socow_iodone unwiring after send a page that was not wired. An added
issue is that parts of the VM assume that COW and wired are mutually
exclusive which the socow code violates.

At some point in the near future I may be adding support for doing
zero copy send without COW for blocking sockets. The one down side of
this approach is that if you have multiple threads in your process it
widens the window during which they can stomp on data that you're
sending. Nonetheless, this would be a bug in the application code.
More complicated would be zero-copy non-COW send on non-blocking
sockets as it would require an extension to kevent for completion
notification.

In the meantime, your best bet is to disable ZERO_COPY_SOCKETS.


 -Kip



On Nov 13, 2007 1:59 PM, Vivek Khera [EMAIL PROTECTED] wrote:

 On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote:

 vmio = 1
 offset = Unhandled dwarf expression opcode 0x93
  (kgdb)
 
 
 Do you happen to have ZERO_COPY_SOCKETS in your kernel config?
 
 

 Yes, I do.  Are they known to be bad under certain loads or just in
 general.  I don't have this issue with any other web server running
 the same kernel config but those are amd64 boxes mostly.


 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to [EMAIL PROTECTED]

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-13 Thread Vivek Khera


On Nov 13, 2007, at 5:13 PM, Kip Macy wrote:


In the meantime, your best bet is to disable ZERO_COPY_SOCKETS.


Thanks for the info.  I'm putting the new kernel in place and will see  
what happens and report back.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-13 Thread Kris Kennaway

Kip Macy wrote:

Unfortunately, ZERO_COPY_SOCKETs have long been a known source of
problems. I think also, when a page is copied as part of COW the new
page is unwired (see pmap_copy et al.), this could lead to
socow_iodone unwiring after send a page that was not wired. An added
issue is that parts of the VM assume that COW and wired are mutually
exclusive which the socow code violates.

At some point in the near future I may be adding support for doing
zero copy send without COW for blocking sockets. The one down side of
this approach is that if you have multiple threads in your process it
widens the window during which they can stomp on data that you're
sending. Nonetheless, this would be a bug in the application code.
More complicated would be zero-copy non-COW send on non-blocking
sockets as it would require an extension to kevent for completion
notification.

In the meantime, your best bet is to disable ZERO_COPY_SOCKETS.


There is a chance this was a recent regression, previously in 7.0 they 
were believed to work.


Kris




 -Kip



On Nov 13, 2007 1:59 PM, Vivek Khera [EMAIL PROTECTED] wrote:

On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote:


   vmio = 1
   offset = Unhandled dwarf expression opcode 0x93
(kgdb)


   Do you happen to have ZERO_COPY_SOCKETS in your kernel config?



Yes, I do.  Are they known to be bad under certain loads or just in
general.  I don't have this issue with any other web server running
the same kernel config but those are amd64 boxes mostly.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: reboot after panic: vm_page_unwire: invalid wire count: 0

2007-11-13 Thread Kip Macy
Various calls that downgrade permissions or virtually copy a pmap in
pmap.c now remove PG_W (and did not 6 months ago). This may be the
cause of the regression. It would probably be better (and faster) if
the pages were held instead of wired.

-Kip

On Nov 13, 2007 4:49 PM, Kris Kennaway [EMAIL PROTECTED] wrote:
 Kip Macy wrote:
  Unfortunately, ZERO_COPY_SOCKETs have long been a known source of
  problems. I think also, when a page is copied as part of COW the new
  page is unwired (see pmap_copy et al.), this could lead to
  socow_iodone unwiring after send a page that was not wired. An added
  issue is that parts of the VM assume that COW and wired are mutually
  exclusive which the socow code violates.
 
  At some point in the near future I may be adding support for doing
  zero copy send without COW for blocking sockets. The one down side of
  this approach is that if you have multiple threads in your process it
  widens the window during which they can stomp on data that you're
  sending. Nonetheless, this would be a bug in the application code.
  More complicated would be zero-copy non-COW send on non-blocking
  sockets as it would require an extension to kevent for completion
  notification.
 
  In the meantime, your best bet is to disable ZERO_COPY_SOCKETS.

 There is a chance this was a recent regression, previously in 7.0 they
 were believed to work.

 Kris


 
 
   -Kip
 
 
 
  On Nov 13, 2007 1:59 PM, Vivek Khera [EMAIL PROTECTED] wrote:
  On Nov 13, 2007, at 4:50 PM, Vlad GALU wrote:
 
 vmio = 1
 offset = Unhandled dwarf expression opcode 0x93
  (kgdb)
 
 Do you happen to have ZERO_COPY_SOCKETS in your kernel config?
 
 
  Yes, I do.  Are they known to be bad under certain loads or just in
  general.  I don't have this issue with any other web server running
  the same kernel config but those are amd64 boxes mostly.
 
 
  ___
  freebsd-stable@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to [EMAIL PROTECTED]
 
  ___
  freebsd-stable@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to [EMAIL PROTECTED]
 
 


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]