Setting up netconsole does not require a reboot. The idea is to catch the oops trace when the oops happens. Without that trace, we are flying blind.
mike wrote: > Since these are production I can't do much. > > But I did get an error (it's not happening as much but it still blips > here and there) > > Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% > utilization, 3 seconds before my proxy says "hey, timeout" - every > other second there is -always- some utilization going on. > > What could be steps to figure out this issue? Using debugfs.ocfs2 or > something? > > It's mounted as: > /dev/sdb1 on /home type ocfs2 > (rw,_netdev,noatime,data=writeback,heartbeat=local) > > I know I'm not being much help, but I'm willing to try almost anything > as long as it doesn't cause downtime or require cluster-wide changes > (since those require downtime...) - I want to try to go back to > 2.6.24-16 with data=writeback and see if that fixes the crashing > issue, but if I'm having issues already like this perhaps I should > resolve this before moving up. > > > > [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt > > Time: 02:11:46 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 3.71 0.00 27.23 8.91 0.00 60.15 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 54.46 0.00 309.90 0.00 2914.85 > 9.41 23.08 74.47 0.93 28.71 > sdb 12.87 0.00 17.82 0.00 245.54 0.00 > 13.78 0.33 17.78 18.33 32.67 > > Time: 02:11:47 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 0.25 0.00 26.24 2.23 0.00 71.29 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 5.94 0.00 22.77 0.99 228.71 0.99 > 9.67 0.42 17.92 17.08 40.59 > > Time: 02:11:48 PM <- THIS HAS THE ISSUE > avg-cpu: %user %nice %system %iowait %steal %idle > 0.00 0.00 25.99 0.00 0.00 74.01 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 10.89 0.00 2.97 0.00 110.89 > 37.33 0.00 0.00 0.00 0.00 > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > > > Time: 02:11:49 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 0.25 0.00 14.85 0.99 0.00 83.91 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 0.99 0.00 2.97 0.99 30.69 0.99 > 8.00 0.07 17.50 17.50 6.93 > > Time: 02:11:50 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 0.74 0.00 1.24 1.73 0.00 96.29 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 0.99 0.00 5.94 0.00 55.45 0.00 > 9.33 0.07 11.67 11.67 6.93 > > Time: 02:11:51 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 0.00 0.00 1.24 16.34 0.00 82.43 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 153.47 0.00 494.06 0.00 5156.44 > 10.44 55.62 107.23 1.16 57.43 > sdb 2.97 0.00 11.88 0.99 117.82 0.99 > 9.23 0.26 13.08 20.00 25.74 > > Time: 02:11:52 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 0.00 0.00 0.25 3.22 0.00 96.53 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 16.83 0.00 158.42 > 9.41 0.13 164.71 1.18 1.98 > sdb 1.98 0.00 2.97 0.00 39.60 0.00 > 13.33 0.13 73.33 43.33 12.87 > > Time: 02:11:53 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 0.50 0.00 0.25 4.70 0.00 94.55 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 5.94 0.00 11.88 0.99 141.58 0.99 > 11.08 0.20 15.38 15.38 19.80 > > Time: 02:11:54 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 3.96 0.00 10.15 0.74 0.00 85.15 > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 20.79 0.00 4.95 0.00 205.94 > 41.60 0.00 0.00 0.00 0.00 > sdb 4.95 0.00 5.94 0.00 87.13 0.00 > 14.67 0.07 11.67 11.67 6.93 > > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > >> Do you have the panic output... kernel stack trace. We'll need >> that to figure this out. Without that, we can only speculate. >> >> mike wrote: >> >>> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>>> mike wrote: >>>> >>>> >>>> >>>>> I have changed my kernel back to 2.6.22-14-server, and now I don't get >>>>> the kernel panics. It seems like an issue with 2.6.24-16 and some i/o >>>>> made it crash... >>>>> >>>>> >>>>> >>>>> >>>>> >>>> OK, so it seems that it is a bug for ocfs2 kernel, not the ocfs2-tools. >>>> >> :) >> >>>> Then could you please describe it in more detail about how the kernel >>>> >> panic >> >>>> happens? >>>> >>>> >>>> >>> Yeah, this specific issue seems like a kernel issue. >>> >>> I don't know, these are production systems and I am already getting >>> angry customers. I can't really test anymore. Both are standard Ubuntu >>> kernels. >>> >>> Okay: 2.6.22-14-server (I think still minor file access issues) >>> Breaks under load: 2.6.24-16-server >>> >>> >>> >>> >>> >>>>> However I am still getting file access timeouts once in a while. I am >>>>> nervous about putting more load on the setup. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> Also please provide more details about it. >>>> >>>> >>>> >>> I am using nginx for a frontend load balancer, and nginx for a >>> webserver as well. This doesn't seem to be related to the webserver at >>> all though, it was happening before this. >>> >>> lvs01 proxies traffic in to web01, web02, and web03 (currently using >>> nginx, before I was using LVS/ipvsadm) >>> >>> Every so often, one of the webservers sends me back >>> >>> >>> >>> >>>>> [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb >>>>> >>>>> # O2CB_ENABLED: 'true' means to load the driver on boot. >>>>> O2CB_ENABLED=true >>>>> >>>>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. >>>>> O2CB_BOOTCLUSTER=mycluster >>>>> >>>>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered >>>>> >> dead. >> >>>>> O2CB_HEARTBEAT_THRESHOLD=7 >>>>> >>>>> >>>>> >>>>> >>>>> >>>> This value is a little smaller, so how did you build up your shared >>>> disk(iSCSI or ...)? The most common value I heard of is 61. It is about >>>> >> 120 >> >>>> secs. I don't know the reason and maybe Sunil can tell you. ;) >>>> You can also refer to >>>> >>>> >> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT. >> >>>> >>>> >>>>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is >>>>> considered dead. >>>>> O2CB_IDLE_TIMEOUT_MS=10000 >>>>> >>>>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is >>>>> >>>>> >>>>> >>>> sent >>>> >>>> >>>> >>>>> O2CB_KEEPALIVE_DELAY_MS=5000 >>>>> >>>>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts >>>>> O2CB_RECONNECT_DELAY_MS=2000 >>>>> >>>>> >>>>> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Hi Mike, >>>>>> Are you sure it is caused by the update of ocfs2-tools? >>>>>> AFAIK, the ocfs2-tools only include tools like mkfs, fsck and tunefs >>>>>> >>>>>> >>>>>> >>>> etc. So >>>> >>>> >>>> >>>>>> if you don't make any change to the disk(by using this new tools), >>>>>> >> it >> >>>>>> shouldn't cause the problem of kernel panic since they are all user >>>>>> >>>>>> >>>>>> >>>> space >>>> >>>> >>>> >>>>>> tools. >>>>>> Then there is only one thing maybe. Have you modify >>>>>> >>>>>> >>>>>> >>>> /etc/sysconfig/o2cb(This >>>> >>>> >>>> >>>>>> is the place for RHEL, not sure the place in ubuntu)? I have checked >>>>>> >> the >> >>>>>> >>>> rpm >>>> >>>> >>>> >>>>>> package for RHEL, it will update /etc/sysconfig/o2cb and this file >>>>>> >> has >> >>>>>> >>>> some >>>> >>>> >>>> >>>>>> timeouts defined in it. >>>>>> So do you have some backups for this file? If yes, please restore it >>>>>> >> to >> >>>>>> >>>> see >>>> >>>> >>>> >>>>>> whether it helps(I can't say it for sure). >>>>>> If not, do you remember the old value of some timeouts you set for >>>>>> >>>>>> >>>>>> >>>> ocfs2? If >>>> >>>> >>>> >>>>>> yes, you can use o2cb configure to set them by yourself. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>> _______________________________________________ >>> Ocfs2-users mailing list >>> Ocfs2-users@oss.oracle.com >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>> >>> >>> >> > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users