Hi Florian, florian.engelm...@bt.com wrote: > Hi Tao, > >> Hi Florian, >> >> florian.engelm...@bt.com wrote: >>> Hi Tao, >>> >>>> Hi florian, >>>> >>>> florian.engelm...@bt.com wrote: >>>>>> Florian, >>>>>> the problem here seems to be with network. The nodes are running >>> into >>>>>> network heartbeat timeout and hence second node is getting > fenced. >>> Do >>>>>> you see o2net thread consuming 100% cpu on any node? if not then >>>>>> probably check your network >>>>>> thanks, >>>>>> --Srini >>>>> I forgot to post my /etc/ocfs2/cluster.conf >>>>> node: >>>>> ip_port = 7777 >>>>> ip_address = 192.168.0.101 >>>>> number = 0 >>>>> name = defr1elcbtd01 >>>>> cluster = ocfs2 >>>>> >>>>> node: >>>>> ip_port = 7777 >>>>> ip_address = 192.168.0.102 >>>>> number = 1 >>>>> name = defr1elcbtd02 >>>>> cluster = ocfs2 >>>>> >>>>> cluster: >>>>> node_count = 2 >>>>> name = ocfs2 >>>>> >>>>> >>>>> 192.168.0.10x is eth3 on both nodes and connected with a cross > over >>>>> cable. No active network component is involved here. >>>>> >>>>> defr1elcbtd02:~# traceroute 192.168.0.101 >>>>> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte >>>>> packets >>>>> 1 node1 (192.168.0.101) 0.220 ms 0.142 ms 0.223 ms >>>>> defr1elcbtd02:~# >>>>> >>>>> The error message looks like a network problem but why should > there >>> be a >>>>> network problem if I shutdown a FC port?! I testet it about 20 > times >>> and >>>>> got about 16 kernel panics starting with the same error message: >>>>> >>>>> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) > at >>>>> 192.168.0.101:7777 >>>> It isn't an error message, just a status report that we can't > connect >>> to >>>> that node now. That node may be rebooted or something else, but > this >>>> node don't know, and it only knows the connection is down. >>> But node defr1elcbtd01 was never down and also the network link > (eth3) >>> wasn't down. I was able to ping from each node to the other. >>> Node 1 is hosting all services and never was faulted while I was >>> testing. >>> >>> All I have to do to panic node 2 is to disable one of two fibre > channel >>> ports or pull one fibre channel cable or delete node 2 from the > cisco >>> SAN zoning. >>> If I apply one of those 3 "errors" I get the message about o2net is > no >>> longer connected to node 1 and 60 seconds later the 2nd node panics >>> because of ocfs2 fencing (but this happens only in about 80% of > cases - >>> in the other 20% of cases o2net does not disconnect and there are no >>> messages about ocfs2 at all - like it should be...). >>> Everything else is working fine in these 60 seconds. The filesystem > is >>> still writable from both nodes and both nodes can ping each other > (via >>> the cluster interconnect). >> I just checked your log. The error why node 2 get the message is that >> node 1 get the message that node 2 stopped disk heartbeat for quite a >> long time so it stop the connection intentionally. So node 2 get this >> message. >> >> See the log in node 1: >> Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node > 1, >> 1 total >> Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382 >> node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err 0 > -> >> -107 >> Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node >> 1, 1 total >> Jun 8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to > node >> defr1elcbtd02 (num 1) at 192.168.0.102:7777 >> >> And I guess the reason why you see this log sometimes(80%) is that the >> time interval. You know ocfs2 disk heartbeat try every 2 secs so >> sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes >> node2 panic after node 1 call o2quo_hb_down(which will put something >> like "no longer..." in node 2's log). >> >> So would you please give your timeout configuration(o2cb)? > > These are my setting on node 2: > > O2CB_HEARTBEAT_THRESHOLD=61 > O2CB_IDLE_TIMEOUT_MS=60000 > O2CB_KEEPALIVE_DELAY_MS=4000 > O2CB_RECONNECT_DELAY_MS=4000 ocfs2 can't allow 2 nodes have different timeouts. So if node 1 and 2 don't have the same configuration, node 2 won't be allowed to join the domain and mount the same volume. yeah, this parameter looks much better. See http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#TIMEOUT For details.
I just went through the whole thread, and the panic because of the storage failure is a deliberate behavior since with no disk access, no need to survive. See http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#QUORUM Regards, Tao > > On node 1 I still got the old setting because there was no downtime to > reboot this system till today. Is there any way to change the values > without a reboot? The system is mission critical and I can only reboot > on Sundays. > > Settings on node 1 are the default setting that came with the Debian > package. I changed them to fit the settings of node 2 after the next > reboot - so I can only guess they were: > > O2CB_HEARTBEAT_THRESHOLD=7 > O2CB_IDLE_TIMEOUT_MS=12000 > O2CB_KEEPALIVE_DELAY_MS=2000 > O2CB_RECONNECT_DELAY_MS=2000 > > Regards, > Florian > > > >> Regards, >> Tao >> >> >> >>> Here are the logs with debug logging: >>> >>> Node 2: >>> >>> Jun 8 09:46:11 defr1elcbtd02 kernel: qla2xxx 0000:04:00.0: LOOP > DOWN >>> detected (2). >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] data_ready hit >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] get >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 > [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] receiving >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170 >>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at > page_off >>> 24 >>> Jun 8 09:46:11 defr1elcbtd02 kernel: >>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0 >>> sys_stat 0 key 00000000 num 0] processing message >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 24] get >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] ret = 1 >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] receiving >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] ret = -11 >>> Jun 8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] data_ready hit >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] get >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] data_ready hit >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] get >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 > [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] receiving >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170 >>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at > page_off >>> 24 >>> Jun 8 09:46:16 defr1elcbtd02 kernel: >>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0 >>> sys_stat 0 key 00000000 num 0] processing message >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 24] get >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] ret = 1 >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] receiving >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170 >>> [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at > page_off >>> 24 >>> Jun 8 09:46:16 defr1elcbtd02 kernel: >>> (3463,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0 >>> sys_stat 0 key 00000000 num 0] processing message >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 24] put >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 24] get >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] ret = 1 >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] receiving >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] ret = -11 >>> Jun 8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] data_ready hit >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] get >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 > [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] receiving >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170 >>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at > page_off >>> 24 >>> Jun 8 09:46:21 defr1elcbtd02 kernel: >>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0 >>> sys_stat 0 key 00000000 num 0] processing message >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 24] get >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] ret = 1 >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] receiving >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 > [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] ret = -11 >>> Jun 8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_state_change:477 > [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] state_change to 8 >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] get >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] data_ready hit >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] get >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_shutdown_sc:572 > [sc >>> ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] shutting down >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd02 kernel: > (3463,0):o2net_set_nn_state:382 >>> node 0 sc: ffff81007c2f0800 -> 0000000000000000, valid 1 -> 0, err 0 > -> >>> -107 >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_conn_err:296 > node >>> 0, 1 total >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_set_hold:162 > node >>> 0, 1 total >>> Jun 8 09:46:26 defr1elcbtd02 kernel: o2net: no longer connected to > node >>> defr1elcbtd01 (num 0) at 192.168.0.101:7777 >>> Jun 8 09:46:26 defr1elcbtd02 kernel: >>> (3463,0):o2net_complete_nodes_nsw:255 completed 0 messages for node > 0 >>> Jun 8 09:46:26 defr1elcbtd02 kernel: > (3463,0):o2net_set_nn_state:431 >>> queueing conn attempt in 0 jiffies >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] get >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 > [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] receiving >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 > [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] ret = 0 >>> Jun 8 09:46:26 defr1elcbtd02 kernel: > (3463,0):o2net_rx_until_empty:1215 >>> [sc ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] saw error 0, closing >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_shutdown_sc:572 > [sc >>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] shutting down >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc >>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_kref_release:274 > [sc >>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page >>> ffff81007fafbb00 pg_off 0] releasing >>> Jun 8 09:46:46 defr1elcbtd02 kernel: scsi 2:0:1:1: SCSI error: > return >>> code = 0x00010000 >>> Jun 8 09:46:46 defr1elcbtd02 kernel: end_request: I/O error, dev > sdd, >>> sector 1672 >>> Jun 8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath: > Failing >>> path 8:48. >>> Jun 8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath: > Failing >>> path 8:16. >>> Jun 8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc: > long >>> trespass command will be send >>> Jun 8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc: >>> honor reservation bit will not be set (default) >>> Jun 8 09:46:46 defr1elcbtd02 kernel: device-mapper: ioctl: error > adding >>> target to table >>> Jun 8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc: > long >>> trespass command will be send >>> Jun 8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc: >>> honor reservation bit will not be set (default) >>> Jun 8 09:46:48 defr1elcbtd02 kernel: device-mapper: multipath emc: >>> emc_pg_init: sending switch-over command >>> >>> >>> >>> >>> Node 1: >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (16528,0):o2net_data_ready:452 > [sc >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] data_ready hit >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (16528,0):sc_get:294 [sc >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] get >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129 > [sc >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] receiving >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1170 >>> [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at > page_off >>> 24 >>> Jun 8 09:46:21 defr1elcbtd01 kernel: >>> (3505,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0 >>> sys_stat 0 key 00000000 num 0] processing message >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 24] put >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 24] get >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196 > [sc >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] ret = 1 >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129 > [sc >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] receiving >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196 > [sc >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] ret = -11 >>> Jun 8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 > node 1, >>> 1 total >>> Jun 8 09:46:26 defr1elcbtd01 kernel: > (3804,0):o2net_set_nn_state:382 >>> node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err 0 > -> >>> -107 >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 > node >>> 1, 1 total >>> Jun 8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to > node >>> defr1elcbtd02 (num 1) at 192.168.0.102:7777 >>> Jun 8 09:46:26 defr1elcbtd01 kernel: >>> (3804,0):o2net_complete_nodes_nsw:255 completed 0 messages for node > 1 >>> Jun 8 09:46:26 defr1elcbtd01 kernel: > (3804,0):o2net_set_nn_state:431 >>> queueing conn attempt in 0 jiffies >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_get:294 [sc >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] get >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_put:289 [sc >>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3505,0):o2net_shutdown_sc:572 > [sc >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] shutting down >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc >>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc >>> ffff81007ddf4400 refs 2 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc >>> ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] put >>> Jun 8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_kref_release:274 > [sc >>> ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page >>> ffff81007f75b3e8 pg_off 0] releasing >>> Jun 8 09:46:26 defr1elcbtd01 kernel: >>> (16541,0):ocfs2_replay_journal:1184 Recovering node 1 from slot 0 on >>> device (253,7) >>> Jun 8 09:46:28 defr1elcbtd01 kernel: kjournald starting. Commit >>> interval 5 seconds >>> Jun 8 09:46:48 defr1elcbtd01 kernel: Please double check your >>> configuration values for 'O2CB_HEARTBEAT_THRESHOLD' >>> Jun 8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199 node > 1, 2 >>> total >>> Jun 8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162 > node >>> 1, 1 total >>> Jun 8 09:46:48 defr1elcbtd01 kernel: > (3804,0):o2net_set_nn_state:382 >>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err > -107 >>> -> 0 >>> Jun 8 09:46:48 defr1elcbtd01 kernel: > (3804,0):o2net_set_nn_state:431 >>> queueing conn attempt in 0 jiffies >>> Jun 8 09:46:58 defr1elcbtd01 kernel: > (3505,0):o2net_set_nn_state:382 >>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 > -> >>> -107 >>> Jun 8 09:46:58 defr1elcbtd01 kernel: (3505,0):o2quo_conn_err:296 > node >>> 1, 1 total >>> Jun 8 09:46:58 defr1elcbtd01 kernel: > (3505,0):o2net_set_nn_state:431 >>> queueing conn attempt in 0 jiffies >>> Jun 8 09:47:16 defr1elcbtd01 kernel: (3505,0):o2quo_hb_still_up:242 >>> node 1 >>> Jun 8 09:47:16 defr1elcbtd01 kernel: (3505,0):o2quo_clear_hold:171 > node >>> 1, 0 total >>> Jun 8 09:47:16 defr1elcbtd01 kernel: (8,0):o2quo_make_decision:105 >>> heartbeating: 2, connected: 1, lowest: 0 (reachable) >>> Jun 8 09:47:44 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 > node 1, >>> 1 total >>> Jun 8 09:47:44 defr1elcbtd01 kernel: > (3804,0):o2net_set_nn_state:382 >>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err > -107 >>> -> -107 >>> Jun 8 09:47:44 defr1elcbtd01 kernel: > (3804,0):o2net_set_nn_state:431 >>> queueing conn attempt in 0 jiffies >>> Jun 8 09:48:33 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog: > NIC >>> Link is Down >>> Jun 8 09:48:45 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog: > NIC >>> Link is Up 1000 Mbps Full Duplex >>> Jun 8 09:50:05 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog: > NIC >>> Link is Down >>> Jun 8 09:50:08 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog: > NIC >>> Link is Up 1000 Mbps Full Duplex >>> Jun 8 09:50:20 defr1elcbtd01 kernel: Please double check your >>> configuration values for 'O2CB_HEARTBEAT_THRESHOLD' >>> Jun 8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199 node > 1, 2 >>> total >>> Jun 8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162 > node >>> 1, 1 total >>> Jun 8 09:50:20 defr1elcbtd01 kernel: > (3804,0):o2net_set_nn_state:382 >>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err > -107 >>> -> 0 >>> Jun 8 09:50:20 defr1elcbtd01 kernel: > (3804,0):o2net_set_nn_state:431 >>> queueing conn attempt in 0 jiffies >>> Jun 8 09:50:22 defr1elcbtd01 kernel: > (0,0):o2net_listen_data_ready:1670 >>> bytes: 0 >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_alloc:320 [sc >>> ffff81007dddc400 refs 1 sock 0000000000000000 node 1 page >>> 0000000000000000 pg_off 0] alloced >>> Jun 8 09:50:22 defr1elcbtd01 kernel: > (3505,0):o2net_set_nn_state:382 >>> node 1 sc: 0000000000000000 -> ffff81007dddc400, valid 0 -> 0, err 0 > -> >>> 0 >>> Jun 8 09:50:22 defr1elcbtd01 kernel: > (3505,0):o2net_set_nn_state:431 >>> queueing conn attempt in 0 jiffies >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc >>> ffff81007dddc400 refs 1 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] get >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc >>> ffff81007dddc400 refs 2 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] get >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc >>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] get >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] put >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129 > [sc >>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] receiving >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc >>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 16] get >>> Jun 8 09:50:22 defr1elcbtd01 kernel: > (3505,0):o2net_set_nn_state:382 >>> node 1 sc: ffff81007dddc400 -> ffff81007dddc400, valid 0 -> 1, err 0 > -> >>> 0 >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_conn_up:267 > node 1, >>> 2 total >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_clear_hold:171 > node >>> 1, 0 total >>> Jun 8 09:50:22 defr1elcbtd01 kernel: o2net: accepted connection > from >>> node defr1elcbtd02 (num 1) at 192.168.0.102:7777 >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196 > [sc >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] ret = 16 >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129 > [sc >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] receiving >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196 > [sc >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] ret = -11 >>> Jun 8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc >>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page >>> ffff81007f272590 pg_off 0] put >>> >>>> If there is some problem in FC port, the node will reboot and the >>> other >>>> node will get this message. >>>>> The cluster is running fine if there is no problem with the SAN >>>>> connection. >>>>> >>>>> How to enable verbose logging with ofcs2? >>>> debugfs.ocfs2 -l will show current logging status. >>>> >>>> If you want to enable some log, use e.g >>>> debugfs.ocfs2 -l DISK_ALLOC allow >>>> and >>>> debugfs.ocfs2 -l DISK_ALLOC off >>>> will turn it off. >>>> >>>> Regards, >>>> Tao >>>>> Regards, >>>>> Florian >>>>> >>>>>> florian.engelm...@bt.com wrote: >>>>>>> Hello, >>>>>>> our Debian etch cluster nodes are panicing because of ocfs2 >>> fencing >>>>> if >>>>>>> one SAN path fails. >>>>>>> >>>>>>> modinfo ocfs2 >>>>>>> filename: >>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko >>>>>>> author: Oracle >>>>>>> license: GPL >>>>>>> description: OCFS2 1.3.3 >>>>>>> version: 1.3.3 >>>>>>> vermagic: 2.6.18-6-amd64 SMP mod_unload gcc-4.1 >>>>>>> depends: ocfs2_dlm,ocfs2_nodemanager,jbd >>>>>>> srcversion: 0798424846E68F10172C203 >>>>>>> >>>>>>> modinfo ocfs2_dlmfs >>>>>>> filename: >>>>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko >>>>>>> author: Oracle >>>>>>> license: GPL >>>>>>> description: OCFS2 DLMFS 1.3.3 >>>>>>> version: 1.3.3 >>>>>>> vermagic: 2.6.18-6-amd64 SMP mod_unload gcc-4.1 >>>>>>> depends: ocfs2_dlm,ocfs2_nodemanager >>>>>>> srcversion: E3780E12396118282B3C1AD >>>>>>> >>>>>>> defr1elcbtd02:~# modinfo ocfs2_dlm >>>>>>> filename: >>>>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko >>>>>>> author: Oracle >>>>>>> license: GPL >>>>>>> description: OCFS2 DLM 1.3.3 >>>>>>> version: 1.3.3 >>>>>>> vermagic: 2.6.18-6-amd64 SMP mod_unload gcc-4.1 >>>>>>> depends: ocfs2_nodemanager >>>>>>> srcversion: 7DC395EA08AE4CE826C5B92 >>>>>>> >>>>>>> modinfo ocfs2_nodemanager >>>>>>> filename: >>>>>>> > /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko >>>>>>> author: Oracle >>>>>>> license: GPL >>>>>>> description: OCFS2 Node Manager 1.3.3 >>>>>>> version: 1.3.3 >>>>>>> vermagic: 2.6.18-6-amd64 SMP mod_unload gcc-4.1 >>>>>>> depends: configfs >>>>>>> srcversion: C4C9871302E1910B78DAE40 >>>>>>> >>>>>>> modinfo qla2xxx >>>>>>> filename: >>>>>>> > /lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko >>>>>>> author: QLogic Corporation >>>>>>> description: QLogic Fibre Channel HBA Driver >>>>>>> license: GPL >>>>>>> version: 8.01.07-k1 >>>>>>> vermagic: 2.6.18-6-amd64 SMP mod_unload gcc-4.1 >>>>>>> depends: scsi_mod,scsi_transport_fc,firmware_class >>>>>>> alias: pci:v00001077d00002100sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00002200sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00002300sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00002312sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00002322sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00006312sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00006322sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00002422sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00002432sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00005422sv*sd*bc*sc*i* >>>>>>> alias: pci:v00001077d00005432sv*sd*bc*sc*i* >>>>>>> srcversion: B8E1608E257391DCAFD9224 >>>>>>> parm: ql2xfdmienable:Enables FDMI registratons Default >>> is >>>>> 0 - >>>>>>> no FDMI. 1 - perfom FDMI. (int) >>>>>>> parm: extended_error_logging:Option to enable extended >>>>> error >>>>>>> logging, Default is 0 - no logging. 1 - log errors. (int) >>>>>>> parm: ql2xallocfwdump:Option to enable allocation of >>>>> memory >>>>>>> for a firmware dump during HBA initialization. Memory > allocation >>>>>>> requirements vary by ISP type. Default is 1 - allocate memory. >>>>> (int) >>>>>>> parm: ql2xloginretrycount:Specify an alternate value > for >>>>> the >>>>>>> NVRAM login retry count. (int) >>>>>>> parm: ql2xplogiabsentdevice:Option to enable PLOGI to >>>>> devices >>>>>>> that are not present after a Fabric scan. This is needed for >>>>> several >>>>>>> broken switches. Default is 0 - no PLOGI. 1 - perfom PLOGI. > (int) >>>>>>> parm: qlport_down_retry:Maximum number of command >>> retries >>>>> to a >>>>>>> port that returns a PORT-DOWN status. (int) >>>>>>> parm: ql2xlogintimeout:Login timeout value in seconds. >>>>> (int) >>>>>>> modinfo dm_multipath >>>>>>> filename: >>>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-multipath.ko >>>>>>> description: device-mapper multipath target >>>>>>> author: Sistina Software <dm-de...@redhat.com> >>>>>>> license: GPL >>>>>>> vermagic: 2.6.18-6-amd64 SMP mod_unload gcc-4.1 >>>>>>> depends: dm-mod >>>>>>> >>>>>>> modinfo dm_mod >>>>>>> filename: >>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-mod.ko >>>>>>> description: device-mapper driver >>>>>>> author: Joe Thornber <dm-de...@redhat.com> >>>>>>> license: GPL >>>>>>> vermagic: 2.6.18-6-amd64 SMP mod_unload gcc-4.1 >>>>>>> depends: >>>>>>> parm: major:The major number of the device mapper > (uint) >>>>>>> modinfo dm_round_robin >>>>>>> filename: >>>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-round-robin.ko >>>>>>> description: device-mapper round-robin multipath path > selector >>>>>>> author: Sistina Software <dm-de...@redhat.com> >>>>>>> license: GPL >>>>>>> vermagic: 2.6.18-6-amd64 SMP mod_unload gcc-4.1 >>>>>>> depends: dm-multipath >>>>>>> >>>>>>> There is no self compiled software just the official repository >>> was >>>>>>> used. >>>>>>> The nodes are connected to our two independent SANs. The storage >>>>> systems >>>>>>> are EMC Clariion CX3-20f and EMC Clariion CX500. >>>>>>> >>>>>>> multipath.conf: >>>>>>> defaults { >>>>>>> rr_min_io 1000 >>>>>>> polling_interval 2 >>>>>>> no_path_retry 5 >>>>>>> user_friendly_names yes >>>>>>> } >>>>>>> >>>>>>> blacklist { >>>>>>> devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" >>>>>>> devnode "^hd[a-z][[0-9]*]" >>>>>>> devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]" >>>>>>> device { >>>>>>> vendor "DGC" >>>>>>> product "LUNZ" # EMC Management LUN >>>>>>> } >>>>>>> device { >>>>>>> vendor "ATA" #We do not need mutlipathing for >>> local >>>>>>> drives >>>>>>> product "*" >>>>>>> } >>>>>>> device { >>>>>>> vendor "AMI" # No multipathing for SUN Virtual >>>>> devices >>>>>>> product "*" >>>>>>> } >>>>>>> device { >>>>>>> vendor "HITACHI" # No multipathing for local > scsi >>>>> disks >>>>>>> product "H101414SCSUN146G" >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> devices { >>>>>>> ## Device attributes for EMC CLARiiON >>>>>>> device { >>>>>>> vendor "DGC" >>>>>>> product "*" >>>>>>> path_grouping_policy group_by_prio >>>>>>> getuid_callout "/sbin/scsi_id -g -u -s >>>>>>> /block/%n" >>>>>>> prio_callout "/sbin/mpath_prio_emc >>>>> /dev/%n" >>>>>>> hardware_handler "1 emc" >>>>>>> features "1 queue_if_no_path" >>>>>>> no_path_retry fail >>>>>>> path_checker emc_clariion >>>>>>> path_selector "round-robin 0" >>>>>>> failback immediate >>>>>>> user_friendly_names yes >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> multipaths { >>>>>>> multipath { >>>>>>> wwid >>>>>>> 3600601603ac511001c7c92fec775dd11 >>>>>>> alias stosan01_lun070 >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> multipath -ll: >>>>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7 > DGC,RAID >>> 5 >>>>>>> [size=133G][features=0][hwhandler=1 emc] >>>>>>> \_ round-robin 0 [prio=2][active] >>>>>>> \_ 0:0:1:1 sdd 8:48 [active][ready] >>>>>>> \_ 1:0:1:1 sdh 8:112 [active][ready] >>>>>>> \_ round-robin 0 [prio=0][enabled] >>>>>>> \_ 0:0:0:1 sdb 8:16 [active][ready] >>>>>>> \_ 1:0:0:1 sdf 8:80 [active][ready] >>>>>>> >>>>>>> >>>>>>> As we use lvm2 we added /dev/sd* to the filter: >>>>>>> filter = [ "r|/dev/cdrom|", "r|/dev/sd.*|" ] >>>>>>> >>>>>>> Here is what happened and what we did to reconstruct the > situation >>>>> to >>>>>>> find a solution: >>>>>>> >>>>>>> On 02.06.2009 we did something wrong with the zoning on one of > our >>>>> two >>>>>>> SANs and all servers (about 40) lost one path to the SAN. Only > two >>>>>>> servers crashed. Those two are our Debian etch heartbeat cluster >>>>>>> described above. >>>>>>> The console showed a kernel panic because of ocfs2 was fencing >>> both >>>>>>> nodes. >>>>>>> >>>>>>> This was the message: >>>>>>> O2hb_write_timeout: 165 ERROR: Heartbeat write timeout to device >>>>> dm-7 >>>>>>> after 12000 milliseconds >>>>>>> >>>>>>> So we decided to change the o2cb settings to: >>>>>>> O2CB_HEARTBEAT_THRESHOLD=31 >>>>>>> O2CB_IDLE_TIMEOUT_MS=30000 >>>>>>> O2CB_KEEPALIVE_DELAY_MS=2000 >>>>>>> O2CB_RECONNECT_DELAY_MS=2000 >>>>>>> >>>>>>> We switched all cluster resources to the 1st node to test the > new >>>>>>> settings on the second node. We removed the 2nd node from the >>> zoning >>>>> (we >>>>>>> also tested shutting down the port with the same result) and got > a >>>>>>> different error but still ending up with a kernel panic: >>>>>>> >>>>>>> Jun 4 16:41:05 defr1elcbtd02 kernel: o2net: no longer connected >>> to >>>>> node >>>>>>> defr1elcbtd01 (num 0) at 192.168.0.101:7777 >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: rport-0:0-0: blocked FC >>>>> remote >>>>>>> port time out: removing target and saving binding >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: rport-0:0-1: blocked FC >>>>> remote >>>>>>> port time out: removing target and saving binding >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: sd 0:0:1:1: SCSI error: >>> return >>>>>>> code = 0x00010000 >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: end_request: I/O error, > dev >>>>> sdd, >>>>>>> sector 1672 >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath: >>>>> Failing >>>>>>> path 8:48. >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath: >>>>> Failing >>>>>>> path 8:16. >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: scsi 0:0:1:1: rejecting > I/O >>> to >>>>>>> device being removed >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath >>> emc: >>>>> long >>>>>>> trespass command will be send >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath >>> emc: >>>>>>> honor reservation bit will not be set (default) >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: device-mapper: table: > 253:7: >>>>>>> multipath: error getting device >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: device-mapper: ioctl: > error >>>>> adding >>>>>>> target to table >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath >>> emc: >>>>> long >>>>>>> trespass command will be send >>>>>>> Jun 4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath >>> emc: >>>>>>> honor reservation bit will not be set (default) >>>>>>> Jun 4 16:41:29 defr1elcbtd02 kernel: device-mapper: multipath >>> emc: >>>>>>> emc_pg_init: sending switch-over command >>>>>>> Jun 4 16:42:01 defr1elcbtd02 kernel: >>>>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status = > -107 >>>>>>> Jun 4 16:42:01 defr1elcbtd02 kernel: >>>>>>> (10751,1):dlm_wait_for_node_death:374 >>>>> 5EE89BC01EFC405E9197C198DEEAE678: >>>>>>> waiting 5000ms for notification of death of node 0 >>>>>>> Jun 4 16:42:07 defr1elcbtd02 kernel: >>>>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status = > -107 >>>>>>> Jun 4 16:42:07 defr1elcbtd02 kernel: >>>>>>> (10751,1):dlm_wait_for_node_death:374 >>>>> 5EE89BC01EFC405E9197C198DEEAE678: >>>>>>> waiting 5000ms for notification of death of node 0 >>>>>>> [...] >>>>>>> After 60 seconds: >>>>>>> >>>>>>> (8,0): o2quo_make_decision:143 ERROR: fending this node because > it >>>>> is >>>>>>> connected to a half-quorum of 1 out of 2 nodes which doesn't >>> include >>>>> the >>>>>>> lowest active node 0 >>>>>>> >>>>>>> >>>>>>> multipath -ll changed to: >>>>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7 > DGC,RAID >>> 5 >>>>>>> [size=133G][features=0][hwhandler=1 emc] >>>>>>> \_ round-robin 0 [prio=1][active] >>>>>>> \_ 0:0:1:1 sdd 8:48 [active][ready] >>>>>>> \_ round-robin 0 [prio=0][enabled] >>>>>>> \_ 0:0:0:1 sdb 8:16 [active][ready] >>>>>>> >>>>>>> The ocfs2 filesystem is still mounted an writable. Even if I >>> enable >>>>> the >>>>>>> zoneing (or the FC port) again within the 60 seconds ocfs2 does >>> not >>>>>>> reconnect to node 1 and panics the kernel after 60 seconds while >>>>>>> multipath -ll shows both path again. >>>>>>> >>>>>>> I do not understand at all what the Ethernet heartbeat > connection >>> of >>>>>>> ocfs2 has to do with the SAN connection. >>>>>>> >>>>>>> The strangest thing at all is - this does not happen always! > After >>>>> some >>>>>>> reboots the system keeps running stable even if I shutdown a FC >>> port >>>>> and >>>>>>> enable it again many times. There is no constant behaviour... It >>>>> happens >>>>>>> most of the times, but at about 10% it does not happen and >>>>> everything is >>>>>>> working as intended. >>>>>>> >>>>>>> Any explanations or ideas what causes this behaviour? >>>>>>> >>>>>>> I will test this on Debian lenny to see if the Debian version >>> makes >>>>> a >>>>>>> difference. >>>>>>> >>>>>>> Best regards, >>>>>>> Florian >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Ocfs2-users mailing list >>>>>>> Ocfs2-users@oss.oracle.com >>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>>>> >>>>> _______________________________________________ >>>>> Ocfs2-users mailing list >>>>> Ocfs2-users@oss.oracle.com >>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users