Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

Tao Ma Mon, 08 Jun 2009 01:59:32 -0700

Hi Florian,

florian.engelm...@bt.com wrote:
> Hi Tao,
> 
>> Hi Florian,
>>
>> florian.engelm...@bt.com wrote:
>>> Hi Tao,
>>>
>>>> Hi florian,
>>>>
>>>> florian.engelm...@bt.com wrote:
>>>>>> Florian,
>>>>>> the problem here seems to be with network. The nodes are running
>>> into
>>>>>> network heartbeat timeout and hence second node is getting
> fenced.
>>> Do
>>>>>> you see o2net thread consuming 100% cpu on any node? if not then
>>>>>> probably check your network
>>>>>> thanks,
>>>>>> --Srini
>>>>> I forgot to post my /etc/ocfs2/cluster.conf
>>>>> node:
>>>>>         ip_port = 7777
>>>>>         ip_address = 192.168.0.101
>>>>>         number = 0
>>>>>         name = defr1elcbtd01
>>>>>         cluster = ocfs2
>>>>>
>>>>> node:
>>>>>         ip_port = 7777
>>>>>         ip_address = 192.168.0.102
>>>>>         number = 1
>>>>>         name = defr1elcbtd02
>>>>>         cluster = ocfs2
>>>>>
>>>>> cluster:
>>>>>         node_count = 2
>>>>>         name = ocfs2
>>>>>
>>>>>
>>>>> 192.168.0.10x is eth3 on both nodes and connected with a cross
> over
>>>>> cable. No active network component is involved here.
>>>>>
>>>>> defr1elcbtd02:~# traceroute 192.168.0.101
>>>>> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
>>>>> packets
>>>>>  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
>>>>> defr1elcbtd02:~#
>>>>>
>>>>> The error message looks like a network problem but why should
> there
>>> be a
>>>>> network problem if I shutdown a FC port?! I testet it about 20
> times
>>> and
>>>>> got about 16 kernel panics starting with the same error message:
>>>>>
>>>>> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0)
> at
>>>>> 192.168.0.101:7777
>>>> It isn't an error message, just a status report that we can't
> connect
>>> to
>>>> that node now. That node may be rebooted or something else, but
> this
>>>> node don't know, and it only knows the connection is down.
>>> But node defr1elcbtd01 was never down and also the network link
> (eth3)
>>> wasn't down. I was able to ping from each node to the other.
>>> Node 1 is hosting all services and never was faulted while I was
>>> testing.
>>>
>>> All I have to do to panic node 2 is to disable one of two fibre
> channel
>>> ports or pull one fibre channel cable or delete node 2 from the
> cisco
>>> SAN zoning.
>>> If I apply one of those 3 "errors" I get the message about o2net is
> no
>>> longer connected to node 1 and 60 seconds later the 2nd node panics
>>> because of ocfs2 fencing (but this happens only in about 80% of
> cases -
>>> in the other 20% of cases o2net does not disconnect and there are no
>>> messages about ocfs2 at all - like it should be...).
>>> Everything else is working fine in these 60 seconds. The filesystem
> is
>>> still writable from both nodes and both nodes can ping each other
> (via
>>> the cluster interconnect).
>> I just checked your log. The error why node 2 get the message is that
>> node 1 get the message that node 2 stopped disk heartbeat for quite a
>> long time so it stop the connection intentionally. So node 2 get this
>> message.
>>
>> See the log in node 1:
>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node
> 1,
>> 1 total
>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
>> node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err 0
> ->
>> -107
>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
>> 1, 1 total
>> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
> node
>> defr1elcbtd02 (num 1) at 192.168.0.102:7777
>>
>> And I guess the reason why you see this log sometimes(80%) is that the
>> time interval. You know ocfs2 disk heartbeat try every 2 secs so
>> sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes
>> node2 panic after node 1 call o2quo_hb_down(which will put something
>> like "no longer..." in node 2's log).
>>
>> So would you please give your timeout configuration(o2cb)?
> 
> These are my setting on node 2:
> 
> O2CB_HEARTBEAT_THRESHOLD=61
> O2CB_IDLE_TIMEOUT_MS=60000
> O2CB_KEEPALIVE_DELAY_MS=4000
> O2CB_RECONNECT_DELAY_MS=4000
ocfs2 can't allow 2 nodes have different timeouts. So if node 1 and 2 
don't have the same configuration, node 2 won't be allowed to join the 
domain and mount the same volume.
yeah, this parameter looks much better.
See
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#TIMEOUT
For details.


I just went through the whole thread, and the panic because of the 
storage failure is a deliberate behavior since with no disk access, no 
need to survive. See
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#QUORUM

Regards,
Tao
> 
> On node 1 I still got the old setting because there was no downtime to
> reboot this system till today. Is there any way to change the values
> without a reboot? The system is mission critical and I can only reboot
> on Sundays.
> 
> Settings on node 1 are the default setting that came with the Debian
> package. I changed them to fit the settings of node 2 after the next
> reboot - so I can only guess they were:
> 
> O2CB_HEARTBEAT_THRESHOLD=7
> O2CB_IDLE_TIMEOUT_MS=12000
> O2CB_KEEPALIVE_DELAY_MS=2000
> O2CB_RECONNECT_DELAY_MS=2000


> 
> Regards,
> Florian
> 
> 
> 
>> Regards,
>> Tao
>>
>>
>>
>>> Here are the logs with debug logging:
>>>
>>> Node 2:
>>>
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: qla2xxx 0000:04:00.0: LOOP
> DOWN
>>> detected (2).
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] data_ready hit
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] get
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
> [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] receiving
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
>>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> page_off
>>> 24
>>> Jun  8 09:46:11 defr1elcbtd02 kernel:
>>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
>>> sys_stat 0 key 00000000 num 0] processing message
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 24] get
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] ret = 1
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] receiving
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] ret = -11
>>> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] data_ready hit
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] get
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] data_ready hit
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] get
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
> [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] receiving
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
>>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> page_off
>>> 24
>>> Jun  8 09:46:16 defr1elcbtd02 kernel:
>>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
>>> sys_stat 0 key 00000000 num 0] processing message
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 24] get
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] ret = 1
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] receiving
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
>>> [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> page_off
>>> 24
>>> Jun  8 09:46:16 defr1elcbtd02 kernel:
>>> (3463,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0
>>> sys_stat 0 key 00000000 num 0] processing message
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 24] put
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 24] get
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] ret = 1
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] receiving
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] ret = -11
>>> Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] data_ready hit
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] get
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
> [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] receiving
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
>>> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> page_off
>>> 24
>>> Jun  8 09:46:21 defr1elcbtd02 kernel:
>>> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
>>> sys_stat 0 key 00000000 num 0] processing message
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 24] get
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] ret = 1
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] receiving
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
> [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] ret = -11
>>> Jun  8 09:46:21 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_state_change:477
> [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] state_change to 8
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] get
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] data_ready hit
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] get
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_shutdown_sc:572
> [sc
>>> ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] shutting down
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 5 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd02 kernel:
> (3463,0):o2net_set_nn_state:382
>>> node 0 sc: ffff81007c2f0800 -> 0000000000000000, valid 1 -> 0, err 0
> ->
>>> -107
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_conn_err:296
> node
>>> 0, 1 total
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2quo_set_hold:162
> node
>>> 0, 1 total
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: o2net: no longer connected to
> node
>>> defr1elcbtd01 (num 0) at 192.168.0.101:7777
>>> Jun  8 09:46:26 defr1elcbtd02 kernel:
>>> (3463,0):o2net_complete_nodes_nsw:255 completed 0 messages for node
> 0
>>> Jun  8 09:46:26 defr1elcbtd02 kernel:
> (3463,0):o2net_set_nn_state:431
>>> queueing conn attempt in 0 jiffies
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] get
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 4 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 3 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129
> [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] receiving
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196
> [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] ret = 0
>>> Jun  8 09:46:26 defr1elcbtd02 kernel:
> (3463,0):o2net_rx_until_empty:1215
>>> [sc ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] saw error 0, closing
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 2 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):o2net_shutdown_sc:572
> [sc
>>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] shutting down
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
>>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd02 kernel: (3463,0):sc_kref_release:274
> [sc
>>> ffff81007c2f0800 refs 1 sock ffff8100694138c0 node 0 page
>>> ffff81007fafbb00 pg_off 0] releasing
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: scsi 2:0:1:1: SCSI error:
> return
>>> code = 0x00010000
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: end_request: I/O error, dev
> sdd,
>>> sector 1672
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath:
> Failing
>>> path 8:48.
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath:
> Failing
>>> path 8:16.
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
> long
>>> trespass command will be send
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
>>> honor reservation bit will not be set (default)
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: ioctl: error
> adding
>>> target to table
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
> long
>>> trespass command will be send
>>> Jun  8 09:46:46 defr1elcbtd02 kernel: device-mapper: multipath emc:
>>> honor reservation bit will not be set (default)
>>> Jun  8 09:46:48 defr1elcbtd02 kernel: device-mapper: multipath emc:
>>> emc_pg_init: sending switch-over command
>>>
>>>
>>>
>>>
>>> Node 1:
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (16528,0):o2net_data_ready:452
> [sc
>>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] data_ready hit
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (16528,0):sc_get:294 [sc
>>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] get
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129
> [sc
>>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] receiving
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1170
>>> [mag 64087 len 0 typ 0 stat 0 sys_stat 0 key 00000000 num 0] at
> page_off
>>> 24
>>> Jun  8 09:46:21 defr1elcbtd01 kernel:
>>> (3505,0):o2net_process_message:1015 [mag 64087 len 0 typ 0 stat 0
>>> sys_stat 0 key 00000000 num 0] processing message
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
>>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 24] put
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
>>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 24] get
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196
> [sc
>>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] ret = 1
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129
> [sc
>>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] receiving
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196
> [sc
>>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] ret = -11
>>> Jun  8 09:46:21 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
>>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224
> node 1,
>>> 1 total
>>> Jun  8 09:46:26 defr1elcbtd01 kernel:
> (3804,0):o2net_set_nn_state:382
>>> node 1 sc: ffff81007ddf4400 -> 0000000000000000, valid 1 -> 0, err 0
> ->
>>> -107
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296
> node
>>> 1, 1 total
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
> node
>>> defr1elcbtd02 (num 1) at 192.168.0.102:7777
>>> Jun  8 09:46:26 defr1elcbtd01 kernel:
>>> (3804,0):o2net_complete_nodes_nsw:255 completed 0 messages for node
> 1
>>> Jun  8 09:46:26 defr1elcbtd01 kernel:
> (3804,0):o2net_set_nn_state:431
>>> queueing conn attempt in 0 jiffies
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_get:294 [sc
>>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] get
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):sc_put:289 [sc
>>> ffff81007ddf4400 refs 4 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):o2net_shutdown_sc:572
> [sc
>>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] shutting down
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
>>> ffff81007ddf4400 refs 3 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
>>> ffff81007ddf4400 refs 2 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
>>> ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] put
>>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3505,0):sc_kref_release:274
> [sc
>>> ffff81007ddf4400 refs 1 sock ffff810002977d00 node 1 page
>>> ffff81007f75b3e8 pg_off 0] releasing
>>> Jun  8 09:46:26 defr1elcbtd01 kernel:
>>> (16541,0):ocfs2_replay_journal:1184 Recovering node 1 from slot 0 on
>>> device (253,7)
>>> Jun  8 09:46:28 defr1elcbtd01 kernel: kjournald starting.  Commit
>>> interval 5 seconds
>>> Jun  8 09:46:48 defr1elcbtd01 kernel: Please double check your
>>> configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
>>> Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199 node
> 1, 2
>>> total
>>> Jun  8 09:46:48 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162
> node
>>> 1, 1 total
>>> Jun  8 09:46:48 defr1elcbtd01 kernel:
> (3804,0):o2net_set_nn_state:382
>>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
> -107
>>> -> 0
>>> Jun  8 09:46:48 defr1elcbtd01 kernel:
> (3804,0):o2net_set_nn_state:431
>>> queueing conn attempt in 0 jiffies
>>> Jun  8 09:46:58 defr1elcbtd01 kernel:
> (3505,0):o2net_set_nn_state:382
>>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0
> ->
>>> -107
>>> Jun  8 09:46:58 defr1elcbtd01 kernel: (3505,0):o2quo_conn_err:296
> node
>>> 1, 1 total
>>> Jun  8 09:46:58 defr1elcbtd01 kernel:
> (3505,0):o2net_set_nn_state:431
>>> queueing conn attempt in 0 jiffies
>>> Jun  8 09:47:16 defr1elcbtd01 kernel: (3505,0):o2quo_hb_still_up:242
>>> node 1
>>> Jun  8 09:47:16 defr1elcbtd01 kernel: (3505,0):o2quo_clear_hold:171
> node
>>> 1, 0 total
>>> Jun  8 09:47:16 defr1elcbtd01 kernel: (8,0):o2quo_make_decision:105
>>> heartbeating: 2, connected: 1, lowest: 0 (reachable)
>>> Jun  8 09:47:44 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224
> node 1,
>>> 1 total
>>> Jun  8 09:47:44 defr1elcbtd01 kernel:
> (3804,0):o2net_set_nn_state:382
>>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
> -107
>>> -> -107
>>> Jun  8 09:47:44 defr1elcbtd01 kernel:
> (3804,0):o2net_set_nn_state:431
>>> queueing conn attempt in 0 jiffies
>>> Jun  8 09:48:33 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
> NIC
>>> Link is Down
>>> Jun  8 09:48:45 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
> NIC
>>> Link is Up 1000 Mbps Full Duplex
>>> Jun  8 09:50:05 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
> NIC
>>> Link is Down
>>> Jun  8 09:50:08 defr1elcbtd01 kernel: e1000: eth3: e1000_watchdog:
> NIC
>>> Link is Up 1000 Mbps Full Duplex
>>> Jun  8 09:50:20 defr1elcbtd01 kernel: Please double check your
>>> configuration values for 'O2CB_HEARTBEAT_THRESHOLD'
>>> Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_hb_up:199 node
> 1, 2
>>> total
>>> Jun  8 09:50:20 defr1elcbtd01 kernel: (3804,0):o2quo_set_hold:162
> node
>>> 1, 1 total
>>> Jun  8 09:50:20 defr1elcbtd01 kernel:
> (3804,0):o2net_set_nn_state:382
>>> node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err
> -107
>>> -> 0
>>> Jun  8 09:50:20 defr1elcbtd01 kernel:
> (3804,0):o2net_set_nn_state:431
>>> queueing conn attempt in 0 jiffies
>>> Jun  8 09:50:22 defr1elcbtd01 kernel:
> (0,0):o2net_listen_data_ready:1670
>>> bytes: 0
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_alloc:320 [sc
>>> ffff81007dddc400 refs 1 sock 0000000000000000 node 1 page
>>> 0000000000000000 pg_off 0] alloced
>>> Jun  8 09:50:22 defr1elcbtd01 kernel:
> (3505,0):o2net_set_nn_state:382
>>> node 1 sc: 0000000000000000 -> ffff81007dddc400, valid 0 -> 0, err 0
> ->
>>> 0
>>> Jun  8 09:50:22 defr1elcbtd01 kernel:
> (3505,0):o2net_set_nn_state:431
>>> queueing conn attempt in 0 jiffies
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
>>> ffff81007dddc400 refs 1 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] get
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
>>> ffff81007dddc400 refs 2 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] get
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
>>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] get
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
>>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] put
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129
> [sc
>>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] receiving
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_get:294 [sc
>>> ffff81007dddc400 refs 3 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 16] get
>>> Jun  8 09:50:22 defr1elcbtd01 kernel:
> (3505,0):o2net_set_nn_state:382
>>> node 1 sc: ffff81007dddc400 -> ffff81007dddc400, valid 0 -> 1, err 0
> ->
>>> 0
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_conn_up:267
> node 1,
>>> 2 total
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2quo_clear_hold:171
> node
>>> 1, 0 total
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: o2net: accepted connection
> from
>>> node defr1elcbtd02 (num 1) at 192.168.0.102:7777
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196
> [sc
>>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] ret = 16
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1129
> [sc
>>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] receiving
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):o2net_advance_rx:1196
> [sc
>>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] ret = -11
>>> Jun  8 09:50:22 defr1elcbtd01 kernel: (3505,0):sc_put:289 [sc
>>> ffff81007dddc400 refs 4 sock ffff81005232e700 node 1 page
>>> ffff81007f272590 pg_off 0] put
>>>
>>>> If there is some problem in FC port, the node will reboot and the
>>> other
>>>> node will get this message.
>>>>> The cluster is running fine if there is no problem with the SAN
>>>>> connection.
>>>>>
>>>>> How to enable verbose logging with ofcs2?
>>>> debugfs.ocfs2 -l will show current logging status.
>>>>
>>>> If you want to enable some log, use e.g
>>>> debugfs.ocfs2  -l DISK_ALLOC allow
>>>> and
>>>> debugfs.ocfs2 -l DISK_ALLOC off
>>>> will turn it off.
>>>>
>>>> Regards,
>>>> Tao
>>>>> Regards,
>>>>> Florian
>>>>>
>>>>>> florian.engelm...@bt.com wrote:
>>>>>>> Hello,
>>>>>>> our Debian etch cluster nodes are panicing because of ocfs2
>>> fencing
>>>>> if
>>>>>>> one SAN path fails.
>>>>>>>
>>>>>>> modinfo ocfs2
>>>>>>> filename:
>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko
>>>>>>> author:         Oracle
>>>>>>> license:        GPL
>>>>>>> description:    OCFS2 1.3.3
>>>>>>> version:        1.3.3
>>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>>>> depends:        ocfs2_dlm,ocfs2_nodemanager,jbd
>>>>>>> srcversion:     0798424846E68F10172C203
>>>>>>>
>>>>>>> modinfo ocfs2_dlmfs
>>>>>>> filename:
>>>>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
>>>>>>> author:         Oracle
>>>>>>> license:        GPL
>>>>>>> description:    OCFS2 DLMFS 1.3.3
>>>>>>> version:        1.3.3
>>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>>>> depends:        ocfs2_dlm,ocfs2_nodemanager
>>>>>>> srcversion:     E3780E12396118282B3C1AD
>>>>>>>
>>>>>>> defr1elcbtd02:~# modinfo ocfs2_dlm
>>>>>>> filename:
>>>>>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
>>>>>>> author:         Oracle
>>>>>>> license:        GPL
>>>>>>> description:    OCFS2 DLM 1.3.3
>>>>>>> version:        1.3.3
>>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>>>> depends:        ocfs2_nodemanager
>>>>>>> srcversion:     7DC395EA08AE4CE826C5B92
>>>>>>>
>>>>>>> modinfo ocfs2_nodemanager
>>>>>>> filename:
>>>>>>>
> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
>>>>>>> author:         Oracle
>>>>>>> license:        GPL
>>>>>>> description:    OCFS2 Node Manager 1.3.3
>>>>>>> version:        1.3.3
>>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>>>> depends:        configfs
>>>>>>> srcversion:     C4C9871302E1910B78DAE40
>>>>>>>
>>>>>>> modinfo qla2xxx
>>>>>>> filename:
>>>>>>>
> /lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
>>>>>>> author:         QLogic Corporation
>>>>>>> description:    QLogic Fibre Channel HBA Driver
>>>>>>> license:        GPL
>>>>>>> version:        8.01.07-k1
>>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>>>> depends:        scsi_mod,scsi_transport_fc,firmware_class
>>>>>>> alias:          pci:v00001077d00002100sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00002200sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00002300sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00002312sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00002322sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00006312sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00006322sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00002422sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00002432sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00005422sv*sd*bc*sc*i*
>>>>>>> alias:          pci:v00001077d00005432sv*sd*bc*sc*i*
>>>>>>> srcversion:     B8E1608E257391DCAFD9224
>>>>>>> parm:           ql2xfdmienable:Enables FDMI registratons Default
>>> is
>>>>> 0 -
>>>>>>> no FDMI. 1 - perfom FDMI. (int)
>>>>>>> parm:           extended_error_logging:Option to enable extended
>>>>> error
>>>>>>> logging, Default is 0 - no logging. 1 - log errors. (int)
>>>>>>> parm:           ql2xallocfwdump:Option to enable allocation of
>>>>> memory
>>>>>>> for a firmware dump during HBA initialization.  Memory
> allocation
>>>>>>> requirements vary by ISP type.  Default is 1 - allocate memory.
>>>>> (int)
>>>>>>> parm:           ql2xloginretrycount:Specify an alternate value
> for
>>>>> the
>>>>>>> NVRAM login retry count. (int)
>>>>>>> parm:           ql2xplogiabsentdevice:Option to enable PLOGI to
>>>>> devices
>>>>>>> that are not present after a Fabric scan.  This is needed for
>>>>> several
>>>>>>> broken switches. Default is 0 - no PLOGI. 1 - perfom PLOGI.
> (int)
>>>>>>> parm:           qlport_down_retry:Maximum number of command
>>> retries
>>>>> to a
>>>>>>> port that returns a PORT-DOWN status. (int)
>>>>>>> parm:           ql2xlogintimeout:Login timeout value in seconds.
>>>>> (int)
>>>>>>> modinfo dm_multipath
>>>>>>> filename:
>>>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-multipath.ko
>>>>>>> description:    device-mapper multipath target
>>>>>>> author:         Sistina Software <dm-de...@redhat.com>
>>>>>>> license:        GPL
>>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>>>> depends:        dm-mod
>>>>>>>
>>>>>>> modinfo dm_mod
>>>>>>> filename:
>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-mod.ko
>>>>>>> description:    device-mapper driver
>>>>>>> author:         Joe Thornber <dm-de...@redhat.com>
>>>>>>> license:        GPL
>>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>>>> depends:
>>>>>>> parm:           major:The major number of the device mapper
> (uint)
>>>>>>> modinfo dm_round_robin
>>>>>>> filename:
>>>>>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/md/dm-round-robin.ko
>>>>>>> description:    device-mapper round-robin multipath path
> selector
>>>>>>> author:         Sistina Software <dm-de...@redhat.com>
>>>>>>> license:        GPL
>>>>>>> vermagic:       2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>>>>>> depends:        dm-multipath
>>>>>>>
>>>>>>> There is no self compiled software just the official repository
>>> was
>>>>>>> used.
>>>>>>> The nodes are connected to our two independent SANs. The storage
>>>>> systems
>>>>>>> are EMC Clariion CX3-20f and EMC Clariion CX500.
>>>>>>>
>>>>>>> multipath.conf:
>>>>>>> defaults {
>>>>>>>         rr_min_io                       1000
>>>>>>>         polling_interval                2
>>>>>>>         no_path_retry                   5
>>>>>>>         user_friendly_names             yes
>>>>>>> }
>>>>>>>
>>>>>>> blacklist {
>>>>>>>         devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
>>>>>>>         devnode "^hd[a-z][[0-9]*]"
>>>>>>>         devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
>>>>>>>         device {
>>>>>>>                 vendor "DGC"
>>>>>>>                 product "LUNZ" # EMC Management LUN
>>>>>>>         }
>>>>>>>         device {
>>>>>>>                 vendor "ATA"  #We do not need mutlipathing for
>>> local
>>>>>>> drives
>>>>>>>                 product "*"
>>>>>>>         }
>>>>>>>         device {
>>>>>>>                 vendor "AMI" # No multipathing for SUN Virtual
>>>>> devices
>>>>>>>                 product "*"
>>>>>>>         }
>>>>>>>         device {
>>>>>>>                 vendor "HITACHI" # No multipathing for local
> scsi
>>>>> disks
>>>>>>>                 product "H101414SCSUN146G"
>>>>>>>         }
>>>>>>> }
>>>>>>>
>>>>>>> devices {
>>>>>>>         ## Device attributes for EMC CLARiiON
>>>>>>>         device {
>>>>>>>                 vendor                  "DGC"
>>>>>>>                 product                 "*"
>>>>>>>                 path_grouping_policy    group_by_prio
>>>>>>>                 getuid_callout          "/sbin/scsi_id -g -u -s
>>>>>>> /block/%n"
>>>>>>>                 prio_callout            "/sbin/mpath_prio_emc
>>>>> /dev/%n"
>>>>>>>                 hardware_handler        "1 emc"
>>>>>>>                 features                "1 queue_if_no_path"
>>>>>>>                 no_path_retry           fail
>>>>>>>                 path_checker            emc_clariion
>>>>>>>                 path_selector           "round-robin 0"
>>>>>>>                 failback                immediate
>>>>>>>                 user_friendly_names     yes
>>>>>>>         }
>>>>>>> }
>>>>>>>
>>>>>>> multipaths {
>>>>>>>         multipath {
>>>>>>>                 wwid
>>>>>>> 3600601603ac511001c7c92fec775dd11
>>>>>>>                 alias                   stosan01_lun070
>>>>>>>         }
>>>>>>> }
>>>>>>>
>>>>>>> multipath -ll:
>>>>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7
> DGC,RAID
>>> 5
>>>>>>> [size=133G][features=0][hwhandler=1 emc]
>>>>>>> \_ round-robin 0 [prio=2][active]
>>>>>>>  \_ 0:0:1:1 sdd 8:48  [active][ready]
>>>>>>>  \_ 1:0:1:1 sdh 8:112 [active][ready]
>>>>>>> \_ round-robin 0 [prio=0][enabled]
>>>>>>>  \_ 0:0:0:1 sdb 8:16  [active][ready]
>>>>>>>  \_ 1:0:0:1 sdf 8:80  [active][ready]
>>>>>>>
>>>>>>>
>>>>>>> As we use lvm2 we added /dev/sd* to the filter:
>>>>>>> filter = [ "r|/dev/cdrom|", "r|/dev/sd.*|" ]
>>>>>>>
>>>>>>> Here is what happened and what we did to reconstruct the
> situation
>>>>> to
>>>>>>> find a solution:
>>>>>>>
>>>>>>> On 02.06.2009 we did something wrong with the zoning on one of
> our
>>>>> two
>>>>>>> SANs and all servers (about 40) lost one path to the SAN. Only
> two
>>>>>>> servers crashed. Those two are our Debian etch heartbeat cluster
>>>>>>> described above.
>>>>>>> The console showed a kernel panic because of ocfs2 was fencing
>>> both
>>>>>>> nodes.
>>>>>>>
>>>>>>> This was the message:
>>>>>>> O2hb_write_timeout: 165 ERROR: Heartbeat write timeout to device
>>>>> dm-7
>>>>>>> after 12000 milliseconds
>>>>>>>
>>>>>>> So we decided to change the o2cb settings to:
>>>>>>> O2CB_HEARTBEAT_THRESHOLD=31
>>>>>>> O2CB_IDLE_TIMEOUT_MS=30000
>>>>>>> O2CB_KEEPALIVE_DELAY_MS=2000
>>>>>>> O2CB_RECONNECT_DELAY_MS=2000
>>>>>>>
>>>>>>> We switched all cluster resources to the 1st node to test the
> new
>>>>>>> settings on the second node. We removed the 2nd node from the
>>> zoning
>>>>> (we
>>>>>>> also tested shutting down the port with the same result) and got
> a
>>>>>>> different error but still ending up with a kernel panic:
>>>>>>>
>>>>>>> Jun  4 16:41:05 defr1elcbtd02 kernel: o2net: no longer connected
>>> to
>>>>> node
>>>>>>> defr1elcbtd01 (num 0) at 192.168.0.101:7777
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-0: blocked FC
>>>>> remote
>>>>>>> port time out: removing target and saving binding
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel:  rport-0:0-1: blocked FC
>>>>> remote
>>>>>>> port time out: removing target and saving binding
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: sd 0:0:1:1: SCSI error:
>>> return
>>>>>>> code = 0x00010000
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: end_request: I/O error,
> dev
>>>>> sdd,
>>>>>>> sector 1672
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath:
>>>>> Failing
>>>>>>> path 8:48.
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath:
>>>>> Failing
>>>>>>> path 8:16.
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: scsi 0:0:1:1: rejecting
> I/O
>>> to
>>>>>>> device being removed
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
>>> emc:
>>>>> long
>>>>>>> trespass command will be send
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
>>> emc:
>>>>>>> honor reservation bit will not be set (default)
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: table:
> 253:7:
>>>>>>> multipath: error getting device
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: ioctl:
> error
>>>>> adding
>>>>>>> target to table
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
>>> emc:
>>>>> long
>>>>>>> trespass command will be send
>>>>>>> Jun  4 16:41:27 defr1elcbtd02 kernel: device-mapper: multipath
>>> emc:
>>>>>>> honor reservation bit will not be set (default)
>>>>>>> Jun  4 16:41:29 defr1elcbtd02 kernel: device-mapper: multipath
>>> emc:
>>>>>>> emc_pg_init: sending switch-over command
>>>>>>> Jun  4 16:42:01 defr1elcbtd02 kernel:
>>>>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status =
> -107
>>>>>>> Jun  4 16:42:01 defr1elcbtd02 kernel:
>>>>>>> (10751,1):dlm_wait_for_node_death:374
>>>>> 5EE89BC01EFC405E9197C198DEEAE678:
>>>>>>> waiting 5000ms for notification of death of node 0
>>>>>>> Jun  4 16:42:07 defr1elcbtd02 kernel:
>>>>>>> (10751,1):dlm_send_remote_convert_request:395 ERROR: status =
> -107
>>>>>>> Jun  4 16:42:07 defr1elcbtd02 kernel:
>>>>>>> (10751,1):dlm_wait_for_node_death:374
>>>>> 5EE89BC01EFC405E9197C198DEEAE678:
>>>>>>> waiting 5000ms for notification of death of node 0
>>>>>>> [...]
>>>>>>> After 60 seconds:
>>>>>>>
>>>>>>> (8,0): o2quo_make_decision:143 ERROR: fending this node because
> it
>>>>> is
>>>>>>> connected to a half-quorum of 1 out of 2 nodes which doesn't
>>> include
>>>>> the
>>>>>>> lowest active node 0
>>>>>>>
>>>>>>>
>>>>>>> multipath -ll changed to:
>>>>>>> stosan01_lun070 (3600601603ac511001c7c92fec775dd11) dm-7
> DGC,RAID
>>> 5
>>>>>>> [size=133G][features=0][hwhandler=1 emc]
>>>>>>> \_ round-robin 0 [prio=1][active]
>>>>>>>  \_ 0:0:1:1 sdd 8:48  [active][ready]
>>>>>>> \_ round-robin 0 [prio=0][enabled]
>>>>>>>  \_ 0:0:0:1 sdb 8:16  [active][ready]
>>>>>>>
>>>>>>> The ocfs2 filesystem is still mounted an writable. Even if I
>>> enable
>>>>> the
>>>>>>> zoneing (or the FC port) again within the 60 seconds ocfs2 does
>>> not
>>>>>>> reconnect to node 1 and panics the kernel after 60 seconds while
>>>>>>> multipath -ll shows both path again.
>>>>>>>
>>>>>>> I do not understand at all what the Ethernet heartbeat
> connection
>>> of
>>>>>>> ocfs2 has to do with the SAN connection.
>>>>>>>
>>>>>>> The strangest thing at all is - this does not happen always!
> After
>>>>> some
>>>>>>> reboots the system keeps running stable even if I shutdown a FC
>>> port
>>>>> and
>>>>>>> enable it again many times. There is no constant behaviour... It
>>>>> happens
>>>>>>> most of the times, but at about 10% it does not happen and
>>>>> everything is
>>>>>>> working as intended.
>>>>>>>
>>>>>>> Any explanations or ideas what causes this behaviour?
>>>>>>>
>>>>>>> I will test this on Debian lenny to see if the Debian version
>>> makes
>>>>> a
>>>>>>> difference.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Florian
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Ocfs2-users mailing list
>>>>>>> Ocfs2-users@oss.oracle.com
>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users@oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

Reply via email to