Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread Sunil Mushran
florian.engelm...@bt.com wrote:
> We tried to use ocfs2 with Vserver clustered with Heartbeat. But
> Vservers need barrier=1. That did not work on our shared storage with
> ocfs2 but I guess this is no ocfs2 problem it is a device mapper problem
> because we need to use multipath and LVM2, isn't it?
> Device mapper does not support barriers, doesn't it?
>   

Yes, ocfs2 has support for barrier. Last I checked, dm does not.

> So the only way we could do it is to disable multipathing for one shared
> storage LUN and mount only one path!?
>
> Is there any better solution?
>   


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread florian.engelmann
Hi Tao and Sunil,
thank you VERY much for you great help! I will reboot node 1 on sunday
and this should fix my problem (hope so).

Great filesystem and great support - keep it up!

I have one last question that is a little OT:

We tried to use ocfs2 with Vserver clustered with Heartbeat. But
Vservers need barrier=1. That did not work on our shared storage with
ocfs2 but I guess this is no ocfs2 problem it is a device mapper problem
because we need to use multipath and LVM2, isn't it?
Device mapper does not support barriers, doesn't it?

So the only way we could do it is to disable multipathing for one shared
storage LUN and mount only one path!?

Is there any better solution?


Best regards,
Florian

> 
> Hi Florian,
> 
> florian.engelm...@bt.com wrote:
> > Hi Tao,
> >
> >> Hi Florian,
> >>
> >> florian.engelm...@bt.com wrote:
> >>> Hi Tao,
> >>>
>  Hi florian,
> 
>  florian.engelm...@bt.com wrote:
> >> Florian,
> >> the problem here seems to be with network. The nodes are
running
> >>> into
> >> network heartbeat timeout and hence second node is getting
> > fenced.
> >>> Do
> >> you see o2net thread consuming 100% cpu on any node? if not
then
> >> probably check your network
> >> thanks,
> >> --Srini
> > I forgot to post my /etc/ocfs2/cluster.conf
> > node:
> > ip_port = 
> > ip_address = 192.168.0.101
> > number = 0
> > name = defr1elcbtd01
> > cluster = ocfs2
> >
> > node:
> > ip_port = 
> > ip_address = 192.168.0.102
> > number = 1
> > name = defr1elcbtd02
> > cluster = ocfs2
> >
> > cluster:
> > node_count = 2
> > name = ocfs2
> >
> >
> > 192.168.0.10x is eth3 on both nodes and connected with a cross
> > over
> > cable. No active network component is involved here.
> >
> > defr1elcbtd02:~# traceroute 192.168.0.101
> > traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52
byte
> > packets
> >  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
> > defr1elcbtd02:~#
> >
> > The error message looks like a network problem but why should
> > there
> >>> be a
> > network problem if I shutdown a FC port?! I testet it about 20
> > times
> >>> and
> > got about 16 kernel panics starting with the same error message:
> >
> > kernel: o2net: no longer connected to node defr1elcbtd01 (num 0)
> > at
> > 192.168.0.101:
>  It isn't an error message, just a status report that we can't
> > connect
> >>> to
>  that node now. That node may be rebooted or something else, but
> > this
>  node don't know, and it only knows the connection is down.
> >>> But node defr1elcbtd01 was never down and also the network link
> > (eth3)
> >>> wasn't down. I was able to ping from each node to the other.
> >>> Node 1 is hosting all services and never was faulted while I was
> >>> testing.
> >>>
> >>> All I have to do to panic node 2 is to disable one of two fibre
> > channel
> >>> ports or pull one fibre channel cable or delete node 2 from the
> > cisco
> >>> SAN zoning.
> >>> If I apply one of those 3 "errors" I get the message about o2net
is
> > no
> >>> longer connected to node 1 and 60 seconds later the 2nd node
panics
> >>> because of ocfs2 fencing (but this happens only in about 80% of
> > cases -
> >>> in the other 20% of cases o2net does not disconnect and there are
no
> >>> messages about ocfs2 at all - like it should be...).
> >>> Everything else is working fine in these 60 seconds. The
filesystem
> > is
> >>> still writable from both nodes and both nodes can ping each other
> > (via
> >>> the cluster interconnect).
> >> I just checked your log. The error why node 2 get the message is
that
> >> node 1 get the message that node 2 stopped disk heartbeat for quite
a
> >> long time so it stop the connection intentionally. So node 2 get
this
> >> message.
> >>
> >> See the log in node 1:
> >> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224
node
> > 1,
> >> 1 total
> >> Jun  8 09:46:26 defr1elcbtd01 kernel:
(3804,0):o2net_set_nn_state:382
> >> node 1 sc: 81007ddf4400 -> , valid 1 -> 0, err
0
> > ->
> >> -107
> >> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296
node
> >> 1, 1 total
> >> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
> > node
> >> defr1elcbtd02 (num 1) at 192.168.0.102:
> >>
> >> And I guess the reason why you see this log sometimes(80%) is that
the
> >> time interval. You know ocfs2 disk heartbeat try every 2 secs so
> >> sometimes node 2 panic before node 1 call o2quo_hb_down and
sometimes
> >> node2 panic after node 1 call o2quo_hb_down(which will put
something
> >> like "no longer..." in node 2's log).
> >>
> >> So would you please give your timeout configuration(o2cb)?
> >
> > These are my setting on node 2:
> >
> > O2CB_HEARTBEAT

Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread Tao Ma
Hi Florian,

florian.engelm...@bt.com wrote:
> Hi Tao,
> 
>> Hi Florian,
>>
>> florian.engelm...@bt.com wrote:
>>> Hi Tao,
>>>
 Hi florian,

 florian.engelm...@bt.com wrote:
>> Florian,
>> the problem here seems to be with network. The nodes are running
>>> into
>> network heartbeat timeout and hence second node is getting
> fenced.
>>> Do
>> you see o2net thread consuming 100% cpu on any node? if not then
>> probably check your network
>> thanks,
>> --Srini
> I forgot to post my /etc/ocfs2/cluster.conf
> node:
> ip_port = 
> ip_address = 192.168.0.101
> number = 0
> name = defr1elcbtd01
> cluster = ocfs2
>
> node:
> ip_port = 
> ip_address = 192.168.0.102
> number = 1
> name = defr1elcbtd02
> cluster = ocfs2
>
> cluster:
> node_count = 2
> name = ocfs2
>
>
> 192.168.0.10x is eth3 on both nodes and connected with a cross
> over
> cable. No active network component is involved here.
>
> defr1elcbtd02:~# traceroute 192.168.0.101
> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
> packets
>  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
> defr1elcbtd02:~#
>
> The error message looks like a network problem but why should
> there
>>> be a
> network problem if I shutdown a FC port?! I testet it about 20
> times
>>> and
> got about 16 kernel panics starting with the same error message:
>
> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0)
> at
> 192.168.0.101:
 It isn't an error message, just a status report that we can't
> connect
>>> to
 that node now. That node may be rebooted or something else, but
> this
 node don't know, and it only knows the connection is down.
>>> But node defr1elcbtd01 was never down and also the network link
> (eth3)
>>> wasn't down. I was able to ping from each node to the other.
>>> Node 1 is hosting all services and never was faulted while I was
>>> testing.
>>>
>>> All I have to do to panic node 2 is to disable one of two fibre
> channel
>>> ports or pull one fibre channel cable or delete node 2 from the
> cisco
>>> SAN zoning.
>>> If I apply one of those 3 "errors" I get the message about o2net is
> no
>>> longer connected to node 1 and 60 seconds later the 2nd node panics
>>> because of ocfs2 fencing (but this happens only in about 80% of
> cases -
>>> in the other 20% of cases o2net does not disconnect and there are no
>>> messages about ocfs2 at all - like it should be...).
>>> Everything else is working fine in these 60 seconds. The filesystem
> is
>>> still writable from both nodes and both nodes can ping each other
> (via
>>> the cluster interconnect).
>> I just checked your log. The error why node 2 get the message is that
>> node 1 get the message that node 2 stopped disk heartbeat for quite a
>> long time so it stop the connection intentionally. So node 2 get this
>> message.
>>
>> See the log in node 1:
>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node
> 1,
>> 1 total
>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
>> node 1 sc: 81007ddf4400 -> , valid 1 -> 0, err 0
> ->
>> -107
>> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
>> 1, 1 total
>> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
> node
>> defr1elcbtd02 (num 1) at 192.168.0.102:
>>
>> And I guess the reason why you see this log sometimes(80%) is that the
>> time interval. You know ocfs2 disk heartbeat try every 2 secs so
>> sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes
>> node2 panic after node 1 call o2quo_hb_down(which will put something
>> like "no longer..." in node 2's log).
>>
>> So would you please give your timeout configuration(o2cb)?
> 
> These are my setting on node 2:
> 
> O2CB_HEARTBEAT_THRESHOLD=61
> O2CB_IDLE_TIMEOUT_MS=6
> O2CB_KEEPALIVE_DELAY_MS=4000
> O2CB_RECONNECT_DELAY_MS=4000
ocfs2 can't allow 2 nodes have different timeouts. So if node 1 and 2 
don't have the same configuration, node 2 won't be allowed to join the 
domain and mount the same volume.
yeah, this parameter looks much better.
See
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#TIMEOUT
For details.

I just went through the whole thread, and the panic because of the 
storage failure is a deliberate behavior since with no disk access, no 
need to survive. See
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.2/ocfs2_faq.html#QUORUM

Regards,
Tao
> 
> On node 1 I still got the old setting because there was no downtime to
> reboot this system till today. Is there any way to change the values
> without a reboot? The system is mission critical and I can only reboot
> on Sundays.
> 
> Settings on node 1 are the default s

Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread florian.engelmann
Hi Tao,

> 
> Hi Florian,
> 
> florian.engelm...@bt.com wrote:
> > Hi Tao,
> >
> >> Hi florian,
> >>
> >> florian.engelm...@bt.com wrote:
>  Florian,
>  the problem here seems to be with network. The nodes are running
> > into
>  network heartbeat timeout and hence second node is getting
fenced.
> > Do
>  you see o2net thread consuming 100% cpu on any node? if not then
>  probably check your network
>  thanks,
>  --Srini
> >>> I forgot to post my /etc/ocfs2/cluster.conf
> >>> node:
> >>> ip_port = 
> >>> ip_address = 192.168.0.101
> >>> number = 0
> >>> name = defr1elcbtd01
> >>> cluster = ocfs2
> >>>
> >>> node:
> >>> ip_port = 
> >>> ip_address = 192.168.0.102
> >>> number = 1
> >>> name = defr1elcbtd02
> >>> cluster = ocfs2
> >>>
> >>> cluster:
> >>> node_count = 2
> >>> name = ocfs2
> >>>
> >>>
> >>> 192.168.0.10x is eth3 on both nodes and connected with a cross
over
> >>> cable. No active network component is involved here.
> >>>
> >>> defr1elcbtd02:~# traceroute 192.168.0.101
> >>> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
> >>> packets
> >>>  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
> >>> defr1elcbtd02:~#
> >>>
> >>> The error message looks like a network problem but why should
there
> > be a
> >>> network problem if I shutdown a FC port?! I testet it about 20
times
> > and
> >>> got about 16 kernel panics starting with the same error message:
> >>>
> >>> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0)
at
> >>> 192.168.0.101:
> >> It isn't an error message, just a status report that we can't
connect
> > to
> >> that node now. That node may be rebooted or something else, but
this
> >> node don't know, and it only knows the connection is down.
> >
> > But node defr1elcbtd01 was never down and also the network link
(eth3)
> > wasn't down. I was able to ping from each node to the other.
> > Node 1 is hosting all services and never was faulted while I was
> > testing.
> >
> > All I have to do to panic node 2 is to disable one of two fibre
channel
> > ports or pull one fibre channel cable or delete node 2 from the
cisco
> > SAN zoning.
> > If I apply one of those 3 "errors" I get the message about o2net is
no
> > longer connected to node 1 and 60 seconds later the 2nd node panics
> > because of ocfs2 fencing (but this happens only in about 80% of
cases -
> > in the other 20% of cases o2net does not disconnect and there are no
> > messages about ocfs2 at all - like it should be...).
> > Everything else is working fine in these 60 seconds. The filesystem
is
> > still writable from both nodes and both nodes can ping each other
(via
> > the cluster interconnect).
> I just checked your log. The error why node 2 get the message is that
> node 1 get the message that node 2 stopped disk heartbeat for quite a
> long time so it stop the connection intentionally. So node 2 get this
> message.
> 
> See the log in node 1:
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node
1,
> 1 total
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
> node 1 sc: 81007ddf4400 -> , valid 1 -> 0, err 0
->
> -107
> Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
> 1, 1 total
> Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to
node
> defr1elcbtd02 (num 1) at 192.168.0.102:
> 
> And I guess the reason why you see this log sometimes(80%) is that the
> time interval. You know ocfs2 disk heartbeat try every 2 secs so
> sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes
> node2 panic after node 1 call o2quo_hb_down(which will put something
> like "no longer..." in node 2's log).
> 
> So would you please give your timeout configuration(o2cb)?

These are my setting on node 2:

O2CB_HEARTBEAT_THRESHOLD=61
O2CB_IDLE_TIMEOUT_MS=6
O2CB_KEEPALIVE_DELAY_MS=4000
O2CB_RECONNECT_DELAY_MS=4000

On node 1 I still got the old setting because there was no downtime to
reboot this system till today. Is there any way to change the values
without a reboot? The system is mission critical and I can only reboot
on Sundays.

Settings on node 1 are the default setting that came with the Debian
package. I changed them to fit the settings of node 2 after the next
reboot - so I can only guess they were:

O2CB_HEARTBEAT_THRESHOLD=7
O2CB_IDLE_TIMEOUT_MS=12000
O2CB_KEEPALIVE_DELAY_MS=2000
O2CB_RECONNECT_DELAY_MS=2000

Regards,
Florian



> 
> Regards,
> Tao
> 
> 
> 
> >
> > Here are the logs with debug logging:
> >
> > Node 2:
> >
> > Jun  8 09:46:11 defr1elcbtd02 kernel: qla2xxx :04:00.0: LOOP
DOWN
> > detected (2).
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> > 81007c2f0800 refs 3 sock 8100694138c0 node 0 page
> > 81007fafbb00 pg_off 0] put
> > Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> > 

Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread Tao Ma
Hi Florian,

florian.engelm...@bt.com wrote:
> Hi Tao,
> 
>> Hi florian,
>>
>> florian.engelm...@bt.com wrote:
 Florian,
 the problem here seems to be with network. The nodes are running
> into
 network heartbeat timeout and hence second node is getting fenced.
> Do
 you see o2net thread consuming 100% cpu on any node? if not then
 probably check your network
 thanks,
 --Srini
>>> I forgot to post my /etc/ocfs2/cluster.conf
>>> node:
>>> ip_port = 
>>> ip_address = 192.168.0.101
>>> number = 0
>>> name = defr1elcbtd01
>>> cluster = ocfs2
>>>
>>> node:
>>> ip_port = 
>>> ip_address = 192.168.0.102
>>> number = 1
>>> name = defr1elcbtd02
>>> cluster = ocfs2
>>>
>>> cluster:
>>> node_count = 2
>>> name = ocfs2
>>>
>>>
>>> 192.168.0.10x is eth3 on both nodes and connected with a cross over
>>> cable. No active network component is involved here.
>>>
>>> defr1elcbtd02:~# traceroute 192.168.0.101
>>> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
>>> packets
>>>  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
>>> defr1elcbtd02:~#
>>>
>>> The error message looks like a network problem but why should there
> be a
>>> network problem if I shutdown a FC port?! I testet it about 20 times
> and
>>> got about 16 kernel panics starting with the same error message:
>>>
>>> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) at
>>> 192.168.0.101:
>> It isn't an error message, just a status report that we can't connect
> to
>> that node now. That node may be rebooted or something else, but this
>> node don't know, and it only knows the connection is down.
> 
> But node defr1elcbtd01 was never down and also the network link (eth3)
> wasn't down. I was able to ping from each node to the other.
> Node 1 is hosting all services and never was faulted while I was
> testing.
> 
> All I have to do to panic node 2 is to disable one of two fibre channel
> ports or pull one fibre channel cable or delete node 2 from the cisco
> SAN zoning.
> If I apply one of those 3 "errors" I get the message about o2net is no
> longer connected to node 1 and 60 seconds later the 2nd node panics
> because of ocfs2 fencing (but this happens only in about 80% of cases -
> in the other 20% of cases o2net does not disconnect and there are no
> messages about ocfs2 at all - like it should be...).
> Everything else is working fine in these 60 seconds. The filesystem is
> still writable from both nodes and both nodes can ping each other (via
> the cluster interconnect).
I just checked your log. The error why node 2 get the message is that 
node 1 get the message that node 2 stopped disk heartbeat for quite a 
long time so it stop the connection intentionally. So node 2 get this 
message.

See the log in node 1:
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_hb_down:224 node 1,
1 total
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2net_set_nn_state:382
node 1 sc: 81007ddf4400 -> , valid 1 -> 0, err 0 ->
-107
Jun  8 09:46:26 defr1elcbtd01 kernel: (3804,0):o2quo_conn_err:296 node
1, 1 total
Jun  8 09:46:26 defr1elcbtd01 kernel: o2net: no longer connected to node
defr1elcbtd02 (num 1) at 192.168.0.102:

And I guess the reason why you see this log sometimes(80%) is that the 
time interval. You know ocfs2 disk heartbeat try every 2 secs so 
sometimes node 2 panic before node 1 call o2quo_hb_down and sometimes 
node2 panic after node 1 call o2quo_hb_down(which will put something 
like "no longer..." in node 2's log).

So would you please give your timeout configuration(o2cb)?

Regards,
Tao



> 
> Here are the logs with debug logging:
> 
> Node 2:
> 
> Jun  8 09:46:11 defr1elcbtd02 kernel: qla2xxx :04:00.0: LOOP DOWN
> detected (2).
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
> 81007c2f0800 refs 3 sock 8100694138c0 node 0 page
> 81007fafbb00 pg_off 0] put
> Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
> 81007c2f0800 refs 2 sock 8100694138c0 node 0 page
> 81007fafbb00 pg_off 0] data_ready hit
> Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
> 81007c2f0800 refs 2 sock 8100694138c0 node 0 page
> 81007fafbb00 pg_off 0] get
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
> 81007c2f0800 refs 3 sock 8100694138c0 node 0 page
> 81007fafbb00 pg_off 0] receiving
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
> [mag 64088 len 0 typ 0 stat 0 sys_stat 0 key  num 0] at page_off
> 24
> Jun  8 09:46:11 defr1elcbtd02 kernel:
> (3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
> sys_stat 0 key  num 0] processing message
> Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
> 81007c2f0800 refs 3 sock 8100694138c0 node 0 page
> 81007fafbb00 pg_off 24] get
> Jun  8 09:46:1

Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread florian.engelmann
Hi Tao,

> Hi florian,
> 
> florian.engelm...@bt.com wrote:
> >> Florian,
> >> the problem here seems to be with network. The nodes are running
into
> >> network heartbeat timeout and hence second node is getting fenced.
Do
> >> you see o2net thread consuming 100% cpu on any node? if not then
> >> probably check your network
> >> thanks,
> >> --Srini
> >
> > I forgot to post my /etc/ocfs2/cluster.conf
> > node:
> > ip_port = 
> > ip_address = 192.168.0.101
> > number = 0
> > name = defr1elcbtd01
> > cluster = ocfs2
> >
> > node:
> > ip_port = 
> > ip_address = 192.168.0.102
> > number = 1
> > name = defr1elcbtd02
> > cluster = ocfs2
> >
> > cluster:
> > node_count = 2
> > name = ocfs2
> >
> >
> > 192.168.0.10x is eth3 on both nodes and connected with a cross over
> > cable. No active network component is involved here.
> >
> > defr1elcbtd02:~# traceroute 192.168.0.101
> > traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
> > packets
> >  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
> > defr1elcbtd02:~#
> >
> > The error message looks like a network problem but why should there
be a
> > network problem if I shutdown a FC port?! I testet it about 20 times
and
> > got about 16 kernel panics starting with the same error message:
> >
> > kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) at
> > 192.168.0.101:
> It isn't an error message, just a status report that we can't connect
to
> that node now. That node may be rebooted or something else, but this
> node don't know, and it only knows the connection is down.

But node defr1elcbtd01 was never down and also the network link (eth3)
wasn't down. I was able to ping from each node to the other.
Node 1 is hosting all services and never was faulted while I was
testing.

All I have to do to panic node 2 is to disable one of two fibre channel
ports or pull one fibre channel cable or delete node 2 from the cisco
SAN zoning.
If I apply one of those 3 "errors" I get the message about o2net is no
longer connected to node 1 and 60 seconds later the 2nd node panics
because of ocfs2 fencing (but this happens only in about 80% of cases -
in the other 20% of cases o2net does not disconnect and there are no
messages about ocfs2 at all - like it should be...).
Everything else is working fine in these 60 seconds. The filesystem is
still writable from both nodes and both nodes can ping each other (via
the cluster interconnect).

Here are the logs with debug logging:

Node 2:

Jun  8 09:46:11 defr1elcbtd02 kernel: qla2xxx :04:00.0: LOOP DOWN
detected (2).
Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
81007c2f0800 refs 3 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] put
Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
81007c2f0800 refs 2 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] data_ready hit
Jun  8 09:46:11 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
81007c2f0800 refs 2 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] get
Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
81007c2f0800 refs 3 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] receiving
Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1170
[mag 64088 len 0 typ 0 stat 0 sys_stat 0 key  num 0] at page_off
24
Jun  8 09:46:11 defr1elcbtd02 kernel:
(3463,0):o2net_process_message:1015 [mag 64088 len 0 typ 0 stat 0
sys_stat 0 key  num 0] processing message
Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_get:294 [sc
81007c2f0800 refs 3 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 24] get
Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
81007c2f0800 refs 4 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] ret = 1
Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1129 [sc
81007c2f0800 refs 4 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] receiving
Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):o2net_advance_rx:1196 [sc
81007c2f0800 refs 4 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] ret = -11
Jun  8 09:46:11 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
81007c2f0800 refs 4 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] put
Jun  8 09:46:16 defr1elcbtd02 kernel: (3463,0):sc_put:289 [sc
81007c2f0800 refs 3 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] put
Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
81007c2f0800 refs 2 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] data_ready hit
Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):sc_get:294 [sc
81007c2f0800 refs 2 sock 8100694138c0 node 0 page
81007fafbb00 pg_off 0] get
Jun  8 09:46:16 defr1elcbtd02 kernel: (0,0):o2net_data_ready:452 [sc
81007c2f0800 refs 3 sock 8100694138c

Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread Tao Ma
Hi florian,

florian.engelm...@bt.com wrote:
>> Florian,
>> the problem here seems to be with network. The nodes are running into
>> network heartbeat timeout and hence second node is getting fenced. Do
>> you see o2net thread consuming 100% cpu on any node? if not then
>> probably check your network
>> thanks,
>> --Srini
> 
> I forgot to post my /etc/ocfs2/cluster.conf
> node:
> ip_port = 
> ip_address = 192.168.0.101
> number = 0
> name = defr1elcbtd01
> cluster = ocfs2
> 
> node:
> ip_port = 
> ip_address = 192.168.0.102
> number = 1
> name = defr1elcbtd02
> cluster = ocfs2
> 
> cluster:
> node_count = 2
> name = ocfs2
> 
> 
> 192.168.0.10x is eth3 on both nodes and connected with a cross over
> cable. No active network component is involved here.
> 
> defr1elcbtd02:~# traceroute 192.168.0.101
> traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
> packets
>  1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
> defr1elcbtd02:~#
> 
> The error message looks like a network problem but why should there be a
> network problem if I shutdown a FC port?! I testet it about 20 times and
> got about 16 kernel panics starting with the same error message:
> 
> kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) at
> 192.168.0.101: 
It isn't an error message, just a status report that we can't connect to 
that node now. That node may be rebooted or something else, but this 
node don't know, and it only knows the connection is down.

If there is some problem in FC port, the node will reboot and the other 
node will get this message.
> 
> The cluster is running fine if there is no problem with the SAN
> connection.
> 
> How to enable verbose logging with ofcs2?
debugfs.ocfs2 -l will show current logging status.

If you want to enable some log, use e.g
debugfs.ocfs2  -l DISK_ALLOC allow
and
debugfs.ocfs2 -l DISK_ALLOC off
will turn it off.

Regards,
Tao
> 
> Regards,
> Florian
> 
>> florian.engelm...@bt.com wrote:
>>> Hello,
>>> our Debian etch cluster nodes are panicing because of ocfs2 fencing
> if
>>> one SAN path fails.
>>>
>>> modinfo ocfs2
>>> filename:   /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko
>>> author: Oracle
>>> license:GPL
>>> description:OCFS2 1.3.3
>>> version:1.3.3
>>> vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>> depends:ocfs2_dlm,ocfs2_nodemanager,jbd
>>> srcversion: 0798424846E68F10172C203
>>>
>>> modinfo ocfs2_dlmfs
>>> filename:
>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
>>> author: Oracle
>>> license:GPL
>>> description:OCFS2 DLMFS 1.3.3
>>> version:1.3.3
>>> vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>> depends:ocfs2_dlm,ocfs2_nodemanager
>>> srcversion: E3780E12396118282B3C1AD
>>>
>>> defr1elcbtd02:~# modinfo ocfs2_dlm
>>> filename:
>>> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
>>> author: Oracle
>>> license:GPL
>>> description:OCFS2 DLM 1.3.3
>>> version:1.3.3
>>> vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>> depends:ocfs2_nodemanager
>>> srcversion: 7DC395EA08AE4CE826C5B92
>>>
>>> modinfo ocfs2_nodemanager
>>> filename:
>>>
> /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
>>> author: Oracle
>>> license:GPL
>>> description:OCFS2 Node Manager 1.3.3
>>> version:1.3.3
>>> vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>> depends:configfs
>>> srcversion: C4C9871302E1910B78DAE40
>>>
>>> modinfo qla2xxx
>>> filename:
>>> /lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
>>> author: QLogic Corporation
>>> description:QLogic Fibre Channel HBA Driver
>>> license:GPL
>>> version:8.01.07-k1
>>> vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
>>> depends:scsi_mod,scsi_transport_fc,firmware_class
>>> alias:  pci:v1077d2100sv*sd*bc*sc*i*
>>> alias:  pci:v1077d2200sv*sd*bc*sc*i*
>>> alias:  pci:v1077d2300sv*sd*bc*sc*i*
>>> alias:  pci:v1077d2312sv*sd*bc*sc*i*
>>> alias:  pci:v1077d2322sv*sd*bc*sc*i*
>>> alias:  pci:v1077d6312sv*sd*bc*sc*i*
>>> alias:  pci:v1077d6322sv*sd*bc*sc*i*
>>> alias:  pci:v1077d2422sv*sd*bc*sc*i*
>>> alias:  pci:v1077d2432sv*sd*bc*sc*i*
>>> alias:  pci:v1077d5422sv*sd*bc*sc*i*
>>> alias:  pci:v1077d5432sv*sd*bc*sc*i*
>>> srcversion: B8E1608E257391DCAFD9224
>>> parm:   ql2xfdmienable:Enables FDMI registratons Default is
> 0 -
>>> no FDMI. 1 - perfom FDMI. (int)
>>> parm:   extended_error_logging:Option to enable extended
> error
>>> logging, Default is 0 - no logging. 1 - log errors. (int)
>>> parm:   ql2xall

Re: [Ocfs2-users] ocfs2 fencing with multipath and dual channel HBA

2009-06-08 Thread florian.engelmann

> Florian,
> the problem here seems to be with network. The nodes are running into
> network heartbeat timeout and hence second node is getting fenced. Do
> you see o2net thread consuming 100% cpu on any node? if not then
> probably check your network
> thanks,
> --Srini

I forgot to post my /etc/ocfs2/cluster.conf
node:
ip_port = 
ip_address = 192.168.0.101
number = 0
name = defr1elcbtd01
cluster = ocfs2

node:
ip_port = 
ip_address = 192.168.0.102
number = 1
name = defr1elcbtd02
cluster = ocfs2

cluster:
node_count = 2
name = ocfs2


192.168.0.10x is eth3 on both nodes and connected with a cross over
cable. No active network component is involved here.

defr1elcbtd02:~# traceroute 192.168.0.101
traceroute to 192.168.0.101 (192.168.0.101), 30 hops max, 52 byte
packets
 1  node1 (192.168.0.101)  0.220 ms  0.142 ms  0.223 ms
defr1elcbtd02:~#

The error message looks like a network problem but why should there be a
network problem if I shutdown a FC port?! I testet it about 20 times and
got about 16 kernel panics starting with the same error message:

kernel: o2net: no longer connected to node defr1elcbtd01 (num 0) at
192.168.0.101: 

The cluster is running fine if there is no problem with the SAN
connection.

How to enable verbose logging with ofcs2?

Regards,
Florian

> 
> florian.engelm...@bt.com wrote:
> > Hello,
> > our Debian etch cluster nodes are panicing because of ocfs2 fencing
if
> > one SAN path fails.
> >
> > modinfo ocfs2
> > filename:   /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/ocfs2.ko
> > author: Oracle
> > license:GPL
> > description:OCFS2 1.3.3
> > version:1.3.3
> > vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
> > depends:ocfs2_dlm,ocfs2_nodemanager,jbd
> > srcversion: 0798424846E68F10172C203
> >
> > modinfo ocfs2_dlmfs
> > filename:
> > /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlmfs.ko
> > author: Oracle
> > license:GPL
> > description:OCFS2 DLMFS 1.3.3
> > version:1.3.3
> > vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
> > depends:ocfs2_dlm,ocfs2_nodemanager
> > srcversion: E3780E12396118282B3C1AD
> >
> > defr1elcbtd02:~# modinfo ocfs2_dlm
> > filename:
> > /lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko
> > author: Oracle
> > license:GPL
> > description:OCFS2 DLM 1.3.3
> > version:1.3.3
> > vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
> > depends:ocfs2_nodemanager
> > srcversion: 7DC395EA08AE4CE826C5B92
> >
> > modinfo ocfs2_nodemanager
> > filename:
> >
/lib/modules/2.6.18-6-amd64/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko
> > author: Oracle
> > license:GPL
> > description:OCFS2 Node Manager 1.3.3
> > version:1.3.3
> > vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
> > depends:configfs
> > srcversion: C4C9871302E1910B78DAE40
> >
> > modinfo qla2xxx
> > filename:
> > /lib/modules/2.6.18-6-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> > author: QLogic Corporation
> > description:QLogic Fibre Channel HBA Driver
> > license:GPL
> > version:8.01.07-k1
> > vermagic:   2.6.18-6-amd64 SMP mod_unload gcc-4.1
> > depends:scsi_mod,scsi_transport_fc,firmware_class
> > alias:  pci:v1077d2100sv*sd*bc*sc*i*
> > alias:  pci:v1077d2200sv*sd*bc*sc*i*
> > alias:  pci:v1077d2300sv*sd*bc*sc*i*
> > alias:  pci:v1077d2312sv*sd*bc*sc*i*
> > alias:  pci:v1077d2322sv*sd*bc*sc*i*
> > alias:  pci:v1077d6312sv*sd*bc*sc*i*
> > alias:  pci:v1077d6322sv*sd*bc*sc*i*
> > alias:  pci:v1077d2422sv*sd*bc*sc*i*
> > alias:  pci:v1077d2432sv*sd*bc*sc*i*
> > alias:  pci:v1077d5422sv*sd*bc*sc*i*
> > alias:  pci:v1077d5432sv*sd*bc*sc*i*
> > srcversion: B8E1608E257391DCAFD9224
> > parm:   ql2xfdmienable:Enables FDMI registratons Default is
0 -
> > no FDMI. 1 - perfom FDMI. (int)
> > parm:   extended_error_logging:Option to enable extended
error
> > logging, Default is 0 - no logging. 1 - log errors. (int)
> > parm:   ql2xallocfwdump:Option to enable allocation of
memory
> > for a firmware dump during HBA initialization.  Memory allocation
> > requirements vary by ISP type.  Default is 1 - allocate memory.
(int)
> > parm:   ql2xloginretrycount:Specify an alternate value for
the
> > NVRAM login retry count. (int)
> > parm:   ql2xplogiabsentdevice:Option to enable PLOGI to
devices
> > that are not present after a Fabric scan.  This is needed for
several
> > broken switches. Default is 0 - no PLOGI. 1 - perfom PLOGI. (int)
> > parm:   qlport_down_retry:Maximum number of command retries
to a
> > port that returns a PORT-DOWN status. (int)
>