Re: [Gluster-users] I/O error on replicated volume

Jonathan Heese Thu, 26 Mar 2015 13:38:50 -0700

Joe,

Thanks again for the reply.

Your theory makes sense to me, but I'm still not seeing a solution from here... 
 Can you (or anyone else) help me to:

1. Determine why it's trying to connect to some server via RDMA (seems like my 
nfs-server.vol config might be an obvious choice, but I'm not sure), and what 
server,

2. Determine why it's failing to connect thusly (was this part of the RDMA bug 
in 3.5.3?),

3. Correct the bit of configuration causing 1) and 2) above.

4. Explain if there are any (significant) pros or cons to using the RDMA 
transport or the TCP transport (assuming both function over a 20Gb InfiniBand 
connection).

Thanks again!

Regards,
Jon Heese

On Mar 26, 2015, at 4:20 PM, "Joe Julian" 
<j...@julianfamily.org<mailto:j...@julianfamily.org>> wrote:

Every 3 seconds implies, to me, that it's trying to reconnect to a server.

On 03/26/2015 01:12 PM, Jonathan Heese wrote:

Joe,

Hmmm.... But every 3 seconds for all eternity? Seems a bit much for a 
"warning", doesn't it?

Did you see my last reply? My nfs-server.vol file seems to indicate that RDMA 
is still in use in some capacity... Is this normal? If not, how can I reconcile 
this?

Thanks.

Regards,

Jon Heese

________________________________
From: 
gluster-users-boun...@gluster.org<mailto:gluster-users-boun...@gluster.org> 
<gluster-users-boun...@gluster.org><mailto:gluster-users-boun...@gluster.org> 
on behalf of Joe Julian <j...@julianfamily.org><mailto:j...@julianfamily.org>
Sent: Thursday, March 26, 2015 4:08 PM
To: gluster-users@gluster.org<mailto:gluster-users@gluster.org>
Subject: Re: [Gluster-users] I/O error on replicated volume

The RDMA warnings are not relevant if you don't use RDMA. It's simply pointing 
out that it tried to register and it couldn't, which would be expected if your 
system doesn't support it.

On 03/23/2015 12:29 AM, Mohammed Rafi K C wrote:

On 03/23/2015 11:28 AM, Jonathan Heese wrote:
On Mar 23, 2015, at 1:20 AM, "Mohammed Rafi K C" 
<rkavu...@redhat.com<mailto:rkavu...@redhat.com>> wrote:

On 03/21/2015 07:49 PM, Jonathan Heese wrote:

Mohamed,

I have completed the steps you suggested (unmount all, stop the volume, set the 
config.transport to tcp, start the volume, mount, etc.), and the behavior has 
indeed changed.

[root@duke ~]# gluster volume info

Volume Name: gluster_disk
Type: Replicate
Volume ID: 2307a5a8-641e-44f4-8eaf-7cc2b704aafd
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: duke-ib:/bricks/brick1
Brick2: duchess-ib:/bricks/brick1
Options Reconfigured:
config.transport: tcp

[root@duke ~]# gluster volume status
Status of volume: gluster_disk
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick duke-ib:/bricks/brick1                            49152   Y       16362
Brick duchess-ib:/bricks/brick1                         49152   Y       14155
NFS Server on localhost                                 2049    Y       16374
Self-heal Daemon on localhost                           N/A     Y       16381
NFS Server on duchess-ib                                2049    Y       14167
Self-heal Daemon on duchess-ib                          N/A     Y       14174

Task Status of Volume gluster_disk
------------------------------------------------------------------------------
There are no active volume tasks

I am no longer seeing the I/O errors during prolonged periods of write I/O that 
I was seeing when the transport was set to rdma. However, I am seeing this 
message on both nodes every 3 seconds (almost exactly):

==> /var/log/glusterfs/nfs.log <==
[2015-03-21 14:17:40.379719] W [rdma.c:1076:gf_rdma_cm_event_handler] 
0-gluster_disk-client-1: cma event RDMA_CM_EVENT_REJECTED, error 8 
(me:10.10.10.1:1023 peer:10.10.10.2:49152)

Is this something to worry about?

If you are not using nfs to export the volumes, there is nothing to worry.

I'm using the native glusterfs FUSE component to mount the volume locally on 
both servers -- I assume that you're referring to the standard NFS protocol 
stuff, which I'm not using here.

Incidentally, I would like to keep my logs from filling up with junk if 
possible.  Is there something I can do to get rid of these (useless?) error 
messages?

If i understand correctly, you are getting this enormous log message from nfs 
log only, all other logs and everything are fine now, right ? If that is the 
case, and you are not at all using nfs for exporting the volume, as  a 
workaround you can disable nfs for your volume or cluster. (gluster v set 
nfs.disable on). This will turnoff your gluster nfs server, and you will no 
longer get those log messages.

Any idea why there are rdma pieces in play when I've set my transport to tcp?

there should not be any piece of rdma,if possible, can you paste the volfile 
for nfs server. You can find the volfile in 
/var/lib/glusterd/nfs/nfs-server.vol or 
/usr/local/var/lib/glusterd/nfs/nfs-server.vol

I will get this for you when I can.  Thanks.

If you can make it, that will be great help to understand the problem.

Rafi KC

Regards,
Jon Heese

Rafi KC

The actual I/O appears to be handled properly and I've seen no further errors 
in the testing I've done so far.

Thanks.

Regards,

Jon Heese

________________________________
From: 
gluster-users-boun...@gluster.org<mailto:gluster-users-boun...@gluster.org> 
<gluster-users-boun...@gluster.org><mailto:gluster-users-boun...@gluster.org> 
on behalf of Jonathan Heese <jhe...@inetu.net><mailto:jhe...@inetu.net>
Sent: Friday, March 20, 2015 7:04 AM
To: Mohammed Rafi K C
Cc: gluster-users
Subject: Re: [Gluster-users] I/O error on replicated volume

Mohammed,

Thanks very much for the reply.  I will try that and report back.

Regards,
Jon Heese

On Mar 20, 2015, at 3:26 AM, "Mohammed Rafi K C" 
<rkavu...@redhat.com<mailto:rkavu...@redhat.com>> wrote:

On 03/19/2015 10:16 PM, Jonathan Heese wrote:
Hello all,

Does anyone else have any further suggestions for troubleshooting this?

To sum up: I have a 2 node 2 brick replicated volume, which holds a handful of 
iSCSI image files which are mounted and served up by tgtd (CentOS 6) to a 
handful of devices on a dedicated iSCSI network.  The most important iSCSI 
clients (initiators) are four VMware ESXi 5.5 hosts that use the iSCSI volumes 
as backing for their datastores for virtual machine storage.

After a few minutes of sustained writing to the volume, I am seeing a massive 
flood (over 1500 per second at times) of this error in 
/var/log/glusterfs/mnt-gluster-disk.log:
[2015-03-16 02:24:07.582801] W [fuse-bridge.c:2242:fuse_writev_cbk] 
0-glusterfs-fuse: 635358: WRITE => -1 (Input/output error)

When this happens, the ESXi box fails its write operation and returns an error 
to the effect of “Unable to write data to datastore”.  I don’t see anything 
else in the supporting logs to explain the root cause of the i/o errors.

Any and all suggestions are appreciated.  Thanks.

>From the mount logs, i assume that your volume transport type is rdma. There 
>are some known issues for rdma in 3.5.3, and the patch for to address those 
>issues are already send to upstream [1]. From the logs, I'm not sure and it is 
>hard to tell you whether this problem is something related to rdma transport 
>or not. To make sure that the tcp transport is works well in this scenario, if 
>possible can you try to reproduce the same using tcp type volumes. You can 
>change the transport type of volume by doing the following step ( not 
>recommended in normal use case).

1) unmount every client
2) stop the volume
3) run gluster volume set volname config.transport tcp
4) start the volume again
5) mount the clients

[1] : http://goo.gl/2PTL61

Regards
Rafi KC

Jon Heese
Systems Engineer
INetU Managed Hosting
P: 610.266.7441 x 261
F: 610.266.7434
www.inetu.net<https://www.inetu.net/>
** This message contains confidential information, which also may be 
privileged, and is intended only for the person(s) addressed above. Any 
unauthorized use, distribution, copying or disclosure of confidential and/or 
privileged information is strictly prohibited. If you have received this 
communication in error, please erase all copies of the message and its 
attachments and notify the sender immediately via reply e-mail. **

From: Jonathan Heese
Sent: Tuesday, March 17, 2015 12:36 PM
To: 'Ravishankar N'; gluster-users@gluster.org<mailto:gluster-users@gluster.org>
Subject: RE: [Gluster-users] I/O error on replicated volume

Ravi,

The last lines in the mount log before the massive vomit of I/O errors are from 
22 minutes prior, and seem innocuous to me:

[2015-03-16 01:37:07.126340] E 
[client-handshake.c:1760:client_query_portmap_cbk] 0-gluster_disk-client-0: 
failed to get the port number for remote subvolume. Please run 'gluster volume 
status' on server to see if brick process is running.
[2015-03-16 01:37:07.126587] W [rdma.c:4273:gf_rdma_disconnect] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f) [0x7fd9c557bccf] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7fd9c557a995] 
(-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)
 [0x7fd9c0d8fb9a]))) 0-gluster_disk-client-0: disconnect called 
(peer:10.10.10.1:24008)
[2015-03-16 01:37:07.126687] E 
[client-handshake.c:1760:client_query_portmap_cbk] 0-gluster_disk-client-1: 
failed to get the port number for remote subvolume. Please run 'gluster volume 
status' on server to see if brick process is running.
[2015-03-16 01:37:07.126737] W [rdma.c:4273:gf_rdma_disconnect] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f) [0x7fd9c557bccf] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7fd9c557a995] 
(-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)
 [0x7fd9c0d8fb9a]))) 0-gluster_disk-client-1: disconnect called 
(peer:10.10.10.2:24008)
[2015-03-16 01:37:10.730165] I [rpc-clnt.c:1729:rpc_clnt_reconfig] 
0-gluster_disk-client-0: changing port to 49152 (from 0)
[2015-03-16 01:37:10.730276] W [rdma.c:4273:gf_rdma_disconnect] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f) [0x7fd9c557bccf] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7fd9c557a995] 
(-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)
 [0x7fd9c0d8fb9a]))) 0-gluster_disk-client-0: disconnect called 
(peer:10.10.10.1:24008)
[2015-03-16 01:37:10.739500] I [rpc-clnt.c:1729:rpc_clnt_reconfig] 
0-gluster_disk-client-1: changing port to 49152 (from 0)
[2015-03-16 01:37:10.739560] W [rdma.c:4273:gf_rdma_disconnect] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x13f) [0x7fd9c557bccf] 
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x7fd9c557a995] 
(-->/usr/lib64/glusterfs/3.5.3/xlator/protocol/client.so(client_query_portmap_cbk+0x1ea)
 [0x7fd9c0d8fb9a]))) 0-gluster_disk-client-1: disconnect called 
(peer:10.10.10.2:24008)
[2015-03-16 01:37:10.741883] I 
[client-handshake.c:1677:select_server_supported_programs] 
0-gluster_disk-client-0: Using Program GlusterFS 3.3, Num (1298437), Version 
(330)
[2015-03-16 01:37:10.744524] I [client-handshake.c:1462:client_setvolume_cbk] 
0-gluster_disk-client-0: Connected to 10.10.10.1:49152, attached to remote 
volume '/bricks/brick1'.
[2015-03-16 01:37:10.744537] I [client-handshake.c:1474:client_setvolume_cbk] 
0-gluster_disk-client-0: Server and Client lk-version numbers are not same, 
reopening the fds
[2015-03-16 01:37:10.744566] I [afr-common.c:4267:afr_notify] 
0-gluster_disk-replicate-0: Subvolume 'gluster_disk-client-0' came back up; 
going online.
[2015-03-16 01:37:10.744627] I 
[client-handshake.c:450:client_set_lk_version_cbk] 0-gluster_disk-client-0: 
Server lk version = 1
[2015-03-16 01:37:10.753037] I 
[client-handshake.c:1677:select_server_supported_programs] 
0-gluster_disk-client-1: Using Program GlusterFS 3.3, Num (1298437), Version 
(330)
[2015-03-16 01:37:10.755657] I [client-handshake.c:1462:client_setvolume_cbk] 
0-gluster_disk-client-1: Connected to 10.10.10.2:49152, attached to remote 
volume '/bricks/brick1'.
[2015-03-16 01:37:10.755676] I [client-handshake.c:1474:client_setvolume_cbk] 
0-gluster_disk-client-1: Server and Client lk-version numbers are not same, 
reopening the fds
[2015-03-16 01:37:10.761945] I [fuse-bridge.c:5016:fuse_graph_setup] 0-fuse: 
switched to graph 0
[2015-03-16 01:37:10.762144] I 
[client-handshake.c:450:client_set_lk_version_cbk] 0-gluster_disk-client-1: 
Server lk version = 1
[2015-03-16 01:37:10.762279] I [fuse-bridge.c:3953:fuse_init] 0-glusterfs-fuse: 
FUSE inited with protocol versions: glusterfs 7.22 kernel 7.14
[2015-03-16 01:59:26.098670] W [fuse-bridge.c:2242:fuse_writev_cbk] 
0-glusterfs-fuse: 292084: WRITE => -1 (Input/output error)
…

I’ve seen no indication of split-brain on any files at any point in this (ever 
since downdating from 3.6.2 to 3.5.3, which is when this particular issue 
started):
[root@duke gfapi-module-for-linux-target-driver-]# gluster v heal gluster_disk 
info
Brick duke.jonheese.local:/bricks/brick1/
Number of entries: 0

Brick duchess.jonheese.local:/bricks/brick1/
Number of entries: 0

Thanks.

Jon Heese
Systems Engineer
INetU Managed Hosting
P: 610.266.7441 x 261
F: 610.266.7434
www.inetu.net<https://www.inetu.net/>
** This message contains confidential information, which also may be 
privileged, and is intended only for the person(s) addressed above. Any 
unauthorized use, distribution, copying or disclosure of confidential and/or 
privileged information is strictly prohibited. If you have received this 
communication in error, please erase all copies of the message and its 
attachments and notify the sender immediately via reply e-mail. **

From: Ravishankar N [mailto:ravishan...@redhat.com]
Sent: Tuesday, March 17, 2015 12:35 AM
To: Jonathan Heese; gluster-users@gluster.org<mailto:gluster-users@gluster.org>
Subject: Re: [Gluster-users] I/O error on replicated volume

On 03/17/2015 02:14 AM, Jonathan Heese wrote:
Hello,

So I resolved my previous issue with split-brains and the lack of self-healing 
by dropping my installed glusterfs* packages from 3.6.2 to 3.5.3, but now I've 
picked up a new issue, which actually makes normal use of the volume 
practically impossible.

A little background for those not already paying close attention:
I have a 2 node 2 brick replicating volume whose purpose in life is to hold 
iSCSI target files, primarily for use to provide datastores to a VMware ESXi 
cluster.  The plan is to put a handful of image files on the Gluster volume, 
mount them locally on both Gluster nodes, and run tgtd on both, pointed to the 
image files on the mounted gluster volume. Then the ESXi boxes will use 
multipath (active/passive) iSCSI to connect to the nodes, with automatic 
failover in case of planned or unplanned downtime of the Gluster nodes.

In my most recent round of testing with 3.5.3, I'm seeing a massive failure to 
write data to the volume after about 5-10 minutes, so I've simplified the 
scenario a bit (to minimize the variables) to: both Gluster nodes up, only one 
node (duke) mounted and running tgtd, and just regular (single path) iSCSI from 
a single ESXi server.

About 5-10 minutes into migration a VM onto the test datastore, 
/var/log/messages on duke gets blasted with a ton of messages exactly like this:
Mar 15 22:24:06 duke tgtd: bs_rdwr_request(180) io error 0x1781e00 2a -1 512 
22971904, Input/output error

And /var/log/glusterfs/mnt-gluster_disk.log gets blased with a ton of messages 
exactly like this:
[2015-03-16 02:24:07.572279] W [fuse-bridge.c:2242:fuse_writev_cbk] 
0-glusterfs-fuse: 635299: WRITE => -1 (Input/output error)

Are there any messages in the mount log from AFR about split-brain just before 
the above line appears?
Does `gluster v heal <VOLNAME> info` show any files? Performing I/O on files 
that are in split-brain fail with EIO.

-Ravi

And the write operation from VMware's side fails as soon as these messages 
start.

I don't see any other errors (in the log files I know of) indicating the root 
cause of these i/o errors.  I'm sure that this is not enough information to 
tell what's going on, but can anyone help me figure out what to look at next to 
figure this out?

I've also considered using Dan Lambright's libgfapi gluster module for tgtd (or 
something similar) to avoid going through FUSE, but I'm not sure whether that 
would be irrelevant to this problem, since I'm not 100% sure if it lies in FUSE 
or elsewhere.

Thanks!

Jon Heese
Systems Engineer
INetU Managed Hosting
P: 610.266.7441 x 261
F: 610.266.7434
www.inetu.net<https://www.inetu.net/>
** This message contains confidential information, which also may be 
privileged, and is intended only for the person(s) addressed above. Any 
unauthorized use, distribution, copying or disclosure of confidential and/or 
privileged information is strictly prohibited. If you have received this 
communication in error, please erase all copies of the message and its 
attachments and notify the sender immediately via reply e-mail. **

_______________________________________________

Gluster-users mailing list

Gluster-users@gluster.org<mailto:Gluster-users@gluster.org>

http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org<mailto:Gluster-users@gluster.org>
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org<mailto:Gluster-users@gluster.org>
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] I/O error on replicated volume

Reply via email to