Hi Boris, Thank you for providing the logs. The problem here is because of the "auth.allow: 127.0.0.1" setting on the volume. When you try to add a new brick to the volume internally replication module will try to set some metadata on the existing bricks to mark pending heal on the new brick, by creating a temporary mount. Because of the auth.allow setting that mount gets permission errors as seen in the below logs, leading to add-brick failure.
>From data-gluster-dockervols.log-webserver9 : [2019-04-15 14:00:34.226838] I [addr.c:55:compare_addr_and_update] 0-/data/gluster/dockervols: allowed = "127.0.0.1", received addr = "192.168.200.147" [2019-04-15 14:00:34.226895] E [MSGID: 115004] [authenticate.c:224:gf_authenticate] 0-auth: no authentication module is interested in accepting remote-client (null) [2019-04-15 14:00:34.227129] E [MSGID: 115001] [server-handshake.c:848:server_setvolume] 0-dockervols-server: Cannot authenticate client from webserver8.cast.org-55674-2019/04/15-14:00:20:495333-dockervols-client-2-0-0 3.12.2 [Permission denied] >From dockervols-add-brick-mount.log : [2019-04-15 14:00:20.672033] W [MSGID: 114043] [client-handshake.c:1109:client_setvolume_cbk] 0-dockervols-client-2: failed to set the volume [Permission denied] [2019-04-15 14:00:20.672102] W [MSGID: 114007] [client-handshake.c:1138:client_setvolume_cbk] 0-dockervols-client-2: failed to get 'process-uuid' from reply dict [Invalid argument] [2019-04-15 14:00:20.672129] E [MSGID: 114044] [client-handshake.c:1144:client_setvolume_cbk] 0-dockervols-client-2: SETVOLUME on remote-host failed: Authentication failed [Permission denied] [2019-04-15 14:00:20.672151] I [MSGID: 114049] [client-handshake.c:1258:client_setvolume_cbk] 0-dockervols-client-2: sending AUTH_FAILED event This is a known issue and we are planning to fix this. For the time being we have a workaround for this. - Before you try adding the brick set the auth.allow option to default i.e., "*" or you can do this by running "gluster v reset <volname> auth.allow" - Add the brick - After it succeeds set back the auth.allow option to the previous value. Regards, Karthik On Tue, Apr 16, 2019 at 5:20 PM Boris Goldowsky <bgoldow...@cast.org> wrote: > OK, log files attached. > > > > Boris > > > > > > *From: *Karthik Subrahmanya <ksubr...@redhat.com> > *Date: *Tuesday, April 16, 2019 at 2:52 AM > *To: *Atin Mukherjee <atin.mukherje...@gmail.com>, Boris Goldowsky < > bgoldow...@cast.org> > *Cc: *Gluster-users <gluster-users@gluster.org> > *Subject: *Re: [Gluster-users] Volume stuck unable to add a brick > > > > > > > > On Mon, Apr 15, 2019 at 9:43 PM Atin Mukherjee <atin.mukherje...@gmail.com> > wrote: > > +Karthik Subrahmanya <ksubr...@redhat.com> > > > > Didn't we we fix this problem recently? Failed to set extended attribute > indicates that temp mount is failing and we don't have quorum number of > bricks up. > > > > We had two fixes which handles two kind of add-brick scenarios. > > [1] Fails add-brick when increasing the replica count if any of the brick > is down to avoid data loss. This can be overridden by using the force > option. > > [2] Allow add-brick to set the extended attributes by the temp mount if > the volume is already mounted (has clients). > > > > They are in version 3.12.2 so, patch [1] is present there. But since they > are using the force option it should not have any problem even if they have > any brick down. The error message they are getting is also different, so it > is not because of any brick being down I guess. > > Patch [2] is not present in 3.12.2 and it is not the conversion from plain > distribute to replicate volume. So the scenario is different here. > > It seems like they are hitting some other issue. > > > > @Boris, > > Can you attach the add-brick's temp mount log. The file name should look > something like "dockervols-add-brick-mount.log". Can you also provide all > the brick logs of that volume during that time. > > > > [1] https://review.gluster.org/#/c/glusterfs/+/16330/ > > [2] https://review.gluster.org/#/c/glusterfs/+/21791/ > > > > Regards, > > Karthik > > > > Boris - What's the gluster version are you using? > > > > > > > > On Mon, Apr 15, 2019 at 7:35 PM Boris Goldowsky <bgoldow...@cast.org> > wrote: > > Atin, thank you for the reply. Here are all of those pieces of > information: > > > > [bgoldowsky@webserver9 ~]$ gluster --version > > glusterfs 3.12.2 > > (same on all nodes) > > > > [bgoldowsky@webserver9 ~]$ sudo gluster peer status > > Number of Peers: 3 > > > > Hostname: webserver11.cast.org > > Uuid: c2b147fd-cab4-4859-9922-db5730f8549d > > State: Peer in Cluster (Connected) > > > > Hostname: webserver1.cast.org > > Uuid: 4b918f65-2c9d-478e-8648-81d1d6526d4c > > State: Peer in Cluster (Connected) > > Other names: > > 192.168.200.131 > > webserver1 > > > > Hostname: webserver8.cast.org > > Uuid: be2f568b-61c5-4016-9264-083e4e6453a2 > > State: Peer in Cluster (Connected) > > Other names: > > webserver8 > > > > [bgoldowsky@webserver1 ~]$ sudo gluster v info > > Volume Name: dockervols > > Type: Replicate > > Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 3 = 3 > > Transport-type: tcp > > Bricks: > > Brick1: webserver1:/data/gluster/dockervols > > Brick2: webserver11:/data/gluster/dockervols > > Brick3: webserver9:/data/gluster/dockervols > > Options Reconfigured: > > nfs.disable: on > > transport.address-family: inet > > auth.allow: 127.0.0.1 > > > > Volume Name: testvol > > Type: Replicate > > Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 4 = 4 > > Transport-type: tcp > > Bricks: > > Brick1: webserver1:/data/gluster/testvol > > Brick2: webserver9:/data/gluster/testvol > > Brick3: webserver11:/data/gluster/testvol > > Brick4: webserver8:/data/gluster/testvol > > Options Reconfigured: > > transport.address-family: inet > > nfs.disable: on > > > > [bgoldowsky@webserver8 ~]$ sudo gluster v info > > Volume Name: dockervols > > Type: Replicate > > Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 3 = 3 > > Transport-type: tcp > > Bricks: > > Brick1: webserver1:/data/gluster/dockervols > > Brick2: webserver11:/data/gluster/dockervols > > Brick3: webserver9:/data/gluster/dockervols > > Options Reconfigured: > > nfs.disable: on > > transport.address-family: inet > > auth.allow: 127.0.0.1 > > > > Volume Name: testvol > > Type: Replicate > > Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 4 = 4 > > Transport-type: tcp > > Bricks: > > Brick1: webserver1:/data/gluster/testvol > > Brick2: webserver9:/data/gluster/testvol > > Brick3: webserver11:/data/gluster/testvol > > Brick4: webserver8:/data/gluster/testvol > > Options Reconfigured: > > nfs.disable: on > > transport.address-family: inet > > > > [bgoldowsky@webserver9 ~]$ sudo gluster v info > > Volume Name: dockervols > > Type: Replicate > > Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 3 = 3 > > Transport-type: tcp > > Bricks: > > Brick1: webserver1:/data/gluster/dockervols > > Brick2: webserver11:/data/gluster/dockervols > > Brick3: webserver9:/data/gluster/dockervols > > Options Reconfigured: > > nfs.disable: on > > transport.address-family: inet > > auth.allow: 127.0.0.1 > > > > Volume Name: testvol > > Type: Replicate > > Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 4 = 4 > > Transport-type: tcp > > Bricks: > > Brick1: webserver1:/data/gluster/testvol > > Brick2: webserver9:/data/gluster/testvol > > Brick3: webserver11:/data/gluster/testvol > > Brick4: webserver8:/data/gluster/testvol > > Options Reconfigured: > > nfs.disable: on > > transport.address-family: inet > > > > [bgoldowsky@webserver11 ~]$ sudo gluster v info > > Volume Name: dockervols > > Type: Replicate > > Volume ID: 6093a9c6-ec6c-463a-ad25-8c3e3305b98a > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 3 = 3 > > Transport-type: tcp > > Bricks: > > Brick1: webserver1:/data/gluster/dockervols > > Brick2: webserver11:/data/gluster/dockervols > > Brick3: webserver9:/data/gluster/dockervols > > Options Reconfigured: > > auth.allow: 127.0.0.1 > > transport.address-family: inet > > nfs.disable: on > > > > Volume Name: testvol > > Type: Replicate > > Volume ID: 4d5f00f5-00ea-4dcf-babf-1a76eca55332 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 4 = 4 > > Transport-type: tcp > > Bricks: > > Brick1: webserver1:/data/gluster/testvol > > Brick2: webserver9:/data/gluster/testvol > > Brick3: webserver11:/data/gluster/testvol > > Brick4: webserver8:/data/gluster/testvol > > Options Reconfigured: > > transport.address-family: inet > > nfs.disable: on > > > > [bgoldowsky@webserver9 ~]$ sudo gluster volume add-brick dockervols > replica 4 webserver8:/data/gluster/dockervols force > > volume add-brick: failed: Commit failed on webserver8.cast.org. Please > check log file for details. > > > > Webserver8 glusterd.log: > > > > [2019-04-15 13:55:42.338197] I [MSGID: 106488] > [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: > Received get vol req > > The message "I [MSGID: 106488] > [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: > Received get vol req" repeated 2 times between [2019-04-15 13:55:42.338197] > and [2019-04-15 13:55:42.341618] > > [2019-04-15 14:00:20.445011] I [run.c:190:runner_log] > (-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x3a215) > [0x7fe697764215] > -->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0xe3e9d) > [0x7fe69780de9d] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7fe6a2d16ea5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh > --volname=dockervols --version=1 --volume-op=add-brick > --gd-workdir=/var/lib/glusterd > > [2019-04-15 14:00:20.445148] I [MSGID: 106578] > [glusterd-brick-ops.c:1354:glusterd_op_perform_add_bricks] 0-management: > replica-count is set 4 > > [2019-04-15 14:00:20.445184] I [MSGID: 106578] > [glusterd-brick-ops.c:1364:glusterd_op_perform_add_bricks] 0-management: > type is set 0, need to change it > > [2019-04-15 14:00:20.672347] E [MSGID: 106054] > [glusterd-utils.c:13863:glusterd_handle_replicate_brick_ops] 0-management: > Failed to set extended attribute trusted.add-brick : Transport endpoint is > not connected [Transport endpoint is not connected] > > [2019-04-15 14:00:20.693491] E [MSGID: 101042] > [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of /tmp/mntmvdFGq > [Transport endpoint is not connected] > > [2019-04-15 14:00:20.693597] E [MSGID: 106074] > [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to add > bricks > > [2019-04-15 14:00:20.693637] E [MSGID: 106123] > [glusterd-mgmt.c:312:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit > failed. > > [2019-04-15 14:00:20.693667] E [MSGID: 106123] > [glusterd-mgmt-handler.c:616:glusterd_handle_commit_fn] 0-management: > commit failed on operation Add brick > > > > Webserver11 log file: > > > > [2019-04-15 13:56:29.563270] I [MSGID: 106488] > [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: > Received get vol req > > The message "I [MSGID: 106488] > [glusterd-handler.c:1559:__glusterd_handle_cli_get_volume] 0-management: > Received get vol req" repeated 2 times between [2019-04-15 13:56:29.563270] > and [2019-04-15 13:56:29.566209] > > [2019-04-15 14:00:33.996866] I [run.c:190:runner_log] > (-->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0x3a215) > [0x7f36de924215] > -->/usr/lib64/glusterfs/3.12.2/xlator/mgmt/glusterd.so(+0xe3e9d) > [0x7f36de9cde9d] -->/lib64/libglusterfs.so.0(runner_log+0x115) > [0x7f36e9ed6ea5] ) 0-management: Ran script: > /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh > --volname=dockervols --version=1 --volume-op=add-brick > --gd-workdir=/var/lib/glusterd > > [2019-04-15 14:00:33.996979] I [MSGID: 106578] > [glusterd-brick-ops.c:1354:glusterd_op_perform_add_bricks] 0-management: > replica-count is set 4 > > [2019-04-15 14:00:33.997004] I [MSGID: 106578] > [glusterd-brick-ops.c:1364:glusterd_op_perform_add_bricks] 0-management: > type is set 0, need to change it > > [2019-04-15 14:00:34.013789] I [MSGID: 106132] > [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: nfs already > stopped > > [2019-04-15 14:00:34.013849] I [MSGID: 106568] > [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: nfs service is > stopped > > [2019-04-15 14:00:34.017535] I [MSGID: 106568] > [glusterd-proc-mgmt.c:88:glusterd_proc_stop] 0-management: Stopping > glustershd daemon running in pid: 6087 > > [2019-04-15 14:00:35.018783] I [MSGID: 106568] > [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: glustershd > service is stopped > > [2019-04-15 14:00:35.018952] I [MSGID: 106567] > [glusterd-svc-mgmt.c:211:glusterd_svc_start] 0-management: Starting > glustershd service > > [2019-04-15 14:00:35.028306] I [MSGID: 106132] > [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: bitd already > stopped > > [2019-04-15 14:00:35.028408] I [MSGID: 106568] > [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: bitd service is > stopped > > [2019-04-15 14:00:35.028601] I [MSGID: 106132] > [glusterd-proc-mgmt.c:84:glusterd_proc_stop] 0-management: scrub already > stopped > > [2019-04-15 14:00:35.028645] I [MSGID: 106568] > [glusterd-svc-mgmt.c:243:glusterd_svc_stop] 0-management: scrub service is > stopped > > > > Thank you for taking a look! > > > > Boris > > > > > > *From: *Atin Mukherjee <atin.mukherje...@gmail.com> > *Date: *Friday, April 12, 2019 at 1:10 PM > *To: *Boris Goldowsky <bgoldow...@cast.org> > *Cc: *Gluster-users <gluster-users@gluster.org> > *Subject: *Re: [Gluster-users] Volume stuck unable to add a brick > > > > > > > > On Fri, 12 Apr 2019 at 22:32, Boris Goldowsky <bgoldow...@cast.org> wrote: > > I’ve got a replicated volume with three bricks (“1x3=3”), the idea is to > have a common set of files that are locally available on all the machines > (Scientific Linux 7, which is essentially CentOS 7) in a cluster. > > > > I tried to add on a fourth machine, so used a command like this: > > > > sudo gluster volume add-brick dockervols replica 4 > webserver8:/data/gluster/dockervols force > > > > but the result is: > > volume add-brick: failed: Commit failed on webserver1. Please check log > file for details. > > Commit failed on webserver8. Please check log file for details. > > Commit failed on webserver11. Please check log file for details. > > > > Tried: removing the new brick (this also fails) and trying again. > > Tried: checking the logs. The log files are not enlightening to me – I > don’t know what’s normal and what’s not. > > > > From webserver8 & webserver11 could you attach glusterd log files? > > > > Also please share following: > > - gluster version? (gluster —version) > > - Output of ‘gluster peer status’ > > - Output of ‘gluster v info’ from all 4 nodes. > > > > Tried: deleting the brick directory from previous attempt, so that it’s > not in the way. > > Tried: restarting gluster services > > Tried: rebooting > > Tried: setting up a new volume, replicated to all four machines. This > works, so I’m assuming it’s not a networking issue. But still fails with > this existing volume that has the critical data in it. > > > > Running out of ideas. Any suggestions? Thank you! > > > > Boris > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > > --Atin > >
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users