Re: [Gluster-users] geo replication issue

2018-10-24 Thread Krishna Verma
Hi Sunny,

Thanks for your response. Yes " 
usr/libexec/glusterfs/python/syncdaemon/gsyncd.py'" was missing at slave. 

I have installed " glusterfs-geo-replication.x86_64" rpm and then the session 
is Active now. 

But now I am struggling with the indexing issue. Files more than 5GB in master 
volume is not getting sync with slave. I have to delete the geo replication 
session and erase the indexing like below then after creating the new session 
large files start sync with slave. 

How we can avoid this Gluster behavior in geo replication? Also can we monitor 
the real time data sync between master and slave by any GUI method? 

I was also searching for the implementation docs of "geo replication over the 
internet for distributed volume", but can't find any. Do you have one? 

Appreciate for any help. 

 

#gluster volume geo-replication gv1 sj-gluster01::gv1 delete
Deleting geo-replication session between gv1 & sj-gluster01::gv1 has been 
successful
]# gluster volume set gv1 geo-replication.indexing off

/Krishna

-Original Message-
From: Sunny Kumar  
Sent: Wednesday, October 24, 2018 6:33 PM
To: Krishna Verma 
Cc: gluster-users@gluster.org
Subject: Re: [Gluster-users] geo replication issue

EXTERNAL MAIL


Hi Krishna,

Please check for this file existance
'/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py' at slave.

- Sunny
On Wed, Oct 24, 2018 at 4:36 PM Krishna Verma  wrote:
>
>
>
>
>
>
>
> Hi Everyone,
>
>
>
> I have created a 4*4 distributed gluster but when I am starting the start the 
> session its get failed with below errors.
>
>
>
> [2018-10-24 10:02:03.857861] I [gsyncdstatus(monitor):245:set_worker_status] 
> GeorepStatus: Worker Status Change status=Initializing...
>
> [2018-10-24 10:02:03.858133] I [monitor(monitor):155:monitor] Monitor: 
> starting gsyncd worker   brick=/gfs1/brick1/gv1  slave_node=sj-gluster02
>
> [2018-10-24 10:02:03.954746] I [gsyncd(agent /gfs1/brick1/gv1):297:main] 
> : Using session config file   
> path=/var/lib/glusterd/geo-replication/gv1_sj-gluster01_gv1/gsyncd.conf
>
> [2018-10-24 10:02:03.956724] I [changelogagent(agent 
> /gfs1/brick1/gv1):72:__init__] ChangelogAgent: Agent listining...
>
> [2018-10-24 10:02:03.958110] I [gsyncd(worker /gfs1/brick1/gv1):297:main] 
> : Using session config file  
> path=/var/lib/glusterd/geo-replication/gv1_sj-gluster01_gv1/gsyncd.conf
>
> [2018-10-24 10:02:03.975778] I [resource(worker 
> /gfs1/brick1/gv1):1377:connect_remote] SSH: Initializing SSH connection 
> between master and slave...
>
> [2018-10-24 10:02:07.413379] E [syncdutils(worker 
> /gfs1/brick1/gv1):305:log_raise_exception] : connection to peer is broken
>
> [2018-10-24 10:02:07.414144] E [syncdutils(worker 
> /gfs1/brick1/gv1):801:errlog] Popen: command returned error   cmd=ssh 
> -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i 
> /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S 
> /tmp/gsyncd-aux-ssh-OE_W1C/cf9a66dce686717c4a5ef9a7c3a7f8be.sock sj-gluster01 
> /nonexistent/gsyncd slave gv1 sj-gluster01::gv1 --master-node noida-gluster01 
> --master-node-id 08925454-9fea-4b24-8f82-9d7ad917b870 --master-brick 
> /gfs1/brick1/gv1 --local-node sj-gluster02 --local-node-id 
> f592c041-dcae-493c-b5a0-31e376a5be34 --slave-timeout 120 --slave-log-level 
> INFO --slave-gluster-log-level INFO --slave-gluster-command-dir 
> /usr/local/sbin/  error=2
>
> [2018-10-24 10:02:07.414386] E [syncdutils(worker 
> /gfs1/brick1/gv1):805:logerr] Popen: ssh> /usr/bin/python2: can't open file 
> '/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py': [Errno 2] No such file 
> or directory
>
> [2018-10-24 10:02:07.422688] I [repce(agent 
> /gfs1/brick1/gv1):80:service_loop] RepceServer: terminating on reaching EOF.
>
> [2018-10-24 10:02:07.422842] I [monitor(monitor):266:monitor] Monitor: worker 
> died before establishing connection   brick=/gfs1/brick1/gv1
>
> [2018-10-24 10:02:07.435054] I [gsyncdstatus(monitor):245:set_worker_status] 
> GeorepStatus: Worker Status Change status=Faulty
>
>
>
>
>
> MASTER NODE  MASTER VOLMASTER BRICKSLAVE USERSLAVE
> SLAVE NODESTATUSCRAWL STATUSLAST_SYNCED
>
> 
>
> noida-gluster01  gv1   /gfs1/brick1/gv1root  
> sj-gluster01::gv1N/A   FaultyN/A N/A
>
> noida-gluster02  gv1   /gfs1/brick1/gv1root  
> sj-gluster01::gv1N/A   FaultyN/A N/A
>
> gluster-poc-noidagv1   /gfs1/brick1/gv1root  
> sj-gluster01::gv1N/A   FaultyN/A N/A
>
> noi-poc-gluster  gv1   /gfs1/brick1/gv1root  
> sj-gluster01::gv1N/A   FaultyN/A N/A
>
>
>
>
>
> Could someone please help?
>
>
>
> /Krishna
>
> ___

[Gluster-users] Log-file rotation on a Disperse Volume while a failed brick results in files that cannot be healed.

2018-10-24 Thread Jeff Byers
Hello,

Regarding the issue:

  Bug 1642638 - Log-file rotation on a Disperse Volume while a failed
brick results in files that cannot be healed.
  https://bugzilla.redhat.com/show_bug.cgi?id=1642638

Could anybody that has GlusterFS 4.1.x installed see if this problem
exists there?

If anyone knows of a fix, or a good way to avoid these "Not able to
heal" situations requiring manual intervention,
that would be greatly appreciated. Thanks!

~ Jeff Byers ~
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] nfs volume usage

2018-10-24 Thread Oğuz Yarımtepe
Hi,

How can i use my nfs exports from my storage as the peer's replicated
volume? Any tip?

Regards.

-- 
Oğuz Yarımtepe
http://about.me/oguzy
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] geo replication session status faulty

2018-10-24 Thread Christos Tsalidis
Hi all,

I am testing the geo-replication service in gluster version 3.10.12 on
centos CentOS Linux release 7.5.1804 and my session remains in faulty
state. On gluster 3.12 version we can configure the following command to
solve the problem.

gluster vol geo-replication mastervol geoaccount@servere::slavevol config
access_mount true

Do you know whether there is any other command on 3.10.12 ?

Here are the geo-replication logs

[2018-10-24 17:54:09.613987] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:08.838430] I [cli.c:759:main] 0-cli: Started running
/usr/sbin/gluster with version 3.10.12
[2018-10-24 17:54:09.614087] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:08.838471] I [cli.c:642:cli_rpc_init] 0-cli: Connecting to remote
glusterd at localhost
[2018-10-24 17:54:09.614211] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:08.845996] I [socket.c:4208:socket_init] 0-glusterfs: SSL support for
glusterd is ENABLED
[2018-10-24 17:54:09.614345] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:08.846805] E [socket.c:4288:socket_init] 0-glusterfs: failed to open
/etc/ssl/dhparam.pem, DH ciphers are disabled
[2018-10-24 17:54:09.614475] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:08.864811] I [socket.c:348:ssl_setup_connection] 0-glusterfs: peer CN
= servere
[2018-10-24 17:54:09.614582] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:08.865488] I [socket.c:351:ssl_setup_connection] 0-glusterfs: SSL
verification succeeded (client: )
[2018-10-24 17:54:09.614722] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:08.865676] I [socket.c:4208:socket_init] 0-glusterfs: SSL support for
glusterd is ENABLED
[2018-10-24 17:54:09.614826] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:08.865807] E [socket.c:4288:socket_init] 0-glusterfs: failed to open
/etc/ssl/dhparam.pem, DH ciphers are disabled
[2018-10-24 17:54:09.614919] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:09.066460] I [MSGID: 101190]
[event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2018-10-24 17:54:09.615006] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:09.067076] I [socket.c:2426:socket_event_handler] 0-transport:
EPOLLERR - disconnecting now
[2018-10-24 17:54:09.615093] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:09.067893] I [cli-rpc-ops.c:7024:gf_cli_getwd_cbk] 0-cli: Received
resp to getwd
[2018-10-24 17:54:09.615226] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:09.067953] I [input.c:31:cli_batch] 0-: Exiting with: 0
[2018-10-24 17:54:09.615494] I
[syncdutils(/bricks/brick-a1/brick):238:finalize] : exiting.
[2018-10-24 17:54:09.616787] I
[repce(/bricks/brick-a1/brick):92:service_loop] RepceServer: terminating on
reaching EOF.
[2018-10-24 17:54:09.617005] I
[syncdutils(/bricks/brick-a1/brick):238:finalize] : exiting.
[2018-10-24 17:54:09.617331] I [monitor(monitor):347:monitor] Monitor:
worker(/bricks/brick-a1/brick) died before establishing connection
[2018-10-24 17:54:19.811722] I [monitor(monitor):275:monitor] Monitor:
starting gsyncd worker(/bricks/brick-a1/brick). Slave node:
ssh://geoaccount@servere:gluster://localhost:slavevol
[2018-10-24 17:54:20.90926] I
[changelogagent(/bricks/brick-a1/brick):73:__init__] ChangelogAgent: Agent
listining...
[2018-10-24 17:54:21.431653] E
[syncdutils(/bricks/brick-a1/brick):270:log_raise_exception] :
connection to peer is broken
[2018-10-24 17:54:21.432003] E
[resource(/bricks/brick-a1/brick):234:errlog] Popen: command "ssh
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S
/tmp/gsyncd-aux-ssh-rJaCZW/05b8d7b5dab75575689c0e1a2ec33b3f.sock
geoaccount@servere /nonexistent/gsyncd --session-owner
4d94d1ea-6818-450a-8fa8-645a7d9d36b8 --local-id
.%2Fbricks%2Fbrick-a1%2Fbrick --local-node servera -N --listen --timeout
120 gluster://localhost:slavevol" returned with 1, saying:
[2018-10-24 17:54:21.432122] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:20.609121] I [cli.c:759:main] 0-cli: Started running
/usr/sbin/gluster with version 3.10.12
[2018-10-24 17:54:21.432220] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:20.609156] I [cli.c:642:cli_rpc_init] 0-cli: Connecting to remote
glusterd at localhost
[2018-10-24 17:54:21.432312] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:20.615402] I [socket.c:4208:socket_init] 0-glusterfs: SSL support for
glusterd is ENABLED
[2018-10-24 17:54:21.432401] E
[resource(/bricks/brick-a1/brick):238:logerr] Popen: ssh> [2018-10-24
17:54:20.616145] E [socket.c:4288:socket_init] 0-glusterfs: failed to open
/etc

[Gluster-users] How to use system.affinity/distributed.migrate-data on distributed/replicated volume?

2018-10-24 Thread Ingo Fischer
Hi,

I have setup a glusterfs volume gv0 as distributed/replicated:

root@pm1:~# gluster volume info gv0

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 64651501-6df2-4106-b330-fdb3e1fbcdf4
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 192.168.178.50:/gluster/brick1/gv0
Brick2: 192.168.178.76:/gluster/brick1/gv0
Brick3: 192.168.178.50:/gluster/brick2/gv0
Brick4: 192.168.178.81:/gluster/brick1/gv0
Brick5: 192.168.178.50:/gluster/brick3/gv0
Brick6: 192.168.178.82:/gluster/brick1/gv0
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet


root@pm1:~# gluster volume status
Status of volume: gv0
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick 192.168.178.50:/gluster/brick1/gv049152 0  Y
1665
Brick 192.168.178.76:/gluster/brick1/gv049152 0  Y
26343
Brick 192.168.178.50:/gluster/brick2/gv049153 0  Y
1666
Brick 192.168.178.81:/gluster/brick1/gv049152 0  Y
1161
Brick 192.168.178.50:/gluster/brick3/gv049154 0  Y
1679
Brick 192.168.178.82:/gluster/brick1/gv049152 0  Y
1334
Self-heal Daemon on localhost   N/A   N/AY
5022
Self-heal Daemon on 192.168.178.81  N/A   N/AY
935
Self-heal Daemon on 192.168.178.82  N/A   N/AY
1057
Self-heal Daemon on pm2.fritz.box   N/A   N/AY
1651


I use the fs to store VM files, so not many, but big files.

The distribution now put 4 big files on one brick set and only one file
on an other. This means that the one brick set it "overcommited" now as
soon as all VMs using max space. SO I would like to manually
redistribute the files a bit better.

After log googling I found that the following should work:
setfattr -n 'system.affinity' -v $location $filepath
setfattr -n 'distribute.migrate-data' -v 'force' $filepath

But I have problems with it because it gives errors or doing nothing at all.

The mounting looks like:
192.168.178.50:gv0 on /mnt/pve/glusterfs type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)


Here is what I tried for the first xattr:

root@pm1:~# setfattr -n 'system.affinity' -v 'gv0-client-5'
/mnt/pve/glusterfs/201/imagesvm201.qcow2
setfattr: /mnt/pve/glusterfs/201/imagesvm201.qcow2: Operation not supported

So I found on google to use trusted.affinity instead and yes this works.
I'm only not sure if the location "gv0-client-5" is correct to move the
file to "Brick 5" from "gluster volume info gv0" ... or how this
location is build?
Commit Message from http://review.gluster.org/#/c/glusterfs/+/5233/ says
> The value is the internal client or AFR brick name where you want the
file to be.

So what do I need to set there? maybe I do need the "afr" because
replicated? But where to get that name from?
I also tried to enter other client or replicate names like
"gv0-replicate-0" or such which seems to be more fitting for a
replicated volume, but result the same.


For the second command I get:
root@pm1:~# setfattr -n 'distribute.migrate-data' -v 'force'
/mnt/pve/glusterfs/201/imagesvm201.qcow2
setfattr: /mnt/pve/glusterfs/images/201/vm-201-disk-0.qcow2: Operation
not supported
root@pm1:~# setfattr -n 'trusted.distribute.migrate-data' -v 'force'
/mnt/pve/glusterfs/201/imagesvm201.qcow2
setfattr: /mnt/pve/glusterfs/images/201/vm-201-disk-0.qcow2: File exists

I also experimented with other "names" then "gv0-client-5" above but
always the same.

I saw that instead of the second command I could start a rebalance with
force, but this also did nothing. Ended after max1 second and moved nothing.

Can someone please advice how to do it right?


An other idea was to enable nufa and kind of "re-copy" the files on the
glusterfs, but here it seems that the documentation is wrong.
gluster volume set gv0 cluster.nufa enable on

Is

gluster volume set gv0 cluster.nufa 1

correct?

Thank you very much!

Ingo
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] geo replication issue

2018-10-24 Thread Sunny Kumar
Hi Krishna,

Please check for this file existance
'/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py' at slave.

- Sunny
On Wed, Oct 24, 2018 at 4:36 PM Krishna Verma  wrote:
>
>
>
>
>
>
>
> Hi Everyone,
>
>
>
> I have created a 4*4 distributed gluster but when I am starting the start the 
> session its get failed with below errors.
>
>
>
> [2018-10-24 10:02:03.857861] I [gsyncdstatus(monitor):245:set_worker_status] 
> GeorepStatus: Worker Status Change status=Initializing...
>
> [2018-10-24 10:02:03.858133] I [monitor(monitor):155:monitor] Monitor: 
> starting gsyncd worker   brick=/gfs1/brick1/gv1  slave_node=sj-gluster02
>
> [2018-10-24 10:02:03.954746] I [gsyncd(agent /gfs1/brick1/gv1):297:main] 
> : Using session config file   
> path=/var/lib/glusterd/geo-replication/gv1_sj-gluster01_gv1/gsyncd.conf
>
> [2018-10-24 10:02:03.956724] I [changelogagent(agent 
> /gfs1/brick1/gv1):72:__init__] ChangelogAgent: Agent listining...
>
> [2018-10-24 10:02:03.958110] I [gsyncd(worker /gfs1/brick1/gv1):297:main] 
> : Using session config file  
> path=/var/lib/glusterd/geo-replication/gv1_sj-gluster01_gv1/gsyncd.conf
>
> [2018-10-24 10:02:03.975778] I [resource(worker 
> /gfs1/brick1/gv1):1377:connect_remote] SSH: Initializing SSH connection 
> between master and slave...
>
> [2018-10-24 10:02:07.413379] E [syncdutils(worker 
> /gfs1/brick1/gv1):305:log_raise_exception] : connection to peer is broken
>
> [2018-10-24 10:02:07.414144] E [syncdutils(worker 
> /gfs1/brick1/gv1):801:errlog] Popen: command returned error   cmd=ssh 
> -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i 
> /var/lib/glusterd/geo-replication/secret.pem -p 22 -oControlMaster=auto -S 
> /tmp/gsyncd-aux-ssh-OE_W1C/cf9a66dce686717c4a5ef9a7c3a7f8be.sock sj-gluster01 
> /nonexistent/gsyncd slave gv1 sj-gluster01::gv1 --master-node noida-gluster01 
> --master-node-id 08925454-9fea-4b24-8f82-9d7ad917b870 --master-brick 
> /gfs1/brick1/gv1 --local-node sj-gluster02 --local-node-id 
> f592c041-dcae-493c-b5a0-31e376a5be34 --slave-timeout 120 --slave-log-level 
> INFO --slave-gluster-log-level INFO --slave-gluster-command-dir 
> /usr/local/sbin/  error=2
>
> [2018-10-24 10:02:07.414386] E [syncdutils(worker 
> /gfs1/brick1/gv1):805:logerr] Popen: ssh> /usr/bin/python2: can't open file 
> '/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py': [Errno 2] No such file 
> or directory
>
> [2018-10-24 10:02:07.422688] I [repce(agent 
> /gfs1/brick1/gv1):80:service_loop] RepceServer: terminating on reaching EOF.
>
> [2018-10-24 10:02:07.422842] I [monitor(monitor):266:monitor] Monitor: worker 
> died before establishing connection   brick=/gfs1/brick1/gv1
>
> [2018-10-24 10:02:07.435054] I [gsyncdstatus(monitor):245:set_worker_status] 
> GeorepStatus: Worker Status Change status=Faulty
>
>
>
>
>
> MASTER NODE  MASTER VOLMASTER BRICKSLAVE USERSLAVE
> SLAVE NODESTATUSCRAWL STATUSLAST_SYNCED
>
> 
>
> noida-gluster01  gv1   /gfs1/brick1/gv1root  
> sj-gluster01::gv1N/A   FaultyN/A N/A
>
> noida-gluster02  gv1   /gfs1/brick1/gv1root  
> sj-gluster01::gv1N/A   FaultyN/A N/A
>
> gluster-poc-noidagv1   /gfs1/brick1/gv1root  
> sj-gluster01::gv1N/A   FaultyN/A N/A
>
> noi-poc-gluster  gv1   /gfs1/brick1/gv1root  
> sj-gluster01::gv1N/A   FaultyN/A N/A
>
>
>
>
>
> Could someone please help?
>
>
>
> /Krishna
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] GlusterFS 4.1.x deb packages missing for Debian 8 (jessie)

2018-10-24 Thread mabi
Anyone?

I would really like to be able to install GlusterFS 4.1.x on Debian 8 (jessie). 
This version of Debian 8 is still widely in use and IMHO there should be a 
GlusterFS package for it.

Many thanks in advance for your consideration.


‐‐‐ Original Message ‐‐‐
On Friday, October 19, 2018 10:58 PM, mabi  wrote:

> Hello,
>
> I just upgraded all my Debian 9 (stretch) GlusterFS servers from 3.12.14 to 
> 4.1.5 but unfortunately my GlusterFS clients are all Debian 8 (jessie) 
> machines and there are no single GlusterFS 4.1.x package available for Debian 
> 8 as I found out here:
>
> https://download.gluster.org/pub/gluster/glusterfs/4.1/4.1.5/Debian/
>
> May I kindly ask the GlusterFS packaging team or the person responsible for 
> this task to please also provide the packages for Debian 8?
>
> Right now I am running a GlusterFS 4.1.5 servers with GlusterFS 3.12.14 
> clients (FUSE mount). Could this create any problems or is not unsafe? I did 
> not upgrade the op-version on the server yet.
>
> Thank you very much in advance.
>
> Best regards,
> Mabi


___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing ou

2018-10-24 Thread Ravishankar N



On 10/24/2018 05:16 PM, Hoggins! wrote:

Thank you, it's working as expected.

I guess it's only safe to put cluster.data-self-heal back on when I get
an updated version of GlusterFS?
Yes correct. Also, you would still need to restart shd whenever you hit 
this issue until upgrade.

-Ravi


     Hoggins!

Le 24/10/2018 à 11:53, Ravishankar N a écrit :

On 10/24/2018 02:38 PM, Hoggins! wrote:

Thanks, that's helping a lot, I will do that.

One more question: should the glustershd restart be performed on the
arbiter only, or on each node of the cluster?

If you do a 'gluster volume start volname force' it will restart the
shd on all nodes.
-Ravi

Thanks!

     Hoggins!

Le 24/10/2018 à 02:55, Ravishankar N a écrit :

On 10/23/2018 10:01 PM, Hoggins! wrote:

Hello there,

I'm stumbling upon the *exact same issue*, and unfortunately setting the
server.tcp-user-timeout to 42 does not help.
Any other suggestion?

I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running
version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a
node) suggested by Sam works, but it often doesn't.

You may ask how I got into this, well it's simple: I needed to replace
my brick 1 and brick 2 with two brand new machines, so here's what I did:
     - add brick 3 and brick 4 into the cluster (gluster peer probe,
gluster volume add-brick, etc., with the issue regarding the arbiter
node that has to be first removed from the cluster before being able to
add bricks 3 and 4)
     - wait for all the files on my volumes to heal. It took a few days.
     - remove bricks 1 and 2
     - after having "reset" the arbiter, re-add the arbiter into the cluster

And now it's intermittently hanging on writing *on existing files*.
There is *no problem for writing new files* on the volumes.

Hi,

There was a arbiter volume hang issue  that was fixed [1] recently.
The fix has been back-ported to all release branches.

One workaround to overcome hangs is to (1)turn off  'testvol
cluster.data-self-heal', remount the clients *and* (2) restart
glustershd (via volume start force). The hang is observed due to an
unreleased lock from self-heal. There are other ways to release the
stale lock via gluster clear-locks command or tweaking
features.locks-revocation-secs but restarting shd whenever you see the
issue is the easiest and safest way.

-Ravi

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802



I'm lost here, thanks for your inputs!

     Hoggins!

Le 14/09/2018 à 04:16, Amar Tumballi a écrit :

On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod mailto:mailingli...@smcleod.net>> wrote:

 I apologise for this being posted twice - I'm not sure if that was
 user error or a bug in the mailing list, but the list wasn't
 showing my post after quite some time so I sent a second email
 which near immediately showed up - that's mailing lists I guess...

 Anyway, if anyone has any input, advice or abuse I'm welcome any
 input!


We got little late to get back on this. But after running tests
internally, we found possibly missing an volume option is the reason
for this:

Try

gluster volume set  server.tcp-user-timeout 42
on your volume. Let us know if this helps.
(Ref: https://review.gluster.org/#/c/glusterfs/+/21170/)
  


 --
 Sam McLeod
 https://smcleod.net
 https://twitter.com/s_mcleod


 On 3 Sep 2018, at 1:20 pm, Sam McLeod mailto:mailingli...@smcleod.net>> wrote:

 We've got an odd problem where clients are blocked from writing
 to Gluster volumes until the first node of the Gluster cluster is
 rebooted.

 I suspect I've either configured something incorrectly with the
 arbiter / replica configuration of the volumes, or there is some
 sort of bug in the gluster client-server connection that we're
 triggering.

 I was wondering if anyone has seen this or could point me in the
 right direction?


 *Environment:*

   * Typology: 3 node cluster, replica 2, arbiter 1 (third node is
 metadata only).
   * Version: Client and Servers both running 4.1.3, both on
 CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast
 networked SSD storage backing them, XFS.
   * Client: Native Gluster FUSE client mounting via the
 kubernetes provider


 *Problem:*

   * Seemingly randomly some clients will be blocked / are unable
 to write to what should be a highly available gluster volume.
   * The client gluster logs show it failing to do new file
 operations across various volumes and all three nodes of the
 gluster.
   * The server gluster (or OS) logs do not show any warnings or
 errors.
   * The client recovers and is able to write to volumes again
 after the first node of the gluster cluster is rebooted.
   * Until the first node of the gluster cluster is rebooted, the
 client fails to write to the volume that is (or should be)
 available on the

Re: [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing ou

2018-10-24 Thread Hoggins!
Thank you, it's working as expected.

I guess it's only safe to put cluster.data-self-heal back on when I get
an updated version of GlusterFS?

    Hoggins!

Le 24/10/2018 à 11:53, Ravishankar N a écrit :
>
> On 10/24/2018 02:38 PM, Hoggins! wrote:
>> Thanks, that's helping a lot, I will do that.
>>
>> One more question: should the glustershd restart be performed on the
>> arbiter only, or on each node of the cluster?
> If you do a 'gluster volume start volname force' it will restart the
> shd on all nodes.
> -Ravi
>> Thanks!
>>
>>     Hoggins!
>>
>> Le 24/10/2018 à 02:55, Ravishankar N a écrit :
>>> On 10/23/2018 10:01 PM, Hoggins! wrote:
 Hello there,

 I'm stumbling upon the *exact same issue*, and unfortunately setting the
 server.tcp-user-timeout to 42 does not help.
 Any other suggestion?

 I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running
 version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a
 node) suggested by Sam works, but it often doesn't.

 You may ask how I got into this, well it's simple: I needed to replace
 my brick 1 and brick 2 with two brand new machines, so here's what I did:
     - add brick 3 and brick 4 into the cluster (gluster peer probe,
 gluster volume add-brick, etc., with the issue regarding the arbiter
 node that has to be first removed from the cluster before being able to
 add bricks 3 and 4)
     - wait for all the files on my volumes to heal. It took a few days.
     - remove bricks 1 and 2
     - after having "reset" the arbiter, re-add the arbiter into the cluster

 And now it's intermittently hanging on writing *on existing files*.
 There is *no problem for writing new files* on the volumes.
>>> Hi,
>>>
>>> There was a arbiter volume hang issue  that was fixed [1] recently.
>>> The fix has been back-ported to all release branches.
>>>
>>> One workaround to overcome hangs is to (1)turn off  'testvol
>>> cluster.data-self-heal', remount the clients *and* (2) restart
>>> glustershd (via volume start force). The hang is observed due to an
>>> unreleased lock from self-heal. There are other ways to release the
>>> stale lock via gluster clear-locks command or tweaking
>>> features.locks-revocation-secs but restarting shd whenever you see the
>>> issue is the easiest and safest way.
>>>
>>> -Ravi
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802
>>>
>>>
 I'm lost here, thanks for your inputs!

     Hoggins!

 Le 14/09/2018 à 04:16, Amar Tumballi a écrit :
> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod  > wrote:
>
> I apologise for this being posted twice - I'm not sure if that was
> user error or a bug in the mailing list, but the list wasn't
> showing my post after quite some time so I sent a second email
> which near immediately showed up - that's mailing lists I guess...
>
> Anyway, if anyone has any input, advice or abuse I'm welcome any
> input!
>
>
> We got little late to get back on this. But after running tests
> internally, we found possibly missing an volume option is the reason
> for this:
>
> Try 
>
> gluster volume set  server.tcp-user-timeout 42
> on your volume. Let us know if this helps.
> (Ref: https://review.gluster.org/#/c/glusterfs/+/21170/)
>  
>
> --
> Sam McLeod
> https://smcleod.net
> https://twitter.com/s_mcleod
>
>> On 3 Sep 2018, at 1:20 pm, Sam McLeod > > wrote:
>>
>> We've got an odd problem where clients are blocked from writing
>> to Gluster volumes until the first node of the Gluster cluster is
>> rebooted.
>>
>> I suspect I've either configured something incorrectly with the
>> arbiter / replica configuration of the volumes, or there is some
>> sort of bug in the gluster client-server connection that we're
>> triggering.
>>
>> I was wondering if anyone has seen this or could point me in the
>> right direction?
>>
>>
>> *Environment:*
>>
>>   * Typology: 3 node cluster, replica 2, arbiter 1 (third node is
>> metadata only).
>>   * Version: Client and Servers both running 4.1.3, both on
>> CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast
>> networked SSD storage backing them, XFS.
>>   * Client: Native Gluster FUSE client mounting via the
>> kubernetes provider
>>
>>
>> *Problem:*
>>
>>   * Seemingly randomly some clients will be blocked / are unable
>> to write to what should be a highly available gluster volume.
>>   * The client gluster logs show it failing to do new file
>> operations across various volumes and all th

[Gluster-users] geo replication issue

2018-10-24 Thread Krishna Verma



Hi Everyone,

I have created a 4*4 distributed gluster but when I am starting the start the 
session its get failed with below errors.

[2018-10-24 10:02:03.857861] I [gsyncdstatus(monitor):245:set_worker_status] 
GeorepStatus: Worker Status Change status=Initializing...
[2018-10-24 10:02:03.858133] I [monitor(monitor):155:monitor] Monitor: starting 
gsyncd worker   brick=/gfs1/brick1/gv1  slave_node=sj-gluster02
[2018-10-24 10:02:03.954746] I [gsyncd(agent /gfs1/brick1/gv1):297:main] : 
Using session config file   
path=/var/lib/glusterd/geo-replication/gv1_sj-gluster01_gv1/gsyncd.conf
[2018-10-24 10:02:03.956724] I [changelogagent(agent 
/gfs1/brick1/gv1):72:__init__] ChangelogAgent: Agent listining...
[2018-10-24 10:02:03.958110] I [gsyncd(worker /gfs1/brick1/gv1):297:main] 
: Using session config file  
path=/var/lib/glusterd/geo-replication/gv1_sj-gluster01_gv1/gsyncd.conf
[2018-10-24 10:02:03.975778] I [resource(worker 
/gfs1/brick1/gv1):1377:connect_remote] SSH: Initializing SSH connection between 
master and slave...
[2018-10-24 10:02:07.413379] E [syncdutils(worker 
/gfs1/brick1/gv1):305:log_raise_exception] : connection to peer is broken
[2018-10-24 10:02:07.414144] E [syncdutils(worker /gfs1/brick1/gv1):801:errlog] 
Popen: command returned error   cmd=ssh -oPasswordAuthentication=no 
-oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 
22 -oControlMaster=auto -S 
/tmp/gsyncd-aux-ssh-OE_W1C/cf9a66dce686717c4a5ef9a7c3a7f8be.sock sj-gluster01 
/nonexistent/gsyncd slave gv1 sj-gluster01::gv1 --master-node noida-gluster01 
--master-node-id 08925454-9fea-4b24-8f82-9d7ad917b870 --master-brick 
/gfs1/brick1/gv1 --local-node sj-gluster02 --local-node-id 
f592c041-dcae-493c-b5a0-31e376a5be34 --slave-timeout 120 --slave-log-level INFO 
--slave-gluster-log-level INFO --slave-gluster-command-dir /usr/local/sbin/  
error=2
[2018-10-24 10:02:07.414386] E [syncdutils(worker /gfs1/brick1/gv1):805:logerr] 
Popen: ssh> /usr/bin/python2: can't open file 
'/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py': [Errno 2] No such file or 
directory
[2018-10-24 10:02:07.422688] I [repce(agent /gfs1/brick1/gv1):80:service_loop] 
RepceServer: terminating on reaching EOF.
[2018-10-24 10:02:07.422842] I [monitor(monitor):266:monitor] Monitor: worker 
died before establishing connection   brick=/gfs1/brick1/gv1
[2018-10-24 10:02:07.435054] I [gsyncdstatus(monitor):245:set_worker_status] 
GeorepStatus: Worker Status Change status=Faulty


MASTER NODE  MASTER VOLMASTER BRICKSLAVE USERSLAVE  
  SLAVE NODESTATUSCRAWL STATUSLAST_SYNCED

noida-gluster01  gv1   /gfs1/brick1/gv1root  
sj-gluster01::gv1N/A   FaultyN/A N/A
noida-gluster02  gv1   /gfs1/brick1/gv1root  
sj-gluster01::gv1N/A   FaultyN/A N/A
gluster-poc-noidagv1   /gfs1/brick1/gv1root  
sj-gluster01::gv1N/A   FaultyN/A N/A
noi-poc-gluster  gv1   /gfs1/brick1/gv1root  
sj-gluster01::gv1N/A   FaultyN/A N/A


Could someone please help?

/Krishna
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] How to use system.affinity/distributed.migrate-data on distributed/replicated volume?

2018-10-24 Thread Ingo Fischer
Hi,

I have setup a glusterfs volume gv0 as distributed/replicated:

root@pm1:~# gluster volume info gv0

Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 64651501-6df2-4106-b330-fdb3e1fbcdf4
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 192.168.178.50:/gluster/brick1/gv0
Brick2: 192.168.178.76:/gluster/brick1/gv0
Brick3: 192.168.178.50:/gluster/brick2/gv0
Brick4: 192.168.178.81:/gluster/brick1/gv0
Brick5: 192.168.178.50:/gluster/brick3/gv0
Brick6: 192.168.178.82:/gluster/brick1/gv0
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet


root@pm1:~# gluster volume status
Status of volume: gv0
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick 192.168.178.50:/gluster/brick1/gv049152 0  Y
1665
Brick 192.168.178.76:/gluster/brick1/gv049152 0  Y
26343
Brick 192.168.178.50:/gluster/brick2/gv049153 0  Y
1666
Brick 192.168.178.81:/gluster/brick1/gv049152 0  Y
1161
Brick 192.168.178.50:/gluster/brick3/gv049154 0  Y
1679
Brick 192.168.178.82:/gluster/brick1/gv049152 0  Y
1334
Self-heal Daemon on localhost   N/A   N/AY
5022
Self-heal Daemon on 192.168.178.81  N/A   N/AY
935
Self-heal Daemon on 192.168.178.82  N/A   N/AY
1057
Self-heal Daemon on pm2.fritz.box   N/A   N/AY
1651


I use the fs to store VM files, so not many, but big files.

The distribution now put 4 big files on one brick set and only one file
on an other. This means that the one brick set it "overcommited" now as
soon as all VMs using max space. SO I would like to manually
redistribute the files a bit better.

After log googling I found that the following should work:
setfattr -n 'system.affinity' -v $location $filepath
setfattr -n 'distribute.migrate-data' -v 'force' $filepath

But I have problems with it because it gives errors or doing nothing at all.

The mounting looks like:
192.168.178.50:gv0 on /mnt/pve/glusterfs type fuse.glusterfs
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)


Here is what I tried for the first xattr:

root@pm1:~# setfattr -n 'system.affinity' -v 'gv0-client-5'
/mnt/pve/glusterfs/201/imagesvm201.qcow2
setfattr: /mnt/pve/glusterfs/201/imagesvm201.qcow2: Operation not supported

So I found on google to use trusted.affinity instead and yes this works.
I'm only not sure if the location "gv0-client-5" is correct to move the
file to "Brick 5" from "gluster volume info gv0" ... or how this
location is build?
Commit Message from http://review.gluster.org/#/c/glusterfs/+/5233/ says
> The value is the internal client or AFR brick name where you want the
file to be.

So what do I need to set there? maybe I do need the "afr" because
replicated? But where to get that name from?
I also tried to enter other client or replicate names like
"gv0-replicate-0" or such which seems to be more fitting for a
replicated volume, but result the same.


For the second command I get:
root@pm1:~# setfattr -n 'distribute.migrate-data' -v 'force'
/mnt/pve/glusterfs/201/imagesvm201.qcow2
setfattr: /mnt/pve/glusterfs/images/201/vm-201-disk-0.qcow2: Operation
not supported
root@pm1:~# setfattr -n 'trusted.distribute.migrate-data' -v 'force'
/mnt/pve/glusterfs/201/imagesvm201.qcow2
setfattr: /mnt/pve/glusterfs/images/201/vm-201-disk-0.qcow2: File exists

I also experimented with other "names" then "gv0-client-5" above but
always the same.

I saw that instead of the second command I could start a rebalance with
force, but this also did nothing. Ended after max1 second and moved nothing.

Can someone please advice how to do it right?


An other idea was to enable nufa and kind of "re-copy" the files on the
glusterfs, but here it seems that the documentation is wrong.
gluster volume set gv0 cluster.nufa enable on

Is

gluster volume set gv0 cluster.nufa 1

correct?

Thank you very much!

Ingo

-- 
Ingo Fischer
Technical Director of Platform

Gameforge 4D GmbH
Albert-Nestler-Straße 8
76131 Karlsruhe
Germany

Tel. +49 721 354 808-2269

ingo.fisc...@gameforge.com

http://www.gameforge.com
Amtsgericht Mannheim, Handelsregisternummer 718029
USt-IdNr.: DE814330106
Geschäftsführer Alexander Rösner, Jeffrey Brown
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing ou

2018-10-24 Thread Ravishankar N


On 10/24/2018 02:38 PM, Hoggins! wrote:

Thanks, that's helping a lot, I will do that.

One more question: should the glustershd restart be performed on the
arbiter only, or on each node of the cluster?
If you do a 'gluster volume start volname force' it will restart the shd 
on all nodes.

-Ravi


Thanks!

     Hoggins!

Le 24/10/2018 à 02:55, Ravishankar N a écrit :

On 10/23/2018 10:01 PM, Hoggins! wrote:

Hello there,

I'm stumbling upon the *exact same issue*, and unfortunately setting the
server.tcp-user-timeout to 42 does not help.
Any other suggestion?

I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running
version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a
node) suggested by Sam works, but it often doesn't.

You may ask how I got into this, well it's simple: I needed to replace
my brick 1 and brick 2 with two brand new machines, so here's what I did:
     - add brick 3 and brick 4 into the cluster (gluster peer probe,
gluster volume add-brick, etc., with the issue regarding the arbiter
node that has to be first removed from the cluster before being able to
add bricks 3 and 4)
     - wait for all the files on my volumes to heal. It took a few days.
     - remove bricks 1 and 2
     - after having "reset" the arbiter, re-add the arbiter into the cluster

And now it's intermittently hanging on writing *on existing files*.
There is *no problem for writing new files* on the volumes.

Hi,

There was a arbiter volume hang issue  that was fixed [1] recently.
The fix has been back-ported to all release branches.

One workaround to overcome hangs is to (1)turn off  'testvol
cluster.data-self-heal', remount the clients *and* (2) restart
glustershd (via volume start force). The hang is observed due to an
unreleased lock from self-heal. There are other ways to release the
stale lock via gluster clear-locks command or tweaking
features.locks-revocation-secs but restarting shd whenever you see the
issue is the easiest and safest way.

-Ravi

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802



I'm lost here, thanks for your inputs!

     Hoggins!

Le 14/09/2018 à 04:16, Amar Tumballi a écrit :

On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod mailto:mailingli...@smcleod.net>> wrote:

 I apologise for this being posted twice - I'm not sure if that was
 user error or a bug in the mailing list, but the list wasn't
 showing my post after quite some time so I sent a second email
 which near immediately showed up - that's mailing lists I guess...

 Anyway, if anyone has any input, advice or abuse I'm welcome any
 input!


We got little late to get back on this. But after running tests
internally, we found possibly missing an volume option is the reason
for this:

Try

gluster volume set  server.tcp-user-timeout 42
on your volume. Let us know if this helps.
(Ref: https://review.gluster.org/#/c/glusterfs/+/21170/)
  


 --
 Sam McLeod
 https://smcleod.net
 https://twitter.com/s_mcleod


 On 3 Sep 2018, at 1:20 pm, Sam McLeod mailto:mailingli...@smcleod.net>> wrote:

 We've got an odd problem where clients are blocked from writing
 to Gluster volumes until the first node of the Gluster cluster is
 rebooted.

 I suspect I've either configured something incorrectly with the
 arbiter / replica configuration of the volumes, or there is some
 sort of bug in the gluster client-server connection that we're
 triggering.

 I was wondering if anyone has seen this or could point me in the
 right direction?


 *Environment:*

   * Typology: 3 node cluster, replica 2, arbiter 1 (third node is
 metadata only).
   * Version: Client and Servers both running 4.1.3, both on
 CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast
 networked SSD storage backing them, XFS.
   * Client: Native Gluster FUSE client mounting via the
 kubernetes provider


 *Problem:*

   * Seemingly randomly some clients will be blocked / are unable
 to write to what should be a highly available gluster volume.
   * The client gluster logs show it failing to do new file
 operations across various volumes and all three nodes of the
 gluster.
   * The server gluster (or OS) logs do not show any warnings or
 errors.
   * The client recovers and is able to write to volumes again
 after the first node of the gluster cluster is rebooted.
   * Until the first node of the gluster cluster is rebooted, the
 client fails to write to the volume that is (or should be)
 available on the second node (a replica) and third node (an
 arbiter only node).


 *What 'fixes' the issue:*

   * Although the clients (kubernetes hosts) connect to all 3
 nodes of the Gluster cluster - restarting the first gluster
 node always unblocks the IO and allows the client to continue
 writing.
   * Stopping and s

Re: [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing ou

2018-10-24 Thread Hoggins!
Thanks, that's helping a lot, I will do that.

One more question: should the glustershd restart be performed on the
arbiter only, or on each node of the cluster?

Thanks!

    Hoggins!

Le 24/10/2018 à 02:55, Ravishankar N a écrit :
>
> On 10/23/2018 10:01 PM, Hoggins! wrote:
>> Hello there,
>>
>> I'm stumbling upon the *exact same issue*, and unfortunately setting the
>> server.tcp-user-timeout to 42 does not help.
>> Any other suggestion?
>>
>> I'm running a replica 3 arbiter 1 GlusterFS cluster, all nodes running
>> version 4.1.5 (Fedora 28), and /sometimes/ the workaround (rebooting a
>> node) suggested by Sam works, but it often doesn't.
>>
>> You may ask how I got into this, well it's simple: I needed to replace
>> my brick 1 and brick 2 with two brand new machines, so here's what I did:
>>     - add brick 3 and brick 4 into the cluster (gluster peer probe,
>> gluster volume add-brick, etc., with the issue regarding the arbiter
>> node that has to be first removed from the cluster before being able to
>> add bricks 3 and 4)
>>     - wait for all the files on my volumes to heal. It took a few days.
>>     - remove bricks 1 and 2
>>     - after having "reset" the arbiter, re-add the arbiter into the cluster
>>
>> And now it's intermittently hanging on writing *on existing files*.
>> There is *no problem for writing new files* on the volumes.
> Hi,
>
> There was a arbiter volume hang issue  that was fixed [1] recently.
> The fix has been back-ported to all release branches.
>
> One workaround to overcome hangs is to (1)turn off  'testvol
> cluster.data-self-heal', remount the clients *and* (2) restart
> glustershd (via volume start force). The hang is observed due to an
> unreleased lock from self-heal. There are other ways to release the
> stale lock via gluster clear-locks command or tweaking
> features.locks-revocation-secs but restarting shd whenever you see the
> issue is the easiest and safest way.
>
> -Ravi
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1637802
>
>
>> I'm lost here, thanks for your inputs!
>>
>>     Hoggins!
>>
>> Le 14/09/2018 à 04:16, Amar Tumballi a écrit :
>>> On Mon, Sep 3, 2018 at 3:41 PM, Sam McLeod >> > wrote:
>>>
>>> I apologise for this being posted twice - I'm not sure if that was
>>> user error or a bug in the mailing list, but the list wasn't
>>> showing my post after quite some time so I sent a second email
>>> which near immediately showed up - that's mailing lists I guess...
>>>
>>> Anyway, if anyone has any input, advice or abuse I'm welcome any
>>> input!
>>>
>>>
>>> We got little late to get back on this. But after running tests
>>> internally, we found possibly missing an volume option is the reason
>>> for this:
>>>
>>> Try 
>>>
>>> gluster volume set  server.tcp-user-timeout 42
>>> on your volume. Let us know if this helps.
>>> (Ref: https://review.gluster.org/#/c/glusterfs/+/21170/)
>>>  
>>>
>>> --
>>> Sam McLeod
>>> https://smcleod.net
>>> https://twitter.com/s_mcleod
>>>
 On 3 Sep 2018, at 1:20 pm, Sam McLeod >>> > wrote:

 We've got an odd problem where clients are blocked from writing
 to Gluster volumes until the first node of the Gluster cluster is
 rebooted.

 I suspect I've either configured something incorrectly with the
 arbiter / replica configuration of the volumes, or there is some
 sort of bug in the gluster client-server connection that we're
 triggering.

 I was wondering if anyone has seen this or could point me in the
 right direction?


 *Environment:*

   * Typology: 3 node cluster, replica 2, arbiter 1 (third node is
 metadata only).
   * Version: Client and Servers both running 4.1.3, both on
 CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast
 networked SSD storage backing them, XFS.
   * Client: Native Gluster FUSE client mounting via the
 kubernetes provider


 *Problem:*

   * Seemingly randomly some clients will be blocked / are unable
 to write to what should be a highly available gluster volume.
   * The client gluster logs show it failing to do new file
 operations across various volumes and all three nodes of the
 gluster.
   * The server gluster (or OS) logs do not show any warnings or
 errors.
   * The client recovers and is able to write to volumes again
 after the first node of the gluster cluster is rebooted.
   * Until the first node of the gluster cluster is rebooted, the
 client fails to write to the volume that is (or should be)
 available on the second node (a replica) and third node (an
 arbiter only node).


 *What 'fixes' the issue:*

   

[Gluster-users] Gluster Errors and configuraiton status

2018-10-24 Thread Vrgotic, Marko
Dear Gluster team,

Since January 2018 I am running GLusterFS in mode with 4 nodes.
The storage is attached to oVirt system and has been running happily so far.

I have three volumes:
Gv0_she – triple replicated volume for oVIrt SelfHostedEngine (it’s a 
requirement)
Gv1_vmpool – distributed volume across all four nodes for Guest VMs
Gv2_vmpool – distributed-replicated volume across all four nodes (this one is 
use and is to be a replacement for Gv1_vmpool) – created 4weeks ago

Quesitons:


  1.  What is the best (recommended) way to monitor performance and for issues 
over volumes and its images?
  2.  From the logs files, I see the glusterd log files and per brick/volume 
log files: so far it seemed that main focus should be on brick/volume logs
  3.  Some of the Errors I saw and I do not know yet enough to explain the 
criticality or how to slove them:


[2018-10-23 10:49:26.747985] W [MSGID: 113096] 
[posix-handle.c:770:posix_handle_hard] 0-gv2_vmpool-posix: link 
/gluster/brick2/gv2/.shard/766dbebd-336e-4925-89c6-5a429fa9607c.15 -> 
/gluster/brick2/gv2/.glusterfs/22/e4/22e4eff4-2c0c-4f99-9672-052fbb1f431efailed 
 [File exists]

[2018-10-23 10:49:26.748020] E [MSGID: 113020] [posix.c:1485:posix_mknod] 
0-gv2_vmpool-posix: setting gfid on 
/gluster/brick2/gv2/.shard/766dbebd-336e-4925-89c6-5a429fa9607c.15 failed

[2018-10-23 10:49:26.747989] W [MSGID: 113096] 
[posix-handle.c:770:posix_handle_hard] 0-gv2_vmpool-posix: link 
/gluster/brick2/gv2/.shard/766dbebd-336e-4925-89c6-5a429fa9607c.15 -> 
/gluster/brick2/gv2/.glusterfs/22/e4/22e4eff4-2c0c-4f99-9672-052fbb1f431efailed 
 [File exists]

[2018-10-23 10:50:48.075821] W [MSGID: 113096] 
[posix-handle.c:770:posix_handle_hard] 0-gv2_vmpool-posix: link 
/gluster/brick2/gv2/.shard/52185a97-ae8d-4925-8be6-b1afa90b5116.5 -> 
/gluster/brick2/gv2/.glusterfs/d7/fb/d7fb430a-e6c6-4bbb-b3b9-5f0691ad68bafailed 
 [File exists]

[2018-10-23 10:50:48.075866] E [MSGID: 113020] [posix.c:1485:posix_mknod] 
0-gv2_vmpool-posix: setting gfid on 
/gluster/brick2/gv2/.shard/52185a97-ae8d-4925-8be6-b1afa90b5116.5 failed

[2018-10-23 10:51:00.885479] W [MSGID: 113096] 
[posix-handle.c:770:posix_handle_hard] 0-gv2_vmpool-posix: link 
/gluster/brick2/gv2/.shard/52185a97-ae8d-4925-8be6-b1afa90b5116.12 -> 
/gluster/brick2/gv2/.glusterfs/91/ed/91ede536-e9e7-4371-8c4a-08b41f9a5e15failed 
 [File exists]

[2018-10-23 10:51:00.885491] E [MSGID: 113020] [posix.c:1485:posix_mknod] 
0-gv2_vmpool-posix: setting gfid on 
/gluster/brick2/gv2/.shard/52185a97-ae8d-4925-8be6-b1afa90b5116.12 failed

[2018-10-23 10:51:00.885480] W [MSGID: 113096] 
[posix-handle.c:770:posix_handle_hard] 0-gv2_vmpool-posix: link 
/gluster/brick2/gv2/.shard/52185a97-ae8d-4925-8be6-b1afa90b5116.12 -> 
/gluster/brick2/gv2/.glusterfs/91/ed/91ede536-e9e7-4371-8c4a-08b41f9a5e15failed 
 [File exists]

GLuster volumes status:
Status of volume: gv0_he
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick aws-gfs-01.awesome.lan:/gluster/brick
1/gv0   49152 0  Y   20938
Brick aws-gfs-02.awesome.lan:/gluster/brick
2/gv0   49152 0  Y   30787
Brick aws-gfs-03.awesome.lan:/gluster/brick
3/gv0   49152 0  Y   24685
Self-heal Daemon on localhost   N/A   N/AY   25808
Self-heal Daemon on aws-gfs-04.awesome.lan  N/A   N/AY   27130
Self-heal Daemon on aws-gfs-02.awesome.lan  N/A   N/AY   2672
Self-heal Daemon on aws-gfs-03.awesome.lan  N/A   N/AY   29368

Task Status of Volume gv0_he
--
There are no active volume tasks

Status of volume: gv1_vmpool
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick aws-gfs-01.awesome.lan:/gluster/brick
1/gv1   49153 0  Y   2066
Brick aws-gfs-02.awesome.lan:/gluster/brick
2/gv1   49153 0  Y   1933
Brick aws-gfs-03.awesome.lan:/gluster/brick
3/gv1   49153 0  Y   2027
Brick aws-gfs-04.awesome.lan:/gluster/brick
4/gv1   49152 0  Y   1870

Task Status of Volume gv1_vmpool
--
There are no active volume tasks

Status of volume: gv2_vmpool
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick aws-gfs-01.awesome.lan:/gluster/brick
1/gv2   49154 0  Y   25787
Bric