Re: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does not work Now: Upgraded to 4.1.3 geo node Faulty
Hi Marcus, Geo-rep had few important fixes in 4.1.3. Is it possible to upgrade and check whether the issue is still seen? Thanks, Kotresh HR On Sat, Sep 1, 2018 at 5:08 PM, Marcus Pedersén wrote: > Hi again, > > I found another problem on the other master node. > > The node toggles Active/Faulty and it is the same error over and over > again. > > > [2018-09-01 11:23:02.94080] E [repce(worker /urd-gds/gluster):197:__call__] > RepceClient: call failedcall=1226:139955262510912:1535800981.24 > method=entry_opserror=GsyncdError > [2018-09-01 11:23:02.94214] E [syncdutils(worker > /urd-gds/gluster):300:log_raise_exception] > : execution of "gluster" failed with ENOENT (No such file or directory) > [2018-09-01 11:23:02.106194] I [repce(agent /urd-gds/gluster):80:service_loop] > RepceServer: terminating on reaching EOF. > [2018-09-01 11:23:02.12] I [gsyncdstatus(monitor):244:set_worker_status] > GeorepStatus: Worker Status Change status=Faulty > > > I have also found a python error as well, I have only seen this once > though. > > > [2018-09-01 11:16:45.907660] I [master(worker > /urd-gds/gluster):1536:crawl] _GMaster: slave's time > stime=(1524101534, 0) > [2018-09-01 11:16:47.364109] E [syncdutils(worker > /urd-gds/gluster):332:log_raise_exception] : FAIL: > Traceback (most recent call last): > File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line > 362, in twrap > tf(*aargs) > File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1939, > in syncjob > po = self.sync_engine(pb, self.log_err) > File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1442, > in rsync > rconf.ssh_ctl_args + \ > AttributeError: 'NoneType' object has no attribute 'split' > [2018-09-01 11:16:47.384531] I [repce(agent /urd-gds/gluster):80:service_loop] > RepceServer: terminating on reaching EOF. > [2018-09-01 11:16:48.362987] I [monitor(monitor):279:monitor] Monitor: > worker died in startup phase brick=/urd-gds/gluster > [2018-09-01 11:16:48.370701] I [gsyncdstatus(monitor):244:set_worker_status] > GeorepStatus: Worker Status Change status=Faulty > [2018-09-01 11:16:58.390548] I [monitor(monitor):158:monitor] Monitor: > starting gsyncd worker brick=/urd-gds/gluster slave_node=urd-gds-geo-000 > > > I attach the logs as well. > > > Many thanks! > > > Best regards > > Marcus Pedersén > > > > > -- > *Från:* gluster-users-boun...@gluster.org gluster.org> för Marcus Pedersén > *Skickat:* den 31 augusti 2018 16:09 > *Till:* khire...@redhat.com > > *Kopia:* gluster-users@gluster.org > *Ämne:* Re: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does > not work Now: Upgraded to 4.1.3 geo node Faulty > > > I realy appologize, third try to make mail smaller. > > > /Marcus > > > -- > *Från:* Marcus Pedersén > *Skickat:* den 31 augusti 2018 16:03 > *Till:* Kotresh Hiremath Ravishankar > *Kopia:* gluster-users@gluster.org > *Ämne:* SV: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does > not work Now: Upgraded to 4.1.3 geo node Faulty > > > Sorry, resend due to too large mail. > > > /Marcus > -- > *Från:* Marcus Pedersén > *Skickat:* den 31 augusti 2018 15:19 > *Till:* Kotresh Hiremath Ravishankar > *Kopia:* gluster-users@gluster.org > *Ämne:* SV: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does > not work Now: Upgraded to 4.1.3 geo node Faulty > > > Hi Kotresh, > > Please find attached logs, only logs from today. > > The python error was repeated over and over again until I disabled selinux. > > After that the node bacame active again. > > The return code 23 seems to be repeated over and over again. > > > rsync version 3.1.2 > > > Thanks a lot! > > > Best regards > > Marcus > > > -- > *Från:* Kotresh Hiremath Ravishankar > *Skickat:* den 31 augusti 2018 11:09 > *Till:* Marcus Pedersén > *Kopia:* gluster-users@gluster.org > *Ämne:* Re: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does > not work Now: Upgraded to 4.1.3 geo node Faulty > > Hi Marcus, > > Could you attach full logs? Is the same trace back happening repeatedly? > It will be helpful you attach the corresponding mount log as well. > What's the rsync version, you are using? > > Thanks, > Kotresh HR > > On Fri, Aug 31, 2018 at 12:16 PM, Marcus Pedersén > wrote: > >> Hi all, >> >> I had problems with stopping sync after upgrade to 4.1.2. >> >> I upgraded to 4.1.3 and it ran fine for one day, but now one of the >> master nodes shows faulty. >> >> Most of the sync jobs have return code 23, how do I resolve this? >> >> I see messages like: >> >> _GMaster: Sucessfully fixed all entry ops with gfid mismatch >> >> Will this resolve error code 23? >> >> There is also a python error. >> >> The python error was a selinux problem, turning off selinux made node go >> to active again. >> >> See log below. >> >> >> CentOS 7, installed through SIG Gluster (OS updated to latest at
Re: [Gluster-users] Gluter 3.12.12: performance during heal and in general
On Fri, Aug 31, 2018 at 1:18 PM Hu Bert wrote: > Hi Pranith, > > i just wanted to ask if you were able to get any feedback from your > colleagues :-) > Sorry, I didn't get a chance to. I am working on a customer issue which is taking away cycles from any other work. Let me get back to you once I get time this week. > > btw.: we migrated some stuff (static resources, small files) to a nfs > server that we actually wanted to replace by glusterfs. Load and cpu > usage has gone down a bit, but still is asymmetric on the 3 gluster > servers. > > > 2018-08-28 9:24 GMT+02:00 Hu Bert : > > Hm, i noticed that in the shared.log (volume log file) on gluster11 > > and gluster12 (but not on gluster13) i now see these warnings: > > > > [2018-08-28 07:18:57.224367] W [MSGID: 109011] > > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for > > hash (value) = 3054593291 > > [2018-08-28 07:19:17.733625] W [MSGID: 109011] > > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for > > hash (value) = 2595205890 > > [2018-08-28 07:19:27.950355] W [MSGID: 109011] > > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for > > hash (value) = 3105728076 > > [2018-08-28 07:19:42.519010] W [MSGID: 109011] > > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for > > hash (value) = 3740415196 > > [2018-08-28 07:19:48.194774] W [MSGID: 109011] > > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for > > hash (value) = 2922795043 > > [2018-08-28 07:19:52.506135] W [MSGID: 109011] > > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for > > hash (value) = 2841655539 > > [2018-08-28 07:19:55.466352] W [MSGID: 109011] > > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for > > hash (value) = 3049465001 > > > > Don't know if that could be related. > > > > > > 2018-08-28 8:54 GMT+02:00 Hu Bert : > >> a little update after about 2 hours of uptime: still/again high cpu > >> usage by one brick processes. server load >30. > >> > >> gluster11: high cpu; brick /gluster/bricksdd1/; no hdd exchange so far > >> gluster12: normal cpu; brick /gluster/bricksdd1_new/; hdd change > /dev/sdd > >> gluster13: high cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd > >> > >> The process for brick bricksdd1 consumes almost all 12 cores. > >> Interestingly there are more threads for the bricksdd1 process than > >> for the other bricks. Counted with "ps huH p | wc > >> -l" > >> > >> gluster11: > >> bricksda1 59 threads, bricksdb1 65 threads, bricksdc1 68 threads, > >> bricksdd1 85 threads > >> gluster12: > >> bricksda1 65 threads, bricksdb1 60 threads, bricksdc1 61 threads, > >> bricksdd1_new 58 threads > >> gluster13: > >> bricksda1 61 threads, bricksdb1 60 threads, bricksdc1 61 threads, > >> bricksdd1_new 82 threads > >> > >> Don't know if that could be relevant. > >> > >> 2018-08-28 7:04 GMT+02:00 Hu Bert : > >>> Good Morning, > >>> > >>> today i update + rebooted all gluster servers, kernel update to > >>> 4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the > >>> gluster servers (gluster13) one of the bricks did come up at the > >>> beginning but then lost connection. > >>> > >>> OK: > >>> > >>> Status of volume: shared > >>> Gluster process TCP Port RDMA Port > Online Pid > >>> > -- > >>> [...] > >>> Brick gluster11:/gluster/bricksdd1/shared 49155 0 > >>> Y 2506 > >>> Brick gluster12:/gluster/bricksdd1_new/shared49155 0 > >>> Y 2097 > >>> Brick gluster13:/gluster/bricksdd1_new/shared49155 0 > >>> Y 2136 > >>> > >>> Lost connection: > >>> > >>> Brick gluster11:/gluster/bricksdd1/shared 49155 0 > >>> Y 2506 > >>> Brick gluster12:/gluster/bricksdd1_new/shared 49155 0 > >>> Y 2097 > >>> Brick gluster13:/gluster/bricksdd1_new/shared N/A N/A > >>> N N/A > >>> > >>> gluster volume heal shared info: > >>> Brick gluster13:/gluster/bricksdd1_new/shared > >>> Status: Transport endpoint is not connected > >>> Number of entries: - > >>> > >>> reboot was at 06:15:39; brick then worked for a short period, but then > >>> somehow disconnected. > >>> > >>> from gluster13:/var/log/glusterfs/glusterd.log: > >>> > >>> [2018-08-28 04:27:36.944608] I [MSGID: 106005] > >>> [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management: > >>> Brick gluster13:/gluster/bricksdd1_new/shared has disconnected from > >>> glusterd. > >>> [2018-08-28 04:28:57.869666] I > >>> [glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a > >>> fresh brick process for brick /gluster/bricksdd1_new/shared > >>> [2018-08-28 04:35:20.732666] I [MSGID: 106143] > >>> [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick > >>> /gluster/bricksdd1_new/shared on port 49157 > >>> > >>> After 'gluster volume start shared force' (then with new
Re: [Gluster-users] Transport endpoint is not connected : issue
Hey, We need some more information to debug this. I think you missed to send the output of 'gluster volume info '. Can you also provide the bricks, shd and glfsheal logs as well? In the setup how many peers are present? You also mentioned that "one of the file servers have two processes for each of the volumes instead of one per volume", which process are you talking about here? Regards, Karthik On Sat, Sep 1, 2018 at 12:10 AM Johnson, Tim wrote: > Thanks for the reply. > > > >I have attached the gluster.log file from the host that it is happening > to at this time. > > It does change which host it does this on. > > > > Thanks. > > > > *From: *Atin Mukherjee > *Date: *Friday, August 31, 2018 at 1:03 PM > *To: *"Johnson, Tim" > *Cc: *Karthik Subrahmanya , Ravishankar N < > ravishan...@redhat.com>, "gluster-users@gluster.org" < > gluster-users@gluster.org> > *Subject: *Re: [Gluster-users] Transport endpoint is not connected : issue > > > > Can you please pass all the gluster log files from the server where the > transport end point not connected error is reported? As restarting glusterd > didn’t solve this issue, I believe this isn’t a stale port problem but > something else. Also please provide the output of ‘gluster v info ’ > > > > (@cc Ravi, Karthik) > > > > On Fri, 31 Aug 2018 at 23:24, Johnson, Tim wrote: > > Hello all, > > > > We have a gluster replicate (with arbiter) volumes that we are > getting “Transport endpoint is not connected” with on a rotating basis > from each of the two file servers, and a third host that has the arbiter > bricks on. > > This is happening when trying to run a heal on all the volumes on the > gluster hosts When I get the status of all the volumes all looks good. > >This behavior seems to be a forshadowing of the gluster volumes > becoming unresponsive to our vm cluster. As well as one of the file > servers have two processes for each of the volumes instead of one per > volume. Eventually the affected file server > > will drop off the listed peers. Restarting glusterd/glusterfsd on the > affected file server does not take care of the issue, we have to bring down > both file > > Servers due to the volumes not being seen by the vm cluster after the > errors start occurring. I had seen that there were bug reports about the > “Transport endpoint is not connected” on earlier versions of Gluster > however had thought that > > It had been addressed. > > Dmesg did have some entries for “a possible syn flood on port *” > which we changed the sysctl to “net.ipv4.tcp_max_syn_backlog = 2048” which > seemed to help the syn flood messages but not the underlying volume issues. > > I have put the versions of all the Gluster packages installed below as > well as the “Heal” and “Status” commands showing the volumes are > > > >This has just started happening but cannot definitively say if this > started occurring after an update or not. > > > > > > Thanks for any assistance. > > > > > > Running Heal : > > > > gluster volume heal ovirt_engine info > > Brick 1.rrc.local:/bricks/brick0/ovirt_engine > > Status: Connected > > Number of entries: 0 > > > > Brick 3.rrc.local:/bricks/brick0/ovirt_engine > > Status: Transport endpoint is not connected > > Number of entries: - > > > > Brick *3.rrc.local:/bricks/arb-brick/ovirt_engine > > Status: Transport endpoint is not connected > > Number of entries: - > > > > > > Running status : > > > > gluster volume status ovirt_engine > > Status of volume: ovirt_engine > > Gluster process TCP Port RDMA Port Online > Pid > > > -- > > Brick*.rrc.local:/bricks/brick0/ov > > irt_engine 49152 0 Y > 5521 > > Brick fs2-tier3.rrc.local:/bricks/brick0/ov > > irt_engine 49152 0 Y > 6245 > > Brick .rrc.local:/bricks/arb-b > > rick/ovirt_engine 49152 0 Y > 3526 > > Self-heal Daemon on localhost N/A N/AY > 5509 > > Self-heal Daemon on ***.rrc.local N/A N/AY 6218 > > Self-heal Daemon on ***.rrc.local N/A N/AY 3501 > > Self-heal Daemon on .rrc.local N/A N/AY 3657 > > Self-heal Daemon on *.rrc.local N/A N/AY 3753 > > Self-heal Daemon on .rrc.local N/A N/AY 17284 > > > > Task Status of Volume ovirt_engine > > > -- > > There are no active volume tasks > > > > > > > > > > /etc/glusterd.vol. : > > > > > > volume management > > type mgmt/glusterd > > option working-directory /var/lib/glusterd > > option transport-type socket,rdma > > option transport.socket.keepalive-time 10 > > option transport.socket.keepalive-interval 2 > >
Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work
Hi Krishna, Indexing is the feature used by Hybrid crawl which only makes crawl faster. It has nothing to do with missing data sync. Could you please share the complete log file of the session where the issue is encountered ? Thanks, Kotresh HR On Mon, Sep 3, 2018 at 9:33 AM, Krishna Verma wrote: > Hi Kotresh/Support, > > > > Request your help to get it fix. My slave is not getting sync with master. > When I restart the session after doing the indexing off then only it shows > the file at slave but that is also blank with zero size. > > > > At master: file size is 5.8 GB. > > > > [root@gluster-poc-noida distvol]# du -sh 17.10.v001.20171023-201021_ > 17020_GPLV3.tar.gz > > 5.8G17.10.v001.20171023-201021_17020_GPLV3.tar.gz > > [root@gluster-poc-noida distvol]# > > > > But at slave, after doing the “indexing off” and restart the session and > then wait for 2 days. It shows only 4.9 GB copied. > > > > [root@gluster-poc-sj distvol]# du -sh 17.10.v001.20171023-201021_ > 17020_GPLV3.tar.gz > > 4.9G17.10.v001.20171023-201021_17020_GPLV3.tar.gz > > [root@gluster-poc-sj distvol]# > > > > Similarly, I tested for small file of size 1.2 GB only that is still > showing “0” size at slave after days waiting time. > > > > At Master: > > > > [root@gluster-poc-noida distvol]# du -sh rflowTestInt18.08-b001.t.Z > > 1.2GrflowTestInt18.08-b001.t.Z > > [root@gluster-poc-noida distvol]# > > > > At Slave: > > > > [root@gluster-poc-sj distvol]# du -sh rflowTestInt18.08-b001.t.Z > > 0 rflowTestInt18.08-b001.t.Z > > [root@gluster-poc-sj distvol]# > > > > Below is my distributed volume info : > > > > [root@gluster-poc-noida distvol]# gluster volume info glusterdist > > > > Volume Name: glusterdist > > Type: Distribute > > Volume ID: af5b2915-7170-4b5e-aee8-7e68757b9bf1 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 2 > > Transport-type: tcp > > Bricks: > > Brick1: gluster-poc-noida:/data/gluster-dist/distvol > > Brick2: noi-poc-gluster:/data/gluster-dist/distvol > > Options Reconfigured: > > changelog.changelog: on > > geo-replication.ignore-pid-check: on > > geo-replication.indexing: on > > transport.address-family: inet > > nfs.disable: on > > [root@gluster-poc-noida distvol]# > > > > Please help to fix, I believe its not a normal behavior of gluster rsync. > > > > /Krishna > > *From:* Krishna Verma > *Sent:* Friday, August 31, 2018 12:42 PM > *To:* 'Kotresh Hiremath Ravishankar' > *Cc:* Sunny Kumar ; Gluster Users < > gluster-users@gluster.org> > *Subject:* RE: [Gluster-users] Upgrade to 4.1.2 geo-replication does not > work > > > > Hi Kotresh, > > > > I have tested the geo replication over distributed volumes with 2*2 > gluster setup. > > > > [root@gluster-poc-noida ~]# gluster volume geo-replication glusterdist > gluster-poc-sj::glusterdist status > > > > MASTER NODE MASTER VOL MASTER BRICK SLAVE > USERSLAVE SLAVE NODE STATUSCRAWL > STATUS LAST_SYNCED > > > > - > > gluster-poc-noidaglusterdist/data/gluster-dist/distvol > root gluster-poc-sj::glusterdistgluster-poc-sj Active > Changelog Crawl2018-08-31 10:28:19 > > noi-poc-gluster glusterdist/data/gluster-dist/distvol > root gluster-poc-sj::glusterdistgluster-poc-sj2Active > History Crawl N/A > > [root@gluster-poc-noida ~]# > > > > Not at client I copied a 848MB file from local disk to master mounted > volume and it took only 1 minute and 15 seconds. Its great…. > > > > But even after waited for 2 hrs I was unable to see that file at slave > site. Then I again erased the indexing by doing “gluster volume set > glusterdist indexing off” and restart the session. Magically I received > the file instantly at slave after doing this. > > > > Why I need to do “indexing off” every time to reflect data at slave site? > Is there any fix/workaround of it? > > > > /Krishna > > > > > > *From:* Kotresh Hiremath Ravishankar > *Sent:* Friday, August 31, 2018 10:10 AM > *To:* Krishna Verma > *Cc:* Sunny Kumar ; Gluster Users < > gluster-users@gluster.org> > *Subject:* Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not > work > > > > EXTERNAL MAIL > > > > > > On Thu, Aug 30, 2018 at 3:51 PM, Krishna Verma wrote: > > Hi Kotresh, > > > > Yes, this include the time take to write 1GB file to master. geo-rep was > not stopped while the data was copying to master. > > > > This way, you can't really measure how much time geo-rep took. > > > > > > But now I am trouble, My putty session was timed out while copying data to > master and geo replication was active. After I restart putty session My > Master data is not syncing with slave. Its Last_synced time is 1hrs behind > the current time. > > > > I restart the geo rep and also delete and
Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work
Hi Kotresh/Support, Request your help to get it fix. My slave is not getting sync with master. When I restart the session after doing the indexing off then only it shows the file at slave but that is also blank with zero size. At master: file size is 5.8 GB. [root@gluster-poc-noida distvol]# du -sh 17.10.v001.20171023-201021_17020_GPLV3.tar.gz 5.8G17.10.v001.20171023-201021_17020_GPLV3.tar.gz [root@gluster-poc-noida distvol]# But at slave, after doing the “indexing off” and restart the session and then wait for 2 days. It shows only 4.9 GB copied. [root@gluster-poc-sj distvol]# du -sh 17.10.v001.20171023-201021_17020_GPLV3.tar.gz 4.9G17.10.v001.20171023-201021_17020_GPLV3.tar.gz [root@gluster-poc-sj distvol]# Similarly, I tested for small file of size 1.2 GB only that is still showing “0” size at slave after days waiting time. At Master: [root@gluster-poc-noida distvol]# du -sh rflowTestInt18.08-b001.t.Z 1.2GrflowTestInt18.08-b001.t.Z [root@gluster-poc-noida distvol]# At Slave: [root@gluster-poc-sj distvol]# du -sh rflowTestInt18.08-b001.t.Z 0 rflowTestInt18.08-b001.t.Z [root@gluster-poc-sj distvol]# Below is my distributed volume info : [root@gluster-poc-noida distvol]# gluster volume info glusterdist Volume Name: glusterdist Type: Distribute Volume ID: af5b2915-7170-4b5e-aee8-7e68757b9bf1 Status: Started Snapshot Count: 0 Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: gluster-poc-noida:/data/gluster-dist/distvol Brick2: noi-poc-gluster:/data/gluster-dist/distvol Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on transport.address-family: inet nfs.disable: on [root@gluster-poc-noida distvol]# Please help to fix, I believe its not a normal behavior of gluster rsync. /Krishna From: Krishna Verma Sent: Friday, August 31, 2018 12:42 PM To: 'Kotresh Hiremath Ravishankar' Cc: Sunny Kumar ; Gluster Users Subject: RE: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work Hi Kotresh, I have tested the geo replication over distributed volumes with 2*2 gluster setup. [root@gluster-poc-noida ~]# gluster volume geo-replication glusterdist gluster-poc-sj::glusterdist status MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUSCRAWL STATUS LAST_SYNCED - gluster-poc-noidaglusterdist/data/gluster-dist/distvolroot gluster-poc-sj::glusterdistgluster-poc-sj ActiveChangelog Crawl 2018-08-31 10:28:19 noi-poc-gluster glusterdist/data/gluster-dist/distvolroot gluster-poc-sj::glusterdistgluster-poc-sj2ActiveHistory Crawl N/A [root@gluster-poc-noida ~]# Not at client I copied a 848MB file from local disk to master mounted volume and it took only 1 minute and 15 seconds. Its great…. But even after waited for 2 hrs I was unable to see that file at slave site. Then I again erased the indexing by doing “gluster volume set glusterdist indexing off” and restart the session. Magically I received the file instantly at slave after doing this. Why I need to do “indexing off” every time to reflect data at slave site? Is there any fix/workaround of it? /Krishna From: Kotresh Hiremath Ravishankar mailto:khire...@redhat.com>> Sent: Friday, August 31, 2018 10:10 AM To: Krishna Verma mailto:kve...@cadence.com>> Cc: Sunny Kumar mailto:sunku...@redhat.com>>; Gluster Users mailto:gluster-users@gluster.org>> Subject: Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work EXTERNAL MAIL On Thu, Aug 30, 2018 at 3:51 PM, Krishna Verma mailto:kve...@cadence.com>> wrote: Hi Kotresh, Yes, this include the time take to write 1GB file to master. geo-rep was not stopped while the data was copying to master. This way, you can't really measure how much time geo-rep took. But now I am trouble, My putty session was timed out while copying data to master and geo replication was active. After I restart putty session My Master data is not syncing with slave. Its Last_synced time is 1hrs behind the current time. I restart the geo rep and also delete and again create the session but its “LAST_SYNCED” time is same. Unless, geo-rep is Faulty, it would be processing/syncing. You should check logs for any errors. Please help in this. …. It's better if gluster volume has more distribute count like 3*3 or 4*3 :- Are you refereeing to create a distributed volume with 3 master node and 3 slave node? Yes, that's correct. Please do the test with this. I recommend you to run the actual workload for which you are planning to use gluster instead of copying 1GB file and testing. /krishna From: Kotresh Hiremath Ravishankar mailto:khire...@redhat.com>>
[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out fr
We've got an odd problem where clients are blocked from writing to Gluster volumes until the first node of the Gluster cluster is rebooted. I suspect I've either configured something incorrectly with the arbiter / replica configuration of the volumes, or there is some sort of bug in the gluster client-server connection that we're triggering. I was wondering if anyone has seen this or could point me in the right direction? Environment: Typology: 3 node cluster, replica 2, arbiter 1 (third node is metadata only). Version: Client and Servers both running 4.1.3, both on CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast networked SSD storage backing them, XFS. Client: Native Gluster FUSE client mounting via the kubernetes provider Problem: Seemingly randomly some clients will be blocked / are unable to write to what should be a highly available gluster volume. The client gluster logs show it failing to do new file operations across various volumes and all three nodes of the gluster. The server gluster (or OS) logs do not show any warnings or errors. The client recovers and is able to write to volumes again after the first node of the gluster cluster is rebooted. Until the first node of the gluster cluster is rebooted, the client fails to write to the volume that is (or should be) available on the second node (a replica) and third node (an arbiter only node). What 'fixes' the issue: Although the clients (kubernetes hosts) connect to all 3 nodes of the Gluster cluster - restarting the first gluster node always unblocks the IO and allows the client to continue writing. Stopping and starting the glusterd service on the gluster server is not enough to fix the issue, nor is restarting its networking. This suggests to me that the volume is unavailable for writing for some reason and restarting the first node in the cluster either clears some sort of TCP sessions between the client-server or between the server-server replication. Expected behaviour: If the first gluster node / server had failed or was blocked from performing operations for some reason (which it doesn't seem it is), I'd expect the clients to access data from the second gluster node and write metadata to the third gluster node as well as it's an arbiter / metadata only node. If for some reason the a gluster node was not able to serve connections to clients, I'd expect to see errors in the volume, glusterd or brick log files (there are none on the first gluster node). If the first gluster node was for some reason blocking IO on a volume, I'd expect that node either to show as unhealthy or unavailable in the gluster peer status or gluster volume status. Client gluster errors: staging_static in this example is a volume name. You can see the client trying to connect to the second and third nodes of the gluster cluster and failing (unsure as to why?) The server side logs on the first gluster node do not show any errors or problems, but the second / third node show errors in the glusterd.log when trying to 'unlock' the 0-management volume on the first node. On a gluster client (a kubernetes host using the kubernetes connector which uses the native fuse client) when its blocked from writing but the gluster appears healthy (other than the errors mentioned later): [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 15:03:22.417773. timeout = 1800 for :49154 [2018-09-02 15:33:22.750989] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 15:33:22.765751. timeout = 1800 for :49154 [2018-09-02 16:03:23.097988] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 16:03:23.098133. timeout = 1800 for :49154 [2018-09-02 16:33:23.439282] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 16:33:23.455171. timeout = 1800 for :49154 [2018-09-02 17:03:23.786971] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 17:33:24.160607] E
[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out fr
We've got an odd problem where clients are blocked from writing to Gluster volumes until the first node of the Gluster cluster is rebooted. I suspect I've either configured something incorrectly with the arbiter / replica configuration of the volumes, or there is some sort of bug in the gluster client-server connection that we're triggering. I was wondering if anyone has seen this or could point me in the right direction? Environment: Typology: 3 node cluster, replica 2, arbiter 1 (third node is metadata only). Version: Client and Servers both running 4.1.3, both on CentOS 7, kernel 4.18.x, (Xen) VMs with relatively fast networked SSD storage backing them, XFS. Client: Native Gluster FUSE client mounting via the kubernetes provider Problem: Seemingly randomly some clients will be blocked / are unable to write to what should be a highly available gluster volume. The client gluster logs show it failing to do new file operations across various volumes and all three nodes of the gluster. The server gluster (or OS) logs do not show any warnings or errors. The client recovers and is able to write to volumes again after the first node of the gluster cluster is rebooted. Until the first node of the gluster cluster is rebooted, the client fails to write to the volume that is (or should be) available on the second node (a replica) and third node (an arbiter only node). What 'fixes' the issue: Although the clients (kubernetes hosts) connect to all 3 nodes of the Gluster cluster - restarting the first gluster node always unblocks the IO and allows the client to continue writing. Stopping and starting the glusterd service on the gluster server is not enough to fix the issue, nor is restarting its networking. This suggests to me that the volume is unavailable for writing for some reason and restarting the first node in the cluster either clears some sort of TCP sessions between the client-server or between the server-server replication. Expected behaviour: If the first gluster node / server had failed or was blocked from performing operations for some reason (which it doesn't seem it is), I'd expect the clients to access data from the second gluster node and write metadata to the third gluster node as well as it's an arbiter / metadata only node. If for some reason the a gluster node was not able to serve connections to clients, I'd expect to see errors in the volume, glusterd or brick log files (there are none on the first gluster node). If the first gluster node was for some reason blocking IO on a volume, I'd expect that node either to show as unhealthy or unavailable in the gluster peer status or gluster volume status. Client gluster errors: staging_static in this example is a volume name. You can see the client trying to connect to the second and third nodes of the gluster cluster and failing (unsure as to why?) The server side logs on the first gluster node do not show any errors or problems, but the second / third node show errors in the glusterd.log when trying to 'unlock' the 0-management volume on the first node. On a gluster client (a kubernetes host using the kubernetes connector which uses the native fuse client) when its blocked from writing but the gluster appears healthy (other than the errors mentioned later): [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 15:03:22.417773. timeout = 1800 for :49154 [2018-09-02 15:33:22.750989] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 15:33:22.765751. timeout = 1800 for :49154 [2018-09-02 16:03:23.097988] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 16:03:23.098133. timeout = 1800 for :49154 [2018-09-02 16:33:23.439282] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: remote operation failed [Transport endpoint is not connected] [2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 16:33:23.455171. timeout = 1800 for :49154 [2018-09-02 17:03:23.786971] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: remote operation failed [Transport endpoint is not connected] [2018-09-02 17:33:24.160607] E
[Gluster-users] /var/run/glusterd.socket permissions for non-root geo-replication (4.1.3)
Hi all, We're investigating geo-replication and noticed that when using non-root geo-replication, the sync user cannot access various gluster commands, e.g. one of the session commands ends up running this on the slave: Popen: command returned error cmd=/usr/sbin/gluster --remote-host=localhost system:: mount geosync user-map-root=geosync aux-gfid-mount acl log-level=INFO log-file=/var/log/glusterfs/geo-replication-slaves/snip/snip.log volfile-server=localhost volfile-id=shared client-pid=-1 error=1 Popen: /usr/sbin/gluster> 2 : failed with this errno (No such file or directory) The underlying cause of this is the gluster command not being able to write to the socket file /var/run/glusterd.socket - if I change the group to my geo-replication group and add group write, the command succeeds and geo-replication becomes active. The problem is every time the server/service restarts it comes back up as root:root srwxr-xr-x. 1 root root 0 Sep 3 02:17 /var/run/glusterd.socket So a couple of questions: 1) Should the geo-replication non-root user be able to do what it needs without changing those permissions? 2) If it does need write permission, is there a config option to tell the service to set the correct permissions on the file when it starts so that the non-root user can write to it? Thanks. Andy ___ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] blocking process on FUSE mount in directory which is using quota
Hello, I wanted to report that I had this morning a similar issue on another server where a few PHP-FPM processes get blocked on different GlusterFS volume mounted through a FUSE mount. This GlusterFS volume has no quota enabled so it might not be quota related after all. Here would be the Linux kernel stack trace: [Sun Sep 2 06:47:47 2018] INFO: task php5-fpm:25880 blocked for more than 120 seconds. [Sun Sep 2 06:47:47 2018] Not tainted 3.16.0-4-amd64 #1 [Sun Sep 2 06:47:47 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Sun Sep 2 06:47:47 2018] php5-fpmD 88017ee12f40 0 25880 1 0x0004 [Sun Sep 2 06:47:47 2018] 880101688b60 0282 00012f40 880059ca3fd8 [Sun Sep 2 06:47:47 2018] 00012f40 880101688b60 8801093b51b0 8801067ec800 [Sun Sep 2 06:47:47 2018] 880059ca3cc0 8801093b5290 8801093b51b0 880059ca3e80 [Sun Sep 2 06:47:47 2018] Call Trace: [Sun Sep 2 06:47:47 2018] [] ? __fuse_request_send+0xbd/0x270 [fuse] [Sun Sep 2 06:47:47 2018] [] ? prepare_to_wait_event+0xf0/0xf0 [Sun Sep 2 06:47:47 2018] [] ? fuse_send_write+0xd0/0x100 [fuse] [Sun Sep 2 06:47:47 2018] [] ? fuse_perform_write+0x26f/0x4b0 [fuse] [Sun Sep 2 06:47:47 2018] [] ? fuse_file_write_iter+0x1dd/0x2b0 [fuse] [Sun Sep 2 06:47:47 2018] [] ? new_sync_write+0x74/0xa0 [Sun Sep 2 06:47:47 2018] [] ? vfs_write+0xb2/0x1f0 [Sun Sep 2 06:47:47 2018] [] ? vfs_read+0xed/0x170 [Sun Sep 2 06:47:47 2018] [] ? SyS_write+0x42/0xa0 [Sun Sep 2 06:47:47 2018] [] ? SyS_lseek+0x7e/0xa0 [Sun Sep 2 06:47:47 2018] [] ? system_call_fast_compare_end+0x10/0x15 Did anyone already have time to have a look at the statedump file I sent around 3 weeks ago? I never saw this type of problems in the past and it started to appear since I upgraded to GluterFS 3.12.12. Best regards, Mabi ‐‐‐ Original Message ‐‐‐ On August 15, 2018 9:21 AM, mabi wrote: > Great, you will then find attached here the statedump of the client using the > FUSE glusterfs mount right after two processes have blocked. > > Two notes here regarding the "path=" in this statedump file: > - I have renamed all the "path=" which has the problematic directory as > "path=PROBLEMATIC_DIRECTORY_HERE > - All the other "path=" I have renamed them to "path=REMOVED_FOR_PRIVACY". > > Note also that funnily enough the number of "path=" for that problematic > directory sums up to exactly 5000 entries. Coincidence or hint to the problem > maybe? > > ‐‐‐ Original Message ‐‐‐ > On August 15, 2018 5:21 AM, Raghavendra Gowdappa wrote: > >> On Tue, Aug 14, 2018 at 7:23 PM, mabi wrote: >> >>> Bad news: the process blocked happened again this time with another >>> directory of another user which is NOT over his quota but which also has >>> quota enabled. >>> >>> The symptoms on the Linux side are the same: >>> >>> [Tue Aug 14 15:30:33 2018] INFO: task php5-fpm:14773 blocked for more than >>> 120 seconds. >>> [Tue Aug 14 15:30:33 2018] Not tainted 3.16.0-4-amd64 #1 >>> [Tue Aug 14 15:30:33 2018] "echo 0 > >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> [Tue Aug 14 15:30:33 2018] php5-fpmD 8801fea13200 0 14773 >>> 729 0x >>> [Tue Aug 14 15:30:33 2018] 880100bbe0d0 0282 >>> 00013200 880129bcffd8 >>> [Tue Aug 14 15:30:33 2018] 00013200 880100bbe0d0 >>> 880153ed0d68 880129bcfee0 >>> [Tue Aug 14 15:30:33 2018] 880153ed0d6c 880100bbe0d0 >>> 880153ed0d70 >>> [Tue Aug 14 15:30:33 2018] Call Trace: >>> [Tue Aug 14 15:30:33 2018] [] ? >>> schedule_preempt_disabled+0x25/0x70 >>> [Tue Aug 14 15:30:33 2018] [] ? >>> __mutex_lock_slowpath+0xd3/0x1d0 >>> [Tue Aug 14 15:30:33 2018] [] ? write_inode_now+0x93/0xc0 >>> [Tue Aug 14 15:30:33 2018] [] ? mutex_lock+0x1b/0x2a >>> [Tue Aug 14 15:30:33 2018] [] ? fuse_flush+0x8f/0x1e0 >>> [fuse] >>> [Tue Aug 14 15:30:33 2018] [] ? vfs_read+0x93/0x170 >>> [Tue Aug 14 15:30:33 2018] [] ? filp_close+0x2a/0x70 >>> [Tue Aug 14 15:30:33 2018] [] ? SyS_close+0x1f/0x50 >>> [Tue Aug 14 15:30:33 2018] [] ? >>> system_call_fast_compare_end+0x10/0x15 >>> >>> and if I check this process it has state "D" which is "D = uninterruptible >>> sleep". >>> >>> Now I also managed to take a statedump file as recommended but I see in its >>> content under the "[io-cache.inode]" a "path=" which I would need to remove >>> as it contains filenames for privacy reasons. Can I remove every "path=" >>> line and still send you the statedump file for analysis? >> >> Yes. Removing path is fine and statedumps will still be useful for debugging >> the issue. >> >>> Thank you. >>> >>> ‐‐‐ Original Message ‐‐‐ >>> On August 14, 2018 10:48 AM, Nithya Balachandran >>> wrote: >>> Thanks for letting us know. Sanoj, can you take a look at this? Thanks. Nithya