Re: [Gluster-users] Transport endpoint is not connected : issue
On Mon, Sep 3, 2018 at 11:17 AM Karthik Subrahmanya wrote: > Hey, > > We need some more information to debug this. > I think you missed to send the output of 'gluster volume info '. > Can you also provide the bricks, shd and glfsheal logs as well? > In the setup how many peers are present? You also mentioned that "one of > the file servers have two processes for each of the volumes instead of one > per volume", which process are you talking about here? > Also provide the "ps aux | grep gluster" output. > > Regards, > Karthik > > On Sat, Sep 1, 2018 at 12:10 AM Johnson, Tim wrote: > >> Thanks for the reply. >> >> >> >>I have attached the gluster.log file from the host that it is >> happening to at this time. >> >> It does change which host it does this on. >> >> >> >> Thanks. >> >> >> >> *From: *Atin Mukherjee >> *Date: *Friday, August 31, 2018 at 1:03 PM >> *To: *"Johnson, Tim" >> *Cc: *Karthik Subrahmanya , Ravishankar N < >> ravishan...@redhat.com>, "gluster-users@gluster.org" < >> gluster-users@gluster.org> >> *Subject: *Re: [Gluster-users] Transport endpoint is not connected : >> issue >> >> >> >> Can you please pass all the gluster log files from the server where the >> transport end point not connected error is reported? As restarting glusterd >> didn’t solve this issue, I believe this isn’t a stale port problem but >> something else. Also please provide the output of ‘gluster v info ’ >> >> >> >> (@cc Ravi, Karthik) >> >> >> >> On Fri, 31 Aug 2018 at 23:24, Johnson, Tim wrote: >> >> Hello all, >> >> >> >> We have a gluster replicate (with arbiter) volumes that we are >> getting “Transport endpoint is not connected” with on a rotating basis >> from each of the two file servers, and a third host that has the arbiter >> bricks on. >> >> This is happening when trying to run a heal on all the volumes on the >> gluster hosts When I get the status of all the volumes all looks good. >> >>This behavior seems to be a forshadowing of the gluster volumes >> becoming unresponsive to our vm cluster. As well as one of the file >> servers have two processes for each of the volumes instead of one per >> volume. Eventually the affected file server >> >> will drop off the listed peers. Restarting glusterd/glusterfsd on the >> affected file server does not take care of the issue, we have to bring down >> both file >> >> Servers due to the volumes not being seen by the vm cluster after the >> errors start occurring. I had seen that there were bug reports about the >> “Transport endpoint is not connected” on earlier versions of Gluster >> however had thought that >> >> It had been addressed. >> >> Dmesg did have some entries for “a possible syn flood on port *” >> which we changed the sysctl to “net.ipv4.tcp_max_syn_backlog = 2048” which >> seemed to help the syn flood messages but not the underlying volume issues. >> >> I have put the versions of all the Gluster packages installed below >> as well as the “Heal” and “Status” commands showing the volumes are >> >> >> >>This has just started happening but cannot definitively say if >> this started occurring after an update or not. >> >> >> >> >> >> Thanks for any assistance. >> >> >> >> >> >> Running Heal : >> >> >> >> gluster volume heal ovirt_engine info >> >> Brick 1.rrc.local:/bricks/brick0/ovirt_engine >> >> Status: Connected >> >> Number of entries: 0 >> >> >> >> Brick 3.rrc.local:/bricks/brick0/ovirt_engine >> >> Status: Transport endpoint is not connected >> >> Number of entries: - >> >> >> >> Brick *3.rrc.local:/bricks/arb-brick/ovirt_engine >> >> Status: Transport endpoint is not connected >> >> Number of entries: - >> >> >> >> >> >> Running status : >> >> >> >> gluster volume status ovirt_engine >> >> Status of volume: ovirt_engine >> >> Gluster process TCP Port RDMA Port Online >> Pid >> >> >> -- >> >> Brick*.rrc.local:/bricks/brick0/ov >> >> irt_engine 49152 0 Y >> 5521 >> >> Brick fs2-tier3.rrc.local:/bricks/brick0/ov >> >> irt_engine 49152 0 Y >> 6245 >> >> Brick .rrc.local:/bricks/arb-b >> >> rick/ovirt_engine 49152 0 Y >> 3526 >> >> Self-heal Daemon on localhost N/A N/AY >> 5509 >> >> Self-heal Daemon on ***.rrc.local N/A N/AY 6218 >> >> Self-heal Daemon on ***.rrc.local N/A N/AY 3501 >> >> Self-heal Daemon on .rrc.local N/A N/AY 3657 >> >> Self-heal Daemon on *.rrc.local N/A N/AY 3753 >> >> Self-heal Daemon on .rrc.local N/A N/AY 17284 >> >> >> >> Task Status of Volume ovirt_engine >> >> >> -- >> >> There are no active volume tas
Re: [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing ou
I apologise for this being posted twice - I'm not sure if that was user error or a bug in the mailing list, but the list wasn't showing my post after quite some time so I sent a second email which near immediately showed up - that's mailing lists I guess... Anyway, if anyone has any input, advice or abuse I'm welcome any input! -- Sam McLeod https://smcleod.net https://twitter.com/s_mcleod > On 3 Sep 2018, at 1:20 pm, Sam McLeod wrote: > > We've got an odd problem where clients are blocked from writing to Gluster > volumes until the first node of the Gluster cluster is rebooted. > > I suspect I've either configured something incorrectly with the arbiter / > replica configuration of the volumes, or there is some sort of bug in the > gluster client-server connection that we're triggering. > > I was wondering if anyone has seen this or could point me in the right > direction? > > > Environment: > Typology: 3 node cluster, replica 2, arbiter 1 (third node is metadata only). > Version: Client and Servers both running 4.1.3, both on CentOS 7, kernel > 4.18.x, (Xen) VMs with relatively fast networked SSD storage backing them, > XFS. > Client: Native Gluster FUSE client mounting via the kubernetes provider > > Problem: > Seemingly randomly some clients will be blocked / are unable to write to what > should be a highly available gluster volume. > The client gluster logs show it failing to do new file operations across > various volumes and all three nodes of the gluster. > The server gluster (or OS) logs do not show any warnings or errors. > The client recovers and is able to write to volumes again after the first > node of the gluster cluster is rebooted. > Until the first node of the gluster cluster is rebooted, the client fails to > write to the volume that is (or should be) available on the second node (a > replica) and third node (an arbiter only node). > > What 'fixes' the issue: > Although the clients (kubernetes hosts) connect to all 3 nodes of the Gluster > cluster - restarting the first gluster node always unblocks the IO and allows > the client to continue writing. > Stopping and starting the glusterd service on the gluster server is not > enough to fix the issue, nor is restarting its networking. > This suggests to me that the volume is unavailable for writing for some > reason and restarting the first node in the cluster either clears some sort > of TCP sessions between the client-server or between the server-server > replication. > > Expected behaviour: > > If the first gluster node / server had failed or was blocked from performing > operations for some reason (which it doesn't seem it is), I'd expect the > clients to access data from the second gluster node and write metadata to the > third gluster node as well as it's an arbiter / metadata only node. > If for some reason the a gluster node was not able to serve connections to > clients, I'd expect to see errors in the volume, glusterd or brick log files > (there are none on the first gluster node). > If the first gluster node was for some reason blocking IO on a volume, I'd > expect that node either to show as unhealthy or unavailable in the gluster > peer status or gluster volume status. > > > Client gluster errors: > > staging_static in this example is a volume name. > You can see the client trying to connect to the second and third nodes of the > gluster cluster and failing (unsure as to why?) > The server side logs on the first gluster node do not show any errors or > problems, but the second / third node show errors in the glusterd.log when > trying to 'unlock' the 0-management volume on the first node. > > > On a gluster client (a kubernetes host using the kubernetes connector which > uses the native fuse client) when its blocked from writing but the gluster > appears healthy (other than the errors mentioned later): > > [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] > 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) > op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 15:03:22.417773. timeout = > 1800 for :49154 > [2018-09-02 15:33:22.750989] E [MSGID: 114031] > [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: > remote operation failed [Transport endpoint is not connected] > [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] > 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) > op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 15:33:22.765751. timeout = > 1800 for :49154 > [2018-09-02 16:03:23.097988] E [MSGID: 114031] > [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: > remote operation failed [Transport endpoint is not connected] > [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] > 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) > op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 16:03:23.098133. timeout = > 1800 for :49154 > [2018-09-02 16:33:23.439
Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work
Hi krishna, I see no error in the shared logs. The only errro messages I see are during geo-rep stop. That is expected. Could you share the steps you used to created geo-rep setup? Thanks, Kotresh HR On Mon, Sep 3, 2018 at 1:02 PM, Krishna Verma wrote: > Hi Kotesh, > > > > Below is the cat output of gsyncd.log file generating on my master server. > > > > And I am using 4.1.3 version only all my gluster nodes. > > [root@gluster-poc-noida distvol]# gluster --version | grep glusterfs > > glusterfs 4.1.3 > > > > > > [root@gluster-poc-noida distvol]# cat /var/log/glusterfs/geo- > replication/glusterdist_gluster-poc-sj_glusterdist/gsyncd.log > > [2018-09-03 04:01:52.424609] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 04:01:52.526323] I [gsyncd(status):297:main] : Using > session config file path=/var/lib/glusterd/geo- > replication/glusterdist_gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:55:41.326411] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:55:49.676120] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:55:50.406042] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:56:52.847537] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:57:03.778448] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:57:25.86958] I [gsyncd(config-get):297:main] : Using > session config filepath=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:57:25.855273] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:58:09.294239] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:59:39.255487] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 06:59:39.355753] I [gsyncd(status):297:main] : Using > session config file path=/var/lib/glusterd/geo- > replication/glusterdist_gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:00:26.311767] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:03:29.205226] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:03:30.131258] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:10:34.679677] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:10:35.653928] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:26:24.438854] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:26:25.495117] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:27:26.159113] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:27:26.216475] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:27:26.932451] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glusterdist/gsyncd.conf > > [2018-09-03 07:27:26.988286] I [gsyncd(config-get):297:main] : Using > session config file path=/var/lib/glusterd/geo-replication/glusterdist_ > gluster-poc-sj_glus
Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work
Hi Krishna, The log is not complete. If you are re-trying, could you please try it out on 4.1.3 and share the logs. Thanks, Kotresh HR On Mon, Sep 3, 2018 at 12:42 PM, Krishna Verma wrote: > Hi Kotresh, > > > > Please find the log files attached. > > > > Request you to please have a look. > > > > /Krishna > > > > > > > > *From:* Kotresh Hiremath Ravishankar > *Sent:* Monday, September 3, 2018 10:19 AM > > *To:* Krishna Verma > *Cc:* Sunny Kumar ; Gluster Users < > gluster-users@gluster.org> > *Subject:* Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not > work > > > > EXTERNAL MAIL > > Hi Krishna, > > Indexing is the feature used by Hybrid crawl which only makes crawl > faster. It has nothing to do with missing data sync. > > Could you please share the complete log file of the session where the > issue is encountered ? > > Thanks, > > Kotresh HR > > > > On Mon, Sep 3, 2018 at 9:33 AM, Krishna Verma wrote: > > Hi Kotresh/Support, > > > > Request your help to get it fix. My slave is not getting sync with master. > When I restart the session after doing the indexing off then only it shows > the file at slave but that is also blank with zero size. > > > > At master: file size is 5.8 GB. > > > > [root@gluster-poc-noida distvol]# du -sh 17.10.v001.20171023-201021_ > 17020_GPLV3.tar.gz > > 5.8G17.10.v001.20171023-201021_17020_GPLV3.tar.gz > > [root@gluster-poc-noida distvol]# > > > > But at slave, after doing the “indexing off” and restart the session and > then wait for 2 days. It shows only 4.9 GB copied. > > > > [root@gluster-poc-sj distvol]# du -sh 17.10.v001.20171023-201021_ > 17020_GPLV3.tar.gz > > 4.9G17.10.v001.20171023-201021_17020_GPLV3.tar.gz > > [root@gluster-poc-sj distvol]# > > > > Similarly, I tested for small file of size 1.2 GB only that is still > showing “0” size at slave after days waiting time. > > > > At Master: > > > > [root@gluster-poc-noida distvol]# du -sh rflowTestInt18.08-b001.t.Z > > 1.2GrflowTestInt18.08-b001.t.Z > > [root@gluster-poc-noida distvol]# > > > > At Slave: > > > > [root@gluster-poc-sj distvol]# du -sh rflowTestInt18.08-b001.t.Z > > 0 rflowTestInt18.08-b001.t.Z > > [root@gluster-poc-sj distvol]# > > > > Below is my distributed volume info : > > > > [root@gluster-poc-noida distvol]# gluster volume info glusterdist > > > > Volume Name: glusterdist > > Type: Distribute > > Volume ID: af5b2915-7170-4b5e-aee8-7e68757b9bf1 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 2 > > Transport-type: tcp > > Bricks: > > Brick1: gluster-poc-noida:/data/gluster-dist/distvol > > Brick2: noi-poc-gluster:/data/gluster-dist/distvol > > Options Reconfigured: > > changelog.changelog: on > > geo-replication.ignore-pid-check: on > > geo-replication.indexing: on > > transport.address-family: inet > > nfs.disable: on > > [root@gluster-poc-noida distvol]# > > > > Please help to fix, I believe its not a normal behavior of gluster rsync. > > > > /Krishna > > *From:* Krishna Verma > *Sent:* Friday, August 31, 2018 12:42 PM > *To:* 'Kotresh Hiremath Ravishankar' > *Cc:* Sunny Kumar ; Gluster Users < > gluster-users@gluster.org> > *Subject:* RE: [Gluster-users] Upgrade to 4.1.2 geo-replication does not > work > > > > Hi Kotresh, > > > > I have tested the geo replication over distributed volumes with 2*2 > gluster setup. > > > > [root@gluster-poc-noida ~]# gluster volume geo-replication glusterdist > gluster-poc-sj::glusterdist status > > > > MASTER NODE MASTER VOL MASTER BRICK SLAVE > USERSLAVE SLAVE NODE STATUSCRAWL > STATUS LAST_SYNCED > > > > - > > gluster-poc-noidaglusterdist/data/gluster-dist/distvol > root gluster-poc-sj::glusterdistgluster-poc-sj Active > Changelog Crawl2018-08-31 10:28:19 > > noi-poc-gluster glusterdist/data/gluster-dist/distvol > root gluster-poc-sj::glusterdistgluster-poc-sj2Active > History Crawl N/A > > [root@gluster-poc-noida ~]# > > > > Not at client I copied a 848MB file from local disk to master mounted > volume and it took only 1 minute and 15 seconds. Its great…. > > > > But even after waited for 2 hrs I was unable to see that file at slave > site. Then I again erased the indexing by doing “gluster volume set > glusterdist indexing off” and restart the session. Magically I received > the file instantly at slave after doing this. > > > > Why I need to do “indexing off” every time to reflect data at slave site? > Is there any fix/workaround of it? > > > > /Krishna > > > > > > *From:* Kotresh Hiremath Ravishankar > *Sent:* Friday, August 31, 2018 10:10 AM > *To:* Krishna Verma > *Cc:* Sunny Kumar ; Gluster Users < > gluster-users@gluster.org> > *Subject:* Re: [Gluster-users] Upgrade to 4.1.2 geo-replication