Re: [Gluster-users] Transport endpoint is not connected : issue

2018-09-03 Thread Karthik Subrahmanya
On Mon, Sep 3, 2018 at 11:17 AM Karthik Subrahmanya 
wrote:

> Hey,
>
> We need some more information to debug this.
> I think you missed to send the output of 'gluster volume info '.
> Can you also provide the bricks, shd and glfsheal logs as well?
> In the setup how many peers are present? You also mentioned that "one of
> the file servers have two processes for each of the volumes instead of one
> per volume", which process are you talking about here?
>
Also provide the "ps aux | grep gluster" output.

>
> Regards,
> Karthik
>
> On Sat, Sep 1, 2018 at 12:10 AM Johnson, Tim  wrote:
>
>> Thanks for the reply.
>>
>>
>>
>>I have attached the gluster.log file from the host that it is
>> happening to at this time.
>>
>> It does change which host it does this on.
>>
>>
>>
>> Thanks.
>>
>>
>>
>> *From: *Atin Mukherjee 
>> *Date: *Friday, August 31, 2018 at 1:03 PM
>> *To: *"Johnson, Tim" 
>> *Cc: *Karthik Subrahmanya , Ravishankar N <
>> ravishan...@redhat.com>, "gluster-users@gluster.org" <
>> gluster-users@gluster.org>
>> *Subject: *Re: [Gluster-users] Transport endpoint is not connected :
>> issue
>>
>>
>>
>> Can you please pass all the gluster log files from the server where the
>> transport end point not connected error is reported? As restarting glusterd
>> didn’t solve this issue, I believe this isn’t a stale port problem but
>> something else. Also please provide the output of ‘gluster v info ’
>>
>>
>>
>> (@cc Ravi, Karthik)
>>
>>
>>
>> On Fri, 31 Aug 2018 at 23:24, Johnson, Tim  wrote:
>>
>> Hello all,
>>
>>
>>
>>   We have a gluster replicate (with arbiter)  volumes that we are
>> getting “Transport endpoint is not connected” with on a rotating basis
>>  from each of the two file servers, and a third host that has the arbiter
>> bricks on.
>>
>> This is happening when trying to run a heal on all the volumes on the
>> gluster hosts   When I get the status of all the volumes all looks good.
>>
>>This behavior seems to be a forshadowing of the gluster volumes
>> becoming unresponsive to our vm cluster.  As well as one of the file
>> servers have two processes for each of the volumes instead of one per
>> volume. Eventually the affected file server
>>
>> will drop off the listed peers. Restarting glusterd/glusterfsd on the
>> affected file server does not take care of the issue, we have to bring down
>> both file
>>
>> Servers due to the volumes not being seen by the vm cluster after the
>> errors start occurring. I had seen that there were bug reports about the
>> “Transport endpoint is not connected” on earlier versions of Gluster
>> however had thought that
>>
>> It had been addressed.
>>
>>  Dmesg did have some entries for “a possible syn flood on port *”
>> which we changed the  sysctl to “net.ipv4.tcp_max_syn_backlog = 2048” which
>> seemed to help the syn flood messages but not the underlying volume issues.
>>
>> I have put the versions of all the Gluster packages installed below
>> as well as the   “Heal” and “Status” commands showing the volumes are
>>
>>
>>
>>This has just started happening but cannot definitively say if
>> this started occurring after an update or not.
>>
>>
>>
>>
>>
>> Thanks for any assistance.
>>
>>
>>
>>
>>
>> Running Heal  :
>>
>>
>>
>> gluster volume heal ovirt_engine info
>>
>> Brick 1.rrc.local:/bricks/brick0/ovirt_engine
>>
>> Status: Connected
>>
>> Number of entries: 0
>>
>>
>>
>> Brick 3.rrc.local:/bricks/brick0/ovirt_engine
>>
>> Status: Transport endpoint is not connected
>>
>> Number of entries: -
>>
>>
>>
>> Brick *3.rrc.local:/bricks/arb-brick/ovirt_engine
>>
>> Status: Transport endpoint is not connected
>>
>> Number of entries: -
>>
>>
>>
>>
>>
>> Running status :
>>
>>
>>
>> gluster volume status ovirt_engine
>>
>> Status of volume: ovirt_engine
>>
>> Gluster process TCP Port  RDMA Port  Online
>> Pid
>>
>>
>> --
>>
>> Brick*.rrc.local:/bricks/brick0/ov
>>
>> irt_engine  49152 0  Y
>> 5521
>>
>> Brick fs2-tier3.rrc.local:/bricks/brick0/ov
>>
>> irt_engine  49152 0  Y
>> 6245
>>
>> Brick .rrc.local:/bricks/arb-b
>>
>> rick/ovirt_engine   49152 0  Y
>> 3526
>>
>> Self-heal Daemon on localhost   N/A   N/AY
>> 5509
>>
>> Self-heal Daemon on ***.rrc.local N/A   N/AY   6218
>>
>> Self-heal Daemon on ***.rrc.local   N/A   N/AY   3501
>>
>> Self-heal Daemon on .rrc.local N/A   N/AY   3657
>>
>> Self-heal Daemon on *.rrc.local   N/A   N/AY   3753
>>
>> Self-heal Daemon on .rrc.local N/A   N/AY   17284
>>
>>
>>
>> Task Status of Volume ovirt_engine
>>
>>
>> --
>>
>> There are no active volume 

Re: [Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing ou

2018-09-03 Thread Sam McLeod
I apologise for this being posted twice - I'm not sure if that was user error 
or a bug in the mailing list, but the list wasn't showing my post after quite 
some time so I sent a second email which near immediately showed up - that's 
mailing lists I guess...

Anyway, if anyone has any input, advice or abuse I'm welcome any input!

--
Sam McLeod
https://smcleod.net
https://twitter.com/s_mcleod

> On 3 Sep 2018, at 1:20 pm, Sam McLeod  wrote:
> 
> We've got an odd problem where clients are blocked from writing to Gluster 
> volumes until the first node of the Gluster cluster is rebooted.
> 
> I suspect I've either configured something incorrectly with the arbiter / 
> replica configuration of the volumes, or there is some sort of bug in the 
> gluster client-server connection that we're triggering.
> 
> I was wondering if anyone has seen this or could point me in the right 
> direction?
> 
> 
> Environment:
> Typology: 3 node cluster, replica 2, arbiter 1 (third node is metadata only).
> Version: Client and Servers both running 4.1.3, both on CentOS 7, kernel 
> 4.18.x, (Xen) VMs with relatively fast networked SSD storage backing them, 
> XFS.
> Client: Native Gluster FUSE client mounting via the kubernetes provider
> 
> Problem:
> Seemingly randomly some clients will be blocked / are unable to write to what 
> should be a highly available gluster volume.
> The client gluster logs show it failing to do new file operations across 
> various volumes and all three nodes of the gluster.
> The server gluster (or OS) logs do not show any warnings or errors.
> The client recovers and is able to write to volumes again after the first 
> node of the gluster cluster is rebooted.
> Until the first node of the gluster cluster is rebooted, the client fails to 
> write to the volume that is (or should be) available on the second node (a 
> replica) and third node (an arbiter only node).
> 
> What 'fixes' the issue:
> Although the clients (kubernetes hosts) connect to all 3 nodes of the Gluster 
> cluster - restarting the first gluster node always unblocks the IO and allows 
> the client to continue writing.
> Stopping and starting the glusterd service on the gluster server is not 
> enough to fix the issue, nor is restarting its networking.
> This suggests to me that the volume is unavailable for writing for some 
> reason and restarting the first node in the cluster either clears some sort 
> of TCP sessions between the client-server or between the server-server 
> replication.
> 
> Expected behaviour:
> 
> If the first gluster node / server had failed or was blocked from performing 
> operations for some reason (which it doesn't seem it is), I'd expect the 
> clients to access data from the second gluster node and write metadata to the 
> third gluster node as well as it's an arbiter / metadata only node.
> If for some reason the a gluster node was not able to serve connections to 
> clients, I'd expect to see errors in the volume, glusterd or brick log files 
> (there are none on the first gluster node).
> If the first gluster node was for some reason blocking IO on a volume, I'd 
> expect that node either to show as unhealthy or unavailable in the gluster 
> peer status or gluster volume status.
> 
> 
> Client gluster errors:
> 
> staging_static in this example is a volume name.
> You can see the client trying to connect to the second and third nodes of the 
> gluster cluster and failing (unsure as to why?)
> The server side logs on the first gluster node do not show any errors or 
> problems, but the second / third node show errors in the glusterd.log when 
> trying to 'unlock' the 0-management volume on the first node.
> 
> 
> On a gluster client (a kubernetes host using the kubernetes connector which 
> uses the native fuse client) when its blocked from writing but the gluster 
> appears healthy (other than the errors mentioned later):
> 
> [2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] 
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) 
> op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 15:03:22.417773. timeout = 
> 1800 for :49154
> [2018-09-02 15:33:22.750989] E [MSGID: 114031] 
> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: 
> remote operation failed [Transport endpoint is not connected]
> [2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] 
> 0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) 
> op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 15:33:22.765751. timeout = 
> 1800 for :49154
> [2018-09-02 16:03:23.097988] E [MSGID: 114031] 
> [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: 
> remote operation failed [Transport endpoint is not connected]
> [2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] 
> 0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) 
> op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 16:03:23.098133. timeout = 
> 1800 for :49154
> [2018-09-02 

Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work

2018-09-03 Thread Kotresh Hiremath Ravishankar
Hi krishna,

I see no error in the shared logs. The only errro messages I see are during
geo-rep stop. That is expected.
Could you share the steps you used to created geo-rep setup?

Thanks,
Kotresh HR

On Mon, Sep 3, 2018 at 1:02 PM, Krishna Verma  wrote:

> Hi Kotesh,
>
>
>
> Below is the cat output of gsyncd.log file generating on my master server.
>
>
>
> And I am using 4.1.3 version only all my gluster nodes.
>
> [root@gluster-poc-noida distvol]# gluster --version | grep glusterfs
>
> glusterfs 4.1.3
>
>
>
>
>
> [root@gluster-poc-noida distvol]# cat /var/log/glusterfs/geo-
> replication/glusterdist_gluster-poc-sj_glusterdist/gsyncd.log
>
> [2018-09-03 04:01:52.424609] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 04:01:52.526323] I [gsyncd(status):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-
> replication/glusterdist_gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:55:41.326411] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:55:49.676120] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:55:50.406042] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:56:52.847537] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:57:03.778448] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:57:25.86958] I [gsyncd(config-get):297:main] : Using
> session config filepath=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:57:25.855273] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:58:09.294239] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:59:39.255487] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 06:59:39.355753] I [gsyncd(status):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-
> replication/glusterdist_gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:00:26.311767] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:03:29.205226] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:03:30.131258] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:10:34.679677] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:10:35.653928] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:26:24.438854] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:26:25.495117] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:27:26.159113] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:27:26.216475] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:27:26.932451] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> gluster-poc-sj_glusterdist/gsyncd.conf
>
> [2018-09-03 07:27:26.988286] I [gsyncd(config-get):297:main] : Using
> session config file   path=/var/lib/glusterd/geo-replication/glusterdist_
> 

Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work

2018-09-03 Thread Kotresh Hiremath Ravishankar
Hi Krishna,

The log is not complete. If you are re-trying, could you please try it out
on 4.1.3 and share the logs.

Thanks,
Kotresh HR

On Mon, Sep 3, 2018 at 12:42 PM, Krishna Verma  wrote:

> Hi Kotresh,
>
>
>
> Please find the log files attached.
>
>
>
> Request you to please have a look.
>
>
>
> /Krishna
>
>
>
>
>
>
>
> *From:* Kotresh Hiremath Ravishankar 
> *Sent:* Monday, September 3, 2018 10:19 AM
>
> *To:* Krishna Verma 
> *Cc:* Sunny Kumar ; Gluster Users <
> gluster-users@gluster.org>
> *Subject:* Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not
> work
>
>
>
> EXTERNAL MAIL
>
> Hi Krishna,
>
> Indexing is the feature used by Hybrid crawl which only makes crawl
> faster. It has nothing to do with missing data sync.
>
> Could you please share the complete log file of the session where the
> issue is encountered ?
>
> Thanks,
>
> Kotresh HR
>
>
>
> On Mon, Sep 3, 2018 at 9:33 AM, Krishna Verma  wrote:
>
> Hi Kotresh/Support,
>
>
>
> Request your help to get it fix. My slave is not getting sync with master.
> When I restart the session after doing the indexing off then only it shows
> the file at slave but that is also blank with zero size.
>
>
>
> At master: file size is 5.8 GB.
>
>
>
> [root@gluster-poc-noida distvol]# du -sh 17.10.v001.20171023-201021_
> 17020_GPLV3.tar.gz
>
> 5.8G17.10.v001.20171023-201021_17020_GPLV3.tar.gz
>
> [root@gluster-poc-noida distvol]#
>
>
>
> But at slave, after doing the “indexing off” and restart the session and
> then wait for 2 days. It shows only 4.9 GB copied.
>
>
>
> [root@gluster-poc-sj distvol]# du -sh 17.10.v001.20171023-201021_
> 17020_GPLV3.tar.gz
>
> 4.9G17.10.v001.20171023-201021_17020_GPLV3.tar.gz
>
> [root@gluster-poc-sj distvol]#
>
>
>
> Similarly, I tested for small file of size 1.2 GB only that is still
> showing “0” size at slave  after days waiting time.
>
>
>
> At Master:
>
>
>
> [root@gluster-poc-noida distvol]# du -sh rflowTestInt18.08-b001.t.Z
>
> 1.2GrflowTestInt18.08-b001.t.Z
>
> [root@gluster-poc-noida distvol]#
>
>
>
> At Slave:
>
>
>
> [root@gluster-poc-sj distvol]# du -sh rflowTestInt18.08-b001.t.Z
>
> 0   rflowTestInt18.08-b001.t.Z
>
> [root@gluster-poc-sj distvol]#
>
>
>
> Below is my distributed volume info :
>
>
>
> [root@gluster-poc-noida distvol]# gluster volume info glusterdist
>
>
>
> Volume Name: glusterdist
>
> Type: Distribute
>
> Volume ID: af5b2915-7170-4b5e-aee8-7e68757b9bf1
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 2
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: gluster-poc-noida:/data/gluster-dist/distvol
>
> Brick2: noi-poc-gluster:/data/gluster-dist/distvol
>
> Options Reconfigured:
>
> changelog.changelog: on
>
> geo-replication.ignore-pid-check: on
>
> geo-replication.indexing: on
>
> transport.address-family: inet
>
> nfs.disable: on
>
> [root@gluster-poc-noida distvol]#
>
>
>
> Please help to fix, I believe its not a normal behavior of gluster rsync.
>
>
>
> /Krishna
>
> *From:* Krishna Verma
> *Sent:* Friday, August 31, 2018 12:42 PM
> *To:* 'Kotresh Hiremath Ravishankar' 
> *Cc:* Sunny Kumar ; Gluster Users <
> gluster-users@gluster.org>
> *Subject:* RE: [Gluster-users] Upgrade to 4.1.2 geo-replication does not
> work
>
>
>
> Hi Kotresh,
>
>
>
> I have tested the geo replication over distributed volumes with 2*2
> gluster setup.
>
>
>
> [root@gluster-poc-noida ~]# gluster volume geo-replication glusterdist
> gluster-poc-sj::glusterdist status
>
>
>
> MASTER NODE  MASTER VOL MASTER BRICK  SLAVE
> USERSLAVE  SLAVE NODE STATUSCRAWL
> STATUS   LAST_SYNCED
>
> 
> 
> -
>
> gluster-poc-noidaglusterdist/data/gluster-dist/distvol
> root  gluster-poc-sj::glusterdistgluster-poc-sj Active
> Changelog Crawl2018-08-31 10:28:19
>
> noi-poc-gluster  glusterdist/data/gluster-dist/distvol
> root  gluster-poc-sj::glusterdistgluster-poc-sj2Active
> History Crawl  N/A
>
> [root@gluster-poc-noida ~]#
>
>
>
> Not at client I copied a 848MB file from local disk to master mounted
> volume and it took only 1 minute and 15 seconds. Its great….
>
>
>
> But even after waited for 2 hrs I was unable to see that file at slave
> site. Then I again erased the indexing by doing “gluster volume set
> glusterdist  indexing off” and restart the session. Magically I received
> the file instantly at slave after doing this.
>
>
>
> Why I need to do “indexing off” every time to reflect data at slave site?
> Is there any fix/workaround of it?
>
>
>
> /Krishna
>
>
>
>
>
> *From:* Kotresh Hiremath Ravishankar 
> *Sent:* Friday, August 31, 2018 10:10 AM
> *To:* Krishna Verma 
> *Cc:* Sunny Kumar ; Gluster Users <
> gluster-users@gluster.org>
> *Subject:* Re: [Gluster-users] Upgrade to 4.1.2 geo-replication