Re: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does not work Now: Upgraded to 4.1.3 geo node Faulty

2018-09-02 Thread Kotresh Hiremath Ravishankar
Hi Marcus,

Geo-rep had few important fixes in 4.1.3. Is it possible to upgrade and
check whether the issue is still seen?

Thanks,
Kotresh HR

On Sat, Sep 1, 2018 at 5:08 PM, Marcus Pedersén 
wrote:

> Hi again,
>
> I found another problem on the other master node.
>
> The node toggles Active/Faulty and it is the same error over and over
> again.
>
>
> [2018-09-01 11:23:02.94080] E [repce(worker /urd-gds/gluster):197:__call__]
> RepceClient: call failedcall=1226:139955262510912:1535800981.24
> method=entry_opserror=GsyncdError
> [2018-09-01 11:23:02.94214] E [syncdutils(worker 
> /urd-gds/gluster):300:log_raise_exception]
> : execution of "gluster" failed with ENOENT (No such file or directory)
> [2018-09-01 11:23:02.106194] I [repce(agent /urd-gds/gluster):80:service_loop]
> RepceServer: terminating on reaching EOF.
> [2018-09-01 11:23:02.12] I [gsyncdstatus(monitor):244:set_worker_status]
> GeorepStatus: Worker Status Change status=Faulty
>
>
> I have also found a python error as well, I have only seen this once
> though.
>
>
> [2018-09-01 11:16:45.907660] I [master(worker
> /urd-gds/gluster):1536:crawl] _GMaster: slave's time
> stime=(1524101534, 0)
> [2018-09-01 11:16:47.364109] E [syncdutils(worker
> /urd-gds/gluster):332:log_raise_exception] : FAIL:
> Traceback (most recent call last):
>   File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line
> 362, in twrap
> tf(*aargs)
>   File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1939,
> in syncjob
> po = self.sync_engine(pb, self.log_err)
>   File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1442,
> in rsync
> rconf.ssh_ctl_args + \
> AttributeError: 'NoneType' object has no attribute 'split'
> [2018-09-01 11:16:47.384531] I [repce(agent /urd-gds/gluster):80:service_loop]
> RepceServer: terminating on reaching EOF.
> [2018-09-01 11:16:48.362987] I [monitor(monitor):279:monitor] Monitor:
> worker died in startup phase brick=/urd-gds/gluster
> [2018-09-01 11:16:48.370701] I [gsyncdstatus(monitor):244:set_worker_status]
> GeorepStatus: Worker Status Change status=Faulty
> [2018-09-01 11:16:58.390548] I [monitor(monitor):158:monitor] Monitor:
> starting gsyncd worker   brick=/urd-gds/gluster  slave_node=urd-gds-geo-000
>
>
> I attach the logs as well.
>
>
> Many thanks!
>
>
> Best regards
>
> Marcus Pedersén
>
>
>
>
> --
> *Från:* gluster-users-boun...@gluster.org  gluster.org> för Marcus Pedersén 
> *Skickat:* den 31 augusti 2018 16:09
> *Till:* khire...@redhat.com
>
> *Kopia:* gluster-users@gluster.org
> *Ämne:* Re: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does
> not work Now: Upgraded to 4.1.3 geo node Faulty
>
>
> I realy appologize, third try to make mail smaller.
>
>
> /Marcus
>
>
> --
> *Från:* Marcus Pedersén
> *Skickat:* den 31 augusti 2018 16:03
> *Till:* Kotresh Hiremath Ravishankar
> *Kopia:* gluster-users@gluster.org
> *Ämne:* SV: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does
> not work Now: Upgraded to 4.1.3 geo node Faulty
>
>
> Sorry, resend due to too large mail.
>
>
> /Marcus
> --
> *Från:* Marcus Pedersén
> *Skickat:* den 31 augusti 2018 15:19
> *Till:* Kotresh Hiremath Ravishankar
> *Kopia:* gluster-users@gluster.org
> *Ämne:* SV: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does
> not work Now: Upgraded to 4.1.3 geo node Faulty
>
>
> Hi Kotresh,
>
> Please find attached logs, only logs from today.
>
> The python error was repeated over and over again until I disabled selinux.
>
> After that the node bacame active again.
>
> The return code 23 seems to be repeated over and over again.
>
>
> rsync version 3.1.2
>
>
> Thanks a lot!
>
>
> Best regards
>
> Marcus
>
>
> --
> *Från:* Kotresh Hiremath Ravishankar 
> *Skickat:* den 31 augusti 2018 11:09
> *Till:* Marcus Pedersén
> *Kopia:* gluster-users@gluster.org
> *Ämne:* Re: [Gluster-users] Was: Upgrade to 4.1.2 geo-replication does
> not work Now: Upgraded to 4.1.3 geo node Faulty
>
> Hi Marcus,
>
> Could you attach full logs? Is the same trace back happening repeatedly?
> It will be helpful you attach the corresponding mount log as well.
> What's the rsync version, you are using?
>
> Thanks,
> Kotresh HR
>
> On Fri, Aug 31, 2018 at 12:16 PM, Marcus Pedersén 
> wrote:
>
>> Hi all,
>>
>> I had problems with stopping sync after upgrade to 4.1.2.
>>
>> I upgraded to 4.1.3 and it ran fine for one day, but now one of the
>> master nodes shows faulty.
>>
>> Most of the sync jobs have return code 23, how do I resolve this?
>>
>> I see messages like:
>>
>> _GMaster: Sucessfully fixed all entry ops with gfid mismatch
>>
>> Will this resolve error code 23?
>>
>> There is also a python error.
>>
>> The python error was a selinux problem, turning off selinux made node go
>> to active again.
>>
>> See log below.
>>
>>
>> CentOS 7, installed through SIG Gluster (OS updated to latest at 

Re: [Gluster-users] Gluter 3.12.12: performance during heal and in general

2018-09-02 Thread Pranith Kumar Karampuri
On Fri, Aug 31, 2018 at 1:18 PM Hu Bert  wrote:

> Hi Pranith,
>
> i just wanted to ask if you were able to get any feedback from your
> colleagues :-)
>

Sorry, I didn't get a chance to. I am working on a customer issue which is
taking away cycles from any other work. Let me get back to you once I get
time this week.


>
> btw.: we migrated some stuff (static resources, small files) to a nfs
> server that we actually wanted to replace by glusterfs. Load and cpu
> usage has gone down a bit, but still is asymmetric on the 3 gluster
> servers.
>
>
> 2018-08-28 9:24 GMT+02:00 Hu Bert :
> > Hm, i noticed that in the shared.log (volume log file) on gluster11
> > and gluster12 (but not on gluster13) i now see these warnings:
> >
> > [2018-08-28 07:18:57.224367] W [MSGID: 109011]
> > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
> > hash (value) = 3054593291
> > [2018-08-28 07:19:17.733625] W [MSGID: 109011]
> > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
> > hash (value) = 2595205890
> > [2018-08-28 07:19:27.950355] W [MSGID: 109011]
> > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
> > hash (value) = 3105728076
> > [2018-08-28 07:19:42.519010] W [MSGID: 109011]
> > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
> > hash (value) = 3740415196
> > [2018-08-28 07:19:48.194774] W [MSGID: 109011]
> > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
> > hash (value) = 2922795043
> > [2018-08-28 07:19:52.506135] W [MSGID: 109011]
> > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
> > hash (value) = 2841655539
> > [2018-08-28 07:19:55.466352] W [MSGID: 109011]
> > [dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
> > hash (value) = 3049465001
> >
> > Don't know if that could be related.
> >
> >
> > 2018-08-28 8:54 GMT+02:00 Hu Bert :
> >> a little update after about 2 hours of uptime: still/again high cpu
> >> usage by one brick processes. server load >30.
> >>
> >> gluster11: high cpu; brick /gluster/bricksdd1/; no hdd exchange so far
> >> gluster12: normal cpu; brick /gluster/bricksdd1_new/; hdd change
> /dev/sdd
> >> gluster13: high cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
> >>
> >> The process for brick bricksdd1 consumes almost all 12 cores.
> >> Interestingly there are more threads for the bricksdd1 process than
> >> for the other bricks. Counted with "ps huH p  | wc
> >> -l"
> >>
> >> gluster11:
> >> bricksda1 59 threads, bricksdb1 65 threads, bricksdc1 68 threads,
> >> bricksdd1 85 threads
> >> gluster12:
> >> bricksda1 65 threads, bricksdb1 60 threads, bricksdc1 61 threads,
> >> bricksdd1_new 58 threads
> >> gluster13:
> >> bricksda1 61 threads, bricksdb1 60 threads, bricksdc1 61 threads,
> >> bricksdd1_new 82 threads
> >>
> >> Don't know if that could be relevant.
> >>
> >> 2018-08-28 7:04 GMT+02:00 Hu Bert :
> >>> Good Morning,
> >>>
> >>> today i update + rebooted all gluster servers, kernel update to
> >>> 4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the
> >>> gluster servers (gluster13) one of the bricks did come up at the
> >>> beginning but then lost connection.
> >>>
> >>> OK:
> >>>
> >>> Status of volume: shared
> >>> Gluster process TCP Port  RDMA Port
> Online  Pid
> >>>
> --
> >>> [...]
> >>> Brick gluster11:/gluster/bricksdd1/shared 49155 0
> >>> Y   2506
> >>> Brick gluster12:/gluster/bricksdd1_new/shared49155 0
> >>> Y   2097
> >>> Brick gluster13:/gluster/bricksdd1_new/shared49155 0
> >>> Y   2136
> >>>
> >>> Lost connection:
> >>>
> >>> Brick gluster11:/gluster/bricksdd1/shared  49155 0
> >>>  Y   2506
> >>> Brick gluster12:/gluster/bricksdd1_new/shared 49155 0
> >>> Y   2097
> >>> Brick gluster13:/gluster/bricksdd1_new/shared N/A   N/A
> >>> N   N/A
> >>>
> >>> gluster volume heal shared info:
> >>> Brick gluster13:/gluster/bricksdd1_new/shared
> >>> Status: Transport endpoint is not connected
> >>> Number of entries: -
> >>>
> >>> reboot was at 06:15:39; brick then worked for a short period, but then
> >>> somehow disconnected.
> >>>
> >>> from gluster13:/var/log/glusterfs/glusterd.log:
> >>>
> >>> [2018-08-28 04:27:36.944608] I [MSGID: 106005]
> >>> [glusterd-handler.c:6071:__glusterd_brick_rpc_notify] 0-management:
> >>> Brick gluster13:/gluster/bricksdd1_new/shared has disconnected from
> >>> glusterd.
> >>> [2018-08-28 04:28:57.869666] I
> >>> [glusterd-utils.c:6056:glusterd_brick_start] 0-management: starting a
> >>> fresh brick process for brick /gluster/bricksdd1_new/shared
> >>> [2018-08-28 04:35:20.732666] I [MSGID: 106143]
> >>> [glusterd-pmap.c:295:pmap_registry_bind] 0-pmap: adding brick
> >>> /gluster/bricksdd1_new/shared on port 49157
> >>>
> >>> After 'gluster volume start shared force' (then with new 

Re: [Gluster-users] Transport endpoint is not connected : issue

2018-09-02 Thread Karthik Subrahmanya
Hey,

We need some more information to debug this.
I think you missed to send the output of 'gluster volume info '.
Can you also provide the bricks, shd and glfsheal logs as well?
In the setup how many peers are present? You also mentioned that "one of
the file servers have two processes for each of the volumes instead of one
per volume", which process are you talking about here?

Regards,
Karthik

On Sat, Sep 1, 2018 at 12:10 AM Johnson, Tim  wrote:

> Thanks for the reply.
>
>
>
>I have attached the gluster.log file from the host that it is happening
> to at this time.
>
> It does change which host it does this on.
>
>
>
> Thanks.
>
>
>
> *From: *Atin Mukherjee 
> *Date: *Friday, August 31, 2018 at 1:03 PM
> *To: *"Johnson, Tim" 
> *Cc: *Karthik Subrahmanya , Ravishankar N <
> ravishan...@redhat.com>, "gluster-users@gluster.org" <
> gluster-users@gluster.org>
> *Subject: *Re: [Gluster-users] Transport endpoint is not connected : issue
>
>
>
> Can you please pass all the gluster log files from the server where the
> transport end point not connected error is reported? As restarting glusterd
> didn’t solve this issue, I believe this isn’t a stale port problem but
> something else. Also please provide the output of ‘gluster v info ’
>
>
>
> (@cc Ravi, Karthik)
>
>
>
> On Fri, 31 Aug 2018 at 23:24, Johnson, Tim  wrote:
>
> Hello all,
>
>
>
>   We have a gluster replicate (with arbiter)  volumes that we are
> getting “Transport endpoint is not connected” with on a rotating basis
>  from each of the two file servers, and a third host that has the arbiter
> bricks on.
>
> This is happening when trying to run a heal on all the volumes on the
> gluster hosts   When I get the status of all the volumes all looks good.
>
>This behavior seems to be a forshadowing of the gluster volumes
> becoming unresponsive to our vm cluster.  As well as one of the file
> servers have two processes for each of the volumes instead of one per
> volume. Eventually the affected file server
>
> will drop off the listed peers. Restarting glusterd/glusterfsd on the
> affected file server does not take care of the issue, we have to bring down
> both file
>
> Servers due to the volumes not being seen by the vm cluster after the
> errors start occurring. I had seen that there were bug reports about the
> “Transport endpoint is not connected” on earlier versions of Gluster
> however had thought that
>
> It had been addressed.
>
>  Dmesg did have some entries for “a possible syn flood on port *”
> which we changed the  sysctl to “net.ipv4.tcp_max_syn_backlog = 2048” which
> seemed to help the syn flood messages but not the underlying volume issues.
>
> I have put the versions of all the Gluster packages installed below as
> well as the   “Heal” and “Status” commands showing the volumes are
>
>
>
>This has just started happening but cannot definitively say if this
> started occurring after an update or not.
>
>
>
>
>
> Thanks for any assistance.
>
>
>
>
>
> Running Heal  :
>
>
>
> gluster volume heal ovirt_engine info
>
> Brick 1.rrc.local:/bricks/brick0/ovirt_engine
>
> Status: Connected
>
> Number of entries: 0
>
>
>
> Brick 3.rrc.local:/bricks/brick0/ovirt_engine
>
> Status: Transport endpoint is not connected
>
> Number of entries: -
>
>
>
> Brick *3.rrc.local:/bricks/arb-brick/ovirt_engine
>
> Status: Transport endpoint is not connected
>
> Number of entries: -
>
>
>
>
>
> Running status :
>
>
>
> gluster volume status ovirt_engine
>
> Status of volume: ovirt_engine
>
> Gluster process TCP Port  RDMA Port  Online
> Pid
>
>
> --
>
> Brick*.rrc.local:/bricks/brick0/ov
>
> irt_engine  49152 0  Y
> 5521
>
> Brick fs2-tier3.rrc.local:/bricks/brick0/ov
>
> irt_engine  49152 0  Y
> 6245
>
> Brick .rrc.local:/bricks/arb-b
>
> rick/ovirt_engine   49152 0  Y
> 3526
>
> Self-heal Daemon on localhost   N/A   N/AY
> 5509
>
> Self-heal Daemon on ***.rrc.local N/A   N/AY   6218
>
> Self-heal Daemon on ***.rrc.local   N/A   N/AY   3501
>
> Self-heal Daemon on .rrc.local N/A   N/AY   3657
>
> Self-heal Daemon on *.rrc.local   N/A   N/AY   3753
>
> Self-heal Daemon on .rrc.local N/A   N/AY   17284
>
>
>
> Task Status of Volume ovirt_engine
>
>
> --
>
> There are no active volume tasks
>
>
>
>
>
>
>
>
>
> /etc/glusterd.vol.   :
>
>
>
>
>
> volume management
>
> type mgmt/glusterd
>
> option working-directory /var/lib/glusterd
>
> option transport-type socket,rdma
>
> option transport.socket.keepalive-time 10
>
> option transport.socket.keepalive-interval 2
>
> 

Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work

2018-09-02 Thread Kotresh Hiremath Ravishankar
Hi Krishna,

Indexing is the feature used by Hybrid crawl which only makes crawl faster.
It has nothing to do with missing data sync.
Could you please share the complete log file of the session where the issue
is encountered ?

Thanks,
Kotresh HR

On Mon, Sep 3, 2018 at 9:33 AM, Krishna Verma  wrote:

> Hi Kotresh/Support,
>
>
>
> Request your help to get it fix. My slave is not getting sync with master.
> When I restart the session after doing the indexing off then only it shows
> the file at slave but that is also blank with zero size.
>
>
>
> At master: file size is 5.8 GB.
>
>
>
> [root@gluster-poc-noida distvol]# du -sh 17.10.v001.20171023-201021_
> 17020_GPLV3.tar.gz
>
> 5.8G17.10.v001.20171023-201021_17020_GPLV3.tar.gz
>
> [root@gluster-poc-noida distvol]#
>
>
>
> But at slave, after doing the “indexing off” and restart the session and
> then wait for 2 days. It shows only 4.9 GB copied.
>
>
>
> [root@gluster-poc-sj distvol]# du -sh 17.10.v001.20171023-201021_
> 17020_GPLV3.tar.gz
>
> 4.9G17.10.v001.20171023-201021_17020_GPLV3.tar.gz
>
> [root@gluster-poc-sj distvol]#
>
>
>
> Similarly, I tested for small file of size 1.2 GB only that is still
> showing “0” size at slave  after days waiting time.
>
>
>
> At Master:
>
>
>
> [root@gluster-poc-noida distvol]# du -sh rflowTestInt18.08-b001.t.Z
>
> 1.2GrflowTestInt18.08-b001.t.Z
>
> [root@gluster-poc-noida distvol]#
>
>
>
> At Slave:
>
>
>
> [root@gluster-poc-sj distvol]# du -sh rflowTestInt18.08-b001.t.Z
>
> 0   rflowTestInt18.08-b001.t.Z
>
> [root@gluster-poc-sj distvol]#
>
>
>
> Below is my distributed volume info :
>
>
>
> [root@gluster-poc-noida distvol]# gluster volume info glusterdist
>
>
>
> Volume Name: glusterdist
>
> Type: Distribute
>
> Volume ID: af5b2915-7170-4b5e-aee8-7e68757b9bf1
>
> Status: Started
>
> Snapshot Count: 0
>
> Number of Bricks: 2
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: gluster-poc-noida:/data/gluster-dist/distvol
>
> Brick2: noi-poc-gluster:/data/gluster-dist/distvol
>
> Options Reconfigured:
>
> changelog.changelog: on
>
> geo-replication.ignore-pid-check: on
>
> geo-replication.indexing: on
>
> transport.address-family: inet
>
> nfs.disable: on
>
> [root@gluster-poc-noida distvol]#
>
>
>
> Please help to fix, I believe its not a normal behavior of gluster rsync.
>
>
>
> /Krishna
>
> *From:* Krishna Verma
> *Sent:* Friday, August 31, 2018 12:42 PM
> *To:* 'Kotresh Hiremath Ravishankar' 
> *Cc:* Sunny Kumar ; Gluster Users <
> gluster-users@gluster.org>
> *Subject:* RE: [Gluster-users] Upgrade to 4.1.2 geo-replication does not
> work
>
>
>
> Hi Kotresh,
>
>
>
> I have tested the geo replication over distributed volumes with 2*2
> gluster setup.
>
>
>
> [root@gluster-poc-noida ~]# gluster volume geo-replication glusterdist
> gluster-poc-sj::glusterdist status
>
>
>
> MASTER NODE  MASTER VOL MASTER BRICK  SLAVE
> USERSLAVE  SLAVE NODE STATUSCRAWL
> STATUS   LAST_SYNCED
>
> 
> 
> -
>
> gluster-poc-noidaglusterdist/data/gluster-dist/distvol
> root  gluster-poc-sj::glusterdistgluster-poc-sj Active
> Changelog Crawl2018-08-31 10:28:19
>
> noi-poc-gluster  glusterdist/data/gluster-dist/distvol
> root  gluster-poc-sj::glusterdistgluster-poc-sj2Active
> History Crawl  N/A
>
> [root@gluster-poc-noida ~]#
>
>
>
> Not at client I copied a 848MB file from local disk to master mounted
> volume and it took only 1 minute and 15 seconds. Its great….
>
>
>
> But even after waited for 2 hrs I was unable to see that file at slave
> site. Then I again erased the indexing by doing “gluster volume set
> glusterdist  indexing off” and restart the session. Magically I received
> the file instantly at slave after doing this.
>
>
>
> Why I need to do “indexing off” every time to reflect data at slave site?
> Is there any fix/workaround of it?
>
>
>
> /Krishna
>
>
>
>
>
> *From:* Kotresh Hiremath Ravishankar 
> *Sent:* Friday, August 31, 2018 10:10 AM
> *To:* Krishna Verma 
> *Cc:* Sunny Kumar ; Gluster Users <
> gluster-users@gluster.org>
> *Subject:* Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not
> work
>
>
>
> EXTERNAL MAIL
>
>
>
>
>
> On Thu, Aug 30, 2018 at 3:51 PM, Krishna Verma  wrote:
>
> Hi Kotresh,
>
>
>
> Yes, this include the time take  to write 1GB file to master. geo-rep was
> not stopped while the data was copying to master.
>
>
>
> This way, you can't really measure how much time geo-rep took.
>
>
>
>
>
> But now I am trouble, My putty session was timed out while copying data to
> master and geo replication was active. After I restart putty session My
> Master data is not syncing with slave. Its Last_synced time is  1hrs behind
> the current time.
>
>
>
> I restart the geo rep and also delete and 

Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work

2018-09-02 Thread Krishna Verma
Hi Kotresh/Support,

Request your help to get it fix. My slave is not getting sync with master. When 
I restart the session after doing the indexing off then only it shows the file 
at slave but that is also blank with zero size.

At master: file size is 5.8 GB.

[root@gluster-poc-noida distvol]# du -sh 
17.10.v001.20171023-201021_17020_GPLV3.tar.gz
5.8G17.10.v001.20171023-201021_17020_GPLV3.tar.gz
[root@gluster-poc-noida distvol]#

But at slave, after doing the “indexing off” and restart the session and then 
wait for 2 days. It shows only 4.9 GB copied.

[root@gluster-poc-sj distvol]# du -sh 
17.10.v001.20171023-201021_17020_GPLV3.tar.gz
4.9G17.10.v001.20171023-201021_17020_GPLV3.tar.gz
[root@gluster-poc-sj distvol]#

Similarly, I tested for small file of size 1.2 GB only that is still showing 
“0” size at slave  after days waiting time.

At Master:

[root@gluster-poc-noida distvol]# du -sh rflowTestInt18.08-b001.t.Z
1.2GrflowTestInt18.08-b001.t.Z
[root@gluster-poc-noida distvol]#

At Slave:

[root@gluster-poc-sj distvol]# du -sh rflowTestInt18.08-b001.t.Z
0   rflowTestInt18.08-b001.t.Z
[root@gluster-poc-sj distvol]#

Below is my distributed volume info :

[root@gluster-poc-noida distvol]# gluster volume info glusterdist

Volume Name: glusterdist
Type: Distribute
Volume ID: af5b2915-7170-4b5e-aee8-7e68757b9bf1
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: gluster-poc-noida:/data/gluster-dist/distvol
Brick2: noi-poc-gluster:/data/gluster-dist/distvol
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
nfs.disable: on
[root@gluster-poc-noida distvol]#

Please help to fix, I believe its not a normal behavior of gluster rsync.

/Krishna
From: Krishna Verma
Sent: Friday, August 31, 2018 12:42 PM
To: 'Kotresh Hiremath Ravishankar' 
Cc: Sunny Kumar ; Gluster Users 
Subject: RE: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work

Hi Kotresh,

I have tested the geo replication over distributed volumes with 2*2 gluster 
setup.

[root@gluster-poc-noida ~]# gluster volume geo-replication glusterdist 
gluster-poc-sj::glusterdist status

MASTER NODE  MASTER VOL MASTER BRICK  SLAVE USER
SLAVE  SLAVE NODE STATUSCRAWL STATUS   
LAST_SYNCED
-
gluster-poc-noidaglusterdist/data/gluster-dist/distvolroot  
gluster-poc-sj::glusterdistgluster-poc-sj ActiveChangelog Crawl
2018-08-31 10:28:19
noi-poc-gluster  glusterdist/data/gluster-dist/distvolroot  
gluster-poc-sj::glusterdistgluster-poc-sj2ActiveHistory Crawl  
N/A
[root@gluster-poc-noida ~]#

Not at client I copied a 848MB file from local disk to master mounted volume 
and it took only 1 minute and 15 seconds. Its great….

But even after waited for 2 hrs I was unable to see that file at slave site. 
Then I again erased the indexing by doing “gluster volume set glusterdist  
indexing off” and restart the session. Magically I received the file instantly 
at slave after doing this.

Why I need to do “indexing off” every time to reflect data at slave site? Is 
there any fix/workaround of it?

/Krishna


From: Kotresh Hiremath Ravishankar 
mailto:khire...@redhat.com>>
Sent: Friday, August 31, 2018 10:10 AM
To: Krishna Verma mailto:kve...@cadence.com>>
Cc: Sunny Kumar mailto:sunku...@redhat.com>>; Gluster 
Users mailto:gluster-users@gluster.org>>
Subject: Re: [Gluster-users] Upgrade to 4.1.2 geo-replication does not work

EXTERNAL MAIL


On Thu, Aug 30, 2018 at 3:51 PM, Krishna Verma 
mailto:kve...@cadence.com>> wrote:
Hi Kotresh,

Yes, this include the time take  to write 1GB file to master. geo-rep was not 
stopped while the data was copying to master.

This way, you can't really measure how much time geo-rep took.


But now I am trouble, My putty session was timed out while copying data to 
master and geo replication was active. After I restart putty session My Master 
data is not syncing with slave. Its Last_synced time is  1hrs behind the 
current time.

I restart the geo rep and also delete and again create the session but its  
“LAST_SYNCED” time is same.

Unless, geo-rep is Faulty, it would be processing/syncing. You should check 
logs for any errors.


Please help in this.

…. It's better if gluster volume has more distribute count like  3*3 or 4*3 :- 
Are you refereeing to create a distributed volume with 3 master node and 3 
slave node?

Yes,  that's correct. Please do the test with this. I recommend you to run the 
actual workload for which you are planning to use gluster instead of copying 
1GB file and testing.



/krishna

From: Kotresh Hiremath Ravishankar 
mailto:khire...@redhat.com>>

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out fr

2018-09-02 Thread Sam McLeod
We've got an odd problem where clients are blocked from writing to Gluster 
volumes until the first node of the Gluster cluster is rebooted.

I suspect I've either configured something incorrectly with the arbiter / 
replica configuration of the volumes, or there is some sort of bug in the 
gluster client-server connection that we're triggering.

I was wondering if anyone has seen this or could point me in the right 
direction?


Environment:
Typology: 3 node cluster, replica 2, arbiter 1 (third node is metadata only).
Version: Client and Servers both running 4.1.3, both on CentOS 7, kernel 
4.18.x, (Xen) VMs with relatively fast networked SSD storage backing them, XFS.
Client: Native Gluster FUSE client mounting via the kubernetes provider

Problem:
Seemingly randomly some clients will be blocked / are unable to write to what 
should be a highly available gluster volume.
The client gluster logs show it failing to do new file operations across 
various volumes and all three nodes of the gluster.
The server gluster (or OS) logs do not show any warnings or errors.
The client recovers and is able to write to volumes again after the first node 
of the gluster cluster is rebooted.
Until the first node of the gluster cluster is rebooted, the client fails to 
write to the volume that is (or should be) available on the second node (a 
replica) and third node (an arbiter only node).

What 'fixes' the issue:
Although the clients (kubernetes hosts) connect to all 3 nodes of the Gluster 
cluster - restarting the first gluster node always unblocks the IO and allows 
the client to continue writing.
Stopping and starting the glusterd service on the gluster server is not enough 
to fix the issue, nor is restarting its networking.
This suggests to me that the volume is unavailable for writing for some reason 
and restarting the first node in the cluster either clears some sort of TCP 
sessions between the client-server or between the server-server replication.

Expected behaviour:

If the first gluster node / server had failed or was blocked from performing 
operations for some reason (which it doesn't seem it is), I'd expect the 
clients to access data from the second gluster node and write metadata to the 
third gluster node as well as it's an arbiter / metadata only node.
If for some reason the a gluster node was not able to serve connections to 
clients, I'd expect to see errors in the volume, glusterd or brick log files 
(there are none on the first gluster node).
If the first gluster node was for some reason blocking IO on a volume, I'd 
expect that node either to show as unhealthy or unavailable in the gluster peer 
status or gluster volume status.


Client gluster errors:

staging_static in this example is a volume name.
You can see the client trying to connect to the second and third nodes of the 
gluster cluster and failing (unsure as to why?)
The server side logs on the first gluster node do not show any errors or 
problems, but the second / third node show errors in the glusterd.log when 
trying to 'unlock' the 0-management volume on the first node.


On a gluster client (a kubernetes host using the kubernetes connector which 
uses the native fuse client) when its blocked from writing but the gluster 
appears healthy (other than the errors mentioned later):

[2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] 
0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) 
op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 15:03:22.417773. timeout = 1800 
for :49154
[2018-09-02 15:33:22.750989] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: 
remote operation failed [Transport endpoint is not connected]
[2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] 
0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) 
op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 15:33:22.765751. timeout = 1800 
for :49154
[2018-09-02 16:03:23.097988] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: 
remote operation failed [Transport endpoint is not connected]
[2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] 
0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) 
op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 16:03:23.098133. timeout = 1800 
for :49154
[2018-09-02 16:33:23.439282] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: 
remote operation failed [Transport endpoint is not connected]
[2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] 
0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) 
op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 16:33:23.455171. timeout = 1800 
for :49154
[2018-09-02 17:03:23.786971] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: 
remote operation failed [Transport endpoint is not connected]
[2018-09-02 17:33:24.160607] E 

[Gluster-users] Gluster clients intermittently hang until first gluster server in a Replica 1 Arbiter 1 cluster is rebooted, server error: 0-management: Unlocking failed & client error: bailing out fr

2018-09-02 Thread Sam McLeod
We've got an odd problem where clients are blocked from writing to Gluster 
volumes until the first node of the Gluster cluster is rebooted.

I suspect I've either configured something incorrectly with the arbiter / 
replica configuration of the volumes, or there is some sort of bug in the 
gluster client-server connection that we're triggering.

I was wondering if anyone has seen this or could point me in the right 
direction?


Environment:
Typology: 3 node cluster, replica 2, arbiter 1 (third node is metadata only).
Version: Client and Servers both running 4.1.3, both on CentOS 7, kernel 
4.18.x, (Xen) VMs with relatively fast networked SSD storage backing them, XFS.
Client: Native Gluster FUSE client mounting via the kubernetes provider

Problem:
Seemingly randomly some clients will be blocked / are unable to write to what 
should be a highly available gluster volume.
The client gluster logs show it failing to do new file operations across 
various volumes and all three nodes of the gluster.
The server gluster (or OS) logs do not show any warnings or errors.
The client recovers and is able to write to volumes again after the first node 
of the gluster cluster is rebooted.
Until the first node of the gluster cluster is rebooted, the client fails to 
write to the volume that is (or should be) available on the second node (a 
replica) and third node (an arbiter only node).

What 'fixes' the issue:
Although the clients (kubernetes hosts) connect to all 3 nodes of the Gluster 
cluster - restarting the first gluster node always unblocks the IO and allows 
the client to continue writing.
Stopping and starting the glusterd service on the gluster server is not enough 
to fix the issue, nor is restarting its networking.
This suggests to me that the volume is unavailable for writing for some reason 
and restarting the first node in the cluster either clears some sort of TCP 
sessions between the client-server or between the server-server replication.

Expected behaviour:

If the first gluster node / server had failed or was blocked from performing 
operations for some reason (which it doesn't seem it is), I'd expect the 
clients to access data from the second gluster node and write metadata to the 
third gluster node as well as it's an arbiter / metadata only node.
If for some reason the a gluster node was not able to serve connections to 
clients, I'd expect to see errors in the volume, glusterd or brick log files 
(there are none on the first gluster node).
If the first gluster node was for some reason blocking IO on a volume, I'd 
expect that node either to show as unhealthy or unavailable in the gluster peer 
status or gluster volume status.


Client gluster errors:

staging_static in this example is a volume name.
You can see the client trying to connect to the second and third nodes of the 
gluster cluster and failing (unsure as to why?)
The server side logs on the first gluster node do not show any errors or 
problems, but the second / third node show errors in the glusterd.log when 
trying to 'unlock' the 0-management volume on the first node.


On a gluster client (a kubernetes host using the kubernetes connector which 
uses the native fuse client) when its blocked from writing but the gluster 
appears healthy (other than the errors mentioned later):

[2018-09-02 15:33:22.750874] E [rpc-clnt.c:184:call_bail] 
0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) 
op(INODELK(29)) xid = 0x1cce sent = 2018-09-02 15:03:22.417773. timeout = 1800 
for :49154
[2018-09-02 15:33:22.750989] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: 
remote operation failed [Transport endpoint is not connected]
[2018-09-02 16:03:23.097905] E [rpc-clnt.c:184:call_bail] 
0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) 
op(INODELK(29)) xid = 0x2e21 sent = 2018-09-02 15:33:22.765751. timeout = 1800 
for :49154
[2018-09-02 16:03:23.097988] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: 
remote operation failed [Transport endpoint is not connected]
[2018-09-02 16:33:23.439172] E [rpc-clnt.c:184:call_bail] 
0-staging_static-client-2: bailing out frame type(GlusterFS 4.x v1) 
op(INODELK(29)) xid = 0x1d4b sent = 2018-09-02 16:03:23.098133. timeout = 1800 
for :49154
[2018-09-02 16:33:23.439282] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-2: 
remote operation failed [Transport endpoint is not connected]
[2018-09-02 17:03:23.786858] E [rpc-clnt.c:184:call_bail] 
0-staging_static-client-1: bailing out frame type(GlusterFS 4.x v1) 
op(INODELK(29)) xid = 0x2ee7 sent = 2018-09-02 16:33:23.455171. timeout = 1800 
for :49154
[2018-09-02 17:03:23.786971] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-staging_static-client-1: 
remote operation failed [Transport endpoint is not connected]
[2018-09-02 17:33:24.160607] E 

[Gluster-users] /var/run/glusterd.socket permissions for non-root geo-replication (4.1.3)

2018-09-02 Thread Andy Coates
Hi all,

We're investigating geo-replication and noticed that when using non-root
geo-replication, the sync user cannot access various gluster commands, e.g.
one of the session commands ends up running this on the slave:

Popen: command returned error   cmd=/usr/sbin/gluster
--remote-host=localhost system:: mount geosync user-map-root=geosync
aux-gfid-mount acl log-level=INFO
log-file=/var/log/glusterfs/geo-replication-slaves/snip/snip.log
volfile-server=localhost volfile-id=shared client-pid=-1  error=1

Popen: /usr/sbin/gluster> 2 : failed with this errno (No such file or
directory)

The underlying cause of this is the gluster command not being able to write
to the socket file /var/run/glusterd.socket - if I change the group to my
geo-replication group and add group write, the command succeeds and
geo-replication becomes active.

The problem is every time the server/service restarts it comes back up as
root:root

srwxr-xr-x. 1 root root 0 Sep  3 02:17 /var/run/glusterd.socket

So a couple of questions:
1) Should the geo-replication non-root user be able to do what it needs
without changing those permissions?
2) If it does need write permission, is there a config option to tell the
service to set the correct permissions on the file when it starts so that
the non-root user can write to it?

Thanks.
Andy
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] blocking process on FUSE mount in directory which is using quota

2018-09-02 Thread mabi
Hello,

I wanted to report that I had this morning a similar issue on another server 
where a few PHP-FPM processes get blocked on different GlusterFS volume mounted 
through a FUSE mount. This GlusterFS volume has no quota enabled so it might 
not be quota related after all.

Here would be the Linux kernel stack trace:

[Sun Sep  2 06:47:47 2018] INFO: task php5-fpm:25880 blocked for more than 120 
seconds.
[Sun Sep  2 06:47:47 2018]   Not tainted 3.16.0-4-amd64 #1
[Sun Sep  2 06:47:47 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[Sun Sep  2 06:47:47 2018] php5-fpmD 88017ee12f40 0 25880  
1 0x0004
[Sun Sep  2 06:47:47 2018]  880101688b60 0282 00012f40 
880059ca3fd8
[Sun Sep  2 06:47:47 2018]  00012f40 880101688b60 8801093b51b0 
8801067ec800
[Sun Sep  2 06:47:47 2018]  880059ca3cc0 8801093b5290 8801093b51b0 
880059ca3e80
[Sun Sep  2 06:47:47 2018] Call Trace:
[Sun Sep  2 06:47:47 2018]  [] ? 
__fuse_request_send+0xbd/0x270 [fuse]
[Sun Sep  2 06:47:47 2018]  [] ? 
prepare_to_wait_event+0xf0/0xf0
[Sun Sep  2 06:47:47 2018]  [] ? fuse_send_write+0xd0/0x100 
[fuse]
[Sun Sep  2 06:47:47 2018]  [] ? 
fuse_perform_write+0x26f/0x4b0 [fuse]
[Sun Sep  2 06:47:47 2018]  [] ? 
fuse_file_write_iter+0x1dd/0x2b0 [fuse]
[Sun Sep  2 06:47:47 2018]  [] ? new_sync_write+0x74/0xa0
[Sun Sep  2 06:47:47 2018]  [] ? vfs_write+0xb2/0x1f0
[Sun Sep  2 06:47:47 2018]  [] ? vfs_read+0xed/0x170
[Sun Sep  2 06:47:47 2018]  [] ? SyS_write+0x42/0xa0
[Sun Sep  2 06:47:47 2018]  [] ? SyS_lseek+0x7e/0xa0
[Sun Sep  2 06:47:47 2018]  [] ? 
system_call_fast_compare_end+0x10/0x15

Did anyone already have time to have a look at the statedump file I sent around 
3 weeks ago?

I never saw this type of problems in the past and it started to appear since I 
upgraded to GluterFS 3.12.12.

Best regards,
Mabi

‐‐‐ Original Message ‐‐‐
On August 15, 2018 9:21 AM, mabi  wrote:

> Great, you will then find attached here the statedump of the client using the 
> FUSE glusterfs mount right after two processes have blocked.
>
> Two notes here regarding the "path=" in this statedump file:
> - I have renamed all the "path=" which has the problematic directory as 
> "path=PROBLEMATIC_DIRECTORY_HERE
> - All the other "path=" I have renamed them to "path=REMOVED_FOR_PRIVACY".
>
> Note also that funnily enough the number of "path=" for that problematic 
> directory sums up to exactly 5000 entries. Coincidence or hint to the problem 
> maybe?
>
> ‐‐‐ Original Message ‐‐‐
> On August 15, 2018 5:21 AM, Raghavendra Gowdappa  wrote:
>
>> On Tue, Aug 14, 2018 at 7:23 PM, mabi  wrote:
>>
>>> Bad news: the process blocked happened again this time with another 
>>> directory of another user which is NOT over his quota but which also has 
>>> quota enabled.
>>>
>>> The symptoms on the Linux side are the same:
>>>
>>> [Tue Aug 14 15:30:33 2018] INFO: task php5-fpm:14773 blocked for more than 
>>> 120 seconds.
>>> [Tue Aug 14 15:30:33 2018]   Not tainted 3.16.0-4-amd64 #1
>>> [Tue Aug 14 15:30:33 2018] "echo 0 > 
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [Tue Aug 14 15:30:33 2018] php5-fpmD 8801fea13200 0 14773   
>>>  729 0x
>>> [Tue Aug 14 15:30:33 2018]  880100bbe0d0 0282 
>>> 00013200 880129bcffd8
>>> [Tue Aug 14 15:30:33 2018]  00013200 880100bbe0d0 
>>> 880153ed0d68 880129bcfee0
>>> [Tue Aug 14 15:30:33 2018]  880153ed0d6c 880100bbe0d0 
>>>  880153ed0d70
>>> [Tue Aug 14 15:30:33 2018] Call Trace:
>>> [Tue Aug 14 15:30:33 2018]  [] ? 
>>> schedule_preempt_disabled+0x25/0x70
>>> [Tue Aug 14 15:30:33 2018]  [] ? 
>>> __mutex_lock_slowpath+0xd3/0x1d0
>>> [Tue Aug 14 15:30:33 2018]  [] ? write_inode_now+0x93/0xc0
>>> [Tue Aug 14 15:30:33 2018]  [] ? mutex_lock+0x1b/0x2a
>>> [Tue Aug 14 15:30:33 2018]  [] ? fuse_flush+0x8f/0x1e0 
>>> [fuse]
>>> [Tue Aug 14 15:30:33 2018]  [] ? vfs_read+0x93/0x170
>>> [Tue Aug 14 15:30:33 2018]  [] ? filp_close+0x2a/0x70
>>> [Tue Aug 14 15:30:33 2018]  [] ? SyS_close+0x1f/0x50
>>> [Tue Aug 14 15:30:33 2018]  [] ? 
>>> system_call_fast_compare_end+0x10/0x15
>>>
>>> and if I check this process it has state "D" which is "D = uninterruptible 
>>> sleep".
>>>
>>> Now I also managed to take a statedump file as recommended but I see in its 
>>> content under the "[io-cache.inode]" a "path=" which I would need to remove 
>>> as it contains filenames for privacy reasons. Can I remove every "path=" 
>>> line and still send you the statedump file for analysis?
>>
>> Yes. Removing path is fine and statedumps will still be useful for debugging 
>> the issue.
>>
>>> Thank you.
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>> On August 14, 2018 10:48 AM, Nithya Balachandran  
>>> wrote:
>>>
 Thanks for letting us know. Sanoj, can you take a look at this?

 Thanks.
 Nithya