Re: [Gluster-users] sometimes entry remains in "gluster v heal vol-name info" until visit it from mnt

2018-09-28 Thread Karthik Subrahmanya
Hey,

Please provide the glustershd log from all the nodes and client logs on the
node from where you did the lookup on the file to resolve this issue.

Regards,
Karthik

On Fri, Sep 28, 2018 at 5:27 PM Ravishankar N 
wrote:

> + gluster-users.
>
> Adding Karthik to see if he has some cycles to look into this.
>
> -Ravi
>
> On 09/28/2018 12:07 PM, Zhou, Cynthia (NSB - CN/Hangzhou) wrote:
>
> Hi, glusterfs expert
>
> When I test with glusterfs version 3.12.3 I find it quite often that
> sometimes, there are entry remains in gluster volume heal info
> output for long time, *it does not disappear until you visit it from the
> mount point, is this normal*?
>
>
>
>
>
> [root@sn-0:/root]
>
> # gluster v heal services info
>
> Brick sn-0.local:/mnt/bricks/services/brick
>
> Status: Connected
>
> Number of entries: 0
>
>
>
> Brick sn-1.local:/mnt/bricks/services/brick
>
> Status: Connected
>
> Number of entries: 0
>
>
>
> Brick sn-2.local:/mnt/bricks/services/brick
>
> /fstest_88402c989256d6e39e50208c90c1e85d  //this entry remains in
> the output until you touch /mnt/services/
> fstest_88402c989256d6e39e50208c90c1e85d
>
> Status: Connected
>
> Number of entries: 1
>
>
>
> [root@sn-0:/root]
>
> # ssh sn-2.local
>
> Warning: Permanently added 'sn-2.local' (RSA) to the list of known hosts.
>
>
>
> USAGE OF THE ROOT ACCOUNT AND THE FULL BASH IS RECOMMENDED ONLY FOR
> LIMITED USE. PLEASE USE A NON-ROOT ACCOUNT AND THE SCLI SHELL (fsclish)
> AND/OR LIMITED BASH SHELL.
>
>
>
> Read /opt/nokia/share/security/readme_root.txt for more details.
>
>
>
> [root@sn-2:/root]
>
> # cd /mnt/bricks/services/brick/.glusterfs/indices/xattrop/
>
> [root@sn-2:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
>
> # ls
>
> 9138e315-efd6-46e0-8a3a-db535078c781
> xattrop-dfcd7e67-8c2d-4ef1-93e2-c180073c8d87
>
> [root@sn-2:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
>
> # getfattr -m . -d -e hex
> /mnt/bricks/services/brick/fstest_88402c989256d6e39e50208c90c1e85d/
>
> getfattr: Removing leading '/' from absolute path names
>
> # file: mnt/bricks/services/brick/fstest_88402c989256d6e39e50208c90c1e85d/
>
> trusted.afr.services-client-1=0x00010001
>
> trusted.gfid=0x9138e315efd646e08a3adb535078c781
>
> trusted.glusterfs.dht=0x0001
>
>
>
> [root@sn-2:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
>
> # getfattr -m . -d -e hex
> /mnt/bricks/services/brick/fstest_88402c989256d6e39e50208c90c1e85d/fstest_4cf1be62e0b12d3d65fac8eacb523ef3/
>
> getfattr: Removing leading '/' from absolute path names
>
> # file:
> mnt/bricks/services/brick/fstest_88402c989256d6e39e50208c90c1e85d/fstest_4cf1be62e0b12d3d65fac8eacb523ef3/
>
> trusted.gfid=0x0ccb5c1f96064e699f62fdc72cf036f5
>
>
>
>
>
>
>
> “fstest_88402c989256d6e39e50208c90c1e85d” is only seen from sn-2 mount
> point and sn-2 service brick, there is no such entry if you ls
> /mnt/services on sn-0 or sn-1.
>
> [root@sn-2:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
>
> # cd /mnt/services/
>
> [root@sn-2:/mnt/services]
>
> # ls
>
> backup   db
> fstest_88402c989256d6e39e50208c90c1e85d  LCM  NE3SAgent
> _nokrcpautoremoteuser  PM9  RCP_Backup  SS_AlLightProcessor  SymptomDataUpl
>
> commoncollector  EventCorrelationEngine  hypertracer
>  Log  netservODSptp  rcpha   SWM
>
> [root@sn-2:/mnt/services]
>
>
>
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Halo Replication usage

2018-09-28 Thread Nathan Barry
Hello All,

Halo Replication as contributed by Facebook seems to fit a significant
use-case for us, but the documentation is very sparse and leaves out key
details. I'm going to work on setting up a test environment to see how it
behaves, but can anybody answer a few questions?

1. is cluster.halo-enabled a global setting? Or does it only apply to
certain volumes in the TSP?

2. cluster.halo-min-replicas: How does this interact with a traditional
gluster volume create command? What replica levels are recommended? Does
it work with or need Arbiter?

3. Related to above: assuming you have 5 nodes per site (if you were doing
distributed/replicate with Arbiter previously), what halo-min-replicas
setting is recommended?

 

 



CONFIDENTIAL NOTICE : If you have received this email in error, 
please immediately notify the sender by email at the address 
shown above. This email may contain confidential or legally 
privileged information that is intended only for the use of the 
individual or entity named in this email. If you are not the 
intended recipient, you are hereby notified that any 
disclosure, copying, distribution or reliance upon the contents 
of this email is strictly prohibited. Please delete from your 
files if you are not the intended recipient. Thank you for your 
compliance.___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Rebalance failed on Distributed Disperse volume based on 3.12.14 version

2018-09-28 Thread Ashish Pandey


- Original Message -

From: "Mauro Tridici"  
To: "Ashish Pandey"  
Cc: "Gluster Users"  
Sent: Friday, September 28, 2018 9:08:52 PM 
Subject: Re: [Gluster-users] Rebalance failed on Distributed Disperse volume 
based on 3.12.14 version 

Thank you, Ashish. 

I will study and try your solution on my virtual env. 
How I can detect the process of a brick on gluster server? 

Many Thanks, 
Mauro 


gluster v status  will give you the list of bricks and the respective 
process id. 
Also, you can use "ps aux | grep glusterfs" to see all the processes on a node 
but I think the above step also do the same. 

--- 
Ashish 



Il ven 28 set 2018 16:39 Ashish Pandey < aspan...@redhat.com > ha scritto: 






From: "Mauro Tridici" < mauro.trid...@cmcc.it > 
To: "Ashish Pandey" < aspan...@redhat.com > 
Cc: "gluster-users" < gluster-users@gluster.org > 
Sent: Friday, September 28, 2018 7:08:41 PM 
Subject: Re: [Gluster-users] Rebalance failed on Distributed Disperse volume 
based on 3.12.14 version 


Dear Ashish, 

please excuse me, I'm very sorry for misunderstanding. 
Before contacting you during last days, we checked all network devices (switch 
10GbE, cables, NICs, servers ports, and so on), operating systems version and 
settings, network bonding configuration, gluster packages versions, tuning 
profiles, etc. but everything seems to be ok. The first 3 servers (and volume) 
operated without problem for one year. After we added the new 3 servers we 
noticed something wrong. 
Fortunately, yesterday you gave me an hand to understand where is (or could be) 
the problem. 

At this moment, after we re-launched the remove-brick command, it seems that 
the rebalance is going ahead without errors, but it is only scanning the files. 
May be that during the future data movement some errors could appear. 

For this reason, it could be useful to know how to proceed in case of a new 
failure: insist with approach n.1 or change the strategy? 
We are thinking to try to complete the running remove-brick procedure and make 
a decision based on the outcome. 

Question: could we start approach n.2 also after having successfully removed 
the V1 subvolume?! 

>>> Yes, we can do that. My idea is to use replace-brick command. 
We will kill "ONLY" one brick process on s06. We will format this brick. Then 
use replace-brick command to replace brick of a volume on s05 with this 
formatted brick. 
heal will be triggered and data of the respective volume will be placed on this 
brick. 

Now, we can format the brick which got freed up on s05 and replace the brick 
which we killed on s06 to s05. 
During this process, we have to make sure heal completed before trying any 
other replace/kill brick. 

It is tricky but looks doable. Think about it and try to perform it on your 
virtual environment first before trying on production. 
--- 

If it is still possible, could you please illustrate the approach n.2 even if I 
dont have free disks? 
I would like to start thinking about it and test it on a virtual environment. 

Thank you in advance for your help and patience. 
Regards, 
Mauro 






Il giorno 28 set 2018, alle ore 14:36, Ashish Pandey < aspan...@redhat.com > ha 
scritto: 


We could have taken approach -2 even if you did not have free disks. You should 
have told me why are you 
opting Approach-1 or perhaps I should have asked. 
I was wondering for approach 1 because sometimes re-balance takes time 
depending upon the data size. 

Anyway, I hope whole setup is stable, I mean it is not in the middle of 
something which we can not stop. 
If free disks are the only concern I will give you some more steps to deal with 
it and follow the approach 2. 

Let me know once you think everything is fine with the system and there is 
nothing to heal. 

--- 
Ashish 


From: "Mauro Tridici" < mauro.trid...@cmcc.it > 
To: "Ashish Pandey" < aspan...@redhat.com > 
Cc: "gluster-users" < gluster-users@gluster.org > 
Sent: Friday, September 28, 2018 4:21:03 PM 
Subject: Re: [Gluster-users] Rebalance failed on Distributed Disperse volume 
based on 3.12.14 version 


Hi Ashish, 

as I said in my previous message, we adopted the first approach you suggested 
(setting network.ping-timeout option to 0). 
This choice was due to the absence of empty brick to be used as indicated in 
the second approach. 

So, we launched remove-brick command on the first subvolume (V1, bricks 
1,2,3,4,5,6 on server s04). 
Rebalance started moving the data across the other bricks, but, after about 3TB 
of moved data, rebalance speed slowed down and some transfer errors appeared in 
the rebalance.log of server s04. 
At this point, since remaining 1,8TB need to be moved in order to complete the 
step, we decided to stop the remove-brick execution and start it again (I hope 
it doesn’t stop again before complete the rebalance) 

Now rebalance is not moving data, it’s only scanning files (please, take a look 
to the following output) 

[root@s01 ~]# gluster volume 

Re: [Gluster-users] Rebalance failed on Distributed Disperse volume based on 3.12.14 version

2018-09-28 Thread Mauro Tridici
Thank you, Ashish.

I will study and try your solution on my virtual env.
How I can detect the process of a brick on gluster server?

Many Thanks,
Mauro



Il ven 28 set 2018 16:39 Ashish Pandey  ha scritto:

>
>
> --
> *From: *"Mauro Tridici" 
> *To: *"Ashish Pandey" 
> *Cc: *"gluster-users" 
> *Sent: *Friday, September 28, 2018 7:08:41 PM
> *Subject: *Re: [Gluster-users] Rebalance failed on Distributed Disperse
> volumebased on 3.12.14 version
>
>
> Dear Ashish,
>
> please excuse me, I'm very sorry for misunderstanding.
> Before contacting you during last days, we checked all network devices
> (switch 10GbE, cables, NICs, servers ports, and so on), operating systems
> version and settings, network bonding configuration, gluster packages
> versions, tuning profiles, etc. but everything seems to be ok. The first 3
> servers (and volume) operated without problem for one year. After we added
> the new 3 servers we noticed something wrong.
> Fortunately, yesterday you gave me an hand to understand where is (or
> could be) the problem.
>
> At this moment, after we re-launched the remove-brick command, it seems
> that the rebalance is going ahead without errors, but it is only scanning
> the files.
> May be that during the future data movement some errors could appear.
>
> For this reason, it could be useful to know how to proceed in case of a
> new failure: insist with approach n.1 or change the strategy?
> We are thinking to try to complete the running remove-brick procedure and
>  make a decision based on the outcome.
>
> Question: could we start approach n.2 also after having successfully
> removed the V1 subvolume?!
>
> >>> Yes, we can do that. My idea is to use replace-brick command.
> We will kill "ONLY" one brick process on s06. We will format this brick.
> Then use replace-brick command to replace brick of a volume on s05 with
> this formatted brick.
> heal will be triggered and data of the respective volume will be placed on
> this brick.
>
> Now, we can format the brick which got freed up on s05 and replace the
> brick which we killed on s06 to s05.
> During this process, we have to make sure heal completed before trying any
> other replace/kill brick.
>
> It is tricky but looks doable. Think about it and try to perform it on
> your virtual environment first before trying on production.
> ---
>
> If it is still possible, could you please illustrate the approach n.2 even
> if I dont have free disks?
> I would like to start thinking about it and test it on a virtual
> environment.
>
> Thank you in advance for your help and patience.
> Regards,
> Mauro
>
>
>
> Il giorno 28 set 2018, alle ore 14:36, Ashish Pandey 
> ha scritto:
>
>
> We could have taken approach -2 even if you did not have free disks. You
> should have told me why are you
> opting Approach-1 or perhaps I should have asked.
> I was wondering for approach 1 because sometimes re-balance takes time
> depending upon the data size.
>
> Anyway, I hope whole setup is stable, I mean it is not in the middle of
> something which we can not stop.
> If free disks are the only concern I will give you some more steps to deal
> with it and follow the approach 2.
>
> Let me know once you think everything is fine with the system and there is
> nothing to heal.
>
> ---
> Ashish
>
> --
> *From: *"Mauro Tridici" 
> *To: *"Ashish Pandey" 
> *Cc: *"gluster-users" 
> *Sent: *Friday, September 28, 2018 4:21:03 PM
> *Subject: *Re: [Gluster-users] Rebalance failed on Distributed Disperse
> volume based on 3.12.14 version
>
>
> Hi Ashish,
>
> as I said in my previous message, we adopted the first approach you
> suggested (setting network.ping-timeout option to 0).
> This choice was due to the absence of empty brick to be used as indicated
> in the second approach.
>
> So, we launched remove-brick command on the first subvolume (V1, bricks
> 1,2,3,4,5,6 on server s04).
> Rebalance started moving the data across the other bricks, but, after
> about 3TB of moved data, rebalance speed slowed down and some transfer
> errors appeared in the rebalance.log of server s04.
> At this point, since remaining 1,8TB need to be moved in order to complete
> the step, we decided to stop the remove-brick execution and start it again
> (I hope it doesn’t stop again before complete the rebalance)
>
> Now rebalance is not moving data, it’s only scanning files (please, take a
> look to the following output)
>
> [root@s01 ~]# gluster volume remove-brick tier2
> s04-stg:/gluster/mnt1/brick s04-stg:/gluster/mnt2/brick
> s04-stg:/gluster/mnt3/brick s04-stg:/gluster/mnt4/brick
> s04-stg:/gluster/mnt5/brick s04-stg:/gluster/mnt6/brick status
> Node Rebalanced-files  size
> scanned  failures   skipped   status  run time in
> h:m:s
>-  ---   ---
> ---   ---   --- 

[Gluster-users] Found anomalies in ganesha-gfapi.log

2018-09-28 Thread Renaud Fortier
Hi,
I have a lot of these lines in ganesha-gfapi.log. What is it and should I 
worried about it ?
[2018-09-28 14:26:46.296375] I [MSGID: 109063] 
[dht-layout.c:693:dht_layout_normalize] 0-testing-dht: Found anomalies in 
(null) (gfid = 4efad4fd-fc7f-4c06-90e0-f882ca74b9a5). Holes=1 overlaps=0

OS : Debian stretch
Gluster : v4.1.5 type : replicated 3 briks
Ganesha : 2.6.0

Thank you
___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Rebalance failed on Distributed Disperse volume based on 3.12.14 version

2018-09-28 Thread Mauro Tridici

Dear Ashish,

please excuse me, I'm very sorry for misunderstanding.
Before contacting you during last days, we checked all network devices (switch 
10GbE, cables, NICs, servers ports, and so on), operating systems version and 
settings, network bonding configuration, gluster packages versions, tuning 
profiles, etc. but everything seems to be ok. The first 3 servers (and volume) 
operated without problem for one year. After we added the new 3 servers we 
noticed something wrong.
Fortunately, yesterday you gave me an hand to understand where is (or could be) 
the problem. 

At this moment, after we re-launched the remove-brick command, it seems that 
the rebalance is going ahead without errors, but it is only scanning the files.
May be that during the future data movement some errors could appear.

For this reason, it could be useful to know how to proceed in case of a new 
failure: insist with approach n.1 or change the strategy?
We are thinking to try to complete the running remove-brick procedure and  make 
a decision based on the outcome.

Question: could we start approach n.2 also after having successfully removed 
the V1 subvolume?!

If it is still possible, could you please illustrate the approach n.2 even if I 
dont have free disks?
I would like to start thinking about it and test it on a virtual environment.

Thank you in advance for your help and patience.
Regards,
Mauro



> Il giorno 28 set 2018, alle ore 14:36, Ashish Pandey  ha 
> scritto:
> 
> 
> We could have taken approach -2 even if you did not have free disks. You 
> should have told me why are you
> opting Approach-1 or perhaps I should have asked.
> I was wondering for approach 1 because sometimes re-balance takes time 
> depending upon the data size.
> 
> Anyway, I hope whole setup is stable, I mean it is not in the middle of 
> something which we can not stop.
> If free disks are the only concern I will give you some more steps to deal 
> with it and follow the approach 2.
> 
> Let me know once you think everything is fine with the system and there is 
> nothing to heal.
> 
> ---
> Ashish
> 
> From: "Mauro Tridici" 
> To: "Ashish Pandey" 
> Cc: "gluster-users" 
> Sent: Friday, September 28, 2018 4:21:03 PM
> Subject: Re: [Gluster-users] Rebalance failed on Distributed Disperse volume 
> based on 3.12.14 version
> 
> 
> Hi Ashish,
> 
> as I said in my previous message, we adopted the first approach you suggested 
> (setting network.ping-timeout option to 0).
> This choice was due to the absence of empty brick to be used as indicated in 
> the second approach.
> 
> So, we launched remove-brick command on the first subvolume (V1, bricks 
> 1,2,3,4,5,6 on server s04).
> Rebalance started moving the data across the other bricks, but, after about 
> 3TB of moved data, rebalance speed slowed down and some transfer errors 
> appeared in the rebalance.log of server s04.
> At this point, since remaining 1,8TB need to be moved in order to complete 
> the step, we decided to stop the remove-brick execution and start it again (I 
> hope it doesn’t stop again before complete the rebalance)
> 
> Now rebalance is not moving data, it’s only scanning files (please, take a 
> look to the following output)
> 
> [root@s01 ~]# gluster volume remove-brick tier2 s04-stg:/gluster/mnt1/brick 
> s04-stg:/gluster/mnt2/brick s04-stg:/gluster/mnt3/brick 
> s04-stg:/gluster/mnt4/brick s04-stg:/gluster/mnt5/brick 
> s04-stg:/gluster/mnt6/brick status
> Node Rebalanced-files  size   
> scanned  failures   skipped   status  run time in h:m:s
>-  ---   ---   
> ---   ---   ---  
> --
>  s04-stg00Bytes   
>  182008 0 0  in progress3:08:09
> Estimated time left for rebalance to complete :  442:45:06
> 
> If I’m not wrong, remove-brick rebalances entire cluster each time it start.
> Is there a way to speed up this procedure? Do you have some other suggestion 
> that, in this particular case, could be useful to reduce errors (I know that 
> they are related to the current volume configuration) and improve rebalance 
> performance avoiding to rebalance the entire cluster?
> 
> Thank you in advance,
> Mauro
> 
> Il giorno 27 set 2018, alle ore 13:14, Ashish Pandey  > ha scritto:
> 
> 
> Yes, you can.
> If not me others may also reply.
> 
> ---
> Ashish
> 
> From: "Mauro Tridici" mailto:mauro.trid...@cmcc.it>>
> To: "Ashish Pandey" mailto:aspan...@redhat.com>>
> Cc: "gluster-users"  >
> Sent: Thursday, September 27, 2018 4:24:12 PM
> Subject: Re: [Gluster-users] Rebalance failed on Distributed Disperse volume  
>   based on 3.12.14 version
> 
> 
> Dear Ashish,
> 
> I can not thank you enough!
> Your procedure and 

Re: [Gluster-users] sometimes entry remains in "gluster v heal vol-name info" until visit it from mnt

2018-09-28 Thread Ravishankar N

+ gluster-users.

Adding Karthik to see if he has some cycles to look into this.

-Ravi


On 09/28/2018 12:07 PM, Zhou, Cynthia (NSB - CN/Hangzhou) wrote:


Hi, glusterfs expert

When I test with glusterfs version 3.12.3 I find it quite often that 
sometimes, there are entry remains in gluster volume heal 
info output for long time, *it does not disappear until you 
visit it from the mount point, is this normal*?


[root@sn-0:/root]

# gluster v heal services info

Brick sn-0.local:/mnt/bricks/services/brick

Status: Connected

Number of entries: 0

Brick sn-1.local:/mnt/bricks/services/brick

Status: Connected

Number of entries: 0

Brick sn-2.local:/mnt/bricks/services/brick

/fstest_88402c989256d6e39e50208c90c1e85d  //this entry remains 
in the output until you touch /mnt/services/ 
fstest_88402c989256d6e39e50208c90c1e85d


Status: Connected

Number of entries: 1

[root@sn-0:/root]

# ssh sn-2.local

Warning: Permanently added 'sn-2.local' (RSA) to the list of known hosts.

USAGE OF THE ROOT ACCOUNT AND THE FULL BASH IS RECOMMENDED ONLY FOR 
LIMITED USE. PLEASE USE A NON-ROOT ACCOUNT AND THE SCLI SHELL 
(fsclish) AND/OR LIMITED BASH SHELL.


Read /opt/nokia/share/security/readme_root.txt for more details.

[root@sn-2:/root]

# cd /mnt/bricks/services/brick/.glusterfs/indices/xattrop/

[root@sn-2:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]

# ls

9138e315-efd6-46e0-8a3a-db535078c781 
xattrop-dfcd7e67-8c2d-4ef1-93e2-c180073c8d87


[root@sn-2:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]

# getfattr -m . -d -e hex 
/mnt/bricks/services/brick/fstest_88402c989256d6e39e50208c90c1e85d/


getfattr: Removing leading '/' from absolute path names

# file: mnt/bricks/services/brick/fstest_88402c989256d6e39e50208c90c1e85d/

trusted.afr.services-client-1=0x00010001

trusted.gfid=0x9138e315efd646e08a3adb535078c781

trusted.glusterfs.dht=0x0001

[root@sn-2:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]

# getfattr -m . -d -e hex 
/mnt/bricks/services/brick/fstest_88402c989256d6e39e50208c90c1e85d/fstest_4cf1be62e0b12d3d65fac8eacb523ef3/


getfattr: Removing leading '/' from absolute path names

# file: 
mnt/bricks/services/brick/fstest_88402c989256d6e39e50208c90c1e85d/fstest_4cf1be62e0b12d3d65fac8eacb523ef3/


trusted.gfid=0x0ccb5c1f96064e699f62fdc72cf036f5

“fstest_88402c989256d6e39e50208c90c1e85d” is only seen from sn-2 mount 
point and sn-2 service brick, there is no such entry if you ls 
/mnt/services on sn-0 or sn-1.


[root@sn-2:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]

# cd /mnt/services/

[root@sn-2:/mnt/services]

# ls

backup db fstest_88402c989256d6e39e50208c90c1e85d  LCM NE3SAgent  
_nokrcpautoremoteuser  PM9  RCP_Backup SS_AlLightProcessor  SymptomDataUpl


commoncollector EventCorrelationEngine  hypertracer  Log  
netserv    ODS    ptp rcpha   SWM


[root@sn-2:/mnt/services]



___
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Rebalance failed on Distributed Disperse volume based on 3.12.14 version

2018-09-28 Thread Nithya Balachandran
Hi Mauro,


Please send the rebalance logs from s04-stg. I will take a look and get
back.


Regards,
Nithya

On 28 September 2018 at 16:21, Mauro Tridici  wrote:

>
> Hi Ashish,
>
> as I said in my previous message, we adopted the first approach you
> suggested (setting network.ping-timeout option to 0).
> This choice was due to the absence of empty brick to be used as indicated
> in the second approach.
>
> So, we launched remove-brick command on the first subvolume (V1, bricks
> 1,2,3,4,5,6 on server s04).
> Rebalance started moving the data across the other bricks, but, after
> about 3TB of moved data, rebalance speed slowed down and some transfer
> errors appeared in the rebalance.log of server s04.
> At this point, since remaining 1,8TB need to be moved in order to complete
> the step, we decided to stop the remove-brick execution and start it again
> (I hope it doesn’t stop again before complete the rebalance)
>
> Now rebalance is not moving data, it’s only scanning files (please, take a
> look to the following output)
>
> [root@s01 ~]# gluster volume remove-brick tier2
> s04-stg:/gluster/mnt1/brick s04-stg:/gluster/mnt2/brick
> s04-stg:/gluster/mnt3/brick s04-stg:/gluster/mnt4/brick
> s04-stg:/gluster/mnt5/brick s04-stg:/gluster/mnt6/brick status
> Node Rebalanced-files  size
> scanned  failures   skipped   status  run time in
> h:m:s
>-  ---   ---
> ---   ---   --- 
> --
>  s04-stg00Bytes
> 182008 0 0  in progress3:08:09
> Estimated time left for rebalance to complete :  442:45:06
>
> If I’m not wrong, remove-brick rebalances entire cluster each time it
> start.
> Is there a way to speed up this procedure? Do you have some other
> suggestion that, in this particular case, could be useful to reduce errors
> (I know that they are related to the current volume configuration) and
> improve rebalance performance avoiding to rebalance the entire cluster?
>
> Thank you in advance,
> Mauro
>
> Il giorno 27 set 2018, alle ore 13:14, Ashish Pandey 
> ha scritto:
>
>
> Yes, you can.
> If not me others may also reply.
>
> ---
> Ashish
>
> --
> *From: *"Mauro Tridici" 
> *To: *"Ashish Pandey" 
> *Cc: *"gluster-users" 
> *Sent: *Thursday, September 27, 2018 4:24:12 PM
> *Subject: *Re: [Gluster-users] Rebalance failed on Distributed Disperse
> volumebased on 3.12.14 version
>
>
> Dear Ashish,
>
> I can not thank you enough!
> Your procedure and description is very detailed.
> I think to follow the first approach after setting network.ping-timeout
> option to 0 (If I’m not wrong “0" means “infinite”...I noticed that this
> value reduced rebalance errors).
> After the fix I will set network.ping-timeout option to default value.
>
> Could I contact you again if I need some kind of suggestion?
>
> Thank you very much again.
> Have a good day,
> Mauro
>
>
> Il giorno 27 set 2018, alle ore 12:38, Ashish Pandey 
> ha scritto:
>
>
> Hi Mauro,
>
> We can divide the 36 newly added bricks into 6 set of 6 bricks each
> starting from brick37.
> That means, there are 6 ec subvolumes and we have to deal with one sub
> volume at a time.
> I have named it V1 to V6.
>
> Problem:
> Take the case of V1.
> The best configuration/setup would be to have all the 6 bricks of V1 on 6
> different nodes.
> However, in your case you have added 3 new nodes. So, at least we should
> have 2 bricks on 3 different newly added nodes.
> This way, in 4+2 EC configuration, even if one node goes down you will
> have 4 other bricks of that volume and the data on that volume would be
> accessible.
> In current setup if s04-stg goes down, you will loose all the data on V1
> and V2 as all the bricks will be down. We want to avoid and correct it.
>
> Now, we can have two approach to correct/modify this setup.
>
> *Approach 1*
> We have to remove all the newly added bricks in a set of 6 bricks. This
> will trigger re- balance and move whole data to other sub volumes.
> Repeat the above step and then once all the bricks are removed, add those
> bricks again in a set of 6 bricks, this time have 2 bricks from each of the
> 3 newly added Nodes.
>
> While this is a valid and working approach, I personally think that this
> will take long time and also require lot of movement of data.
>
> *Approach 2*
>
> In this approach we can use the heal process. We have to deal with all the
> volumes (V1 to V6) one by one. Following are the steps for V1-
>
> *Step 1 - *
> Use replace-brick command to move following bricks on *s05-stg* node *one
> by one (heal should be completed after every replace brick command)*
>
>
> *Brick39: s04-stg:/gluster/mnt3/brick to s05-stg/*
>
> *Brick40: s04-stg:/gluster/mnt4/brick to s05-stg/ free>*
>
> Command :
> gluster v