Re: [Gluster-devel] Cleaning up Jenkins

2017-06-20 Thread Nigel Babu
On Thu, Apr 20, 2017 at 10:57:53AM +0530, Nigel Babu wrote:
> Hello folks,
>
> As I was testing the Jenkins upgrade, I realized we store quite a lot of old
> builds on Jenkins that doesn't seem to be useful. I'm going to start cleaning
> them slowly in anticipation of moving Jenkins over to a CentOS 7 server in the
> not-so-distant future.
>
> * Old and disabled jobs will be deleted completely.
> * Discard regression logs older than 90 days.
> * Discard smoke and dev RPM logs older than 30 days.
> * Discard post-build RPM jobs older than 10 days.
> * Release job will be unaffected. We'll store all logs.
>
> If we want to archive the old regression logs, I might looking at storing them
> some place that's not the Jenkins machine. If you have concerns or comments,
> please let me know.

I've made the changes today. All jobs (except release jobs and regression jobs
will be deleted after 30 days). Regression logs will be kept for 90 days so we
can debug intermittent failures.

--
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Nithya Balachandran
On 21 June 2017 at 10:26, Pranith Kumar Karampuri 
wrote:

>
>
> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran  > wrote:
>
>>
>> On 20 June 2017 at 20:38, Aravinda  wrote:
>>
>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>>>
>>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to
>>> go with the format Aravinda suggested for now and in future we wanted some
>>> more changes for dht to detect which subvolume went down came back up, at
>>> that time we will revisit the solution suggested by Xavi.
>>>
>>> Susanth is doing the dht changes
>>> Aravinda is doing geo-rep changes
>>>
>>> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>>>
>>>
>> The proposed changes to the node-uuid behaviour (while good) are going to
>> break tiering . Tiering changes will take a little more time to be coded
>> and tested.
>>
>> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we
>> go back to the original node-uuid behaviour for now so as to unblock the
>> release and target the proposed changes for the next 3.11 releases.
>>
>
> Let me see if I understand the changes correctly. We are restoring the
> behavior of node-uuid xattr and adding a new xattr for parallel rebalance
> for both afr and ec, correct?
>

Yes, this is what I understand as well. So geo-rep behaviour does not
change (node-uuid) and rebalance uses the new xattr. :)



> Otherwise that is one more regression. If yes, we will also wait for
> Xavi's inputs. Jeff accidentally merged the afr patch yesterday which does
> these changes. If everyone is in agreement, we will leave it as is and add
> similar changes in ec as well. If we are not in agreement, then we will let
> the discussion progress :-)
>
>
>>
>>
>> Regards,
>> Nithya
>>
>>> --
>>> Aravinda
>>>
>>>
>>>
>>> Thanks to all of you guys for the discussions!
>>>
>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez >> > wrote:
>>>
 Hi Aravinda,

 On 20/06/17 12:42, Aravinda wrote:

> I think following format can be easily adopted by all components
>
> UUIDs of a subvolume are seperated by space and subvolumes are
> separated
> by comma
>
> For example, node1 and node2 are replica with U1 and U2 UUIDs
> respectively and
> node3 and node4 are replica with U3 and U4 UUIDs respectively
>
> node-uuid can return "U1 U2,U3 U4"
>

 While this is ok for current implementation, I think this can be
 insufficient if there are more layers of xlators that require to indicate
 some sort of grouping. Some representation that can represent hierarchy
 would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
 as a separator).


> Geo-rep can split by "," and then split by space and take first UUID
> DHT can split the value by space or comma and get unique UUIDs list
>

 This doesn't solve the problem I described in the previous email. Some
 more logic will need to be added to avoid more than one node from each
 replica-set to be active. If we have some explicit hierarchy information in
 the node-uuid value, more decisions can be taken.

 An initial proposal I made was this:

 DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

 This is harder to parse, but gives a lot of information: DHT with 2
 subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
 also easily extensible with any new xlator that changes the layout.

 However maybe this is not the moment to do this, and probably we could
 implement this in a new xattr with a better name.

 Xavi



> Another question is about the behavior when a node is down, existing
> node-uuid xattr will not return that UUID if a node is down. What is
> the
> behavior with the proposed xattr?
>
> Let me know your thoughts.
>
> regards
> Aravinda VK
>
> On 06/20/2017 03:06 PM, Aravinda wrote:
>
>> Hi Xavi,
>>
>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>
>>> Hi Aravinda,
>>>
>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>>
 Adding more people to get a consensus about this.

 On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


 regards
 Aravinda VK


 On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

 Hi Pranith,

 adding gluster-devel, Kotresh and Aravinda,

 On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



 On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
 

[Gluster-devel] Reducing the time to test patches which doesn't modify code

2017-06-20 Thread Amar Tumballi
Today, any changes to glusterfs code base (other than 'doc/') triggers the
regression runs when +1 Verified is voted. But we noticed that patches
which are changes in 'extras/' or just updating README file, need not run
regressions.

So, Nigel proposed the idea of a .testignore file (like .gitignore)[1].
Content in which are paths for files to ignore from testing. If a patch has
all the files belonging to this file, then the tests wouldn't be triggered.

Anyone sending patches in future, and the patch needs to add a new file,
see if you need to update .testignore file too.

[1] - https://review.gluster.org/17522

Regards,
Amar

-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Pranith Kumar Karampuri
On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran 
wrote:

>
> On 20 June 2017 at 20:38, Aravinda  wrote:
>
>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>>
>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to go
>> with the format Aravinda suggested for now and in future we wanted some
>> more changes for dht to detect which subvolume went down came back up, at
>> that time we will revisit the solution suggested by Xavi.
>>
>> Susanth is doing the dht changes
>> Aravinda is doing geo-rep changes
>>
>> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>>
>>
> The proposed changes to the node-uuid behaviour (while good) are going to
> break tiering . Tiering changes will take a little more time to be coded
> and tested.
>
> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we go
> back to the original node-uuid behaviour for now so as to unblock the
> release and target the proposed changes for the next 3.11 releases.
>

Let me see if I understand the changes correctly. We are restoring the
behavior of node-uuid xattr and adding a new xattr for parallel rebalance
for both afr and ec, correct? Otherwise that is one more regression. If
yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr
patch yesterday which does these changes. If everyone is in agreement, we
will leave it as is and add similar changes in ec as well. If we are not in
agreement, then we will let the discussion progress :-)


>
>
> Regards,
> Nithya
>
>> --
>> Aravinda
>>
>>
>>
>> Thanks to all of you guys for the discussions!
>>
>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
>> wrote:
>>
>>> Hi Aravinda,
>>>
>>> On 20/06/17 12:42, Aravinda wrote:
>>>
 I think following format can be easily adopted by all components

 UUIDs of a subvolume are seperated by space and subvolumes are separated
 by comma

 For example, node1 and node2 are replica with U1 and U2 UUIDs
 respectively and
 node3 and node4 are replica with U3 and U4 UUIDs respectively

 node-uuid can return "U1 U2,U3 U4"

>>>
>>> While this is ok for current implementation, I think this can be
>>> insufficient if there are more layers of xlators that require to indicate
>>> some sort of grouping. Some representation that can represent hierarchy
>>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
>>> as a separator).
>>>
>>>
 Geo-rep can split by "," and then split by space and take first UUID
 DHT can split the value by space or comma and get unique UUIDs list

>>>
>>> This doesn't solve the problem I described in the previous email. Some
>>> more logic will need to be added to avoid more than one node from each
>>> replica-set to be active. If we have some explicit hierarchy information in
>>> the node-uuid value, more decisions can be taken.
>>>
>>> An initial proposal I made was this:
>>>
>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>>>
>>> This is harder to parse, but gives a lot of information: DHT with 2
>>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
>>> also easily extensible with any new xlator that changes the layout.
>>>
>>> However maybe this is not the moment to do this, and probably we could
>>> implement this in a new xattr with a better name.
>>>
>>> Xavi
>>>
>>>
>>>
 Another question is about the behavior when a node is down, existing
 node-uuid xattr will not return that UUID if a node is down. What is the
 behavior with the proposed xattr?

 Let me know your thoughts.

 regards
 Aravinda VK

 On 06/20/2017 03:06 PM, Aravinda wrote:

> Hi Xavi,
>
> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>
>> Hi Aravinda,
>>
>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>
>>> Adding more people to get a consensus about this.
>>>
>>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda >> > wrote:
>>>
>>>
>>> regards
>>> Aravinda VK
>>>
>>>
>>> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>>>
>>> Hi Pranith,
>>>
>>> adding gluster-devel, Kotresh and Aravinda,
>>>
>>> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>>
>>>
>>>
>>> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
>>> 
>>> >> >> wrote:
>>>
>>> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>>
>>> The way geo-replication works is:
>>> On each machine, it does getxattr of node-uuid
>>> and
>>> check 

Re: [Gluster-devel] brick multiplexing and memory consumption

2017-06-20 Thread Amar Tumballi
On Wed, Jun 21, 2017 at 9:53 AM, Raghavendra Talur 
wrote:

>
>
> On 21-Jun-2017 9:45 AM, "Jeff Darcy"  wrote:
>
>
>
>
> On Tue, Jun 20, 2017, at 03:38 PM, Raghavendra Talur wrote:
>
> Each process takes 795MB of virtual memory and resident memory is 10MB
> each.
>
>
> Wow, that's even better than I thought.  I was seeing about a 3x
> difference per brick (plus the fixed cost of a brick process) during
> development.  Your numbers suggest more than 10x.  Almost makes it seem
> worth the effort.  ;)
>
>
> :)
>
>
> Just to be clear, I am not saying that brick multiplexing isn't working.
> The aim is to prevent the glusterfsd process from getting OOM killed
> because 200 bricks when multiplexed consume 20GB of virtual memory.
>
>
> Yes, the OOM killer is more dangerous with multiplexing.  It likes to take
> out the process that is the whole machine's reason for existence, which is
> pretty darn dumb.  Perhaps we should use oom_adj/OOM_DISABLE to make it a
> bit less dumb?
>
>
> This is not so easy for container deployment models.
>
>
> If it is found that the additional usage of 75MB of virtual memory per
> every brick attach can't be removed/reduced, then the only solution would
> be to fix issue 151 [1] by limiting multiplexed bricks.
> [1] https://github.com/gluster/glusterfs/issues/151
>
>
> This is another reason why limiting the number of brick processes is
> preferable to limiting the number of bricks per process.  When we limit
> bricks per process and wait until one is "full" before starting another,
> then that first brick process remains a prime target for the OOM killer.
> By "striping" bricks across N processes (where N ~= number of cores), none
> of them become targets until we're approaching our system-wide brick limit
> anyway.
>
>
> +1, I now understand the reasoning behind limiting number of processes. I
> was in the favor of limiting bricks per process before.
>
>
Makes sense. +1 on this approach from me too. Lets get going with this IMO.

-Amar


> Thanks,
> Raghavendra Talur
>
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Nithya Balachandran
On 20 June 2017 at 20:38, Aravinda  wrote:

> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
>
> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to go
> with the format Aravinda suggested for now and in future we wanted some
> more changes for dht to detect which subvolume went down came back up, at
> that time we will revisit the solution suggested by Xavi.
>
> Susanth is doing the dht changes
> Aravinda is doing geo-rep changes
>
> Done. Geo-rep patch sent for review https://review.gluster.org/17582
>
>
The proposed changes to the node-uuid behaviour (while good) are going to
break tiering . Tiering changes will take a little more time to be coded
and tested.

As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we go
back to the original node-uuid behaviour for now so as to unblock the
release and target the proposed changes for the next 3.11 releases.


Regards,
Nithya

> --
> Aravinda
>
>
>
> Thanks to all of you guys for the discussions!
>
> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
> wrote:
>
>> Hi Aravinda,
>>
>> On 20/06/17 12:42, Aravinda wrote:
>>
>>> I think following format can be easily adopted by all components
>>>
>>> UUIDs of a subvolume are seperated by space and subvolumes are separated
>>> by comma
>>>
>>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>>> respectively and
>>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>>
>>> node-uuid can return "U1 U2,U3 U4"
>>>
>>
>> While this is ok for current implementation, I think this can be
>> insufficient if there are more layers of xlators that require to indicate
>> some sort of grouping. Some representation that can represent hierarchy
>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
>> as a separator).
>>
>>
>>> Geo-rep can split by "," and then split by space and take first UUID
>>> DHT can split the value by space or comma and get unique UUIDs list
>>>
>>
>> This doesn't solve the problem I described in the previous email. Some
>> more logic will need to be added to avoid more than one node from each
>> replica-set to be active. If we have some explicit hierarchy information in
>> the node-uuid value, more decisions can be taken.
>>
>> An initial proposal I made was this:
>>
>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>>
>> This is harder to parse, but gives a lot of information: DHT with 2
>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
>> also easily extensible with any new xlator that changes the layout.
>>
>> However maybe this is not the moment to do this, and probably we could
>> implement this in a new xattr with a better name.
>>
>> Xavi
>>
>>
>>
>>> Another question is about the behavior when a node is down, existing
>>> node-uuid xattr will not return that UUID if a node is down. What is the
>>> behavior with the proposed xattr?
>>>
>>> Let me know your thoughts.
>>>
>>> regards
>>> Aravinda VK
>>>
>>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>>
 Hi Xavi,

 On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

> Hi Aravinda,
>
> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>
>> Adding more people to get a consensus about this.
>>
>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > > wrote:
>>
>>
>> regards
>> Aravinda VK
>>
>>
>> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>>
>> Hi Pranith,
>>
>> adding gluster-devel, Kotresh and Aravinda,
>>
>> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>
>>
>>
>> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
>> 
>> > >> wrote:
>>
>> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>
>> The way geo-replication works is:
>> On each machine, it does getxattr of node-uuid and
>> check if its
>> own uuid
>> is present in the list. If it is present then it
>> will consider
>> it active
>> otherwise it will be considered passive. With this
>> change we are
>> giving
>> all uuids instead of first-up subvolume. So all
>> machines think
>> they are
>> ACTIVE which is bad apparently. So that is the
>> reason. Even I
>> felt bad
>> that we are doing this change.
>>
>>
>> And what about 

Re: [Gluster-devel] brick multiplexing and memory consumption

2017-06-20 Thread Raghavendra Talur
On 21-Jun-2017 9:45 AM, "Jeff Darcy"  wrote:




On Tue, Jun 20, 2017, at 03:38 PM, Raghavendra Talur wrote:

Each process takes 795MB of virtual memory and resident memory is 10MB each.


Wow, that's even better than I thought.  I was seeing about a 3x difference
per brick (plus the fixed cost of a brick process) during development.
Your numbers suggest more than 10x.  Almost makes it seem worth the effort.
 ;)


:)


Just to be clear, I am not saying that brick multiplexing isn't working.
The aim is to prevent the glusterfsd process from getting OOM killed
because 200 bricks when multiplexed consume 20GB of virtual memory.


Yes, the OOM killer is more dangerous with multiplexing.  It likes to take
out the process that is the whole machine's reason for existence, which is
pretty darn dumb.  Perhaps we should use oom_adj/OOM_DISABLE to make it a
bit less dumb?


This is not so easy for container deployment models.


If it is found that the additional usage of 75MB of virtual memory per
every brick attach can't be removed/reduced, then the only solution would
be to fix issue 151 [1] by limiting multiplexed bricks.
[1] https://github.com/gluster/glusterfs/issues/151


This is another reason why limiting the number of brick processes is
preferable to limiting the number of bricks per process.  When we limit
bricks per process and wait until one is "full" before starting another,
then that first brick process remains a prime target for the OOM killer.
By "striping" bricks across N processes (where N ~= number of cores), none
of them become targets until we're approaching our system-wide brick limit
anyway.


+1, I now understand the reasoning behind limiting number of processes. I
was in the favor of limiting bricks per process before.

Thanks,
Raghavendra Talur
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] brick multiplexing and memory consumption

2017-06-20 Thread Jeff Darcy



On Tue, Jun 20, 2017, at 03:38 PM, Raghavendra Talur wrote:
> Each process takes 795MB of virtual memory and resident memory is
> 10MB each.
Wow, that's even better than I thought.  I was seeing about a 3x
difference per brick (plus the fixed cost of a brick process) during
development.  Your numbers suggest more than 10x.  Almost makes it seem
worth the effort.  ;)
> Just to be clear, I am not saying that brick multiplexing isn't
> working. The aim is to prevent the glusterfsd process from getting
> OOM killed because 200 bricks when multiplexed consume 20GB of
> virtual memory.
Yes, the OOM killer is more dangerous with multiplexing.  It likes to
take out the process that is the whole machine's reason for existence,
which is pretty darn dumb.  Perhaps we should use oom_adj/OOM_DISABLE to
make it a bit less dumb?
> If it is found that the additional usage of 75MB of virtual memory per
> every brick attach can't be removed/reduced, then the only solution
> would be to fix issue 151 [1] by limiting multiplexed bricks.> [1] 
> https://github.com/gluster/glusterfs/issues/151

This is another reason why limiting the number of brick processes is
preferable to limiting the number of bricks per process.  When we limit
bricks per process and wait until one is "full" before starting another,
then that first brick process remains a prime target for the OOM killer.
By "striping" bricks across N processes (where N ~= number of cores),
none of them become targets until we're approaching our system-wide
brick limit anyway.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] brick multiplexing and memory consumption

2017-06-20 Thread Raghavendra Talur
On Tue, Jun 20, 2017 at 8:13 PM, Jeff Darcy  wrote:

>
>
>
> On Tue, Jun 20, 2017, at 08:45 AM, Raghavendra Talur wrote:
>
> Here is the data I gathered while debugging the considerable increase in
> memory consumption by brick process when brick multiplexing is on.
>
> before adding 14th brick to it: 3163 MB
> before glusterfs_graph_init is called   3171 (8  MB increase)
> io-stats init   3180 (9  MB increase)
> index  init 3181 (1  MB increase)
> bitrot-stub init3182 (1  MB increase)
> changelog  init 3206 (24 MB increase)
> posix  init 3230 (24 MB increase)
> glusterfs_autoscale_threads 3238 (8  MB increase)
> end of glusterfs_handle_attach
>
> Every brick attach is taking about 75 MB of virtual memory and it is
> consistent. Need help from respective xlators owners to confirm if init of
> those xlators really takes that much memory.
>
> This is all Virtual memory data, resident memory is very nicely at 40 MB
> after 14 bricks.
>
>
> Do you have the equivalent numbers for memory consumption of 14 bricks
> *without* multiplexing?
>


Each process takes 795MB of virtual memory and resident memory is 10MB each.
Just to be clear, I am not saying that brick multiplexing isn't working.
The aim is to prevent the glusterfsd process from getting OOM killed
because 200 bricks when multiplexed consume 20GB of virtual memory.
If it is found that the additional usage of 75MB of virtual memory per
every brick attach can't be removed/reduced, then the only solution would
be to fix issue 151 [1] by limiting multiplexed bricks.
[1] https://github.com/gluster/glusterfs/issues/151



>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Aravinda

On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote:
Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to 
go with the format Aravinda suggested for now and in future we wanted 
some more changes for dht to detect which subvolume went down came 
back up, at that time we will revisit the solution suggested by Xavi.


Susanth is doing the dht changes
Aravinda is doing geo-rep changes

Done. Geo-rep patch sent for review https://review.gluster.org/17582

--
Aravinda



Thanks to all of you guys for the discussions!

On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
> wrote:


Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:

I think following format can be easily adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are
separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"


While this is ok for current implementation, I think this can be
insufficient if there are more layers of xlators that require to
indicate some sort of grouping. Some representation that can
represent hierarchy would be better. For example: "(U1 U2) (U3
U4)" (we can use spaces or comma as a separator).


Geo-rep can split by "," and then split by space and take
first UUID
DHT can split the value by space or comma and get unique UUIDs
list


This doesn't solve the problem I described in the previous email.
Some more logic will need to be added to avoid more than one node
from each replica-set to be active. If we have some explicit
hierarchy information in the node-uuid value, more decisions can
be taken.

An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with
2 subvolumes, each subvolume is an AFR with replica 2 and no
arbiters. It's also easily extensible with any new xlator that
changes the layout.

However maybe this is not the moment to do this, and probably we
could implement this in a new xattr with a better name.

Xavi



Another question is about the behavior when a node is down,
existing
node-uuid xattr will not return that UUID if a node is down.
What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda

>> wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri
wrote:



On Tue, Jun 20, 2017 at 1:12 PM,
Xavier Hernandez

>



Re: [Gluster-devel] brick multiplexing and memory consumption

2017-06-20 Thread Jeff Darcy



On Tue, Jun 20, 2017, at 08:45 AM, Raghavendra Talur wrote:
> Here is the data I gathered while debugging the considerable increase
> in memory consumption by brick process when brick multiplexing is on.
>
> before adding 14th brick to it: 3163 MB before
> glusterfs_graph_init is called   3171 (8  MB increase) io-stats
> init   3180 (9  MB increase) index  init
> 3181 (1  MB increase) bitrot-stub init3182
> (1  MB increase) changelog  init 3206 (24
> MB increase) posix  init 3230 (24 MB
> increase) glusterfs_autoscale_threads 3238 (8  MB
> increase) end of glusterfs_handle_attach
>
> Every brick attach is taking about 75 MB of virtual memory and it is
> consistent. Need help from respective xlators owners to confirm if
> init of those xlators really takes that much memory.
>
> This is all Virtual memory data, resident memory is very nicely at 40
> MB after 14 bricks.
Do you have the equivalent numbers for memory consumption of 14 bricks
*without* multiplexing?

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Release 3.11.1: Scheduled for 20th of June

2017-06-20 Thread Shyam

Hi,

Release tagging has been postponed by a day to accommodate a fix for a 
regression that has been introduced between 3.11.0 and 3.11.1 (see [1] 
for details).


As a result 3.11.1 will be tagged on the 21st June as of now (further 
delays will be notified to the lists appropriately).


Thanks,
Shyam

[1] Bug awaiting fix: https://bugzilla.redhat.com/show_bug.cgi?id=1463250

"Releases are made better together"

On 06/06/2017 09:24 AM, Shyam wrote:

Hi,

It's time to prepare the 3.11.1 release, which falls on the 20th of
each month [4], and hence would be June-20th-2017 this time around.

This mail is to call out the following,

1) Are there any pending *blocker* bugs that need to be tracked for
3.11.1? If so mark them against the provided tracker [1] as blockers
for the release, or at the very least post them as a response to this
mail

2) Pending reviews in the 3.11 dashboard will be part of the release,
*iff* they pass regressions and have the review votes, so use the
dashboard [2] to check on the status of your patches to 3.11 and get
these going

3) Empty release notes are posted here [3], if there are any specific
call outs for 3.11 beyond bugs, please update the review, or leave a
comment in the review, for us to pick it up

Thanks,
Shyam/Kaushal

[1] Release bug tracker:
https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-3.11.1

[2] 3.11 review dashboard:
https://review.gluster.org/#/projects/glusterfs,dashboards/dashboard:3-11-dashboard


[3] Release notes WIP: https://review.gluster.org/17480

[4] Release calendar: https://www.gluster.org/community/release-schedule/
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 3.11.1: Scheduled for 20th of June

2017-06-20 Thread Shyam

On 06/20/2017 08:41 AM, Pranith Kumar Karampuri wrote:



On Tue, Jun 6, 2017 at 6:54 PM, Shyam > wrote:

Hi,

It's time to prepare the 3.11.1 release, which falls on the 20th of
each month [4], and hence would be June-20th-2017 this time around.

This mail is to call out the following,

1) Are there any pending *blocker* bugs that need to be tracked for
3.11.1? If so mark them against the provided tracker [1] as blockers
for the release, or at the very least post them as a response to this
mail


I added https://bugzilla.redhat.com/show_bug.cgi?id=1463250 as blocker
just now for this release. We just completed the discussion about
solution on gluster-devel. We are hoping to get the patch in by EOD
tomorrow IST. This is a geo-rep regression we introduced because of
changing node-uuid behavior. My mistake :-(


I am postponing tagging the release till this regression is fixed, and 
from the looks of it, tagging will hence be done tomorrow.






2) Pending reviews in the 3.11 dashboard will be part of the release,
*iff* they pass regressions and have the review votes, so use the
dashboard [2] to check on the status of your patches to 3.11 and get
these going

3) Empty release notes are posted here [3], if there are any specific
call outs for 3.11 beyond bugs, please update the review, or leave a
comment in the review, for us to pick it up

Thanks,
Shyam/Kaushal

[1] Release bug tracker:
https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-3.11.1


[2] 3.11 review dashboard:

https://review.gluster.org/#/projects/glusterfs,dashboards/dashboard:3-11-dashboard



[3] Release notes WIP: https://review.gluster.org/17480


[4] Release calendar:
https://www.gluster.org/community/release-schedule/

___
maintainers mailing list
maintain...@gluster.org 
http://lists.gluster.org/mailman/listinfo/maintainers





--
Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 3.11.1: Scheduled for 20th of June

2017-06-20 Thread Shyam

On 06/20/2017 06:13 AM, Amar Tumballi wrote:



On Mon, Jun 19, 2017 at 11:01 PM, Shyam > wrote:

3.11.1 Release tagging is tomorrow (20th June, 2017).

Here are some key things to do before we tag the release,

1) Regression failures: (@pranith, @maintainers)
  - Overall regression failures status on 3.11 since .0 release can
be seen here [1]
  - Tests of concern are:
- ./tests/basic/afr/add-brick-self-heal.t [2]
  @pranith, this seems to be failing more often now, do we know
why or any pointers?

- ./tests/encryption/crypt.t [3]
  @maintainers? This seems to have a higher incident of failures
recently on master (and just one instance on 3.11 branch). All are
cores, so possibly some other change is causing this. Any updates
from anyone on this?

2) Pending review queue: [4] (@poornima, @csaba, @soumya, @ravishankar)
  - There are some reviews that do not have CentOS (or NetBSD) votes
yet, and are present to be committed for over a week, I have kicked
off rechecks as appropriate for some. Patch owners please keep a
watch out for the same.

3) Backport status: (IOW, things backported to older released
branches should be present in the later ones (in this case ported to
3.8/3.10 to be present in 3.11))
  - This is clean as of today, pending merge of
https://review.gluster.org/17512 


Please consider https://review.gluster.org/#/c/17569/ and
https://review.gluster.org/#/c/17573/

This is requested from Kubernetes integration.


These are merged now. Considering this is an STM release I am taking in 
the feature into a minor release.


Request you or Csaba post a release note as a comment @ 
https://review.gluster.org/#/c/17480/




Thanks.
Amar


Thanks,
Shyam

"Releases are made better together"

[1] All regression failures for 3.11.1 :

https://fstat.gluster.org/summary?start_date=2017-06-01_date=2017-06-20=release-3.11



[2] add-brick-self-heal.t failures:

https://fstat.gluster.org/failure/2?start_date=2017-06-01_date=2017-06-20=release-3.11



[3] crypt.t failure(s) on all branches:

https://fstat.gluster.org/failure/62?start_date=2017-06-01_date=2017-06-20=all



[4] Pending reviews needing attention:
https://review.gluster.org/#/q/status:open+starredby:srangana%2540redhat.com




On 06/06/2017 09:24 AM, Shyam wrote:

Hi,

It's time to prepare the 3.11.1 release, which falls on the 20th of
each month [4], and hence would be June-20th-2017 this time around.

This mail is to call out the following,

1) Are there any pending *blocker* bugs that need to be tracked for
3.11.1? If so mark them against the provided tracker [1] as blockers
for the release, or at the very least post them as a response to
this
mail

2) Pending reviews in the 3.11 dashboard will be part of the
release,
*iff* they pass regressions and have the review votes, so use the
dashboard [2] to check on the status of your patches to 3.11 and get
these going

3) Empty release notes are posted here [3], if there are any
specific
call outs for 3.11 beyond bugs, please update the review, or leave a
comment in the review, for us to pick it up

Thanks,
Shyam/Kaushal

[1] Release bug tracker:
https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-3.11.1


[2] 3.11 review dashboard:

https://review.gluster.org/#/projects/glusterfs,dashboards/dashboard:3-11-dashboard




[3] Release notes WIP: https://review.gluster.org/17480


[4] Release calendar:
https://www.gluster.org/community/release-schedule/


___
maintainers mailing list
maintain...@gluster.org 
http://lists.gluster.org/mailman/listinfo/maintainers





--
Amar Tumballi (amarts)

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Coverity covscan for 2017-06-20-0a8dac38 (master branch)

2017-06-20 Thread staticanalysis
GlusterFS Coverity covscan results are available from
http://download.gluster.org/pub/gluster/glusterfs/static-analysis/master/glusterfs-coverity/2017-06-20-0a8dac38
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] brick multiplexing and memory consumption

2017-06-20 Thread Raghavendra Talur
Hi,

Here is the data I gathered while debugging the considerable increase in
memory consumption by brick process when brick multiplexing is on.

before adding 14th brick to it: 3163 MB
before glusterfs_graph_init is called   3171 (8  MB increase)
io-stats init   3180 (9  MB increase)
index  init 3181 (1  MB increase)
bitrot-stub init3182 (1  MB increase)
changelog  init 3206 (24 MB increase)
posix  init 3230 (24 MB increase)
glusterfs_autoscale_threads 3238 (8  MB increase)
end of glusterfs_handle_attach

Every brick attach is taking about 75 MB of virtual memory and it is
consistent. Need help from respective xlators owners to confirm if init of
those xlators really takes that much memory.

This is all Virtual memory data, resident memory is very nicely at 40 MB
after 14 bricks.

Thanks,
Raghavendra Talur
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] New 'experimental' branch created for validating your ideas

2017-06-20 Thread Amar Tumballi
On Tue, Jun 20, 2017 at 5:42 PM, Hari Gowtham  wrote:

> What about the patches that are already posted against master and are
> yet to be reviewed?
>

If its not merged in master, it is fine to merge to experimental.


> Is it fine to post them here and continue with the work further here
> or wait for these patches to get in master
> and then come to the experimental branch.
>

If a patch lands in 'master' branch, I don't think there is any need to
send it to 'experimental'. The idea of 'experimental' branch is to push
features/changes which are not having enough confidence to make it to
master branch, and get it tested in regression suite over time.

If its not merged, send it to experimental branch, get it merged, and if
there is anything extra needs to be done, work on it, and once stable, send
the aggregated patch to 'master' branch.


> Asking this question because the patch is there for a very long time.
>
> Got it. And this is the very reason for getting 'experimental' branch done.


> And it would be better to know what is the frequency in rebasing the
> master and the experimental branch.
>

'master' branch rebase to experimental will be done once in 6 months,
officially. I will try rebasing once a month, and see if there are no
conflicts, I will force push the branch, but for now, thinking of 6 months
frequency. There has been suggestion to keep it 3 months window (similar to
that of release branch cut-off). Yet to finalize on that.


> And how has the permission to merge the patches in the experimental branch.
>
>
Assuming it is 'who' and not 'how', Currently I (@amarts) have the
permission to merge patches.

Regards,
Amar


>
> On Tue, Jun 20, 2017 at 4:05 PM, Amar Tumballi 
> wrote:
> > All,
> >
> > As proposed earlier [1], the 'experimental' branch is now created and
> > active. Any submission to this branch is going to be accepted without too
> > much detailed review, and the focus will be to make sure overall design
> is
> > fine. I have put a deadline of a week max for reviewing and merging a
> patch
> > if there are no significant problems with the patch.
> >
> > I welcome everyone to use this branch as an test bed to validate your
> ideas,
> > so your patch gets merged, and the nightly regressions would run on it
> for
> > some time. Also we are planning to give out RPMs from this branch every
> > week, so some features which gets completed in experimental can be
> tested by
> > wider user-base, and once validated, can land in master branch and
> > subsequently next release.
> >
> > A note of caution: Getting a patch merged in experimental is in no way
> > guarantee to get your patch in any of the upstream glusterfs releases.
> The
> > author needs to submit the changes to 'master' branch to get the feature
> in
> > releases.
> >
> > The branch already has some experimental features like:
> >
> > metrics on 'fops' in every xlator layer, instead of only getting it with
> > io-stats.
> > latency related information is default.
> > few more metrics added in memory allocation checks.
> > All the above can be seen with sending SIGUSR2 signal to GlusterFS
> process.
> > (@ /tmp/glusterfs.* )
> >
> >
> > Regards,
> > Amar
> >
> > [1] - http://lists.gluster.org/pipermail/maintainers/2017-
> May/002644.html
> >
> > --
> > Amar Tumballi (amarts)
> >
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
>
> --
> Regards,
> Hari Gowtham.
>



-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 3.11.1: Scheduled for 20th of June

2017-06-20 Thread Pranith Kumar Karampuri
On Tue, Jun 6, 2017 at 6:54 PM, Shyam  wrote:

> Hi,
>
> It's time to prepare the 3.11.1 release, which falls on the 20th of
> each month [4], and hence would be June-20th-2017 this time around.
>
> This mail is to call out the following,
>
> 1) Are there any pending *blocker* bugs that need to be tracked for
> 3.11.1? If so mark them against the provided tracker [1] as blockers
> for the release, or at the very least post them as a response to this
> mail
>

I added https://bugzilla.redhat.com/show_bug.cgi?id=1463250 as blocker just
now for this release. We just completed the discussion about solution on
gluster-devel. We are hoping to get the patch in by EOD tomorrow IST. This
is a geo-rep regression we introduced because of changing node-uuid
behavior. My mistake :-(


>
> 2) Pending reviews in the 3.11 dashboard will be part of the release,
> *iff* they pass regressions and have the review votes, so use the
> dashboard [2] to check on the status of your patches to 3.11 and get
> these going
>
> 3) Empty release notes are posted here [3], if there are any specific
> call outs for 3.11 beyond bugs, please update the review, or leave a
> comment in the review, for us to pick it up
>
> Thanks,
> Shyam/Kaushal
>
> [1] Release bug tracker: https://bugzilla.redhat.com/sh
> ow_bug.cgi?id=glusterfs-3.11.1
>
> [2] 3.11 review dashboard: https://review.gluster.org/#/p
> rojects/glusterfs,dashboards/dashboard:3-11-dashboard
>
> [3] Release notes WIP: https://review.gluster.org/17480
>
> [4] Release calendar: https://www.gluster.org/community/release-schedule/
> ___
> maintainers mailing list
> maintain...@gluster.org
> http://lists.gluster.org/mailman/listinfo/maintainers
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Pranith Kumar Karampuri
Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to go
with the format Aravinda suggested for now and in future we wanted some
more changes for dht to detect which subvolume went down came back up, at
that time we will revisit the solution suggested by Xavi.

Susanth is doing the dht changes
Aravinda is doing geo-rep changes

Thanks to all of you guys for the discussions!

On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez 
wrote:

> Hi Aravinda,
>
> On 20/06/17 12:42, Aravinda wrote:
>
>> I think following format can be easily adopted by all components
>>
>> UUIDs of a subvolume are seperated by space and subvolumes are separated
>> by comma
>>
>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>> respectively and
>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>
>> node-uuid can return "U1 U2,U3 U4"
>>
>
> While this is ok for current implementation, I think this can be
> insufficient if there are more layers of xlators that require to indicate
> some sort of grouping. Some representation that can represent hierarchy
> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or comma
> as a separator).
>
>
>> Geo-rep can split by "," and then split by space and take first UUID
>> DHT can split the value by space or comma and get unique UUIDs list
>>
>
> This doesn't solve the problem I described in the previous email. Some
> more logic will need to be added to avoid more than one node from each
> replica-set to be active. If we have some explicit hierarchy information in
> the node-uuid value, more decisions can be taken.
>
> An initial proposal I made was this:
>
> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))
>
> This is harder to parse, but gives a lot of information: DHT with 2
> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's
> also easily extensible with any new xlator that changes the layout.
>
> However maybe this is not the moment to do this, and probably we could
> implement this in a new xattr with a better name.
>
> Xavi
>
>
>
>> Another question is about the behavior when a node is down, existing
>> node-uuid xattr will not return that UUID if a node is down. What is the
>> behavior with the proposed xattr?
>>
>> Let me know your thoughts.
>>
>> regards
>> Aravinda VK
>>
>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>
>>> Hi Xavi,
>>>
>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>>
 Hi Aravinda,

 On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

> Adding more people to get a consensus about this.
>
> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda  > wrote:
>
>
> regards
> Aravinda VK
>
>
> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>
> Hi Pranith,
>
> adding gluster-devel, Kotresh and Aravinda,
>
> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>
>
>
> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
> 
>  >> wrote:
>
> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>
> The way geo-replication works is:
> On each machine, it does getxattr of node-uuid and
> check if its
> own uuid
> is present in the list. If it is present then it
> will consider
> it active
> otherwise it will be considered passive. With this
> change we are
> giving
> all uuids instead of first-up subvolume. So all
> machines think
> they are
> ACTIVE which is bad apparently. So that is the
> reason. Even I
> felt bad
> that we are doing this change.
>
>
> And what about changing the content of node-uuid to
> include some
> sort of hierarchy ?
>
> for example:
>
> a single brick:
>
> NODE()
>
> AFR/EC:
>
> AFR[2](NODE(), NODE())
> EC[3,1](NODE(), NODE(), NODE())
>
> DHT:
>
> DHT[2](AFR[2](NODE(), NODE()),
> AFR[2](NODE(),
> NODE()))
>
> This gives a lot of information that can be used to
> take the
> appropriate decisions.
>
>
> I guess that is not backward compatible. Shall I CC
> 

Re: [Gluster-devel] New 'experimental' branch created for validating your ideas

2017-06-20 Thread Hari Gowtham
What about the patches that are already posted against master and are
yet to be reviewed?
Is it fine to post them here and continue with the work further here
or wait for these patches to get in master
and then come to the experimental branch.
Asking this question because the patch is there for a very long time.

And it would be better to know what is the frequency in rebasing the
master and the experimental branch.
And how has the permission to merge the patches in the experimental branch.


On Tue, Jun 20, 2017 at 4:05 PM, Amar Tumballi  wrote:
> All,
>
> As proposed earlier [1], the 'experimental' branch is now created and
> active. Any submission to this branch is going to be accepted without too
> much detailed review, and the focus will be to make sure overall design is
> fine. I have put a deadline of a week max for reviewing and merging a patch
> if there are no significant problems with the patch.
>
> I welcome everyone to use this branch as an test bed to validate your ideas,
> so your patch gets merged, and the nightly regressions would run on it for
> some time. Also we are planning to give out RPMs from this branch every
> week, so some features which gets completed in experimental can be tested by
> wider user-base, and once validated, can land in master branch and
> subsequently next release.
>
> A note of caution: Getting a patch merged in experimental is in no way
> guarantee to get your patch in any of the upstream glusterfs releases. The
> author needs to submit the changes to 'master' branch to get the feature in
> releases.
>
> The branch already has some experimental features like:
>
> metrics on 'fops' in every xlator layer, instead of only getting it with
> io-stats.
> latency related information is default.
> few more metrics added in memory allocation checks.
> All the above can be seen with sending SIGUSR2 signal to GlusterFS process.
> (@ /tmp/glusterfs.* )
>
>
> Regards,
> Amar
>
> [1] - http://lists.gluster.org/pipermail/maintainers/2017-May/002644.html
>
> --
> Amar Tumballi (amarts)
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
Regards,
Hari Gowtham.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Aravinda,

On 20/06/17 12:42, Aravinda wrote:

I think following format can be easily adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated
by comma

For example, node1 and node2 are replica with U1 and U2 UUIDs
respectively and
node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"


While this is ok for current implementation, I think this can be 
insufficient if there are more layers of xlators that require to 
indicate some sort of grouping. Some representation that can represent 
hierarchy would be better. For example: "(U1 U2) (U3 U4)" (we can use 
spaces or comma as a separator).




Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list


This doesn't solve the problem I described in the previous email. Some 
more logic will need to be added to avoid more than one node from each 
replica-set to be active. If we have some explicit hierarchy information 
in the node-uuid value, more decisions can be taken.


An initial proposal I made was this:

DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2)))

This is harder to parse, but gives a lot of information: DHT with 2 
subvolumes, each subvolume is an AFR with replica 2 and no arbiters. 
It's also easily extensible with any new xlator that changes the layout.


However maybe this is not the moment to do this, and probably we could 
implement this in a new xattr with a better name.


Xavi



Another question is about the behavior when a node is down, existing
node-uuid xattr will not return that UUID if a node is down. What is the
behavior with the proposed xattr?

Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to
take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A
replicated volume is created using the following command:

gluster volume create test replica 2 server1:/brick1 server2:/brick1
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2

In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Sunil Kumar Heggodu Gopala Acharya
EC also sends all zeros if the node is down.

Regards,

Sunil kumar Acharya

Senior Software Engineer

Red Hat



T: +91-8067935170 


TRIED. TESTED. TRUSTED. 
On Tue, Jun 20, 2017 at 4:27 PM, Karthik Subrahmanya 
wrote:

>
>
> On Tue, Jun 20, 2017 at 4:12 PM, Aravinda  wrote:
>
>> I think following format can be easily adopted by all components
>>
>> UUIDs of a subvolume are seperated by space and subvolumes are separated
>> by comma
>>
>> For example, node1 and node2 are replica with U1 and U2 UUIDs
>> respectively and
>> node3 and node4 are replica with U3 and U4 UUIDs respectively
>>
>> node-uuid can return "U1 U2,U3 U4"
>>
>> Geo-rep can split by "," and then split by space and take first UUID
>> DHT can split the value by space or comma and get unique UUIDs list
>>
>> Another question is about the behavior when a node is down, existing
>> node-uuid xattr will not return that UUID if a node is down.
>
> After the change [1], if a node is down we send all zeros as the uuid for
> that node, in the list of node uuids.
>
> [1] https://review.gluster.org/#/c/17084/
>
> Regards,
> Karthik
>
>> What is the behavior with the proposed xattr?
>>
>> Let me know your thoughts.
>>
>> regards
>> Aravinda VK
>>
>>
>> On 06/20/2017 03:06 PM, Aravinda wrote:
>>
>>> Hi Xavi,
>>>
>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>>
 Hi Aravinda,

 On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

> Adding more people to get a consensus about this.
>
> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda  > wrote:
>
>
> regards
> Aravinda VK
>
>
> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>
> Hi Pranith,
>
> adding gluster-devel, Kotresh and Aravinda,
>
> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>
>
>
> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
> 
>  >> wrote:
>
> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>
> The way geo-replication works is:
> On each machine, it does getxattr of node-uuid and
> check if its
> own uuid
> is present in the list. If it is present then it
> will consider
> it active
> otherwise it will be considered passive. With this
> change we are
> giving
> all uuids instead of first-up subvolume. So all
> machines think
> they are
> ACTIVE which is bad apparently. So that is the
> reason. Even I
> felt bad
> that we are doing this change.
>
>
> And what about changing the content of node-uuid to
> include some
> sort of hierarchy ?
>
> for example:
>
> a single brick:
>
> NODE()
>
> AFR/EC:
>
> AFR[2](NODE(), NODE())
> EC[3,1](NODE(), NODE(), NODE())
>
> DHT:
>
> DHT[2](AFR[2](NODE(), NODE()),
> AFR[2](NODE(),
> NODE()))
>
> This gives a lot of information that can be used to
> take the
> appropriate decisions.
>
>
> I guess that is not backward compatible. Shall I CC
> gluster-devel and
> Kotresh/Aravinda?
>
>
> Is the change we did backward compatible ? if we only require
> the first field to be a GUID to support backward compatibility,
> we can use something like this:
>
> No. But the necessary change can be made to Geo-rep code as well if
> format is changed, Since all these are built/shipped together.
>
> Geo-rep uses node-id as follows,
>
> list = listxattr(node-uuid)
> active_node_uuids = list.split(SPACE)
> active_node_flag = True if self.node_id exists in active_node_uuids
> else False
>

 How was this case solved ?

 suppose we have three servers and 2 bricks in each server. A replicated
 volume is created using the following command:

 gluster volume create test replica 2 server1:/brick1 server2:/brick1
 server2:/brick2 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Karthik Subrahmanya
On Tue, Jun 20, 2017 at 4:12 PM, Aravinda  wrote:

> I think following format can be easily adopted by all components
>
> UUIDs of a subvolume are seperated by space and subvolumes are separated
> by comma
>
> For example, node1 and node2 are replica with U1 and U2 UUIDs respectively
> and
> node3 and node4 are replica with U3 and U4 UUIDs respectively
>
> node-uuid can return "U1 U2,U3 U4"
>
> Geo-rep can split by "," and then split by space and take first UUID
> DHT can split the value by space or comma and get unique UUIDs list
>
> Another question is about the behavior when a node is down, existing
> node-uuid xattr will not return that UUID if a node is down.

After the change [1], if a node is down we send all zeros as the uuid for
that node, in the list of node uuids.

[1] https://review.gluster.org/#/c/17084/

Regards,
Karthik

> What is the behavior with the proposed xattr?
>
> Let me know your thoughts.
>
> regards
> Aravinda VK
>
>
> On 06/20/2017 03:06 PM, Aravinda wrote:
>
>> Hi Xavi,
>>
>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote:
>>
>>> Hi Aravinda,
>>>
>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote:
>>>
 Adding more people to get a consensus about this.

 On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


 regards
 Aravinda VK


 On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

 Hi Pranith,

 adding gluster-devel, Kotresh and Aravinda,

 On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



 On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez
 
 >> wrote:

 On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

 The way geo-replication works is:
 On each machine, it does getxattr of node-uuid and
 check if its
 own uuid
 is present in the list. If it is present then it
 will consider
 it active
 otherwise it will be considered passive. With this
 change we are
 giving
 all uuids instead of first-up subvolume. So all
 machines think
 they are
 ACTIVE which is bad apparently. So that is the
 reason. Even I
 felt bad
 that we are doing this change.


 And what about changing the content of node-uuid to
 include some
 sort of hierarchy ?

 for example:

 a single brick:

 NODE()

 AFR/EC:

 AFR[2](NODE(), NODE())
 EC[3,1](NODE(), NODE(), NODE())

 DHT:

 DHT[2](AFR[2](NODE(), NODE()),
 AFR[2](NODE(),
 NODE()))

 This gives a lot of information that can be used to
 take the
 appropriate decisions.


 I guess that is not backward compatible. Shall I CC
 gluster-devel and
 Kotresh/Aravinda?


 Is the change we did backward compatible ? if we only require
 the first field to be a GUID to support backward compatibility,
 we can use something like this:

 No. But the necessary change can be made to Geo-rep code as well if
 format is changed, Since all these are built/shipped together.

 Geo-rep uses node-id as follows,

 list = listxattr(node-uuid)
 active_node_uuids = list.split(SPACE)
 active_node_flag = True if self.node_id exists in active_node_uuids
 else False

>>>
>>> How was this case solved ?
>>>
>>> suppose we have three servers and 2 bricks in each server. A replicated
>>> volume is created using the following command:
>>>
>>> gluster volume create test replica 2 server1:/brick1 server2:/brick1
>>> server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2
>>>
>>> In this case we have three replica-sets:
>>>
>>> * server1:/brick1 server2:/brick1
>>> * server2:/brick2 server3:/brick1
>>> * server3:/brick2 server2:/brick2
>>>
>>> Old AFR implementation for node-uuid always returned the uuid of the
>>> node of the first brick, so in this case we will get the uuid of the three
>>> nodes because all of them are the first brick of a replica-set.
>>>
>>> Does this mean that with this configuration all nodes are active ? Is
>>> this a 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Aravinda

I think following format can be easily adopted by all components

UUIDs of a subvolume are seperated by space and subvolumes are separated 
by comma


For example, node1 and node2 are replica with U1 and U2 UUIDs 
respectively and

node3 and node4 are replica with U3 and U4 UUIDs respectively

node-uuid can return "U1 U2,U3 U4"

Geo-rep can split by "," and then split by space and take first UUID
DHT can split the value by space or comma and get unique UUIDs list

Another question is about the behavior when a node is down, existing 
node-uuid xattr will not return that UUID if a node is down. What is the 
behavior with the proposed xattr?


Let me know your thoughts.

regards
Aravinda VK

On 06/20/2017 03:06 PM, Aravinda wrote:

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to 
take the

appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A 
replicated volume is created using the following command:


gluster volume create test replica 2 server1:/brick1 server2:/brick1 
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2


In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR implementation for node-uuid always returned the uuid of the 
node of the first brick, so in this case we will get the uuid of the 
three nodes because all of them are the first brick of a replica-set.


Does this mean that with this configuration all nodes are active ? Is 
this a problem ? Is there any other check to avoid this situation if 
it's not good ?
Yes all Geo-rep workers will become Active and participate in syncing. 
Since changelogs will have the same information in replica bricks this 
will lead to duplicate syncing and consuming network bandwidth.


Node-uuid based Active worker is the default configuration in Geo-rep 
till now, Geo-rep also has Meta Volume based syncronization for Active 
worker using lock files.(Can be opted using Geo-rep configuration, 
with this config node-uuid will not be used)


Kotresh proposed a solution to configure which worker to become 
Active. This will give more control to Admin to choose Active workers, 
This will become default configuration from 3.12

https://github.com/gluster/glusterfs/issues/244

--
Aravinda



Xavi

[Gluster-devel] New 'experimental' branch created for validating your ideas

2017-06-20 Thread Amar Tumballi
All,

As proposed earlier [1], the 'experimental' branch is now created and
active. Any submission to this branch is going to be accepted without too
much detailed review, and the focus will be to make sure overall design is
fine. I have put a deadline of a week max for reviewing and merging a patch
if there are no significant problems with the patch.

I welcome everyone to use this branch as an test bed to validate your
ideas, so your patch gets merged, and the nightly regressions would run on
it for some time. Also we are planning to give out RPMs from this branch
every week, so some features which gets completed in experimental can be
tested by wider user-base, and once validated, can land in master branch
and subsequently next release.

A note of caution: Getting a patch merged in experimental is in no way
guarantee to get your patch in any of the upstream glusterfs releases. The
author needs to submit the changes to 'master' branch to get the feature in
releases.

The branch already has some experimental features like:

   - metrics on 'fops' in every xlator layer, instead of only getting it
   with io-stats.
   - latency related information is default.
   - few more metrics added in memory allocation checks.
   - All the above can be seen with sending SIGUSR2 signal to GlusterFS
   process. (@ /tmp/glusterfs.* )


Regards,
Amar

[1] - http://lists.gluster.org/pipermail/maintainers/2017-May/002644.html

-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 3.11.1: Scheduled for 20th of June

2017-06-20 Thread Amar Tumballi
On Mon, Jun 19, 2017 at 11:01 PM, Shyam  wrote:

> 3.11.1 Release tagging is tomorrow (20th June, 2017).
>
> Here are some key things to do before we tag the release,
>
> 1) Regression failures: (@pranith, @maintainers)
>   - Overall regression failures status on 3.11 since .0 release can be
> seen here [1]
>   - Tests of concern are:
> - ./tests/basic/afr/add-brick-self-heal.t [2]
>   @pranith, this seems to be failing more often now, do we know why or
> any pointers?
>
> - ./tests/encryption/crypt.t [3]
>   @maintainers? This seems to have a higher incident of failures
> recently on master (and just one instance on 3.11 branch). All are cores,
> so possibly some other change is causing this. Any updates from anyone on
> this?
>
> 2) Pending review queue: [4] (@poornima, @csaba, @soumya, @ravishankar)
>   - There are some reviews that do not have CentOS (or NetBSD) votes yet,
> and are present to be committed for over a week, I have kicked off rechecks
> as appropriate for some. Patch owners please keep a watch out for the same.
>
> 3) Backport status: (IOW, things backported to older released branches
> should be present in the later ones (in this case ported to 3.8/3.10 to be
> present in 3.11))
>   - This is clean as of today, pending merge of
> https://review.gluster.org/17512
>
>
Please consider https://review.gluster.org/#/c/17569/ and
https://review.gluster.org/#/c/17573/

This is requested from Kubernetes integration.

Thanks.
Amar


> Thanks,
> Shyam
>
> "Releases are made better together"
>
> [1] All regression failures for 3.11.1 : https://fstat.gluster.org/summ
> ary?start_date=2017-06-01_date=2017-06-20=release-3.11
>
> [2] add-brick-self-heal.t failures: https://fstat.gluster.org/fail
> ure/2?start_date=2017-06-01_date=2017-06-20=release-3.11
>
> [3] crypt.t failure(s) on all branches: https://fstat.gluster.org/fail
> ure/62?start_date=2017-06-01_date=2017-06-20=all
>
> [4] Pending reviews needing attention: https://review.gluster.org/#/q
> /status:open+starredby:srangana%2540redhat.com
>
>
> On 06/06/2017 09:24 AM, Shyam wrote:
>
>> Hi,
>>
>> It's time to prepare the 3.11.1 release, which falls on the 20th of
>> each month [4], and hence would be June-20th-2017 this time around.
>>
>> This mail is to call out the following,
>>
>> 1) Are there any pending *blocker* bugs that need to be tracked for
>> 3.11.1? If so mark them against the provided tracker [1] as blockers
>> for the release, or at the very least post them as a response to this
>> mail
>>
>> 2) Pending reviews in the 3.11 dashboard will be part of the release,
>> *iff* they pass regressions and have the review votes, so use the
>> dashboard [2] to check on the status of your patches to 3.11 and get
>> these going
>>
>> 3) Empty release notes are posted here [3], if there are any specific
>> call outs for 3.11 beyond bugs, please update the review, or leave a
>> comment in the review, for us to pick it up
>>
>> Thanks,
>> Shyam/Kaushal
>>
>> [1] Release bug tracker:
>> https://bugzilla.redhat.com/show_bug.cgi?id=glusterfs-3.11.1
>>
>> [2] 3.11 review dashboard:
>> https://review.gluster.org/#/projects/glusterfs,dashboards/d
>> ashboard:3-11-dashboard
>>
>>
>> [3] Release notes WIP: https://review.gluster.org/17480
>>
>> [4] Release calendar: https://www.gluster.org/community/release-schedule/
>>
> ___
> maintainers mailing list
> maintain...@gluster.org
> http://lists.gluster.org/mailman/listinfo/maintainers
>



-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Aravinda

Hi Xavi,

On 06/20/2017 02:51 PM, Xavier Hernandez wrote:

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to 
take the

appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A 
replicated volume is created using the following command:


gluster volume create test replica 2 server1:/brick1 server2:/brick1 
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2


In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR implementation for node-uuid always returned the uuid of the 
node of the first brick, so in this case we will get the uuid of the 
three nodes because all of them are the first brick of a replica-set.


Does this mean that with this configuration all nodes are active ? Is 
this a problem ? Is there any other check to avoid this situation if 
it's not good ?
Yes all Geo-rep workers will become Active and participate in syncing. 
Since changelogs will have the same information in replica bricks this 
will lead to duplicate syncing and consuming network bandwidth.


Node-uuid based Active worker is the default configuration in Geo-rep 
till now, Geo-rep also has Meta Volume based syncronization for Active 
worker using lock files.(Can be opted using Geo-rep configuration, with 
this config node-uuid will not be used)


Kotresh proposed a solution to configure which worker to become Active. 
This will give more control to Admin to choose Active workers, This will 
become default configuration from 3.12

https://github.com/gluster/glusterfs/issues/244

--
Aravinda



Xavi





Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they
returned before the patch, but between '(' and ')' they put the
full list of guid's of all nodes. The first  can be used
by geo-replication. The list after the first  can be used
for rebalance.

Not sure if there's any user of node-uuid above DHT.

Xavi




Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez
 >
   

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Aravinda,

On 20/06/17 11:05, Pranith Kumar Karampuri wrote:

Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda > wrote:


regards
Aravinda VK


On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez

>> wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and
check if its
own uuid
is present in the list. If it is present then it
will consider
it active
otherwise it will be considered passive. With this
change we are
giving
all uuids instead of first-up subvolume. So all
machines think
they are
ACTIVE which is bad apparently. So that is the
reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to
include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()),
AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC
gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require
the first field to be a GUID to support backward compatibility,
we can use something like this:

No. But the necessary change can be made to Geo-rep code as well if
format is changed, Since all these are built/shipped together.

Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids
else False


How was this case solved ?

suppose we have three servers and 2 bricks in each server. A replicated 
volume is created using the following command:


gluster volume create test replica 2 server1:/brick1 server2:/brick1 
server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2


In this case we have three replica-sets:

* server1:/brick1 server2:/brick1
* server2:/brick2 server3:/brick1
* server3:/brick2 server2:/brick2

Old AFR implementation for node-uuid always returned the uuid of the 
node of the first brick, so in this case we will get the uuid of the 
three nodes because all of them are the first brick of a replica-set.


Does this mean that with this configuration all nodes are active ? Is 
this a problem ? Is there any other check to avoid this situation if 
it's not good ?


Xavi





Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they
returned before the patch, but between '(' and ')' they put the
full list of guid's of all nodes. The first  can be used
by geo-replication. The list after the first  can be used
for rebalance.

Not sure if there's any user of node-uuid above DHT.

Xavi




Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez
 >
 

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Pranith Kumar Karampuri
Adding more people to get a consensus about this.

On Tue, Jun 20, 2017 at 1:49 PM, Aravinda  wrote:

>
> regards
> Aravinda VK
>
>
> On 06/20/2017 01:26 PM, Xavier Hernandez wrote:
>
>> Hi Pranith,
>>
>> adding gluster-devel, Kotresh and Aravinda,
>>
>> On 20/06/17 09:45, Pranith Kumar Karampuri wrote:
>>
>>>
>>>
>>> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez >> > wrote:
>>>
>>> On 20/06/17 09:31, Pranith Kumar Karampuri wrote:
>>>
>>> The way geo-replication works is:
>>> On each machine, it does getxattr of node-uuid and check if its
>>> own uuid
>>> is present in the list. If it is present then it will consider
>>> it active
>>> otherwise it will be considered passive. With this change we are
>>> giving
>>> all uuids instead of first-up subvolume. So all machines think
>>> they are
>>> ACTIVE which is bad apparently. So that is the reason. Even I
>>> felt bad
>>> that we are doing this change.
>>>
>>>
>>> And what about changing the content of node-uuid to include some
>>> sort of hierarchy ?
>>>
>>> for example:
>>>
>>> a single brick:
>>>
>>> NODE()
>>>
>>> AFR/EC:
>>>
>>> AFR[2](NODE(), NODE())
>>> EC[3,1](NODE(), NODE(), NODE())
>>>
>>> DHT:
>>>
>>> DHT[2](AFR[2](NODE(), NODE()), AFR[2](NODE(),
>>> NODE()))
>>>
>>> This gives a lot of information that can be used to take the
>>> appropriate decisions.
>>>
>>>
>>> I guess that is not backward compatible. Shall I CC gluster-devel and
>>> Kotresh/Aravinda?
>>>
>>
>> Is the change we did backward compatible ? if we only require the first
>> field to be a GUID to support backward compatibility, we can use something
>> like this:
>>
> No. But the necessary change can be made to Geo-rep code as well if format
> is changed, Since all these are built/shipped together.
>
> Geo-rep uses node-id as follows,
>
> list = listxattr(node-uuid)
> active_node_uuids = list.split(SPACE)
> active_node_flag = True if self.node_id exists in active_node_uuids else
> False
>
>
>
>> Bricks:
>>
>> 
>>
>> AFR/EC:
>> (, )
>>
>> DHT:
>> ((, ...), (, ...))
>>
>> In this case, AFR and EC would return the same  they returned
>> before the patch, but between '(' and ')' they put the full list of guid's
>> of all nodes. The first  can be used by geo-replication. The list
>> after the first  can be used for rebalance.
>>
>> Not sure if there's any user of node-uuid above DHT.
>>
>> Xavi
>>
>>
>>>
>>>
>>> Xavi
>>>
>>>
>>> On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez
>>> 
>>> >>
>>> wrote:
>>>
>>> Hi Pranith,
>>>
>>> On 20/06/17 07:53, Pranith Kumar Karampuri wrote:
>>>
>>> hi Xavi,
>>>We all made the mistake of not sending about
>>> changing
>>> behavior of
>>> node-uuid xattr so that rebalance can use multiple nodes
>>> for doing
>>> rebalance. Because of this on geo-rep all the workers
>>> are becoming
>>> active instead of one per EC/AFR subvolume. So we are
>>> frantically trying
>>> to restore the functionality of node-uuid and introduce
>>> a new
>>> xattr for
>>> the new behavior. Sunil will be sending out a patch for
>>> this.
>>>
>>>
>>> Wouldn't it be better to change geo-rep behavior to use the
>>> new data
>>> ? I think it's better as it's now, since it gives more
>>> information
>>> to upper layers so that they can take more accurate
>>> decisions.
>>>
>>> Xavi
>>>
>>>
>>> --
>>> Pranith
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>
>>
>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] about deduplication feature

2017-06-20 Thread Pranith Kumar Karampuri
On Tue, Jun 20, 2017 at 7:29 AM, Li, Dan  wrote:

> Hi, all
>
> we are using GlusterFS to construct our distribute filesystem.
> Does gulsterFS has the deduplication feature on volumes?
>
Will you support it in the future?
>

hi Lidan,
  At the moment, GlusterFS doesn't have any deduplication feature.
There were plans to do it sometime back with reflinks but nothing concrete
happened. So we don't know if it will be supported in future or not. Some
people use other software at different layers underneath to achieve this.


> Thanks,
>
> Lidan
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Aravinda


regards
Aravinda VK

On 06/20/2017 01:26 PM, Xavier Hernandez wrote:

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez > wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and check if its
own uuid
is present in the list. If it is present then it will consider
it active
otherwise it will be considered passive. With this change we are
giving
all uuids instead of first-up subvolume. So all machines think
they are
ACTIVE which is bad apparently. So that is the reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()), AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require the 
first field to be a GUID to support backward compatibility, we can use 
something like this:
No. But the necessary change can be made to Geo-rep code as well if 
format is changed, Since all these are built/shipped together.


Geo-rep uses node-id as follows,

list = listxattr(node-uuid)
active_node_uuids = list.split(SPACE)
active_node_flag = True if self.node_id exists in active_node_uuids else 
False




Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they returned 
before the patch, but between '(' and ')' they put the full list of 
guid's of all nodes. The first  can be used by geo-replication. 
The list after the first  can be used for rebalance.


Not sure if there's any user of node-uuid above DHT.

Xavi





Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez

>>
wrote:

Hi Pranith,

On 20/06/17 07:53, Pranith Kumar Karampuri wrote:

hi Xavi,
   We all made the mistake of not sending about 
changing

behavior of
node-uuid xattr so that rebalance can use multiple nodes
for doing
rebalance. Because of this on geo-rep all the workers
are becoming
active instead of one per EC/AFR subvolume. So we are
frantically trying
to restore the functionality of node-uuid and introduce
a new
xattr for
the new behavior. Sunil will be sending out a patch for
this.


Wouldn't it be better to change geo-rep behavior to use the
new data
? I think it's better as it's now, since it gives more
information
to upper layers so that they can take more accurate 
decisions.


Xavi


--
Pranith





--
Pranith





--
Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] geo-rep regression because of node-uuid change

2017-06-20 Thread Xavier Hernandez

Hi Pranith,

adding gluster-devel, Kotresh and Aravinda,

On 20/06/17 09:45, Pranith Kumar Karampuri wrote:



On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez > wrote:

On 20/06/17 09:31, Pranith Kumar Karampuri wrote:

The way geo-replication works is:
On each machine, it does getxattr of node-uuid and check if its
own uuid
is present in the list. If it is present then it will consider
it active
otherwise it will be considered passive. With this change we are
giving
all uuids instead of first-up subvolume. So all machines think
they are
ACTIVE which is bad apparently. So that is the reason. Even I
felt bad
that we are doing this change.


And what about changing the content of node-uuid to include some
sort of hierarchy ?

for example:

a single brick:

NODE()

AFR/EC:

AFR[2](NODE(), NODE())
EC[3,1](NODE(), NODE(), NODE())

DHT:

DHT[2](AFR[2](NODE(), NODE()), AFR[2](NODE(),
NODE()))

This gives a lot of information that can be used to take the
appropriate decisions.


I guess that is not backward compatible. Shall I CC gluster-devel and
Kotresh/Aravinda?


Is the change we did backward compatible ? if we only require the first 
field to be a GUID to support backward compatibility, we can use 
something like this:


Bricks:



AFR/EC:
(, )

DHT:
((, ...), (, ...))

In this case, AFR and EC would return the same  they returned 
before the patch, but between '(' and ')' they put the full list of 
guid's of all nodes. The first  can be used by geo-replication. 
The list after the first  can be used for rebalance.


Not sure if there's any user of node-uuid above DHT.

Xavi





Xavi


On Tue, Jun 20, 2017 at 12:46 PM, Xavier Hernandez

>>
wrote:

Hi Pranith,

On 20/06/17 07:53, Pranith Kumar Karampuri wrote:

hi Xavi,
   We all made the mistake of not sending about changing
behavior of
node-uuid xattr so that rebalance can use multiple nodes
for doing
rebalance. Because of this on geo-rep all the workers
are becoming
active instead of one per EC/AFR subvolume. So we are
frantically trying
to restore the functionality of node-uuid and introduce
a new
xattr for
the new behavior. Sunil will be sending out a patch for
this.


Wouldn't it be better to change geo-rep behavior to use the
new data
? I think it's better as it's now, since it gives more
information
to upper layers so that they can take more accurate decisions.

Xavi


--
Pranith





--
Pranith





--
Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Self-heal on read-only volumes

2017-06-20 Thread Xavier Hernandez

Hi Karthik,

thanks for the information.

Xavi

On 16/06/17 13:25, Karthik Subrahmanya wrote:

Hi Xavi,

In my opinion it can not be called as a bug, it is kind of an
improvement to the read-only and WORM translators.
The solution to this is to identify the internal FOPs and allowing them
to pass, even the read-only or WORM options are enabled.
The patch [1] from Kotresh resolves this issue, which is currently under
review.

[1] https://review.gluster.org/#/c/16855/

Regards,
Karthik

On Fri, Jun 16, 2017 at 4:26 PM, Pranith Kumar Karampuri
> wrote:

I remember either Kotresh/Karthik recently sent patches to do
something similar. Adding them to check if the know something about this

On Fri, Jun 16, 2017 at 1:25 PM, Xavier Hernandez
> wrote:

Hi,

currently it seems that a read-only replica 2 volume cannot be
healed because all attempts to make changes by the self-heal
daemon on the damaged brick will fail with EROFS.

It's true that no regular writes are allowed, so there's no
possibility to cause damage by partial writes or similar things.
However a read-only brick can still fail because of disk errors
and some files could get corrupted or the entire disk will need
to be replaced.

Is this a bug or the only way to solve this problem is to make
the volume read-write until self-heal finishes ?

Thanks,

Xavi
___
Gluster-devel mailing list
Gluster-devel@gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-devel





--
Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-20 Thread Krutika Dhananjay
Apologies. Pressed 'send' even before I was done.

On Tue, Jun 20, 2017 at 11:39 AM, Krutika Dhananjay 
wrote:

> Some update on this topic:
>
> I ran fio again, this time with Raghavendra's epoll-rearm patch @
> https://review.gluster.org/17391
>
> The IOPs increased to ~50K (from 38K).
> Avg READ latency as seen by the io-stats translator that sits above
> client-io-threads came down to 963us (from 1666us).
> ∆ (2,3) is down to 804us.
> The disk utilization didn't improve.
>

>From code reading, it appears there is some serialization between POLLIN,
POLLOUT and POLLERR events for a given socket because of
socket_private->lock which they all contend for.

Discussed the same with Raghavendra G.
(I think he already alluded to the same point in this thread earlier.)
Let me make some quick dirty changes to see if fixing this serialization
improves performance further and I'll update the thread accordingly.

-Krutika


>
>
> On Sat, Jun 10, 2017 at 12:47 AM, Manoj Pillai  wrote:
>
>> So comparing the key latency, ∆ (2,3), in the two cases:
>>
>> iodepth=1: 171 us
>> iodepth=8: 1453 us (in the ballpark of 171*8=1368). That's not good! (I
>> wonder if that relation roughly holds up for other values of iodepth).
>>
>> This data doesn't conclusively establish that the problem is in gluster.
>> You'd see similar results if the network were saturated, like Vijay
>> suggested. But from what I remember of this test, the throughput here is
>> far too low for that to be the case.
>>
>> -- Manoj
>>
>>
>> On Thu, Jun 8, 2017 at 6:37 PM, Krutika Dhananjay 
>> wrote:
>>
>>> Indeed the latency on the client side dropped with iodepth=1. :)
>>> I ran the test twice and the results were consistent.
>>>
>>> Here are the exact numbers:
>>>
>>> *Translator Position*   *Avg Latency of READ fop as
>>> seen by this translator*
>>>
>>> 1. parent of client-io-threads437us
>>>
>>> ∆ (1,2) = 69us
>>>
>>> 2. parent of protocol/client-0368us
>>>
>>> ∆ (2,3) = 171us
>>>
>>> - end of client stack -
>>> - beginning of brick stack --
>>>
>>> 3. child of protocol/server   197us
>>>
>>> ∆ (3,4) = 4us
>>>
>>> 4. parent of io-threads193us
>>>
>>> ∆ (4,5) = 32us
>>>
>>> 5. child-of-io-threads  161us
>>>
>>> ∆ (5,6) = 11us
>>>
>>> 6. parent of storage/posix   150us
>>> ...
>>>  end of brick stack 
>>>
>>> Will continue reading code and get back when I find sth concrete.
>>>
>>> -Krutika
>>>
>>>
>>> On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai 
>>> wrote:
>>>
 Thanks. So I was suggesting a repeat of the test but this time with
 iodepth=1 in the fio job. If reducing the no. of concurrent requests
  reduces drastically the high latency you're seeing from the client-side,
 that would strengthen the hypothesis than serialization/contention among
 concurrent requests at the n/w layers is the root cause here.

 -- Manoj


 On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay  wrote:

> Hi,
>
> This is what my job file contains:
>
> [global]
> ioengine=libaio
> #unified_rw_reporting=1
> randrepeat=1
> norandommap=1
> group_reporting
> direct=1
> runtime=60
> thread
> size=16g
>
>
> [workload]
> bs=4k
> rw=randread
> iodepth=8
> numjobs=1
> file_service_type=random
> filename=/perf5/iotest/fio_5
> filename=/perf6/iotest/fio_6
> filename=/perf7/iotest/fio_7
> filename=/perf8/iotest/fio_8
>
> I have 3 vms reading from one mount, and each of these vms is running
> the above job in parallel.
>
> -Krutika
>
> On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai 
> wrote:
>
>>
>>
>> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay <
>> kdhan...@redhat.com> wrote:
>>
>>> Hi,
>>>
>>> As part of identifying performance bottlenecks within gluster stack
>>> for VM image store use-case, I loaded io-stats at multiple points on the
>>> client and brick stack and ran randrd test using fio from within the 
>>> hosted
>>> vms in parallel.
>>>
>>> Before I get to the results, a little bit about the configuration ...
>>>
>>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>>> direct-io.
>>> 3 FUSE clients, one per node in the cluster (which implies reads are
>>> served from the replica that is local to the client).
>>>
>>> io-stats was loaded at the following places:
>>> On the client stack: Above client-io-threads and above
>>> protocol/client-0 (the first child of AFR).
>>> On the brick stack: Below protocol/server, above and 

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-20 Thread Krutika Dhananjay
Some update on this topic:

I ran fio again, this time with Raghavendra's epoll-rearm patch @
https://review.gluster.org/17391

The IOPs increased to ~50K (from 38K).
Avg READ latency as seen by the io-stats translator that sits above
client-io-threads came down to 963us (from 1666us).
∆ (2,3) is down to 804us.
The disk utilization didn't improve.



On Sat, Jun 10, 2017 at 12:47 AM, Manoj Pillai  wrote:

> So comparing the key latency, ∆ (2,3), in the two cases:
>
> iodepth=1: 171 us
> iodepth=8: 1453 us (in the ballpark of 171*8=1368). That's not good! (I
> wonder if that relation roughly holds up for other values of iodepth).
>
> This data doesn't conclusively establish that the problem is in gluster.
> You'd see similar results if the network were saturated, like Vijay
> suggested. But from what I remember of this test, the throughput here is
> far too low for that to be the case.
>
> -- Manoj
>
>
> On Thu, Jun 8, 2017 at 6:37 PM, Krutika Dhananjay 
> wrote:
>
>> Indeed the latency on the client side dropped with iodepth=1. :)
>> I ran the test twice and the results were consistent.
>>
>> Here are the exact numbers:
>>
>> *Translator Position*   *Avg Latency of READ fop as
>> seen by this translator*
>>
>> 1. parent of client-io-threads437us
>>
>> ∆ (1,2) = 69us
>>
>> 2. parent of protocol/client-0368us
>>
>> ∆ (2,3) = 171us
>>
>> - end of client stack -
>> - beginning of brick stack --
>>
>> 3. child of protocol/server   197us
>>
>> ∆ (3,4) = 4us
>>
>> 4. parent of io-threads193us
>>
>> ∆ (4,5) = 32us
>>
>> 5. child-of-io-threads  161us
>>
>> ∆ (5,6) = 11us
>>
>> 6. parent of storage/posix   150us
>> ...
>>  end of brick stack 
>>
>> Will continue reading code and get back when I find sth concrete.
>>
>> -Krutika
>>
>>
>> On Thu, Jun 8, 2017 at 12:22 PM, Manoj Pillai  wrote:
>>
>>> Thanks. So I was suggesting a repeat of the test but this time with
>>> iodepth=1 in the fio job. If reducing the no. of concurrent requests
>>>  reduces drastically the high latency you're seeing from the client-side,
>>> that would strengthen the hypothesis than serialization/contention among
>>> concurrent requests at the n/w layers is the root cause here.
>>>
>>> -- Manoj
>>>
>>>
>>> On Thu, Jun 8, 2017 at 11:46 AM, Krutika Dhananjay 
>>> wrote:
>>>
 Hi,

 This is what my job file contains:

 [global]
 ioengine=libaio
 #unified_rw_reporting=1
 randrepeat=1
 norandommap=1
 group_reporting
 direct=1
 runtime=60
 thread
 size=16g


 [workload]
 bs=4k
 rw=randread
 iodepth=8
 numjobs=1
 file_service_type=random
 filename=/perf5/iotest/fio_5
 filename=/perf6/iotest/fio_6
 filename=/perf7/iotest/fio_7
 filename=/perf8/iotest/fio_8

 I have 3 vms reading from one mount, and each of these vms is running
 the above job in parallel.

 -Krutika

 On Tue, Jun 6, 2017 at 9:14 PM, Manoj Pillai 
 wrote:

>
>
> On Tue, Jun 6, 2017 at 5:05 PM, Krutika Dhananjay  > wrote:
>
>> Hi,
>>
>> As part of identifying performance bottlenecks within gluster stack
>> for VM image store use-case, I loaded io-stats at multiple points on the
>> client and brick stack and ran randrd test using fio from within the 
>> hosted
>> vms in parallel.
>>
>> Before I get to the results, a little bit about the configuration ...
>>
>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>> direct-io.
>> 3 FUSE clients, one per node in the cluster (which implies reads are
>> served from the replica that is local to the client).
>>
>> io-stats was loaded at the following places:
>> On the client stack: Above client-io-threads and above
>> protocol/client-0 (the first child of AFR).
>> On the brick stack: Below protocol/server, above and below io-threads
>> and just above storage/posix.
>>
>> Based on a 60-second run of randrd test and subsequent analysis of
>> the stats dumped by the individual io-stats instances, the following is
>> what I found:
>>
>> *​​Translator Position*   *Avg Latency of READ
>> fop as seen by this translator*
>>
>> 1. parent of client-io-threads1666us
>>
>> ∆ (1,2) = 50us
>>
>> 2. parent of protocol/client-01616us
>>
>> ∆ (2,3) = 1453us
>>
>> - end of client stack -
>> - beginning of brick stack ---
>>
>> 3. child of protocol/server