Xavi, Now that the change has been reverted, we can resume this discussion and decide on the exact format that considers, tier, dht, afr, ec. People working geo-rep/dht/afr/ec had an internal discussion and we all agreed that this proposal would be a good way forward. I think once we agree on the format and decide on the initial encoding/decoding functions of the xattr and this change is merged, we can send patches on afr/ec/dht and geo-rep to take it to closure.
Could you propose the new format you have in mind that considers all of the xlators? On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya <ksubr...@redhat.com> wrote: > > > On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez <xhernan...@datalab.es> > wrote: > >> That's ok. I'm currently unable to write a patch for this on ec. > > Sunil is working on this patch. > > ~Karthik > >> If no one can do it, I can try to do it in 6 - 7 hours... >> >> Xavi >> >> >> On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri < >> pkara...@redhat.com> wrote: >> >> >> >> >> On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez <xhernan...@datalab.es> >> wrote: >>> >>> I'm ok with reverting node-uuid content to the previous format and >>> create a new xattr for the new format. Currently, only rebalance will use >>> it. >>> >>> Only thing to consider is what can happen if we have a half upgraded >>> cluster where some clients have this change and some not. Can rebalance >>> work in this situation ? if so, could there be any issue ? >> >> >> I think there shouldn't be any problem, because this is in-memory xattr >> so layers below afr/ec will only see node-uuid xattr. >> This also gives us a chance to do whatever we want to do in future with >> this xattr without any problems about backward compatibility. >> >> You can check https://review.gluster.org/#/c >> /17576/3/xlators/cluster/afr/src/afr-inode-read.c@1507 for how karthik >> implemented this in AFR (this got merged accidentally yesterday, but looks >> like this is what we are settling on) >> >> >>> >>> Xavi >>> >>> >>> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar Karampuri < >>> pkara...@redhat.com> wrote: >>> >>> >>> >>> >>> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran < >>> nbala...@redhat.com> wrote: >>>> >>>> >>>> On 20 June 2017 at 20:38, Aravinda <avish...@redhat.com> wrote: >>>>> >>>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri wrote: >>>>> >>>>> Xavi, Aravinda and I had a discussion on #gluster-dev and we agreed to >>>>> go with the format Aravinda suggested for now and in future we wanted some >>>>> more changes for dht to detect which subvolume went down came back up, at >>>>> that time we will revisit the solution suggested by Xavi. >>>>> >>>>> Susanth is doing the dht changes >>>>> Aravinda is doing geo-rep changes >>>>> >>>>> Done. Geo-rep patch sent for review https://review.gluster.org/17582 >>>>> >>>>> >>>> >>>> The proposed changes to the node-uuid behaviour (while good) are going >>>> to break tiering . Tiering changes will take a little more time to be coded >>>> and tested. >>>> >>>> As this is a regression for 3.11 and a blocker for 3.11.1, I suggest we >>>> go back to the original node-uuid behaviour for now so as to unblock the >>>> release and target the proposed changes for the next 3.11 releases. >>>> >>> >>> Let me see if I understand the changes correctly. We are restoring the >>> behavior of node-uuid xattr and adding a new xattr for parallel rebalance >>> for both afr and ec, correct? Otherwise that is one more regression. If >>> yes, we will also wait for Xavi's inputs. Jeff accidentally merged the afr >>> patch yesterday which does these changes. If everyone is in agreement, we >>> will leave it as is and add similar changes in ec as well. If we are not in >>> agreement, then we will let the discussion progress :-) >>> >>> >>>> >>>> >>>> Regards, >>>> Nithya >>>> >>>>> -- >>>>> Aravinda >>>>> >>>>> >>>>> >>>>> Thanks to all of you guys for the discussions! >>>>> >>>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier Hernandez < >>>>> xhernan...@datalab.es> wrote: >>>>>> >>>>>> Hi Aravinda, >>>>>> >>>>>> On 20/06/17 12:42, Aravinda wrote: >>>>>>> >>>>>>> I think following format can be easily adopted by all components >>>>>>> >>>>>>> UUIDs of a subvolume are seperated by space and subvolumes are >>>>>>> separated >>>>>>> by comma >>>>>>> >>>>>>> For example, node1 and node2 are replica with U1 and U2 UUIDs >>>>>>> respectively and >>>>>>> node3 and node4 are replica with U3 and U4 UUIDs respectively >>>>>>> >>>>>>> node-uuid can return "U1 U2,U3 U4" >>>>>> >>>>>> >>>>>> While this is ok for current implementation, I think this can be >>>>>> insufficient if there are more layers of xlators that require to indicate >>>>>> some sort of grouping. Some representation that can represent hierarchy >>>>>> would be better. For example: "(U1 U2) (U3 U4)" (we can use spaces or >>>>>> comma >>>>>> as a separator). >>>>>> >>>>>>> >>>>>>> >>>>>>> Geo-rep can split by "," and then split by space and take first UUID >>>>>>> DHT can split the value by space or comma and get unique UUIDs list >>>>>> >>>>>> >>>>>> This doesn't solve the problem I described in the previous email. >>>>>> Some more logic will need to be added to avoid more than one node from >>>>>> each >>>>>> replica-set to be active. If we have some explicit hierarchy information >>>>>> in >>>>>> the node-uuid value, more decisions can be taken. >>>>>> >>>>>> An initial proposal I made was this: >>>>>> >>>>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), AFR[2,0](NODE(U1), NODE(U2))) >>>>>> >>>>>> This is harder to parse, but gives a lot of information: DHT with 2 >>>>>> subvolumes, each subvolume is an AFR with replica 2 and no arbiters. It's >>>>>> also easily extensible with any new xlator that changes the layout. >>>>>> >>>>>> However maybe this is not the moment to do this, and probably we >>>>>> could implement this in a new xattr with a better name. >>>>>> >>>>>> Xavi >>>>>> >>>>>>> >>>>>>> >>>>>>> Another question is about the behavior when a node is down, existing >>>>>>> node-uuid xattr will not return that UUID if a node is down. What is >>>>>>> the >>>>>>> behavior with the proposed xattr? >>>>>>> >>>>>>> Let me know your thoughts. >>>>>>> >>>>>>> regards >>>>>>> Aravinda VK >>>>>>> >>>>>>> On 06/20/2017 03:06 PM, Aravinda wrote: >>>>>>>> >>>>>>>> Hi Xavi, >>>>>>>> >>>>>>>> On 06/20/2017 02:51 PM, Xavier Hernandez wrote: >>>>>>>>> >>>>>>>>> Hi Aravinda, >>>>>>>>> >>>>>>>>> On 20/06/17 11:05, Pranith Kumar Karampuri wrote: >>>>>>>>>> >>>>>>>>>> Adding more people to get a consensus about this. >>>>>>>>>> >>>>>>>>>> On Tue, Jun 20, 2017 at 1:49 PM, Aravinda <avish...@redhat.com >>>>>>>>>> <mailto:avish...@redhat.com>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> regards >>>>>>>>>> Aravinda VK >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 06/20/2017 01:26 PM, Xavier Hernandez wrote: >>>>>>>>>> >>>>>>>>>> Hi Pranith, >>>>>>>>>> >>>>>>>>>> adding gluster-devel, Kotresh and Aravinda, >>>>>>>>>> >>>>>>>>>> On 20/06/17 09:45, Pranith Kumar Karampuri wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 20, 2017 at 1:12 PM, Xavier Hernandez >>>>>>>>>> <xhernan...@datalab.es <mailto:xhernan...@datalab.es> >>>>>>>>>> <mailto:xhernan...@datalab.es >>>>>>>>>> <mailto:xhernan...@datalab.es>>> wrote: >>>>>>>>>> >>>>>>>>>> On 20/06/17 09:31, Pranith Kumar Karampuri wrote: >>>>>>>>>> >>>>>>>>>> The way geo-replication works is: >>>>>>>>>> On each machine, it does getxattr of >>>>>>>>>> node-uuid and >>>>>>>>>> check if its >>>>>>>>>> own uuid >>>>>>>>>> is present in the list. If it is present then >>>>>>>>>> it >>>>>>>>>> will consider >>>>>>>>>> it active >>>>>>>>>> otherwise it will be considered passive. With >>>>>>>>>> this >>>>>>>>>> change we are >>>>>>>>>> giving >>>>>>>>>> all uuids instead of first-up subvolume. So >>>>>>>>>> all >>>>>>>>>> machines think >>>>>>>>>> they are >>>>>>>>>> ACTIVE which is bad apparently. So that is the >>>>>>>>>> reason. Even I >>>>>>>>>> felt bad >>>>>>>>>> that we are doing this change. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> And what about changing the content of node-uuid >>>>>>>>>> to >>>>>>>>>> include some >>>>>>>>>> sort of hierarchy ? >>>>>>>>>> >>>>>>>>>> for example: >>>>>>>>>> >>>>>>>>>> a single brick: >>>>>>>>>> >>>>>>>>>> NODE(<guid>) >>>>>>>>>> >>>>>>>>>> AFR/EC: >>>>>>>>>> >>>>>>>>>> AFR[2](NODE(<guid>), NODE(<guid>)) >>>>>>>>>> EC[3,1](NODE(<guid>), NODE(<guid>), NODE(<guid>)) >>>>>>>>>> >>>>>>>>>> DHT: >>>>>>>>>> >>>>>>>>>> DHT[2](AFR[2](NODE(<guid>), NODE(<guid>)), >>>>>>>>>> AFR[2](NODE(<guid>), >>>>>>>>>> NODE(<guid>))) >>>>>>>>>> >>>>>>>>>> This gives a lot of information that can be used >>>>>>>>>> to >>>>>>>>>> take the >>>>>>>>>> appropriate decisions. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I guess that is not backward compatible. Shall I CC >>>>>>>>>> gluster-devel and >>>>>>>>>> Kotresh/Aravinda? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Is the change we did backward compatible ? if we only >>>>>>>>>> require >>>>>>>>>> the first field to be a GUID to support backward >>>>>>>>>> compatibility, >>>>>>>>>> we can use something like this: >>>>>>>>>> >>>>>>>>>> No. But the necessary change can be made to Geo-rep code as >>>>>>>>>> well if >>>>>>>>>> format is changed, Since all these are built/shipped together. >>>>>>>>>> >>>>>>>>>> Geo-rep uses node-id as follows, >>>>>>>>>> >>>>>>>>>> list = listxattr(node-uuid) >>>>>>>>>> active_node_uuids = list.split(SPACE) >>>>>>>>>> active_node_flag = True if self.node_id exists in >>>>>>>>>> active_node_uuids >>>>>>>>>> else False >>>>>>>>> >>>>>>>>> >>>>>>>>> How was this case solved ? >>>>>>>>> >>>>>>>>> suppose we have three servers and 2 bricks in each server. A >>>>>>>>> replicated volume is created using the following command: >>>>>>>>> >>>>>>>>> gluster volume create test replica 2 server1:/brick1 >>>>>>>>> server2:/brick1 >>>>>>>>> server2:/brick2 server3:/brick1 server3:/brick1 server1:/brick2 >>>>>>>>> >>>>>>>>> In this case we have three replica-sets: >>>>>>>>> >>>>>>>>> * server1:/brick1 server2:/brick1 >>>>>>>>> * server2:/brick2 server3:/brick1 >>>>>>>>> * server3:/brick2 server2:/brick2 >>>>>>>>> >>>>>>>>> Old AFR implementation for node-uuid always returned the uuid of >>>>>>>>> the >>>>>>>>> node of the first brick, so in this case we will get the uuid of >>>>>>>>> the >>>>>>>>> three nodes because all of them are the first brick of a >>>>>>>>> replica-set. >>>>>>>>> >>>>>>>>> Does this mean that with this configuration all nodes are active ? >>>>>>>>> Is >>>>>>>>> this a problem ? Is there any other check to avoid this situation >>>>>>>>> if >>>>>>>>> it's not good ? >>>>>>>> >>>>>>>> Yes all Geo-rep workers will become Active and participate in >>>>>>>> syncing. >>>>>>>> Since changelogs will have the same information in replica bricks >>>>>>>> this >>>>>>>> will lead to duplicate syncing and consuming network bandwidth. >>>>>>>> >>>>>>>> Node-uuid based Active worker is the default configuration in >>>>>>>> Geo-rep >>>>>>>> till now, Geo-rep also has Meta Volume based syncronization for >>>>>>>> Active >>>>>>>> worker using lock files.(Can be opted using Geo-rep configuration, >>>>>>>> with this config node-uuid will not be used) >>>>>>>> >>>>>>>> Kotresh proposed a solution to configure which worker to become >>>>>>>> Active. This will give more control to Admin to choose Active >>>>>>>> workers, >>>>>>>> This will become default configuration from 3.12 >>>>>>>> https://github.com/gluster/glusterfs/issues/244 >>>>>>>> >>>>>>>> -- >>>>>>>> Aravinda >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Xavi >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Bricks: >>>>>>>>>> >>>>>>>>>> <guid> >>>>>>>>>> >>>>>>>>>> AFR/EC: >>>>>>>>>> <guid>(<guid>, <guid>) >>>>>>>>>> >>>>>>>>>> DHT: >>>>>>>>>> <guid>(<guid>(<guid>, ...), <guid>(<guid>, ...)) >>>>>>>>>> >>>>>>>>>> In this case, AFR and EC would return the same <guid> they >>>>>>>>>> returned before the patch, but between '(' and ')' they >>>>>>>>>> put the >>>>>>>>>> full list of guid's of all nodes. The first <guid> can be >>>>>>>>>> used >>>>>>>>>> by geo-replication. The list after the first <guid> can >>>>>>>>>> be used >>>>>>>>>> for rebalance. >>>>>>>>>> >>>>>>>>>> Not sure if there's any user of node-uuid above DHT. >>>>>>>>>> >>>>>>>>>> Xavi >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Xavi >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 20, 2017 at 12:46 PM, Xavier >>>>>>>>>> Hernandez >>>>>>>>>> <xhernan...@datalab.es >>>>>>>>>> <mailto:xhernan...@datalab.es> >>>>>>>>>> <mailto:xhernan...@datalab.es >>>>>>>>>> <mailto:xhernan...@datalab.es>> >>>>>>>>>> <mailto:xhernan...@datalab.es >>>>>>>>>> <mailto:xhernan...@datalab.es> >>>>>>>>>> <mailto:xhernan...@datalab.es >>>>>>>>>> <mailto:xhernan...@datalab.es>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Pranith, >>>>>>>>>> >>>>>>>>>> On 20/06/17 07:53, Pranith Kumar Karampuri >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> hi Xavi, >>>>>>>>>> We all made the mistake of not >>>>>>>>>> sending about changing >>>>>>>>>> behavior of >>>>>>>>>> node-uuid xattr so that rebalance can >>>>>>>>>> use >>>>>>>>>> multiple nodes >>>>>>>>>> for doing >>>>>>>>>> rebalance. Because of this on geo-rep >>>>>>>>>> all >>>>>>>>>> the workers >>>>>>>>>> are becoming >>>>>>>>>> active instead of one per EC/AFR >>>>>>>>>> subvolume. >>>>>>>>>> So we are >>>>>>>>>> frantically trying >>>>>>>>>> to restore the functionality of >>>>>>>>>> node-uuid >>>>>>>>>> and introduce >>>>>>>>>> a new >>>>>>>>>> xattr for >>>>>>>>>> the new behavior. Sunil will be >>>>>>>>>> sending out >>>>>>>>>> a patch for >>>>>>>>>> this. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Wouldn't it be better to change geo-rep >>>>>>>>>> behavior >>>>>>>>>> to use the >>>>>>>>>> new data >>>>>>>>>> ? I think it's better as it's now, since >>>>>>>>>> it >>>>>>>>>> gives more >>>>>>>>>> information >>>>>>>>>> to upper layers so that they can take more >>>>>>>>>> accurate decisions. >>>>>>>>>> >>>>>>>>>> Xavi >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Pranith >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Pranith >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Pranith >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Pranith >>>>>>>>> >>>>>>>>> >>>>> >>>>> >>>>> -- >>>>> Pranith >>>>> >>>>> >>> >>> >>> -- >>> Pranith >>> >>> >>> >>> >> >> >> >> -- >> Pranith >> >> >> >> > > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel