On Tue, Jul 4, 2017 at 2:26 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:
> Hi Pranith, > > On 03/07/17 08:33, Pranith Kumar Karampuri wrote: > >> Xavi, >> Now that the change has been reverted, we can resume this >> discussion and decide on the exact format that considers, tier, dht, >> afr, ec. People working geo-rep/dht/afr/ec had an internal discussion >> and we all agreed that this proposal would be a good way forward. I >> think once we agree on the format and decide on the initial >> encoding/decoding functions of the xattr and this change is merged, we >> can send patches on afr/ec/dht and geo-rep to take it to closure. >> >> Could you propose the new format you have in mind that considers all of >> the xlators? >> > > My idea was to create a new xattr not bound to any particular function but > which could give enough information to be used in many places. > > Currently we have another attribute called glusterfs.pathinfo that returns > hierarchical information about the location of a file. Maybe we can extend > this to unify all these attributes into a single feature that could be used > for multiple purposes. > > Since we have time to discuss it, I would like to design it with more > information than we already talked. > > First of all, the amount of information that this attribute can contain is > quite big if we expect to have volumes with thousands of bricks. Even in > the most simple case of returning only an UUID, we can easily go beyond the > limit of 64KB. > > Consider also, for example, what shard should return when pathinfo is > requested for a file. Probably it should return a list of shards, each one > with all its associated pathinfo. We are talking about big amounts of data > here. > > I think this kind of information doesn't fit very well in an extended > attribute. Another think to consider is that most probably the requester of > the data only needs a fragment of it, so we are generating big amounts of > data only to be parsed and reduced later, dismissing most of it. > > What do you think about using a very special virtual file to manage all > this information ? it could be easily read using normal read fops, so it > could manage big amounts of data easily. Also, accessing only to some parts > of the file we could go directly where we want, avoiding the read of all > remaining data. > > A very basic idea could be this: > > Each xlator would have a reserved area of the file. We can reserve up to > 4GB per xlator (32 bits). The remaining 32 bits of the offset would > indicate the xlator we want to access. > > At offset 0 we have generic information about the volume. One of the the > things that this information should include is a basic hierarchy of the > whole volume and the offset for each xlator. > > After reading this, the user will seek to the desired offset and read the > information related to the xlator it is interested in. > > All the information should be stored in a format easily extensible that > will be kept compatible even if new information is added in the future (for > example doing special mappings of the 32 bits offsets reserved for the > xlator). > > For example we can reserve the first megabyte of the xlator area to have a > mapping of attributes with its respective offset. > > I think that using a binary format would simplify all this a lot. > > Do you think this is a way to explore or should I stop wasting time here ? > I think this just became a very big feature :-). Shall we just live with it the way it is now? > > Xavi > > >> >> >> On Wed, Jun 21, 2017 at 2:08 PM, Karthik Subrahmanya >> <ksubr...@redhat.com <mailto:ksubr...@redhat.com>> wrote: >> >> >> >> On Wed, Jun 21, 2017 at 1:56 PM, Xavier Hernandez >> <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote: >> >> That's ok. I'm currently unable to write a patch for this on ec. >> >> Sunil is working on this patch. >> >> ~Karthik >> >> If no one can do it, I can try to do it in 6 - 7 hours... >> >> Xavi >> >> >> On Wednesday, June 21, 2017 09:48 CEST, Pranith Kumar Karampuri >> <pkara...@redhat.com <mailto:pkara...@redhat.com>> wrote: >> >> >>> >>> On Wed, Jun 21, 2017 at 1:00 PM, Xavier Hernandez >>> <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote: >>> >>> I'm ok with reverting node-uuid content to the previous >>> format and create a new xattr for the new format. >>> Currently, only rebalance will use it. >>> >>> Only thing to consider is what can happen if we have a >>> half upgraded cluster where some clients have this change >>> and some not. Can rebalance work in this situation ? if >>> so, could there be any issue ? >>> >>> >>> I think there shouldn't be any problem, because this is >>> in-memory xattr so layers below afr/ec will only see node-uuid >>> xattr. >>> This also gives us a chance to do whatever we want to do in >>> future with this xattr without any problems about backward >>> compatibility. >>> >>> You can check >>> https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/s >>> rc/afr-inode-read.c@1507 >>> <https://review.gluster.org/#/c/17576/3/xlators/cluster/afr/ >>> src/afr-inode-read.c@1507> >>> for how karthik implemented this in AFR (this got merged >>> accidentally yesterday, but looks like this is what we are >>> settling on) >>> >>> >>> >>> Xavi >>> >>> >>> On Wednesday, June 21, 2017 06:56 CEST, Pranith Kumar >>> Karampuri <pkara...@redhat.com >>> <mailto:pkara...@redhat.com>> wrote: >>> >>> >>>> >>>> On Wed, Jun 21, 2017 at 10:07 AM, Nithya Balachandran >>>> <nbala...@redhat.com <mailto:nbala...@redhat.com>> wrote: >>>> >>>> >>>> On 20 June 2017 at 20:38, Aravinda >>>> <avish...@redhat.com <mailto:avish...@redhat.com>> >>>> wrote: >>>> >>>> On 06/20/2017 06:02 PM, Pranith Kumar Karampuri >>>> wrote: >>>> >>>>> Xavi, Aravinda and I had a discussion on >>>>> #gluster-dev and we agreed to go with the format >>>>> Aravinda suggested for now and in future we >>>>> wanted some more changes for dht to detect which >>>>> subvolume went down came back up, at that time >>>>> we will revisit the solution suggested by Xavi. >>>>> >>>>> Susanth is doing the dht changes >>>>> Aravinda is doing geo-rep changes >>>>> >>>> Done. Geo-rep patch sent for review >>>> https://review.gluster.org/17582 >>>> <https://review.gluster.org/17582> >>>> >>>> >>>> >>>> The proposed changes to the node-uuid behaviour >>>> (while good) are going to break tiering . Tiering >>>> changes will take a little more time to be coded and >>>> tested. >>>> >>>> As this is a regression for 3.11 and a blocker for >>>> 3.11.1, I suggest we go back to the original >>>> node-uuid behaviour for now so as to unblock the >>>> release and target the proposed changes for the next >>>> 3.11 releases. >>>> >>>> >>>> Let me see if I understand the changes correctly. We are >>>> restoring the behavior of node-uuid xattr and adding a >>>> new xattr for parallel rebalance for both afr and ec, >>>> correct? Otherwise that is one more regression. If yes, >>>> we will also wait for Xavi's inputs. Jeff accidentally >>>> merged the afr patch yesterday which does these changes. >>>> If everyone is in agreement, we will leave it as is and >>>> add similar changes in ec as well. If we are not in >>>> agreement, then we will let the discussion progress :-) >>>> >>>> >>>> >>>> >>>> Regards, >>>> Nithya >>>> >>>> -- >>>> Aravinda >>>> >>>> >>>>> Thanks to all of you guys for the discussions! >>>>> >>>>> On Tue, Jun 20, 2017 at 5:05 PM, Xavier >>>>> Hernandez <xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es>> wrote: >>>>> >>>>> Hi Aravinda, >>>>> >>>>> On 20/06/17 12:42, Aravinda wrote: >>>>> >>>>> I think following format can be easily >>>>> adopted by all components >>>>> >>>>> UUIDs of a subvolume are seperated by >>>>> space and subvolumes are separated >>>>> by comma >>>>> >>>>> For example, node1 and node2 are replica >>>>> with U1 and U2 UUIDs >>>>> respectively and >>>>> node3 and node4 are replica with U3 and >>>>> U4 UUIDs respectively >>>>> >>>>> node-uuid can return "U1 U2,U3 U4" >>>>> >>>>> >>>>> While this is ok for current implementation, >>>>> I think this can be insufficient if there >>>>> are more layers of xlators that require to >>>>> indicate some sort of grouping. Some >>>>> representation that can represent hierarchy >>>>> would be better. For example: "(U1 U2) (U3 >>>>> U4)" (we can use spaces or comma as a >>>>> separator). >>>>> >>>>> >>>>> >>>>> Geo-rep can split by "," and then split >>>>> by space and take first UUID >>>>> DHT can split the value by space or >>>>> comma and get unique UUIDs list >>>>> >>>>> >>>>> This doesn't solve the problem I described >>>>> in the previous email. Some more logic will >>>>> need to be added to avoid more than one node >>>>> from each replica-set to be active. If we >>>>> have some explicit hierarchy information in >>>>> the node-uuid value, more decisions can be >>>>> taken. >>>>> >>>>> An initial proposal I made was this: >>>>> >>>>> DHT[2](AFR[2,0](NODE(U1), NODE(U2)), >>>>> AFR[2,0](NODE(U1), NODE(U2))) >>>>> >>>>> This is harder to parse, but gives a lot of >>>>> information: DHT with 2 subvolumes, each >>>>> subvolume is an AFR with replica 2 and no >>>>> arbiters. It's also easily extensible with >>>>> any new xlator that changes the layout. >>>>> >>>>> However maybe this is not the moment to do >>>>> this, and probably we could implement this >>>>> in a new xattr with a better name. >>>>> >>>>> Xavi >>>>> >>>>> >>>>> >>>>> Another question is about the behavior >>>>> when a node is down, existing >>>>> node-uuid xattr will not return that >>>>> UUID if a node is down. What is the >>>>> behavior with the proposed xattr? >>>>> >>>>> Let me know your thoughts. >>>>> >>>>> regards >>>>> Aravinda VK >>>>> >>>>> On 06/20/2017 03:06 PM, Aravinda wrote: >>>>> >>>>> Hi Xavi, >>>>> >>>>> On 06/20/2017 02:51 PM, Xavier >>>>> Hernandez wrote: >>>>> >>>>> Hi Aravinda, >>>>> >>>>> On 20/06/17 11:05, Pranith Kumar >>>>> Karampuri wrote: >>>>> >>>>> Adding more people to get a >>>>> consensus about this. >>>>> >>>>> On Tue, Jun 20, 2017 at 1:49 >>>>> PM, Aravinda >>>>> <avish...@redhat.com >>>>> <mailto:avish...@redhat.com> >>>>> <mailto:avish...@redhat.com >>>>> <mailto:avish...@redhat.com>>> >>>>> wrote: >>>>> >>>>> >>>>> regards >>>>> Aravinda VK >>>>> >>>>> >>>>> On 06/20/2017 01:26 PM, >>>>> Xavier Hernandez wrote: >>>>> >>>>> Hi Pranith, >>>>> >>>>> adding >>>>> gluster-devel, Kotresh and >>>>> Aravinda, >>>>> >>>>> On 20/06/17 09:45, >>>>> Pranith Kumar Karampuri wrote: >>>>> >>>>> >>>>> >>>>> On Tue, Jun 20, >>>>> 2017 at 1:12 PM, Xavier >>>>> Hernandez >>>>> >>>>> <xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es>> >>>>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es> >>>>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es>>>> >>>>> wrote: >>>>> >>>>> On 20/06/17 >>>>> 09:31, Pranith Kumar >>>>> Karampuri wrote: >>>>> >>>>> The way >>>>> geo-replication works is: >>>>> On each >>>>> machine, it does getxattr of >>>>> node-uuid and >>>>> check if its >>>>> own uuid >>>>> is >>>>> present in the list. If it >>>>> is present then it >>>>> will consider >>>>> it active >>>>> >>>>> otherwise it will be >>>>> considered passive. With this >>>>> change we are >>>>> giving >>>>> all >>>>> uuids instead of first-up >>>>> subvolume. So all >>>>> machines think >>>>> they are >>>>> ACTIVE >>>>> which is bad apparently. So >>>>> that is the >>>>> reason. Even I >>>>> felt bad >>>>> that we >>>>> are doing this change. >>>>> >>>>> >>>>> And what >>>>> about changing the content >>>>> of node-uuid to >>>>> include some >>>>> sort of >>>>> hierarchy ? >>>>> >>>>> for example: >>>>> >>>>> a single brick: >>>>> >>>>> NODE(<guid>) >>>>> >>>>> AFR/EC: >>>>> >>>>> >>>>> AFR[2](NODE(<guid>), >>>>> NODE(<guid>)) >>>>> >>>>> EC[3,1](NODE(<guid>), >>>>> NODE(<guid>), NODE(<guid>)) >>>>> >>>>> DHT: >>>>> >>>>> >>>>> DHT[2](AFR[2](NODE(<guid>), >>>>> NODE(<guid>)), >>>>> >>>>> AFR[2](NODE(<guid>), >>>>> NODE(<guid>))) >>>>> >>>>> This gives a >>>>> lot of information that can >>>>> be used to >>>>> take the >>>>> appropriate >>>>> decisions. >>>>> >>>>> >>>>> I guess that is >>>>> not backward compatible. >>>>> Shall I CC >>>>> gluster-devel and >>>>> Kotresh/Aravinda? >>>>> >>>>> >>>>> Is the change we did >>>>> backward compatible ? if we >>>>> only require >>>>> the first field to >>>>> be a GUID to support >>>>> backward compatibility, >>>>> we can use something >>>>> like this: >>>>> >>>>> No. But the necessary >>>>> change can be made to >>>>> Geo-rep code as well if >>>>> format is changed, Since >>>>> all these are built/shipped >>>>> together. >>>>> >>>>> Geo-rep uses node-id as >>>>> follows, >>>>> >>>>> list = listxattr(node-uuid) >>>>> active_node_uuids = >>>>> list.split(SPACE) >>>>> active_node_flag = True >>>>> if self.node_id exists in >>>>> active_node_uuids >>>>> else False >>>>> >>>>> >>>>> How was this case solved ? >>>>> >>>>> suppose we have three servers >>>>> and 2 bricks in each server. A >>>>> replicated volume is created >>>>> using the following command: >>>>> >>>>> gluster volume create test >>>>> replica 2 server1:/brick1 >>>>> server2:/brick1 >>>>> server2:/brick2 server3:/brick1 >>>>> server3:/brick1 server1:/brick2 >>>>> >>>>> In this case we have three >>>>> replica-sets: >>>>> >>>>> * server1:/brick1 server2:/brick1 >>>>> * server2:/brick2 server3:/brick1 >>>>> * server3:/brick2 server2:/brick2 >>>>> >>>>> Old AFR implementation for >>>>> node-uuid always returned the >>>>> uuid of the >>>>> node of the first brick, so in >>>>> this case we will get the uuid >>>>> of the >>>>> three nodes because all of them >>>>> are the first brick of a >>>>> replica-set. >>>>> >>>>> Does this mean that with this >>>>> configuration all nodes are >>>>> active ? Is >>>>> this a problem ? Is there any >>>>> other check to avoid this >>>>> situation if >>>>> it's not good ? >>>>> >>>>> Yes all Geo-rep workers will become >>>>> Active and participate in syncing. >>>>> Since changelogs will have the same >>>>> information in replica bricks this >>>>> will lead to duplicate syncing and >>>>> consuming network bandwidth. >>>>> >>>>> Node-uuid based Active worker is the >>>>> default configuration in Geo-rep >>>>> till now, Geo-rep also has Meta >>>>> Volume based syncronization for Active >>>>> worker using lock files.(Can be >>>>> opted using Geo-rep configuration, >>>>> with this config node-uuid will not >>>>> be used) >>>>> >>>>> Kotresh proposed a solution to >>>>> configure which worker to become >>>>> Active. This will give more control >>>>> to Admin to choose Active workers, >>>>> This will become default >>>>> configuration from 3.12 >>>>> https://github.com/gluster/glu >>>>> sterfs/issues/244 >>>>> <https://github.com/gluster/gl >>>>> usterfs/issues/244> >>>>> >>>>> -- >>>>> Aravinda >>>>> >>>>> >>>>> >>>>> Xavi >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Bricks: >>>>> >>>>> <guid> >>>>> >>>>> AFR/EC: >>>>> <guid>(<guid>, <guid>) >>>>> >>>>> DHT: >>>>> >>>>> <guid>(<guid>(<guid>, ...), >>>>> <guid>(<guid>, ...)) >>>>> >>>>> In this case, AFR >>>>> and EC would return the same >>>>> <guid> they >>>>> returned before the >>>>> patch, but between '(' and >>>>> ')' they put the >>>>> full list of guid's >>>>> of all nodes. The first >>>>> <guid> can be used >>>>> by geo-replication. >>>>> The list after the first >>>>> <guid> can be used >>>>> for rebalance. >>>>> >>>>> Not sure if there's >>>>> any user of node-uuid above >>>>> DHT. >>>>> >>>>> Xavi >>>>> >>>>> >>>>> >>>>> >>>>> Xavi >>>>> >>>>> >>>>> On Tue, >>>>> Jun 20, 2017 at 12:46 PM, >>>>> Xavier Hernandez >>>>> >>>>> <xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es> >>>>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es> >>>>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es>>> >>>>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es> >>>>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es> >>>>> >>>>> <mailto:xhernan...@datalab.es >>>>> <mailto:xhernan...@datalab.es>>>>> >>>>> >>>>> wrote: >>>>> >>>>> Hi >>>>> Pranith, >>>>> >>>>> On >>>>> 20/06/17 07:53, Pranith >>>>> Kumar Karampuri >>>>> wrote: >>>>> >>>>> >>>>> hi Xavi, >>>>> >>>>> We all made the >>>>> mistake of not >>>>> sending about >>>>> changing >>>>> >>>>> behavior of >>>>> >>>>> node-uuid xattr so that >>>>> rebalance can use >>>>> multiple nodes >>>>> for doing >>>>> >>>>> rebalance. Because of this >>>>> on geo-rep all >>>>> the workers >>>>> are >>>>> becoming >>>>> >>>>> active instead of one per >>>>> EC/AFR subvolume. >>>>> So we are >>>>> >>>>> frantically trying >>>>> >>>>> to restore the functionality >>>>> of node-uuid >>>>> and introduce >>>>> a new >>>>> >>>>> xattr for >>>>> >>>>> the new behavior. Sunil will >>>>> be sending out >>>>> a patch for >>>>> this. >>>>> >>>>> >>>>> >>>>> Wouldn't it be better to >>>>> change geo-rep >>>>> behavior >>>>> to use the >>>>> new data >>>>> ? I >>>>> think it's better as it's >>>>> now, since it >>>>> gives more >>>>> information >>>>> to >>>>> upper layers so that they >>>>> can take more >>>>> accurate decisions. >>>>> >>>>> Xavi >>>>> >>>>> >>>>> -- >>>>> >>>>> Pranith >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Pranith >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Pranith >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Pranith >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Pranith >>>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Pranith >>>> >>> >>> >>> >>> >>> >>> >>> -- >>> Pranith >>> >> >> >> >> >> >> >> >> -- >> Pranith >> > > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel