Re: [Gluster-devel] regarding inode-unref on root inode
On Tuesday 24 June 2014 08:17 PM, Pranith Kumar Karampuri wrote: Does anyone know why inode_unref is no-op for root inode? I see the following code in inode.c static inode_t * __inode_unref (inode_t *inode) { if (!inode) return NULL; if (__is_root_gfid(inode-gfid)) return inode; ... } I think its done with the intention that, root inode should *never* ever get removed from the active inodes list. (not even accidentally). So unref on root-inode is a no-op. Dont know whether there are any other reasons. Regards, Raghavendra Bhat Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] glusterfs-3.5.1 released
On 06/24/2014 03:45 PM, Gluster Build System wrote: SRC: http://bits.gluster.org/pub/gluster/glusterfs/src/glusterfs-3.5.1.tar.gz This release is made off jenkins-release-73 -- Gluster Build System ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users RPMs for el5-7 (RHEL, CentOS, etc.) are available at download.gluster.org [1]. [1] http://download.gluster.org/pub/gluster/glusterfs/LATEST/ Thanks, Lala ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] regarding inode-unref on root inode
On 06/25/2014 11:52 AM, Raghavendra Bhat wrote: On Tuesday 24 June 2014 08:17 PM, Pranith Kumar Karampuri wrote: Does anyone know why inode_unref is no-op for root inode? I see the following code in inode.c static inode_t * __inode_unref (inode_t *inode) { if (!inode) return NULL; if (__is_root_gfid(inode-gfid)) return inode; ... } I think its done with the intention that, root inode should *never* ever get removed from the active inodes list. (not even accidentally). So unref on root-inode is a no-op. Dont know whether there are any other reasons. Thanks, That helps. Pranith. Regards, Raghavendra Bhat Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] 3.5.1-beta2 Problems with suid and sgid bits on directories
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2014-06-24 22:26, Shyamsundar Ranganathan wrote: - Original Message - From: Anders Blomdell anders.blomd...@control.lth.se To: Niels de Vos nde...@redhat.com Cc: Shyamsundar Ranganathan srang...@redhat.com, Gluster Devel gluster-devel@gluster.org, Susant Palai spa...@redhat.com Sent: Tuesday, June 24, 2014 4:09:52 AM Subject: Re: [Gluster-devel] 3.5.1-beta2 Problems with suid and sgid bits on directories On 2014-06-23 12:03, Niels de Vos wrote: On Tue, Jun 17, 2014 at 11:49:26AM -0400, Shyamsundar Ranganathan wrote: You maybe looking at the problem being fixed here, [1]. On a lookup attribute mismatch was not being healed across directories, and this patch attempts to address the same. Currently the version of the patch does not heal the S_ISUID and S_ISGID bits, which is work in progress (but easy enough to incorporate and test based on the patch at [1]). On a separate note, add-brick just adds a brick to the cluster, the lookup is where the heal (or creation of the directory across all sub volumes in DHT xlator) is being done. I assume that this is not a regression between 3.5.0 and 3.5.1? If that is the case, we can pull the fix in 3.5.2 because 3.5.1 really should not get delayed much longer. No, it does not work in 3.5.0 either :-( I ran these tests using your scripts and observed similar behavior and need to dig into this a little further to understand how to make this work reliably. This might be a root cause, probably should be resolved first: https://bugzilla.redhat.com/show_bug.cgi?id=1113050 The proposed patch does not work as intended, with the following hieararchy 7550: 0 /mnt/gluster 27770:1000 /mnt/gluster/test 2755 1000:1000 /mnt/gluster/test/dir1 2755 1000:1000 /mnt/gluster/test/dir1/dir2 In the (approx 25%) of cases where my test-script does trigger a self heal on disk2, 10% ends up with (giving access error on client): 00: 0 /data/disk2/gluster/test 755 1000:1000 /data/disk2/gluster/test/dir1 755 1000:1000 /data/disk2/gluster/test/dir1/dir2 or 27770:1000 /data/disk2/gluster/test 00: 0 /data/disk2/gluster/test/dir1 755 1000:1000 /data/disk2/gluster/test/dir1/dir2 or 27770:1000 /data/disk2/gluster/test 2755 1000:1000 /data/disk2/gluster/test/dir1 00: 0 /data/disk2/gluster/test/dir1/dir2 and 73% ends up with either partially healed directories (/data/disk2/gluster/test/dir1/dir2 or /data/disk2/gluster/test/dir1 missing) or the sgid bit [randomly] set on some of the directories. Since I don't even understand how to reliably trigger a self-heal of the directories, I'm currently clueless to the reason for this behaviour. Soo, I think that the comment from susant in http://review.gluster.org/#/c/6983/3/xlators/cluster/dht/src/dht-common.c: susant palaiJun 13 9:04 AM I think we dont have to worry about that. Rebalance does not interfere with directory SUID/GID/STICKY bits. unfortunately is wrong :-(, and I'm on too deep water to understand how to fix this at the moment. Currently in the test case rebalance is not run, so the above comment in relation to rebalance is sort of different that what is observed. Just a note. I stand corrected :-) So far only self-heal has interfered. N.B: with 00777 flags on the /mnt/gluster/test directory I have not been able to trigger any unreadable directories /Anders Thanks, Niels Shyam [1] http://review.gluster.org/#/c/6983/ - Original Message - From: Anders Blomdell anders.blomd...@control.lth.se To: Gluster Devel gluster-devel@gluster.org Sent: Tuesday, June 17, 2014 10:53:52 AM Subject: [Gluster-devel] 3.5.1-beta2 Problems with suid and sgid bits on directories With a glusterfs-3.5.1-0.3.beta2.fc20.x86_64 with a reverted 3dc56cbd16b1074d7ca1a4fe4c5bf44400eb63ff (due to local lack of IPv4 addresses), I get weird behavior if I: 1. Create a directory with suid/sgid/sticky bit set (/mnt/gluster/test) 2. Make a subdirectory of #1 (/mnt/gluster/test/dir1) 3. Do an add-brick Before add-brick 755 /mnt/gluster 7775 /mnt/gluster/test 2755 /mnt/gluster/test/dir1 After add-brick 755 /mnt/gluster 1775 /mnt/gluster/test 755 /mnt/gluster/test/dir1 On the server it looks like this: 7775 /data/disk1/gluster/test 2755 /data/disk1/gluster/test/dir1 1775 /data/disk2/gluster/test 755 /data/disk2/gluster/test/dir1 Filed as bug: https://bugzilla.redhat.com/show_bug.cgi?id=1110262 If somebody can point me to where the logic of add-brick is placed, I can give it a shot (a find/grep on mkdir didn't immediately point me to the right place). /Anders - -- Anders Blomdell Email: anders.blomd...@control.lth.se Department of Automatic Control Lund University Phone:+46 46 222 4625 P.O.
Re: [Gluster-devel] Data classification proposal
- Original Message - For the short-term, wouldn't it be OK to disallow adding bricks that is not a multiple of group-size? In the *very* short term, yes. However, I think that will quickly become an issue for users who try to deploy erasure coding because those group sizes will be quite large. As soon as we implement tiering, our very next task - perhaps even before tiering gets into a release - should be to implement automatic brick splitting. That will bring other benefits as well, such as variable replication levels to handle the sanlock case, or overlapping replica sets to spread a failed brick's load over more peers. OK. Do you have some initial ideas on how we could 'split' bricks? I ask this to see if I can work on splitting bricks while the data classification format is being ironed out. thanks, Krish ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
On Wednesday 25 June 2014 08:35:05 Jeff Darcy wrote: For the short-term, wouldn't it be OK to disallow adding bricks that is not a multiple of group-size? In the *very* short term, yes. However, I think that will quickly become an issue for users who try to deploy erasure coding because those group sizes will be quite large. As soon as we implement tiering, our very next task - perhaps even before tiering gets into a release - should be to implement automatic brick splitting. That will bring other benefits as well, such as variable replication levels to handle the sanlock case, or overlapping replica sets to spread a failed brick's load over more peers. If I understand correctly the proposed data-classification architecture, each server will have a number of bricks that will be dynamically modified as needed: as more data-classifying conditions are defined, a new layer of translators will be added (a new DHT or AFR, or something else) and some or all existing bricks will be split to accommodate the new and, maybe, overlapping condition. How space will be allocated to each new sub-brick ? some sort of thin- provisioning or will it be distributed evenly on each split ? If using thin-provisioning, it will be hard to determine real available space. If using a fixed amount, we can get to scenarios where a file cannot be written even if there seems to be enough free space. This can already happen today if using very big files on almost full bricks. I think brick splitting can accentuate this. Also, the addition of multiple layered DHT translators, as it's implemented today, could add a lot more of latency, specially on directory listings. Another problem I see is that splitting bricks will require a rebalance, which is a costly operation. It doesn't seem right to require a so expensive operation every time you add a new condition on an already created volume. Maybe I've missed something important ? Thanks, Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Data classification proposal
If I understand correctly the proposed data-classification architecture, each server will have a number of bricks that will be dynamically modified as needed: as more data-classifying conditions are defined, a new layer of translators will be added (a new DHT or AFR, or something else) and some or all existing bricks will be split to accommodate the new and, maybe, overlapping condition. Correct. How space will be allocated to each new sub-brick ? some sort of thin- provisioning or will it be distributed evenly on each split ? That's left to the user. The latest proposal, based on discussion of the first, is here: https://docs.google.com/presentation/d/1e8tuh9DKNi9eCMrdt5vetppn1D3BiJSmfR7lDW2wRvA/edit?usp=sharing That has an example of assigning percentages to the sub-bricks created by a rule (i.e. a subvolume in a potentially multi-tiered configuration). Other possibilities include relative weights used to determine percentages, or total thin provisioning where sub-bricks compete freely for available space. It's certainly a fruitful area for discussion. If using thin-provisioning, it will be hard to determine real available space. If using a fixed amount, we can get to scenarios where a file cannot be written even if there seems to be enough free space. This can already happen today if using very big files on almost full bricks. I think brick splitting can accentuate this. Is this really common outside of test environments, given the sizes of modern disks and files? Even in cases where it might happen, doesn't striping address it? We have a whole bunch of problems in this area. If multiple bricks are on the same local file system, their capacity will be double-counted. If a second local file system is mounted over part of a brick, the additional space won't be counted at all. We do need a general solution to this, but I don't think that solution needs to be part of data classification unless there's a specific real-world scenario that DC makes worse. Also, the addition of multiple layered DHT translators, as it's implemented today, could add a lot more of latency, specially on directory listings. With http://review.gluster.org/#/c/7702/ this should be less of a problem. Also, lookups across multiple tiers are likely to be rare in most use cases. For example, for the name-based filtering (sanlock) case, a given file should only *ever* be in one tier so only that tier would need to be searched. For the activity-based tiering case, the vast majority of lookups will be for hot files which are (not accidentally) in the first tier. The only real problem is with *failed* lookups, e.g. during create. We can address that by adding stubs (similar to linkfiles) in the upper tier, but I'd still want to wait until it's proven necessary. What I would truly resist is any solution that involves building tier awareness directly into (one instance of) DHT. Besides requiring a much larger development effort in the present, it would throw away the benefit of modularity and hamper other efforts in the future. We need tiering and brick splitting *now*, especially as a complement to erasure coding which many won't be able to use otherwise. As far as I can tell, stacking translators is the fastest way to get there. Another problem I see is that splitting bricks will require a rebalance, which is a costly operation. It doesn't seem right to require a so expensive operation every time you add a new condition on an already created volume. Yes, rebalancing is expensive, but that's no different for split bricks than whole ones. Any time you change the definition of what should go where, you'll have to move some data into compliance and that's expensive. However, such operations are likely to be very rare. It's highly likely that most uses of this feature will consist of a simple two-tier setup defined when the volume is created and never changed thereafter, so the only rebalancing would be within a tier - i.e. the exact same thing we do today in homogeneous volumes (maybe even slightly better). The only use case I can think of that would involve *frequent* tier-config changes is multi-tenancy, but adding a new tenant should only affect new data and not require migration of old data. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Weekly GlusterFS Community meeting minutes
Thanks to everyone who attended and participated in our Weekly Community Meeting today. :) The points which stand out as most important/interesting are: * GlusterFS 3.4.5 beta1 tarball and rpms to be created soon (next few days likely) This is pretty much GlusterFS 3.4.4-2 with the memory leak fix backported by Martin Svec * GlusterFS 3.6 feature freeze date is not far away now - 5th July * Automatic NetBSD build testing is almost working * Automatic FreeBSD build testing will be set up in next few weeks * James Shubin has been working on btrfs pieces for puppet-gluster. He'd really like people to test it out: https://github.com/purpleidea/puppet-gluster/tree/feat/btrfs :) Meeting minutes: http://meetbot.fedoraproject.org/gluster-meeting/2014-06-25/gluster-meeting.2014-06-25-15.10.html Full meeting logs: http://meetbot.fedoraproject.org/gluster-meeting/2014-06-25/gluster-meeting.2014-06-25-15.10.log.html Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Reviewing patches early
Justin asked me, as the group's official Grumpy Old Man, to send a note reminding people about the importance of reviewing patches early. Here it is. As I see it, we've historically had two problems with reviews. (1) Patches that don't get reviewed at all. (2) Patches that have to be re-worked continually due to late reviews. We've made a lot of progress on (1), especially with the addition of more maintainers, so this is about (2). As a patch gets older, it becomes increasingly likely that it will be rebased and regression tests will have to be re-run because of merge conflicts. This isn't a problem for features to which Red Hat has graciously assigned more than one developer, as they review each others' work and the patch gets merged quickly (sometimes before other interested parties have even had a chance to see it in the queue but that's a different problem). However, it creates a problem for *every other patch*, which might now have to rebased etc. - even those that are older and more important to users and up against tighter deadlines. This priority inversion can often be avoided if people who intend to review a patch would do so sooner, so that all of the review re-work can be done before new merge conflicts are created. Given the differences in time zones throughout our group, each round of such unnecessary work can cost an entire day, leading to even more potential for further merge conflicts. It's a vicious cycle that we need to break. Please, get all of those complaints about tabs and spaces and variable names in *early*, and help us keep the improvements flowing to our users. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] glusterfs-release-3.4 released
SRC: http://bits.gluster.org/pub/gluster/glusterfs/src/glusterfs-release-3.4.tar.gz This release is made off jenkins-release-78 -- Gluster Build System ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] 3.5.1-beta2 Problems with suid and sgid bits on directories
Hi Anders, There are multiple problems that I see in the test provided, here is answering one of them and the reason why this occurs. It does get into the code and functions a bit, but bottom line is that on a code path the setattr that DHT does, misses setting the SGID bit causing the problem observed. - When directory is healed on a newly added brick it loses the SGID mode bit This is happening due to 2 reasons, mkdir does not honor the SGID mode bit [1]. So when initially creating the directory when there is a single brick, an strace of the mkdir command shows an fchmod which actually changes the mode of the file to add the SGID bit to it. In DHT we get into dht_lookup_dir_cbk, as a part of the lookup when creating the new directory .../dir2, as the graph has changed due to a brick addition (otherwise we would have gone into revalidate path where the previous fix was made). Here we call the function, dht_selfheal_directory which would create the missing directories, with the expected attributes. DHT winds a call to mkdir as a part of the dht_selfheal_directory (in dht_selfheal_dir_mkdir where it winds a call to mkdir for all subvolumes that have the directory missing) with the right mode bits (in this case with the SGID bit). As the POSIX layer on the brick calls mkdir, the SGID bit is not set for the newly created directory due to [1]. Further to calling mkdir DHT now winds an setattr to set the mode bits straight, but ends up using the mode bits that are returned in the iatt (stat) information by the just concluded mkdir wind, which has the SGID bit missing, as mkdir returns the stat information from posix_mkdir, by doing a stat post mkdir. Hence we never end up setting the SGID bit in the setattr part of DHT. Rectification of the problem would be in (need to close out some more analysis) dht_selfheal_dir_mkdir_cbk, where we need to pass to the subsequent dht_selfheal_dir_setattr the right mode bits to set on the directories. I will provide a patch for the above issue, post testing out the same with the provided script, possibly tomorrow. This would make the directory equal on all the bricks, and further discrepancies from the mount point or on the backed should not be seen. One of the other problems seems to stem from which stat information we pick in DHT to return for the mount, the above fix would take care of that issue as well, but still something that needs some understanding and possible correction. [1] see NOTES in, man 2 mkdir Shyam ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Bug#751888: glusterfs-server: creating symlinks generates errors
On 20/06/2014, at 2:32 PM, Matteo Checcucci wrote: On 06/20/2014 03:05 PM, Ravishankar N wrote: Yes, just sent a patch for review on master :http://review.gluster.org/#/c/8135/ Once it gets accepted, will back-port it to the 3.5 branch I am looking forward to seeing it back-ported and integrated in the debian package. Btw, the backport of this was merged into the release-3.5 branch yesterday. It'll be in GlusterFS 3.5.2. Hope that helps. :) Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Ignore the rackspace-regression-2GB-triggered queue in Jenkins for now
On 25/06/2014, at 11:19 PM, Justin Clift wrote: There's a new rackspace-regression-2GB-triggered in Jenkins on build.gluster.org. Please ignore it for now. I'm just experimenting with having Gerrit automatically trigger regression tests. This seems to be working ok, so I've enabled it. Failure/success now IS going into Gerrit. If this turns out to work ok, we won't need to manually trigger new regression tests. :) If any weirdness seems to happen, feel free to disable this and manually trigger regression tests like normal. :) Regards and best wishes, Justin Clift -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reviewing patches early
On 26/06/2014, at 1:40 AM, Pranith Kumar Karampuri wrote: snip While I agree with everything you said. Complaining about tabs/spaces should be done by a script. Something like http://review.gluster.com/#/c/5404 +1 And we can use a git trigger to reject future patches that have tabs in them. For bonus points, we should put info on the wiki on how to configure our editors to do spaces properly. eg .vimrc settings, and that kind of thing + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Reviewing patches early
On 26/06/2014, at 2:12 AM, Pranith Kumar Karampuri wrote: On 06/26/2014 06:19 AM, Justin Clift wrote: On 26/06/2014, at 1:40 AM, Pranith Kumar Karampuri wrote: snip While I agree with everything you said. Complaining about tabs/spaces should be done by a script. Something like http://review.gluster.com/#/c/5404 +1 And we can use a git trigger to reject future patches that have tabs in them. We can probably do it at the time of './rfc.sh'. It probably is easier as well? Have the script in the repo. Run it against the patches that are to be submitted. Whatever works. :) + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel