Re: [Gluster-devel] rpc throttling causing ping timer expiries while running iozone
On Tue, Apr 22, 2014 at 11:49 PM, Pranith Kumar Karampuri pkara...@redhat.com wrote: hi, When iozone is in progress and the number of blocking inodelks is greater than the threshold number of rpc requests allowed for that client (RPCSVC_DEFAULT_OUTSTANDING_RPC_LIMIT), subsequent requests from that client will not be read until all the outstanding requests are processed and replied to. But because no more requests are read from that client, unlocks on the already granted locks will never come thus the number of outstanding requests would never come down. This leads to a ping-timeout on the client. I am wondering if the proper fix for this is to not account INODELK/ENTRYLK/LK calls for throttling. I did make such a change in the codebase and tested it and it works. Please let me know if this is acceptable or it needs to be fixed differently. Do you know why there were 64 outstanding inodelk requests? What does iozone do to result in this kind of a locking pattern? ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Status on Gluster on OS X (10.9)
I did now. I'd recommend adding a check for libintl.h in configure.ac and fail gracefully suggesting installing gettext. Thanks On Fri, Apr 4, 2014 at 10:59 PM, Dennis Schafroth den...@schafroth.dkwrote: On 05 Apr 2014, at 07:38 , Anand Avati av...@gluster.org wrote: And here: ./gf-error-codes.h:12:10: fatal error: 'libintl.h' file not found I guess I was wrong that gettext / libintl.h was not required. It seems to be in use in logging.c Until I figure out if this is the case, I would suggest installing gettext. cheers, :-Dennis On Fri, Apr 4, 2014 at 10:15 PM, Dennis Schafroth den...@schafroth.dkwrote: Pushed a fix to make it work without gettext / libintl header. I compiled without the CFLAGS and LDFLAGS Hmm. Apparently not. cheers, :-Dennis On 05 Apr 2014, at 07:04 , Dennis Schafroth den...@schafroth.dk wrote: Bummer. That is from gettext which I thought was only optional. I got it using either Homebrew (http://brew.sh/) or macports Homebrew seems quite good these days I would prob. recommend that. It will install using a one-liner in /usr/local and but require sudo right underway to sett rights brew install gettext It will require setting some CFLAGS / LDFLAGS when ./configure: LDFLAGS=-L/usr/local/opt/gettext/lib CPPFLAGS=-I/usr/local/opt/gettext/include cheers, :-Dennis On 05 Apr 2014, at 06:56 , Anand Avati av...@gluster.org wrote: Build fails for me: Making all in libglusterfs Making all in src CC libglusterfs_la-dict.lo CC libglusterfs_la-xlator.lo CC libglusterfs_la-logging.lo logging.c:26:10: fatal error: 'libintl.h' file not found #include libintl.h ^ 1 error generated. make[4]: *** [libglusterfs_la-logging.lo] Error 1 make[3]: *** [all] Error 2 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 How did you get libintl.h in your system? Also, please add a check for it in configure.ac and report the missing package. Thanks, On Fri, Apr 4, 2014 at 6:08 PM, Dennis Schafroth den...@schafroth.dk wrote: It's been quiet on this topic, but actually Harshavardhana and I have been quite busy off-line working on this. Since my initial success we have been able to get it to compile with clang (almost as clean as with gcc) and actually run. The later was a bit tricky because clang has more strict strategy about exporting functions with inline, which ended with many runs with missing functions. So right now I can run everything, but there is an known issue with NFS/NLM4, but this should not matter for people trying to run the client with OSX FUSE. Anyone brave enough wanting to try the *client* can check out: Still need Xcode + command line tools (clang, make) A installed OSXFUSE (FUSE for OS X) $ git clone g...@forge.gluster.org :~schafdog/glusterfs-core/osx-glusterfs.git $ cd osx-glusterfs Either $ ./configure.osx Or - $ ./autogen.sh (requires aclocal, autoconf, automake) - $ ./configure $ make $ sudo make install You should be able to mount using sudo glusterfs --volfile=your vol file.vol mount point And yes this is very much bleeding edge. My mac did kernel panic yesterday, when it was running both client and server. I would really like to get feed back from anyone trying this out. cheers, :-Dennis Schafroth ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Status on Gluster on OS X (10.9)
Build fails for me: Making all in libglusterfs Making all in src CC libglusterfs_la-dict.lo CC libglusterfs_la-xlator.lo CC libglusterfs_la-logging.lo logging.c:26:10: fatal error: 'libintl.h' file not found #include libintl.h ^ 1 error generated. make[4]: *** [libglusterfs_la-logging.lo] Error 1 make[3]: *** [all] Error 2 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 How did you get libintl.h in your system? Also, please add a check for it in configure.ac and report the missing package. Thanks, On Fri, Apr 4, 2014 at 6:08 PM, Dennis Schafroth den...@schafroth.dkwrote: It's been quiet on this topic, but actually Harshavardhana and I have been quite busy off-line working on this. Since my initial success we have been able to get it to compile with clang (almost as clean as with gcc) and actually run. The later was a bit tricky because clang has more strict strategy about exporting functions with inline, which ended with many runs with missing functions. So right now I can run everything, but there is an known issue with NFS/NLM4, but this should not matter for people trying to run the client with OSX FUSE. Anyone brave enough wanting to try the *client* can check out: Still need Xcode + command line tools (clang, make) A installed OSXFUSE (FUSE for OS X) $ git clone g...@forge.gluster.org :~schafdog/glusterfs-core/osx-glusterfs.git $ cd osx-glusterfs Either $ ./configure.osx Or - $ ./autogen.sh (requires aclocal, autoconf, automake) - $ ./configure $ make $ sudo make install You should be able to mount using sudo glusterfs --volfile=your vol file.vol mount point And yes this is very much bleeding edge. My mac did kernel panic yesterday, when it was running both client and server. I would really like to get feed back from anyone trying this out. cheers, :-Dennis Schafroth ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Status on Gluster on OS X (10.9)
And here: ./gf-error-codes.h:12:10: fatal error: 'libintl.h' file not found On Fri, Apr 4, 2014 at 10:15 PM, Dennis Schafroth den...@schafroth.dkwrote: Pushed a fix to make it work without gettext / libintl header. I compiled without the CFLAGS and LDFLAGS cheers, :-Dennis On 05 Apr 2014, at 07:04 , Dennis Schafroth den...@schafroth.dk wrote: Bummer. That is from gettext which I thought was only optional. I got it using either Homebrew (http://brew.sh/) or macports Homebrew seems quite good these days I would prob. recommend that. It will install using a one-liner in /usr/local and but require sudo right underway to sett rights brew install gettext It will require setting some CFLAGS / LDFLAGS when ./configure: LDFLAGS=-L/usr/local/opt/gettext/lib CPPFLAGS=-I/usr/local/opt/gettext/include cheers, :-Dennis On 05 Apr 2014, at 06:56 , Anand Avati av...@gluster.org wrote: Build fails for me: Making all in libglusterfs Making all in src CC libglusterfs_la-dict.lo CC libglusterfs_la-xlator.lo CC libglusterfs_la-logging.lo logging.c:26:10: fatal error: 'libintl.h' file not found #include libintl.h ^ 1 error generated. make[4]: *** [libglusterfs_la-logging.lo] Error 1 make[3]: *** [all] Error 2 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 How did you get libintl.h in your system? Also, please add a check for it in configure.ac and report the missing package. Thanks, On Fri, Apr 4, 2014 at 6:08 PM, Dennis Schafroth den...@schafroth.dk wrote: It's been quiet on this topic, but actually Harshavardhana and I have been quite busy off-line working on this. Since my initial success we have been able to get it to compile with clang (almost as clean as with gcc) and actually run. The later was a bit tricky because clang has more strict strategy about exporting functions with inline, which ended with many runs with missing functions. So right now I can run everything, but there is an known issue with NFS/NLM4, but this should not matter for people trying to run the client with OSX FUSE. Anyone brave enough wanting to try the *client* can check out: Still need Xcode + command line tools (clang, make) A installed OSXFUSE (FUSE for OS X) $ git clone g...@forge.gluster.org :~schafdog/glusterfs-core/osx-glusterfs.git $ cd osx-glusterfs Either $ ./configure.osx Or - $ ./autogen.sh (requires aclocal, autoconf, automake) - $ ./configure $ make $ sudo make install You should be able to mount using sudo glusterfs --volfile=your vol file.vol mount point And yes this is very much bleeding edge. My mac did kernel panic yesterday, when it was running both client and server. I would really like to get feed back from anyone trying this out. cheers, :-Dennis Schafroth ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] glfs_futimens function is not implemented
Because futimes() is not a POSIX (or any standard) call. We can have #ifdefs and call futimes(), but hasn't been a priority (welcome to send a patch) Thanks Avati On Mon, Mar 31, 2014 at 10:00 AM, Thiago da Silva thi...@redhat.com wrote: Hi, While testing libgfapi I noticed that glfs_futimens was returning -1 with errno set to ENOSYS. Digging a little deeper shows that it was never implemented. Here's the current function definition: https://github.com/gluster/glusterfs/blob/master/xlators/storage/posix/src/posix.c#L463 Does anybody know if there's a particular reason as to why it was never implemented? Thanks, Thiago ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] release 3.4.3?
On Mon, Mar 24, 2014 at 9:50 AM, Kaleb S. KEITHLEY kkeit...@redhat.comwrote: On 03/24/2014 08:35 AM, Kaleb S. KEITHLEY wrote: I've been begging for an additional (+2) review for http://review.gluster.org/#/c/6737. It has two +1 reviews, but the matching fixes for release-3.5 and master have three +1 reviews and one +1 review respectively and have not been merged, so I'm reluctant to merge this. Progress. http://review.gluster.org/#/c/6737 has received +2 (thanks Jeff) but the matching fixes for release-3.5 and master, http://review.gluster.org/6736and http://review.gluster.org/5075 respectively, also still need +2. I am still reluctant to take this fix into release-3.4 unless I'm certain the corresponding fixes will also be taken into release-3.5 and master. Please give me some time to review the changes to master. I'll do it asap. ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] How to compile gluster 3.4 on Mac OS X 10.8.4?
(moving to gluster-devel@) That is great progress! Please keep posting the intermediate work upstream (into gerrit) as you move along.. Regarding the hang: Do you have cli.log printing anything at all (typically /var/log/glusterfs/cli.log) Avati On Wed, Mar 19, 2014 at 5:07 PM, Dennis Schafroth den...@schafroth.dkwrote: I now have a branch of HEAD compiling under OS X 10.9, when I disable the qemu-block and fusermount options. Still having a build issue with libtool and libspl, which I have only hacked my way around. Actually both the glusterd and gluster runs, but using gluster (OS X) hangs on both pool list and peer probe other server. However probing glusterd from a linux succeds. But glusterd's log does indicates some issue. cheers, :-Dennis Schafroth ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Proposal: GlusterFS Quattro
On Fri, Mar 7, 2014 at 8:13 AM, Jeff Darcy jda...@redhat.com wrote: As a counterpoint to the current GlusterFS proposal, I've written up a bunch of ideas that I'm collectively calling GlusterFS Quattro. It's in Google Docs so that people can comment. Please do. ;) http://goo.gl/yE3O4j Thanks for sharing this Jeff. Towards the end of my visit to the Bangalore Red Hat office this time (from which I just returned a couple days ago) we got to discuss the 4.x proposal from a high level (less about specifics, more about in general). A concern raised by many was that if a new release is a too radical (analogy given was samba4 vs samba3 - coincidentally the same major number), it would result in way too much confusion and overhead (e.g lots of people want to stick with 3.x as 4.x is not yet stable, and this results in 3.x getting stabler and be a negative incentive to move over to 4.x, especially when distributions/ISVs are concerned). The conclusion was that, the 4.x proposal would be downsized to only have the management layer changes, while the data layer (EHT, stripe etc) changes be introduced piece by piece (as they get ready) independent of whether the current master is for 3.x or 4.x. Given the background, it only makes sense to retain the guiding principles of the feedback, and reconcile the changes proposed to management layer in the two proposals and retain the scope of 4.x to management changes. Thoughts? Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Proposal: GlusterFS Quattro
On Fri, Mar 7, 2014 at 11:56 AM, Jeff Darcy jda...@redhat.com wrote: Given the background, it only makes sense to retain the guiding principles of the feedback, and reconcile the changes proposed to management layer in the two proposals and retain the scope of 4.x to management changes. Thoughts? I think we need to take a more careful look at dependencies between various items before we decide what should be in 4.0 vs. earlier/later. For example, several other features depend on being able to subdivide storage that the user gives us into smaller units. That feature itself depends on multiplexing those smaller units (whether we call them s-bricks or something else) onto fewer daemons/ports. So which one is the 4.0 feature? If we have a clear idea of which parts are independent and which ones must be done sequentially, then I think we'll be better able to draw a line which separates 3.x from 4.x at the most optimal point. The brick model is probably the borderline item which touches upon both management layer and data layer to some extent. Decreasing the number of processes/ports in general is a good thing, and to that end we need our brick processes to be more flexible/dynamic (able to switch a graph on the fly, add a new export directory on the fly etc.) - which is completely lacking today. I think, by covering this piece (brick model) we should be mostly able to classify rest of the changes into management vs data path in a more clear way. That being said we still need a low level design of how to make the brick process more dynamic (though it is mostly a matter of just getting it done) Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Barrier design issues wrt volume snapshot
On Thu, Mar 6, 2014 at 11:19 AM, Krishnan Parthasarathi kpart...@redhat.com wrote: - Original Message - On Thu, Mar 6, 2014 at 12:21 AM, Vijay Bellur vbel...@redhat.com wrote: Adding gluster-devel. On 03/06/2014 01:15 PM, Krishnan Parthasarathi wrote: All, In recent discussions around design (and implementation) of the barrier feature, couple of things came to light. 1) changelog xlator needs barrier xlator to block unlink and rename FOPs in the call path. This is apart from the current list of FOPs that are blocked in their call back path. This is to make sure that the changelog has a bounded queue of unlink and rename FOPs, from the time barriering is enabled, to be drained, committed to changelog file and published. Why is this necessary? The only consumer of changelog today, georeplication, can't tolerate missing unlink/rename entries from changelog, even with the initial xsync based crawl, until changelog entries are available for the master volume. So, changelog xlator needs to ensure that the last rotated (publishable) changelog should have entries for all the unlink(s)/rename(s) that made it to the snapshot. For this, changelog needs barrier xlator to block unlink/rename FOPs in the call path too. Hope that helps. This sounds like a very changelog specific requirement. This is best addressed in the changelog translator itself. If unlink/rmdir/renames should not be in progress during a snapshot, then we need to hold off new ops in the call path, trigger a log rotation and the rotation should wait for completion of ongoing fops anyways. 2) It is possible in a pure distribute volume that the following sequence of FOPs could result in snapshots of bricks disagreeing on inode type for a file or directory. t1: snap b1 t2: unlink /a t3: mkdir /a t4: snap b2 where, b1 and b2 are bricks of a pure distribute volume V. The above sequence can happen with the current barrier xlator design, since we allow unlink FOPs to go through to the disk and only block their acknowledgement to the application. This implies a concurrent mkdir on the same name could succeed, since DHT doesn't serialize unlink and mkdir FOPs, unlike AFR. Avati, I hear that you have a solution for problem 2). Could you please start the discussion on this thread? It would help us to decide how to go about with the barrier xlator implementation. The solution is really a long pending implementation of dentry serialization in the resolver of protocol server. Today we allow multiple FOPs to happen in parallel which modify the same dentry. This results in hairy races (including non atomicity of rename) and has been kept open for a while now. Implementing the dentry serialization in the resolver will solve 2 as a side effect. Hence that is a better approach than making changes in the barrier translator. I am not sure I understood how this works from the brief introduction above. Could you explain a bit? By dentry serialization, I mean we should have only one operation modifying a pargfid/bname at a given time. This needs changes in the resolver of protocol server and possibly some changes in the inode table. This is really for solving rare races, and I think is something we need to work on independent of the snapshot requirements. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster
Xavi, Getting such a caching mechanism has several aspects. First of all we need the framework pieces implemented (particularly server originated messages to the client for invalidation and revokes) in a well designed way. Particularly how we address a specific translator in a message originating from the server. Some of the recent changes to client_t allows for server-side translators to get a handle (the client_t object) on which messages can be submitted back to the client. Such a framework (of having server originated messages) is also necessary for implementing oplocks (and possibly leases) - particularly interesting for the Samba integration. As Jeff already mentioned, this is an area where gluster has not focussed on, given the targeted use case. However the benefits of extending this to internal use cases (to avoid per-operation inodelks can benefit many modules - encryption/crypt, afr, etc.) It seems possible to have a common framework for delegating locks to clients, and build caching coherency protocols / oplocks / inodelk avoidence on top of it. Feel free to share a more detailed proposal if you have have/plan - I'm sure the Samba folks (Ira copied) would be interested too. Thanks! Avati On Wed, Feb 5, 2014 at 11:27 AM, Xavier Hernandez xhernan...@datalab.eswrote: On 04.02.2014 17:18, Jeff Darcy wrote: The only synchronization point needed is to make sure that all bricks agree on the inode state and which client owns it. This can be achieved without locking using a method similar to what I implemented in the DFC translator. Besides the lock-less architecture, the main advantage is that much more aggressive caching strategies can be implemented very near to the final user, increasing considerably the throughput of the file system. Special care has to be taken with things than can fail on background writes (basically brick space and user access rights). Those should be handled appropiately on the client side to guarantee future success of writes. Of course this is only a high level overview. A deeper analysis should be done to see what to do on each special case. What do you think ? I think this is a great idea for where we can go - and need to go - in the long term. However, it's important to recognize that it *is* the long term. We had to solve almost exactly the same problems in MPFS long ago. Whether the synchronization uses locks or not *locally* is meaningless, because all of the difficult problems have to do with recovering the *distributed* state. What happens when a brick fails while holding an inode in any state but I? How do we recognize it, what do we do about it, how do we handle the case where it comes back and needs to re-acquire its previous state? How do we make sure that a brick can successfully flush everything it needs to before it yields a lock/lease/whatever? That's going to require some kind of flow control, which is itself a pretty big project. It's not impossible, but it took multiple people some years for MPFS, and ditto for every other project (e.g. Ceph or XtreemFS) which adopted similar approaches. GlusterFS's historical avoidance of this complexity certainly has some drawbacks, but it has also been key to us making far more progress in other areas. Well, it's true that there will be a lot of tricky cases that will need to be handled to be sure that data integrity and system responsiveness is guaranteed, however I think that they are not more difficult than what can happen currently if a client dies or loses communication while it holds a lock on a file. Anyway I think there is a great potential with this mechanism because it can allow the implementation of powefull caches, even based on SSD that could improve the performance a lot. Of course there is a lot of work solving all potential failures and designing the right thing. An important consideration is that all these methods try to solve a problem that is seldom found (i.e. having more than one client modifying the same file at the same time). So a solution that has almost 0 overhead for the normal case and allows the implementation of aggressive caching mechanisms seems a big win. To move forward on this, I think we need a *much* more detailed idea of how we're going to handle the nasty cases. Would some sort of online collaboration - e.g. Hangouts - make more sense than continuing via email? Of course, we can talk on irc or another place if you prefer Xavi ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] empty xlator
The nop xlator by itself seems OK. Have you tried the stripe config with the nop xlator on top? or even without the nop xlator? Avati On Wed, Feb 5, 2014 at 3:26 PM, Lluís Pàmies i Juárez llpam...@pamies.catwrote: Hello, As a proof of concept I'm trying to write a xlator that does nothing, I call it nop. The code for nop.c is simply: #include config.h #include call-stub.h struct xlator_fops fops = {}; struct xlator_cbks cbks = {}; struct xlator_dumpops dumpops = {}; struct volume_options options[] = {{.key={NULL}},}; int32_t init (xlator_t *this){return 0;} int fini (xlator_t *this){return 0;} And I compile it with: $ gcc -fPIC -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -DGF_LINUX_HOST_OS -shared -nostartfiles -lglusterfs -lpthread -I${GFS} -I${GFS}/libglusterfs/src -I${GFS}/contrib/uuid nop.c -o nop.so Then, if I try a test.vol file like that: volume test-posix type storage/posix option directory /home/llpamies/Projects/gluster/test-split/node0-data end-volume volume test-nop type features/nop subvolumes test-posix end-volume volume test-cache type performance/io-cache subvolumes test-nop end-volume and mount it with: $ glusterfs --debug -f test.vol /mount/point It seems to work fine, doing nothing. However, when used together with the stripe xlator as follows: volume test-posix0 type storage/posix option directory /home/llpamies/Projects/gluster/test-split/node0-data end-volume volume test-posix1 type storage/posix option directory /home/llpamies/Projects/gluster/test-split/node1-data end-volume volume test-nop0 type features/nop subvolumes test-posix0 end-volume volume test-nop1 type features/nop subvolumes test-posix1 end-volume volume test-stripe type cluster/stripe subvolumes test-nop0 test-nop1 end-volume Glusterfs hangs during the first fuse lookup for .Trash, and /mount/point looks unmounted with permissions etc. Does it look like some bug in the stripe xlator or is there something fundamentally wrong with the nop xlator? Thank you, -- Lluís ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] GlusterFS 4.0 round two?
I agree. I will send out a spin of the 2nd draft soon. Have been caught up in a bunch of other stuff. Avati On Wed, Jan 29, 2014 at 10:08 PM, Amar Tumballi ama...@gmail.com wrote: On Thu, Jan 30, 2014 at 3:30 AM, Jeffrey Darcy jda...@redhat.com wrote: I know we're all busy with other things, but it has been a little over a month since this discussion started. There are a lot of really good comments on the Google Docs version (http://goo.gl/qLw3Vz) and we're at risk of losing our place if we don't try to keep things going. In particular, the issue of how these plans relate to 3.6 feature planning, which also needs to conclude soon. To pick a couple of examples: * There's a 3.6 item to make glusterd more scalable, but there are many more scalability issues that need to be addressed and the later 4.0 proposal tries to tackle a few. Should we even try to address scalability in the 3.x series, or just leave it entirely to 4.x? If we try to do both, how should we resolve the incompatibilities that the second proposal introduces relative to the first? * One of the hottest 3.6 items is tiering, data classification, whatever you want to call it. I say it's hot because everyone else - e.g. Ceph, HDFS, Swift - has recognized this as an important feature and they're all making significant moves here. Again, the 4.0 proposal contains some ideas that touch on this, not always compatible with earlier ideas. Which should we work on, and how should we address their differences? If we don't complete the discussions about 4.0, we won't be able to reach any reasonable conclusions about when/how it should diverge from 3.x. Should we set a deadline for a second draft and/or an IRC meeting to discuss the comments we've already collected? +1 It is very important to keep momentum for this, otherwise, the amount of work planned for 4.0 would never be 'done'. Also, would be very important to track the deadlines as an action item in every weekly IRC meeting. Regards, Amar ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Gerrit doesn't use HTTPS
On Sat, Dec 14, 2013 at 5:58 AM, James purplei...@gmail.com wrote: On Sat, Dec 14, 2013 at 3:28 AM, Vijay Bellur vbel...@redhat.com wrote: On 12/13/2013 04:05 AM, James wrote: I just noticed that the Gluster Gerrit [1] doesn't use HTTPS! Can this be fixed ASAP? Configured now, thanks! Thanks for looking into this promptly! Please check and let us know if you encounter any problems with https. 1) None of the CN information (name, location, etc) has been filled in... Either that or I'm hitting a MITM (less likely). 2) Ideally the certificate would be signed. If it's not signed, you should at least publish the correct signature somewhere we trust. If you need help wrangling any of the SSL, I'm happy to help! IIRC we should be having a CA signed cert for *.gluster.org. Copying JM. Avati -Vijay Thanks! James ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Bug in locks deletion upon disconnet
I have a fix in the works for this (with more cleanups to locks xlator) Avati On Fri, Dec 13, 2013 at 2:14 AM, Raghavendra Bhat rab...@redhat.com wrote: Hi, There seems to be a bug in the ltable cleanup when disconnect is received in 3.5 and master. Its easy to reproduce. Just create a replicate volume. Start running dbench on the mount point, and do graph changes. The brick processes will crash while doing the ltable cleanup. Pranith and me looked at the code and found the below issues. static void ltable_delete_locks (struct _lock_table *ltable) { struct _locker *locker = NULL; struct _locker *tmp= NULL; list_for_each_entry_safe (locker, tmp, ltable-inodelk_lockers, lockers) { if (locker-fd) pl_del_locker (ltable, locker-volume, locker-loc, locker-fd, locker-owner, GF_FOP_INODELK); GF_FREE (locker-volume); GF_FREE (locker); } list_for_each_entry_safe (locker, tmp, ltable-entrylk_lockers, lockers) { if (locker-fd) pl_del_locker (ltable, locker-volume, locker-loc, locker-fd, locker-owner, GF_FOP_ENTRYLK); GF_FREE (locker-volume); GF_FREE (locker); } GF_FREE (ltable); } In above function, the list of indelks and entrylks is traversed and pl_del_locker is called for each lock with fd. But in pl_del_locker, we are collecting all the locks with same volume and owner sent as arguments and deleting them at once (that too without unlocking them). But for locks without fd, we are directly freeing up the objects without deleting them from the list (and without holding the ltable lock). This is the bug logged for the issue. https://bugzilla.redhat.com/show_bug.cgi?id=1042764 Regards, Raghavendra Bhat ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Mechanisms for automatic management of Gluster
James, This is the right way to think about the problem. I have more specific comments in the script, but just wanted to let you know this is a great start. Thanks! On Wed, Nov 27, 2013 at 7:42 AM, James purplei...@gmail.com wrote: Hi, This is along the lines of tools for sysadmins. I plan on using these algorithms for puppet-gluster, but will try to maintain them separately as a standalone tool. The problem: Given a set of bricks and servers, if they have a logical naming convention, can an algorithm decide the ideal order. This could allow parameters such as replica count, and chained=true/false/offset#. The second problem: Given a set of bricks in a volume, if someone adds X bricks and removes Y bricks, is this valid, and what is the valid sequence of add/remove brick commands. I've written some code with test cases to try and figure this all out. I've left out a lot of corner cases, but the boilerplate is there to make it happen. Hopefully it's self explanatory. (gluster.py) Read and run it. Once this all works, the puppet-gluster use case is magic. It will be able to take care of these operations for you (if you want). For non puppet users, this will give admins the confidence to know what commands they should _probably_ run in what order. I say probably because we assume that if there's an error, they'll stop and inspect first. I haven't yet tried to implement the chained cases, or anything involving striping. There are also some corner cases with some of the current code. Once you add chaining and striping, etc, I realized it was time to step back and ask for help :) I hope this all makes sense. Comments, code, test cases are appreciated! Cheers, James @purpleidea (irc/twitter) https://ttboj.wordpress.com/ ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
[Gluster-devel] GlusterFS 4.0 plan
Hello all, Here is a working draft of the plan for 4.0. It has pretty significant changes from the current model. Sending it out for early review/feedback. Further revisions will follow over time. https://gist.github.com/avati/af04f1030dcf52e16535#file-plan-md Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] important change to syncop infra
The decision of doing away with -O2 is beyond our control, and we shouldn't have code which depend on optimization to be disabled to behave properly. Representing -errno as return is the cleanest fix (that's how other projects which use setcontext/getcontext are behaving too.) If there are any new further issues which arise from setcontext/getcontext, I'm tempted to change the internal implementation to use a a vanilla pthread pool. Avati On Wed, Dec 11, 2013 at 10:29 PM, Anand Subramanian ansub...@redhat.comwrote: Is doing away with -O2 an option that was ever considered or is it that we simply must have O2 on? (I understand that turning off O2 can open some so-far-unexposed can of worms and a lot of soaking maybe required, and also that we may have had a good set of perf related reasons to have settled on -O2 in the first place, but wanted to understand nevertheless...) Anand On 12/11/2013 02:21 PM, Pranith Kumar Karampuri wrote: hi, We found a day-1 bug when syncop_xxx() infra is used inside a synctask with compilation optimization (CFLAGS -O2). This bug has been dormant for at least 2 years. There are around ~400(rebalance, replace-brick, bd, self-heal-daemon, quota, fuse lock/fd migration) places where syncop is used in the code base all of which are potential candidates which can take the hit. I sent first round of patch at http://review.gluster.com/6475 to catch regressions upstream. These are the files that are affected by the changes I introduced to fix this: api/src/glfs-fops.c | 36 ++ api/src/glfs-handleops.c| 15 ++ api/src/glfs-internal.h | 7 +++ api/src/glfs-resolve.c | 10 ++ libglusterfs/src/syncop.c | 117 - xlators/cluster/afr/src/afr-self-heald.c| 45 +- xlators/cluster/afr/src/pump.c | 12 ++-- xlators/cluster/dht/src/dht-helper.c| 24 +++ xlators/cluster/dht/src/dht-rebalance.c | 168 ++-- - xlators/cluster/dht/src/dht-selfheal.c | 6 -- xlators/features/locks/src/posix.c | 3 ++- xlators/features/qemu-block/src/bdrv-xlator.c | 15 -- xlators/features/qemu-block/src/qb-coroutines.c | 14 ++ xlators/mount/fuse/src/fuse-bridge.c| 16 ++- Please review your respective component for these changes in gerrit. Thanks Pranith. Detailed explanation of the Root cause: We found the bug in 'gf_defrag_migrate_data' in rebalance operation: Lets look at interesting parts of the function: int gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc, dict_t *migrate_data) { . code section - [ Loop ] while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL, entries)) != 0) { . code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a thread) /* Need to keep track of ENOENT errno, that means, there is no need to send more readdirp() */ readdir_operrno = errno; . code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread) ret = syncop_getxattr (this, entry_loc, dict, GF_XATTR_LINKINFO_KEY); code section - [ ERRNO-2] (checking for failures of syncop_getxattr(). This may not always be executed in same thread which executed [SYNCOP-1]) if (ret 0) { if (errno != ENODATA) { loglevel = GF_LOG_ERROR; defrag-total_failures += 1; . } the function above could be executed by thread(t1) till [SYNCOP-1] and code from [ERRNO-2] can be executed by a different thread(t2) because of the way syncop-infra schedules the tasks. when the code is compiled with -O2 optimization this is the assembly code that is generated: [ERRNO-1] 1165readdir_operrno = errno; errno gets expanded as *(__errno_location()) 0x7fd149d48b60 +496:callq 0x7fd149d410c0 __errno_location@plt 0x7fd149d48b72 +514:mov%rax,0x50(%rsp) -- Address returned by __errno_location() is stored in a special location in stack for later use. 0x7fd149d48b77 +519:mov(%rax),%eax 0x7fd149d48b79 +521:
Re: [Gluster-devel] Standardizing interfaces for BD xlator
Mohan, It would be better to approach this problem by defining the FOPs and behavior at the xlator levels, and let gfapi calls be simple wrappers around the FOPs. We can introduce new FOPs splice() and reflink(), and discuss more on the best MERGE semantics. Avati On Mon, Dec 9, 2013 at 11:48 PM, M. Mohan Kumar mo...@in.ibm.com wrote: Hello, BD xlator provides certain features such as server offloaded copy, snapshot etc. But there is no standard way of invoking these operations due to the limitation in fops and system call interfaces. One has to issue setxattr interface to achieve these offload operations. Using setxattr interface in GlusterFS for all non standard operations becomes ugly and complicated. We are looking for adding new FOPs to cover these operations. glfs interfaces for BD xlator: --- We are looking for adding interfaces to libgfapi to facilitate consuming BD xlator features seamlessly. As of now one has to create a posix file and then issue setxattr/fsetxattr call to create a LV and map that LV to the posix file. For offload operations they have to get the gfid of the destination file and pass that gfid in {f}setxattr interface. Typical users of BD xlator will be qemu-img utility. To create a BD backed file on a GlusterFS volume, qemu-img has to issue glfs_create and glfs_fsetxattr, but it doesn't look elegant. Idea is to provide a single glfs call to create a posix file, BD and map that BD to the posix file. /* SYNOPSIS glfs_bd_creat: Create a posix file, BD and maps the posix file to BD in a BD GlusterFS volume. DESCRIPTION This function creates a posix file BD and maps them. This interface takes care of the transaction consistency case where posix file creation succeeded but BD creation failed for whatever reason, created posix file is deleted to make sure that file is not dangling. PARAMETERS @fs: The 'virtual mount' object to be initialized. @path: Path of the posix file within the virtual mount. @mode: Permission of the file to be created. @flags: Create flags. See open(2). O_EXCL is supported. RETURN VALUES NULL : Failure. @errno will be set with the type of failure. @errno: EOPNOTSUPP if underlying volume is not BD capable. Others : Pointer to the opened glfs_fd_t. */ struct glfs_fd * glfs_bd_create(struct glfs *fs, const char *path, int flags, mode_t mode); Also planning to provide glfs interfaces for other offload features of BD such as snapshot, clone and merge. This API can be used to abstract the steps involved in getting the gfid of the destination file and passing it to the setfattr interface (optionally mode parameter can be used to specify if the destination file has to be created, as of now bd xlator code expects the destination file to exist for offload operations). /* SYNOPSIS glfs_copy: Offloads copy operation between two files. DESCRIPTION This function optionally creates destination posix file and initiates server offloaded copy between them. Optionally based on the mode it could create destination file and issue glfs_{f}setxattr interface to do actual offload operation. PARAMETERS @fs: The 'virtual mount' object to be initialized. @source: Path of the source file within the virtual mount. @dest: Path of the destination file within the virtual mount. @flag: Specifies if destination file need to be created or not. @mode: Permission of the destination file to be created. RETURN VALUES -1 : Failure. @errno will be set with the type of failure. 0 : Success */ int glfs_copy(struct glfs *fs, const char *source, const char *dest, int mode); Similarly int glfs_snapshot(struct glfs *fs, const char *source, const char *dest, int mode); int glfs_merge(struct glfs *fs, const char *snapshot); Upstream effort for server offloaded and copy on write: --- Clone - offloaded copy: FS Community already started discussing about the interfaces for supporting server offloaded copy. Initially it started with adding a new syscall 'copy_range' [https://patchwork.kernel.org/patch/2568761/] and later the plan is to use existing splice system call itself to extend copy between two regular files [http://article.gmane.org/gmane.linux.kernel/1560133]. So is it safe to assume that splice is the way for copy offload and add these FOPs to GlusterFS(XFS, FUSE also) and support it in BD xlator? Snapshot - reflink: Also there is an upstream effort to provide interfaces for creating Copy on Write files (ie snapshots in LVM terminlogy) using reflink syscall interface, but its not merged in upstream [http://lwn.net/Articles/331808/ ] This snapshot feature is supported by BRTFS and OCFS2 through ioctl interface. Can we assume its the way for snapshot interface and add FOPs similar
Re: [Gluster-devel] Questions on dht rebalance
On Sun, Nov 24, 2013 at 8:42 PM, Muralidhar Balcha muralidh...@gmail.comwrote: Hi, I have couple of questions on rebalance functionality 1. When I ran rebalancing command, the gluster daemon that is responsible for migrating data is terminating itself in the middle of the data migration. I may be missing something. How do I make the daemon wait until the data migration is complete. Under normal circumstances, the rebalance daemon terminates when it believes it has completed transferring all files it is supposed to. Do you have any logs with more details? 2. This question may not be related to rebalance adding or removing a brick from a volume. If I add a new brick to a volume that is in use, how do existing clients realize the new brick? Is there a notification mechanism that each client refreshes their client volinfo? Yes, clients are polling w/ glusterd waiting for config changes, and will swap to a new xlator graph at run time if necessary. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Daily Coverity runs for GlusterFS?
Lala, It would be ideal if we hooked this into Jenkins and run along with regression test in parallel. How feasible is it to set it up that way? Avati On Fri, Nov 22, 2013 at 11:51 AM, Lalatendu Mohanty lmoha...@redhat.comwrote: Hi Gluster Ants, There is a way, we can automate the Coverity scan runs and we will get a email like below, which will tell us if any code issues introduced in to the code base. Will it be helpful/good if we run Coverity scan daily with the latest code base and send the results to gluster-devel@nongnu.org? I think it would be helpful. But wanted to take a feed back from you all before doing it. Feedback/ Thoughts? Thanks, Lala Original Message Subject: New Defects reported by Coverity Scan for GlusterFS Date: Thu, 21 Nov 2013 14:14:04 -0800 From: scan-ad...@coverity.com Hi, Please find the latest report on new defect(s) introduced to GlusterFS found with Coverity Scan. Defect(s) Reported-by: Coverity Scan Showing 7 of 10 defect(s) ** CID 1130760: Sizeof not portable (SIZEOF_MISMATCH) /xlators/encryption/crypt/src/data.c: 487 in set_config_avec_data() ** CID 1130759: Sizeof not portable (SIZEOF_MISMATCH) /xlators/encryption/crypt/src/data.c: 566 in set_config_avec_hole() ** CID 1130756: Unchecked return value (CHECKED_RETURN) /xlators/encryption/crypt/src/crypt.c: 2627 in crypt_fsetxattr() ** CID 1130755: Unchecked return value (CHECKED_RETURN) /xlators/encryption/crypt/src/crypt.c: 2649 in crypt_setxattr() ** CID 1130758: Dereference after null check (FORWARD_NULL) /xlators/encryption/crypt/src/crypt.c: 3298 in linkop_grab_local() ** CID 1130757: Null pointer dereference (FORWARD_NULL) /api/src/glfs-fops.c: 718 in glfs_preadv_async() /api/src/glfs-fops.c: 718 in glfs_preadv_async() ** CID 1124349: Unchecked return value (CHECKED_RETURN) /xlators/mgmt/glusterd/src/glusterd-volume-set.c: 120 in validate_cache_max_min_size() /xlators/mgmt/glusterd/src/glusterd-volume-set.c: 129 in validate_cache_max_min_size() /xlators/mgmt/glusterd/src/glusterd-volume-set.c: 130 in validate_cache_max_min_size() To view the defects in Coverity Scan visit, http://scan.coverity.com To unsubscribe from the email notification for new defects, http://scan5.coverity.com/cgi-bin/unsubscribe.py ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] How to find the current offset in an open file through libgfapi
you can: curr_offset = glfs_lseek (glfd, 0, SEEK_CUR) Avati On Wed, Nov 20, 2013 at 10:26 AM, Brad Childs b...@redhat.com wrote: Hello list, I'm trying to find the current offset of an open file through libgfapi. I have the glfs_fd_t of the file, i've done some reading and want to know the current location. Can I achieve this without manually keeping track after every read or seek? -bc ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Translator test harness
On Tue, Nov 19, 2013 at 5:35 AM, Jeff Darcy jda...@redhat.com wrote: On 11/18/2013 11:32 PM, Anand Avati wrote: It might be interesting to build a test harness using libgfapi (specially the handle based APIs) to load a graph with the xlator to be test (on top of posix) and using gfapi calls to bombard fops and notifications and callbacks from multiple threads spawned by the testing app/framework. Along with a fault injection, we also need a pedantic verifier translator (loaded both on top and blow the testing xlator) which inspects all params of all calls and callbacks coming out of the xlator to conform by the rules (e.g lookup_cbk op_ret is either -1 or 0 ONLY, op_errno is one of the known standard values ONLY, struct stat does not have mtime/ctime/atime from too far ahead into the future, mkdir_cbk's struct stat has ia_type to be IA_IFDIR ONLY etc.) Sounds like we have another volunteer. ;) Certainly. I have a half-done pedantic translator (few years ago) lying around somewhere and trying to dig it out.. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Translator test harness
On Tue, Nov 19, 2013 at 6:48 PM, Luis Pabon lpa...@redhat.com wrote: I'm definitely up for it, not just for the translators, but as Avati pointed out, a test harness for the GlusterFS system. I think, if possible, that the translator test harness is really a subclass a GlusterFS unit/functional test environment. I am currently in the process of qualifying some C Unit test frameworks (specifically those that provide mock frameworks -- Cmock, cmockery) to propose to the GlusterFS community as a foundation to the unit/functional tests environment. What I would like to see, and I still have a hard time finding, is a source coverage tool for C. Anyone know of one? gcov (+lcov) Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] fallocate
On Sat, Nov 16, 2013 at 4:45 PM, Emmanuel Dreyfus m...@netbsd.org wrote: Anand Avati av...@gluster.org wrote: If you call fallocate() over an existing region with data it shouldn't be wiped with 0s. You can also call fallocate() on a hole (in case file was ftruncate()ed to a large size) and that region should get allocated (i.e future write to an fallocated() region should NOT fail with ENOSPC). It seems it can be emulated, should it be atomic? I am not aware of any app which depends on it being atomic (though Linux implementations probably are) BTW, does NetBSD have the equivalent of open_by_handle[_at]() and name_to_handle[_at]() system calls? That is extended API set 2. With the exception of fexecve(2), I implemented them in NetBSD-current, which means they will be available in NetBSD-7.0. Are they also mandatory in glusterfs-3.5? Is they are, then emulating fallocate() in userland is useless, I would better work on it in kernel for the next release. Oh that's interesting, can I get pointers to see how NetBSD implements open_by_handle() and name_to_handle()? Thanks, Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Translator test harness
On Mon, Nov 18, 2013 at 8:23 PM, Shyamsundar Ranganathan srang...@redhat.com wrote: - Original Message - From: Jeff Darcy jda...@redhat.com To: gluster-dev Gluster Devel gluster-devel@nongnu.org Sent: Monday, November 18, 2013 8:04:27 PM Subject: [Gluster-devel] Translator test harness Last week, Luis and I had a discussion about unit testing translator code. Unfortunately, the structure of a translator - a plugin with many entry points which interact in complex and instance-specific ways - is one that is notoriously challenging. Really, the only way to do it is to have some sort of a task-specific harness, with at least the following parts: * Code above to inject requests. * Code below to provide mocked replies to the translator's own requests. * Code on the side to track things like resources or locks acquired and released. Interesting. KP (Krishnan P) and myself were discussing about an fault injection translator, (beyond error injection (which exists in the code base)), and were trying to narrow down some faults that we could inject to check and see if it makes sense to add such a translator. This would be an ambitious undertaking, but not so ambitious that it's beyond reason. The benefits should be obvious. At this point, what I'm most interested in is volunteers to help define the requirements and scope so that we can propose this as a feature or task for some future GlusterFS release. Who's up for it? I would be interested in pitching in on this, and also hearing about extending this effort to cover fault injections if it makes sense. It might be interesting to build a test harness using libgfapi (specially the handle based APIs) to load a graph with the xlator to be test (on top of posix) and using gfapi calls to bombard fops and notifications and callbacks from multiple threads spawned by the testing app/framework. Along with a fault injection, we also need a pedantic verifier translator (loaded both on top and blow the testing xlator) which inspects all params of all calls and callbacks coming out of the xlator to conform by the rules (e.g lookup_cbk op_ret is either -1 or 0 ONLY, op_errno is one of the known standard values ONLY, struct stat does not have mtime/ctime/atime from too far ahead into the future, mkdir_cbk's struct stat has ia_type to be IA_IFDIR ONLY etc.) Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] fallocate
If you call fallocate() over an existing region with data it shouldn't be wiped with 0s. You can also call fallocate() on a hole (in case file was ftruncate()ed to a large size) and that region should get allocated (i.e future write to an fallocated() region should NOT fail with ENOSPC). BTW, does NetBSD have the equivalent of open_by_handle[_at]() and name_to_handle[_at]() system calls? On Sat, Nov 16, 2013 at 12:40 PM, Emmanuel Dreyfus m...@netbsd.org wrote: I note that glusterfs-3.5 branch requires fallocate(). That one does not exist in NetBSD yet. I wonder if it can be emulated in userspace: this is just about writing zeros to the new size, right? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Fencing FOPs on data-split-brained files
Ravi, We should not mix up data and entry operation domains, if a file is in data split brain that should not stop a user from rename/link/unlink operations on the file. Regarding your concern about complications while healing - we should change our manual fixing instructions to: - go to backend, access through gfid path or normal path - rmxattr the afr changelogs - truncate the file to 0 bytes (like filename) Accessing the path through gfid and truncating to 0 bytes addresses your concerns about hardlinks/renames. Avati On Wed, Nov 13, 2013 at 3:01 AM, Ravishankar N ravishan...@redhat.comwrote: Hi, Currenly in glusterfs, when there is a data splt-brain (only) on a file, we disallow the following operations from the mount-point by returning EIO to the application: - Writes to the file (truncate, dd, echo, cp etc) - Reads to the file (cat) - Reading extended attributes (getfattr) [1] However we do permit the following operations: -creating hardlinks -creating symlinks -mv -setattr -chmod -chown --touch -ls -stat While it makes sense to allow `ls` and `stat`, is it okay to add checks in the FOPS to disallow the other operations? Allowing creation of links and changing file attributes only seems to complicate things before the admin can go to the backend bricks and resolve the splitbrain (by deleteing all but the healthy copy of the file including hardlinks). More so if the file is renamed before addressing the split-brain. Please share your thoughs. Thanks, Ravi [1] http://review.gluster.org/#/c/5988/ ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Rebalance Query
Both rebalance and self-healing are actually run in lower priority by default. You can disable this low priority by: # gluster volume set name performance.enable-least-priority off Avati On Mon, Nov 11, 2013 at 8:55 PM, Paul Cuzner pcuz...@redhat.com wrote: Hi, I was asked today about the relative priority of rebalance. From what I understand, gluster does not perform any rate-limiting on rebalance or even geo-rep. Is this the case? Also, assuming that we don't rate-limit rebalance, can you confirm whether rebalance fops are lower priority than client I/O and if there are any mechanisms to influence the priority of the rebalance. Again from what I can see this isn't possible. Cheers, Paul C ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] glfs_readdir_r is painful
Eric, Thanks for the insights. I have posted a patch at http://review.gluster.org/6201 which clarifies the usage of glfs_readdir_r() and also introduce glfs_readdir(). Thanks, Avati On Wed, Oct 30, 2013 at 11:05 AM, Eric Blake ebl...@redhat.com wrote: On 10/30/2013 11:18 AM, Eric Blake wrote: The only safe way to use readdir_r is to know the maximum d_name that can possibly be returned, but there is no glfs_fpathconf() for determining that information. Your example usage of glfs_readdir_r() suggests that 512 bytes is large enough: https://forge.gluster.org/glusterfs-core/glusterfs/blobs/f44ada6cd9bcc5ab98ca66bedde4fe23dd1c3f05/api/examples/glfsxmp.c but I don't know if that is true. Okay, after a bit more investigation, I see: gf_dirent_to_dirent (gf_dirent_t *gf_dirent, struct dirent *dirent) { dirent-d_ino = gf_dirent-d_ino; #ifdef _DIRENT_HAVE_D_OFF dirent-d_off = gf_dirent-d_off; #endif #ifdef _DIRENT_HAVE_D_TYPE dirent-d_type = gf_dirent-d_type; #endif #ifdef _DIRENT_HAVE_D_NAMLEN dirent-d_namlen = strlen (gf_dirent-d_name); #endif strncpy (dirent-d_name, gf_dirent-d_name, 256); } I also discovered that 'getconf NAME_MAX /path/to/xfs/mount' is 255, so it looks like you got lucky (although strncpy is generally unsafe because it fails to write a NUL terminator if you truncate the string, it looks like you are guaranteed by XFS to never have a string that needs truncation). You _do_ have the advantage that since every brick backing a glusterfs volume is using an xfs file system, then you only have to worry about the NAME_MAX of xfs - but I don't know that value off the top of my head. Again, my research shows it is 255. Can you please let me know how big I should make my struct dirent to avoid buffer overflow, and properly document this in glusterfs/api/glfs.h? Furthermore, can you please provide a much saner glfs_readdir() so I don't have to worry about contortions of using a broken-by-design function? These requests are still in force. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] can glfs_fini ever succeed?
This was fixed upstream, and backported to release-3.4 as well. The fix will be part of 3.4.2. Avati On Wed, Oct 30, 2013 at 2:49 PM, Eric Blake ebl...@redhat.com wrote: I'm trying to use glusterfs-api, but ran into some questions on usage (currently targetting Fedora 19's glusterfs-api-devel-3.4.1-1.fc19.x86_64). It looks like glfs_fini() starts with 'ret = -1' and never assigns ret to any other value. This in turn leads to odd error messages; I explicitly coded my application to warn about a negative return value, but see: warning : virStorageBackendGlusterClose:52 : shutdown of gluster failed with errno 0 which contradicts the docs that say errno will be set on failure. Is this a bug where I should just ignore the return value as useless? -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] glfs_readdir_r is painful
On Wed, Oct 30, 2013 at 3:31 PM, Eric Blake ebl...@redhat.com wrote: On 10/30/2013 04:08 PM, Anand Avati wrote: Eric, Thanks for the insights. I have posted a patch at http://review.gluster.org/6201 which clarifies the usage of glfs_readdir_r() and also introduce glfs_readdir(). Thanks for starting that. I see an off-by-one in that patch; pre-patch you did: strncpy (dirent-d_name, gf_dirent-d_name, 256); but post-patch, you have: strncpy (dirent-d_name, gf_dirent-d_name, GF_NAME_MAX); with GF_NAME_MAX set to either NAME_MAX or 255. This is a bug; you MUST strncpy at least 1 byte more than the maximum name if you are to guarantee a NUL-terminated d_name for the user. The buffer is guaranteed to be 0-inited, and strncpy with 255 is now guaranteed to have a NULL terminated string no matter how big the name was (which wasn't the case before, in case the name was 255 bytes). Oh, and NAME_MAX is not guaranteed to be defined as 255; if it is larger than 255 you are wasting memory compared to XFS, if it is less than 255 [although unlikely], you have made it impossible to return valid file names to the user. You may be better off just hard-coding GF_NAME_MAX to 255 regardless of what the system has for its NAME_MAX. Hmm, I don't think so.. strncpy of 255 bytes on to a buffer guaranteed to be 256 or higher and also guaranteed to be 0-memset'ed cannot return an invalid file name. No? Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] glfs_readdir_r is painful
On Wed, Oct 30, 2013 at 3:54 PM, Eric Blake ebl...@redhat.com wrote: Hmm, I don't think so.. strncpy of 255 bytes on to a buffer guaranteed to be 256 or higher and also guaranteed to be 0-memset'ed cannot return an invalid file name. No? The fact that your internal glfs_readdir buffer is memset means you are safe there for a 255-byte filename; but that safety does not extend to glfs_readdir_r for a user buffer. Right! Fixed - http://review.gluster.org/#/c/6201/2http://review.gluster.org/#/c/6201/ ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Weirdness in glfs_mkdir()?
glfs_mkdir() (in glfs.h) accepts three params - @fs, @path, @mode. gfapi.py uses raw ctypes to provide python APIs - and therefore the bug of not accepting and passing @mode in the mkdir() method in gfapi.py is translating into a junk value getting received by glfs_mkdir (and random modes getting set for various dirs). You just witnessed the woe of a typeless system :) Avati On Mon, Oct 28, 2013 at 1:31 PM, Justin Clift jcl...@redhat.com wrote: Hi Avati, When creating directories through glfs_mkdir() - called through Python - the directories have inconsistent mode permissions. Is this expected? Here's the super simple code running directly in a Python 2.7.5 shell, on F19. It's a simple single brick volume, XFS underneath. Gluster compiled from git master head over the weekend: vol.mkdir('asdf') 0 vol.mkdir('asdf/111') 0 vol.mkdir('asdf/112') 0 vol.mkdir('asdf/113') 0 vol.mkdir('asdf/114') 0 vol.mkdir('asdf/115') 0 vol.mkdir('asdf/116') 0 vol.mkdir('asdf/117') 0 Looks ok from here, but ls -la shows the strangeness of the subdirs: $ sudo ls -la asdf/ total 0 dr-x-w. 9 root root 76 Oct 28 20:22 . drwxr-xr-x. 8 root root 114 Oct 28 20:22 .. d-w--w---T. 2 root root 6 Oct 28 20:22 111 d--x--x--T. 2 root root 6 Oct 28 20:22 112 dr--rw---T. 2 root root 6 Oct 28 20:22 113 drwx--x---. 2 root root 6 Oct 28 20:22 114 dr--rT. 2 root root 6 Oct 28 20:22 115 dr-x-w. 2 root root 6 Oct 28 20:22 116 drwx--x---. 2 root root 6 Oct 28 20:22 117 Easily worked around using chmod() after each mkdir(), but I'm not sure if this is a bug or not. ? Regards and best wishes, Justin Clift -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Weirdness in glfs_mkdir()?
On Mon, Oct 28, 2013 at 6:28 PM, Jay Vyas jayunit...@gmail.com wrote: wow. im surprised. this caught my eye, checked into the mode: glfs_mkdir (struct glfs *fs, const char *path, mode_t mode) So, somehow, the python API is capable of sending a mode which doesnt correspond to anything enumerated as part of the mode_t, but the C method still manages to write the file with a garbage mode ? That sounds like a bug not in python. Not in gluster... but in C ! :) [if im understanding this correctyle, which i might not be] Not sure whether its a bug or feature of C. C runtime is typeless. Python ctypes uses dlopen/dlsym which do symbol lookups - doesn't care whether the looked up symbol is a data structure or a function name - let alone do type matching! Avati On Mon, Oct 28, 2013 at 7:19 PM, Anand Avati av...@gluster.org wrote: glfs_mkdir() (in glfs.h) accepts three params - @fs, @path, @mode. gfapi.py uses raw ctypes to provide python APIs - and therefore the bug of not accepting and passing @mode in the mkdir() method in gfapi.py is translating into a junk value getting received by glfs_mkdir (and random modes getting set for various dirs). You just witnessed the woe of a typeless system :) Avati On Mon, Oct 28, 2013 at 1:31 PM, Justin Clift jcl...@redhat.com wrote: Hi Avati, When creating directories through glfs_mkdir() - called through Python - the directories have inconsistent mode permissions. Is this expected? Here's the super simple code running directly in a Python 2.7.5 shell, on F19. It's a simple single brick volume, XFS underneath. Gluster compiled from git master head over the weekend: vol.mkdir('asdf') 0 vol.mkdir('asdf/111') 0 vol.mkdir('asdf/112') 0 vol.mkdir('asdf/113') 0 vol.mkdir('asdf/114') 0 vol.mkdir('asdf/115') 0 vol.mkdir('asdf/116') 0 vol.mkdir('asdf/117') 0 Looks ok from here, but ls -la shows the strangeness of the subdirs: $ sudo ls -la asdf/ total 0 dr-x-w. 9 root root 76 Oct 28 20:22 . drwxr-xr-x. 8 root root 114 Oct 28 20:22 .. d-w--w---T. 2 root root 6 Oct 28 20:22 111 d--x--x--T. 2 root root 6 Oct 28 20:22 112 dr--rw---T. 2 root root 6 Oct 28 20:22 113 drwx--x---. 2 root root 6 Oct 28 20:22 114 dr--rT. 2 root root 6 Oct 28 20:22 115 dr-x-w. 2 root root 6 Oct 28 20:22 116 drwx--x---. 2 root root 6 Oct 28 20:22 117 Easily worked around using chmod() after each mkdir(), but I'm not sure if this is a bug or not. ? Regards and best wishes, Justin Clift -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel -- Jay Vyas http://jayunit100.blogspot.com ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Change in glusterfs[master]: Transparent data encryption and metadata authentication in t...
On Thu, Oct 24, 2013 at 1:18 PM, Edward Shishkin edw...@redhat.com wrote: Hi all, So, here is the all-in-one-translator version represented by the Patch Set #2 at review.gluster.org/4667 Everything has been addressed except encryption in NFS mounts (see next mail for details). That is: . New design of EOF (end-of-file) handling; . No oplock translator on the server side; . All locks are acquired/released by the crypt translator; . Now we can encrypt srtiped and(or) replicated volumes. Common comments. In the new design all files on the server are padded, whereas the real file size is stored as xattr. So we introduce a special layer in the crypt translator, which performs file size translations: every time when any callback returns struct iatt, we update its ia_size with the real (non-padded) value. The most unpleasant thing in this new design is FOP-readdirp_cbk(): in this case we need N translations, i.e. N calls to the server (N is number of directory entries). To perform translations we spawn N children. We need a valid list of dirents after returning from FOP-readdirp_cbk() of previous translator, but we don't want to create a copy of this list (which can be large enough). For this reason we introduce a reference counter in struct gf_dirent_t and allocate dynamic structures gf_dirent_t (instead of on-stack ones), see respective changes in ./libglusterfs/src/gf-dirent.c ./libglusterfs/src/gf-dirent.h ./xlators/cluster/dht/src/dht-common.c ./xlators/protocol/client/src/client-rpc-fops.c [pasting from internal email reply] I had a look at the way you are handling readdirplus. I think it is overly complex. FOP-readdirplus() already has a parameter @xdata in which you can request per-entry xattr replies. So in crypt_readdirp() you need to: dict_set(xdata, FSIZE_XATTR_PREFIX, 0); Once you do that, in crypt_readdirp_cbk, you can expect each gf_dirent_t to have its dirent-dict set with FSIZE_XATTR_PREFIX. So you just need to iterate over replies in crypt_readdirp_cbk, update each dirent-d_stat.ia_size with value from dict_get_uint64(dirent-xdata, FSIZE_XATTR_PREFIX) Please look at how posix-acl does something very similar (loading per-entry ACLs into respective inodes via xattrs returned in readdirplus) Avati Thanks, Edward. On Mon, 14 Oct 2013 14:27:01 -0700 Anand Avati av...@redhat.com wrote: Edward, It looks like this patch requires a higher version of openssl (I recall you have mentioned before that that dependency was on version 1.0.1c? I checked yum update on the build server and the latest available version is 1.0.0-27. Is there a clean way to get the right version of openssl to a RHEL/CENTOS-6.x server? Also note that the previous submission of the patch was at http://review.gluster.org/4667. The recent on (http://review.gluster.org/6086) has a different Change-Id: in the commit log. It will be good if you can re-submit the patch with the old Change-Id (and abandon #6086) so that we can maintain the history of resubmission and the old work on records. Thanks! Avati On 10/14/2013 07:26 AM, Edward Shishkin (Code Review) wrote: Edward Shishkin has uploaded a new change for review. http://review.gluster.org/6086 Change subject: Transparent data encryption and metadata authentication in the systems with non-trusted server (take II) .. Transparent data encryption and metadata authentication in the systems with non-trusted server (take II) This new functionality can be useful in various cloud technologies. It is implemented via a special encryption/crypt translator, which works on the client side and performs encryption and authentication; 1. Class of supported algorithms The crypt translator can support any atomic symmetric block cipher algorithms (which require to pad plain/cipher text before performing encryption/decryption transform (see glossary in atom.c for definitions). In particular, it can support algorithms with the EOF issue (which require to pad the end of file by extra-data). Crypt translator performs translations user - (offset, size) - (aligned-offset, padded-size) -server (and backward), and resolves individual FOPs (write(), truncate(), etc) to read-modify-write sequences. A volume can contain files encrypted by different algorithms of the mentioned class. To change some option value just reconfigure the volume. Currently only one algorithm is supported: AES_XTS. Example of algorithms, which can not be supported by the crypt translator: 1. Asymmetric block cipher algorithms, which inflate data, e.g. RSA; 2. Symmetric block cipher algorithms with inline MACs for data authentication. 2. Implementation notes. a) Atomic algorithms Since any process
Re: [Gluster-devel] [Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.
http://review.gluster.org/#/c/6031/ (patch to remove replace-brick data migration) is slated for merge before 3.5. Review comments (on gerrit) welcome. Thanks, Avati On Thu, Oct 3, 2013 at 9:27 AM, Anand Avati av...@gluster.org wrote: On Thu, Oct 3, 2013 at 8:57 AM, KueiHuan Chen kueihuan.c...@gmail.comwrote: Hi, Avati In your chained configuration, how to replace whole h1 without replace-brick ? Is there has a better way than replace brick in this situation ? h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old h1.) You have a couple of options, A) replace-brick h1:/b1 h3:/b1 replace-brick h1:/b2 h3:/b2 and let self-heal bring the disks up to speed, or B) add-brick replica 2 h3:/b1 h2:/b2a add-brick replica 2 h3:/b2 h0:/b1a remove-brick h0:/b1 h1:/b2 start .. commit remove-brick h2:/b2 h1:/b1 start .. commit Let me know if you still have questions. Avati Thanks. Best Regards, KueiHuan-Chen Synology Incorporated. Email: khc...@synology.com Tel: +886-2-25521814 ext.827 2013/9/30 Anand Avati av...@gluster.org: On Fri, Sep 27, 2013 at 1:56 AM, James purplei...@gmail.com wrote: On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote: Hello all, Hey, Interesting timing for this post... I've actually started working on automatic brick addition/removal. (I'm planning to add this to puppet-gluster of course.) I was hoping you could help out with the algorithm. I think it's a bit different if there's no replace-brick command as you are proposing. Here's the problem: Given a logically optimal initial volume: volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2 suppose I know that I want to add/remove bricks such that my new volume (if I had created it new) looks like: volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2 h5:/b2 h6:/b2 What is the optimal algorithm for determining the correct sequence of transforms that are needed to accomplish this task. Obviously there are some simpler corner cases, but I'd like to solve the general case. The transforms are obviously things like running the add-brick {...} and remove-brick {...} commands. Obviously we have to take into account that it's better to add bricks and rebalance before we remove bricks and risk the file system if a replica is missing. The algorithm should work for any replica N. We want to make sure the new layout makes sense to replicate the data on different servers. In many cases, this will require creating a circular chain of bricks as illustrated in the bottom of this image: http://joejulian.name/media/uploads/images/replica_expansion.png for example. I'd like to optimize for safety first, and then time, I imagine. Many thanks in advance. I see what you are asking. First of all, when running a 2-replica volume you almost pretty much always want to have an even number of servers, and add servers in even numbers. Ideally the two sides of the replicas should be placed in separate failures zones - separate racks with separate power supplies or separate AZs in the cloud. Having an odd number of servers with an 2 replicas is a very odd configuration. In all these years I am yet to come across a customer who has a production cluster with 2 replicas and an odd number of servers. And setting up replicas in such a chained manner makes it hard to reason about availability, especially when you are trying recover from a disaster. Having clear and separate pairs is definitely what is recommended. That being said, nothing prevents one from setting up a chain like above as long as you are comfortable with the complexity of the configuration. And phasing out replace-brick in favor of add-brick/remove-brick does not make the above configuration impossible either. Let's say you have a chained configuration of N servers, with pairs formed between every: h(i):/b1 h((i+1) % N):/b2 | i := 0 - N-1 Now you add N+1th server. Using replace-brick, you have been doing thus far: 1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was part of a previous brick 2. replace-brick h0:/b2 hN:/b2 start ... commit In case you are doing an add-brick/remove-brick approach, you would now instead do: 1. add-brick h(N-1):/b1a hN:/b2 2. add-brick hN:/b1 h0:/b2a 3. remove-brick h(N-1):/b1 h0:/b2 start ... commit You will not be left with only 1 copy of a file at any point in the process, and achieve the same end result as you were with replace-brick. As mentioned before, I once again request you to consider if you really want to deal with the configuration complexity of having chained replication, instead of just adding servers in pairs. Please ask if there are any more questions or concerns. Avati James Some comments below, although I'm a bit tired so I hope I said it all right. DHT's remove
Re: [Gluster-devel] glusterfs 3.4.0 vs 3.4.1 potential packaging problem?
[2013-10-08 17:33:36.662549] I [glusterfsd.c:1910:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.1 (/usr/sbin/glusterd --debug) ... [2013-10-08 17:33:36.664191] W [xlator.c:185:xlator_dynload] 0-xlator: /usr/lib64/glusterfs/3.4.0/xlator/mgmt/glusterd.so: cannot open shared object file: No such file or directory I think the issue can be summarized with the above two log lines. glusterd binary is version 3.4.1 (PACKAGE_VERSION of glusterfsd is 3.4.1) but libglusterfs is trying to open .../3.4.0/...glusterd.so (i.e PACKAGE_VERSION during build of libglusterfs.so is 3.4.0). The reality in code today is that glusterfsd and libglusterfs must be built from the same version of the source tree (for reasons like above), and this needs to be captured in the packaging. I see that the glusterfs.spec.in in glusterfs.git has: Requires: %{name}-libs = %{version}-%{release} for the glusterfs-server RPM. That should have forced your glusterfs-libs to be updated to 3.4.1 as well. Kaleb, Can you confirm that the Fedora RPMs also have this internal dependency between packages? If it already does, I'm not sure how Jeff ended up with: glusterfs-libs-3.4.0-8.fc19.x86_64 glusterfs-3.4.1-1.fc19.x86_64 without doing a --force and/or --nodeps install. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.
On Thu, Oct 3, 2013 at 8:57 AM, KueiHuan Chen kueihuan.c...@gmail.comwrote: Hi, Avati In your chained configuration, how to replace whole h1 without replace-brick ? Is there has a better way than replace brick in this situation ? h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old h1.) You have a couple of options, A) replace-brick h1:/b1 h3:/b1 replace-brick h1:/b2 h3:/b2 and let self-heal bring the disks up to speed, or B) add-brick replica 2 h3:/b1 h2:/b2a add-brick replica 2 h3:/b2 h0:/b1a remove-brick h0:/b1 h1:/b2 start .. commit remove-brick h2:/b2 h1:/b1 start .. commit Let me know if you still have questions. Avati Thanks. Best Regards, KueiHuan-Chen Synology Incorporated. Email: khc...@synology.com Tel: +886-2-25521814 ext.827 2013/9/30 Anand Avati av...@gluster.org: On Fri, Sep 27, 2013 at 1:56 AM, James purplei...@gmail.com wrote: On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote: Hello all, Hey, Interesting timing for this post... I've actually started working on automatic brick addition/removal. (I'm planning to add this to puppet-gluster of course.) I was hoping you could help out with the algorithm. I think it's a bit different if there's no replace-brick command as you are proposing. Here's the problem: Given a logically optimal initial volume: volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2 suppose I know that I want to add/remove bricks such that my new volume (if I had created it new) looks like: volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2 h5:/b2 h6:/b2 What is the optimal algorithm for determining the correct sequence of transforms that are needed to accomplish this task. Obviously there are some simpler corner cases, but I'd like to solve the general case. The transforms are obviously things like running the add-brick {...} and remove-brick {...} commands. Obviously we have to take into account that it's better to add bricks and rebalance before we remove bricks and risk the file system if a replica is missing. The algorithm should work for any replica N. We want to make sure the new layout makes sense to replicate the data on different servers. In many cases, this will require creating a circular chain of bricks as illustrated in the bottom of this image: http://joejulian.name/media/uploads/images/replica_expansion.png for example. I'd like to optimize for safety first, and then time, I imagine. Many thanks in advance. I see what you are asking. First of all, when running a 2-replica volume you almost pretty much always want to have an even number of servers, and add servers in even numbers. Ideally the two sides of the replicas should be placed in separate failures zones - separate racks with separate power supplies or separate AZs in the cloud. Having an odd number of servers with an 2 replicas is a very odd configuration. In all these years I am yet to come across a customer who has a production cluster with 2 replicas and an odd number of servers. And setting up replicas in such a chained manner makes it hard to reason about availability, especially when you are trying recover from a disaster. Having clear and separate pairs is definitely what is recommended. That being said, nothing prevents one from setting up a chain like above as long as you are comfortable with the complexity of the configuration. And phasing out replace-brick in favor of add-brick/remove-brick does not make the above configuration impossible either. Let's say you have a chained configuration of N servers, with pairs formed between every: h(i):/b1 h((i+1) % N):/b2 | i := 0 - N-1 Now you add N+1th server. Using replace-brick, you have been doing thus far: 1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was part of a previous brick 2. replace-brick h0:/b2 hN:/b2 start ... commit In case you are doing an add-brick/remove-brick approach, you would now instead do: 1. add-brick h(N-1):/b1a hN:/b2 2. add-brick hN:/b1 h0:/b2a 3. remove-brick h(N-1):/b1 h0:/b2 start ... commit You will not be left with only 1 copy of a file at any point in the process, and achieve the same end result as you were with replace-brick. As mentioned before, I once again request you to consider if you really want to deal with the configuration complexity of having chained replication, instead of just adding servers in pairs. Please ask if there are any more questions or concerns. Avati James Some comments below, although I'm a bit tired so I hope I said it all right. DHT's remove-brick + rebalance has been enhanced in the last couple of releases to be quite sophisticated. It can handle graceful decommissioning of bricks, including open file descriptors and hard links. Sweet This in a way is a feature overlap
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
On Tue, Oct 1, 2013 at 4:49 AM, Emmanuel Dreyfus m...@netbsd.org wrote: Justin Clift jcl...@redhat.com wrote: Towards this we need some extensions to gfapi that can handle object based operations. Meaning, instead of using full paths or relative paths rom cwd, it is required that we can work with APIs, like the *at POSIX variants, to be able to create, lookup, open etc. files and directories. snip Any idea if this would impact our *BSD compatibility? :) NetBSD 6.1 only have partial linkat(2). NetBSD-current (will-be NetBSD-7.0) has all extended API set 2, except fexecve(2) and O_EXEC for which no consensus was reached on how to implment it securely. In a nutshell, switching to *at() kills NetBSD compatibility until next major release, but I already know it will be restored at that time. The context here is the POSIX-like style of API exposed by GFAPI, and not dependent on what syscalls the platform provides. Good to know (separately) that the *at() syscalls will be supported in NetBSD in sometime. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
in place from the glfs_resolve_inode implementation as suggested earlier, but good to check. 4) Renames In the case of renames, the inode remains the same, hence all handed out object handles still are valid and will operate on the right object per se. 5) unlinks and recreation of the same _named_ object in the background Example being, application gets an handle for an object, say named a.txt, and in the background (or via another application/client) this is deleted and recreated. This will return ENOENT as the GFID would have changed for the previously held object to the new one, even though the names are the same. This seems like the right behaviour, and does not change in the case of a 1:1 of an N:1 object handle to inode mapping. So bottom line, I see the object handles like an fd with the noted difference above. Having them in a 1:1 relationship or as a N:1 relationship does not seem to be an issue from what I understand, what am I missing here? The issue is this. From what I understand, the usage of glfs_object in the FSAL is not like a per-operation handle, but something stored long term (many minutes, hours, days) in the per-inode context of the NFS Ganesha layer. Now NFS Ganesha may be doing the right thing by not re-looking up an already looked up name and therefore avoiding a leak (I'm not so sure, it still needs to verify every so often if the mapping is still valid). From NFS Ganesha's point of view the handle is changing on every lookup. Now consider what happens in case of READDIRPLUS. A list of names and handles are returned to the client. The list of names can possibly include names which were previously looked up as well. Both are supposed to represent the same gfid, but here will be returning new glfs_objects. When a client performs an operation on a GFID, on which glfs_object will the operation be performed at the gfapi layer? This part seems very ambiguous and not clear. What would really help is if you can tell what a glfs_object is supposed to represent? - an on disk inode (i.e GFID)? an in memory per-graph inode (i.e inode_t)? A dentry? A per-operation handle to an on disk inode? A per-operation handle to an in memory per-graph inode? A per operation handle to a dentry? In the current form, it does not seem to fit any of the these categories. Avati Shyam -- *From: *Anand Avati av...@gluster.org *To: *Shyamsundar Ranganathan srang...@redhat.com *Cc: *Gluster Devel gluster-devel@nongnu.org *Sent: *Monday, September 30, 2013 10:35:05 AM *Subject: *Re: RFC/Review: libgfapi object handle based extensions I see a pretty core issue - lifecycle management of 'struct glfs_object'. What is the structure representing? When is it created? When is it destroyed? How does it relate to inode_t? Looks like for every lookup() we are creating a new glfs_object, even if the looked up inode was already looked up before (in the cache) and had a glfs_object created for it in the recent past. We need a stronger relationship between the two with a clearer relationship. It is probably necessary for a glfs_object to represent mulitple inode_t's at different points in time depending on graph switches, but for a given inode_t we need only one glfs_object. We definitely must NOT have a new glfs_object per lookup call. Avati On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan srang...@redhat.com wrote: Avati, Please find the updated patch set for review at gerrit. http://review.gluster.org/#/c/5936/ Changes made to address the points (1) (2) and (3) below. By the usage of the suggested glfs_resolve_inode approach. I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on this a little later). So currently, the review request is for all APIs other than, glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid glfs_resolve_at: Using this function the terminal name will be a force look up anyway (as force_lookup will be passed as 1 based on !next_component). We need to avoid this _extra_ lookup in the unlink case, which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in the first place. Having said the above, we should still leverage glfs_resolve_at anyway, as there seem to be other corner cases where the resolved inode and subvol maybe from different graphs. So I think I want to modify glfs_resolve_at to make a conditional force_lookup, based on iatt being NULL or not. IOW, change the call to glfs_resolve_component with the conditional as, (reval || (!next_component iatt)). So that callers that do not want the iatt filled, can skip the syncop_lookup. Request comments on the glfs_resolve_at proposal. Shyam. - Original Message - From: Anand Avati av...@gluster.org To: Shyamsundar Ranganathan srang...@redhat.com Cc: Gluster Devel gluster-devel@nongnu.org Sent: Wednesday, September 18, 2013 11:39:27 AM Subject: Re: RFC/Review: libgfapi
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
On Mon, Sep 30, 2013 at 9:34 AM, Anand Avati av...@gluster.org wrote: On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan srang...@redhat.com wrote: Avati, Amar, Amar, Anand S and myself had a discussion on this comment and here is an answer to your queries the way I see it. Let me know if I am missing something here. (this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a single lookup or preserve a single object handle per filesystem object in its cache) Currently a glfs_object is an opaque pointer to an object (it is a _handle_ to the object). The object itself contains a ref'd inode, which is the actual pointer to the object. 1) The similarity and differences of object handles to fds The intention of multiple object handles is in lines with multiple fd's per file, an application using the library is free to lookup (and/or create (and its equivalents)) and acquire as many object handles as it wants for a particular object, and can hence determine the lifetime of each such object in its view. So in essence one thread can have an object handle to perform, say attribute related operations, whereas another thread has the same object looked up to perform IO. So do you mean a glfs_object is meant to be a *per-operation* handle? If one thread wants to perform a chmod() and another thread wants to perform chown() and both attempt to resolve the same name and end up getting different handles, then both of them unref the glfs_handle right after their operation? Where the object handles depart from the notion of fds is when an unlink is performed. As POSIX defines that open fds are still _open_ for activities on the file, the life of an fd and the actual object that it points to is till the fd is closed. In the case of object handles though, the moment any handle is used to unlink the object (which BTW is done using the parent object handle and the name of the child), all handles pointing to the object are still valid pointers, but operations on then will result in ENOENT, as the actual object has since been unlinked and removed by the underlying filesystem. Not always. If the file had hardlinks the handle should still be valid. And if there were no hardlinks and you unlinked the last link, further operations must return ESTALE. ENOENT is when a basename does not resolve to a handle (in entry operations) - for e.g when you try to unlink the same entry a second time. Whereas ESTALE is when a presented handle does not exist - for e.g when you try to operate (read, chmod) a handle which got deleted. The departure from fds is considered valid in my perspective, as the handle points to an object, which has since been removed, and so there is no semantics here that needs it to be preserved for further operations as there is a reference to it held. The departure is only in the behavior of unlinked files. That is orthogonal to whether you want to return separate handles each time a component is looked up. I fail to see how the departure from fd behavior justifies creating new glfs_object per lookup? So in essence for each time an object handle is returned by the API, it has to be closed for its life to end. Additionally if the object that it points to is removed from the underlying system, the handle is pointing to an entry that does not exist any longer and returns ENOENT on operations using the same. 2) The issue/benefit of having the same object handle irrespective of looking it up multiple times If we have an 1-1 relationship of object handles (i.e struct glfs_object) to inodes, then the caller gets the same pointer to the handle. Hence having multiple handles as per the caller, boils down to giving out ref counted glfs_object(s) for the same inode. Other than the memory footprint, this will still not make the object live past it's unlink time. The pointer handed out will be still valid till the last ref count is removed (i.e the object handle closed), at which point the object handle can be destroyed. If I understand what you say above correctly, you intend to solve the problem of unlinked files must return error at your API layer? That's wrong. The right way is to ref-count glfs_object and return them precisely because you should NOT make the decision about the end of life of an inode at that layer. A hardlink may have been created by another client and the glfs_object may therefore be still be valid. You are also returning separate glfs_object for different hardlinks of a file. Does that mean glfs_object is representing a dentry? or a per-operation reference to an inode? So again, as many handles were handed out for the same inode, they have to be closed, etc. 3) Graph switches In the case of graph switches, handles that are used in operations post the switch, get refreshed with an inode from the new graph, if we have an N:1 object to inode relationship. In the case of 1:1
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
On Mon, Sep 30, 2013 at 12:49 PM, Anand Avati av...@gluster.org wrote: On Mon, Sep 30, 2013 at 9:34 AM, Anand Avati av...@gluster.org wrote: On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan srang...@redhat.com wrote: Avati, Amar, Amar, Anand S and myself had a discussion on this comment and here is an answer to your queries the way I see it. Let me know if I am missing something here. (this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a single lookup or preserve a single object handle per filesystem object in its cache) Currently a glfs_object is an opaque pointer to an object (it is a _handle_ to the object). The object itself contains a ref'd inode, which is the actual pointer to the object. 1) The similarity and differences of object handles to fds The intention of multiple object handles is in lines with multiple fd's per file, an application using the library is free to lookup (and/or create (and its equivalents)) and acquire as many object handles as it wants for a particular object, and can hence determine the lifetime of each such object in its view. So in essence one thread can have an object handle to perform, say attribute related operations, whereas another thread has the same object looked up to perform IO. So do you mean a glfs_object is meant to be a *per-operation* handle? If one thread wants to perform a chmod() and another thread wants to perform chown() and both attempt to resolve the same name and end up getting different handles, then both of them unref the glfs_handle right after their operation? Where the object handles depart from the notion of fds is when an unlink is performed. As POSIX defines that open fds are still _open_ for activities on the file, the life of an fd and the actual object that it points to is till the fd is closed. In the case of object handles though, the moment any handle is used to unlink the object (which BTW is done using the parent object handle and the name of the child), all handles pointing to the object are still valid pointers, but operations on then will result in ENOENT, as the actual object has since been unlinked and removed by the underlying filesystem. Not always. If the file had hardlinks the handle should still be valid. And if there were no hardlinks and you unlinked the last link, further operations must return ESTALE. ENOENT is when a basename does not resolve to a handle (in entry operations) - for e.g when you try to unlink the same entry a second time. Whereas ESTALE is when a presented handle does not exist - for e.g when you try to operate (read, chmod) a handle which got deleted. The departure from fds is considered valid in my perspective, as the handle points to an object, which has since been removed, and so there is no semantics here that needs it to be preserved for further operations as there is a reference to it held. The departure is only in the behavior of unlinked files. That is orthogonal to whether you want to return separate handles each time a component is looked up. I fail to see how the departure from fd behavior justifies creating new glfs_object per lookup? So in essence for each time an object handle is returned by the API, it has to be closed for its life to end. Additionally if the object that it points to is removed from the underlying system, the handle is pointing to an entry that does not exist any longer and returns ENOENT on operations using the same. 2) The issue/benefit of having the same object handle irrespective of looking it up multiple times If we have an 1-1 relationship of object handles (i.e struct glfs_object) to inodes, then the caller gets the same pointer to the handle. Hence having multiple handles as per the caller, boils down to giving out ref counted glfs_object(s) for the same inode. Other than the memory footprint, this will still not make the object live past it's unlink time. The pointer handed out will be still valid till the last ref count is removed (i.e the object handle closed), at which point the object handle can be destroyed. If I understand what you say above correctly, you intend to solve the problem of unlinked files must return error at your API layer? That's wrong. The right way is to ref-count glfs_object and return them precisely because you should NOT make the decision about the end of life of an inode at that layer. A hardlink may have been created by another client and the glfs_object may therefore be still be valid. You are also returning separate glfs_object for different hardlinks of a file. Does that mean glfs_object is representing a dentry? or a per-operation reference to an inode? So again, as many handles were handed out for the same inode, they have to be closed, etc. 3) Graph switches In the case of graph switches, handles that are used in operations post the switch, get refreshed with an inode from the new graph, if we
Re: [Gluster-devel] Phasing out replace-brick for data migration in favor of remove-brick.
On Fri, Sep 27, 2013 at 10:15 AM, Amar Tumballi ama...@gmail.com wrote: I plan to send out patches to remove all traces of replace-brick data migration code by 3.5 branch time. Thanks for the initiative, let me know if you need help. I could use help here, if you have free cycles to pick up this task? Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
Now consider what happens in case of READDIRPLUS. A list of names and handles are returned to the client. The list of names can possibly include names which were previously looked up as well. Both are supposed to represent the same gfid, but here will be returning new glfs_objects. When a client performs an operation on a GFID, on which glfs_object will the operation be performed at the gfapi layer? This part seems very ambiguous and not clear. I should have made a note for readdirplus earlier, this would default to the fd based version of the same, not a handle/object based version of the same. So we would transition from an handle to an fd via glfs_h_opendir and then continue with the readdir variants. if I look at the POSIX *at routines, this seem about right, but of course we may have variances here. You would get an fd for the directory on which the READDIRPLUS is attempted. I was referring to the replies, where every entry needs to be returned with its own handle (on which operations can arrive without LOOKUP). Think of READDIRPLUS as bulk LOOKUP. What would really help is if you can tell what a glfs_object is supposed to represent? - an on disk inode (i.e GFID)? an in memory per-graph inode (i.e inode_t)? A dentry? A per-operation handle to an on disk inode? A per-operation handle to an in memory per-graph inode? A per operation handle to a dentry? In the current form, it does not seem to fit any of the these categories. Well I think of it as a handle to an file system object. Having said that, if we just returned the inode pointer as this handle, the graph switches can cause a problem, in which case we need to default to the (as per my understanding) the FUSE manner of working. keeping the handle 1:1 via other infrastructure does not seem beneficial ATM. I think you cover this in the subsequent mail so let us continue there. That is correct, using inode_t will force us to behave like FUSE. As mentioned in the other mail, we are probably better off fixing that and using inode_t in a cleaner way in both FUSE and gfapi. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
On Thu, Sep 26, 2013 at 3:55 AM, Shyamsundar Ranganathan srang...@redhat.com wrote: - Original Message - From: Shyamsundar Ranganathan srang...@redhat.com To: gluster-devel@nongnu.org Cc: ana...@redhat.com Sent: Friday, September 13, 2013 1:48:19 PM Subject: RFC/Review: libgfapi object handle based extensions - We do need the APIs to extend themselves to do any ID based operations, say creating with a specific UID/GID rather than the running process UID/GID that can prove detrimental in a multi threaded, multi connection handling server protocol like the NFS Ganesha implementation In continuation of the original mail, we need to handle the one item above. Where we need to pass in the UID/GID to be used when performing the operations. Here is a suggestion for review on achieving the same, (for current code implementation of handle APIs look at, http://review.gluster.org/#/c/5936/) 1) Modify the handle based APIs to take in a opctx (operation context, concept borrowed from Ganesha) So, instead of, glfs_h_creat (struct glfs *fs, struct glfs_object *parent, const char *path, int flags, mode_t mode, struct stat *stat) it would be, glfs_h_creat (struct glfs *fs, struct glfs_optctx *opctx, struct glfs_object *parent, const char *path, int flags, mode_t mode, struct stat *stat) Where, struct glfs_optctx { uid_t caller_uid; gid_t caller_gid; } Later as needed this operation context can be extended for other needs like, client connection address or ID, supplementary groups, etc. 2) Internal to the glfs APIs (esp. handle based APIs), use this to set thread local variables (UID/GID) that the syncop frame creation can pick up in addition to the current probe of geteuid/egid. (as suggested by Avati) If the basic construct looks fine I will amend my current review with this change in the create API and syncop.h (etc.), and once reviewed extend it to other handle based APIs as appropriate. I am somewhat hesitant to expose a structure to be filled by the user, where the structure can grow over time. Providing APIs like glfs_setfsuid()/glfs_setfsgid()/glfs_setgroups(), which internally uses thread local variables to communicate the values to syncop_create_frame() is probably a cleaner approach. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
I see a pretty core issue - lifecycle management of 'struct glfs_object'. What is the structure representing? When is it created? When is it destroyed? How does it relate to inode_t? Looks like for every lookup() we are creating a new glfs_object, even if the looked up inode was already looked up before (in the cache) and had a glfs_object created for it in the recent past. We need a stronger relationship between the two with a clearer relationship. It is probably necessary for a glfs_object to represent mulitple inode_t's at different points in time depending on graph switches, but for a given inode_t we need only one glfs_object. We definitely must NOT have a new glfs_object per lookup call. Avati On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan srang...@redhat.com wrote: Avati, Please find the updated patch set for review at gerrit. http://review.gluster.org/#/c/5936/ Changes made to address the points (1) (2) and (3) below. By the usage of the suggested glfs_resolve_inode approach. I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on this a little later). So currently, the review request is for all APIs other than, glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid glfs_resolve_at: Using this function the terminal name will be a force look up anyway (as force_lookup will be passed as 1 based on !next_component). We need to avoid this _extra_ lookup in the unlink case, which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in the first place. Having said the above, we should still leverage glfs_resolve_at anyway, as there seem to be other corner cases where the resolved inode and subvol maybe from different graphs. So I think I want to modify glfs_resolve_at to make a conditional force_lookup, based on iatt being NULL or not. IOW, change the call to glfs_resolve_component with the conditional as, (reval || (!next_component iatt)). So that callers that do not want the iatt filled, can skip the syncop_lookup. Request comments on the glfs_resolve_at proposal. Shyam. - Original Message - From: Anand Avati av...@gluster.org To: Shyamsundar Ranganathan srang...@redhat.com Cc: Gluster Devel gluster-devel@nongnu.org Sent: Wednesday, September 18, 2013 11:39:27 AM Subject: Re: RFC/Review: libgfapi object handle based extensions Minor comments are made in gerrit. Here is a larger (more important) comment for which email is probably more convenient. There is a problem in the general pattern of the fops, for example glfs_h_setattrs() (and others too) 1. glfs_validate_inode() has the assumption that object-inode deref is a guarded operation, but here we are doing an unguarded deref in the paramter glfs_resolve_base(). 2. A more important issue, glfs_active_subvol() and glfs_validate_inode() are not atomic. glfs_active_subvol() can return an xlator from one graph, but by the time glfs_validate_inode() is called, a graph switch could have happened and inode can get resolved to a different graph. And in syncop_XX() we end up calling on graph1 with inode belonging to graph2. 3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based operations. The ESTALE_RETRY macro exists for path based FOPs where the resolved handle could have turned stale by the time we perform the FOP (where resolution and FOP are non-atomic). Over here, the handle is predetermined, and it does not make sense to retry on ESTALE (notice that FD based fops in glfs-fops.c also do not have ESTALE_RETRY for this same reason) I think the pattern should be similar to FD based fops which specifically address both the above problems. Here's an outline: glfs_h_(struct glfs *fs, glfs_object *object, ...) { xlator_t *subvol = NULL; inode_t *inode = NULL; __glfs_entry_fs (fs); subvol = glfs_active_subvol (fs); if (!subvol) { errno = EIO; ... goto out; } inode = glfs_resolve_inode (fs, object, subvol); if (!inode) { errno = ESTALE; ... goto out; } loc.inode = inode; ret = syncop_(subvol, loc, ...); } Notice the signature of glfs_resolve_inode(). What it does: given a glfs_object, and a subvol, it returns an inode_t which is resolved on that subvol. This way the syncop_XXX() is performed with matching subvol and inode. Also it returns the inode pointer so that no unsafe object-inode deref is done by the caller. Again, this is the same pattern followed by the fd based fops already. Also, as mentioned in one of the comments, please consider using glfs_resolve_at() and avoiding manual construction of loc_t. Thanks, Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
Also note that the same glfs_object must be re-used in readdirplus (once we have a _h_ equivalent of the API) Avati On Sun, Sep 29, 2013 at 10:05 PM, Anand Avati av...@gluster.org wrote: I see a pretty core issue - lifecycle management of 'struct glfs_object'. What is the structure representing? When is it created? When is it destroyed? How does it relate to inode_t? Looks like for every lookup() we are creating a new glfs_object, even if the looked up inode was already looked up before (in the cache) and had a glfs_object created for it in the recent past. We need a stronger relationship between the two with a clearer relationship. It is probably necessary for a glfs_object to represent mulitple inode_t's at different points in time depending on graph switches, but for a given inode_t we need only one glfs_object. We definitely must NOT have a new glfs_object per lookup call. Avati On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan srang...@redhat.com wrote: Avati, Please find the updated patch set for review at gerrit. http://review.gluster.org/#/c/5936/ Changes made to address the points (1) (2) and (3) below. By the usage of the suggested glfs_resolve_inode approach. I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on this a little later). So currently, the review request is for all APIs other than, glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid glfs_resolve_at: Using this function the terminal name will be a force look up anyway (as force_lookup will be passed as 1 based on !next_component). We need to avoid this _extra_ lookup in the unlink case, which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in the first place. Having said the above, we should still leverage glfs_resolve_at anyway, as there seem to be other corner cases where the resolved inode and subvol maybe from different graphs. So I think I want to modify glfs_resolve_at to make a conditional force_lookup, based on iatt being NULL or not. IOW, change the call to glfs_resolve_component with the conditional as, (reval || (!next_component iatt)). So that callers that do not want the iatt filled, can skip the syncop_lookup. Request comments on the glfs_resolve_at proposal. Shyam. - Original Message - From: Anand Avati av...@gluster.org To: Shyamsundar Ranganathan srang...@redhat.com Cc: Gluster Devel gluster-devel@nongnu.org Sent: Wednesday, September 18, 2013 11:39:27 AM Subject: Re: RFC/Review: libgfapi object handle based extensions Minor comments are made in gerrit. Here is a larger (more important) comment for which email is probably more convenient. There is a problem in the general pattern of the fops, for example glfs_h_setattrs() (and others too) 1. glfs_validate_inode() has the assumption that object-inode deref is a guarded operation, but here we are doing an unguarded deref in the paramter glfs_resolve_base(). 2. A more important issue, glfs_active_subvol() and glfs_validate_inode() are not atomic. glfs_active_subvol() can return an xlator from one graph, but by the time glfs_validate_inode() is called, a graph switch could have happened and inode can get resolved to a different graph. And in syncop_XX() we end up calling on graph1 with inode belonging to graph2. 3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based operations. The ESTALE_RETRY macro exists for path based FOPs where the resolved handle could have turned stale by the time we perform the FOP (where resolution and FOP are non-atomic). Over here, the handle is predetermined, and it does not make sense to retry on ESTALE (notice that FD based fops in glfs-fops.c also do not have ESTALE_RETRY for this same reason) I think the pattern should be similar to FD based fops which specifically address both the above problems. Here's an outline: glfs_h_(struct glfs *fs, glfs_object *object, ...) { xlator_t *subvol = NULL; inode_t *inode = NULL; __glfs_entry_fs (fs); subvol = glfs_active_subvol (fs); if (!subvol) { errno = EIO; ... goto out; } inode = glfs_resolve_inode (fs, object, subvol); if (!inode) { errno = ESTALE; ... goto out; } loc.inode = inode; ret = syncop_(subvol, loc, ...); } Notice the signature of glfs_resolve_inode(). What it does: given a glfs_object, and a subvol, it returns an inode_t which is resolved on that subvol. This way the syncop_XXX() is performed with matching subvol and inode. Also it returns the inode pointer so that no unsafe object-inode deref is done by the caller. Again, this is the same pattern followed by the fd based fops already. Also, as mentioned in one of the comments, please consider using glfs_resolve_at() and avoiding manual construction of loc_t. Thanks, Avati ___ Gluster-devel mailing list Gluster-devel
Re: [Gluster-devel] [Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.
On Fri, Sep 27, 2013 at 1:56 AM, James purplei...@gmail.com wrote: On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote: Hello all, Hey, Interesting timing for this post... I've actually started working on automatic brick addition/removal. (I'm planning to add this to puppet-gluster of course.) I was hoping you could help out with the algorithm. I think it's a bit different if there's no replace-brick command as you are proposing. Here's the problem: Given a logically optimal initial volume: volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2 suppose I know that I want to add/remove bricks such that my new volume (if I had created it new) looks like: volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2 h5:/b2 h6:/b2 What is the optimal algorithm for determining the correct sequence of transforms that are needed to accomplish this task. Obviously there are some simpler corner cases, but I'd like to solve the general case. The transforms are obviously things like running the add-brick {...} and remove-brick {...} commands. Obviously we have to take into account that it's better to add bricks and rebalance before we remove bricks and risk the file system if a replica is missing. The algorithm should work for any replica N. We want to make sure the new layout makes sense to replicate the data on different servers. In many cases, this will require creating a circular chain of bricks as illustrated in the bottom of this image: http://joejulian.name/media/uploads/images/replica_expansion.png for example. I'd like to optimize for safety first, and then time, I imagine. Many thanks in advance. I see what you are asking. First of all, when running a 2-replica volume you almost pretty much always want to have an even number of servers, and add servers in even numbers. Ideally the two sides of the replicas should be placed in separate failures zones - separate racks with separate power supplies or separate AZs in the cloud. Having an odd number of servers with an 2 replicas is a very odd configuration. In all these years I am yet to come across a customer who has a production cluster with 2 replicas and an odd number of servers. And setting up replicas in such a chained manner makes it hard to reason about availability, especially when you are trying recover from a disaster. Having clear and separate pairs is definitely what is recommended. That being said, nothing prevents one from setting up a chain like above as long as you are comfortable with the complexity of the configuration. And phasing out replace-brick in favor of add-brick/remove-brick does not make the above configuration impossible either. Let's say you have a chained configuration of N servers, with pairs formed between every: h(i):/b1 h((i+1) % N):/b2 | i := 0 - N-1 Now you add N+1th server. Using replace-brick, you have been doing thus far: 1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was part of a previous brick 2. replace-brick h0:/b2 hN:/b2 start ... commit In case you are doing an add-brick/remove-brick approach, you would now instead do: 1. add-brick h(N-1):/b1a hN:/b2 2. add-brick hN:/b1 h0:/b2a 3. remove-brick h(N-1):/b1 h0:/b2 start ... commit You will not be left with only 1 copy of a file at any point in the process, and achieve the same end result as you were with replace-brick. As mentioned before, I once again request you to consider if you really want to deal with the configuration complexity of having chained replication, instead of just adding servers in pairs. Please ask if there are any more questions or concerns. Avati James Some comments below, although I'm a bit tired so I hope I said it all right. DHT's remove-brick + rebalance has been enhanced in the last couple of releases to be quite sophisticated. It can handle graceful decommissioning of bricks, including open file descriptors and hard links. Sweet This in a way is a feature overlap with replace-brick's data migration functionality. Replace-brick's data migration is currently also used for planned decommissioning of a brick. Reasons to remove replace-brick (or why remove-brick is better): - There are two methods of moving data. It is confusing for the users and hard for developers to maintain. - If server being replaced is a member of a replica set, neither remove-brick nor replace-brick data migration is necessary, because self-healing itself will recreate the data (replace-brick actually uses self-heal internally) - In a non-replicated config if a server is getting replaced by a new one, add-brick new + remove-brick old start achieves the same goal as replace-brick old new start. - In a non-replicated config, replace-brick is NOT glitch free (applications witness ENOTCONN if they are accessing data) whereas add-brick new + remove-brick old is completely transparent. - Replace brick strictly requires a server with enough
Re: [Gluster-devel] Finalizing interfaces for snapshot and clone creation in BD xlator
Adding Brian Foster (and gluster-devel) for the discussion of unified UI for snapshotting. Mohan, I must have missed your comment. Can you please point to the specific patch where you posted your comment? Avati On Tue, Sep 24, 2013 at 9:29 AM, M. Mohan Kumar mohankuma...@gmail.comwrote: Hi Avati, I am ready with V5 of BD xlator patches (I consolidated the patches to 5). Before posting them I wanted your opinion about the interfaces I use for creating clone and snapshot. I posted them on Gerrit few days back. Could you please respond to that? -- Regards, Mohan. ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Possible memory leak in gluster samba vfs
On Tue, Sep 24, 2013 at 6:37 PM, haiwei.xie-soulinfo haiwei@soulinfo.com wrote: hi, Our patch for this bug, running looks good. smbd will not exit with oom-kill. But it's not correct method. git version: release-3.4/886021a31bdac83c2124d08d64b81f22d82039d6 diff --git a/api/src/glfs-fops.c b/api/src/glfs-fops.c index 66e7d69..535ee53 100644 --- a/api/src/glfs-fops.c +++ b/api/src/glfs-fops.c @@ -713,7 +713,9 @@ glfs_pwritev (struct glfs_fd *glfd, const struct iovec *iovec, int iovcnt, } size = iov_length (iovec, iovcnt); - +#define MIN_LEN 8 * 1024 + if (size MIN_LEN) + size = MIN_LEN; iobuf = iobuf_get2 (subvol-ctx-iobuf_pool, size); if (!iobuf) { ret = -1; Ah, looks like we need to tune the page_size/num_pages table in libglusterfs/src/iobuf.c. The table is allowing for too small pages. We should probably remove entries for page size less than 4KB. Just doing that might fix your issue: diff --git a/libglusterfs/src/iobuf.c b/libglusterfs/src/iobuf.c index a89e962..0269004 100644 --- a/libglusterfs/src/iobuf.c +++ b/libglusterfs/src/iobuf.c @@ -24,9 +24,7 @@ /* Make sure this array is sorted based on pagesize */ struct iobuf_init_config gf_iobuf_init_config[] = { /* { pagesize, num_pages }, */ -{128, 1024}, -{512, 512}, -{2 * 1024, 512}, +{4 * 1024, 256}, {8 * 1024, 128}, {32 * 1024, 64}, {128 * 1024, 32}, Avati On 09/13/2013 06:03 PM, kane wrote: Hi We use samba gluster vfs in IO test, but meet with gluster server smbd oom killer, The smbd process spend over 15g RES with top command show, in the end is our simple test code: gluster server vfs -- smbd -- client mount dir /mnt/vfs-- execute vfs test program $ ./vfs 1000 then we can watch gluster server smbd RES with top command. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4000 soul 20 0 5486m 4.9g 10m R 108.4 31.5 111:07.07 smbd 3447 root 20 0 1408m 44m 2428 S 44.4 0.3 59:11.55 glusterfsd io test code: === #define _LARGEFILE64_SOURCE #include stdio.h #include unistd.h #include string.h #include pthread.h #include stdlib.h #include fcntl.h #include sys/types.h int WT = 1; #define RND(x) ((x0)?(genrand() % (x)):0) extern unsigned long genrand(); extern void sgenrand(); /* Period parameters */ #define N 624 #define M 397 #define MATRIX_A 0x9908b0df /* constant vector a */ #define UPPER_MASK 0x8000 /* most significant w-r bits */ #define LOWER_MASK 0x7fff /* least significant r bits */ /* Tempering parameters */ #define TEMPERING_MASK_B 0x9d2c5680 #define TEMPERING_MASK_C 0xefc6 #define TEMPERING_SHIFT_U(y) (y 11) #define TEMPERING_SHIFT_S(y) (y 7) #define TEMPERING_SHIFT_T(y) (y 15) #define TEMPERING_SHIFT_L(y) (y 18) static unsigned long mt[N]; /* the array for the state vector */ static int mti=N+1; /* mti==N+1 means mt[N] is not initialized */ /* Initializing the array with a seed */ void sgenrand(seed) unsigned long seed; { int i; for (i=0;iN;i++) { mt[i] = seed 0x; seed = 69069 * seed + 1; mt[i] |= (seed 0x) 16; seed = 69069 * seed + 1; } mti = N; } unsigned long genrand() { unsigned long y; static unsigned long mag01[2]={0x0, MATRIX_A}; /* mag01[x] = x * MATRIX_A for x=0,1 */ if (mti = N) { /* generate N words at one time */ int kk; if (mti == N+1) /* if sgenrand() has not been called, */ sgenrand(4357); /* a default initial seed is used */ for (kk=0;kkN-M;kk++) { y = (mt[kk]UPPER_MASK)|(mt[kk+1]LOWER_MASK); mt[kk] = mt[kk+M] ^ (y 1) ^ mag01[y 0x1]; } for (;kkN-1;kk++) { y = (mt[kk]UPPER_MASK)|(mt[kk+1]LOWER_MASK); mt[kk] = mt[kk+(M-N)] ^ (y 1) ^ mag01[y 0x1]; } y = (mt[N-1]UPPER_MASK)|(mt[0]LOWER_MASK); mt[N-1] = mt[M-1] ^ (y 1) ^ mag01[y 0x1]; mti = 0; } y = mt[mti++]; y ^= TEMPERING_SHIFT_U(y); y ^= TEMPERING_SHIFT_S(y) TEMPERING_MASK_B; y ^= TEMPERING_SHIFT_T(y) TEMPERING_MASK_C; y ^= TEMPERING_SHIFT_L(y); return y; } char *initialize_file_source(int size) { char *new_source; int i; if ((new_source=(char *)malloc(size))==NULL) /* allocate buffer */ fprintf(stderr,Error: failed to allocate source file of size %d\n,size); else for (i=0; isize; i++) /* file buffer with junk */ new_source[i]=32+RND(95); return(new_source); } void *tran_file(void *map) { int block_size = 512; char *read_buffer; /* temporary space for reading file data into */ int fd = open((char *)map, O_RDWR | O_CREAT | O_TRUNC, 0644); if(fd == -1) { perror(open); return ; } //read_buffer=(char *)malloc(block_size); //memset(read_buffer,
Re: [Gluster-devel] [Gluster-users] glusterfs-3.4.1qa2 released
I have a theory for #998967 (that posix-acl is not doing the right thing after chmod/setattr). Preparing a patch, will appreciate if you can test it quickly. Avati On Fri, Sep 20, 2013 at 1:26 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com wrote: No, I see issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=998967 which is probably related to BZ#991035. On Thu, Sep 19, 2013 at 7:40 PM, Vijay Bellur vbel...@redhat.com wrote: On 09/18/2013 02:45 PM, Lukáš Bezdička wrote: Tested with glusterfs-3.4.1qa2-1.el6.x86_**64 issue with ACL is still there, unless one applies patch from http://review.gluster.org/#/c/** 5693/ http://review.gluster.org/#/c/5693/ which shoots through the caches and takes ACLs from server or sets entry-timeout=0 it returns wrong values. This is probably because ACL mask being applied incorrectly in posix_acl_inherit_mode, but I'm no C expert to say so :( Checking again. Are you seeing issues reported in both BZ#991035 and BZ#990830 with 3.4.1qa2? Thanks, Vijay ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] glusterfs-3.4.1qa2 released
Can you please confirm if http://review.gluster.org/5979 fixes the problem of #998967 for you? If so we will backport and include the patch in 3.4.1. Thanks, Avati On Fri, Sep 20, 2013 at 2:03 AM, Anand Avati av...@gluster.org wrote: I have a theory for #998967 (that posix-acl is not doing the right thing after chmod/setattr). Preparing a patch, will appreciate if you can test it quickly. Avati On Fri, Sep 20, 2013 at 1:26 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com wrote: No, I see issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=998967 which is probably related to BZ#991035. On Thu, Sep 19, 2013 at 7:40 PM, Vijay Bellur vbel...@redhat.com wrote: On 09/18/2013 02:45 PM, Lukáš Bezdička wrote: Tested with glusterfs-3.4.1qa2-1.el6.x86_**64 issue with ACL is still there, unless one applies patch from http://review.gluster.org/#/c/** 5693/ http://review.gluster.org/#/c/5693/ which shoots through the caches and takes ACLs from server or sets entry-timeout=0 it returns wrong values. This is probably because ACL mask being applied incorrectly in posix_acl_inherit_mode, but I'm no C expert to say so :( Checking again. Are you seeing issues reported in both BZ#991035 and BZ#990830 with 3.4.1qa2? Thanks, Vijay ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] glusterfs-3.4.1qa2 released
Please pick #2 resubmission, that is fine. Avati On Fri, Sep 20, 2013 at 2:48 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com wrote: Will take about 2 hours to setup test env, also build seems to be failed but does not seem to be caused by the patch :/ On Fri, Sep 20, 2013 at 11:38 AM, Anand Avati av...@gluster.org wrote: Can you please confirm if http://review.gluster.org/5979 fixes the problem of #998967 for you? If so we will backport and include the patch in 3.4.1. Thanks, Avati On Fri, Sep 20, 2013 at 2:03 AM, Anand Avati av...@gluster.org wrote: I have a theory for #998967 (that posix-acl is not doing the right thing after chmod/setattr). Preparing a patch, will appreciate if you can test it quickly. Avati On Fri, Sep 20, 2013 at 1:26 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com wrote: No, I see issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=998967 which is probably related to BZ#991035. On Thu, Sep 19, 2013 at 7:40 PM, Vijay Bellur vbel...@redhat.comwrote: On 09/18/2013 02:45 PM, Lukáš Bezdička wrote: Tested with glusterfs-3.4.1qa2-1.el6.x86_**64 issue with ACL is still there, unless one applies patch from http://review.gluster.org/#/c/** 5693/ http://review.gluster.org/#/c/5693/ which shoots through the caches and takes ACLs from server or sets entry-timeout=0 it returns wrong values. This is probably because ACL mask being applied incorrectly in posix_acl_inherit_mode, but I'm no C expert to say so :( Checking again. Are you seeing issues reported in both BZ#991035 and BZ#990830 with 3.4.1qa2? Thanks, Vijay ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] glusterfs-3.4.1qa2 released
Thanks Lukas. Copying Lubomir. Can you confirm that http://review.gluster.org/5693 is no more needed then? Also, can you please vote on http://review.gluster.org/5979? Thanks, Avati On Fri, Sep 20, 2013 at 4:39 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com wrote: I was unable to reproduce the issue with patch #2 from http://review.gluster.org/#/c/5979/ Thank you. On Fri, Sep 20, 2013 at 11:52 AM, Anand Avati av...@gluster.org wrote: Please pick #2 resubmission, that is fine. Avati On Fri, Sep 20, 2013 at 2:48 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com wrote: Will take about 2 hours to setup test env, also build seems to be failed but does not seem to be caused by the patch :/ On Fri, Sep 20, 2013 at 11:38 AM, Anand Avati av...@gluster.org wrote: Can you please confirm if http://review.gluster.org/5979 fixes the problem of #998967 for you? If so we will backport and include the patch in 3.4.1. Thanks, Avati On Fri, Sep 20, 2013 at 2:03 AM, Anand Avati av...@gluster.org wrote: I have a theory for #998967 (that posix-acl is not doing the right thing after chmod/setattr). Preparing a patch, will appreciate if you can test it quickly. Avati On Fri, Sep 20, 2013 at 1:26 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com wrote: No, I see issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=998967 which is probably related to BZ#991035. On Thu, Sep 19, 2013 at 7:40 PM, Vijay Bellur vbel...@redhat.comwrote: On 09/18/2013 02:45 PM, Lukáš Bezdička wrote: Tested with glusterfs-3.4.1qa2-1.el6.x86_**64 issue with ACL is still there, unless one applies patch from http://review.gluster.org/#/c/ **5693/ http://review.gluster.org/#/c/5693/ which shoots through the caches and takes ACLs from server or sets entry-timeout=0 it returns wrong values. This is probably because ACL mask being applied incorrectly in posix_acl_inherit_mode, but I'm no C expert to say so :( Checking again. Are you seeing issues reported in both BZ#991035 and BZ#990830 with 3.4.1qa2? Thanks, Vijay ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan srang...@redhat.com wrote: Avati, Please find the updated patch set for review at gerrit. http://review.gluster.org/#/c/5936/ Changes made to address the points (1) (2) and (3) below. By the usage of the suggested glfs_resolve_inode approach. I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on this a little later). So currently, the review request is for all APIs other than, glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid glfs_resolve_at: Using this function the terminal name will be a force look up anyway (as force_lookup will be passed as 1 based on !next_component). We need to avoid this _extra_ lookup in the unlink case, which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in the first place. Having said the above, we should still leverage glfs_resolve_at anyway, as there seem to be other corner cases where the resolved inode and subvol maybe from different graphs. So I think I want to modify glfs_resolve_at to make a conditional force_lookup, based on iatt being NULL or not. IOW, change the call to glfs_resolve_component with the conditional as, (reval || (!next_component iatt)). So that callers that do not want the iatt filled, can skip the syncop_lookup. Request comments on the glfs_resolve_at proposal. That should be OK (passing iatt as NULL to skip forced lookup) Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] samba-glusterfs-vfs does not build
On Thu, Sep 19, 2013 at 11:28 AM, Nux! n...@li.nux.ro wrote: On 18.09.2013 19:04, Nux! wrote: Hi, I'm trying to build and test samba-glusterfs-vfs, but problems appear from the start: http://fpaste.org/40562/**95274621/ http://fpaste.org/40562/95274621/ Any pointers? Anyone from devel has any ideas? Thanks, Lucian Have you ./configure'd in the samba tree? --with-samba-source must point to a built samba tree (not just extracted) Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions
On Mon, Sep 16, 2013 at 4:18 AM, Shyamsundar Ranganathan srang...@redhat.com wrote: - Original Message - From: Anand Avati av...@gluster.org Sent: Friday, September 13, 2013 11:09:37 PM Shyam, Thanks for sending this out. Can you post your patches to review.gluster.org and link the URL in this thread? That would make things a lot more clear for feedback and review. Done, please find the same here, http://review.gluster.org/#/c/5936/ Shyam Minor comments are made in gerrit. Here is a larger (more important) comment for which email is probably more convenient. There is a problem in the general pattern of the fops, for example glfs_h_setattrs() (and others too) 1. glfs_validate_inode() has the assumption that object-inode deref is a guarded operation, but here we are doing an unguarded deref in the paramter glfs_resolve_base(). 2. A more important issue, glfs_active_subvol() and glfs_validate_inode() are not atomic. glfs_active_subvol() can return an xlator from one graph, but by the time glfs_validate_inode() is called, a graph switch could have happened and inode can get resolved to a different graph. And in syncop_XX() we end up calling on graph1 with inode belonging to graph2. 3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based operations. The ESTALE_RETRY macro exists for path based FOPs where the resolved handle could have turned stale by the time we perform the FOP (where resolution and FOP are non-atomic). Over here, the handle is predetermined, and it does not make sense to retry on ESTALE (notice that FD based fops in glfs-fops.c also do not have ESTALE_RETRY for this same reason) I think the pattern should be similar to FD based fops which specifically address both the above problems. Here's an outline: glfs_h_(struct glfs *fs, glfs_object *object, ...) { xlator_t *subvol = NULL; inode_t *inode = NULL; __glfs_entry_fs (fs); subvol = glfs_active_subvol (fs); if (!subvol) { errno = EIO; ... goto out; } inode = glfs_resolve_inode (fs, object, subvol); if (!inode) { errno = ESTALE; ... goto out; } loc.inode = inode; ret = syncop_(subvol, loc, ...); } Notice the signature of glfs_resolve_inode(). What it does: given a glfs_object, and a subvol, it returns an inode_t which is resolved on that subvol. This way the syncop_XXX() is performed with matching subvol and inode. Also it returns the inode pointer so that no unsafe object-inode deref is done by the caller. Again, this is the same pattern followed by the fd based fops already. Also, as mentioned in one of the comments, please consider using glfs_resolve_at() and avoiding manual construction of loc_t. Thanks, Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Fwd: [Nfs-ganesha-devel] Announce: Push of next pre-2.0-dev_49
Anand, This is a great first step.. Looking forward for the integration to mature soon. This is a big step for supporting NFSv4 and pNFS for GlusterFS. Thanks! Avati On Sat, Sep 14, 2013 at 3:18 AM, Anand Subramanian ana...@redhat.comwrote: FYI, the FSAL (File System Abstraction Layer) for Gluster is now available in the upstream nfs-ganesha community (details of branch, tag and commit below). This enables users to export Gluster volumes through nfs-ganesha and for use by both nfs v3 and v4 clients. Please note that this is an on-going effort. More details wrt configuration, building etc. will follow. Anand - Forwarded Message - From: Jim Lieb jl...@panasas.com To: nfs-ganesha-de...@lists.sourceforge.net Sent: Fri, 13 Sep 2013 22:20:43 -0400 (EDT) Subject: [Nfs-ganesha-devel] Announce: Push of next pre-2.0-dev_49 Pushed to the project repo: git://github.com/nfs-ganesha/nfs-ganesha.git branch next Branch: next Tag: pre-2.0-dev_49 This week's merge is big. It also took a little extra effort to file and fit some of the pieces to get them to slide into place. The Red Hat Gluster FS team has submitted their fsal. I have built it but have not tested it. It requires the glfsapi library and a header which I can supply to anyone else who wants to play. They will be testing with us at BAT in Boston this month. It is built by default but the build will be disabled if the build cannot find the header or libary. IBM has also submitted the Protectier fsal. I have not built this but we expect a report from their team once they have tested the merge. Its build is off by default. The Pseudo filesystem handle for v4 has been reworked. This was done to get the necessary handle changes in for V2.0. Further work on pseudo file system infrastructure will build on this in 2.1. Frank and the IBM team submitted a large set of 1.5 to 2.0 bugfix ports. This is almost all of them. Frank has updated the port document reflecting current state. Please feel free to grab some patches and port them. As usual, there have been bugfixes in multiple places. We tried to get the 1.5 log rotation and compression code in but found some bugs that will take more than a few line fix to get working in 2.0. As a result, it has been reverted. Highlights: * FSAL_GLUSTER is a new fsal to export Gluster FS * FSAL_PT is a new fsal for the Protectier file system * Rework of the PseudoFS file handle format (NVFv4+ only) * More 1.5 to 2.0 bugfix ports * Lots of bugfixes Enjoy Jim -- Jim Lieb Linux Systems Engineer Panasas Inc. If ease of use was the only requirement, we would all be riding tricycles - Douglas Engelbart 1925–2013 Short log from pre-2.0-dev_47 -- commit b2a927948e627367d87af04892afbb031ed85d75 Author: Jeremy Bongio jbon...@us.ibm.com Don't access export in SAVEFH request when FH is for pseudofs and fix up references commit 03228228ab64f8d004b864ae7829b51707bfc068 Author: Jim Lieb jl...@panasas.com Revert Added support for rotation and compression of log files. commit 0f8690df03a57243d65f20d23c53f86a9e0b17cc Merge: cca7875 9483a7d Author: Jim Lieb jl...@panasas.com Merge remote-tracking branch 'ffilz/porting-doc' into merge_next commit cca787542d85112cb3e0706caf5ae007b8cd5285 Merge: 2f0118d af03de5 Author: Jim Lieb jl...@panasas.com Merge remote-tracking branch 'martinetd/for_dev_49' into merge_next commit 9483a7d7ab54a5e6e6daf4521928b147fa7329b8 Author: Frank S. Filz ffilz...@mindspring.com Clean up porting doc commit d19cadcf4069976c299e968e890efc8d0ccf001a Author: Frank S. Filz ffilz...@mindspring.com Update porting doc for dev_49 commit 2f0118d2eb9a3f95cff08070ff3453ca7ce0d4a2 Merge: a75665a 9530440 Author: Jim Lieb jl...@panasas.com Merge branch 'glusterfs' into merge_next commit a75665ac75c01e767780cea023c2a8f74b46e2a0 Merge: 3c7578c 183e044 Author: Jim Lieb jl...@panasas.com Merge remote-tracking branch 'sachin/next' into merge_next commit 3c7578cde4d47344b0dac2264e9990de3b029ba6 Merge: c0aa16f 75d81d1 Author: Jim Lieb jl...@panasas.com Merge remote-tracking branch 'linuxbox2/next' into merge_next commit c0aa16f8ea25c3dae059b349302083291ea7af9d Author: Jim Lieb jl...@panasas.com Fixups to logging macros and display logic commit 183e0440d2d8a9f1ef0513807829fd7c15e568d4 Author: Sachin Bhamare sbham...@panasas.com Fix the order in which credentials are set in fsal_set_credentials(). commit 0af11c7592092825098215733fc9a14cbc9bcfe3 Author: Sachin Bhamare sbham...@panasas.com Fix bugs in FreeBSD version of setuser() and setgroup(). commit b9ca8bddbe140f90c216aeb6611465060607420e Merge: 9629e2a 5eeb095 Author: Jim Lieb jl...@panasas.com Merge remote-tracking branch 'ganltc/ibm_next_20' into merge_next commit 953044057566c7d9013b276a14879a3f226d6972 Author: Jim
Re: [Gluster-devel] Build broken on current head with F19?
I think you might need this - http://review.gluster.org/5896 Avati On Wed, Sep 11, 2013 at 1:34 PM, Justin Clift jcl...@redhat.com wrote: Hi all, Building on F19 with current Gluster master head seems broken atm. Looks related to the QEMU code. Have we added a new compilation dependency or something (that I'm obviously missing :)? + Justin * [snip] CC changelog-notifier.lo CCLD changelog.la Making all in lib Making all in src CC libgfchangelog_la-gf-changelog.lo CC libgfchangelog_la-gf-changelog-process.lo CC libgfchangelog_la-gf-changelog-helpers.lo CC libgfchangelog_la-clear.lo CC libgfchangelog_la-copy.lo CC libgfchangelog_la-gen_uuid.lo CC libgfchangelog_la-pack.lo CC libgfchangelog_la-parse.lo CC libgfchangelog_la-unparse.lo CC libgfchangelog_la-uuid_time.lo CC libgfchangelog_la-compare.lo CC libgfchangelog_la-isnull.lo CC libgfchangelog_la-unpack.lo CCLD libgfchangelog.la Making all in gfid-access Making all in src CC gfid-access.lo CCLD gfid-access.la Making all in glupy Making all in src CC glupy.lo CCLD glupy.la Making all in qemu-block Making all in src CC qemu-coroutine.lo CC qemu-coroutine-lock.lo CC qemu-coroutine-sleep.lo CC block.lo CC nop-symbols.lo CC aes.lo CC bitmap.lo CC bitops.lo CC cutils.lo In file included from ../../../../contrib/qemu/block.c:25:0: ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error: glib-compat.h: No such file or directory #include glib-compat.h ^ In file included from ../../../../contrib/qemu/include/qemu/bitops.h:15:0, from ../../../../contrib/qemu/util/bitops.c:14: ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error: glib-compat.h: No such file or directory #include glib-compat.h ^ compilation terminated. compilation terminated. In file included from ../../../../contrib/qemu/util/aes.c:30:0: ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error: glib-compat.h: No such file or directory #include glib-compat.h ^ compilation terminated. In file included from ../../../../contrib/qemu/trace/generated-tracers.h:6:0, from ../../../../contrib/qemu/include/trace.h:4, from ../../../../contrib/qemu/qemu-coroutine.c:15: ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error: glib-compat.h: No such file or directory #include glib-compat.h ^ compilation terminated. In file included from ../../../../contrib/qemu/include/qemu/bitops.h:15:0, from ../../../../contrib/qemu/util/bitmap.c:12: ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error: glib-compat.h: No such file or directory #include glib-compat.h ^ compilation terminated. In file included from ../../../../contrib/qemu/qemu-coroutine-lock.c:25:0: ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error: glib-compat.h: No such file or directory #include glib-compat.h ^ compilation terminated. In file included from ../../../../contrib/qemu/include/qemu/timer.h:4:0, from ../../../../contrib/qemu/include/block/coroutine.h:20, from ../../../../contrib/qemu/qemu-coroutine-sleep.c:14: ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error: glib-compat.h: No such file or directory #include glib-compat.h ^ compilation terminated. In file included from ../../../../contrib/qemu/util/cutils.c:24:0: ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error: glib-compat.h: No such file or directory #include glib-compat.h ^ compilation terminated. make[6]: *** [bitops.lo] Error 1 make[6]: *** Waiting for unfinished jobs make[6]: *** [block.lo] Error 1 make[6]: *** [aes.lo] Error 1 make[6]: *** [qemu-coroutine.lo] Error 1 make[6]: *** [bitmap.lo] Error 1 make[6]: *** [qemu-coroutine-lock.lo] Error 1 make[6]: *** [qemu-coroutine-sleep.lo] Error 1 make[6]: *** [cutils.lo] Error 1 make[5]: *** [all-recursive] Error 1 make[4]: *** [all-recursive] Error 1 make[3]: *** [all-recursive] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all] Error 2 make[1]: Leaving directory `/home/jc/git_repos/glusterfs/extras/LinuxRPM/rpmbuild/BUILD/glusterfs-3git' error: Bad exit status from /var/tmp/rpm-tmp.uXvaGe (%build) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.uXvaGe (%build) make: *** [rpms] Error 1 * ___ Gluster-devel mailing list
Re: [Gluster-devel] Build broken on current head with F19?
Thanks for confirming Justin! Niels, do you know why rpm.t regression test is failing on your patch? Avati On Wed, Sep 11, 2013 at 1:49 PM, Justin Clift jcl...@redhat.com wrote: On Wed, 2013-09-11 at 13:37 -0700, Anand Avati wrote: I think you might need this - http://review.gluster.org/5896 Thanks Avati, that solved the build failure for me. :) + Justin ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Issues with fallocate, discard and zerofill
It is cleaner to implement it as a separate fop. The complexity of overloading writev() is unnecessary. There would be a whole bunch of new if/else condititions to be introduced in existing code, and modules like write-behind, stripe etc. where special action is taken in multiple places based on size (and offset into the buffer), would be very delicate error prone changes. That being said, I still believe the FOP interface should be similar to SCSI write_same, something like this: int fop_write_same (call_frame_t *frame, xlator_t *this, fd_t *fd, void *buf, size_t len, off_t offset, int repeat); and zerofill would be a gfapi wrapper around write_same: int zerofill (call_frame_t *frame, xlator_t *this, fd_t *fd, off_t offset, int len) { char zero[1] = {0}; return fop_write_same (frame, this, fd, zero, 1, offset, len); } Avati On Thu, Sep 5, 2013 at 10:28 PM, M. Mohan Kumar mo...@in.ibm.com wrote: Anand Avati anand.av...@gmail.com writes: Hi Shishir, Its possible to overload writev FOP for achieving zerofill functionality. Is there any open issues with this zerofill functionality even after overloading in writev? Shishir, Is this in reference to the dht open file rebalance (of replaying the operations to the destination server)? I am assuming so, as that is something which has to be handled. The other question is how should fallocate/discard be handled by self-heal in AFR. I'm not sure how important it is, but will be certainly good to bounce some ideas off here. Maybe we should implement a fiemap fop to query extents/holes and replay them in the other serverl? Avati On Tue, Aug 13, 2013 at 10:49 PM, Bharata B Rao bharata@gmail.com wrote: Hi Avati, Brian, During the recently held gluster meetup, Shishir mentioned about a potential problem (related to fd migration etc) in the zerofill implementation (http://review.gluster.org/#/c/5327/) and also mentioned that same/similar issues are present with fallocate and discard implementations. Since zerofill has been modelled on fallocate/discard, I was wondering if it would be possible to address these issues in fallocate/discard first so that we could potentially follow the same in zerofill implementation. Regards, Bharata. -- http://raobharata.wordpress.com/ ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [FEEDBACK] Governance of GlusterFS project
Good point Amar.. Noted. Avati On Fri, Sep 6, 2013 at 1:40 AM, Amar Tumballi ama...@gmail.com wrote: One of the other things we missed in this thread is how to handle bugs in bugzilla, and who should own the triage for high/urgent priority bugs. -Amar On Fri, Jul 26, 2013 at 10:56 PM, Anand Avati anand.av...@gmail.comwrote: Hello everyone, We are in the process of formalizing the governance model of the GlusterFS project. Historically, the governance of the project has been loosely structured. This is an invitation to all of you to participate in this discussion and provide your feedback and suggestions on how we should evolve a formal model. Feedback from this thread will be considered to the extent possible in formulating the draft (which will be sent out for review as well). Here are some specific topics to seed the discussion: - Core team formation - what are the qualifications for membership (e.g contributions of code, doc, packaging, support on irc/lists, how to quantify?) - what are the responsibilities of the group (e.g direction of the project, project roadmap, infrastructure, membership) - Roadmap - process of proposing features - process of selection of features for release - Release management - timelines and frequency - release themes - life cycle and support for releases - project management and tracking - Project maintainers - qualification for membership - process and evaluation There are a lot more topics which need to be discussed, I just named some to get started. I am sure our community has members who belong and participate (or at least are familiar with) other open source project communities. Your feedback will be valuable. Looking forward to hearing from you! Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
[Gluster-devel] readdir() scalability (was Re: [RFC ] dictionary optimizations)
On Fri, Sep 6, 2013 at 1:46 AM, Xavier Hernandez xhernan...@datalab.eswrote: Al 04/09/13 18:10, En/na Anand Avati ha escrit: On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez xhernan...@datalab.eswrote: Al 04/09/13 14:05, En/na Jeff Darcy ha escrit: On 09/04/2013 04:27 AM, Xavier Hernandez wrote: I would also like to note that each node can store multiple elements. Current implementation creates a node for each byte in the key. In my implementation I only create a node if there is a prefix coincidence between 2 or more keys. This reduces the number of nodes and the number of indirections. Whatever we do, we should try to make sure that the changes are profiled against real usage. When I was making my own dict optimizations back in March of last year, I started by looking at how they're actually used. At that time, a significant majority of dictionaries contained just one item. That's why I only implemented a simple mechanism to pre-allocate the first data_pair instead of doing something more ambitious. Even then, the difference in actual performance or CPU usage was barely measurable. Dict usage has certainly changed since then, but I think you'd still be hard pressed to find a case where a single dict contains more than a handful of entries, and approaches that are optimized for dozens to hundreds might well perform worse than simple ones (e.g. because of cache aliasing or branch misprediction). If you're looking for other optimization opportunities that might provide even bigger bang for the buck then I suggest that stack-frame or frame-local allocations are a good place to start. Or string copying in places like loc_copy. Or the entire fd_ctx/inode_ctx subsystem. Let me know and I'll come up with a few more. To put a bit of a positive spin on things, the GlusterFS code offers many opportunities for improvement in terms of CPU and memory efficiency (though it's surprisingly still way better than Ceph in that regard). Yes. The optimizations on dictionary structures are not a big improvement in the overall performance of GlusterFS. I tried it on a real situation and the benefit was only marginal. However I didn't test new features like an atomic lookup and remove if found (because I would have had to review all the code). I think this kind of functionalities could improve a bit more the results I obtained. However this is not the only reason to do these changes. While I've been writing code I've found that it's tedious to do some things just because there isn't such functions in dict_t. Some actions require multiple calls, having to check multiple errors and adding complexity and limiting readability of the code. Many of these situations could be solved using functions similar to what I proposed. On the other side, if dict_t must be truly considered a concurrent structure, there are a lot of race conditions that might appear when doing some operations. It would require a great effort to take care of all these possibilities everywhere. It would be better to pack most of these situations into functions inside the dict_t itself where it is easier to combine some operations. By the way, I've made some tests with multiple bricks and it seems that there is a clear speed loss on directory listings as the number of bricks increases. Since bricks should be independent and they can work in parallel, I didn't expected such a big performance degradation. The likely reason is that, even though bricks are parallel for IO, readdir is essentially a sequential operation and DHT has a limitation that a readdir reply batch does not cross server boundaries. So if you have 10 files and 1 server, all 10 entries are returned in one call to the app/libc. If you have 10 files and 10 servers evenly distributed, the app/libc has to perform 10 calls and keeps getting one file at a time. This problem goes away when each server has enough files to fill up a readdir batch. It's only when you have too few files and too many servers that this dilution problem shows up. However, this is just a theory and your problem may be something else too.. I didn't know that DHT was doing a sequential brick scan on readdir(p) (my fault). Why is that ? Why it cannot return entries crossing a server boundary ? is it due to a technical reason or is it only due to the current implementation ? I've made a test using only directories (50 directories with 50 subdirectories each). I started with one brick and I measured the time to do a recursive 'ls'. Then I sequentially added an additional brick, up to 6 (all of them physically independent), and repeated the ls. The time increases linearly as the number of bricks augments. As more bricks were added, the rebalancing time was also growing linearly. I think this is a big problem for scalability. It can be partially hidden by using some caching or preloading mechanisms
Re: [Gluster-devel] Change in glusterfs[release-3.4]: call-stub: internal refactor
On 9/5/13 6:27 AM, Vijay Bellur (Code Review) wrote: Vijay Bellur has submitted this change and it was merged. Change subject: call-stub: internal refactor .. call-stub: internal refactor - re-structure members of call_stub_t with new simpler layout - easier to inspect call_stub_t contents in gdb now - fix a bunch of double unrefs and double frees in cbk stub - change all STACK_UNWIND to STACK_UNWIND_STRICT and thereby fixed a lot of bad params - implement new API call_unwind_error() which can even be called on fop_XXX_stub(), and not necessarily fop_XXX_cbk_stub() Change-Id: Idf979f14d46256af0afb9658915cc79de157b2d7 BUG: 846240 Signed-off-by: Anand Avati av...@redhat.com Reviewed-on: http://review.gluster.org/4520 Tested-by: Gluster Build System jenk...@build.gluster.com Reviewed-by: Jeff Darcy jda...@redhat.com Reviewed-on: http://review.gluster.org/5820 Reviewed-by: Raghavendra Bhat raghaven...@redhat.com --- M libglusterfs/src/call-stub.c M libglusterfs/src/call-stub.h M xlators/performance/write-behind/src/write-behind.c 3 files changed, 1,104 insertions(+), 2,994 deletions(-) Approvals: Raghavendra Bhat: Looks good to me, approved Gluster Build System: Verified Note that this backported patch had a bug in master and there was a follow-up patch http://review.gluster.org/4564. This needs to be backported too. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Enabling Apache Hadoop on GlusterFS: glusterfs-hadoop 2.1 released
On Thu, Sep 5, 2013 at 2:53 PM, Stephen Watt sw...@redhat.com wrote: Hi Folks We are pleased to announce a major update to the glusterfs-hadoop project with the release of version 2.1. The glusterfs-hadoop project, available at The glusterfs-hadoop project team, provides an Apache licensed Hadoop FileSystem plugin which enables Apache Hadoop 1.x and 2.x to run directly on top of GlusterFS. This release includes a re-architected plugin which now extends existing functionality within Hadoop to run on local and POSIX File Systems. -- Overview -- Apache Hadoop has a pluggable FileSystem Architecture. This means that if you have a filesystem or object store that you would like to use with Hadoop, you can create a Hadoop FileSystem plugin for it which will act as a mediator between the generic Hadoop FileSystem interface and your filesystem of choice. A popular example would be that over a million Hadoop clusters are spun up on Amazon every year, a lot of which use Amazon S3 as the Hadoop FileSystem. In order to configure the plugin, a specific deployment configuration is required. Firstly, it is required that the Hadoop JobTracker and TaskTrackers (or the Hadoop 2.x equivalents) are installed on servers within the gluster trusted storage pool for a given gluster volume. The JobTracker uses the plugin to query the extended attributes for job input files in gluster to ascertain file placement as well as the distribution of file replicas across the cluster. The TaskTrackers use the plugin to leverage a local fuse mount of the gluster volume in order to access the data required for the tasks. When the JobTracker receives a Hadoop job, it uses the locality information it ascertains via the plugin to send the tasks for the Hadoop Job to Hadoop TaskTrackers on servers that have the data required for the task within their local bricks. This ensures data is read from disk and not over the network. Please see the attached diagram which provides an overview of the entire solution for a Hadoop 1.x deployment. The community project, along with the documentation and available releases, is hosted within the Gluster Forge at http://forge.gluster.org/hadoop. The glusterfs-hadoop project will also be available within the Fedora 20 release later this year, alongside fellow Fedora newcomer Apache Hadoop and the already available gluster project. The glusterfs-hadoop project team welcomes contributions and participation from the broader community. Stay tuned for upcoming posts around GlusterFS integration into the Apache Ambari and Fedora projects. Regards The glusterfs-hadoop project team ___ Announce mailing list annou...@gluster.org http://supercolony.gluster.org/mailman/listinfo/announce ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users Congratulations! This is great news!! Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Change in glusterfs[master]: bd: posix/multi-brick support to BD xlator
On 09/01/2013 11:26 AM, M. Mohan Kumar (Code Review) wrote: Hello Anand Avati, Gluster Build System, I'd like you to reexamine a change. Please visit http://review.gluster.org/4809 to look at the new patch set (#4). Change subject: bd: posix/multi-brick support to BD xlator .. bd: posix/multi-brick support to BD xlator Current BD xlator (block backend) has a few limitations such as * Creation of directories not supported * Supports only single brick * Does not use extended attributes (and client gfid) like posix xlator * Creation of special files (symbolic links, device nodes etc) not supported Basic limitation of not allowing directory creation is blocking oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM creates multi-level directories when GlusterFS is used as storage backend for storing VM images. To overcome these limitations a new BD xlator with following improvements is suggested. * New hybrid BD xlator that handles both regular files and block device files * The volume will have both POSIX and BD bricks. Regular files are created on POSIX bricks, block devices are created on the BD brick (VG) * BD xlator leverages exiting POSIX xlator for most POSIX calls and hence sits above the POSIX xlator * Block device file is differentiated from regular file by an extended attribute * The xattr 'user.glusterfs.bd' (BD_XATTR) plays a role in mapping a posix file to Logical Volume (LV). * When a client sends a request to set BD_XATTR on a posix file, a new LV is created and mapped to posix file. So every block device will have a representative file in POSIX brick with 'user.glusterfs.bd' (BD_XATTR) and 'user.glusterfs.bd.size' (BD_XATTR_SIZE) set. * Here after all operations on this file results in LV related operations. New BD xlator code is placed in xlators/storage/bd directory. For example opening a file that has BD_XATTR_PATH set results in opening the LV block device, reading results in reading the corresponding LV block device. When BD xlator gets request to set BD_XATTR via setxattr call, it creates a LV and information about this LV is placed in the xattr of the posix file. xattr user.glusterfs.bd, user.glusterfs.bd.size used to identify that posix file is mapped to BD. Usage: Server side: [root@host1 ~]# gluster volume create bdvol device vg host1:/storage/vg1_info?vg1 host2:/storage/vg2_info?vg2 It creates a distributed gluster volume 'bdvol' with Volume Group vg1 using posix brick /storage/vg1_info in host1 and Volume Group vg2 using /storage/vg2_info in host2. [root@host1 ~]# gluster volume start bdvol Client side: [root@node ~]# mount -t glusterfs host1:/bdvol /media [root@node ~]# touch /media/posix It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick [root@node ~]# mkdir /media/image [root@node ~]# touch /media/image/lv1 It also creates regular posix file 'lv1' in either host1:/vg1 or host2:/vg2 brick [root@node ~]# setfattr -n user.glusterfs.bd -v lv /media/image/lv1 [root@node ~]# Above setxattr results in creating a new LV in corresponding brick's VG and it sets 'user.glusterfs.bd' with value 'lv' and 'user.glusterfs.size' with default extent size. [root@node ~]# truncate -s5G /media/image/lv1 It results in resizig LV 'lv1'to 5G Changes from previous version V3: * Added support in FUSE to support full/linked clone * Added support to merge snapshots and provide information about origin * bd_map xlator removed * iatt structure used in inode_ctx. iatt is cached and updated during fsync/flush * aio support * Type and capabilities of volume are exported through getxattr Changes from version 2: * Used inode_context for caching BD size and to check if loc/fd is BD or not. * Added GlusterFS server offloaded copy and snapshot through setfattr FOP. As part of this libgfapi is modified. * BD xlator supports stripe * During unlinking if a LV file is already opened, its added to delete list and bd_del_thread tries to delete from this list when a last reference to that file is closed. Changes from previous version: * gfid is used as name of LV * ? is used to specify VG name for creating BD volume in volume create, add-brick. gluster volume create volname host:/path?vg * open-behind issue is fixed * A replicate brick can be added dynamically and LVs from source brick are replicated to destination brick * A distribute brick can be added dynamically and rebalance operation distributes existing LVs/files to the new brick * Thin provisioning support added. * bd_map xlator support retained * setfattr -n user.glusterfs.bd -v lv creates a regular LV and setfattr -n user.glusterfs.bd -v thin creates thin LV * Capability and backend information added to gluster volume info (and --xml) so that management tools can exploit BD xlator. * tracing support for bd xlator added TODO: * Add support to display snapshots for a given LV
Re: [Gluster-devel] [RFC ] dictionary optimizations
On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez xhernan...@datalab.eswrote: Al 04/09/13 14:05, En/na Jeff Darcy ha escrit: On 09/04/2013 04:27 AM, Xavier Hernandez wrote: I would also like to note that each node can store multiple elements. Current implementation creates a node for each byte in the key. In my implementation I only create a node if there is a prefix coincidence between 2 or more keys. This reduces the number of nodes and the number of indirections. Whatever we do, we should try to make sure that the changes are profiled against real usage. When I was making my own dict optimizations back in March of last year, I started by looking at how they're actually used. At that time, a significant majority of dictionaries contained just one item. That's why I only implemented a simple mechanism to pre-allocate the first data_pair instead of doing something more ambitious. Even then, the difference in actual performance or CPU usage was barely measurable. Dict usage has certainly changed since then, but I think you'd still be hard pressed to find a case where a single dict contains more than a handful of entries, and approaches that are optimized for dozens to hundreds might well perform worse than simple ones (e.g. because of cache aliasing or branch misprediction). If you're looking for other optimization opportunities that might provide even bigger bang for the buck then I suggest that stack-frame or frame-local allocations are a good place to start. Or string copying in places like loc_copy. Or the entire fd_ctx/inode_ctx subsystem. Let me know and I'll come up with a few more. To put a bit of a positive spin on things, the GlusterFS code offers many opportunities for improvement in terms of CPU and memory efficiency (though it's surprisingly still way better than Ceph in that regard). Yes. The optimizations on dictionary structures are not a big improvement in the overall performance of GlusterFS. I tried it on a real situation and the benefit was only marginal. However I didn't test new features like an atomic lookup and remove if found (because I would have had to review all the code). I think this kind of functionalities could improve a bit more the results I obtained. However this is not the only reason to do these changes. While I've been writing code I've found that it's tedious to do some things just because there isn't such functions in dict_t. Some actions require multiple calls, having to check multiple errors and adding complexity and limiting readability of the code. Many of these situations could be solved using functions similar to what I proposed. On the other side, if dict_t must be truly considered a concurrent structure, there are a lot of race conditions that might appear when doing some operations. It would require a great effort to take care of all these possibilities everywhere. It would be better to pack most of these situations into functions inside the dict_t itself where it is easier to combine some operations. By the way, I've made some tests with multiple bricks and it seems that there is a clear speed loss on directory listings as the number of bricks increases. Since bricks should be independent and they can work in parallel, I didn't expected such a big performance degradation. The likely reason is that, even though bricks are parallel for IO, readdir is essentially a sequential operation and DHT has a limitation that a readdir reply batch does not cross server boundaries. So if you have 10 files and 1 server, all 10 entries are returned in one call to the app/libc. If you have 10 files and 10 servers evenly distributed, the app/libc has to perform 10 calls and keeps getting one file at a time. This problem goes away when each server has enough files to fill up a readdir batch. It's only when you have too few files and too many servers that this dilution problem shows up. However, this is just a theory and your problem may be something else too.. Note that Brian Foster's readdir-ahead patch should address this problem to a large extent. When loaded on top of DHT, the prefiller effectively collapses the smaller chunks returned by DHT into a larger chunk requested by the app/libc. Avati However the tests have not been exhaustive nor made in best conditions so they might be misleading. Anyway it seems to me that there might be a problem with some mutexes that force too much serialization of requests (though I have no real proves it's only a feeling). Maybe some more asynchronousity on calls between translators could help. Only some thoughts... Best regards, Xavi __**_ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/**mailman/listinfo/gluster-develhttps://lists.nongnu.org/mailman/listinfo/gluster-devel __**_ Gluster-devel mailing list
Re: [Gluster-devel] [RFC ] dictionary optimizations
On Mon, Sep 2, 2013 at 7:24 AM, Xavier Hernandez xhernan...@datalab.eswrote: Hi, dict_t structures are widely used in glusterfs. I've some ideas that could improve its performance. * On delete operations, return the current value if it exists. This is very useful when we want to get a value and remove it from the dictionary. This way it can be done accessing and locking the dict_t only once (and it is atomic). Makes sense. * On add operations, return the previous value if it existed. This avoids to use a lookup and a conditional add (and it is atomic). Do you mean dict_set()? If so, how do you propose we differentiate between failure and previous value did not exist? Do you propose setting the previous value into a pointer to pointer, and retain the return value as is today? * Always return the data_pair_t structure instead of data_t or the data itself. This can be useful to avoid future lookups or other operations on the same element. Macros can be created to simplify writing code to access the actual value. The use case is not clear. A more concrete example will help.. * Use a trie instead of a hash. A trie structure is a bit more complex than a hash, but only processes the key once and does not need to compute the hash. A test implementation I made with a trie shows a significant improvement in dictionary operations. There is already an implementation of trie in libglusterfs/src/trie.c. Though it does not compact (collapse) single-child nodes upwards into the parent. In any case, let's avoid having two implementations of tries. * Implement dict_foreach() as a macro (similar to kernel's list_for_each()). This gives more control and avoids the need of helper functions. This makes sense too, but there are quite a few users of dict_foreach in the existing style. Moving them all over might be a pain. Additionally, I think it's possible to redefine structures to reduce the number of allocations and pointers used for each element (actual data, data_t, data_pair_t and key). This is highly desirable. There was some effort from Amar in the past ( http://review.gluster.org/3910) but it has been in need of attention for some time. It would be intersting to know if you were thinking along similar lines? Avati What do you think ? Best regards, Xavi __**_ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/**mailman/listinfo/gluster-develhttps://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [RFC ] dictionary optimizations
On Tue, Sep 3, 2013 at 1:42 AM, Xavier Hernandez xhernan...@datalab.eswrote: Al 03/09/13 09:33, En/na Anand Avati ha escrit: On Mon, Sep 2, 2013 at 7:24 AM, Xavier Hernandez xhernan...@datalab.eswrote: Hi, dict_t structures are widely used in glusterfs. I've some ideas that could improve its performance. * On delete operations, return the current value if it exists. This is very useful when we want to get a value and remove it from the dictionary. This way it can be done accessing and locking the dict_t only once (and it is atomic). Makes sense. * On add operations, return the previous value if it existed. This avoids to use a lookup and a conditional add (and it is atomic). Do you mean dict_set()? If so, how do you propose we differentiate between failure and previous value did not exist? Do you propose setting the previous value into a pointer to pointer, and retain the return value as is today? Yes, I'm thinking to something similar to dict_set() (by the way, I would remove the dict_add() function). dict_add() is used in unserialization routines where dict_set() for a big set of keys guaranteed not to repeat is very expensive (unserializing would otherwise have a quadratic function as its asymptote). What is the reason you intend to remove it? What you propose would be the simplest solution right now. However I think it would be interesting to change the return value to an error code (this would supply more detailed information in case of failure and we could use EEXIST to know if the value already existed. In fact I think it would be interesting to progressively change the -1 return code of many functions by an error code). The pointer to pointer argument could be NULL if the previous value is not needed. Of course this would change the function signature, breaking a lot of existing code. Another possibility could be to create a dict_replace() function, and possibly make it to fail if the value didn't exist. It is best we do not change the meaning of existing APIs, and just add new APIs instead. The new API can be: int dict_replace (dict_t *dict, const char *key, data_t *newval, data_t **oldval); .. and leave dict_set() as is. * Always return the data_pair_t structure instead of data_t or the data itself. This can be useful to avoid future lookups or other operations on the same element. Macros can be created to simplify writing code to access the actual value. The use case is not clear. A more concrete example will help.. Having a data_pair_t could help to navigate from an existing element (getting next or previous. This is really interesting if dict where implemented using a sorted structure like a trie since it would allow to process a set of similar entries very fast, like the trusted.afr.brick values for example) or removing or replacing it without needing another lookup (a more detailed analysis would be needed to see how to handle race conditions). By the way, is really the dict_t structure used concurrently ? I haven't analyzed all the code deeply, but it seems to me that every dict_t is only accessed from a single place at once. There have been instances of dict_t getting used concurrently, when used as xdata and in xattrop (by AFR). There have been bugs in the past with concurrent dict access. * Use a trie instead of a hash. A trie structure is a bit more complex than a hash, but only processes the key once and does not need to compute the hash. A test implementation I made with a trie shows a significant improvement in dictionary operations. There is already an implementation of trie in libglusterfs/src/trie.c. Though it does not compact (collapse) single-child nodes upwards into the parent. In any case, let's avoid having two implementations of tries. I know. The current implementation wastes a lot of memory because it uses an array of 256 pointers, and in some places it needs to traverse the array. Not a b¡g deal, but if it is made many times it could be noticeable. In my test I used a trie with 4 child pointers (with collapsing single-child nodes) that runs a bit faster than the 256 implementation and uses much less memory. I tried with 2, 4, 16 and 256 childs per node, and 4 seems to be the best (at least for dictionary structures) though there are very little difference between 4 and 16 in terms of speed. The 256 child pointers give you constant time lookup for the next level child with just an offset indirection. With smaller fan-out, do you search through the list? Can you show an example of this? Collapsing single child node upwards is badly needed though. I agree that it is not good to maintain two implementations of the same thing. Maybe we could change the trie implementation. It should be transparent. Yes, I believe the current API can accommodate such internal changes. * Implement dict_foreach() as a macro (similar to kernel's list_for_each
Re: [Gluster-devel] Change in glusterfs[master]: libglusterfs: use safer symbol resolution strategy with our ...
[cc'ing gluster-devel on the final consensus] On 08/30/2013 01:19 PM, Brian Foster wrote: On 08/30/2013 04:01 PM, Anand Avati wrote: On 8/30/13 7:50 AM, Brian Foster wrote: On 08/30/2013 09:51 AM, Brian Foster wrote: On 08/29/2013 09:08 PM, Emmanuel Dreyfus wrote: Anand Avati (Code Review) rev...@dev.gluster.org wrote: ... TBH, I'm not totally sure how much this impacts things that aren't cross DSO conflicts, so I ran a quick test. If I define a function in an executable and a dso and call the function from both points, the library always invokes the local version if the library is loaded via dlopen(). In fact, if I remove the local version from the library, I hit an undefined symbol error when I attempt to invoke the library call. Note that the library does invoke the executable version of the function if a compile time link dependency is made (i.e., even if the library is still referenced via dlopen()/dlsym(), but not loaded by that method). I suspect there is something going on here at compile/link time that determines whether the executable exports the symbol, but that's an initial guess based on observed behavior. FYI, an elf dump of both executables shows that they differ in whether my duplicate symbol is listed in the executable .dynsym (dynamic symbol table) section or not. A dump of my locally installed qemu-kvm executable shows a couple exported block driver symbols: bdrv_aio_readv and bdrv_aio_writev. Brian Given that, perhaps the best thing to do is hold off on the RTLD_DEEPBIND bits until we understand the behavior a bit more conclusively here and evaluate whether it's really an issue in the weird qemu case. Thoughts? (FWIW, I think the change should at least hold with regard to the RTLD_LOCAL bits. A translator is already intended to be a black box with a few specially named interfaces. I don't see any reason to pollute the global namespace with all kinds of extra symbols from different translators). NetBSD has different OS-specific flags, but I am not sure wether they overlap or not: Appears to define a similar behavior, but this apparently applies to an explicit search of a symbol in a library as opposed to how the external dependencies of the library are resolved. Brian P.S., All tests run on Linux. The following special handle values may be used with dlsym(): (...) RTLD_SELF The search for symbol is limited to the shared object issuing the call to dlsym() and those shared objects which were loaded after it that are visible. Using RTLD_LOCAL solves one half of the confusion, and I think there is no disagreement that RTLD_GLOBAL must be changed to RTLD_LOCAL. This solves the problem of qemu-block translator's symbols not getting picked up by qemu when libgfapi for the hypervisor's calls. Agreed. The other half of the problem - functions in qemu getting called instead of same named functions in qemu-block translator is the open concern. Brian, you report above that a DSO always uses the version which is available in the same DSO. If that is the case (which sounds good), I don't understand why RTLD_DEEPBIND is required. Not always... I suspect there are two conditions here. 1.) the executable includes the symbol in its dynamic symbol table. 2.) the dso is compiled/linked to look for a symbol through the dynamic symbol tables (I suspect via PLT entries). The experiment I outlined before effectively toggles #1. By linking the library directly or not, I'm controlling whether the executable exports the dependent symbol. #2 it appears can be controlled by something like the -Bsymbolic linker option. For example, objdump a library with and without that option and you can see the call sites to a particular function either refer to library (relative?) offsets or plt entries. Using this option, the library always uses the local version regardless of whether the symbol is exported from the executable. I am concerned about bugs like 994314, where a symbol defined in qemu (uuid_is_null()) got picked up instead of the one in libglusterfs.so, for a call made from libglusterfs.so. It may be the case that the problem occured here because libglusterfs.so was not dlopen()ed, but dynamically linked as -lglusterfs and build time. It might not have mattered in that particular case. If the symbol was exported by the executable (i.e., condition #1), it probably/possibly could get bound to the unexpected symbol. What was the ultimate fix for that bug? While it appears that right now qemu does not export many bdrv_* symbols, it still seems like it could be a problem if we used those symbols or the nature of the executable changed in the future (i.e., #1 is not under our control). For that reason, I'd suggest we use something like -Bsymbolic for qemu-block to address the second half of the problem (assuming it doesn't break anything else :P). I mentioned this on #gluster-dev btw, and Kaleb pointed out
Re: [Gluster-devel] [Gluster-users] GlusterFS 3.4.1 planning
For those interested in what are the possible patches, here is a short list of commits which are available in master but not yet backported to release-3.4 (note the actual list 500, this is a short list of patches which fix some kind of an issue - crash, leak, incorrect behavior, failure) http://www.gluster.org/community/documentation/index.php/Release_341_backport_candidates Some of the patches towards the end fix some nasty issues. Many of them are covered in the bugs listed in http://www.gluster.org/community/documentation/index.php/Backport_Wishlist. If there are specific patches you would like to see backported, please copy/paste those lines from the Release_341_backport_candidates page into the Backport_Wishlist page. For the others, we will be using a best judgement call based on severity and patch impact. Avati On Fri, Aug 9, 2013 at 2:23 AM, Vijay Bellur vbel...@redhat.com wrote: Hi All, We are considering 3.4.1 to be released in the last week of this month. If you are interested in seeing specific bugs addressed or patches included in 3.4.1, can you please update them here: http://www.gluster.org/**community/documentation/index.** php/Backport_Wishlisthttp://www.gluster.org/community/documentation/index.php/Backport_Wishlist Thanks, Vijay __**_ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.**org/mailman/listinfo/gluster-**usershttp://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] disabling caching and other optimizations for internal fops
Setting in xdata has the benefit of getting propagated to server side without change in protocol. However that being said, dict_t in its current form is not the most efficient data structure for storing a lot of key/values (biggest concern being too many small allocations). It will be good to revive http://review.gluster.org/3910 so that such use of xdata will be of a lesser concern. Avati On Tue, Aug 27, 2013 at 12:12 AM, Raghavendra Bhat rab...@redhat.comwrote: Hi, As of now, the performance xlators cache the data and perform some optimizations for all the fops irrespective of whether the fop is generated from the application or internal xlator. I think, performance xlators should come in to picture for only the fops generated by the applications. Imagine the situation where a graph change happens and fuse xlator sends open call on the fds to migrate them to the new graph. But the open call might not reach posix if open-behind unwinds success to fuse xlator. It can be done in 2 ways. 1) Set a key in dictionary if the call is generated internally OR 2) Set a flag in the callstack itself, whether the fop is internal fop or generated from the application. Please provide feedback. Regards, Raghavendra Bhat __**_ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/**mailman/listinfo/gluster-develhttps://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] The return of the all-null pending matrix
On Mon, Aug 26, 2013 at 6:12 PM, Emmanuel Dreyfus m...@netbsd.org wrote: Anand Avati anand.av...@gmail.com wrote: This is a tricky problem. I have thought about this quite a bit and couldn't come up with a theory which can lead to this behavior. I am suspecting this might be a 32/64bit compatibility issue. Can you try placing all bricks on the same architecture and see if this can be reproduced? How long does it take to reproduce this problem? Now we know this is probably the root of the bug, do you want to track it further, or shall we call that kind of setup unsupported? It is certainly an interesting issue, and will be good to fix in the long run (supporting heterogenous bricks can open up very interesting use cases). Can you please file a bug in http://bugzilla.redhat.com with all the logs as a low priority bug so that we can track it and fix it sometime, rather than ignore it intentionally? Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Fwd: FileSize changing in GlusterNodes
On Sun, Aug 25, 2013 at 11:23 PM, Vijay Bellur vbel...@redhat.com wrote: File size as reported on the mount point and the bricks can vary because of this code snippet in iatt_from_stat(): { uint64_t maxblocks; maxblocks = (iatt-ia_size + 511) / 512; if (iatt-ia_blocks maxblocks) iatt-ia_blocks = maxblocks; } This snippet was brought in to improve accounting behaviour for quota that would fail with disk file systems that perform speculative pre-allocation. If this aides only specific use cases, I think we should make the behaviour configurable. Thoughts? -Vijay This is very unlikely the problem. st_blocks field values do not influence md5sum behavior in any way. The file size (st_size) would, but both du -k and the above code snipped only deal with st_blocks. Bobby, it would help if you can identify the mismatching file and inspect and see what is the difference between the two files? Avati Original Message Subject:[Gluster-users] FileSize changing in GlusterNodes Date: Wed, 21 Aug 2013 05:35:40 + From: Bobby Jacob bobby.ja...@alshaya.com To: gluster-us...@gluster.org gluster-us...@gluster.org Hi, When I upload files into the gluster volume, it replicates all the files to both gluster nodes. But the file size slightly varies by (4-10KB), which changes the md5sum of the file. Command to check file size : du –k *. I’m using glusterFS 3.3.1 with Centos 6.4 This is creating inconsistency between the files on both the bricks. ? What is the reason for this changed file size and how can it be avoided. ? Thanks Regards, *Bobby Jacob* ___ Gluster-users mailing list gluster-us...@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Fwd: FileSize changing in GlusterNodes
On Mon, Aug 26, 2013 at 9:40 AM, Vijay Bellur vbel...@redhat.com wrote: On 08/26/2013 10:04 PM, Anand Avati wrote: On Sun, Aug 25, 2013 at 11:23 PM, Vijay Bellur vbel...@redhat.com mailto:vbel...@redhat.com wrote: File size as reported on the mount point and the bricks can vary because of this code snippet in iatt_from_stat(): { uint64_t maxblocks; maxblocks = (iatt-ia_size + 511) / 512; if (iatt-ia_blocks maxblocks) iatt-ia_blocks = maxblocks; } This snippet was brought in to improve accounting behaviour for quota that would fail with disk file systems that perform speculative pre-allocation. If this aides only specific use cases, I think we should make the behaviour configurable. Thoughts? -Vijay This is very unlikely the problem. st_blocks field values do not influence md5sum behavior in any way. The file size (st_size) would, but both du -k and the above code snipped only deal with st_blocks. I was referring to du -k as seen on the bricks and the mount point. I was certainly not referring to the md5sum difference. -Vijay I thought he was comparing du -k between the two bricks (the sentence felt that way). In any case the above code snippet should do something meaningful only when the file is still held open. XFS should discard the extra allocations after close() anyways. Bobby, it would help if you can identify the mismatching file and inspect and see what is the difference between the two files? Avati Original Message Subject:[Gluster-users] FileSize changing in GlusterNodes Date: Wed, 21 Aug 2013 05:35:40 + From: Bobby Jacob bobby.ja...@alshaya.com mailto:bobby.jacob@alshaya.**com bobby.ja...@alshaya.com To: gluster-us...@gluster.org mailto:gluster-users@gluster.**orggluster-us...@gluster.org gluster-us...@gluster.org mailto:gluster-users@gluster.**orggluster-us...@gluster.org Hi, When I upload files into the gluster volume, it replicates all the files to both gluster nodes. But the file size slightly varies by (4-10KB), which changes the md5sum of the file. Command to check file size : du –k *. I’m using glusterFS 3.3.1 with Centos 6.4 This is creating inconsistency between the files on both the bricks. ? What is the reason for this changed file size and how can it be avoided. ? Thanks Regards, *Bobby Jacob* __**_ Gluster-users mailing list gluster-us...@gluster.org mailto:Gluster-users@gluster.**orggluster-us...@gluster.org http://supercolony.gluster.**org/mailman/listinfo/gluster-**usershttp://supercolony.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] The return of the all-null pending matrix
This is a tricky problem. I have thought about this quite a bit and couldn't come up with a theory which can lead to this behavior. I am suspecting this might be a 32/64bit compatibility issue. Can you try placing all bricks on the same architecture and see if this can be reproduced? How long does it take to reproduce this problem? Avati On Mon, Aug 19, 2013 at 9:30 PM, Emmanuel Dreyfus m...@netbsd.org wrote: Hi Did you had a loog on it? ANy idea on what is going on? On Wed, Aug 14, 2013 at 07:21:13AM +0200, Emmanuel Dreyfus wrote: Anand Avati anand.av...@gmail.com wrote: I was going through your log files again. Correct me if I'm wrong, the issue in the log is with the file tparm.po, right? Yes, this one raises a split brain. We have other all-zero pending matrices in the log on other files in the minutes leading to the problem, though But at least they do not raise an error. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel -- Emmanuel Dreyfus m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Issues with fallocate, discard and zerofill
Shishir, Is this in reference to the dht open file rebalance (of replaying the operations to the destination server)? I am assuming so, as that is something which has to be handled. The other question is how should fallocate/discard be handled by self-heal in AFR. I'm not sure how important it is, but will be certainly good to bounce some ideas off here. Maybe we should implement a fiemap fop to query extents/holes and replay them in the other serverl? Avati On Tue, Aug 13, 2013 at 10:49 PM, Bharata B Rao bharata@gmail.comwrote: Hi Avati, Brian, During the recently held gluster meetup, Shishir mentioned about a potential problem (related to fd migration etc) in the zerofill implementation (http://review.gluster.org/#/c/5327/) and also mentioned that same/similar issues are present with fallocate and discard implementations. Since zerofill has been modelled on fallocate/discard, I was wondering if it would be possible to address these issues in fallocate/discard first so that we could potentially follow the same in zerofill implementation. Regards, Bharata. -- http://raobharata.wordpress.com/ ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Proposal for Gluster 3.5: New test framework
Justin, Thanks for firing up this thread. Are there notable projects which use these frameworks? Do you have any info on what other distributed storage projects use for their automated testing? Thanks, Avati On Mon, Aug 12, 2013 at 10:07 AM, Justin Clift jcl...@redhat.com wrote: Hi all, For Gluster 3.5, I'd like to propose we get some kind of *multi-node* testing framework in place for Gluster. The existing test framework is single node only, which doesn't fit well for a distributed file system. I've recently looked into Autotest in depth, but ruled it out since it's: * Linux only (ugh) * Very hard to figure out for newbies * Close to zero documentation * Opaque/unreadable source * Painful to work with :( Potentially we could use STAF (staf.sourceforge.net). I've not investigated this in depth yet, but from it's website it's: * Cross Platform http://staf.sourceforge.net/current/STAFFAQ.htm#d0e36 * Seems like extensive documentation, and usable with several languages: http://staf.sourceforge.net/current/STAFPython.htm * Seems like a _reasonably_ active Community http://sourceforge.net/p/staf/mailman/staf-users/ I'll put this info into a Feature Page if people think it's worth writing up and taking further. ? Regards and best wishes, Justin Clift -- Open Source and Standards @ Red Hat twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RPM re-structuring
On Wed, Aug 14, 2013 at 12:25 AM, Deepak C Shetty deepa...@linux.vnet.ibm.com wrote: On 07/29/2013 12:18 AM, Anand Avati wrote: On Sun, Jul 28, 2013 at 11:18 AM, Vijay Bellur vbel...@redhat.com wrote: Hi All, There was a recent thread on fedora-devel about bloated glusterfs dependency for qemu: https://lists.fedoraproject.org/pipermail/devel/2013-July/186484.html As of today, we have the following packages and respective primary constituents: 1. glusterfs - contains all the common xlators, libglusterfs, glusterfsd binary glusterfs symlink to glusterfsd. 2. glusterfs-rdma- rdma shared library 3. glusterfs-geo-replication - geo-rep related objects 4. glusterfs-fuse- fuse xlator 5. glusterfs-server - server side xlators, config files 6. glusterfs-api - libgfapi shared library 7. glusterfs-resource-agents - OCF resource agents 8. glusterfs-devel - Header files for libglusterfs 9. glusterfs-api-devel - Header files for gfapi As far as qemu is concerned, qemu depends on glusterfs-api which in turn is dependent on glusterfs. Much of the apparent bloat is coming from glusterfs package and one proposal for reducing the dependency footprint of consumers of libgfapi could be the following: a) Move glusterfsd and glusterfs symlink from 'glusterfs' to 'glusterfs-server' b) Package glusterfsd binary and glusterfs symlink in 'glusterfs-fuse' Does that mean glusterfsd is in glusterfs-server or glusterfs-fuse? It is probably sufficient to leave glusterfs-fuse just have fuse.so and mount.glusterfs.in Another model can be: 0. glusterfs-libs.rpm - libglusterfs.so libgfrpc.so libgfxdr.so 1. glusterfs (depends on glusterfs-libs) - glusterfsd binary, glusterfs symlink, all common xlators 2. glusterfs-rdma (depends on glusterfs) - rdma shared library 3. glusterfs-geo-replication (depends on glusterfs) - geo-rep related objects 4. glusterfs-fuse (depends on glusterfs) - fuse xlator, mount.glusterfs 5. glusterfs-server (depends on glusterfs) - server side xlators, config files 6. glusterfs-api (depends on glusterfs-libs) - libgfapi.so and api.so 7. glusterfs-resource-agents (depends on glusterfs) 8. glusterfs-devel (depends on glusterfs-libs) - header files for libglusterfs 9. glusterfs-api-devel (depends on glusterfs-api) - header files for gfapi This way qemu will only pick up libgfapi.so libglusterfs.so libgfrpc.so and libgfxdr.so (the bare minimum to just execute) for the binary to load at run time. Those who want to store vm images natively on gluster must also do a 'yum install glusterfs' to make gfapi 'useful'. This way Fedora qemu users who do not plan to use gluster will not get any of the xlator cruft. Looks like even after the re-packaging.. the original problem is still there ! Post re-strucuring ( i am on F19 with updates-testing repo enabled) gluserfs-api has dep on -libs and glusterfs So when User install glusterfs-api, it pulls in -libs and glusterfs This is correct, since w/o glusterfs rpm we won't have a working qemu gluster backend. Actually this *wasnt* what we discussed. glusterfs-api was supposed to depend on glusterfs-libs *ONLY*. This is because it has a linking (hard) relationship with glusterfs-libs, and glusterfs.rpm is only a run-time dependency - everything here is dlopen()ed. Just allowing qemu to execute by way of installing-libs and -api only won't help, since once qemu executes and someone tries qemu w/ gluster backend.. things will fail unless User has installed glusterfs rpm (which has all the client xlators) I think this was exactly what we concluded. That a user would need to install glusterfs rpm if they wanted to store VM images on gluster (independent of the fact that qemu was linked with glusterfs-api). Do you see a problem with this? Avati So today ... yum install glusterfs-api brings in glusterfs-libs and glusterfs which sounds correct to get a working system with qemu gluster backend. Later... yum remove glusterfs removes glusterfs-api which has a reverse dep on qemu, hence libvirt hence the entire virt stack goes down which was the original problem reported in the fedora devel list @ https://lists.fedoraproject.org/pipermail/devel/2013-July/186484.html and that unfortunately is still there, even after -libs was created as a separate rpm as part of this effort! thanx, deepak ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Proposal for Gluster 3.5: Better peer identification
On Tue, Aug 13, 2013 at 4:05 AM, Kaushal M kshlms...@gmail.com wrote: Hi all, We recently had a mailing list discussion about the current problems with peer identification and handling multiple networks. This proposal is regarding better identification of peers. Currently, the way we identify peers is not consistent all through the gluster code. We use uuids internally and hostnames externally. This setup works pretty well when all the peers are on a single network, have one address, and are referred to in all the gluster commands with same address. But once we start mixing up addresses in the commands (ip, shortnames, fqdn) and bring in multiple networks we have problems. The problems were discussed in the following mailing list threads and some solutions were proposed. - How do we identify peers? [1] - RFC - Connection Groups concept [2] The solution to the multi-network problem is dependent on the solution to the peer identification problem. So it'll be good to target fixing the peer identification problem asap, ie. in 3.5 , and take up the networks problem later. Thoughts? Thanks for the proposal Kaushal. This is a welcome change. It will be great to have all internal identifiers for peers to happen through UUID and get translated into a host/IP at the most superficial layer. There are open issues around node crash + re-install with same IP (but new UUID) which needs to be addressed in this effort. Avati - Kaushal -- [1] http://lists.gnu.org/archive/html/gluster-devel/2013-06/msg00067.html [2] http://lists.gnu.org/archive/html/gluster-devel/2013-06/msg00069.html ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RPM re-structuring
On Wed, Aug 14, 2013 at 1:40 AM, Deepak C Shetty deepa...@linux.vnet.ibm.com wrote: On 08/14/2013 01:37 PM, Anand Avati wrote: On Wed, Aug 14, 2013 at 12:25 AM, Deepak C Shetty deepa...@linux.vnet.ibm.com wrote: On 07/29/2013 12:18 AM, Anand Avati wrote: On Sun, Jul 28, 2013 at 11:18 AM, Vijay Bellur vbel...@redhat.comwrote: Hi All, There was a recent thread on fedora-devel about bloated glusterfs dependency for qemu: https://lists.fedoraproject.org/pipermail/devel/2013-July/186484.html As of today, we have the following packages and respective primary constituents: 1. glusterfs - contains all the common xlators, libglusterfs, glusterfsd binary glusterfs symlink to glusterfsd. 2. glusterfs-rdma- rdma shared library 3. glusterfs-geo-replication - geo-rep related objects 4. glusterfs-fuse- fuse xlator 5. glusterfs-server - server side xlators, config files 6. glusterfs-api - libgfapi shared library 7. glusterfs-resource-agents - OCF resource agents 8. glusterfs-devel - Header files for libglusterfs 9. glusterfs-api-devel - Header files for gfapi As far as qemu is concerned, qemu depends on glusterfs-api which in turn is dependent on glusterfs. Much of the apparent bloat is coming from glusterfs package and one proposal for reducing the dependency footprint of consumers of libgfapi could be the following: a) Move glusterfsd and glusterfs symlink from 'glusterfs' to 'glusterfs-server' b) Package glusterfsd binary and glusterfs symlink in 'glusterfs-fuse' Does that mean glusterfsd is in glusterfs-server or glusterfs-fuse? It is probably sufficient to leave glusterfs-fuse just have fuse.so and mount.glusterfs.in Another model can be: 0. glusterfs-libs.rpm - libglusterfs.so libgfrpc.so libgfxdr.so 1. glusterfs (depends on glusterfs-libs) - glusterfsd binary, glusterfs symlink, all common xlators 2. glusterfs-rdma (depends on glusterfs) - rdma shared library 3. glusterfs-geo-replication (depends on glusterfs) - geo-rep related objects 4. glusterfs-fuse (depends on glusterfs) - fuse xlator, mount.glusterfs 5. glusterfs-server (depends on glusterfs) - server side xlators, config files 6. glusterfs-api (depends on glusterfs-libs) - libgfapi.so and api.so 7. glusterfs-resource-agents (depends on glusterfs) 8. glusterfs-devel (depends on glusterfs-libs) - header files for libglusterfs 9. glusterfs-api-devel (depends on glusterfs-api) - header files for gfapi This way qemu will only pick up libgfapi.so libglusterfs.so libgfrpc.so and libgfxdr.so (the bare minimum to just execute) for the binary to load at run time. Those who want to store vm images natively on gluster must also do a 'yum install glusterfs' to make gfapi 'useful'. This way Fedora qemu users who do not plan to use gluster will not get any of the xlator cruft. Looks like even after the re-packaging.. the original problem is still there ! Post re-strucuring ( i am on F19 with updates-testing repo enabled) gluserfs-api has dep on -libs and glusterfs So when User install glusterfs-api, it pulls in -libs and glusterfs This is correct, since w/o glusterfs rpm we won't have a working qemu gluster backend. Actually this *wasnt* what we discussed. glusterfs-api was supposed to depend on glusterfs-libs *ONLY*. This is because it has a linking (hard) relationship with glusterfs-libs, and glusterfs.rpm is only a run-time dependency - everything here is dlopen()ed. Just allowing qemu to execute by way of installing-libs and -api only won't help, since once qemu executes and someone tries qemu w/ gluster backend.. things will fail unless User has installed glusterfs rpm (which has all the client xlators) I think this was exactly what we concluded. That a user would need to install glusterfs rpm if they wanted to store VM images on gluster (independent of the fact that qemu was linked with glusterfs-api). Do you see a problem with this? Putting a User's hat.. i think its a problem. IIUC What you are saying is that User must be aware that he/she needs to install glusterfs in order to use qemu gluster backend. User may argue.. why didn't you install glusterfs as part of qemu yum install itself ? Expecting user (who may or may not be glsuter/virt. aware) to install addnl rpm to use qemu gluster might not work always. Who will inform user to install glusterfs when things fail at runtime ? Your view is in direct contradiction with the view of those who objected the dependency to start with :-) I think this question needs to be reconciled with the initial reporters. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RPM re-structuring
On Wed, Aug 14, 2013 at 1:54 AM, Harshavardhana har...@harshavardhana.netwrote: Actually this *wasnt* what we discussed. glusterfs-api was supposed to depend on glusterfs-libs *ONLY*. This is because it has a linking (hard) relationship with glusterfs-libs, and glusterfs.rpm is only a run-time dependency - everything here is dlopen()ed. rpm uses 'ldd' command to get dependencies for 'glusterfs-api' to 'glusterfs-libs' - automatically. You don't need a forced specification. Specifying runtime time dependency is done this way %package api Summary: Clustered file-system api library Group:System Environment/Daemons Requires: %{name} = %{version}-%{release} --- Install-time dependency. Just allowing qemu to execute by way of installing-libs and -api only won't help, since once qemu executes and someone tries qemu w/ gluster backend.. things will fail unless User has installed glusterfs rpm (which has all the client xlators) I think this was exactly what we concluded. That a user would need to install glusterfs rpm if they wanted to store VM images on gluster (independent of the fact that qemu was linked with glusterfs-api). Do you see a problem with this? The problem here is user awareness - it generates additional cycles of communication. In this case 'qemu' should have a direct dependency on 'glusterfs.rpm' and 'glusterfs-api' when provided with gfapi support - wouldn't this solve the problem? This would solve your version of the problem. But the original concern raised was that the whole shebang of glusterfs translators and transports get installed for someone who wants libvirt/qemu and doesn't care what glusterfs even is. Your version of the problem is in direct contradiction with the initially reported problem for which the restructuring was proposed. ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RPM re-structuring
On Wed, Aug 14, 2013 at 2:16 AM, Deepak C Shetty deepa...@linux.vnet.ibm.com wrote: On 08/14/2013 02:23 PM, Anand Avati wrote: On Wed, Aug 14, 2013 at 1:40 AM, Deepak C Shetty deepa...@linux.vnet.ibm.com wrote: On 08/14/2013 01:37 PM, Anand Avati wrote: On Wed, Aug 14, 2013 at 12:25 AM, Deepak C Shetty deepa...@linux.vnet.ibm.com wrote: On 07/29/2013 12:18 AM, Anand Avati wrote: On Sun, Jul 28, 2013 at 11:18 AM, Vijay Bellur vbel...@redhat.comwrote: Hi All, There was a recent thread on fedora-devel about bloated glusterfs dependency for qemu: https://lists.fedoraproject.org/pipermail/devel/2013-July/186484.html As of today, we have the following packages and respective primary constituents: 1. glusterfs - contains all the common xlators, libglusterfs, glusterfsd binary glusterfs symlink to glusterfsd. 2. glusterfs-rdma- rdma shared library 3. glusterfs-geo-replication - geo-rep related objects 4. glusterfs-fuse- fuse xlator 5. glusterfs-server - server side xlators, config files 6. glusterfs-api - libgfapi shared library 7. glusterfs-resource-agents - OCF resource agents 8. glusterfs-devel - Header files for libglusterfs 9. glusterfs-api-devel - Header files for gfapi As far as qemu is concerned, qemu depends on glusterfs-api which in turn is dependent on glusterfs. Much of the apparent bloat is coming from glusterfs package and one proposal for reducing the dependency footprint of consumers of libgfapi could be the following: a) Move glusterfsd and glusterfs symlink from 'glusterfs' to 'glusterfs-server' b) Package glusterfsd binary and glusterfs symlink in 'glusterfs-fuse' Does that mean glusterfsd is in glusterfs-server or glusterfs-fuse? It is probably sufficient to leave glusterfs-fuse just have fuse.so and mount.glusterfs.in Another model can be: 0. glusterfs-libs.rpm - libglusterfs.so libgfrpc.so libgfxdr.so 1. glusterfs (depends on glusterfs-libs) - glusterfsd binary, glusterfs symlink, all common xlators 2. glusterfs-rdma (depends on glusterfs) - rdma shared library 3. glusterfs-geo-replication (depends on glusterfs) - geo-rep related objects 4. glusterfs-fuse (depends on glusterfs) - fuse xlator, mount.glusterfs 5. glusterfs-server (depends on glusterfs) - server side xlators, config files 6. glusterfs-api (depends on glusterfs-libs) - libgfapi.so and api.so 7. glusterfs-resource-agents (depends on glusterfs) 8. glusterfs-devel (depends on glusterfs-libs) - header files for libglusterfs 9. glusterfs-api-devel (depends on glusterfs-api) - header files for gfapi This way qemu will only pick up libgfapi.so libglusterfs.so libgfrpc.so and libgfxdr.so (the bare minimum to just execute) for the binary to load at run time. Those who want to store vm images natively on gluster must also do a 'yum install glusterfs' to make gfapi 'useful'. This way Fedora qemu users who do not plan to use gluster will not get any of the xlator cruft. Looks like even after the re-packaging.. the original problem is still there ! Post re-strucuring ( i am on F19 with updates-testing repo enabled) gluserfs-api has dep on -libs and glusterfs So when User install glusterfs-api, it pulls in -libs and glusterfs This is correct, since w/o glusterfs rpm we won't have a working qemu gluster backend. Actually this *wasnt* what we discussed. glusterfs-api was supposed to depend on glusterfs-libs *ONLY*. This is because it has a linking (hard) relationship with glusterfs-libs, and glusterfs.rpm is only a run-time dependency - everything here is dlopen()ed. Just allowing qemu to execute by way of installing-libs and -api only won't help, since once qemu executes and someone tries qemu w/ gluster backend.. things will fail unless User has installed glusterfs rpm (which has all the client xlators) I think this was exactly what we concluded. That a user would need to install glusterfs rpm if they wanted to store VM images on gluster (independent of the fact that qemu was linked with glusterfs-api). Do you see a problem with this? Putting a User's hat.. i think its a problem. IIUC What you are saying is that User must be aware that he/she needs to install glusterfs in order to use qemu gluster backend. User may argue.. why didn't you install glusterfs as part of qemu yum install itself ? Expecting user (who may or may not be glsuter/virt. aware) to install addnl rpm to use qemu gluster might not work always. Who will inform user to install glusterfs when things fail at runtime ? Your view is in direct contradiction with the view of those who objected the dependency to start with :-) I think this question needs to be reconciled with the initial reporters. One more point to note here is that... even if we go with the way you suggested, it solves the original problem but brings in another as I stated
Re: [Gluster-devel] Proposal for Gluster 3.5: New test framework
On Wed, Aug 14, 2013 at 5:36 AM, Justin Clift jcl...@redhat.com wrote: On 14/08/2013, at 7:43 AM, Anand Avati wrote: Justin, Thanks for firing up this thread. Are there notable projects which use these frameworks? Autotest is used by the Linux kernel (its main claim to fame), and is also used by KVM. STAF seems to have originally been an IBM internal project that was open sourced. Seems to have been around for years. Haven't yet looked at further alternatives, as I was mostly expecting Autotest to be ok. Wrongly it turns out. :( It would be wise to do some proper investigation/shortlisting of potential frameworks before immediately jumping into an investigation of STAF. Do you have any info on what other distributed storage projects use for their automated testing? The Ceph project used Autotest some time ago as well, but it didn't meet their needs so they created their own: Teuthology https://github.com/ceph/teuthology Their historical Autotest stuff https://github.com/ceph/autotest https://github.com/ceph/ceph-autotests I looked over Teuthology quickly, and it seems decent but it's very Ceph oriented/optimised. Not a general purpose thing we could pick up and use without extensive modification. :( Regards and best wishes, Justin Clift An important factor is going to be support for integration with Gerrit for pre-commit tests. Or they should at least be configurable behind Jenkins. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] QEMU (and other libgfapi client?) crashes on add-brick / replace-brick
On Wed, Jul 31, 2013 at 9:48 AM, Guido De Rosa guido.der...@vemarsas.itwrote: Well, there's another problem, possibly related, that I didn't notice: I'm unable to mount! (Although libgfapi is meant to bypass fuse-mount, I've read that you need FUSE working in the same machine where you issue a replace-brick command [1]) $ git clone ssh://guidoder...@git.gluster.org/glusterfs (master branch @ a960f92) $ ./autogen.sh $ ./configure --enable-debug make sudo make install # /etc/init.d/glusterd start # gluster volume create gv transport tcp 192.168.232.179: /var/export/gluster/gv volume create: gv: success: please start the volume to access data # gluster volume start gv volume start: gv: success And here the problem: # mount -t glusterfs localhost:/gv /mnt/gv Mount failed. Please check the log file for more details. The log to check is /usr/local/var/log/glusterfs/mnt-gv.log Please check what the error was in that log. Avati The same issue holds if I apply the patches you suggested, then clean sources, rebuild reinstall. (If there's some relation..,) $ git pull ssh://guidoder...@git.gluster.org/glusterfsrefs/changes/07/5407/2 etc. Here are the relevant logs: /usr/local/var/log/glusterfs/bricks/var-export-gluster-gv.log: Final graph: +--+ 1: volume gv-posix 2: type storage/posix 3: option glusterd-uuid 42ff1e51-7c77-4c70-9e1b-3e6207935bee 4: option directory /var/export/gluster/gv 5: option volume-id a562cb7c-0edf-4efa-afc6-80ea4e3fe978 6: end-volume 7: 8: volume gv-changelog 9: type features/changelog 10: option changelog-brick /var/export/gluster/gv 11: option changelog-dir /var/export/gluster/gv/.glusterfs/changelogs 12: subvolumes gv-posix 13: end-volume 14: 15: volume gv-access-control 16: type features/access-control 17: subvolumes gv-changelog 18: end-volume 19: 20: volume gv-locks 21: type features/locks 22: subvolumes gv-access-control 23: end-volume 24: 25: volume gv-io-threads 26: type performance/io-threads 27: subvolumes gv-locks 28: end-volume 29: 30: volume gv-index 31: type features/index 32: option index-base /var/export/gluster/gv/.glusterfs/indices 33: subvolumes gv-io-threads 34: end-volume 35: 36: volume gv-marker 37: type features/marker 38: option volume-uuid a562cb7c-0edf-4efa-afc6-80ea4e3fe978 39: option timestamp-file /var/lib/glusterd/vols/gv/marker.tstamp 40: option xtime off 41: option quota off 42: subvolumes gv-index 43: end-volume 44: 45: volume /var/export/gluster/gv 46: type debug/io-stats 47: option latency-measurement off 48: option count-fop-hits off 49: subvolumes gv-marker 50: end-volume 51: 52: volume gv-server 53: type protocol/server 54: option transport.socket.listen-port 49152 55: option rpc-auth.auth-glusterfs on 56: option rpc-auth.auth-unix on 57: option rpc-auth.auth-null on 58: option transport-type tcp 59: option auth.login./var/export/gluster/gv.allow ae4ffb2b-75fb-4b5a-b9d3-6c9e390fee03 60: option auth.login.ae4ffb2b-75fb-4b5a-b9d3-6c9e390fee03.password 041ee2e7-e8cf-4ecd-bba6-655348721610 61: option auth.addr./var/export/gluster/gv.allow * 62: subvolumes /var/export/gluster/gv 63: end-volume 64: /usr/local/var/log/glusterfs/usr-local-etc-glusterfs-glusterd.vol.log: Final graph: +--+ 1: volume management 2: type mgmt/glusterd 3: option rpc-auth.auth-glusterfs on 4: option rpc-auth.auth-unix on 5: option rpc-auth.auth-null on 6: option transport.socket.listen-backlog 128 7: option transport.socket.read-fail-log off 8: option transport.socket.keepalive-interval 2 9: option transport.socket.keepalive-time 10 10: option transport-type rdma 11: option working-directory /var/lib/glusterd 12: end-volume 13: +--+ Thanks, Guido --- [1] This is for older versions and I'm not sure the same holds for 3.4 http://www.gluster.org/wp-content/uploads/2012/05/Gluster_File_System-3.3.0-Administration_Guide-en-US.pdf Sec 7.4 ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] The return of the all-null pending matrix
Emmanuel, I was going through your log files again. Correct me if I'm wrong, the issue in the log is with the file tparm.po, right? Avati On Tue, Aug 13, 2013 at 9:39 PM, Emmanuel Dreyfus m...@netbsd.org wrote: Hi I am back on to this problem. I would like to debug it butI need some suggestion on what to look at. We know it dispear if eager locks are disabled. How do they work, and how could they turn bad? Emmanuel Dreyfus m...@netbsd.org wrote: Vijay Bellur vbel...@redhat.com wrote: I have not been able to re-create the problem in my setup. I think it would be a good idea to track this bug and address it. For now, can we not use the volume set mechanism to disable eager-locking? Our exchanges have gone off list after this message. I repost here the 100k last lines of log with debug mode: http://ftp.espci.fr/shadow/manu/log -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] RPM re-structuring
On Mon, Aug 5, 2013 at 4:17 AM, Kaleb S. KEITHLEY kkeit...@redhat.comwrote: On 08/05/2013 05:42 AM, Deepak C Shetty wrote: On 08/05/2013 02:41 PM, Niels de Vos wrote: On Mon, Aug 05, 2013 at 10:59:32AM +0530, Deepak C Shetty wrote: IIUC, per the prev threads.. glusterfs package has a dep on glusterfs-libs glusterfs does not have a dependency (i.e. a Requires: clause) on glusterfs-libs. This is intentional. Does it not? It should. glusterfs-libs would contain libglusterfs.so, libgfrpc.so and libgfxdr.so - which are required for glusterfs package (which contains all the xlators). Did I miss something? Avati vdsm/qemu-kvm/oVirt packages need to change their dependency from glusterfs to glusterfs-libs. -- Kaleb __**_ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/**mailman/listinfo/gluster-develhttps://lists.nongnu.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Feature request]: Regression to take more patches in single instance
On Wed, Jul 31, 2013 at 5:11 AM, Jeff Darcy jda...@redhat.com wrote: On 07/31/2013 07:35 AM, Amar Tumballi wrote: I was trying to fire some regression builds on very minor patches today, and noticed (always known, but faced pain of 'waiting' today) that we can fire regression build on only one patch (or a patchset if its submitted with dependency added while submitting). And each regression run takes approx 30mins. With this model, we can at max take only ~45 patches in a day, which won't scale up if we want to grow with more people participating in code contribution. Would be great to have an option to submit regression run with multiple patch numbers, (technically they should be applicable one top of other in any order if not dependent), and it should work fine. That way, we can handle more review load in future. Maybe my brain has been baked too much by the sun, but I thought I'd seen cases where a regression run on a patch with dependencies automatically validated everything in the stack. Not so? That still places a burden on patch submitters to make sure dependencies are specified (shouldn't be a problem since the current tendency is to *over*specify dependencies) and on the person starting the run to pick the top of the stack, but it does allow us to kill multiple birds with one stone. As for scaling, isn't the basic solution to add more worker machines? That would multiply the daily throughput by the number of workers, and decrease latency for simultaneously submitted runs proportionally. The flip side of having too many patches regression-tested in parallel is that, since the regression test applies the patch in question on top of the current git HEAD _at the time of test execution_, we lose out on testing the combined effect of those multiple patches. This can result in master branch being in broken state even though every patch is tested (in isolation). And the breakage will be visible much later - when an unrelated patch is tested after the patches get (successfully tested and) merged independently. This has happened before too, even with the current test one patch at a time model. E.g: 1 - Patch A is tested [success] 2 - Patch B is tested [success] 3 - Patch A is merged 4 - Patch B is merged git HEAD is broken now 5 - Patch C is tested [failure, because combined effect of A + B is tested only now] The serial nature of today's testing limits such delays to some extent, as tested patches keep getting merged before regression test of new patches start. Parallelizing tests too much could potentially increase this danger window. On the other hand, to guarantee master is never broken, test + merge must be a strictly serial operation (i.e do not even start new regression job until the previous patch is tested and merged). That is even worse, for sure. In the end we probably need a combination of the two strategies - Ability to test multiple patches at the same time (solves regression throughput to some extent and increases integrated testing of patches for their combined effect. - Ability to run tests in parallel (of the patch sets) where testing patch sets can be formed such that the two groups are really independent and there is very less chance of their combined effect to result in a regression (e.g one patch set for a bunch of patches in glusterd and another patch set for a bunch of patches in data path). Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] new glusterfs logging framework
On Tue, Jul 30, 2013 at 11:39 PM, Balamurugan Arumugam barum...@redhat.comwrote: - Original Message - From: Joe Julian j...@julianfamily.org To: Pablo paa.lis...@gmail.com, Balamurugan Arumugam b...@gluster.com Cc: gluster-us...@gluster.org, gluster-devel@nongnu.org Sent: Tuesday, July 30, 2013 9:26:55 PM Subject: Re: [Gluster-users] new glusterfs logging framework Configuration files should be under /etc per FSH standards. Move the logger.conf to /etc/glusterfs. This will be done. I, personally, like json logs since I'm shipping to logstash. :-) My one suggestion would be to ensure the timestamps are in rfc3164. rsyslog supports rfc3339 (a profile of ISO8601) and we use this. Let me know your thoughts on continue using it. Yes, those are complex steps, but the rpm/deb packaging should take care of dependencies and setting up logical defaults. Yes. I am planning to add rsyslog configuration for gluster at install time. IMHO, since this is a departure from the way it's been before now, the config file should enable this new behavior, not disable it, to avoid breaking existing monitoring installations. Do you mean to continue current logging in addition to syslog way? This means unless explicitly configured with syslog, by default we should be logging to gluster logs as before. Avati Regards, Bala Pablo paa.lis...@gmail.com wrote: I think that adding all that 'rsyslog' configuration only to see logs is too much. (I admit it, I don't know how to configure rsyslog at that level so that may influence my opinion) Regards, El 30/07/2013 06:29 a.m., Balamurugan Arumugam escribió: Hi All, Recently new logging framework was introduced [1][2][3] in glusterfs master branch. You could read more about this on doc/logging.txt. In brief, current log target is moved to syslog and user has an option to this new logging at compile time (passing '--disable-syslog' to ./configure or '--without syslog' to rpmbuild) and run time (having a file /var/log/glusterd/logger.conf and restarting gluster services). As rsyslog is used as syslog server in Fedora and CentOS/RHEL and default configuration of rsyslog does not have any rule specific to gluster logs, you see all logs are in /var/log/messages in JSON format. Below is the way to make them neat and clean. For fedora users: 1. It requires to install rsyslog-mmjsonparse rpm (yum -y install rsyslog-mmjsonparse) 2. Place below configuration under /etc/rsyslog.d/gluster.conf file. #$RepeatedMsgReduction on $ModLoad mmjsonparse *.* :mmjsonparse: template (name=GlusterLogFile type=string string=/var/log/gluster/%app-name%.log) template (name=GlusterPidLogFile type=string string=/var/log/gluster/%app-name%-%procid%.log) template(name=GLFS_template type=list) { property(name=$!mmcount) constant(value=/) property(name=syslogfacility-text caseConversion=upper) constant(value=/) property(name=syslogseverity-text caseConversion=upper) constant(value= ) constant(value=[) property(name=timereported dateFormat=rfc3339) constant(value=] ) constant(value=[) property(name=$!gf_code) constant(value=] ) constant(value=[) property(name=$!gf_message) constant(value=] ) property(name=$!msg) constant(value=\n) } if $app-name == 'gluster' or $app-name == 'glusterd' then { action(type=omfile DynaFile=GlusterLogFile Template=GLFS_template) stop } if $app-name contains 'gluster' then { action(type=omfile DynaFile=GlusterPidLogFile Template=GLFS_template) stop } 3. Restart rsyslog (service rsyslog restart) 4. Done. All gluster process specific logs are separated into /var/log/gluster/ directory Note: Fedora 19 users There is a bug in rsyslog of fedora 19 [4], so its required to recompile rsyslog source rpm downloaded from fedora repository ('rpmbuild --rebuild rsyslog-7.2.6-1.fc19.src.rpm' works fine) and use generated rsyslog and rsyslog-mmjsonparse binary rpms For CentOS/RHEL users: Current rsyslog available in CentOS/RHEL does not have json support. I have added the support which requires some testing. I will update once done. TODO: 1. need to add volume:brick specific tag to logging so that those logs can be separated out than pid. 2. enable gfapi to use this logging framework I would like to get feedback/suggestion about this logging framework Regards, Bala [1] http://review.gluster.org/4977 [2] http://review.gluster.org/5002 [3] http://review.gluster.org/4915 [4] https://bugzilla.redhat.com/show_bug.cgi?id=989886 ___ Gluster-users mailing list
Re: [Gluster-devel] [Feature request]: Regression to take more patches in single instance
On 7/31/13 4:35 AM, Amar Tumballi wrote: Hi, I was trying to fire some regression builds on very minor patches today, and noticed (always known, but faced pain of 'waiting' today) that we can fire regression build on only one patch (or a patchset if its submitted with dependency added while submitting). And each regression run takes approx 30mins. With this model, we can at max take only ~45 patches in a day, which won't scale up if we want to grow with more people participating in code contribution. Would be great to have an option to submit regression run with multiple patch numbers, (technically they should be applicable one top of other in any order if not dependent), and it should work fine. That way, we can handle more review load in future. Regards, Amar Amar, This thought has crossed my mind before. It needs some scripting in the Jenkins 'regression' job. Can you give it a shot and send out the change for review? If not I can look into it a few days. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Feature request]: Regression to take more patches in single instance
On Wed, Jul 31, 2013 at 4:47 AM, Kaleb S. KEITHLEY kkeit...@redhat.comwrote: On 07/31/2013 07:35 AM, Amar Tumballi wrote: Hi, I was trying to fire some regression builds on very minor patches today, and noticed (always known, but faced pain of 'waiting' today) that we can fire regression build on only one patch (or a patchset if its submitted with dependency added while submitting). And each regression run takes approx 30mins. With this model, we can at max take only ~45 patches in a day, which won't scale up if we want to grow with more people participating in code contribution. Would be great to have an option to submit regression run with multiple patch numbers, (technically they should be applicable one top of other in any order if not dependent), and it should work fine. That way, we can handle more review load in future. When a regression fails how do you know who to blame? I'd rather see more build machines (multiple VMs on a big build.gluster.org replacement box?) instead to get more concurrency. We already face that ambiguity when a patch has a dependent patch. Multiple VMs will solve the problem, but I guess we need to figure out how to get a bigger box etc. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Feature request]: Regression to take more patches in single instance
On Wed, Jul 31, 2013 at 5:09 AM, Kaleb S. KEITHLEY kkeit...@redhat.comwrote: On 07/31/2013 07:51 AM, Anand Avati wrote: On Wed, Jul 31, 2013 at 4:47 AM, Kaleb S. KEITHLEY kkeit...@redhat.com mailto:kkeit...@redhat.com wrote: On 07/31/2013 07:35 AM, Amar Tumballi wrote: Hi, I was trying to fire some regression builds on very minor patches today, and noticed (always known, but faced pain of 'waiting' today) that we can fire regression build on only one patch (or a patchset if its submitted with dependency added while submitting). And each regression run takes approx 30mins. With this model, we can at max take only ~45 patches in a day, which won't scale up if we want to grow with more people participating in code contribution. Would be great to have an option to submit regression run with multiple patch numbers, (technically they should be applicable one top of other in any order if not dependent), and it should work fine. That way, we can handle more review load in future. When a regression fails how do you know who to blame? I'd rather see more build machines (multiple VMs on a big build.gluster.org http://build.gluster.org replacement box?) instead to get more concurrency. We already face that ambiguity when a patch has a dependent patch. That's a bit of a special case. The dependent patch is often owned by the same person, right? I would not want to make this harder for people in the general case. Multiple VMs will solve the problem, but I guess we need to figure out how to get a bigger box etc. Can the slave build machines be behind a firewall? I'm working on getting the old Sunnyvale lab machines on-line in our new lab. Can we use some of those? That should work, I think! Thanks, Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
[Gluster-devel] REVERT: Change in glusterfs[master]: fuse: auxiliary gfid mount support
On 7/19/13 1:14 AM, Vijay Bellur (Code Review) wrote: Vijay Bellur has submitted this change and it was merged. Change subject: fuse: auxiliary gfid mount support .. fuse: auxiliary gfid mount support * files can be accessed directly through their gfid and not just through their paths. For eg., if the gfid of a file is f3142503-c75e-45b1-b92a-463cf4c01f99, that file can be accessed using gluster-mount/.gfid/f3142503-c75e-45b1-b92a-463cf4c01f99 .gfid is a virtual directory used to seperate out the namespace for accessing files through gfid. This way, we do not conflict with filenames which can be qualified as uuids. * A new file/directory/symlink can be created with a pre-specified gfid. A setxattr done on parent directory with fuse_auxgfid_newfile_args_t initialized with appropriate fields as value to key glusterfs.gfid.newfile results in the entry parent/bname whose gfid is set to args.gfid. The contents of the structure should be in network byte order. struct auxfuse_symlink_in { char linkpath[]; /* linkpath is a null terminated string */ } __attribute__ ((__packed__)); struct auxfuse_mknod_in { unsigned int mode; unsigned int rdev; unsigned int umask; } __attribute__ ((__packed__)); struct auxfuse_mkdir_in { unsigned int mode; unsigned int umask; } __attribute__ ((__packed__)); typedef struct { unsigned int uid; unsigned int gid; char gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid string * in canonical form. */ unsigned int st_mode; char bname[]; /* bname is a null terminated string */ union { struct auxfuse_mkdir_in mkdir; struct auxfuse_mknod_in mknod; struct auxfuse_symlink_in symlink; } __attribute__ ((__packed__)) args; } __attribute__ ((__packed__)) fuse_auxgfid_newfile_args_t; An initial consumer of this feature would be geo-replication to create files on slave mount with same gfids as that on master. It will also help gsyncd to access files directly through their gfids. gsyncd in its newer version will be consuming a changelog (of master) containing operations on gfids and sync corresponding files to slave. * Also, bring in support to heal gfids with a specific value. fuse-bridge sends across a gfid during a lookup, which storage translators assign to an inode (file/directory etc) if there is no gfid associated it. This patch brings in support to specify that gfid value from an application, instead of relying on random gfid generated by fuse-bridge. gfids can be healed through setxattr interface. setxattr should be done on parent directory. The key used is glusterfs.gfid.heal and the value should be the following structure whose contents should be in network byte order. typedef struct { char gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid * string in canonical form */ char bname[]; /* a null terminated basename */ } __attribute__((__packed__)) fuse_auxgfid_heal_args_t; This feature can be used for upgrading older geo-rep setups where gfids of files are different on master and slave to newer setups where they should be same. One can delete gfids on slave using setxattr -x and .glusterfs and issue stat on all the files with gfids from master. Thanks to Amar Tumballi ama...@redhat.com and Csaba Henk cs...@redhat.com for their inputs. Signed-off-by: Raghavendra G rgowd...@redhat.com Change-Id: Ie8ddc0fb3732732315c7ec49eab850c16d905e4e BUG: 952029 Reviewed-on: http://review.gluster.com/#/c/4702 Reviewed-by: Amar Tumballi ama...@redhat.com Tested-by: Amar Tumballi ama...@redhat.com Reviewed-on: http://review.gluster.org/4702 Reviewed-by: Xavier Hernandez xhernan...@datalab.es Tested-by: Gluster Build System jenk...@build.gluster.com Reviewed-by: Vijay Bellur vbel...@redhat.com --- M glusterfsd/src/glusterfsd.c M glusterfsd/src/glusterfsd.h M libglusterfs/src/glusterfs.h M libglusterfs/src/inode.c M libglusterfs/src/inode.h M xlators/cluster/dht/src/dht-common.c M xlators/mount/fuse/src/Makefile.am M xlators/mount/fuse/src/fuse-bridge.c M xlators/mount/fuse/src/fuse-bridge.h M xlators/mount/fuse/src/fuse-helpers.c A xlators/mount/fuse/src/glfs-fuse-bridge.h M xlators/mount/fuse/utils/mount.glusterfs.in M xlators/storage/posix/src/posix.c 13 files changed, 1,317 insertions(+), 136 deletions(-) Approvals: Xavier Hernandez: Looks good to me, but someone else must approve Amar Tumballi: Looks good to me, approved
Re: [Gluster-devel] [Gluster-users] uWSGI plugin and some question
On Mon, Jul 29, 2013 at 10:55 PM, Anand Avati anand.av...@gmail.com wrote: On Mon, Jul 29, 2013 at 8:36 AM, Roberto De Ioris robe...@unbit.itwrote: Hi everyone, i have just committed a plugin for the uWSGI application server for exposing glusterfs filesystems using the new native api: https://github.com/unbit/uwsgi-docs/blob/master/GlusterFS.rst Currently it is very simple, but works really well. I have studied the whole api, and i have two questions: why there is no glfs_stat_async() ? if i understand the code well, even stat() is a blocking operation. Can you show some code in uwsgi which makes use of asynchronous stat calls? Adding an async stat call in gfapi is not hard, but the use case hasn't been clear. My objective is avoiding the use of threads and processes and use the uWSGI async api to implement a non blocking-approach (mixable with other engines like gevent or Coro::AnyEvent) Are there any requirements that the callback happen only in specific threads? That is typically a common requirement, and the async callbacks would end up requiring special wiring to bring the callbacks to desired threads. But I guess that wiring would already be done with the IO callbacks anyways in your case. Do you have some prototype of the module using gfapi out somewhere? I'm hoping to understand the use case of gfapi and see if something can be done to make it integrate with Coro::AnyEvent more naturally. I am assuming the module in question is this - https://github.com/unbit/uwsgi/blob/master/plugins/glusterfs/glusterfs.c. I see that you are not using the async variants of any of the glfs calls so far. I also believe you would like these synchronous calls to play nicely with Coro:: by yielding in a compatible way (and getting woken up when response arrives in a compatible way) - rather than implementing an explicit glfs_stat_async(). The -request() method does not seem to be be naturally allowing the use of explictly asynchronous calls within. Can you provide some details of the event/request management in use? If possible, I would like to provide hooks for yield and wakeup primitives in gfapi (which you can wire with Coro:: or anything else) such that these seemingly synchronous calls (glfs_open, glfs_stat etc.) don't starve the app thread without yielding. I can see those hooks having a benefit in the qemu gfapi driver too, removing a bit of code there which integrates callbacks into the event loop using pipes. Avati Another thing is the bison/yacc nameclash. uWSGI allows you to load various external libraries and the use of the default 'yy' prefix causes nameclashes with common libraries (like matheval). I understand that matheval too should choose a better approach, but why not prefixing it like glusterfsyy ? This would avoid headaches, even for when people will start using the library in higher level languages. Currently i have tried the YFLAGS env var hack for ./configure but it did not work (i am using bison) YFLAGS=-Dapi.prefix=glusterfsyy -d ./configure --prefix=/opt/glusterfs/ Hmm, this is nice to get fixed. Do you already have a patch which you have used (other than just the technique shown above)? Thanks! Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] [FEEDBACK] Governance of GlusterFS project
On Mon, Jul 29, 2013 at 2:17 PM, Joe Julian j...@julianfamily.org wrote: As one of the guys supporting this software, I agree that I would like bugfix releases to happen more. Critical and security bugs should trigger an immediate test release. Other bug fixes should go out on a reasonable schedule (monthly?). The relatively new CI testing should make this a lot more feasible. Joe, we will certainly be increasing the frequency of releases to push out bug fixes sooner. Though this has been a consistent theme in everybody's comments, your feedback in particular weighs in heavily because of your level of involvement in guiding our users :-) Avati If there weren't hundreds of bugs to examine between releases, I would happily participate in the evaluation process. On 07/26/2013 05:16 PM, Bryan Whitehead wrote: I would really like to see releases happen regularly and more aggressively. So maybe this plan needs a community QA guy or the release manager needs to take up that responsibility to say this code is good for including in the next version. (Maybe this falls under process and evaluation?) For example, I think the ext4 patches had long been available but they just took forever to get pushed out into an official release. I'm in favor of closing some bugs and risking introducing new bugs for the sake of releases happening often. On Fri, Jul 26, 2013 at 10:26 AM, Anand Avati anand.av...@gmail.com wrote: Hello everyone, We are in the process of formalizing the governance model of the GlusterFS project. Historically, the governance of the project has been loosely structured. This is an invitation to all of you to participate in this discussion and provide your feedback and suggestions on how we should evolve a formal model. Feedback from this thread will be considered to the extent possible in formulating the draft (which will be sent out for review as well). Here are some specific topics to seed the discussion: - Core team formation - what are the qualifications for membership (e.g contributions of code, doc, packaging, support on irc/lists, how to quantify?) - what are the responsibilities of the group (e.g direction of the project, project roadmap, infrastructure, membership) - Roadmap - process of proposing features - process of selection of features for release - Release management - timelines and frequency - release themes - life cycle and support for releases - project management and tracking - Project maintainers - qualification for membership - process and evaluation There are a lot more topics which need to be discussed, I just named some to get started. I am sure our community has members who belong and participate (or at least are familiar with) other open source project communities. Your feedback will be valuable. Looking forward to hearing from you! Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] [FEEDBACK] Governance of GlusterFS project
On Sun, Jul 28, 2013 at 11:32 PM, Bryan Whitehead dri...@megahappy.netwrote: Weekend activities kept me away from watching this thread, wanted to add in more of my 2 cents... :) Major releases would be great to happen more often - but keeping current releases more current is really what I was talking about. Example, 3.3.0 was a pretty solid release but some annoying bugs got fixed and it felt like 3.3.1 was reasonably quick to come. But that release seemed to be a step back for rdma (forgive me if I was wrong - but I think it wasn't even possible to fuse/mount over rdma with 3.3.1 while 3.3.0 worked). But 3.3.2 release took a pretty long time to come and fix that regression. I think I also recall seeing a bunch of nfs fixes coming and regressing (but since I don't use gluster/nfs I don't follow closely). Bryan - yes, point well taken. I believe a dedicated release maintainer role will help in this case. I would like to hear other suggestions or thoughts on how you/others think this can be implemented. What I'd like to see: In the -devel maillinglist right now I see someone is showing brick add / brick replace in 3.4.0 is causing a segfault in apps using libgfapi (in this case qemu/libvirt) to get at gluster volumes. It looks like some patches were provided to fix the issue. Assuming those patches work I think a 3.4.1 release might be worth being pushed out. Basic stuff like that on something that a lot of people are going to care about (qemu/libvirt integration - or plain libgfapi). So if there was a scheduled release for say - every 1-3 months - then I think that might be worth doing. Ref: http://lists.gnu.org/archive/html/gluster-devel/2013-07/msg00089.html Right, thanks for highlighting. These fixes will be back ported. I have already submitted the backport of one of them for review at http://review.gluster.org/5427. The other will be backported once reviewed and accepted in master. Thanks again! Avati The front page of gluster.org says 3.4.0 has Virtual Machine Image Storage improvements. If 1-3 months from now more traction with CloudStack/OpenStack or just straight up libvirtd/qemu with gluster gets going. I'd much rather tell someone make sure to use 3.4.1 than be careful when doing an add-brick - all your VM's will segfault. On Sun, Jul 28, 2013 at 5:10 PM, Emmanuel Dreyfus m...@netbsd.org wrote: Harshavardhana har...@harshavardhana.net wrote: What is good for GlusterFS as a whole is highly debatable - since there are no module owners/subsystem maintainers as of yet at-least on paper. Just my two cents on that: you need to make clear if a module maintainer is a dictator or a steward for the module: does he has the last word on anything touching his module, or is there some higher instance to settle discussions that do not reach consensus? IMO the first approach creates two problems: - having just one responsible person for a module is a huge bet that this person will have good judgments. Be careful to let a maintainer position open instead of assigning it to the wrong person. - having many different dictators each ruling over a module can create difficult situations when a proposed change impacts many modules. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] uWSGI plugin and some question
On Tue, Jul 30, 2013 at 7:47 AM, Roberto De Ioris robe...@unbit.it wrote: On Mon, Jul 29, 2013 at 10:55 PM, Anand Avati anand.av...@gmail.com wrote: I am assuming the module in question is this - https://github.com/unbit/uwsgi/blob/master/plugins/glusterfs/glusterfs.c . I see that you are not using the async variants of any of the glfs calls so far. I also believe you would like these synchronous calls to play nicely with Coro:: by yielding in a compatible way (and getting woken up when response arrives in a compatible way) - rather than implementing an explicit glfs_stat_async(). The -request() method does not seem to be be naturally allowing the use of explictly asynchronous calls within. Can you provide some details of the event/request management in use? If possible, I would like to provide hooks for yield and wakeup primitives in gfapi (which you can wire with Coro:: or anything else) such that these seemingly synchronous calls (glfs_open, glfs_stat etc.) don't starve the app thread without yielding. I can see those hooks having a benefit in the qemu gfapi driver too, removing a bit of code there which integrates callbacks into the event loop using pipes. Avati This is a prototype of async way: https://github.com/unbit/uwsgi/blob/master/plugins/glusterfs/glusterfs.c#L43 basically once the async request is sent, the uWSGI core (it can be a coroutine, a greenthread or another callback) wait for a signal (via pipe [could be eventfd() on linux]) of the callback completion: https://github.com/unbit/uwsgi/blob/master/plugins/glusterfs/glusterfs.c#L78 the problem is that this approach is racey in respect of the uwsgi_glusterfs_async_io structure. It is probably OK since you are waiting for the completion of the AIO request before issuing the next. One question I have in your usage is, who is draining the \1 written to the pipe in uwsgi_glusterfs_read_async_cb() ? Since the same pipe is re-used for the next read chunk, won't you get an immediate wake up if you tried polling on the pipe without draining? Can i assume after glfs_close() all of the pending callbacks are cleared ? With the way you are using the _async() calls, you do have the guarantee - because you are waiting for the completion of each AIO request right after issuing. The enhancement to gfapi I was proposing was to expose hooks at yield() and wake() points for external consumers to wire in their own ways of switching out of the stack. This is still a half baked idea, but it will let you use only glfs_read(), glfs_stat() etc. (and NOT the explicit async variants), and the hooks will let you do wait_read_hook() and write(pipefd, '\1') respectively in a generic way independent of the actual call. In such a way i could simply deallocate it (now it is on the stack) at the end of the request. You probably need to do all that in case you want to have multiple outstanding AIOs at the same time. From what I see, you just need co-operative waiting till call completion. Also note that the ideal block size for performing IO is 128KB. 8KB is too little for a distributed filesystem. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Status of the SAMBA module
On Mon, Jul 29, 2013 at 1:56 AM, Nux! n...@li.nux.ro wrote: On 29.07.2013 07:16, Daniel Müller wrote: But you need to have gluster installed!? Which version? Samba4.1 does not compile with the lates glusterfs 3.4 on CentOs 6.4. From what JM said, it builds against EL6 Samba (3.6) and it has also been added to upstream. You will need the latest Samba 4.1 release (or git HEAD),and glusterfs-api-devel RPM (with its deps) installed at the time of building Samba to get the vfs module built. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] QEMU (and other libgfapi client?) crashes on add-brick / replace-brick
On Mon, Jul 29, 2013 at 1:51 AM, Guido De Rosa guido.der...@vemarsas.itwrote: Apparently the problem isn't fixed... even when qemu doesn't crash, the guest raises many I/O error and turns unusable, just like a real machine would do if you physically remove the hard drive, I guess... I'm doing more tests anyway and will post a much more detailed report as soon as I can. Thanks for now. Please do get back, with the logs. It still might be a privileged port issue. Avati ___ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel