from:"Anand Avati"

Re: [Gluster-devel] rpc throttling causing ping timer expiries while running iozone

2014-04-23 Thread Anand Avati

On Tue, Apr 22, 2014 at 11:49 PM, Pranith Kumar Karampuri 
pkara...@redhat.com wrote:

 hi,
 When iozone is in progress and the number of blocking inodelks is
 greater than the threshold number of rpc requests allowed for that client
 (RPCSVC_DEFAULT_OUTSTANDING_RPC_LIMIT), subsequent requests from that
 client will not be read until all the outstanding requests are processed
 and replied to. But because no more requests are read from that client,
 unlocks on the already granted locks will never come thus the number of
 outstanding requests would never come down. This leads to a ping-timeout on
 the client. I am wondering if the proper fix for this is to not account
 INODELK/ENTRYLK/LK calls for throttling. I did make such a change in the
 codebase and tested it and it works. Please let me know if this is
 acceptable or it needs to be fixed differently.



Do you know why there were  64 outstanding inodelk requests? What does
iozone do to result in this kind of a locking pattern?
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Status on Gluster on OS X (10.9)

2014-04-05 Thread Anand Avati

I did now. I'd recommend adding a check for libintl.h in configure.ac and
fail gracefully suggesting installing gettext.

Thanks


On Fri, Apr 4, 2014 at 10:59 PM, Dennis Schafroth den...@schafroth.dkwrote:


 On 05 Apr 2014, at 07:38 , Anand Avati av...@gluster.org wrote:

 And here:

 ./gf-error-codes.h:12:10: fatal error: 'libintl.h' file not found


 I guess I was wrong that gettext / libintl.h was not required. It seems to
 be in use in logging.c

 Until I figure out if this is the case, I would suggest installing gettext.

 cheers,
 :-Dennis



 On Fri, Apr 4, 2014 at 10:15 PM, Dennis Schafroth den...@schafroth.dkwrote:


 Pushed a fix to make it work without gettext / libintl header.

 I compiled without the CFLAGS and LDFLAGS


 Hmm. Apparently not.



 cheers,
 :-Dennis

 On 05 Apr 2014, at 07:04 , Dennis Schafroth den...@schafroth.dk wrote:


 Bummer.

 That is from gettext which I thought was only optional.

 I got it using either Homebrew (http://brew.sh/) or macports

 Homebrew seems quite good these days I would prob. recommend that.

 It will install using a one-liner in /usr/local and but require sudo
 right underway to sett rights

 brew install gettext

 It will require setting some CFLAGS / LDFLAGS when ./configure:
 LDFLAGS=-L/usr/local/opt/gettext/lib
 CPPFLAGS=-I/usr/local/opt/gettext/include

 cheers,
 :-Dennis

 On 05 Apr 2014, at 06:56 , Anand Avati av...@gluster.org wrote:

 Build fails for me:

 Making all in libglusterfs
 Making all in src
   CC   libglusterfs_la-dict.lo
   CC   libglusterfs_la-xlator.lo
   CC   libglusterfs_la-logging.lo
 logging.c:26:10: fatal error: 'libintl.h' file not found
 #include libintl.h
  ^
 1 error generated.
 make[4]: *** [libglusterfs_la-logging.lo] Error 1
 make[3]: *** [all] Error 2
 make[2]: *** [all-recursive] Error 1
 make[1]: *** [all-recursive] Error 1
 make: *** [all] Error 2


 How did you get libintl.h in your system? Also, please add a check for it
 in configure.ac and report the missing package.

 Thanks,


 On Fri, Apr 4, 2014 at 6:08 PM, Dennis Schafroth den...@schafroth.dk
 wrote:


 It's been quiet on this topic, but actually Harshavardhana and I have
 been quite busy off-line working on this. Since my initial success we
 have been able to get it  to compile with clang (almost as clean as with
 gcc) and actually run. The later was a bit tricky because clang has more
 strict strategy about exporting functions with inline, which ended with
 many runs with missing functions.

 So right now I can run everything, but there is an known issue with
 NFS/NLM4, but this should not matter for people trying to run the client
 with OSX FUSE.

 Anyone brave enough wanting to try the *client* can check out:

 Still need Xcode + command line tools (clang, make)
 A installed OSXFUSE (FUSE for OS X)

 $ git clone g...@forge.gluster.org
 :~schafdog/glusterfs-core/osx-glusterfs.git
 $ cd osx-glusterfs

 Either
 $ ./configure.osx
 Or
 - $ ./autogen.sh (requires aclocal, autoconf, automake)
 - $ ./configure

 $ make
 $ sudo make install

 You should be able to mount using sudo glusterfs --volfile=your vol
 file.vol mount point

 And yes this is very much bleeding edge. My mac did kernel panic
 yesterday, when it was running both client and server.

 I would really like to get feed back from anyone trying this out.

 cheers,
 :-Dennis Schafroth


 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users



 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel





___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Status on Gluster on OS X (10.9)

2014-04-04 Thread Anand Avati

Build fails for me:

Making all in libglusterfs
Making all in src
  CC   libglusterfs_la-dict.lo
  CC   libglusterfs_la-xlator.lo
  CC   libglusterfs_la-logging.lo
logging.c:26:10: fatal error: 'libintl.h' file not found
#include libintl.h
 ^
1 error generated.
make[4]: *** [libglusterfs_la-logging.lo] Error 1
make[3]: *** [all] Error 2
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2


How did you get libintl.h in your system? Also, please add a check for it
in configure.ac and report the missing package.

Thanks,


On Fri, Apr 4, 2014 at 6:08 PM, Dennis Schafroth den...@schafroth.dkwrote:


 It's been quiet on this topic, but actually Harshavardhana and I have been
 quite busy off-line working on this. Since my initial success we have
 been able to get it  to compile with clang (almost as clean as with gcc)
 and actually run. The later was a bit tricky because clang has more strict
 strategy about exporting functions with inline, which ended with many runs
 with missing functions.

 So right now I can run everything, but there is an known issue with
 NFS/NLM4, but this should not matter for people trying to run the client
 with OSX FUSE.

 Anyone brave enough wanting to try the *client* can check out:

 Still need Xcode + command line tools (clang, make)
 A installed OSXFUSE (FUSE for OS X)

 $ git clone g...@forge.gluster.org
 :~schafdog/glusterfs-core/osx-glusterfs.git
 $ cd osx-glusterfs

 Either
 $ ./configure.osx
 Or
 - $ ./autogen.sh (requires aclocal, autoconf, automake)
 - $ ./configure

 $ make
 $ sudo make install

 You should be able to mount using sudo glusterfs --volfile=your vol
 file.vol mount point

 And yes this is very much bleeding edge. My mac did kernel panic
 yesterday, when it was running both client and server.

 I would really like to get feed back from anyone trying this out.

 cheers,
 :-Dennis Schafroth


 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Status on Gluster on OS X (10.9)

2014-04-04 Thread Anand Avati

And here:

./gf-error-codes.h:12:10: fatal error: 'libintl.h' file not found


On Fri, Apr 4, 2014 at 10:15 PM, Dennis Schafroth den...@schafroth.dkwrote:


 Pushed a fix to make it work without gettext / libintl header.

 I compiled without the CFLAGS and LDFLAGS

 cheers,
 :-Dennis

 On 05 Apr 2014, at 07:04 , Dennis Schafroth den...@schafroth.dk wrote:


 Bummer.

 That is from gettext which I thought was only optional.

 I got it using either Homebrew (http://brew.sh/) or macports

 Homebrew seems quite good these days I would prob. recommend that.

 It will install using a one-liner in /usr/local and but require sudo right
 underway to sett rights

 brew install gettext

 It will require setting some CFLAGS / LDFLAGS when ./configure:
 LDFLAGS=-L/usr/local/opt/gettext/lib
 CPPFLAGS=-I/usr/local/opt/gettext/include

 cheers,
 :-Dennis

 On 05 Apr 2014, at 06:56 , Anand Avati av...@gluster.org wrote:

 Build fails for me:

 Making all in libglusterfs
 Making all in src
   CC   libglusterfs_la-dict.lo
   CC   libglusterfs_la-xlator.lo
   CC   libglusterfs_la-logging.lo
 logging.c:26:10: fatal error: 'libintl.h' file not found
 #include libintl.h
  ^
 1 error generated.
 make[4]: *** [libglusterfs_la-logging.lo] Error 1
 make[3]: *** [all] Error 2
 make[2]: *** [all-recursive] Error 1
 make[1]: *** [all-recursive] Error 1
 make: *** [all] Error 2


 How did you get libintl.h in your system? Also, please add a check for it
 in configure.ac and report the missing package.

 Thanks,


 On Fri, Apr 4, 2014 at 6:08 PM, Dennis Schafroth den...@schafroth.dk
 wrote:


 It's been quiet on this topic, but actually Harshavardhana and I have
 been quite busy off-line working on this. Since my initial success we
 have been able to get it  to compile with clang (almost as clean as with
 gcc) and actually run. The later was a bit tricky because clang has more
 strict strategy about exporting functions with inline, which ended with
 many runs with missing functions.

 So right now I can run everything, but there is an known issue with
 NFS/NLM4, but this should not matter for people trying to run the client
 with OSX FUSE.

 Anyone brave enough wanting to try the *client* can check out:

 Still need Xcode + command line tools (clang, make)
 A installed OSXFUSE (FUSE for OS X)

 $ git clone g...@forge.gluster.org
 :~schafdog/glusterfs-core/osx-glusterfs.git
 $ cd osx-glusterfs

 Either
 $ ./configure.osx
 Or
 - $ ./autogen.sh (requires aclocal, autoconf, automake)
 - $ ./configure

 $ make
 $ sudo make install

 You should be able to mount using sudo glusterfs --volfile=your vol
 file.vol mount point

 And yes this is very much bleeding edge. My mac did kernel panic
 yesterday, when it was running both client and server.

 I would really like to get feed back from anyone trying this out.

 cheers,
 :-Dennis Schafroth


 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users



 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glfs_futimens function is not implemented

2014-03-31 Thread Anand Avati

Because futimes() is not a POSIX (or any standard) call. We can have
#ifdefs and call futimes(), but hasn't been a priority (welcome to send a
patch)

Thanks
Avati


On Mon, Mar 31, 2014 at 10:00 AM, Thiago da Silva thi...@redhat.com wrote:

 Hi,
 While testing libgfapi I noticed that glfs_futimens was returning -1
 with errno set to ENOSYS. Digging a little deeper shows that it was
 never implemented. Here's the current function definition:

 https://github.com/gluster/glusterfs/blob/master/xlators/storage/posix/src/posix.c#L463

 Does anybody know if there's a particular reason as to why it was never
 implemented?

 Thanks,

 Thiago


 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] release 3.4.3?

2014-03-24 Thread Anand Avati

On Mon, Mar 24, 2014 at 9:50 AM, Kaleb S. KEITHLEY kkeit...@redhat.comwrote:

 On 03/24/2014 08:35 AM, Kaleb S. KEITHLEY wrote:


 I've been begging for an additional (+2) review for
 http://review.gluster.org/#/c/6737.  It has two +1 reviews, but the
 matching fixes for release-3.5 and master have three +1 reviews and one
 +1 review respectively and have not been merged, so I'm reluctant to
 merge this.


 Progress.

 http://review.gluster.org/#/c/6737 has received +2 (thanks Jeff) but the
 matching fixes for release-3.5 and master, http://review.gluster.org/6736and
 http://review.gluster.org/5075 respectively, also still need +2. I am
 still reluctant to take this fix into release-3.4 unless I'm certain the
 corresponding fixes will also be taken into release-3.5 and master.



Please give me some time to review the changes to master. I'll do it asap.
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] How to compile gluster 3.4 on Mac OS X 10.8.4?

2014-03-19 Thread Anand Avati

(moving to gluster-devel@)

That is great progress! Please keep posting the intermediate work upstream
(into gerrit) as you move along..

Regarding the hang: Do you have cli.log printing anything at all (typically
/var/log/glusterfs/cli.log)

Avati



On Wed, Mar 19, 2014 at 5:07 PM, Dennis Schafroth den...@schafroth.dkwrote:


 I now have a branch of HEAD compiling under OS X 10.9, when I disable the
 qemu-block and fusermount options.

 Still having a build issue with libtool and libspl, which I have only
 hacked my way around.

 Actually both the glusterd and gluster runs, but using gluster (OS X)
  hangs on both pool list and peer probe other server. However probing
 glusterd from a linux succeds. But glusterd's log does indicates some issue.

 cheers,
 :-Dennis Schafroth


 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Proposal: GlusterFS Quattro

2014-03-07 Thread Anand Avati

On Fri, Mar 7, 2014 at 8:13 AM, Jeff Darcy jda...@redhat.com wrote:

 As a counterpoint to the current GlusterFS proposal, I've written up a
 bunch of
 ideas that I'm collectively calling GlusterFS Quattro.  It's in Google
 Docs so
 that people can comment.  Please do.  ;)

 http://goo.gl/yE3O4j



Thanks for sharing this Jeff. Towards the end of my visit to the Bangalore
Red Hat office this time (from which I just returned a couple days ago) we
got to discuss the 4.x proposal from a high level (less about specifics,
more about in general). A concern raised by many was that if a new
release is a too radical (analogy given was samba4 vs samba3 -
coincidentally the same major number), it would result in way too much
confusion and overhead (e.g lots of people want to stick with 3.x as 4.x is
not yet stable, and this results in 3.x getting stabler and be a negative
incentive to move over to 4.x, especially when distributions/ISVs are
concerned). The conclusion was that, the 4.x proposal would be downsized to
only have the management layer changes, while the data layer (EHT, stripe
etc) changes be introduced piece by piece (as they get ready) independent
of whether the current master is for 3.x or 4.x.

Given the background, it only makes sense to retain the guiding principles
of the feedback, and reconcile the changes proposed to management layer in
the two proposals and retain the scope of 4.x to management changes.

Thoughts?
Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Proposal: GlusterFS Quattro

2014-03-07 Thread Anand Avati

On Fri, Mar 7, 2014 at 11:56 AM, Jeff Darcy jda...@redhat.com wrote:

  Given the background, it only makes sense to retain the guiding
 principles of
  the feedback, and reconcile the changes proposed to management layer in
 the
  two proposals and retain the scope of 4.x to management changes.

  Thoughts?

 I think we need to take a more careful look at dependencies between various
 items before we decide what should be in 4.0 vs. earlier/later.  For
 example,
 several other features depend on being able to subdivide storage that the
 user gives us into smaller units.  That feature itself depends on
 multiplexing
 those smaller units (whether we call them s-bricks or something else) onto
 fewer daemons/ports.  So which one is the 4.0 feature?  If we have a clear
 idea of which parts are independent and which ones must be done
 sequentially,
 then I think we'll be better able to draw a line which separates 3.x from
 4.x at the most optimal point.


The brick model is probably the borderline item which touches upon both
management layer and data layer to some extent. Decreasing the number of
processes/ports in general is a good thing, and to that end we need our
brick processes to be more flexible/dynamic (able to switch a graph on the
fly, add a new export directory on the fly etc.) - which is completely
lacking today. I think, by covering this piece (brick model) we should be
mostly able to classify rest of the changes into management vs data
path in a more clear way. That being said we still need a low level design
of how to make the brick process more dynamic (though it is mostly a matter
of just getting it done)

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Barrier design issues wrt volume snapshot

2014-03-06 Thread Anand Avati

On Thu, Mar 6, 2014 at 11:19 AM, Krishnan Parthasarathi kpart...@redhat.com
 wrote:



 - Original Message -
  On Thu, Mar 6, 2014 at 12:21 AM, Vijay Bellur vbel...@redhat.com
 wrote:
 
   Adding gluster-devel.
  
  
   On 03/06/2014 01:15 PM, Krishnan Parthasarathi wrote:
  
   All,
  
   In recent discussions around design (and implementation) of the
 barrier
   feature, couple of things came to light.
  
   1) changelog xlator needs barrier xlator to block unlink and rename
 FOPs
   in the call path. This is apart from the current list of FOPs that
   are blocked
   in their call back path.
   This is to make sure that the changelog has a bounded queue of
 unlink
   and rename FOPs,
   from the time barriering is enabled, to be drained, committed to
   changelog file and published.
  
  
  Why is this necessary?

 The only consumer of changelog today, georeplication, can't tolerate
 missing unlink/rename
 entries from changelog, even with the initial xsync based crawl, until
 changelog entries
 are available for the master volume.
 So, changelog xlator needs to ensure that the last rotated
 (publishable) changelog should have entries for all the
 unlink(s)/rename(s) that made
 it to the snapshot. For this, changelog needs barrier xlator to block
 unlink/rename
 FOPs in the call path too. Hope that helps.


This sounds like a very changelog specific requirement. This is best
addressed in the changelog translator itself. If unlink/rmdir/renames
should not be in progress during a snapshot, then we need to hold off new
ops in the call path, trigger a log rotation and the rotation should wait
for completion of ongoing fops anyways.



 
 
  2) It is possible in a pure distribute volume that the following sequence
   of FOPs could result
   in snapshots of bricks disagreeing on inode type for a file or
   directory.
  
   t1: snap b1
   t2: unlink /a
   t3: mkdir /a
   t4: snap b2
  
   where, b1 and b2 are bricks of a pure distribute volume V.
  
   The above sequence can happen with the current barrier xlator design,
   since we allow unlink FOPs
   to go through to the disk and only block their acknowledgement to the
   application. This implies
   a concurrent mkdir on the same name could succeed, since DHT doesn't
   serialize unlink and mkdir FOPs,
   unlike AFR.
  
   Avati,
  
   I hear that you have a solution for problem 2). Could you please start
   the discussion on this thread?
   It would help us to decide how to go about with the barrier xlator
   implementation.
  
  
 
  The solution is really a long pending implementation of dentry
  serialization in the resolver of protocol server. Today we allow multiple
  FOPs to happen in parallel which modify the same dentry. This results in
  hairy races (including non atomicity of rename) and has been kept open
 for
  a while now. Implementing the dentry serialization in the resolver will
  solve 2 as a side effect. Hence that is a better approach than making
  changes in the barrier translator.
 

 I am not sure I understood how this works from the brief introduction
 above.
 Could you explain a bit?


By dentry serialization, I mean we should have only one operation modifying
a pargfid/bname at a given time. This needs changes in the resolver of
protocol server and possibly some changes in the inode table. This is
really for solving rare races, and I think is something we need to work on
independent of the snapshot requirements.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster

2014-02-05 Thread Anand Avati

Xavi,
Getting such a caching mechanism has several aspects. First of all we need
the framework pieces implemented (particularly server originated messages
to the client for invalidation and revokes) in a well designed way.
Particularly how we address a specific translator in a message originating
from the server. Some of the recent changes to client_t allows for
server-side translators to get a handle (the client_t object) on which
messages can be submitted back to the client.

Such a framework (of having server originated messages) is also necessary
for implementing oplocks (and possibly leases) - particularly interesting
for the Samba integration.

As Jeff already mentioned, this is an area where gluster has not focussed
on, given the targeted use case. However the benefits of extending this to
internal use cases (to avoid per-operation inodelks can benefit many
modules - encryption/crypt, afr, etc.) It seems possible to have a common
framework for delegating locks to clients, and build caching coherency
protocols / oplocks / inodelk avoidence on top of it.

Feel free to share a more detailed proposal if you have have/plan - I'm
sure the Samba folks (Ira copied) would be interested too.

Thanks!
Avati


On Wed, Feb 5, 2014 at 11:27 AM, Xavier Hernandez xhernan...@datalab.eswrote:

 On 04.02.2014 17:18, Jeff Darcy wrote:

  The only synchronization point needed is to make sure that all bricks
 agree on the inode state and which client owns it. This can be achieved
 without locking using a method similar to what I implemented in the DFC
 translator. Besides the lock-less architecture, the main advantage is
 that much more aggressive caching strategies can be implemented very
 near to the final user, increasing considerably the throughput of the
 file system. Special care has to be taken with things than can fail on
 background writes (basically brick space and user access rights). Those
 should be handled appropiately on the client side to guarantee future
 success of writes. Of course this is only a high level overview. A
 deeper analysis should be done to see what to do on each special case.
 What do you think ?


 I think this is a great idea for where we can go - and need to go - in the
 long term. However, it's important to recognize that it *is* the long
 term. We had to solve almost exactly the same problems in MPFS long ago.
 Whether the synchronization uses locks or not *locally* is meaningless,
 because all of the difficult problems have to do with recovering the
 *distributed* state. What happens when a brick fails while holding an
 inode in any state but I? How do we recognize it, what do we do about it,
 how do we handle the case where it comes back and needs to re-acquire its
 previous state? How do we make sure that a brick can successfully flush
 everything it needs to before it yields a lock/lease/whatever? That's
 going to require some kind of flow control, which is itself a pretty big
 project. It's not impossible, but it took multiple people some years for
 MPFS, and ditto for every other project (e.g. Ceph or XtreemFS) which
 adopted similar approaches. GlusterFS's historical avoidance of this
 complexity certainly has some drawbacks, but it has also been key to us
 making far more progress in other areas.

  Well, it's true that there will be a lot of tricky cases that will need
 to be handled to be sure that data integrity and system responsiveness is
 guaranteed, however I think that they are not more difficult than what
 can happen currently if a client dies or loses communication while it
 holds a lock on a file.

 Anyway I think there is a great potential with this mechanism because it
 can allow the implementation of powefull caches, even based on SSD that
 could improve the performance a lot.

 Of course there is a lot of work solving all potential failures and
 designing the right thing. An important consideration is that all
 these methods try to solve a problem that is seldom found (i.e. having
 more than one client modifying the same file at the same time). So a
 solution that has almost 0 overhead for the normal case and allows the
 implementation of aggressive caching mechanisms seems a big win.


  To move forward on this, I think we need a *much* more detailed idea of
 how we're going to handle the nasty cases. Would some sort of online
 collaboration - e.g. Hangouts - make more sense than continuing via
 email?

  Of course, we can talk on irc or another place if you prefer

 Xavi


 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] empty xlator

2014-02-05 Thread Anand Avati

The nop xlator by itself seems OK. Have you tried the stripe config with
the nop xlator on top? or even without the nop xlator?

Avati


On Wed, Feb 5, 2014 at 3:26 PM, Lluís Pàmies i Juárez
llpam...@pamies.catwrote:

 Hello,

 As a proof of concept I'm trying to write a xlator that does nothing, I
 call it nop. The code for nop.c is simply:

 #include config.h
 #include call-stub.h
 struct xlator_fops fops = {};
 struct xlator_cbks cbks = {};
 struct xlator_dumpops dumpops = {};
 struct volume_options options[] = {{.key={NULL}},};
 int32_t init (xlator_t *this){return 0;}
 int fini (xlator_t *this){return 0;}


 And I compile it with:
 $ gcc -fPIC -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -DGF_LINUX_HOST_OS
 -shared -nostartfiles -lglusterfs -lpthread -I${GFS}
 -I${GFS}/libglusterfs/src -I${GFS}/contrib/uuid nop.c -o nop.so

 Then, if I try a test.vol file like that:

 volume test-posix
 type storage/posix
 option directory /home/llpamies/Projects/gluster/test-split/node0-data
 end-volume

 volume test-nop
 type features/nop
 subvolumes test-posix
 end-volume

 volume test-cache
   type performance/io-cache
   subvolumes test-nop
 end-volume


 and mount it with:
 $ glusterfs --debug -f test.vol /mount/point

 It seems to work fine, doing nothing. However, when used together with the
 stripe xlator as follows:

 volume test-posix0
 type storage/posix
 option directory /home/llpamies/Projects/gluster/test-split/node0-data
 end-volume

 volume test-posix1
 type storage/posix
 option directory /home/llpamies/Projects/gluster/test-split/node1-data
 end-volume

 volume test-nop0
 type features/nop
 subvolumes test-posix0
 end-volume

 volume test-nop1
 type features/nop
 subvolumes test-posix1
 end-volume

 volume test-stripe
 type cluster/stripe
 subvolumes test-nop0 test-nop1
 end-volume


 Glusterfs hangs during the first fuse lookup for .Trash, and
 /mount/point looks unmounted with  permissions etc.

 Does it look like some bug in the stripe xlator or is there something
 fundamentally wrong with the nop xlator?

 Thank you,

 --
 Lluís

 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] GlusterFS 4.0 round two?

2014-01-30 Thread Anand Avati

I agree. I will send out a spin of the 2nd draft soon. Have been caught up
in a bunch of other stuff.

Avati


On Wed, Jan 29, 2014 at 10:08 PM, Amar Tumballi ama...@gmail.com wrote:


 On Thu, Jan 30, 2014 at 3:30 AM, Jeffrey Darcy jda...@redhat.com wrote:

 I know we're all busy with other things, but it has been a little over a
 month
 since this discussion started.  There are a lot of really good comments
 on the
 Google Docs version (http://goo.gl/qLw3Vz) and we're at risk of losing
 our
 place if we don't try to keep things going.  In particular, the issue of
 how
 these plans relate to 3.6 feature planning, which also needs to conclude
 soon.
 To pick a couple of examples:

 * There's a 3.6 item to make glusterd more scalable, but there are many
 more
   scalability issues that need to be addressed and the later 4.0 proposal
   tries to tackle a few.  Should we even try to address scalability in the
   3.x series, or just leave it entirely to 4.x?  If we try to do both, how
   should we resolve the incompatibilities that the second proposal
 introduces
   relative to the first?

 * One of the hottest 3.6 items is tiering, data classification, whatever
 you
   want to call it.  I say it's hot because everyone else - e.g. Ceph,
 HDFS,
   Swift - has recognized this as an important feature and they're all
 making
   significant moves here.  Again, the 4.0 proposal contains some ideas
 that
   touch on this, not always compatible with earlier ideas.  Which should
 we
   work on, and how should we address their differences?

 If we don't complete the discussions about 4.0, we won't be able to reach
 any
 reasonable conclusions about when/how it should diverge from 3.x.  Should
 we
 set a deadline for a second draft and/or an IRC meeting to discuss the
 comments
 we've already collected?


 +1

 It is very important to keep momentum for this, otherwise, the amount of
 work planned for 4.0 would never be 'done'. Also, would be very important
 to track the deadlines as an action item in every weekly IRC meeting.

 Regards,
 Amar


 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Gerrit doesn't use HTTPS

2013-12-14 Thread Anand Avati

On Sat, Dec 14, 2013 at 5:58 AM, James purplei...@gmail.com wrote:

 On Sat, Dec 14, 2013 at 3:28 AM, Vijay Bellur vbel...@redhat.com wrote:
  On 12/13/2013 04:05 AM, James wrote:
 
  I just noticed that the Gluster Gerrit [1] doesn't use HTTPS!
 
  Can this be fixed ASAP?
 
 
  Configured now, thanks!
 Thanks for looking into this promptly!

 
  Please check and let us know if you encounter any problems with https.
 1) None of the CN information (name, location, etc) has been filled
 in... Either that or I'm hitting a MITM (less likely).

 2) Ideally the certificate would be signed. If it's not signed, you
 should at least publish the correct signature somewhere we trust.

 If you need help wrangling any of the SSL, I'm happy to help!


IIRC we should be having a CA signed cert for *.gluster.org. Copying JM.

Avati


  -Vijay
 
 Thanks!

 James
 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Bug in locks deletion upon disconnet

2013-12-13 Thread Anand Avati

I have a fix in the works for this (with more cleanups to locks xlator)

Avati


On Fri, Dec 13, 2013 at 2:14 AM, Raghavendra Bhat rab...@redhat.com wrote:

 Hi,

 There seems to be a bug in the ltable cleanup when disconnect is received
 in 3.5 and master. Its easy to reproduce. Just create a replicate volume.
 Start running dbench on the mount point, and do graph changes. The brick
 processes will crash while doing the ltable cleanup.

 Pranith and me looked at the code and found the below issues.

 static void
 ltable_delete_locks (struct _lock_table *ltable)
 {
 struct _locker *locker = NULL;
 struct _locker *tmp= NULL;

 list_for_each_entry_safe (locker, tmp, ltable-inodelk_lockers,
 lockers) {
 if (locker-fd)
 pl_del_locker (ltable, locker-volume,
 locker-loc,
locker-fd, locker-owner,
GF_FOP_INODELK);
 GF_FREE (locker-volume);
 GF_FREE (locker);
 }

 list_for_each_entry_safe (locker, tmp, ltable-entrylk_lockers,
 lockers) {
 if (locker-fd)
 pl_del_locker (ltable, locker-volume,
 locker-loc,
locker-fd, locker-owner,
GF_FOP_ENTRYLK);
 GF_FREE (locker-volume);
 GF_FREE (locker);
 }
 GF_FREE (ltable);
 }

 In above function, the list of indelks and entrylks is traversed and
 pl_del_locker is called for each lock with fd. But in pl_del_locker, we are
 collecting all the locks with same volume and owner sent as arguments and
 deleting them at once (that too without unlocking them). But for locks
 without fd, we are directly freeing up the objects without deleting them
 from the list (and without holding the ltable lock).

 This is the bug logged for the issue.
 https://bugzilla.redhat.com/show_bug.cgi?id=1042764

 Regards,
 Raghavendra Bhat

 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Mechanisms for automatic management of Gluster

2013-12-11 Thread Anand Avati

James,
This is the right way to think about the problem. I have more specific
comments in the script, but just wanted to let you know this is a great
start.

Thanks!


On Wed, Nov 27, 2013 at 7:42 AM, James purplei...@gmail.com wrote:

 Hi,

 This is along the lines of tools for sysadmins. I plan on using
 these algorithms for puppet-gluster, but will try to maintain them
 separately as a standalone tool.

 The problem: Given a set of bricks and servers, if they have a logical
 naming convention, can an algorithm decide the ideal order. This could
 allow parameters such as replica count, and
 chained=true/false/offset#.

 The second problem: Given a set of bricks in a volume, if someone adds
 X bricks and removes Y bricks, is this valid, and what is the valid
 sequence of add/remove brick commands.

 I've written some code with test cases to try and figure this all out.
 I've left out a lot of corner cases, but the boilerplate is there to
 make it happen. Hopefully it's self explanatory. (gluster.py) Read and
 run it.

 Once this all works, the puppet-gluster use case is magic. It will be
 able to take care of these operations for you (if you want).

 For non puppet users, this will give admins the confidence to know
 what commands they should _probably_ run in what order. I say probably
 because we assume that if there's an error, they'll stop and inspect
 first.

 I haven't yet tried to implement the chained cases, or anything
 involving striping. There are also some corner cases with some of the
 current code. Once you add chaining and striping, etc, I realized it
 was time to step back and ask for help :)

 I hope this all makes sense. Comments, code, test cases are appreciated!

 Cheers,

 James
 @purpleidea (irc/twitter)
 https://ttboj.wordpress.com/

 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] GlusterFS 4.0 plan

2013-12-11 Thread Anand Avati

Hello all,

Here is a working draft of the plan for 4.0. It has pretty significant
changes from the current model. Sending it out for early review/feedback.
Further revisions will follow over time.

https://gist.github.com/avati/af04f1030dcf52e16535#file-plan-md

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] important change to syncop infra

2013-12-11 Thread Anand Avati

The decision of doing away with -O2 is beyond our control, and we shouldn't
have code which depend on optimization to be disabled to behave properly.
Representing -errno as return is the cleanest fix (that's how other
projects which use setcontext/getcontext are behaving too.) If there are
any new further issues which arise from setcontext/getcontext, I'm tempted
to change the internal implementation to use a a vanilla pthread pool.

Avati


On Wed, Dec 11, 2013 at 10:29 PM, Anand Subramanian ansub...@redhat.comwrote:

 Is doing away with -O2 an option that was ever considered or is it that we
 simply must have O2 on?

 (I understand that turning off O2 can open some so-far-unexposed can of
 worms and a lot of soaking maybe required, and also that we may have had a
 good set of perf related reasons to have settled on -O2 in the first place,
 but wanted to understand nevertheless...)

 Anand



 On 12/11/2013 02:21 PM, Pranith Kumar Karampuri wrote:

 hi,
  We found a day-1 bug when syncop_xxx() infra is used inside a
 synctask with compilation optimization (CFLAGS -O2). This bug has been
 dormant for at least 2 years.
 There are around ~400(rebalance, replace-brick, bd, self-heal-daemon,
 quota, fuse lock/fd migration) places where syncop is used in the code base
 all of which are potential candidates which can take the hit.

 I sent first round of patch at http://review.gluster.com/6475 to catch
 regressions upstream.
 These are the files that are affected by the changes I introduced to fix
 this:

   api/src/glfs-fops.c |  36
 ++
   api/src/glfs-handleops.c|  15 ++
   api/src/glfs-internal.h |   7 +++
   api/src/glfs-resolve.c  |  10 ++
   libglusterfs/src/syncop.c   | 117
 
 -
   xlators/cluster/afr/src/afr-self-heald.c|  45
 +-
   xlators/cluster/afr/src/pump.c  |  12 ++--
   xlators/cluster/dht/src/dht-helper.c|  24
 +++
   xlators/cluster/dht/src/dht-rebalance.c | 168
 
 ++--
 -
   xlators/cluster/dht/src/dht-selfheal.c  |   6 --
   xlators/features/locks/src/posix.c  |   3 ++-
   xlators/features/qemu-block/src/bdrv-xlator.c   |  15 --
   xlators/features/qemu-block/src/qb-coroutines.c |  14 ++
   xlators/mount/fuse/src/fuse-bridge.c|  16 ++-

 Please review your respective component for these changes in gerrit.

 Thanks
 Pranith.

 Detailed explanation of the Root cause:
 We found the bug in 'gf_defrag_migrate_data' in rebalance operation:

 Lets look at interesting parts of the function:

 int
 gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t
 *loc,
  dict_t *migrate_data)
 {
 .
 code section - [ Loop ]
  while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
 entries)) != 0) {
 .
 code section - [ ERRNO-1 ] (errno of readdirp is stored in
 readdir_operrno by a thread)
  /* Need to keep track of ENOENT errno, that means, there
 is no
 need to send more readdirp() */
  readdir_operrno = errno;
 .
 code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
  ret = syncop_getxattr (this, entry_loc, dict,
 GF_XATTR_LINKINFO_KEY);
 code section - [ ERRNO-2]   (checking for failures of syncop_getxattr().
 This may not always be executed in same thread which executed [SYNCOP-1])
  if (ret  0) {
  if (errno != ENODATA) {
  loglevel = GF_LOG_ERROR;
  defrag-total_failures += 1;
 .
 }

 the function above could be executed by thread(t1) till [SYNCOP-1] and
 code from [ERRNO-2] can be executed by a different thread(t2) because of
 the way syncop-infra schedules the tasks.

 when the code is compiled with -O2 optimization this is the assembly code
 that is generated:
   [ERRNO-1]
 1165readdir_operrno = errno;  errno gets
 expanded as *(__errno_location())
 0x7fd149d48b60 +496:callq  0x7fd149d410c0
 __errno_location@plt
 0x7fd149d48b72 +514:mov%rax,0x50(%rsp) --
 Address returned by __errno_location() is stored in a special location in
 stack for later use.
 0x7fd149d48b77 +519:mov(%rax),%eax
 0x7fd149d48b79 +521:

Re: [Gluster-devel] Standardizing interfaces for BD xlator

2013-12-10 Thread Anand Avati

Mohan,
It would be better to approach this problem by defining the FOPs and
behavior at the xlator levels, and let gfapi calls be simple wrappers
around the FOPs. We can introduce new FOPs splice() and reflink(), and
discuss more on the best MERGE semantics.

Avati


On Mon, Dec 9, 2013 at 11:48 PM, M. Mohan Kumar mo...@in.ibm.com wrote:


 Hello,

 BD xlator provides certain features such as server offloaded copy,
 snapshot etc. But there is no standard way of invoking these operations
 due to the limitation in fops and system call interfaces. One has to
 issue setxattr interface to achieve these offload operations. Using
 setxattr interface in GlusterFS for all non standard operations becomes
 ugly and complicated. We are looking for adding new FOPs to cover
 these operations.

 glfs interfaces for BD xlator:
 ---
 We are looking for adding interfaces to libgfapi to facilitate consuming
 BD xlator features seamlessly. As of now one has to create a posix file
 and then issue setxattr/fsetxattr call to create a LV and map that LV to
 the posix file. For offload operations they have to get the gfid of the
 destination file and pass that gfid in {f}setxattr interface.

 Typical users of BD xlator will be qemu-img utility. To create a BD
 backed file on a GlusterFS volume, qemu-img has to issue glfs_create and
 glfs_fsetxattr, but it doesn't look elegant. Idea is to provide a single
 glfs call to create a posix file, BD and map that BD to the posix file.

 /*
   SYNOPSIS

   glfs_bd_creat: Create a posix file, BD and maps the posix file to BD
   in a BD GlusterFS volume.

   DESCRIPTION

   This function creates a posix file  BD and maps them. This interface
   takes care of the transaction consistency case where posix file creation
   succeeded but BD creation failed for whatever reason, created posix
   file is deleted to make sure that file is not dangling.

   PARAMETERS

   @fs: The 'virtual mount' object to be initialized.

   @path: Path of the posix file within the virtual mount.

   @mode: Permission of the file to be created.

   @flags: Create flags. See open(2). O_EXCL is supported.

   RETURN VALUES

   NULL   : Failure. @errno will be set with the type of failure.
   @errno: EOPNOTSUPP if underlying volume is not BD capable.

   Others : Pointer to the opened glfs_fd_t.
  */
 struct glfs_fd * glfs_bd_create(struct glfs *fs, const char *path, int
 flags,
  mode_t mode);

 Also planning to provide glfs interfaces for other offload features of
 BD such as snapshot, clone and merge. This API can be used to abstract
 the steps involved in getting the gfid of the destination file and
 passing it to the setfattr interface (optionally mode parameter can be
 used to specify if the destination file has to be created, as of now bd
 xlator code expects the destination file to exist for offload
 operations).

 /*
   SYNOPSIS

   glfs_copy: Offloads copy operation between two files.

   DESCRIPTION

   This function optionally creates destination posix file and initiates
   server offloaded copy between them. Optionally based on
   the mode it could create destination file and issue glfs_{f}setxattr
   interface to do actual offload operation.

   PARAMETERS

   @fs: The 'virtual mount' object to be initialized.

   @source: Path of the source file within the virtual mount.

   @dest: Path of the destination file within the virtual mount.

   @flag: Specifies if destination file need to be created or not.

   @mode: Permission of the destination file to be created.

   RETURN VALUES

   -1 : Failure. @errno will be set with the type of failure.
   0  : Success

  */

 int glfs_copy(struct glfs *fs, const char *source, const char *dest,
 int mode);

 Similarly
 int glfs_snapshot(struct glfs *fs, const char *source, const char *dest,
 int mode);
 int glfs_merge(struct glfs *fs, const char *snapshot);


 Upstream effort for server offloaded and copy on write:
 ---
 Clone - offloaded copy:
 FS Community already started discussing about the interfaces for
 supporting server offloaded copy. Initially it started with adding a new
 syscall 'copy_range' [https://patchwork.kernel.org/patch/2568761/] and
 later the plan is to use existing splice system call itself to extend
 copy between two regular files
 [http://article.gmane.org/gmane.linux.kernel/1560133].  So is it safe to
 assume that splice is the way for copy offload and add these FOPs to
 GlusterFS(XFS, FUSE also) and support it in BD xlator?

 Snapshot - reflink:
 Also there is an upstream effort to provide interfaces for creating Copy
 on Write files (ie snapshots in LVM terminlogy) using reflink syscall
 interface, but its not merged in upstream [http://lwn.net/Articles/331808/
 ]
 This snapshot feature is supported by BRTFS and OCFS2 through ioctl
 interface. Can we assume its the way for snapshot interface and add FOPs
 similar

Re: [Gluster-devel] Questions on dht rebalance

2013-11-24 Thread Anand Avati

On Sun, Nov 24, 2013 at 8:42 PM, Muralidhar Balcha muralidh...@gmail.comwrote:

 Hi,
 I have couple of questions on rebalance functionality
 1. When I ran rebalancing command, the gluster daemon that is responsible
 for migrating data is terminating itself in the middle of the data
 migration. I may be missing something. How do I make the daemon wait until
 the data migration is complete.


Under normal circumstances, the rebalance daemon terminates when it
believes it has completed transferring all files it is supposed to. Do you
have any logs with more details?


 2. This question may not be related to rebalance adding or removing a
 brick from a volume. If I add a new brick to a volume that is in use, how
 do existing clients realize the new brick? Is there a notification
 mechanism that each client refreshes their client volinfo?



Yes, clients are polling w/ glusterd waiting for config changes, and will
swap to a new xlator graph at run time if necessary.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Daily Coverity runs for GlusterFS?

2013-11-22 Thread Anand Avati

Lala,

It would be ideal if we hooked this into Jenkins and run along with
regression test in parallel. How feasible is it to set it up that way?

Avati


On Fri, Nov 22, 2013 at 11:51 AM, Lalatendu Mohanty lmoha...@redhat.comwrote:

 Hi Gluster Ants,

 There is a way, we can automate the Coverity scan runs and we will get a
 email like below, which will tell us if any code issues introduced in to
 the code base. Will it be helpful/good if we run Coverity scan daily with
 the latest code base and send the results to gluster-devel@nongnu.org?
 I think it would be helpful. But wanted to take a feed back from you all
 before doing it. Feedback/ Thoughts?


 Thanks,
 Lala

   Original Message 
 Subject: New Defects reported by Coverity Scan for GlusterFS
 Date: Thu, 21 Nov 2013 14:14:04 -0800
 From: scan-ad...@coverity.com


 Hi,


 Please find the latest report on new defect(s) introduced to GlusterFS
 found with Coverity Scan.

 Defect(s) Reported-by: Coverity Scan
 Showing 7 of 10 defect(s)


 ** CID 1130760:  Sizeof not portable  (SIZEOF_MISMATCH)
 /xlators/encryption/crypt/src/data.c: 487 in set_config_avec_data()

 ** CID 1130759:  Sizeof not portable  (SIZEOF_MISMATCH)
 /xlators/encryption/crypt/src/data.c: 566 in set_config_avec_hole()

 ** CID 1130756:  Unchecked return value  (CHECKED_RETURN)
 /xlators/encryption/crypt/src/crypt.c: 2627 in crypt_fsetxattr()

 ** CID 1130755:  Unchecked return value  (CHECKED_RETURN)
 /xlators/encryption/crypt/src/crypt.c: 2649 in crypt_setxattr()

 ** CID 1130758:  Dereference after null check  (FORWARD_NULL)
 /xlators/encryption/crypt/src/crypt.c: 3298 in linkop_grab_local()

 ** CID 1130757:  Null pointer dereference  (FORWARD_NULL)
 /api/src/glfs-fops.c: 718 in glfs_preadv_async()
 /api/src/glfs-fops.c: 718 in glfs_preadv_async()

 ** CID 1124349:  Unchecked return value  (CHECKED_RETURN)
 /xlators/mgmt/glusterd/src/glusterd-volume-set.c: 120 in
 validate_cache_max_min_size()
 /xlators/mgmt/glusterd/src/glusterd-volume-set.c: 129 in
 validate_cache_max_min_size()
 /xlators/mgmt/glusterd/src/glusterd-volume-set.c: 130 in
 validate_cache_max_min_size()




 
 
 To view the defects in Coverity Scan visit, http://scan.coverity.com

 To unsubscribe from the email notification for new defects,
 http://scan5.coverity.com/cgi-bin/unsubscribe.py






 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel



 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How to find the current offset in an open file through libgfapi

2013-11-20 Thread Anand Avati

you can:

curr_offset = glfs_lseek (glfd, 0, SEEK_CUR)

Avati


On Wed, Nov 20, 2013 at 10:26 AM, Brad Childs b...@redhat.com wrote:

 Hello list,

 I'm trying to find the current offset of an open file through libgfapi.  I
 have the glfs_fd_t of the file, i've done some reading and want to know the
 current location.

 Can I achieve this without manually keeping track after every read or seek?

 -bc


 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Translator test harness

2013-11-19 Thread Anand Avati

On Tue, Nov 19, 2013 at 5:35 AM, Jeff Darcy jda...@redhat.com wrote:

 On 11/18/2013 11:32 PM, Anand Avati wrote:

 It might be interesting to build a test harness using libgfapi
 (specially the handle based APIs) to load a graph with the xlator to be
 test (on top of posix) and using gfapi calls to bombard fops and
 notifications and callbacks from multiple threads spawned by the testing
 app/framework.

 Along with a fault injection, we also need a pedantic verifier
 translator (loaded both on top and blow the testing xlator) which
 inspects all params of all calls and callbacks coming out of the xlator
 to conform by the rules (e.g lookup_cbk op_ret is either -1 or 0 ONLY,
 op_errno is one of the known standard values ONLY, struct stat does not
 have mtime/ctime/atime from too far ahead into the future, mkdir_cbk's
 struct stat has ia_type to be IA_IFDIR ONLY etc.)


 Sounds like we have another volunteer.  ;)


Certainly. I have a half-done pedantic translator (few years ago) lying
around somewhere and trying to dig it out..

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Translator test harness

2013-11-19 Thread Anand Avati

On Tue, Nov 19, 2013 at 6:48 PM, Luis Pabon lpa...@redhat.com wrote:

 I'm definitely up for it, not just for the translators, but as Avati
 pointed out, a test harness for the GlusterFS system.  I think, if
 possible, that the translator test harness is really a subclass a GlusterFS
 unit/functional test environment.

 I am currently in the process of qualifying some C Unit test frameworks
 (specifically those that provide mock frameworks -- Cmock, cmockery) to
 propose to the GlusterFS community as a foundation to the unit/functional
 tests environment.

 What I would like to see, and I still have a hard time finding, is a
 source coverage tool for C.  Anyone know of one?


gcov (+lcov)

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] fallocate

2013-11-18 Thread Anand Avati

On Sat, Nov 16, 2013 at 4:45 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 Anand Avati av...@gluster.org wrote:

  If you call fallocate() over an existing region with data it shouldn't be
  wiped with 0s. You can also call fallocate() on a hole (in case file was
  ftruncate()ed to a large size) and that region should get allocated
 (i.e
  future write to an fallocated() region should NOT fail with ENOSPC).

 It seems it can be emulated, should it be atomic?


I am not aware of any app which depends on it being atomic (though Linux
implementations probably are)


  BTW, does NetBSD have the equivalent of open_by_handle[_at]() and
  name_to_handle[_at]() system calls?

 That is extended API set 2. With the exception of fexecve(2), I
 implemented them in NetBSD-current, which means they will be available
 in NetBSD-7.0. Are they also mandatory in glusterfs-3.5? Is they are,
 then emulating fallocate() in userland is useless, I would better work
 on it in kernel for the next release.


Oh that's interesting, can I get pointers to see how NetBSD implements
open_by_handle() and name_to_handle()?

Thanks,
Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Translator test harness

2013-11-18 Thread Anand Avati

On Mon, Nov 18, 2013 at 8:23 PM, Shyamsundar Ranganathan 
srang...@redhat.com wrote:

 - Original Message -
  From: Jeff Darcy jda...@redhat.com
  To: gluster-dev  Gluster Devel gluster-devel@nongnu.org
  Sent: Monday, November 18, 2013 8:04:27 PM
  Subject: [Gluster-devel] Translator test harness

  Last week, Luis and I had a discussion about unit testing translator
  code.  Unfortunately, the structure of a translator - a plugin with many
  entry points which interact in complex and instance-specific ways - is
  one that is notoriously challenging.  Really, the only way to do it is
  to have some sort of a task-specific harness, with at least the
  following parts:

  * Code above to inject requests.

  * Code below to provide mocked replies to the translator's own requests.

  * Code on the side to track things like resources or locks acquired
  and released.

 Interesting. KP (Krishnan P) and myself were discussing about an fault
 injection translator, (beyond error injection (which exists in the code
 base)), and were trying to narrow down some faults that we could inject to
 check and see if it makes sense to add such a translator.

  This would be an ambitious undertaking, but not so ambitious that it's
  beyond reason.  The benefits should be obvious.  At this point, what I'm
  most interested in is volunteers to help define the requirements and
  scope so that we can propose this as a feature or task for some future
  GlusterFS release.  Who's up for it?

 I would be interested in pitching in on this, and also hearing about
 extending this effort to cover fault injections if it makes sense.

It might be interesting to build a test harness using libgfapi (specially
the handle based APIs) to load a graph with the xlator to be test (on top
of posix) and using gfapi calls to bombard fops and notifications and
callbacks from multiple threads spawned by the testing app/framework.

Along with a fault injection, we also need a pedantic verifier translator
(loaded both on top and blow the testing xlator) which inspects all params
of all calls and callbacks coming out of the xlator to conform by the
rules (e.g lookup_cbk op_ret is either -1 or 0 ONLY, op_errno is one of the
known standard values ONLY, struct stat does not have mtime/ctime/atime
from too far ahead into the future, mkdir_cbk's struct stat has ia_type to
be IA_IFDIR ONLY etc.)

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] fallocate

2013-11-16 Thread Anand Avati

If you call fallocate() over an existing region with data it shouldn't be
wiped with 0s. You can also call fallocate() on a hole (in case file was
ftruncate()ed to a large size) and that region should get allocated (i.e
future write to an fallocated() region should NOT fail with ENOSPC).

BTW, does NetBSD have the equivalent of open_by_handle[_at]() and
name_to_handle[_at]() system calls?


On Sat, Nov 16, 2013 at 12:40 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 I note that glusterfs-3.5 branch requires fallocate(). That one does not
 exist in NetBSD yet. I wonder if it can be emulated in userspace: this
 is just about writing zeros to the new size, right?

 --
 Emmanuel Dreyfus
 http://hcpnet.free.fr/pubz
 m...@netbsd.org

 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Fencing FOPs on data-split-brained files

2013-11-15 Thread Anand Avati

Ravi,
We should not mix up data and entry operation domains, if a file is in data
split brain that should not stop a user from rename/link/unlink operations
on the file.

Regarding your concern about complications while healing - we should change
our manual fixing instructions to:

- go to backend, access through gfid path or normal path
- rmxattr the afr changelogs
- truncate the file to 0 bytes (like  filename)

Accessing the path through gfid and truncating to 0 bytes addresses your
concerns about hardlinks/renames.

Avati


On Wed, Nov 13, 2013 at 3:01 AM, Ravishankar N ravishan...@redhat.comwrote:

 Hi,

 Currenly in glusterfs, when there is a data splt-brain (only) on a file,
 we disallow the following operations from the mount-point by returning EIO
 to the application:
 - Writes to the file (truncate, dd, echo, cp etc)
 - Reads to the file (cat)
 - Reading extended attributes (getfattr) [1]

 However we do permit the following operations:
 -creating hardlinks
 -creating symlinks
 -mv
 -setattr
 -chmod
 -chown
 --touch
 -ls
 -stat

 While it makes sense to allow `ls` and `stat`, is it okay to  add checks
 in the FOPS to disallow the other operations? Allowing creation of links
 and changing file attributes only seems to complicate things before the
 admin can go to the backend bricks and resolve the splitbrain (by deleteing
 all but the healthy copy of the file including hardlinks). More so if the
 file is renamed before addressing the split-brain.
 Please share your thoughs.

 Thanks,
 Ravi

 [1] http://review.gluster.org/#/c/5988/
 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Rebalance Query

2013-11-11 Thread Anand Avati

Both rebalance and self-healing are actually run in lower priority by
default. You can disable this low priority by:

# gluster volume set name performance.enable-least-priority off

Avati


On Mon, Nov 11, 2013 at 8:55 PM, Paul Cuzner pcuz...@redhat.com wrote:


 Hi,

 I was asked today about the relative priority of rebalance.

 From what I understand, gluster does not perform any rate-limiting on
 rebalance or even geo-rep. Is this the case?

 Also, assuming that we don't rate-limit rebalance, can you confirm whether
 rebalance fops are lower priority than client I/O and if there are any
 mechanisms to influence the priority of the rebalance. Again from what I
 can see this isn't possible.

 Cheers,

 Paul C

 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glfs_readdir_r is painful

2013-10-30 Thread Anand Avati

Eric,
Thanks for the insights. I have posted a patch at
http://review.gluster.org/6201 which clarifies the usage of
glfs_readdir_r() and also introduce glfs_readdir().

Thanks,
Avati


On Wed, Oct 30, 2013 at 11:05 AM, Eric Blake ebl...@redhat.com wrote:

 On 10/30/2013 11:18 AM, Eric Blake wrote:

  The only safe way to use readdir_r is to know the maximum d_name that
  can possibly be returned, but there is no glfs_fpathconf() for
  determining that information.  Your example usage of glfs_readdir_r()
  suggests that 512 bytes is large enough:
 
 https://forge.gluster.org/glusterfs-core/glusterfs/blobs/f44ada6cd9bcc5ab98ca66bedde4fe23dd1c3f05/api/examples/glfsxmp.c
  but I don't know if that is true.

 Okay, after a bit more investigation, I see:

 gf_dirent_to_dirent (gf_dirent_t *gf_dirent, struct dirent *dirent)
 {
 dirent-d_ino = gf_dirent-d_ino;

 #ifdef _DIRENT_HAVE_D_OFF
 dirent-d_off = gf_dirent-d_off;
 #endif

 #ifdef _DIRENT_HAVE_D_TYPE
 dirent-d_type = gf_dirent-d_type;
 #endif

 #ifdef _DIRENT_HAVE_D_NAMLEN
 dirent-d_namlen = strlen (gf_dirent-d_name);
 #endif

 strncpy (dirent-d_name, gf_dirent-d_name, 256);
 }

 I also discovered that 'getconf NAME_MAX /path/to/xfs/mount' is 255, so
 it looks like you got lucky (although strncpy is generally unsafe
 because it fails to write a NUL terminator if you truncate the string,
 it looks like you are guaranteed by XFS to never have a string that
 needs truncation).

   You _do_ have the advantage that
  since every brick backing a glusterfs volume is using an xfs file
  system, then you only have to worry about the NAME_MAX of xfs - but I
  don't know that value off the top of my head.

 Again, my research shows it is 255.

   Can you please let me
  know how big I should make my struct dirent to avoid buffer overflow,
  and properly document this in glusterfs/api/glfs.h?  Furthermore, can
  you please provide a much saner glfs_readdir() so I don't have to worry
  about contortions of using a broken-by-design function?

 These requests are still in force.

 --
 Eric Blake   eblake redhat com+1-919-301-3266
 Libvirt virtualization library http://libvirt.org


 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] can glfs_fini ever succeed?

2013-10-30 Thread Anand Avati

This was fixed upstream, and backported to release-3.4 as well. The fix
will be part of 3.4.2.

Avati


On Wed, Oct 30, 2013 at 2:49 PM, Eric Blake ebl...@redhat.com wrote:

 I'm trying to use glusterfs-api, but ran into some questions on usage
 (currently targetting Fedora 19's glusterfs-api-devel-3.4.1-1.fc19.x86_64).

 It looks like glfs_fini() starts with 'ret = -1' and never assigns ret
 to any other value.  This in turn leads to odd error messages; I
 explicitly coded my application to warn about a negative return value,
 but see:

 warning : virStorageBackendGlusterClose:52 : shutdown of gluster failed
 with errno 0

 which contradicts the docs that say errno will be set on failure.

 Is this a bug where I should just ignore the return value as useless?

 --
 Eric Blake   eblake redhat com+1-919-301-3266
 Libvirt virtualization library http://libvirt.org



 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glfs_readdir_r is painful

2013-10-30 Thread Anand Avati

On Wed, Oct 30, 2013 at 3:31 PM, Eric Blake ebl...@redhat.com wrote:

 On 10/30/2013 04:08 PM, Anand Avati wrote:
  Eric,
  Thanks for the insights. I have posted a patch at
  http://review.gluster.org/6201 which clarifies the usage of
  glfs_readdir_r() and also introduce glfs_readdir().

 Thanks for starting that.  I see an off-by-one in that patch; pre-patch
 you did:

 strncpy (dirent-d_name, gf_dirent-d_name, 256);

 but post-patch, you have:

 strncpy (dirent-d_name, gf_dirent-d_name, GF_NAME_MAX);

 with GF_NAME_MAX set to either NAME_MAX or 255.  This is a bug; you MUST
 strncpy at least 1 byte more than the maximum name if you are to
 guarantee a NUL-terminated d_name for the user.


The buffer is guaranteed to be 0-inited, and strncpy with 255 is now
guaranteed to have a NULL terminated string no matter how big the name was
(which wasn't the case before, in case the name was  255 bytes).



 Oh, and NAME_MAX is not guaranteed to be defined as 255; if it is larger
 than 255 you are wasting memory compared to XFS, if it is less than 255
 [although unlikely], you have made it impossible to return valid file
 names to the user.  You may be better off just hard-coding GF_NAME_MAX
 to 255 regardless of what the system has for its NAME_MAX.


Hmm, I don't think so.. strncpy of 255 bytes on to a buffer guaranteed to
be 256 or higher and also guaranteed to be 0-memset'ed cannot return an
invalid file name. No?

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glfs_readdir_r is painful

2013-10-30 Thread Anand Avati

On Wed, Oct 30, 2013 at 3:54 PM, Eric Blake ebl...@redhat.com wrote:

  Hmm, I don't think so.. strncpy of 255 bytes on to a buffer guaranteed to
  be 256 or higher and also guaranteed to be 0-memset'ed cannot return an
  invalid file name. No?

 The fact that your internal glfs_readdir buffer is memset means you are
 safe there for a 255-byte filename; but that safety does not extend to
 glfs_readdir_r for a user buffer.


Right! Fixed - 
http://review.gluster.org/#/c/6201/2http://review.gluster.org/#/c/6201/
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Weirdness in glfs_mkdir()?

2013-10-28 Thread Anand Avati

glfs_mkdir() (in glfs.h) accepts three params - @fs, @path, @mode. gfapi.py
uses raw ctypes to provide python APIs - and therefore the bug of not
accepting and passing @mode in the mkdir() method in gfapi.py is
translating into a junk value getting received by glfs_mkdir (and random
modes getting set for various dirs). You  just witnessed the woe of a
typeless system :)

Avati


On Mon, Oct 28, 2013 at 1:31 PM, Justin Clift jcl...@redhat.com wrote:

 Hi Avati,

 When creating directories through glfs_mkdir() - called through
 Python - the directories have inconsistent mode permissions.

 Is this expected?

 Here's the super simple code running directly in a Python
 2.7.5 shell, on F19.  It's a simple single brick volume,
 XFS underneath.  Gluster compiled from git master head over
 the weekend:

vol.mkdir('asdf')
   0
vol.mkdir('asdf/111')
   0
vol.mkdir('asdf/112')
   0
vol.mkdir('asdf/113')
   0
vol.mkdir('asdf/114')
   0
vol.mkdir('asdf/115')
   0
vol.mkdir('asdf/116')
   0
vol.mkdir('asdf/117')
   0

 Looks ok from here, but ls -la shows the strangeness of
 the subdirs:

   $ sudo ls -la asdf/
   total 0
   dr-x-w. 9 root root  76 Oct 28 20:22 .
   drwxr-xr-x. 8 root root 114 Oct 28 20:22 ..
   d-w--w---T. 2 root root   6 Oct 28 20:22 111
   d--x--x--T. 2 root root   6 Oct 28 20:22 112
   dr--rw---T. 2 root root   6 Oct 28 20:22 113
   drwx--x---. 2 root root   6 Oct 28 20:22 114
   dr--rT. 2 root root   6 Oct 28 20:22 115
   dr-x-w. 2 root root   6 Oct 28 20:22 116
   drwx--x---. 2 root root   6 Oct 28 20:22 117

 Easily worked around using chmod() after each mkdir(), but
 I'm not sure if this is a bug or not.

 ?

 Regards and best wishes,

 Justin Clift

 --
 Open Source and Standards @ Red Hat

 twitter.com/realjustinclift


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Weirdness in glfs_mkdir()?

2013-10-28 Thread Anand Avati

On Mon, Oct 28, 2013 at 6:28 PM, Jay Vyas jayunit...@gmail.com wrote:

 wow.  im surprised.  this caught my eye, checked into the mode:

 glfs_mkdir (struct glfs *fs, const char *path, mode_t mode)

 So, somehow, the python API is capable of sending a mode which doesnt
 correspond to anything enumerated as part of the mode_t, but the C method
 still manages to write the file with a garbage mode ?

 That sounds like a bug not in python.  Not in gluster... but in C ! :)
 [if im understanding this correctyle, which i might not be]



Not sure whether its a bug or feature of C. C runtime is typeless. Python
ctypes uses dlopen/dlsym which do symbol lookups - doesn't care whether the
looked up symbol is a data structure or a function name - let alone do type
matching!

Avati









 On Mon, Oct 28, 2013 at 7:19 PM, Anand Avati av...@gluster.org wrote:

 glfs_mkdir() (in glfs.h) accepts three params - @fs, @path, @mode.
 gfapi.py uses raw ctypes to provide python APIs - and therefore the bug of
 not accepting and passing @mode in the mkdir() method in gfapi.py is
 translating into a junk value getting received by glfs_mkdir (and random
 modes getting set for various dirs). You  just witnessed the woe of a
 typeless system :)

 Avati


 On Mon, Oct 28, 2013 at 1:31 PM, Justin Clift jcl...@redhat.com wrote:

 Hi Avati,

 When creating directories through glfs_mkdir() - called through
 Python - the directories have inconsistent mode permissions.

 Is this expected?

 Here's the super simple code running directly in a Python
 2.7.5 shell, on F19.  It's a simple single brick volume,
 XFS underneath.  Gluster compiled from git master head over
 the weekend:

vol.mkdir('asdf')
   0
vol.mkdir('asdf/111')
   0
vol.mkdir('asdf/112')
   0
vol.mkdir('asdf/113')
   0
vol.mkdir('asdf/114')
   0
vol.mkdir('asdf/115')
   0
vol.mkdir('asdf/116')
   0
vol.mkdir('asdf/117')
   0

 Looks ok from here, but ls -la shows the strangeness of
 the subdirs:

   $ sudo ls -la asdf/
   total 0
   dr-x-w. 9 root root  76 Oct 28 20:22 .
   drwxr-xr-x. 8 root root 114 Oct 28 20:22 ..
   d-w--w---T. 2 root root   6 Oct 28 20:22 111
   d--x--x--T. 2 root root   6 Oct 28 20:22 112
   dr--rw---T. 2 root root   6 Oct 28 20:22 113
   drwx--x---. 2 root root   6 Oct 28 20:22 114
   dr--rT. 2 root root   6 Oct 28 20:22 115
   dr-x-w. 2 root root   6 Oct 28 20:22 116
   drwx--x---. 2 root root   6 Oct 28 20:22 117

 Easily worked around using chmod() after each mkdir(), but
 I'm not sure if this is a bug or not.

 ?

 Regards and best wishes,

 Justin Clift

 --
 Open Source and Standards @ Red Hat

 twitter.com/realjustinclift



 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel




 --
 Jay Vyas
 http://jayunit100.blogspot.com

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Change in glusterfs[master]: Transparent data encryption and metadata authentication in t...

2013-10-24 Thread Anand Avati

On Thu, Oct 24, 2013 at 1:18 PM, Edward Shishkin edw...@redhat.com wrote:

 Hi all,

 So, here is the all-in-one-translator version represented by the
 Patch Set #2 at review.gluster.org/4667

 Everything has been addressed except encryption in NFS mounts (see
 next mail for details). That is:

 . New design of EOF (end-of-file) handling;
 . No oplock translator on the server side;
 . All locks are acquired/released by the crypt translator;
 . Now we can encrypt srtiped and(or) replicated volumes.

 Common comments.

 In the new design all files on the server are padded, whereas the
 real file size is stored as xattr. So we introduce a special layer
 in the crypt translator, which performs file size translations: every
 time when any callback returns struct iatt, we update its ia_size
 with the real (non-padded) value.

 The most unpleasant thing in this new design is FOP-readdirp_cbk():
 in this case we need N translations, i.e. N calls to the server (N is
 number of directory entries).

 To perform translations we spawn N children. We need a valid list of
 dirents after returning from FOP-readdirp_cbk() of previous
 translator, but we don't want to create a copy of this list (which
 can be large enough). For this reason we introduce a reference counter
 in struct gf_dirent_t and allocate dynamic structures gf_dirent_t
 (instead of on-stack ones), see respective changes in

 ./libglusterfs/src/gf-dirent.c
 ./libglusterfs/src/gf-dirent.h
 ./xlators/cluster/dht/src/dht-common.c
 ./xlators/protocol/client/src/client-rpc-fops.c



[pasting from internal email reply]

I had a look at the way you are handling readdirplus. I think it is overly
complex. FOP-readdirplus() already has a parameter @xdata in which you can
request per-entry xattr replies.

So in crypt_readdirp() you need to: dict_set(xdata, FSIZE_XATTR_PREFIX, 0);

Once you do that, in crypt_readdirp_cbk, you can expect each gf_dirent_t to
have its dirent-dict set with FSIZE_XATTR_PREFIX.

So you just need to iterate over replies in crypt_readdirp_cbk, update each
dirent-d_stat.ia_size with value from dict_get_uint64(dirent-xdata,
FSIZE_XATTR_PREFIX)

Please look at how posix-acl does something very similar (loading per-entry
ACLs into respective inodes via xattrs returned in readdirplus)

Avati



 Thanks,
 Edward.


 On Mon, 14 Oct 2013 14:27:01 -0700
 Anand Avati av...@redhat.com wrote:

  Edward,
  It looks like this patch requires a higher version of openssl (I
  recall you have mentioned before that that dependency was on version
  1.0.1c? I checked yum update on the build server and the latest
  available version is 1.0.0-27. Is there a clean way to get the
  right version of openssl to a RHEL/CENTOS-6.x server?
 
  Also note that the previous submission of the patch was at
  http://review.gluster.org/4667. The recent on
  (http://review.gluster.org/6086) has a different Change-Id: in the
  commit log. It will be good if you can re-submit the patch with the
  old Change-Id (and abandon #6086) so that we can maintain the history
  of resubmission and the old work on records.
 
  Thanks!
  Avati
 
  On 10/14/2013 07:26 AM, Edward Shishkin (Code Review) wrote:
   Edward Shishkin has uploaded a new change for review.
  
  http://review.gluster.org/6086
  
  
   Change subject: Transparent data encryption and metadata
   authentication in the systems with non-trusted server (take
   II)
 ..
  
   Transparent data encryption and metadata authentication
   in the systems with non-trusted server (take II)
  
   This new functionality can be useful in various cloud technologies.
   It is implemented via a special encryption/crypt translator, which
   works on the client side and performs encryption and authentication;
  
  1. Class of supported algorithms
  
   The crypt translator can support any atomic symmetric block cipher
   algorithms (which require to pad plain/cipher text before performing
   encryption/decryption transform (see glossary in atom.c for
   definitions). In particular, it can support algorithms with the EOF
   issue (which require to pad the end of file by extra-data).
  
   Crypt translator performs translations
   user - (offset, size) - (aligned-offset, padded-size) -server
   (and backward), and resolves individual FOPs (write(), truncate(),
   etc) to read-modify-write sequences.
  
   A volume can contain files encrypted by different algorithms of the
   mentioned class. To change some option value just reconfigure the
   volume.
  
   Currently only one algorithm is supported: AES_XTS.
  
   Example of algorithms, which can not be supported by the crypt
   translator:
  
   1. Asymmetric block cipher algorithms, which inflate data, e.g. RSA;
   2. Symmetric block cipher algorithms with inline MACs for data
   authentication.
  
   2. Implementation notes.
  
   a) Atomic algorithms
  
   Since any process

Re: [Gluster-devel] [Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.

2013-10-10 Thread Anand Avati

http://review.gluster.org/#/c/6031/ (patch to remove replace-brick data
migration) is slated for merge before 3.5. Review comments (on gerrit)
welcome.

Thanks,
Avati


On Thu, Oct 3, 2013 at 9:27 AM, Anand Avati av...@gluster.org wrote:

 On Thu, Oct 3, 2013 at 8:57 AM, KueiHuan Chen kueihuan.c...@gmail.comwrote:

 Hi, Avati

   In your chained configuration, how to replace whole h1 without
 replace-brick ? Is there has a better way than replace brick in this
 situation ?

   h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old
 h1.)



 You have a couple of options,

 A)

 replace-brick h1:/b1 h3:/b1
 replace-brick h1:/b2 h3:/b2

 and let self-heal bring the disks up to speed, or

 B)

 add-brick replica 2 h3:/b1 h2:/b2a
 add-brick replica 2 h3:/b2 h0:/b1a

 remove-brick h0:/b1 h1:/b2 start .. commit
 remove-brick h2:/b2 h1:/b1 start .. commit

 Let me know if you still have questions.

 Avati


 Thanks.
 Best Regards,

 KueiHuan-Chen
 Synology Incorporated.
 Email: khc...@synology.com
 Tel: +886-2-25521814 ext.827


 2013/9/30 Anand Avati av...@gluster.org:
 
 
 
  On Fri, Sep 27, 2013 at 1:56 AM, James purplei...@gmail.com wrote:
 
  On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote:
   Hello all,
  Hey,
 
  Interesting timing for this post...
  I've actually started working on automatic brick addition/removal. (I'm
  planning to add this to puppet-gluster of course.) I was hoping you
  could help out with the algorithm. I think it's a bit different if
  there's no replace-brick command as you are proposing.
 
  Here's the problem:
  Given a logically optimal initial volume:
 
  volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2
 
  suppose I know that I want to add/remove bricks such that my new volume
  (if I had created it new) looks like:
 
  volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
  h5:/b2 h6:/b2
 
  What is the optimal algorithm for determining the correct sequence of
  transforms that are needed to accomplish this task. Obviously there are
  some simpler corner cases, but I'd like to solve the general case.
 
  The transforms are obviously things like running the add-brick {...}
 and
  remove-brick {...} commands.
 
  Obviously we have to take into account that it's better to add bricks
  and rebalance before we remove bricks and risk the file system if a
  replica is missing. The algorithm should work for any replica N. We
 want
  to make sure the new layout makes sense to replicate the data on
  different servers. In many cases, this will require creating a circular
  chain of bricks as illustrated in the bottom of this image:
  http://joejulian.name/media/uploads/images/replica_expansion.png
  for example. I'd like to optimize for safety first, and then time, I
  imagine.
 
  Many thanks in advance.
 
 
  I see what you are asking. First of all, when running a 2-replica
 volume you
  almost pretty much always want to have an even number of servers, and
 add
  servers in even numbers. Ideally the two sides of the replicas should
 be
  placed in separate failures zones - separate racks with separate power
  supplies or separate AZs in the cloud. Having an odd number of servers
 with
  an 2 replicas is a very odd configuration. In all these years I am
 yet to
  come across a customer who has a production cluster with 2 replicas and
 an
  odd number of servers. And setting up replicas in such a chained manner
  makes it hard to reason about availability, especially when you are
 trying
  recover from a disaster. Having clear and separate pairs is definitely
  what is recommended.
 
  That being said, nothing prevents one from setting up a chain like
 above as
  long as you are comfortable with the complexity of the configuration.
 And
  phasing out replace-brick in favor of add-brick/remove-brick does not
 make
  the above configuration impossible either. Let's say you have a chained
  configuration of N servers, with pairs formed between every:
 
  h(i):/b1 h((i+1) % N):/b2 | i := 0 - N-1
 
  Now you add N+1th server.
 
  Using replace-brick, you have been doing thus far:
 
  1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was part of a previous
 brick
  2. replace-brick h0:/b2 hN:/b2 start ... commit
 
  In case you are doing an add-brick/remove-brick approach, you would now
  instead do:
 
  1. add-brick h(N-1):/b1a hN:/b2
  2. add-brick hN:/b1 h0:/b2a
  3. remove-brick h(N-1):/b1 h0:/b2 start ... commit
 
  You will not be left with only 1 copy of a file at any point in the
 process,
  and achieve the same end result as you were with replace-brick. As
  mentioned before, I once again request you to consider if you really
 want to
  deal with the configuration complexity of having chained replication,
  instead of just adding servers in pairs.
 
  Please ask if there are any more questions or concerns.
 
  Avati
 
 
 
  James
 
  Some comments below, although I'm a bit tired so I hope I said it all
  right.
 
   DHT's remove

Re: [Gluster-devel] glusterfs 3.4.0 vs 3.4.1 potential packaging problem?

2013-10-08 Thread Anand Avati


 [2013-10-08 17:33:36.662549] I [glusterfsd.c:1910:main]
 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.1
 (/usr/sbin/glusterd --debug)


...


 [2013-10-08 17:33:36.664191] W [xlator.c:185:xlator_dynload] 0-xlator:
 /usr/lib64/glusterfs/3.4.0/xlator/mgmt/glusterd.so: cannot open shared
 object file: No such file or directory



I think the issue can be summarized with the above two log lines. glusterd
binary is version 3.4.1 (PACKAGE_VERSION of glusterfsd is 3.4.1) but
libglusterfs is trying to open .../3.4.0/...glusterd.so (i.e
PACKAGE_VERSION during build of libglusterfs.so is 3.4.0).

The reality in code today is that glusterfsd and libglusterfs must be built
from the same version of the source tree (for reasons like above), and this
needs to be captured in the packaging.

I see that the glusterfs.spec.in in glusterfs.git has:

Requires: %{name}-libs = %{version}-%{release}

for the glusterfs-server RPM. That should have forced your glusterfs-libs
to be updated to 3.4.1 as well.

Kaleb,
Can you confirm that the Fedora RPMs also have this internal dependency
between packages? If it already does, I'm not sure how Jeff ended up with:

glusterfs-libs-3.4.0-8.fc19.x86_64
glusterfs-3.4.1-1.fc19.x86_64

without doing a --force and/or --nodeps install.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.

2013-10-03 Thread Anand Avati

On Thu, Oct 3, 2013 at 8:57 AM, KueiHuan Chen kueihuan.c...@gmail.comwrote:

 Hi, Avati

   In your chained configuration, how to replace whole h1 without
 replace-brick ? Is there has a better way than replace brick in this
 situation ?

   h0:/b1 h1:/b2 h1:/b1 h2:/b2 h2:/b1 h0:/b2 (A new h3 want to replace old
 h1.)



You have a couple of options,

A)

replace-brick h1:/b1 h3:/b1
replace-brick h1:/b2 h3:/b2

and let self-heal bring the disks up to speed, or

B)

add-brick replica 2 h3:/b1 h2:/b2a
add-brick replica 2 h3:/b2 h0:/b1a

remove-brick h0:/b1 h1:/b2 start .. commit
remove-brick h2:/b2 h1:/b1 start .. commit

Let me know if you still have questions.

Avati


 Thanks.
 Best Regards,

 KueiHuan-Chen
 Synology Incorporated.
 Email: khc...@synology.com
 Tel: +886-2-25521814 ext.827


 2013/9/30 Anand Avati av...@gluster.org:
 
 
 
  On Fri, Sep 27, 2013 at 1:56 AM, James purplei...@gmail.com wrote:
 
  On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote:
   Hello all,
  Hey,
 
  Interesting timing for this post...
  I've actually started working on automatic brick addition/removal. (I'm
  planning to add this to puppet-gluster of course.) I was hoping you
  could help out with the algorithm. I think it's a bit different if
  there's no replace-brick command as you are proposing.
 
  Here's the problem:
  Given a logically optimal initial volume:
 
  volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2
 
  suppose I know that I want to add/remove bricks such that my new volume
  (if I had created it new) looks like:
 
  volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
  h5:/b2 h6:/b2
 
  What is the optimal algorithm for determining the correct sequence of
  transforms that are needed to accomplish this task. Obviously there are
  some simpler corner cases, but I'd like to solve the general case.
 
  The transforms are obviously things like running the add-brick {...} and
  remove-brick {...} commands.
 
  Obviously we have to take into account that it's better to add bricks
  and rebalance before we remove bricks and risk the file system if a
  replica is missing. The algorithm should work for any replica N. We want
  to make sure the new layout makes sense to replicate the data on
  different servers. In many cases, this will require creating a circular
  chain of bricks as illustrated in the bottom of this image:
  http://joejulian.name/media/uploads/images/replica_expansion.png
  for example. I'd like to optimize for safety first, and then time, I
  imagine.
 
  Many thanks in advance.
 
 
  I see what you are asking. First of all, when running a 2-replica volume
 you
  almost pretty much always want to have an even number of servers, and add
  servers in even numbers. Ideally the two sides of the replicas should
 be
  placed in separate failures zones - separate racks with separate power
  supplies or separate AZs in the cloud. Having an odd number of servers
 with
  an 2 replicas is a very odd configuration. In all these years I am yet
 to
  come across a customer who has a production cluster with 2 replicas and
 an
  odd number of servers. And setting up replicas in such a chained manner
  makes it hard to reason about availability, especially when you are
 trying
  recover from a disaster. Having clear and separate pairs is definitely
  what is recommended.
 
  That being said, nothing prevents one from setting up a chain like above
 as
  long as you are comfortable with the complexity of the configuration. And
  phasing out replace-brick in favor of add-brick/remove-brick does not
 make
  the above configuration impossible either. Let's say you have a chained
  configuration of N servers, with pairs formed between every:
 
  h(i):/b1 h((i+1) % N):/b2 | i := 0 - N-1
 
  Now you add N+1th server.
 
  Using replace-brick, you have been doing thus far:
 
  1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was part of a previous
 brick
  2. replace-brick h0:/b2 hN:/b2 start ... commit
 
  In case you are doing an add-brick/remove-brick approach, you would now
  instead do:
 
  1. add-brick h(N-1):/b1a hN:/b2
  2. add-brick hN:/b1 h0:/b2a
  3. remove-brick h(N-1):/b1 h0:/b2 start ... commit
 
  You will not be left with only 1 copy of a file at any point in the
 process,
  and achieve the same end result as you were with replace-brick. As
  mentioned before, I once again request you to consider if you really
 want to
  deal with the configuration complexity of having chained replication,
  instead of just adding servers in pairs.
 
  Please ask if there are any more questions or concerns.
 
  Avati
 
 
 
  James
 
  Some comments below, although I'm a bit tired so I hope I said it all
  right.
 
   DHT's remove-brick + rebalance has been enhanced in the last couple of
   releases to be quite sophisticated. It can handle graceful
   decommissioning
   of bricks, including open file descriptors and hard links.
  Sweet
 
  
   This in a way is a feature overlap

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-10-01 Thread Anand Avati

On Tue, Oct 1, 2013 at 4:49 AM, Emmanuel Dreyfus m...@netbsd.org wrote:

 Justin Clift jcl...@redhat.com wrote:

   Towards this we need some extensions to gfapi that can handle object
  based operations. Meaning, instead of using full paths or relative paths
   rom cwd, it is required that we can work with APIs, like the *at POSIX
   variants, to be able to create, lookup, open etc. files and
 directories.
  snip
 
  Any idea if this would impact our *BSD compatibility? :)

 NetBSD 6.1 only have partial linkat(2). NetBSD-current (will-be
 NetBSD-7.0) has all extended API set 2, except fexecve(2) and O_EXEC for
 which no consensus was reached on how to implment it securely.

 In a nutshell, switching to *at() kills NetBSD compatibility until next
 major release, but I already know it will be restored at that time.


The context here is the POSIX-like style of API exposed by GFAPI, and not
dependent on what syscalls the platform provides. Good to know (separately)
that the *at() syscalls will be supported in NetBSD in sometime.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-30 Thread Anand Avati

 in place from the
 glfs_resolve_inode implementation as suggested earlier, but good to check.

 4) Renames

 In the case of renames, the inode remains the same, hence all handed out
 object handles still are valid and will operate on the right object per se.

 5) unlinks and recreation of the same _named_ object in the background

 Example being, application gets an handle for an object, say named
 a.txt, and in the background (or via another application/client) this is
 deleted and recreated.

 This will return ENOENT as the GFID would have changed for the previously
 held object to the new one, even though the names are the same. This seems
 like the right behaviour, and does not change in the case of a 1:1 of an
 N:1 object handle to inode mapping.

 So bottom line, I see the object handles like an fd with the noted
 difference above. Having them in a 1:1 relationship or as a N:1
 relationship does not seem to be an issue from what I understand, what am I
 missing here?


The issue is this. From what I understand, the usage of glfs_object in the
FSAL is not like a per-operation handle, but something stored long term
(many minutes, hours, days) in the per-inode context of the NFS Ganesha
layer. Now NFS Ganesha may be doing the right thing by not re-looking up
an already looked up name and therefore avoiding a leak (I'm not so sure,
it still needs to verify every so often if the mapping is still valid).
From NFS Ganesha's point of view the handle is changing on every lookup.

Now consider what happens in case of READDIRPLUS. A list of names and
handles are returned to the client. The list of names can possibly include
names which were previously looked up as well. Both are supposed to
represent the same gfid, but here will be returning new glfs_objects.
When a client performs an operation on a GFID, on which glfs_object will
the operation be performed at the gfapi layer? This part seems very
ambiguous and not clear.

What would really help is if you can tell what a glfs_object is supposed to
represent? - an on disk inode (i.e GFID)? an in memory per-graph inode (i.e
inode_t)? A dentry? A per-operation handle to an on disk inode? A
per-operation handle to an in memory per-graph inode? A per operation
handle to a dentry? In the current form, it does not seem to fit any of the
these categories.


Avati


Shyam


 --

 *From: *Anand Avati av...@gluster.org
 *To: *Shyamsundar Ranganathan srang...@redhat.com
 *Cc: *Gluster Devel gluster-devel@nongnu.org
 *Sent: *Monday, September 30, 2013 10:35:05 AM

 *Subject: *Re: RFC/Review: libgfapi object handle based extensions

 I see a pretty core issue - lifecycle management of 'struct glfs_object'.
 What is the structure representing? When is it created? When is it
 destroyed? How does it relate to inode_t?

 Looks like for every lookup() we are creating a new glfs_object, even if
 the looked up inode was already looked up before (in the cache) and had a
 glfs_object created for it in the recent past.

 We need a stronger relationship between the two with a clearer
 relationship. It is probably necessary for a glfs_object to represent
 mulitple inode_t's at different points in time depending on graph switches,
 but for a given inode_t we need only one glfs_object. We definitely must
 NOT have a new glfs_object per lookup call.

 Avati

 On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan 
 srang...@redhat.com wrote:

 Avati,

 Please find the updated patch set for review at gerrit.
 http://review.gluster.org/#/c/5936/

 Changes made to address the points (1) (2) and (3) below. By the usage of
 the suggested glfs_resolve_inode approach.

 I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on
 this a little later).

 So currently, the review request is for all APIs other than,
 glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid

 glfs_resolve_at: Using this function the terminal name will be a force
 look up anyway (as force_lookup will be passed as 1 based on
 !next_component). We need to avoid this _extra_ lookup in the unlink case,
 which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in
 the first place.

 Having said the above, we should still leverage glfs_resolve_at anyway,
 as there seem to be other corner cases where the resolved inode and subvol
 maybe from different graphs. So I think I want to modify glfs_resolve_at to
 make a conditional force_lookup, based on iatt being NULL or not. IOW,
 change the call to glfs_resolve_component with the conditional as, (reval
 || (!next_component  iatt)). So that callers that do not want the iatt
 filled, can skip the syncop_lookup.

 Request comments on the glfs_resolve_at proposal.

 Shyam.

 - Original Message -

  From: Anand Avati av...@gluster.org
  To: Shyamsundar Ranganathan srang...@redhat.com
  Cc: Gluster Devel gluster-devel@nongnu.org
  Sent: Wednesday, September 18, 2013 11:39:27 AM
  Subject: Re: RFC/Review: libgfapi

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-30 Thread Anand Avati

On Mon, Sep 30, 2013 at 9:34 AM, Anand Avati av...@gluster.org wrote:




 On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan 
 srang...@redhat.com wrote:

 Avati, Amar,

 Amar, Anand S and myself had a discussion on this comment and here is an
 answer to your queries the way I see it. Let me know if I am missing
 something here.

 (this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a
 single lookup or preserve a single object handle per filesystem object in
 its cache)

 Currently a glfs_object is an opaque pointer to an object (it is a
 _handle_ to the object). The object itself contains a ref'd inode, which is
 the actual pointer to the object.

 1) The similarity and differences of object handles to fds

 The intention of multiple object handles is in lines with multiple fd's
 per file, an application using the library is free to lookup (and/or create
 (and its equivalents)) and acquire as many object handles as it wants for a
 particular object, and can hence determine the lifetime of each such object
 in its view. So in essence one thread can have an object handle to perform,
 say attribute related operations, whereas another thread has the same
 object looked up to perform IO.


 So do you mean a glfs_object is meant to be a *per-operation* handle? If
 one thread wants to perform a chmod() and another thread wants to perform
 chown() and both attempt to resolve the same name and end up getting
 different handles, then both of them unref the glfs_handle right after
 their operation?



 Where the object handles depart from the notion of fds is when an unlink
 is performed. As POSIX defines that open fds are still _open_ for
 activities on the file, the life of an fd and the actual object that it
 points to is till the fd is closed. In the case of object handles though,
 the moment any handle is used to unlink the object (which BTW is done using
 the parent object handle and the name of the child), all handles pointing
 to the object are still valid pointers, but operations on then will result
 in ENOENT, as the actual object has since been unlinked and removed by the
 underlying filesystem.


 Not always. If the file had hardlinks the handle should still be valid.
 And if there were no hardlinks and you unlinked the last link, further
 operations must return ESTALE. ENOENT is when a basename does not resolve
 to a handle (in entry operations) - for e.g when you try to unlink the same
 entry a second time. Whereas ESTALE is when a presented handle does not
 exist - for e.g when you try to operate (read, chmod) a handle which got
 deleted.


 The departure from fds is considered valid in my perspective, as the
 handle points to an object, which has since been removed, and so there is
 no semantics here that needs it to be preserved for further operations as
 there is a reference to it held.


 The departure is only in the behavior of unlinked files. That is
 orthogonal to whether you want to return separate handles each time a
 component is looked up. I fail to see how the departure from fd behavior
 justifies creating new glfs_object per lookup?


 So in essence for each time an object handle is returned by the API, it
 has to be closed for its life to end. Additionally if the object that it
 points to is removed from the underlying system, the handle is pointing to
 an entry that does not exist any longer and returns ENOENT on operations
 using the same.

 2) The issue/benefit of having the same object handle irrespective of
 looking it up multiple times

 If we have an 1-1 relationship of object handles (i.e struct glfs_object)
 to inodes, then the caller gets the same pointer to the handle. Hence
 having multiple handles as per the caller, boils down to giving out ref
 counted glfs_object(s) for the same inode.

 Other than the memory footprint, this will still not make the object live
 past it's unlink time. The pointer handed out will be still valid till the
 last ref count is removed (i.e the object handle closed), at which point
 the object handle can be destroyed.


 If I understand what you say above correctly, you intend to solve the
 problem of unlinked files must return error at your API layer? That's
 wrong. The right way is to ref-count glfs_object and return them precisely
 because you should NOT make the decision about the end of life of an inode
 at that layer. A hardlink may have been created by another client and the
 glfs_object may therefore be still be valid.

 You are also returning separate glfs_object for different hardlinks of a
 file. Does that mean glfs_object is representing a dentry? or a
 per-operation reference to an inode?

 So again, as many handles were handed out for the same inode, they have to
 be closed, etc.

 3) Graph switches

 In the case of graph switches, handles that are used in operations post
 the switch, get refreshed with an inode from the new graph, if we have an
 N:1 object to inode relationship.

 In the case of 1:1

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-30 Thread Anand Avati

On Mon, Sep 30, 2013 at 12:49 PM, Anand Avati av...@gluster.org wrote:


 On Mon, Sep 30, 2013 at 9:34 AM, Anand Avati av...@gluster.org wrote:




 On Mon, Sep 30, 2013 at 3:40 AM, Shyamsundar Ranganathan 
 srang...@redhat.com wrote:

 Avati, Amar,

 Amar, Anand S and myself had a discussion on this comment and here is an
 answer to your queries the way I see it. Let me know if I am missing
 something here.

 (this is not a NFS Ganesha requirement, FYI. As Ganesha will only do a
 single lookup or preserve a single object handle per filesystem object in
 its cache)

 Currently a glfs_object is an opaque pointer to an object (it is a
 _handle_ to the object). The object itself contains a ref'd inode, which is
 the actual pointer to the object.

 1) The similarity and differences of object handles to fds

 The intention of multiple object handles is in lines with multiple fd's
 per file, an application using the library is free to lookup (and/or create
 (and its equivalents)) and acquire as many object handles as it wants for a
 particular object, and can hence determine the lifetime of each such object
 in its view. So in essence one thread can have an object handle to perform,
 say attribute related operations, whereas another thread has the same
 object looked up to perform IO.


 So do you mean a glfs_object is meant to be a *per-operation* handle? If
 one thread wants to perform a chmod() and another thread wants to perform
 chown() and both attempt to resolve the same name and end up getting
 different handles, then both of them unref the glfs_handle right after
 their operation?



 Where the object handles depart from the notion of fds is when an unlink
 is performed. As POSIX defines that open fds are still _open_ for
 activities on the file, the life of an fd and the actual object that it
 points to is till the fd is closed. In the case of object handles though,
 the moment any handle is used to unlink the object (which BTW is done using
 the parent object handle and the name of the child), all handles pointing
 to the object are still valid pointers, but operations on then will result
 in ENOENT, as the actual object has since been unlinked and removed by the
 underlying filesystem.


 Not always. If the file had hardlinks the handle should still be valid.
 And if there were no hardlinks and you unlinked the last link, further
 operations must return ESTALE. ENOENT is when a basename does not resolve
 to a handle (in entry operations) - for e.g when you try to unlink the same
 entry a second time. Whereas ESTALE is when a presented handle does not
 exist - for e.g when you try to operate (read, chmod) a handle which got
 deleted.


 The departure from fds is considered valid in my perspective, as the
 handle points to an object, which has since been removed, and so there is
 no semantics here that needs it to be preserved for further operations as
 there is a reference to it held.


 The departure is only in the behavior of unlinked files. That is
 orthogonal to whether you want to return separate handles each time a
 component is looked up. I fail to see how the departure from fd behavior
 justifies creating new glfs_object per lookup?


 So in essence for each time an object handle is returned by the API, it
 has to be closed for its life to end. Additionally if the object that it
 points to is removed from the underlying system, the handle is pointing to
 an entry that does not exist any longer and returns ENOENT on operations
 using the same.

 2) The issue/benefit of having the same object handle irrespective of
 looking it up multiple times

 If we have an 1-1 relationship of object handles (i.e struct
 glfs_object) to inodes, then the caller gets the same pointer to the
 handle. Hence having multiple handles as per the caller, boils down to
 giving out ref counted glfs_object(s) for the same inode.

 Other than the memory footprint, this will still not make the object
 live past it's unlink time. The pointer handed out will be still valid till
 the last ref count is removed (i.e the object handle closed), at which
 point the object handle can be destroyed.


 If I understand what you say above correctly, you intend to solve the
 problem of unlinked files must return error at your API layer? That's
 wrong. The right way is to ref-count glfs_object and return them precisely
 because you should NOT make the decision about the end of life of an inode
 at that layer. A hardlink may have been created by another client and the
 glfs_object may therefore be still be valid.

 You are also returning separate glfs_object for different hardlinks of a
 file. Does that mean glfs_object is representing a dentry? or a
 per-operation reference to an inode?

 So again, as many handles were handed out for the same inode, they have
 to be closed, etc.

 3) Graph switches

 In the case of graph switches, handles that are used in operations post
 the switch, get refreshed with an inode from the new graph, if we

Re: [Gluster-devel] Phasing out replace-brick for data migration in favor of remove-brick.

2013-09-30 Thread Anand Avati

On Fri, Sep 27, 2013 at 10:15 AM, Amar Tumballi ama...@gmail.com wrote:


 I plan to send out patches to remove all traces of replace-brick data
 migration code by 3.5 branch time.

 Thanks for the initiative, let me know if you need help.


I could use help here, if you have free cycles to pick up this task?

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-30 Thread Anand Avati

  Now consider what happens in case of READDIRPLUS. A list of names and
 handles
  are returned to the client. The list of names can possibly include names
  which were previously looked up as well. Both are supposed to represent
 the
  same gfid, but here will be returning new glfs_objects. When a client
  performs an operation on a GFID, on which glfs_object will the operation
 be
  performed at the gfapi layer? This part seems very ambiguous and not
 clear.

 I should have made a note for readdirplus earlier, this would default to
 the fd based version of the same, not a handle/object based version of the
 same. So we would transition from an handle to an fd via glfs_h_opendir and
 then continue with the readdir variants. if I look at the POSIX *at
 routines, this seem about right, but of course we may have variances here.


You would get an fd for the directory on which the READDIRPLUS is
attempted. I was referring to the replies, where every entry needs to be
returned with its own handle (on which operations can arrive without
LOOKUP). Think of READDIRPLUS as bulk LOOKUP.



  What would really help is if you can tell what a glfs_object is supposed
 to
  represent? - an on disk inode (i.e GFID)? an in memory per-graph inode
 (i.e
  inode_t)? A dentry? A per-operation handle to an on disk inode? A
  per-operation handle to an in memory per-graph inode? A per operation
 handle
  to a dentry? In the current form, it does not seem to fit any of the
 these
  categories.

 Well I think of it as a handle to an file system object. Having said that,
 if we just returned the inode pointer as this handle, the graph switches
 can cause a problem, in which case we need to default to the (as per my
 understanding) the FUSE manner of working. keeping the handle 1:1 via other
 infrastructure does not seem beneficial ATM. I think you cover this in the
 subsequent mail so let us continue there.


That is correct, using inode_t will force us to behave like FUSE. As
mentioned in the other mail, we are probably better off fixing that and
using inode_t in a cleaner way in both FUSE and gfapi.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-29 Thread Anand Avati

On Thu, Sep 26, 2013 at 3:55 AM, Shyamsundar Ranganathan 
srang...@redhat.com wrote:

 - Original Message -
  From: Shyamsundar Ranganathan srang...@redhat.com
  To: gluster-devel@nongnu.org
  Cc: ana...@redhat.com
  Sent: Friday, September 13, 2013 1:48:19 PM
  Subject: RFC/Review: libgfapi object handle based extensions

  - We do need the APIs to extend themselves to do any ID based
 operations, say
  creating with a specific UID/GID rather than the running process UID/GID
  that can prove detrimental in a multi threaded, multi connection handling
  server protocol like the NFS Ganesha implementation

 In continuation of the original mail, we need to handle the one item
 above. Where we need to pass in the UID/GID to be used when performing the
 operations.

 Here is a suggestion for review on achieving the same, (for current code
 implementation of handle APIs look at, http://review.gluster.org/#/c/5936/)

 1) Modify the handle based APIs to take in a opctx (operation context,
 concept borrowed from Ganesha)

 So, instead of,
 glfs_h_creat (struct glfs *fs, struct glfs_object *parent, const char
 *path, int flags, mode_t mode, struct stat *stat)
 it would be,
 glfs_h_creat (struct glfs *fs, struct glfs_optctx *opctx, struct
 glfs_object *parent, const char *path, int flags, mode_t mode, struct stat
 *stat)

 Where,
 struct glfs_optctx {
 uid_t caller_uid;
 gid_t caller_gid;
 }

 Later as needed this operation context can be extended for other needs
 like, client connection address or ID, supplementary groups, etc.

 2) Internal to the glfs APIs (esp. handle based APIs), use this to set
 thread local variables (UID/GID) that the syncop frame creation can pick up
 in addition to the current probe of geteuid/egid. (as suggested by Avati)

 If the basic construct looks fine I will amend my current review with this
 change in the create API and syncop.h (etc.), and once reviewed extend it
 to other handle based APIs as appropriate.

I am somewhat hesitant to expose a structure to be filled by the user,
where the structure can grow over time. Providing APIs like
glfs_setfsuid()/glfs_setfsgid()/glfs_setgroups(), which internally uses
thread local variables to communicate the values to syncop_create_frame()
is probably a cleaner approach.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-29 Thread Anand Avati

I see a pretty core issue - lifecycle management of 'struct glfs_object'.
What is the structure representing? When is it created? When is it
destroyed? How does it relate to inode_t?

Looks like for every lookup() we are creating a new glfs_object, even if
the looked up inode was already looked up before (in the cache) and had a
glfs_object created for it in the recent past.

We need a stronger relationship between the two with a clearer
relationship. It is probably necessary for a glfs_object to represent
mulitple inode_t's at different points in time depending on graph switches,
but for a given inode_t we need only one glfs_object. We definitely must
NOT have a new glfs_object per lookup call.

Avati

On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan 
srang...@redhat.com wrote:

 Avati,

 Please find the updated patch set for review at gerrit.
 http://review.gluster.org/#/c/5936/

 Changes made to address the points (1) (2) and (3) below. By the usage of
 the suggested glfs_resolve_inode approach.

 I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on
 this a little later).

 So currently, the review request is for all APIs other than,
 glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid

 glfs_resolve_at: Using this function the terminal name will be a force
 look up anyway (as force_lookup will be passed as 1 based on
 !next_component). We need to avoid this _extra_ lookup in the unlink case,
 which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in
 the first place.

 Having said the above, we should still leverage glfs_resolve_at anyway, as
 there seem to be other corner cases where the resolved inode and subvol
 maybe from different graphs. So I think I want to modify glfs_resolve_at to
 make a conditional force_lookup, based on iatt being NULL or not. IOW,
 change the call to glfs_resolve_component with the conditional as, (reval
 || (!next_component  iatt)). So that callers that do not want the iatt
 filled, can skip the syncop_lookup.

 Request comments on the glfs_resolve_at proposal.

 Shyam.

 - Original Message -

  From: Anand Avati av...@gluster.org
  To: Shyamsundar Ranganathan srang...@redhat.com
  Cc: Gluster Devel gluster-devel@nongnu.org
  Sent: Wednesday, September 18, 2013 11:39:27 AM
  Subject: Re: RFC/Review: libgfapi object handle based extensions

  Minor comments are made in gerrit. Here is a larger (more important)
 comment
  for which email is probably more convenient.

  There is a problem in the general pattern of the fops, for example
  glfs_h_setattrs() (and others too)

  1. glfs_validate_inode() has the assumption that object-inode deref is a
  guarded operation, but here we are doing an unguarded deref in the
 paramter
  glfs_resolve_base().

  2. A more important issue, glfs_active_subvol() and
 glfs_validate_inode() are
  not atomic. glfs_active_subvol() can return an xlator from one graph,
 but by
  the time glfs_validate_inode() is called, a graph switch could have
 happened
  and inode can get resolved to a different graph. And in syncop_XX()
 we
  end up calling on graph1 with inode belonging to graph2.

  3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based
  operations. The ESTALE_RETRY macro exists for path based FOPs where the
  resolved handle could have turned stale by the time we perform the FOP
  (where resolution and FOP are non-atomic). Over here, the handle is
  predetermined, and it does not make sense to retry on ESTALE (notice
 that FD
  based fops in glfs-fops.c also do not have ESTALE_RETRY for this same
  reason)

  I think the pattern should be similar to FD based fops which specifically
  address both the above problems. Here's an outline:

  glfs_h_(struct glfs *fs, glfs_object *object, ...)
  {
  xlator_t *subvol = NULL;
  inode_t *inode = NULL;

  __glfs_entry_fs (fs);

  subvol = glfs_active_subvol (fs);
  if (!subvol) { errno = EIO; ... goto out; }

  inode = glfs_resolve_inode (fs, object, subvol);
  if (!inode) { errno = ESTALE; ... goto out; }

  loc.inode = inode;
  ret = syncop_(subvol, loc, ...);

  }

  Notice the signature of glfs_resolve_inode(). What it does: given a
  glfs_object, and a subvol, it returns an inode_t which is resolved on
 that
  subvol. This way the syncop_XXX() is performed with matching subvol and
  inode. Also it returns the inode pointer so that no unsafe object-inode
  deref is done by the caller. Again, this is the same pattern followed by
 the
  fd based fops already.

  Also, as mentioned in one of the comments, please consider using
  glfs_resolve_at() and avoiding manual construction of loc_t.

  Thanks,
  Avati

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-29 Thread Anand Avati

Also note that the same glfs_object must be re-used in readdirplus (once we
have a _h_ equivalent of the API)

Avati

On Sun, Sep 29, 2013 at 10:05 PM, Anand Avati av...@gluster.org wrote:

 I see a pretty core issue - lifecycle management of 'struct glfs_object'.
 What is the structure representing? When is it created? When is it
 destroyed? How does it relate to inode_t?

 Looks like for every lookup() we are creating a new glfs_object, even if
 the looked up inode was already looked up before (in the cache) and had a
 glfs_object created for it in the recent past.

 We need a stronger relationship between the two with a clearer
 relationship. It is probably necessary for a glfs_object to represent
 mulitple inode_t's at different points in time depending on graph switches,
 but for a given inode_t we need only one glfs_object. We definitely must
 NOT have a new glfs_object per lookup call.

 Avati

 On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan 
 srang...@redhat.com wrote:

 Avati,

 Please find the updated patch set for review at gerrit.
 http://review.gluster.org/#/c/5936/

 Changes made to address the points (1) (2) and (3) below. By the usage of
 the suggested glfs_resolve_inode approach.

 I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on
 this a little later).

 So currently, the review request is for all APIs other than,
 glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid

 glfs_resolve_at: Using this function the terminal name will be a force
 look up anyway (as force_lookup will be passed as 1 based on
 !next_component). We need to avoid this _extra_ lookup in the unlink case,
 which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in
 the first place.

 Having said the above, we should still leverage glfs_resolve_at anyway,
 as there seem to be other corner cases where the resolved inode and subvol
 maybe from different graphs. So I think I want to modify glfs_resolve_at to
 make a conditional force_lookup, based on iatt being NULL or not. IOW,
 change the call to glfs_resolve_component with the conditional as, (reval
 || (!next_component  iatt)). So that callers that do not want the iatt
 filled, can skip the syncop_lookup.

 Request comments on the glfs_resolve_at proposal.

 Shyam.

 - Original Message -

  From: Anand Avati av...@gluster.org
  To: Shyamsundar Ranganathan srang...@redhat.com
  Cc: Gluster Devel gluster-devel@nongnu.org
  Sent: Wednesday, September 18, 2013 11:39:27 AM
  Subject: Re: RFC/Review: libgfapi object handle based extensions

  Minor comments are made in gerrit. Here is a larger (more important)
 comment
  for which email is probably more convenient.

  There is a problem in the general pattern of the fops, for example
  glfs_h_setattrs() (and others too)

  1. glfs_validate_inode() has the assumption that object-inode deref is
 a
  guarded operation, but here we are doing an unguarded deref in the
 paramter
  glfs_resolve_base().

  2. A more important issue, glfs_active_subvol() and
 glfs_validate_inode() are
  not atomic. glfs_active_subvol() can return an xlator from one graph,
 but by
  the time glfs_validate_inode() is called, a graph switch could have
 happened
  and inode can get resolved to a different graph. And in syncop_XX()
 we
  end up calling on graph1 with inode belonging to graph2.

  3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based
  operations. The ESTALE_RETRY macro exists for path based FOPs where the
  resolved handle could have turned stale by the time we perform the FOP
  (where resolution and FOP are non-atomic). Over here, the handle is
  predetermined, and it does not make sense to retry on ESTALE (notice
 that FD
  based fops in glfs-fops.c also do not have ESTALE_RETRY for this same
  reason)

  I think the pattern should be similar to FD based fops which
 specifically
  address both the above problems. Here's an outline:

  glfs_h_(struct glfs *fs, glfs_object *object, ...)
  {
  xlator_t *subvol = NULL;
  inode_t *inode = NULL;

  __glfs_entry_fs (fs);

  subvol = glfs_active_subvol (fs);
  if (!subvol) { errno = EIO; ... goto out; }

  inode = glfs_resolve_inode (fs, object, subvol);
  if (!inode) { errno = ESTALE; ... goto out; }

  loc.inode = inode;
  ret = syncop_(subvol, loc, ...);

  }

  Notice the signature of glfs_resolve_inode(). What it does: given a
  glfs_object, and a subvol, it returns an inode_t which is resolved on
 that
  subvol. This way the syncop_XXX() is performed with matching subvol and
  inode. Also it returns the inode pointer so that no unsafe object-inode
  deref is done by the caller. Again, this is the same pattern followed
 by the
  fd based fops already.

  Also, as mentioned in one of the comments, please consider using
  glfs_resolve_at() and avoiding manual construction of loc_t.

  Thanks,
  Avati



___
Gluster-devel mailing list
Gluster-devel

Re: [Gluster-devel] [Gluster-users] Phasing out replace-brick for data migration in favor of remove-brick.

2013-09-29 Thread Anand Avati

On Fri, Sep 27, 2013 at 1:56 AM, James purplei...@gmail.com wrote:

 On Fri, 2013-09-27 at 00:35 -0700, Anand Avati wrote:
  Hello all,
 Hey,

 Interesting timing for this post...
 I've actually started working on automatic brick addition/removal. (I'm
 planning to add this to puppet-gluster of course.) I was hoping you
 could help out with the algorithm. I think it's a bit different if
 there's no replace-brick command as you are proposing.

 Here's the problem:
 Given a logically optimal initial volume:

 volA: rep=2; h1:/b1 h2:/b1 h3:/b1 h4:/b1 h1:/b2 h2:/b2 h3:/b2 h4:/b2

 suppose I know that I want to add/remove bricks such that my new volume
 (if I had created it new) looks like:

 volB: rep=2; h1:/b1 h3:/b1 h4:/b1 h5:/b1 h6:/b1 h1:/b2 h3:/b2 h4:/b2
 h5:/b2 h6:/b2

 What is the optimal algorithm for determining the correct sequence of
 transforms that are needed to accomplish this task. Obviously there are
 some simpler corner cases, but I'd like to solve the general case.

 The transforms are obviously things like running the add-brick {...} and
 remove-brick {...} commands.

 Obviously we have to take into account that it's better to add bricks
 and rebalance before we remove bricks and risk the file system if a
 replica is missing. The algorithm should work for any replica N. We want
 to make sure the new layout makes sense to replicate the data on
 different servers. In many cases, this will require creating a circular
 chain of bricks as illustrated in the bottom of this image:
 http://joejulian.name/media/uploads/images/replica_expansion.png
 for example. I'd like to optimize for safety first, and then time, I
 imagine.

 Many thanks in advance.


I see what you are asking. First of all, when running a 2-replica volume
you almost pretty much always want to have an even number of servers, and
add servers in even numbers. Ideally the two sides of the replicas should
be placed in separate failures zones - separate racks with separate power
supplies or separate AZs in the cloud. Having an odd number of servers with
an 2 replicas is a very odd configuration. In all these years I am yet to
come across a customer who has a production cluster with 2 replicas and an
odd number of servers. And setting up replicas in such a chained manner
makes it hard to reason about availability, especially when you are trying
recover from a disaster. Having clear and separate pairs is definitely
what is recommended.

That being said, nothing prevents one from setting up a chain like above as
long as you are comfortable with the complexity of the configuration. And
phasing out replace-brick in favor of add-brick/remove-brick does not make
the above configuration impossible either. Let's say you have a chained
configuration of N servers, with pairs formed between every:

h(i):/b1 h((i+1) % N):/b2 | i := 0 - N-1

Now you add N+1th server.

Using replace-brick, you have been doing thus far:

1. add-brick hN:/b1 h0:/b2a # because h0:/b2 was part of a previous brick
2. replace-brick h0:/b2 hN:/b2 start ... commit

In case you are doing an add-brick/remove-brick approach, you would now
instead do:

1. add-brick h(N-1):/b1a hN:/b2
2. add-brick hN:/b1 h0:/b2a
3. remove-brick h(N-1):/b1 h0:/b2 start ... commit

You will not be left with only 1 copy of a file at any point in the
process, and achieve the same end result as you were with replace-brick.
As mentioned before, I once again request you to consider if you really
want to deal with the configuration complexity of having chained
replication, instead of just adding servers in pairs.

Please ask if there are any more questions or concerns.

Avati



 James

 Some comments below, although I'm a bit tired so I hope I said it all
 right.

  DHT's remove-brick + rebalance has been enhanced in the last couple of
  releases to be quite sophisticated. It can handle graceful
 decommissioning
  of bricks, including open file descriptors and hard links.
 Sweet

 
  This in a way is a feature overlap with replace-brick's data migration
  functionality. Replace-brick's data migration is currently also used for
  planned decommissioning of a brick.
 
  Reasons to remove replace-brick (or why remove-brick is better):
 
  - There are two methods of moving data. It is confusing for the users and
  hard for developers to maintain.
 
  - If server being replaced is a member of a replica set, neither
  remove-brick nor replace-brick data migration is necessary, because
  self-healing itself will recreate the data (replace-brick actually uses
  self-heal internally)
 
  - In a non-replicated config if a server is getting replaced by a new
 one,
  add-brick new + remove-brick old start achieves the same goal as
  replace-brick old new start.
 
  - In a non-replicated config, replace-brick is NOT glitch free
  (applications witness ENOTCONN if they are accessing data) whereas
  add-brick new + remove-brick old is completely transparent.
 
  - Replace brick strictly requires a server with enough

Re: [Gluster-devel] Finalizing interfaces for snapshot and clone creation in BD xlator

2013-09-24 Thread Anand Avati

Adding Brian Foster (and gluster-devel) for the discussion of unified UI
for snapshotting.

Mohan, I must have missed your comment. Can you please point to the
specific patch where you posted your comment?

Avati


On Tue, Sep 24, 2013 at 9:29 AM, M. Mohan Kumar mohankuma...@gmail.comwrote:

 Hi Avati,

 I am ready with V5 of BD xlator patches (I consolidated the patches to 5).
 Before posting them I wanted your opinion about the interfaces I use for
 creating clone and snapshot. I posted them on Gerrit few days back. Could
 you please respond to that?

 --
 Regards,
 Mohan.

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Possible memory leak in gluster samba vfs

2013-09-24 Thread Anand Avati

On Tue, Sep 24, 2013 at 6:37 PM, haiwei.xie-soulinfo 
haiwei@soulinfo.com wrote:

 hi,

Our patch for this bug,  running looks good. smbd will not exit with
 oom-kill. But it's not correct method.

 git version: release-3.4/886021a31bdac83c2124d08d64b81f22d82039d6

 diff --git a/api/src/glfs-fops.c b/api/src/glfs-fops.c
 index 66e7d69..535ee53 100644
 --- a/api/src/glfs-fops.c
 +++ b/api/src/glfs-fops.c
 @@ -713,7 +713,9 @@ glfs_pwritev (struct glfs_fd *glfd, const struct iovec
 *iovec, int iovcnt,
 }

 size = iov_length (iovec, iovcnt);
 -
 +#define MIN_LEN 8 * 1024
 +   if (size  MIN_LEN)
 +   size = MIN_LEN;
 iobuf = iobuf_get2 (subvol-ctx-iobuf_pool, size);
 if (!iobuf) {
 ret = -1;



Ah, looks like we need to tune the page_size/num_pages table in
libglusterfs/src/iobuf.c. The table is allowing for too small pages. We
should probably remove entries for page size less than 4KB. Just doing that
might fix your issue:

diff --git a/libglusterfs/src/iobuf.c b/libglusterfs/src/iobuf.c
index a89e962..0269004 100644
--- a/libglusterfs/src/iobuf.c
+++ b/libglusterfs/src/iobuf.c
@@ -24,9 +24,7 @@
 /* Make sure this array is sorted based on pagesize */
 struct iobuf_init_config gf_iobuf_init_config[] = {
 /* { pagesize, num_pages }, */
-{128, 1024},
-{512, 512},
-{2 * 1024, 512},
+{4 * 1024, 256},
 {8 * 1024, 128},
 {32 * 1024, 64},
 {128 * 1024, 32},


Avati



  On 09/13/2013 06:03 PM, kane wrote:
   Hi
  
   We use samba gluster vfs in IO test, but meet with gluster server smbd
   oom killer,
   The smbd process spend over 15g RES with top command show, in the end
   is our simple test code:
  
   gluster server vfs -- smbd -- client mount dir /mnt/vfs-- execute
   vfs test program $ ./vfs 1000
  
   then we can watch gluster server smbd RES with top command.
  
   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
   4000 soul 20 0 5486m 4.9g 10m R 108.4 31.5 111:07.07 smbd
   3447 root 20 0 1408m 44m 2428 S 44.4 0.3 59:11.55 glusterfsd
  
   io test code:
   ===
   #define _LARGEFILE64_SOURCE
  
   #include stdio.h
   #include unistd.h
   #include string.h
   #include pthread.h
   #include stdlib.h
   #include fcntl.h
   #include sys/types.h
  
   int WT = 1;
  
   #define RND(x) ((x0)?(genrand() % (x)):0)
   extern unsigned long genrand();
   extern void sgenrand();
  
   /* Period parameters */
   #define N 624
   #define M 397
   #define MATRIX_A 0x9908b0df /* constant vector a */
   #define UPPER_MASK 0x8000 /* most significant w-r bits */
   #define LOWER_MASK 0x7fff /* least significant r bits */
  
   /* Tempering parameters */
   #define TEMPERING_MASK_B 0x9d2c5680
   #define TEMPERING_MASK_C 0xefc6
   #define TEMPERING_SHIFT_U(y) (y  11)
   #define TEMPERING_SHIFT_S(y) (y  7)
   #define TEMPERING_SHIFT_T(y) (y  15)
   #define TEMPERING_SHIFT_L(y) (y  18)
  
   static unsigned long mt[N]; /* the array for the state vector */
   static int mti=N+1; /* mti==N+1 means mt[N] is not initialized */
  
   /* Initializing the array with a seed */
   void
   sgenrand(seed)
   unsigned long seed;
   {
   int i;
  
   for (i=0;iN;i++) {
   mt[i] = seed  0x;
   seed = 69069 * seed + 1;
   mt[i] |= (seed  0x)  16;
   seed = 69069 * seed + 1;
   }
   mti = N;
   }
  
   unsigned long
   genrand()
   {
   unsigned long y;
   static unsigned long mag01[2]={0x0, MATRIX_A};
   /* mag01[x] = x * MATRIX_A for x=0,1 */
  
   if (mti = N) { /* generate N words at one time */
   int kk;
  
   if (mti == N+1) /* if sgenrand() has not been called, */
   sgenrand(4357); /* a default initial seed is used */
  
   for (kk=0;kkN-M;kk++) {
   y = (mt[kk]UPPER_MASK)|(mt[kk+1]LOWER_MASK);
   mt[kk] = mt[kk+M] ^ (y  1) ^ mag01[y  0x1];
   }
   for (;kkN-1;kk++) {
   y = (mt[kk]UPPER_MASK)|(mt[kk+1]LOWER_MASK);
   mt[kk] = mt[kk+(M-N)] ^ (y  1) ^ mag01[y  0x1];
   }
   y = (mt[N-1]UPPER_MASK)|(mt[0]LOWER_MASK);
   mt[N-1] = mt[M-1] ^ (y  1) ^ mag01[y  0x1];
  
   mti = 0;
   }
   y = mt[mti++];
   y ^= TEMPERING_SHIFT_U(y);
   y ^= TEMPERING_SHIFT_S(y)  TEMPERING_MASK_B;
   y ^= TEMPERING_SHIFT_T(y)  TEMPERING_MASK_C;
   y ^= TEMPERING_SHIFT_L(y);
  
   return y;
   }
  
   char *initialize_file_source(int size)
   {
   char *new_source;
   int i;
  
   if ((new_source=(char *)malloc(size))==NULL) /* allocate buffer */
   fprintf(stderr,Error: failed to allocate source file of size
 %d\n,size);
   else
   for (i=0; isize; i++) /* file buffer with junk */
   new_source[i]=32+RND(95);
  
   return(new_source);
   }
  
   void *tran_file(void *map)
   {
   int block_size = 512;
   char *read_buffer; /* temporary space for reading file data into */
   int fd = open((char *)map, O_RDWR | O_CREAT | O_TRUNC, 0644);
   if(fd == -1) {
   perror(open);
   return ;
   }
  
   //read_buffer=(char *)malloc(block_size);
   //memset(read_buffer,

Re: [Gluster-devel] [Gluster-users] glusterfs-3.4.1qa2 released

2013-09-20 Thread Anand Avati

I have a theory for #998967 (that posix-acl is not doing the right thing
after chmod/setattr). Preparing a patch, will appreciate if you can test it
quickly.

Avati


On Fri, Sep 20, 2013 at 1:26 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com
 wrote:

 No, I see issues reported in
 https://bugzilla.redhat.com/show_bug.cgi?id=998967 which is probably
 related to BZ#991035.


 On Thu, Sep 19, 2013 at 7:40 PM, Vijay Bellur vbel...@redhat.com wrote:

 On 09/18/2013 02:45 PM, Lukáš Bezdička wrote:

 Tested with glusterfs-3.4.1qa2-1.el6.x86_**64 issue with ACL is still
 there, unless one applies patch from http://review.gluster.org/#/c/**
 5693/ http://review.gluster.org/#/c/5693/
 which shoots through the caches and takes ACLs from server or sets
 entry-timeout=0 it returns wrong values. This is probably because ACL
 mask being applied incorrectly in posix_acl_inherit_mode, but I'm no C
 expert to say so :(


 Checking again. Are you seeing issues reported in both BZ#991035 and
 BZ#990830 with 3.4.1qa2?

 Thanks,
 Vijay



 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] glusterfs-3.4.1qa2 released

2013-09-20 Thread Anand Avati

Can you please confirm if http://review.gluster.org/5979 fixes the problem
of #998967 for you? If so we will backport and include the patch in 3.4.1.

Thanks,
Avati


On Fri, Sep 20, 2013 at 2:03 AM, Anand Avati av...@gluster.org wrote:

 I have a theory for #998967 (that posix-acl is not doing the right thing
 after chmod/setattr). Preparing a patch, will appreciate if you can test it
 quickly.

 Avati


 On Fri, Sep 20, 2013 at 1:26 AM, Lukáš Bezdička 
 lukas.bezdi...@gooddata.com wrote:

 No, I see issues reported in
 https://bugzilla.redhat.com/show_bug.cgi?id=998967 which is probably
 related to BZ#991035.


 On Thu, Sep 19, 2013 at 7:40 PM, Vijay Bellur vbel...@redhat.com wrote:

 On 09/18/2013 02:45 PM, Lukáš Bezdička wrote:

 Tested with glusterfs-3.4.1qa2-1.el6.x86_**64 issue with ACL is still
 there, unless one applies patch from http://review.gluster.org/#/c/**
 5693/ http://review.gluster.org/#/c/5693/
 which shoots through the caches and takes ACLs from server or sets
 entry-timeout=0 it returns wrong values. This is probably because ACL
 mask being applied incorrectly in posix_acl_inherit_mode, but I'm no C
 expert to say so :(


 Checking again. Are you seeing issues reported in both BZ#991035 and
 BZ#990830 with 3.4.1qa2?

 Thanks,
 Vijay



 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] glusterfs-3.4.1qa2 released

2013-09-20 Thread Anand Avati

Please pick #2 resubmission, that is fine.

Avati


On Fri, Sep 20, 2013 at 2:48 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com
 wrote:

 Will take about 2 hours to setup test env, also build seems to be failed
 but does not seem to be caused by the patch :/


 On Fri, Sep 20, 2013 at 11:38 AM, Anand Avati av...@gluster.org wrote:

 Can you please confirm if http://review.gluster.org/5979 fixes the
 problem of #998967 for you? If so we will backport and include the patch
 in 3.4.1.

 Thanks,
 Avati


 On Fri, Sep 20, 2013 at 2:03 AM, Anand Avati av...@gluster.org wrote:

 I have a theory for #998967 (that posix-acl is not doing the right thing
 after chmod/setattr). Preparing a patch, will appreciate if you can test it
 quickly.

 Avati


 On Fri, Sep 20, 2013 at 1:26 AM, Lukáš Bezdička 
 lukas.bezdi...@gooddata.com wrote:

 No, I see issues reported in
 https://bugzilla.redhat.com/show_bug.cgi?id=998967 which is probably
 related to BZ#991035.


 On Thu, Sep 19, 2013 at 7:40 PM, Vijay Bellur vbel...@redhat.comwrote:

 On 09/18/2013 02:45 PM, Lukáš Bezdička wrote:

 Tested with glusterfs-3.4.1qa2-1.el6.x86_**64 issue with ACL is still
 there, unless one applies patch from http://review.gluster.org/#/c/**
 5693/ http://review.gluster.org/#/c/5693/
 which shoots through the caches and takes ACLs from server or sets
 entry-timeout=0 it returns wrong values. This is probably because ACL
 mask being applied incorrectly in posix_acl_inherit_mode, but I'm no C
 expert to say so :(


 Checking again. Are you seeing issues reported in both BZ#991035 and
 BZ#990830 with 3.4.1qa2?

 Thanks,
 Vijay



 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users





___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] glusterfs-3.4.1qa2 released

2013-09-20 Thread Anand Avati

Thanks Lukas. Copying Lubomir. Can you confirm that
http://review.gluster.org/5693 is no more needed then? Also, can you please
vote on http://review.gluster.org/5979?

Thanks,
Avati



On Fri, Sep 20, 2013 at 4:39 AM, Lukáš Bezdička lukas.bezdi...@gooddata.com
 wrote:

 I was unable to reproduce the issue with patch #2 from
 http://review.gluster.org/#/c/5979/

 Thank you.


 On Fri, Sep 20, 2013 at 11:52 AM, Anand Avati av...@gluster.org wrote:

 Please pick #2 resubmission, that is fine.

 Avati


 On Fri, Sep 20, 2013 at 2:48 AM, Lukáš Bezdička 
 lukas.bezdi...@gooddata.com wrote:

 Will take about 2 hours to setup test env, also build seems to be failed
 but does not seem to be caused by the patch :/


 On Fri, Sep 20, 2013 at 11:38 AM, Anand Avati av...@gluster.org wrote:

 Can you please confirm if http://review.gluster.org/5979 fixes the
 problem of #998967 for you? If so we will backport and include the
 patch in 3.4.1.

 Thanks,
 Avati


 On Fri, Sep 20, 2013 at 2:03 AM, Anand Avati av...@gluster.org wrote:

 I have a theory for #998967 (that posix-acl is not doing the right
 thing after chmod/setattr). Preparing a patch, will appreciate if you can
 test it quickly.

 Avati


 On Fri, Sep 20, 2013 at 1:26 AM, Lukáš Bezdička 
 lukas.bezdi...@gooddata.com wrote:

 No, I see issues reported in
 https://bugzilla.redhat.com/show_bug.cgi?id=998967 which is probably
 related to BZ#991035.


 On Thu, Sep 19, 2013 at 7:40 PM, Vijay Bellur vbel...@redhat.comwrote:

 On 09/18/2013 02:45 PM, Lukáš Bezdička wrote:

 Tested with glusterfs-3.4.1qa2-1.el6.x86_**64 issue with ACL is
 still
 there, unless one applies patch from http://review.gluster.org/#/c/
 **5693/ http://review.gluster.org/#/c/5693/
 which shoots through the caches and takes ACLs from server or sets
 entry-timeout=0 it returns wrong values. This is probably because
 ACL
 mask being applied incorrectly in posix_acl_inherit_mode, but I'm
 no C
 expert to say so :(


 Checking again. Are you seeing issues reported in both BZ#991035 and
 BZ#990830 with 3.4.1qa2?

 Thanks,
 Vijay



 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users







___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-19 Thread Anand Avati

On Thu, Sep 19, 2013 at 5:13 AM, Shyamsundar Ranganathan 
srang...@redhat.com wrote:

 Avati,

 Please find the updated patch set for review at gerrit.
 http://review.gluster.org/#/c/5936/

 Changes made to address the points (1) (2) and (3) below. By the usage of
 the suggested glfs_resolve_inode approach.

 I have not yet changes glfs_h_unlink to use the glfs_resolve_at. (more on
 this a little later).

 So currently, the review request is for all APIs other than,
 glfs_h_unlink, glfs_h_extract_gfid, glfs_h_create_from_gfid

 glfs_resolve_at: Using this function the terminal name will be a force
 look up anyway (as force_lookup will be passed as 1 based on
 !next_component). We need to avoid this _extra_ lookup in the unlink case,
 which is why all the inode_grep(s) etc. were added to the glfs_h_lookup in
 the first place.

 Having said the above, we should still leverage glfs_resolve_at anyway, as
 there seem to be other corner cases where the resolved inode and subvol
 maybe from different graphs. So I think I want to modify glfs_resolve_at to
 make a conditional force_lookup, based on iatt being NULL or not. IOW,
 change the call to glfs_resolve_component with the conditional as, (reval
 || (!next_component  iatt)). So that callers that do not want the iatt
 filled, can skip the syncop_lookup.

 Request comments on the glfs_resolve_at proposal.


That should be OK (passing iatt as NULL to skip forced lookup)

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] samba-glusterfs-vfs does not build

2013-09-19 Thread Anand Avati

On Thu, Sep 19, 2013 at 11:28 AM, Nux! n...@li.nux.ro wrote:

 On 18.09.2013 19:04, Nux! wrote:

 Hi,

 I'm trying to build and test samba-glusterfs-vfs, but problems appear
 from the start:
 http://fpaste.org/40562/**95274621/ http://fpaste.org/40562/95274621/

 Any pointers?


 Anyone from devel has any ideas?

 Thanks,
 Lucian


Have you ./configure'd in the samba tree? --with-samba-source must point to
a built samba tree (not just extracted)

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RFC/Review: libgfapi object handle based extensions

2013-09-18 Thread Anand Avati

On Mon, Sep 16, 2013 at 4:18 AM, Shyamsundar Ranganathan 
srang...@redhat.com wrote:

 - Original Message -

  From: Anand Avati av...@gluster.org
  Sent: Friday, September 13, 2013 11:09:37 PM

  Shyam,
  Thanks for sending this out. Can you post your patches to
 review.gluster.org
  and link the URL in this thread? That would make things a lot more clear
 for
  feedback and review.

 Done, please find the same here, http://review.gluster.org/#/c/5936/

 Shyam



Minor comments are made in gerrit. Here is a larger (more important)
comment for which email is probably more convenient.

There is a problem in the general pattern of the fops, for example
glfs_h_setattrs() (and others too)

1. glfs_validate_inode() has the assumption that object-inode deref is a
guarded operation, but here we are doing an unguarded deref in the paramter
glfs_resolve_base().

2. A more important issue, glfs_active_subvol() and glfs_validate_inode()
are not atomic. glfs_active_subvol() can return an xlator from one graph,
but by the time glfs_validate_inode() is called, a graph switch could have
happened and inode can get resolved to a different graph. And in
syncop_XX() we end up calling on graph1 with inode belonging to graph2.

3. ESTALE_RETRY is a fundamentally wrong thing to do with handle based
operations. The ESTALE_RETRY macro exists for path based FOPs where the
resolved handle could have turned stale by the time we perform the FOP
(where resolution and FOP are non-atomic). Over here, the handle is
predetermined, and it does not make sense to retry on ESTALE (notice that
FD based fops in glfs-fops.c also do not have ESTALE_RETRY for this same
reason)

I think the pattern should be similar to FD based fops which specifically
address both the above problems. Here's an outline:

  glfs_h_(struct glfs *fs, glfs_object *object, ...)
  {
xlator_t *subvol = NULL;
inode_t *inode = NULL;

__glfs_entry_fs (fs);

subvol = glfs_active_subvol (fs);
if (!subvol) { errno = EIO; ... goto out; }

inode = glfs_resolve_inode (fs, object, subvol);
if (!inode) { errno = ESTALE; ... goto out; }

loc.inode = inode;
ret = syncop_(subvol, loc, ...);

  }

Notice the signature of glfs_resolve_inode(). What it does: given a
glfs_object, and a subvol, it returns an inode_t which is resolved on that
subvol. This way the syncop_XXX() is performed with matching subvol and
inode. Also it returns the inode pointer so that no unsafe object-inode
deref is done by the caller. Again, this is the same pattern followed by
the fd based fops already.

Also, as mentioned in one of the comments, please consider using
glfs_resolve_at() and avoiding manual construction of loc_t.

Thanks,
Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Fwd: [Nfs-ganesha-devel] Announce: Push of next pre-2.0-dev_49

2013-09-14 Thread Anand Avati

Anand,

This is a great first step.. Looking forward for the integration to mature
soon. This is a big step for supporting NFSv4 and pNFS for GlusterFS.

Thanks!
Avati



On Sat, Sep 14, 2013 at 3:18 AM, Anand Subramanian ana...@redhat.comwrote:

 FYI, the FSAL (File System Abstraction Layer) for Gluster is now available
 in the upstream nfs-ganesha community (details of branch, tag and commit
 below). This enables users to export Gluster volumes through nfs-ganesha
 and for use by both nfs v3 and v4 clients. Please note that this is an
 on-going effort.

 More details wrt configuration, building etc. will follow.

 Anand


 - Forwarded Message -
 From: Jim Lieb jl...@panasas.com
 To: nfs-ganesha-de...@lists.sourceforge.net
 Sent: Fri, 13 Sep 2013 22:20:43 -0400 (EDT)
 Subject: [Nfs-ganesha-devel] Announce: Push of next pre-2.0-dev_49

 Pushed to the project repo:

   git://github.com/nfs-ganesha/nfs-ganesha.git branch next

 Branch: next
 Tag: pre-2.0-dev_49

 This week's merge is big.  It also took a little extra effort to file and
 fit
 some of the pieces to get them to slide into place.

 The Red Hat Gluster FS team has submitted their fsal.  I have built it
 but have not tested it.  It requires the glfsapi library and a header
 which I can supply to anyone else who wants to play.  They will be testing
 with us at BAT in Boston this month.  It is built by default but the build
 will be disabled if the build cannot find the header or libary.

 IBM has also submitted the Protectier fsal.  I have not built this but we
 expect a report from their team once they have tested the merge.
 Its build is off by default.

 The Pseudo filesystem handle for v4 has been reworked.  This was done
 to get the necessary handle changes in for V2.0.  Further work on pseudo
 file
 system infrastructure will build on this in 2.1.

 Frank and the IBM team submitted a large set of 1.5 to 2.0 bugfix ports.
  This
 is almost all of them.  Frank has updated the port document reflecting
 current
 state.  Please feel free to grab some patches and port them.

 As usual, there have been bugfixes in multiple places.

 We tried to get the 1.5 log rotation and compression code in but found some
 bugs that will take more than a few line fix to get working in 2.0.  As a
 result, it has been reverted.

 Highlights:

 * FSAL_GLUSTER is a new fsal to export Gluster FS

 * FSAL_PT is a new fsal for the Protectier file system

 * Rework of the PseudoFS file handle format (NVFv4+ only)

 * More 1.5 to 2.0 bugfix ports

 * Lots of bugfixes

 Enjoy

 Jim
 --
 Jim Lieb
 Linux Systems Engineer
 Panasas Inc.

 If ease of use was the only requirement, we would all be riding tricycles
 - Douglas Engelbart 1925–2013

 Short log from pre-2.0-dev_47
 --
 commit b2a927948e627367d87af04892afbb031ed85d75
 Author: Jeremy Bongio jbon...@us.ibm.com

 Don't access export in SAVEFH request when FH is for pseudofs and fix
 up
 references

 commit 03228228ab64f8d004b864ae7829b51707bfc068
 Author: Jim Lieb jl...@panasas.com

 Revert Added support for rotation and compression of log files.

 commit 0f8690df03a57243d65f20d23c53f86a9e0b17cc
 Merge: cca7875 9483a7d
 Author: Jim Lieb jl...@panasas.com

 Merge remote-tracking branch 'ffilz/porting-doc' into merge_next

 commit cca787542d85112cb3e0706caf5ae007b8cd5285
 Merge: 2f0118d af03de5
 Author: Jim Lieb jl...@panasas.com

 Merge remote-tracking branch 'martinetd/for_dev_49' into merge_next

 commit 9483a7d7ab54a5e6e6daf4521928b147fa7329b8
 Author: Frank S. Filz ffilz...@mindspring.com

 Clean up porting doc

 commit d19cadcf4069976c299e968e890efc8d0ccf001a
 Author: Frank S. Filz ffilz...@mindspring.com

 Update porting doc for dev_49

 commit 2f0118d2eb9a3f95cff08070ff3453ca7ce0d4a2
 Merge: a75665a 9530440
 Author: Jim Lieb jl...@panasas.com

 Merge branch 'glusterfs' into merge_next

 commit a75665ac75c01e767780cea023c2a8f74b46e2a0
 Merge: 3c7578c 183e044
 Author: Jim Lieb jl...@panasas.com

 Merge remote-tracking branch 'sachin/next' into merge_next

 commit 3c7578cde4d47344b0dac2264e9990de3b029ba6
 Merge: c0aa16f 75d81d1
 Author: Jim Lieb jl...@panasas.com

 Merge remote-tracking branch 'linuxbox2/next' into merge_next

 commit c0aa16f8ea25c3dae059b349302083291ea7af9d
 Author: Jim Lieb jl...@panasas.com

 Fixups to logging macros and display logic

 commit 183e0440d2d8a9f1ef0513807829fd7c15e568d4
 Author: Sachin Bhamare sbham...@panasas.com

 Fix the order in which credentials are set in fsal_set_credentials().

 commit 0af11c7592092825098215733fc9a14cbc9bcfe3
 Author: Sachin Bhamare sbham...@panasas.com

 Fix bugs in FreeBSD version of setuser() and setgroup().

 commit b9ca8bddbe140f90c216aeb6611465060607420e
 Merge: 9629e2a 5eeb095
 Author: Jim Lieb jl...@panasas.com

 Merge remote-tracking branch 'ganltc/ibm_next_20' into merge_next

 commit 953044057566c7d9013b276a14879a3f226d6972
 Author: Jim

Re: [Gluster-devel] Build broken on current head with F19?

2013-09-11 Thread Anand Avati

I think you might need this - http://review.gluster.org/5896

Avati


On Wed, Sep 11, 2013 at 1:34 PM, Justin Clift jcl...@redhat.com wrote:

 Hi all,

 Building on F19 with current Gluster master head seems broken
 atm.  Looks related to the QEMU code.

 Have we added a new compilation dependency or something (that
 I'm obviously missing :)?

 + Justin

 *
 [snip]
   CC   changelog-notifier.lo
   CCLD changelog.la
 Making all in lib
 Making all in src
   CC   libgfchangelog_la-gf-changelog.lo
   CC   libgfchangelog_la-gf-changelog-process.lo
   CC   libgfchangelog_la-gf-changelog-helpers.lo
   CC   libgfchangelog_la-clear.lo
   CC   libgfchangelog_la-copy.lo
   CC   libgfchangelog_la-gen_uuid.lo
   CC   libgfchangelog_la-pack.lo
   CC   libgfchangelog_la-parse.lo
   CC   libgfchangelog_la-unparse.lo
   CC   libgfchangelog_la-uuid_time.lo
   CC   libgfchangelog_la-compare.lo
   CC   libgfchangelog_la-isnull.lo
   CC   libgfchangelog_la-unpack.lo
   CCLD libgfchangelog.la
 Making all in gfid-access
 Making all in src
   CC   gfid-access.lo
   CCLD gfid-access.la
 Making all in glupy
 Making all in src
   CC   glupy.lo
   CCLD glupy.la
 Making all in qemu-block
 Making all in src
   CC   qemu-coroutine.lo
   CC   qemu-coroutine-lock.lo
   CC   qemu-coroutine-sleep.lo
   CC   block.lo
   CC   nop-symbols.lo
   CC   aes.lo
   CC   bitmap.lo
   CC   bitops.lo
   CC   cutils.lo
 In file included from ../../../../contrib/qemu/block.c:25:0:
 ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error:
 glib-compat.h: No such file or directory
  #include glib-compat.h
  ^
 In file included
 from ../../../../contrib/qemu/include/qemu/bitops.h:15:0,
  from ../../../../contrib/qemu/util/bitops.c:14:
 ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error:
 glib-compat.h: No such file or directory
  #include glib-compat.h
  ^
 compilation terminated.
 compilation terminated.
 In file included from ../../../../contrib/qemu/util/aes.c:30:0:
 ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error:
 glib-compat.h: No such file or directory
  #include glib-compat.h
  ^
 compilation terminated.
 In file included
 from ../../../../contrib/qemu/trace/generated-tracers.h:6:0,
  from ../../../../contrib/qemu/include/trace.h:4,
  from ../../../../contrib/qemu/qemu-coroutine.c:15:
 ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error:
 glib-compat.h: No such file or directory
  #include glib-compat.h
  ^
 compilation terminated.
 In file included
 from ../../../../contrib/qemu/include/qemu/bitops.h:15:0,
  from ../../../../contrib/qemu/util/bitmap.c:12:
 ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error:
 glib-compat.h: No such file or directory
  #include glib-compat.h
  ^
 compilation terminated.
 In file included
 from ../../../../contrib/qemu/qemu-coroutine-lock.c:25:0:
 ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error:
 glib-compat.h: No such file or directory
  #include glib-compat.h
  ^
 compilation terminated.
 In file included from ../../../../contrib/qemu/include/qemu/timer.h:4:0,

 from ../../../../contrib/qemu/include/block/coroutine.h:20,

 from ../../../../contrib/qemu/qemu-coroutine-sleep.c:14:
 ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error:
 glib-compat.h: No such file or directory
  #include glib-compat.h
  ^
 compilation terminated.
 In file included from ../../../../contrib/qemu/util/cutils.c:24:0:
 ../../../../contrib/qemu/include/qemu-common.h:43:25: fatal error:
 glib-compat.h: No such file or directory
  #include glib-compat.h
  ^
 compilation terminated.
 make[6]: *** [bitops.lo] Error 1
 make[6]: *** Waiting for unfinished jobs
 make[6]: *** [block.lo] Error 1
 make[6]: *** [aes.lo] Error 1
 make[6]: *** [qemu-coroutine.lo] Error 1
 make[6]: *** [bitmap.lo] Error 1
 make[6]: *** [qemu-coroutine-lock.lo] Error 1
 make[6]: *** [qemu-coroutine-sleep.lo] Error 1
 make[6]: *** [cutils.lo] Error 1
 make[5]: *** [all-recursive] Error 1
 make[4]: *** [all-recursive] Error 1
 make[3]: *** [all-recursive] Error 1
 make[2]: *** [all-recursive] Error 1
 make[1]: *** [all] Error 2
 make[1]: Leaving directory

 `/home/jc/git_repos/glusterfs/extras/LinuxRPM/rpmbuild/BUILD/glusterfs-3git'
 error: Bad exit status from /var/tmp/rpm-tmp.uXvaGe (%build)


 RPM build errors:
 Bad exit status from /var/tmp/rpm-tmp.uXvaGe (%build)
 make: *** [rpms] Error 1

 *


 ___
 Gluster-devel mailing list

Re: [Gluster-devel] Build broken on current head with F19?

2013-09-11 Thread Anand Avati

Thanks for confirming Justin! Niels, do you know why rpm.t regression test
is failing on your patch?

Avati


On Wed, Sep 11, 2013 at 1:49 PM, Justin Clift jcl...@redhat.com wrote:

 On Wed, 2013-09-11 at 13:37 -0700, Anand Avati wrote:
  I think you might need this - http://review.gluster.org/5896

 Thanks Avati, that solved the build failure for me. :)

 + Justin




___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Issues with fallocate, discard and zerofill

2013-09-06 Thread Anand Avati

It is cleaner to implement it as a separate fop. The complexity of
overloading writev() is unnecessary. There would be a whole bunch of new
if/else condititions to be introduced in existing code, and modules like
write-behind, stripe etc. where special action is taken in multiple places
based on size (and offset into the buffer), would be very delicate error
prone changes.

That being said, I still believe the FOP interface should be similar to
SCSI write_same, something like this:

int fop_write_same (call_frame_t *frame, xlator_t *this, fd_t *fd, void
*buf, size_t len, off_t offset, int repeat);

and zerofill would be a gfapi wrapper around write_same:

int zerofill (call_frame_t *frame, xlator_t *this, fd_t *fd, off_t offset,
int len)
{
  char zero[1] = {0};

  return fop_write_same (frame, this, fd, zero, 1, offset, len);
}

Avati


On Thu, Sep 5, 2013 at 10:28 PM, M. Mohan Kumar mo...@in.ibm.com wrote:

 Anand Avati anand.av...@gmail.com writes:

 Hi Shishir,

 Its possible to overload writev FOP for achieving zerofill
 functionality. Is there any open issues with this zerofill functionality
 even after overloading in writev?

  Shishir,
  Is this in reference to the dht open file rebalance (of replaying the
  operations to the destination server)? I am assuming so, as that is
  something which has to be handled.
 
  The other question is how should fallocate/discard be handled by
 self-heal
  in AFR. I'm not sure how important it is, but will be certainly good to
  bounce some ideas off here. Maybe we should implement a fiemap fop to
 query
  extents/holes and replay them in the other serverl?
 
  Avati
 
 
 
  On Tue, Aug 13, 2013 at 10:49 PM, Bharata B Rao bharata@gmail.com
 wrote:
 
  Hi Avati, Brian,
 
  During the recently held gluster meetup, Shishir mentioned about a
  potential problem (related to fd migration etc) in the zerofill
  implementation (http://review.gluster.org/#/c/5327/) and also
  mentioned that same/similar issues are present with fallocate and
  discard implementations. Since zerofill has been modelled on
  fallocate/discard, I was wondering if it would be possible to address
  these issues in fallocate/discard first so that we could potentially
  follow the same in zerofill implementation.
 
  Regards,
  Bharata.
  --
  http://raobharata.wordpress.com/
 
  ___
  Gluster-devel mailing list
  Gluster-devel@nongnu.org
  https://lists.nongnu.org/mailman/listinfo/gluster-devel
 
  ___
  Gluster-devel mailing list
  Gluster-devel@nongnu.org
  https://lists.nongnu.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [FEEDBACK] Governance of GlusterFS project

2013-09-06 Thread Anand Avati

Good point Amar.. Noted.

Avati


On Fri, Sep 6, 2013 at 1:40 AM, Amar Tumballi ama...@gmail.com wrote:

 One of the other things we missed in this thread is how to handle bugs in
 bugzilla, and who should own the triage for high/urgent priority bugs.

 -Amar


 On Fri, Jul 26, 2013 at 10:56 PM, Anand Avati anand.av...@gmail.comwrote:

 Hello everyone,

   We are in the process of formalizing the governance model of the
 GlusterFS project. Historically, the governance of the project has been
 loosely structured. This is an invitation to all of you to participate in
 this discussion and provide your feedback and suggestions on how we should
 evolve a formal model. Feedback from this thread will be considered to the
 extent possible in formulating the draft (which will be sent out for review
 as well).

   Here are some specific topics to seed the discussion:

 - Core team formation
   - what are the qualifications for membership (e.g contributions of
 code, doc, packaging, support on irc/lists, how to quantify?)
   - what are the responsibilities of the group (e.g direction of the
 project, project roadmap, infrastructure, membership)

 - Roadmap
   - process of proposing features
   - process of selection of features for release

 - Release management
   - timelines and frequency
   - release themes
   - life cycle and support for releases
   - project management and tracking

 - Project maintainers
   - qualification for membership
   - process and evaluation

 There are a lot more topics which need to be discussed, I just named some
 to get started. I am sure our community has members who belong and
 participate (or at least are familiar with) other open source project
 communities. Your feedback will be valuable.

 Looking forward to hearing from you!

 Avati


 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] readdir() scalability (was Re: [RFC ] dictionary optimizations)

2013-09-06 Thread Anand Avati

On Fri, Sep 6, 2013 at 1:46 AM, Xavier Hernandez xhernan...@datalab.eswrote:

  Al 04/09/13 18:10, En/na Anand Avati ha escrit:

 On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez xhernan...@datalab.eswrote:

 Al 04/09/13 14:05, En/na Jeff Darcy ha escrit:

  On 09/04/2013 04:27 AM, Xavier Hernandez wrote:

 I would also like to note that each node can store multiple elements.
 Current implementation creates a node for each byte in the key. In my
 implementation I only create a node if there is a prefix coincidence
 between
 2 or more keys. This reduces the number of nodes and the number of
 indirections.


 Whatever we do, we should try to make sure that the changes are profiled
 against real usage.  When I was making my own dict optimizations back in
 March
 of last year, I started by looking at how they're actually used. At that
 time,
 a significant majority of dictionaries contained just one item. That's
 why I
 only implemented a simple mechanism to pre-allocate the first data_pair
 instead
 of doing something more ambitious.  Even then, the difference in actual
 performance or CPU usage was barely measurable.  Dict usage has certainly
 changed since then, but I think you'd still be hard pressed to find a
 case
 where a single dict contains more than a handful of entries, and
 approaches
 that are optimized for dozens to hundreds might well perform worse than
 simple
 ones (e.g. because of cache aliasing or branch misprediction).

 If you're looking for other optimization opportunities that might
 provide even
 bigger bang for the buck then I suggest that stack-frame or
 frame-local
 allocations are a good place to start.  Or string copying in places like
 loc_copy.  Or the entire fd_ctx/inode_ctx subsystem.  Let me know and
 I'll come
 up with a few more.  To put a bit of a positive spin on things, the
 GlusterFS
 code offers many opportunities for improvement in terms of CPU and memory
 efficiency (though it's surprisingly still way better than Ceph in that
 regard).

   Yes. The optimizations on dictionary structures are not a big
 improvement in the overall performance of GlusterFS. I tried it on a real
 situation and the benefit was only marginal. However I didn't test new
 features like an atomic lookup and remove if found (because I would have
 had to review all the code). I think this kind of functionalities could
 improve a bit more the results I obtained.

 However this is not the only reason to do these changes. While I've been
 writing code I've found that it's tedious to do some things just because
 there isn't such functions in dict_t. Some actions require multiple calls,
 having to check multiple errors and adding complexity and limiting
 readability of the code. Many of these situations could be solved using
 functions similar to what I proposed.

 On the other side, if dict_t must be truly considered a concurrent
 structure, there are a lot of race conditions that might appear when doing
 some operations. It would require a great effort to take care of all these
 possibilities everywhere. It would be better to pack most of these
 situations into functions inside the dict_t itself where it is easier to
 combine some operations.

 By the way, I've made some tests with multiple bricks and it seems that
 there is a clear speed loss on directory listings as the number of bricks
 increases. Since bricks should be independent and they can work in
 parallel, I didn't expected such a big performance degradation.


  The likely reason is that, even though bricks are parallel for IO,
 readdir is essentially a sequential operation and DHT has a limitation that
 a readdir reply batch does not cross server boundaries. So if you have 10
 files and 1 server, all 10 entries are returned in one call to the
 app/libc. If you have 10 files and 10 servers evenly distributed, the
 app/libc has to perform 10 calls and keeps getting one file at a time. This
 problem goes away when each server has enough files to fill up a readdir
 batch. It's only when you have too few files and too many servers that this
 dilution problem shows up. However, this is just a theory and your
 problem may be something else too..

I didn't know that DHT was doing a sequential brick scan on readdir(p)
 (my fault). Why is that ? Why it cannot return entries crossing a server
 boundary ? is it due to a technical reason or is it only due to the current
 implementation ?

 I've made a test using only directories (50 directories with 50
 subdirectories each). I started with one brick and I measured the time to
 do a recursive 'ls'. Then I sequentially added an additional brick, up to 6
 (all of them physically independent), and repeated the ls. The time
 increases linearly as the number of bricks augments. As more bricks were
 added, the rebalancing time was also growing linearly.

 I think this is a big problem for scalability. It can be partially hidden
 by using some caching or preloading mechanisms

Re: [Gluster-devel] Change in glusterfs[release-3.4]: call-stub: internal refactor

2013-09-05 Thread Anand Avati


On 9/5/13 6:27 AM, Vijay Bellur (Code Review) wrote:

Vijay Bellur has submitted this change and it was merged.

Change subject: call-stub: internal refactor
..


call-stub: internal refactor

- re-structure members of call_stub_t with new simpler layout
- easier to inspect call_stub_t contents in gdb now
- fix a bunch of double unrefs and double frees in cbk stub
- change all STACK_UNWIND to STACK_UNWIND_STRICT and thereby fixed
   a lot of bad params
- implement new API call_unwind_error() which can even be called on
   fop_XXX_stub(), and not necessarily fop_XXX_cbk_stub()

Change-Id: Idf979f14d46256af0afb9658915cc79de157b2d7
BUG: 846240
Signed-off-by: Anand Avati av...@redhat.com
Reviewed-on: http://review.gluster.org/4520
Tested-by: Gluster Build System jenk...@build.gluster.com
Reviewed-by: Jeff Darcy jda...@redhat.com
Reviewed-on: http://review.gluster.org/5820
Reviewed-by: Raghavendra Bhat raghaven...@redhat.com
---
M libglusterfs/src/call-stub.c
M libglusterfs/src/call-stub.h
M xlators/performance/write-behind/src/write-behind.c
3 files changed, 1,104 insertions(+), 2,994 deletions(-)

Approvals:
   Raghavendra Bhat: Looks good to me, approved
   Gluster Build System: Verified





Note that this backported patch had a bug in master and there was a 
follow-up patch http://review.gluster.org/4564. This needs to be 
backported too.


Avati


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Enabling Apache Hadoop on GlusterFS: glusterfs-hadoop 2.1 released

2013-09-05 Thread Anand Avati

On Thu, Sep 5, 2013 at 2:53 PM, Stephen Watt sw...@redhat.com wrote:

 Hi Folks

 We are pleased to announce a major update to the glusterfs-hadoop project
 with the release of version 2.1. The glusterfs-hadoop project, available at
 The glusterfs-hadoop project team, provides an Apache licensed Hadoop
 FileSystem plugin which enables Apache Hadoop 1.x and 2.x to run directly
 on top of GlusterFS. This release includes a re-architected plugin which
 now extends existing functionality within Hadoop to run on local and POSIX
 File Systems.

 -- Overview --

 Apache Hadoop has a pluggable FileSystem Architecture. This means that if
 you have a filesystem or object store that you would like to use with
 Hadoop, you can create a Hadoop FileSystem plugin for it which will act as
 a mediator between the generic Hadoop FileSystem interface and your
 filesystem of choice. A popular example would be that over a million Hadoop
 clusters are spun up on Amazon every year, a lot of which use Amazon S3 as
 the Hadoop FileSystem.

 In order to configure the plugin, a specific deployment configuration is
 required. Firstly, it is required that the Hadoop JobTracker and
 TaskTrackers (or the Hadoop 2.x equivalents) are installed on servers
 within the gluster trusted storage pool for a given gluster volume. The
 JobTracker uses the plugin to query the extended attributes for job input
 files in gluster to ascertain file placement as well as the distribution of
 file replicas across the cluster. The TaskTrackers use the plugin to
 leverage a local fuse mount of the gluster volume in order to access the
 data required for the tasks. When the JobTracker receives a Hadoop job, it
 uses the locality information it ascertains via the plugin to send the
 tasks for the Hadoop Job to Hadoop TaskTrackers on servers that have the
 data required for the task within their local bricks. This ensures data is
 read from disk and not over the network. Please see the attached diagram
 which provides an overview of the entire solution for a Hadoop 1.x
 deployment.

 The community project, along with the documentation and available
 releases, is hosted within the Gluster Forge at
 http://forge.gluster.org/hadoop. The glusterfs-hadoop project will also
 be available within the Fedora 20 release later this year, alongside fellow
 Fedora newcomer Apache Hadoop and the already available gluster project.
 The glusterfs-hadoop project team welcomes contributions and participation
 from the broader community.

 Stay tuned for upcoming posts around GlusterFS integration into the Apache
 Ambari and Fedora projects.

 Regards
 The glusterfs-hadoop project team
 ___
 Announce mailing list
 annou...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/announce

 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users


Congratulations! This is great news!!

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Change in glusterfs[master]: bd: posix/multi-brick support to BD xlator

2013-09-05 Thread Anand Avati


On 09/01/2013 11:26 AM, M. Mohan Kumar (Code Review) wrote:

Hello Anand Avati, Gluster Build System,

I'd like you to reexamine a change.  Please visit

 http://review.gluster.org/4809

to look at the new patch set (#4).

Change subject: bd: posix/multi-brick support to BD xlator
..

bd: posix/multi-brick support to BD xlator

Current BD xlator (block backend) has a few limitations such as
* Creation of directories not supported
* Supports only single brick
* Does not use extended attributes (and client gfid) like posix xlator
* Creation of special files (symbolic links, device nodes etc) not
supported

Basic limitation of not allowing directory creation is blocking
oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM
creates multi-level directories when GlusterFS is used as storage
backend for storing VM images.

To overcome these limitations a new BD xlator with following
improvements is suggested.

* New hybrid BD xlator that handles both regular files and block device files
* The volume will have both POSIX and BD bricks. Regular files are
   created on POSIX bricks, block devices are created on the BD brick (VG)
* BD xlator leverages exiting POSIX xlator for most POSIX calls and
   hence sits above the POSIX xlator
* Block device file is differentiated from regular file by an extended attribute
* The xattr 'user.glusterfs.bd' (BD_XATTR) plays a role in mapping a
   posix file to Logical Volume (LV).
* When a client sends a request to set BD_XATTR on a posix file, a new
   LV is created and mapped to posix file. So every block device will
   have a representative file in POSIX brick with 'user.glusterfs.bd'
   (BD_XATTR) and 'user.glusterfs.bd.size' (BD_XATTR_SIZE) set.
* Here after all operations on this file results in LV related operations.

New BD xlator code is placed in xlators/storage/bd directory.

For example opening a file that has BD_XATTR_PATH set results in opening
the LV block device, reading results in reading the corresponding LV block
device.

When BD xlator gets request to set BD_XATTR via setxattr call, it
creates a LV and information about this LV is placed in the xattr of the
posix file. xattr user.glusterfs.bd, user.glusterfs.bd.size used to
identify that posix file is mapped to BD.

Usage:
Server side:
[root@host1 ~]# gluster volume create bdvol device vg 
host1:/storage/vg1_info?vg1 host2:/storage/vg2_info?vg2
It creates a distributed gluster volume 'bdvol' with Volume Group vg1
using posix brick /storage/vg1_info in host1 and Volume Group vg2 using
/storage/vg2_info in host2.

[root@host1 ~]# gluster volume start bdvol

Client side:
[root@node ~]# mount -t glusterfs host1:/bdvol /media
[root@node ~]# touch /media/posix
It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2
brick

[root@node ~]# mkdir /media/image
[root@node ~]# touch /media/image/lv1
It also creates regular posix file 'lv1' in either host1:/vg1 or
host2:/vg2 brick

[root@node ~]# setfattr -n user.glusterfs.bd -v lv /media/image/lv1
[root@node ~]#
Above setxattr results in creating a new LV in corresponding brick's VG
and it sets 'user.glusterfs.bd' with value 'lv' and
'user.glusterfs.size' with default extent size.

[root@node ~]# truncate -s5G /media/image/lv1
It results in resizig LV 'lv1'to 5G

Changes from previous version V3:
* Added support in FUSE to support full/linked clone
* Added support to merge snapshots and provide information about origin
* bd_map xlator removed
* iatt structure used in inode_ctx. iatt is cached and updated during
fsync/flush
* aio support
* Type and capabilities of volume are exported through getxattr

Changes from version 2:
* Used inode_context for caching BD size and to check if loc/fd is BD or
   not.
* Added GlusterFS server offloaded copy and snapshot through setfattr
   FOP. As part of this libgfapi is modified.
* BD xlator supports stripe
* During unlinking if a LV file is already opened, its added to delete
   list and bd_del_thread tries to delete from this list when a last
   reference to that file is closed.

Changes from previous version:
* gfid is used as name of LV
* ? is used to specify VG name for creating BD volume in volume
   create, add-brick. gluster volume create volname host:/path?vg
* open-behind issue is fixed
* A replicate brick can be added dynamically and LVs from source brick are
   replicated to destination brick
* A distribute brick can be added dynamically and rebalance operation
   distributes existing LVs/files to the new brick
* Thin provisioning support added.
* bd_map xlator support retained
* setfattr -n user.glusterfs.bd -v lv creates a regular LV and
   setfattr -n user.glusterfs.bd -v thin creates thin LV
* Capability and backend information added to gluster volume info (and --xml) so
   that management tools can exploit BD xlator.
* tracing support for bd xlator added

TODO:
* Add support to display snapshots for a given LV

Re: [Gluster-devel] [RFC ] dictionary optimizations

2013-09-04 Thread Anand Avati

On Wed, Sep 4, 2013 at 6:37 AM, Xavier Hernandez xhernan...@datalab.eswrote:

Al 04/09/13 14:05, En/na Jeff Darcy ha escrit:

On 09/04/2013 04:27 AM, Xavier Hernandez wrote:

I would also like to note that each node can store multiple elements.
Current implementation creates a node for each byte in the key. In my
implementation I only create a node if there is a prefix coincidence
between
2 or more keys. This reduces the number of nodes and the number of
indirections.

Whatever we do, we should try to make sure that the changes are profiled
against real usage. When I was making my own dict optimizations back in
March
of last year, I started by looking at how they're actually used. At that
time,
a significant majority of dictionaries contained just one item. That's
why I
only implemented a simple mechanism to pre-allocate the first data_pair
instead
of doing something more ambitious. Even then, the difference in actual
performance or CPU usage was barely measurable. Dict usage has certainly
changed since then, but I think you'd still be hard pressed to find a case
where a single dict contains more than a handful of entries, and
approaches
that are optimized for dozens to hundreds might well perform worse than
simple
ones (e.g. because of cache aliasing or branch misprediction).

If you're looking for other optimization opportunities that might provide
even
bigger bang for the buck then I suggest that stack-frame or frame-local
allocations are a good place to start. Or string copying in places like
loc_copy. Or the entire fd_ctx/inode_ctx subsystem. Let me know and
I'll come
up with a few more. To put a bit of a positive spin on things, the
GlusterFS
code offers many opportunities for improvement in terms of CPU and memory
efficiency (though it's surprisingly still way better than Ceph in that
regard).

Yes. The optimizations on dictionary structures are not a big
improvement in the overall performance of GlusterFS. I tried it on a real
situation and the benefit was only marginal. However I didn't test new
features like an atomic lookup and remove if found (because I would have
had to review all the code). I think this kind of functionalities could
improve a bit more the results I obtained.

However this is not the only reason to do these changes. While I've been
writing code I've found that it's tedious to do some things just because
there isn't such functions in dict_t. Some actions require multiple calls,
having to check multiple errors and adding complexity and limiting
readability of the code. Many of these situations could be solved using
functions similar to what I proposed.

On the other side, if dict_t must be truly considered a concurrent
structure, there are a lot of race conditions that might appear when doing
some operations. It would require a great effort to take care of all these
possibilities everywhere. It would be better to pack most of these
situations into functions inside the dict_t itself where it is easier to
combine some operations.

By the way, I've made some tests with multiple bricks and it seems that
there is a clear speed loss on directory listings as the number of bricks
increases. Since bricks should be independent and they can work in
parallel, I didn't expected such a big performance degradation.

The likely reason is that, even though bricks are parallel for IO, readdir
is essentially a sequential operation and DHT has a limitation that a
readdir reply batch does not cross server boundaries. So if you have 10
files and 1 server, all 10 entries are returned in one call to the
app/libc. If you have 10 files and 10 servers evenly distributed, the
app/libc has to perform 10 calls and keeps getting one file at a time. This
problem goes away when each server has enough files to fill up a readdir
batch. It's only when you have too few files and too many servers that this
dilution problem shows up. However, this is just a theory and your
problem may be something else too..

Note that Brian Foster's readdir-ahead patch should address this problem to
a large extent. When loaded on top of DHT, the prefiller effectively
collapses the smaller chunks returned by DHT into a larger chunk requested
by the app/libc.

Avati

However the tests have not been exhaustive nor made in best conditions so
they might be misleading. Anyway it seems to me that there might be a
problem with some mutexes that force too much serialization of requests
(though I have no real proves it's only a feeling). Maybe some more
asynchronousity on calls between translators could help.

Only some thoughts...

Best regards,

Xavi

__**_
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/**mailman/listinfo/gluster-develhttps://lists.nongnu.org/mailman/listinfo/gluster-devel

__**_
Gluster-devel mailing list

Re: [Gluster-devel] [RFC ] dictionary optimizations

2013-09-03 Thread Anand Avati

On Mon, Sep 2, 2013 at 7:24 AM, Xavier Hernandez xhernan...@datalab.eswrote:

 Hi,

 dict_t structures are widely used in glusterfs. I've some ideas that could
 improve its performance.

 * On delete operations, return the current value if it exists.

 This is very useful when we want to get a value and remove it from the
 dictionary. This way it can be done accessing and locking the dict_t only
 once (and it is atomic).


Makes sense.



 * On add operations, return the previous value if it existed.

 This avoids to use a lookup and a conditional add (and it is atomic).


Do you mean dict_set()? If so, how do you propose we differentiate between
failure and previous value did not exist? Do you propose setting the
previous value into a pointer to pointer, and retain the return value as is
today?


 * Always return the data_pair_t structure instead of data_t or the data
 itself.

 This can be useful to avoid future lookups or other operations on the same
 element. Macros can be created to simplify writing code to access the
 actual value.


The use case is not clear. A more concrete example will help..



 * Use a trie instead of a hash.

 A trie structure is a bit more complex than a hash, but only processes the
 key once and does not need to compute the hash. A test implementation I
 made with a trie shows a significant improvement in dictionary operations.


There is already an implementation of trie in libglusterfs/src/trie.c.
Though it does not compact (collapse) single-child nodes upwards into the
parent. In any case, let's avoid having two implementations of tries.


 * Implement dict_foreach() as a macro (similar to kernel's
 list_for_each()).

 This gives more control and avoids the need of helper functions.


This makes sense too, but there are quite a few users of dict_foreach in
the existing style. Moving them all over might be a pain.


 Additionally, I think it's possible to redefine structures to reduce the
 number of allocations and pointers used for each element (actual data,
 data_t, data_pair_t and key).


This is highly desirable. There was some effort from Amar in the past (
http://review.gluster.org/3910) but it has been in need of attention for
some time. It would be intersting to know if you were thinking along
similar lines?


Avati

What do you think ?

 Best regards,

 Xavi

 __**_
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/**mailman/listinfo/gluster-develhttps://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [RFC ] dictionary optimizations

2013-09-03 Thread Anand Avati

On Tue, Sep 3, 2013 at 1:42 AM, Xavier Hernandez xhernan...@datalab.eswrote:

  Al 03/09/13 09:33, En/na Anand Avati ha escrit:

 On Mon, Sep 2, 2013 at 7:24 AM, Xavier Hernandez xhernan...@datalab.eswrote:

 Hi,

 dict_t structures are widely used in glusterfs. I've some ideas that
 could improve its performance.

 * On delete operations, return the current value if it exists.

 This is very useful when we want to get a value and remove it from the
 dictionary. This way it can be done accessing and locking the dict_t only
 once (and it is atomic).


  Makes sense.



 * On add operations, return the previous value if it existed.

 This avoids to use a lookup and a conditional add (and it is atomic).


  Do you mean dict_set()? If so, how do you propose we differentiate
 between failure and previous value did not exist? Do you propose
 setting the previous value into a pointer to pointer, and retain the return
 value as is today?

 Yes, I'm thinking to something similar to dict_set() (by the way, I would
 remove the dict_add() function).


dict_add() is used in unserialization routines where dict_set() for a big
set of keys guaranteed not to repeat is very expensive (unserializing would
otherwise have a quadratic function as its asymptote). What is the reason
you intend to remove it?


 What you propose would be the simplest solution right now. However I think
 it would be interesting to change the return value to an error code (this
 would supply more detailed information in case of failure and we could use
 EEXIST to know if the value already existed. In fact I think it would be
 interesting to progressively change the -1 return code of many functions by
 an error code). The pointer to pointer argument could be NULL if the
 previous value is not needed.

 Of course this would change the function signature, breaking a lot of
 existing code. Another possibility could be to create a dict_replace()
 function, and possibly make it to fail if the value didn't exist.


It is best we do not change the meaning of existing APIs, and just add new
APIs instead. The new API can be:

int dict_replace (dict_t *dict, const char *key, data_t *newval, data_t
**oldval);

.. and leave dict_set() as is.





 * Always return the data_pair_t structure instead of data_t or the data
 itself.

 This can be useful to avoid future lookups or other operations on the
 same element. Macros can be created to simplify writing code to access the
 actual value.


  The use case is not clear. A more concrete example will help..

Having a data_pair_t could help to navigate from an existing element
 (getting next or previous. This is really interesting if dict where
 implemented using a sorted structure like a trie since it would allow to
 process a set of similar entries very fast, like the trusted.afr.brick
 values for example) or removing or replacing it without needing another
 lookup (a more detailed analysis would be needed to see how to handle race
 conditions).

 By the way, is really the dict_t structure used concurrently ? I haven't
 analyzed all the code deeply, but it seems to me that every dict_t is only
 accessed from a single place at once.


There have been instances of dict_t getting used concurrently, when used as
xdata and in xattrop (by AFR). There have been bugs in the past with
concurrent dict access.




 * Use a trie instead of a hash.

 A trie structure is a bit more complex than a hash, but only processes
 the key once and does not need to compute the hash. A test implementation I
 made with a trie shows a significant improvement in dictionary operations.


  There is already an implementation of trie in libglusterfs/src/trie.c.
 Though it does not compact (collapse) single-child nodes upwards into the
 parent. In any case, let's avoid having two implementations of tries.

 I know. The current implementation wastes a lot of memory because it uses
 an array of 256 pointers, and in some places it needs to traverse the
 array. Not a b¡g deal, but if it is made many times it could be noticeable.
 In my test I used a trie with 4 child pointers (with collapsing
 single-child nodes) that runs a bit faster than the 256 implementation and
 uses much less memory. I tried with 2, 4, 16 and 256 childs per node, and 4
 seems to be the best (at least for dictionary structures) though there are
 very little difference between 4 and 16 in terms of speed.


The 256 child pointers give you constant time lookup for the next level
child with just an offset indirection. With smaller fan-out, do you search
through the list? Can you show an example of this? Collapsing single child
node upwards is badly needed though.


 I agree that it is not good to maintain two implementations of the same
 thing. Maybe we could change the trie implementation. It should be
 transparent.


Yes, I believe the current API can accommodate such internal changes.




* Implement dict_foreach() as a macro (similar to kernel's
 list_for_each

Re: [Gluster-devel] Change in glusterfs[master]: libglusterfs: use safer symbol resolution strategy with our ...

2013-08-30 Thread Anand Avati


[cc'ing gluster-devel on the final consensus]

On 08/30/2013 01:19 PM, Brian Foster wrote:

On 08/30/2013 04:01 PM, Anand Avati wrote:

On 8/30/13 7:50 AM, Brian Foster wrote:

On 08/30/2013 09:51 AM, Brian Foster wrote:

On 08/29/2013 09:08 PM, Emmanuel Dreyfus wrote:

Anand Avati (Code Review) rev...@dev.gluster.org wrote:


...

TBH, I'm not totally sure how much this impacts things that aren't cross
DSO conflicts, so I ran a quick test. If I define a function in an
executable and a dso and call the function from both points, the library
always invokes the local version if the library is loaded via dlopen().
In fact, if I remove the local version from the library, I hit an
undefined symbol error when I attempt to invoke the library call.

Note that the library does invoke the executable version of the function
if a compile time link dependency is made (i.e., even if the library is
still referenced via dlopen()/dlsym(), but not loaded by that method). I
suspect there is something going on here at compile/link time that
determines whether the executable exports the symbol, but that's an
initial guess based on observed behavior.



FYI, an elf dump of both executables shows that they differ in whether
my duplicate symbol is listed in the executable .dynsym (dynamic symbol
table) section or not.

A dump of my locally installed qemu-kvm executable shows a couple
exported block driver symbols: bdrv_aio_readv and bdrv_aio_writev.

Brian


Given that, perhaps the best thing to do is hold off on the
RTLD_DEEPBIND bits until we understand the behavior a bit more
conclusively here and evaluate whether it's really an issue in the weird
qemu case. Thoughts?

(FWIW, I think the change should at least hold with regard to the
RTLD_LOCAL bits. A translator is already intended to be a black box with
a few specially named interfaces. I don't see any reason to pollute the
global namespace with all kinds of extra symbols from different
translators).


NetBSD has different OS-specific flags, but I am not sure wether they
overlap or not:



Appears to define a similar behavior, but this apparently applies to an
explicit search of a symbol in a library as opposed to how the external
dependencies of the library are resolved.

Brian

P.S., All tests run on Linux.


The following special handle values may be used with dlsym():
(...)
   RTLD_SELF   The search for symbol is limited to the
shared
 object issuing the call to dlsym() and those
shared
 objects which were loaded after it that are
visible.








Using RTLD_LOCAL solves one half of the confusion, and I think there is
no disagreement that RTLD_GLOBAL must be changed to RTLD_LOCAL. This
solves the problem of qemu-block translator's symbols not getting picked
up by qemu when libgfapi for the hypervisor's calls.



Agreed.


The other half of the problem - functions in qemu getting called instead
of same named functions in qemu-block translator is the open concern.
Brian, you report above that a DSO always uses the version which is
available in the same DSO. If that is the case (which sounds good), I
don't understand why RTLD_DEEPBIND is required.



Not always... I suspect there are two conditions here. 1.) the
executable includes the symbol in its dynamic symbol table. 2.) the dso
is compiled/linked to look for a symbol through the dynamic symbol
tables (I suspect via PLT entries).

The experiment I outlined before effectively toggles #1. By linking the
library directly or not, I'm controlling whether the executable exports
the dependent symbol. #2 it appears can be controlled by something like
the -Bsymbolic linker option. For example, objdump a library with and
without that option and you can see the call sites to a particular
function either refer to library (relative?) offsets or plt entries.

Using this option, the library always uses the local version regardless
of whether the symbol is exported from the executable.


I am concerned about bugs like 994314, where a symbol defined in qemu
(uuid_is_null()) got picked up instead of the one in libglusterfs.so,
for a call made from libglusterfs.so. It may be the case that the
problem occured here because libglusterfs.so was not dlopen()ed, but
dynamically linked as -lglusterfs and build time.



It might not have mattered in that particular case. If the symbol was
exported by the executable (i.e., condition #1), it probably/possibly
could get bound to the unexpected symbol. What was the ultimate fix for
that bug?

While it appears that right now qemu does not export many bdrv_*
symbols, it still seems like it could be a problem if we used those
symbols or the nature of the executable changed in the future (i.e., #1
is not under our control). For that reason, I'd suggest we use something
like -Bsymbolic for qemu-block to address the second half of the problem
(assuming it doesn't break anything else :P). I mentioned this on
#gluster-dev btw, and Kaleb pointed out

Re: [Gluster-devel] [Gluster-users] GlusterFS 3.4.1 planning

2013-08-28 Thread Anand Avati

For those interested in what are the possible patches, here is a short list
of commits which are available in master but not yet backported to
release-3.4 (note the actual list  500, this is a short list of patches
which fix some kind of an issue - crash, leak, incorrect behavior,
failure)

http://www.gluster.org/community/documentation/index.php/Release_341_backport_candidates

Some of the patches towards the end fix some nasty issues. Many of them are
covered in the bugs listed in
http://www.gluster.org/community/documentation/index.php/Backport_Wishlist.
If there are specific patches you would like to see backported, please
copy/paste those lines from the Release_341_backport_candidates page into
the Backport_Wishlist page. For the others, we will be using a best
judgement call based on severity and patch impact.

Avati



On Fri, Aug 9, 2013 at 2:23 AM, Vijay Bellur vbel...@redhat.com wrote:

 Hi All,

 We are considering 3.4.1 to be released in the last week of this month. If
 you are interested in seeing specific bugs addressed or patches included in
 3.4.1, can you please update them here:

 http://www.gluster.org/**community/documentation/index.**
 php/Backport_Wishlisthttp://www.gluster.org/community/documentation/index.php/Backport_Wishlist

 Thanks,
 Vijay
 __**_
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.**org/mailman/listinfo/gluster-**usershttp://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] disabling caching and other optimizations for internal fops

2013-08-27 Thread Anand Avati

Setting in xdata has the benefit of getting propagated to server side
without change in protocol. However that being said, dict_t in its current
form is not the most efficient data structure for storing a lot of
key/values (biggest concern being too many small allocations). It will be
good to revive http://review.gluster.org/3910 so that such use of xdata
will be of a lesser concern.

Avati

On Tue, Aug 27, 2013 at 12:12 AM, Raghavendra Bhat rab...@redhat.comwrote:

Hi,

As of now, the performance xlators cache the data and perform some
optimizations for all the fops irrespective of whether the fop is generated
from the application or internal xlator. I think, performance xlators
should come in to picture for only the fops generated by the applications.
Imagine the situation where a graph change happens and fuse xlator sends
open call on the fds to migrate them to the new graph. But the open call
might not reach posix if open-behind unwinds success to fuse xlator.

It can be done in 2 ways.

1) Set a key in dictionary if the call is generated internally
OR
2) Set a flag in the callstack itself, whether the fop is internal fop or
generated from the application.

Please provide feedback.

Regards,
Raghavendra Bhat

__**_
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/**mailman/listinfo/gluster-develhttps://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] The return of the all-null pending matrix

2013-08-27 Thread Anand Avati

On Mon, Aug 26, 2013 at 6:12 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 Anand Avati anand.av...@gmail.com wrote:

  This is a tricky problem. I have thought about this quite a bit and
  couldn't come up with a theory which can lead to this behavior. I am
  suspecting this might be a 32/64bit compatibility issue. Can you try
  placing all bricks on the same architecture and see if this can be
  reproduced? How long does it take to reproduce this problem?

 Now we know this is probably the root of the bug, do you want to track
 it further, or shall we call that kind of setup unsupported?


It is certainly an interesting issue, and will be good to fix in the long
run (supporting heterogenous bricks can open up very interesting use
cases). Can you please file a bug in http://bugzilla.redhat.com with all
the logs as a low priority bug so that we can track it and fix it sometime,
rather than ignore it intentionally?

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Fwd: FileSize changing in GlusterNodes

2013-08-26 Thread Anand Avati

On Sun, Aug 25, 2013 at 11:23 PM, Vijay Bellur vbel...@redhat.com wrote:

 File size as reported on the mount point and the bricks can vary because
 of this code snippet in iatt_from_stat():

 {
 uint64_t maxblocks;

 maxblocks = (iatt-ia_size + 511) / 512;

 if (iatt-ia_blocks  maxblocks)
 iatt-ia_blocks = maxblocks;
 }


 This snippet was brought in to improve accounting behaviour for quota that
 would fail with disk file systems that perform speculative pre-allocation.

 If this aides only specific use cases, I think we should make the
 behaviour configurable.

 Thoughts?

 -Vijay



This is very unlikely the problem. st_blocks field values do not influence
md5sum behavior in any way. The file size (st_size) would, but both du -k
and the above code snipped only deal with st_blocks.

Bobby, it would help if you can identify the mismatching file and inspect
and see what is the difference between the two files?

Avati







  Original Message 
 Subject:[Gluster-users] FileSize changing in GlusterNodes
 Date:   Wed, 21 Aug 2013 05:35:40 +
 From:   Bobby Jacob bobby.ja...@alshaya.com
 To: gluster-us...@gluster.org gluster-us...@gluster.org



 Hi,

 When I upload files into the gluster volume, it replicates all the files
 to both gluster nodes. But the file size slightly varies by (4-10KB),
 which changes the md5sum of the file.

 Command to check file size : du –k *. I’m using glusterFS 3.3.1 with
 Centos 6.4

 This is creating inconsistency between the files on both the bricks. ?
 What is the reason for this changed file size and how can it be avoided. ?

 Thanks  Regards,

 *Bobby Jacob*




 ___
 Gluster-users mailing list
 gluster-us...@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Fwd: FileSize changing in GlusterNodes

2013-08-26 Thread Anand Avati

On Mon, Aug 26, 2013 at 9:40 AM, Vijay Bellur vbel...@redhat.com wrote:

On 08/26/2013 10:04 PM, Anand Avati wrote:

On Sun, Aug 25, 2013 at 11:23 PM, Vijay Bellur vbel...@redhat.com
mailto:vbel...@redhat.com wrote:

File size as reported on the mount point and the bricks can vary
because of this code snippet in iatt_from_stat():

{
uint64_t maxblocks;

maxblocks = (iatt-ia_size + 511) / 512;

if (iatt-ia_blocks maxblocks)
iatt-ia_blocks = maxblocks;
}

This snippet was brought in to improve accounting behaviour for
quota that would fail with disk file systems that perform
speculative pre-allocation.

If this aides only specific use cases, I think we should make the
behaviour configurable.

Thoughts?

-Vijay

This is very unlikely the problem. st_blocks field values do not
influence md5sum behavior in any way. The file size (st_size) would, but
both du -k and the above code snipped only deal with st_blocks.

I was referring to du -k as seen on the bricks and the mount point. I was
certainly not referring to the md5sum difference.

-Vijay

I thought he was comparing du -k between the two bricks (the sentence felt
that way). In any case the above code snippet should do something
meaningful only when the file is still held open. XFS should discard the
extra allocations after close() anyways.

Bobby, it would help if you can identify the mismatching file and
inspect and see what is the difference between the two files?

Avati

Original Message
Subject:[Gluster-users] FileSize changing in GlusterNodes
Date: Wed, 21 Aug 2013 05:35:40 +
From: Bobby Jacob bobby.ja...@alshaya.com
mailto:bobby.jacob@alshaya.**com bobby.ja...@alshaya.com
To: gluster-us...@gluster.org
mailto:gluster-users@gluster.**orggluster-us...@gluster.org

gluster-us...@gluster.org
mailto:gluster-users@gluster.**orggluster-us...@gluster.org

Hi,

When I upload files into the gluster volume, it replicates all the
files
to both gluster nodes. But the file size slightly varies by (4-10KB),
which changes the md5sum of the file.

Command to check file size : du –k *. I’m using glusterFS 3.3.1 with
Centos 6.4

This is creating inconsistency between the files on both the bricks. ?
What is the reason for this changed file size and how can it be
avoided. ?

Thanks Regards,

*Bobby Jacob*

__**_
Gluster-users mailing list
gluster-us...@gluster.org
mailto:Gluster-users@gluster.**orggluster-us...@gluster.org

http://supercolony.gluster.**org/mailman/listinfo/gluster-**usershttp://supercolony.gluster.org/mailman/listinfo/gluster-users

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] The return of the all-null pending matrix

2013-08-20 Thread Anand Avati

This is a tricky problem. I have thought about this quite a bit and
couldn't come up with a theory which can lead to this behavior. I am
suspecting this might be a 32/64bit compatibility issue. Can you try
placing all bricks on the same architecture and see if this can be
reproduced? How long does it take to reproduce this problem?

Avati


On Mon, Aug 19, 2013 at 9:30 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 Hi

 Did you had a loog on it? ANy idea on what is going on?

 On Wed, Aug 14, 2013 at 07:21:13AM +0200, Emmanuel Dreyfus wrote:
  Anand Avati anand.av...@gmail.com wrote:
 
   I was going through your log files again. Correct me if I'm wrong, the
   issue in the log is with the file tparm.po, right?
 
  Yes, this one raises a split brain. We have other all-zero pending
  matrices in the log on other files in the minutes leading to the
  problem, though But  at least they do not raise an error.
 
  --
  Emmanuel Dreyfus
  http://hcpnet.free.fr/pubz
  m...@netbsd.org
 
  ___
  Gluster-devel mailing list
  Gluster-devel@nongnu.org
  https://lists.nongnu.org/mailman/listinfo/gluster-devel

 --
 Emmanuel Dreyfus
 m...@netbsd.org

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Issues with fallocate, discard and zerofill

2013-08-14 Thread Anand Avati

Shishir,
Is this in reference to the dht open file rebalance (of replaying the
operations to the destination server)? I am assuming so, as that is
something which has to be handled.

The other question is how should fallocate/discard be handled by self-heal
in AFR. I'm not sure how important it is, but will be certainly good to
bounce some ideas off here. Maybe we should implement a fiemap fop to query
extents/holes and replay them in the other serverl?

Avati



On Tue, Aug 13, 2013 at 10:49 PM, Bharata B Rao bharata@gmail.comwrote:

 Hi Avati, Brian,

 During the recently held gluster meetup, Shishir mentioned about a
 potential problem (related to fd migration etc) in the zerofill
 implementation (http://review.gluster.org/#/c/5327/) and also
 mentioned that same/similar issues are present with fallocate and
 discard implementations. Since zerofill has been modelled on
 fallocate/discard, I was wondering if it would be possible to address
 these issues in fallocate/discard first so that we could potentially
 follow the same in zerofill implementation.

 Regards,
 Bharata.
 --
 http://raobharata.wordpress.com/

 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Proposal for Gluster 3.5: New test framework

2013-08-14 Thread Anand Avati

Justin,

Thanks for firing up this thread. Are there notable projects which use
these frameworks? Do you have any info on what other distributed storage
projects use for their automated testing?

Thanks,
Avati


On Mon, Aug 12, 2013 at 10:07 AM, Justin Clift jcl...@redhat.com wrote:

 Hi all,

 For Gluster 3.5, I'd like to propose we get some kind of
 *multi-node* testing framework in place for Gluster.

 The existing test framework is single node only, which
 doesn't fit well for a distributed file system.

 I've recently looked into Autotest in depth, but ruled
 it out since it's:

  * Linux only (ugh)
  * Very hard to figure out for newbies
  * Close to zero documentation
  * Opaque/unreadable source
  * Painful to work with :(

 Potentially we could use STAF (staf.sourceforge.net).
 I've not investigated this in depth yet, but from it's
 website it's:

  * Cross Platform
http://staf.sourceforge.net/current/STAFFAQ.htm#d0e36

  * Seems like extensive documentation, and usable with
several languages:
http://staf.sourceforge.net/current/STAFPython.htm

  * Seems like a _reasonably_ active Community
http://sourceforge.net/p/staf/mailman/staf-users/

 I'll put this info into a Feature Page if people think
 it's worth writing up and taking further.

 ?

 Regards and best wishes,

 Justin Clift

 --
 Open Source and Standards @ Red Hat

 twitter.com/realjustinclift


 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RPM re-structuring

2013-08-14 Thread Anand Avati

On Wed, Aug 14, 2013 at 12:25 AM, Deepak C Shetty 
deepa...@linux.vnet.ibm.com wrote:

  On 07/29/2013 12:18 AM, Anand Avati wrote:




 On Sun, Jul 28, 2013 at 11:18 AM, Vijay Bellur vbel...@redhat.com wrote:

 Hi All,

 There was a recent thread on fedora-devel about bloated glusterfs
 dependency for qemu:

 https://lists.fedoraproject.org/pipermail/devel/2013-July/186484.html

 As of today, we have the following packages and respective primary
 constituents:

  1. glusterfs - contains all the common xlators,
 libglusterfs, glusterfsd binary  glusterfs symlink to glusterfsd.
  2. glusterfs-rdma- rdma shared library
  3. glusterfs-geo-replication - geo-rep related objects
  4. glusterfs-fuse- fuse xlator
  5. glusterfs-server  - server side xlators, config files
  6. glusterfs-api - libgfapi shared library
  7. glusterfs-resource-agents - OCF resource agents
  8. glusterfs-devel   - Header files for libglusterfs
  9. glusterfs-api-devel   - Header files for gfapi

 As far as qemu is concerned, qemu depends on glusterfs-api which in turn
 is dependent on glusterfs. Much of the apparent bloat is coming from
 glusterfs package and one proposal for reducing the dependency footprint of
 consumers of libgfapi could be the following:

 a) Move glusterfsd and glusterfs symlink from 'glusterfs' to
 'glusterfs-server'
 b) Package glusterfsd binary and glusterfs symlink in 'glusterfs-fuse'



  Does that mean glusterfsd is in glusterfs-server or glusterfs-fuse? It
 is probably sufficient to leave glusterfs-fuse just have fuse.so and
 mount.glusterfs.in

  Another model can be:

  0. glusterfs-libs.rpm - libglusterfs.so libgfrpc.so libgfxdr.so
 1. glusterfs (depends on glusterfs-libs) - glusterfsd binary, glusterfs
 symlink, all common xlators
 2. glusterfs-rdma (depends on glusterfs) - rdma shared library
 3. glusterfs-geo-replication (depends on glusterfs) - geo-rep related
 objects
 4. glusterfs-fuse (depends on glusterfs) - fuse xlator, mount.glusterfs
 5. glusterfs-server (depends on glusterfs) - server side xlators, config
 files
 6. glusterfs-api (depends on glusterfs-libs) - libgfapi.so and api.so
 7. glusterfs-resource-agents (depends on glusterfs)
 8. glusterfs-devel (depends on glusterfs-libs) - header files for
 libglusterfs
 9. glusterfs-api-devel (depends on glusterfs-api) - header files for gfapi

  This way qemu will only pick up libgfapi.so libglusterfs.so libgfrpc.so
 and libgfxdr.so (the bare minimum to just execute) for the binary to load
 at run time. Those who want to store vm images natively on gluster must
 also do a 'yum install glusterfs' to make gfapi 'useful'. This way Fedora
 qemu users who do not plan to use gluster will not get any of the xlator
 cruft.


 Looks like even after the re-packaging.. the original problem is still
 there !
 Post re-strucuring ( i am on F19 with updates-testing repo enabled)

 gluserfs-api has dep on -libs and glusterfs
 So when User install glusterfs-api, it pulls in -libs and glusterfs

 This is correct, since w/o glusterfs rpm we won't have a working qemu
 gluster backend.


Actually this *wasnt* what we discussed. glusterfs-api was supposed to
depend on glusterfs-libs *ONLY*. This is because it has a linking (hard)
relationship with glusterfs-libs, and glusterfs.rpm is only a run-time
dependency - everything here is dlopen()ed.



 Just allowing qemu to execute by way of installing-libs and -api only
 won't help, since once qemu executes and someone tries qemu w/ gluster
 backend.. things will fail unless User has installed glusterfs rpm (which
 has all the client xlators)


I think this was exactly what we concluded. That a user would need to
install glusterfs rpm if they wanted to store VM images on gluster
(independent of the fact that qemu was linked with glusterfs-api). Do you
see a problem with this?

Avati


 So today ...
 yum install glusterfs-api brings in glusterfs-libs and glusterfs
 which sounds correct to get a working system with qemu gluster backend.

 Later...
 yum remove glusterfs
 removes glusterfs-api which has a reverse dep on qemu, hence libvirt hence
 the entire virt stack goes down

 which was the original problem reported in the fedora devel list @
 https://lists.fedoraproject.org/pipermail/devel/2013-July/186484.html

 and that unfortunately is still there, even after -libs was created as a
 separate rpm as part of this effort!

 thanx,
 deepak


 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Proposal for Gluster 3.5: Better peer identification

2013-08-14 Thread Anand Avati

On Tue, Aug 13, 2013 at 4:05 AM, Kaushal M kshlms...@gmail.com wrote:

 Hi all,
 We recently had a mailing list discussion about the current problems
 with peer identification and handling multiple networks. This proposal
 is regarding better identification of peers.

 Currently, the way we identify peers is not consistent all through the
 gluster code. We use uuids internally and hostnames externally. This
 setup works pretty well when all the peers are on a single network,
 have one address, and are referred to in all the gluster commands with
 same address. But once we start mixing up addresses in the commands
 (ip, shortnames, fqdn) and bring in multiple networks we have
 problems.

 The problems were discussed in the following mailing list threads and
 some solutions were proposed.
  - How do we identify peers? [1]
  - RFC - Connection Groups concept [2]

 The solution to the multi-network problem is dependent on the solution
 to the peer identification problem. So it'll be good to target fixing
 the peer identification problem asap, ie. in 3.5 , and take up the
 networks problem later.

 Thoughts?


Thanks for the proposal Kaushal. This is a welcome change. It will be great
to have all internal identifiers for peers to happen through UUID and get
translated into a host/IP at the most superficial layer. There are open
issues around node crash + re-install with same IP (but new UUID) which
needs to be addressed in this effort.

Avati



 - Kaushal

 --
 [1] http://lists.gnu.org/archive/html/gluster-devel/2013-06/msg00067.html
 [2] http://lists.gnu.org/archive/html/gluster-devel/2013-06/msg00069.html

 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RPM re-structuring

2013-08-14 Thread Anand Avati

On Wed, Aug 14, 2013 at 1:40 AM, Deepak C Shetty 
deepa...@linux.vnet.ibm.com wrote:

  On 08/14/2013 01:37 PM, Anand Avati wrote:




 On Wed, Aug 14, 2013 at 12:25 AM, Deepak C Shetty 
 deepa...@linux.vnet.ibm.com wrote:

   On 07/29/2013 12:18 AM, Anand Avati wrote:




 On Sun, Jul 28, 2013 at 11:18 AM, Vijay Bellur vbel...@redhat.comwrote:

 Hi All,

 There was a recent thread on fedora-devel about bloated glusterfs
 dependency for qemu:

 https://lists.fedoraproject.org/pipermail/devel/2013-July/186484.html

 As of today, we have the following packages and respective primary
 constituents:

  1. glusterfs - contains all the common xlators,
 libglusterfs, glusterfsd binary  glusterfs symlink to glusterfsd.
  2. glusterfs-rdma- rdma shared library
  3. glusterfs-geo-replication - geo-rep related objects
  4. glusterfs-fuse- fuse xlator
  5. glusterfs-server  - server side xlators, config files
  6. glusterfs-api - libgfapi shared library
  7. glusterfs-resource-agents - OCF resource agents
  8. glusterfs-devel   - Header files for libglusterfs
  9. glusterfs-api-devel   - Header files for gfapi

 As far as qemu is concerned, qemu depends on glusterfs-api which in turn
 is dependent on glusterfs. Much of the apparent bloat is coming from
 glusterfs package and one proposal for reducing the dependency footprint of
 consumers of libgfapi could be the following:

 a) Move glusterfsd and glusterfs symlink from 'glusterfs' to
 'glusterfs-server'
 b) Package glusterfsd binary and glusterfs symlink in 'glusterfs-fuse'



  Does that mean glusterfsd is in glusterfs-server or glusterfs-fuse? It
 is probably sufficient to leave glusterfs-fuse just have fuse.so and
 mount.glusterfs.in

  Another model can be:

  0. glusterfs-libs.rpm - libglusterfs.so libgfrpc.so libgfxdr.so
 1. glusterfs (depends on glusterfs-libs) - glusterfsd binary, glusterfs
 symlink, all common xlators
 2. glusterfs-rdma (depends on glusterfs) - rdma shared library
 3. glusterfs-geo-replication (depends on glusterfs) - geo-rep related
 objects
 4. glusterfs-fuse (depends on glusterfs) - fuse xlator, mount.glusterfs
 5. glusterfs-server (depends on glusterfs) - server side xlators, config
 files
 6. glusterfs-api (depends on glusterfs-libs) - libgfapi.so and api.so
 7. glusterfs-resource-agents (depends on glusterfs)
 8. glusterfs-devel (depends on glusterfs-libs) - header files for
 libglusterfs
 9. glusterfs-api-devel (depends on glusterfs-api) - header files for gfapi

  This way qemu will only pick up libgfapi.so libglusterfs.so libgfrpc.so
 and libgfxdr.so (the bare minimum to just execute) for the binary to load
 at run time. Those who want to store vm images natively on gluster must
 also do a 'yum install glusterfs' to make gfapi 'useful'. This way Fedora
 qemu users who do not plan to use gluster will not get any of the xlator
 cruft.


  Looks like even after the re-packaging.. the original problem is still
 there !
 Post re-strucuring ( i am on F19 with updates-testing repo enabled)

 gluserfs-api has dep on -libs and glusterfs
 So when User install glusterfs-api, it pulls in -libs and glusterfs

 This is correct, since w/o glusterfs rpm we won't have a working qemu
 gluster backend.


  Actually this *wasnt* what we discussed. glusterfs-api was supposed to
 depend on glusterfs-libs *ONLY*. This is because it has a linking (hard)
 relationship with glusterfs-libs, and glusterfs.rpm is only a run-time
 dependency - everything here is dlopen()ed.



  Just allowing qemu to execute by way of installing-libs and -api only
 won't help, since once qemu executes and someone tries qemu w/ gluster
 backend.. things will fail unless User has installed glusterfs rpm (which
 has all the client xlators)


  I think this was exactly what we concluded. That a user would need to
 install glusterfs rpm if they wanted to store VM images on gluster
 (independent of the fact that qemu was linked with glusterfs-api). Do you
 see a problem with this?


 Putting a User's hat.. i think its a problem.
 IIUC What you are saying is that User must be aware that he/she needs to
 install glusterfs in order to use qemu gluster backend. User may argue..
 why didn't you install glusterfs as part of qemu yum install itself ?

 Expecting user (who may or may not be glsuter/virt. aware) to install
 addnl rpm to use qemu gluster might not work always.

 Who will inform user to install glusterfs when things fail at runtime ?



Your view is in direct contradiction with the view of those who objected
the dependency to start with :-) I think this question needs to be
reconciled with the initial reporters.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RPM re-structuring

2013-08-14 Thread Anand Avati

On Wed, Aug 14, 2013 at 1:54 AM, Harshavardhana
har...@harshavardhana.netwrote:



 Actually this *wasnt* what we discussed. glusterfs-api was supposed to
 depend on glusterfs-libs *ONLY*. This is because it has a linking (hard)
 relationship with glusterfs-libs, and glusterfs.rpm is only a run-time
 dependency - everything here is dlopen()ed.


 rpm uses 'ldd' command to get dependencies for 'glusterfs-api' to
 'glusterfs-libs' - automatically.  You don't need a forced specification.

 Specifying runtime time dependency is done this way

 %package api
 Summary:  Clustered file-system api library
 Group:System Environment/Daemons
 Requires: %{name} = %{version}-%{release} ---
 Install-time dependency.





  Just allowing qemu to execute by way of installing-libs and -api only
 won't help, since once qemu executes and someone tries qemu w/ gluster
 backend.. things will fail unless User has installed glusterfs rpm (which
 has all the client xlators)


 I think this was exactly what we concluded. That a user would need to
 install glusterfs rpm if they wanted to store VM images on gluster
 (independent of the fact that qemu was linked with glusterfs-api). Do you
 see a problem with this?



 The problem here is user awareness - it generates additional cycles of
 communication. In this case 'qemu' should have a direct dependency on
 'glusterfs.rpm' and  'glusterfs-api' when provided with gfapi support  -
 wouldn't this solve the problem?


This would solve your version of the problem. But the original concern
raised was that the whole shebang of glusterfs translators and transports
get installed for someone who wants libvirt/qemu and doesn't care what
glusterfs even is. Your version of the problem is in direct contradiction
with the initially reported problem for which the restructuring was
proposed.
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RPM re-structuring

2013-08-14 Thread Anand Avati

On Wed, Aug 14, 2013 at 2:16 AM, Deepak C Shetty 
deepa...@linux.vnet.ibm.com wrote:

  On 08/14/2013 02:23 PM, Anand Avati wrote:




 On Wed, Aug 14, 2013 at 1:40 AM, Deepak C Shetty 
 deepa...@linux.vnet.ibm.com wrote:

  On 08/14/2013 01:37 PM, Anand Avati wrote:




 On Wed, Aug 14, 2013 at 12:25 AM, Deepak C Shetty 
 deepa...@linux.vnet.ibm.com wrote:

   On 07/29/2013 12:18 AM, Anand Avati wrote:




 On Sun, Jul 28, 2013 at 11:18 AM, Vijay Bellur vbel...@redhat.comwrote:

 Hi All,

 There was a recent thread on fedora-devel about bloated glusterfs
 dependency for qemu:

 https://lists.fedoraproject.org/pipermail/devel/2013-July/186484.html

 As of today, we have the following packages and respective primary
 constituents:

  1. glusterfs - contains all the common xlators,
 libglusterfs, glusterfsd binary  glusterfs symlink to glusterfsd.
  2. glusterfs-rdma- rdma shared library
  3. glusterfs-geo-replication - geo-rep related objects
  4. glusterfs-fuse- fuse xlator
  5. glusterfs-server  - server side xlators, config files
  6. glusterfs-api - libgfapi shared library
  7. glusterfs-resource-agents - OCF resource agents
  8. glusterfs-devel   - Header files for libglusterfs
  9. glusterfs-api-devel   - Header files for gfapi

 As far as qemu is concerned, qemu depends on glusterfs-api which in
 turn is dependent on glusterfs. Much of the apparent bloat is coming from
 glusterfs package and one proposal for reducing the dependency footprint of
 consumers of libgfapi could be the following:

 a) Move glusterfsd and glusterfs symlink from 'glusterfs' to
 'glusterfs-server'
 b) Package glusterfsd binary and glusterfs symlink in 'glusterfs-fuse'



  Does that mean glusterfsd is in glusterfs-server or glusterfs-fuse? It
 is probably sufficient to leave glusterfs-fuse just have fuse.so and
 mount.glusterfs.in

  Another model can be:

  0. glusterfs-libs.rpm - libglusterfs.so libgfrpc.so libgfxdr.so
 1. glusterfs (depends on glusterfs-libs) - glusterfsd binary, glusterfs
 symlink, all common xlators
 2. glusterfs-rdma (depends on glusterfs) - rdma shared library
 3. glusterfs-geo-replication (depends on glusterfs) - geo-rep related
 objects
 4. glusterfs-fuse (depends on glusterfs) - fuse xlator, mount.glusterfs
 5. glusterfs-server (depends on glusterfs) - server side xlators, config
 files
 6. glusterfs-api (depends on glusterfs-libs) - libgfapi.so and api.so
 7. glusterfs-resource-agents (depends on glusterfs)
 8. glusterfs-devel (depends on glusterfs-libs) - header files for
 libglusterfs
 9. glusterfs-api-devel (depends on glusterfs-api) - header files for
 gfapi

  This way qemu will only pick up libgfapi.so libglusterfs.so
 libgfrpc.so and libgfxdr.so (the bare minimum to just execute) for the
 binary to load at run time. Those who want to store vm images natively on
 gluster must also do a 'yum install glusterfs' to make gfapi 'useful'. This
 way Fedora qemu users who do not plan to use gluster will not get any of
 the xlator cruft.


  Looks like even after the re-packaging.. the original problem is still
 there !
 Post re-strucuring ( i am on F19 with updates-testing repo enabled)

 gluserfs-api has dep on -libs and glusterfs
 So when User install glusterfs-api, it pulls in -libs and glusterfs

 This is correct, since w/o glusterfs rpm we won't have a working qemu
 gluster backend.


  Actually this *wasnt* what we discussed. glusterfs-api was supposed to
 depend on glusterfs-libs *ONLY*. This is because it has a linking (hard)
 relationship with glusterfs-libs, and glusterfs.rpm is only a run-time
 dependency - everything here is dlopen()ed.



  Just allowing qemu to execute by way of installing-libs and -api only
 won't help, since once qemu executes and someone tries qemu w/ gluster
 backend.. things will fail unless User has installed glusterfs rpm (which
 has all the client xlators)


  I think this was exactly what we concluded. That a user would need to
 install glusterfs rpm if they wanted to store VM images on gluster
 (independent of the fact that qemu was linked with glusterfs-api). Do you
 see a problem with this?


   Putting a User's hat.. i think its a problem.
 IIUC What you are saying is that User must be aware that he/she needs to
 install glusterfs in order to use qemu gluster backend. User may argue..
 why didn't you install glusterfs as part of qemu yum install itself ?

 Expecting user (who may or may not be glsuter/virt. aware) to install
 addnl rpm to use qemu gluster might not work always.

  Who will inform user to install glusterfs when things fail at runtime ?



  Your view is in direct contradiction with the view of those who objected
 the dependency to start with :-) I think this question needs to be
 reconciled with the initial reporters.


 One more point to note here is that... even if we go with the way you
 suggested, it solves the original problem but brings in another as I stated

Re: [Gluster-devel] Proposal for Gluster 3.5: New test framework

2013-08-14 Thread Anand Avati

On Wed, Aug 14, 2013 at 5:36 AM, Justin Clift jcl...@redhat.com wrote:

 On 14/08/2013, at 7:43 AM, Anand Avati wrote:
  Justin,
 
  Thanks for firing up this thread. Are there notable projects which use
 these frameworks?

 Autotest is used by the Linux kernel (its main claim to
 fame), and is also used by KVM.

 STAF seems to have originally been an IBM internal
 project that was open sourced.  Seems to have been around
 for years.

 Haven't yet looked at further alternatives, as I was
 mostly expecting Autotest to be ok.  Wrongly it turns out. :(

 It would be wise to do some proper investigation/shortlisting
 of potential frameworks before immediately jumping into
 an investigation of STAF.


  Do you have any info on what other distributed storage projects use for
 their automated testing?

 The Ceph project used Autotest some time ago as well, but
 it didn't meet their needs so they created their own:

   Teuthology
   https://github.com/ceph/teuthology

   Their historical Autotest stuff
   https://github.com/ceph/autotest
   https://github.com/ceph/ceph-autotests

 I looked over Teuthology quickly, and it seems decent but it's very
 Ceph oriented/optimised.  Not a general purpose thing we could pick
 up and use without extensive modification. :(

 Regards and best wishes,

 Justin Clift


An important factor is going to be support for integration with Gerrit for
pre-commit tests. Or they should at least be configurable behind Jenkins.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] QEMU (and other libgfapi client?) crashes on add-brick / replace-brick

2013-08-14 Thread Anand Avati

On Wed, Jul 31, 2013 at 9:48 AM, Guido De Rosa guido.der...@vemarsas.itwrote:

 Well, there's another problem, possibly related, that I didn't notice: I'm
 unable to mount!

 (Although libgfapi is meant to bypass fuse-mount, I've read that you need
 FUSE working in the same machine where you issue a replace-brick command
 [1])

 $ git clone ssh://guidoder...@git.gluster.org/glusterfs

 (master branch @ a960f92)

 $ ./autogen.sh

 $ ./configure --enable-debug  make  sudo make install

 # /etc/init.d/glusterd start

 # gluster volume create gv transport tcp 192.168.232.179:
 /var/export/gluster/gv
 volume create: gv: success: please start the volume to access data

 # gluster volume start gv
 volume start: gv: success

 And here the problem:

 # mount -t glusterfs localhost:/gv /mnt/gv
 Mount failed. Please check the log file for more details.


The log to check is /usr/local/var/log/glusterfs/mnt-gv.log

Please check what the error was in that log.

Avati




 The same issue holds if I apply the patches you suggested, then clean
 sources, rebuild  reinstall. (If there's some relation..,)

 $ git pull ssh://guidoder...@git.gluster.org/glusterfsrefs/changes/07/5407/2

 etc.

 Here are the relevant logs:

 /usr/local/var/log/glusterfs/bricks/var-export-gluster-gv.log:

 Final graph:

 +--+
   1: volume gv-posix
   2: type storage/posix
   3: option glusterd-uuid 42ff1e51-7c77-4c70-9e1b-3e6207935bee
   4: option directory /var/export/gluster/gv
   5: option volume-id a562cb7c-0edf-4efa-afc6-80ea4e3fe978
   6: end-volume
   7:
   8: volume gv-changelog
   9: type features/changelog
  10: option changelog-brick /var/export/gluster/gv
  11: option changelog-dir /var/export/gluster/gv/.glusterfs/changelogs
  12: subvolumes gv-posix
  13: end-volume
  14:
  15: volume gv-access-control
  16: type features/access-control
  17: subvolumes gv-changelog
  18: end-volume
  19:
  20: volume gv-locks
  21: type features/locks
  22: subvolumes gv-access-control
  23: end-volume
  24:
  25: volume gv-io-threads
  26: type performance/io-threads
  27: subvolumes gv-locks
  28: end-volume
  29:
  30: volume gv-index
  31: type features/index
  32: option index-base /var/export/gluster/gv/.glusterfs/indices
  33: subvolumes gv-io-threads
  34: end-volume
  35:
  36: volume gv-marker
  37: type features/marker
  38: option volume-uuid a562cb7c-0edf-4efa-afc6-80ea4e3fe978
  39: option timestamp-file /var/lib/glusterd/vols/gv/marker.tstamp
  40: option xtime off
  41: option quota off
  42: subvolumes gv-index
  43: end-volume
  44:
  45: volume /var/export/gluster/gv
  46: type debug/io-stats
  47: option latency-measurement off
  48: option count-fop-hits off
  49: subvolumes gv-marker
  50: end-volume
  51:
  52: volume gv-server
  53: type protocol/server
  54: option transport.socket.listen-port 49152
  55: option rpc-auth.auth-glusterfs on
  56: option rpc-auth.auth-unix on
  57: option rpc-auth.auth-null on
  58: option transport-type tcp
  59: option auth.login./var/export/gluster/gv.allow
 ae4ffb2b-75fb-4b5a-b9d3-6c9e390fee03
  60: option auth.login.ae4ffb2b-75fb-4b5a-b9d3-6c9e390fee03.password
 041ee2e7-e8cf-4ecd-bba6-655348721610
  61: option auth.addr./var/export/gluster/gv.allow *
  62: subvolumes /var/export/gluster/gv
  63: end-volume
  64:

 /usr/local/var/log/glusterfs/usr-local-etc-glusterfs-glusterd.vol.log:

 Final graph:

 +--+
   1: volume management
   2: type mgmt/glusterd
   3: option rpc-auth.auth-glusterfs on
   4: option rpc-auth.auth-unix on
   5: option rpc-auth.auth-null on
   6: option transport.socket.listen-backlog 128
   7: option transport.socket.read-fail-log off
   8: option transport.socket.keepalive-interval 2
   9: option transport.socket.keepalive-time 10
  10: option transport-type rdma
  11: option working-directory /var/lib/glusterd
  12: end-volume
  13:

 +--+

 Thanks,

 Guido

 ---
 [1] This is for older versions and I'm not sure the same holds for 3.4
 http://www.gluster.org/wp-content/uploads/2012/05/Gluster_File_System-3.3.0-Administration_Guide-en-US.pdf
  Sec
 7.4


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] The return of the all-null pending matrix

2013-08-13 Thread Anand Avati

Emmanuel,
I was going through your log files again. Correct me if I'm wrong, the
issue in the log is with the file tparm.po, right?

Avati


On Tue, Aug 13, 2013 at 9:39 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 Hi

 I am back on to this problem. I would like to debug it butI need some
 suggestion
 on what to look at.

 We know it dispear if eager locks are disabled. How do they work, and how
 could they turn bad?

 Emmanuel Dreyfus m...@netbsd.org wrote:

  Vijay Bellur vbel...@redhat.com wrote:
 
   I have not been able to re-create the problem in my setup. I think it
   would be a good idea to track this bug and address it. For now, can we
   not use the volume set mechanism to disable eager-locking?
 
  Our exchanges have gone off list after this message. I repost here
  the 100k last lines of log with debug mode:
  http://ftp.espci.fr/shadow/manu/log

 --
 Emmanuel Dreyfus
 http://hcpnet.free.fr/pubz
 m...@netbsd.org

 ___
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RPM re-structuring

2013-08-05 Thread Anand Avati

On Mon, Aug 5, 2013 at 4:17 AM, Kaleb S. KEITHLEY kkeit...@redhat.comwrote:

 On 08/05/2013 05:42 AM, Deepak C Shetty wrote:

 On 08/05/2013 02:41 PM, Niels de Vos wrote:

 On Mon, Aug 05, 2013 at 10:59:32AM +0530, Deepak C Shetty wrote:


 IIUC, per the prev threads.. glusterfs package has a dep on
 glusterfs-libs


 glusterfs does not have a dependency (i.e. a Requires: clause) on
 glusterfs-libs. This is intentional.


Does it not? It should. glusterfs-libs would contain libglusterfs.so,
libgfrpc.so and libgfxdr.so - which are required for glusterfs package
(which contains all the xlators). Did I miss something?

Avati


 vdsm/qemu-kvm/oVirt packages need to change their dependency from
 glusterfs to glusterfs-libs.



 --

 Kaleb


 __**_
 Gluster-devel mailing list
 Gluster-devel@nongnu.org
 https://lists.nongnu.org/**mailman/listinfo/gluster-develhttps://lists.nongnu.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Feature request]: Regression to take more patches in single instance

2013-08-01 Thread Anand Avati

On Wed, Jul 31, 2013 at 5:11 AM, Jeff Darcy jda...@redhat.com wrote:

 On 07/31/2013 07:35 AM, Amar Tumballi wrote:

 I was trying to fire some regression builds on very minor patches today,
 and
  noticed (always known, but faced pain of 'waiting' today) that we can
 fire
  regression build on only one patch (or a patchset if its submitted with
 dependency added while submitting). And each regression run takes approx
 30mins.

 With this model, we can at max take only ~45 patches in a day, which won't
 scale up if we want to grow with more people participating in code
 contribution. Would be great to have an option to submit regression run
 with
  multiple patch numbers, (technically they should be applicable one top of
 other in any order if not dependent), and it should work fine. That way,
 we
 can handle more review load in future.


 Maybe my brain has been baked too much by the sun, but I thought I'd seen
 cases
 where a regression run on a patch with dependencies automatically validated
 everything in the stack.  Not so?  That still places a burden on patch
 submitters to make sure dependencies are specified (shouldn't be a problem
 since the current tendency is to *over*specify dependencies) and on the
 person
 starting the run to pick the top of the stack, but it does allow us to kill
 multiple birds with one stone.

 As for scaling, isn't the basic solution to add more worker machines?  That
 would multiply the daily throughput by the number of workers, and decrease
 latency for simultaneously submitted runs proportionally.


The flip side of having too many patches regression-tested in parallel is
that, since the regression test applies the patch in question on top of the
current git HEAD _at the time of test execution_, we lose out on testing
the combined effect of those multiple patches. This can result in master
branch being in broken state even though every patch is tested (in
isolation). And the breakage will be visible much later - when an unrelated
patch is tested after the patches get (successfully tested and) merged
independently. This has happened before too, even with the current test
one  patch at a time model. E.g:

1 - Patch A is tested [success]
2 - Patch B is tested [success]
3 - Patch A is merged
4 - Patch B is merged
git HEAD is broken now
5 - Patch C is tested [failure, because combined effect of A + B is tested
only now]

The serial nature of today's testing limits such delays to some extent, as
tested patches keep getting merged before regression test of new patches
start. Parallelizing tests too much could potentially increase this danger
window.

On the other hand, to guarantee master is never broken, test + merge must
be a strictly serial operation (i.e do not even start new regression job
until the previous patch is tested and merged). That is even worse, for
sure.

In the end we probably need a combination of the two strategies

- Ability to test multiple patches at the same time (solves regression
throughput to some extent and increases integrated testing of patches for
their combined effect.

- Ability to run tests in parallel (of the patch sets) where testing patch
sets can be formed such that the two groups are really independent and
there is very less chance of their combined effect to result in a
regression (e.g one patch set for a bunch of patches in glusterd and
another patch set for a bunch of patches in data path).

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] new glusterfs logging framework

2013-07-31 Thread Anand Avati

On Tue, Jul 30, 2013 at 11:39 PM, Balamurugan Arumugam
barum...@redhat.comwrote:

 - Original Message -
  From: Joe Julian j...@julianfamily.org
  To: Pablo paa.lis...@gmail.com, Balamurugan Arumugam 
 b...@gluster.com
  Cc: gluster-us...@gluster.org, gluster-devel@nongnu.org
  Sent: Tuesday, July 30, 2013 9:26:55 PM
  Subject: Re: [Gluster-users] new glusterfs logging framework

  Configuration files should be under /etc per FSH standards. Move the
  logger.conf to /etc/glusterfs.

 This will be done.

  I, personally, like json logs since I'm shipping to logstash. :-) My one
  suggestion would be to ensure the timestamps are in rfc3164.

 rsyslog supports rfc3339 (a profile of ISO8601) and we use this.  Let me
 know your thoughts on continue using it.

  Yes, those are complex steps, but the rpm/deb packaging should take care
 of
  dependencies and setting up logical defaults.

 Yes. I am planning to add rsyslog configuration for gluster at install
 time.

  IMHO, since this is a departure from the way it's been before now, the
 config
  file should enable this new behavior, not disable it, to avoid breaking
  existing monitoring installations.

 Do you mean to continue current logging in addition to syslog way?

This means unless explicitly configured with syslog, by default we should
be logging to gluster logs as before.

Avati

 Regards,
 Bala

  Pablo paa.lis...@gmail.com wrote:
  I think that adding all that 'rsyslog' configuration only to see logs
  is
  too much. (I admit it, I don't know how to configure rsyslog at that
  level so that may influence my opinion)

  Regards,

  El 30/07/2013 06:29 a.m., Balamurugan Arumugam escribió:
   Hi All,

   Recently new logging framework was introduced [1][2][3] in glusterfs
  master branch.  You could read more about this on doc/logging.txt.  In
  brief, current log target is moved to syslog and user has an option to
  this new logging at compile time (passing '--disable-syslog' to
  ./configure or '--without syslog' to rpmbuild) and run time (having a
  file /var/log/glusterd/logger.conf and restarting gluster services).

   As rsyslog is used as syslog server in Fedora and CentOS/RHEL and
  default configuration of rsyslog does not have any rule specific to
  gluster logs, you see all logs are in /var/log/messages in JSON format.

   Below is the way to make them neat and clean.

   For fedora users:
   1. It requires to install rsyslog-mmjsonparse rpm (yum -y install
  rsyslog-mmjsonparse)
   2. Place below configuration under /etc/rsyslog.d/gluster.conf file.

   #$RepeatedMsgReduction on

   $ModLoad mmjsonparse
   *.* :mmjsonparse:

   template (name=GlusterLogFile type=string
  string=/var/log/gluster/%app-name%.log)
   template (name=GlusterPidLogFile type=string
  string=/var/log/gluster/%app-name%-%procid%.log)

   template(name=GLFS_template type=list) {
   property(name=$!mmcount)
   constant(value=/)
   property(name=syslogfacility-text caseConversion=upper)
   constant(value=/)
   property(name=syslogseverity-text caseConversion=upper)
   constant(value= )
   constant(value=[)
   property(name=timereported dateFormat=rfc3339)
   constant(value=] )
   constant(value=[)
   property(name=$!gf_code)
   constant(value=] )
   constant(value=[)
   property(name=$!gf_message)
   constant(value=] )
   property(name=$!msg)
   constant(value=\n)
   }

   if $app-name == 'gluster' or $app-name == 'glusterd' then {
   action(type=omfile
  DynaFile=GlusterLogFile
  Template=GLFS_template)
   stop
   }

   if $app-name contains 'gluster' then {
   action(type=omfile
  DynaFile=GlusterPidLogFile
  Template=GLFS_template)
   stop
   }

   3. Restart rsyslog (service rsyslog restart)
   4. Done. All gluster process specific logs are separated into
  /var/log/gluster/ directory

   Note: Fedora 19 users
   There is a bug in rsyslog of fedora 19 [4], so its required to
  recompile rsyslog source rpm downloaded from fedora repository
  ('rpmbuild --rebuild rsyslog-7.2.6-1.fc19.src.rpm' works fine) and use
  generated rsyslog and rsyslog-mmjsonparse binary rpms

   For CentOS/RHEL users:
   Current rsyslog available in CentOS/RHEL does not have json support.
  I have added the support which requires some testing.  I will update
  once done.

   TODO:
   1. need to add volume:brick specific tag to logging so that those
  logs can be separated out than pid.
   2. enable gfapi to use this logging framework

   I would like to get feedback/suggestion about this logging framework

   Regards,
   Bala

   [1] http://review.gluster.org/4977
   [2] http://review.gluster.org/5002
   [3] http://review.gluster.org/4915
   [4] https://bugzilla.redhat.com/show_bug.cgi?id=989886
   ___
   Gluster-users mailing list

Re: [Gluster-devel] [Feature request]: Regression to take more patches in single instance

2013-07-31 Thread Anand Avati


On 7/31/13 4:35 AM, Amar Tumballi wrote:

Hi,

I was trying to fire some regression builds on very minor patches today,
and noticed (always known, but faced pain of 'waiting' today) that we
can fire regression build on only one patch (or a patchset if its
submitted with dependency added while submitting). And each regression
run takes approx 30mins.

With this model, we can at max take only ~45 patches in a day, which
won't scale up if we want to grow with more people participating in code
contribution. Would be great to have an option to submit regression run
with multiple patch numbers, (technically they should be applicable one
top of other in any order if not dependent), and it should work fine.
That way, we can handle more review load in future.

Regards,
Amar


Amar,
This thought has crossed my mind before. It needs some scripting in the 
Jenkins 'regression' job. Can you give it a shot and send out the change 
for review? If not I can look into it a few days.


Avati


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Feature request]: Regression to take more patches in single instance

2013-07-31 Thread Anand Avati

On Wed, Jul 31, 2013 at 4:47 AM, Kaleb S. KEITHLEY kkeit...@redhat.comwrote:

 On 07/31/2013 07:35 AM, Amar Tumballi wrote:

 Hi,

 I was trying to fire some regression builds on very minor patches today,
 and noticed (always known, but faced pain of 'waiting' today) that we
 can fire regression build on only one patch (or a patchset if its
 submitted with dependency added while submitting). And each regression
 run takes approx 30mins.

 With this model, we can at max take only ~45 patches in a day, which
 won't scale up if we want to grow with more people participating in code
 contribution. Would be great to have an option to submit regression run
 with multiple patch numbers, (technically they should be applicable one
 top of other in any order if not dependent), and it should work fine.
 That way, we can handle more review load in future.


 When a regression fails how do you know who to blame?

 I'd rather see more build machines (multiple VMs on a big
 build.gluster.org replacement box?) instead to get more concurrency.


We already face that ambiguity when a patch has a dependent patch. Multiple
VMs will solve the problem, but I guess we need to figure out how to get a
bigger box etc.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Feature request]: Regression to take more patches in single instance

2013-07-31 Thread Anand Avati

On Wed, Jul 31, 2013 at 5:09 AM, Kaleb S. KEITHLEY kkeit...@redhat.comwrote:

 On 07/31/2013 07:51 AM, Anand Avati wrote:

 On Wed, Jul 31, 2013 at 4:47 AM, Kaleb S. KEITHLEY kkeit...@redhat.com
 mailto:kkeit...@redhat.com wrote:

 On 07/31/2013 07:35 AM, Amar Tumballi wrote:

 Hi,

 I was trying to fire some regression builds on very minor
 patches today,
 and noticed (always known, but faced pain of 'waiting' today)
 that we
 can fire regression build on only one patch (or a patchset if its
 submitted with dependency added while submitting). And each
 regression
 run takes approx 30mins.

 With this model, we can at max take only ~45 patches in a day,
 which
 won't scale up if we want to grow with more people participating
 in code
 contribution. Would be great to have an option to submit
 regression run
 with multiple patch numbers, (technically they should be
 applicable one
 top of other in any order if not dependent), and it should work
 fine.
 That way, we can handle more review load in future.


 When a regression fails how do you know who to blame?

 I'd rather see more build machines (multiple VMs on a big
 build.gluster.org http://build.gluster.org replacement box?)

 instead to get more concurrency.


 We already face that ambiguity when a patch has a dependent patch.


 That's a bit of a special case. The dependent patch is often owned by the
 same person, right? I would not want to make this harder for people in the
 general case.


  Multiple VMs will solve the problem, but I guess we need to figure out
 how to get a bigger box etc.


 Can the slave build machines be behind a firewall? I'm working on
 getting the old Sunnyvale lab machines on-line in our new lab. Can we use
 some of those?


That should work, I think!

Thanks,
Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Gluster-devel] REVERT: Change in glusterfs[master]: fuse: auxiliary gfid mount support

2013-07-31 Thread Anand Avati


On 7/19/13 1:14 AM, Vijay Bellur (Code Review) wrote:

Vijay Bellur has submitted this change and it was merged.

Change subject: fuse: auxiliary gfid mount support
..


fuse: auxiliary gfid mount support

* files can be accessed directly through their gfid and not just
   through their paths. For eg., if the gfid of a file is
   f3142503-c75e-45b1-b92a-463cf4c01f99, that file can be accessed
   using gluster-mount/.gfid/f3142503-c75e-45b1-b92a-463cf4c01f99

   .gfid is a virtual directory used to seperate out the namespace
   for accessing files through gfid. This way, we do not conflict with
   filenames which can be qualified as uuids.

* A new file/directory/symlink can be created with a pre-specified
   gfid. A setxattr done on parent directory with fuse_auxgfid_newfile_args_t
   initialized with appropriate fields as value to key glusterfs.gfid.newfile
   results in the entry parent/bname whose gfid is set to args.gfid. The
   contents of the structure should be in network byte order.

   struct auxfuse_symlink_in {
 char linkpath[]; /* linkpath is a null terminated string */
   } __attribute__ ((__packed__));

   struct auxfuse_mknod_in {
 unsigned int   mode;
 unsigned int   rdev;
 unsigned int   umask;
   } __attribute__ ((__packed__));

   struct auxfuse_mkdir_in {
 unsigned int   mode;
 unsigned int   umask;
   } __attribute__ ((__packed__));

   typedef struct {
 unsigned int  uid;
 unsigned int  gid;
 char  gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated 
gfid string
   * in canonical form.
   */
 unsigned int  st_mode;
 char  bname[]; /* bname is a null terminated string */

 union {
 struct auxfuse_mkdir_in   mkdir;
 struct auxfuse_mknod_in   mknod;
 struct auxfuse_symlink_in symlink;
 } __attribute__ ((__packed__)) args;
   } __attribute__ ((__packed__)) fuse_auxgfid_newfile_args_t;

   An initial consumer of this feature would be geo-replication to
   create files on slave mount with same gfids as that on master.
   It will also help gsyncd to access files directly through their
   gfids. gsyncd in its newer version will be consuming a changelog
   (of master) containing operations on gfids and sync corresponding
   files to slave.

* Also, bring in support to heal gfids with a specific value.
   fuse-bridge sends across a gfid during a lookup, which storage
   translators assign to an inode (file/directory etc) if there is
   no gfid associated it. This patch brings in support
   to specify that gfid value from an application, instead of relying
   on random gfid generated by fuse-bridge.

   gfids can be healed through setxattr interface. setxattr should be
   done on parent directory. The key used is glusterfs.gfid.heal
   and the value should be the following structure whose contents
   should be in network byte order.

   typedef struct {
 char  gfid[UUID_CANONICAL_FORM_LEN + 1]; /* a null terminated gfid
   * string in canonical 
form
   */
 char  bname[]; /* a null terminated basename */
   } __attribute__((__packed__)) fuse_auxgfid_heal_args_t;

   This feature can be used for upgrading older geo-rep setups where gfids
   of files are different on master and slave to newer setups where they
   should be same. One can delete gfids on slave using setxattr -x and
   .glusterfs and issue stat on all the files with gfids from master.

Thanks to Amar Tumballi ama...@redhat.com and Csaba Henk
cs...@redhat.com for their inputs.

Signed-off-by: Raghavendra G rgowd...@redhat.com
Change-Id: Ie8ddc0fb3732732315c7ec49eab850c16d905e4e
BUG: 952029
Reviewed-on: http://review.gluster.com/#/c/4702
Reviewed-by: Amar Tumballi ama...@redhat.com
Tested-by: Amar Tumballi ama...@redhat.com
Reviewed-on: http://review.gluster.org/4702
Reviewed-by: Xavier Hernandez xhernan...@datalab.es
Tested-by: Gluster Build System jenk...@build.gluster.com
Reviewed-by: Vijay Bellur vbel...@redhat.com
---
M glusterfsd/src/glusterfsd.c
M glusterfsd/src/glusterfsd.h
M libglusterfs/src/glusterfs.h
M libglusterfs/src/inode.c
M libglusterfs/src/inode.h
M xlators/cluster/dht/src/dht-common.c
M xlators/mount/fuse/src/Makefile.am
M xlators/mount/fuse/src/fuse-bridge.c
M xlators/mount/fuse/src/fuse-bridge.h
M xlators/mount/fuse/src/fuse-helpers.c
A xlators/mount/fuse/src/glfs-fuse-bridge.h
M xlators/mount/fuse/utils/mount.glusterfs.in
M xlators/storage/posix/src/posix.c
13 files changed, 1,317 insertions(+), 136 deletions(-)

Approvals:
   Xavier Hernandez: Looks good to me, but someone else must approve
   Amar Tumballi: Looks good to me, approved

Re: [Gluster-devel] [Gluster-users] uWSGI plugin and some question

2013-07-30 Thread Anand Avati

On Mon, Jul 29, 2013 at 10:55 PM, Anand Avati anand.av...@gmail.com wrote:


 On Mon, Jul 29, 2013 at 8:36 AM, Roberto De Ioris robe...@unbit.itwrote:


 Hi everyone, i have just committed a plugin for the uWSGI application
 server
 for exposing glusterfs filesystems using the new native api:

 https://github.com/unbit/uwsgi-docs/blob/master/GlusterFS.rst

 Currently it is very simple, but works really well.

 I have studied the whole api, and i have two questions:

 why there is no glfs_stat_async() ?

 if i understand the code well, even stat() is a blocking operation.


 Can you show some code in uwsgi which makes use of asynchronous stat
 calls? Adding an async stat call in gfapi is not hard, but the use case
 hasn't been clear.


 My objective is avoiding the use of threads and processes and use the
 uWSGI async api to implement a non blocking-approach (mixable with other
 engines like gevent or Coro::AnyEvent)


 Are there any requirements that the callback happen only in specific
 threads? That is typically a common requirement, and the async callbacks
 would end up requiring special wiring to bring the callbacks to desired
 threads. But I guess that wiring would already be done with the IO
 callbacks anyways in your case.

 Do you have some prototype of the module using gfapi out somewhere? I'm
 hoping to understand the use case of gfapi and see if something can be done
 to make it integrate with Coro::AnyEvent more naturally.



I am assuming the module in question is this -
https://github.com/unbit/uwsgi/blob/master/plugins/glusterfs/glusterfs.c. I
see that you are not using the async variants of any of the glfs calls so
far. I also believe you would like these synchronous calls to play nicely
with Coro:: by yielding in a compatible way (and getting woken up when
response arrives in a compatible way) - rather than implementing an
explicit glfs_stat_async(). The -request() method does not seem to be be
naturally allowing the use of explictly asynchronous calls within.

Can you provide some details of the event/request management in use? If
possible, I would like to provide hooks for yield and wakeup primitives in
gfapi (which you can wire with Coro:: or anything else) such that these
seemingly synchronous calls (glfs_open, glfs_stat etc.) don't starve the
app thread without yielding.

I can see those hooks having a benefit in the qemu gfapi driver too,
removing a bit of code there which integrates callbacks into the event loop
using pipes.

Avati



 Another thing is the bison/yacc nameclash. uWSGI allows you to load
 various external libraries and the use of the default 'yy' prefix causes
 nameclashes with common libraries (like matheval).

 I understand that matheval too should choose a better approach, but why
 not prefixing it like glusterfsyy ? This would avoid headaches, even for
 when people will start using the library in higher level languages.

 Currently i have tried the YFLAGS env var hack for ./configure but it did
 not work (i am using bison)

 YFLAGS=-Dapi.prefix=glusterfsyy -d ./configure --prefix=/opt/glusterfs/


 Hmm, this is nice to get fixed. Do you already have a patch which you have
 used (other than just the technique shown above)?


 Thanks!
 Avati


___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] [FEEDBACK] Governance of GlusterFS project

2013-07-30 Thread Anand Avati

On Mon, Jul 29, 2013 at 2:17 PM, Joe Julian j...@julianfamily.org wrote:

 As one of the guys supporting this software, I agree that I would like
 bugfix releases to happen more. Critical and security bugs should trigger
 an immediate test release. Other bug fixes should go out on a reasonable
 schedule (monthly?). The relatively new CI testing should make this a lot
 more feasible.


Joe, we will certainly be increasing the frequency of releases to push out
bug fixes sooner. Though this has been a consistent theme in everybody's
comments, your feedback in particular weighs in heavily because of your
level of involvement in guiding our users :-)

Avati



 If there weren't hundreds of bugs to examine between releases, I would
 happily participate in the evaluation process.


 On 07/26/2013 05:16 PM, Bryan Whitehead wrote:

 I would really like to see releases happen regularly and more
 aggressively. So maybe this plan needs a community QA guy or the
 release manager needs to take up that responsibility to say this code
 is good for including in the next version. (Maybe this falls under
 process and evaluation?)

 For example, I think the ext4 patches had long been available but they
 just took forever to get pushed out into an official release.

 I'm in favor of closing some bugs and risking introducing new bugs for
 the sake of releases happening often.



 On Fri, Jul 26, 2013 at 10:26 AM, Anand Avati anand.av...@gmail.com
 wrote:

 Hello everyone,

We are in the process of formalizing the governance model of the
 GlusterFS
 project. Historically, the governance of the project has been loosely
 structured. This is an invitation to all of you to participate in this
 discussion and provide your feedback and suggestions on how we should
 evolve
 a formal model. Feedback from this thread will be considered to the
 extent
 possible in formulating the draft (which will be sent out for review as
 well).

Here are some specific topics to seed the discussion:

 - Core team formation
- what are the qualifications for membership (e.g contributions of
 code,
 doc, packaging, support on irc/lists, how to quantify?)
- what are the responsibilities of the group (e.g direction of the
 project, project roadmap, infrastructure, membership)

 - Roadmap
- process of proposing features
- process of selection of features for release

 - Release management
- timelines and frequency
- release themes
- life cycle and support for releases
- project management and tracking

 - Project maintainers
- qualification for membership
- process and evaluation

 There are a lot more topics which need to be discussed, I just named
 some to
 get started. I am sure our community has members who belong and
 participate
 (or at least are familiar with) other open source project communities.
 Your
 feedback will be valuable.

 Looking forward to hearing from you!

 Avati



___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] [FEEDBACK] Governance of GlusterFS project

2013-07-30 Thread Anand Avati

On Sun, Jul 28, 2013 at 11:32 PM, Bryan Whitehead dri...@megahappy.netwrote:

 Weekend activities kept me away from watching this thread, wanted to
 add in more of my 2 cents... :)

 Major releases would be great to happen more often - but keeping
 current releases more current is really what I was talking about.
 Example, 3.3.0 was a pretty solid release but some annoying bugs got
 fixed and it felt like 3.3.1 was reasonably quick to come. But that
 release seemed to be a step back for rdma (forgive me if I was wrong -
 but I think it wasn't even possible to fuse/mount over rdma with 3.3.1
 while 3.3.0 worked). But 3.3.2 release took a pretty long time to come
 and fix that regression. I think I also recall seeing a bunch of nfs
 fixes coming and regressing (but since I don't use gluster/nfs I don't
 follow closely).


Bryan - yes, point well taken. I believe a dedicated release maintainer
role will help in this case. I would like to hear other suggestions or
thoughts on how you/others think this can be implemented.



 What I'd like to see:
 In the -devel maillinglist right now I see someone is showing brick
 add / brick replace in 3.4.0 is causing a segfault in apps using
 libgfapi (in this case qemu/libvirt) to get at gluster volumes. It
 looks like some patches were provided to fix the issue. Assuming those
 patches work I think a 3.4.1 release might be worth being pushed out.
 Basic stuff like that on something that a lot of people are going to
 care about (qemu/libvirt integration - or plain libgfapi). So if there
 was a scheduled release for say - every 1-3 months - then I think that
 might be worth doing. Ref:
 http://lists.gnu.org/archive/html/gluster-devel/2013-07/msg00089.html


Right, thanks for highlighting. These fixes will be back ported. I have
already submitted the backport of one of them for review at
http://review.gluster.org/5427. The other will be backported once reviewed
and accepted in master.

Thanks again!
Avati

The front page of gluster.org says 3.4.0 has Virtual Machine Image
 Storage improvements. If 1-3 months from now more traction with
 CloudStack/OpenStack or just straight up libvirtd/qemu with gluster
 gets going. I'd much rather tell someone make sure to use 3.4.1 than
 be careful when doing an add-brick - all your VM's will segfault.

 On Sun, Jul 28, 2013 at 5:10 PM, Emmanuel Dreyfus m...@netbsd.org wrote:
  Harshavardhana har...@harshavardhana.net wrote:
 
  What is good for GlusterFS as a whole is highly debatable - since there
  are no module owners/subsystem maintainers as of yet at-least on paper.
 
  Just my two cents on that: you need to make clear if a module maintainer
  is a dictator or a steward for the module: does he has the last word on
  anything touching his module, or is there some higher instance to settle
  discussions that do not reach consensus?
 
  IMO the first approach creates two problems:
 
  - having just one responsible person for a module is a huge bet that
  this person will have good judgments. Be careful to let a maintainer
  position open instead of assigning it to the wrong person.
 
  - having many different dictators each ruling over a module can create
  difficult situations when a proposed change impacts many modules.
 
  --
  Emmanuel Dreyfus
  http://hcpnet.free.fr/pubz
  m...@netbsd.org

___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] uWSGI plugin and some question

2013-07-30 Thread Anand Avati

On Tue, Jul 30, 2013 at 7:47 AM, Roberto De Ioris robe...@unbit.it wrote:


  On Mon, Jul 29, 2013 at 10:55 PM, Anand Avati anand.av...@gmail.com
  wrote:
 
 
  I am assuming the module in question is this -
  https://github.com/unbit/uwsgi/blob/master/plugins/glusterfs/glusterfs.c
 .
  I
  see that you are not using the async variants of any of the glfs calls so
  far. I also believe you would like these synchronous calls to play
  nicely
  with Coro:: by yielding in a compatible way (and getting woken up when
  response arrives in a compatible way) - rather than implementing an
  explicit glfs_stat_async(). The -request() method does not seem to be be
  naturally allowing the use of explictly asynchronous calls within.
 
  Can you provide some details of the event/request management in use? If
  possible, I would like to provide hooks for yield and wakeup primitives
 in
  gfapi (which you can wire with Coro:: or anything else) such that these
  seemingly synchronous calls (glfs_open, glfs_stat etc.) don't starve the
  app thread without yielding.
 
  I can see those hooks having a benefit in the qemu gfapi driver too,
  removing a bit of code there which integrates callbacks into the event
  loop
  using pipes.
 
  Avati
 
 

 This is a prototype of async way:


 https://github.com/unbit/uwsgi/blob/master/plugins/glusterfs/glusterfs.c#L43

 basically once the async request is sent, the uWSGI core (it can be a
 coroutine, a greenthread or another callback) wait for a signal (via pipe
 [could be eventfd() on linux]) of the callback completion:


 https://github.com/unbit/uwsgi/blob/master/plugins/glusterfs/glusterfs.c#L78

 the problem is that this approach is racey in respect of the
 uwsgi_glusterfs_async_io structure.


It is probably OK since you are waiting for the completion of the AIO
request before issuing the next. One question I have in your usage is, who
is draining the \1 written to the pipe in uwsgi_glusterfs_read_async_cb()
? Since the same pipe is re-used for the next read chunk, won't you get an
immediate wake up if you tried polling on the pipe without draining?


 Can i assume after glfs_close() all of
 the pending callbacks are cleared ?


With the way you are using the _async() calls, you do have the guarantee -
because you are waiting for the completion of each AIO request right after
issuing.

The enhancement to gfapi I was proposing was to expose hooks at yield() and
wake() points for external consumers to wire in their own ways of switching
out of the stack. This is still a half baked idea, but it will let you use
only glfs_read(), glfs_stat() etc. (and NOT the explicit async variants),
and the hooks will let you do wait_read_hook() and write(pipefd, '\1')
respectively in a generic way independent of the actual call.


 In such a way i could simply
 deallocate it (now it is on the stack) at the end of the request.


You probably need to do all that in case you want to have multiple
outstanding AIOs at the same time. From what I see, you just need
co-operative waiting till call completion.

Also note that the ideal block size for performing IO is 128KB. 8KB is too
little for a distributed filesystem.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Status of the SAMBA module

2013-07-29 Thread Anand Avati

On Mon, Jul 29, 2013 at 1:56 AM, Nux! n...@li.nux.ro wrote:

 On 29.07.2013 07:16, Daniel Müller wrote:

 But you need to have gluster installed!? Which version?
 Samba4.1 does not compile with the lates glusterfs 3.4 on CentOs 6.4.


 From what JM said, it builds against EL6 Samba (3.6) and it has also been
 added to upstream.


You will need the latest Samba 4.1 release (or git HEAD),and
glusterfs-api-devel RPM (with its deps) installed at the time of building
Samba to get the vfs module built.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] QEMU (and other libgfapi client?) crashes on add-brick / replace-brick

2013-07-29 Thread Anand Avati

On Mon, Jul 29, 2013 at 1:51 AM, Guido De Rosa guido.der...@vemarsas.itwrote:

 Apparently the problem isn't fixed... even when qemu doesn't crash, the
 guest raises many I/O error and turns unusable, just like a real machine
 would do if you physically remove the hard drive, I guess...

 I'm doing more tests anyway and will post a much more detailed report as
 soon as I can. Thanks for now.



Please do get back, with the logs. It still might be a privileged port
issue.

Avati
___
Gluster-devel mailing list
Gluster-devel@nongnu.org
https://lists.nongnu.org/mailman/listinfo/gluster-devel

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1024 matches

Mail list logo