Re: [Gluster-devel] AFR conservative merge portability

2014-12-15 Thread Emmanuel Dreyfus
Ravishankar N ravishan...@redhat.com wrote:

 The check can be done in metadata selfheal itself but I don't think AFR is
 to blame if the user space app in NetBSD sends a setattr on the parent
 dir.

It is not exactly the problem: When adding an entry, you must update
parent directory mtime/ctime. On NetBSD, the filesystem-independant code
(that is, above VFS) in the kernel does it. On Linux, it seems it is the
responsability of the filesystem (below VFS) to do it.

Since this in-kernel implementation, no standard will tell what OS is
right: both are. The only blame we can put on AFR is that it assumes
Linux behavior, hence the proposal to make it portable.


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Portable filesystem ACLs?

2014-12-15 Thread J. Bruce Fields
On Sun, Dec 14, 2014 at 07:24:52PM +0530, Soumya Koduri wrote:
 
 
 On 12/12/2014 11:06 PM, Niels de Vos wrote:
 Hi,
 
 I started to look into getting some form of support for ACLs in gfapi.
 
 After a short discussion with Shyam, some investigation showed that our
 current implementation of ACLs is not very portable. There definitely
 seem to be issues with ACLs when a FUSE mount is used on FreeBSD, and
 the bricks are on a Linux system.
 
 Our current implementations of (POSIX) ACLs is very much focussed on the
 Linux behaviour. For example, there is the assumption that ACLs are
 stored in the system.posix_acl_access extended attribute. FreeBSD uses a
 system.posix1e.acl_access xattr. Other platforms likely use an other
 variation. Also the (binary) encoding of the contens most definitely
 differs per platform.
 
 In order to provide a good experience with ACLs on different platforms,
 we could introduce a solution like this:
 
  setfacl ...
|
v
  glusterfs client (like fuse)
|
v
  some API, possibly transparent in the posix-acl xlator
|   converts the client-platform specific ACL into
|   a Gluster ACL format
v
  Outgoing RPC procedure, a new SET_ACL, or as SETXATTR(gluster.acl)
|
v
  [network]
|
v
  Incoming RPC procedure on the brick (can be different platform)
|
v
  Conversion from Gluster/ACL format to platform specific, possibly in
|   the storage/posix xlator
v
  setfacl() syscall/library call to store the ACL on the filesystem
 
 
 Reading the ACL would be the same, just in reverse.
 
 It would be most welcome to have some kind of API that can get exposed
 in gfapi, so that NFS-Ganesha and other gfapi applications can get/set
 ACLs in a standardized way. One option is to use the (possibly platform
 dependent) structures defined by libacl or librichacl.
 From a quick look at the code, 'libacl' seems to be supporting only
 'system.posix_acl_access' xattr.
 
 IMO it may be good to have a standard 'acl_obj' structure in RichACL
 format itself, which Multi-protocol team (CCed) has been looking at
  which you have been suggesting. This structure can be exposed to
 the applications(NFS-Ganesha/SAMBA) to fill-in ACLs data and
 probably call 'glfs_set_acl(..)' which in turn converts these
 'acl_obj's to platform dependent acl xattrs and make a
 syncop_setxattr call.
 
 And in future, if we get RichACL support, we could just
 extend/modify this routine to send in those ACLs as is to the
 back-end.

Translating automatically between FreeBSD and Linux ACLs makes sense, I
think, as long as the only differences are trivial encoding and naming
issues.

Translating between completely different ACL models could quickly get
more complicated than its worth.

--b.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] AFR conservative merge portability

2014-12-15 Thread Jeff Darcy
 Here is a proposal: we know that at the end of conservative merge, we
 should end up with the situation where directory ctime/mtime is the
 ctime of the most recently added children.

Won't the directory mtime change as the result of a rename or unlink?
Neither of those would be reflected in the children's times (in the
unlink case the child no longer exists).

 And fortunately, as
 conservative merge happens, parent directory ctime/mtime are updated on
 each child addition, and we finish in the desired state.
 
 In other words, after conservative merge, parent directory metadata
 split brain for only ctime/mtime can just be cleared by AFR without any
 harm.

Is there *any* case, not even necessarily involving conservative merge,
where it would be harmful to propagate the latest ctime/mtime for any
replica of a directory?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Readdir d_off encoding

2014-12-15 Thread Shyam

With the changes present in [1] and [2],

A short explanation of the change would be, we encode the subvol ID in 
the d_off, losing 'n + 1' bits in case the high order n+1 bits of the 
underlying xlator returned d_off is not free. (Best to read the commit 
message for [1] :) )


Although not related to the latest patch, here is something to consider 
for the future:


We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol 
encoding in the returned readdir offset. Due to this, the loss in bits 
_may_ cause unwanted offset behavior, when used in the current scheme. 
As we would end up eating more bits than what we do at present.


Or IOW, we could be invalidating the assumption both EXT4/XFS are 
tolerant in terms of the accuracy of the value presented
back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which 
has the closest true offset.


Should we reconsider an in memory _cookie_ like approach that can help 
in this case?


It would invalidate (some or all based on the implementation) the 
following constraints that the current design resolves, (from, [1])

- Nothing to remember in memory or evict old entries.
- Works fine across NFS server reboots and also NFS head failover.
- Tolerant to seekdir() to arbitrary locations.

But, would provide a more reliable readdir offset for use (when valid 
and not evicted, say).


How would NFS adapt to this? Does Ganesha need a better scheme when 
doing multi-head NFS fail over?


Thoughts?

Shyam
[1] http://review.gluster.org/#/c/4711/
[2] http://review.gluster.org/#/c/8201/
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Upcalls Infrastructure

2014-12-15 Thread Krishnan Parthasarathi
 
  - Is there a new connection from glusterfsd (upcall xlator) to
 a client accessing a file? If so, how does the upcall xlator reuse
 connections when the same client accesses multiple files, or does it?
 
 No. We are using the same connection which client initiates to send-in
 fops. Thanks to you for pointing me initially to the 'client_t'
 structure. As these connection details are available only in the server
 xlator, I am passing these to upcall xlator by storing them in
 'frame-root-client'.
 
  - In the event of a network separation (i.e, a partition) between a client
 and a server, how does the client discover or detect that the server
 has 'freed' up its previously registerd upcall notification?
 
 The rpc connection details of each client are stored based on its
 client-uid. So incase of network partition, when client comes back
 online, IMO it re-initiates the connection (along with new client-uid).

How would a client discover that a server has purged its upcall entries?
For instance, a client could assume that the server would notify it about
changes as before (while the server has purged the client's upcall entries)
and assume that it still holds the lease/lock. How would you avoid that?

 Please correct me if that's not the case. So there will new entries
 created/added in this xlator. However, we still need to decide on how to
 cleanup the old-timed-out and stale entries
   * either clean-up the entries as and when we find any expired entry or
 stale entry (in case if notification fails).
   * or by spawning a new thread which periodically scans through this
 list and cleans up those entries.

There are couple of things to resource cleanup in this context.
1) Time to cleanup; For e.g, on expiry of a timer.
2) Order of cleaning up; This involves clearly establishing relationships
   among inode, upcall entry and client_t(s). We should document this.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Volume management proposal (4.0)

2014-12-15 Thread Krishnan Parthasarathi

 So . . . about that new functionality.  The core idea of data
 classification is to apply step 6c repeatedly, with variants of DHT that
 do tiering or various other kinds of intelligent placement instead of
 the hash-based random placement we do now.  NUFA and switch are
 already examples of this.  In fact, their needs drove some of the code
 structure that makes data classification (DC) possible.
 
 The trickiest question with DC has always been how the user specifies
 these complex placement policies, which we then turn into volfiles.  In
 the interests of maximizing compatibility with existing scripts and user
 habits, what I propose is that we do this by allowing the user to
 combine existing volumes into a new higher-level volume.  This is

I like the idea for its simplicity. Abstracting 'tiers' as volumes is
natural from manageability point of view for the following reason.
Tiering is about partitioning resources to data based on their needs.
This is done by moving data to its most suited resource (for e.g, kind of disk).
Secondary volumes approach allow us to partition the resources and their 
management
together.

 
 (D) Secondary volumes may not be started and stopped by the user.
 Instead, a secondary volume is automatically started or stopped along
 with its primary.

Wouldn't it help in some cases to have secondary volumes running while
primary is not running? Some form of maintenance activity.

 
 (E) The user must specify an explicit option to see the status of
 secondary volumes.  Without this option, secondary volumes are hidden
 and status for their constituent bricks will be shown as though they
 were (directly) part of the corresponding primary volume.
 
 As it turns out, most of the extra volfiles in step 8 above also
 have their own steps 6d and 7, so implementing step C will probably make
 those paths simpler as well.
 
 The one big remaining question is how this will work in terms of
 detecting and responding to volume configuration changes.  Currently we
 treat each volfile as a completely independent entity, and just compare
 whole graphs.  Instead, what we need to do is track dependencies between
 graphs (a graph of graphs?) so that a change to a secondary volume will
 ripple up to its primary where a new graph can be generated and
 compared to its predecessor.

IIUC, (E) describes that primary volume file would be generated with all
secondary volume references resolved. Wouldn't that preclude the possibility
of the respective processes discovering the dependencies?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] REMINDER: Gluster Community Bug Triage meeting today at 12:00 UTC

2014-12-15 Thread Niels de Vos
Hi all,

Later today we will have an other Gluster Community Bug Triage meeting.

Meeting details:
- location: #gluster-meeting on Freenode IRC
- date: every Tuesday
- time: 12:00 UTC, 13:00 CET (in your terminal, run: date -d 12:00 UTC)
- agenda: https://public.pad.fsfe.org/p/gluster-bug-triage

Currently the following items are listed:
* Roll Call
* Status of last weeks action items
* Group Triage
* Open Floor

The last two topics have space for additions. If you have a suitable bug
or topic to discuss, please add it to the agenda.

Thanks,
Niels


pgpfX6SY0x_NZ.pgp
Description: PGP signature
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel