[Gluster-devel] md-cache improvements

2016-08-10 Thread Dan Lambright

There have been recurring discussions within the gluster community to build on 
existing support for md-cache and upcalls to help performance for small file 
workloads. In certain cases, "lookup amplification" dominates data transfers, 
i.e. the cumulative round trip times of multiple LOOKUPs from the client 
mitigates benefits from faster backend storage. 

To tackle this problem, one suggestion is to more aggressively utilize md-cache 
to cache inodes on the client than is currently done. The inodes would be 
cached until they are invalidated by the server. 

Several gluster development engineers within the DHT, NFS, and Samba teams have 
been involved with related efforts, which have been underway for some time now. 
At this juncture, comments are requested from gluster developers. 

(1) .. help call out where additional upcalls would be needed to invalidate 
stale client cache entries (in particular, need feedback from DHT/AFR areas), 

(2) .. identify failure cases, when we cannot trust the contents of md-cache, 
e.g. when an upcall may have been dropped by the network

(3) .. point out additional improvements which md-cache needs. For example, it 
cannot be allowed to grow unbounded.

Dan

- Original Message -
> From: "Raghavendra Gowdappa" 
> 
> List of areas where we need invalidation notification:
> 1. Any changes to xattrs used by xlators to store metadata (like dht layout
> xattr, afr xattrs etc).
> 2. Scenarios where individual xlator feels like it needs a lookup. For
> example failed directory creation on non-hashed subvol in dht during mkdir.
> Though dht succeeds mkdir, it would be better to not cache this inode as a
> subsequent lookup will heal the directory and make things better.
> 3. removing of files
> 4. writev on brick (to invalidate read cache on client)
> 
> Other questions:
> 5. Does md-cache has cache management? like lru or an upper limit for cache.
> 6. Network disconnects and invalidating cache. When a network disconnect
> happens we need to invalidate cache for inodes present on that brick as we
> might be missing some notifications. Current approach of purging cache of
> all inodes might not be optimal as it might rollback benefits of caching.
> Also, please note that network disconnects are not rare events.
> 
> regards,
> Raghavendra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] md-cache improvements

2016-08-17 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Raghavendra G" 
> Cc: "Dan Lambright" , "Gluster Devel" 
> , "Csaba Henk"
> 
> Sent: Wednesday, August 17, 2016 4:49:41 AM
> Subject: Re: [Gluster-devel] md-cache improvements
> 
> On Wed, Aug 17, 2016 at 11:42:25AM +0530, Raghavendra G wrote:
> > On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra G 
> > wrote:
> > 
> > >
> > >
> > > On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G 
> > > wrote:
> > >
> > >> Couple of more areas to explore:
> > >> 1. purging kernel dentry and/or page-cache too. Because of patch [1],
> > >> upcall notification can result in a call to inode_invalidate, which
> > >> results
> > >> in an "invalidate" notification to fuse kernel module. While I am sure
> > >> that, this notification will purge page-cache from kernel, I am not sure
> > >> about dentries. I assume if an inode is invalidated, it should result in
> > >> a
> > >> lookup (from kernel to glusterfs). But neverthless, we should look into
> > >> differences between entry_invalidation and inode_invalidation and
> > >> harness
> > >> them appropriately.
> 
> I do not think fuse handles upcall yet. I think there is a patch for
> that somewhere. It's been a while since I looked into that, but I think
> invalidating the affected dentries was straight forwards.

Can the patch # be tracked down ? I'd like to run some experiments with it + 
tiering..


> 
> > >> 2. Granularity of invalidation. For eg., We shouldn't be purging
> > >> page-cache in kernel, because of a change in xattr used by an xlator
> > >> (eg.,
> > >> dht layout xattr). We have to make sure that [1] is handling this. We
> > >> need
> > >> to add more granularity into invaldation (like internal xattr
> > >> invalidation,
> > >> user xattr invalidation, entry invalidation in kernel, page-cache
> > >> invalidation in kernel, attribute/stat invalidation in kernel etc) and
> > >> use
> > >> them judiciously, while making sure other cached data remains to be
> > >> present.
> > >>
> > >
> > > To stress the importance of this point, it should be noted that with tier
> > > there can be constant migration of files, which can result in spurious
> > > (from perspective of application) invalidations, even though application
> > > is
> > > not doing any writes on files [2][3][4]. Also, even if application is
> > > writing to file, there is no point in invalidating dentry cache. We
> > > should
> > > explore more ways to solve [2][3][4].
> 
> Actually upcall tracks the client/inode combination, and only sends
> upcall events to clients that (recently/timeout?) accessed the inode.
> There should not be any upcalls for inodes that the client did not
> access. So, when promotion/demotion happens, only the process doing this
> should receive the event, not any of the other clients that did not
> access the inode.
> 
> > > 3. We've a long standing issue of spurious termination of fuse
> > > invalidation thread. Since after termination, the thread is not
> > > re-spawned,
> > > we would not be able to purge kernel entry/attribute/page-cache. This
> > > issue
> > > was touched upon during a discussion [5], though we didn't solve the
> > > problem then for lack of bandwidth. Csaba has agreed to work on this
> > > issue.
> > >
> > 
> > 4. Flooding of network with upcall notifications. Is it a problem? If yes,
> > does upcall infra already solves it? Would NFS/SMB leases help here?
> 
> I guess some form of flooding is possible when two or more clients do
> many directory operations in the same directory. Hmm, now I wonder if a
> client gets an upcall event for something it did itself. I guess that
> would (most often?) not be needed.
> 
> Niels
> 
> 
> > 
> > 
> > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7
> > > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8
> > > [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9
> > > [5] http://review.gluster.org/#/c/13274/1/xlators/mount/
> > > fuse/src/fuse-bridge.c
> > >
> > >
> > >>
> > >> [1] http://review.gluster.org/12951
> > >>
> > >>
> > >> On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright 
> > &

Re: [Gluster-devel] md-cache improvements

2016-08-18 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Vijay Bellur" 
> Cc: "Poornima Gurusiddaiah" , "Dan Lambright" 
> , "Nithya Balachandran"
> , "Raghavendra Gowdappa" , "Soumya 
> Koduri" , "Pranith
> Kumar Karampuri" , "Gluster Devel" 
> 
> Sent: Thursday, August 18, 2016 9:32:34 AM
> Subject: Re: [Gluster-devel] md-cache improvements
> 
> On Mon, Aug 15, 2016 at 10:39:40PM -0400, Vijay Bellur wrote:
> > Hi Poornima, Dan -
> > 
> > Let us have a hangout/bluejeans session this week to discuss the planned
> > md-cache improvements, proposed timelines and sort out open questions if
> > any.
> > 
> > Would 11:00 UTC on Wednesday work for everyone in the To: list?
> 
> I'd appreciate it if someone could send the meeting minutes. It'll make
> it easier to follow up and we can provide better status details on the
> progress.

Adding to this thread the tracking bug for the feature - 1211863


> 
> In any case, one of the points that Poornima mentioned was that upcall
> events (when enabled) get cached in gfapi until the application handles
> them. NFS-Ganesha is the only application that (currently) is interested
> in these events. Other use-cases (like md-cache invalidation) would
> enable upcalls too, and then cause event caching even when not needed.
> 
> This change should address that, and I'm waiting for feedback on it.
> There should be a bug report about these unneeded and uncleared caches,
> but I could not find one...
> 
>   gfapi: do not cache upcalls if the application is not interested
>   http://review.gluster.org/15191
> 
> Thanks,
> Niels
> 
> 
> > 
> > Thanks,
> > Vijay
> > 
> > 
> > 
> > On 08/11/2016 01:04 AM, Poornima Gurusiddaiah wrote:
> > > 
> > > My comments inline.
> > > 
> > > Regards,
> > > Poornima
> > > 
> > > - Original Message -
> > > > From: "Dan Lambright" 
> > > > To: "Gluster Devel" 
> > > > Sent: Wednesday, August 10, 2016 10:35:58 PM
> > > > Subject: [Gluster-devel] md-cache improvements
> > > > 
> > > > 
> > > > There have been recurring discussions within the gluster community to
> > > > build
> > > > on existing support for md-cache and upcalls to help performance for
> > > > small
> > > > file workloads. In certain cases, "lookup amplification" dominates data
> > > > transfers, i.e. the cumulative round trip times of multiple LOOKUPs
> > > > from the
> > > > client mitigates benefits from faster backend storage.
> > > > 
> > > > To tackle this problem, one suggestion is to more aggressively utilize
> > > > md-cache to cache inodes on the client than is currently done. The
> > > > inodes
> > > > would be cached until they are invalidated by the server.
> > > > 
> > > > Several gluster development engineers within the DHT, NFS, and Samba
> > > > teams
> > > > have been involved with related efforts, which have been underway for
> > > > some
> > > > time now. At this juncture, comments are requested from gluster
> > > > developers.
> > > > 
> > > > (1) .. help call out where additional upcalls would be needed to
> > > > invalidate
> > > > stale client cache entries (in particular, need feedback from DHT/AFR
> > > > areas),
> > > > 
> > > > (2) .. identify failure cases, when we cannot trust the contents of
> > > > md-cache,
> > > > e.g. when an upcall may have been dropped by the network
> > > 
> > > Yes, this needs to be handled.
> > > It can happen only when there is a one way disconnect, where the server
> > > cannot
> > > reach client and notify fails. We can have a retry for the same until the
> > > cache
> > > expiry time.
> > > 
> > > > 
> > > > (3) .. point out additional improvements which md-cache needs. For
> > > > example,
> > > > it cannot be allowed to grow unbounded.
> > > 
> > > This is being worked on, and will be targetted for 3.9
> > > 
> > > > 
> > > > Dan
> > > > 
> > > > - Original Message -
> > > > > From: "Raghavendra Gowdappa" 
> > > > > 
> > > > > List of areas where we need invalida

Re: [Gluster-devel] [Gluster-users] CFP for Gluster Developer Summit

2016-08-23 Thread Dan Lambright
I posted this earlier to Amye's original google form.. not sure it got onto the 
discussion lists .. so reposting now..

Challenges with Gluster and Persistent Memory

A discussion of the difficulties posed by persistent memory with Gluster and 
some short and long term steps to address them.

Persistent memory will significantly improve storage performance. But these 
benefits may be hard to realize in Gluster. Gains are mitigated from costly 
network overhead and its deep software layer. It is also likely that the high 
costs of persistent memory will limit deployments. This talk shall discuss 
short and long term steps to take on those problems. Possible strategies 
include better incorporating high speed networks such as infiniband, client 
side caching of metadata, and centralizing DHT's layouts. Not to mention 
tiering. The talk will include discussion and results from a range of 
experiments in software and hardware.


> 
> On 2016-08-12 at 15:48 -0400, Vijay Bellur wrote:
> > Hey All,
> > 
> > Gluster Developer Summit 2016 is fast approaching [1] on us. We are looking
> > to have talks and discussions related to the following themes in the
> > summit:
> > 
> > 1. Gluster.Next - focusing on features shaping the future of Gluster
> > 
> > 2. Experience - Description of real world experience and feedback from:
> >a> Devops and Users deploying Gluster in production
> >b> Developers integrating Gluster with other ecosystems
> > 
> > 3. Use cases  - focusing on key use cases that drive Gluster.today and
> > Gluster.Next
> > 
> > 4. Stability & Performance - focusing on current improvements to reduce our
> > technical debt backlog
> > 
> > 5. Process & infrastructure  - focusing on improving current workflow,
> > infrastructure to make life easier for all of us!
> > 
> > If you have a talk/discussion proposal that can be part of these themes,
> > please send out your proposal(s) by replying to this thread. Please clearly
> > mention the theme for which your proposal is relevant when you do so. We
> > will be ending the CFP by 12 midnight PDT on August 31st, 2016.
> > 
> > If you have other topics that do not fit in the themes listed, please feel
> > free to propose and we might be able to accommodate some of them as
> > lightening talks or something similar.
> > 
> > Please do reach out to me or Amye if you have any questions.
> > 
> > Thanks!
> > Vijay
> > 
> > [1] https://www.gluster.org/events/summit2016/
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> ___
> Gluster-users mailing list
> gluster-us...@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] md-cache changes and impact on tiering

2016-08-28 Thread Dan Lambright


- Original Message -
> From: "Poornima Gurusiddaiah" 
> To: "Dan Lambright" , "Nithya Balachandran" 
> 
> Cc: "Gluster Devel" 
> Sent: Tuesday, August 23, 2016 12:56:38 AM
> Subject: md-cache changes and impact on tiering
> 
> Hi,
> 
> The basic patches for md-cache and integrating it with cache-invalidation is
> merged in master. You could try master build and enable the following
> settings, to see if there is any impact on tiering performance at all:
> 
> # gluster volume set  performance.stat-prefetch on
> # gluster volume set  features.cache-invalidation on
> # gluster volume set  performance.cache-samba-metadata on
> # gluster volume set  performance.md-cache-timeout 600
> # gluster volume set  features.cache-invalidation-timeout 600

On the tests I run, this cut the number of LOOKUPs by about three orders of 
magnitude. Each saved lookup reduces a round trip over the network.

I'm running a "small file" performance test. It creates 16K 64 byte files in a 
seven level directory. It then reads each file twice. 

Configuration is HOT: 2 x 2 ramdisk COLD: 2 x (8 + 4) disk, network is 
1Mb/s 9000 mtu. The number of lookups is a factor of the number of 
directories and subvolumes. On each I/O the file is re-opened and each 
directory is laboriously rechecked for existence/permission. 

Without using md-cache, these lookups used to be further propagated across each 
subvolume by DHT to obtain the entire layout. So it would be something like 
order of 16K*7*26 round trips across the network. 

The counts are all visible with gluster profile. 


> 
> Note: It has to be executed in the same order.
> 
> Tracker bug: https://bugzilla.redhat.com/show_bug.cgi?id=1211863
> Patches:
> http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:bug-1211863
> 
> Thanks,
> Poornima
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] counters in tiering / request for comments

2016-08-29 Thread Dan Lambright
Below is a write-up on tiering counters (bz 1275917) I give three options, and 
I think option (1) and (3) are doable. (2) is harder and would need more 
discussion.

Currently counters give limited information on tiering behavior. They are just 
a raw count of the number of files moved each direction. The overall feature is 
much less usable as a result.

Generally counters should work with future tiering use cases, i.e. tier 
according to location or some other policy.

$ gluster volume tier vol1 status
Node Promoted files   Demoted filesStatus   
   
----
   
localhost20   30   in progress  
   
172.17.60.18 00in progress  
   
172.17.60.19 00in progress  
   
172.17.60.20 00in progress   

(1)

Customers want to know the total number of files / MB on a tier at any one 
time. I propose we query the database on the bricks for each tier, to get a 
count of the number of files. 

$ gluster volume tier vol1 status
Node Promoted files /hot count   Demoted files / cold count 
   Status  
--   -  
   -   
localhost20 / 50030 /2000   
   in progress 
172.17.60.18 0   0  
   in progress 
172.17.60.19 0   0  
   in progress 
172.17.60.20 0   0  
   in progress   

(2)

People need to know the ratio of I/Os served by the hot tier to the cold tier. 
For an administrator, if 90% of your I/Os go to the hot tier, this is good. If 
only 20% are served by the hot tier, this is bad, and there is a 
misconfiguration.

Something like this is what we want:

$ gluster volume tier vol1 status
Node Promoted files   Demoted filesRead Hit rate   
Write Hit Rate Status  
----   
---
localhost0080% 
75%in progress   

The difficulty is how to capture that. When we read a large file, it is broken 
up into multiple individual reads. Each piece is a single read FOP. Should we 
consider each FOP individually? Or does only the first "hit" to the hot tier 
count?  

Also, when an FOP comes in, it will first look on one tier, and then the other 
tier. The callback to the FOP checks success or failure. It is only when the 
file is found on none of the subvolumes that the FOP returns an error. New code 
needs to deal with this complexity. If there is failure on the cold tier but 
success on the hot tier, the "hit count" should be bumped.

We probably do not want to update the "hit rate" on all FOPs. 

(3)

A simpler new counter to implement is the #MB promoted or demoted. I think that 
could be satisfied in a separate patch and could be done quicker. 

This output with (2) and (3):

$ gluster volume tier vol1 status
Node Promoted files/MBDemoted files/MB Read Hit rate   
Write Hit Rate Status  
----   
---
localhost120/2033MB   50/1044MB80% 
75%in progress   
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] md-cache changes and impact on tiering

2016-09-06 Thread Dan Lambright


- Original Message -
> From: "Dan Lambright" 
> To: "Poornima Gurusiddaiah" 
> Cc: "Nithya Balachandran" , "Gluster Devel" 
> 
> Sent: Sunday, August 28, 2016 10:01:36 AM
> Subject: Re: md-cache changes and impact on tiering
> 
> 
> 
> ----- Original Message -
> > From: "Poornima Gurusiddaiah" 
> > To: "Dan Lambright" , "Nithya Balachandran"
> > 
> > Cc: "Gluster Devel" 
> > Sent: Tuesday, August 23, 2016 12:56:38 AM
> > Subject: md-cache changes and impact on tiering
> > 
> > Hi,
> > 
> > The basic patches for md-cache and integrating it with cache-invalidation
> > is
> > merged in master. You could try master build and enable the following
> > settings, to see if there is any impact on tiering performance at all:
> > 
> > # gluster volume set  performance.stat-prefetch on
> > # gluster volume set  features.cache-invalidation on
> > # gluster volume set  performance.cache-samba-metadata on
> > # gluster volume set  performance.md-cache-timeout 600
> > # gluster volume set  features.cache-invalidation-timeout 600
> 
> On the tests I run, this cut the number of LOOKUPs by about three orders of
> magnitude. Each saved lookup reduces a round trip over the network.
> 
> I'm running a "small file" performance test. It creates 16K 64 byte files in
> a seven level directory. It then reads each file twice.
> 
> Configuration is HOT: 2 x 2 ramdisk COLD: 2 x (8 + 4) disk, network is
> 1Mb/s 9000 mtu. The number of lookups is a factor of the number of
> directories and subvolumes. On each I/O the file is re-opened and each
> directory is laboriously rechecked for existence/permission.
> 
> Without using md-cache, these lookups used to be further propagated across
> each subvolume by DHT to obtain the entire layout. So it would be something
> like order of 16K*7*26 round trips across the network.
> 
> The counts are all visible with gluster profile.

I'm going to have to retract the above comments. The optimization does not work 
well for me yet. 

If I follow the traces, something odd happens when the client sends a LOOKUP. 
The server will send an invalidation, from the upcall translator's lookup fop 
callback. At that point any future LOOKUPs for that entry are passed right 
through again to the server.. this logic defeats the reasoning for using 
md-cache.. can you explain the reasoning behind that?

> 
> 
> > 
> > Note: It has to be executed in the same order.
> > 
> > Tracker bug: https://bugzilla.redhat.com/show_bug.cgi?id=1211863
> > Patches:
> > http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:bug-1211863
> > 
> > Thanks,
> > Poornima
> > 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] md-cache changes and impact on tiering

2016-10-08 Thread Dan Lambright

> - Original Message -
> > From: "Poornima Gurusiddaiah" 
> > To: "Dan Lambright" , "Nithya Balachandran"
> > 
> > Cc: "Gluster Devel" 
> > Sent: Tuesday, August 23, 2016 12:56:38 AM
> > Subject: md-cache changes and impact on tiering
> > 
> > Hi,
> > 
> > The basic patches for md-cache and integrating it with cache-invalidation
> > is
> > merged in master. You could try master build and enable the following
> > settings, to see if there is any impact on tiering performance at all:
> > 
> > # gluster volume set  performance.stat-prefetch on
> > # gluster volume set  features.cache-invalidation on
> > # gluster volume set  performance.cache-samba-metadata on
> > # gluster volume set  performance.md-cache-timeout 600
> > # gluster volume set  features.cache-invalidation-timeout 600

To follow up on our discussions at the Berlin Gluster conference, I add to the 
above list one more important tunable:

# gluster v set vol1 network.inode-lru-limit 

in my case, this was needed as the default setting was too small for my 
workload. I'll also share that there exists a new "inode forget" counter in 
gluster profile, which makes it much easier track cache utilization.

With this set of tunables, I more consistently see nice improvements on small 
file workloads with tiering for small files.  But I would imagine 
md-cache+upcall will help many scenarios where "lookup amplification" acts as a 
drag. I saw some encouraging results testing this with RDMA.

Some caveats to acknowledge
 
- client caching takes resources from the end user's machine

- the md-cache timeout does not yet have an "infinity" setting; entries still 
age out artificially

- I am running a very artificial workload using our "smallfile" workload 
generator [1]. It does does open/read/close over a large set of files; I've not 
exercised other file operations. 

All that said, it sure seems like a big step forward to me. 

Great to see small file performance improvements with gluster ! :)

[1]
https://github.com/bengland2/smallfile


> 
> 
> > 
> > Note: It has to be executed in the same order.
> > 
> > Tracker bug: https://bugzilla.redhat.com/show_bug.cgi?id=1211863
> > Patches:
> > http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:bug-1211863
> > 
> > Thanks,
> > Poornima
> > 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] tiering: emergency demotions

2016-10-13 Thread Dan Lambright

- Original Message -
> From: "Milind Changire" 
> To: gluster-devel@gluster.org
> Sent: Thursday, October 13, 2016 7:53:48 AM
> Subject: Re: [Gluster-devel] tiering: emergency demotions
> 
> Dilemma:
> *without* my patch, the demotions in degraded (hi-watermark breached)
> mode happen every 10 seconds by listing *all* files colder than the
> last 10 seconds and sorting them in ascending order w.r.t. the
> (write,read) access time ... so the existing query could take more than
> a minute to list files if there are millions of them
> 
> *with* my patch we currently select a random set of 20 files and demote
> them ... even if they are actively used ... so we either wait for more
> than a minute for the exact listing of cold files in the worst case or
> trade off by demoting hot files without imposing a file selection
> criteria for a quicker turnaround time
> 
> The exponential time window schema to select files discussed over Google
> Hangout has an issue with deciding the start time of the time window,
> although we know the end time being the current time
> 
> So, I think it would be either of the strategies discussed above with a
> trade-off in one way or the other.
> 
> Comments are requested regarding the approach to take for the
> implementation.

Reaching a full hot tier is a catastrophic event; the operator can no longer 
use the volume. If we find ourselves getting close to this situation we should 
take every means to get out of it as soon as possible. Performance is a 
secondary concern in this case.

Right now, a long database query will be O(n). It will therefore always take a 
long time (a minute or more) when there are large numbers of files (e.g. 
>10^6). This is only our current scheme and subject to change someday, but for 
now we must live with O(n).

On the other hand. It may or may not be true that the sample of files we choose 
to demote will include a file that is being accessed. We could potentially 
avoid demoting "hot" files by skipping them from the approximate "sample" we 
take, the criteria for skipping could be an elastic window of time that grows 
to ensure we eventually demote enough data.

So I think the "approximate" solution is better because the long query time 
(order of minutes) is something we cannot incur and must avoid, whereas the 
active file issue is something we can manage.

Avoiding filling up storage units is very much a classic problem. As we know 
DHT only partially solves it at the moment (write appends can fill up a 
subvolume). I am querying how ceph tackles this to see if they have any 
insights.

> 
> Rafi has also suggested to avoid file creation on the hot tier if the
> hot tier has hi-watermark breached to avoid further stress on storage
> capacity and eventual file migration to the cold tier.
> 
> Do we introduce demotion policies like "strict" and "approximate" to
> let user choose the demotion strategy ?
> 1. strict
> Choosing this strategy could mean we wait for the full and ordered
> query to complete and only then start demoting the coldest file first
> 
> 2. approximate
> Choosing this strategy could mean we choose the the first available
> file from the database query and demote it even if it is hot and
> actively written to
> 
> 
> Milind
> 
> On 08/12/2016 08:25 PM, Milind Changire wrote:
> > Patch for review: http://review.gluster.org/15158
> >
> > Milind
> >
> > On 08/12/2016 07:27 PM, Milind Changire wrote:
> >> On 08/10/2016 12:06 PM, Milind Changire wrote:
> >>> Emergency demotions will be required whenever writes breach the
> >>> hi-watermark. Emergency demotions are required to avoid ENOSPC in case
> >>> of continuous writes that originate on the hot tier.
> >>>
> >>> There are two concerns in this area:
> >>>
> >>> 1. enforcing max-cycle-time during emergency demotions
> >>>max-cycle-time is the time the tiering daemon spends in promotions or
> >>>demotions
> >>>I tend to think that the tiering daemon skip this check for the
> >>>emergency situation and continue demotions until the watermark drops
> >>>below the hi-watermark
> >>
> >> Update:
> >> To keep matters simple and manageable, it has been decided to *enforce*
> >> max-cycle-time to yield the worker threads to attend to impending tier
> >> management tasks if the need arises.
> >>
> >>>
> >>> 2. file demotion policy
> >>>I tend to think that evicting the largest file with the most recent
> >>>*write* should be chosen for eviction when write-freq-threshold is
> >>>NON-ZERO.
> >>>Choosing a least written file is just going to delay file migration
> >>>of an active file which might consume hot tier disk space resulting
> >>>in a ENOSPC, in the worst case.
> >>>In cases where write-freq-threshold are ZERO, the most recently
> >>>*written* file can be chosen for eviction.
> >>>In the case of choosing the largest file within the
> >>>write-freq-threshold, a stat() on the files would be required

Re: [Gluster-devel] Possible race condition bug with tiered volume

2016-10-18 Thread Dan Lambright
Dustin,

What level code ? I often run smallfile on upstream code with tiered volumes 
and have not seen this.  

Sure, one of us will get back to you.

Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they 
overwhelm the boost in transfer speeds you get for small files. A presentation 
at the Berlin gluster summit evaluated this.  The expectation is md-cache will 
go a long way towards helping that, before too long.

Dan



- Original Message -
> From: "Dustin Black" 
> To: gluster-devel@gluster.org
> Cc: "Annette Clewett" 
> Sent: Tuesday, October 18, 2016 4:30:04 PM
> Subject: [Gluster-devel] Possible race condition bug with tiered volume
> 
> I have a 3x2 hot tier on NVMe drives with a 3x2 cold tier on RAID6 drives.
> 
> # gluster vol info 1nvme-distrep3x2
> Volume Name: 1nvme-distrep3x2
> Type: Tier
> Volume ID: 21e3fc14-c35c-40c5-8e46-c258c1302607
> Status: Started
> Number of Bricks: 12
> Transport-type: tcp
> Hot Tier :
> Hot Tier Type : Distributed-Replicate
> Number of Bricks: 3 x 2 = 6
> Brick1: n5:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick2: n4:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick3: n3:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick4: n2:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick5: n1:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Brick6: n0:/rhgs/hotbricks/1nvme-distrep3x2-hot
> Cold Tier:
> Cold Tier Type : Distributed-Replicate
> Number of Bricks: 3 x 2 = 6
> Brick7: n0:/rhgs/coldbricks/1nvme-distrep3x2
> Brick8: n1:/rhgs/coldbricks/1nvme-distrep3x2
> Brick9: n2:/rhgs/coldbricks/1nvme-distrep3x2
> Brick10: n3:/rhgs/coldbricks/1nvme-distrep3x2
> Brick11: n4:/rhgs/coldbricks/1nvme-distrep3x2
> Brick12: n5:/rhgs/coldbricks/1nvme-distrep3x2
> Options Reconfigured:
> cluster.tier-mode: cache
> features.ctr-enabled: on
> performance.readdir-ahead: on
> 
> 
> I am attempting to run the 'smallfile' benchmark tool on this volume. The
> 'smallfile' tool creates a starting gate directory and files in a shared
> filesystem location. The first run (write) works as expected.
> 
> # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
> /rhgs/client/1nvme-distrep3x2 --host-set
> c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
> --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create
> 
> For the second run (read), I believe that smallfile attempts first to 'rm
> -rf' the "network-sync-dir" path, which fails with ENOTEMPTY, causing the
> run to fail
> 
> # smallfile_cli.py --threads 12 --file-size 4096 --files 300 --top
> /rhgs/client/1nvme-distrep3x2 --host-set
> c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11 --prefix test1 --stonewall Y
> --network-sync-dir /rhgs/client/1nvme-distrep3x2/smf1 --operation create
> ...
> Traceback (most recent call last):
> File "/root/bin/smallfile_cli.py", line 280, in 
> run_workload()
> File "/root/bin/smallfile_cli.py", line 270, in run_workload
> return run_multi_host_workload(params)
> File "/root/bin/smallfile_cli.py", line 62, in run_multi_host_workload
> sync_files.create_top_dirs(master_invoke, True)
> File "/root/bin/sync_files.py", line 27, in create_top_dirs
> shutil.rmtree(master_invoke.network_dir)
> File "/usr/lib64/python2.7/shutil.py", line 256, in rmtree
> onerror(os.rmdir, path, sys.exc_info())
> File "/usr/lib64/python2.7/shutil.py", line 254, in rmtree
> os.rmdir(path)
> OSError: [Errno 39] Directory not empty: '/rhgs/client/1nvme-distrep3x2/smf1'
> 
> 
> From the client perspective, the directory is clearly empty.
> 
> # ls -a /rhgs/client/1nvme-distrep3x2/smf1/
> . ..
> 
> 
> And a quick search on the bricks shows that the hot tier on the last replica
> pair is the offender.
> 
> # for i in {0..5}; do ssh n$i "hostname; ls
> /rhgs/coldbricks/1nvme-distrep3x2/smf1 | wc -l; ls
> /rhgs/hotbricks/1nvme-distrep3x2-hot/smf1 | wc -l"; donerhosd0
> 0
> 0
> rhosd1
> 0
> 0
> rhosd2
> 0
> 0
> rhosd3
> 0
> 0
> rhosd4
> 0
> 1
> rhosd5
> 0
> 1
> 
> 
> (For the record, multiple runs of this reproducer show that it is
> consistently the hot tier that is to blame, but it is not always the same
> replica pair.)
> 
> 
> Can someone try recreating this scenario to see if the problem is consistent?
> Please reach out if you need me to provide any further details.
> 
> 
> Dustin Black, RHCA
> Senior Architect, Software-Defined Storage
> Red Hat, Inc.
> (o) +1.212.510.4138 (m) +1.215.821.7423
> dus...@redhat.com
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Possible race condition bug with tiered volume

2016-10-20 Thread Dan Lambright
Dustin,

Your python code looks fine to me... I've been in Ceph C++ weeds lately, I 
kinda miss python ;)

If I run back-to-back smallfile operation "create", then on the second 
smallfile run, I consistently see:  

0.00% of requested files processed, minimum is  70.00
at least one thread encountered error, test may be incomplete

Is this what you get? We can follow up off the mailing list.

Dan

glusterfs 3.7.15 built on Oct 20 2016, with two clients running small file 
against a tiered volume (using ram disk as hot tier, cold disks JBOD, copied 
below) on Fedora 23.

./smallfile_cli.py  --top /mnt/p66p67 --host-set gprfc066,gprfc067 --threads 8 
--files 5000 --file-size 64 --record-size 64 --fsync N --operation read

volume - 

Status: Started
Number of Bricks: 28
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: gprfs020:/home/ram 
Brick2: gprfs019:/home/ram 
Brick3: gprfs018:/home/ram 
Brick4: gprfs017:/home/ram 
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (8 + 4) = 24
Brick5: gprfs017:/t0
Brick6: gprfs018:/t0
Brick7: gprfs019:/t0
Brick8: gprfs020:/t0
Brick9: gprfs017:/t1
Brick10: gprfs018:/t1
Brick11: gprfs019:/t1
Brick12: gprfs020:/t1
Brick13: gprfs017:/t2
Brick14: gprfs018:/t2
Brick15: gprfs019:/t2
Brick16: gprfs020:/t2
Brick17: gprfs017:/t3
Brick18: gprfs018:/t3
Brick19: gprfs019:/t3
Brick20: gprfs020:/t3
Brick21: gprfs017:/t4
Brick22: gprfs018:/t4
Brick23: gprfs019:/t4
Brick24: gprfs020:/t4
Brick25: gprfs017:/t5
Brick26: gprfs018:/t5
Brick27: gprfs019:/t5
Brick28: gprfs020:/t5
Options Reconfigured:
cluster.tier-mode: cache   
features.ctr-enabled: on   
performance.readdir-ahead: on


- Original Message -
> From: "Dustin Black" 
> To: "Dan Lambright" 
> Cc: "Milind Changire" , "Annette Clewett" 
> , gluster-devel@gluster.org
> Sent: Wednesday, October 19, 2016 3:23:04 PM
> Subject: Re: [Gluster-devel] Possible race condition bug with tiered volume
> 
> # gluster --version
> glusterfs 3.7.9 built on Jun 10 2016 06:32:42
> 
> 
> Try not to make fun of my python, but I was able to make a small
> modification to the to the sync_files.py script from smallfile and at least
> enable my team to move on with testing. It's terribly hacky and ugly, but
> works around the problem, which I am pretty convinced is a Gluster bug at
> this point.
> 
> 
> # diff bin/sync_files.py.orig bin/sync_files.py
> 6a7,8
> > import errno
> > import binascii
> 27c29,40
> < shutil.rmtree(master_invoke.network_dir)
> ---
> > try:
> > shutil.rmtree(master_invoke.network_dir)
> > except OSError as e:
> > err = e.errno
> > if err != errno.EEXIST:
> > # workaround for possible bug in Gluster
> > if err != errno.ENOTEMPTY:
> > raise e
> > else:
> > print('saw ENOTEMPTY on stonewall, moving shared
> directory')
> > ext = str(binascii.b2a_hex(os.urandom(15)))
> > shutil.move(master_invoke.network_dir,
> master_invoke.network_dir + ext)
> 
> 
> Dustin Black, RHCA
> Senior Architect, Software-Defined Storage
> Red Hat, Inc.
> (o) +1.212.510.4138  (m) +1.215.821.7423
> dus...@redhat.com
> 
> 
> On Tue, Oct 18, 2016 at 7:09 PM, Dustin Black  wrote:
> 
> > Dang. I always think I get all the detail and inevitably leave out
> > something important. :-/
> >
> > I'm mobile and don't have the exact version in front of me, but this is
> > recent if not latest RHGS on RHEL 7.2.
> >
> >
> > On Oct 18, 2016 7:04 PM, "Dan Lambright"  wrote:
> >
> >> Dustin,
> >>
> >> What level code ? I often run smallfile on upstream code with tiered
> >> volumes and have not seen this.
> >>
> >> Sure, one of us will get back to you.
> >>
> >> Unfortunately, gluster has a lot of protocol overhead (LOOKUPs), and they
> >> overwhelm the boost in transfer speeds you get for small files. A
> >> presentation at the Berlin gluster summit evaluated this.  The expectation
> >> is md-cache will go a long way towards helping that, before too long.
> >>
> >> Dan
> >>
> >>
> >>
> >> - Original Message -
> >> > From: "Dustin Black" 
> >> > To: gluster-devel@gluster.org
> >> > Cc: "Annette Clewett" 
> >> > Sent: Tuesday, October 18, 2016 4:30:04 PM
> >> > Subject: [Gluster-devel] Possible race condition bug with tiered volume
> &g

Re: [Gluster-devel] New commands for supporting add/remove brick and rebalance on tiered volume

2016-10-22 Thread Dan Lambright


- Original Message -
> From: "Hari Gowtham" 
> To: "Atin Mukherjee" 
> Cc: "gluster-users" , "gluster-devel" 
> 
> Sent: Friday, October 21, 2016 3:52:34 AM
> Subject: Re: [Gluster-devel] New commands for supporting add/remove brick and 
> rebalance on tiered volume
> 
> Hi,
> 
> Currently there are two suggested options for the syntax of add/remove brick:
> 
> 1) gluster v tier  add-brick [replica ] [tier-type
> ]  ...
> 
> this syntax shows that its a add-brick operation on a tiered volume through a
> argument
> instead of distinguishing using the command. The separation of tier-type is
> done through
> parsing. When it comes to the parsing of these [replica ] [tier-type
> ] ,
>  we need to parse between the tier-type, replica count and bricks. All these
>  three variable
>  make it complicated to parse and get the replica count, tier-type and brick.
> 
> currently the parsing is like:
> w = str_getunamb (words[3], opwords_cl);
> if (!w) {
> type = GF_CLUSTER_TYPE_NONE;
> .
> .
> } else if ((strcmp (w, "replica")) == 0) {
> type = GF_CLUSTER_TYPE_REPLICATE;
> .
> .
> }
> } else if ((strcmp (w, "stripe")) == 0) {
> type = GF_CLUSTER_TYPE_STRIPE;
> .
> .
> } else {
> .
> .
> }
> 
> we can use the same for replica as long as replica comes before tier-type on
> the syntax.
> and add the parsing for tier-type using words[4] instead of words[3] and
> repeat the same.
> If its a plain distribute then we will be getting tier-type on words[3]. so
> we have to parse
> it again by checking on the wordcount. the word count influences the parsing
> to a great extent.
> Having the tier-type after replica is looking bit off as tier-type is more
> important here.
> So we can have tier-type before replica count. This has to be maintained
> thorughtout.
> And a separate parsing can make this work. Both these will influence the
> brick_index used
> for parsing the brick making the switch using the word_count bit unclear.
> This can be done but will add a lot of complications on code.
> 
> 2) gluster v tier  add-hot-brick/add-cold-brick [replica ]
>  ...
> In this syntax, we remove the tier-type from parsing and mention the type on
> the command.
> The parsing remains the same as add-brick parsing. as differentiate between
> the hot and cold
> brick is done by the command
> 
> if (!strcmp(words[1], "detach-tier")) {
> ret = do_cli_cmd_volume_detach_tier (state, word,
>  words, wordcount);
> goto out;
> 
> } else if (!strcmp(words[1], "attach-tier")) {
> ret = do_cli_cmd_volume_attach_tier (state, word,
>  words, wordcount);
> goto out;
> } else if (!strcmp(words[3], "add-hot-brick")) {
> 
> ret = do_cli_cmd_volume_add_hotbr_tier (state, word,
>  words, wordcount-1);
> goto out;
> } else if (!strcmp(words[3], "add-cold-brick")) {
> 
> ret = do_cli_cmd_volume_add_coldbr_tier (state, word,
>  words, wordcount-1);
> goto out;
> }
> 
> it get differentiated here and is sent to the respective function. and the
> parsing remains same.
> 
> Let me know which one is the better one to follow.


We might someday have more tiering "types" than "hot and cold". We may have an 
"archive" tier, for example, or time-based rather than automated, 
secure/unsecure, etc.  

We should not assume there is only one hot/cold tier. Someday there could be 
multiple hot/cold tiers, as in "vol1" below. How to distinguish them? Number 
each tier according to the leaf's position in the graph, and calling this the 
"tier-id" would be sensible. Option (1) seems a closer fit to that (we would 
change "tier-type" to "tier-id"). I'm not sure we are ready to make the jump to 
"tier-id" rather than "tier-type", thats a different discussion. 

Overall, I prefer option (1), it seems to more easily keep future options open.

vol1
- unsecure 
--- hot T1
--- cold T2
- secure
--- nearline 
-- hot T3
-- cold  T4
--- archive T5

vol1
-- hot T1
-- cold T2

> 
> - Original Message -
> > From: "Hari Gowtham" 
> > To: "Atin Mukherjee" 
> > Cc: "gluster-users" , "gluster-devel"
> > 
> > Sent: Monday, October 3, 2016 4:11:40 PM
> > Subject: Re: [Gluster-devel] New commands for supporting add/remove brick
> > and rebalance on tiered volume
> > 
> > Yes. this sounds better than having two separate commands for each tier.
> > If i don't get any other better solution will go with this one.
> > Thanks Atin.
> > 
> > - Original Message -
> > >

Re: [Gluster-devel] Release 3.10 feature proposal : Volume expansion on tiered volumes.

2016-12-08 Thread Dan Lambright


- Original Message -
> From: "Shyam" 
> To: "Hari Gowtham" , "gluster-devel" 
> 
> Sent: Thursday, December 8, 2016 7:35:27 AM
> Subject: Re: [Gluster-devel] Release 3.10 feature proposal : Volume expansion 
> on tiered volumes.
> 
> Hi Hari,
> 
> Thanks for posting this issue to be considered part of 3.10.
> 
> I have a few questions inline.
> 
> Shyam
> 
> On 12/08/2016 01:23 AM, Hari Gowtham wrote:
> > Hi,
> >
> > To support add/remove brick on tiered volumes we are planing to separate
> > the tier into a separate process in the service framework and add the
> > add/remove brick support. Later the users will be able to spawn rebalance
> > on tiered volumes (which is not possible).
> 
> I assume tier as a separate process is from the rebalance deamon
> perspective, right? Or, is it about separating the xlator cod efrom DHT?
> 
> Also, Dan would like your comments as Tier maintainer, on the maturity
> of the below proposal for 3.10 inclusion? Could you also add the
> required labels [2] to the issue as you see fit, and if this passes your
> inspection, then let us know and I can mark it for 3.10 milestone in github.

The first part of this project "tier as a service" can probably get into 3.10. 
I will discuss a bit more with Hari and the glusterd team to confirm the entire 
feature will make it.

> 
> >
> > The following are the steps planed to be performed:
> >
> > *) tier as a service (final stages of code review)
> 
> Can we get links to the code, and also the design spec if available, for
> the above (and possibly as a whole)
> 
> > *) we are separating the attach tier from add brick and detach from
> >remove brick.
> > *) infra to support add/remove brick.
> > *) rebalance process on a tiered volume.
> > *) a few patches to take care of the issues that will be arising
> >eg: while adding a brick on a tiered volume, the tier process has to
> >be stopped as the graph switch occurs. and other issues like this.
> >
> > The whole volume expansion will be in an experimental state. while the
> > separation of tier into a separate service framework and attach/detach
> > tier separation from add/remove brick should be back to stable state before
> > the release of 3.10
> 
> What is the mitigation plan in case this does not get stable? Would you
> have all commits in ready but not merged state till it is stable?
> 
> This looks like a big change, and also something that has been going on
> for some time now, based on your comments above.
> 
> >
> > [1] https://github.com/gluster/glusterfs/issues/54
> [2] https://github.com/gluster/glusterfs/labels
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] gluster tiering maintainer

2017-01-24 Thread Dan Lambright
Dear Gluster Community,

2017 has arrived, and I have taken an opportunity which will require a new 
maintainer for gluster tiering to replace me. I will continue to be available 
to help with the feature as a contributor. 

As seen in this year's CES conference [1], new storage types are coming fast. 
Customers will likely use a mix of them, and tiering is a logical 
consideration. Going forward the feature should support multiple tiers, better 
performance while migration is underway, and migration according to attributes 
other than hit rate. Under the hood, tiering should evolve with the rest of 
gluster (e.g. DHT2), the database vs other algorithms should be analyzed 
critically, and somehow the codebase should be detached from DHT. All 
significant and interesting challenges. 

Gluster has come a long way since I joined- and I think it will only get better.

Dan

[1]
http://www.theregister.co.uk/2017/01/04/optane_arrives_at_ces/
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Spurious failures again

2015-07-10 Thread Dan Lambright


- Original Message -
> From: "Atin Mukherjee" 
> To: "Vijaikumar Mallikarjuna" 
> Cc: "Gluster Devel" 
> Sent: Wednesday, July 8, 2015 12:46:42 PM
> Subject: Re: [Gluster-devel] Spurious failures again
> 
> 
> 
> I think our linux regression is again unstable. I am seeing at least 10 such
> test cases ( if not more) which have failed. I think we should again start
> maintaining an etherpad page (probably the same earlier one) and keep track
> of them otherwise it will be difficult to track what is fixed and what's not
> if we have to go through mails.
> 
> Thoughts?


+2 to this, we worked very hard to fix a spurious problem in one of our tests 
and have held off merging it until it passes, but we keep hitting other 
spurious errors. 

> 
> -Atin
> Sent from one plus one
> On Jul 8, 2015 8:45 PM, "Vijaikumar M" < vmall...@redhat.com > wrote:
> 
> 
> 
> 
> On Wednesday 08 July 2015 03:53 PM, Vijaikumar M wrote:
> 
> 
> 
> 
> On Wednesday 08 July 2015 03:42 PM, Kaushal M wrote:
> 
> 
> I've been hitting spurious failures in Linux regression runs for my change
> [1].
> 
> The following tests failed,
> ./tests/basic/afr/replace-brick-self-heal.t [2]
> ./tests/bugs/replicate/bug-1238508-self-heal.t [3]
> ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4]
> I will look into this issue
> Patch submitted: http://review.gluster.org/#/c/11583/
> 
> 
> 
> 
> 
> 
> 
> ./tests/bugs/quota/bug-1235182.t [5]
> I have submitted two patches to fix failures from 'bug-1235182.t'
> http://review.gluster.org/#/c/11561/
> http://review.gluster.org/#/c/11510/
> 
> 
> 
> ./tests/bugs/replicate/bug-977797.t [6]
> 
> Can AFR and quota owners look into this?
> 
> Thanks.
> 
> Kaushal
> 
> [1] https://review.gluster.org/11559
> [2]
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull
> [3]
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull
> [4]
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull
> [5]
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull
> [6]
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull
> 
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Fwd: Nightly build for 3.7

2015-08-18 Thread Dan Lambright
All,

I do not see build for 3.7 after 27th July
@http://download.gluster.org/pub/gluster/glusterfs/nightly/glusterfs-3.7/
Is there an alternate location for the nightly build for 3.7?

Regards,
Dan & Vivek



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Roadmap for afr, ec

2015-09-18 Thread Dan Lambright


- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "fanghuang data" , "Gluster Devel" 
> , "Xavier Hernandez"
> , "Dan Lambright" 
> Sent: Friday, September 18, 2015 1:25:30 AM
> Subject: Re: [Gluster-devel] Roadmap for afr, ec
> 
> 
> 
> On 09/16/2015 03:42 PM, fanghuang.d...@yahoo.com wrote:
> > Hi Pranith,
> >
> > For the EC encoding/decoding algorithm, could we design a plug-in mechanism
> > to make users can choose their own
> > algorithm or can use the third side library just like Ceph? And I am also
> > curious why originally the IDA algorithm
> > is chosen, instead of the common used Reed-Solomon algorithm?
> Pluggability of algorithms is also in plan. I never really bothered to
> check which algorithm was used, and was under the impression that we are
> using reed-solomon nonsystematic erasure codes as told to me by Dan(CCed).

Reed solomon error correction is a general purpose coding technique. Its used 
with scratched compact disks, noisy WANs, as well as erasure encoding. 

The way I read it, Rabin's IDA (information dispersal algorithm) describes a 
process for coding files over networks (distributed systems), but I do not 
think it mandates a particular coding algorithm. So you can plug in Tornado 
codes, XOR Cauchy codes, etc. into the scheme.

So my interpretation would be Xavi implemented nonsystematic IDA using Reed 
Solomon encoding, and we would like to change the implementation to be 
systematic with plug-in algorithms.

My interpretation.. I make no claims to be an expert.

> 
> Pranith
> >   
> > Best Regards,
> > Fang Huang
> >
> >
> >> On Monday, 14 September 2015, 16:30, Pranith Kumar Karampuri
> >>  wrote:
> >>> hi,
> >> Here is a list of common improvements for both ec and afr planned over
> >> the next few months:
> >>
> >> 1) Granular entry self-heals.
> >>Both afr and ec at the moment do lot of readdirs and lookups to
> >> figure out the differences between the directories to perform heals.
> >> Kritika, Ravi, Anuradha and I are discussing about how to prevent this.
> >> The base algo is to store only the names that need heal in
> >> .glusterfs/indices/entry-changes// as links to base
> >> file in .glusterfs/indices/entry-changes of the bricks. So only the
> >> names that need to be healed will be going through name heals.
> >> We want to complete this for 3.8 definitely.
> >>
> >> 2) Granular data self-heals.
> >>At the moment even if a single byte changes in the file afr, ec
> >> read the entire file to fix the problems. We are thinking of preventing
> >> this by remembering where the changes happened on the file in extended
> >> attributes. There will be a new extended attribute on the file which
> >> represents a bit map of the changes and each bit represents a range that
> >> needs healing. This extended attribute will have a maximum size it can
> >> represent, the extra chunks will be represented like shards in
> >> .glusterfs/indices/data-changes/> extended
> >> attribute on
> >> this block will store ranges that need heals.
> >>
> >> For example: If we have extended attribute value maximum size as 4KB and
> >> each bit represents 128KB (i.e. first bit represents changes done from
> >> offset 0-128KB, 2nd bit 128KB+1-256KB etc.), In single extended
> >> attribute we can store changes happening to file upto 4GB (We are
> >> thinking of dynamically increasing the size represented by each bit from
> >> say 4k to 128k, but this is still in design). For changes that are
> >> happening from offset 4GB+1 - 8GB will be stored in extended attribute
> >> of .glusterfs/indices/data-changes/. Changes happening
> >> from offset 8GB+1 to 12GB will be stored in extended attribute of
> >> .glusterfs/indices/data-changes/, (please note that
> >> these files are empty, they will just contain extended attributes) etc.
> >> We want to complete this for 3.8 (stretch goal)
> >>
> >> 3) Performance & throttling improvements for self-heal:
> >>We are also looking into the multi-threaded self-heal daemon patch
> >> by Richard for inclusion in 3.8. We are waiting for the discussions by
> >> Raghavendra G on QoS to be over before coming to any decisions on
> >> throttling.
> >>
> >> After we have compound fops:
> >> Goal here is to come up with compound fops and prevent un-necessary
> >> round trips:
> >> 4) Transaction l

[Gluster-devel] spurious regression errors getting worse

2015-11-05 Thread Dan Lambright
It seems to have become more difficult in the last week to pass regression 
tests.

I've started recording the tests that seem to be failing the most:

bug-1221481-allow-fops-on-dir-split-brain.t
bug-1238706-daemons-stop-on-peer-cleanup.t
./tests/bugs/quota/bug-1235182.t
./tests/bugs/distribute/bug-1066798.t
./tests/bugs/snapshot/bug-1166197.t

In some cases regression must be run a half dozen times before finally passing.

Could the owners those tests please look into these?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression errors getting worse

2015-11-09 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Atin Mukherjee" , "Rajesh Joseph" 
> 
> Cc: "Dan Lambright" , "Gluster Devel" 
> 
> Sent: Monday, November 9, 2015 7:39:28 AM
> Subject: Re: [Gluster-devel] spurious regression errors getting worse
> 
> On Fri, Nov 06, 2015 at 09:36:50AM +0530, Atin Mukherjee wrote:
> > 
> > 
> > On 11/06/2015 07:47 AM, Dan Lambright wrote:
> > > It seems to have become more difficult in the last week to pass
> > > regression tests.
> > > 
> > > I've started recording the tests that seem to be failing the most:
> > > 
> > > bug-1221481-allow-fops-on-dir-split-brain.t
> > > bug-1238706-daemons-stop-on-peer-cleanup.t
> > You shouldn't be worried about
> > bug-1238706-daemons-stop-on-peer-cleanup.t as its marked as bad. Also
> > respective failure links for all these tests would help component owners
> > to root cause the issues. Probably you could add all of them here [1]
> 
> We really need to file bugs for the problems, it allows us to get
> notifications about problems. Etherpads can be nice to dedicated work,
> but many of us do not check them regularly.

There does not seem to be a small core set of bugs that fails regularly. 
Rather, the set of spurious bugs seems to be large.
So we would be filing a lot of bugs. But, it could be done. And is probably 
more effective than a rarely consulted ether pad.


> 
> Rajesh, tests/bugs/snapshot/bug-1227646.t resulted in a core for this
> run:
> 
>   
> https://build.gluster.org/job/rackspace-regression-2GB-triggered/15664/consoleFull
> 
> I'm not sure if someone is looking into that already?
> 
> Thanks,
> Niels
> 
> > 
> > [1] https://public.pad.fsfe.org/p/gluster-spurious-failures
> > 
> > Thanks,
> > Atin
> > > ./tests/bugs/quota/bug-1235182.t
> > > ./tests/bugs/distribute/bug-1066798.t
> > > ./tests/bugs/snapshot/bug-1166197.t
> > > 
> > > In some cases regression must be run a half dozen times before finally
> > > passing.
> > > 
> > > Could the owners those tests please look into these?
> > > ___
> > > Gluster-devel mailing list
> > > Gluster-devel@gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > > 
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression errors getting worse

2015-11-09 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Dan Lambright" 
> Cc: "Gluster Devel" 
> Sent: Monday, November 9, 2015 4:09:08 PM
> Subject: Re: [Gluster-devel] spurious regression errors getting worse
> 
> On Thu, Nov 05, 2015 at 09:17:28PM -0500, Dan Lambright wrote:
> > It seems to have become more difficult in the last week to pass regression
> > tests.
> > 
> > I've started recording the tests that seem to be failing the most:
> > 
> > bug-1221481-allow-fops-on-dir-split-brain.t
> > bug-1238706-daemons-stop-on-peer-cleanup.t
> > ./tests/bugs/quota/bug-1235182.t
> > ./tests/bugs/distribute/bug-1066798.t
> > ./tests/bugs/snapshot/bug-1166197.t
> > 
> > In some cases regression must be run a half dozen times before finally
> > passing.
> > 
> > Could the owners those tests please look into these?
> 
> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/11617/consoleFull
> failed on
> 
> [16:18:49] ./tests/basic/tier/fops-during-migration-pause.t ..
> not ok 19
> not ok 20
> Failed 2/20 subtests
> [16:18:49]
> 
> Please have a look. Thanks,

Hm. This one most certainly broke due to one of the fixes we merged over the 
weekend, its spurious and snuck through.
Will fix it right away.
Thank you

> Niels
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression errors getting worse

2015-11-11 Thread Dan Lambright


- Original Message -
> From: "Dan Lambright" 
> To: "Niels de Vos" 
> Cc: "Gluster Devel" 
> Sent: Monday, November 9, 2015 4:18:31 PM
> Subject: Re: [Gluster-devel] spurious regression errors getting worse
> 
> 
> 
> - Original Message -
> > From: "Niels de Vos" 
> > To: "Dan Lambright" 
> > Cc: "Gluster Devel" 
> > Sent: Monday, November 9, 2015 4:09:08 PM
> > Subject: Re: [Gluster-devel] spurious regression errors getting worse
> > 
> > On Thu, Nov 05, 2015 at 09:17:28PM -0500, Dan Lambright wrote:
> > > It seems to have become more difficult in the last week to pass
> > > regression
> > > tests.
> > > 
> > > I've started recording the tests that seem to be failing the most:
> > > 
> > > bug-1221481-allow-fops-on-dir-split-brain.t
> > > bug-1238706-daemons-stop-on-peer-cleanup.t
> > > ./tests/bugs/quota/bug-1235182.t
> > > ./tests/bugs/distribute/bug-1066798.t
> > > ./tests/bugs/snapshot/bug-1166197.t
> > > 
> > > In some cases regression must be run a half dozen times before finally
> > > passing.
> > > 
> > > Could the owners those tests please look into these?
> > 
> > https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/11617/consoleFull
> > failed on
> > 
> > [16:18:49] ./tests/basic/tier/fops-during-migration-pause.t ..
> > not ok 19
> > not ok 20
> > Failed 2/20 subtests
> > [16:18:49]
> > 
> > Please have a look. Thanks,

https://build.gluster.org/job/rackspace-regression-2GB-triggered/15785/consoleFull

failed on 

./tests/bugs/fuse/many-groups-for-acl.t: 1 new core files

[root@rhs-cli-11 glusterfs]# git blame 
./tests/bugs/fuse/many-groups-for-acl.t|grep Niels|wc -l
113


> 
> Hm. This one most certainly broke due to one of the fixes we merged over the
> weekend, its spurious and snuck through.
> Will fix it right away.
> Thank you
> 
> > Niels
> > 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [release-3.7] regression failure in ./tests/basic/tier/fops-during-migration-pause.t

2015-11-19 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: gluster-devel@gluster.org
> Cc: "Dan Lambright" , "Nithya Balachandran" 
> , "Joseph Fernandes"
> 
> Sent: Thursday, November 19, 2015 11:25:06 AM
> Subject: [release-3.7] regression failure in 
> ./tests/basic/tier/fops-during-migration-pause.t
> 
> Backports of bug fixes are getting regression failures, the current one
> failed due to ./tests/basic/tier/fops-during-migration-pause.t.
> 
> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/11853/consoleFull
> 
> Test Summary Report
> ---
> ./tests/basic/tier/fops-during-migration-pause.t (Wstat: 0 Tests: 20
> Failed: 3)
>   Failed tests:  18-20
> Files=1, Tests=20, 454 wallclock secs ( 0.04 usr  0.01 sys +  3.67 cusr
> 5.98 csys =  9.70 CPU)
> Result: FAIL
> 
> Was this fixed in the master branch already? I do not remember seeing
> this fail recently.
> 
> Please be so kind to send a backport to the relevant release-3.*
> branches when a bugfix gets merged in the master branch.

The fix in this case for 3.7 is 12647. It was stalled as automatic regression 
was not kicked off. I did it manually (once I learned how that is done this 
morning), and it passed; that fix is now merged. I'll keep an eye on 3.7 
regressions to confirm it fixes the problem.

> 
> Thanks,
> Niels
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious regression error / bug-924726.t

2015-11-24 Thread Dan Lambright
This test case has failed several time for me in the last few days, 

./tests/bugs/fuse/bug-924726.t

https://build.gluster.org/job/rackspace-regression-2GB-triggered/16141/consoleFull

I'll cc: the author (per git blame).

- Original Message -
> From: "Dan Lambright" 
> To: "Niels de Vos" 
> Cc: "Gluster Devel" 
> Sent: Wednesday, November 11, 2015 6:24:39 PM
> Subject: Re: [Gluster-devel] spurious regression errors getting worse
> 
> 
> 
> - Original Message -
> > From: "Dan Lambright" 
> > To: "Niels de Vos" 
> > Cc: "Gluster Devel" 
> > Sent: Monday, November 9, 2015 4:18:31 PM
> > Subject: Re: [Gluster-devel] spurious regression errors getting worse
> > 
> > 
> > 
> > - Original Message -
> > > From: "Niels de Vos" 
> > > To: "Dan Lambright" 
> > > Cc: "Gluster Devel" 
> > > Sent: Monday, November 9, 2015 4:09:08 PM
> > > Subject: Re: [Gluster-devel] spurious regression errors getting worse
> > > 
> > > On Thu, Nov 05, 2015 at 09:17:28PM -0500, Dan Lambright wrote:
> > > > It seems to have become more difficult in the last week to pass
> > > > regression
> > > > tests.
> > > > 
> > > > I've started recording the tests that seem to be failing the most:
> > > > 
> > > > bug-1221481-allow-fops-on-dir-split-brain.t
> > > > bug-1238706-daemons-stop-on-peer-cleanup.t
> > > > ./tests/bugs/quota/bug-1235182.t
> > > > ./tests/bugs/distribute/bug-1066798.t
> > > > ./tests/bugs/snapshot/bug-1166197.t
> > > > 
> > > > In some cases regression must be run a half dozen times before finally
> > > > passing.
> > > > 
> > > > Could the owners those tests please look into these?
> > > 
> > > https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/11617/consoleFull
> > > failed on
> > > 
> > > [16:18:49] ./tests/basic/tier/fops-during-migration-pause.t ..
> > > not ok 19
> > > not ok 20
> > > Failed 2/20 subtests
> > > [16:18:49]
> > > 
> > > Please have a look. Thanks,
> 
> https://build.gluster.org/job/rackspace-regression-2GB-triggered/15785/consoleFull
> 
> failed on
> 
> ./tests/bugs/fuse/many-groups-for-acl.t: 1 new core files
> 
> [root@rhs-cli-11 glusterfs]# git blame
> ./tests/bugs/fuse/many-groups-for-acl.t|grep Niels|wc -l
> 113
> 
> 
> > 
> > Hm. This one most certainly broke due to one of the fixes we merged over
> > the
> > weekend, its spurious and snuck through.
> > Will fix it right away.
> > Thank you
> > 
> > > Niels
> > > 
> >
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] ./tests/bugs/fuse/bug-924726.t failing regression

2015-12-01 Thread Dan Lambright

This test

./tests/bugs/fuse/bug-924726.t

has failed regression multiple times. I sent an email the other week about it. 
I'd like to put it to sleep for a bit... if no objections (or fixes) I'll do 
that tomorrow.

Dan



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] volume-snapshot.t failures

2015-12-21 Thread Dan Lambright
They fail on RHEL6 machines due to an issue with sqlite. The test has been 
moved to the ignore list already (13056).
I'd like to look into having tests that do not run on RHEL6, only RHEL7+. 
I've been running it on RHEL7 the last few hours in a loop successfully.

- Original Message -
> From: "Raghavendra Gowdappa" 
> To: "Rajesh Joseph" 
> Cc: "Gluster Devel" 
> Sent: Tuesday, December 22, 2015 1:42:13 AM
> Subject: Re: [Gluster-devel] volume-snapshot.t failures
> 
> Both these tests succeed on my local machine.
> 
> - Original Message -
> > From: "Raghavendra Gowdappa" 
> > To: "Rajesh Joseph" 
> > Cc: "Gluster Devel" 
> > Sent: Tuesday, December 22, 2015 12:05:24 PM
> > Subject: Re: [Gluster-devel] volume-snapshot.t failures
> > 
> > Seems like a snapshot failure on build machines. Found another failure:
> > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17066/console
> > 
> > Test failed:
> > ./tests/basic/tier/tier-snapshot.t
> > 
> > Debug-msg:
> > ++ gluster --mode=script --wignore snapshot create snap2 patchy
> > no-timestamp
> > snapshot create: failed: Pre-validation failed on localhost. Please check
> > log
> > file for details
> > + test_footer
> > + RET=1
> > + local err=
> > + '[' 1 -eq 0 ']'
> > + echo 'not ok 11 '
> > not ok 11
> > + '[' x0 = x0 ']'
> > + echo 'FAILED COMMAND: gluster --mode=script --wignore snapshot create
> > snap2
> > patchy no-timestamp'
> > FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2
> > patchy
> > no-timestamp
> > 
> > - Original Message -
> > > From: "Raghavendra Gowdappa" 
> > > To: "Rajesh Joseph" 
> > > Sent: Tuesday, December 22, 2015 10:05:22 AM
> > > Subject: volume-snapshot.t failures
> > > 
> > > Hi Rajesh
> > > 
> > > There is a failure of volume-snapshot.t on build machine:
> > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17048/consoleFull
> > > 
> > > 
> > > However, on my local machine test succeeds always. Is it a known case of
> > > spurious failure?
> > > 
> > > regards,
> > > Raghavendra.
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] tests/bugs/tier/bug-1286974.t failed and dropped a core

2016-01-12 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Dan Lambright" , "Joseph Fernandes" 
> 
> Cc: gluster-devel@gluster.org
> Sent: Tuesday, January 12, 2016 3:52:51 PM
> Subject: tests/bugs/tier/bug-1286974.t failed and dropped a core
> 
> Hi guys,
> 
> could you please have a look at this regression test failure?

Pranith, could you or someone with EC expertise help us diagnose this problem? 

The test script does: 

TEST touch /mnt/glusterfs/0/file{1..100}; 

I see some number of errors such as:

[2016-01-12 20:14:26.888412] E [MSGID: 122063] 
[ec-common.c:943:ec_prepare_update_cbk] 0-patchy-disperse-0: Unable to get size 
xattr [No such file or directory]
[2016-01-12 20:14:26.888493] E [MSGID: 109031] 
[dht-linkfile.c:301:dht_linkfile_setattr_cbk] 0-patchy-tier-dht: Failed to set 
attr uid/gid on /file28 :  [No such file or directory]

.. right before the crash. The backtrace is in mnt-glusterfs-0.log, it failed 
in ec function ec_manager_setattr(). 

It appears to be an assert, if I found the code right.

GF_ASSERT(ec_get_inode_size(fop,
   
fop->locks[0].lock->loc.inode,  
   
&cbk->iatt[0].ia_size));  

/lib64/libc.so.6(+0x2b74e)[0x7f84c62a974e]
/lib64/libc.so.6(__assert_perror_fail+0x0)[0x7f84c62a9810]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x312f5)[0x7f84ba5ce2f5]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x14918)[0x7f84ba5b1918]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x10756)[0x7f84ba5ad756]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x1093c)[0x7f84ba5ad93c]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x2fbe0)[0x7f84ba5ccbe0]
/build/install/lib/glusterfs/3.8dev/xlator/cluster/disperse.so(+0x30ea9)[0x7f84ba5cdea9]
/build/install/lib/glusterfs/3.8dev/xlator/protocol/client.so(+0x1f706)[0x7f84ba854706]
/build/install/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x1b2)[0x7f84c74e542a]

> 
> 
> https://build.gluster.org/job/rackspace-regression-2GB-triggered/17475/consoleFull
> 
> [20:14:46] ./tests/bugs/tier/bug-1286974.t ..
> not ok 16
> Failed 1/24 subtests
> [20:14:46]
> 
> Test Summary Report
> ---
> ./tests/bugs/tier/bug-1286974.t (Wstat: 0 Tests: 24 Failed: 1)
>   Failed test:  16
> Files=1, Tests=24, 37 wallclock secs ( 0.03 usr  0.01 sys +  2.38 cusr
> 0.77 csys =  3.19 CPU)
> Result: FAIL
> ./tests/bugs/tier/bug-1286974.t: bad status 1
> ./tests/bugs/tier/bug-1286974.t: 1 new core files
> Ignoring failure from known-bad test ./tests/bugs/tier/bug-1286974.t
> 
> Failures are ignored as mentioned in the last line, but cores are not
> allowed. Please prevent this from happening :)
> 
> Thanks,
> Niels
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reverse brick order in tier volume- Why?

2016-01-22 Thread Dan Lambright


- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Ravishankar N" , "Gluster Devel" 
> , "Dan Lambright"
> , "Joseph Fernandes" , "Nithya 
> Balachandran" ,
> "Mohammed Rafi K C" 
> Sent: Friday, January 22, 2016 10:48:15 PM
> Subject: Re: [Gluster-devel] Reverse brick order in tier volume- Why?
> 
> 
> 
> On 01/22/2016 03:48 PM, Ravishankar N wrote:
> > On 01/19/2016 06:44 PM, Ravishankar N wrote:
> >>
> >> 1) Is there is a compelling reason as to why the bricks of hot-tier
> >> are in the reverse order ?
> >> 2) If there isn't one, should we spend time to fix it so that the
> >> bricks appear in the order in which they were given at the time of
> >> volume creaction/ attach-tier *OR*  just continue with the way things
> >> are currently because it is not that much of an issue?
> > Dan / Joseph - any pointers?

This order was an artifact of how the volume is created using legacy code and 
data structures in glusterd-volgen.c. Two volume graphs are built (the hot and 
the cold). The two graphs are built and combined in a single list. As far as I 
know, nobody has run into trouble with this. Refactoring the code would be fine 
to ease maintainability.


> +Nitya, Rafi as well.
> > -Ravi
> >
> >
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core on Linux

2016-03-04 Thread Dan Lambright


- Original Message -
> From: "Shyam" 
> To: "Krutika Dhananjay" , "Gluster Devel" 
> , "Rafi Kavungal Chundattu
> Parambil" , "Nithya Balachandran" , 
> "Joseph Fernandes"
> , "Dan Lambright" 
> Cc: "gluster-infra" 
> Sent: Friday, March 4, 2016 9:45:17 AM
> Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core 
> on Linux
> 
> Facing the same problem in the following runs as well,
> 
> 1)
> https://build.gluster.org/job/rackspace-regression-2GB-triggered/18767/console
> 2) https://build.gluster.org/job/regression-test-burn-in/546/console
> 3) https://build.gluster.org/job/regression-test-burn-in/547/console
> 4) https://build.gluster.org/job/regression-test-burn-in/549/console
> 
> Last successful burn-in was: 545 (but do not see the test having been
> run here, so this is inconclusive)
> 
> burn-in test 544 is hung on the same test here,
> https://build.gluster.org/job/regression-test-burn-in/544/console
> 
> (and at this point I am stopping the hunt for when this last succeeded :) )
> 
> Let's know if anyone is taking a peek at the cores.

hm. Not familiar with this test. Written by Pranith? I'll look.

> 
> Thanks,
> Shyam
> 
> 
> 
> On 03/04/2016 07:40 AM, Krutika Dhananjay wrote:
> > Could someone from tiering dev team please take a look?
> >
> > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18793/console
> >
> > -Krutika
> >
> >
> > ___
> > Gluster-infra mailing list
> > gluster-in...@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-infra
> >
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core on Linux

2016-03-05 Thread Dan Lambright


- Original Message -
> From: "Dan Lambright" 
> To: "Shyam" 
> Cc: "Krutika Dhananjay" , "Gluster Devel" 
> , "Rafi Kavungal Chundattu
> Parambil" , "Nithya Balachandran" , 
> "Joseph Fernandes"
> , "gluster-infra" 
> Sent: Friday, March 4, 2016 9:51:18 AM
> Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core 
> on Linux
> 
> 
> 
> - Original Message -
> > From: "Shyam" 
> > To: "Krutika Dhananjay" , "Gluster Devel"
> > , "Rafi Kavungal Chundattu
> > Parambil" , "Nithya Balachandran"
> > , "Joseph Fernandes"
> > , "Dan Lambright" 
> > Cc: "gluster-infra" 
> > Sent: Friday, March 4, 2016 9:45:17 AM
> > Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t dumping
> > core on Linux
> > 
> > Facing the same problem in the following runs as well,
> > 
> > 1)
> > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18767/console
> > 2) https://build.gluster.org/job/regression-test-burn-in/546/console
> > 3) https://build.gluster.org/job/regression-test-burn-in/547/console
> > 4) https://build.gluster.org/job/regression-test-burn-in/549/console
> > 
> > Last successful burn-in was: 545 (but do not see the test having been
> > run here, so this is inconclusive)
> > 
> > burn-in test 544 is hung on the same test here,
> > https://build.gluster.org/job/regression-test-burn-in/544/console
> > 
> > (and at this point I am stopping the hunt for when this last succeeded :) )
> > 
> > Let's know if anyone is taking a peek at the cores.
> 
> hm. Not familiar with this test. Written by Pranith? I'll look.

We are doing lookup everywhere, and building up a dict of the extended 
attributes of a file as we traverse each sub volume across the hot and cold 
tiers. The length field of one of the EC keys is corrupted.

Not clear why this is happening.. I see no tiering relationship as of yet, its 
possible the file is being demoted in parallel to the foreground script 
operation. 

The test runs fine on my machines.  Does this reproduce consistently on one of 
the Jenkins machines? If so, getting onto it would be the next step. I think 
that would be preferable to masking this test case.


> 
> > 
> > Thanks,
> > Shyam
> > 
> > 
> > 
> > On 03/04/2016 07:40 AM, Krutika Dhananjay wrote:
> > > Could someone from tiering dev team please take a look?
> > >
> > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/18793/console
> > >
> > > -Krutika
> > >
> > >
> > > ___
> > > Gluster-infra mailing list
> > > gluster-in...@gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-infra
> > >
> > 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core on Linux

2016-03-08 Thread Dan Lambright


- Original Message -
> From: "Krutika Dhananjay" 
> To: "Pranith Karampuri" 
> Cc: "gluster-infra" , "Gluster Devel" 
> , "RHGS tiering mailing
> list" , "Dan Lambright" 
> Sent: Tuesday, March 8, 2016 12:15:20 AM
> Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core 
> on Linux
> 
> It has been failing rather frequently.
> Have reported a bug at https://bugzilla.redhat.com/show_bug.cgi?id=1315560
> For now, have moved it to bad tests here:
> http://review.gluster.org/#/c/13632/1
> 


Masking tests is a bad habit. It would be better to fix the problem, and it 
looks like a real bug.
The author of the test should help chase this down.

> -Krutika
> 
> On Mon, Mar 7, 2016 at 4:17 PM, Krutika Dhananjay 
> wrote:
> 
> > +Pranith
> >
> > -Krutika
> >
> >
> > On Sat, Mar 5, 2016 at 11:34 PM, Dan Lambright 
> > wrote:
> >
> >>
> >>
> >> - Original Message -
> >> > From: "Dan Lambright" 
> >> > To: "Shyam" 
> >> > Cc: "Krutika Dhananjay" , "Gluster Devel" <
> >> gluster-devel@gluster.org>, "Rafi Kavungal Chundattu
> >> > Parambil" , "Nithya Balachandran" <
> >> nbala...@redhat.com>, "Joseph Fernandes"
> >> > , "gluster-infra" 
> >> > Sent: Friday, March 4, 2016 9:51:18 AM
> >> > Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t
> >> dumping core on Linux
> >> >
> >> >
> >> >
> >> > - Original Message -
> >> > > From: "Shyam" 
> >> > > To: "Krutika Dhananjay" , "Gluster Devel"
> >> > > , "Rafi Kavungal Chundattu
> >> > > Parambil" , "Nithya Balachandran"
> >> > > , "Joseph Fernandes"
> >> > > , "Dan Lambright" 
> >> > > Cc: "gluster-infra" 
> >> > > Sent: Friday, March 4, 2016 9:45:17 AM
> >> > > Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t
> >> dumping
> >> > > core on Linux
> >> > >
> >> > > Facing the same problem in the following runs as well,
> >> > >
> >> > > 1)
> >> > >
> >> https://build.gluster.org/job/rackspace-regression-2GB-triggered/18767/console
> >> > > 2) https://build.gluster.org/job/regression-test-burn-in/546/console
> >> > > 3) https://build.gluster.org/job/regression-test-burn-in/547/console
> >> > > 4) https://build.gluster.org/job/regression-test-burn-in/549/console
> >> > >
> >> > > Last successful burn-in was: 545 (but do not see the test having been
> >> > > run here, so this is inconclusive)
> >> > >
> >> > > burn-in test 544 is hung on the same test here,
> >> > > https://build.gluster.org/job/regression-test-burn-in/544/console
> >> > >
> >> > > (and at this point I am stopping the hunt for when this last
> >> succeeded :) )
> >> > >
> >> > > Let's know if anyone is taking a peek at the cores.
> >> >
> >> > hm. Not familiar with this test. Written by Pranith? I'll look.
> >>
> >> We are doing lookup everywhere, and building up a dict of the extended
> >> attributes of a file as we traverse each sub volume across the hot and
> >> cold
> >> tiers. The length field of one of the EC keys is corrupted.
> >>
> >> Not clear why this is happening.. I see no tiering relationship as of
> >> yet, its possible the file is being demoted in parallel to the foreground
> >> script operation.
> >>
> >> The test runs fine on my machines.  Does this reproduce consistently on
> >> one of the Jenkins machines? If so, getting onto it would be the next
> >> step.
> >> I think that would be preferable to masking this test case.
> >>
> >>
> >> >
> >> > >
> >> > > Thanks,
> >> > > Shyam
> >> > >
> >> > >
> >> > >
> >> > > On 03/04/2016 07:40 AM, Krutika Dhananjay wrote:
> >> > > > Could someone from tiering dev team please take a look?
> >> > > >
> >> > > >
> >> https://build.gluster.org/job/rackspace-regression-2GB-triggered/18793/console
> >> > > >
> >> > > > -Krutika
> >> > > >
> >> > > >
> >> > > > ___
> >> > > > Gluster-infra mailing list
> >> > > > gluster-in...@gluster.org
> >> > > > http://www.gluster.org/mailman/listinfo/gluster-infra
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core on Linux

2016-03-08 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Krutika Dhananjay" , "Vijay Bellur" 
> 
> Cc: "Dan Lambright" , "Gluster Devel" 
> , "Pranith Karampuri"
> , "gluster-infra" , "RHGS 
> tiering mailing list"
> 
> Sent: Tuesday, March 8, 2016 8:37:14 AM
> Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core 
> on Linux
> 
> On Tue, Mar 08, 2016 at 07:00:07PM +0530, Krutika Dhananjay wrote:
> > I did talk to the author and he is going to look into the issue.
> > 3.7.9 is round the corner and we certainly don't want bad tests to block
> > patches that need to go in.
> 
> 3.7.9 has been delayed already. There is hardly a good reason for
> changes to delay the relase even more. 3.7.10 will be done in a few
> weeks at the end of March, anything non-regression should probably not
> be included at this point anymore.
> 
> Vijay is the release manager for 3.7.9 and .10, you'll need to come with
> extreme strong points for getting more patches included.

I've no desire to disrupt downstream releases and if we are indeed out of time 
for 3.7.9, then thats life.
That said, obviously a test failure is a failure, and once its masked it tends 
to be forgotton.
I suggest we do not mask this test upstream (does it happen there?) and give 
some urgency to chasing it down.


> 
> Niels
> 
> 
> > 
> > -Krutika
> > 
> > On Tue, Mar 8, 2016 at 6:50 PM, Dan Lambright  wrote:
> > 
> > >
> > >
> > > - Original Message -
> > > > From: "Krutika Dhananjay" 
> > > > To: "Pranith Karampuri" 
> > > > Cc: "gluster-infra" , "Gluster Devel" <
> > > gluster-devel@gluster.org>, "RHGS tiering mailing
> > > > list" , "Dan Lambright" 
> > > > Sent: Tuesday, March 8, 2016 12:15:20 AM
> > > > Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t
> > > > dumping
> > > core on Linux
> > > >
> > > > It has been failing rather frequently.
> > > > Have reported a bug at
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1315560
> > > > For now, have moved it to bad tests here:
> > > > http://review.gluster.org/#/c/13632/1
> > > >
> > >
> > >
> > > Masking tests is a bad habit. It would be better to fix the problem, and
> > > it looks like a real bug.
> > > The author of the test should help chase this down.
> > >
> > > > -Krutika
> > > >
> > > > On Mon, Mar 7, 2016 at 4:17 PM, Krutika Dhananjay 
> > > > wrote:
> > > >
> > > > > +Pranith
> > > > >
> > > > > -Krutika
> > > > >
> > > > >
> > > > > On Sat, Mar 5, 2016 at 11:34 PM, Dan Lambright 
> > > > > wrote:
> > > > >
> > > > >>
> > > > >>
> > > > >> - Original Message -
> > > > >> > From: "Dan Lambright" 
> > > > >> > To: "Shyam" 
> > > > >> > Cc: "Krutika Dhananjay" , "Gluster Devel" <
> > > > >> gluster-devel@gluster.org>, "Rafi Kavungal Chundattu
> > > > >> > Parambil" , "Nithya Balachandran" <
> > > > >> nbala...@redhat.com>, "Joseph Fernandes"
> > > > >> > , "gluster-infra" 
> > > > >> > Sent: Friday, March 4, 2016 9:51:18 AM
> > > > >> > Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t
> > > > >> dumping core on Linux
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > - Original Message -
> > > > >> > > From: "Shyam" 
> > > > >> > > To: "Krutika Dhananjay" , "Gluster Devel"
> > > > >> > > , "Rafi Kavungal Chundattu
> > > > >> > > Parambil" , "Nithya Balachandran"
> > > > >> > > , "Joseph Fernandes"
> > > > >> > > , "Dan Lambright" 
> > > > >> > > Cc: "gluster-infra" 
> > > > >> > > Sent: Friday, March 4, 2016 9:45:17 AM
> > > > >> > > Subject: Re: [Gluster-infra] tests/basic/ti

Re: [Gluster-devel] [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core on Linux

2016-03-08 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Dan Lambright" 
> Cc: "Krutika Dhananjay" , "Vijay Bellur" 
> , "Gluster Devel"
> , "Pranith Karampuri" , 
> "gluster-infra" ,
> "RHGS tiering mailing list" 
> Sent: Tuesday, March 8, 2016 12:36:58 PM
> Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t dumping core 
> on Linux
> 
> On Tue, Mar 08, 2016 at 08:45:12AM -0500, Dan Lambright wrote:
> > 
> > 
> > - Original Message -
> > > From: "Niels de Vos" 
> > > To: "Krutika Dhananjay" , "Vijay Bellur"
> > > 
> > > Cc: "Dan Lambright" , "Gluster Devel"
> > > , "Pranith Karampuri"
> > > , "gluster-infra" , "RHGS
> > > tiering mailing list"
> > > 
> > > Sent: Tuesday, March 8, 2016 8:37:14 AM
> > > Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t dumping
> > > core on Linux
> > > 
> > > On Tue, Mar 08, 2016 at 07:00:07PM +0530, Krutika Dhananjay wrote:
> > > > I did talk to the author and he is going to look into the issue.
> > > > 3.7.9 is round the corner and we certainly don't want bad tests to
> > > > block
> > > > patches that need to go in.
> > > 
> > > 3.7.9 has been delayed already. There is hardly a good reason for
> > > changes to delay the relase even more. 3.7.10 will be done in a few
> > > weeks at the end of March, anything non-regression should probably not
> > > be included at this point anymore.
> > > 
> > > Vijay is the release manager for 3.7.9 and .10, you'll need to come with
> > > extreme strong points for getting more patches included.
> > 
> > I've no desire to disrupt downstream releases and if we are indeed out
> > of time for 3.7.9, then thats life.
> 
> 3.7.x releases are planned for each 30th of the month, 3.7.9 was
> supposed to be tagged for relaese the end of February. We are trying to
> get back to this schdule, and are failing hard :-(
> 
> Upstream has a very strict schedule (that we fail to follow because the
> release managers are occupied elsewhere), downstream (RHGS) should be
> aware of that as is it documented on
> https://www.gluster.org/community/release-schedule/ . Please pass that
> link on to anyone that expects changes to get merged in certain upstream
> releases. In general a release needs a few days of preparation, so
> patches for a release need to be ready at least ~5 working days in
> advance.
> 
> > That said, obviously a test failure is a failure, and once its masked
> > it tends to be forgotton.
> 
> When a test is marked as bad_test, we require a bug to be filed for that
> test. It should not be forgotten, the component maintainers are expected
> to track (and eventually fix) the open bugs for their components.
> 
> > I suggest we do not mask this test upstream (does it happen there?)
> > and give some urgency to chasing it down.
> 
> I assumed this is all for upstream. Not sure why an internal Red Hat
> list (rhgs-tiering) was put on CC.

-1

> 
> HTH,
> Niels
> 
> 
> > 
> > 
> > > 
> > > Niels
> > > 
> > > 
> > > > 
> > > > -Krutika
> > > > 
> > > > On Tue, Mar 8, 2016 at 6:50 PM, Dan Lambright 
> > > > wrote:
> > > > 
> > > > >
> > > > >
> > > > > - Original Message -
> > > > > > From: "Krutika Dhananjay" 
> > > > > > To: "Pranith Karampuri" 
> > > > > > Cc: "gluster-infra" , "Gluster Devel" <
> > > > > gluster-devel@gluster.org>, "RHGS tiering mailing
> > > > > > list" , "Dan Lambright"
> > > > > > 
> > > > > > Sent: Tuesday, March 8, 2016 12:15:20 AM
> > > > > > Subject: Re: [Gluster-infra] tests/basic/tier/tier-file-create.t
> > > > > > dumping
> > > > > core on Linux
> > > > > >
> > > > > > It has been failing rather frequently.
> > > > > > Have reported a bug at
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1315560
> > > > > > For now, have moved it to bad tests here:
> > > > > > http://review.gluster.org/#/c/13632/1
> > > > > >
> > > > >
> > > > >

Re: [Gluster-devel] 3.7.9 update

2016-03-15 Thread Dan Lambright


- Original Message -
> From: "Vijay Bellur" 
> To: "Gluster Devel" , "Niels de Vos" 
> , "Raghavendra Bhat"
> , "Dan Lambright" , "Nithya 
> Balachandran" 
> Sent: Sunday, March 13, 2016 10:50:42 PM
> Subject: 3.7.9 update
> 
> Hey All,
> 
> I have been running tests with the latest HEAD of release-3.7  on a 2x2
> distributed replicated volume. Here are some updates:
> 
> - Write Performance has seen an improvement as seen by running
> perf-test.sh [1]
> 
.
.
.
> 
> - Tiering has seen a lot of patches in 3.7.9. Dan, Nithya - can you
> please assist in preparation of release notes by summarizing the changes
> and providing inputs on the general readiness of tiering?

Here is what we put together..

https://docs.google.com/document/d/17nQXG0oradZ769Sw94n3CO39imBefruXzP8zZLJJgzM/edit?usp=sharing


> 
> Thanks,
> Vijay
> 
> [1] https://github.com/avati/perf-test/blob/master/perf-test.sh
> 
> [2] http://review.gluster.org/#/c/13689/
> 
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] "Samba and NFS-Ganesha support for tiered volumes" is at risk for 3.8

2016-04-01 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Dan Lambright" , "Joseph Fernandes" 
> 
> Cc: gluster-devel@gluster.org
> Sent: Friday, April 1, 2016 4:09:11 AM
> Subject: "Samba and NFS-Ganesha support for tiered volumes" is at risk for 3.8
> 
> Hi,
> 
> the feature labelled "Samba and NFS-Ganesha support for tiered volumes"
> did not receive any status updates by pull request to the 3.8 roadmap.
> We have now moved this feature to the new "at risk" category on the
> page. If there is still an intention to include this feature with 3.8,
> we encourage you send an update for the roadmap soon. This can easily be
> done by clicking the "edit this page" link on the bottom of the roadmap:
> 
>   https://www.gluster.org/community/roadmap/3.8/
> 
> If there is no update within a week, we'll move the feature to the next
> release.

Do not foresee completion within the 3.8 timeframe. There is work underway but 
it will take time.

> 
> Thanks,
> Jiffin and Niels
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] "Tiering Performance Enhancements" is at risk for 3.8

2016-04-01 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Dan Lambright" , "Joseph Fernandes" 
> 
> Cc: gluster-devel@gluster.org
> Sent: Friday, April 1, 2016 4:08:25 AM
> Subject: "Tiering Performance Enhancements" is at risk for 3.8
> 
> Hi,
> 
> the feature labelled "Tiering Performance Enhancements" did not receive
> any status updates by pull request to the 3.8 roadmap. We have now moved
> this feature to the new "at risk" category on the page. If there is
> still an intention to include this feature with 3.8, we encourage you
> send an update for the roadmap soon. This can easily be done by clicking
> the "edit this page" link on the bottom of the roadmap:
> 
>   https://www.gluster.org/community/roadmap/3.8/
> 
> If there is no update within a week, we'll move the feature to the next
> release.

This is related to EC as a cold tier, we can rename the feature to clarify that.

One fix from Pranith helps, I'll confirm with him it shall be in 3.8.

Another fix from me is waiting on results from Manoj.

Once I hear back, I'll update the page.

> 
> Thanks,
> Jiffin and Niels
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] State of the 4.0 World

2016-06-10 Thread Dan Lambright


- Original Message -
> From: "Jeff Darcy" 
> To: "Gluster Devel" 
> Sent: Tuesday, May 3, 2016 11:50:30 AM
> Subject: [Gluster-devel] State of the 4.0 World
> 
> One of my recurring action items at community meetings is to report to
> the list on how 4.0 is going.  So, here we go.
> 
> The executive summary is that 4.0 is on life support.  Many features
> were proposed - some quite ambitious.  Many of those *never* had anyone
> available to work on them.  Of those that did, many have either been
> pulled forward into 3.8 (which is great) or lost what resources they had
> (which is bad).  Downstream priorities have been the biggest cause of
> those resource losses, though other factors such as attrition have also
> played a part.  Net result is that, with the singular exception of
> GlusterD 2.0, progress on 4.0 has all but stopped.  I'll provide more
> details below.  Meanwhile, I'd like to issue a bit of a call to action
> here, in two parts.
> 
>  * Many of the 4.0 sub-projects are still unstaffed.  Some of them are
>in areas of code where our combined expertise is thin.  For example,
>"glusterfsd" is where we need to make many brick- and
>daemon-management changes for 4.0, but it has no specific maintainer
>other than the project architects so nobody touches it.  Over the
>past year it has been touched by fewer than two patches per month,
>mostly side effects of patches which were primarily focused elsewhere
>(less than 400 lines changed).  It can be challenging to dive into
>such a "fallow" area, but it can also be an opportunity to make a big
>difference, show off one's skill, and not have to worry much about
>conflicts with other developers' changes.  Taking on projects like
>these is how people get from contributing to leading (FWIW it's how I
>did), so I encourage people to make the leap.
> 
>  * I've been told that some people have asked how 4.0 is going to affect
>existing components for which they are responsible.  Please note that
>only two components are being replaced - GlusterD and DHT.  The DHT2
>changes are going to affect storage/posix a lot, so that *might* be
>considered a third replacement.  JBR (formerly NSR) is *not* going to
>replace AFR or EC any time soon.  In fact, I'm making significant
>efforts to create common infrastructure that will also support
>running AFR/EC on the server side, with many potential benefits to
>them and their developers.  However, just about every other component
>is going to be affected to some degree, if only to use the 4.0
>CLI/volgen plugin interfaces instead of being hard-coded into their
>current equivalents.  4.0 tests are also expected to be based on
>Distaf rather than TAP (the .t infrastructure) so there's a lot of
>catch-up to be done there.  In other cases there are deeper issues to
>be resolved, and many of those discussions - e.g. regarding quota or
>georep - have already been ongoing.  There will eventually be a
>Gluster 4.0, even if it happens after I'm retired and looks nothing
>like what I describe below.  If you're responsible for any part of
>GlusterFS, you're also responsible for understanding how 4.0 will
>affect that part.
> 
> With all that said, I'm going to give item-by-item details of where we
> stand.  I'll use
> 
> http://www.gluster.org/community/documentation/index.php/Planning40
> 
> as a starting point, even though (as you'll see) in some ways it's out
> of date.
> 
> * GlusterD 2 is still making good progress, under Atin's and Kaushal's
>leadership.  There are designs for most of the important pieces, and
>a significant amount of code which we should be able to demo soon.
> 
>  * DHT2 had been making good progress for a while, but has been stalled
>recently as its lead developer (Shyam) has been unavailable.
>Hopefully we'll get him back soon, and progress will accelerate
>again.

DHT-2 will consolidate metadata on a server. This has the potential to help 
gluster's tiering implementation significantly, as it will not need to 
replicate directories on both the hot and cold tier. Chatting with Shyam, there 
appears to be three work items related to tiering and DHT-2.

1.

An unmodified tiering translator "should" work with DHT-2. But to realize 
DHT-2's benefits, the tiering translator would need to be modified so metadata 
related FOPs are directed to only go to the tier on which the metadata resides.

2.

"metadata" refers to directories, but (per my understanding), it could possibly 
include the file's inode as well. This is a choice- whether or not to include 
the inode in the metadata server is an technical investigation to undertake.

3.

Tier's database is currently SQLite, but it has been understood from day one 
that we may wish to move to a different database or algorithm. RocksDB is one 
candidate that is an attractive alternative. It is used in Ceph and 
gluste

Re: [Gluster-devel] GlusterFS and the logging framework

2014-04-30 Thread Dan Lambright
Hello,

In a previous job, an engineer in our storage group modified our I/O stack logs 
in a manner similar to your proposal #1 (except he did not tell anyone, and did 
it for DEBUG messages as well as ERRORS and WARNINGS, over the weekend). 
Developers came to work Monday and found over a thousand log message strings 
had been buried in a new header file, and any new logs required a new message 
id, along with a new string entry in the header file. 

This did render the code harder to read. The ensuing uproar closely mirrored 
the arguments (1) and (2) you listed. Logs are like comments. If you move them 
out of the source, the code is harder to follow. And you probably wan't fewer 
message IDs than comments.

The developer retracted his work. After some debate, his V2 solution resembled 
your "approach #2". Developers were once again free to use plain text strings 
directly in logs, but the notion of "classes" (message ID) was kept. We allowed 
multiple text strings to be used against a single class, and any new classes 
went in a master header file. The "debug" message ID class was a general 
purpose bucket and what most coders used day to day. 

So basically, your email sounded very familiar to me and I think your proposal 
#2 is on the right track.

Dan

- Original Message -
From: "Nithya Balachandran" 
To: gluster-devel@gluster.org
Cc: "gluster-users" 
Sent: Wednesday, April 30, 2014 3:06:26 AM
Subject: [Gluster-devel] GlusterFS and the logging framework

Hi,

I have attached some DHT files to demonstrate the 2 logging approaches. (*_1 is 
the original approach, *_2 is the proposed approach).I personally think the 2 
approach leads to better code readability and propose that we follow approach 
2. Please let me know of any concerns with this.


To consolidate all the points raised in the earlier discussions:


What are we trying to solve?
Improving gluster logs to make end user debugging easier by providing a 
sufficient information and a consistent logging mechanism and message format .

The new logging framework already logs the function name and line, msgid and 
strerror, which improves the log messages and debug-ability. However, there are 
some potential issues with the way it is getting used. Please note - there are 
no changes being proposed to the underlying logging framework.


Current approach (approach 1):

Define message_ids for each log message (except Trace and Debug) and associate 
both id and string with a msg_id macro
Replace all calls to gf_log with gf_msg passing in the message_id for the 
message. This message_id will be printed as part of the log message.
Document each log string with details of what caused it/how to fix it.



Issues:
1. Code readability - It becomes difficult to figure out what the following is 
actually printing and can cause issues with incorrect params being passed or 
params being passed in the wrong order:
gf_msg ("dht", GF_LOG_ERROR, 0, dht_msg_23, param1, param2, param3);

2.Code Redundancy -multiple messages for the same thing differing in small 
details can potentially use up a large chunk of allocated ids as well as making 
it difficult for end users - they will need to search for multiple string 
formats/msgids as they could all refer to more or less the same thing. For 
example:

dht_msg_1   123, "Failed to get cached subvol for %s"
dht_msg_2   124, "Failed to get cached subvol for %s on %s"



3. Documentation redundancy -

The proposed format for documenting these messages is as follows:

Msg ID
Message format string
Cause
Recommended action

This could potentially lead to documentation like:

Msg ID : 123
Message format string : Failed to get cached subvol for 
Cause : The subvolume might not be reachable etc etc
Recommended action : Check network connection  etc etc

Msg ID : 124
Message format string : Failed to get cached subvol for  on 
Cause : The subvolume might not be reachable etc etc
Recommended action : Check network connection  etc etc

The end user now has to search for multiple msgids and string formats to find 
all instances of this error.

NOTE: It may be possible to consolidate all these strings into a single one, 
say, "Failed to get cached subvol for %s on %s" and mandate that it be used in 
all calls which are currently using variations of the string. However, this 
might not be possible in all scenarios - some params might not be available or 
might not be meaningful in a particular case or a developer might want to 
provide additional info in a particular scenario.



Proposed approach (approach 2):
Define meaningful macros for message_ids for a class of message (except Trace 
and Debug) without associating them to a message string. For example
#define DHT_CACHED_SUBVOL_GET_FAILED 123
#define DHT_MEM_ALLOC_FAILED 124


Replace all calls to gf_log with gf_msg but pass in the msg id and string 
separately. The string is defined by the developer based on an agreed upon 
format.

Define a log message format polic

Re: [Gluster-devel] Data classification proposal

2014-06-23 Thread Dan Lambright
A frustrating aspect of Linux is the complexity of /etc configuration file's 
formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit I 
would simplify the "select" in the data classification proposal (copied below) 
to only accept a list of bricks/sub-tiers with wild-cards '*', rather than 
full-blown regular expressions or key/value pairs. I would drop the "unclaimed" 
keyword, and not have keywords "media type", and "rack". It does not seem 
necessary to introduce new keys for the underlying block device type (SSD vs 
disk) any more than we need to express the filesystem (XFS vs ext4). In other 
words, I think tiering can be fully expressed in the configuration file while 
still abstracting the underlying storage. That said, the configuration file 
could be built up by a CLI or GUI, and richer expressibility could exist at 
that level.

example:

brick host1:/brick ssd-group0-1

brick host2:/brick ssd-group0-2

brick host3:/brick disk-group0-1

rule tier-1
select ssd-group0*

rule tier-2
select disk-group0

rule all
select tier-1
# use repeated "select" to establish order
select tier-2
type features/tiering

The filtering option's regular expressions seem hard to avoid. If just the name 
of the file satisfies most use cases (that we know of?) I do not think there is 
any way to avoid regular expressions in the option for filters. (Down the road, 
if we were to allow complete flexibility in how files can be distributed across 
subvolumes, the filtering problems may start to look similar to 90s-era packet 
classification with a solution along the lines of the Berkeley packet filter.)

There may be different rules by which data is distributed at the "tiering" 
level. For example, one tiering policy could be the fast tier (first listed). 
It would be a "cache" for the slow tier (second listed). I think the "option" 
keyword could handle that.

rule all
select tier-1
 # use repeated "select" to establish order
select tier-2
type features/tiering
option tier-cache, mode=writeback, dirty-watermark=80

Another example tiering policy could be based on compliance ; when a file needs 
to become read-only, it moves from the first listed tier to the second.

rule all
 select tier-1
 # use repeated "select" to establish order
 select tier-2
 type features/tiering
option tier-retention

- Original Message -
From: "Jeff Darcy" 
To: "Gluster Devel" 
Sent: Friday, May 23, 2014 3:30:39 PM
Subject: [Gluster-devel] Data classification proposal

One of the things holding up our data classification efforts (which include 
tiering but also other stuff as well) has been the extension of the same 
conceptual model from the I/O path to the configuration subsystem and 
ultimately to the user experience.  How does an administrator define a tiering 
policy without tearing their hair out?  How does s/he define a mixed 
replication/erasure-coding setup without wanting to rip *our* hair out?  The 
included Markdown document attempts to remedy this by proposing one out of many 
possible models and user interfaces.  It includes examples for some of the most 
common use cases, including the "replica 2.5" case we'e been discussing 
recently.  Constructive feedback would be greatly appreciated.



# Data Classification Interface

The data classification feature is extremely flexible, to cover use cases from
SSD/disk tiering to rack-aware placement to security or other policies.  With
this flexibility comes complexity.  While this complexity does not affect the
I/O path much, it does affect both the volume-configuration subsystem and the
user interface to set placement policies.  This document describes one possible
model and user interface.

The model we used is based on two kinds of information: brick descriptions and
aggregation rules.  Both are contained in a configuration file (format TBD)
which can be associated with a volume using a volume option.

## Brick Descriptions

A brick is described by a series of simple key/value pairs.  Predefined keys
include:

 * **media-type**  
   The underlying media type for the brick.  In its simplest form this might
   just be *ssd* or *disk*.  More sophisticated users might use something like
   *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick
   backed by a RAID controller.

 * **rack** (and/or **row**)  
   The physical location of the brick.  Some policy rules might be set up to
   spread data across more than one rack.

User-defined keys are also allowed.  For example, some users might use a
*tenant* or *security-level* tag as the basis for their placement policy.

## Aggregation Rules

Aggregation rules are used to define how bricks should be combined into
subvolumes, and those potentially combined into higher-level subvolumes, and so
on until all of the bricks are accounted for.  Each aggregation rule consists
of t

Re: [Gluster-devel] Data classification proposal

2014-06-23 Thread Dan Lambright
Rather than using the keyword "unclaimed", my instinct was to explicitly list 
which bricks have not been "claimed".  Perhaps you have something more subtle 
in mind, it is not apparent to me from your response. Can you provide an 
example of why it is necessary and a list could not be provided in its place? 
If the list is somehow "difficult to figure out", due to a particularly complex 
setup or some such, I'd prefer a CLI/GUI build that list rather than having 
sysadmins hand-edit this file.

The key-value piece seems like syntactic sugar - an "alias". If so, let the 
name itself be the alias. No notions of SSD or physical location need be 
inserted. Unless I am missing that it *is* necessary, I stand by that value 
judgement as a philosophy of not putting anything into the configuration file 
that you don't require. Can you provide an example of where it is necessary?

As to your point on filtering (which files go into which tier/group). I wrote a 
little further in the email that I do not see a way around regular expressions 
within the filter-condition keyword. My understanding of your proposal is the 
select statement did not do file name filtering, the "filter-condition" option 
did. I'm ok with that.

As far as the "user stories" idea goes, that seems like a good next step.

- Original Message -
From: "Jeff Darcy" 
To: "Dan Lambright" 
Cc: "Gluster Devel" 
Sent: Monday, June 23, 2014 5:24:14 PM
Subject: Re: [Gluster-devel] Data classification proposal

> A frustrating aspect of Linux is the complexity of /etc configuration file's
> formats (rsyslog.conf, logrotate, cron, yum repo files, etc) In that spirit
> I would simplify the "select" in the data classification proposal (copied
> below) to only accept a list of bricks/sub-tiers with wild-cards '*', rather
> than full-blown regular expressions or key/value pairs.

Then how does *the user* specify which files should go into which tier/group?
If we don't let them specify that in configuration, then it can only be done
in code and we've taken a choice away from them.

> I would drop the
> "unclaimed" keyword

Then how do you specify any kind of default rule for files not matched
elsewhere?  If certain files can be placed only in certain locations due to
security or compliance considerations, how would they specify the location(s)
for files not subject to any such limitation?

> and not have keywords "media type", and "rack". It does
> not seem necessary to introduce new keys for the underlying block device
> type (SSD vs disk) any more than we need to express the filesystem (XFS vs
> ext4).

The idea is to let users specify whatever criteria matter *to them*; media
type and rack/row are just examples to get them started.

> In other words, I think tiering can be fully expressed in the
> configuration file while still abstracting the underlying storage.

Yes, *tiering* can be expressed using a simpler syntax.  I was trying for
something that could also support placement policies other than strict
linear "above" vs. "below" with only the migration policies we've written
into code.

> That
> said, the configuration file could be built up by a CLI or GUI, and richer
> expressibility could exist at that level.
> 
> example:
> 
> brick host1:/brick ssd-group0-1
> 
> brick host2:/brick ssd-group0-2
> 
> brick host3:/brick disk-group0-1
> 
> rule tier-1
>   select ssd-group0*
> 
> rule tier-2
>   select disk-group0
> 
> rule all
>   select tier-1
>   # use repeated "select" to establish order
>   select tier-2
>   type features/tiering
> 
> The filtering option's regular expressions seem hard to avoid. If just the
> name of the file satisfies most use cases (that we know of?) I do not think
> there is any way to avoid regular expressions in the option for filters.
> (Down the road, if we were to allow complete flexibility in how files can be
> distributed across subvolumes, the filtering problems may start to look
> similar to 90s-era packet classification with a solution along the lines of
> the Berkeley packet filter.)
> 
> There may be different rules by which data is distributed at the "tiering"
> level. For example, one tiering policy could be the fast tier (first
> listed). It would be a "cache" for the slow tier (second listed). I think
> the "option" keyword could handle that.
> 
> rule all
>   select tier-1
># use repeated "select" to establish order
>   select tier-2
>   type features/tiering
>   option tier-cache, mode=writeback, dirty-watermark=80
> 
> Another example 

Re: [Gluster-devel] Data classification proposal

2014-06-24 Thread Dan Lambright

Its possible to express your example using lists if their entries are allowed 
to overlap. I see that you wanted a way to express a matrix (overlapping rules) 
with gluster's tree-like syntax as backdrop. 

A polytree may be a better term than matrix (DAG without cycles), i.e. when 
there are overlaps a node in the graph gets multiple in-arcs.

Syntax aside, we seem to part on "where" to solve the problem- config file or 
UX. I prefer the UX have the logic to build the configuration file, given how 
complex it can be. My preference would be for the config file be mostly "read 
only" with extremely simple syntax. 

I'll put some more thought into this and believe this discussion has 
illuminated some good points.

Brick: host1:/SSD1  SSD1
Brick: host1:/SSD2  SSD2
Brick: host2:/SSD3  SSD3
Brick: host2:/SSD4  SSD4
Brick: host1:/DISK1 DISK1

rule rack4: 
  select SSD1, SSD2, DISK1

# some files should go on ssds in rack 4
rule A: 
  option filter-condition *.lock
  select SSD1, SSD2

# some files should go on ssds anywhere
rule B: 
  option filter-condition *.out
  select SSD1, SSD2, SSD3, SSD4

# some files should go anywhere in rack 4
rule C 
  option filter-condition *.c
  select rack4

# some files we just don't care
rule D
  option filter-condition *.h
  select SSD1, SSD2, SSD3, SSD4, DISK1

volume:
  option filter-condition A,B,C,D

- Original Message -
From: "Jeff Darcy" 
To: "Dan Lambright" 
Cc: "Gluster Devel" 
Sent: Monday, June 23, 2014 7:11:44 PM
Subject: Re: [Gluster-devel] Data classification proposal

> Rather than using the keyword "unclaimed", my instinct was to
> explicitly list which bricks have not been "claimed".  Perhaps you
> have something more subtle in mind, it is not apparent to me from your
> response. Can you provide an example of why it is necessary and a list
> could not be provided in its place? If the list is somehow "difficult
> to figure out", due to a particularly complex setup or some such, I'd
> prefer a CLI/GUI build that list rather than having sysadmins
> hand-edit this file.

It's not *difficult* to make sure every brick has been enumerated by
some rule, and that there are no overlaps, but it's certainly tedious
and error prone.  Imagine that a user has four has bricks in four
machines, using names like serv1-b1, serv1-b2, ..., serv4-b6.
Accordingly, they've set up rules to put serv1* into one set and
serv[234]* into another set (which is already more flexibility than I
think your proposal gave them).  Now when they add serv5 they need an
extra step to add it to the tiering config, which wouldn't have been
necessary if we supported defaults.  What percentage of users would
forget that step at least once?  I don't know for sure, but I'd guess
it's pretty high.

Having a CLI or GUI create configs just means that we have to add
support for defaults there instead.  We'd still have to implement the
same logic, they'd still have to specify the same thing.  That just
seems like moving the problem around instead of solving it.

> The key-value piece seems like syntactic sugar - an "alias". If so,
> let the name itself be the alias. No notions of SSD or physical
> location need be inserted. Unless I am missing that it *is* necessary,
> I stand by that value judgement as a philosophy of not putting
> anything into the configuration file that you don't require. Can you
> provide an example of where it is necessary?

OK...
-


Brick: SSD1
Brick: SSD2
Brick: SSD3
Brick: SSD4
Brick: DISK1

rack4: SSD1, SSD2, DISK1

filter A : SSD1, SSD2

filter B : SSD1,SSD2, SSD3, SSD4

filter C: rack4

filter D: SSD1, SSD2, SSD3, SSD4, DISK1

meta-filter: filter A, filter B, filter C, filter D

  * some files should go on ssds in rack 4

  * some files should go on ssds anywhere

  * some files should go anywhere in rack 4

  * some files we just don't care

Notice how the rules *overlap*.  We can't support that if our syntax
only allows the user to express a list (or list of lists).  If the list
is ordered by type, we can't also support location-based rules.  If the
list is ordered by location, we lose type-based rules instead.   Brick
properties create a matrix, with an unknown number of dimensions (e.g.
security level, tenant ID, and so on as well as type and location).  The
logical way to represent such a space for rule-matching purposes is to
let users define however many dimensions (keys) as they want and as many
values for each dimension as they want.

Whether the exact string "type" or "unclaimed" appears anywhere isn't
the issue.  What matters is that the *semantics* of assigning properties
to a brick have to be more sophisticated than just assigning each a
position in a list, and we need a syntax that supports those semantics.
Otherw

Re: [Gluster-devel] Data classification proposal

2014-06-26 Thread Dan Lambright
Implementing brick splitting using LVM would allow you to treat each logical 
volume (split) as an independent brick. Each split would have its own 
.glusterfs subdirectory. I think this would help with taking snapshots as well.

- Original Message -
From: "Shyamsundar Ranganathan" 
To: "Krishnan Parthasarathi" 
Cc: "Gluster Devel" 
Sent: Thursday, June 26, 2014 11:13:48 AM
Subject: Re: [Gluster-devel] Data classification proposal

> > > For the short-term, wouldn't it be OK to disallow adding bricks that
> > > is not a multiple of group-size?
> > 
> > In the *very* short term, yes.  However, I think that will quickly
> > become an issue for users who try to deploy erasure coding because those
> > group sizes will be quite large.  As soon as we implement tiering, our
> > very next task - perhaps even before tiering gets into a release -
> > should be to implement automatic brick splitting.  That will bring other
> > benefits as well, such as variable replication levels to handle the
> > sanlock case, or overlapping replica sets to spread a failed brick's
> > load over more peers.
> > 
> 
> OK. Do you have some initial ideas on how we could 'split' bricks? I ask this
> to see if I can work on splitting bricks while the data classification format
> is
> being ironed out.

I see split bricks as creating a logical space for the new aggregate that the 
brick belongs to. This may not need data movement etc. but just a logical 
branching at the root of the brick for its membership. Are there counter 
examples to this?

Unless this changes the weight age of the brick across its aggregates, for 
example size based weight age for layout assignments, if we are considering 
schemes of that nature.

So I can see this as follows,

THE_Brick: /data/bricka

Belongs to: aggregate 1 and aggregate 2, so get the following structure beneath 
it,

/data/bricka/agg_1_ID/
/data/bricka/agg_2_ID/

Future splits of the bricks add more aggregate ID (not stating where or what 
this ID is, but assume this is something to distinguish aggregates) parents, 
and I would expect the xlator to send in requests into its aggregate parent and 
not root.

One issue that I see with this is, if we wanted to snap an aggregate then we 
would snap the entire brick.
Another is that how we distinguish the .glusterfs space across the aggregates?

Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data classification proposal

2014-06-26 Thread Dan Lambright
I don't think brick splitting implemented by LVM would affect directory 
browsing any more than adding an additional brick would,

- Original Message -
From: "Justin Clift" 
To: "Dan Lambright" 
Cc: "Shyamsundar Ranganathan" , "Gluster Devel" 

Sent: Thursday, June 26, 2014 12:01:16 PM
Subject: Re: [Gluster-devel] Data classification proposal

On 26/06/2014, at 4:54 PM, Dan Lambright wrote:
> Implementing brick splitting using LVM would allow you to treat each logical 
> volume (split) as an independent brick. Each split would have its own 
> .glusterfs subdirectory. I think this would help with taking snapshots as 
> well.


Would brick splitting make directory browsing latency even scarier?

+ Justin

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] 3.6 Feature Freeze - move to mid next week?

2014-07-04 Thread Dan Lambright


- Original Message -
> From: "Jeff Darcy" 
> To: "Vijay Bellur" 
> Cc: "Gluster Devel" 
> Sent: Friday, July 4, 2014 10:14:32 AM
> Subject: Re: [Gluster-devel] 3.6 Feature Freeze - move to mid next week?
> 
> > Given the holiday weekend in US, I feel that it would be appropriate
> > to move the 3.6 feature freeze date to mid next week so that we can
> > have more reviews done & address review comments too. We can still
> > continue to track other milestones as per our release schedule [1].
> > What do you folks think?
> 
> I think the answer depends on what we can expect to change between now
> and then.  Since the gluster.org feature page never got updated to
> reflect the real feature set for 3.6, I took the list from email sent
> after the planning meeting.
> 
>   * Better SSL
> Two out of three patches merged, one still in review.
> 
>   * Data Classification
> Design barely begun.
> 
>   * Heterogeneous Bricks
> Patch has CR+1 V+1 but still stalled in review.
> 
>   * Trash
> Ancient one is still there, probably doesn't even work.
> 
>   * Disperse
> Patches still in very active review.
> 
>   * Persistent AFR Changelog Xattributes
> Patches merged.
> 
>   * Better Peer Identification
> Patch still in review (fails verification).
> 
>   * Gluster Volume Snapshot
> Tons of patches merged, tons more still to come.
> 
>   * AFRv2
> Jammed in long ago.
> 
>   * Policy Based Split-Brain Resolver (PBSBR)
> No patches, feature page still says in design.
> 
>   * RDMA Improvements
> No patches, feature page says work in progress.
> 
>   * Server-side Barrier Feature
> Patches merged.
> 
> 
> That leaves us with a very short list of items that are likely to change
> state.
> 
>   * Better SSL
> 
>   * Heterogeneous Bricks
> 
>   * Disperse
> 
>   * Better Peer Identification
> 
> Of those, I think only disperse is likely to benefit from an extension.
> The others just need people to step up and finish reviewing them, which
> could happen today if there were sufficient will.  The real question is
> what to do about disperse.  Some might argue that it's already complete
> enough to go in, so long as its limitations are documented
> appropriately.  Others might argue that it's still months away from
> being usable (especially wrt performance).  In a way it doesn't matter,
> because either way a few days won't make a difference.  We just need to
> make a collective decision based on its current state (or close to it).
> If we need to wait a few days before people can come together for that,
> so be it.

The reliability and performance of the erasure code translator is probably not 
at a level where we could guarantee the feature is bug free and "ready". 
However the feature could be added to the gluster code base for people to begin 
to experiment with, as suggested we would need to document limitations. The 
idea being the more hands get on the code, the more bugs found, the more 
suggestions made for improvements, deeper integration, etc.  I do not believe 
the erasure code translator is called "disperse" any longer.


> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Feature proposal - FS-Cache support in FUSE

2014-09-02 Thread Dan Lambright
Hi Vimal,

One suggestion is to consider making the cache read only. A read cache would 
give you many of the performance benefits, without having to deal with some of 
the more difficult problems.

* you would not need a background "flush" process to move dirty data from the 
cache to persistent store.
* no need to distinguish "dirty" data from "clean" data
* the read cache would not need to be persistent, so if power went out, you 
would be fine.
* the code could be written such that a write cache could exist in the future.

Another suggestion is to look at memcached.org, which is a distributed object 
system. I know of a company that uses it to implement their caching layer. They 
use it across multiple clients. Each client sees the same "view" of cached data.

Dan


- Original Message -
> From: "Vimal A R" 
> To: fuse-de...@lists.sourceforge.net, linux-cach...@redhat.com, 
> gluster-devel@gluster.org, nde...@redhat.com
> Sent: Monday, September 1, 2014 9:07:02 AM
> Subject: [Gluster-devel] Feature proposal - FS-Cache support in FUSE
> 
> Hello fuse-devel / fs-cache / gluster-devel lists,
> 
> I would like to propose the idea of implementing FS-Cache support in the fuse
> kernel module, which I am planning to do as part of my UG university course.
> This proposal is by no means final, since I have just started to look into
> this.
> 
> There are several user-space filesystems which are based on the FUSE kernel
> module. As of now, if I understand correct, the only networked filesystems
> having FS-Cache support are NFS and AFS.
> 
> Implementing support hooks for fs-cache in the fuse module would provide
> networked filesystems such as GlusterFS the benefit of  a client-side
> caching mechanism, which should decrease the access times.
> 
> When enabled, FS-Cache would maintain a virtual indexing tree to cache the
> data or object-types per network FS. Indices in the tree are used by
> FS-Cache to find objects faster. The tree or index structure under the main
> network FS index depends on the filesystem. Cookies are used to represent
> the indices, the pages etc..
> 
> The tree structure would be as following:
> 
> a) The virtual index tree maintained by fs-cache would look like:
> 
> * FS-Cache master index -> The network-filesystem indice (NFS/AFS etc..) ->
> per-share indices -> File-handle indices -> Page indices
> 
> b) In case of FUSE-based filesystems, the tree would be similar to :
> 
> * FS-Cache master index -> FUSE indice -> Per FS indices -> file-handle
> indices -> page indices.
> 
> c) In case of FUSE based filesystems as GlusterFS, the tree would as :
> 
> * FS-Cache master index -> FUSE indice (fuse.glusterfs) -> GlusterFS volume
> ID (a UUID exists for each volume) - > GlusterFS file-handle indices (based
> on the GFID of a file) -> page indices.
> 
> The idea is to enable FUSE to work with the FS-Cache network filesystem API,
> which is documented at
> 'https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/netfs-api.txt'.
> 
> The implementation of FS-Cache support in NFS can be taken as a guideline to
> understand and start off.
> 
> I will reply to this mail with any other updates that would come up whilst
> pursuing this further. I request any sort of feedback/suggestions, ideas,
> any pitfalls etc.. that can help in taking this further.
> 
> Thank you,
> 
> Vimal
> 
> References:
> *
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/fscache.txt
> *
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/netfs-api.txt
> *
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/object.txt
> * http://people.redhat.com/dhowells/fscache/FS-Cache.pdf
> * http://people.redhat.com/steved/fscache/docs/HOWTO.txt
> * https://en.wikipedia.org/wiki/CacheFS
> * https://lwn.net/Articles/160122/
> * http://www.linux-mag.com/id/7378/
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Languages (was Re: Proposal for GlusterD-2.0)

2014-09-05 Thread Dan Lambright
One reason to use c++ could be to build components that we wish to share with 
ceph. (Not that I know of any at this time). Also c++0x11 has improved the 
language.
But the more I hear about it, the more interesting go sounds..

- Original Message -
> From: "Jeff Darcy" 
> To: "Justin Clift" 
> Cc: "Gluster Devel" 
> Sent: Friday, September 5, 2014 11:44:35 AM
> Subject: [Gluster-devel] Languages (was Re: Proposal for GlusterD-2.0)
> 
> > Does this mean we'll need to learn Go as well as C and Python?
> 
> As KP points out, the fact that consul is written in Go doesn't mean our
> code needs to be ... unless we need to contribute code upstream e.g. to
> add new features.  Ditto for etcd also being written in Go, ZooKeeper
> being written in Java, and so on.  It's probably more of an issue that
> these all require integration into our build/test environments.  At
> least Go, unlike Java, doesn't require any new *run time* support.
> Python kind of sits in between - it does require runtime support, but
> it's much less resource-intensive and onerous than Java (no GC-tuning
> hell).  Between that and the fact that it's almost always present
> already, it just doesn't seem to provoke the same kind of allergic
> reaction that Java does.
> 
> However, this is as good a time as any to think about what languages
> we're going to use for the project going forward.  While there are many
> good reasons for our I/O path to remain in Plain Old C (yes I'm
> deliberately avoiding the C++ issue), many of those reasons apply only
> weakly to other parts of the code - not only management code, but also
> "offline" processes like self heal and rebalancing.  Some people might
> already be aware that I've used Python for the reconciliation component
> of NSR, for example, and that version is in almost every way better than
> the C version it replaces.  When we need to interface with code written
> in other languages, or even interact with communities where other
> languages are spoken more fluently than C, it's pretty natural to
> consider using those languages ourselves.  Let's look at some of the
> alternatives.
> 
>  * C++
>Code is highly compatible with C, programming styles and idioms less
>so.  Not prominent in most areas we care about.
> 
>  * Java
>The "old standard" for a lot of distributed systems - e.g.  the
>entire Hadoop universe, Cassandra, etc.  Also a great burden as
>discussed previously.
> 
>  * Go
>Definitely the "up and comer" in distributed systems, for which it
>was (partly) designed.  Easy for C programmers to pick up, and also
>popular among (former?) Python folks.  Light on resources and
>dependencies.
> 
>  * JavaScript
>Ubiquitous.  Common in HTTP-ish "microservice" situations, but not so
>much in true distributed systems.
> 
>  * Ruby
>Much like JavaScript as far as we're concerned, but less ubiquitous.
> 
>  * Erlang
>Functional, designed for highly reliable distributed systems,
>significant use in related areas (e.g. Riak).
> 
> Obviously, there are many more, but issues of compatibility and talent
> availability weigh heavier for most than for Erlang (which barely made
> the list as it is despite its strengths).  Of these, the ones without
> serious drawbacks are JavaScript and Go.  As popular as JS is in other
> specialties, I just don't feel any positive "pull" to use it in anything
> we do.  As a language it's notoriously loose about many things (e.g.
> equality comparisons) and prone to the same "callback hell" from which
> we already suffer.
> 
> Go is an entirely different story.  We're already bumping up against
> other projects that use it, and that's no surprise considering how
> strong the uptake has been among other systems programmers.
> Language-wise, goroutines might help get us out of callback hell, and it
> has other features such as channels and "defer" that might also support
> a more productive style for our own code.  I know that several in the
> group are already eager to give it a try.  While we shouldn't do so for
> the "cool factor" alone, for new code that's not in the I/O path the
> potential productivity benefits make it an option well worth exploring.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Languages (was Re: Proposal for GlusterD-2.0)

2014-09-07 Thread Dan Lambright

Digging deeper into Go, I see there is a fascinating discussion in the language 
communities comparing Go with C++.

Go has no..
- classes (no inheritance), though it has interfaces (sets of methods) which 
remind me of things like gluster's struct xlator_fops {}
- polymorphism 
- pointer arithmetic
- generic programming
- etc. 

Here is a comparison of C++ with Go from Rob Pike himself (a Go author). 

http://commandcenter.blogspot.com/2012/06/less-is-exponentially-more.html 

And here are a few counter arguments.

http://lambda-the-ultimate.org/node/4554

Preference for Go seems to come down to how deeply you prefer the C++ object 
oriented way of doing things (as Pike calls it, the "type-centric" focus on 
classes). If thats your cup of tea, you may find Go a letdown or step 
backwards. Pike implies that coders invest a lot of time to master those 
techniques and are reluctant to ditch those skills.

But if you are a C or python programmer, you may see Go as a way to have your 
cake (modern stripped down language with lists, maps, packages, interfaces, no 
#includes) and eat it too (it compiles to binary, no VM).

As gluster is not beholden in any way to legacy C++, Go seems like a great fit. 
I'm looking forward to giving it a spin :)

- Original Message -
> From: "Dan Lambright" 
> To: "Jeff Darcy" 
> Cc: "Justin Clift" , "Gluster Devel" 
> 
> Sent: Friday, September 5, 2014 5:32:05 PM
> Subject: Re: [Gluster-devel] Languages (was Re: Proposal for GlusterD-2.0)
> 
> One reason to use c++ could be to build components that we wish to share with
> ceph. (Not that I know of any at this time). Also c++0x11 has improved the
> language.
> But the more I hear about it, the more interesting go sounds..
> 
> - Original Message -
> > From: "Jeff Darcy" 
> > To: "Justin Clift" 
> > Cc: "Gluster Devel" 
> > Sent: Friday, September 5, 2014 11:44:35 AM
> > Subject: [Gluster-devel] Languages (was Re: Proposal for GlusterD-2.0)
> > 
> > > Does this mean we'll need to learn Go as well as C and Python?
> > 
> > As KP points out, the fact that consul is written in Go doesn't mean our
> > code needs to be ... unless we need to contribute code upstream e.g. to
> > add new features.  Ditto for etcd also being written in Go, ZooKeeper
> > being written in Java, and so on.  It's probably more of an issue that
> > these all require integration into our build/test environments.  At
> > least Go, unlike Java, doesn't require any new *run time* support.
> > Python kind of sits in between - it does require runtime support, but
> > it's much less resource-intensive and onerous than Java (no GC-tuning
> > hell).  Between that and the fact that it's almost always present
> > already, it just doesn't seem to provoke the same kind of allergic
> > reaction that Java does.
> > 
> > However, this is as good a time as any to think about what languages
> > we're going to use for the project going forward.  While there are many
> > good reasons for our I/O path to remain in Plain Old C (yes I'm
> > deliberately avoiding the C++ issue), many of those reasons apply only
> > weakly to other parts of the code - not only management code, but also
> > "offline" processes like self heal and rebalancing.  Some people might
> > already be aware that I've used Python for the reconciliation component
> > of NSR, for example, and that version is in almost every way better than
> > the C version it replaces.  When we need to interface with code written
> > in other languages, or even interact with communities where other
> > languages are spoken more fluently than C, it's pretty natural to
> > consider using those languages ourselves.  Let's look at some of the
> > alternatives.
> > 
> >  * C++
> >Code is highly compatible with C, programming styles and idioms less
> >so.  Not prominent in most areas we care about.
> > 
> >  * Java
> >The "old standard" for a lot of distributed systems - e.g.  the
> >entire Hadoop universe, Cassandra, etc.  Also a great burden as
> >discussed previously.
> > 
> >  * Go
> >Definitely the "up and comer" in distributed systems, for which it
> >was (partly) designed.  Easy for C programmers to pick up, and also
> >popular among (former?) Python folks.  Light on resources and
> >dependencies.
> > 
> >  * JavaScript
> >Ubiquitous.  Common in HTTP-ish "microservice" situations, but not so
> >much in true distributed sy

Re: [Gluster-devel] Languages (was Re: Proposal for GlusterD-2.0)

2014-09-08 Thread Dan Lambright
I could see Go used for background type jobs or test harnessing in the 
beginning, at the discretion of the developer. The question about garbage 
collection is an unknown and a good point. To me, it makes sense to get 
experience with Go before using it in the I/O path. Particularly as the 
language is new.

Apparently Go does have a "kind of" inheritance. It does *not* have virtual 
functions. Here is a nice blog post.

https://geekwentfreak-raviteja.rhcloud.com/blog/2014/03/06/golang-inheritance-by-embedding/

- Original Message -
> From: "Jeff Darcy" 
> To: "Krishnan Parthasarathi" 
> Cc: "Dan Lambright" , "Gluster Devel" 
> 
> Sent: Monday, September 8, 2014 8:14:07 AM
> Subject: Re: [Gluster-devel] Languages (was Re: Proposal for GlusterD-2.0)
> 
> > Two characteristics of a language (tool chain) are important to me,
> > especially
> > when you spend a good part of your time debugging failures/bugs.
> > 
> > - Analysing core files.
> > - Ability to reason about space consumption. This becomes important in
> >   the case of garbage collected languages.
> > 
> > I have written a few toy programs in Go and have been following the
> > language
> > lately. Some of its features like channels and go routines catch my
> > attention
> > as we are aspiring to build reactive and scalable services. Its lack of
> > type-inference
> > and inheritance worries me a little. But, I shouldn't be complaining when
> > our default choice has been C thus far ;)
> 
> If there's going to be complaining, now's the time.  Justin's kind of
> right that we don't want to be adding languages willy-nilly.  If there's
> something about a language which is likely to preclude its use in
> certain contexts (e.g. GC languages in the I/O path) or impair our
> long-term productivity, then that's important to realize.
> Unfortunately, the list of such drawbacks for C isn't exactly
> zero-length either.
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Reviewers needed for ec xlator

2014-09-10 Thread Dan Lambright
yes, I'll take a look,

- Original Message -
> From: "Xavier Hernandez" 
> To: gluster-devel@gluster.org
> Sent: Wednesday, September 10, 2014 4:21:17 AM
> Subject: [Gluster-devel] Reviewers needed for ec xlator
> 
> Hi,
> 
> does anyone have some time to review these ec patches ?
> 
> http://review.gluster.org/8368/ - Fix spurious crash
> http://review.gluster.org/8369/ - Improve performance
> http://review.gluster.org/8413/ - Remove Intel's SSE2 dependency
> http://review.gluster.org/8420/ - Fix spurious crash
> 
> Thank you very much,
> 
> Xavi
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] how do you debug ref leaks?

2014-09-18 Thread Dan Lambright
If we could disable/enable ref tracking dynamically, it may only be "heavy 
weight" tempoarily while the customer is being observed.
You could get a state dump , or another idea is to take a core of the live 
process.   gcore $(pidof processname) 

- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Shyam" , gluster-devel@gluster.org
> Sent: Thursday, September 18, 2014 11:34:28 AM
> Subject: Re: [Gluster-devel] how do you debug ref leaks?
> 
> 
> On 09/18/2014 07:48 PM, Shyam wrote:
> > On 09/17/2014 10:13 PM, Pranith Kumar Karampuri wrote:
> >> hi,
> >>  Till now the only method I used to find ref leaks effectively is to
> >> find what operation is causing ref leaks and read the code to find if
> >> there is a ref-leak somewhere. Valgrind doesn't solve this problem
> >> because it is reachable memory from inode-table etc. I am just wondering
> >> if there is an effective way anyone else knows of. Do you guys think we
> >> need a better mechanism of finding refleaks? At least which decreases
> >> the search space significantly i.e. xlator y, fop f etc? It would be
> >> better if we can come up with ways to integrate statedump and this infra
> >> just like we did for mem-accounting.
> >>
> >> One way I thought was to introduce new apis called
> >> xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops
> >> per inode/dict/fd and increments/decrements accordingly. Dump this info
> >> on statedump.
> >>
> >> I myself am not completely sure about this idea. It requires all xlators
> >> to change.
> >>
> >> Any ideas?
> >
> > On a debug build we can use backtrace information stashed per ref and
> > unref, this will give us history of refs taken and released. Which
> > will also give the code path where ref was taken and released.
> >
> > It is heavy weight, so not for non-debug setups, but if a problem is
> > reproducible this could be a quick way to check who is not releasing
> > the ref's or have a history of the refs and unrefs to dig better into
> > code.
> >
> Do you have any ideas for final builds also? Basically when users report
> leaks it should not take us too long to figure out the problem area. We
> should just ask them for statedump and should be able to figure out the
> problem.
> 
> Pranith
> > Shyam
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] if/else coding style :-)

2014-10-13 Thread Dan Lambright
+1 on choosing a single brace style [1] for the entire codebase. 

[1]
http://en.wikipedia.org/wiki/Indent_style

- Original Message -
> From: "Shyam" 
> To: "Pranith Kumar Karampuri" , gluster-devel@gluster.org
> Sent: Monday, October 13, 2014 10:13:38 AM
> Subject: Re: [Gluster-devel] if/else coding style :-)
> 
> On 10/13/2014 10:08 AM, Pranith Kumar Karampuri wrote:
> >
> > On 10/13/2014 07:27 PM, Shyam wrote:
> >> On 10/13/2014 08:01 AM, Pranith Kumar Karampuri wrote:
> >>> hi,
> >>>   Why are we moving away from this coding style?:
> >>> if (x) {
> >>> /*code*/
> >>> } else {
> >>> /* code */
> >>> }
> >>
> >> This patch (in master) introduces the same and explains why,
> >>
> >> commit 0a8371bdfdd88e662d09def717cc0b822feb64e8
> >> Author: Jeff Darcy 
> >> Date:   Mon Sep 29 17:27:14 2014 -0400
> >>
> >> extras: reverse test for '}' vs. following 'else' placement
> >>
> >> The two-line form "}\nelse {" has been more common than the one-line
> >> form "} else {" in our code for years, and IMO for good reason (see
> >> the comment in the diff).
> > Will there be any objections to allow the previous way of writing this
> > if/else block? I just don't want to get any errors in 'check-formatting'
> > when I write the old way for this.
> > May be we can change it to warning?
> 
> I am going to state my experience/expectation :)
> 
> I actually got this _error_ when submitting a patch, and thought to
> myself "isn't the one-line form the right one?" then went to see why
> this check was in place and read the above. Going by the reason in the
> patch, I just adapted myself.
> 
> Now, coming to _allowing_ both forms with a warning, my personal call is
> _no_, we should allow one form so that the code is readable and there is
> little to no confusion for others on which form to use. So I would say
> no to your proposal.
> 
> Shyam
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Gluster tiering feature

2014-10-17 Thread Dan Lambright

Myself, Joseph Fernandez, and others have been working on a tiering feature for 
gluster. We are in the prototype phase and demoed some code recently internally 
to Red Hat. In its first incarnation it resembles the server side cache tier 
Ceph has, or dm-cache in the kernel. So, fast storage and slow storage are 
exposed as a single volume, data that is frequently used makes its way to fast 
storage, and the system responds dynamically to changing usage. Because 
migration between tiers is time consuming, a cache tier is a good fit for 
workloads where the set of hot data is stable. A cache tier can be added or 
removed at run-time. The tiering logic is very general-purpose infrastructure, 
and can be used for elaborate data placement graphs or other data migration 
features.

The design is in early stages.. but current thinking can be found on the 
feature page and the links below, and all the code is on the forge.

http://www.gluster.org/community/documentation/index.php/Features/data-classification

goo.gl/bkU5qv

Thanks,
Dan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Gluster tiering feature

2014-10-20 Thread Dan Lambright


- Original Message -
> From: "Lalatendu Mohanty" 
> To: "Dan Lambright" , "Gluster Devel" 
> 
> Sent: Monday, October 20, 2014 11:28:28 AM
> Subject: Re: [Gluster-devel] Gluster tiering feature
> 
> On 10/18/2014 01:11 AM, Dan Lambright wrote:
> > Myself, Joseph Fernandez, and others have been working on a tiering feature
> > for gluster. We are in the prototype phase and demoed some code recently
> > internally to Red Hat. In its first incarnation it resembles the server
> > side cache tier Ceph has, or dm-cache in the kernel. So, fast storage and
> > slow storage are exposed as a single volume, data that is frequently used
> > makes its way to fast storage, and the system responds dynamically to
> > changing usage. Because migration between tiers is time consuming, a cache
> > tier is a good fit for workloads where the set of hot data is stable. A
> > cache tier can be added or removed at run-time. The tiering logic is very
> > general-purpose infrastructure, and can be used for elaborate data
> > placement graphs or other data migration features.
> >
> > The design is in early stages.. but current thinking can be found on the
> > feature page and the links below, and all the code is on the forge.
> >
> > http://www.gluster.org/community/documentation/index.php/Features/data-classification
> >
> > goo.gl/bkU5qv
> >
> > Thanks,
> > Dan
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
> Looks awesome!! Thanks for putting these docs.
> 
> However I have a few questions on the following.
> /"Current thinking is a snapshot cannot be made of a volume with a cache
> tier, however a cache may be "detached" from a volume and then a
> snapshot could be made//"./
> 
> As far as I can understand, detaching cache tier will take a good amount
> of time (of course it depends on the amount of data), so detaching the
> cache tier and then taking snapshot will be time consuming and might not
> be used much (from a user point of view) . However I am wondering if
> pausing cache is less expensive then detaching cache? Also does pause
> cache migrates all data from hot to cold subvol, or just from that time
> onwards?
> 
> Thanks,
> Lala


yes, the thinking is "pausing" the cache would demote all the data off the fast 
tier.

I think getting snap to work with this is a stretch goal, at least in the 
beginning. I'm worried about thorny issues that may happen. For example, 
suppose someone snapped a tiered volume, then detached the fast tier, the tried 
to restore the volume. 


> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] data-classification feature

2014-11-03 Thread Dan Lambright


- Original Message -
> From: "Rudra Siva" 
> To: gluster-devel@gluster.org
> Sent: Saturday, November 1, 2014 11:17:06 AM
> Subject: [Gluster-devel] data-classification feature
> 
> Is this something that is available in some branch? 

Its on the forge, but under development so not ready to use.

> for example can I
> create a setup with a hot-tier of cache on n-bricks as rambricks and
> attach it to a cold-tier of m-bricks (n usually > m but much smaller
> in size, replication for n would be typically 2, m could be higher).
> This would probably improve reads that looks very good. 

that configuration is possible, sure.

> Can the write
> get optimized by going to hot-tier first and at-least 1 cold-tier?

Yes, that is how it will work.

> 
> Thanks
> -Siva
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Discussion: Implications on geo-replication due to bitrot and tiering changes!!!!

2014-12-11 Thread Dan Lambright
Looks good to me.
Thank you

- Original Message -
> From: "Kotresh Hiremath Ravishankar" 
> To: "Venky Shankar" 
> Cc: "Joseph Fernandes" , "Gluster Devel" 
> , "Vijay Bellur"
> , "Dan Lambright" , "Nagaprasad 
> Sathyanarayana" ,
> "Vivek Agarwal" 
> Sent: Thursday, December 11, 2014 3:44:18 PM
> Subject: Re: [Gluster-devel] Discussion: Implications on geo-replication due 
> to bitrot and tiering changes
> 
> Hi All,
> 
> As per the discussions within Data Tiering, BitRot and Geo-Rep team,
> following things are discussed.
> 
> 1. For Data Tiering to use changelog, in memory LRU/LFU implementation is
> required to capture reads
>as changelog journal doesn't capture reads. But given the commitments each
>team has on for themselves,
>it might not be possible to implement in memory LRU/LFU implementation by
>3.7 time line.
>As per current testing done by Tiering team, feeding Database in I/O path
>is not hitting noticeable
>performance as crash consistency is not expected. Hence for 3.7, logic for
>feeding database will be
>in changelog translator or new crt translator. When LRU/LFU implementation
>is available down the line,
>database can be fed from LRU/LFU in only changelog translator.
> 
> 2. Since BitRot or any other consumers might decide to use database, query
> and initialization APIs are
>exposed as a library.
> 
> Please add if I have missed anything or any corrections.
> 
> Thanks and Regards,
> Kotresh H R
> 
> - Original Message -
> From: "Venky Shankar" 
> To: "Joseph Fernandes" 
> Cc: "Gluster Devel" , "Kotresh Hiremath
> Ravishankar" , "Vijay Bellur" ,
> "Dan Lambright" , "Ben England" ,
> "Ric Wheeler" , "Nagaprasad Sathyanarayana"
> , "Vivek Agarwal" 
> Sent: Saturday, December 6, 2014 1:53:16 PM
> Subject: Re: [Gluster-devel] Discussion: Implications on geo-replication due
> to bitrot and tiering changes
> 
> [snip]
> >
> > Well If you would recall the multiple internal discussion we had and we had
> > agreed upon on this long time from the beginning.(though not recorded)
> 
> Agreed. In that case changelog changes to feed an alternate data store
> is unneeded, correct?
> 
> > and as a result of the discussion we have the Approach for the
> > infra-structure https://gist.github.com/vshankar/346843ea529f3af35339
> > AFAIK, Though the doc doesn't speak of the above in details it was always
> > the plan to do it as above.
> 
> Absolutely, the document tries to solve things in a more generic way
> and does not cover data store feeding from the cache. Thinking about
> it more leads me to the point of feeding the data store at the time of
> cache expiry a neat approach.
> 
> > The use of the LRU/LFU is definitely the way to go both with or without
> > changelog recording as it boasts the performance for recording.
> > And the mention of this is in
> > https://gist.github.com/vshankar/346843ea529f3af35339 at the end. Well you
> > know the best as you are the author :)
> > (Kotresh and me contributed over discussions, though not recorded, thanks
> > for mentioning it in the gluster-devel mail :) )
> 
> Correct me here: if data store is fed from cache (on expiry), is the
> alternate feed from changelog (either inline or asynchronous to the
> data path) needed?
> 
> >
> > As I have mentioned the development of feeding the DB in the IO path is
> > still in work in progress. We (Dan & Me) are making it more and more
> > performant. We have
> > also taking guidance from Ben England on testing it in parallel with
> > development cycles so that we have the best approach &  implementation.
> > That is where we are getting the numbers from (This is recorded in mails I
> > will forward them to you). Plus we have kept Vijay Bellur in sync with the
> > approach we are taking on a weekly basis ( though not recorded :) )
> 
> That's nice. But, my previous comment is still a concern.
> 
> >
> > On the point of the discussion not recorded on gluster-devel, these
> > discussion happened more frequently and in more adhoc way. Well you the
> > best as you were part of all of them :).
> 
> Hmmm, not all.
> 
> >
> > As we move forward we will have more discussion internally for sure and
> > lets make sure that they are recorded so that lets not keep running
> > around the same bush again and again ;).
> >
> > And Thanks for all the he

Re: [Gluster-devel] Readdir d_off encoding

2014-12-16 Thread Dan Lambright
.. also keep in mind we may want more than two DHT layers if we spin up the 
data classification project in the future.

- Original Message -
> From: "Shyam" 
> To: "Anand Avati" , "Gluster Devel" 
> , "Soumya Koduri"
> 
> Sent: Tuesday, December 16, 2014 11:46:46 AM
> Subject: Re: [Gluster-devel] Readdir d_off encoding
> 
> On 12/15/2014 09:06 PM, Anand Avati wrote:
> > Replies inline
> >
> > On Mon Dec 15 2014 at 12:46:41 PM Shyam  > > wrote:
> >
> > With the changes present in [1] and [2],
> >
> > A short explanation of the change would be, we encode the subvol ID in
> > the d_off, losing 'n + 1' bits in case the high order n+1 bits of the
> > underlying xlator returned d_off is not free. (Best to read the commit
> > message for [1] :) )
> >
> > Although not related to the latest patch, here is something to consider
> > for the future:
> >
> > We now have DHT, AFR, EC(?), DHT over DHT (Tier) which need subvol
> > encoding in the returned readdir offset. Due to this, the loss in bits
> > _may_ cause unwanted offset behavior, when used in the current scheme.
> > As we would end up eating more bits than what we do at present.
> >
> > Or IOW, we could be invalidating the assumption "both EXT4/XFS are
> > tolerant in terms of the accuracy of the value presented
> > back in seekdir().
> >
> >
> > XFS has not been a problem, since it always returns 32bit d_off. With
> > Ext4, it has been noted that it is tolerant to sacrificing the lower
> > bits in accuracy.
> >
> > i.e, a seekdir(val) actually seeks to the entry which
> > has the "closest" true offset."
> >
> > Should we reconsider an in memory _cookie_ like approach that can help
> > in this case?
> >
> > It would invalidate (some or all based on the implementation) the
> > following constraints that the current design resolves, (from, [1])
> > - Nothing to "remember in memory" or evict "old entries".
> > - Works fine across NFS server reboots and also NFS head failover.
> > - Tolerant to seekdir() to arbitrary locations.
> >
> > But, would provide a more reliable readdir offset for use (when valid
> > and not evicted, say).
> >
> > How would NFS adapt to this? Does Ganesha need a better scheme when
> > doing multi-head NFS fail over?
> >
> >
> > Ganesha just offloads the responsibility to the FSAL layer to give
> > stable dir cookies (as it rightly should)
> >
> >
> > Thoughts?
> >
> >
> > I think we need to analyze the actual assumption/problem here.
> > Remembering things in memory comes with the limitations you note above,
> > and may after all, still not be necessary. Let's look at the two
> > approaches taken:
> >
> > - Small backend offsets: like XFS, the offsets fit in 32bits, and we are
> > left with another 32bits of freedom to encode what we want. There is no
> > problem here until our nested encoding requirements cross 32bits of
> > space. So let's ignore this for now.
> >
> > - Large backend offsets: Ext4 being the primary target. Here we observe
> > that the backend filesystem is tolerant to sacrificing the accuracy of
> > lower bits. So we overwrite the lower bits with our subvolume encoding
> > information, and the number of bits used to encode is implicit in the
> > subvolume cardinality of that translator. While this works fine with a
> > single transformation, it is clearly a problem when the transformation
> > is nested with the same algorithm. The reason is quite simple: while the
> > lower bits were disposable when the cookie was taken fresh from Ext4,
> > once transformed the same lower bits are now "holy" and cannot be
> > overwritten carelessly, at least without dire consequences. The higher
> > level xlators need to take up the "next higher bits", past the previous
> > transformation boundary, to encode the next subvolume information. Once
> > the d_off transformation algorithms are fixed to give such due "respect"
> > to the lower layer's transformation and use a different real estate, we
> > might actually notice that the problem may not need such a deep redesign
> > after all.
> 
> Agreed, my lack of understanding though is how may bits can be
> sacrificed for ext4? I do not have that data, any pointers there would
> help. (did go through https://lwn.net/Articles/544520/ but that does not
> have the tolerance information in it)
> 
> Here is what I have as the current bits lost based on the following
> volume configuration,
> - 2 Tiers (DHT over DHT)
> - 128 subvols per DHT
> - Each DHT instance is either AFR or EC subvolumes, with 2 replicas and
> say 6 bricks per EC instance
> 
> So EC side of the subvol needs log(2)6 (EC) + log(2)128 (DHT) + log(2)2
> (Tier) = 3 + 7 + 1, or 11 bits of the actual d_off used to encode the
> volume, +1 for the high order bit to denote the encoding. (AFR would
> have 1 bit less, so we can consider just the EC side of things for the
> maximum loss compu

Re: [Gluster-devel] GlusterFS 4.0 updates needed

2015-01-27 Thread Dan Lambright
data classification has been updated to better reflect cache tiering's current 
status. 

- Original Message -
> From: "Jeff Darcy" 
> To: "Gluster Devel" 
> Sent: Monday, January 26, 2015 7:53:37 AM
> Subject: [Gluster-devel] GlusterFS 4.0 updates needed
> 
> Ahead of Wednesday's community meeting, I'd like to get as much 4.0
> status together as possible.  Developers, please check out the
> sub-projects on this page:
> 
> http://www.gluster.org/community/documentation/index.php/Planning40
> 
> If you're involved in any of those, please update the "status" section
> of the corresponding feature page (near the bottom) as appropriate.
> Links to design notes, email threads, code review requests, etc. are
> most welcome.  Yes, even things that have been sent to this list before.
> I'll be setting up a survey to choose a time for an online (Hangout/IRC)
> "summit" in the first week of February.  If we can separate the items
> that are truly being worked on vs. those that just "seemed cool at the
> time" but haven't received one minute of anyone's attention since (e.g.
> sharding) then we can keep that meeting focused and productive.  Thanks!
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] RFC: d_off encoding at client/protocol layer

2015-02-02 Thread Dan Lambright
Hello,

We have a prototype of this working on the tiering forge site, and noticed that 
in this scheme, each client translator needs to "know" the total number of 
#bricks in the volume. We can compute that number when the graph is created, 
and on a graph switch. But in some sense, a downside of this is it runs against 
a gluster design principal that a translator is only "aware" of the translators 
it is attached to. 

I can imagine an alternative: the d_off could set aside a fixed number of bits 
for the sub volume. The number would be as many needed for the maximum number 
of sub volumes we support. So for example,  if we support 2048 sub volumes, 11 
bits would be set aside in the d_off. This would mean the client translator 
would not need to know the #bricks.

The disadvantage to this approach is the more bits we use for the sub volume, 
the fewer remaining are available for the offset handed to us from the file 
system- and this can lead to a greater probability of losing track of which 
file in a directory we left off on. 

Overall, it seems the arguments I've heard in support of keeping with the 
current approach (dynamic # bits set aside for the sub volume) are stronger, 
given in most deployments the number of sub volumes occupies a much smaller 
number of bits.

Dan

- Original Message -
> From: "Shyam" 
> To: "Gluster Devel" 
> Cc: "Dan Lambright" 
> Sent: Monday, January 26, 2015 8:59:14 PM
> Subject: RFC: d_off encoding at client/protocol layer
> 
> Hi,
> 
> Some parts of this topic has been discussed in the recent past here [1]
> 
> The current mechanism of each xlator encoding the subvol in the lower or
> higher bits has its pitfalls as discussed in the threads and in this
> review, here [2]
> 
> Here is a solution design from the one of the comments posted on this by
> Avati here, [3], as in,
> 
> "One example approach (not necessarily the best): Make every xlator
> knows the total number of leaf xlators (protocol/clients), and also the
> number of all leaf xlators from each of its subvolumes. This way, the
> protocol/client xlators (alone) do the encoding, by knowing its global
> brick# and total #of bricks. The cluster xlators blindly forward the
> readdir_cbk without any further transformations of the d_offs, and also
> route the next readdir(old_doff) request to the appropriate subvolume
> based on the weighted graph (of counts of protocol/clients in the
> subtrees) till it reaches the right protocol/client to resume the
> enumeration."
> 
> So the current proposed scheme that is being worked on is as follows,
> - encode the d_off with the client/protocol ID, which is generated as
> its leaf position/number
> - no further encoding in any other xlator
> - on receiving further readdir requests with the d_off, consult the,
> graph/or immediate children, on ID encoded in the d_off, and send the
> request down that subvol path
> 
> IOW, given a d_off and a common routine, pass the d_off with this (i.e
> current xlator) to get a subvol that the d_off belongs to. This routine
> would decode the d_off for the leaf ID as encoded in the client/protocol
> layer, and match its subvol relative to this and send that for further
> processing. (it may consult the graph or store the range of IDs that any
> subvol has w.r.t client/protocol and deliver the result appropriately).
> 
> Given the current situation of ext4 and xfs, and continuing with the ID
> encoding scheme, this seems to be the best manner of preventing multiple
> encoding of subvol stomping on each other, and also preserving (in a
> sense) further loss of bits. This scheme would also give AFR/EC the
> ability to load balance readdir requests across its subvols better, than
> have a static subvol to send to for a longer duration.
> 
> Thoughts/comments?
> 
> Shyam
> 
> [1] https://www.mail-archive.com/gluster-devel@gluster.org/msg02834.html
> [2] review.gluster.org/#/c/8201/4/xlators/cluster/afr/src/afr-dir-read.c
> [3] https://www.mail-archive.com/gluster-devel@gluster.org/msg02847.html
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] readdir vs readdirp

2015-02-06 Thread Dan Lambright
Hi All, I hope this is not an overly stupid question..

Does anyone know a good way to test readdir (not readdirplus) on gluster?

I see in :
bugs/replicate/886998/strict-readdir.t

.. that there is a switch "cluster.strict-readdir". It does not appear to work, 
or be referenced in the code - perhaps I am missing something.

I also tried mounting nfs with the "nordirplus" option, this did not produce 
the desired results either, surprisingly.

I suppose its always possible to load older pre-readdirp versions of Linux.

Thanks,
Dan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] readdir vs readdirp

2015-02-10 Thread Dan Lambright


- Original Message -
> From: "Vijay Bellur" 
> To: "Dan Lambright" , "Gluster Devel" 
> 
> Sent: Saturday, February 7, 2015 7:45:49 PM
> Subject: Re: [Gluster-devel] readdir vs readdirp
> 
> On 02/06/2015 02:03 PM, Dan Lambright wrote:
> > Hi All, I hope this is not an overly stupid question..
> >
> > Does anyone know a good way to test readdir (not readdirplus) on gluster?
> >
> > I see in :
> > bugs/replicate/886998/strict-readdir.t
> >
> > .. that there is a switch "cluster.strict-readdir". It does not appear to
> > work, or be referenced in the code - perhaps I am missing something.
> >
> > I also tried mounting nfs with the "nordirplus" option, this did not
> > produce the desired results either, surprisingly.
> 
> With fuse client, I suspect you would need to do the following to let
> readdir fop percolate through the stack:
> 
> 1. mount client with --use-readdirp=no. This will bypass readdirplus in
> fuse.
> 
> 2. Disable md-cache. md-cache by default converts all readdir operations
> to readdirplus for its functioning.

Thanks. This procedure worked, although rather than disabling md-cache, I used 
a switch to disable readdirp operations "*md-cache.force-readdirp=no".

glusterfs --volfile-id=/t --volfile-server=gprfs018 /mnt2 --use-readdirp=no 
--xlator-option "*dht.use-readdirp=no"  --xlator-option 
"*md-cache.force-readdirp=no"

> 
> 3. Prevent dht from doing readdirp by default. While mounting, you could
> pass --xlator-option "*dht.use-readdrip=no" to prevent readdir to
> readdirp conversions in dht.
> 
> -Vijay
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Data Classification: would it be possible to have a RAM-disk as caching tier?

2015-02-16 Thread Dan Lambright

The tiering code ought to work if the underlying volume is a RAM disk.  Its 
like any other Gluster brick.  But as mentioned, this would be unsafe for their 
data. If the power went out, anything in the RAM disk would be lost. I think a 
pure "read cache" may be a better option for RAM disks. This is the route 
companies like NimbleStorage and Infinio went. Read caches are something the 
feature may support in the future.

Dan

- Original Message -
> From: "Joseph Fernandes" 
> To: "Niels de Vos" 
> Cc: gluster-devel@gluster.org, "Dan Lambright" 
> Sent: Monday, February 16, 2015 5:54:15 AM
> Subject: Re: Data Classification: would it be possible to have a RAM-disk as 
> caching tier?
> 
> Hi Niels,
> 
> Well the idea is good, RAM-Disk would fastest and with no extra cost +
> We may have gluster brick from RAM-Disk[1]
> The one and the biggest challenge would be durability of data on RAM-Disks
> Using RAM for caching is good, as the cache will have only the copy of the
> original data
> but in case of tiering, the original data sits on the tier(not the copy).
> 
> Dan your thoughts.
> 
> Thanks,
> Joe
> 
> 1. https://lists.gnu.org/archive/html/gluster-devel/2013-05/msg00118.html
> 
> - Original Message -
> From: "Niels de Vos" 
> To: gluster-devel@gluster.org
> Cc: "Dan Lambright" , "Joseph Fernandes"
> 
> Sent: Monday, February 16, 2015 4:14:37 PM
> Subject: Data Classification: would it be possible to have a RAM-disk as
> caching tier?
> 
> Hi guys,
> 
> at FOSDEM one of our users spoke to me about their deployment and
> environment. It seems that they have a *very* good deal with their
> hardware vendor, which makes it possible to stuff their servers full
> with RAM for a minimal difference of the costs.
> 
> They expressed interest in having a RAM-disk as caching tier on the
> bricks. Would a configuration like this be possible with the
> data-classification feature [1]?
> 
> Thanks,
> Niels
> 
> 1.
> http://www.gluster.org/community/documentation/index.php/Features/data-classification
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] expanding tiered volumes

2015-02-16 Thread Dan Lambright

Hello,

An initial version of storage tiering is being developing for Gluster 3.7. It 
will allow you to add a set of fast bricks to a volume (such as SSDs) which can 
be used as a cache for slower bricks (HDD). It resembles a feature developed 
for Ceph last year. One of the hard parts has been integrating this new volume 
type with existing features, such as adding and removing bricks. Allowing the 
user to add and remove bricks to both the hot and cold tiers is a challenge on 
multiple levels. The CLI commands would have to specify which tier to 
add/remove bricks to. New phenomena could occur such as rebalancing operations 
happening simultanously on both the hot and cold tiers. In general rebalancing 
at the same time cache promotion/demotion is happening is a tricky problem, it 
requires some tweaks to DHT which are forthcoming. To manage the complexity and 
get this feature done "on time" (per Gluster's release schedule) we would like 
to only allow adding and removing bricks to the cold 
 tier for the first version. I would like to see if there are any comments from 
the upstream community.

Dan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] expanding tiered volumes

2015-02-23 Thread Dan Lambright


- Original Message -
> From: "Vijay Bellur" 
> To: "Dan Lambright" , "Gluster Devel" 
> 
> Sent: Monday, February 23, 2015 5:26:58 AM
> Subject: Re: [Gluster-devel] expanding tiered volumes
> 
> On 02/16/2015 08:20 PM, Dan Lambright wrote:
> 
> > One of the hard parts has been integrating this new volume type with
> > existing features, such as adding and removing bricks.
> > Allowing the user to add and remove bricks to both the hot and cold tiers
> > is a challenge on multiple levels.
> > The CLI commands would have to specify which tier to add/remove bricks to.
> 
> Could you please detail the CLI being considered for defining a hot/cold
> tier? I think that will be useful in understanding the problem better.
> 
> > New phenomena could occur such as rebalancing operations happening
> > simultanously on both the hot and cold tiers.
> > In general rebalancing at the same time cache promotion/demotion is
> > happening is a tricky problem, it requires some tweaks to DHT which are
> > forthcoming.
> > To manage the complexity and get this feature done "on time" (per Gluster's
> > release schedule) we would like to only allow adding and removing bricks
> > to the cold
> > tier for the first version. I would like to see if there are any comments
> > from the upstream community.
> 
> We can have some flexibility if the core feature lands by end of this
> week. Are there any thoughts on how either tier can be expanded or
> shrunk? If it is not very intrusive and seems appropriate to be pulled
> in as part of a bug fix, we can look at this beyond feature freeze and
> before code freeze.

Attaching a tier is akin to adding bricks:

volume attach-tier  [ ]  ... [force]

Adding bricks to the "cold tier" would be the same CLI as with normal volumes.

volume add-brick  [ ]  ... [force] 

Adding bricks to the "hot tier" could be specified using a modified CLI, e.g. 
add the keyword "attached-tier". 

volume add-brick  [ ]  ... [force] 
[attached-tier]

Technically, the ability to expand both the hot and cold tiers is feasible 
given enough development time.  An additional change to DHT is expected from 
the DHT group to allow rebalancing while promotion and demotion are happening. 
That fix is the first step.

> 
> Thanks,
> Vijay
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Object Quota feature proposal for GlusterFS-3.7

2015-02-24 Thread Dan Lambright
Prashanth ,

Not sure exactly what you are trying to do, but the changetimerecorder 
translator records in a database files that are created/modified/deleted/read 
(or some subset). This is a new translator thats part of the tier feature, 
though was written to be multipurpose and modular etc.

It uses the SQLlite database as a backend. You can put in a different backend. 
It does a lazy updates. 

Feel free to take a look at it as well or inquire about it (Joseph authored 
this feature), just FYI. 

Dan

- Original Message -
> From: "Vijay Bellur" 
> To: "Prashanth Pai" , "Luis Pabon" 
> Cc: "gluster-devel@gluster.org >> Gluster Devel" 
> Sent: Tuesday, February 24, 2015 2:11:07 AM
> Subject: Re: [Gluster-devel] Object Quota feature proposal for GlusterFS-3.7
> 
> On 02/24/2015 10:54 AM, Prashanth Pai wrote:
> > Hi Luis,
> >
> > Currently, even with storage policies, there is no mechanism to update the
> > container DBs with details of files created/modified/deleted over
> > filesystem interface.
> > Hence a GET on a container would list objects that may or may not exist on
> > disk. Also the metadata of container (bytes and object count) could be
> > outdated.
> > We need a mechanism to detect change in GlusterFS (have explored changelog
> > and inotify, both not feasible) and then lazily update Swift DBs so that
> > the container listing and metadata would eventually look right.
> >
> 
> What were the problems encountered while trying to use changelog?
> 
> Thanks,
> Vijay
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Patch #10000

2015-03-26 Thread Dan Lambright

Patch 1 is nothing special.
  
But the bug count is a milestone for Gluster; congrats to everyone in our 
re-energized community.

- Original Message -
> From: "Jeff Darcy" 
> To: "Gluster Devel" 
> Sent: Thursday, March 26, 2015 9:17:47 AM
> Subject: [Gluster-devel] Patch #10000
> 
> And the winner is ... Dan Lambright!
> 
> http://review.gluster.org/#/c/1
> 
> Congrats, Dan.
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Hangouts for 3.7 features

2015-03-27 Thread Dan Lambright
I think we could do one for Tiering Tuesday after next (4/7) ?

- Original Message -
> From: "Vijay Bellur" 
> To: "Gluster Devel" 
> Sent: Friday, March 27, 2015 7:01:11 AM
> Subject: [Gluster-devel] Hangouts for 3.7 features
> 
> Hi All,
> 
> As we inch closer to 3.7.0, I think it might be a good idea to talk
> about new/improved features in 3.7 and do a demo of the features to our
> users over Google hangout sessions. With that in mind, I have created an
> etherpad with the list of prominent features in 3.7 at [1]. If you are a
> feature owner and interested in doing a hangout session to the
> community, can you please update your name and preferred time in the
> etherpad? Please feel free to update the etherpad if I have missed
> adding your feature to the list :).
> 
> Given the number of features that we have, we can possibly look at doing
> two hangouts per week - possibly on Tuesdays and Thursdays from the
> coming week. What do you folks think?
> 
> Cheers,
> Vijay
> 
> [1] https://public.pad.fsfe.org/p/gluster-3.7-hangouts
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] 3.7.0 update

2015-04-14 Thread Dan Lambright


- Original Message -
> From: "Emmanuel Dreyfus" 
> To: "Vijay Bellur" 
> Cc: "Gluster Devel" 
> Sent: Tuesday, April 14, 2015 1:07:45 PM
> Subject: Re: [Gluster-devel] 3.7.0 update
> 
> On Mon, Apr 13, 2015 at 10:43:01PM +0530, Vijay Bellur wrote:
> > Haven't heard any feedback here. Maintainers - can you please chime in with
> > ACK/NACK here for your components?
> 
> here is NetBSD regression status:
> 
> 1) The good (tests re-enabled)
> - tests/encryption/crypt.t was fixed
> - tests/basic/afr/read-subvol-entry.t was fixed
> 
> 2) The bad (tests remain disabled)
> - tests/features/trash.t had a bug fixed, but the test still fails for
>   timing reasons. I expect a positive outcome soon in
>   http://review.gluster.org/10215
> - tests/basic/afr/split-brain-resolution.t had a bug fix submitted, but
>   was voted down: http://review.gluster.org/10134
> - tests/basic/quota-anon-fd-nfs.t: Sachin Pandit got involved but no
>   fix yet
> - tests/basic/tier/tier.t: reliabily broken and nobody is involved AFAIK

If I recall, the problem with tier.t was we need to replace how we flush the 
client's buffer cache with a manner friendly to BSD. My understanding it that 
is just unmounting the volume. I'll take this up.


> - tests/basic/ec still exhibits spurious failure, with no work being done
>   on it, test will remain disabled
> 
> 3) The ugly (new regressions)
> - tests/geo-rep was introduced and it fails for now on NetBSD, hanging the
>   teests. I had to kill a handful of pending jobs (which got un undeserved
>   verified=--1 and unfortunately there is easy way to batch-retirgger), and
>   I disabled the subdirectory until I have it working.
> 
> 
> --
> Emmanuel Dreyfus
> m...@netbsd.org
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] tiering demo Thursday

2015-04-14 Thread Dan Lambright

Hello folks,

We are scheduling a Hangout[1] session Thursday 1:30PM UTC regarding the 
upcoming "Gluster tiering" feature in GlusterFS. This session would include a 
preview of the feature, implementation details and quick demo.

Please plan to join the Hangout session and spread the word around.

[1] goo.gl/auENCG

Dan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] tiering demo Thursday

2015-04-16 Thread Dan Lambright
Hello folks,

Our hangout session has concluded [1], and I expect we will do another next 
month from the US which hopefully will have better interactivity. 

In the meantime, below [2] is the "gluster volume info" display we are 
considering for tiered volumes. Let us know any feedback on how it looks.

[1]
goo.gl/auENCG

[2]
Proposal for gluster v info for the tier case:

Volume Name: t
Type: Tier
Volume ID: 320d4795-4eae-4d83-8e55-1681813c8549
Status: Created
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:

hot
Number of Bricks: 3 x 2 = 6
Type: Distribute Replicate
Brick1: gprfs018:/home/t6
Brick2: gprfs018:/home/t5

cold
Type: Distributed Replicate
Number of Bricks: 3 x 2 = 6
Brick3: gprfs018:/home/t1
Brick4: gprfs018:/home/t2
Brick5: gprfs018:/home/t3
Brick6: gprfs018:/home/t4

- Original Message -
> From: "Dan Lambright" 
> To: "Gluster Devel" 
> Sent: Wednesday, April 15, 2015 12:09:27 PM
> Subject: tiering demo Thursday
> 
> 
> Hello folks,
> 
> We are scheduling a Hangout[1] session Thursday 1:30PM UTC regarding the
> upcoming "Gluster tiering" feature in GlusterFS. This session would include
> a preview of the feature, implementation details and quick demo.
> 
> Please plan to join the Hangout session and spread the word around.
> 
> [1] goo.gl/auENCG
> 
> Dan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Upcall state + Data Tiering

2015-04-19 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Dan Lambright" , "Joseph Fernandes" 
> 
> Cc: "gluster Devel" , "Soumya Koduri" 
> 
> Sent: Sunday, April 19, 2015 9:01:56 AM
> Subject: Re: [Gluster-devel] Upcall state + Data Tiering
> 
> On Thu, Apr 16, 2015 at 04:58:29PM +0530, Soumya Koduri wrote:
> > Hi Dan/Joseph,
> > 
> > As part of upcall support on the server-side, we maintain certain state to
> > notify clients of the cache-invalidation and recall-leaselk events.
> > 
> > We have certain known limitations with Rebalance and Self-Heal. Details in
> > the below link -
> > http://www.gluster.org/community/documentation/index.php/Features/Upcall-infrastructure#Limitations
> > 
> > In case of Cache-invalidation,
> > upcall state is not migrated and once the rebalance is finished, the file
> > is
> > deleted and we may falsely notify the client that file is deleted when in
> > reality it isn't.
> > 
> > In case of Lease-locks,
> > As the case with posix locks, we do not migrate lease-locks as well but
> > will
> > end-up recalling lease-lock.
> > 
> > Here rebalance is an admin driven job, but that is not the case with
> > respect
> > to Data tiering.
> > 
> > We would like to know when the files are moved from hot to cold tiers or
> > vice-versa or rather when a file is considered to be migrated from cold to
> > hot tier, where we see potential issues.
> > Is it the first fop which triggers it? and
> > where are the further fops processed - on hot tier or cold tier?

Data tiering's basic design has been to reuse DHT's data migration algorithms. 
In this case, we see the same problem exists with DHT, but is a known 
limitation controlled by management operations. And therefore (if I follow) 
they may not tackle this problem right away, hence tiering may not be able to 
leverage their solution. It is of course desirable for data tiering to solve 
the problem in order to use the new upcall mechanisms. 

Migration of a file is a multi-state process. I/O is accepted at the same time 
migration is underway. I believe the upcall manager and the migration manager 
(for lack of better words) would have to coordinate. The former subsystem 
understands locks, and the later how to move files.

With that "coordination" in mind, a basic strategy might be something like this:

On the source, when a file is ready to be moved, the migration manager informs 
the upcall manager.

The upcall manager packages relevant lock information and returns it to the 
migrator.  The information reflects the state of posix or lease locks.

The migration manager moves the file.

The migration manager then sends the lock information as a virtual extended 
attribute. 

On the destination server, the upcall manager is invoked. It is passed the 
contents of the virtual attributes. The upcall manager rebuilds the lock state 
and puts the file into proper order. 

Only at that point, does the setxattr RPC return, and only then, shall the file 
be declared "migrated".

We would have to handle any changes to the lock state that occur when the file 
is in the middle of being migrated. Probably, the upcall manager would change 
the contents of the "package".

It is desirable to invent something that would work with both DHT and tiering 
(i.e. be implemented at the core dht-rebalance.c layer). And in fact, the 
mechanism I describe could be useful for other meta-data transfer applications. 

This is just a high level sketch, to provoke discussion and checkpoint if this 
is the right direction. It would take much time to sort through the details. 
Other ideas are welcome.

> 
> My understanding is the following:
> 
> - when a file is "cold" and gets accessed, the 1st FOP will mark the
>   file for migration to the "hot" tier
> - migration is async, so the initial responses on FOPs would come from
>   the "cold" tier
> - upon migration (similar to rebalance) locking state and upcall
>   tracking is lost
> 
> I think this is a problem. There seems to be a window where a client can
> get (posix) locks while the file is on the "cold" tier. After migrating
> the file from "cold" to "hot", these locks would get lost. The same
> counts for tracking access in the upcall xlator.
> 
> > Please provide your inputs on the same. We may need to document the same or
> > provide suggestions to the customers while deploying this solution.
> 
> Some ideas on how this can get solved would be most welcome.
> 
> Thanks,
> Niels
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] tiering demo Thursday

2015-04-22 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Dan Lambright" 
> Cc: "Gluster Devel" , "gluster-us...@gluster.org 
> List" 
> Sent: Sunday, April 19, 2015 8:37:30 AM
> Subject: Re: [Gluster-devel] tiering demo Thursday
> 
> On Thu, Apr 16, 2015 at 10:10:43AM -0400, Dan Lambright wrote:
> > Hello folks,
> > 
> > Our hangout session has concluded [1], and I expect we will do another next
> > month from the US which hopefully will have better interactivity.
> > 
> > In the meantime, below [2] is the "gluster volume info" display we are
> > considering for tiered volumes. Let us know any feedback on how it looks.
> > 
> > [1]
> > goo.gl/auENCG
> > 
> > [2]
> > Proposal for gluster v info for the tier case:
> > 
> > Volume Name: t
> > Type: Tier
> > Volume ID: 320d4795-4eae-4d83-8e55-1681813c8549
> > Status: Created
> > Number of Bricks: 3 x 2 = 6
> > Transport-type: tcp
> > Bricks:
> > 
> > hot
> > Number of Bricks: 3 x 2 = 6
> > Type: Distribute Replicate
> > Brick1: gprfs018:/home/t6
> > Brick2: gprfs018:/home/t5
> > 
> > cold
> > Type: Distributed Replicate
> > Number of Bricks: 3 x 2 = 6
> > Brick3: gprfs018:/home/t1
> > Brick4: gprfs018:/home/t2
> > Brick5: gprfs018:/home/t3
> > Brick6: gprfs018:/home/t4
> 
> Have you thought about how this will get displayed when using the --xml
> option? Most (all?) management interfaces use "gluster --xml ..." and
> parse the output. If you have not done so yet, it might be worth talking
> to (for example) the oVirt people that use it.

This is a good point - I will set up a bug to work on it.

> 
> Thanks,
> Niels
> 
> > 
> > - Original Message -
> > > From: "Dan Lambright" 
> > > To: "Gluster Devel" 
> > > Sent: Wednesday, April 15, 2015 12:09:27 PM
> > > Subject: tiering demo Thursday
> > > 
> > > 
> > > Hello folks,
> > > 
> > > We are scheduling a Hangout[1] session Thursday 1:30PM UTC regarding the
> > > upcoming "Gluster tiering" feature in GlusterFS. This session would
> > > include
> > > a preview of the feature, implementation details and quick demo.
> > > 
> > > Please plan to join the Hangout session and spread the word around.
> > > 
> > > [1] goo.gl/auENCG
> > > 
> > > Dan
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression status upate

2015-04-27 Thread Dan Lambright


- Original Message -
> From: "Vijay Bellur" 
> To: "Emmanuel Dreyfus" , gluster-devel@gluster.org
> Sent: Monday, April 27, 2015 5:40:11 AM
> Subject: Re: [Gluster-devel] NetBSD regression status upate
> 
> On 04/26/2015 09:33 AM, Emmanuel Dreyfus wrote:
> > Hello
> >
> > Here is the status of NetBSD regression so far. It would be nice if
> > people could help with quota-anon-fd-nfs.t, mgmt_v3-locks.t and tier.t.
> >
> > The following tests are disabled for now:
> >
> > - tests/basic/afr/split-brain-resolution.t
> >Anuradha Talur is working on it, the change being still under review
> >http://review.gluster.org/10134
> >
> > - tests/basic/ec/
> >This works but with rare spurious faiures. Nobody works on it.
> >
> > - tests/basic/quota-anon-fd-nfs.t
> >This test passed in the past and is now broken. Nobody works on it.
> >
> > - tests/basic/mgmt_v3-locks.t
> >This test passed in the past and is now broken. Nobody works on it
> >
> > - tests/basic/tier/tier.t
> >This test has always been broken on NetBSD. Nobody works on it
> >
> > - tests/bugs
> >Mostly uncharted terrirory
> >
> > - tests/geo-rep
> >All that tests have always failed.
> >I started investigating and awaits input from Kotresh Hiremath
> >Ravishankar.
> >
> > - tests/features/trash.t
> >Anoop C S, Jiffin Tony Thottan and I have been working on it. The
> >change are pending review:
> >http://review.gluster.org/10346
> >http://review.gluster.org/10360
> >http://review.gluster.org/10374
> >
> >
> 
> I think we need to add tests/features/glupy.t to the list as it seems to
> fail consistently.
> 
> Given that we consider regression test failures as a blocker for 3.7.0,
> can owners of:
> 
> - tests/basic/mgmt_v3-locks.t
> - tests/basic/tier/tier.t

Worked with Emanual over the weekend on tier.t, found bug 10395.

> - tests/features/glupy.t
> 
> try to address these problems over this week on NetBSD?
> 
> Manu - I think we can pick up tests/bugs/* post 3.7.0 for NetBSD. Would
> that work for you?
> 
> Thanks,
> Vijay
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] NetBSD regression status

2015-06-03 Thread Dan Lambright


- Original Message -
> From: "Niels de Vos" 
> To: "Emmanuel Dreyfus" 
> Cc: "Gluster Devel" 
> Sent: Wednesday, June 3, 2015 9:43:56 AM
> Subject: Re: [Gluster-devel] NetBSD regression status
> 
> On Wed, Jun 03, 2015 at 01:30:31PM +, Emmanuel Dreyfus wrote:
> > Hi
> > 
> > Here is NetBSD regression status for master and release-3.7:
> > 
> > The following tests always fail:
> > ./tests/basic/ec/ec.t
> > ./tests/basic/ec/quota.t
> > ./tests/basic/ec/self-heal.t
> > ./tests/basic/quota-anon-fd-nfs.t
> > 
> > Uncharted terriotiry:
> > ./tests/bugs
> > 
> > Fails, but never passed:
> > ./tests/geo-rep
> > 
> > 
> > Additionnaly, ./tests/basic/tier/tier.t passes but:
> > Running tests in file ./tests/basic/tier/tier.t
> > [10:03:02] ./tests/basic/tier/tier.t .. 25/35 umount: unknown option -- l
> > Usage: umount [-fvFR] [-t fstypelist] special | node
> >  umount -a[fvF] [-h host] [-t fstypelist]
> > 
> > It is a bit nasty to use a Linux-only option where portability was
> > the goal:
> > # Check promotion on read to slow tier
> > ( cd $M0 ; umount -l $M0 ) # fail but drops kernel cache
> 
> I'm also not sure if this really drops the caches on Linux. Someone
> tested it (Ravi?) and did not see the same cache reduction as when using
> "echo 3 > /proc/sys/vm/drop_caches".
> 
> We probably need to introduce a function for this that handles both the
> NetBSD and Linux methods.

There was a suggestion to do this, see comments for fix 10411. 

There are some subtle differences:

( cd $M0 ; umount -l $M0 )  

- drops only the fs.
- does not drop FUSE's cache. 

Whereas, /proc/sys/vm/drop_caches drops everything on the system.

> 
> Niels
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Amazon EFS (elastic file system)

2015-06-03 Thread Dan Lambright
Of interest to this group, AWS is previewing an "elastic file system" feature.

https://aws.amazon.com/efs/

- access by NFSv4.
- automatically scales as files added.
- metering based on storage used.
- integrated with their web interface, security groups, etc.


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious failure with test-case ./tests/basic/tier/tier.t

2015-06-26 Thread Dan Lambright


- Original Message -
> From: "Raghavendra Bhat" 
> To: gluster-devel@gluster.org
> Sent: Friday, June 26, 2015 6:37:37 AM
> Subject: Re: [Gluster-devel] spurious failure with test-case  
> ./tests/basic/tier/tier.t
> 
> On 06/26/2015 04:00 PM, Ravishankar N wrote:
> >
> >
> > On 06/26/2015 03:57 PM, Vijaikumar M wrote:
> >> Hi
> >>
> >> Upstream regression failure with test-case ./tests/basic/tier/tier.t
> >>
> >> My patch# 11315 regression failed twice with
> >> test-case./tests/basic/tier/tier.t. Anyone seeing this issue with
> >> other patches?
> >>
> >
> > Yes, one of my patches failed today too:
> > http://build.gluster.org/job/rackspace-regression-2GB-triggered/11461/consoleFull

Will take a look. Thanks.

> >
> > -Ravi
> 
> Even I had faced failure in tier.t couple of times.
> 
> Regards,
> Raghavendra Bhat
> 
> >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/11396/consoleFull
> >>
> >> http://build.gluster.org/job/rackspace-regression-2GB-triggered/11456/consoleFull
> >>
> >>
> >>
> >> Thanks,
> >> Vijay
> >>
> >> ___
> >> Gluster-devel mailing list
> >> Gluster-devel@gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Failure in tests/basic/tier/bug-1214222-directories_miising_after_attach_tier.t

2015-07-02 Thread Dan Lambright
I'll check on this.

- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Gluster Devel" , "Joseph Fernandes" 
> 
> Sent: Thursday, July 2, 2015 5:40:34 AM
> Subject: [Gluster-devel] Failure in   
> tests/basic/tier/bug-1214222-directories_miising_after_attach_tier.t
> 
> hi Joseph,
> Could you take a look at
> http://build.gluster.org/job/rackspace-regression-2GB-triggered/11842/consoleFull
> 
> Pranith
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel