Re: [Gluster-devel] Throttling xlator on the bricks

2016-03-08 Thread Pranith Kumar Karampuri
 *healer;
  xlator_t*readdir_xl;
  inode_t *idx_inode;  /* inode ref for 
xattrop dir */

+call_frame_t*frame;
  unsigned intentries_healed;
  unsigned intentries_processed;
  unsigned intalready_healed;


Richard

From: Ravishankar N [ravishan...@redhat.com]
Sent: Sunday, February 07, 2016 11:15 PM
To: Shreyas Siravara
Cc: Richard Wareing; Vijay Bellur; Gluster Devel
Subject: Re: [Gluster-devel] Throttling xlator on the bricks

Hello,

On 01/29/2016 06:51 AM, Shreyas Siravara wrote:

So the way our throttling works is (intentionally) very simplistic.

(1) When someone mounts an NFS share, we tag the frame with a 32 bit 
hash of the export name they were authorized to mount.
(2) io-stats keeps track of the "current rate" of fops we're seeing 
for that particular mount, using a sampling of fops and a moving 
average over a short period of time.
(3) Based on whether the share violated its allowed rate (which is 
defined in a config file), we tag the FOP as "least-pri". Of course 
this makes the assumption that all NFS endpoints are receiving 
roughly the same # of FOPs. The rate defined in the config file is a 
*per* NFS endpoint number. So if your cluster has 10 NFS endpoints, 
and you've pre-computed that it can do roughly 1000 FOPs per second, 
the rate in the config file would be 100.
(4) IO-Threads then shoves the FOP into the least-pri queue, rather 
than its default. The value is honored all the way down to the bricks.


The code is actually complete, and I'll put it up for review after 
we iron out a few minor issues.

Did you get a chance to send the patch? Just wanted to run some tests
and see if this is all we need at the moment to regulate shd traffic,
especially with Richard's multi-threaded heal patch
https://urldefense.proofpoint.com/v2/url?u=http-3A__review.gluster.org_-23_c_13329_&d=CwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=B873EiTlTeUXIjEcoutZ6Py5KL0bwXIVroPbpwaKD8s&s=fo86UTOQWXf0nQZvvauqIIhlwoZHpRlQMNfQd7Ubu7g&e= 
being revived and made ready for 3.8.


-Ravi

On Jan 27, 2016, at 9:48 PM, Ravishankar N  
wrote:


On 01/26/2016 08:41 AM, Richard Wareing wrote:
In any event, it might be worth having Shreyas detail his 
throttling feature (that can throttle any directory hierarchy no 
less) to illustrate how a simpler design can achieve similar 
results to these more complicated (and it followsbug prone) 
approaches.


Richard

Hi Shreyas,

Wondering if you can share the details of the throttling feature 
you're working on. Even if there's no code, a description of what 
it is trying to achieve and how will be great.


Thanks,
Ravi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-02-26 Thread Ravishankar N

Hey Shreyas,
I'll be starting on the TBF based implementation next week, as this 
needs to be completed by 3.8. If you can send your patch, I'll see we 
can leverage it too.

Thanks,
Ravi


On 02/13/2016 09:06 AM, Pranith Kumar Karampuri wrote:



On 02/13/2016 12:13 AM, Richard Wareing wrote:

Hey Ravi,

I'll ping Shreyas about this today.  There's also a patch we'll need 
for multi-threaded SHD to fix the least-pri queuing.  The PID of the 
process wasn't tagged correctly via the call frame in my original 
patch.  The patch below fixes this (for 3.6.3), I didn't see 
multi-threaded self heal on github/master yet so let me know what 
branch you need this patch on and I can come up with a clean patch.


Hi Richard,
 I reviewed the patch and found that the same needs to be done 
even for ec. So I am thinking if I can split it out as two different 
patches, 1 patch in syncop-utils which builds the functionality of 
parallelization. Another patch which uses this in afr, ec. Do you mind 
if I give it a go? I can complete it by end of Wednesday.


Pranith


Richard


=


diff --git a/xlators/cluster/afr/src/afr-self-heald.c 
b/xlators/cluster/afr/src/afr-self-heald.c

index 028010d..b0f6248 100644
--- a/xlators/cluster/afr/src/afr-self-heald.c
+++ b/xlators/cluster/afr/src/afr-self-heald.c
@@ -532,6 +532,9 @@ afr_mt_process_entries_done (int ret, 
call_frame_t *sync_frame,

  pthread_cond_signal (&mt_data->task_done);
  }
  pthread_mutex_unlock (&mt_data->lock);
+
+if (task_ctx->frame)
+AFR_STACK_DESTROY (task_ctx->frame);
  GF_FREE (task_ctx);
  return 0;
  }
@@ -787,6 +790,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  int   ret = -1;
  afr_mt_process_entries_task_ctx_t *task_ctx;
  afr_mt_data_t *mt_data;
+call_frame_t  *frame = NULL;

  mt_data = &healer->mt_data;

@@ -799,6 +803,8 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  if (!task_ctx)
  goto err;

+task_ctx->frame = afr_frame_create (this);
+
  INIT_LIST_HEAD (&task_ctx->list);
  task_ctx->readdir_xl = this;
  task_ctx->healer = healer;
@@ -812,7 +818,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  // This returns immediately, and 
afr_mt_process_entries_done will
  // be called when the task is completed e.g. our queue is 
empty
  ret = synctask_new (this->ctx->env, 
afr_mt_process_entries_task,

-afr_mt_process_entries_done, NULL,
+afr_mt_process_entries_done, task_ctx->frame,
  (void *)task_ctx);

  if (!ret) {
diff --git a/xlators/cluster/afr/src/afr-self-heald.h 
b/xlators/cluster/afr/src/afr-self-heald.h

index 817e712..1588fc8 100644
--- a/xlators/cluster/afr/src/afr-self-heald.h
+++ b/xlators/cluster/afr/src/afr-self-heald.h
@@ -74,6 +74,7 @@ typedef struct afr_mt_process_entries_task_ctx_ {
  subvol_healer_t *healer;
  xlator_t*readdir_xl;
  inode_t *idx_inode;  /* inode ref for 
xattrop dir */

+call_frame_t*frame;
  unsigned intentries_healed;
  unsigned intentries_processed;
  unsigned intalready_healed;


Richard

From: Ravishankar N [ravishan...@redhat.com]
Sent: Sunday, February 07, 2016 11:15 PM
To: Shreyas Siravara
Cc: Richard Wareing; Vijay Bellur; Gluster Devel
Subject: Re: [Gluster-devel] Throttling xlator on the bricks

Hello,

On 01/29/2016 06:51 AM, Shreyas Siravara wrote:

So the way our throttling works is (intentionally) very simplistic.

(1) When someone mounts an NFS share, we tag the frame with a 32 bit 
hash of the export name they were authorized to mount.
(2) io-stats keeps track of the "current rate" of fops we're seeing 
for that particular mount, using a sampling of fops and a moving 
average over a short period of time.
(3) Based on whether the share violated its allowed rate (which is 
defined in a config file), we tag the FOP as "least-pri". Of course 
this makes the assumption that all NFS endpoints are receiving 
roughly the same # of FOPs. The rate defined in the config file is a 
*per* NFS endpoint number. So if your cluster has 10 NFS endpoints, 
and you've pre-computed that it can do roughly 1000 FOPs per second, 
the rate in the config file would be 100.
(4) IO-Threads then shoves the FOP into the least-pri queue, rather 
than its default. The value is honored all the way down to the bricks.


The code is actually complete, and I'll put it up for review after 
we iron out a few minor issues.

Did you get a chance to send the 

Re: [Gluster-devel] Throttling xlator on the bricks

2016-02-12 Thread Pranith Kumar Karampuri



On 02/13/2016 12:13 AM, Richard Wareing wrote:

Hey Ravi,

I'll ping Shreyas about this today.  There's also a patch we'll need for 
multi-threaded SHD to fix the least-pri queuing.  The PID of the process wasn't 
tagged correctly via the call frame in my original patch.  The patch below 
fixes this (for 3.6.3), I didn't see multi-threaded self heal on github/master 
yet so let me know what branch you need this patch on and I can come up with a 
clean patch.


Hi Richard,
 I reviewed the patch and found that the same needs to be done 
even for ec. So I am thinking if I can split it out as two different 
patches, 1 patch in syncop-utils which builds the functionality of 
parallelization. Another patch which uses this in afr, ec. Do you mind 
if I give it a go? I can complete it by end of Wednesday.


Pranith


Richard


=


diff --git a/xlators/cluster/afr/src/afr-self-heald.c 
b/xlators/cluster/afr/src/afr-self-heald.c
index 028010d..b0f6248 100644
--- a/xlators/cluster/afr/src/afr-self-heald.c
+++ b/xlators/cluster/afr/src/afr-self-heald.c
@@ -532,6 +532,9 @@ afr_mt_process_entries_done (int ret, call_frame_t 
*sync_frame,
  pthread_cond_signal (&mt_data->task_done);
  }
  pthread_mutex_unlock (&mt_data->lock);
+
+if (task_ctx->frame)
+AFR_STACK_DESTROY (task_ctx->frame);
  GF_FREE (task_ctx);
  return 0;
  }
@@ -787,6 +790,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  int   ret = -1;
  afr_mt_process_entries_task_ctx_t *task_ctx;
  afr_mt_data_t *mt_data;
+call_frame_t  *frame = NULL;

  mt_data = &healer->mt_data;

@@ -799,6 +803,8 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  if (!task_ctx)
  goto err;

+task_ctx->frame = afr_frame_create (this);
+
  INIT_LIST_HEAD (&task_ctx->list);
  task_ctx->readdir_xl = this;
  task_ctx->healer = healer;
@@ -812,7 +818,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
  // This returns immediately, and afr_mt_process_entries_done will
  // be called when the task is completed e.g. our queue is empty
  ret = synctask_new (this->ctx->env, afr_mt_process_entries_task,
-afr_mt_process_entries_done, NULL,
+afr_mt_process_entries_done, task_ctx->frame,
  (void *)task_ctx);

  if (!ret) {
diff --git a/xlators/cluster/afr/src/afr-self-heald.h 
b/xlators/cluster/afr/src/afr-self-heald.h
index 817e712..1588fc8 100644
--- a/xlators/cluster/afr/src/afr-self-heald.h
+++ b/xlators/cluster/afr/src/afr-self-heald.h
@@ -74,6 +74,7 @@ typedef struct afr_mt_process_entries_task_ctx_ {
  subvol_healer_t *healer;
  xlator_t*readdir_xl;
  inode_t *idx_inode;  /* inode ref for xattrop dir */
+call_frame_t*frame;
  unsigned intentries_healed;
  unsigned intentries_processed;
  unsigned intalready_healed;


Richard

From: Ravishankar N [ravishan...@redhat.com]
Sent: Sunday, February 07, 2016 11:15 PM
To: Shreyas Siravara
Cc: Richard Wareing; Vijay Bellur; Gluster Devel
Subject: Re: [Gluster-devel] Throttling xlator on the bricks

Hello,

On 01/29/2016 06:51 AM, Shreyas Siravara wrote:

So the way our throttling works is (intentionally) very simplistic.

(1) When someone mounts an NFS share, we tag the frame with a 32 bit hash of 
the export name they were authorized to mount.
(2) io-stats keeps track of the "current rate" of fops we're seeing for that 
particular mount, using a sampling of fops and a moving average over a short period of 
time.
(3) Based on whether the share violated its allowed rate (which is defined in a config 
file), we tag the FOP as "least-pri". Of course this makes the assumption that 
all NFS endpoints are receiving roughly the same # of FOPs. The rate defined in the 
config file is a *per* NFS endpoint number. So if your cluster has 10 NFS endpoints, and 
you've pre-computed that it can do roughly 1000 FOPs per second, the rate in the config 
file would be 100.
(4) IO-Threads then shoves the FOP into the least-pri queue, rather than its 
default. The value is honored all the way down to the bricks.

The code is actually complete, and I'll put it up for review after we iron out 
a few minor issues.

Did you get a chance to send the patch? Just wanted to run some tests
and see if this is all we need at the moment to regulate shd traffic,
especially with Richard's multi-threaded heal patch
https://urldefense.proofpoint.com/v2/url?u=http-3A__review.gluster.org_-2

Re: [Gluster-devel] Throttling xlator on the bricks

2016-02-12 Thread Richard Wareing
Hey Ravi,

I'll ping Shreyas about this today.  There's also a patch we'll need for 
multi-threaded SHD to fix the least-pri queuing.  The PID of the process wasn't 
tagged correctly via the call frame in my original patch.  The patch below 
fixes this (for 3.6.3), I didn't see multi-threaded self heal on github/master 
yet so let me know what branch you need this patch on and I can come up with a 
clean patch.

Richard


=


diff --git a/xlators/cluster/afr/src/afr-self-heald.c 
b/xlators/cluster/afr/src/afr-self-heald.c
index 028010d..b0f6248 100644
--- a/xlators/cluster/afr/src/afr-self-heald.c
+++ b/xlators/cluster/afr/src/afr-self-heald.c
@@ -532,6 +532,9 @@ afr_mt_process_entries_done (int ret, call_frame_t 
*sync_frame,
 pthread_cond_signal (&mt_data->task_done);
 }
 pthread_mutex_unlock (&mt_data->lock);
+
+if (task_ctx->frame)
+AFR_STACK_DESTROY (task_ctx->frame);
 GF_FREE (task_ctx);
 return 0;
 }
@@ -787,6 +790,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
 int   ret = -1;
 afr_mt_process_entries_task_ctx_t *task_ctx;
 afr_mt_data_t *mt_data;
+call_frame_t  *frame = NULL;

 mt_data = &healer->mt_data;

@@ -799,6 +803,8 @@ _afr_mt_create_process_entries_task (xlator_t *this,
 if (!task_ctx)
 goto err;

+task_ctx->frame = afr_frame_create (this);
+
 INIT_LIST_HEAD (&task_ctx->list);
 task_ctx->readdir_xl = this;
 task_ctx->healer = healer;
@@ -812,7 +818,7 @@ _afr_mt_create_process_entries_task (xlator_t *this,
 // This returns immediately, and afr_mt_process_entries_done will
 // be called when the task is completed e.g. our queue is empty
 ret = synctask_new (this->ctx->env, afr_mt_process_entries_task,
-afr_mt_process_entries_done, NULL,
+afr_mt_process_entries_done, task_ctx->frame,
 (void *)task_ctx);

 if (!ret) {
diff --git a/xlators/cluster/afr/src/afr-self-heald.h 
b/xlators/cluster/afr/src/afr-self-heald.h
index 817e712..1588fc8 100644
--- a/xlators/cluster/afr/src/afr-self-heald.h
+++ b/xlators/cluster/afr/src/afr-self-heald.h
@@ -74,6 +74,7 @@ typedef struct afr_mt_process_entries_task_ctx_ {
 subvol_healer_t *healer;
 xlator_t*readdir_xl;
 inode_t *idx_inode;  /* inode ref for xattrop dir */
+call_frame_t*frame;
 unsigned intentries_healed;
 unsigned intentries_processed;
 unsigned intalready_healed;


Richard

From: Ravishankar N [ravishan...@redhat.com]
Sent: Sunday, February 07, 2016 11:15 PM
To: Shreyas Siravara
Cc: Richard Wareing; Vijay Bellur; Gluster Devel
Subject: Re: [Gluster-devel] Throttling xlator on the bricks

Hello,

On 01/29/2016 06:51 AM, Shreyas Siravara wrote:
> So the way our throttling works is (intentionally) very simplistic.
>
> (1) When someone mounts an NFS share, we tag the frame with a 32 bit hash of 
> the export name they were authorized to mount.
> (2) io-stats keeps track of the "current rate" of fops we're seeing for that 
> particular mount, using a sampling of fops and a moving average over a short 
> period of time.
> (3) Based on whether the share violated its allowed rate (which is defined in 
> a config file), we tag the FOP as "least-pri". Of course this makes the 
> assumption that all NFS endpoints are receiving roughly the same # of FOPs. 
> The rate defined in the config file is a *per* NFS endpoint number. So if 
> your cluster has 10 NFS endpoints, and you've pre-computed that it can do 
> roughly 1000 FOPs per second, the rate in the config file would be 100.
> (4) IO-Threads then shoves the FOP into the least-pri queue, rather than its 
> default. The value is honored all the way down to the bricks.
>
> The code is actually complete, and I'll put it up for review after we iron 
> out a few minor issues.

Did you get a chance to send the patch? Just wanted to run some tests
and see if this is all we need at the moment to regulate shd traffic,
especially with Richard's multi-threaded heal patch
https://urldefense.proofpoint.com/v2/url?u=http-3A__review.gluster.org_-23_c_13329_&d=CwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=B873EiTlTeUXIjEcoutZ6Py5KL0bwXIVroPbpwaKD8s&s=fo86UTOQWXf0nQZvvauqIIhlwoZHpRlQMNfQd7Ubu7g&e=
  being revived and made ready for 3.8.

-Ravi

>
>> On Jan 27, 2016, at 9:48 PM, Ravishankar N  wrote:
>>
>> On 01/26/2016 08:41 AM, Richard Wareing wrote:
>>> In

Re: [Gluster-devel] Throttling xlator on the bricks

2016-02-07 Thread Ravishankar N

Hello,

On 01/29/2016 06:51 AM, Shreyas Siravara wrote:

So the way our throttling works is (intentionally) very simplistic.

(1) When someone mounts an NFS share, we tag the frame with a 32 bit hash of 
the export name they were authorized to mount.
(2) io-stats keeps track of the "current rate" of fops we're seeing for that 
particular mount, using a sampling of fops and a moving average over a short period of 
time.
(3) Based on whether the share violated its allowed rate (which is defined in a config 
file), we tag the FOP as "least-pri". Of course this makes the assumption that 
all NFS endpoints are receiving roughly the same # of FOPs. The rate defined in the 
config file is a *per* NFS endpoint number. So if your cluster has 10 NFS endpoints, and 
you've pre-computed that it can do roughly 1000 FOPs per second, the rate in the config 
file would be 100.
(4) IO-Threads then shoves the FOP into the least-pri queue, rather than its 
default. The value is honored all the way down to the bricks.

The code is actually complete, and I'll put it up for review after we iron out 
a few minor issues.


Did you get a chance to send the patch? Just wanted to run some tests 
and see if this is all we need at the moment to regulate shd traffic, 
especially with Richard's multi-threaded heal patch 
http://review.gluster.org/#/c/13329/ being revived and made ready for 3.8.


-Ravi




On Jan 27, 2016, at 9:48 PM, Ravishankar N  wrote:

On 01/26/2016 08:41 AM, Richard Wareing wrote:

In any event, it might be worth having Shreyas detail his throttling feature 
(that can throttle any directory hierarchy no less) to illustrate how a simpler 
design can achieve similar results to these more complicated (and it 
followsbug prone) approaches.

Richard

Hi Shreyas,

Wondering if you can share the details of the throttling feature you're working 
on. Even if there's no code, a description of what it is trying to achieve and 
how will be great.

Thanks,
Ravi


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-28 Thread Shreyas Siravara
So the way our throttling works is (intentionally) very simplistic. 

(1) When someone mounts an NFS share, we tag the frame with a 32 bit hash of 
the export name they were authorized to mount.
(2) io-stats keeps track of the "current rate" of fops we're seeing for that 
particular mount, using a sampling of fops and a moving average over a short 
period of time.
(3) Based on whether the share violated its allowed rate (which is defined in a 
config file), we tag the FOP as "least-pri". Of course this makes the 
assumption that all NFS endpoints are receiving roughly the same # of FOPs. The 
rate defined in the config file is a *per* NFS endpoint number. So if your 
cluster has 10 NFS endpoints, and you've pre-computed that it can do roughly 
1000 FOPs per second, the rate in the config file would be 100.
(4) IO-Threads then shoves the FOP into the least-pri queue, rather than its 
default. The value is honored all the way down to the bricks.

The code is actually complete, and I'll put it up for review after we iron out 
a few minor issues.

> On Jan 27, 2016, at 9:48 PM, Ravishankar N  wrote:
> 
> On 01/26/2016 08:41 AM, Richard Wareing wrote:
>> In any event, it might be worth having Shreyas detail his throttling feature 
>> (that can throttle any directory hierarchy no less) to illustrate how a 
>> simpler design can achieve similar results to these more complicated (and it 
>> followsbug prone) approaches.
>> 
>> Richard
> Hi Shreyas,
> 
> Wondering if you can share the details of the throttling feature you're 
> working on. Even if there's no code, a description of what it is trying to 
> achieve and how will be great.
> 
> Thanks,
> Ravi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-28 Thread Jeff Darcy
> TBF isn't complicated at all - it's widely used for traffic shaping, cgroups,
> UML to rate limit disk I/O.

It's not complicated and it's widely used, but that doesn't mean it's
the right fit for our needs.  Token buckets are good to create a
*ceiling* on resource utilization, but what if you want to set a floor
or allocate fair shares instead?  Even if what you want is a ceiling,
there's a problem of how many tokens should be entering the system.
Ideally that number should match the actual number of operations the
resource can handle per time quantum, but for networks and disks that
number can be pretty variable.  That's why network QoS is a poorly
solved problem and disk QoS is even worse.

To create a floor using token buckets, you have to chain buckets
together.  Each user/activity draws first from its own bucket, setting
the floor.  When that bucket is exhausted, it starts drawing from the
next bucket, eventually from an infinite "best effort" bucket at the end
of the chain.  To allocate fair shares (which is probably closest to
what we want in this case) you need active monitoring of how much work
the resource is actually doing.  As that number fluctuates, so does the
number of tokens, which are then divided *proportionally* between
buckets.  Hybrid approaches - e.g. low and high watermarks,
bucket-filling priorities - are also possible.

Then we get to the problem of how to distribute a resource fairly
*across nodes* when the limits are actually being applied locally on
each.  This is very similar to the problem we faced with quota over DHT,
and the same kind of approaches (e.g. a "balancing daemon") might apply.

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-27 Thread Ravishankar N

On 01/26/2016 08:41 AM, Richard Wareing wrote:

In any event, it might be worth having Shreyas detail his throttling feature 
(that can throttle any directory hierarchy no less) to illustrate how a simpler 
design can achieve similar results to these more complicated (and it 
followsbug prone) approaches.

Richard

Hi Shreyas,

Wondering if you can share the details of the throttling feature you're 
working on. Even if there's no code, a description of what it is trying 
to achieve and how will be great.


Thanks,
Ravi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-27 Thread Raghavendra Bhat
There is already a patch submitted for moving TBF part to libglusterfs. It
is under review.
http://review.gluster.org/#/c/12413/


Regards,
Raghavendra

On Mon, Jan 25, 2016 at 2:26 AM, Venky Shankar  wrote:

> On Mon, Jan 25, 2016 at 11:06:26AM +0530, Ravishankar N wrote:
> > Hi,
> >
> > We are planning to introduce a throttling xlator on the server (brick)
> > process to regulate FOPS. The main motivation is to solve complaints
> about
> > AFR selfheal taking too much of CPU resources. (due to too many fops for
> > entry
> > self-heal, rchecksums for data self-heal etc.)
> >
> > The throttling is achieved using the Token Bucket Filter algorithm (TBF).
> > TBF
> > is already used by bitrot's bitd signer (which is a client process) in
> > gluster to regulate the CPU intensive check-sum calculation. By putting
> the
> > logic on the brick side, multiple clients- selfheal, bitrot, rebalance or
> > even the mounts themselves can avail the benefits of throttling.
>
>   [Providing current TBF implementation link for completeness]
>
>
> https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/src/bitd/bit-rot-tbf.c
>
> Also, it would be beneficial to have the core TBF implementation as part of
> libglusterfs so as to be consumable by the server side xlator component to
> throttle dispatched FOPs and for daemons to throttle anything that's
> outside
> "brick" boundary (such as cpu, etc..).
>
> >
> > The TBF algorithm in a nutshell is as follows: There is a bucket which is
> > filled
> > at a steady (configurable) rate with tokens. Each FOP will need a fixed
> > amount
> > of tokens to be processed. If the bucket has that many tokens, the FOP is
> > allowed and that many tokens are removed from the bucket. If not, the
> FOP is
> > queued until the bucket is filled.
> >
> > The xlator will need to reside above io-threads and can have different
> > buckets,
> > one per client. There has to be a communication mechanism between the
> client
> > and
> > the brick (IPC?) to tell what FOPS need to be regulated from it, and the
> no.
> > of
> > tokens needed etc. These need to be re configurable via appropriate
> > mechanisms.
> > Each bucket will have a token filler thread which will fill the tokens in
> > it.
> > The main thread will enqueue heals in a list in the bucket if there
> aren't
> > enough tokens. Once the token filler detects some FOPS can be serviced,
> it
> > will
> > send a cond-broadcast to a dequeue thread which will process (stack wind)
> > all
> > the FOPS that have the required no. of tokens from all buckets.
> >
> > This is just a high level abstraction: requesting feedback on any aspect
> of
> > this feature. what kind of mechanism is best between the client/bricks
> for
> > tuning various parameters? What other requirements do you foresee?
> >
> > Thanks,
> > Ravi
>
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Venky Shankar
On Tue, Jan 26, 2016 at 03:11:50AM +, Richard Wareing wrote:
> > If there is one bucket per client and one thread per bucket, it would be
> > difficult to scale as the number of clients increase. How can we do this
> > better?
> 
> On this note... consider that 10's of thousands of clients are not 
> unrealistic in production :).  Using a thread per bucket would also 
> beunwise..
> 
> On the idea in general, I'm just wondering if there's specific (real-world) 
> cases where this has even been an issue where least-prio queuing hasn't been 
> able to handle?  Or is this more of a theoretical concern?  I ask as I've not 
> really encountered situations where I wished I could give more FOPs to SHD vs 
> rebalance and such.
> 
> In any event, it might be worth having Shreyas detail his throttling feature 
> (that can throttle any directory hierarchy no less) to illustrate how a 
> simpler design can achieve similar results to these more complicated (and it 
> followsbug prone) approaches.

TBF isn't complicated at all - it's widely used for traffic shaping, cgroups, 
UML to rate limit disk I/O.

But, I won't hurry up on things and wait to hear out from Shreyas regarding his 
throttling design.

> 
> Richard
> 
> 
> From: gluster-devel-boun...@gluster.org [gluster-devel-boun...@gluster.org] 
> on behalf of Vijay Bellur [vbel...@redhat.com]
> Sent: Monday, January 25, 2016 6:44 PM
> To: Ravishankar N; Gluster Devel
> Subject: Re: [Gluster-devel] Throttling xlator on the bricks
> 
> On 01/25/2016 12:36 AM, Ravishankar N wrote:
> > Hi,
> >
> > We are planning to introduce a throttling xlator on the server (brick)
> > process to regulate FOPS. The main motivation is to solve complaints about
> > AFR selfheal taking too much of CPU resources. (due to too many fops for
> > entry
> > self-heal, rchecksums for data self-heal etc.)
> 
> 
> I am wondering if we can re-use the same xlator for throttling
> bandwidth, iops etc. in addition to fops. Based on admin configured
> policies we could provide different upper thresholds to different
> clients/tenants and this could prove to be an useful feature in
> multitenant deployments to avoid starvation/noisy neighbor class of
> problems. Has any thought gone in this direction?
> 
> >
> > The throttling is achieved using the Token Bucket Filter algorithm
> > (TBF). TBF
> > is already used by bitrot's bitd signer (which is a client process) in
> > gluster to regulate the CPU intensive check-sum calculation. By putting the
> > logic on the brick side, multiple clients- selfheal, bitrot, rebalance or
> > even the mounts themselves can avail the benefits of throttling.
> >
> > The TBF algorithm in a nutshell is as follows: There is a bucket which
> > is filled
> > at a steady (configurable) rate with tokens. Each FOP will need a fixed
> > amount
> > of tokens to be processed. If the bucket has that many tokens, the FOP is
> > allowed and that many tokens are removed from the bucket. If not, the FOP is
> > queued until the bucket is filled.
> >
> > The xlator will need to reside above io-threads and can have different
> > buckets,
> > one per client. There has to be a communication mechanism between the
> > client and
> > the brick (IPC?) to tell what FOPS need to be regulated from it, and the
> > no. of
> > tokens needed etc. These need to be re configurable via appropriate
> > mechanisms.
> > Each bucket will have a token filler thread which will fill the tokens
> > in it.
> 
> If there is one bucket per client and one thread per bucket, it would be
> difficult to scale as the number of clients increase. How can we do this
> better?
> 
> > The main thread will enqueue heals in a list in the bucket if there aren't
> > enough tokens. Once the token filler detects some FOPS can be serviced,
> > it will
> > send a cond-broadcast to a dequeue thread which will process (stack
> > wind) all
> > the FOPS that have the required no. of tokens from all buckets.
> >
> > This is just a high level abstraction: requesting feedback on any aspect of
> > this feature. what kind of mechanism is best between the client/bricks for
> > tuning various parameters? What other requirements do you foresee?
> >
> 
> I am in favor of having administrator defined policies or templates
> (collection of policies) being used to provide the tuning parameter per
> client or a set of clients. We could even have a default template per
> use case etc. 

Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Joe Julian



On 01/25/16 20:36, Pranith Kumar Karampuri wrote:



On 01/26/2016 08:41 AM, Richard Wareing wrote:
If there is one bucket per client and one thread per bucket, it 
would be
difficult to scale as the number of clients increase. How can we do 
this

better?
On this note... consider that 10's of thousands of clients are not 
unrealistic in production :).  Using a thread per bucket would also 
beunwise..


There is only one thread and this solution is for internal 
processes(shd, rebalance, quota etc) not coming in the way of clients 
which do I/O.




On the idea in general, I'm just wondering if there's specific 
(real-world) cases where this has even been an issue where least-prio 
queuing hasn't been able to handle?  Or is this more of a theoretical 
concern?  I ask as I've not really encountered situations where I 
wished I could give more FOPs to SHD vs rebalance and such.


I have seen users resort to offline healing of the bricks whenever a 
brick is replaced, or new brick is added to replication to increase 
replica count. When entry self-heal happens or big VM image data 
self-heals which do rchecksums CPU spikes are seen and I/O becomes 
useless.
This is the recent thread where a user ran into similar problem (just 
yesterday) (This is a combination of client-side healing and 
healing-load):

http://www.gluster.org/pipermail/gluster-users/2016-January/025051.html

We can find more of such threads if we put some time to dig into the 
mailing list.
I personally have seen people even resort to things like, "we let 
gluster heal over the weekend or in the nights when none of us are 
working on the volumes" etc.


I get at least weekly complaints of such on the IRC channel. A lot of 
them are in virtual environments (aws).




There are people who complain healing is too slow too. We get both 
kinds of complaints :-). Your multi-threaded shd patch is going to 
help here. I somehow feel you guys are in this set of people :-).


+1




In any event, it might be worth having Shreyas detail his throttling 
feature (that can throttle any directory hierarchy no less) to 
illustrate how a simpler design can achieve similar results to these 
more complicated (and it followsbug prone) approaches.


The solution we came up with is about throttling internal I/O. And 
there are only 4/5 such processes(shd, rebalance, quota, bitd etc). 
What you are saying above about throttling any directory hierarchy 
seems a bit different than what we are trying to solve from the looks 
of it(At least from the small description you gave above :-) ). 
Shreyas' mail detailing the feature would definitely help us 
understand what each of us are trying to solve. We want to GA both 
multi-threaded shd and this feature for 3.8.


Pranith


Richard


From: gluster-devel-boun...@gluster.org 
[gluster-devel-boun...@gluster.org] on behalf of Vijay Bellur 
[vbel...@redhat.com]

Sent: Monday, January 25, 2016 6:44 PM
To: Ravishankar N; Gluster Devel
Subject: Re: [Gluster-devel] Throttling xlator on the bricks

On 01/25/2016 12:36 AM, Ravishankar N wrote:

Hi,

We are planning to introduce a throttling xlator on the server (brick)
process to regulate FOPS. The main motivation is to solve complaints 
about
AFR selfheal taking too much of CPU resources. (due to too many fops 
for

entry
self-heal, rchecksums for data self-heal etc.)


I am wondering if we can re-use the same xlator for throttling
bandwidth, iops etc. in addition to fops. Based on admin configured
policies we could provide different upper thresholds to different
clients/tenants and this could prove to be an useful feature in
multitenant deployments to avoid starvation/noisy neighbor class of
problems. Has any thought gone in this direction?


The throttling is achieved using the Token Bucket Filter algorithm
(TBF). TBF
is already used by bitrot's bitd signer (which is a client process) in
gluster to regulate the CPU intensive check-sum calculation. By 
putting the
logic on the brick side, multiple clients- selfheal, bitrot, 
rebalance or

even the mounts themselves can avail the benefits of throttling.

The TBF algorithm in a nutshell is as follows: There is a bucket which
is filled
at a steady (configurable) rate with tokens. Each FOP will need a fixed
amount
of tokens to be processed. If the bucket has that many tokens, the 
FOP is
allowed and that many tokens are removed from the bucket. If not, 
the FOP is

queued until the bucket is filled.

The xlator will need to reside above io-threads and can have different
buckets,
one per client. There has to be a communication mechanism between the
client and
the brick (IPC?) to tell what FOPS need to be regulated from it, and 
the

no. of
tokens needed etc. These need to be re configurable via appropriate
mechanisms.
Each bucket will have a token filler thread which will fill the tokens
in it.

If there is one bucket per client and on

Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Joe Julian



On 01/25/16 18:24, Ravishankar N wrote:


On 01/26/2016 01:22 AM, Shreyas Siravara wrote:
Just out of curiosity, what benefits do we think this throttling 
xlator would provide over the "enable-least-priority" option (where 
we put all the fops from SHD, etc into a least pri queue)?




For one, it could provide more granularity on the amount of throttling 
you want to do, for specific fops, from specific clients. If the only 
I/O going through the bricks was from the SHD, they would all be 
least-priority but yet consume an unfair % of the CPU. We could tweak 
`performance.least-rate-limit` to throttle but it would be a global 
option.


Right, because as it is now, when shd is the only client, it queues up 
so much iops that higher prioritiy ops are still getting delayed.





On Jan 25, 2016, at 12:29 AM, Venky Shankar  
wrote:


On Mon, Jan 25, 2016 at 01:08:38PM +0530, Ravishankar N wrote:

On 01/25/2016 12:56 PM, Venky Shankar wrote:
Also, it would be beneficial to have the core TBF implementation 
as part of
libglusterfs so as to be consumable by the server side xlator 
component to
throttle dispatched FOPs and for daemons to throttle anything 
that's outside

"brick" boundary (such as cpu, etc..).
That makes sense. We were initially thinking to overload 
posix_rchecksum()

to do the SHA256 sums for the signer.
That does have advantages by avoiding network rountrips by computing 
SHA* locally.
TBF could still implement ->rchecksum and throttle that (on behalf 
of clients,
residing on the server - internal daemons). Placing the core 
implementation as

part of libglusterfs would still provide the flexibility.




___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwICAg&c=5VD0RTtNlTh3ycd41b3MUw&r=N7LE2BKIHDDBvkYkakYthA&m=9W9xtRg0TIEUvFL-8HpUCux8psoWKkUbEFiwqykRwH4&s=OVF0dZRXt8GFcIxsHlkbNjH-bjD9097q5hjVVHgOFkQ&e= 




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Pranith Kumar Karampuri



On 01/26/2016 08:41 AM, Richard Wareing wrote:

If there is one bucket per client and one thread per bucket, it would be
difficult to scale as the number of clients increase. How can we do this
better?

On this note... consider that 10's of thousands of clients are not unrealistic 
in production :).  Using a thread per bucket would also beunwise..


There is only one thread and this solution is for internal 
processes(shd, rebalance, quota etc) not coming in the way of clients 
which do I/O.




On the idea in general, I'm just wondering if there's specific (real-world) 
cases where this has even been an issue where least-prio queuing hasn't been 
able to handle?  Or is this more of a theoretical concern?  I ask as I've not 
really encountered situations where I wished I could give more FOPs to SHD vs 
rebalance and such.


I have seen users resort to offline healing of the bricks whenever a 
brick is replaced, or new brick is added to replication to increase 
replica count. When entry self-heal happens or big VM image data 
self-heals which do rchecksums CPU spikes are seen and I/O becomes useless.
This is the recent thread where a user ran into similar problem (just 
yesterday) (This is a combination of client-side healing and healing-load):

http://www.gluster.org/pipermail/gluster-users/2016-January/025051.html

We can find more of such threads if we put some time to dig into the 
mailing list.
I personally have seen people even resort to things like, "we let 
gluster heal over the weekend or in the nights when none of us are 
working on the volumes" etc.


There are people who complain healing is too slow too. We get both kinds 
of complaints :-). Your multi-threaded shd patch is going to help here. 
I somehow feel you guys are in this set of people :-).


In any event, it might be worth having Shreyas detail his throttling feature 
(that can throttle any directory hierarchy no less) to illustrate how a simpler 
design can achieve similar results to these more complicated (and it 
followsbug prone) approaches.


The solution we came up with is about throttling internal I/O. And there 
are only 4/5 such processes(shd, rebalance, quota, bitd etc). What you 
are saying above about throttling any directory hierarchy seems a bit 
different than what we are trying to solve from the looks of it(At least 
from the small description you gave above :-) ). Shreyas' mail detailing 
the feature would definitely help us understand what each of us are 
trying to solve. We want to GA both multi-threaded shd and this feature 
for 3.8.


Pranith


Richard


From: gluster-devel-boun...@gluster.org [gluster-devel-boun...@gluster.org] on 
behalf of Vijay Bellur [vbel...@redhat.com]
Sent: Monday, January 25, 2016 6:44 PM
To: Ravishankar N; Gluster Devel
Subject: Re: [Gluster-devel] Throttling xlator on the bricks

On 01/25/2016 12:36 AM, Ravishankar N wrote:

Hi,

We are planning to introduce a throttling xlator on the server (brick)
process to regulate FOPS. The main motivation is to solve complaints about
AFR selfheal taking too much of CPU resources. (due to too many fops for
entry
self-heal, rchecksums for data self-heal etc.)


I am wondering if we can re-use the same xlator for throttling
bandwidth, iops etc. in addition to fops. Based on admin configured
policies we could provide different upper thresholds to different
clients/tenants and this could prove to be an useful feature in
multitenant deployments to avoid starvation/noisy neighbor class of
problems. Has any thought gone in this direction?


The throttling is achieved using the Token Bucket Filter algorithm
(TBF). TBF
is already used by bitrot's bitd signer (which is a client process) in
gluster to regulate the CPU intensive check-sum calculation. By putting the
logic on the brick side, multiple clients- selfheal, bitrot, rebalance or
even the mounts themselves can avail the benefits of throttling.

The TBF algorithm in a nutshell is as follows: There is a bucket which
is filled
at a steady (configurable) rate with tokens. Each FOP will need a fixed
amount
of tokens to be processed. If the bucket has that many tokens, the FOP is
allowed and that many tokens are removed from the bucket. If not, the FOP is
queued until the bucket is filled.

The xlator will need to reside above io-threads and can have different
buckets,
one per client. There has to be a communication mechanism between the
client and
the brick (IPC?) to tell what FOPS need to be regulated from it, and the
no. of
tokens needed etc. These need to be re configurable via appropriate
mechanisms.
Each bucket will have a token filler thread which will fill the tokens
in it.

If there is one bucket per client and one thread per bucket, it would be
difficult to scale as the number of clients increase. How can we do this
better?


The main thread will enqueue heals in a list in the bucket if there aren't

Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Pranith Kumar Karampuri



On 01/26/2016 08:14 AM, Vijay Bellur wrote:

On 01/25/2016 12:36 AM, Ravishankar N wrote:

Hi,

We are planning to introduce a throttling xlator on the server (brick)
process to regulate FOPS. The main motivation is to solve complaints 
about

AFR selfheal taking too much of CPU resources. (due to too many fops for
entry
self-heal, rchecksums for data self-heal etc.)



I am wondering if we can re-use the same xlator for throttling 
bandwidth, iops etc. in addition to fops. Based on admin configured 
policies we could provide different upper thresholds to different 
clients/tenants and this could prove to be an useful feature in 
multitenant deployments to avoid starvation/noisy neighbor class of 
problems. Has any thought gone in this direction?


Nope. It was mainly about internal processes at the moment.





The throttling is achieved using the Token Bucket Filter algorithm
(TBF). TBF
is already used by bitrot's bitd signer (which is a client process) in
gluster to regulate the CPU intensive check-sum calculation. By 
putting the
logic on the brick side, multiple clients- selfheal, bitrot, 
rebalance or

even the mounts themselves can avail the benefits of throttling.

The TBF algorithm in a nutshell is as follows: There is a bucket which
is filled
at a steady (configurable) rate with tokens. Each FOP will need a fixed
amount
of tokens to be processed. If the bucket has that many tokens, the 
FOP is
allowed and that many tokens are removed from the bucket. If not, the 
FOP is

queued until the bucket is filled.

The xlator will need to reside above io-threads and can have different
buckets,
one per client. There has to be a communication mechanism between the
client and
the brick (IPC?) to tell what FOPS need to be regulated from it, and the
no. of
tokens needed etc. These need to be re configurable via appropriate
mechanisms.
Each bucket will have a token filler thread which will fill the tokens
in it.


If there is one bucket per client and one thread per bucket, it would 
be difficult to scale as the number of clients increase. How can we do 
this better?


It is same thread for all the buckets. Because the number of internal 
clients at the moment is in single digits. The problem statement we have 
right now doesn't consider what you are looking for.




The main thread will enqueue heals in a list in the bucket if there 
aren't

enough tokens. Once the token filler detects some FOPS can be serviced,
it will
send a cond-broadcast to a dequeue thread which will process (stack
wind) all
the FOPS that have the required no. of tokens from all buckets.

This is just a high level abstraction: requesting feedback on any 
aspect of
this feature. what kind of mechanism is best between the 
client/bricks for

tuning various parameters? What other requirements do you foresee?



I am in favor of having administrator defined policies or templates 
(collection of policies) being used to provide the tuning parameter 
per client or a set of clients. We could even have a default template 
per use case etc. Is there a specific need to have this negotiation 
between clients and servers?


Thanks,
Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Richard Wareing
> If there is one bucket per client and one thread per bucket, it would be
> difficult to scale as the number of clients increase. How can we do this
> better?

On this note... consider that 10's of thousands of clients are not unrealistic 
in production :).  Using a thread per bucket would also beunwise..

On the idea in general, I'm just wondering if there's specific (real-world) 
cases where this has even been an issue where least-prio queuing hasn't been 
able to handle?  Or is this more of a theoretical concern?  I ask as I've not 
really encountered situations where I wished I could give more FOPs to SHD vs 
rebalance and such.

In any event, it might be worth having Shreyas detail his throttling feature 
(that can throttle any directory hierarchy no less) to illustrate how a simpler 
design can achieve similar results to these more complicated (and it 
followsbug prone) approaches.

Richard


From: gluster-devel-boun...@gluster.org [gluster-devel-boun...@gluster.org] on 
behalf of Vijay Bellur [vbel...@redhat.com]
Sent: Monday, January 25, 2016 6:44 PM
To: Ravishankar N; Gluster Devel
Subject: Re: [Gluster-devel] Throttling xlator on the bricks

On 01/25/2016 12:36 AM, Ravishankar N wrote:
> Hi,
>
> We are planning to introduce a throttling xlator on the server (brick)
> process to regulate FOPS. The main motivation is to solve complaints about
> AFR selfheal taking too much of CPU resources. (due to too many fops for
> entry
> self-heal, rchecksums for data self-heal etc.)


I am wondering if we can re-use the same xlator for throttling
bandwidth, iops etc. in addition to fops. Based on admin configured
policies we could provide different upper thresholds to different
clients/tenants and this could prove to be an useful feature in
multitenant deployments to avoid starvation/noisy neighbor class of
problems. Has any thought gone in this direction?

>
> The throttling is achieved using the Token Bucket Filter algorithm
> (TBF). TBF
> is already used by bitrot's bitd signer (which is a client process) in
> gluster to regulate the CPU intensive check-sum calculation. By putting the
> logic on the brick side, multiple clients- selfheal, bitrot, rebalance or
> even the mounts themselves can avail the benefits of throttling.
>
> The TBF algorithm in a nutshell is as follows: There is a bucket which
> is filled
> at a steady (configurable) rate with tokens. Each FOP will need a fixed
> amount
> of tokens to be processed. If the bucket has that many tokens, the FOP is
> allowed and that many tokens are removed from the bucket. If not, the FOP is
> queued until the bucket is filled.
>
> The xlator will need to reside above io-threads and can have different
> buckets,
> one per client. There has to be a communication mechanism between the
> client and
> the brick (IPC?) to tell what FOPS need to be regulated from it, and the
> no. of
> tokens needed etc. These need to be re configurable via appropriate
> mechanisms.
> Each bucket will have a token filler thread which will fill the tokens
> in it.

If there is one bucket per client and one thread per bucket, it would be
difficult to scale as the number of clients increase. How can we do this
better?

> The main thread will enqueue heals in a list in the bucket if there aren't
> enough tokens. Once the token filler detects some FOPS can be serviced,
> it will
> send a cond-broadcast to a dequeue thread which will process (stack
> wind) all
> the FOPS that have the required no. of tokens from all buckets.
>
> This is just a high level abstraction: requesting feedback on any aspect of
> this feature. what kind of mechanism is best between the client/bricks for
> tuning various parameters? What other requirements do you foresee?
>

I am in favor of having administrator defined policies or templates
(collection of policies) being used to provide the tuning parameter per
client or a set of clients. We could even have a default template per
use case etc. Is there a specific need to have this negotiation between
clients and servers?

Thanks,
Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwICAg&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=aQHnnoxK50Ebw77QHtp3ykjC976mJIt2qrIUzpqEViQ&s=Jitbldlbjwye6QI8V33ZoKtVt6-B64p2_-5piVlfXMQ&e=
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Vijay Bellur

On 01/25/2016 12:36 AM, Ravishankar N wrote:

Hi,

We are planning to introduce a throttling xlator on the server (brick)
process to regulate FOPS. The main motivation is to solve complaints about
AFR selfheal taking too much of CPU resources. (due to too many fops for
entry
self-heal, rchecksums for data self-heal etc.)



I am wondering if we can re-use the same xlator for throttling 
bandwidth, iops etc. in addition to fops. Based on admin configured 
policies we could provide different upper thresholds to different 
clients/tenants and this could prove to be an useful feature in 
multitenant deployments to avoid starvation/noisy neighbor class of 
problems. Has any thought gone in this direction?




The throttling is achieved using the Token Bucket Filter algorithm
(TBF). TBF
is already used by bitrot's bitd signer (which is a client process) in
gluster to regulate the CPU intensive check-sum calculation. By putting the
logic on the brick side, multiple clients- selfheal, bitrot, rebalance or
even the mounts themselves can avail the benefits of throttling.

The TBF algorithm in a nutshell is as follows: There is a bucket which
is filled
at a steady (configurable) rate with tokens. Each FOP will need a fixed
amount
of tokens to be processed. If the bucket has that many tokens, the FOP is
allowed and that many tokens are removed from the bucket. If not, the FOP is
queued until the bucket is filled.

The xlator will need to reside above io-threads and can have different
buckets,
one per client. There has to be a communication mechanism between the
client and
the brick (IPC?) to tell what FOPS need to be regulated from it, and the
no. of
tokens needed etc. These need to be re configurable via appropriate
mechanisms.
Each bucket will have a token filler thread which will fill the tokens
in it.


If there is one bucket per client and one thread per bucket, it would be 
difficult to scale as the number of clients increase. How can we do this 
better?



The main thread will enqueue heals in a list in the bucket if there aren't
enough tokens. Once the token filler detects some FOPS can be serviced,
it will
send a cond-broadcast to a dequeue thread which will process (stack
wind) all
the FOPS that have the required no. of tokens from all buckets.

This is just a high level abstraction: requesting feedback on any aspect of
this feature. what kind of mechanism is best between the client/bricks for
tuning various parameters? What other requirements do you foresee?



I am in favor of having administrator defined policies or templates 
(collection of policies) being used to provide the tuning parameter per 
client or a set of clients. We could even have a default template per 
use case etc. Is there a specific need to have this negotiation between 
clients and servers?


Thanks,
Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Ravishankar N


On 01/26/2016 01:22 AM, Shreyas Siravara wrote:

Just out of curiosity, what benefits do we think this throttling xlator would provide 
over the "enable-least-priority" option (where we put all the fops from SHD, 
etc into a least pri queue)?

  


For one, it could provide more granularity on the amount of throttling 
you want to do, for specific fops, from specific clients. If the only 
I/O going through the bricks was from the SHD, they would all be 
least-priority but yet consume an unfair % of the CPU. We could tweak 
`performance.least-rate-limit` to throttle but it would be a global option.




On Jan 25, 2016, at 12:29 AM, Venky Shankar  wrote:

On Mon, Jan 25, 2016 at 01:08:38PM +0530, Ravishankar N wrote:

On 01/25/2016 12:56 PM, Venky Shankar wrote:

Also, it would be beneficial to have the core TBF implementation as part of
libglusterfs so as to be consumable by the server side xlator component to
throttle dispatched FOPs and for daemons to throttle anything that's outside
"brick" boundary (such as cpu, etc..).

That makes sense. We were initially thinking to overload posix_rchecksum()
to do the SHA256 sums for the signer.

That does have advantages by avoiding network rountrips by computing SHA* 
locally.
TBF could still implement ->rchecksum and throttle that (on behalf of clients,
residing on the server - internal daemons). Placing the core implementation as
part of libglusterfs would still provide the flexibility.




___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwICAg&c=5VD0RTtNlTh3ycd41b3MUw&r=N7LE2BKIHDDBvkYkakYthA&m=9W9xtRg0TIEUvFL-8HpUCux8psoWKkUbEFiwqykRwH4&s=OVF0dZRXt8GFcIxsHlkbNjH-bjD9097q5hjVVHgOFkQ&e=



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Shreyas Siravara
Just out of curiosity, what benefits do we think this throttling xlator would 
provide over the "enable-least-priority" option (where we put all the fops from 
SHD, etc into a least pri queue)?

 
> On Jan 25, 2016, at 12:29 AM, Venky Shankar  wrote:
> 
> On Mon, Jan 25, 2016 at 01:08:38PM +0530, Ravishankar N wrote:
>> On 01/25/2016 12:56 PM, Venky Shankar wrote:
>>> Also, it would be beneficial to have the core TBF implementation as part of
>>> libglusterfs so as to be consumable by the server side xlator component to
>>> throttle dispatched FOPs and for daemons to throttle anything that's outside
>>> "brick" boundary (such as cpu, etc..).
>> That makes sense. We were initially thinking to overload posix_rchecksum()
>> to do the SHA256 sums for the signer.
> 
> That does have advantages by avoiding network rountrips by computing SHA* 
> locally.
> TBF could still implement ->rchecksum and throttle that (on behalf of clients,
> residing on the server - internal daemons). Placing the core implementation as
> part of libglusterfs would still provide the flexibility.
> 
>> 
>> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwICAg&c=5VD0RTtNlTh3ycd41b3MUw&r=N7LE2BKIHDDBvkYkakYthA&m=9W9xtRg0TIEUvFL-8HpUCux8psoWKkUbEFiwqykRwH4&s=OVF0dZRXt8GFcIxsHlkbNjH-bjD9097q5hjVVHgOFkQ&e=
>  

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-25 Thread Venky Shankar
On Mon, Jan 25, 2016 at 01:08:38PM +0530, Ravishankar N wrote:
> On 01/25/2016 12:56 PM, Venky Shankar wrote:
> >Also, it would be beneficial to have the core TBF implementation as part of
> >libglusterfs so as to be consumable by the server side xlator component to
> >throttle dispatched FOPs and for daemons to throttle anything that's outside
> >"brick" boundary (such as cpu, etc..).
> That makes sense. We were initially thinking to overload posix_rchecksum()
> to do the SHA256 sums for the signer.

That does have advantages by avoiding network rountrips by computing SHA* 
locally.
TBF could still implement ->rchecksum and throttle that (on behalf of clients,
residing on the server - internal daemons). Placing the core implementation as
part of libglusterfs would still provide the flexibility.

> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-24 Thread Ravishankar N

On 01/25/2016 12:56 PM, Venky Shankar wrote:

Also, it would be beneficial to have the core TBF implementation as part of
libglusterfs so as to be consumable by the server side xlator component to
throttle dispatched FOPs and for daemons to throttle anything that's outside
"brick" boundary (such as cpu, etc..).
That makes sense. We were initially thinking to overload 
posix_rchecksum() to do the SHA256 sums for the signer.



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Throttling xlator on the bricks

2016-01-24 Thread Venky Shankar
On Mon, Jan 25, 2016 at 11:06:26AM +0530, Ravishankar N wrote:
> Hi,
> 
> We are planning to introduce a throttling xlator on the server (brick)
> process to regulate FOPS. The main motivation is to solve complaints about
> AFR selfheal taking too much of CPU resources. (due to too many fops for
> entry
> self-heal, rchecksums for data self-heal etc.)
> 
> The throttling is achieved using the Token Bucket Filter algorithm (TBF).
> TBF
> is already used by bitrot's bitd signer (which is a client process) in
> gluster to regulate the CPU intensive check-sum calculation. By putting the
> logic on the brick side, multiple clients- selfheal, bitrot, rebalance or
> even the mounts themselves can avail the benefits of throttling.

  [Providing current TBF implementation link for completeness]

  
https://github.com/gluster/glusterfs/blob/master/xlators/features/bit-rot/src/bitd/bit-rot-tbf.c

Also, it would be beneficial to have the core TBF implementation as part of
libglusterfs so as to be consumable by the server side xlator component to
throttle dispatched FOPs and for daemons to throttle anything that's outside
"brick" boundary (such as cpu, etc..).

> 
> The TBF algorithm in a nutshell is as follows: There is a bucket which is
> filled
> at a steady (configurable) rate with tokens. Each FOP will need a fixed
> amount
> of tokens to be processed. If the bucket has that many tokens, the FOP is
> allowed and that many tokens are removed from the bucket. If not, the FOP is
> queued until the bucket is filled.
> 
> The xlator will need to reside above io-threads and can have different
> buckets,
> one per client. There has to be a communication mechanism between the client
> and
> the brick (IPC?) to tell what FOPS need to be regulated from it, and the no.
> of
> tokens needed etc. These need to be re configurable via appropriate
> mechanisms.
> Each bucket will have a token filler thread which will fill the tokens in
> it.
> The main thread will enqueue heals in a list in the bucket if there aren't
> enough tokens. Once the token filler detects some FOPS can be serviced, it
> will
> send a cond-broadcast to a dequeue thread which will process (stack wind)
> all
> the FOPS that have the required no. of tokens from all buckets.
> 
> This is just a high level abstraction: requesting feedback on any aspect of
> this feature. what kind of mechanism is best between the client/bricks for
> tuning various parameters? What other requirements do you foresee?
> 
> Thanks,
> Ravi

> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Throttling xlator on the bricks

2016-01-24 Thread Ravishankar N

Hi,

We are planning to introduce a throttling xlator on the server (brick)
process to regulate FOPS. The main motivation is to solve complaints about
AFR selfheal taking too much of CPU resources. (due to too many fops for 
entry

self-heal, rchecksums for data self-heal etc.)

The throttling is achieved using the Token Bucket Filter algorithm 
(TBF). TBF

is already used by bitrot's bitd signer (which is a client process) in
gluster to regulate the CPU intensive check-sum calculation. By putting the
logic on the brick side, multiple clients- selfheal, bitrot, rebalance or
even the mounts themselves can avail the benefits of throttling.

The TBF algorithm in a nutshell is as follows: There is a bucket which 
is filled
at a steady (configurable) rate with tokens. Each FOP will need a fixed 
amount

of tokens to be processed. If the bucket has that many tokens, the FOP is
allowed and that many tokens are removed from the bucket. If not, the FOP is
queued until the bucket is filled.

The xlator will need to reside above io-threads and can have different 
buckets,
one per client. There has to be a communication mechanism between the 
client and
the brick (IPC?) to tell what FOPS need to be regulated from it, and the 
no. of
tokens needed etc. These need to be re configurable via appropriate 
mechanisms.
Each bucket will have a token filler thread which will fill the tokens 
in it.

The main thread will enqueue heals in a list in the bucket if there aren't
enough tokens. Once the token filler detects some FOPS can be serviced, 
it will
send a cond-broadcast to a dequeue thread which will process (stack 
wind) all

the FOPS that have the required no. of tokens from all buckets.

This is just a high level abstraction: requesting feedback on any aspect of
this feature. what kind of mechanism is best between the client/bricks for
tuning various parameters? What other requirements do you foresee?

Thanks,
Ravi
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel