Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips
On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote:
> > The throttling of the virtual device must begin in
> > generic_make_request and last to ->endio.  You release the throttle
> > of the virtual device at the point you remap the bio to an
> > underlying device, which you have convinced yourself is ok, but it
> > is not.  You seem to miss the fact that whatever resources the
> > virtual device has allocated are no longer protected by the
> > throttle count *of the virtual device*, or you do not
>
> Because it is charged to another device.

Great.  You charged the resource to another device, but you did not 
limit the amount of resources that the first device can consume.  Which 
misses the whole point.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Evgeniy Polyakov
On Tue, Aug 14, 2007 at 05:32:29AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote:
> > On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips 
> ([EMAIL PROTECTED]) wrote:
> > > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > > > > And it will not solve the deadlock problem in general.  (Maybe
> > > > > it works for your virtual device, but I wonder...)  If the
> > > > > virtual device allocates memory during generic_make_request
> > > > > then the memory needs to be throttled.
> > > >
> > > > Daniel, if device process bio by itself, it has a limit and thus
> > > > it will wait in generic_make_request()
> > >
> > > What will make it wait?
> >
> > gneric_make_request() for given block device.
> 
> Not good enough, that only makes one thread wait.  Look here:
> 
> http://lkml.org/lkml/2007/8/13/788
> 
> An unlimited number of threads can come in, each consuming resources of 
> the virtual device, and violating the throttling rules.
> 
> The throttling of the virtual device must begin in generic_make_request 
> and last to ->endio.  You release the throttle of the virtual device at 
> the point you remap the bio to an underlying device, which you have 
> convinced yourself is ok, but it is not.  You seem to miss the fact 
> that whatever resources the virtual device has allocated are no longer 
> protected by the throttle count *of the virtual device*, or you do not 

Because it is charged to another device. No matter how many of them are
chained, limit is applied to the last device being used.
So, if you have unlimited number of threads, each one allocates a
request, forward it down to low-level devices, each one will eventually
sleep, but yes, each one _can_ allocate _one_ request before it goes
sleeping. It is done to allow fain-grained limits, since some devices
(like locally attached disks) do not require throttling.

Here is an example with threads you mentioned:
http://article.gmane.org/gmane.linux.file-systems/17644

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips
On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote:
> On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > > > And it will not solve the deadlock problem in general.  (Maybe
> > > > it works for your virtual device, but I wonder...)  If the
> > > > virtual device allocates memory during generic_make_request
> > > > then the memory needs to be throttled.
> > >
> > > Daniel, if device process bio by itself, it has a limit and thus
> > > it will wait in generic_make_request()
> >
> > What will make it wait?
>
> gneric_make_request() for given block device.

Not good enough, that only makes one thread wait.  Look here:

http://lkml.org/lkml/2007/8/13/788

An unlimited number of threads can come in, each consuming resources of 
the virtual device, and violating the throttling rules.

The throttling of the virtual device must begin in generic_make_request 
and last to ->endio.  You release the throttle of the virtual device at 
the point you remap the bio to an underlying device, which you have 
convinced yourself is ok, but it is not.  You seem to miss the fact 
that whatever resources the virtual device has allocated are no longer 
protected by the throttle count *of the virtual device*, or you do not 
see why that is a bad thing.  It is a very bad thing, roughly like 
leaving some shared data outside a spin_lock/unlock.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Evgeniy Polyakov
On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > > And it will not solve the deadlock problem in general.  (Maybe it
> > > works for your virtual device, but I wonder...)  If the virtual
> > > device allocates memory during generic_make_request then the memory
> > > needs to be throttled.
> >
> > Daniel, if device process bio by itself, it has a limit and thus it
> > will wait in generic_make_request()
> 
> What will make it wait?

gneric_make_request() for given block device.

Example:

   virt_device -> do_smth_with_bio ->bio_endio().
   |
  / \
 phys0   phys1

Each of three devices above works with bio, each one eventually calls
bio_endio() and bio->bi_bdev will be one of the three above devices.

Thus, when system calls generic_make_request(bio->bi_bdev == virt_device),
one of the three limits will be charged, depending on the fact, that
virtual device forward bio to physical devices or not. Actually virtual
device limit will be charged too first, but if bio is forwarded, its 
portion will be reduced from virtual device's limit.

Now, if virtual device allocates bio itself (like device mapper), then
this new bio will be forwarded to physical devices via
gneric_make_request() and thus it will sleep in the physical device's
queue, if it is filled.

So, if each of three devices has a limit of 10 bios, then actual number
of bios in flight is maximum 3 * 10, since each device will be charged
up to _its_ maximum limit, not limit for the first device in the chain.

So, you set 10 to virtual device and its can process bio itself (like
send it to network), then this is number of bios in flight, which are
processed by _this_ device and not forwarded further. Actual number of
bios you can flush into virtual device is its own limit plus limits of
all physical devices atached to it.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips
On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > And it will not solve the deadlock problem in general.  (Maybe it
> > works for your virtual device, but I wonder...)  If the virtual
> > device allocates memory during generic_make_request then the memory
> > needs to be throttled.
>
> Daniel, if device process bio by itself, it has a limit and thus it
> will wait in generic_make_request()

What will make it wait?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Evgeniy Polyakov
On Tue, Aug 14, 2007 at 04:13:10AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote:
> > On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips 
> ([EMAIL PROTECTED]) wrote:
> > > Perhaps you never worried about the resources that the device
> > > mapper mapping function allocates to handle each bio and so did not
> > > consider this hole significant.  These resources can be
> > > significant, as is the case with ddsnap.  It is essential to close
> > > that window through with the virtual device's queue limit may be
> > > violated.  Not doing so will allow deadlock.
> >
> > This is not a bug, this is special kind of calculation - total limit
> > is number of physical devices multiplied by theirs limits. It was
> > done _on purpose_ to allow different device to have different limits
> > (for example in distributed storage project it is possible to have
> > both remote and local node in the same device, but local device
> > should not have _any_ limit at all, but network one should).
> >
> > Virtual device essentially has _no_ limit. And that as done on
> > purpose.
> 
> And it will not solve the deadlock problem in general.  (Maybe it works 
> for your virtual device, but I wonder...)  If the virtual device 
> allocates memory during generic_make_request then the memory needs to 
> be throttled.

Daniel, if device process bio by itself, it has a limit and thus it will
wait in generic_make_request(), if it queues it to different device,
then the same logic applies there. If virutal device does not process
bio, its limit will always be recharged to underlying devices, and
overall limit is equal to number of physical device (or devices which do
process bio) multiplied by theirs limits. This does _work_ and I showed
example how limits are processed and who and where will sleep. This
solution is not narrow fix, please check my examples I showed before.

> Regards,
> 
> Daniel

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips
On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > Perhaps you never worried about the resources that the device
> > mapper mapping function allocates to handle each bio and so did not
> > consider this hole significant.  These resources can be
> > significant, as is the case with ddsnap.  It is essential to close
> > that window through with the virtual device's queue limit may be
> > violated.  Not doing so will allow deadlock.
>
> This is not a bug, this is special kind of calculation - total limit
> is number of physical devices multiplied by theirs limits. It was
> done _on purpose_ to allow different device to have different limits
> (for example in distributed storage project it is possible to have
> both remote and local node in the same device, but local device
> should not have _any_ limit at all, but network one should).
>
> Virtual device essentially has _no_ limit. And that as done on
> purpose.

And it will not solve the deadlock problem in general.  (Maybe it works 
for your virtual device, but I wonder...)  If the virtual device 
allocates memory during generic_make_request then the memory needs to 
be throttled.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Evgeniy Polyakov
On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> Perhaps you never worried about the resources that the device mapper 
> mapping function allocates to handle each bio and so did not consider 
> this hole significant.  These resources can be significant, as is the 
> case with ddsnap.  It is essential to close that window through with 
> the virtual device's queue limit may be violated.  Not doing so will 
> allow deadlock.

This is not a bug, this is special kind of calculation - total limit is
number of physical devices multiplied by theirs limits. It was done 
_on purpose_ to allow different device to have different limits (for 
example in distributed storage project it is possible to have both remote 
and local node in the same device, but local device should not have _any_
limit at all, but network one should).

Virtual device essentially has _no_ limit. And that as done on purpose.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote:
> > Say you have a device mapper device with some physical device
> > sitting underneath, the classic use case for this throttle code. 
> > Say 8,000 threads each submit an IO in parallel.  The device mapper
> > mapping function will be called 8,000 times with associated
> > resource allocations, regardless of any throttling on the physical
> > device queue.
>
> Each thread will sleep in generic_make_request(), if limit is
> specified correctly, then allocated number of bios will be enough to
> have a progress.

The problem is, the sleep does not occur before the virtual device 
mapping function is called.  Let's consider two devices, a physical 
device named pdev and a virtual device sitting on top of it called 
vdev.   vdev's throttle limit is just one element, but we will see that 
in spite of this, two bios can be handled by the vdev's mapping method 
before any IO completes, which violates the throttling rules. According 
to your patch it works like this:

 Thread 1Thread  2

bio_queued is zero>

vdev->q->bio_queued++



blk_set_bdev(bio, pdev)
 vdev->bio_queued--
   
bio_queued is 
zero>

vdev->q->bio_queued++



whoops!  Our virtual device mapping
function has now allocated resources
for two in-flight bios in spite of 
having its
throttle limit set to 1.

Perhaps you never worried about the resources that the device mapper 
mapping function allocates to handle each bio and so did not consider 
this hole significant.  These resources can be significant, as is the 
case with ddsnap.  It is essential to close that window through with 
the virtual device's queue limit may be violated.  Not doing so will 
allow deadlock.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov
On Mon, Aug 13, 2007 at 05:18:14AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> > If limit is for 
> > 1gb of pending block io, and system has for example 2gbs of ram (or
> > any other resonable parameters), then there is no way we can deadlock
> > in allocation, since it will not force page reclaim mechanism.
> 
> The problem is that sk_alloc (called from our block driver via 
> socket->write) would recurse into shrink_pages, which recursively 
> submits IO to our block driver and blocks on the throttle.  Subtle 
> indeed, and yet another demonstration of why vm recursion is a Bad 
> Thing.
>
> I will find a traceback for you tomorrow, which makes this deadlock much 
> clearer.

I see how it can happen, but device throttling is a solution we are
trying to complete, which main aim _is_ to remove this problem.

Lower per-device limit, so that the rest of the RAM allowed to
allocate all needed data structures in the network path.
Above example just has 1gb of ram, which should be enough for skbs, if
it is not, decrease limit to 500 mb and so on, until weighted load of
the system allows to always have a forward progress.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > > > Oops, and there is also:
> > > >
> > > > 3) The bio throttle, which is supposed to prevent deadlock, can
> > > > itself deadlock.  Let me see if I can remember how it goes.
> > > >
> > > >   * generic_make_request puts a bio in flight
> > > >   * the bio gets past the throttle and initiates network IO
> > > >   * net calls sk_alloc->alloc_pages->shrink_caches
> > > >   * shrink_caches submits a bio recursively to our block device
> > > >   * this bio blocks on the throttle
> > > >   * net may never get the memory it needs, and we are wedged
> > >
> > > If system is in such condition, it is already broken - throttle
> > > limit must be lowered (next time) not to allow such situation.
> >
> > Agreed that the system is broken, however lowering the throttle
> > limit gives no improvement in this case.
>
> How is it ever possible? The whole idea of throttling is to remove
> such situation, and now you say it can not be solved.

It was solved, by not throttling writeout that comes from shrink_caches.
Ugly.

> If limit is for 
> 1gb of pending block io, and system has for example 2gbs of ram (or
> any other resonable parameters), then there is no way we can deadlock
> in allocation, since it will not force page reclaim mechanism.

The problem is that sk_alloc (called from our block driver via 
socket->write) would recurse into shrink_pages, which recursively 
submits IO to our block driver and blocks on the throttle.  Subtle 
indeed, and yet another demonstration of why vm recursion is a Bad 
Thing.

I will find a traceback for you tomorrow, which makes this deadlock much 
clearer.

Regards
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov
On Mon, Aug 13, 2007 at 04:18:03AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> > No. Since all requests for virtual device end up in physical devices,
> > which have limits, this mechanism works. Virtual device will
> > essentially call either generic_make_request() for new physical
> > device (and thus will sleep is limit is over), or will process bios
> > directly, but in that case it will sleep in generic_make_request()
> > for virutal device.
> 
> What can happen is, as soon as you unthrottle the previous queue, 
> another thread can come in and put another request on it.  Sure, that 
> thread will likely block on the physical throttle and so will the rest 
> of the incoming threads, but it still allows the higher level queue to 
> grow past any given limit, with the help of lots of threads.  JVM for 
> example?

No. You get one slot, and one thread will not be blocked, all others
will. If lucky thread wants to put two requests it will be blocked on
second request, since underlying physical device does not accept requests
anymore an thus caller will sleep.

> Say you have a device mapper device with some physical device sitting 
> underneath, the classic use case for this throttle code.  Say 8,000 
> threads each submit an IO in parallel.  The device mapper mapping 
> function will be called 8,000 times with associated resource 
> allocations, regardless of any throttling on the physical device queue.

Each thread will sleep in generic_make_request(), if limit is specified
correctly, then allocated number of bios will be enough to have a
progress.

Here is an example:

let's say system has 20.000 pages in RAM and 20.000 in swap,
we have 8.000 threads, each one allocates a page, then next page and so
on. System has one virtual device with two physical devices under it,
each device gets half of requests.

We set limit to 4.000 per physical device.

All threads allocate a page and queue it to devices, so all threads
succeeded in its first allocation, and each device has its queue full.
Virtual device does not have a limit (or have it 4.000 too, but since it
was each time recharged, it has zero blocks in-flight).

New thread tries to allocate a page, it is allocated and queued to one
of the devices, but since its queue is full, thread sleeps. So will do
each other.

Thus we ended up allocated 8.000 requests queued, and 8.000 in-flight,
totally 16.000 which is smaller than amount of pages in RAM, so we are
happy.

Consider above as a special kind calculation i.e. number of 
_allocated_ pages is always number of physical device multiplied by each
one's in-flight limit. By adjusting in-flight limit and knowing number
of device it is completely possible to eliminate vm deadlock.

If you do not like such calculation, solution is trivial:
we can sleep _after_ ->make_request_fn() in
generic_make_request() until number of in-flight bios is reduced by
bio_endio().

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov
On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > > Oops, and there is also:
> > >
> > > 3) The bio throttle, which is supposed to prevent deadlock, can
> > > itself deadlock.  Let me see if I can remember how it goes.
> > >
> > >   * generic_make_request puts a bio in flight
> > >   * the bio gets past the throttle and initiates network IO
> > >   * net calls sk_alloc->alloc_pages->shrink_caches
> > >   * shrink_caches submits a bio recursively to our block device
> > >   * this bio blocks on the throttle
> > >   * net may never get the memory it needs, and we are wedged
> >
> > If system is in such condition, it is already broken - throttle limit
> > must be lowered (next time) not to allow such situation.
> 
> Agreed that the system is broken, however lowering the throttle limit 
> gives no improvement in this case.

How is it ever possible? The whole idea of throttling is to remove such
situation, and now you say it can not be solved. If limit is for 1gb of
pending block io, and system has for example 2gbs of ram (or any other
resonable parameters), then there is no way we can deadlock in
allocation, since it will not force page reclaim mechanism.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote:
> On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > (previous incomplete message sent accidentally)
> >
> > On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> > > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
> > >
> > > So, what did we decide? To bloat bio a bit (add a queue pointer)
> > > or to use physical device limits? The latter requires to replace
> > > all occurence of bio->bi_bdev = something_new with
> > > blk_set_bdev(bio, somthing_new), where queue limits will be
> > > appropriately charged. So far I'm testing second case, but I only
> > > changed DST for testing, can change all other users if needed
> > > though.
> >
> > Adding a queue pointer to struct bio and using physical device
> > limits as in your posted patch both suffer from the same problem:
> > you release the throttling on the previous queue when the bio moves
> > to a new one, which is a bug because memory consumption on the
> > previous queue then becomes unbounded, or limited only by the
> > number of struct requests that can be allocated.  In other words,
> > it reverts to the same situation we have now as soon as the IO
> > stack has more than one queue.  (Just a shorter version of my
> > previous post.)
>
> No. Since all requests for virtual device end up in physical devices,
> which have limits, this mechanism works. Virtual device will
> essentially call either generic_make_request() for new physical
> device (and thus will sleep is limit is over), or will process bios
> directly, but in that case it will sleep in generic_make_request()
> for virutal device.

What can happen is, as soon as you unthrottle the previous queue, 
another thread can come in and put another request on it.  Sure, that 
thread will likely block on the physical throttle and so will the rest 
of the incoming threads, but it still allows the higher level queue to 
grow past any given limit, with the help of lots of threads.  JVM for 
example?

Say you have a device mapper device with some physical device sitting 
underneath, the classic use case for this throttle code.  Say 8,000 
threads each submit an IO in parallel.  The device mapper mapping 
function will be called 8,000 times with associated resource 
allocations, regardless of any throttling on the physical device queue.

Anyway, your approach is awfully close to being airtight, there is just 
a small hole.  I would be more than happy to be proved wrong about 
that, but the more I look, the more I see that hole.

> > 1) One throttle count per submitted bio is too crude a measure.  A
> > bio can carry as few as one page or as many as 256 pages.  If you
> > take only
>
> It does not matter - we can count bytes, pages, bio vectors or
> whatever we like, its just a matter of counter and can be changed
> without problem.

Quite true.  In some cases the simple inc/dec per bio works just fine.  
But the general case where finer granularity is required comes up in 
existing code, so there needs to be a plan.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > Oops, and there is also:
> >
> > 3) The bio throttle, which is supposed to prevent deadlock, can
> > itself deadlock.  Let me see if I can remember how it goes.
> >
> >   * generic_make_request puts a bio in flight
> >   * the bio gets past the throttle and initiates network IO
> >   * net calls sk_alloc->alloc_pages->shrink_caches
> >   * shrink_caches submits a bio recursively to our block device
> >   * this bio blocks on the throttle
> >   * net may never get the memory it needs, and we are wedged
>
> If system is in such condition, it is already broken - throttle limit
> must be lowered (next time) not to allow such situation.

Agreed that the system is broken, however lowering the throttle limit 
gives no improvement in this case.

This is not theoretical, but a testable, repeatable result.  
Instructions to reproduce should show up tomorrow.

This bug is now solved in a kludgy way.  Now, Peter's patch set offers a 
much cleaner way to fix this little problem, along with at least one 
other nasty that it already fixed.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov
On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> (previous incomplete message sent accidentally)
> 
> On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
> >
> > So, what did we decide? To bloat bio a bit (add a queue pointer) or
> > to use physical device limits? The latter requires to replace all
> > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> > somthing_new), where queue limits will be appropriately charged. So
> > far I'm testing second case, but I only changed DST for testing, can
> > change all other users if needed though.
> 
> Adding a queue pointer to struct bio and using physical device limits as 
> in your posted patch both suffer from the same problem: you release the 
> throttling on the previous queue when the bio moves to a new one, which 
> is a bug because memory consumption on the previous queue then becomes 
> unbounded, or limited only by the number of struct requests that can be 
> allocated.  In other words, it reverts to the same situation we have 
> now as soon as the IO stack has more than one queue.  (Just a shorter 
> version of my previous post.)

No. Since all requests for virtual device end up in physical devices,
which have limits, this mechanism works. Virtual device will essentially 
call either generic_make_request() for new physical device (and thus
will sleep is limit is over), or will process bios directly, but in that
case it will sleep in generic_make_request() for virutal device.

> 1) One throttle count per submitted bio is too crude a measure.  A bio 
> can carry as few as one page or as many as 256 pages.  If you take only 

It does not matter - we can count bytes, pages, bio vectors or whatever
we like, its just a matter of counter and can be changed without
problem.

> 2) Exposing the per-block device throttle limits via sysfs or similar is 
> really not a good long term solution for system administration.  
> Imagine our help text: "just keep trying smaller numbers until your 
> system deadlocks".  We really need to figure this out internally and 
> get it correct.  I can see putting in a temporary userspace interface 
> just for experimentation, to help determine what really is safe, and 
> what size the numbers should be to approach optimal throughput in a 
> fully loaded memory state.

Well, we already have number of such 'supposed-to-be-automatic'
variables exported to userspace, so this will not change a picture,
frankly I do not care if there will or will not be any sysfs exported
tunable, eventually we can remove it or do not create at all.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Evgeniy Polyakov
On Sun, Aug 12, 2007 at 11:44:00PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:
> On Sunday 12 August 2007 22:36, I wrote:
> > Note!  There are two more issues I forgot to mention earlier.
> 
> Oops, and there is also:
> 
> 3) The bio throttle, which is supposed to prevent deadlock, can itself 
> deadlock.  Let me see if I can remember how it goes.
> 
>   * generic_make_request puts a bio in flight
>   * the bio gets past the throttle and initiates network IO
>   * net calls sk_alloc->alloc_pages->shrink_caches
>   * shrink_caches submits a bio recursively to our block device
>   * this bio blocks on the throttle
>   * net may never get the memory it needs, and we are wedged

If system is in such condition, it is already broken - throttle limit
must be lowered (next time) not to allow such situation.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Sunday 12 August 2007 22:36, I wrote:
> Note!  There are two more issues I forgot to mention earlier.

Oops, and there is also:

3) The bio throttle, which is supposed to prevent deadlock, can itself 
deadlock.  Let me see if I can remember how it goes.

  * generic_make_request puts a bio in flight
  * the bio gets past the throttle and initiates network IO
  * net calls sk_alloc->alloc_pages->shrink_caches
  * shrink_caches submits a bio recursively to our block device
  * this bio blocks on the throttle
  * net may never get the memory it needs, and we are wedged

I need to review a backtrace to get this precisely right, however you 
can see the danger.  In ddsnap we kludge around this problem by not 
throttling any bio submitted in PF_MEMALLOC mode, which effectively 
increases our reserve requirement by the amount of IO that mm will 
submit to a given block device before deciding the device is congested 
and should be left alone.  This works, but is sloppy and disgusting.

The right thing to do is to make sure than the mm knows about our 
throttle accounting in backing_dev_info so it will not push IO to our 
device when it knows that the IO will just block on congestion.  
Instead, shrink_caches will find some other less congested block device 
or give up, causing alloc_pages to draw from the memalloc reserve to 
satisfy the sk_alloc request.

The mm already uses backing_dev_info this way, we just need to set the 
right bits in the backing_dev_info state flags.  I think Peter posted a 
patch set that included this feature at some point.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips
(previous incomplete message sent accidentally)

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as 
in your posted patch both suffer from the same problem: you release the 
throttling on the previous queue when the bio moves to a new one, which 
is a bug because memory consumption on the previous queue then becomes 
unbounded, or limited only by the number of struct requests that can be 
allocated.  In other words, it reverts to the same situation we have 
now as soon as the IO stack has more than one queue.  (Just a shorter 
version of my previous post.)

We can solve this by having the bio only point at the queue to which it 
was originally submitted, since throttling the top level queue 
automatically throttles all queues lower down the stack.  Alternatively 
the bio can point at the block_device or straight at the 
backing_dev_info, which is the per-device structure it actually needs 
to touch.

Note!  There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure.  A bio 
can carry as few as one page or as many as 256 pages.  If you take only 
one throttle count per bio and that data will be transferred over the 
network then you have to assume that (a little more than) 256 pages of 
sk_alloc reserve will be needed for every bio, resulting in a grossly 
over-provisioned reserve.  The precise reserve calculation we want to 
do is per-block device, and you will find hooks like this already 
living in backing_dev_info.  We need to place our own fn+data there to 
calculate the throttle draw for each bio.  Unthrottling gets trickier 
with variable size throttle draw.  In ddsnap, we simply write the 
amount we drew from the throttle into (the private data of) bio for use 
later by unthrottle, thus avoiding the issue that the bio fields we 
used to calculate might have changed during the lifetime of the bio.  
This would translate into one more per-bio field.

2) Exposing the per-block device throttle limits via sysfs or similar is 
really not a good long term solution for system administration.  
Imagine our help text: "just keep trying smaller numbers until your 
system deadlocks".  We really need to figure this out internally and 
get it correct.  I can see putting in a temporary userspace interface 
just for experimentation, to help determine what really is safe, and 
what size the numbers should be to approach optimal throughput in a 
fully loaded memory state.

Regards,

Daniel


Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips
On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe 
([EMAIL PROTECTED]) wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as 
in your posted patch both suffer from the same problem: you release the 
throttling on the previous queue when the bio moves to a new one, which 
is a bug because memory consumption on the previous queue then becomes 
unbounded, or limited only by the number of struct requests that can be 
allocated.  In other words, it reverts to the same situation we have 
now as soon as the IO stack has more than one queue.  (Just a shorter 
version of my previous post.)

We can solve this by having the bio only point at the queue to which it 
was originally submitted, since throttling the top level queue 
automatically throttles all queues lower down the stack.  Alternatively 
the bio can point at the block_device or straight at the 
backing_dev_info, which is the per-device structure it actually needs 
to touch.

Note!  There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure.  A bio 
can carry as few as one page or as many as 256 pages.  If you take only 
one throttle count per bio and that data will be transferred over the 
network then you have to assume that (a little more than) 256 pages of 
sk_alloc reserve will be needed for every bio, resulting in a grossly 
over-provisioned reserve.  The precise reserve calculation we want to 
do is per-block device, and you will find hooks like this already 
living in backing_dev_info.  We need to place our own fn+data there to 
calculate the throttle draw for each bio.  Unthrottling gets trickier 
with variable size throttle draw.  In ddsnap, we simply write the 
amount we drew from the throttle into (the private data of) bio for use 
later by unthrottle, thus avoiding the issue that the bio fields we 
used to calculate might have changed during the lifetime of the bio.  
This would translate into one more per-bio field.



the throttling performs another function: keeping a reasonable amount of 
IO in flight for the device.  The definition of "reasonable" is 
complex.  For a hard disk it depends on the physical distance between 
sector addresses of the bios in flight.  In ddsnap we make a crude but 
workable approximation that 


 In general, a per block device 

The throttle count needs to cover 

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/