Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote: > > The throttling of the virtual device must begin in > > generic_make_request and last to ->endio. You release the throttle > > of the virtual device at the point you remap the bio to an > > underlying device, which you have convinced yourself is ok, but it > > is not. You seem to miss the fact that whatever resources the > > virtual device has allocated are no longer protected by the > > throttle count *of the virtual device*, or you do not > > Because it is charged to another device. Great. You charged the resource to another device, but you did not limit the amount of resources that the first device can consume. Which misses the whole point. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Tue, Aug 14, 2007 at 05:32:29AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote: > > On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips > ([EMAIL PROTECTED]) wrote: > > > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: > > > > > And it will not solve the deadlock problem in general. (Maybe > > > > > it works for your virtual device, but I wonder...) If the > > > > > virtual device allocates memory during generic_make_request > > > > > then the memory needs to be throttled. > > > > > > > > Daniel, if device process bio by itself, it has a limit and thus > > > > it will wait in generic_make_request() > > > > > > What will make it wait? > > > > gneric_make_request() for given block device. > > Not good enough, that only makes one thread wait. Look here: > > http://lkml.org/lkml/2007/8/13/788 > > An unlimited number of threads can come in, each consuming resources of > the virtual device, and violating the throttling rules. > > The throttling of the virtual device must begin in generic_make_request > and last to ->endio. You release the throttle of the virtual device at > the point you remap the bio to an underlying device, which you have > convinced yourself is ok, but it is not. You seem to miss the fact > that whatever resources the virtual device has allocated are no longer > protected by the throttle count *of the virtual device*, or you do not Because it is charged to another device. No matter how many of them are chained, limit is applied to the last device being used. So, if you have unlimited number of threads, each one allocates a request, forward it down to low-level devices, each one will eventually sleep, but yes, each one _can_ allocate _one_ request before it goes sleeping. It is done to allow fain-grained limits, since some devices (like locally attached disks) do not require throttling. Here is an example with threads you mentioned: http://article.gmane.org/gmane.linux.file-systems/17644 -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote: > On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: > > > > And it will not solve the deadlock problem in general. (Maybe > > > > it works for your virtual device, but I wonder...) If the > > > > virtual device allocates memory during generic_make_request > > > > then the memory needs to be throttled. > > > > > > Daniel, if device process bio by itself, it has a limit and thus > > > it will wait in generic_make_request() > > > > What will make it wait? > > gneric_make_request() for given block device. Not good enough, that only makes one thread wait. Look here: http://lkml.org/lkml/2007/8/13/788 An unlimited number of threads can come in, each consuming resources of the virtual device, and violating the throttling rules. The throttling of the virtual device must begin in generic_make_request and last to ->endio. You release the throttle of the virtual device at the point you remap the bio to an underlying device, which you have convinced yourself is ok, but it is not. You seem to miss the fact that whatever resources the virtual device has allocated are no longer protected by the throttle count *of the virtual device*, or you do not see why that is a bad thing. It is a very bad thing, roughly like leaving some shared data outside a spin_lock/unlock. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: > > > And it will not solve the deadlock problem in general. (Maybe it > > > works for your virtual device, but I wonder...) If the virtual > > > device allocates memory during generic_make_request then the memory > > > needs to be throttled. > > > > Daniel, if device process bio by itself, it has a limit and thus it > > will wait in generic_make_request() > > What will make it wait? gneric_make_request() for given block device. Example: virt_device -> do_smth_with_bio ->bio_endio(). | / \ phys0 phys1 Each of three devices above works with bio, each one eventually calls bio_endio() and bio->bi_bdev will be one of the three above devices. Thus, when system calls generic_make_request(bio->bi_bdev == virt_device), one of the three limits will be charged, depending on the fact, that virtual device forward bio to physical devices or not. Actually virtual device limit will be charged too first, but if bio is forwarded, its portion will be reduced from virtual device's limit. Now, if virtual device allocates bio itself (like device mapper), then this new bio will be forwarded to physical devices via gneric_make_request() and thus it will sleep in the physical device's queue, if it is filled. So, if each of three devices has a limit of 10 bios, then actual number of bios in flight is maximum 3 * 10, since each device will be charged up to _its_ maximum limit, not limit for the first device in the chain. So, you set 10 to virtual device and its can process bio itself (like send it to network), then this is number of bios in flight, which are processed by _this_ device and not forwarded further. Actual number of bios you can flush into virtual device is its own limit plus limits of all physical devices atached to it. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: > > And it will not solve the deadlock problem in general. (Maybe it > > works for your virtual device, but I wonder...) If the virtual > > device allocates memory during generic_make_request then the memory > > needs to be throttled. > > Daniel, if device process bio by itself, it has a limit and thus it > will wait in generic_make_request() What will make it wait? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Tue, Aug 14, 2007 at 04:13:10AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote: > > On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips > ([EMAIL PROTECTED]) wrote: > > > Perhaps you never worried about the resources that the device > > > mapper mapping function allocates to handle each bio and so did not > > > consider this hole significant. These resources can be > > > significant, as is the case with ddsnap. It is essential to close > > > that window through with the virtual device's queue limit may be > > > violated. Not doing so will allow deadlock. > > > > This is not a bug, this is special kind of calculation - total limit > > is number of physical devices multiplied by theirs limits. It was > > done _on purpose_ to allow different device to have different limits > > (for example in distributed storage project it is possible to have > > both remote and local node in the same device, but local device > > should not have _any_ limit at all, but network one should). > > > > Virtual device essentially has _no_ limit. And that as done on > > purpose. > > And it will not solve the deadlock problem in general. (Maybe it works > for your virtual device, but I wonder...) If the virtual device > allocates memory during generic_make_request then the memory needs to > be throttled. Daniel, if device process bio by itself, it has a limit and thus it will wait in generic_make_request(), if it queues it to different device, then the same logic applies there. If virutal device does not process bio, its limit will always be recharged to underlying devices, and overall limit is equal to number of physical device (or devices which do process bio) multiplied by theirs limits. This does _work_ and I showed example how limits are processed and who and where will sleep. This solution is not narrow fix, please check my examples I showed before. > Regards, > > Daniel -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > Perhaps you never worried about the resources that the device > > mapper mapping function allocates to handle each bio and so did not > > consider this hole significant. These resources can be > > significant, as is the case with ddsnap. It is essential to close > > that window through with the virtual device's queue limit may be > > violated. Not doing so will allow deadlock. > > This is not a bug, this is special kind of calculation - total limit > is number of physical devices multiplied by theirs limits. It was > done _on purpose_ to allow different device to have different limits > (for example in distributed storage project it is possible to have > both remote and local node in the same device, but local device > should not have _any_ limit at all, but network one should). > > Virtual device essentially has _no_ limit. And that as done on > purpose. And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > Perhaps you never worried about the resources that the device mapper > mapping function allocates to handle each bio and so did not consider > this hole significant. These resources can be significant, as is the > case with ddsnap. It is essential to close that window through with > the virtual device's queue limit may be violated. Not doing so will > allow deadlock. This is not a bug, this is special kind of calculation - total limit is number of physical devices multiplied by theirs limits. It was done _on purpose_ to allow different device to have different limits (for example in distributed storage project it is possible to have both remote and local node in the same device, but local device should not have _any_ limit at all, but network one should). Virtual device essentially has _no_ limit. And that as done on purpose. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote: > > Say you have a device mapper device with some physical device > > sitting underneath, the classic use case for this throttle code. > > Say 8,000 threads each submit an IO in parallel. The device mapper > > mapping function will be called 8,000 times with associated > > resource allocations, regardless of any throttling on the physical > > device queue. > > Each thread will sleep in generic_make_request(), if limit is > specified correctly, then allocated number of bios will be enough to > have a progress. The problem is, the sleep does not occur before the virtual device mapping function is called. Let's consider two devices, a physical device named pdev and a virtual device sitting on top of it called vdev. vdev's throttle limit is just one element, but we will see that in spite of this, two bios can be handled by the vdev's mapping method before any IO completes, which violates the throttling rules. According to your patch it works like this: Thread 1Thread 2 bio_queued is zero> vdev->q->bio_queued++ blk_set_bdev(bio, pdev) vdev->bio_queued-- bio_queued is zero> vdev->q->bio_queued++ whoops! Our virtual device mapping function has now allocated resources for two in-flight bios in spite of having its throttle limit set to 1. Perhaps you never worried about the resources that the device mapper mapping function allocates to handle each bio and so did not consider this hole significant. These resources can be significant, as is the case with ddsnap. It is essential to close that window through with the virtual device's queue limit may be violated. Not doing so will allow deadlock. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Mon, Aug 13, 2007 at 05:18:14AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > If limit is for > > 1gb of pending block io, and system has for example 2gbs of ram (or > > any other resonable parameters), then there is no way we can deadlock > > in allocation, since it will not force page reclaim mechanism. > > The problem is that sk_alloc (called from our block driver via > socket->write) would recurse into shrink_pages, which recursively > submits IO to our block driver and blocks on the throttle. Subtle > indeed, and yet another demonstration of why vm recursion is a Bad > Thing. > > I will find a traceback for you tomorrow, which makes this deadlock much > clearer. I see how it can happen, but device throttling is a solution we are trying to complete, which main aim _is_ to remove this problem. Lower per-device limit, so that the rest of the RAM allowed to allocate all needed data structures in the network path. Above example just has 1gb of ram, which should be enough for skbs, if it is not, decrease limit to 500 mb and so on, until weighted load of the system allows to always have a forward progress. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: > > > > Oops, and there is also: > > > > > > > > 3) The bio throttle, which is supposed to prevent deadlock, can > > > > itself deadlock. Let me see if I can remember how it goes. > > > > > > > > * generic_make_request puts a bio in flight > > > > * the bio gets past the throttle and initiates network IO > > > > * net calls sk_alloc->alloc_pages->shrink_caches > > > > * shrink_caches submits a bio recursively to our block device > > > > * this bio blocks on the throttle > > > > * net may never get the memory it needs, and we are wedged > > > > > > If system is in such condition, it is already broken - throttle > > > limit must be lowered (next time) not to allow such situation. > > > > Agreed that the system is broken, however lowering the throttle > > limit gives no improvement in this case. > > How is it ever possible? The whole idea of throttling is to remove > such situation, and now you say it can not be solved. It was solved, by not throttling writeout that comes from shrink_caches. Ugly. > If limit is for > 1gb of pending block io, and system has for example 2gbs of ram (or > any other resonable parameters), then there is no way we can deadlock > in allocation, since it will not force page reclaim mechanism. The problem is that sk_alloc (called from our block driver via socket->write) would recurse into shrink_pages, which recursively submits IO to our block driver and blocks on the throttle. Subtle indeed, and yet another demonstration of why vm recursion is a Bad Thing. I will find a traceback for you tomorrow, which makes this deadlock much clearer. Regards - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Mon, Aug 13, 2007 at 04:18:03AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > No. Since all requests for virtual device end up in physical devices, > > which have limits, this mechanism works. Virtual device will > > essentially call either generic_make_request() for new physical > > device (and thus will sleep is limit is over), or will process bios > > directly, but in that case it will sleep in generic_make_request() > > for virutal device. > > What can happen is, as soon as you unthrottle the previous queue, > another thread can come in and put another request on it. Sure, that > thread will likely block on the physical throttle and so will the rest > of the incoming threads, but it still allows the higher level queue to > grow past any given limit, with the help of lots of threads. JVM for > example? No. You get one slot, and one thread will not be blocked, all others will. If lucky thread wants to put two requests it will be blocked on second request, since underlying physical device does not accept requests anymore an thus caller will sleep. > Say you have a device mapper device with some physical device sitting > underneath, the classic use case for this throttle code. Say 8,000 > threads each submit an IO in parallel. The device mapper mapping > function will be called 8,000 times with associated resource > allocations, regardless of any throttling on the physical device queue. Each thread will sleep in generic_make_request(), if limit is specified correctly, then allocated number of bios will be enough to have a progress. Here is an example: let's say system has 20.000 pages in RAM and 20.000 in swap, we have 8.000 threads, each one allocates a page, then next page and so on. System has one virtual device with two physical devices under it, each device gets half of requests. We set limit to 4.000 per physical device. All threads allocate a page and queue it to devices, so all threads succeeded in its first allocation, and each device has its queue full. Virtual device does not have a limit (or have it 4.000 too, but since it was each time recharged, it has zero blocks in-flight). New thread tries to allocate a page, it is allocated and queued to one of the devices, but since its queue is full, thread sleeps. So will do each other. Thus we ended up allocated 8.000 requests queued, and 8.000 in-flight, totally 16.000 which is smaller than amount of pages in RAM, so we are happy. Consider above as a special kind calculation i.e. number of _allocated_ pages is always number of physical device multiplied by each one's in-flight limit. By adjusting in-flight limit and knowing number of device it is completely possible to eliminate vm deadlock. If you do not like such calculation, solution is trivial: we can sleep _after_ ->make_request_fn() in generic_make_request() until number of in-flight bios is reduced by bio_endio(). -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: > > > Oops, and there is also: > > > > > > 3) The bio throttle, which is supposed to prevent deadlock, can > > > itself deadlock. Let me see if I can remember how it goes. > > > > > > * generic_make_request puts a bio in flight > > > * the bio gets past the throttle and initiates network IO > > > * net calls sk_alloc->alloc_pages->shrink_caches > > > * shrink_caches submits a bio recursively to our block device > > > * this bio blocks on the throttle > > > * net may never get the memory it needs, and we are wedged > > > > If system is in such condition, it is already broken - throttle limit > > must be lowered (next time) not to allow such situation. > > Agreed that the system is broken, however lowering the throttle limit > gives no improvement in this case. How is it ever possible? The whole idea of throttling is to remove such situation, and now you say it can not be solved. If limit is for 1gb of pending block io, and system has for example 2gbs of ram (or any other resonable parameters), then there is no way we can deadlock in allocation, since it will not force page reclaim mechanism. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote: > On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > (previous incomplete message sent accidentally) > > > > On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > > > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: > > > > > > So, what did we decide? To bloat bio a bit (add a queue pointer) > > > or to use physical device limits? The latter requires to replace > > > all occurence of bio->bi_bdev = something_new with > > > blk_set_bdev(bio, somthing_new), where queue limits will be > > > appropriately charged. So far I'm testing second case, but I only > > > changed DST for testing, can change all other users if needed > > > though. > > > > Adding a queue pointer to struct bio and using physical device > > limits as in your posted patch both suffer from the same problem: > > you release the throttling on the previous queue when the bio moves > > to a new one, which is a bug because memory consumption on the > > previous queue then becomes unbounded, or limited only by the > > number of struct requests that can be allocated. In other words, > > it reverts to the same situation we have now as soon as the IO > > stack has more than one queue. (Just a shorter version of my > > previous post.) > > No. Since all requests for virtual device end up in physical devices, > which have limits, this mechanism works. Virtual device will > essentially call either generic_make_request() for new physical > device (and thus will sleep is limit is over), or will process bios > directly, but in that case it will sleep in generic_make_request() > for virutal device. What can happen is, as soon as you unthrottle the previous queue, another thread can come in and put another request on it. Sure, that thread will likely block on the physical throttle and so will the rest of the incoming threads, but it still allows the higher level queue to grow past any given limit, with the help of lots of threads. JVM for example? Say you have a device mapper device with some physical device sitting underneath, the classic use case for this throttle code. Say 8,000 threads each submit an IO in parallel. The device mapper mapping function will be called 8,000 times with associated resource allocations, regardless of any throttling on the physical device queue. Anyway, your approach is awfully close to being airtight, there is just a small hole. I would be more than happy to be proved wrong about that, but the more I look, the more I see that hole. > > 1) One throttle count per submitted bio is too crude a measure. A > > bio can carry as few as one page or as many as 256 pages. If you > > take only > > It does not matter - we can count bytes, pages, bio vectors or > whatever we like, its just a matter of counter and can be changed > without problem. Quite true. In some cases the simple inc/dec per bio works just fine. But the general case where finer granularity is required comes up in existing code, so there needs to be a plan. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: > > Oops, and there is also: > > > > 3) The bio throttle, which is supposed to prevent deadlock, can > > itself deadlock. Let me see if I can remember how it goes. > > > > * generic_make_request puts a bio in flight > > * the bio gets past the throttle and initiates network IO > > * net calls sk_alloc->alloc_pages->shrink_caches > > * shrink_caches submits a bio recursively to our block device > > * this bio blocks on the throttle > > * net may never get the memory it needs, and we are wedged > > If system is in such condition, it is already broken - throttle limit > must be lowered (next time) not to allow such situation. Agreed that the system is broken, however lowering the throttle limit gives no improvement in this case. This is not theoretical, but a testable, repeatable result. Instructions to reproduce should show up tomorrow. This bug is now solved in a kludgy way. Now, Peter's patch set offers a much cleaner way to fix this little problem, along with at least one other nasty that it already fixed. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > (previous incomplete message sent accidentally) > > On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: > > > > So, what did we decide? To bloat bio a bit (add a queue pointer) or > > to use physical device limits? The latter requires to replace all > > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio, > > somthing_new), where queue limits will be appropriately charged. So > > far I'm testing second case, but I only changed DST for testing, can > > change all other users if needed though. > > Adding a queue pointer to struct bio and using physical device limits as > in your posted patch both suffer from the same problem: you release the > throttling on the previous queue when the bio moves to a new one, which > is a bug because memory consumption on the previous queue then becomes > unbounded, or limited only by the number of struct requests that can be > allocated. In other words, it reverts to the same situation we have > now as soon as the IO stack has more than one queue. (Just a shorter > version of my previous post.) No. Since all requests for virtual device end up in physical devices, which have limits, this mechanism works. Virtual device will essentially call either generic_make_request() for new physical device (and thus will sleep is limit is over), or will process bios directly, but in that case it will sleep in generic_make_request() for virutal device. > 1) One throttle count per submitted bio is too crude a measure. A bio > can carry as few as one page or as many as 256 pages. If you take only It does not matter - we can count bytes, pages, bio vectors or whatever we like, its just a matter of counter and can be changed without problem. > 2) Exposing the per-block device throttle limits via sysfs or similar is > really not a good long term solution for system administration. > Imagine our help text: "just keep trying smaller numbers until your > system deadlocks". We really need to figure this out internally and > get it correct. I can see putting in a temporary userspace interface > just for experimentation, to help determine what really is safe, and > what size the numbers should be to approach optimal throughput in a > fully loaded memory state. Well, we already have number of such 'supposed-to-be-automatic' variables exported to userspace, so this will not change a picture, frankly I do not care if there will or will not be any sysfs exported tunable, eventually we can remove it or do not create at all. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Sun, Aug 12, 2007 at 11:44:00PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > On Sunday 12 August 2007 22:36, I wrote: > > Note! There are two more issues I forgot to mention earlier. > > Oops, and there is also: > > 3) The bio throttle, which is supposed to prevent deadlock, can itself > deadlock. Let me see if I can remember how it goes. > > * generic_make_request puts a bio in flight > * the bio gets past the throttle and initiates network IO > * net calls sk_alloc->alloc_pages->shrink_caches > * shrink_caches submits a bio recursively to our block device > * this bio blocks on the throttle > * net may never get the memory it needs, and we are wedged If system is in such condition, it is already broken - throttle limit must be lowered (next time) not to allow such situation. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Sunday 12 August 2007 22:36, I wrote: > Note! There are two more issues I forgot to mention earlier. Oops, and there is also: 3) The bio throttle, which is supposed to prevent deadlock, can itself deadlock. Let me see if I can remember how it goes. * generic_make_request puts a bio in flight * the bio gets past the throttle and initiates network IO * net calls sk_alloc->alloc_pages->shrink_caches * shrink_caches submits a bio recursively to our block device * this bio blocks on the throttle * net may never get the memory it needs, and we are wedged I need to review a backtrace to get this precisely right, however you can see the danger. In ddsnap we kludge around this problem by not throttling any bio submitted in PF_MEMALLOC mode, which effectively increases our reserve requirement by the amount of IO that mm will submit to a given block device before deciding the device is congested and should be left alone. This works, but is sloppy and disgusting. The right thing to do is to make sure than the mm knows about our throttle accounting in backing_dev_info so it will not push IO to our device when it knows that the IO will just block on congestion. Instead, shrink_caches will find some other less congested block device or give up, causing alloc_pages to draw from the memalloc reserve to satisfy the sk_alloc request. The mm already uses backing_dev_info this way, we just need to set the right bits in the backing_dev_info state flags. I think Peter posted a patch set that included this feature at some point. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
(previous incomplete message sent accidentally) On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: > > So, what did we decide? To bloat bio a bit (add a queue pointer) or > to use physical device limits? The latter requires to replace all > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio, > somthing_new), where queue limits will be appropriately charged. So > far I'm testing second case, but I only changed DST for testing, can > change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) We can solve this by having the bio only point at the queue to which it was originally submitted, since throttling the top level queue automatically throttles all queues lower down the stack. Alternatively the bio can point at the block_device or straight at the backing_dev_info, which is the per-device structure it actually needs to touch. Note! There are two more issues I forgot to mention earlier. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only one throttle count per bio and that data will be transferred over the network then you have to assume that (a little more than) 256 pages of sk_alloc reserve will be needed for every bio, resulting in a grossly over-provisioned reserve. The precise reserve calculation we want to do is per-block device, and you will find hooks like this already living in backing_dev_info. We need to place our own fn+data there to calculate the throttle draw for each bio. Unthrottling gets trickier with variable size throttle draw. In ddsnap, we simply write the amount we drew from the throttle into (the private data of) bio for use later by unthrottle, thus avoiding the issue that the bio fields we used to calculate might have changed during the lifetime of the bio. This would translate into one more per-bio field. 2) Exposing the per-block device throttle limits via sysfs or similar is really not a good long term solution for system administration. Imagine our help text: "just keep trying smaller numbers until your system deadlocks". We really need to figure this out internally and get it correct. I can see putting in a temporary userspace interface just for experimentation, to help determine what really is safe, and what size the numbers should be to approach optimal throughput in a fully loaded memory state. Regards, Daniel Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Block device throttling [Re: Distributed storage.]
On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: > > So, what did we decide? To bloat bio a bit (add a queue pointer) or > to use physical device limits? The latter requires to replace all > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio, > somthing_new), where queue limits will be appropriately charged. So > far I'm testing second case, but I only changed DST for testing, can > change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) We can solve this by having the bio only point at the queue to which it was originally submitted, since throttling the top level queue automatically throttles all queues lower down the stack. Alternatively the bio can point at the block_device or straight at the backing_dev_info, which is the per-device structure it actually needs to touch. Note! There are two more issues I forgot to mention earlier. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only one throttle count per bio and that data will be transferred over the network then you have to assume that (a little more than) 256 pages of sk_alloc reserve will be needed for every bio, resulting in a grossly over-provisioned reserve. The precise reserve calculation we want to do is per-block device, and you will find hooks like this already living in backing_dev_info. We need to place our own fn+data there to calculate the throttle draw for each bio. Unthrottling gets trickier with variable size throttle draw. In ddsnap, we simply write the amount we drew from the throttle into (the private data of) bio for use later by unthrottle, thus avoiding the issue that the bio fields we used to calculate might have changed during the lifetime of the bio. This would translate into one more per-bio field. the throttling performs another function: keeping a reasonable amount of IO in flight for the device. The definition of "reasonable" is complex. For a hard disk it depends on the physical distance between sector addresses of the bios in flight. In ddsnap we make a crude but workable approximation that In general, a per block device The throttle count needs to cover Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/