Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages
On Friday 14 December 2007 14:51, I wrote: > On Friday 14 December 2007 07:39, Peter Zijlstra wrote: > Note that false sharing of slab pages is still possible between two > unrelated writeout processes, both of which obey rules for their own > writeout path, but the pinned combination does not. This still > leaves a hole through which a deadlock may slip. Actually, no it doesn't. It in fact does not matter how many unrelated writeout processes, block devices, whatevers share a slab cache. Sufficient reserve pages must have been made available (in a perfect work, by adding extra pages to the memalloc reserve on driver initialization, in the real world just by having a big memalloc reserve) to populate the slab up to the sum of the required objects for all memalloc users sharing the cache. So I think this slab technique of yours is fundamentally sound, that is to say, adding a new per-slab flag to keep unbounded numbers of slab objects with unbounded lifetimes from mixing with the bounded number of slab objects with bounded lifetimes. Ponder. OK, here is another issue. Suppose a driver expands the memalloc reserve by the X number of pages it needs on initialization, and shrinks it by the same amount on removal, as is right and proper. The problem is, less than the number of slab pages that got pulled into slab on behalf of the removed driver may be freed (or made freeable) back to the global reserve, due to page sharing with an unrelated user. In theory, the global reserve could be completely depleted by this slab fragmentation. OK, that is like the case that I mistakenly raised in the previous mail, though far less likely to occur, because driver removals are relatively rare and so would be a fragmentation case so severe as to cause global reserve depletion. Even so, if this possibility bothers anybody, it is fairly easy to plug the hole: associate each slab with a given memalloc user instead of just having one bit to classify users. So unrelated memalloc users would never share a slab, no false sharing, everybody happy. The cost: a new pointer field per slab and a few additional lines of code. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages
On Friday 14 December 2007 07:39, Peter Zijlstra wrote: > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to > allocation contexts that are entitled to it. This is done to ensure > reserve pages don't leak out and get consumed. Tighter definitions of "leak out" and "get consumed" would be helpful here. As I see it, the chain of reasoning is: * Any MEMALLOC mode allocation must have come from a properly throttled path and has a finite lifetime that must eventually produce writeout progress. * Since the transaction that made the allocation was throttled and must have a finite lifetime, we know that it must eventually return the resources it consumed to the appropriate resource pool. Now, I think what you mean by "get consumed" and "leak out" is: "become pinned by false sharing with other allocations that do not guarantee that they will be returned to the resource pool". We can say "pinned" for short. So you are attempting to prevent slab pages from becoming pinned by users that do not obey the reserve management rules, which I think your approach achieves. However... Note that false sharing of slab pages is still possible between two unrelated writeout processes, both of which obey rules for their own writeout path, but the pinned combination does not. This still leaves a hole through which a deadlock may slip. My original solution was simply to allocate a full page when drawing from the memaloc reserve, which may use a tad more reserve, but makes it possible to prove the algorithm correct. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/29] mm: kmem_estimate_pages()
On Friday 14 December 2007 07:39, Peter Zijlstra wrote: > Provide a method to get the upper bound on the pages needed to > allocate a given number of objects from a given kmem_cache. > > This lays the foundation for a generic reserve framework as presented > in a later patch in this series. This framework needs to convert > object demand (kmalloc() bytes, kmem_cache_alloc() objects) to pages. And hence the big idea that all reserve accounting can be done in units of pages, allowing the use of a single global reserve that already exists. The other big idea here is that reserve accounting can be independent of the actual resource allocations. This is a powerful idea which we may not have explained clearly yet. Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/29] netvm: INET reserves.
Hi Peter, sysctl_intvec_fragment, proc_dointvec_fragment, sysctl_intvec_fragment seem to suffer from cut-n-pastitis. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/29] Swap over NFS -v15
Hi Peter, A major feature of this patch set is the network receive deadlock avoidance, but there is quite a bit of stuff bundled with it, the NFS user accounting for a big part of the patch by itself. Is it possible to provide a before and after demonstration case for just the network receive deadlock part, given a subset of the patch set and a user space recipe that anybody can try? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Friday 31 August 2007 14:41, Alasdair G Kergon wrote: > On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote: > > Resubmitting a bio or submitting a dependent bio from > > inside a block driver does not need to be throttled because all > > resources required to guarantee completion must have been obtained > > _before_ the bio was allowed to proceed into the block layer. > > I'm toying with the idea of keeping track of the maximum device stack > depth for each stacked device, and only permitting it to increase in > controlled circumstances. Hi Alasdair, What kind of circumstances did you have in mind? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote: > Then, if of course you will want, which I doubt, you can reread > previous mails and find that it was pointed to that race and > possibilities to solve it way too long ago. What still bothers me about your response is that, while you know the race exists and do not disagree with my example, you don't seem to see that that race can eventually lock up the block device by repeatedly losing throttle counts which are never recovered. What prevents that? > > --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 > > -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.0 > > -0700 @@ -3237,6 +3237,15 @@ end_io: > > */ > > void generic_make_request(struct bio *bio) > > { > > + struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > + > > + if (q && q->metric) { > > + int need = bio->bi_reserved = q->metric(bio); > > + bio->queue = q; > > In case you have stacked device, this entry will be rewritten and you > will lost all your account data. It is a weakness all right. Well, - if (q && q->metric) { + if (q && q->metric && !bio->queue) { which fixes that problem. Maybe there is a better fix possible. Thanks for the catch! The original conception was that this block throttling would apply only to the highest level submission of the bio, the one that crosses the boundary between filesystem (or direct block device application) and block layer. Resubmitting a bio or submitting a dependent bio from inside a block driver does not need to be throttled because all resources required to guarantee completion must have been obtained _before_ the bio was allowed to proceed into the block layer. The other principle we are trying to satisfy is that the throttling should not be released until bio->endio, which I am not completely sure about with the patch as modified above. Your earlier idea of having the throttle protection only cover the actual bio submission is interesting and may be effective in some cases, in fact it may cover the specific case of ddsnap. But we don't have to look any further than ddraid (distributed raid) to find a case it doesn't cover - the additional memory allocated to hold parity data has to be reserved until parity data is deallocated, long after the submission completes. So while you manage to avoid some logistical difficulties, it also looks like you didn't solve the general problem. Hopefully I will be able to report on whether my patch actually works soon, when I get back from vacation. The mechanism in ddsnap this is supposed to replace is effective, it is just ugly and tricky to verify. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote: > On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) > wrote: > > > We do not care about one cpu being able to increase its counter > > > higher than the limit, such inaccuracy (maximum bios in flight > > > thus can be more than limit, difference is equal to the number of > > > CPUs - 1) is a price for removing atomic operation. I thought I > > > pointed it in the original description, but might forget, that if > > > it will be an issue, that atomic operations can be introduced > > > there. Any uber-precise measurements in the case when we are > > > close to the edge will not give us any benefit at all, since were > > > are already in the grey area. > > > > This is not just inaccurate, it is suicide. Keep leaking throttle > > counts and eventually all of them will be gone. No more IO > > on that block device! > > First, because number of increased and decreased operations are the > same, so it will dance around limit in both directions. No. Please go and read it the description of the race again. A count gets irretrievably lost because the write operation of the first decrement is overwritten by the second. Data gets lost. Atomic operations exist to prevent that sort of thing. You either need to use them or have a deep understanding of SMP read and write ordering in order to preserve data integrity by some equivalent algorithm. > Let's solve problems in order of their appearence. If bio structure > will be allowed to grow, then the whole patches can be done better. How about like the patch below. This throttles any block driver by implementing a throttle metric method so that each block driver can keep track of its own resource consumption in units of its choosing. As an (important) example, it implements a simple metric for device mapper devices. Other block devices will work as before, because they do not define any metric. Short, sweet and untested, which is why I have not posted it until now. This patch originally kept its accounting info in backing_dev_info, however that structure seems to be in some and it is just a part of struct queue anyway, so I lifted the throttle accounting up into struct queue. We should be able to report on the efficacy of this patch in terms of deadlock prevention pretty soon. --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700 @@ -3237,6 +3237,15 @@ end_io: */ void generic_make_request(struct bio *bio) { + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + + if (q && q->metric) { + int need = bio->bi_reserved = q->metric(bio); + bio->queue = q; + wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need); + atomic_sub(&q->available, need); + } + if (current->bio_tail) { /* make_request is active */ *(current->bio_tail) = bio; --- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700 +++ 2.6.22/drivers/md/dm.c 2007-08-24 12:14:23.0 -0700 @@ -880,6 +880,11 @@ static int dm_any_congested(void *conges return r; } +static unsigned dm_metric(struct bio *bio) +{ + return bio->bi_vcnt; +} + /*- * An IDR is used to keep track of allocated minor numbers. *---*/ @@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i goto bad1_free_minor; md->queue->queuedata = md; + md->queue->metric = dm_metric; + atomic_set(&md->queue->available, md->queue->capacity = 1000); + init_waitqueue_head(&md->queue->throttle_wait); + md->queue->backing_dev_info.congested_fn = dm_any_congested; md->queue->backing_dev_info.congested_data = md; blk_queue_make_request(md->queue, dm_request); --- 2.6.22.clean/fs/bio.c 2007-07-08 16:32:17.0 -0700 +++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700 @@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned bytes_done = bio->bi_size; } - bio->bi_size -= bytes_done; + if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) { + struct request_queue *q = bio->queue; + atomic_add(&q->available, bio->bi_reserved); + bio->bi_reserved = 0; /* just in case */ + wake_up(&q->throttle_wait); + } bio->bi_sector += (bytes_done >> 9); if (bio->bi_end_io) --- 2.6.22
Re: [1/1] Block device throttling [Re: Distributed storage.]
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote: > On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > Say Evgeniy, something I was curious about but forgot to ask you > > earlier... > > > > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > > > ...All oerations are not atomic, since we do not care about > > > precise number of bios, but a fact, that we are close or close > > > enough to the limit. > > > ... in bio->endio > > > + q->bio_queued--; > > > > In your proposed patch, what prevents the race: > > > > cpu1cpu2 > > > > read q->bio_queued > > > > q->bio_queued-- > > write q->bio_queued - 1 > > Whoops! We leaked a throttle count. > > We do not care about one cpu being able to increase its counter > higher than the limit, such inaccuracy (maximum bios in flight thus > can be more than limit, difference is equal to the number of CPUs - > 1) is a price for removing atomic operation. I thought I pointed it > in the original description, but might forget, that if it will be an > issue, that atomic operations can be introduced there. Any > uber-precise measurements in the case when we are close to the edge > will not give us any benefit at all, since were are already in the > grey area. This is not just inaccurate, it is suicide. Keep leaking throttle counts and eventually all of them will be gone. No more IO on that block device! > Another possibility is to create a queue/device pointer in the bio > structure to hold original device and then in its backing dev > structure add a callback to recalculate the limit, but it increases > the size of the bio. Do we need this? Different issue. Yes, I think we need a nice simple approach like that, and prove it is stable before worrying about the size cost. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/1] Block device throttling [Re: Distributed storage.]
Say Evgeniy, something I was curious about but forgot to ask you earlier... On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > ...All oerations are not atomic, since we do not care about precise > number of bios, but a fact, that we are close or close enough to the > limit. > ... in bio->endio > + q->bio_queued--; In your proposed patch, what prevents the race: cpu1cpu2 read q->bio_queued q->bio_queued-- write q->bio_queued - 1 Whoops! We leaked a throttle count. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote: > > The throttling of the virtual device must begin in > > generic_make_request and last to ->endio. You release the throttle > > of the virtual device at the point you remap the bio to an > > underlying device, which you have convinced yourself is ok, but it > > is not. You seem to miss the fact that whatever resources the > > virtual device has allocated are no longer protected by the > > throttle count *of the virtual device*, or you do not > > Because it is charged to another device. Great. You charged the resource to another device, but you did not limit the amount of resources that the first device can consume. Which misses the whole point. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote: > On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: > > > > And it will not solve the deadlock problem in general. (Maybe > > > > it works for your virtual device, but I wonder...) If the > > > > virtual device allocates memory during generic_make_request > > > > then the memory needs to be throttled. > > > > > > Daniel, if device process bio by itself, it has a limit and thus > > > it will wait in generic_make_request() > > > > What will make it wait? > > gneric_make_request() for given block device. Not good enough, that only makes one thread wait. Look here: http://lkml.org/lkml/2007/8/13/788 An unlimited number of threads can come in, each consuming resources of the virtual device, and violating the throttling rules. The throttling of the virtual device must begin in generic_make_request and last to ->endio. You release the throttle of the virtual device at the point you remap the bio to an underlying device, which you have convinced yourself is ok, but it is not. You seem to miss the fact that whatever resources the virtual device has allocated are no longer protected by the throttle count *of the virtual device*, or you do not see why that is a bad thing. It is a very bad thing, roughly like leaving some shared data outside a spin_lock/unlock. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote: > > And it will not solve the deadlock problem in general. (Maybe it > > works for your virtual device, but I wonder...) If the virtual > > device allocates memory during generic_make_request then the memory > > needs to be throttled. > > Daniel, if device process bio by itself, it has a limit and thus it > will wait in generic_make_request() What will make it wait? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > Perhaps you never worried about the resources that the device > > mapper mapping function allocates to handle each bio and so did not > > consider this hole significant. These resources can be > > significant, as is the case with ddsnap. It is essential to close > > that window through with the virtual device's queue limit may be > > violated. Not doing so will allow deadlock. > > This is not a bug, this is special kind of calculation - total limit > is number of physical devices multiplied by theirs limits. It was > done _on purpose_ to allow different device to have different limits > (for example in distributed storage project it is possible to have > both remote and local node in the same device, but local device > should not have _any_ limit at all, but network one should). > > Virtual device essentially has _no_ limit. And that as done on > purpose. And it will not solve the deadlock problem in general. (Maybe it works for your virtual device, but I wonder...) If the virtual device allocates memory during generic_make_request then the memory needs to be throttled. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:12, Jens Axboe wrote: > > It is a system wide problem. Every block device needs throttling, > > otherwise queues expand without limit. Currently, block devices > > that use the standard request library get a slipshod form of > > throttling for free in the form of limiting in-flight request > > structs. Because the amount of IO carried by a single request can > > vary by two orders of magnitude, the system behavior of this > > approach is far from predictable. > > Is it? Consider just 10 standard sata disks. The next kernel revision > will have sg chaining support, so that allows 32MiB per request. Even > if we disregard reads (not so interesting in this discussion) and > just look at potentially pinned dirty data in a single queue, that > number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch. > > So I still think that this throttling needs to happen elsewhere, you > cannot rely the block layer throttling globally or for a single > device. It just doesn't make sense. You are right, so long as the unit of throttle accounting remains one request. This is not what we do in ddsnap. Instead we inc/dec the throttle counter by the number of bvecs in each bio, which produces a nice steady data flow to the disk under a wide variety of loads, and provides the memory resource bound we require. One throttle count per bvec will not be the right throttling metric for every driver. To customize this accounting metric for a given driver we already have the backing_dev_info structure, which provides per-device-instance accounting functions and instance data. Perfect! This allows us to factor the throttling mechanism out of the driver, so the only thing the driver has to do is define the throttle accounting if it needs a custom one. We can avoid affecting the traditional behavior quite easily, for example if backing_dev_info->throttle_fn (new method) is null then either not throttle at all (and rely on the struct request in-flight limit) or we can move the in-flight request throttling logic into core as the default throttling method, simplifying the request library and not changing its behavior. > > These deadlocks are first and foremost, block layer deficiencies. > > Even the network becomes part of the problem only because it lies > > in the block IO path. > > The block layer has NEVER guaranteed throttling, so it can - by > definition - not be a block layer deficiency. The block layer has always been deficient by not providing accurate throttling, or any throttling at all for some devices. We have practical proof that this causes deadlock and a good theoretical basis for describing exactly how it happens. To be sure, vm and net are co-conspirators, however the block layer really is the main actor in this little drama. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote: > > Say you have a device mapper device with some physical device > > sitting underneath, the classic use case for this throttle code. > > Say 8,000 threads each submit an IO in parallel. The device mapper > > mapping function will be called 8,000 times with associated > > resource allocations, regardless of any throttling on the physical > > device queue. > > Each thread will sleep in generic_make_request(), if limit is > specified correctly, then allocated number of bios will be enough to > have a progress. The problem is, the sleep does not occur before the virtual device mapping function is called. Let's consider two devices, a physical device named pdev and a virtual device sitting on top of it called vdev. vdev's throttle limit is just one element, but we will see that in spite of this, two bios can be handled by the vdev's mapping method before any IO completes, which violates the throttling rules. According to your patch it works like this: Thread 1Thread 2 bio_queued is zero> vdev->q->bio_queued++ blk_set_bdev(bio, pdev) vdev->bio_queued-- bio_queued is zero> vdev->q->bio_queued++ whoops! Our virtual device mapping function has now allocated resources for two in-flight bios in spite of having its throttle limit set to 1. Perhaps you never worried about the resources that the device mapper mapping function allocates to handle each bio and so did not consider this hole significant. These resources can be significant, as is the case with ddsnap. It is essential to close that window through with the virtual device's queue limit may be violated. Not doing so will allow deadlock. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: > > > > Oops, and there is also: > > > > > > > > 3) The bio throttle, which is supposed to prevent deadlock, can > > > > itself deadlock. Let me see if I can remember how it goes. > > > > > > > > * generic_make_request puts a bio in flight > > > > * the bio gets past the throttle and initiates network IO > > > > * net calls sk_alloc->alloc_pages->shrink_caches > > > > * shrink_caches submits a bio recursively to our block device > > > > * this bio blocks on the throttle > > > > * net may never get the memory it needs, and we are wedged > > > > > > If system is in such condition, it is already broken - throttle > > > limit must be lowered (next time) not to allow such situation. > > > > Agreed that the system is broken, however lowering the throttle > > limit gives no improvement in this case. > > How is it ever possible? The whole idea of throttling is to remove > such situation, and now you say it can not be solved. It was solved, by not throttling writeout that comes from shrink_caches. Ugly. > If limit is for > 1gb of pending block io, and system has for example 2gbs of ram (or > any other resonable parameters), then there is no way we can deadlock > in allocation, since it will not force page reclaim mechanism. The problem is that sk_alloc (called from our block driver via socket->write) would recurse into shrink_pages, which recursively submits IO to our block driver and blocks on the throttle. Subtle indeed, and yet another demonstration of why vm recursion is a Bad Thing. I will find a traceback for you tomorrow, which makes this deadlock much clearer. Regards - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 04:03, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > > This is not a very good solution, since it requires all users of > > > the bios to know how to free it. > > > > No, only the specific ->endio needs to know that, which is set by > > the bio owner, so this knowledge lies in exactly the right place. > > A small handful of generic endios all with the same destructor are > > used nearly everywhere. > > That is what I meant - there will be no way to just alloc a bio and > put it, helpers for generic bio sets must be exported and each and > every bi_end_io() must be changed to check reference counter and they > must know how they were allocated. There are fewer non-generic bio allocators than you think. > Endio callback is of course quite rare and additional atomic > reading will not kill the system, but why introduce another read? > It is possible to provide a flag for endio callback that it is last, > but it still requires to change every single callback - why do we > want this? We don't. Struct bio does not need to be shrunk. Jens wanted to talk about what fields could be eliminated if we wanted to shrink it. It is about time to let that lie, don't you think? > So, I'm a bit lost... > > You say it is too big Did not say that. > and some parts can be removed or combined True. > and then that size does not matter. Also true, backed up by numbers on real systems. > Last/not-last checks in the code is > not clear design, so I do not see why it is needed at all if not for > size shrinking. Not needed, indeed. Accurate throttling is needed. If the best way to throttle requires expanding struct bio a little then we should not let concerns about the cost of an int or two stand in the way. Like Jens, I am more concerned about the complexity cost, and that is minimized in my opinion by throttling in the generic code rather than with custom code in each specialized block driver. Your patch does throttle in the generic code, great. Next thing is to be sure that it completely closes the window for reserve leakage, which is not yet clear. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote: > On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > (previous incomplete message sent accidentally) > > > > On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > > > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: > > > > > > So, what did we decide? To bloat bio a bit (add a queue pointer) > > > or to use physical device limits? The latter requires to replace > > > all occurence of bio->bi_bdev = something_new with > > > blk_set_bdev(bio, somthing_new), where queue limits will be > > > appropriately charged. So far I'm testing second case, but I only > > > changed DST for testing, can change all other users if needed > > > though. > > > > Adding a queue pointer to struct bio and using physical device > > limits as in your posted patch both suffer from the same problem: > > you release the throttling on the previous queue when the bio moves > > to a new one, which is a bug because memory consumption on the > > previous queue then becomes unbounded, or limited only by the > > number of struct requests that can be allocated. In other words, > > it reverts to the same situation we have now as soon as the IO > > stack has more than one queue. (Just a shorter version of my > > previous post.) > > No. Since all requests for virtual device end up in physical devices, > which have limits, this mechanism works. Virtual device will > essentially call either generic_make_request() for new physical > device (and thus will sleep is limit is over), or will process bios > directly, but in that case it will sleep in generic_make_request() > for virutal device. What can happen is, as soon as you unthrottle the previous queue, another thread can come in and put another request on it. Sure, that thread will likely block on the physical throttle and so will the rest of the incoming threads, but it still allows the higher level queue to grow past any given limit, with the help of lots of threads. JVM for example? Say you have a device mapper device with some physical device sitting underneath, the classic use case for this throttle code. Say 8,000 threads each submit an IO in parallel. The device mapper mapping function will be called 8,000 times with associated resource allocations, regardless of any throttling on the physical device queue. Anyway, your approach is awfully close to being airtight, there is just a small hole. I would be more than happy to be proved wrong about that, but the more I look, the more I see that hole. > > 1) One throttle count per submitted bio is too crude a measure. A > > bio can carry as few as one page or as many as 256 pages. If you > > take only > > It does not matter - we can count bytes, pages, bio vectors or > whatever we like, its just a matter of counter and can be changed > without problem. Quite true. In some cases the simple inc/dec per bio works just fine. But the general case where finer granularity is required comes up in existing code, so there needs to be a plan. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote: > > Oops, and there is also: > > > > 3) The bio throttle, which is supposed to prevent deadlock, can > > itself deadlock. Let me see if I can remember how it goes. > > > > * generic_make_request puts a bio in flight > > * the bio gets past the throttle and initiates network IO > > * net calls sk_alloc->alloc_pages->shrink_caches > > * shrink_caches submits a bio recursively to our block device > > * this bio blocks on the throttle > > * net may never get the memory it needs, and we are wedged > > If system is in such condition, it is already broken - throttle limit > must be lowered (next time) not to allow such situation. Agreed that the system is broken, however lowering the throttle limit gives no improvement in this case. This is not theoretical, but a testable, repeatable result. Instructions to reproduce should show up tomorrow. This bug is now solved in a kludgy way. Now, Peter's patch set offers a much cleaner way to fix this little problem, along with at least one other nasty that it already fixed. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 03:22, Jens Axboe wrote: > I never compared the bio to struct page, I'd obviously agree that > shrinking struct page was a worthy goal and that it'd be ok to uglify > some code to do that. The same isn't true for struct bio. I thought I just said that. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 03:06, Jens Axboe wrote: > On Mon, Aug 13 2007, Daniel Phillips wrote: > > Of course not. Nothing I said stops endio from being called in the > > usual way as well. For this to work, endio just needs to know that > > one call means "end" and the other means "destroy", this is > > trivial. > > Sorry Daniel, but your suggestions would do nothing more than uglify > the code and design. Pretty much exactly what was said about shrinking struct page, ask Bill. The difference was, shrinking struct page actually mattered whereas shrinking struct bio does not, and neither does expanding it by a few bytes. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:18, Evgeniy Polyakov wrote: > On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: > > > But that idea fails as well, since reference counts and IO > > > completion are two completely seperate entities. So unless end IO > > > just happens to be the last user holding a reference to the bio, > > > you cannot free it. > > > > That is not a problem. When bio_put hits zero it calls ->endio > > instead of the destructor. The ->endio sees that the count is zero > > and destroys the bio. > > This is not a very good solution, since it requires all users of the > bios to know how to free it. No, only the specific ->endio needs to know that, which is set by the bio owner, so this knowledge lies in exactly the right place. A small handful of generic endios all with the same destructor are used nearly everywhere. > Right now it is hidden. > And adds additional atomic check (although reading is quite fast) in > the end_io. Actual endio happens once in the lifetime of the transfer, this read will be entirely lost in the noise. > And for what purpose? To eat 8 bytes on 64bit platform? > This will not reduce its size noticebly, so the same number of bios > will be in the cache's page, so what is a gain? All this cleanups and > logic complicatins should be performed only if after size shring > increased number of bios can fit into cache's page, will it be done > after such cleanups? Well, exactly, My point from the beginning was that the size of struct bio is not even close to being a problem and adding a few bytes to it in the interest of doing the cleanest fix to a core kernel bug is just not a dominant issue. I suppose that leaving out the word "bloated" and skipping straight to the "doesn't matter" proof would have saved some bandwidth. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 02:13, Jens Axboe wrote: > On Mon, Aug 13 2007, Daniel Phillips wrote: > > On Monday 13 August 2007 00:45, Jens Axboe wrote: > > > On Mon, Aug 13 2007, Jens Axboe wrote: > > > > > You did not comment on the one about putting the bio > > > > > destructor in the ->endio handler, which looks dead simple. > > > > > The majority of cases just use the default endio handler and > > > > > the default destructor. Of the remaining cases, where a > > > > > specialized destructor is needed, typically a specialized > > > > > endio handler is too, so combining is free. There are few if > > > > > any cases where a new specialized endio handler would need to > > > > > be written. > > > > > > > > We could do that without too much work, I agree. > > > > > > But that idea fails as well, since reference counts and IO > > > completion are two completely seperate entities. So unless end IO > > > just happens to be the last user holding a reference to the bio, > > > you cannot free it. > > > > That is not a problem. When bio_put hits zero it calls ->endio > > instead of the destructor. The ->endio sees that the count is zero > > and destroys the bio. > > You can't be serious? You'd stall end io completion notification > because someone holds a reference to a bio. Of course not. Nothing I said stops endio from being called in the usual way as well. For this to work, endio just needs to know that one call means "end" and the other means "destroy", this is trivial. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 00:45, Jens Axboe wrote: > On Mon, Aug 13 2007, Jens Axboe wrote: > > > You did not comment on the one about putting the bio destructor > > > in the ->endio handler, which looks dead simple. The majority of > > > cases just use the default endio handler and the default > > > destructor. Of the remaining cases, where a specialized > > > destructor is needed, typically a specialized endio handler is > > > too, so combining is free. There are few if any cases where a > > > new specialized endio handler would need to be written. > > > > We could do that without too much work, I agree. > > But that idea fails as well, since reference counts and IO completion > are two completely seperate entities. So unless end IO just happens > to be the last user holding a reference to the bio, you cannot free > it. That is not a problem. When bio_put hits zero it calls ->endio instead of the destructor. The ->endio sees that the count is zero and destroys the bio. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Monday 13 August 2007 00:28, Jens Axboe wrote: > On Sun, Aug 12 2007, Daniel Phillips wrote: > > Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can > > derive efficiently from BIO_POOL_IDX() provided the bio was > > allocated in the standard way. > > That would only be feasible, if we ruled that any bio in the system > must originate from the standard pools. Not at all. > > This leaves a little bit of clean up to do for bios not allocated > > from a standard pool. > > Please suggest how to do such a cleanup. Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a bio from the standard pool. Just put the power of two size in the bits and map that number to the standard pool arrangement with a table lookup. > > On the other hand, vm writeout deadlock ranks smack dab at the top > > of the list, so that is where the patching effort must go for the > > forseeable future. Without bio throttling, the ddsnap load can go > > to 24 MB for struct bio alone. That definitely moves the needle. > > in short, we save 3,200 times more memory by putting decent > > throttling in place than by saving an int in struct bio. > > Then fix the damn vm writeout. I always thought it was silly to > depend on the block layer for any sort of throttling. If it's not a > system wide problem, then throttle the io count in the > make_request_fn handler of that problematic driver. It is a system wide problem. Every block device needs throttling, otherwise queues expand without limit. Currently, block devices that use the standard request library get a slipshod form of throttling for free in the form of limiting in-flight request structs. Because the amount of IO carried by a single request can vary by two orders of magnitude, the system behavior of this approach is far from predictable. > > You did not comment on the one about putting the bio destructor in > > the ->endio handler, which looks dead simple. The majority of > > cases just use the default endio handler and the default > > destructor. Of the remaining cases, where a specialized destructor > > is needed, typically a specialized endio handler is too, so > > combining is free. There are few if any cases where a new > > specialized endio handler would need to be written. > > We could do that without too much work, I agree. OK, we got one and another is close to cracking, enough of that. > > As far as code stability goes, current kernels are horribly > > unstable in a variety of contexts because of memory deadlock and > > slowdowns related to the attempt to fix the problem via dirty > > memory limits. Accurate throttling of bio traffic is one of the > > two key requirements to fix this instability, the other other is > > accurate writeout path reserve management, which is only partially > > addressed by BIO_POOL. > > Which, as written above and stated many times over the years on lkml, > is not a block layer issue imho. Whoever stated that was wrong, but this should be no surprise. There have been many wrong things said about this particular bug over the years. The one thing that remains constant is, Linux continues to deadlock under a variety of loads both with and without network involvement, making it effectively useless as a storage platform. These deadlocks are first and foremost, block layer deficiencies. Even the network becomes part of the problem only because it lies in the block IO path. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Sunday 12 August 2007 22:36, I wrote: > Note! There are two more issues I forgot to mention earlier. Oops, and there is also: 3) The bio throttle, which is supposed to prevent deadlock, can itself deadlock. Let me see if I can remember how it goes. * generic_make_request puts a bio in flight * the bio gets past the throttle and initiates network IO * net calls sk_alloc->alloc_pages->shrink_caches * shrink_caches submits a bio recursively to our block device * this bio blocks on the throttle * net may never get the memory it needs, and we are wedged I need to review a backtrace to get this precisely right, however you can see the danger. In ddsnap we kludge around this problem by not throttling any bio submitted in PF_MEMALLOC mode, which effectively increases our reserve requirement by the amount of IO that mm will submit to a given block device before deciding the device is congested and should be left alone. This works, but is sloppy and disgusting. The right thing to do is to make sure than the mm knows about our throttle accounting in backing_dev_info so it will not push IO to our device when it knows that the IO will just block on congestion. Instead, shrink_caches will find some other less congested block device or give up, causing alloc_pages to draw from the memalloc reserve to satisfy the sk_alloc request. The mm already uses backing_dev_info this way, we just need to set the right bits in the backing_dev_info state flags. I think Peter posted a patch set that included this feature at some point. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
(previous incomplete message sent accidentally) On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote: > > So, what did we decide? To bloat bio a bit (add a queue pointer) or > to use physical device limits? The latter requires to replace all > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio, > somthing_new), where queue limits will be appropriately charged. So > far I'm testing second case, but I only changed DST for testing, can > change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) We can solve this by having the bio only point at the queue to which it was originally submitted, since throttling the top level queue automatically throttles all queues lower down the stack. Alternatively the bio can point at the block_device or straight at the backing_dev_info, which is the per-device structure it actually needs to touch. Note! There are two more issues I forgot to mention earlier. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only one throttle count per bio and that data will be transferred over the network then you have to assume that (a little more than) 256 pages of sk_alloc reserve will be needed for every bio, resulting in a grossly over-provisioned reserve. The precise reserve calculation we want to do is per-block device, and you will find hooks like this already living in backing_dev_info. We need to place our own fn+data there to calculate the throttle draw for each bio. Unthrottling gets trickier with variable size throttle draw. In ddsnap, we simply write the amount we drew from the throttle into (the private data of) bio for use later by unthrottle, thus avoiding the issue that the bio fields we used to calculate might have changed during the lifetime of the bio. This would translate into one more per-bio field. 2) Exposing the per-block device throttle limits via sysfs or similar is really not a good long term solution for system administration. Imagine our help text: "just keep trying smaller numbers until your system deadlocks". We really need to figure this out internally and get it correct. I can see putting in a temporary userspace interface just for experimentation, to help determine what really is safe, and what size the numbers should be to approach optimal throughput in a fully loaded memory state. Regards, Daniel Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Block device throttling [Re: Distributed storage.]
On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote: > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe ([EMAIL PROTECTED]) wrote: > > So, what did we decide? To bloat bio a bit (add a queue pointer) or > to use physical device limits? The latter requires to replace all > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio, > somthing_new), where queue limits will be appropriately charged. So > far I'm testing second case, but I only changed DST for testing, can > change all other users if needed though. Adding a queue pointer to struct bio and using physical device limits as in your posted patch both suffer from the same problem: you release the throttling on the previous queue when the bio moves to a new one, which is a bug because memory consumption on the previous queue then becomes unbounded, or limited only by the number of struct requests that can be allocated. In other words, it reverts to the same situation we have now as soon as the IO stack has more than one queue. (Just a shorter version of my previous post.) We can solve this by having the bio only point at the queue to which it was originally submitted, since throttling the top level queue automatically throttles all queues lower down the stack. Alternatively the bio can point at the block_device or straight at the backing_dev_info, which is the per-device structure it actually needs to touch. Note! There are two more issues I forgot to mention earlier. 1) One throttle count per submitted bio is too crude a measure. A bio can carry as few as one page or as many as 256 pages. If you take only one throttle count per bio and that data will be transferred over the network then you have to assume that (a little more than) 256 pages of sk_alloc reserve will be needed for every bio, resulting in a grossly over-provisioned reserve. The precise reserve calculation we want to do is per-block device, and you will find hooks like this already living in backing_dev_info. We need to place our own fn+data there to calculate the throttle draw for each bio. Unthrottling gets trickier with variable size throttle draw. In ddsnap, we simply write the amount we drew from the throttle into (the private data of) bio for use later by unthrottle, thus avoiding the issue that the bio fields we used to calculate might have changed during the lifetime of the bio. This would translate into one more per-bio field. the throttling performs another function: keeping a reasonable amount of IO in flight for the device. The definition of "reasonable" is complex. For a hard disk it depends on the physical distance between sector addresses of the bios in flight. In ddsnap we make a crude but workable approximation that In general, a per block device The throttle count needs to cover Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tuesday 07 August 2007 13:55, Jens Axboe wrote: > I don't like structure bloat, but I do like nice design. Overloading > is a necessary evil sometimes, though. Even today, there isn't enough > room to hold bi_rw and bi_flags in the same variable on 32-bit archs, > so that concern can be scratched. If you read bio.h, that much is > obvious. Sixteen bits in bi_rw are consumed by queue priority. Is there a reason this lives in struct bio instead of struct request? > If you check up on the iommu virtual merging, you'll understand the > front and back size members. They may smell dubious to you, but > please take the time to understand why it looks the way it does. Virtual merging is only needed at the physical device, so why do these fields live in struct bio instead of struct request? > Changing the number of bvecs is integral to how bio buildup current > works. Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can derive efficiently from BIO_POOL_IDX() provided the bio was allocated in the standard way. This leaves a little bit of clean up to do for bios not allocated from a standard pool. Incidentally, why does the bvl need to be memset to zero on allocation? bi_vcnt already tells you which bvecs are valid and the only field in a bvec that can reasonably default to zero is the offset, which ought to be set set every time a bvec is initialized anyway. > > bi_destructor could be combined. I don't see a lot of users of > > bi_idx, > > bi_idx is integral to partial io completions. Struct request has a remaining submission sector count so what does bi_idx do that is different? > > that looks like a soft target. See what happened to struct page > > when a couple of folks got serious about attacking it, some really > > deep hacks were done to pare off a few bytes here and there. But > > struct bio as a space waster is not nearly in the same ballpark. > > So show some concrete patches and examples, hand waving and > assumptions is just a waste of everyones time. Average struct bio memory footprint ranks near the bottom of the list of things that suck most about Linux storage. At idle I see 8K in use (reserves); during updatedb it spikes occasionally to 50K; under a heavy load generated by ddsnap on a storage box it sometimes goes to 100K with bio throttling in place. Really not moving the needle. On the other hand, vm writeout deadlock ranks smack dab at the top of the list, so that is where the patching effort must go for the forseeable future. Without bio throttling, the ddsnap load can go to 24 MB for struct bio alone. That definitely moves the needle. in short, we save 3,200 times more memory by putting decent throttling in place than by saving an int in struct bio. That said, I did a little analysis to get an idea of where the soft targets are in struct bio, and to get to know the bio layer a little better. Maybe these few hints will get somebody interested enough to look further. > > It would be interesting to see if bi_bdev could be made read only. > > Generally, each stage in the block device stack knows what the next > > stage is going to be, so why do we have to write that in the bio? > > For error reporting from interrupt context? Anyway, if Evgeniy > > wants to do the patch, I will happily unload the task of convincing > > you that random fields are/are not needed in struct bio :-) > > It's a trade off, otherwise you'd have to pass the block device > around a lot. Which costs very little, probably less than trashing an extra field's worth of cache. > And it's, again, a design issue. A bio contains > destination information, that means device/offset/size information. > I'm all for shaving structure bytes where it matters, but not for the > sake of sacrificing code stability or design. I consider struct bio > quite lean and have worked hard to keep it that way. In fact, iirc, > the only addition to struct bio since 2001 is the iommu front/back > size members. And I resisted those for quite a while. You did not comment on the one about putting the bio destructor in the ->endio handler, which looks dead simple. The majority of cases just use the default endio handler and the default destructor. Of the remaining cases, where a specialized destructor is needed, typically a specialized endio handler is too, so combining is free. There are few if any cases where a new specialized endio handler would need to be written. As far as code stability goes, current kernels are horribly unstable in a variety of contexts because of memory deadlock and slowdowns related to the attempt to fix the problem via dirty memory limits. Accurate throttling of bio traffic is one of the two key requirements to fix this instability, the other other is accurate writeout path reserve management, which is only partially addressed by BIO_POOL. Nice to see you jumping in Jens. Now it is over to the other side of the thread where Evgeniy has posted a
Re: [1/1] Block device throttling [Re: Distributed storage.]
Hi Evgeniy, Sorry for not getting back to you right away, I was on the road with limited email access. Incidentally, the reason my mails to you keep bouncing is, your MTA is picky about my mailer's IP reversing to a real hostname. I will take care of that pretty soon, but for now my direct mail to you is going to bounce and you will only see the lkml copy. On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote: > This throttling mechanism allows to limit maximum amount of queued > bios per physical device. By default it is turned off and old block > layer behaviour with unlimited number of bios is used. When turned on > (queue limit is set to something different than -1U via > blk_set_queue_limit()), generic_make_request() will sleep until there > is room in the queue. number of bios is increased in > generic_make_request() and reduced either in bio_endio(), when bio is > completely processed (bi_size is zero), and recharged from original > queue when new device is assigned to bio via blk_set_bdev(). All > oerations are not atomic, since we do not care about precise number > of bios, but a fact, that we are close or close enough to the limit. > > Tested on distributed storage device - with limit of 2 bios it works > slow :) it seems to me you need: - if (q) { + if (q && q->bio_limit != -1) { This patch is short and simple, and will throttle more accurately than the current simplistic per-request allocation limit. However, it fails to throttle device mapper devices. This is because no request is allocated by the device mapper queue method, instead the mapping call goes straight through to the mapping function. If the mapping function allocates memory (typically the case) then this resource usage evades throttling and deadlock becomes a risk. There are three obvious fixes: 1) Implement bio throttling in each virtual block device 2) Implement bio throttling generically in device mapper 3) Implement bio throttling for all block devices Number 1 is the approach we currently use in ddsnap, but it is ugly and repetitious. Number 2 is a possibility, but I favor number 3 because it is a system-wide solution to a system-wide problem, does not need to be repeated for every block device that lacks a queue, heads in the direction of code subtraction, and allows system-wide reserve accounting. Your patch is close to the truth, but it needs to throttle at the top (virtual) end of each block device stack instead of the bottom (physical) end. It does head in the direction of eliminating your own deadlock risk indeed, however there are block devices it does not cover. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tuesday 07 August 2007 05:05, Jens Axboe wrote: > On Sun, Aug 05 2007, Daniel Phillips wrote: > > A simple way to solve the stable accounting field issue is to add a > > new pointer to struct bio that is owned by the top level submitter > > (normally generic_make_request but not always) and is not affected > > by any recursive resubmission. Then getting rid of that field > > later becomes somebody's summer project, which is not all that > > urgent because struct bio is already bloated up with a bunch of > > dubious fields and is a transient structure anyway. > > Thanks for your insights. Care to detail what bloat and dubious > fields struct bio has? First obvious one I see is bi_rw separate from bi_flags. Front_size and back_size smell dubious. Is max_vecs really necessary? You could reasonably assume bi_vcnt rounded up to a power of two and bury the details of making that work behind wrapper functions to change the number of bvecs, if anybody actually needs that. Bi_endio and bi_destructor could be combined. I don't see a lot of users of bi_idx, that looks like a soft target. See what happened to struct page when a couple of folks got serious about attacking it, some really deep hacks were done to pare off a few bytes here and there. But struct bio as a space waster is not nearly in the same ballpark. It would be interesting to see if bi_bdev could be made read only. Generally, each stage in the block device stack knows what the next stage is going to be, so why do we have to write that in the bio? For error reporting from interrupt context? Anyway, if Evgeniy wants to do the patch, I will happily unload the task of convincing you that random fields are/are not needed in struct bio :-) Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sunday 05 August 2007 08:01, Evgeniy Polyakov wrote: > On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips wrote: > > > DST original code worked as device mapper plugin too, but its two > > > additional allocations (io and clone) per block request ended up > > > for me as a show stopper. > > > > Ah, sorry, I misread. A show stopper in terms of efficiency, or in > > terms of deadlock? > > At least as in terms of efficiency. Device mapper lives in happy > world where memory does not end and allocations are fast. Are you saying that things are different for a network block device because it needs to do GFP_ATOMIC allocations? If so then that is just a misunderstanding. The global page reserve Peter and I use is available in interrupt context just like GFP_ATOMIC. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote: > If we are sleeping in memory pool, then we already do not have memory > to complete previous requests, so we are in trouble. Not at all. Any requests in flight are guaranteed to get the resources they need to complete. This is guaranteed by the combination of memory reserve management and request queue throttling. In logical terms, reserve management plus queue throttling is necessary and sufficient to prevent these deadlocks. Conversely, the absence of either one allows deadlock. > This can work > for devices which do not require additional allocations (like usual > local storage), but not for network connected ones. It works for network devices too, and also for a fancy device like ddsnap, which is the moral equivalent of a filesystem implemented in user space. > If not in device, then at least it should say to block layer about > its limits. What about new function to register queue... Yes, a new internal API is needed eventually. However, no new api is needed right at the moment because we can just hard code the reserve sizes and queue limits and audit them by hand, which is not any more sloppy than several other kernel subsystems. The thing is, we need to keep any obfuscating detail out of the initial patches because these principles are hard enough to explain already without burying them in hundreds of lines of API fluff. That said, the new improved API should probably not be a new way to register, but a set of function calls you can use after the queue is created, which follows the pattern of the existing queue API. > ...which will get > maximum number of bios in flight and sleep in generic_make_request() > when new bio is going to be submitted and it is about to exceed the > limit? Exactly. This is what ddsnap currently does and it works. But we did not change generic_make_request for this driver, instead we throttled the driver from the time it makes a request to its user space server, until the reply comes back. We did it that way because it was easy and was the only segment of the request lifeline that could not be fixed by other means. A proper solution for all block devices will move the throttling up into generic_make_request, as you say below. > By default things will be like they are now, except additional > non-atomic increment and branch in generic_make_request() and > decrement and wake in bio_end_io()? ->endio is called in interrupt context, so the accounting needs to be atomic as far as I can see. We actually account the total number of bio pages in flight, otherwise you would need to assume the largest possible bio and waste a huge amount of reserve memory. A counting semaphore works fine for this purpose, with some slight inefficiency that is nigh on unmeasurable in the block IO path. What the semaphore does is make the patch small and easy to understand, which is important at this point. > I can cook up such a patch if idea worth efforts. It is. There are some messy details... You need a place to store the accounting variable/semaphore and need to be able to find that place again in ->endio. Trickier than it sounds, because of the unstructured way drivers rewrite ->bi_bdev. Peterz has already poked at this in a number of different ways, typically involving backing_dev_info, which seems like a good idea to me. A simple way to solve the stable accounting field issue is to add a new pointer to struct bio that is owned by the top level submitter (normally generic_make_request but not always) and is not affected by any recursive resubmission. Then getting rid of that field later becomes somebody's summer project, which is not all that urgent because struct bio is already bloated up with a bunch of dubious fields and is a transient structure anyway. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Saturday 04 August 2007 09:44, Evgeniy Polyakov wrote: > > On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: > > > * storage can be formed on top of remote nodes and be > > > exported simultaneously (iSCSI is peer-to-peer only, NBD requires > > > device mapper and is synchronous) > > > > In fact, NBD has nothing to do with device mapper. I use it as a > > physical target underneath ddraid (a device mapper plugin) just > > like I would use your DST if it proves out. > > I meant to create a storage on top of several nodes one needs to have > device mapper or something like that on top of NBD itself. To further > export resulted device one needs another userspace NDB application > and so on. DST simplifies that greatly. > > DST original code worked as device mapper plugin too, but its two > additional allocations (io and clone) per block request ended up for > me as a show stopper. Ah, sorry, I misread. A show stopper in terms of efficiency, or in terms of deadlock? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Saturday 04 August 2007 09:37, Evgeniy Polyakov wrote: > On Fri, Aug 03, 2007 at 06:19:16PM -0700, I wrote: > > To be sure, I am not very proud of this throttling mechanism for > > various reasons, but the thing is, _any_ throttling mechanism no > > matter how sucky solves the deadlock problem. Over time I want to > > move the > > make_request_fn is always called in process context, Yes, as is submit_bio which calls it. The decision re where it is best to throttle, in submit_bio or in make_request_fn, has more to do with system factoring, that is, is throttling something that _every_ block device should have (yes I think) or is it a delicate, optional thing that needs a tweakable algorithm per block device type (no I think). The big worry I had was that by blocking on congestion in the submit_bio/make_request_fn I might stuff up system-wide mm writeout. But a while ago that part of the mm was tweaked (by Andrew if I recall correctly) to use a pool of writeout threads and understand the concept of one of them blocking on some block device, and not submit more writeout to the same block device until the first thread finishes its submission. Meanwhile, other mm writeout threads carry on with other block devices. > we can wait in it for memory in mempool. Although that means we > already in trouble. Not at all. This whole block writeout path needs to be written to run efficiently even when normal system memory is completely gone. All it means when we wait on a mempool is that the block device queue is as full as we are ever going to let it become, and that means the block device is working as hard as it can (subject to a small caveat: for some loads a device can work more efficiently if it can queue up larger numbers of requests down at the physical elevators). By the way, ddsnap waits on a counting semaphore, not a mempool. That is because we draw our reserve memory from the global memalloc reserve, not from a mempool. And that is not only because it takes less code to do so, but mainly because global pools as opposed to lots of little special purpose pools seem like a good idea to me. Though I will admit that with our current scheme we need to allow for the total of the maximum reserve requirements for all memalloc users in the memalloc pool, so it does not actually save any memory vs dedicated pools. We could improve that if we wanted to, by having hard and soft reserve requirements: the global reserve actually only needs to be as big as the total of the hard requirements. With this idea, if by some unlucky accident every single pool user got itself maxed out at the same time, we would still not exceed our share of the global reserve. Under "normal" low memory situations, a block device would typically be free to grab reserve memory up to its soft limit, allowing it to optimize over a wider range of queued transactions. My little idea here is: allocating specific pages to a pool is kind of dumb, all we really want to do is account precisely for the number of pages we are allowed to draw from the global reserve. OK, I kind of digressed, but this all counts as explaining the details of what Peter and I have been up to for the last year (longer for me). At this point, we don't need to do the reserve accounting in the most absolutely perfect way possible, we just need to get something minimal in place to fix the current deadlock problems, then we can iteratively improve it. > I agree, any kind of high-boundary leveling must be implemented in > device itself, since block layer does not know what device is at the > end and what it will need to process given block request. I did not say the throttling has to be implemented in the device, only that we did it there because it was easiest to code that up and try it out (it worked). This throttling really wants to live at a higher level, possibly submit_bio()...bio->endio(). Someone at OLS (James Bottomley?) suggested it would be better done at the request queue layer, but I do not immediately see why that should be. I guess this is going to come down to somebody throwing out a patch for interested folks to poke at. But this detail is a fine point. The big point is to have _some_ throttling mechanism in place on the block IO path, always. Device mapper in particular does not have any throttling itself: calling submit_bio on a device mapper device directly calls the device mapper bio dispatcher. Default initialized block device queue do provide a crude form of throttling based on limiting the number of requests. This is insufficiently precise to do a good job in the long run, but it works for now because the current gaggle of low level block drivers do not have a lot of resource requirements and tend to behave fairly predictably (except for some irritating issues re very slow devices working in parallel with very fast devices, but... worry about that later). Network block driv
Re: Distributed storage.
On Friday 03 August 2007 03:26, Evgeniy Polyakov wrote: > On Thu, Aug 02, 2007 at 02:08:24PM -0700, I wrote: > > I see bits that worry me, e.g.: > > > > + req = mempool_alloc(st->w->req_pool, GFP_NOIO); > > > > which seems to be callable in response to a local request, just the > > case where NBD deadlocks. Your mempool strategy can work reliably > > only if you can prove that the pool allocations of the maximum > > number of requests you can have in flight do not exceed the size of > > the pool. In other words, if you ever take the pool's fallback > > path to normal allocation, you risk deadlock. > > mempool should be allocated to be able to catch up with maximum > in-flight requests, in my tests I was unable to force block layer to > put more than 31 pages in sync, but in one bio. Each request is > essentially dealyed bio processing, so this must handle maximum > number of in-flight bios (if they do not cover multiple nodes, if > they do, then each node requires own request). It depends on the characteristics of the physical and virtual block devices involved. Slow block devices can produce surprising effects. Ddsnap still qualifies as "slow" under certain circumstances (big linear write immediately following a new snapshot). Before we added throttling we would see as many as 800,000 bios in flight. Nice to know the system can actually survive this... mostly. But memory deadlock is a clear and present danger under those conditions and we did hit it (not to mention that read latency sucked beyond belief). Anyway, we added a simple counting semaphore to throttle the bio traffic to a reasonable number and behavior became much nicer, but most importantly, this satisfies one of the primary requirements for avoiding block device memory deadlock: a strictly bounded amount of bio traffic in flight. In fact, we allow some bounded number of non-memalloc bios *plus* however much traffic the mm wants to throw at us in memalloc mode, on the assumption that the mm knows what it is doing and imposes its own bound of in flight bios per device. This needs auditing obviously, but the mm either does that or is buggy. In practice, with this throttling in place we never saw more than 2,000 in flight no matter how hard we hit it, which is about the number we were aiming at. Since we draw our reserve from the main memalloc pool, we can easily handle 2,000 bios in flight, even under extreme conditions. See: http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c down(&info->throttle_sem); To be sure, I am not very proud of this throttling mechanism for various reasons, but the thing is, _any_ throttling mechanism no matter how sucky solves the deadlock problem. Over time I want to move the throttling up into bio submission proper, or perhaps incorporate it in device mapper's queue function, not quite as high up the food chain. Only some stupid little logistical issues stopped me from doing it one of those ways right from the start. I think Peter has also tried some things in this area. Anyway, that part is not pressing because the throttling can be done in the virtual device itself as we do it, even if it is not very pretty there. The point is: you have to throttle the bio traffic. The alternative is to die a horrible death under conditions that may be rare, but _will_ hit somebody. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
Hi Mike, On Thursday 02 August 2007 21:09, Mike Snitzer wrote: > But NBD's synchronous nature is actually an asset when coupled with > MD raid1 as it provides guarantees that the data has _really_ been > mirrored remotely. And bio completion doesn't? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
Hi Evgeniy, Nit alert: On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: > * storage can be formed on top of remote nodes and be exported > simultaneously (iSCSI is peer-to-peer only, NBD requires device > mapper and is synchronous) In fact, NBD has nothing to do with device mapper. I use it as a physical target underneath ddraid (a device mapper plugin) just like I would use your DST if it proves out. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Friday 03 August 2007 07:53, Peter Zijlstra wrote: > On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote: > > On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra wrote: > > ...my main position is to > > allocate per socket reserve from socket's queue, and copy data > > there from main reserve, all of which are allocated either in > > advance (global one) or per sockoption, so that there would be no > > fairness issues what to mark as special and what to not. > > > > Say we have a page per socket, each socket can assign a reserve for > > itself from own memory, this accounts both tx and rx side. Tx is > > not interesting, it is simple, rx has global reserve (always > > allocated on startup or sometime way before reclaim/oom)where data > > is originally received (including skb, shared info and whatever is > > needed, page is just an exmaple), then it is copied into per-socket > > reserve and reused for the next packet. Having per-socket reserve > > allows to have progress in any situation not only in cases where > > single action must be received/processed, and allows to be > > completely fair for all users, but not only special sockets, thus > > admin for example would be allowed to login, ipsec would work and > > so on... > > Ah, I think I understand now. Yes this is indeed a good idea! > > It would be quite doable to implement this on top of that I already > have. We would need to extend the socket with a sock_opt that would > reserve a specified amount of data for that specific socket. And then > on socket demux check if the socket has a non zero reserve and has > not yet exceeded said reserve. If so, process the packet. > > This would also quite neatly work for -rt where we would not want > incomming packet processing to be delayed by memory allocations. At this point we need "anything that works" in mainline as a starting point. By erring on the side of simplicity we can make this understandable for folks who haven't spent the last two years wallowing in it. The page per socket approach is about as simple as it gets. I therefore propose we save our premature optimizations for later. It will also help our cause if we keep any new internal APIs to strictly what is needed to make deadlock go away. Not a whole lot more than just the flag to mark a socket as part of the vm writeout path when you get right down to essentials. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Friday 03 August 2007 06:49, Evgeniy Polyakov wrote: > ...rx has global reserve (always allocated on > startup or sometime way before reclaim/oom)where data is originally > received (including skb, shared info and whatever is needed, page is > just an exmaple), then it is copied into per-socket reserve and > reused for the next packet. Having per-socket reserve allows to have > progress in any situation not only in cases where single action must > be received/processed, and allows to be completely fair for all > users, but not only special sockets, thus admin for example would be > allowed to login, ipsec would work and so on... And when the global reserve is entirely used up your system goes back to dropping vm writeout acknowledgements, not so good. I like your approach, and specifically the copying idea cuts out considerable complexity. But I believe the per-socket flag to mark a socket as part of the vm writeout path is not optional, and in this case it will be a better world if it is a slightly unfair world in favor of vm writeout traffic. Ssh will still work fine even with vm getting priority access to the pool. During memory crunches, non-vm ssh traffic may get bumped till after the crunch, but vm writeout is never supposed to hog the whole machine. If vm writeout hogs your machine long enough to delay an ssh login then that is a vm bug and should be fixed at that level. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage.
On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: > Hi. > > I'm pleased to announce first release of the distributed storage > subsystem, which allows to form a storage on top of remote and local > nodes, which in turn can be exported to another storage as a node to > form tree-like storages. Excellent! This is precisely what the doctor ordered for the OCFS2-based distributed storage system I have been mumbling about for some time. In fact the dd in ddsnap and ddraid stands for "distributed data". The ddsnap/raid devices do not include an actual network transport, that is expected to be provided by a specialized block device, which up till now has been NBD. But NBD has various deficiencies as you note, in addition to its tendency to deadlock when accessed locally. Your new code base may be just the thing we always wanted. We (zumastor et al) will take it for a drive and see if anything breaks. Memory deadlock is a concern of course. From a cursory glance through, it looks like this code is pretty vm-friendly and you have thought quite a lot about it, however I respectfully invite peterz (obsessive/compulsive memory deadlock hunter) to help give it a good going over with me. I see bits that worry me, e.g.: + req = mempool_alloc(st->w->req_pool, GFP_NOIO); which seems to be callable in response to a local request, just the case where NBD deadlocks. Your mempool strategy can work reliably only if you can prove that the pool allocations of the maximum number of requests you can have in flight do not exceed the size of the pool. In other words, if you ever take the pool's fallback path to normal allocation, you risk deadlock. Anyway, if this is as grand as it seems then I would think we ought to factor out a common transfer core that can be used by all of NBD, iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own code those things have now. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network receive stall avoidance (was [PATCH 2/9] deadlock prevention core)
Andrew Morton wrote: handwaving - The mmap(MAP_SHARED)-the-whole-world scenario should be fixed by mm-tracking-shared-dirty-pages.patch. Please test it and if you are still able to demonstrate deadlocks, describe how, and why they are occurring. OK, but please see "atomic 0 order allocation failure" below. - We expect that the lots-of-dirty-anon-memory-over-swap-over-network scenario might still cause deadlocks. I assert that this can be solved by putting swap on local disks. Peter asserts that this isn't acceptable due to disk unreliability. I point out that local disk reliability can be increased via MD, all goes quiet. A good exposition which helps us to understand whether and why a significant proportion of the target user base still wishes to do swap-over-network would be useful. I don't care much about swap-over-network, Peter does. I care about reliable remote disks in general, and reliable nbd in particular. - Assuming that exposition is convincing, I would ask that you determine at what level of /proc/sys/vm/min_free_kbytes the swap-over-network deadlock is no longer demonstrable. Earlier, I wrote: "we need to actually fix the out of memory deadlock/performance bug right now so that remote storage works properly." It is actually far more likely that memory reclaim stuffups will cause TCP to drop block IO packets here and there than that we will hit some endless retransmit deadlock. But dropping packets and thus causing TCP stalls during block writeout is a very bad thing too, I do not think you could call a system that exhibits such behavior a particularly reliable one. So rather than just the word deadlock, let us add "or atomic 0 order alloc failure during TCP receive" to the challenge. Fair? On the client send side, Peter already showed an easily reproducable deadlock which we avoid by using elevator-noop. The bug is still there, it is just under a different rock now. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Andrew Morton wrote: ...in my earlier emails I asked a number of questions regarding whether existing facilities, queued patches or further slight kernel changes could provide a sufficient solution to these problems. The answer may well be "no". But diligence requires that we be able to prove that, and handwaving doesn't provide that proof, does it? Hi Andrew, I missed those questions about queued patches. So far the closest we have seen to anything relevant is Evgeniy's proposed network sub-allocater, which just doesn't address the problem, as Peter and I have explained in some detail, and the dirty page throttling machinery which is obviously flawed in that it leaves a lot of room for corner cases that have to be handled with more special purpose logic. For example, what about kernel users that just go ahead and use lots of slab without getting involved in the dirty/mapped page accounting? I don't know, maybe you will handle that too, but then there will be another case that isn't handled, and so on. The dirty page throttling approach is just plain fragile by nature. We already know we need to do reserve accounting for struct bio, so what is the conceptual difficulty in extending that reasoning to struct sk_buff, which is very nearly the same kind of beastie, performing very nearly the same kind of function, and suffering very nearly the same kind of problems we had in the block layer before mingo's mempool machinery arrived? Did I miss some other queued patch that bears on this? And I am not sure what handwaving you are referring to. Note that we have not yet submitted this patch to the queue, just put it up for comment. This patch set is actually getting more elegant as various individual artists get their flames in. At some point it will acquire a before-and-after test case as well, then we can realistically compare it to the best of the alternate proposals. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Andrew Morton wrote: Daniel Phillips wrote: Andrew Morton wrote: ...it's runtime configurable. So we default to "less than the best" because we are too lazy to fix the network starvation issue properly? Maybe we don't really need a mempool for struct bio either, isn't that one rather like the reserve pool we propose for sk_buffs? (And for that matter, isn't a bio rather more like an sk_buff than one would think from the code we have written?) unavailable for write caching just to get around the network receive starvation issue? No, it's mainly to avoid latency: to prevent tasks which want to allocate pages from getting stuck behind writeback. This seems like a great help for some common loads, but a regression for other not-so-rare loads. Regardless of whether or not marking 60% of memory as unusable for write caching is a great idea, it is not a compelling reason to fall for the hammer/nail trap. What happens if some in kernel user grabs 68% of kernel memory to do some very important thing, does this starvation avoidance scheme still work? Well something has to give way. The process might get swapped out a bit, or it might stall in the page allocator because of all the dirty memory. It is that "something has to give way" that makes me doubt the efficacy of our global write throttling scheme as a solution for network receive memory starvation. As it is, the thing that gives way may well be the network stack. Avoiding this hazard via a little bit of simple code is the whole point of our patch set. Yes, we did waltz off into some not-so-simple code at one point, but after discussing it, Peter and I agree that the simple approach actually works well enough, so we just need to trim down the patch set accordingly and thus prove that our approach really is nice, light and tight. Specifically, we do not actually need any fancy sk_buff sub-allocator (either ours or the one proposed by davem). We already got rid of the per device reserve accounting in order to finesse away the sk_buff->dev oddity and shorten the patch. So the next rev will be rather lighter than the original. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Daniel Phillips wrote: Andrew Morton wrote: Processes which are dirtying those pages throttle at /proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to "fill" memory with dirty pages. If the amount of physical memory which is dirty exceeds 40%: bug. So we make 400 MB of a 1 GB system unavailable for write caching just to get around the network receive starvation issue? Excuse me, 600 MB of a 1 GB system :-/ Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Andrew Morton wrote: Daniel Phillips <[EMAIL PROTECTED]> wrote: What happened to the case where we just fill memory full of dirty file pages backed by a remote disk? Processes which are dirtying those pages throttle at /proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to "fill" memory with dirty pages. If the amount of physical memory which is dirty exceeds 40%: bug. Hi Andrew, So we make 400 MB of a 1 GB system unavailable for write caching just to get around the network receive starvation issue? What happens if some in kernel user grabs 68% of kernel memory to do some very important thing, does this starvation avoidance scheme still work? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Evgeniy Polyakov wrote: On Thu, Aug 17, 2006 at 09:15:14PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: I got openssh as example of situation when system does not know in advance, what sockets must be marked as critical. OpenSSH works with network and unix sockets in parallel, so you need to hack openssh code to be able to allow it to use reserve when there is not enough memory. OpenSSH or any other user-space program will never ever have the problem we are trying to solve. Nor could it be fixed the way we are solving it, user-space programs can be stalled indefinite. We are concerned with kernel services, and the continued well being of the kernel, not user-space. (well therefore indirectly also user-space of course) You limit your system here - it is possible that userspace should send some command when kernel agent requires some negotiation. And even for them it is possible to require ARP request and/or ICMP processing. Actually all sockets must be able to get data, since kernel can not diffirentiate between telnetd and a process which will receive an ack for swapped page or other significant information. Oh, but it does, the kernel itself controls those sockets: NBD / iSCSI and AoE are all kernel services, not user-space. And it is the core of our work to provide this information to the kernel; to distinguish these few critical sockets. As I stated above it is not enough. And even if you will cover all kernel-only network allocations, which are used for your selected datapath, problem still there - admin is unable to connect although it can be critical connection too. So network must behave separately from main allocator in that period of time, but since it is possible that reserve can be not filled or has not enough space or something other, it must be preallocated in far advance and should be quite big, but then why netwrok should use it at all, when being separated from main allocations solves the problem? You still need to guarantee data-paths to these services, and you need to make absolutely sure that your last bit of memory is used to service these critical sockets, not some random blocked user-space process. You cannot pre-allocate enough memory _ever_ to satisfy the total capacity of the network stack. You _can_ allot a certain amount of memory to the network stack (avoiding DoS), and drop packets once you exceed that. But still, you need to make sure these few critical _kernel_ services get their data. Feel free to implement any receiving policy inside _separated_ allocator to meet your needs, but if allocator depends on main system's memory conditions it is always possible that it will fail to make forward progress. Wrong. Our main allocator has a special reserve that can be accessed only by a task that has its PF_MEMALLOC flag set. This reserve exists in order to guarantee forward progress in just such situations as the network runs into when it is trying to receive responses from a remote disk. Anything otherwise is a bug. I do not argue that your approach is bad or does not solve the problem, I'm just trying to show that further evolution of that idea eventually ends up in separated allocator (as long as all most robust systems separate operations), which can improve things in a lot of other sides too. Not a separate allocator per-se, separate socket group, they are serviced by the kernel, they will never refuse to process data, and it is critical for the continued well-being of your kernel that they get their data. The memalloc reserve is indeed a separate reserve, however it is a reserve that already exists, and you are busy creating another separate reserve to further partition memory. Partitioning memory is not the direction we should be going, we should be trying to unify our reserves wherever possible, and find ways such as Andrew and others propose to implement "reserve on demand", or in other words, take advantage of "easily freeable" pages. If your allocation code is so much more efficient than slab then why don't you fix slab instead of replicating functionality that already exists elsewhere in the system, and has since day one? You do not know in advance which sockets must be separated (since only in the simplest situation it is the same as in NBD and is kernelspace-only), Yes we do, they are exactly those sockets that lie in the block IO path. The VM cannot deadlock on any others. you can not solve problem with ARP/ICMP/route changes and other control messages, netfilter, IPsec and compression which still can happen in your setup, If you bothered to read the patches you would know that ICMP is indeed handled. ARP I don't think so, we may need to do that since ARP can believably be required for remote disk interface failover. Anybody who designs ssh into remote disk failover is an idiot. Ssh for configuration, monitoring and administration is fine. Ssh for fencing or whatever is just plain stupid, unless the nodes run
Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Evgeniy Polyakov wrote: Just for clarification - it will be completely impossible to login using openssh or some other priveledge separation protocol to the machine due to the nature of unix sockets. So you will be unable to manage your storage system just because it is in OOM - it is not what is expected from reliable system. The system is not OOM, it is in reclaim, a transient condition that will be resolved in normal course by IO progress. However you raise an excellent point: if there is any remote management that we absolutely require to be available while remote IO is interrupted - manual failover for example - then we must supply a means of carrying out such remote administration, that is guaranteed not to deadlock on a normal mode memory request. This ends up as a new network stack feature I think, and probably a theoretical one for the time being since we don't actually know of any such mandatory login that must be carried out while remote disk IO is suspended. That is why you are not allowed to depend on main system's allocator problems. That is why network can have it's own allocator. Please check your assumptions. Do you understand how PF_MEMALLOC works, and where it gets its reserve memory from? But really, if you expect to run reliable block IO to Zanzibar over an ssh tunnel through a firewall, then you might also consider taking up bungie jumping with the cord tied to your neck. Just pure openssh for control connection (admin should be able to login). And the admin will be able to, but in the cluster stack itself we don't bless such stupidity as emailing an admin to ask for a login in order to break a tie over which node should take charge of DLM recovery. No, you can't since openssh and any other priveledge separation mechanisms use adtional sockets to transfer data between it's parts, unix sockets require page sized allocation frequently which will endup with 8k allocation in slab. You will not be able to login using openssh. *** The system is not OOM, it is in reclaim, a transient condition *** Which Peter already told you! Please, let's get our facts straight, then we can look intelligently at whether your work is appropriate to solve the block IO network starvation problem that Peter and I are working on. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Evgeniy Polyakov wrote: On Mon, Aug 14, 2006 at 08:45:43AM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: Just pure openssh for control connection (admin should be able to login). These periods of degenerated functionality should be short and infrequent albeit critical for machine recovery. Would you rather have a slower ssh login (the machine will recover) or drive/fly to Zanzibar to physically reboot the machine? It will not work, since you can not mark openssh sockets as those which are able to get memory from reserved pool. So admin unable to check the system status and make anything to turn system's life on. This is incorrect, please see my previous email. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Evgeniy Polyakov wrote: On Sun, Aug 13, 2006 at 01:16:15PM -0700, Daniel Phillips ([EMAIL PROTECTED]) wrote: Indeed. The rest of the corner cases like netfilter, layered protocol and so on need to be handled, however they do not need to be handled right now in order to make remote storage on a lan work properly. The sane thing for the immediate future is to flag each socket as safe for remote block IO or not, then gradually widen the scope of what is safe. We need to set up an opt in strategy for network block IO that views such network subsystems as ipfilter as not safe by default, until somebody puts in the work to make them safe. Just for clarification - it will be completely impossible to login using openssh or some other priveledge separation protocol to the machine due to the nature of unix sockets. So you will be unable to manage your storage system just because it is in OOM - it is not what is expected from reliable system. The system is not OOM, it is in reclaim, a transient condition that will be resolved in normal course by IO progress. However you raise an excellent point: if there is any remote management that we absolutely require to be available while remote IO is interrupted - manual failover for example - then we must supply a means of carrying out such remote administration, that is guaranteed not to deadlock on a normal mode memory request. This ends up as a new network stack feature I think, and probably a theoretical one for the time being since we don't actually know of any such mandatory login that must be carried out while remote disk IO is suspended. But really, if you expect to run reliable block IO to Zanzibar over an ssh tunnel through a firewall, then you might also consider taking up bungie jumping with the cord tied to your neck. Just pure openssh for control connection (admin should be able to login). And the admin will be able to, but in the cluster stack itself we don't bless such stupidity as emailing an admin to ask for a login in order to break a tie over which node should take charge of DLM recovery. Regards, Da niel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Andrew Morton wrote: What is a "socket wait queue" and how/why can it consume so much memory? Two things: 1) sk_buffs in flight between device receive interrupt and layer 3 protocol/socket identification. 2) sk_buffs queued onto a particular socket waiting for some task to come along and pull them off via read or equivalent. Case (1) probably can't consume a unbounded amount of memory, but I would not swear to that with my current reading knowledge of the network stack. The upper bound here is obscured by clever SMP device processing, netfilter options, softirq scheduling questions, probably other things. This needs a considered explanation from a network guru, or perhaps a pointer to documentation. Case (2) is the elephant under the rug. Some form of TCP memory throttling exists, but there is no organized way to correlate that with actual memory conditions, and it appears to be exposed to user control. Memory throttling seems to be entirely absent for non-TCP protocols, e.g., UDP. Can it be prevented from doing that? This patch set does that, and also provides an emergency reserve for network devices in order to prevent atomic allocation failures while trying to refill NIC DMA buffer rings. It is the moral equivalent of our bio reservation scheme, but with additional twists specific to the network stack. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Andrew Morton wrote: Peter Zijlstra <[EMAIL PROTECTED]> wrote: Testcase: Mount an NBD device as sole swap device and mmap > physical RAM, then loop through touching pages only once. Fix: don't try to swap over the network. Yes, there may be some scenarios where people have no local storage, but it's reasonable to expect anyone who is using Linux as an "enterprise storage platform" to stick a local disk on the thing for swap. That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix that, will it not? Hi Andrew, What happened to the case where we just fill memory full of dirty file pages backed by a remote disk? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
David Miller wrote: I think there is more profitability from a solution that really does something about "network memory", and doesn't try to say "these devices are special" or "these sockets are special". Special cases generally suck. We already limit and control TCP socket memory globally in the system. If we do this for all socket and anonymous network buffer allocations, which is sort of implicity in Evgeniy's network tree allocator design, we can solve this problem in a more reasonable way. This does sound promising, but... It is not possible to solve this problem entirely in the network layer. At minimum, throttling is needed in the nbd driver, or even better, in the submit_bio path as we have it now. We also need a way of letting the virtual block device declare its resource needs, which we also have in some form. We also need a mechanism to guarantee level 2 delivery so we can drop unrelated packets if necessary, which exists in the current patch set. So Evgeniy's work might well be a part of the solution, but it is not the whole solution. And here's the kick, there are other unrelated highly positive consequences to using Evgeniy's network tree allocator. Also good. But is it ready to use today? We need to actually fix the out of memory deadlock/performance bug right now so that remote storage works properly. It doesn't just solve the _one_ problem it was built for, it solves several problems. And that is the hallmark signature of good design. Great. But to solve the whole problem, the block IO system and the network layer absolutely must cooperate. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
David Miller wrote: From: Daniel Phillips <[EMAIL PROTECTED]> David Miller wrote: The reason is that there is no refcounting performed on these devices when they are attached to the skb, for performance reasons, and thus the device can be downed, the module for it removed, etc. long before the skb is freed up. The virtual block device can refcount the network device on virtual device create and un-refcount on virtual device delete. What if the packet is originally received on the device in question, and then gets redirected to another device by a packet scheduler traffic classifier action or a netfilter rule? It is necessary to handle the case where the device changes on the skb, and the skb gets freed up in a context and assosciation different from when the skb was allocated (for example, different from the device attached to the virtual block device). This aspect of the patch became moot because of the change to a single reserve for all layer 2 delivery in Peter's more recent revisions. *However* maybe it is worth mentioning that I intended to provide a pointer from each sk_buff to a common accounting struct. This pointer is set by the device driver. If the driver knows nothing about memory accounting then the pointer is null, no accounting is done, and block IO over the interface will be dangerous. Otherwise, if the system is in reclaim (which we currently detect crudely when a normal allocation fails) then atomic ops will be done on the accounting structure. I planned to use the (struct sock *)sk_buff->sk field for this, which is unused during layer 2 delivery as far as I can see. The accounting struct can look somewhat like a struct sock if we like, size doesn't really matter, and it might make the code more robust. Or the field could become a union. Anyway, this bit doesn't matter any more, the single global packet delivery reserve is better and simpler. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Evgeniy Polyakov wrote: One must receive a packet to determine if that packet must be dropped until tricky hardware with header split capabilities or MMIO copying is used. Peter uses special pool to get data from when system is in OOM (at least in his latest patchset), so allocations are separated and thus network code is not affected by OOM condition, which allows to make forward progress. Nice executive summary. Crucial point: you want to say "in reclaim" not "in OOM". Yes, right from the beginning the patch set got its sk_buff memory from a special pool when the system is in reclaim, however the exact nature of the pool and how/where it is accounted has evolved... mostly forward. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/4] VM deadlock prevention -v4
Peter Zijlstra wrote: On Sat, 2006-08-12 at 20:16 +0200, Indan Zupancic wrote: What was missing or wrong in the old approach? Can't you use the new approach, but use alloc_pages() instead of SROG? Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-) I'm almost sorry I threw that code out... Good instinct :-) Lemme see what I can do to explain; what I need/want is: - single allocation group per packet - that is, when I free a packet and all its associated object I get my memory back. First, try to recast all your objects as pages, as Evgeniy Polyakov suggested. Then if there is some place where that just doesn't work (please point it out) put a mempool there and tweak the high level reservation setup accordingly. - not waste too much space managing the various objects If we waste a little space only where the network would have otherwise dropped a packet, that is still a big win. We just need to be sure the normal path does not become more wasteful. skb operations want to allocate various sk_buffs for the same data clones. Also, it wants to be able to break the COW or realloc the data. The trivial approach would be one page (or higher alloc page) per object, and that will work quite well, except that it'll waste a _lot_ of memory. High order allocations are just way too undependable without active defragmentation, which isn't even on the horizon at the moment. We just need to treat any network hardware that can't scatter/gather into single pages as too broken to use for network block io. As for sk_buff cow break, we need to look at which network paths do it (netfilter obviously, probably others) and decide whether we just want to declare that the feature breaks network block IO, or fix the feature so it plays well with reserve accounting. So I tried manual packing (parts of that you have seen in previous attempts). This gets hard when you want to do unlimited clones and COW breaks. To do either you need to go link several pages. You need to avoid the temptation to fix the entire world on the first attempt. Sure, there will be people who call for gobs of overengineering right from the start, but simple has always worked better for Linux than lathering on layers of complexity just to support some feature that may arguably be broken by design. For example, swapping through a firewall. Changing from per-interface to a global block IO reserve was a great simplification, we need more of those. Looking forward to -v5 ;-) Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rename *MEMALLOC flags
Peter Zijlstra wrote: Jeff Garzik in his infinite wisdom spake thusly: Peter Zijlstra wrote: Index: linux-2.6/include/linux/gfp.h === --- linux-2.6.orig/include/linux/gfp.h 2006-08-12 12:56:06.0 +0200 +++ linux-2.6/include/linux/gfp.h 2006-08-12 12:56:09.0 +0200 @@ -46,6 +46,7 @@ struct vm_area_struct; #define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */ #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency reserves */ #define __GFP_HARDWALL ((__force gfp_t)0x2u) /* Enforce hardwall cpuset memory allocs */ +#define __GFP_MEMALLOC ((__force gfp_t)0x4u) /* Use emergency reserves */ This symbol name has nothing to do with its purpose. The entire area of code you are modifying could be described as having something to do with 'memalloc'. GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better symbol name. I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change the situation anyway. In fact, a cleanup patch to rename GFP_NOMEMALLOC would be nice. I'm rather bad at picking names, but here goes: PF_MEMALLOC -> PF_EMERGALLOC __GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC __GFP_MEMALLOC -> __GFP_EMERGALLOC Is that suitable and shall I prepare patches? Or do we want more ppl to chime in and have a few more rounds? MEMALLOC is the name Linus chose to name exactly the reserve from which we are allocating. Perhaps that was just Linus being denser than jgarzik and not realizing that he should have called it EMERGALLOC right from the start. BUT since Linus did call it MEMALLOC, we should too. Or just email Linus and tell him how much better EMERGALLOC rolls off the tongue, and could we please change all occurances of MEMALLOC to EMERGALLOC. Then don't read your email for a week ;-) Inventing a new name for an existing thing is very poor taste on grounds of grepability alone. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Rik van Riel wrote: Thomas Graf wrote: skb->dev is not guaranteed to still point to the "allocating" device once the skb is freed again so reserve/unreserve isn't symmetric. You'd need skb->alloc_dev or something. There's another consequence of this property of the network stack. Every network interface must be able to fall back to these MEMALLOC allocations, because the memory critical socket could be on another network interface. Hence, we cannot know which network interfaces should (not) be marked MEMALLOC. Good point. We do however know which interfaces should be marked capable of carrying block IO traffic: the ones that have been fixed, audited and tested. We might then allow the network block device to specify which interface(s) will actually carry the traffic. The advantage of being specific about which devices are carrying at least one block io socket is, we can skip the reserve accounting for the other interfaces. But is the extra layer of configuration gack a better idea than just doing the accounting for every device, provided the system is in reclaim? By the way, another way to avoid impact on the normal case is an experimental option such as CONFIG_PREVENT_NETWORK_BLOCKIO_DEADLOCK. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
David Miller wrote: From: Peter Zijlstra <[EMAIL PROTECTED]> Hmm, what does sk_buff::input_dev do? That seems to store the initial device? You can run grep on the tree just as easily as I can which is what I did to answer this question. It only takes a few seconds of your time to grep the source tree for things like "skb->input_dev", so would you please do that before asking more questions like this? It does store the initial device, but as Thomas tried so hard to explain to you guys these device pointers in the skb are transient and you cannot refer to them outside of packet receive processing. Thomas did a great job of explaining and without any flaming or ad hominem attacks. We have now formed a decent plan for doing the accounting in a stable way without adding new fields to sk_buff, thankyou for the catch. The reason is that there is no refcounting performed on these devices when they are attached to the skb, for performance reasons, and thus the device can be downed, the module for it removed, etc. long before the skb is freed up. The virtual block device can refcount the network device on virtual device create and un-refcount on virtual device delete. We need to add that to the core patch and maybe package it nicely so the memalloc reserve/unreserve happens at the same time, in a tidy little library function to share with other virtual devices like iSCSI that also need some anti-deadlock lovin. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Peter Zijlstra wrote: On Wed, 2006-08-09 at 16:54 -0700, David Miller wrote: People are doing I/O over IP exactly for it's ubiquity and flexibility. It seems a major limitation of the design if you cancel out major components of this flexibility. We're not, that was a bit of my own frustration leaking out; I think this whole push to IP based storage is a bit silly. I'm just not going to help the admin who's server just hangs because his VPN key expired. Running critical resources remotely like this is tricky, and every hop/layer you put in between increases the risk of something going bad. The only setup I think even remotely sane is a dedicated network in the very same room - not unlike FC but cheaper (which I think is the whole push behind this, eth is cheap) Indeed. The rest of the corner cases like netfilter, layered protocol and so on need to be handled, however they do not need to be handled right now in order to make remote storage on a lan work properly. The sane thing for the immediate future is to flag each socket as safe for remote block IO or not, then gradually widen the scope of what is safe. We need to set up an opt in strategy for network block IO that views such network subsystems as ipfilter as not safe by default, until somebody puts in the work to make them safe. But really, if you expect to run reliable block IO to Zanzibar over an ssh tunnel through a firewall, then you might also consider taking up bungie jumping with the cord tied to your neck. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 8/9] 3c59x driver conversion
David Miller wrote: I think he's saying that he doesn't think your code is yet a reasonable way to solve the problem, and therefore doesn't belong upstream. That is why it has not yet been submitted upstream. Respectfully, I do not think that jgarzik has yet put in the work to know if this anti deadlock technique is reasonable or not, and he was only commenting on some superficial blemish. I still don't get his point, if there was one. He seems to be arguing in favor of a jump-off-the-cliff approach to driver conversion. If he wants to do the work and take the blame when some driver inevitably breaks because of being edited in a hurry then he is welcome to submit the necessary additional patches. Until then, there are about 3 nics that actually matter to network storage at the moment, all of them GigE. The layer 2 blemishes can be fixed easily, including avoiding the atomic op stall and the ->dev volatility . Thankyou for pointing those out. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD
Evgeniy Polyakov wrote: On Tue, Aug 08, 2006 at 09:33:25PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) wrote: http://lwn.net/Articles/144273/ "Kernel Summit 2005: Convergence of network and storage paths" We believe that an approach very much like today's patch set is necessary for NBD, iSCSI, AoE or the like ever to work reliably. We further believe that a properly working version of at least one of these subsystems is critical to the viability of Linux as a modern storage platform. There is another approach for that - do not use slab allocator for network dataflow at all. It automatically has all you pros amd if implemented correctly can have a lot of additional usefull and high-performance features like full zero-copy and total fragmentation avoidance. Agreed. But probably more intrusive than davem would be happy with at this point. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 8/9] 3c59x driver conversion
Jeff Garzik wrote: Peter Zijlstra wrote: Update the driver to make use of the netdev_alloc_skb() API and the NETIF_F_MEMALLOC feature. NETIF_F_MEMALLOC does not exist in the upstream tree... nor should it, IMO. Elaborate please. Do you think that all drivers should be updated to fix the broken blockdev semantics, making NETIF_F_MEMALLOC redundant? If so, I trust you will help audit for it? netdev_alloc_skb() is in the tree, and that's fine. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
David Miller wrote: From: Daniel Phillips <[EMAIL PROTECTED]> >>Can you please characterize the conditions under which skb->dev changes after the alloc? Are there writings on this subtlety? The packet scheduler and classifier can redirect packets to different devices, and can the netfilter layer. The setting of skb->dev is wholly transient and you cannot rely upon it to be the same as when you set it on allocation. Even simple things like the bonding device change skb->dev on every receive. Thankyou, this is easily fixed. I think you need to study the networking stack a little more before you continue to play in this delicate area :-) The VM deadlock is also delicate. Perhaps we can work together. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
David Miller wrote: From: Daniel Phillips <[EMAIL PROTECTED]> David Miller wrote: I think the new atomic operation that will seemingly occur on every device SKB free is unacceptable. Alternate suggestion? Sorry, I have none. But you're unlikely to get your changes considered seriously unless you can avoid any new overhead your patch has which is of this level. We just skip anything new unless the socket is actively carrying block IO traffic, in which case we pay a miniscule price to avoid severe performance artifacts or in the worst case, deadlock. So in this design the new atomic operation does not occur on every device SKP free. All atomic ops sit behind the cheap test: (dev->flags & IFF_MEMALLOC) or if any have escaped that is just an oversight. Peter? We're busy trying to make these data structures smaller, and eliminate atomic operations, as much as possible. Therefore anything which adds new datastructure elements and new atomic operations will be met with fierce resistence unless it results an equal or greater shrink of datastructures elsewhere or removes atomic operations elsewhere in the critical path. Right now we have a problem because our network stack cannot support block IO reliably. Without that, Linux is no enterprise storage platform. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Hi Dave, David Miller wrote: I think the new atomic operation that will seemingly occur on every device SKB free is unacceptable. Alternate suggestion? You also cannot modify netdev->flags in the lockless manner in which you do, it must be done with the appropriate locking, such as holding the RTNL semaphore. Thanks for the catch. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Thomas Graf wrote: > skb->dev is not guaranteed to still point to the "allocating" device once the skb is freed again so reserve/unreserve isn't symmetric. You'd need skb->alloc_dev or something. Can you please characterize the conditions under which skb->dev changes after the alloc? Are there writings on this subtlety? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Stephen Hemminger wrote: How much of this is just building special case support for large allocations for jumbo frames? Wouldn't it make more sense to just fix those drivers to do scatter and add the support hooks for that? Short answer: none of it is. If it happens to handle jumbo frames nicely that is mainly a lucky accident, and we do need to check that they actually works. Minor rant: the whole skb_alloc familly has degenerated into an unholy mess and could use some rethinking. I believe the current patch gets as far as three _'s at the beginning of a function, this shows it is high time to reroll the api. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
Indan Zupancic wrote: Hello, Saw the patch on lkml, and wondered about some things. On Tue, August 8, 2006 21:33, Peter Zijlstra said: +static inline void dev_unreserve_skb(struct net_device *dev) +{ + if (atomic_dec_return(&dev->rx_reserve_used) < 0) + atomic_inc(&dev->rx_reserve_used); +} + This looks strange. Is it possible for rx_reserve_used to become negative? A quick look at the code says no, except in the one case where there isn't a "if (unlikely(dev_reserve_used(skb->dev)))" check: Yes, you can see what I'm trying to do there, I was short an atomic op to do it, My ugly solution may well be... probably is racy. Let's rewrite it with something better. We want the atomic op that some people call "monus": decrement unless zero. @@ -389,6 +480,8 @@ void __kfree_skb(struct sk_buff *skb) #endif kfree_skbmem(skb); + if (dev && (dev->flags & IFF_MEMALLOC)) + dev_unreserve_skb(dev); } So it seems that that < 0 check in dev_unreserve_skb() was only added to handle this case (though there seems to be a race between those two atomic ops). Isn't it better to remove that check and just do: if (dev && (dev->flags & IFF_MEMALLOC) && dev_reserve_used(dev)) Seems to me that unreserve is also called from each protocol handler. Agreed, the < 0 check is fairly sickening. The use of atomic seems a bit dubious. Either it's necessary, in which case changes depending on tests seem unsafe as they're not atomic and something crucial could change between the read and the check, or normal reads and writes combined with barriers would be sufficient. All in all it seems better to move that "if (unlikely(dev_reserve_used(skb->dev)))" check into dev_unreserve_skb(), and make the whole atomic if necessary. Then let dev_unreserve_skb() return wether rx_reserve_used was positive. Getting rid of dev_reserve_used() and using the atomic_read directly might be better, as it is set with the bare atomic instructions too and rarely used without dev_unreserve_skb(). Barriers should work for this reserve accounting, but is that better than an atomic op? I don't know, let's let the barrier mavens opine. IMHO the cleanest thing to do is code up "monus", in fact I dimly recall somebody already added something similar. Side note: please don't be shy, just reply-all in future so the discussion stays public. Part of what we do is try to share our development process so people see not only what we have done, but why we did it. (And sometimes so they can see what dumb mistakes we make, but I won't get into that...) I beg for forgiveness in advance having taken the liberty of CCing this reply to lkml. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 1/2] in-kernel sockets API
Hi Harald, You wrote: On Tue, Jun 13, 2006 at 02:12:41PM -0700, I wrote: This has the makings of a nice stable internal kernel api. Why do we want to provide this nice stable internal api to proprietary modules? because there is IMHO legally nothing we can do about it anyway. Speaking as a former member of a "grey market" binary module vendor that came in from the cold I can assure you that the distinction between EXPORT and EXPORT_GPL _is_ meaningful. That tainted flag makes it extremely difficult to do deals with mainstream Linux companies and there is always the fear that it will turn into a legal problem. The latter bit tends to make venture capitalists nervous. That said, the EXPORT_GPL issue is not about black and white legal issues, it is about gentle encouragement. In this case we are offering a clumsy, on-the-metal, guaranteed-to-change-and-make-you-edit-code interface to non-GPL-compatible modules and a decent, stable (in the deserves to live sense) interface for the pure of heart. Gentle encouragement at exactly the right level. Did we settle the question of whether these particular exports should be EXPORT_SYMBOL_GPL? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 1/2] in-kernel sockets API
Chase Venters wrote: can you name some non-GPL non-proprietary modules we should be concerned about? You probably meant "non-GPL-compatible non-proprietary". If so, then by definition there are none. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 1/2] in-kernel sockets API
Brian F. G. Bidulock wrote: Stephen, On Tue, 13 Jun 2006, Stephen Hemminger wrote: @@ -2176,3 +2279,13 @@ EXPORT_SYMBOL(sock_wake_async); EXPORT_SYMBOL(sockfd_lookup); EXPORT_SYMBOL(kernel_sendmsg); EXPORT_SYMBOL(kernel_recvmsg); +EXPORT_SYMBOL(kernel_bind); +EXPORT_SYMBOL(kernel_listen); +EXPORT_SYMBOL(kernel_accept); +EXPORT_SYMBOL(kernel_connect); +EXPORT_SYMBOL(kernel_getsockname); +EXPORT_SYMBOL(kernel_getpeername); +EXPORT_SYMBOL(kernel_getsockopt); +EXPORT_SYMBOL(kernel_setsockopt); +EXPORT_SYMBOL(kernel_sendpage); +EXPORT_SYMBOL(kernel_ioctl); Don't we want to restrict this to GPL code with EXPORT_SYMBOL_GPL? There are direct derivatives of the BSD/POSIX system call interface. The protocol function pointers within the socket structure are not GPL only. Why make this wrappered access to them GPL only? It will only encourange the reverse of what they were intended to do: be used instead of the protocol function pointers within the socket structure, that currently carry no such restriction. This has the makings of a nice stable internal kernel api. Why do we want to provide this nice stable internal api to proprietary modules? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [stable] [NET] Fix zero-size datagram reception
On Tuesday 08 November 2005 10:13, Greg KH wrote: > On Thu, Nov 03, 2005 at 07:55:38AM +1100, Herbert Xu wrote: > > The recent rewrite of skb_copy_datagram_iovec broke the reception of > > zero-size datagrams. This patch fixes it. > > > > Signed-off-by: Herbert Xu <[EMAIL PROTECTED]> > > > > Please apply it to both 2.6.15 and 2.6.14. > > queued to -stable. Blush. Is it worth putting the initialization after the exit on zero? int i, err, fraglen, end = 0; struct sk_buff *next; if (!len) return 0; next = skb_shinfo(skb)->frag_list; Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive (really revised)
Gah, this time the revised patch is included, not just the diffstat. datagram.c | 82 +++-- 1 files changed, 26 insertions(+), 56 deletions(-) diff -up --recursive 2.6.12.3.clean/net/core/datagram.c 2.6.12.3/net/core/datagram.c --- 2.6.12.3.clean/net/core/datagram.c 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/net/core/datagram.c2005-08-25 03:57:32.0 -0400 @@ -211,74 +211,44 @@ void skb_free_datagram(struct sock *sk, int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset, struct iovec *to, int len) { - int start = skb_headlen(skb); - int i, copy = start - offset; - - /* Copy header. */ - if (copy > 0) { - if (copy > len) - copy = len; - if (memcpy_toiovec(to, skb->data + offset, copy)) - goto fault; - if ((len -= copy) == 0) - return 0; - offset += copy; - } + int i, err, fraglen, end = 0; + struct sk_buff *next = skb_shinfo(skb)->frag_list; +next_skb: + fraglen = skb_headlen(skb); + i = -1; - /* Copy paged appendix. Hmm... why does this look so complicated? */ - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { - int end; - - BUG_TRAP(start <= offset + len); + while (1) { + int start = end; - end = start + skb_shinfo(skb)->frags[i].size; - if ((copy = end - offset) > 0) { - int err; - u8 *vaddr; - skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - struct page *page = frag->page; + if ((end += fraglen) > offset) { + int copy = end - offset, o = offset - start; if (copy > len) copy = len; - vaddr = kmap(page); - err = memcpy_toiovec(to, vaddr + frag->page_offset + -offset - start, copy); - kunmap(page); + if (i == -1) + err = memcpy_toiovec(to, skb->data + o, copy); + else { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = frag->page; + void *p = kmap(page) + frag->page_offset + o; + err = memcpy_toiovec(to, p, copy); + kunmap(page); + } if (err) goto fault; if (!(len -= copy)) return 0; offset += copy; } - start = end; - } - - if (skb_shinfo(skb)->frag_list) { - struct sk_buff *list = skb_shinfo(skb)->frag_list; - - for (; list; list = list->next) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + list->len; - if ((copy = end - offset) > 0) { - if (copy > len) - copy = len; - if (skb_copy_datagram_iovec(list, - offset - start, - to, copy)) - goto fault; - if ((len -= copy) == 0) - return 0; - offset += copy; - } - start = end; - } + if (++i >= skb_shinfo(skb)->nr_frags) + break; + fraglen = skb_shinfo(skb)->frags[i].size; + } + if (next) { + skb = next; + next = skb->next; + goto next_skb; } - if (!len) - return 0; - fault: return -EFAULT; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive
On Thursday 25 August 2005 03:30, David S. Miller wrote: > From: Daniel Phillips <[EMAIL PROTECTED]> > > As far as I can see, it is illegal for any but the first skb to have > > a non-null skb_shinfo(skb)->frag_list, is this correct? > > As currently used, yes. That's a relief. I updated the patch to express exactly those semantics. Now that the patch is (probably) correct I can just drop it if you like. After all, it is just a cleanup, and what use is that? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive (revised)
The fragment list handling was wrong in the previous version, now correct I think. datagram.c | 82 +++-- 1 files changed, 26 insertions(+), 56 deletions(-) diff -up --recursive 2.6.12.3.clean/net/core/datagram.c 2.6.12.3/net/core/datagram.c --- 2.6.12.3.clean/net/core/datagram.c 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/net/core/datagram.c2005-08-25 01:39:56.0 -0400 @@ -211,74 +211,40 @@ void skb_free_datagram(struct sock *sk, int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset, struct iovec *to, int len) { - int start = skb_headlen(skb); - int i, copy = start - offset; - - /* Copy header. */ - if (copy > 0) { - if (copy > len) - copy = len; - if (memcpy_toiovec(to, skb->data + offset, copy)) - goto fault; - if ((len -= copy) == 0) - return 0; - offset += copy; - } + int i, err, fraglen, end = 0; +next_skb: + fraglen = skb_headlen(skb); + i = -1; - /* Copy paged appendix. Hmm... why does this look so complicated? */ - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { - int end; - - BUG_TRAP(start <= offset + len); + while (1) { + int start = end; - end = start + skb_shinfo(skb)->frags[i].size; - if ((copy = end - offset) > 0) { - int err; - u8 *vaddr; - skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - struct page *page = frag->page; + if ((end += fraglen) > offset) { + int copy = end - offset, o = offset - start; if (copy > len) copy = len; - vaddr = kmap(page); - err = memcpy_toiovec(to, vaddr + frag->page_offset + -offset - start, copy); - kunmap(page); + if (i == -1) + err = memcpy_toiovec(to, skb->data + o, copy); + else { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = frag->page; + void *p = kmap(page) + frag->page_offset + o; + err = memcpy_toiovec(to, p, copy); + kunmap(page); + } if (err) goto fault; if (!(len -= copy)) return 0; offset += copy; } - start = end; - } - - if (skb_shinfo(skb)->frag_list) { - struct sk_buff *list = skb_shinfo(skb)->frag_list; - - for (; list; list = list->next) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + list->len; - if ((copy = end - offset) > 0) { - if (copy > len) - copy = len; - if (skb_copy_datagram_iovec(list, - offset - start, - to, copy)) - goto fault; - if ((len -= copy) == 0) - return 0; - offset += copy; - } - start = end; - } + if (++i >= skb_shinfo(skb)->nr_frags) + break; + fraglen = skb_shinfo(skb)->frags[i].size; } - if (!len) - return 0; - + if ((skb = skb->next)) + goto next_skb; fault: return -EFAULT; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive
On Thursday 25 August 2005 02:44, David S. Miller wrote: > Frag lists cannot be deeper than one level of nesting, > and I think the recursive version is easier to understand, > so I really don't see the value of your change. Losing 34 lines of a 74 line function is the value. The real problem with my version is, it isn't right, it walks the ->next link from the original skb, whereas the original version gets the list from skb_shinfo(skb)->frag_list and walks the ->next links of that. This structure actually does allow for lists of lists, but it does not look like this was intended. As far as I can see, it is illegal for any but the first skb to have a non-null skb_shinfo(skb)->frag_list, is this correct? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive
Hi, I noticed that skb_copy_datagram_iovec calls itself recursively to copy a fragment list. This isn't actually wrong or even inefficient, it is just somehow disturbing. Oh, and it uses an extra stack frame, and is hard to read. Once I got started straightening that out, I couldn't resist cleaning it up the rest of the way. * Recursion changed to nested loop * Three copy cases folded into one * Miscellaneous funky stuff simplified * Efficiency should be about the same * Comments gone, it's all perfectly obvious now ;-) I booted it, my nics still work, that's as far as I've tested it. Please try it. There are a couple of cosmetic things to do, like BUG on impossible offset. Why does the current code just kprint this and not BUG? datagram.c | 78 + 1 files changed, 22 insertions(+), 56 deletions(-) diff -up --recursive 2.6.12.3.clean/net/core/datagram.c 2.6.12.3/net/core/datagram.c --- 2.6.12.3.clean/net/core/datagram.c 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/net/core/datagram.c2005-08-25 01:39:56.0 -0400 @@ -211,74 +211,40 @@ void skb_free_datagram(struct sock *sk, int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset, struct iovec *to, int len) { - int start = skb_headlen(skb); - int i, copy = start - offset; - - /* Copy header. */ - if (copy > 0) { - if (copy > len) - copy = len; - if (memcpy_toiovec(to, skb->data + offset, copy)) - goto fault; - if ((len -= copy) == 0) - return 0; - offset += copy; - } + int i, err, fraglen, end = 0; +next_skb: + fraglen = skb_headlen(skb); + i = -1; - /* Copy paged appendix. Hmm... why does this look so complicated? */ - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { - int end; - - BUG_TRAP(start <= offset + len); + while (1) { + int start = end; - end = start + skb_shinfo(skb)->frags[i].size; - if ((copy = end - offset) > 0) { - int err; - u8 *vaddr; - skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; - struct page *page = frag->page; + if ((end += fraglen) > offset) { + int copy = end - offset, o = offset - start; if (copy > len) copy = len; - vaddr = kmap(page); - err = memcpy_toiovec(to, vaddr + frag->page_offset + -offset - start, copy); - kunmap(page); + if (i == -1) + err = memcpy_toiovec(to, skb->data + o, copy); + else { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = frag->page; + void *p = kmap(page) + frag->page_offset + o; + err = memcpy_toiovec(to, p, copy); + kunmap(page); + } if (err) goto fault; if (!(len -= copy)) return 0; offset += copy; } - start = end; - } - - if (skb_shinfo(skb)->frag_list) { - struct sk_buff *list = skb_shinfo(skb)->frag_list; - - for (; list; list = list->next) { - int end; - - BUG_TRAP(start <= offset + len); - - end = start + list->len; - if ((copy = end - offset) > 0) { - if (copy > len) - copy = len; - if (skb_copy_datagram_iovec(list, - offset - start, - to, copy)) - goto fault; - if ((len -= copy) == 0) - return 0; - offset += copy; - } - start = end; - } + if (++i >= skb_shinfo(skb)->nr_frags) + break; + fraglen = skb_shinfo(skb)->frags[i].size; } - if (!len) - return 0; - + if ((skb = skb->next)) + goto next_skb; fault: return -EFAULT; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo inf
Re: Bridge MTU
On Thursday 18 August 2005 07:02, John Heffner wrote: > 1) Bridging occurs at the link layer (ethernet); fragmenting occurs at the > network layer (IP). > > 2) A lot of protocols set the Don't Fragment bit, so you can't always > fragment anyway. > > What you might want to do is set it up as an IP router rather than a > bridge. That way, it will be able to fragment, or send ICMP responses when > it can't. Gaahh. Thanks for filling in that gaping comprehension hole. In retrospect it is perfectly obvious that routing is what I need, and that bridging just comes close enough to confuse me ;-) Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Net vm deadlock fix, version 6
Ahem: + } adjust_memalloc_reserve(-netdev->memalloc_pages); - } Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Net vm deadlock fix, version 6
Hi, This version corrects a couple of bugs previously noted and ties up some loose ends in the e1000 driver. Some versions of this driver support packet splitting into multiple pages, with just the protocol header in the skb itself. This is a very good thing because it avoids the high order page fragmentation problem. Though this is something that probably needs to be pushed down into generic skb allocation, for now the driver handles it explicitly with the help of a new memalloc_page function that just gets a page from the memalloc reserve if normal allocation fails. This does not need separate reserve accounting because such pages are allocated per-skb. While I was in there, I could not resist cleaning up some non-orthogonality in the 64K overlap handling (e1000 people, please check my work). The result is that with the new memalloc handling, the source stayed the same size. Code size is another question. I have added a number of new inlines. I suspect that inlining the skb allocation functions in general doesn't buy anything, but this needs to be checked. Anyway, we now have a pretty good picture of the full per-driver damage of closing this hole, and it is not much. I still haven't looked at the various hooks in the packet delivery path, but it's coming up pretty soon. There are other wrinkles too, like the fact that there can actually be many block devices mapped over the same network interface. I have to ponder what is best to do so they can't wedge each other, and so that the SOCK_MEMALLOC bit doesn't get cleared prematurely. Anyway, progress marches on. diff -up --recursive 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 2.6.12.3/drivers/net/e1000/e1000_main.c --- 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/drivers/net/e1000/e1000_main.c 2005-08-11 17:42:12.0 -0400 @@ -309,6 +309,16 @@ e1000_up(struct e1000_adapter *adapter) e1000_phy_reset(&adapter->hw); } + netdev->memalloc_pages = estimate_skb_pages(netdev->rx_reserve, + adapter->rx_buffer_len + NET_IP_ALIGN); + if (adapter->rx_ps) + netdev->memalloc_pages += PS_PAGE_BUFFERS * netdev->rx_reserve; + if ((err = adjust_memalloc_reserve(netdev->memalloc_pages))) { + DPRINTK(PROBE, ERR, + "Unable to allocate rx reserve Error: %d\n", err); + return err; + } + e1000_set_multi(netdev); e1000_restore_vlan(adapter); @@ -386,6 +396,7 @@ e1000_down(struct e1000_adapter *adapter mii_reg |= MII_CR_POWER_DOWN; e1000_write_phy_reg(&adapter->hw, PHY_CTRL, mii_reg); mdelay(1); + adjust_memalloc_reserve(-netdev->memalloc_pages); } } @@ -3116,34 +3127,29 @@ e1000_alloc_rx_buffers(struct e1000_adap buffer_info = &rx_ring->buffer_info[i]; while(!buffer_info->skb) { - skb = dev_alloc_skb(bufsz); + skb = dev_memalloc_skb(netdev, bufsz); - if(unlikely(!skb)) { + if(unlikely(!skb)) /* Better luck next round */ break; - } /* Fix for errata 23, can't cross 64kB boundary */ if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) { struct sk_buff *oldskb = skb; DPRINTK(RX_ERR, ERR, "skb align check failed: %u bytes " "at %p\n", bufsz, skb->data); - /* Try again, without freeing the previous */ - skb = dev_alloc_skb(bufsz); + /* Try again, then free previous */ + skb = dev_memalloc_skb(netdev, bufsz); + dev_memfree_skb(oldskb); + /* Failed allocation, critical failure */ - if (!skb) { - dev_kfree_skb(oldskb); + if (!skb) break; - } + /* give up */ if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) { - /* give up */ - dev_kfree_skb(skb); - dev_kfree_skb(oldskb); + dev_memfree_skb(skb); break; /* while !buffer_info->skb */ - } else { - /* Use new allocation */ - dev_kfree_skb(oldskb); } } /* Make buffer alignment 2 beyond a 16 byte boundary @@ -3152,8 +3158,6 @@ e1000_alloc_rx_buffers(struct e1000_adap */ skb_reserve(skb, NET_IP_ALIGN); - skb->dev = netdev; - buf
Re: [RFC] Net vm deadlock fix, version 5
Hi, A couple of goofs. First, the sysctl interface to min_free_kbytes could stomp on any in-kernel adjustments. Now there are two variables, summed in setup_per_zone_pages_min: min_free_kbytes and var_free_kbytes. The adjust_memalloc_reserve operates only the latter, so the user can freely twiddle the base reserve without interfering with in-kernel adjustments. Second, the skb resource estimation code was grossly wrong where it attempted to calculate ceiling(log2(size)), now fixed and tested this time. To save bandwidth, just the relevant parts of the patch included below. Regards, Daniel diff -up --recursive 2.6.12.3.clean/mm/page_alloc.c 2.6.12.3/mm/page_alloc.c --- 2.6.12.3.clean/mm/page_alloc.c 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/mm/page_alloc.c2005-08-08 21:20:15.0 -0400 @@ -73,6 +73,7 @@ EXPORT_SYMBOL(zone_table); static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" }; int min_free_kbytes = 1024; +int var_free_kbytes; unsigned long __initdata nr_kernel_pages; unsigned long __initdata nr_all_pages; @@ -2029,7 +2030,8 @@ static void setup_per_zone_lowmem_reserv */ static void setup_per_zone_pages_min(void) { - unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); + unsigned pages_min = (min_free_kbytes + var_free_kbytes) + >> (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; struct zone *zone; unsigned long flags; @@ -2075,6 +2077,18 @@ static void setup_per_zone_pages_min(voi } } +int adjust_memalloc_reserve(int pages) +{ + int kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10)); + if (kbytes < 0) + return -EINVAL; + var_free_kbytes = kbytes; + setup_per_zone_pages_min(); + return 0; +} + +EXPORT_SYMBOL_GPL(adjust_memalloc_reserve); + /* * Initialise min_free_kbytes. * diff -up --recursive 2.6.12.3.clean/net/core/skbuff.c 2.6.12.3/net/core/skbuff.c --- 2.6.12.3.clean/net/core/skbuff.c2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/net/core/skbuff.c 2005-08-08 23:07:12.0 -0400 @@ -167,6 +168,15 @@ nodata: goto out; } +#define ceiling_log2(x) fls(x - 1) +unsigned estimate_skb_pages(unsigned num, unsigned size) +{ + int slab_pages = kmem_estimate_pages(skbuff_head_cache, num); + int data_space = num * (1 << ceiling_log2(size + 16)); + int data_pages = (data_space + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + return slab_pages + data_pages; +} + /** * alloc_skb_from_cache- allocate a network buffer * @cp: kmem_cache from which to allocate the data area - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Net vm deadlock fix, version 5
Hi, This version introduces the idea of having a network driver adjust the global memalloc reserve when it brings an interface up or down. The interface is: int adjust_memalloc_reserve(int bytes) which is just a thin shell over the min_free_kbytes interface that already exists. The driver estimates the maximum number of pages it could ever need from the reserve and saves it in the net_device struct, so that exactly the same number of pages will be added to the reserve on bring-up and removed from it on bring-down. The tricky part is figuring out how many pages could ever be needed. I wrote some helper functions to estimate how many pages a given number of skbs would use. These functions are deficient in a number of ways, but today the point isn't to be exactly correct, but to give a sense of what the complete solution will look like. I am still continuously comparing the memalloc way of doing things to the mempool way. Today I had a look at the number of different places where mempools would be needed on the pre-protocol packet delivery path. It is scary. With the memalloc way, the interfaces are fairly lightweight, but the accounting isn't as precise. It doesn't need to be precise either, we really just have to be sure that the reservation is more than enough to handle a single block IO transaction. But actually, the code tries to arrange for about 50 in flight at once, which is a very big safety factor. Adjusting the global memalloc reserve accurately is more of a concern. If the adjustment wildly low, then a lowly network driver could end up eating the whole global reserve and wedge the entire system. So my ongoing effort will be directed at accounting the reserve limits in pages, rather than skbs as currently. This gets tricky, because skbs are not allocated for pages, they are allocated from slab caches. Maybe skbs really should be allocated from pages, it would not be much different from the way block IO works. Or maybe I just need to roll up my sleeves and create an interface where I can find out exactly how many additional pages my skb allocation actually consumed. Anyway, I am now at least making an attempt to account for global memalloc pool usage sanely, whereas traditionally we have just made the memalloc reserve "big" and hoped for the best. (At this point, somebody stands up and says "use mempool" and I say, you fix the patch to use mempool. It's messier than you think.) The e1000 driver currently does not implement the deadlock-avoidance path if packet splitting is enabled, because other bits of memory start showing up in the packet path and the accounting gets more complex than I was prepared to tackle today. Maybe tomorrow. Regards, Daniel diff -up --recursive 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 2.6.12.3/drivers/net/e1000/e1000_main.c --- 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/drivers/net/e1000/e1000_main.c 2005-08-08 04:26:29.0 -0400 @@ -309,6 +309,14 @@ e1000_up(struct e1000_adapter *adapter) e1000_phy_reset(&adapter->hw); } + netdev->memalloc_pages = estimate_skb_pages(netdev->rx_reserve, + adapter->rx_buffer_len + NET_IP_ALIGN); + if ((err = adjust_memalloc_reserve(netdev->memalloc_pages))) { + DPRINTK(PROBE, ERR, + "Unable to allocate rx reserve Error: %d\n", err); + return err; + } + e1000_set_multi(netdev); e1000_restore_vlan(adapter); @@ -386,6 +394,7 @@ e1000_down(struct e1000_adapter *adapter mii_reg |= MII_CR_POWER_DOWN; e1000_write_phy_reg(&adapter->hw, PHY_CTRL, mii_reg); mdelay(1); + adjust_memalloc_reserve(-netdev->memalloc_pages); } } @@ -3242,7 +3251,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a cpu_to_le64(ps_page_dma->ps_page_dma[j]); } - skb = dev_alloc_skb(adapter->rx_ps_bsize0 + NET_IP_ALIGN); + skb = dev_memalloc_skb(netdev, adapter->rx_ps_bsize0 + NET_IP_ALIGN); if(unlikely(!skb)) break; @@ -3253,8 +3262,6 @@ e1000_alloc_rx_buffers_ps(struct e1000_a */ skb_reserve(skb, NET_IP_ALIGN); - skb->dev = netdev; - buffer_info->skb = skb; buffer_info->length = adapter->rx_ps_bsize0; buffer_info->dma = pci_map_single(pdev, skb->data, diff -up --recursive 2.6.12.3.clean/include/linux/gfp.h 2.6.12.3/include/linux/gfp.h --- 2.6.12.3.clean/include/linux/gfp.h 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/include/linux/gfp.h2005-08-05 21:53:09.0 -0400 @@ -39,6 +39,7 @@ struct vm_area_struct; #define __GFP_COMP 0x4000u /* Add compound page metadata */ #define __GFP_ZERO 0x8000u /* Return zeroed page
[RFC] Net vm deadlock fix, version 4
Hi, This patch fills in some missing pieces: * Support v4 udp: same as v4 tcp, when in reserve, drop packets on noncritical sockets * Support v4 icmp: when in reserve, drop icmp traffic * Add reserve skb support to e1000 driver * API for dropping packets before delivery (dev_drop_skb) * Atomic_t for reserve accounting Now ready for proof-of-concept testing. High level API boilerplate will come later. Regards, Daniel diff -up --recursive 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 2.6.12.3/drivers/net/e1000/e1000_main.c --- 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/drivers/net/e1000/e1000_main.c 2005-08-06 16:46:13.0 -0400 @@ -3242,7 +3242,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a cpu_to_le64(ps_page_dma->ps_page_dma[j]); } - skb = dev_alloc_skb(adapter->rx_ps_bsize0 + NET_IP_ALIGN); + skb = dev_memalloc_skb(netdev, adapter->rx_ps_bsize0 + NET_IP_ALIGN); if(unlikely(!skb)) break; @@ -3253,8 +3253,6 @@ e1000_alloc_rx_buffers_ps(struct e1000_a */ skb_reserve(skb, NET_IP_ALIGN); - skb->dev = netdev; - buffer_info->skb = skb; buffer_info->length = adapter->rx_ps_bsize0; buffer_info->dma = pci_map_single(pdev, skb->data, diff -up --recursive 2.6.12.3.clean/include/linux/gfp.h 2.6.12.3/include/linux/gfp.h --- 2.6.12.3.clean/include/linux/gfp.h 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/include/linux/gfp.h2005-08-05 21:53:09.0 -0400 @@ -39,6 +39,7 @@ struct vm_area_struct; #define __GFP_COMP 0x4000u /* Add compound page metadata */ #define __GFP_ZERO 0x8000u /* Return zeroed page on success */ #define __GFP_NOMEMALLOC 0x1u /* Don't use emergency reserves */ +#define __GFP_MEMALLOC 0x2u /* Use emergency reserves */ #define __GFP_BITS_SHIFT 20/* Room for 20 __GFP_FOO bits */ #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) diff -up --recursive 2.6.12.3.clean/include/linux/netdevice.h 2.6.12.3/include/linux/netdevice.h --- 2.6.12.3.clean/include/linux/netdevice.h2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/include/linux/netdevice.h 2005-08-06 16:37:14.0 -0400 @@ -371,6 +371,8 @@ struct net_device struct Qdisc*qdisc_ingress; struct list_headqdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ + int rx_reserve; + atomic_trx_reserve_used; /* ingress path synchronizer */ spinlock_t ingress_lock; @@ -662,6 +664,49 @@ static inline void dev_kfree_skb_any(str dev_kfree_skb(skb); } +/* + * Support for critical network IO under low memory conditions + */ +static inline int dev_reserve_used(struct net_device *dev) +{ + return atomic_read(&dev->rx_reserve_used); +} + +static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev, + unsigned length, int gfp_mask) +{ + struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask); + if (skb) + goto done; + if (dev_reserve_used(dev) >= dev->rx_reserve) + return NULL; + if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC)) + return NULL;; + atomic_inc(&dev->rx_reserve_used); +done: + skb->dev = dev; + return skb; +} + +static inline struct sk_buff *dev_memalloc_skb(struct net_device *dev, + unsigned length) +{ + return __dev_memalloc_skb(dev, length, GFP_ATOMIC); +} + +static inline void dev_unreserve(struct net_device *dev) +{ + if (atomic_dec_return(&dev->rx_reserve_used) < 0) + atomic_inc(&dev->rx_reserve_used); +} + +static inline void dev_drop_skb(struct sk_buff *skb) +{ + struct net_device *dev = skb->dev; + __kfree_skb(skb); + dev_unreserve(dev); +} + #define HAVE_NETIF_RX 1 extern int netif_rx(struct sk_buff *skb); extern int netif_rx_ni(struct sk_buff *skb); diff -up --recursive 2.6.12.3.clean/include/net/sock.h 2.6.12.3/include/net/sock.h --- 2.6.12.3.clean/include/net/sock.h 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/include/net/sock.h 2005-08-05 21:53:09.0 -0400 @@ -382,6 +382,7 @@ enum sock_flags { SOCK_NO_LARGESEND, /* whether to sent large segments or not */ SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */ SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */ + SOCK_MEMALLOC, /* protocol can use memalloc reserve */ }; static inline void sock_set_flag(struct sock *sk, enum sock_flags flag) @@ -399,6 +400,11 @@ static inline int sock_flag(struct sock return test_bit(flag, &sk->sk_flags); } +static inline int is_memalloc_sock(struct sock *
Re: kfree_skb questions
On Sunday 07 August 2005 06:26, Patrick McHardy wrote: > > Anyway, do we not want BUG_ON(!atomic_read(&skb->users)) at the beginning > > of kfree_skb, since we rely on it? > > Why do you care if skb->users is 0 or 1 in __kfree_skb()? Because I am a neatness freak and I like to check things that inattentive coders can easily get wrong. But the question above is not about that, it is about checking for possible calls where skb->users is already zero and thereby catching the double free early instead of letting it slide further into the innards of the machine. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
kfree_skb questions
Hi, The way I read this, __kfree_skb will sometimes be called with ->users = 1 and sometimes with ->users = 0, is that right? static inline void kfree_skb(struct sk_buff *skb) { if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); else if (likely(!atomic_dec_and_test(&skb->users))) return; __kfree_skb(skb); } If so, then why not just: static inline void kfree_skb(struct sk_buff *skb) { if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); if (likely(!atomic_dec_and_test(&skb->users))) return; __kfree_skb(skb); } so __kfree_skb can BUG_ON(atomic_read(&skb->users))? Perhaps this has something to do with the smp_rmb, could somebody please explain to me why it is necessary here, and for which architectures? Anyway, do we not want BUG_ON(!atomic_read(&skb->users)) at the beginning of kfree_skb, since we rely on it? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: test
On Saturday 06 August 2005 18:40, David S. Miller wrote: > From: Daniel Phillips <[EMAIL PROTECTED]> > Date: Sat, 6 Aug 2005 04:52:07 +1000 > > > So then there is no choice but to throttle the per-cpu ->input_pkt > > queues. > > Make the driver support NAPI if you want device fairness. This will co-exist happily with NAPI. It does not simply replicate NAPI functionality because the reserve accounting also tells the receive path when it should discriminate in favor of vm traffic. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Net vm deadlock fix (take two)
On Sunday 07 August 2005 02:07, Jeff Garzik wrote: > > +static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev, > > + unsigned length, int gfp_mask) > > +{ > > + struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask); > > + if (skb) > > + goto done; > > + if (dev->rx_reserve_used >= dev->rx_reserve) > > + return NULL; > > + if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC)) > > + return NULL;; > > + dev->rx_reserve_used++; > > why bother with rx_reserve at all? Why not just let the second > allocation fail, without the rx_reserve_used test? Because that would allow unbounded reserve use, either because of a leak or because of a legitimate backup in the softnet queues. It is not worth it to run the risk of wedging the whole system just to save this check. If we were using a mempool here, it would fail with a similar check. > Additionally, I think the rx_reserve_used accounting is wrong, since I > could simply free the skb Good point, I should provide a kfree_skb variant that does the reserve accounting (dev_free_skb) in case some driver wants to do this. Anyway, if somebody does free an skb in the delivery path without doing the accounting it is not a memory leak, but might cause a non-blockio packet to be unnecessarily dropped later. > -- but doing so would cause a rx_reserve_used > leak in your code, since you only decrement the counter in the TCP IPv4 > path. Reserve checks are needed not just on the IPv4 path but on every protocol path that is allowed to co-exist on the same wire as block IO. I will add udp and sctp to the patch next. If an unhandled protocol does get onto the wire, the consequences are not severe. There is just a risk that the entire reserve may be consumed (another reason we need the limit check above) and we just fall back to the old unreliable block IO behavior. Eventually this needs to be enforced automatically so that normal users don't have to worry about exactly what protocols they are running on an interface, but cluster users will just take care to run only supported protocols, they can already benefit from this without fancy checking. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netpoll can lock up on low memory.
On Saturday 06 August 2005 12:32, Steven Rostedt wrote: > > > If you need to really get the data out, then the design should be > > > changed. Have some return value showing the failure, check for > > > oops_in_progress or whatever, and try again after turning interrupts > > > back on, and getting to a point where the system can free up memory > > > (write to swap, etc). Just a busy loop without ever getting a skb is > > > just bad. > > > > Why, pray tell, do you think there will be a second chance after > > re-enabling interrupts? How does this work when we're panicking or > > oopsing where we most care? How does this work when the netpoll client > > is the kernel debugger and the machine is completely stopped because > > we're tracing it? > > What I meant was to check for an oops and maybe then don't break out. > Otherwise let the system try to reclaim memory. Since this is locked > when the alloc_skb called with GFP_ATOMIC and fails. You might want to take a look at my stupid little __GFP_MEMALLOC hack in the network block IO deadlock thread on netdev. It will let you use the memalloc reserve from atomic context. As long as you can be sure your usage will be bounded and you will eventually give it back, this should be ok. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Net vm deadlock fix (take two)
Hi, This version does not do blatantly stupid things in hardware irq context, is more efficient, and... wow the patch is smaller! (That never happens.) I don't mark skbs as being allocated from reserve any more. That works, but it is slightly bogus, because it doesn't matter which skb came from reserve, it only matters that we put one back. So I just count them and don't mark them. The tricky issue that had to be dealt with is the possibility that a massive number of skbs could in theory be queued by the hardware interrupt before the softnet softirq gets around to delivering them to the protocol. But we can only allocate a limited number of skbs from reserve memory. If we run out of reserve memory we have no choice but to fail skb allocations, and that can cause packets to be dropped. Since we don't know which packets are blockio packets and which are not at that point, we could be so unlucky as to always drop block IO packets and always let other junk through. The other junk won't get very far though, because those packets will be dropped as soon as the protocol headers are decoded, which will reveal that they do not belong to a memalloc socket. This short circuit ought to help take away the cpu load that caused the softirq constipation in the first place. What is actually going to happen is, a few block IO packets might be randomly dropped under such conditions, degrading the transport efficiency. Block IO progress will continue, unless we manage to accidently drop every block IO packet and our softirqs continue to stay comatose, probably indicating a scheduler bug. OK, we want to allocate skbs from reserve, but we cannot go infinitely into reserve. So we count how many packets a driver has allocated from reserve, in the net_device struct. If this goes over some limit, the skb allocation fails and the device driver may drop a packet because of that. Note that the e1000 driver will just be trying to refill its rx-ring at this point, and it will try to refill it again as soon as the next packet arrives, so it is still some ways away from actually dropping a packet. Other drivers may immediately drop a packet at this point, c'est la vie. Remember, this can only happen if the softirqs are backed up a silly amount. The thing is, we have got our block IO traffic moving, by virtue of dipping into the reserve, and most likely moving at near-optimal speed. Normal memory-consuming tasks are not moving because they are blocked on vm IO. The things that can mess us up are cpu hogs - a scheduler problem - and tons of unhelpful traffic sharing the network wire, which we are killing off early as mentioned above. What happens when a packet arrives at the protocol handler is a little subtle. At this point, if the interface is into reserve, we can always decrement the reserve count, regardless of what type of packet it is. If it is a block IO packet, the packet is still accounted for within the block driver's throttling. We are sure that the packet's resources will be returned to the common pool in an organized way. If it is some other random kind of packet, we drop it right away, also returning the resources to the common pool. Either way, it is not the responsibility of the interface to account for it any more. I'll just reiterate what I'm trying to accomplish here: 1) Guarantee network block io forward progress 2) Block IO throughput should not degrade much under low memory This iteration of the patch addresses those goals nicely, I think. I have not yet shown how to drive this from the block IO layer, and I haven't shown how to be sure that all protocols on an interface (not just TCPv4, as here) can handle the reserve management semantics. I have ignored all transports besides IP, though not much changes for other transports. I have some accounting code that is very probably racy and needs to be rewritten with atomic_t. I have ignored the many hooks that are possible in the protocol path. I have assumed that all receive skbs are the same size, and haven't accounted for the possibility that that size (MTU) might change. All these things need looking at, but the main point at the moment is to establish a solid sense of correctness and to get some real results on a vanilla delivery path. That in itself will be useful for cluster work, where configuration issues are kept under careful control. As far as drivers are concerned, the new interface is dev_memalloc_skb, which is straightforward. It needs to know about the netdev for accounting purposes, so it takes it as a parameter and thoughtfully plugs it into the skb for you. I am still using the global memory reserve, not mempool. But notice, now I am explicitly accounting and throttling how deep a driver dips into the global reserve. So GFP_MEMALLOC wins a point: the driver isn't just using the global reserve blindly, as has been traditional. The jury is still out thoug
Re: test
On Sunday 07 August 2005 03:54, Daniel Phillips wrote: > On Saturday 06 August 2005 18:40, David S. Miller wrote: > > From: Daniel Phillips <[EMAIL PROTECTED]> > > Date: Sat, 6 Aug 2005 04:52:07 +1000 > > > > > So then there is no choice but to throttle the per-cpu ->input_pkt > > > queues. > > > > Make the driver support NAPI if you want device fairness. > > This will co-exist happily with NAPI. It does not simply replicate NAPI > functionality because the reserve accounting also tells the receive path > when it should discriminate in favor of vm traffic. I forgot to mention, I rewrote the patch completely because of the stupid interrupt context mistake, and it no longer touches netif_rx. Please see "take two". Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lockups with netconsole on e1000 on media insertion
On Saturday 06 August 2005 11:22, Matt Mackall wrote: > On Sat, Aug 06, 2005 at 01:51:22AM +0200, Andi Kleen wrote: > > > But why are we in a hurry to dump the backlog on the floor? Why are we > > > worrying about the performance of netpoll without the cable plugged in > > > at all? We shouldn't be optimizing the data loss case. > > > > Because a system shouldn't stall for minutes (or forever like right now) > > at boot just because the network cable isn't plugged in. > > Using netconsole without a network cable could well be classified as a > serious configuration error. But please don't. An OS that slows to a crawl or crashes because a cable isn't plugged in an OS that deserves to be ridiculed. Silly timeouts on boot are scary and a waste of user's time. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
test
On Saturday 06 August 2005 02:33, David S. Miller wrote: > You can't call into the networking packet input path from > hardware interrupt context, it simply is not allowed. > > And that's the context in which netif_rx() gets called. Duh. I assumed we already were in softirq context here (but with 20-20 hindsight, that would be stupid extra work). > That is why netif_rx() just queues, and schedules a software > interrupt in which the netif_receive_skb() call gets made. So then there is no choice but to throttle the per-cpu ->input_pkt queues. The throttling code is already there, I just need to repurpose it slightly. if (queue->input_pkt_queue.qlen <= netdev_max_backlog) Question: it looks to me as if a given network device interrupt is always routed to the same CPU. Can we rely on this? If so, I only have to check the min(netdev_max_backlog, sk->dev->reserve_in_flight) netdev_max_backlog - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bypass softnet
On Saturday 06 August 2005 02:33, David S. Miller wrote: > You can't call into the networking packet input path from > hardware interrupt context, it simply is not allowed. > > And that's the context in which netif_rx() gets called. Duh. I assumed we already were in softirq context here (but with 20-20 hindsight, that would be stupid extra work). > That is why netif_rx() just queues, and schedules a software > interrupt in which the netif_receive_skb() call gets made. So then there is no choice but to throttle the per-cpu ->input_pkt queues. The throttling code is already there, I just need to repurpose it slightly. That is, netif_rx will be something like: if (queue->input_pkt_queue.qlen <= netdev_max_backlog && reserve_available(skb->dev)) { } This will needlessly drop the very last packet allocated from reserve, a slight wastage, but obvious correctness is more important at this point. The only purpose of this extra throttling is to avoid having to include the entire netdev_max_backlog (default 300) in the network block IO packet reserve. That would be 1.2 MB per interface, perhaps a little too much to withdraw from the general memory pool. I am thinking that even a single packet's worth of reserve is enough to make _probabilistic_ forward progress, but then we run a high risk of being DOSed by non-blockio traffic, and through will be crippled just when it hurts most. So the right reserve size is a compromise between the single element required to make (probabilistic) forward progress and the number that will almost never fill up the ->input_pkt queue under zero atomic memory conditions. I guess 50-100 packets in reserve is about right, reduce it for larger MTU, increase it for higher interface speed. Question: it looks to me as if a given network device interrupt is always routed to the same CPU. Can I rely on that? If so, I can do the reserve throttling strictly per-cpu, if not, then a slightly messier calculation is necessary. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: argh... ;/
On Saturday 06 August 2005 03:49, Dave Jones wrote: > On Fri, Aug 05, 2005 at 01:20:59PM -0400, John W. Linville wrote: > > On Sat, Aug 06, 2005 at 02:41:30AM +1000, Daniel Phillips wrote: > > > On Friday 05 August 2005 13:04, Mateusz Berezecki wrote: > > > > I accidentaly posted the patches as MIME attachments... its 5:03 am > > > > here > > > > > > Does anybody still care if patches are posted as attachments, > > > particularly for review as opposed to submission? > > > > Yes. Opening attachments makes them harder to review. > > Indeed. Replying to such mails ends up looking like.. > Hi, sorry, about the misfired "test" mail... the _real_ issue for me was that Kmail was turning tabs into spaces for the longest time, so forcing me either to switch mailers (and you know how appealing that is) or send patches as attachments. It's finally fixed, so I'll happily join in with dumping on the poor misguided folks who insist on posting their patches as attachments. Regards, Daniel "the flexible" Phillips - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: argh... ;/
On Friday 05 August 2005 13:04, Mateusz Berezecki wrote: > I accidentaly posted the patches as MIME attachments... its 5:03 am here > already. Sorry guys. > I can resubmit if you want. I just dont want do that now and not trash > your mailboxes Does anybody still care if patches are posted as attachments, particularly for review as opposed to submission? I will make the standard comment about C99 // comments, but add that I use them myself to mean "this comment will definitely be gone by the time it gets officially submitted", in other words, work in progress. Heroic work! And nicely lindented at least. Some long lines, but in the same kinds of places I write them myself. Only commenting on the form, not substance. Best of luck. I have one of these things here so when you declare it ready I will try it. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Bypass softnet
Hi, OK, I am still a network klutz. The attached patch changes netif_rx to call netif_receive_skb directly instead of going through softnet. It works with my e1000 here, but eventually oopses under moderate load. I see that a few drivers use netif_receive_skb directly, sometimes together with NAPI. But I must have missed some assumption, at least with the e1000 driver. Any suggestion? The oops happens almost immediately if I add a printk. I need to put a serial cable on the box so I can see what the oops actually is. Regards, Daniel diff -up --recursive 2.6.12.3.clean/net/core/dev.c 2.6.12.3/net/core/dev.c --- 2.6.12.3.clean/net/core/dev.c 2005-07-15 17:18:57.0 -0400 +++ 2.6.12.3/net/core/dev.c 2005-08-05 11:47:23.0 -0400 @@ -1456,6 +1456,11 @@ int netif_rx(struct sk_buff *skb) if (netpoll_rx(skb)) return NET_RX_DROP; + if (1) { +netif_receive_skb(skb); +return NET_RX_SUCCESS; +} + if (!skb->stamp.tv_sec) net_timestamp(&skb->stamp);
Re: [RFC] Net vm deadlock fix (preliminary)
Whoops: - return __dev_alloc_skb(length, gfp_mask | __GFP_MEMALLOC); + return __dev_alloc_skb(length, GFP_ATOMIC|__GFP_MEMALLOC); Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Net vm deadlock fix (preliminary)
Hi, I spent the last day mulling things over and doing research. It seems to me that the patch as first posted is correct and solves the deadlock, except that some uses of __GFP_MEMALLOC in __dev_alloc_skb may escape into contexts where the reserve is not guaranteed to be reclaimed. It may be that this does not actually happen, but there is enough different usage that I would rather err on the side of caution just now, and offer a variant called, e.g., dev_memalloc_skb so that drivers will explicitly have to choose to use it (or supply the flag to __dev_alloc_skb). This is just a stack of inline functions so there should be no extra object code. The dev_memalloc_skb variant can go away in time, but for now it does no harm. A minor cleanup: somebody (Rik) complained about his bleeding eyes after reading my side-effecty alloc_skb expression, so that was rearranged in a way that should optimize to the same thing. On the "first write the patch, then do the research" principle, the entire thread on this topic from the ksummit-2005-discuss mailing list is a good read: http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/thread.html#186 Matt, you are close to the truth here: http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000242.html but this part isn't right: "it's important to note here that "making progress" may require M acknowledgements to N packets representing a single IO. So we need separate send and acknowledge pools for each SO_MEMALLOC socket so that we don't find ourselves wedged with M-1 available mempool slots when we're waiting on ACKs". This erroneously assumes that mempools throttle the block IO traffic. In fact, the throttling _must_ take place higher up, in the block IO stack. The block driver that submits the network IO must pre-account any per-request resources and block until sufficient resources become available. So the accounting would include both space for transmit and acknowledge, and the network block IO protocol must be designed to obey that accounting. (I will wave my hands at the question of how we arrange for low-level components to communicate their resource needs to high-level throttlers, just for now.) Andrea, also getting close: http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000200.html But there is no need to be random. Short of actually overlowing the input ring buffer, we can be precise about accepting all block IO packets and dropping non-blockio traffic as necessary. Rik, not bad: http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000218.html particularly for deducing it from first principles without actually looking at the network code ;-) It is even the same socket flag name as I settled on (SO_MEMALLOC). But step 4 veers off course: out of order does not matter. And the conclusion that we can throttle here by dropping non-blockio packets is not right: the packets that we got from reserve still can live an arbitrary time in the protocol stack, so we could still exhaust the reserve and be back to the same bad old deadlock conditions. Everybody noticed that dropping non-blockio packets is key, and everybody missed the fact that softnet introduces additional queues that need throttling (which can't be done sanely) or bypassing. Almost everybody noticed that throttling in the block IO submission path is non-optional. Everybody thought that mempool is the one true way of reserving memory. I am not so sure, though I still intend to produce a mempool variant of the patch. One problem I see with mempool is that it not only reserves resources, but pins them. If the kernel is full of mempools pinning memory pages here and there, physical defragmentation gets that much harder and the buddy tree will fragment that much sooner. The __GPF_MEMALLOC interface does not have this problem because pages stay in the global pool. So the jury is still out on which method is better. Obviously, to do the job properly, __GPF_MEMALLOC would need a way of resizing the memalloc reserve as users are loaded and unloaded. Such an interface can be very lightweight. I will cook one up just to demonstrate this. Now, the scheme in my patch does the job and I think it does it in a way that works for all drivers, even e1000 (by method 1. in the thread above). But we could tighten this up a little by noticing that it doesn't actually matter which socket buffer we return to the pool as long as we are sure to return the same amount of memory as we withdrew. Therefore, we could just account the number of pages alloc_skb withdraws, and the number that freeing a packet returns. The e1000 driver would look at this number to determine whether to mark a packet as from_reserve or not. That way, the e1000 driver could set things in motion to release reserve resources sooner, rather than waiting for certain specially flagged skbs to work their way around the