Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages

2007-12-15 Thread Daniel Phillips
On Friday 14 December 2007 14:51, I wrote:
> On Friday 14 December 2007 07:39, Peter Zijlstra wrote:
> Note that false sharing of slab pages is still possible between two
> unrelated writeout processes, both of which obey rules for their own
> writeout path, but the pinned combination does not.  This still
> leaves a hole through which a deadlock may slip.

Actually, no it doesn't.  It in fact does not matter how many unrelated 
writeout processes, block devices, whatevers share a slab cache.  
Sufficient reserve pages must have been made available (in a perfect 
work, by adding extra pages to the memalloc reserve on driver 
initialization, in the real world just by having a big memalloc 
reserve) to populate the slab up to the sum of the required objects for 
all memalloc users sharing the cache.

So I think this slab technique of yours is fundamentally sound, that is 
to say, adding a new per-slab flag to keep unbounded numbers of slab 
objects with unbounded lifetimes from mixing with the bounded number of 
slab objects with bounded lifetimes.

Ponder.  OK, here is another issue.  Suppose a driver expands the 
memalloc reserve by the X number of pages it needs on initialization, 
and shrinks it by the same amount on removal, as is right and proper.  
The problem is, less than the number of slab pages that got pulled into 
slab on behalf of the removed driver may be freed (or made freeable) 
back to the global reserve, due to page sharing with an unrelated user.   
In theory, the global reserve could be completely depleted by this slab 
fragmentation.

OK, that is like the case that I mistakenly raised in the previous mail, 
though far less likely to occur, because driver removals are relatively 
rare and so would be a fragmentation case so severe as to cause global 
reserve depletion.

Even so, if this possibility bothers anybody, it is fairly easy to plug 
the hole: associate each slab with a given memalloc user instead of 
just having one bit to classify users.  So unrelated memalloc users 
would never share a slab, no false sharing, everybody happy.  The cost: 
a new pointer field per slab and a few additional lines of code.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/29] mm: slb: add knowledge of reserve pages

2007-12-14 Thread Daniel Phillips
On Friday 14 December 2007 07:39, Peter Zijlstra wrote:
> Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to
> allocation contexts that are entitled to it. This is done to ensure
> reserve pages don't leak out and get consumed.

Tighter definitions of "leak out" and "get consumed" would be helpful 
here.  As I see it, the chain of reasoning is:

  * Any MEMALLOC mode allocation must have come from a properly
   throttled path and  has a finite lifetime that must eventually
   produce writeout  progress.

  * Since the transaction that made the allocation was throttled and
must have a finite lifetime, we  know that it must eventually return
the resources it consumed to the appropriate resource pool.

Now, I think what you mean by "get consumed" and "leak out" is: "become 
pinned by false sharing with other allocations that do not guarantee 
that they will be returned to the resource pool".  We can say "pinned" 
for short.

So you are attempting to prevent slab pages from becoming pinned by 
users that do not obey the reserve management rules, which I think your 
approach achieves.  However...

Note that false sharing of slab pages is still possible between two 
unrelated writeout processes, both of which obey rules for their own 
writeout path, but the pinned combination does not.  This still leaves 
a hole through which a deadlock may slip.

My original solution was simply to allocate a full page when drawing 
from the memaloc reserve, which may use a tad more reserve, but makes 
it possible to prove the algorithm correct.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/29] mm: kmem_estimate_pages()

2007-12-14 Thread Daniel Phillips
On Friday 14 December 2007 07:39, Peter Zijlstra wrote:
> Provide a method to get the upper bound on the pages needed to
> allocate a given number of objects from a given kmem_cache.
>
> This lays the foundation for a generic reserve framework as presented
> in a later patch in this series. This framework needs to convert
> object demand (kmalloc() bytes, kmem_cache_alloc() objects) to pages.

And hence the big idea that all reserve accounting can be done in units
of pages, allowing the use of a single global reserve that already 
exists.

The other big idea here is that reserve accounting can be independent of 
the actual resource allocations.  This is a powerful idea which we may 
not have explained clearly yet.

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/29] netvm: INET reserves.

2007-12-14 Thread Daniel Phillips
Hi Peter,

sysctl_intvec_fragment, proc_dointvec_fragment, sysctl_intvec_fragment 
seem to suffer from cut-n-pastitis.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/29] Swap over NFS -v15

2007-12-14 Thread Daniel Phillips
Hi Peter,

A major feature of this patch set is the network receive deadlock 
avoidance, but there is quite a bit of stuff bundled with it, the NFS 
user accounting for a big part of the patch by itself.

Is it possible to provide a before and after demonstration case for just 
the network receive deadlock part, given a subset of the patch set and 
a user space recipe that anybody can try?

Regards,

Daniel

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-09-01 Thread Daniel Phillips
On Friday 31 August 2007 14:41, Alasdair G Kergon wrote:
> On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
> > Resubmitting a bio or submitting a dependent bio from
> > inside a block driver does not need to be throttled because all
> > resources required to guarantee completion must have been obtained
> > _before_ the bio was allowed to proceed into the block layer.
>
> I'm toying with the idea of keeping track of the maximum device stack
> depth for each stacked device, and only permitting it to increase in
> controlled circumstances.

Hi Alasdair,

What kind of circumstances did you have in mind?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-30 Thread Daniel Phillips
On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
> Then, if of course you will want, which I doubt, you can reread
> previous mails and find that it was pointed to that race and
> possibilities to solve it way too long ago.

What still bothers me about your response is that, while you know the 
race exists and do not disagree with my example, you don't seem to see 
that that race can eventually lock up the block device by repeatedly 
losing throttle counts which are never recovered.  What prevents that?

> > --- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0
> > -0700 +++ 2.6.22/block/ll_rw_blk.c  2007-08-24 12:07:16.0
> > -0700 @@ -3237,6 +3237,15 @@ end_io:
> >   */
> >  void generic_make_request(struct bio *bio)
> >  {
> > +   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > +
> > +   if (q && q->metric) {
> > +   int need = bio->bi_reserved = q->metric(bio);
> > +   bio->queue = q;
>
> In case you have stacked device, this entry will be rewritten and you
> will lost all your account data.

It is a weakness all right.  Well,

-   if (q && q->metric) {
+   if (q && q->metric && !bio->queue) {

which fixes that problem.  Maybe there is a better fix possible.  Thanks 
for the catch!

The original conception was that this block throttling would apply only 
to the highest level submission of the bio, the one that crosses the 
boundary between filesystem (or direct block device application) and 
block layer.  Resubmitting a bio or submitting a dependent bio from 
inside a block driver does not need to be throttled because all 
resources required to guarantee completion must have been obtained 
_before_ the bio was allowed to proceed into the block layer.

The other principle we are trying to satisfy is that the throttling 
should not be released until bio->endio, which I am not completely sure 
about with the patch as modified above.  Your earlier idea of having 
the throttle protection only cover the actual bio submission is 
interesting and may be effective in some cases, in fact it may cover 
the specific case of ddsnap.  But we don't have to look any further 
than ddraid (distributed raid) to find a case it doesn't cover - the 
additional memory allocated to hold parity data has to be reserved 
until parity data is deallocated, long after the submission completes.
So while you manage to avoid some logistical difficulties, it also looks 
like you didn't solve the general problem.

Hopefully I will be able to report on whether my patch actually works 
soon, when I get back from vacation.  The mechanism in ddsnap this is 
supposed to replace is effective, it is just ugly and tricky to verify.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips
On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
> On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
> wrote:
> > > We do not care about one cpu being able to increase its counter
> > > higher than the limit, such inaccuracy (maximum bios in flight
> > > thus can be more than limit, difference is equal to the number of
> > > CPUs - 1) is a price for removing atomic operation. I thought I
> > > pointed it in the original description, but might forget, that if
> > > it will be an issue, that atomic operations can be introduced
> > > there. Any uber-precise measurements in the case when we are
> > > close to the edge will not give us any benefit at all, since were
> > > are already in the grey area.
> >
> > This is not just inaccurate, it is suicide.  Keep leaking throttle
> > counts and eventually all of them will be gone.  No more IO
> > on that block device!
>
> First, because number of increased and decreased operations are the
> same, so it will dance around limit in both directions.

No.  Please go and read it the description of the race again.  A count
gets irretrievably lost because the write operation of the first
decrement is overwritten by the second. Data gets lost.  Atomic 
operations exist to prevent that sort of thing.  You either need to use 
them or have a deep understanding of SMP read and write ordering in 
order to preserve data integrity by some equivalent algorithm.

> Let's solve problems in order of their appearence. If bio structure
> will be allowed to grow, then the whole patches can be done better.

How about like the patch below.  This throttles any block driver by
implementing a throttle metric method so that each block driver can
keep track of its own resource consumption in units of its choosing.
As an (important) example, it implements a simple metric for device
mapper devices.  Other block devices will work as before, because
they do not define any metric.  Short, sweet and untested, which is
why I have not posted it until now.

This patch originally kept its accounting info in backing_dev_info,
however that structure seems to be in some and it is just a part of
struct queue anyway, so I lifted the throttle accounting up into
struct queue.  We should be able to report on the efficacy of this
patch in terms of deadlock prevention pretty soon.

--- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0 -0700
+++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700
@@ -3237,6 +3237,15 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+
+   if (q && q->metric) {
+   int need = bio->bi_reserved = q->metric(bio);
+   bio->queue = q;
+   wait_event_interruptible(q->throttle_wait, 
atomic_read(&q->available) >= need);
+   atomic_sub(&q->available, need);
+   }
+
if (current->bio_tail) {
/* make_request is active */
*(current->bio_tail) = bio;
--- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700
+++ 2.6.22/drivers/md/dm.c  2007-08-24 12:14:23.0 -0700
@@ -880,6 +880,11 @@ static int dm_any_congested(void *conges
return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+   return bio->bi_vcnt;
+}
+
 /*-
  * An IDR is used to keep track of allocated minor numbers.
  *---*/
@@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i
goto bad1_free_minor;
 
md->queue->queuedata = md;
+   md->queue->metric = dm_metric;
+   atomic_set(&md->queue->available, md->queue->capacity = 1000);
+   init_waitqueue_head(&md->queue->throttle_wait);
+
md->queue->backing_dev_info.congested_fn = dm_any_congested;
md->queue->backing_dev_info.congested_data = md;
blk_queue_make_request(md->queue, dm_request);
--- 2.6.22.clean/fs/bio.c   2007-07-08 16:32:17.0 -0700
+++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700
@@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned
bytes_done = bio->bi_size;
}
 
-   bio->bi_size -= bytes_done;
+   if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) {
+   struct request_queue *q = bio->queue;
+   atomic_add(&q->available, bio->bi_reserved);
+   bio->bi_reserved = 0; /* just in case */
+   wake_up(&q->throttle_wait);
+   }
bio->bi_sector += (bytes_done >> 9);
 
if (bio->bi_end_io)
--- 2.6.22

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips
On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote:
> On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > Say Evgeniy, something I was curious about but forgot to ask you
> > earlier...
> >
> > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> > > ...All oerations are not atomic, since we do not care about
> > > precise number of bios, but a fact, that we are close or close
> > > enough to the limit.
> > > ... in bio->endio
> > > + q->bio_queued--;
> >
> > In your proposed patch, what prevents the race:
> >
> > cpu1cpu2
> >
> > read q->bio_queued
> > 
> > q->bio_queued--
> > write q->bio_queued - 1
> > Whoops! We leaked a throttle count.
>
> We do not care about one cpu being able to increase its counter
> higher than the limit, such inaccuracy (maximum bios in flight thus
> can be more than limit, difference is equal to the number of CPUs -
> 1) is a price for removing atomic operation. I thought I pointed it
> in the original description, but might forget, that if it will be an
> issue, that atomic operations can be introduced there. Any
> uber-precise measurements in the case when we are close to the edge
> will not give us any benefit at all, since were are already in the
> grey area.

This is not just inaccurate, it is suicide.  Keep leaking throttle 
counts and eventually all of them will be gone.  No more IO
on that block device!

> Another possibility is to create a queue/device pointer in the bio
> structure to hold original device and then in its backing dev
> structure add a callback to recalculate the limit, but it increases
> the size of the bio. Do we need this?

Different issue.  Yes, I think we need a nice simple approach like
that, and prove it is stable before worrying about the size cost.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-27 Thread Daniel Phillips
Say Evgeniy, something I was curious about but forgot to ask you 
earlier...

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> ...All oerations are not atomic, since we do not care about precise
> number of bios, but a fact, that we are close or close enough to the
> limit. 
> ... in bio->endio
> + q->bio_queued--;

In your proposed patch, what prevents the race:

cpu1cpu2

read q->bio_queued

q->bio_queued--
write q->bio_queued - 1
Whoops! We leaked a throttle count.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips
On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote:
> > The throttling of the virtual device must begin in
> > generic_make_request and last to ->endio.  You release the throttle
> > of the virtual device at the point you remap the bio to an
> > underlying device, which you have convinced yourself is ok, but it
> > is not.  You seem to miss the fact that whatever resources the
> > virtual device has allocated are no longer protected by the
> > throttle count *of the virtual device*, or you do not
>
> Because it is charged to another device.

Great.  You charged the resource to another device, but you did not 
limit the amount of resources that the first device can consume.  Which 
misses the whole point.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips
On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote:
> On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > > > And it will not solve the deadlock problem in general.  (Maybe
> > > > it works for your virtual device, but I wonder...)  If the
> > > > virtual device allocates memory during generic_make_request
> > > > then the memory needs to be throttled.
> > >
> > > Daniel, if device process bio by itself, it has a limit and thus
> > > it will wait in generic_make_request()
> >
> > What will make it wait?
>
> gneric_make_request() for given block device.

Not good enough, that only makes one thread wait.  Look here:

http://lkml.org/lkml/2007/8/13/788

An unlimited number of threads can come in, each consuming resources of 
the virtual device, and violating the throttling rules.

The throttling of the virtual device must begin in generic_make_request 
and last to ->endio.  You release the throttle of the virtual device at 
the point you remap the bio to an underlying device, which you have 
convinced yourself is ok, but it is not.  You seem to miss the fact 
that whatever resources the virtual device has allocated are no longer 
protected by the throttle count *of the virtual device*, or you do not 
see why that is a bad thing.  It is a very bad thing, roughly like 
leaving some shared data outside a spin_lock/unlock.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips
On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > And it will not solve the deadlock problem in general.  (Maybe it
> > works for your virtual device, but I wonder...)  If the virtual
> > device allocates memory during generic_make_request then the memory
> > needs to be throttled.
>
> Daniel, if device process bio by itself, it has a limit and thus it
> will wait in generic_make_request()

What will make it wait?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips
On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > Perhaps you never worried about the resources that the device
> > mapper mapping function allocates to handle each bio and so did not
> > consider this hole significant.  These resources can be
> > significant, as is the case with ddsnap.  It is essential to close
> > that window through with the virtual device's queue limit may be
> > violated.  Not doing so will allow deadlock.
>
> This is not a bug, this is special kind of calculation - total limit
> is number of physical devices multiplied by theirs limits. It was
> done _on purpose_ to allow different device to have different limits
> (for example in distributed storage project it is possible to have
> both remote and local node in the same device, but local device
> should not have _any_ limit at all, but network one should).
>
> Virtual device essentially has _no_ limit. And that as done on
> purpose.

And it will not solve the deadlock problem in general.  (Maybe it works 
for your virtual device, but I wonder...)  If the virtual device 
allocates memory during generic_make_request then the memory needs to 
be throttled.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 02:12, Jens Axboe wrote:
> > It is a system wide problem.  Every block device needs throttling,
> > otherwise queues expand without limit.  Currently, block devices
> > that use the standard request library get a slipshod form of
> > throttling for free in the form of limiting in-flight request
> > structs.  Because the amount of IO carried by a single request can
> > vary by two orders of magnitude, the system behavior of this
> > approach is far from predictable.
>
> Is it? Consider just 10 standard sata disks. The next kernel revision
> will have sg chaining support, so that allows 32MiB per request. Even
> if we disregard reads (not so interesting in this discussion) and
> just look at potentially pinned dirty data in a single queue, that
> number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch.
>
> So I still think that this throttling needs to happen elsewhere, you
> cannot rely the block layer throttling globally or for a single
> device. It just doesn't make sense.

You are right, so long as the unit of throttle accounting remains one 
request.  This is not what we do in ddsnap.  Instead we inc/dec the 
throttle counter by the number of bvecs in each bio, which produces a 
nice steady data flow to the disk under a wide variety of loads, and 
provides the memory resource bound we require.

One throttle count per bvec will not be the right throttling metric for 
every driver.  To customize this accounting metric for a given driver 
we already have the backing_dev_info structure, which provides 
per-device-instance accounting functions and instance data.  Perfect! 
This allows us to factor the throttling mechanism out of the driver, so 
the only thing the driver has to do is define the throttle accounting 
if it needs a custom one.

We can avoid affecting the traditional behavior quite easily, for 
example if backing_dev_info->throttle_fn (new method) is null then 
either not throttle at all (and rely on the struct request in-flight 
limit) or we can move the in-flight request throttling logic into core 
as the default throttling method, simplifying the request library and 
not changing its behavior.

> > These deadlocks are first and foremost, block layer deficiencies. 
> > Even the network becomes part of the problem only because it lies
> > in the block IO path.
>
> The block layer has NEVER guaranteed throttling, so it can - by
> definition - not be a block layer deficiency.

The block layer has always been deficient by not providing accurate 
throttling, or any throttling at all for some devices.  We have 
practical proof that this causes deadlock and a good theoretical basis 
for describing exactly how it happens.

To be sure, vm and net are co-conspirators, however the block layer 
really is the main actor in this little drama.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote:
> > Say you have a device mapper device with some physical device
> > sitting underneath, the classic use case for this throttle code. 
> > Say 8,000 threads each submit an IO in parallel.  The device mapper
> > mapping function will be called 8,000 times with associated
> > resource allocations, regardless of any throttling on the physical
> > device queue.
>
> Each thread will sleep in generic_make_request(), if limit is
> specified correctly, then allocated number of bios will be enough to
> have a progress.

The problem is, the sleep does not occur before the virtual device 
mapping function is called.  Let's consider two devices, a physical 
device named pdev and a virtual device sitting on top of it called 
vdev.   vdev's throttle limit is just one element, but we will see that 
in spite of this, two bios can be handled by the vdev's mapping method 
before any IO completes, which violates the throttling rules. According 
to your patch it works like this:

 Thread 1Thread  2

bio_queued is zero>

vdev->q->bio_queued++



blk_set_bdev(bio, pdev)
 vdev->bio_queued--
   
bio_queued is 
zero>

vdev->q->bio_queued++



whoops!  Our virtual device mapping
function has now allocated resources
for two in-flight bios in spite of 
having its
throttle limit set to 1.

Perhaps you never worried about the resources that the device mapper 
mapping function allocates to handle each bio and so did not consider 
this hole significant.  These resources can be significant, as is the 
case with ddsnap.  It is essential to close that window through with 
the virtual device's queue limit may be violated.  Not doing so will 
allow deadlock.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > > > Oops, and there is also:
> > > >
> > > > 3) The bio throttle, which is supposed to prevent deadlock, can
> > > > itself deadlock.  Let me see if I can remember how it goes.
> > > >
> > > >   * generic_make_request puts a bio in flight
> > > >   * the bio gets past the throttle and initiates network IO
> > > >   * net calls sk_alloc->alloc_pages->shrink_caches
> > > >   * shrink_caches submits a bio recursively to our block device
> > > >   * this bio blocks on the throttle
> > > >   * net may never get the memory it needs, and we are wedged
> > >
> > > If system is in such condition, it is already broken - throttle
> > > limit must be lowered (next time) not to allow such situation.
> >
> > Agreed that the system is broken, however lowering the throttle
> > limit gives no improvement in this case.
>
> How is it ever possible? The whole idea of throttling is to remove
> such situation, and now you say it can not be solved.

It was solved, by not throttling writeout that comes from shrink_caches.
Ugly.

> If limit is for 
> 1gb of pending block io, and system has for example 2gbs of ram (or
> any other resonable parameters), then there is no way we can deadlock
> in allocation, since it will not force page reclaim mechanism.

The problem is that sk_alloc (called from our block driver via 
socket->write) would recurse into shrink_pages, which recursively 
submits IO to our block driver and blocks on the throttle.  Subtle 
indeed, and yet another demonstration of why vm recursion is a Bad 
Thing.

I will find a traceback for you tomorrow, which makes this deadlock much 
clearer.

Regards
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 04:03, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > > This is not a very good solution, since it requires all users of
> > > the bios to know how to free it.
> >
> > No, only the specific ->endio needs to know that, which is set by
> > the bio owner, so this knowledge lies in exactly the right place. 
> > A small handful of generic endios all with the same destructor are
> > used nearly everywhere.
>
> That is what I meant - there will be no way to just alloc a bio and
> put it, helpers for generic bio sets must be exported and each and
> every bi_end_io() must be changed to check reference counter and they
> must know how they were allocated.

There are fewer non-generic bio allocators than you think.

> Endio callback is of course quite rare and additional atomic
> reading will not kill the system, but why introduce another read?
> It is possible to provide a flag for endio callback that it is last,
> but it still requires to change every single callback - why do we
> want this?

We don't.  Struct bio does not need to be shrunk.  Jens wanted to talk 
about what fields could be eliminated if we wanted to shrink it.  It is 
about time to let that lie, don't you think?

> So, I'm a bit lost...
>
> You say it is too big 

Did not say that.

> and some parts can be removed or combined

True.

> and  then that size does not matter.

Also true, backed up by numbers on real systems.

> Last/not-last checks in the code is 
> not clear design, so I do not see why it is needed at all if not for
> size shrinking.

Not needed, indeed.  Accurate throttling is needed.  If the best way to 
throttle requires expanding struct bio a little then we should not let 
concerns about the cost  of an int or two stand in the way.  Like Jens, 
I am more concerned about the complexity cost, and that is minimized in 
my opinion by throttling in the generic code rather than with custom 
code in each specialized block driver.

Your patch does throttle in the generic code, great.  Next thing is to 
be sure that it completely closes the window for reserve leakage, which 
is not yet clear.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote:
> On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > (previous incomplete message sent accidentally)
> >
> > On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> > > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
> > >
> > > So, what did we decide? To bloat bio a bit (add a queue pointer)
> > > or to use physical device limits? The latter requires to replace
> > > all occurence of bio->bi_bdev = something_new with
> > > blk_set_bdev(bio, somthing_new), where queue limits will be
> > > appropriately charged. So far I'm testing second case, but I only
> > > changed DST for testing, can change all other users if needed
> > > though.
> >
> > Adding a queue pointer to struct bio and using physical device
> > limits as in your posted patch both suffer from the same problem:
> > you release the throttling on the previous queue when the bio moves
> > to a new one, which is a bug because memory consumption on the
> > previous queue then becomes unbounded, or limited only by the
> > number of struct requests that can be allocated.  In other words,
> > it reverts to the same situation we have now as soon as the IO
> > stack has more than one queue.  (Just a shorter version of my
> > previous post.)
>
> No. Since all requests for virtual device end up in physical devices,
> which have limits, this mechanism works. Virtual device will
> essentially call either generic_make_request() for new physical
> device (and thus will sleep is limit is over), or will process bios
> directly, but in that case it will sleep in generic_make_request()
> for virutal device.

What can happen is, as soon as you unthrottle the previous queue, 
another thread can come in and put another request on it.  Sure, that 
thread will likely block on the physical throttle and so will the rest 
of the incoming threads, but it still allows the higher level queue to 
grow past any given limit, with the help of lots of threads.  JVM for 
example?

Say you have a device mapper device with some physical device sitting 
underneath, the classic use case for this throttle code.  Say 8,000 
threads each submit an IO in parallel.  The device mapper mapping 
function will be called 8,000 times with associated resource 
allocations, regardless of any throttling on the physical device queue.

Anyway, your approach is awfully close to being airtight, there is just 
a small hole.  I would be more than happy to be proved wrong about 
that, but the more I look, the more I see that hole.

> > 1) One throttle count per submitted bio is too crude a measure.  A
> > bio can carry as few as one page or as many as 256 pages.  If you
> > take only
>
> It does not matter - we can count bytes, pages, bio vectors or
> whatever we like, its just a matter of counter and can be changed
> without problem.

Quite true.  In some cases the simple inc/dec per bio works just fine.  
But the general case where finer granularity is required comes up in 
existing code, so there needs to be a plan.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > Oops, and there is also:
> >
> > 3) The bio throttle, which is supposed to prevent deadlock, can
> > itself deadlock.  Let me see if I can remember how it goes.
> >
> >   * generic_make_request puts a bio in flight
> >   * the bio gets past the throttle and initiates network IO
> >   * net calls sk_alloc->alloc_pages->shrink_caches
> >   * shrink_caches submits a bio recursively to our block device
> >   * this bio blocks on the throttle
> >   * net may never get the memory it needs, and we are wedged
>
> If system is in such condition, it is already broken - throttle limit
> must be lowered (next time) not to allow such situation.

Agreed that the system is broken, however lowering the throttle limit 
gives no improvement in this case.

This is not theoretical, but a testable, repeatable result.  
Instructions to reproduce should show up tomorrow.

This bug is now solved in a kludgy way.  Now, Peter's patch set offers a 
much cleaner way to fix this little problem, along with at least one 
other nasty that it already fixed.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 03:22, Jens Axboe wrote:
> I never compared the bio to struct page, I'd obviously agree that
> shrinking struct page was a worthy goal and that it'd be ok to uglify
> some code to do that. The same isn't true for struct bio.

I thought I just said that.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 03:06, Jens Axboe wrote:
> On Mon, Aug 13 2007, Daniel Phillips wrote:
> > Of course not.  Nothing I said stops endio from being called in the
> > usual way as well.  For this to work, endio just needs to know that
> > one call means "end" and the other means "destroy", this is
> > trivial.
>
> Sorry Daniel, but your suggestions would do nothing more than uglify
> the code and design.

Pretty much exactly what was said about shrinking struct page, ask Bill.  
The difference was, shrinking struct page actually mattered whereas 
shrinking struct bio does not, and neither does expanding it by a few 
bytes.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 02:18, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > > But that idea fails as well, since reference counts and IO
> > > completion are two completely seperate entities. So unless end IO
> > > just happens to be the last user holding a reference to the bio,
> > > you cannot free it.
> >
> > That is not a problem.  When bio_put hits zero it calls ->endio
> > instead of the destructor.  The ->endio sees that the count is zero
> > and destroys the bio.
>
> This is not a very good solution, since it requires all users of the
> bios to know how to free it.

No, only the specific ->endio needs to know that, which is set by the 
bio owner, so this knowledge lies in exactly the right place.  A small 
handful of generic endios all with the same destructor are used nearly 
everywhere.

> Right now it is hidden. 
> And adds additional atomic check (although reading is quite fast) in
> the end_io.

Actual endio happens once in the lifetime of the transfer, this read 
will be entirely lost in the noise.

> And for what purpose? To eat 8 bytes on 64bit platform? 
> This will not reduce its size noticebly, so the same number of bios
> will be in the cache's page, so what is a gain? All this cleanups and
> logic complicatins should be performed only if after size shring
> increased number of bios can fit into cache's page, will it be done
> after such cleanups?

Well, exactly,   My point from the beginning was that the size of struct 
bio is not even close to being a problem and adding a few bytes to it 
in the interest of doing the cleanest fix to a core kernel bug is just 
not a dominant issue.

I suppose that leaving out the word "bloated" and skipping straight to 
the "doesn't matter" proof would have saved some bandwidth.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 02:13, Jens Axboe wrote:
> On Mon, Aug 13 2007, Daniel Phillips wrote:
> > On Monday 13 August 2007 00:45, Jens Axboe wrote:
> > > On Mon, Aug 13 2007, Jens Axboe wrote:
> > > > > You did not comment on the one about putting the bio
> > > > > destructor in the ->endio handler, which looks dead simple. 
> > > > > The majority of cases just use the default endio handler and
> > > > > the default destructor.  Of the remaining cases, where a
> > > > > specialized destructor is needed, typically a specialized
> > > > > endio handler is too, so combining is free.  There are few if
> > > > > any cases where a new specialized endio handler would need to
> > > > > be written.
> > > >
> > > > We could do that without too much work, I agree.
> > >
> > > But that idea fails as well, since reference counts and IO
> > > completion are two completely seperate entities. So unless end IO
> > > just happens to be the last user holding a reference to the bio,
> > > you cannot free it.
> >
> > That is not a problem.  When bio_put hits zero it calls ->endio
> > instead of the destructor.  The ->endio sees that the count is zero
> > and destroys the bio.
>
> You can't be serious? You'd stall end io completion notification
> because someone holds a reference to a bio.

Of course not.  Nothing I said stops endio from being called in the 
usual way as well.  For this to work, endio just needs to know that one 
call means "end" and the other means "destroy", this is trivial.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 00:45, Jens Axboe wrote:
> On Mon, Aug 13 2007, Jens Axboe wrote:
> > > You did not comment on the one about putting the bio destructor
> > > in the ->endio handler, which looks dead simple.  The majority of
> > > cases just use the default endio handler and the default
> > > destructor.  Of the remaining cases, where a specialized
> > > destructor is needed, typically a specialized endio handler is
> > > too, so combining is free.  There are few if any cases where a
> > > new specialized endio handler would need to be written.
> >
> > We could do that without too much work, I agree.
>
> But that idea fails as well, since reference counts and IO completion
> are two completely seperate entities. So unless end IO just happens
> to be the last user holding a reference to the bio, you cannot free
> it.

That is not a problem.  When bio_put hits zero it calls ->endio instead 
of the destructor.  The ->endio sees that the count is zero and 
destroys the bio.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Daniel Phillips
On Monday 13 August 2007 00:28, Jens Axboe wrote:
> On Sun, Aug 12 2007, Daniel Phillips wrote:
> > Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
> > derive efficiently from BIO_POOL_IDX() provided the bio was
> > allocated in the standard way.
>
> That would only be feasible, if we ruled that any bio in the system
> must originate from the standard pools.

Not at all.

> > This leaves a little bit of clean up to do for bios not allocated
> > from a standard pool.
>
> Please suggest how to do such a cleanup.

Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
bio from the standard pool.  Just put the power of two size in the bits 
and map that number to the standard pool arrangement with a table 
lookup.

> > On the other hand, vm writeout deadlock ranks smack dab at the top
> > of the list, so that is where the patching effort must go for the
> > forseeable future.  Without bio throttling, the ddsnap load can go
> > to 24 MB for struct bio alone.  That definitely moves the needle. 
> > in short, we save 3,200 times more memory by putting decent
> > throttling in place than by saving an int in struct bio.
>
> Then fix the damn vm writeout. I always thought it was silly to
> depend on the block layer for any sort of throttling. If it's not a
> system wide problem, then throttle the io count in the
> make_request_fn handler of that problematic driver.

It is a system wide problem.  Every block device needs throttling, 
otherwise queues expand without limit.  Currently, block devices that 
use the standard request library get a slipshod form of throttling for 
free in the form of limiting in-flight request structs.  Because the 
amount of IO carried by a single request can vary by two orders of 
magnitude, the system behavior of this approach is far from 
predictable.

> > You did not comment on the one about putting the bio destructor in
> > the ->endio handler, which looks dead simple.  The majority of
> > cases just use the default endio handler and the default
> > destructor.  Of the remaining cases, where a specialized destructor
> > is needed, typically a specialized endio handler is too, so
> > combining is free.  There are few if any cases where a new
> > specialized endio handler would need to be written.
>
> We could do that without too much work, I agree.

OK, we got one and another is close to cracking, enough of that.

> > As far as code stability goes, current kernels are horribly
> > unstable in a variety of contexts because of memory deadlock and
> > slowdowns related to the attempt to fix the problem via dirty
> > memory limits.  Accurate throttling of bio traffic is one of the
> > two key requirements to fix this instability, the other other is
> > accurate writeout path reserve management, which is only partially
> > addressed by BIO_POOL.
>
> Which, as written above and stated many times over the years on lkml,
> is not a block layer issue imho.

Whoever stated that was wrong, but this should be no surprise.  There 
have been many wrong things said about this particular bug over the 
years.  The one thing that remains constant is, Linux continues to 
deadlock under a variety of loads both with and without network 
involvement, making it effectively useless as a storage platform.

These deadlocks are first and foremost, block layer deficiencies.  Even 
the network becomes part of the problem only because it lies in the 
block IO path.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips
On Sunday 12 August 2007 22:36, I wrote:
> Note!  There are two more issues I forgot to mention earlier.

Oops, and there is also:

3) The bio throttle, which is supposed to prevent deadlock, can itself 
deadlock.  Let me see if I can remember how it goes.

  * generic_make_request puts a bio in flight
  * the bio gets past the throttle and initiates network IO
  * net calls sk_alloc->alloc_pages->shrink_caches
  * shrink_caches submits a bio recursively to our block device
  * this bio blocks on the throttle
  * net may never get the memory it needs, and we are wedged

I need to review a backtrace to get this precisely right, however you 
can see the danger.  In ddsnap we kludge around this problem by not 
throttling any bio submitted in PF_MEMALLOC mode, which effectively 
increases our reserve requirement by the amount of IO that mm will 
submit to a given block device before deciding the device is congested 
and should be left alone.  This works, but is sloppy and disgusting.

The right thing to do is to make sure than the mm knows about our 
throttle accounting in backing_dev_info so it will not push IO to our 
device when it knows that the IO will just block on congestion.  
Instead, shrink_caches will find some other less congested block device 
or give up, causing alloc_pages to draw from the memalloc reserve to 
satisfy the sk_alloc request.

The mm already uses backing_dev_info this way, we just need to set the 
right bits in the backing_dev_info state flags.  I think Peter posted a 
patch set that included this feature at some point.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips
(previous incomplete message sent accidentally)

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as 
in your posted patch both suffer from the same problem: you release the 
throttling on the previous queue when the bio moves to a new one, which 
is a bug because memory consumption on the previous queue then becomes 
unbounded, or limited only by the number of struct requests that can be 
allocated.  In other words, it reverts to the same situation we have 
now as soon as the IO stack has more than one queue.  (Just a shorter 
version of my previous post.)

We can solve this by having the bio only point at the queue to which it 
was originally submitted, since throttling the top level queue 
automatically throttles all queues lower down the stack.  Alternatively 
the bio can point at the block_device or straight at the 
backing_dev_info, which is the per-device structure it actually needs 
to touch.

Note!  There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure.  A bio 
can carry as few as one page or as many as 256 pages.  If you take only 
one throttle count per bio and that data will be transferred over the 
network then you have to assume that (a little more than) 256 pages of 
sk_alloc reserve will be needed for every bio, resulting in a grossly 
over-provisioned reserve.  The precise reserve calculation we want to 
do is per-block device, and you will find hooks like this already 
living in backing_dev_info.  We need to place our own fn+data there to 
calculate the throttle draw for each bio.  Unthrottling gets trickier 
with variable size throttle draw.  In ddsnap, we simply write the 
amount we drew from the throttle into (the private data of) bio for use 
later by unthrottle, thus avoiding the issue that the bio fields we 
used to calculate might have changed during the lifetime of the bio.  
This would translate into one more per-bio field.

2) Exposing the per-block device throttle limits via sysfs or similar is 
really not a good long term solution for system administration.  
Imagine our help text: "just keep trying smaller numbers until your 
system deadlocks".  We really need to figure this out internally and 
get it correct.  I can see putting in a temporary userspace interface 
just for experimentation, to help determine what really is safe, and 
what size the numbers should be to approach optimal throughput in a 
fully loaded memory state.

Regards,

Daniel


Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips
On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe 
([EMAIL PROTECTED]) wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as 
in your posted patch both suffer from the same problem: you release the 
throttling on the previous queue when the bio moves to a new one, which 
is a bug because memory consumption on the previous queue then becomes 
unbounded, or limited only by the number of struct requests that can be 
allocated.  In other words, it reverts to the same situation we have 
now as soon as the IO stack has more than one queue.  (Just a shorter 
version of my previous post.)

We can solve this by having the bio only point at the queue to which it 
was originally submitted, since throttling the top level queue 
automatically throttles all queues lower down the stack.  Alternatively 
the bio can point at the block_device or straight at the 
backing_dev_info, which is the per-device structure it actually needs 
to touch.

Note!  There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure.  A bio 
can carry as few as one page or as many as 256 pages.  If you take only 
one throttle count per bio and that data will be transferred over the 
network then you have to assume that (a little more than) 256 pages of 
sk_alloc reserve will be needed for every bio, resulting in a grossly 
over-provisioned reserve.  The precise reserve calculation we want to 
do is per-block device, and you will find hooks like this already 
living in backing_dev_info.  We need to place our own fn+data there to 
calculate the throttle draw for each bio.  Unthrottling gets trickier 
with variable size throttle draw.  In ddsnap, we simply write the 
amount we drew from the throttle into (the private data of) bio for use 
later by unthrottle, thus avoiding the issue that the bio fields we 
used to calculate might have changed during the lifetime of the bio.  
This would translate into one more per-bio field.



the throttling performs another function: keeping a reasonable amount of 
IO in flight for the device.  The definition of "reasonable" is 
complex.  For a hard disk it depends on the physical distance between 
sector addresses of the bios in flight.  In ddsnap we make a crude but 
workable approximation that 


 In general, a per block device 

The throttle count needs to cover 

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-12 Thread Daniel Phillips
On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
> I don't like structure bloat, but I do like nice design. Overloading
> is a necessary evil sometimes, though. Even today, there isn't enough
> room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
> so that concern can be scratched. If you read bio.h, that much is
> obvious.

Sixteen bits in bi_rw are consumed by queue priority.  Is there a reason 
this lives in struct bio instead of struct request?

> If you check up on the iommu virtual merging, you'll understand the
> front and back size members. They may smell dubious to you, but
> please take the time to understand why it looks the way it does.

Virtual merging is only needed at the physical device, so why do these 
fields live in struct bio instead of struct request?

> Changing the number of bvecs is integral to how bio buildup current
> works.

Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can 
derive efficiently from BIO_POOL_IDX() provided the bio was allocated 
in the standard way.  This leaves a little bit of clean up to do for 
bios not allocated from a standard pool.

Incidentally, why does the bvl need to be memset to zero on allocation?  
bi_vcnt already tells you which bvecs are valid and the only field in a 
bvec that can reasonably default to zero is the offset, which ought to 
be set set every time a bvec is initialized anyway.

> > bi_destructor could be combined.  I don't see a lot of users of
> > bi_idx,
>
> bi_idx is integral to partial io completions.

Struct request has a remaining submission sector count so what does 
bi_idx do that is different?

> > that looks like a soft target.  See what happened to struct page
> > when a couple of folks got serious about attacking it, some really
> > deep hacks were done to pare off a few bytes here and there.  But
> > struct bio as a space waster is not nearly in the same ballpark.
>
> So show some concrete patches and examples, hand waving and
> assumptions is just a waste of everyones time.

Average struct bio memory footprint ranks near the bottom of the list of 
things that suck most about Linux storage.  At idle I see 8K in use 
(reserves); during updatedb it spikes occasionally to 50K; under a 
heavy  load generated by ddsnap on a storage box it sometimes goes to 
100K with bio throttling in place.  Really not moving the needle.

On the other hand, vm writeout deadlock ranks smack dab at the top of 
the list, so that is where the patching effort must go for the 
forseeable future.  Without bio throttling, the ddsnap load can go to 
24 MB for struct bio alone.  That definitely moves the needle.  in 
short, we save 3,200 times more memory by putting decent throttling in 
place than by saving an int in struct bio.

That said, I did a little analysis to get an idea of where the soft 
targets are in struct bio, and to get to know the bio layer a little 
better.  Maybe these few hints will get somebody interested enough to 
look further.

> > It would be interesting to see if bi_bdev could be made read only.
> > Generally, each stage in the block device stack knows what the next
> > stage is going to be, so why do we have to write that in the bio? 
> > For error reporting from interrupt context?  Anyway, if Evgeniy
> > wants to do the patch, I will happily unload the task of convincing
> > you that random fields are/are not needed in struct bio :-)
>
> It's a trade off, otherwise you'd have to pass the block device
> around a lot.

Which costs very little, probably less than trashing an extra field's 
worth of cache.

> And it's, again, a design issue. A bio contains 
> destination information, that means device/offset/size information.
> I'm all for shaving structure bytes where it matters, but not for the
> sake of sacrificing code stability or design. I consider struct bio
> quite lean and have worked hard to keep it that way. In fact, iirc,
> the only addition to struct bio since 2001 is the iommu front/back
> size members. And I resisted those for quite a while.

You did not comment on the one about putting the bio destructor in 
the ->endio handler, which looks dead simple.  The majority of cases 
just use the default endio handler and the default destructor.  Of the 
remaining cases, where a specialized destructor is needed, typically a 
specialized endio handler is too, so combining is free.  There are few 
if any cases where a new specialized endio handler would need to be 
written.

As far as code stability goes, current kernels are horribly unstable in 
a variety of contexts because of memory deadlock and slowdowns related 
to the attempt to fix the problem via dirty memory limits.  Accurate 
throttling of bio traffic is one of the two key requirements to fix 
this instability, the other other is accurate writeout path reserve 
management, which is only partially addressed by BIO_POOL.

Nice to see you jumping in Jens.  Now it is over to the other side of 
the thread where Evgeniy has posted a 

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips
Hi Evgeniy,

Sorry for not getting back to you right away, I was on the road with 
limited email access.  Incidentally, the reason my mails to you keep 
bouncing is, your MTA is picky about my mailer's IP reversing to a real 
hostname.  I will take care of that pretty soon, but for now my direct 
mail to you is going to bounce and you will only see the lkml copy.

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> This throttling mechanism allows to limit maximum amount of queued
> bios per physical device. By default it is turned off and old block
> layer behaviour with unlimited number of bios is used. When turned on
> (queue limit is set to something different than -1U via
> blk_set_queue_limit()), generic_make_request() will sleep until there
> is room in the queue. number of bios is increased in
> generic_make_request() and reduced either in bio_endio(), when bio is
> completely processed (bi_size is zero), and recharged from original
> queue when new device is assigned to bio via blk_set_bdev(). All
> oerations are not atomic, since we do not care about precise number
> of bios, but a fact, that we are close or close enough to the limit.
>
> Tested on distributed storage device - with limit of 2 bios it works
> slow :)

it seems to me you need:

-   if (q) {
+   if (q && q->bio_limit != -1) {

This patch is short and simple, and will throttle more accurately than 
the current simplistic per-request allocation limit.  However, it fails 
to throttle device mapper devices.  This is because no request is 
allocated by the device mapper queue method, instead the mapping call 
goes straight through to the mapping function.  If the mapping function 
allocates memory (typically the case) then this resource usage evades 
throttling and deadlock becomes a risk.

There are three obvious fixes:

   1) Implement bio throttling in each virtual block device
   2) Implement bio throttling generically in device mapper
   3) Implement bio throttling for all block devices

Number 1 is the approach we currently use in ddsnap, but it is ugly and 
repetitious.  Number 2 is a possibility, but I favor number 3 because 
it is a system-wide solution to a system-wide problem, does not need to 
be repeated for every block device that lacks a queue, heads in the 
direction of code subtraction, and allows system-wide reserve 
accounting. 

Your patch is close to the truth, but it needs to throttle at the top 
(virtual) end of each block device stack instead of the bottom 
(physical) end.  It does head in the direction of eliminating your own 
deadlock risk indeed, however there are block devices it does not 
cover.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-07 Thread Daniel Phillips
On Tuesday 07 August 2007 05:05, Jens Axboe wrote:
> On Sun, Aug 05 2007, Daniel Phillips wrote:
> > A simple way to solve the stable accounting field issue is to add a
> > new pointer to struct bio that is owned by the top level submitter
> > (normally generic_make_request but not always) and is not affected
> > by any recursive resubmission.  Then getting rid of that field
> > later becomes somebody's summer project, which is not all that
> > urgent because struct bio is already bloated up with a bunch of
> > dubious fields and is a transient structure anyway.
>
> Thanks for your insights. Care to detail what bloat and dubious
> fields struct bio has?

First obvious one I see is bi_rw separate from bi_flags.  Front_size and 
back_size smell dubious.  Is max_vecs really necessary?  You could 
reasonably assume bi_vcnt rounded up to a power of two and bury the 
details of making that work behind wrapper functions to change the 
number of bvecs, if anybody actually needs that.  Bi_endio and 
bi_destructor could be combined.  I don't see a lot of users of bi_idx, 
that looks like a soft target.  See what happened to struct page when a 
couple of folks got serious about attacking it, some really deep hacks 
were done to pare off a few bytes here and there.  But struct bio as a 
space waster is not nearly in the same ballpark.

It would be interesting to see if bi_bdev could be made read only.  
Generally, each stage in the block device stack knows what the next 
stage is going to be, so why do we have to write that in the bio?  For 
error reporting from interrupt context?  Anyway, if Evgeniy wants to do 
the patch, I will happily unload the task of convincing you that random 
fields are/are not needed in struct bio :-)

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-05 Thread Daniel Phillips
On Sunday 05 August 2007 08:01, Evgeniy Polyakov wrote:
> On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips wrote:
> > > DST original code worked as device mapper plugin too, but its two
> > > additional allocations (io and clone) per block request ended up
> > > for me as a show stopper.
> >
> > Ah, sorry, I misread.  A show stopper in terms of efficiency, or in
> > terms of deadlock?
>
> At least as in terms of efficiency. Device mapper lives in happy
> world where memory does not end and allocations are fast.

Are you saying that things are different for a network block device 
because it needs to do GFP_ATOMIC allocations?  If so then that is just 
a misunderstanding.  The global page reserve Peter and I use is 
available in interrupt context just like GFP_ATOMIC.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-05 Thread Daniel Phillips
On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote:
> If we are sleeping in memory pool, then we already do not have memory
> to complete previous requests, so we are in trouble.

Not at all.  Any requests in flight are guaranteed to get the resources 
they need to complete.  This is guaranteed by the combination of memory 
reserve management and request queue throttling.  In logical terms, 
reserve management plus queue throttling is necessary and sufficient to 
prevent these deadlocks.  Conversely, the absence of either one allows 
deadlock.

> This can work 
> for devices which do not require additional allocations (like usual
> local storage), but not for network connected ones.

It works for network devices too, and also for a fancy device like 
ddsnap, which is the moral equivalent of a filesystem implemented in 
user space.

> If not in device, then at least it should say to block layer about
> its limits. What about new function to register queue...

Yes, a new internal API is needed eventually.  However, no new api is 
needed right at the moment because we can just hard code the reserve 
sizes and queue limits and audit them by hand, which is not any more 
sloppy than several other kernel subsystems.  The thing is, we need to 
keep any obfuscating detail out of the initial patches because these 
principles are hard enough to explain already without burying them in 
hundreds of lines of API fluff.

That said, the new improved API should probably not be a new way to 
register, but a set of function calls you can use after the queue is 
created, which follows the pattern of the existing queue API.

> ...which will get 
> maximum number of bios in flight and sleep in generic_make_request()
> when new bio is going to be submitted and it is about to exceed the
> limit?

Exactly.  This is what ddsnap currently does and it works.  But we did 
not change generic_make_request for this driver, instead we throttled 
the driver from the time it makes a request to its user space server, 
until the reply comes back.  We did it that way because it was easy and 
was the only segment of the request lifeline that could not be fixed by 
other means.  A proper solution for all block devices will move the 
throttling up into generic_make_request, as you say below.

> By default things will be like they are now, except additional
> non-atomic increment and branch in generic_make_request() and
> decrement and wake in bio_end_io()?

->endio is called in interrupt context, so the accounting needs to be 
atomic as far as I can see.

We actually account the total number of bio pages in flight, otherwise 
you would need to assume the largest possible bio and waste a huge 
amount of reserve memory.  A counting semaphore works fine for this 
purpose, with some slight inefficiency that is nigh on unmeasurable in 
the block IO path.  What the semaphore does is make the patch small and 
easy to understand, which is important at this point.

> I can cook up such a patch if idea worth efforts.

It is.  There are some messy details...  You need a place to store the 
accounting variable/semaphore and need to be able to find that place 
again in ->endio.  Trickier than it sounds, because of the unstructured 
way drivers rewrite ->bi_bdev.   Peterz has already poked at this in a 
number of different ways, typically involving backing_dev_info, which 
seems like a good idea to me.

A simple way to solve the stable accounting field issue is to add a new 
pointer to struct bio that is owned by the top level submitter 
(normally generic_make_request but not always) and is not affected by 
any recursive resubmission.  Then getting rid of that field later 
becomes somebody's summer project, which is not all that urgent because 
struct bio is already bloated up with a bunch of dubious fields and is 
a transient structure anyway.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-05 Thread Daniel Phillips
On Saturday 04 August 2007 09:44, Evgeniy Polyakov wrote:
> > On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> > > * storage can be formed on top of remote nodes and be
> > > exported simultaneously (iSCSI is peer-to-peer only, NBD requires
> > > device mapper and is synchronous)
> >
> > In fact, NBD has nothing to do with device mapper.  I use it as a
> > physical target underneath ddraid (a device mapper plugin) just
> > like I would use your DST if it proves out.
>
> I meant to create a storage on top of several nodes one needs to have
> device mapper or something like that on top of NBD itself. To further
> export resulted device one needs another userspace NDB application
> and so on. DST simplifies that greatly.
>
> DST original code worked as device mapper plugin too, but its two
> additional allocations (io and clone) per block request ended up for
> me as a show stopper.

Ah, sorry, I misread.  A show stopper in terms of efficiency, or in 
terms of deadlock?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-05 Thread Daniel Phillips
On Saturday 04 August 2007 09:37, Evgeniy Polyakov wrote:
> On Fri, Aug 03, 2007 at 06:19:16PM -0700, I wrote:
> > To be sure, I am not very proud of this throttling mechanism for
> > various reasons, but the thing is, _any_ throttling mechanism no
> > matter how sucky solves the deadlock problem.  Over time I want to
> > move the
>
> make_request_fn is always called in process context,

Yes, as is submit_bio which calls it.  The decision re where it is best 
to throttle, in submit_bio or in make_request_fn, has more to do with 
system factoring, that is, is throttling something that _every_ block 
device should have (yes I think) or is it a delicate, optional thing 
that needs a tweakable algorithm per block device type (no I think).

The big worry I had was that by blocking on congestion in the 
submit_bio/make_request_fn I might stuff up system-wide mm writeout.  
But a while ago that part of the mm was tweaked (by Andrew if I recall 
correctly) to use a pool of writeout threads and understand the concept 
of one of them blocking on some block device, and not submit more 
writeout to the same block device until the first thread finishes its 
submission.  Meanwhile, other mm writeout threads carry on with other 
block devices.

> we can wait in it for memory in mempool. Although that means we
> already in trouble. 

Not at all.  This whole block writeout path needs to be written to run 
efficiently even when normal system memory is completely gone.  All it 
means when we wait on a mempool is that the block device queue is as 
full as we are ever going to let it become, and that means the block 
device is working as hard as it can (subject to a small caveat: for 
some loads a device can work more efficiently if it can queue up larger 
numbers of requests down at the physical elevators).

By the way, ddsnap waits on a counting semaphore, not a mempool.  That 
is because we draw our reserve memory from the global memalloc reserve, 
not from a mempool.  And that is not only because it takes less code to 
do so, but mainly because global pools as opposed to lots of little 
special purpose pools seem like a good idea to me.  Though I will admit 
that with our current scheme we need to allow for the total of the 
maximum reserve requirements for all memalloc users in the memalloc 
pool, so it does not actually save any memory vs dedicated pools.  We 
could improve that if we wanted to, by having hard and soft reserve 
requirements: the global reserve actually only needs to be as big as 
the total of the hard requirements.  With this idea, if by some unlucky 
accident every single pool user got itself maxed out at the same time, 
we would still not exceed our share of the global reserve.  
Under "normal" low memory situations, a block device would typically be 
free to grab reserve memory up to its soft limit, allowing it to 
optimize over a wider range of queued transactions.   My little idea 
here is: allocating specific pages to a pool is kind of dumb, all we 
really want to do is account precisely for the number of pages we are 
allowed to draw from the global reserve.

OK, I kind of digressed, but this all counts as explaining the details 
of what Peter and I have been up to for the last year (longer for me).  
At this point, we don't need to do the reserve accounting in the most 
absolutely perfect way possible, we just need to get something minimal 
in place to fix the current deadlock problems, then we can iteratively 
improve it.

> I agree, any kind of high-boundary leveling must be implemented in
> device itself, since block layer does not know what device is at the
> end and what it will need to process given block request.

I did not say the throttling has to be implemented in the device, only 
that we did it there because it was easiest to code that up and try it 
out (it worked).  This throttling really wants to live at a higher 
level, possibly submit_bio()...bio->endio().  Someone at OLS (James 
Bottomley?) suggested it would be better done at the request queue 
layer, but I do not immediately see why that should be.  I guess this 
is going to come down to somebody throwing out a patch for interested 
folks to poke at.  But this detail is a fine point.  The big point is 
to have _some_ throttling mechanism in place on the block IO path, 
always.

Device mapper in particular does not have any throttling itself: calling 
submit_bio on a device mapper device directly calls the device mapper 
bio dispatcher.  Default initialized block device queue do provide a 
crude form of throttling based on limiting the number of requests.  
This is insufficiently precise to do a good job in the long run, but it 
works for now because the current gaggle of low level block drivers do 
not have a lot of resource requirements and tend to behave fairly 
predictably (except for some irritating issues re very slow devices 
working in parallel with very fast devices, but... worry about that 
later).  Network block driv

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips
On Friday 03 August 2007 03:26, Evgeniy Polyakov wrote:
> On Thu, Aug 02, 2007 at 02:08:24PM -0700, I wrote:
> > I see bits that worry me, e.g.:
> >
> > +   req = mempool_alloc(st->w->req_pool, GFP_NOIO);
> >
> > which seems to be callable in response to a local request, just the
> > case where NBD deadlocks.  Your mempool strategy can work reliably
> > only if you can prove that the pool allocations of the maximum
> > number of requests you can have in flight do not exceed the size of
> > the pool.  In other words, if you ever take the pool's fallback
> > path to normal allocation, you risk deadlock.
>
> mempool should be allocated to be able to catch up with maximum
> in-flight requests, in my tests I was unable to force block layer to
> put more than 31 pages in sync, but in one bio. Each request is
> essentially dealyed bio processing, so this must handle maximum
> number of in-flight bios (if they do not cover multiple nodes, if
> they do, then each node requires own request).

It depends on the characteristics of the physical and virtual block 
devices involved.  Slow block devices can produce surprising effects.  
Ddsnap still qualifies as "slow" under certain circumstances (big 
linear write immediately following a new snapshot). Before we added 
throttling we would see as many as 800,000 bios in flight.  Nice to 
know the system can actually survive this... mostly.  But memory 
deadlock is a clear and present danger under those conditions and we 
did hit it (not to mention that read latency sucked beyond belief). 

Anyway, we added a simple counting semaphore to throttle the bio traffic 
to a reasonable number and behavior became much nicer, but most 
importantly, this satisfies one of the primary requirements for 
avoiding block device memory deadlock: a strictly bounded amount of bio 
traffic in flight.  In fact, we allow some bounded number of 
non-memalloc bios *plus* however much traffic the mm wants to throw at 
us in memalloc mode, on the assumption that the mm knows what it is 
doing and imposes its own bound of in flight bios per device.   This 
needs auditing obviously, but the mm either does that or is buggy.  In 
practice, with this throttling in place we never saw more than 2,000 in 
flight no matter how hard we hit it, which is about the number we were 
aiming at.  Since we draw our reserve from the main memalloc pool, we 
can easily handle 2,000 bios in flight, even under extreme conditions.

See:
http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
down(&info->throttle_sem);

To be sure, I am not very proud of this throttling mechanism for various 
reasons, but the thing is, _any_ throttling mechanism no matter how 
sucky solves the deadlock problem.  Over time I want to move the 
throttling up into bio submission proper, or perhaps incorporate it in 
device mapper's queue function, not quite as high up the food chain.  
Only some stupid little logistical issues stopped me from doing it one 
of those ways right from the start.   I think Peter has also tried some 
things in this area.  Anyway, that part is not pressing because the 
throttling can be done in the virtual device itself as we do it, even 
if it is not very pretty there.  The point is: you have to throttle the 
bio traffic.  The alternative is to die a horrible death under 
conditions that may be rare, but _will_ hit somebody.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-03 Thread Daniel Phillips
Hi Mike,

On Thursday 02 August 2007 21:09, Mike Snitzer wrote:
> But NBD's synchronous nature is actually an asset when coupled with
> MD raid1 as it provides guarantees that the data has _really_ been
> mirrored remotely.

And bio completion doesn't?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-03 Thread Daniel Phillips
Hi Evgeniy,

Nit alert:

On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> * storage can be formed on top of remote nodes and be exported
>   simultaneously (iSCSI is peer-to-peer only, NBD requires device
>   mapper and is synchronous)

In fact, NBD has nothing to do with device mapper.  I use it as a 
physical target underneath ddraid (a device mapper plugin) just like I 
would use your DST if it proves out.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-03 Thread Daniel Phillips
On Friday 03 August 2007 07:53, Peter Zijlstra wrote:
> On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote:
> > On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra wrote:
> > ...my main position is to
> > allocate per socket reserve from socket's queue, and copy data
> > there from main reserve, all of which are allocated either in
> > advance (global one) or per sockoption, so that there would be no
> > fairness issues what to mark as special and what to not.
> >
> > Say we have a page per socket, each socket can assign a reserve for
> > itself from own memory, this accounts both tx and rx side. Tx is
> > not interesting, it is simple, rx has global reserve (always
> > allocated on startup or sometime way before reclaim/oom)where data
> > is originally received (including skb, shared info and whatever is
> > needed, page is just an exmaple), then it is copied into per-socket
> > reserve and reused for the next packet. Having per-socket reserve
> > allows to have progress in any situation not only in cases where
> > single action must be received/processed, and allows to be
> > completely fair for all users, but not only special sockets, thus
> > admin for example would be allowed to login, ipsec would work and
> > so on...
>
> Ah, I think I understand now. Yes this is indeed a good idea!
>
> It would be quite doable to implement this on top of that I already
> have. We would need to extend the socket with a sock_opt that would
> reserve a specified amount of data for that specific socket. And then
> on socket demux check if the socket has a non zero reserve and has
> not yet exceeded said reserve. If so, process the packet.
>
> This would also quite neatly work for -rt where we would not want
> incomming packet processing to be delayed by memory allocations.

At this point we need "anything that works" in mainline as a starting 
point.  By erring on the side of simplicity we can make this 
understandable for folks who haven't spent the last two years wallowing 
in it.  The page per socket approach is about as simple as it gets.  I 
therefore propose we save our premature optimizations for later.

It will also help our cause if we keep any new internal APIs to strictly 
what is needed to make deadlock go away.  Not a whole lot more than 
just the flag to mark a socket as part of the vm writeout path when you 
get right down to essentials.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-03 Thread Daniel Phillips
On Friday 03 August 2007 06:49, Evgeniy Polyakov wrote:
> ...rx has global reserve (always allocated on
> startup or sometime way before reclaim/oom)where data is originally
> received (including skb, shared info and whatever is needed, page is
> just an exmaple), then it is copied into per-socket reserve and
> reused for the next packet. Having per-socket reserve allows to have
> progress in any situation not only in cases where single action must
> be received/processed, and allows to be completely fair for all
> users, but not only special sockets, thus admin for example would be
> allowed to login, ipsec would work and so on...

And when the global reserve is entirely used up your system goes back to 
dropping vm writeout acknowledgements, not so good.  I like your 
approach, and specifically the copying idea cuts out considerable 
complexity.  But I believe the per-socket flag to mark a socket as part 
of the vm writeout path is not optional, and in this case it will be a 
better world if it is a slightly unfair world in favor of vm writeout 
traffic.

Ssh will still work fine even with vm getting priority access to the 
pool.  During memory crunches, non-vm ssh traffic may get bumped till 
after the crunch, but vm writeout is never supposed to hog the whole 
machine.  If vm writeout hogs your machine long enough to delay an ssh 
login then that is a vm bug and should be fixed at that level.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-02 Thread Daniel Phillips
On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> Hi.
>
> I'm pleased to announce first release of the distributed storage
> subsystem, which allows to form a storage on top of remote and local
> nodes, which in turn can be exported to another storage as a node to
> form tree-like storages.

Excellent!  This is precisely what the doctor ordered for the 
OCFS2-based distributed storage system I have been mumbling about for 
some time.  In fact the dd in ddsnap and ddraid stands for "distributed 
data".  The ddsnap/raid devices do not include an actual network 
transport, that is expected to be provided by a specialized block 
device, which up till now has been NBD.  But NBD has various 
deficiencies as you note, in addition to its tendency to deadlock when 
accessed locally.  Your new code base may be just the thing we always 
wanted.  We (zumastor et al) will take it for a drive and see if 
anything breaks.

Memory deadlock is a concern of course.  From a cursory glance through, 
it looks like this code is pretty vm-friendly and you have thought 
quite a lot about it, however I respectfully invite peterz 
(obsessive/compulsive memory deadlock hunter) to help give it a good 
going over with me.

I see bits that worry me, e.g.:

+   req = mempool_alloc(st->w->req_pool, GFP_NOIO);

which seems to be callable in response to a local request, just the case 
where NBD deadlocks.  Your mempool strategy can work reliably only if 
you can prove that the pool allocations of the maximum number of 
requests you can have in flight do not exceed the size of the pool.  In 
other words, if you ever take the pool's fallback path to normal 
allocation, you risk deadlock.

Anyway, if this is as grand as it seems then I would think we ought to 
factor out a common transfer core that can be used by all of NBD, 
iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own 
code those things have now.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network receive stall avoidance (was [PATCH 2/9] deadlock prevention core)

2006-08-18 Thread Daniel Phillips

Andrew Morton wrote:

handwaving


- The mmap(MAP_SHARED)-the-whole-world scenario should be fixed by
  mm-tracking-shared-dirty-pages.patch.  Please test it and if you are
  still able to demonstrate deadlocks, describe how, and why they
  are occurring.


OK, but please see "atomic 0 order allocation failure" below.


- We expect that the lots-of-dirty-anon-memory-over-swap-over-network
  scenario might still cause deadlocks.  


  I assert that this can be solved by putting swap on local disks.  Peter
  asserts that this isn't acceptable due to disk unreliability.  I point
  out that local disk reliability can be increased via MD, all goes quiet.

  A good exposition which helps us to understand whether and why a
  significant proportion of the target user base still wishes to do
  swap-over-network would be useful.


I don't care much about swap-over-network, Peter does.  I care about
reliable remote disks in general, and reliable nbd in particular.


- Assuming that exposition is convincing, I would ask that you determine
  at what level of /proc/sys/vm/min_free_kbytes the swap-over-network
  deadlock is no longer demonstrable.


Earlier, I wrote: "we need to actually fix the out of memory
deadlock/performance bug right now so that remote storage works
properly."  It is actually far more likely that memory reclaim stuffups
will cause TCP to drop block IO packets here and there than that we
will hit some endless retransmit deadlock.  But dropping packets and
thus causing TCP stalls during block writeout is a very bad thing too,
I do not think you could call a system that exhibits such behavior a
particularly reliable one.

So rather than just the word deadlock, let us add "or atomic 0 order
alloc failure during TCP receive" to the challenge.  Fair?

On the client send side, Peter already showed an easily reproducable
deadlock which we avoid by using elevator-noop.  The bug is still
there, it is just under a different rock now.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-18 Thread Daniel Phillips

Andrew Morton wrote:

...in my earlier emails I asked a number of questions regarding
whether existing facilities, queued patches or further slight kernel
changes could provide a sufficient solution to these problems.  The answer
may well be "no".  But diligence requires that we be able to prove that,
and handwaving doesn't provide that proof, does it?


Hi Andrew,

I missed those questions about queued patches.  So far the closest we
have seen to anything relevant is Evgeniy's proposed network sub-allocater,
which just doesn't address the problem, as Peter and I have explained in
some detail, and the dirty page throttling machinery which is obviously
flawed in that it leaves a lot of room for corner cases that have to be
handled with more special purpose logic.  For example, what about kernel
users that just go ahead and use lots of slab without getting involved
in the dirty/mapped page accounting?  I don't know, maybe you will handle
that too, but then there will be another case that isn't handled, and so
on.  The dirty page throttling approach is just plain fragile by nature.

We already know we need to do reserve accounting for struct bio, so what
is the conceptual difficulty in extending that reasoning to struct
sk_buff, which is very nearly the same kind of beastie, performing very
nearly the same kind of function, and suffering very nearly the same kind
of problems we had in the block layer before mingo's mempool machinery
arrived?

Did I miss some other queued patch that bears on this?  And I am not sure
what handwaving you are referring to.

Note that we have not yet submitted this patch to the queue, just put
it up for comment.  This patch set is actually getting more elegant as
various individual artists get their flames in.  At some point it will
acquire a before-and-after test case as well, then we can realistically
compare it to the best of the alternate proposals.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-18 Thread Daniel Phillips

Andrew Morton wrote:

Daniel Phillips wrote:

Andrew Morton wrote:


...it's runtime configurable.


So we default to "less than the best" because we are too lazy to fix the
network starvation issue properly?  Maybe we don't really need a mempool for
struct bio either, isn't that one rather like the reserve pool we propose
for sk_buffs?  (And for that matter, isn't a bio rather more like an sk_buff
than one would think from the code we have written?)


unavailable for write caching just to get around the network receive
starvation issue?


No, it's mainly to avoid latency: to prevent tasks which want to allocate
pages from getting stuck behind writeback.


This seems like a great help for some common loads, but a regression for
other not-so-rare loads.  Regardless of whether or not marking 60% of memory
as unusable for write caching is a great idea, it is not a compelling reason
to fall for the hammer/nail trap.


What happens if some in kernel user grabs 68% of kernel memory to do some
very important thing, does this starvation avoidance scheme still work?


Well something has to give way.  The process might get swapped out a bit,
or it might stall in the page allocator because of all the dirty memory.


It is that "something has to give way" that makes me doubt the efficacy of
our global write throttling scheme as a solution for network receive memory
starvation.  As it is, the thing that gives way may well be the network
stack.  Avoiding this hazard via a little bit of simple code is the whole
point of our patch set.

Yes, we did waltz off into some not-so-simple code at one point, but after
discussing it, Peter and I agree that the simple approach actually works
well enough, so we just need to trim down the patch set accordingly and
thus prove that our approach really is nice, light and tight.

Specifically, we do not actually need any fancy sk_buff sub-allocator
(either ours or the one proposed by davem).  We already got rid of the per
device reserve accounting in order to finesse away the sk_buff->dev oddity
and shorten the patch.  So the next rev will be rather lighter than the
original.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-17 Thread Daniel Phillips

Daniel Phillips wrote:

Andrew Morton wrote:

Processes which are dirtying those pages throttle at
/proc/sys/vm/dirty_ratio% of memory dirty.  So it is not possible to "fill"
memory with dirty pages.  If the amount of physical memory which is dirty
exceeds 40%: bug.


So we make 400 MB of a 1 GB system unavailable for write caching just to
get around the network receive starvation issue?


Excuse me, 600 MB of a 1 GB system :-/

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-17 Thread Daniel Phillips

Andrew Morton wrote:

Daniel Phillips <[EMAIL PROTECTED]> wrote:

What happened to the case where we just fill memory full of dirty file
pages backed by a remote disk?


Processes which are dirtying those pages throttle at
/proc/sys/vm/dirty_ratio% of memory dirty.  So it is not possible to "fill"
memory with dirty pages.  If the amount of physical memory which is dirty
exceeds 40%: bug.


Hi Andrew,

So we make 400 MB of a 1 GB system unavailable for write caching just to
get around the network receive starvation issue?

What happens if some in kernel user grabs 68% of kernel memory to do some
very important thing, does this starvation avoidance scheme still work?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

2006-08-17 Thread Daniel Phillips

Evgeniy Polyakov wrote:

On Thu, Aug 17, 2006 at 09:15:14PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
wrote:

I got openssh as example of situation when system does not know in 
advance, what sockets must be marked as critical.

OpenSSH works with network and unix sockets in parallel, so you need to
hack openssh code to be able to allow it to use reserve when there is 
not enough memory.


OpenSSH or any other user-space program will never ever have the problem
we are trying to solve. Nor could it be fixed the way we are solving it,
user-space programs can be stalled indefinite. We are concerned with
kernel services, and the continued well being of the kernel, not
user-space. (well therefore indirectly also user-space of course)



You limit your system here - it is possible that userspace should send
some command when kernel agent requires some negotiation.
And even for them it is possible to require ARP request and/or ICMP
processing.



Actually all sockets must be able to get data, since
kernel can not diffirentiate between telnetd and a process which will 
receive an ack for swapped page or other significant information.


Oh, but it does, the kernel itself controls those sockets: NBD / iSCSI
and AoE are all kernel services, not user-space. And it is the core of
our work to provide this information to the kernel; to distinguish these
few critical sockets.



As I stated above it is not enough.
And even if you will cover all kernel-only network allocations, which are
used for your selected datapath, problem still there - admin is unable
to connect although it can be critical connection too.


So network must behave separately from main allocator in that period of 
time, but since it is possible that reserve can be not filled or has not

enough space or something other, it must be preallocated in far advance
and should be quite big, but then why netwrok should use it at all, when
being separated from main allocations solves the problem?


You still need to guarantee data-paths to these services, and you need
to make absolutely sure that your last bit of memory is used to service
these critical sockets, not some random blocked user-space process.

You cannot pre-allocate enough memory _ever_ to satisfy the total
capacity of the network stack. You _can_ allot a certain amount of
memory to the network stack (avoiding DoS), and drop packets once you
exceed that. But still, you need to make sure these few critical
_kernel_ services get their data.


Feel free to implement any receiving policy inside _separated_ allocator
to meet your needs, but if allocator depends on main system's memory
conditions it is always possible that it will fail to make forward
progress.


Wrong.  Our main allocator has a special reserve that can be accessed
only by a task that has its PF_MEMALLOC flag set.  This reserve exists
in order to guarantee forward progress in just such situations as the
network runs into when it is trying to receive responses from a remote
disk.  Anything otherwise is a bug.


I do not argue that your approach is bad or does not solve the problem,
I'm just trying to show that further evolution of that idea eventually
ends up in separated allocator (as long as all most robust systems
separate operations), which can improve things in a lot of other sides
too.


Not a separate allocator per-se, separate socket group, they are
serviced by the kernel, they will never refuse to process data, and it
is critical for the continued well-being of your kernel that they get
their data.


The memalloc reserve is indeed a separate reserve, however it is a
reserve that already exists, and you are busy creating another separate
reserve to further partition memory.  Partitioning memory is not the
direction we should be going, we should be trying to unify our reserves
wherever possible, and find ways such as Andrew and others propose to
implement "reserve on demand", or in other words, take advantage of
"easily freeable" pages.

If your allocation code is so much more efficient than slab then why
don't you fix slab instead of replicating functionality that already
exists elsewhere in the system, and has since day one?


You do not know in advance which sockets must be separated (since only
in the simplest situation it is the same as in NBD and is
kernelspace-only),


Yes we do, they are exactly those sockets that lie in the block IO path.
The VM cannot deadlock on any others.


you can not solve problem with ARP/ICMP/route changes and other control
messages, netfilter, IPsec and compression which still can happen in your 
setup, 


If you bothered to read the patches you would know that ICMP is indeed
handled.  ARP I don't think so, we may need to do that since ARP can
believably be required for remote disk interface failover.  Anybody
who designs ssh into remote disk failover is an idiot.  Ssh for
configuration, monitoring and administration is fine.  Ssh for fencing
or whatever is just plain stupid, unless the nodes run

Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

2006-08-17 Thread Daniel Phillips

Evgeniy Polyakov wrote:
Just for clarification - it will be completely impossible to login using 
openssh or some other priveledge separation protocol to the machine due

to the nature of unix sockets. So you will be unable to manage your
storage system just because it is in OOM - it is not what is expected
from reliable system.


The system is not OOM, it is in reclaim, a transient condition that will be
resolved in normal course by IO progress.  However you raise an excellent
point: if there is any remote management that we absolutely require to be
available while remote IO is interrupted - manual failover for example -
then we must supply a means of carrying out such remote administration, that
is guaranteed not to deadlock on a normal mode memory request.  This ends up
as a new network stack feature I think, and probably a theoretical one for
the time being since we don't actually know of any such mandatory login
that must be carried out while remote disk IO is suspended.


That is why you are not allowed to depend on main system's allocator
problems. That is why network can have it's own allocator.


Please check your assumptions.  Do you understand how PF_MEMALLOC works,
and where it gets its reserve memory from?


But really, if you expect to run reliable block IO to Zanzibar over an ssh
tunnel through a firewall, then you might also consider taking up bungie
jumping with the cord tied to your neck.

Just pure openssh for control connection (admin should be able to
login).


And the admin will be able to, but in the cluster stack itself we don't
bless such stupidity as emailing an admin to ask for a login in order to
break a tie over which node should take charge of DLM recovery.


No, you can't since openssh and any other priveledge separation
mechanisms use adtional sockets to transfer data between it's parts,
unix sockets require page sized allocation frequently which will endup
with 8k allocation in slab.
You will not be able to login using openssh.


*** The system is not OOM, it is in reclaim, a transient condition ***

Which Peter already told you!  Please, let's get our facts straight,
then we can look intelligently at whether your work is appropriate to
solve the block IO network starvation problem that Peter and I are
working on.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

2006-08-16 Thread Daniel Phillips

Evgeniy Polyakov wrote:

On Mon, Aug 14, 2006 at 08:45:43AM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
wrote:

Just pure openssh for control connection (admin should be able to
login).


These periods of degenerated functionality should be short and
infrequent albeit critical for machine recovery. Would you rather have a
slower ssh login (the machine will recover) or drive/fly to Zanzibar to
physically reboot the machine?


It will not work, since you can not mark openssh sockets as those which
are able to get memory from reserved pool. So admin unable to check the
system status and make anything to turn system's life on.


This is incorrect, please see my previous email.

Regards,

Daniel

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

2006-08-16 Thread Daniel Phillips

Evgeniy Polyakov wrote:

On Sun, Aug 13, 2006 at 01:16:15PM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
wrote:

Indeed.  The rest of the corner cases like netfilter, layered protocol and
so on need to be handled, however they do not need to be handled right now
in order to make remote storage on a lan work properly.  The sane thing for
the immediate future is to flag each socket as safe for remote block IO or
not, then gradually widen the scope of what is safe.  We need to set up an
opt in strategy for network block IO that views such network subsystems as
ipfilter as not safe by default, until somebody puts in the work to make
them safe.


Just for clarification - it will be completely impossible to login using 
openssh or some other priveledge separation protocol to the machine due

to the nature of unix sockets. So you will be unable to manage your
storage system just because it is in OOM - it is not what is expected
from reliable system.


The system is not OOM, it is in reclaim, a transient condition that will be
resolved in normal course by IO progress.  However you raise an excellent
point: if there is any remote management that we absolutely require to be
available while remote IO is interrupted - manual failover for example -
then we must supply a means of carrying out such remote administration, that
is guaranteed not to deadlock on a normal mode memory request.  This ends up
as a new network stack feature I think, and probably a theoretical one for
the time being since we don't actually know of any such mandatory login
that must be carried out while remote disk IO is suspended.


But really, if you expect to run reliable block IO to Zanzibar over an ssh
tunnel through a firewall, then you might also consider taking up bungie
jumping with the cord tied to your neck.


Just pure openssh for control connection (admin should be able to
login).


And the admin will be able to, but in the cluster stack itself we don't
bless such stupidity as emailing an admin to ask for a login in order to
break a tie over which node should take charge of DLM recovery.

Regards,

Da niel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-16 Thread Daniel Phillips

Andrew Morton wrote:

What is a "socket wait queue" and how/why can it consume so much memory?


Two things:

  1) sk_buffs in flight between device receive interrupt and layer 3
 protocol/socket identification.

  2) sk_buffs queued onto a particular socket waiting for some task to
 come along and pull them off via read or equivalent.

Case (1) probably can't consume a unbounded amount of memory, but I
would not swear to that with my current reading knowledge of the network
stack.  The upper bound here is obscured by clever SMP device processing,
netfilter options, softirq scheduling questions, probably other things.
This needs a considered explanation from a network guru, or perhaps a
pointer to documentation.

Case (2) is the elephant under the rug.  Some form of TCP memory
throttling exists, but there is no organized way to correlate that with
actual memory conditions, and it appears to be exposed to user control.
Memory throttling seems to be entirely absent for non-TCP protocols,
e.g., UDP.


Can it be prevented from doing that?


This patch set does that, and also provides an emergency reserve for
network devices in order to prevent atomic allocation failures while
trying to refill NIC DMA buffer rings.  It is the moral equivalent of
our bio reservation scheme, but with additional twists specific to the
network stack.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-16 Thread Daniel Phillips

Andrew Morton wrote:

Peter Zijlstra <[EMAIL PROTECTED]> wrote:

Testcase:

Mount an NBD device as sole swap device and mmap > physical RAM, then
loop through touching pages only once.


Fix: don't try to swap over the network.  Yes, there may be some scenarios
where people have no local storage, but it's reasonable to expect anyone
who is using Linux as an "enterprise storage platform" to stick a local
disk on the thing for swap.

That leaves MAP_SHARED, but mm-tracking-shared-dirty-pages.patch will fix
that, will it not?


Hi Andrew,

What happened to the case where we just fill memory full of dirty file
pages backed by a remote disk?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-13 Thread Daniel Phillips

David Miller wrote:

I think there is more profitability from a solution that really does
something about "network memory", and doesn't try to say "these
devices are special" or "these sockets are special".  Special cases
generally suck.

We already limit and control TCP socket memory globally in the system.
If we do this for all socket and anonymous network buffer allocations,
which is sort of implicity in Evgeniy's network tree allocator design,
we can solve this problem in a more reasonable way.


This does sound promising, but...

It is not possible to solve this problem entirely in the network
layer.  At minimum, throttling is needed in the nbd driver, or even
better, in the submit_bio path as we have it now.  We also need a way
of letting the virtual block device declare its resource needs, which
we also have in some form.  We also need a mechanism to guarantee
level 2 delivery so we can drop unrelated packets if necessary, which
exists in the current patch set.

So Evgeniy's work might well be a part of the solution, but it is not
the whole solution.


And here's the kick, there are other unrelated highly positive
consequences to using Evgeniy's network tree allocator.


Also good.  But is it ready to use today?  We need to actually fix
the out of memory deadlock/performance bug right now so that remote
storage works properly.


It doesn't just solve the _one_ problem it was built for, it solves
several problems.  And that is the hallmark signature of good design.


Great.  But to solve the whole problem, the block IO system and the
network layer absolutely must cooperate.

Regards,

Daniel

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-13 Thread Daniel Phillips

David Miller wrote:

From: Daniel Phillips <[EMAIL PROTECTED]>

David Miller wrote:


The reason is that there is no refcounting performed on these devices
when they are attached to the skb, for performance reasons, and thus
the device can be downed, the module for it removed, etc. long before
the skb is freed up.


The virtual block device can refcount the network device on virtual
device create and un-refcount on virtual device delete.


What if the packet is originally received on the device in question,
and then gets redirected to another device by a packet scheduler
traffic classifier action or a netfilter rule?

It is necessary to handle the case where the device changes on the
skb, and the skb gets freed up in a context and assosciation different
from when the skb was allocated (for example, different from the
device attached to the virtual block device).


This aspect of the patch became moot because of the change to a single
reserve for all layer 2 delivery in Peter's more recent revisions.

*However* maybe it is worth mentioning that I intended to provide a
pointer from each sk_buff to a common accounting struct.  This pointer
is set by the device driver.  If the driver knows nothing about memory
accounting then the pointer is null, no accounting is done, and block
IO over the interface will be dangerous.  Otherwise, if the system is
in reclaim (which we currently detect crudely when a normal allocation
fails) then atomic ops will be done on the accounting structure.

I planned to use the (struct sock *)sk_buff->sk field for this, which
is unused during layer 2 delivery as far as I can see.  The accounting
struct can look somewhat like a struct sock if we like, size doesn't
really matter, and it might make the code more robust.  Or the field
could become a union.

Anyway, this bit doesn't matter any more, the single global packet
delivery reserve is better and simpler.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

2006-08-13 Thread Daniel Phillips

Evgeniy Polyakov wrote:

One must receive a packet to determine if that packet must be dropped
until tricky hardware with header split capabilities or MMIO copying is
used. Peter uses special pool to get data from when system is in OOM (at
least in his latest patchset), so allocations are separated and thus
network code is not affected by OOM condition, which allows to make
forward progress.


Nice executive summary.  Crucial point: you want to say "in reclaim"
not "in OOM".

Yes, right from the beginning the patch set got its sk_buff memory
from a special pool when the system is in reclaim, however the exact
nature of the pool and how/where it is accounted has evolved... mostly
forward.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/4] VM deadlock prevention -v4

2006-08-13 Thread Daniel Phillips

Peter Zijlstra wrote:

On Sat, 2006-08-12 at 20:16 +0200, Indan Zupancic wrote:

What was missing or wrong in the old approach? Can't you use the new
approach, but use alloc_pages() instead of SROG?

Sorry if I bug you so, but I'm also trying to increase my knowledge here. ;-)


I'm almost sorry I threw that code out...


Good instinct :-)


Lemme see what I can do to explain; what I need/want is:
 - single allocation group per packet - that is, when I free a packet
and all its associated object I get my memory back.


First, try to recast all your objects as pages, as Evgeniy Polyakov
suggested.  Then if there is some place where that just doesn't work
(please point it out) put a mempool there and tweak the high level
reservation setup accordingly.


 - not waste too much space managing the various objects


If we waste a little space only where the network would have otherwise
dropped a packet, that is still a big win.  We just need to be sure the
normal path does not become more wasteful.


skb operations want to allocate various sk_buffs for the same data
clones. Also, it wants to be able to break the COW or realloc the data.

The trivial approach would be one page (or higher alloc page) per
object, and that will work quite well, except that it'll waste a _lot_
of memory. 


High order allocations are just way too undependable without active
defragmentation, which isn't even on the horizon at the moment.  We
just need to treat any network hardware that can't scatter/gather into
single pages as too broken to use for network block io.

As for sk_buff cow break, we need to look at which network paths do it
(netfilter obviously, probably others) and decide whether we just want
to declare that the feature breaks network block IO, or fix the feature
so it plays well with reserve accounting.


So I tried manual packing (parts of that you have seen in previous
attempts). This gets hard when you want to do unlimited clones and COW
breaks. To do either you need to go link several pages.


You need to avoid the temptation to fix the entire world on the first
attempt.  Sure, there will be people who call for gobs of overengineering
right from the start, but simple has always worked better for Linux than
lathering on layers of complexity just to support some feature that may
arguably be broken by design.  For example, swapping through a firewall.

Changing from per-interface to a global block IO reserve was a great
simplification, we need more of those.

Looking forward to -v5 ;-)

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rename *MEMALLOC flags

2006-08-13 Thread Daniel Phillips

Peter Zijlstra wrote:

Jeff Garzik in his infinite wisdom spake thusly:

Peter Zijlstra wrote:


Index: linux-2.6/include/linux/gfp.h
===
--- linux-2.6.orig/include/linux/gfp.h  2006-08-12 12:56:06.0 +0200
+++ linux-2.6/include/linux/gfp.h   2006-08-12 12:56:09.0 +0200
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_ZERO  ((__force gfp_t)0x8000u)/* Return zeroed page on 
success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency 
reserves */
#define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce hardwall cpuset 
memory allocs */
+#define __GFP_MEMALLOC  ((__force gfp_t)0x4u) /* Use emergency reserves */


This symbol name has nothing to do with its purpose.  The entire area of 
code you are modifying could be described as having something to do with 
'memalloc'.


GFP_EMERGENCY or GFP_USE_RESERVES or somesuch would be a far better 
symbol name.


I recognize that is matches with GFP_NOMEMALLOC, but that doesn't change 
the situation anyway.  In fact, a cleanup patch to rename GFP_NOMEMALLOC 
would be nice.


I'm rather bad at picking names, but here goes:

PF_MEMALLOC  -> PF_EMERGALLOC
__GFP_NOMEMALLOC -> __GFP_NOEMERGALLOC
__GFP_MEMALLOC   -> __GFP_EMERGALLOC

Is that suitable and shall I prepare patches? Or do we want more ppl to
chime in and have a few more rounds?


MEMALLOC is the name Linus chose to name exactly the reserve from which we
are allocating.  Perhaps that was just Linus being denser than jgarzik and
not realizing that he should have called it EMERGALLOC right from the start.

BUT since Linus did call it MEMALLOC, we should too.  Or just email Linus
and tell him how much better EMERGALLOC rolls off the tongue, and could we
please change all occurances of MEMALLOC to EMERGALLOC.  Then don't read
your email for a week ;-)

Inventing a new name for an existing thing is very poor taste on grounds of
grepability alone.

Regards,

Daniel

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-13 Thread Daniel Phillips

Rik van Riel wrote:

Thomas Graf wrote:

skb->dev is not guaranteed to still point to the "allocating" device
once the skb is freed again so reserve/unreserve isn't symmetric.
You'd need skb->alloc_dev or something.


There's another consequence of this property of the network
stack.

Every network interface must be able to fall back to these
MEMALLOC allocations, because the memory critical socket
could be on another network interface.  Hence, we cannot
know which network interfaces should (not) be marked MEMALLOC.


Good point.  We do however know which interfaces should be marked
capable of carrying block IO traffic: the ones that have been fixed,
audited and tested.  We might then allow the network block device to
specify which interface(s) will actually carry the traffic.

The advantage of being specific about which devices are carrying at
least one block io socket is, we can skip the reserve accounting for
the other interfaces.  But is the extra layer of configuration gack a
better idea than just doing the accounting for every device, provided
the system is in reclaim?

By the way, another way to avoid impact on the normal case is an
experimental option such as CONFIG_PREVENT_NETWORK_BLOCKIO_DEADLOCK.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-13 Thread Daniel Phillips

David Miller wrote:

From: Peter Zijlstra <[EMAIL PROTECTED]>

Hmm, what does sk_buff::input_dev do? That seems to store the initial
device?


You can run grep on the tree just as easily as I can which is what I
did to answer this question.  It only takes a few seconds of your
time to grep the source tree for things like "skb->input_dev", so
would you please do that before asking more questions like this?

It does store the initial device, but as Thomas tried so hard to
explain to you guys these device pointers in the skb are transient and
you cannot refer to them outside of packet receive processing.


Thomas did a great job of explaining and without any flaming or ad
hominem attacks.

We have now formed a decent plan for doing the accounting in a stable
way without adding new fields to sk_buff, thankyou for the catch.


The reason is that there is no refcounting performed on these devices
when they are attached to the skb, for performance reasons, and thus
the device can be downed, the module for it removed, etc. long before
the skb is freed up.


The virtual block device can refcount the network device on virtual
device create and un-refcount on virtual device delete.  We need to
add that to the core patch and maybe package it nicely so the memalloc
reserve/unreserve happens at the same time, in a tidy little library
function to share with other virtual devices like iSCSI that also need
some anti-deadlock lovin.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

2006-08-13 Thread Daniel Phillips

Peter Zijlstra wrote:

On Wed, 2006-08-09 at 16:54 -0700, David Miller wrote:

People are doing I/O over IP exactly for it's ubiquity and
flexibility.  It seems a major limitation of the design if you cancel
out major components of this flexibility.


We're not, that was a bit of my own frustration leaking out; I think 
this whole push to IP based storage is a bit silly. I'm just not going 
to help the admin who's server just hangs because his VPN key expired.


Running critical resources remotely like this is tricky, and every 
hop/layer you put in between increases the risk of something going bad.

The only setup I think even remotely sane is a dedicated network in the
very same room - not unlike FC but cheaper (which I think is the whole
push behind this, eth is cheap)


Indeed.  The rest of the corner cases like netfilter, layered protocol and
so on need to be handled, however they do not need to be handled right now
in order to make remote storage on a lan work properly.  The sane thing for
the immediate future is to flag each socket as safe for remote block IO or
not, then gradually widen the scope of what is safe.  We need to set up an
opt in strategy for network block IO that views such network subsystems as
ipfilter as not safe by default, until somebody puts in the work to make
them safe.

But really, if you expect to run reliable block IO to Zanzibar over an ssh
tunnel through a firewall, then you might also consider taking up bungie
jumping with the cord tied to your neck.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 8/9] 3c59x driver conversion

2006-08-13 Thread Daniel Phillips

David Miller wrote:

I think he's saying that he doesn't think your code is yet a
reasonable way to solve the problem, and therefore doesn't belong
upstream.


That is why it has not yet been submitted upstream.  Respectfully, I
do not think that jgarzik has yet put in the work to know if this anti
deadlock technique is reasonable or not, and he was only commenting
on some superficial blemish.  I still don't get his point, if there
was one.  He seems to be arguing in favor of a jump-off-the-cliff
approach to driver conversion.  If he wants to do the work and take
the blame when some driver inevitably breaks because of being edited
in a hurry then he is welcome to submit the necessary additional
patches.  Until then, there are about 3 nics that actually matter to
network storage at the moment, all of them GigE.

The layer 2 blemishes can be fixed easily, including avoiding the
atomic op stall and the ->dev volatility .  Thankyou for pointing
those out.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 0/9] Network receive deadlock prevention for NBD

2006-08-08 Thread Daniel Phillips

Evgeniy Polyakov wrote:

On Tue, Aug 08, 2006 at 09:33:25PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
wrote:


  http://lwn.net/Articles/144273/
  "Kernel Summit 2005: Convergence of network and storage paths"

We believe that an approach very much like today's patch set is
necessary for NBD, iSCSI, AoE or the like ever to work reliably. 
We further believe that a properly working version of at least one of

these subsystems is critical to the viability of Linux as a modern
storage platform.


There is another approach for that - do not use slab allocator for
network dataflow at all. It automatically has all you pros amd if
implemented correctly can have a lot of additional usefull and
high-performance features like full zero-copy and total fragmentation
avoidance.


Agreed.  But probably more intrusive than davem would be happy with
at this point.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 8/9] 3c59x driver conversion

2006-08-08 Thread Daniel Phillips

Jeff Garzik wrote:

Peter Zijlstra wrote:

Update the driver to make use of the netdev_alloc_skb() API and the
NETIF_F_MEMALLOC feature.


NETIF_F_MEMALLOC does not exist in the upstream tree...  nor should it, 
IMO.


Elaborate please.  Do you think that all drivers should be updated to
fix the broken blockdev semantics, making NETIF_F_MEMALLOC redundant?
If so, I trust you will help audit for it?


netdev_alloc_skb() is in the tree, and that's fine.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-08 Thread Daniel Phillips

David Miller wrote:

From: Daniel Phillips <[EMAIL PROTECTED]>

 >>Can you please characterize the conditions under which skb->dev changes

after the alloc?  Are there writings on this subtlety?


The packet scheduler and classifier can redirect packets to different
devices, and can the netfilter layer.

The setting of skb->dev is wholly transient and you cannot rely upon
it to be the same as when you set it on allocation.

Even simple things like the bonding device change skb->dev on every
receive.


Thankyou, this is easily fixed.


I think you need to study the networking stack a little more before
you continue to play in this delicate area :-)


The VM deadlock is also delicate.  Perhaps we can work together.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-08 Thread Daniel Phillips

David Miller wrote:

From: Daniel Phillips <[EMAIL PROTECTED]>

David Miller wrote:

I think the new atomic operation that will seemingly occur on every
device SKB free is unacceptable.


Alternate suggestion?


Sorry, I have none.  But you're unlikely to get your changes
considered seriously unless you can avoid any new overhead your patch
has which is of this level.


We just skip anything new unless the socket is actively carrying block
IO traffic, in which case we pay a miniscule price to avoid severe
performance artifacts or in the worst case, deadlock.  So in this design
the new atomic operation does not occur on every device SKP free.

All atomic ops sit behind the cheap test:

   (dev->flags & IFF_MEMALLOC)

or if any have escaped that is just an oversight.   Peter?


We're busy trying to make these data structures smaller, and eliminate
atomic operations, as much as possible.  Therefore anything which adds
new datastructure elements and new atomic operations will be met with
fierce resistence unless it results an equal or greater shrink of
datastructures elsewhere or removes atomic operations elsewhere in
the critical path.


Right now we have a problem because our network stack cannot support
block IO reliably.  Without that, Linux is no enterprise storage
platform.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-08 Thread Daniel Phillips

Hi Dave,

David Miller wrote:

I think the new atomic operation that will seemingly occur on every
device SKB free is unacceptable.


Alternate suggestion?


You also cannot modify netdev->flags in the lockless manner in which
you do, it must be done with the appropriate locking, such as holding
the RTNL semaphore.


Thanks for the catch.

Regards,

Daniel



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-08 Thread Daniel Phillips

Thomas Graf wrote:
 > skb->dev is not guaranteed to still point to the "allocating" device

once the skb is freed again so reserve/unreserve isn't symmetric.
You'd need skb->alloc_dev or something.


Can you please characterize the conditions under which skb->dev changes
after the alloc?  Are there writings on this subtlety?

Regards,

Daniel

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-08 Thread Daniel Phillips

Stephen Hemminger wrote:

How much of this is just building special case support for large allocations
for jumbo frames? Wouldn't it make more sense to just fix those drivers to
do scatter and add the support hooks for that?


Short answer: none of it is.  If it happens to handle jumbo frames nicely
that is mainly a lucky accident, and we do need to check that they actually
works.

Minor rant: the whole skb_alloc familly has degenerated into an unholy
mess and could use some rethinking.  I believe the current patch gets as
far as three _'s at the beginning of a function, this shows it is high
time to reroll the api.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-08 Thread Daniel Phillips

Indan Zupancic wrote:

Hello,
Saw the patch on lkml, and wondered about some things.
On Tue, August 8, 2006 21:33, Peter Zijlstra said:


+static inline void dev_unreserve_skb(struct net_device *dev)
+{
+   if (atomic_dec_return(&dev->rx_reserve_used) < 0)
+   atomic_inc(&dev->rx_reserve_used);
+}
+


This looks strange. Is it possible for rx_reserve_used to become negative?
A quick look at the code says no, except in the one case where there isn't a
"if (unlikely(dev_reserve_used(skb->dev)))" check:


Yes, you can see what I'm trying to do there, I was short an atomic op to
do it,  My ugly solution may well be... probably is racy.  Let's rewrite
it with something better.  We want the atomic op that some people call
"monus": decrement unless zero.


@@ -389,6 +480,8 @@ void __kfree_skb(struct sk_buff *skb)
#endif

kfree_skbmem(skb);
+   if (dev && (dev->flags & IFF_MEMALLOC))
+   dev_unreserve_skb(dev);
}


So it seems that that < 0 check in dev_unreserve_skb() was only added to handle
this case (though there seems to be a race between those two atomic ops).
Isn't it better to remove that check and just do:

if (dev && (dev->flags & IFF_MEMALLOC) && dev_reserve_used(dev))


Seems to me that unreserve is also called from each protocol handler.
Agreed, the < 0 check is fairly sickening.


The use of atomic seems a bit dubious. Either it's necessary, in which case
changes depending on tests seem unsafe as they're not atomic and something
crucial could change between the read and the check, or normal reads and writes
combined with barriers would be sufficient. All in all it seems better to
move that "if (unlikely(dev_reserve_used(skb->dev)))" check into
dev_unreserve_skb(), and make the whole atomic if necessary. Then let
dev_unreserve_skb() return wether rx_reserve_used was positive.
Getting rid of dev_reserve_used() and using the atomic_read directly might be
better, as it is set with the bare atomic instructions too and rarely used 
without
dev_unreserve_skb().


Barriers should work for this reserve accounting, but is that better than
an atomic op?  I don't know, let's let the barrier mavens opine.

IMHO the cleanest thing to do is code up "monus", in fact I dimly recall
somebody already added something similar.

Side note: please don't be shy, just reply-all in future so the discussion
stays public.  Part of what we do is try to share our development process
so people see not only what we have done, but why we did it.  (And sometimes
so they can see what dumb mistakes we make, but I won't get into that...)

I beg for forgiveness in advance having taken the liberty of CCing this
reply to lkml.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 1/2] in-kernel sockets API

2006-06-14 Thread Daniel Phillips

Hi Harald,

You wrote:

On Tue, Jun 13, 2006 at 02:12:41PM -0700, I wrote:


This has the makings of a nice stable internal kernel api.  Why do we want
to provide this nice stable internal api to proprietary modules?


because there is IMHO legally nothing we can do about it anyway.


Speaking as a former member of a "grey market" binary module vendor that
came in from the cold I can assure you that the distinction between EXPORT
and EXPORT_GPL _is_ meaningful.  That tainted flag makes it extremely
difficult to do deals with mainstream Linux companies and there is always
the fear that it will turn into a legal problem.  The latter bit tends to
make venture capitalists nervous.

That said, the EXPORT_GPL issue is not about black and white legal issues,
it is about gentle encouragement.  In this case we are offering a clumsy,
on-the-metal, guaranteed-to-change-and-make-you-edit-code interface to
non-GPL-compatible modules and a decent, stable (in the deserves to live
sense) interface for the pure of heart.  Gentle encouragement at exactly
the right level.

Did we settle the question of whether these particular exports should be
EXPORT_SYMBOL_GPL?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 1/2] in-kernel sockets API

2006-06-13 Thread Daniel Phillips

Chase Venters wrote:

can you name some non-GPL non-proprietary modules we should be concerned
about?


You probably meant "non-GPL-compatible non-proprietary".  If so, then by
definition there are none.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 1/2] in-kernel sockets API

2006-06-13 Thread Daniel Phillips

Brian F. G. Bidulock wrote:

Stephen,

On Tue, 13 Jun 2006, Stephen Hemminger wrote:



@@ -2176,3 +2279,13 @@ EXPORT_SYMBOL(sock_wake_async);
EXPORT_SYMBOL(sockfd_lookup);
EXPORT_SYMBOL(kernel_sendmsg);
EXPORT_SYMBOL(kernel_recvmsg);
+EXPORT_SYMBOL(kernel_bind);
+EXPORT_SYMBOL(kernel_listen);
+EXPORT_SYMBOL(kernel_accept);
+EXPORT_SYMBOL(kernel_connect);
+EXPORT_SYMBOL(kernel_getsockname);
+EXPORT_SYMBOL(kernel_getpeername);
+EXPORT_SYMBOL(kernel_getsockopt);
+EXPORT_SYMBOL(kernel_setsockopt);
+EXPORT_SYMBOL(kernel_sendpage);
+EXPORT_SYMBOL(kernel_ioctl);


Don't we want to restrict this to GPL code with EXPORT_SYMBOL_GPL?



There are direct derivatives of the BSD/POSIX system call
interface.  The protocol function pointers within the socket
structure are not GPL only.  Why make this wrappered access to
them GPL only?  It will only encourange the reverse of what they
were intended to do: be used instead of the protocol function
pointers within the socket structure, that currently carry no
such restriction.


This has the makings of a nice stable internal kernel api.  Why do we want
to provide this nice stable internal api to proprietary modules?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [stable] [NET] Fix zero-size datagram reception

2005-11-08 Thread Daniel Phillips
On Tuesday 08 November 2005 10:13, Greg KH wrote:
> On Thu, Nov 03, 2005 at 07:55:38AM +1100, Herbert Xu wrote:
> > The recent rewrite of skb_copy_datagram_iovec broke the reception of
> > zero-size datagrams.  This patch fixes it.
> >
> > Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>
> >
> > Please apply it to both 2.6.15 and 2.6.14.
>
> queued to -stable.

Blush.  Is it worth putting the initialization after the exit on zero?

int i, err, fraglen, end = 0;
struct sk_buff *next;

if (!len)
return 0;

next = skb_shinfo(skb)->frag_list;

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive (really revised)

2005-08-25 Thread Daniel Phillips
Gah, this time the revised patch is included, not just the diffstat.

 datagram.c |   82 +++--
 1 files changed, 26 insertions(+), 56 deletions(-)

diff -up --recursive 2.6.12.3.clean/net/core/datagram.c 
2.6.12.3/net/core/datagram.c
--- 2.6.12.3.clean/net/core/datagram.c  2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/net/core/datagram.c2005-08-25 03:57:32.0 -0400
@@ -211,74 +211,44 @@ void skb_free_datagram(struct sock *sk, 
 int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset,
struct iovec *to, int len)
 {
-   int start = skb_headlen(skb);
-   int i, copy = start - offset;
-
-   /* Copy header. */
-   if (copy > 0) {
-   if (copy > len)
-   copy = len;
-   if (memcpy_toiovec(to, skb->data + offset, copy))
-   goto fault;
-   if ((len -= copy) == 0)
-   return 0;
-   offset += copy;
-   }
+   int i, err, fraglen, end = 0;
+   struct sk_buff *next = skb_shinfo(skb)->frag_list;
+next_skb:
+   fraglen = skb_headlen(skb);
+   i = -1;
 
-   /* Copy paged appendix. Hmm... why does this look so complicated? */
-   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-   int end;
-
-   BUG_TRAP(start <= offset + len);
+   while (1) {
+   int start = end;
 
-   end = start + skb_shinfo(skb)->frags[i].size;
-   if ((copy = end - offset) > 0) {
-   int err;
-   u8  *vaddr;
-   skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-   struct page *page = frag->page;
+   if ((end += fraglen) > offset) {
+   int copy = end - offset, o = offset - start;
 
if (copy > len)
copy = len;
-   vaddr = kmap(page);
-   err = memcpy_toiovec(to, vaddr + frag->page_offset +
-offset - start, copy);
-   kunmap(page);
+   if (i == -1)
+   err = memcpy_toiovec(to, skb->data + o, copy);
+   else {
+   skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+   struct page *page = frag->page;
+   void *p = kmap(page) + frag->page_offset + o;
+   err = memcpy_toiovec(to, p, copy);
+   kunmap(page);
+   }
if (err)
goto fault;
if (!(len -= copy))
return 0;
offset += copy;
}
-   start = end;
-   }
-
-   if (skb_shinfo(skb)->frag_list) {
-   struct sk_buff *list = skb_shinfo(skb)->frag_list;
-
-   for (; list; list = list->next) {
-   int end;
-
-   BUG_TRAP(start <= offset + len);
-
-   end = start + list->len;
-   if ((copy = end - offset) > 0) {
-   if (copy > len)
-   copy = len;
-   if (skb_copy_datagram_iovec(list,
-   offset - start,
-   to, copy))
-   goto fault;
-   if ((len -= copy) == 0)
-   return 0;
-   offset += copy;
-   }
-   start = end;
-   }
+   if (++i >= skb_shinfo(skb)->nr_frags)
+   break;
+   fraglen = skb_shinfo(skb)->frags[i].size;
+   }
+   if (next) {
+   skb = next;
+   next = skb->next;
+   goto next_skb;
}
-   if (!len)
-   return 0;
-
 fault:
return -EFAULT;
 }
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive

2005-08-25 Thread Daniel Phillips
On Thursday 25 August 2005 03:30, David S. Miller wrote:
> From: Daniel Phillips <[EMAIL PROTECTED]>
> > As far as I can see, it is illegal for any but the first skb to have
> > a non-null skb_shinfo(skb)->frag_list, is this correct?
>
> As currently used, yes.

That's a relief.  I updated the patch to express exactly those semantics.  Now 
that the patch is (probably) correct I can just drop it if you like.  After 
all, it is just a cleanup, and what use is that?

Regards,

Daniel

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive (revised)

2005-08-25 Thread Daniel Phillips
The fragment list handling was wrong in the previous version, now correct I 
think.

datagram.c |   82 +++--
1 files changed, 26 insertions(+), 56 deletions(-)

diff -up --recursive 2.6.12.3.clean/net/core/datagram.c 
2.6.12.3/net/core/datagram.c
--- 2.6.12.3.clean/net/core/datagram.c  2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/net/core/datagram.c2005-08-25 01:39:56.0 -0400
@@ -211,74 +211,40 @@ void skb_free_datagram(struct sock *sk, 
 int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset,
struct iovec *to, int len)
 {
-   int start = skb_headlen(skb);
-   int i, copy = start - offset;
-
-   /* Copy header. */
-   if (copy > 0) {
-   if (copy > len)
-   copy = len;
-   if (memcpy_toiovec(to, skb->data + offset, copy))
-   goto fault;
-   if ((len -= copy) == 0)
-   return 0;
-   offset += copy;
-   }
+   int i, err, fraglen, end = 0;
+next_skb:
+   fraglen = skb_headlen(skb);
+   i = -1;
 
-   /* Copy paged appendix. Hmm... why does this look so complicated? */
-   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-   int end;
-
-   BUG_TRAP(start <= offset + len);
+   while (1) {
+   int start = end;
 
-   end = start + skb_shinfo(skb)->frags[i].size;
-   if ((copy = end - offset) > 0) {
-   int err;
-   u8  *vaddr;
-   skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-   struct page *page = frag->page;
+   if ((end += fraglen) > offset) {
+   int copy = end - offset, o = offset - start;
 
if (copy > len)
copy = len;
-   vaddr = kmap(page);
-   err = memcpy_toiovec(to, vaddr + frag->page_offset +
-offset - start, copy);
-   kunmap(page);
+   if (i == -1)
+   err = memcpy_toiovec(to, skb->data + o, copy);
+   else {
+   skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+   struct page *page = frag->page;
+   void *p = kmap(page) + frag->page_offset + o;
+   err = memcpy_toiovec(to, p, copy);
+   kunmap(page);
+   }
if (err)
goto fault;
if (!(len -= copy))
return 0;
offset += copy;
}
-   start = end;
-   }
-
-   if (skb_shinfo(skb)->frag_list) {
-   struct sk_buff *list = skb_shinfo(skb)->frag_list;
-
-   for (; list; list = list->next) {
-   int end;
-
-   BUG_TRAP(start <= offset + len);
-
-   end = start + list->len;
-   if ((copy = end - offset) > 0) {
-   if (copy > len)
-   copy = len;
-   if (skb_copy_datagram_iovec(list,
-   offset - start,
-   to, copy))
-   goto fault;
-   if ((len -= copy) == 0)
-   return 0;
-   offset += copy;
-   }
-   start = end;
-   }
+   if (++i >= skb_shinfo(skb)->nr_frags)
+   break;
+   fraglen = skb_shinfo(skb)->frags[i].size;
}
-   if (!len)
-   return 0;
-
+   if ((skb = skb->next))
+   goto next_skb;
 fault:
return -EFAULT;
 }
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive

2005-08-25 Thread Daniel Phillips
On Thursday 25 August 2005 02:44, David S. Miller wrote:
> Frag lists cannot be deeper than one level of nesting,
> and I think the recursive version is easier to understand,
> so I really don't see the value of your change.

Losing 34 lines of a 74 line function is the value.

The real problem with my version is, it isn't right, it walks the ->next link 
from the original skb, whereas the original version gets the list from 
skb_shinfo(skb)->frag_list and walks the ->next links of that.  This 
structure actually does allow for lists of lists, but it does not look like 
this was intended.

As far as I can see, it is illegal for any but the first skb to have a 
non-null skb_shinfo(skb)->frag_list, is this correct?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[TESTME][PATCH] Make skb_copy_datagram_iovec nonrecursive

2005-08-24 Thread Daniel Phillips
Hi,

I noticed that skb_copy_datagram_iovec calls itself recursively to copy a
fragment list.  This isn't actually wrong or even inefficient, it is just
somehow disturbing.  Oh, and it uses an extra stack frame, and is hard to
read.  Once I got started straightening that out, I couldn't resist cleaning
it up the rest of the way.

  * Recursion changed to nested loop
  * Three copy cases folded into one
  * Miscellaneous funky stuff simplified
  * Efficiency should be about the same
  * Comments gone, it's all perfectly obvious now ;-)

I booted it, my nics still work, that's as far as I've tested it.  Please try 
it.

There are a couple of cosmetic things to do, like BUG on impossible offset.  Why
does the current code just kprint this and not BUG?

datagram.c |   78 +
1 files changed, 22 insertions(+), 56 deletions(-)

diff -up --recursive 2.6.12.3.clean/net/core/datagram.c 
2.6.12.3/net/core/datagram.c
--- 2.6.12.3.clean/net/core/datagram.c  2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/net/core/datagram.c2005-08-25 01:39:56.0 -0400
@@ -211,74 +211,40 @@ void skb_free_datagram(struct sock *sk, 
 int skb_copy_datagram_iovec(const struct sk_buff *skb, int offset,
struct iovec *to, int len)
 {
-   int start = skb_headlen(skb);
-   int i, copy = start - offset;
-
-   /* Copy header. */
-   if (copy > 0) {
-   if (copy > len)
-   copy = len;
-   if (memcpy_toiovec(to, skb->data + offset, copy))
-   goto fault;
-   if ((len -= copy) == 0)
-   return 0;
-   offset += copy;
-   }
+   int i, err, fraglen, end = 0;
+next_skb:
+   fraglen = skb_headlen(skb);
+   i = -1;
 
-   /* Copy paged appendix. Hmm... why does this look so complicated? */
-   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-   int end;
-
-   BUG_TRAP(start <= offset + len);
+   while (1) {
+   int start = end;
 
-   end = start + skb_shinfo(skb)->frags[i].size;
-   if ((copy = end - offset) > 0) {
-   int err;
-   u8  *vaddr;
-   skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-   struct page *page = frag->page;
+   if ((end += fraglen) > offset) {
+   int copy = end - offset, o = offset - start;
 
if (copy > len)
copy = len;
-   vaddr = kmap(page);
-   err = memcpy_toiovec(to, vaddr + frag->page_offset +
-offset - start, copy);
-   kunmap(page);
+   if (i == -1)
+   err = memcpy_toiovec(to, skb->data + o, copy);
+   else {
+   skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+   struct page *page = frag->page;
+   void *p = kmap(page) + frag->page_offset + o;
+   err = memcpy_toiovec(to, p, copy);
+   kunmap(page);
+   }
if (err)
goto fault;
if (!(len -= copy))
return 0;
offset += copy;
}
-   start = end;
-   }
-
-   if (skb_shinfo(skb)->frag_list) {
-   struct sk_buff *list = skb_shinfo(skb)->frag_list;
-
-   for (; list; list = list->next) {
-   int end;
-
-   BUG_TRAP(start <= offset + len);
-
-   end = start + list->len;
-   if ((copy = end - offset) > 0) {
-   if (copy > len)
-   copy = len;
-   if (skb_copy_datagram_iovec(list,
-   offset - start,
-   to, copy))
-   goto fault;
-   if ((len -= copy) == 0)
-   return 0;
-   offset += copy;
-   }
-   start = end;
-   }
+   if (++i >= skb_shinfo(skb)->nr_frags)
+   break;
+   fraglen = skb_shinfo(skb)->frags[i].size;
}
-   if (!len)
-   return 0;
-
+   if ((skb = skb->next))
+   goto next_skb;
 fault:
return -EFAULT;
 }
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo inf

Re: Bridge MTU

2005-08-17 Thread Daniel Phillips
On Thursday 18 August 2005 07:02, John Heffner wrote:
> 1) Bridging occurs at the link layer (ethernet); fragmenting occurs at the
> network layer (IP).
>
> 2) A lot of protocols set the Don't Fragment bit, so you can't always
> fragment anyway.
>
> What you might want to do is set it up as an IP router rather than a
> bridge. That way, it will be able to fragment, or send ICMP responses when
> it can't.

Gaahh.  Thanks for filling in that gaping comprehension hole.  In retrospect 
it is perfectly obvious that routing is what I need, and that bridging just 
comes close enough to confuse me ;-)

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Net vm deadlock fix, version 6

2005-08-11 Thread Daniel Phillips
Ahem:

+   }
adjust_memalloc_reserve(-netdev->memalloc_pages);
-   }

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Net vm deadlock fix, version 6

2005-08-11 Thread Daniel Phillips
Hi,

This version corrects a couple of bugs previously noted and ties up some loose
ends in the e1000 driver.  Some versions of this driver support packet
splitting into multiple pages, with just the protocol header in the skb
itself.  This is a very good thing because it avoids the high order page
fragmentation problem.  Though this is something that probably needs to be
pushed down into generic skb allocation, for now the driver handles it
explicitly with the help of a new memalloc_page function that just gets a page
from the memalloc reserve if normal allocation fails.  This does not need
separate reserve accounting because such pages are allocated per-skb.

While I was in there, I could not resist cleaning up some non-orthogonality in
the 64K overlap handling (e1000 people, please check my work).  The result is
that with the new memalloc handling, the source stayed the same size.  Code
size is another question.  I have added a number of new inlines.  I suspect
that inlining the skb allocation functions in general doesn't buy anything, 
but this needs to be checked.  Anyway, we now have a pretty good picture of 
the full per-driver damage of closing this hole, and it is not much.

I still haven't looked at the various hooks in the packet delivery path, but 
it's coming up pretty soon.  There are other wrinkles too, like the fact that 
there can actually be many block devices mapped over the same network 
interface.  I have to ponder what is best to do so they can't wedge each
other, and so that the SOCK_MEMALLOC bit doesn't get cleared prematurely.

Anyway, progress marches on.

diff -up --recursive 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 
2.6.12.3/drivers/net/e1000/e1000_main.c
--- 2.6.12.3.clean/drivers/net/e1000/e1000_main.c   2005-07-15 
17:18:57.0 -0400
+++ 2.6.12.3/drivers/net/e1000/e1000_main.c 2005-08-11 17:42:12.0 
-0400
@@ -309,6 +309,16 @@ e1000_up(struct e1000_adapter *adapter)
e1000_phy_reset(&adapter->hw);
}
 
+   netdev->memalloc_pages = estimate_skb_pages(netdev->rx_reserve,
+   adapter->rx_buffer_len + NET_IP_ALIGN);
+   if (adapter->rx_ps)
+   netdev->memalloc_pages += PS_PAGE_BUFFERS * netdev->rx_reserve;
+   if ((err = adjust_memalloc_reserve(netdev->memalloc_pages))) {
+   DPRINTK(PROBE, ERR,
+   "Unable to allocate rx reserve Error: %d\n", err);
+   return err;
+   }
+
e1000_set_multi(netdev);
 
e1000_restore_vlan(adapter);
@@ -386,6 +396,7 @@ e1000_down(struct e1000_adapter *adapter
mii_reg |= MII_CR_POWER_DOWN;
e1000_write_phy_reg(&adapter->hw, PHY_CTRL, mii_reg);
mdelay(1);
+   adjust_memalloc_reserve(-netdev->memalloc_pages);
}
 }
 
@@ -3116,34 +3127,29 @@ e1000_alloc_rx_buffers(struct e1000_adap
buffer_info = &rx_ring->buffer_info[i];
 
while(!buffer_info->skb) {
-   skb = dev_alloc_skb(bufsz);
+   skb = dev_memalloc_skb(netdev, bufsz);
 
-   if(unlikely(!skb)) {
+   if(unlikely(!skb))
/* Better luck next round */
break;
-   }
 
/* Fix for errata 23, can't cross 64kB boundary */
if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
struct sk_buff *oldskb = skb;
DPRINTK(RX_ERR, ERR, "skb align check failed: %u bytes "
 "at %p\n", bufsz, skb->data);
-   /* Try again, without freeing the previous */
-   skb = dev_alloc_skb(bufsz);
+   /* Try again, then free previous */
+   skb = dev_memalloc_skb(netdev, bufsz);
+   dev_memfree_skb(oldskb);
+
/* Failed allocation, critical failure */
-   if (!skb) {
-   dev_kfree_skb(oldskb);
+   if (!skb)
break;
-   }
 
+   /* give up */
if (!e1000_check_64k_bound(adapter, skb->data, bufsz)) {
-   /* give up */
-   dev_kfree_skb(skb);
-   dev_kfree_skb(oldskb);
+   dev_memfree_skb(skb);
break; /* while !buffer_info->skb */
-   } else {
-   /* Use new allocation */
-   dev_kfree_skb(oldskb);
}
}
/* Make buffer alignment 2 beyond a 16 byte boundary
@@ -3152,8 +3158,6 @@ e1000_alloc_rx_buffers(struct e1000_adap
 */
skb_reserve(skb, NET_IP_ALIGN);
 
-   skb->dev = netdev;
-
buf

Re: [RFC] Net vm deadlock fix, version 5

2005-08-08 Thread Daniel Phillips
Hi,

A couple of goofs.  First, the sysctl interface to min_free_kbytes could stomp
on any in-kernel adjustments.  Now there are two variables, summed in
setup_per_zone_pages_min: min_free_kbytes and var_free_kbytes.  The
adjust_memalloc_reserve operates only the latter, so the user can freely
twiddle the base reserve without interfering with in-kernel adjustments.

Second, the skb resource estimation code was grossly wrong where it attempted
to calculate ceiling(log2(size)), now fixed and tested this time.

To save bandwidth, just the relevant parts of the patch included below.

Regards,

Daniel

diff -up --recursive 2.6.12.3.clean/mm/page_alloc.c 2.6.12.3/mm/page_alloc.c
--- 2.6.12.3.clean/mm/page_alloc.c  2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/mm/page_alloc.c2005-08-08 21:20:15.0 -0400
@@ -73,6 +73,7 @@ EXPORT_SYMBOL(zone_table);
 
 static char *zone_names[MAX_NR_ZONES] = { "DMA", "Normal", "HighMem" };
 int min_free_kbytes = 1024;
+int var_free_kbytes;
 
 unsigned long __initdata nr_kernel_pages;
 unsigned long __initdata nr_all_pages;
@@ -2029,7 +2030,8 @@ static void setup_per_zone_lowmem_reserv
  */
 static void setup_per_zone_pages_min(void)
 {
-   unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
+   unsigned pages_min = (min_free_kbytes + var_free_kbytes)
+   >> (PAGE_SHIFT - 10);
unsigned long lowmem_pages = 0;
struct zone *zone;
unsigned long flags;
@@ -2075,6 +2077,18 @@ static void setup_per_zone_pages_min(voi
}
 }
 
+int adjust_memalloc_reserve(int pages)
+{
+   int kbytes = var_free_kbytes + (pages << (PAGE_SHIFT - 10));
+   if (kbytes < 0)
+   return -EINVAL;
+   var_free_kbytes = kbytes;
+   setup_per_zone_pages_min();
+   return 0;
+}
+
+EXPORT_SYMBOL_GPL(adjust_memalloc_reserve);
+
 /*
  * Initialise min_free_kbytes.
  *
diff -up --recursive 2.6.12.3.clean/net/core/skbuff.c 2.6.12.3/net/core/skbuff.c
--- 2.6.12.3.clean/net/core/skbuff.c2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/net/core/skbuff.c  2005-08-08 23:07:12.0 -0400
@@ -167,6 +168,15 @@ nodata:
goto out;
 }
 
+#define ceiling_log2(x) fls(x - 1)
+unsigned estimate_skb_pages(unsigned num, unsigned size)
+{
+   int slab_pages = kmem_estimate_pages(skbuff_head_cache, num);
+   int data_space = num * (1 << ceiling_log2(size + 16));
+   int data_pages = (data_space + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+   return slab_pages + data_pages;
+}
+
 /**
  * alloc_skb_from_cache-   allocate a network buffer
  * @cp: kmem_cache from which to allocate the data area
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Net vm deadlock fix, version 5

2005-08-08 Thread Daniel Phillips
Hi,

This version introduces the idea of having a network driver adjust the global 
memalloc reserve when it brings an interface up or down.  The interface is:

int adjust_memalloc_reserve(int bytes)

which is just a thin shell over the min_free_kbytes interface that already 
exists.

The driver estimates the maximum number of pages it could ever need from the 
reserve and saves it in the net_device struct, so that exactly the same 
number of pages will be added to the reserve on bring-up and removed from it 
on bring-down.

The tricky part is figuring out how many pages could ever be needed.  I wrote
some helper functions to estimate how many pages a given number of skbs
would use.  These functions are deficient in a number of ways, but today the
point isn't to be exactly correct, but to give a sense of what the complete
solution will look like.

I am still continuously comparing the memalloc way of doing things to the 
mempool way.  Today I had a look at the number of different places where 
mempools would be needed on the pre-protocol packet delivery path.  It is 
scary.

With the memalloc way, the interfaces are fairly lightweight, but the 
accounting isn't as precise.  It doesn't need to be precise either, we really 
just have to be sure that the reservation is more than enough to handle a 
single block IO transaction.  But actually, the code tries to arrange for 
about 50 in flight at once, which is a very big safety factor.

Adjusting the global memalloc reserve accurately is more of a concern.  If the 
adjustment wildly low, then a lowly network driver could end up eating the 
whole global reserve and wedge the entire system.  So my ongoing effort will 
be directed at accounting the reserve limits in pages, rather than skbs as 
currently.  This gets tricky, because skbs are not allocated for pages, they 
are allocated from slab caches.  Maybe skbs really should be allocated from 
pages, it would not be much different from the way block IO works.  Or maybe 
I just need to roll up my sleeves and create an interface where I can find 
out exactly how many additional pages my skb allocation actually consumed.

Anyway, I am now at least making an attempt to account for global memalloc 
pool usage sanely, whereas traditionally we have just made the memalloc 
reserve "big" and hoped for the best.  (At this point, somebody stands up and 
says "use mempool" and I say, you fix the patch to use mempool.  It's messier
than you think.)

The e1000 driver currently does not implement the deadlock-avoidance path if 
packet splitting is enabled, because other bits of memory start showing up in 
the packet path and the accounting gets more complex than I was prepared to 
tackle today.  Maybe tomorrow.

Regards,

Daniel

diff -up --recursive 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 
2.6.12.3/drivers/net/e1000/e1000_main.c
--- 2.6.12.3.clean/drivers/net/e1000/e1000_main.c   2005-07-15 
17:18:57.0 -0400
+++ 2.6.12.3/drivers/net/e1000/e1000_main.c 2005-08-08 04:26:29.0 
-0400
@@ -309,6 +309,14 @@ e1000_up(struct e1000_adapter *adapter)
e1000_phy_reset(&adapter->hw);
}
 
+   netdev->memalloc_pages = estimate_skb_pages(netdev->rx_reserve,
+   adapter->rx_buffer_len + NET_IP_ALIGN);
+   if ((err = adjust_memalloc_reserve(netdev->memalloc_pages))) {
+   DPRINTK(PROBE, ERR,
+   "Unable to allocate rx reserve Error: %d\n", err);
+   return err;
+   }
+
e1000_set_multi(netdev);
 
e1000_restore_vlan(adapter);
@@ -386,6 +394,7 @@ e1000_down(struct e1000_adapter *adapter
mii_reg |= MII_CR_POWER_DOWN;
e1000_write_phy_reg(&adapter->hw, PHY_CTRL, mii_reg);
mdelay(1);
+   adjust_memalloc_reserve(-netdev->memalloc_pages);
}
 }
 
@@ -3242,7 +3251,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
cpu_to_le64(ps_page_dma->ps_page_dma[j]);
}
 
-   skb = dev_alloc_skb(adapter->rx_ps_bsize0 + NET_IP_ALIGN);
+   skb = dev_memalloc_skb(netdev, adapter->rx_ps_bsize0 + 
NET_IP_ALIGN);
 
if(unlikely(!skb))
break;
@@ -3253,8 +3262,6 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
 */
skb_reserve(skb, NET_IP_ALIGN);
 
-   skb->dev = netdev;
-
buffer_info->skb = skb;
buffer_info->length = adapter->rx_ps_bsize0;
buffer_info->dma = pci_map_single(pdev, skb->data,
diff -up --recursive 2.6.12.3.clean/include/linux/gfp.h 
2.6.12.3/include/linux/gfp.h
--- 2.6.12.3.clean/include/linux/gfp.h  2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/include/linux/gfp.h2005-08-05 21:53:09.0 -0400
@@ -39,6 +39,7 @@ struct vm_area_struct;
 #define __GFP_COMP 0x4000u /* Add compound page metadata */
 #define __GFP_ZERO 0x8000u /* Return zeroed page 

[RFC] Net vm deadlock fix, version 4

2005-08-06 Thread Daniel Phillips
Hi,

This patch fills in some missing pieces:

   * Support v4 udp: same as v4 tcp, when in reserve, drop packets on
 noncritical sockets

   * Support v4 icmp: when in reserve, drop icmp traffic

   * Add reserve skb support to e1000 driver

   * API for dropping packets before delivery (dev_drop_skb)

   * Atomic_t for reserve accounting

Now ready for proof-of-concept testing.  High level API boilerplate will come
later.

Regards,

Daniel

diff -up --recursive 2.6.12.3.clean/drivers/net/e1000/e1000_main.c 
2.6.12.3/drivers/net/e1000/e1000_main.c
--- 2.6.12.3.clean/drivers/net/e1000/e1000_main.c   2005-07-15 
17:18:57.0 -0400
+++ 2.6.12.3/drivers/net/e1000/e1000_main.c 2005-08-06 16:46:13.0 
-0400
@@ -3242,7 +3242,7 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
cpu_to_le64(ps_page_dma->ps_page_dma[j]);
}
 
-   skb = dev_alloc_skb(adapter->rx_ps_bsize0 + NET_IP_ALIGN);
+   skb = dev_memalloc_skb(netdev, adapter->rx_ps_bsize0 + 
NET_IP_ALIGN);
 
if(unlikely(!skb))
break;
@@ -3253,8 +3253,6 @@ e1000_alloc_rx_buffers_ps(struct e1000_a
 */
skb_reserve(skb, NET_IP_ALIGN);
 
-   skb->dev = netdev;
-
buffer_info->skb = skb;
buffer_info->length = adapter->rx_ps_bsize0;
buffer_info->dma = pci_map_single(pdev, skb->data,
diff -up --recursive 2.6.12.3.clean/include/linux/gfp.h 
2.6.12.3/include/linux/gfp.h
--- 2.6.12.3.clean/include/linux/gfp.h  2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/include/linux/gfp.h2005-08-05 21:53:09.0 -0400
@@ -39,6 +39,7 @@ struct vm_area_struct;
 #define __GFP_COMP 0x4000u /* Add compound page metadata */
 #define __GFP_ZERO 0x8000u /* Return zeroed page on success */
 #define __GFP_NOMEMALLOC 0x1u /* Don't use emergency reserves */
+#define __GFP_MEMALLOC  0x2u /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
diff -up --recursive 2.6.12.3.clean/include/linux/netdevice.h 
2.6.12.3/include/linux/netdevice.h
--- 2.6.12.3.clean/include/linux/netdevice.h2005-07-15 17:18:57.0 
-0400
+++ 2.6.12.3/include/linux/netdevice.h  2005-08-06 16:37:14.0 -0400
@@ -371,6 +371,8 @@ struct net_device
struct Qdisc*qdisc_ingress;
struct list_headqdisc_list;
unsigned long   tx_queue_len;   /* Max frames per queue allowed 
*/
+   int rx_reserve;
+   atomic_trx_reserve_used;
 
/* ingress path synchronizer */
spinlock_t  ingress_lock;
@@ -662,6 +664,49 @@ static inline void dev_kfree_skb_any(str
dev_kfree_skb(skb);
 }
 
+/*
+ * Support for critical network IO under low memory conditions
+ */
+static inline int dev_reserve_used(struct net_device *dev)
+{
+   return atomic_read(&dev->rx_reserve_used);
+}
+
+static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev,
+   unsigned length, int gfp_mask)
+{
+   struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask);
+   if (skb)
+   goto done;
+   if (dev_reserve_used(dev) >= dev->rx_reserve)
+   return NULL;
+   if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC))
+   return NULL;;
+   atomic_inc(&dev->rx_reserve_used);
+done:
+   skb->dev = dev;
+   return skb;
+}
+
+static inline struct sk_buff *dev_memalloc_skb(struct net_device *dev,
+   unsigned length)
+{
+   return __dev_memalloc_skb(dev, length, GFP_ATOMIC);
+}
+
+static inline void dev_unreserve(struct net_device *dev)
+{
+   if (atomic_dec_return(&dev->rx_reserve_used) < 0)
+   atomic_inc(&dev->rx_reserve_used);
+}
+
+static inline void dev_drop_skb(struct sk_buff *skb)
+{
+   struct net_device *dev = skb->dev;
+   __kfree_skb(skb);
+   dev_unreserve(dev);
+}
+
 #define HAVE_NETIF_RX 1
 extern int netif_rx(struct sk_buff *skb);
 extern int netif_rx_ni(struct sk_buff *skb);
diff -up --recursive 2.6.12.3.clean/include/net/sock.h 
2.6.12.3/include/net/sock.h
--- 2.6.12.3.clean/include/net/sock.h   2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/include/net/sock.h 2005-08-05 21:53:09.0 -0400
@@ -382,6 +382,7 @@ enum sock_flags {
SOCK_NO_LARGESEND, /* whether to sent large segments or not */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+   SOCK_MEMALLOC, /* protocol can use memalloc reserve */
 };
 
 static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
@@ -399,6 +400,11 @@ static inline int sock_flag(struct sock 
return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int is_memalloc_sock(struct sock *

Re: kfree_skb questions

2005-08-06 Thread Daniel Phillips
On Sunday 07 August 2005 06:26, Patrick McHardy wrote:
> > Anyway, do we not want BUG_ON(!atomic_read(&skb->users)) at the beginning
> > of kfree_skb, since we rely on it?
>
> Why do you care if skb->users is 0 or 1 in __kfree_skb()?

Because I am a neatness freak and I like to check things that inattentive 
coders can easily get wrong.  But the question above is not about that, it is 
about checking for possible calls where skb->users is already zero and 
thereby catching the double free early instead of letting it slide further 
into the innards of the machine.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kfree_skb questions

2005-08-06 Thread Daniel Phillips
Hi,

The way I read this, __kfree_skb will sometimes be called with ->users = 1 and 
sometimes with ->users = 0, is that right?  

static inline void kfree_skb(struct sk_buff *skb)
{
if (likely(atomic_read(&skb->users) == 1))
smp_rmb();
else if (likely(!atomic_dec_and_test(&skb->users)))
return;
__kfree_skb(skb);
}

If so, then why not just:

static inline void kfree_skb(struct sk_buff *skb)
{
if (likely(atomic_read(&skb->users) == 1))
smp_rmb();
if (likely(!atomic_dec_and_test(&skb->users)))
return;
__kfree_skb(skb);
}

so __kfree_skb can BUG_ON(atomic_read(&skb->users))?  Perhaps this has 
something to do with the smp_rmb, could somebody please explain to me why it 
is necessary here, and for which architectures?

Anyway, do we not want BUG_ON(!atomic_read(&skb->users)) at the beginning of 
kfree_skb, since we rely on it?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: test

2005-08-06 Thread Daniel Phillips
On Saturday 06 August 2005 18:40, David S. Miller wrote:
> From: Daniel Phillips <[EMAIL PROTECTED]>
> Date: Sat, 6 Aug 2005 04:52:07 +1000
>
> > So then there is no choice but to throttle the per-cpu ->input_pkt
> > queues.
>
> Make the driver support NAPI if you want device fairness.

This will co-exist happily with NAPI.  It does not simply replicate NAPI 
functionality because the reserve accounting also tells the receive path when 
it should discriminate in favor of vm traffic.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Net vm deadlock fix (take two)

2005-08-06 Thread Daniel Phillips
On Sunday 07 August 2005 02:07, Jeff Garzik wrote:
> > +static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev,
> > +   unsigned length, int gfp_mask)
> > +{
> > +   struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask);
> > +   if (skb)
> > +   goto done;
> > +   if (dev->rx_reserve_used >= dev->rx_reserve)
> > +   return NULL;
> > +   if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC))
> > +   return NULL;;
> > +   dev->rx_reserve_used++;
>
> why bother with rx_reserve at all?  Why not just let the second
> allocation fail, without the rx_reserve_used test?

Because that would allow unbounded reserve use, either because of a leak or 
because of a legitimate backup in the softnet queues.  It is not worth it to 
run the risk of wedging the whole system just to save this check.  If we were 
using a mempool here, it would fail with a similar check.

> Additionally, I think the rx_reserve_used accounting is wrong, since I
> could simply free the skb

Good point, I should provide a kfree_skb variant that does the reserve 
accounting (dev_free_skb) in case some driver wants to do this.  Anyway, if 
somebody does free an skb in the delivery path without doing the accounting 
it is not a memory leak, but might cause a non-blockio packet to be 
unnecessarily dropped later.

> -- but doing so would cause a rx_reserve_used 
> leak in your code, since you only decrement the counter in the TCP IPv4
> path.

Reserve checks are needed not just on the IPv4 path but on every protocol path 
that is allowed to co-exist on the same wire as block IO.  I will add udp and 
sctp to the patch next.

If an unhandled protocol does get onto the wire, the consequences are not 
severe.  There is just a risk that the entire reserve may be consumed 
(another reason we need the limit check above) and we just fall back to the 
old unreliable block IO behavior.

Eventually this needs to be enforced automatically so that normal users don't 
have to worry about exactly what protocols they are running on an interface, 
but cluster users will just take care to run only supported protocols, they 
can already benefit from this without fancy checking.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] netpoll can lock up on low memory.

2005-08-06 Thread Daniel Phillips
On Saturday 06 August 2005 12:32, Steven Rostedt wrote:
> > > If you need to really get the data out, then the design should be
> > > changed.  Have some return value showing the failure, check for
> > > oops_in_progress or whatever, and try again after turning interrupts
> > > back on, and getting to a point where the system can free up memory
> > > (write to swap, etc).  Just a busy loop without ever getting a skb is
> > > just bad.
> >
> > Why, pray tell, do you think there will be a second chance after
> > re-enabling interrupts? How does this work when we're panicking or
> > oopsing where we most care? How does this work when the netpoll client
> > is the kernel debugger and the machine is completely stopped because
> > we're tracing it?
>
> What I meant was to check for an oops and maybe then don't break out.
> Otherwise let the system try to reclaim memory. Since this is locked
> when the alloc_skb called with GFP_ATOMIC and fails.

You might want to take a look at my stupid little __GFP_MEMALLOC hack in the 
network block IO deadlock thread on netdev.  It will let you use the memalloc 
reserve from atomic context.  As long as you can be sure your usage will be 
bounded and you will eventually give it back, this should be ok.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Net vm deadlock fix (take two)

2005-08-06 Thread Daniel Phillips
Hi,

This version does not do blatantly stupid things in hardware irq context, is
more efficient, and... wow the patch is smaller!  (That never happens.)

I don't mark skbs as being allocated from reserve any more.  That works, but
it is slightly bogus, because it doesn't matter which skb came from reserve,
it only matters that we put one back.  So I just count them and don't mark
them.

The tricky issue that had to be dealt with is the possibility that a massive 
number of skbs could in theory be queued by the hardware interrupt before the 
softnet softirq gets around to delivering them to the protocol.  But we can 
only allocate a limited number of skbs from reserve memory.  If we run out of 
reserve memory we have no choice but to fail skb allocations, and that can 
cause packets to be dropped.  Since we don't know which packets are blockio 
packets and which are not at that point, we could be so unlucky as to always 
drop block IO packets and always let other junk through.  The other junk 
won't get very far though, because those packets will be dropped as soon as 
the protocol headers are decoded, which will reveal that they do not belong 
to a memalloc socket.  This short circuit ought to help take away the cpu 
load that caused the softirq constipation in the first place.

What is actually going to happen is, a few block IO packets might be randomly 
dropped under such conditions, degrading the transport efficiency.  Block IO 
progress will continue, unless we manage to accidently drop every block IO 
packet and our softirqs continue to stay comatose, probably indicating a 
scheduler bug.

OK, we want to allocate skbs from reserve, but we cannot go infinitely into 
reserve.  So we count how many packets a driver has allocated from reserve, 
in the net_device struct.  If this goes over some limit, the skb 
allocation fails and the device driver may drop a packet because of that.  
Note that the e1000 driver will just be trying to refill its rx-ring at this 
point, and it will try to refill it again as soon as the next packet arrives, 
so it is still some ways away from actually dropping a packet.  Other drivers 
may immediately drop a packet at this point, c'est la vie.  Remember, this 
can only happen if the softirqs are backed up a silly amount.

The thing is, we have got our block IO traffic moving, by virtue of dipping 
into the reserve, and most likely moving at near-optimal speed.  Normal 
memory-consuming tasks are not moving because they are blocked on vm IO.  The 
things that can mess us up are cpu hogs - a scheduler problem - and tons of 
unhelpful traffic sharing the network wire, which we are killing off early as 
mentioned above.

What happens when a packet arrives at the protocol handler is a little subtle.  
At this point, if the interface is into reserve, we can always decrement the 
reserve count, regardless of what type of packet it is.  If it is a block IO 
packet, the packet is still accounted for within the block driver's 
throttling.  We are sure that the packet's resources will be returned to the 
common pool in an organized way.  If it is some other random kind of packet, 
we drop it right away, also returning the resources to the common pool.  
Either way, it is not the responsibility of the interface to account for it 
any more.

I'll just reiterate what I'm trying to accomplish here:

   1) Guarantee network block io forward progress
   2) Block IO throughput should not degrade much under low memory

This iteration of the patch addresses those goals nicely, I think.  I have not 
yet shown how to drive this from the block IO layer, and I haven't shown how 
to be sure that all protocols on an interface (not just TCPv4, as here) can 
handle the reserve management semantics.  I have ignored all transports 
besides IP, though not much changes for other transports.  I have some 
accounting code that is very probably racy and needs to be rewritten with 
atomic_t.  I have ignored the many hooks that are possible in the protocol 
path.  I have assumed that all receive skbs are the same size, and haven't 
accounted for the possibility that that size (MTU) might change.  All these 
things need looking at, but the main point at the moment is to establish a 
solid sense of correctness and to get some real results on a vanilla delivery 
path.  That in itself will be useful for cluster work, where configuration
issues are kept under careful control.

As far as drivers are concerned, the new interface is dev_memalloc_skb, which 
is straightforward.  It needs to know about the netdev for accounting 
purposes, so it takes it as a parameter and thoughtfully plugs it into the 
skb for you.

I am still using the global memory reserve, not mempool.  But notice, now I am 
explicitly accounting and throttling how deep a driver dips into the global 
reserve.  So GFP_MEMALLOC wins a point: the driver isn't just using the 
global reserve blindly, as has been traditional.  The jury is still out 
thoug

Re: test

2005-08-06 Thread Daniel Phillips
On Sunday 07 August 2005 03:54, Daniel Phillips wrote:
> On Saturday 06 August 2005 18:40, David S. Miller wrote:
> > From: Daniel Phillips <[EMAIL PROTECTED]>
> > Date: Sat, 6 Aug 2005 04:52:07 +1000
> >
> > > So then there is no choice but to throttle the per-cpu ->input_pkt
> > > queues.
> >
> > Make the driver support NAPI if you want device fairness.
>
> This will co-exist happily with NAPI.  It does not simply replicate NAPI
> functionality because the reserve accounting also tells the receive path
> when it should discriminate in favor of vm traffic.

I forgot to mention, I rewrote the patch completely because of the stupid 
interrupt context mistake, and it no longer touches netif_rx.  Please see 
"take two".

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lockups with netconsole on e1000 on media insertion

2005-08-05 Thread Daniel Phillips
On Saturday 06 August 2005 11:22, Matt Mackall wrote:
> On Sat, Aug 06, 2005 at 01:51:22AM +0200, Andi Kleen wrote:
> > > But why are we in a hurry to dump the backlog on the floor? Why are we
> > > worrying about the performance of netpoll without the cable plugged in
> > > at all? We shouldn't be optimizing the data loss case.
> >
> > Because a system shouldn't stall for minutes (or forever like right now)
> > at boot just because the network cable isn't plugged in.
>
> Using netconsole without a network cable could well be classified as a
> serious configuration error.

But please don't.  An OS that slows to a crawl or crashes because a cable 
isn't plugged in an OS that deserves to be ridiculed.  Silly timeouts on boot 
are scary and a waste of user's time.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


test

2005-08-05 Thread Daniel Phillips
On Saturday 06 August 2005 02:33, David S. Miller wrote:
> You can't call into the networking packet input path from
> hardware interrupt context, it simply is not allowed.
>
> And that's the context in which netif_rx() gets called.

Duh.  I assumed we already were in softirq context here (but with 20-20 
hindsight, that would be stupid extra work).

> That is why netif_rx() just queues, and schedules a software
> interrupt in which the netif_receive_skb() call gets made.

So then there is no choice but to throttle the per-cpu ->input_pkt queues.  
The throttling code is already there, I just need to repurpose it slightly.

if (queue->input_pkt_queue.qlen <= netdev_max_backlog)

Question: it looks to me as if a given network device interrupt is always 
routed to the same CPU.  Can we rely on this?  If so, I only have to check 
the 

   min(netdev_max_backlog, sk->dev->reserve_in_flight)


netdev_max_backlog
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bypass softnet

2005-08-05 Thread Daniel Phillips
On Saturday 06 August 2005 02:33, David S. Miller wrote:
> You can't call into the networking packet input path from
> hardware interrupt context, it simply is not allowed.
>
> And that's the context in which netif_rx() gets called.

Duh.  I assumed we already were in softirq context here (but with 20-20 
hindsight, that would be stupid extra work).

> That is why netif_rx() just queues, and schedules a software
> interrupt in which the netif_receive_skb() call gets made.

So then there is no choice but to throttle the per-cpu ->input_pkt queues.  
The throttling code is already there, I just need to repurpose it slightly.  
That is, netif_rx will be something like:

if (queue->input_pkt_queue.qlen <= netdev_max_backlog &&
reserve_available(skb->dev)) {


}


This will needlessly drop the very last packet allocated from reserve, a 
slight wastage, but obvious correctness is more important at this point.

The only purpose of this extra throttling is to avoid having to include the 
entire netdev_max_backlog (default 300) in the network block IO packet 
reserve.  That would be 1.2 MB per interface, perhaps a little too much to 
withdraw from the general memory pool.  I am thinking that even a single 
packet's worth of reserve is enough to make _probabilistic_ forward progress, 
but then we run a high risk of being DOSed by non-blockio traffic, and 
through will be crippled just when it hurts most.  So the right reserve size 
is a compromise between the single element required to make (probabilistic) 
forward progress and the number that will almost never fill up the 
->input_pkt queue under zero atomic memory conditions.  I guess 50-100 
packets in reserve is about right, reduce it for larger MTU, increase it for 
higher interface speed.

Question: it looks to me as if a given network device interrupt is always 
routed to the same CPU.  Can I rely on that?  If so, I can do the reserve 
throttling strictly per-cpu, if not, then a slightly messier calculation is 
necessary.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: argh... ;/

2005-08-05 Thread Daniel Phillips
On Saturday 06 August 2005 03:49, Dave Jones wrote:
> On Fri, Aug 05, 2005 at 01:20:59PM -0400, John W. Linville wrote:
>  > On Sat, Aug 06, 2005 at 02:41:30AM +1000, Daniel Phillips wrote:
>  > > On Friday 05 August 2005 13:04, Mateusz Berezecki wrote:
>  > > > I accidentaly posted the patches as MIME attachments... its 5:03 am
>  > > > here
>  > >
>  > > Does anybody still care if patches are posted as attachments,
>  > > particularly for review as opposed to submission?
>  >
>  > Yes.  Opening attachments makes them harder to review.
>
> Indeed. Replying to such mails ends up looking like..

>

Hi, sorry, about the misfired "test" mail... the _real_ issue for me was that 
Kmail was turning tabs into spaces for the longest time, so forcing me either 
to switch mailers (and you know how appealing that is) or send patches as 
attachments.  It's finally fixed, so I'll happily join in with dumping on the 
poor misguided folks who insist on posting their patches as attachments.

Regards,

Daniel "the flexible" Phillips
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: argh... ;/

2005-08-05 Thread Daniel Phillips
On Friday 05 August 2005 13:04, Mateusz Berezecki wrote:
> I accidentaly posted the patches as MIME attachments... its 5:03 am here
> already. Sorry guys.
> I can resubmit if you want. I just dont want do that now and not trash
> your mailboxes

Does anybody still care if patches are posted as attachments, particularly for 
review as opposed to submission?

I will make the standard comment about C99 // comments, but add that I use 
them myself to mean "this comment will definitely be gone by the time it gets 
officially submitted", in other words, work in progress.

Heroic work!  And nicely lindented at least.  Some long lines, but in the same 
kinds of places I write them myself.

Only commenting on the form, not substance.  Best of luck.  I have one of 
these things here so when you declare it ready I will try it.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bypass softnet

2005-08-05 Thread Daniel Phillips
Hi,

OK, I am still a network klutz.  The attached patch changes netif_rx to call 
netif_receive_skb directly instead of going through softnet.  It works with 
my e1000 here, but eventually oopses under moderate load.  I see that a few 
drivers use netif_receive_skb directly, sometimes together with NAPI.  But I 
must have missed some assumption, at least with the e1000 driver.  Any 
suggestion?  The oops happens almost immediately if I add a printk.

I need to put a serial cable on the box so I can see what the oops actually 
is.

Regards,

Daniel
diff -up --recursive 2.6.12.3.clean/net/core/dev.c 2.6.12.3/net/core/dev.c
--- 2.6.12.3.clean/net/core/dev.c	2005-07-15 17:18:57.0 -0400
+++ 2.6.12.3/net/core/dev.c	2005-08-05 11:47:23.0 -0400
@@ -1456,6 +1456,11 @@ int netif_rx(struct sk_buff *skb)
 	if (netpoll_rx(skb))
 		return NET_RX_DROP;
 
+	if (1) {
+netif_receive_skb(skb);
+return NET_RX_SUCCESS;
+}
+
 	if (!skb->stamp.tv_sec)
 		net_timestamp(&skb->stamp);
 


Re: [RFC] Net vm deadlock fix (preliminary)

2005-08-04 Thread Daniel Phillips
Whoops:

-   return __dev_alloc_skb(length, gfp_mask | __GFP_MEMALLOC);
+   return __dev_alloc_skb(length, GFP_ATOMIC|__GFP_MEMALLOC);

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Net vm deadlock fix (preliminary)

2005-08-04 Thread Daniel Phillips
Hi,

I spent the last day mulling things over and doing research.  It seems to me 
that the patch as first posted is correct and solves the deadlock, except 
that some uses of __GFP_MEMALLOC in __dev_alloc_skb may escape into contexts 
where the reserve is not guaranteed to be reclaimed.  It may be that this 
does not actually happen, but there is enough different usage that I would 
rather err on the side of caution just now, and offer a variant called, e.g., 
dev_memalloc_skb so that drivers will explicitly have to choose to use it (or 
supply the flag to __dev_alloc_skb).  This is just a stack of inline 
functions so there should be no extra object code.  The dev_memalloc_skb 
variant can go away in time, but for now it does no harm.

A minor cleanup: somebody (Rik) complained about his bleeding eyes after 
reading my side-effecty alloc_skb expression, so that was rearranged in a way 
that should optimize to the same thing.

On the "first write the patch, then do the research" principle, the entire 
thread on this topic from the ksummit-2005-discuss mailing list is a good 
read:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/thread.html#186

Matt, you are close to the truth here:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000242.html

but this part isn't right: "it's important to note here that "making progress" 
may require M acknowledgements to N packets representing a single IO. So we 
need separate send and acknowledge pools for each SO_MEMALLOC socket so that 
we don't find ourselves wedged with M-1 available mempool slots when we're 
waiting on ACKs".  This erroneously assumes that mempools throttle the block 
IO traffic.  In fact, the throttling _must_ take place higher up, in the 
block IO stack.  The block driver that submits the network IO must 
pre-account any per-request resources and block until sufficient resources 
become available.  So the accounting would include both space for transmit 
and acknowledge, and the network block IO protocol must be designed to obey 
that accounting.  (I will wave my hands at the question of how we arrange for 
low-level components to communicate their resource needs to high-level 
throttlers, just for now.)

Andrea, also getting close:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000200.html

But there is no need to be random.  Short of actually overlowing the input 
ring buffer, we can be precise about accepting all block IO packets and 
dropping non-blockio traffic as necessary.

Rik, not bad:

http://thunker.thunk.org/pipermail/ksummit-2005-discuss/2005-March/000218.html

particularly for deducing it from first principles without actually looking at 
the network code ;-)  It is even the same socket flag name as I settled on 
(SO_MEMALLOC).  But step 4 veers off course: out of order does not matter.  
And the conclusion that we can throttle here by dropping non-blockio packets 
is not right: the packets that we got from reserve still can live an 
arbitrary time in the protocol stack, so we could still exhaust the reserve 
and be back to the same bad old deadlock conditions.

Everybody noticed that dropping non-blockio packets is key, and everybody 
missed the fact that softnet introduces additional queues that need 
throttling (which can't be done sanely) or bypassing.  Almost everybody 
noticed that throttling in the block IO submission path is non-optional.

Everybody thought that mempool is the one true way of reserving memory.  I am 
not so sure, though I still intend to produce a mempool variant of the patch.  
One problem I see with mempool is that it not only reserves resources, but 
pins them.  If the kernel is full of mempools pinning memory pages here and 
there, physical defragmentation gets that much harder and the buddy tree will 
fragment that much sooner.  The __GPF_MEMALLOC interface does not have this 
problem because pages stay in the global pool.  So the jury is still out on 
which method is better.

Obviously, to do the job properly, __GPF_MEMALLOC would need a way of resizing 
the memalloc reserve as users are loaded and unloaded.  Such an interface can 
be very lightweight.  I will cook one up just to demonstrate this.

Now, the scheme in my patch does the job and I think it does it in a way that 
works for all drivers, even e1000 (by method 1. in the thread above).  But we 
could tighten this up a little by noticing that it doesn't actually matter 
which socket buffer we return to the pool as long as we are sure to return 
the same amount of memory as we withdrew.  Therefore, we could just account 
the number of pages alloc_skb withdraws, and the number that freeing a packet 
returns.  The e1000 driver would look at this number to determine whether to 
mark a packet as from_reserve or not.  That way, the e1000 driver could set 
things in motion to release reserve resources sooner, rather than waiting for 
certain specially flagged skbs to work their way around the

  1   2   >