Re: [RFC] Net vm deadlock fix (take two)

2005-08-06 Thread Daniel Phillips
On Sunday 07 August 2005 02:07, Jeff Garzik wrote:
> > +static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev,
> > +   unsigned length, int gfp_mask)
> > +{
> > +   struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask);
> > +   if (skb)
> > +   goto done;
> > +   if (dev->rx_reserve_used >= dev->rx_reserve)
> > +   return NULL;
> > +   if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC))
> > +   return NULL;;
> > +   dev->rx_reserve_used++;
>
> why bother with rx_reserve at all?  Why not just let the second
> allocation fail, without the rx_reserve_used test?

Because that would allow unbounded reserve use, either because of a leak or 
because of a legitimate backup in the softnet queues.  It is not worth it to 
run the risk of wedging the whole system just to save this check.  If we were 
using a mempool here, it would fail with a similar check.

> Additionally, I think the rx_reserve_used accounting is wrong, since I
> could simply free the skb

Good point, I should provide a kfree_skb variant that does the reserve 
accounting (dev_free_skb) in case some driver wants to do this.  Anyway, if 
somebody does free an skb in the delivery path without doing the accounting 
it is not a memory leak, but might cause a non-blockio packet to be 
unnecessarily dropped later.

> -- but doing so would cause a rx_reserve_used 
> leak in your code, since you only decrement the counter in the TCP IPv4
> path.

Reserve checks are needed not just on the IPv4 path but on every protocol path 
that is allowed to co-exist on the same wire as block IO.  I will add udp and 
sctp to the patch next.

If an unhandled protocol does get onto the wire, the consequences are not 
severe.  There is just a risk that the entire reserve may be consumed 
(another reason we need the limit check above) and we just fall back to the 
old unreliable block IO behavior.

Eventually this needs to be enforced automatically so that normal users don't 
have to worry about exactly what protocols they are running on an interface, 
but cluster users will just take care to run only supported protocols, they 
can already benefit from this without fancy checking.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Net vm deadlock fix (take two)

2005-08-06 Thread Jeff Garzik
On Sat, Aug 06, 2005 at 05:22:23PM +1000, Daniel Phillips wrote:
> Daniel
> 
> diff -up --recursive 2.6.12.3.clean/include/linux/gfp.h 
> 2.6.12.3/include/linux/gfp.h
> --- 2.6.12.3.clean/include/linux/gfp.h2005-07-15 17:18:57.0 
> -0400
> +++ 2.6.12.3/include/linux/gfp.h  2005-08-05 21:53:09.0 -0400
> @@ -39,6 +39,7 @@ struct vm_area_struct;
>  #define __GFP_COMP   0x4000u /* Add compound page metadata */
>  #define __GFP_ZERO   0x8000u /* Return zeroed page on success */
>  #define __GFP_NOMEMALLOC 0x1u /* Don't use emergency reserves */
> +#define __GFP_MEMALLOC  0x2u /* Use emergency reserves */
>  
>  #define __GFP_BITS_SHIFT 20  /* Room for 20 __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
> diff -up --recursive 2.6.12.3.clean/include/linux/netdevice.h 
> 2.6.12.3/include/linux/netdevice.h
> --- 2.6.12.3.clean/include/linux/netdevice.h  2005-07-15 17:18:57.0 
> -0400
> +++ 2.6.12.3/include/linux/netdevice.h2005-08-06 01:06:18.0 
> -0400
> @@ -371,6 +371,8 @@ struct net_device
>   struct Qdisc*qdisc_ingress;
>   struct list_headqdisc_list;
>   unsigned long   tx_queue_len;   /* Max frames per queue allowed 
> */
> + int rx_reserve;
> + int rx_reserve_used;
>  
>   /* ingress path synchronizer */
>   spinlock_t  ingress_lock;
> @@ -929,6 +931,28 @@ extern void  net_disable_timestamp(void)
>  extern char *net_sysctl_strdup(const char *s);
>  #endif
>  
> +static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev,
> + unsigned length, int gfp_mask)
> +{
> + struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask);
> + if (skb)
> + goto done;
> + if (dev->rx_reserve_used >= dev->rx_reserve)
> + return NULL;
> + if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC))
> + return NULL;;
> + dev->rx_reserve_used++;

why bother with rx_reserve at all?  Why not just let the second
allocation fail, without the rx_reserve_used test?

Additionally, I think the rx_reserve_used accounting is wrong, since I
could simply free the skb -- but doing so would cause a rx_reserve_used
leak in your code, since you only decrement the counter in the TCP IPv4
path.


> +done:
> + skb->dev = dev;
> + return skb;
> +}
> +
> +static inline struct sk_buff *dev_alloc_skb_reserve(struct net_device *dev,
> + unsigned length)
> +{
> + return __dev_memalloc_skb(dev, length, GFP_ATOMIC);
> +}

unused function


> +
>  #endif /* __KERNEL__ */
>  
>  #endif   /* _LINUX_DEV_H */
> diff -up --recursive 2.6.12.3.clean/include/net/sock.h 
> 2.6.12.3/include/net/sock.h
> --- 2.6.12.3.clean/include/net/sock.h 2005-07-15 17:18:57.0 -0400
> +++ 2.6.12.3/include/net/sock.h   2005-08-05 21:53:09.0 -0400
> @@ -382,6 +382,7 @@ enum sock_flags {
>   SOCK_NO_LARGESEND, /* whether to sent large segments or not */
>   SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
>   SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
> + SOCK_MEMALLOC, /* protocol can use memalloc reserve */
>  };
>  
>  static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
> @@ -399,6 +400,11 @@ static inline int sock_flag(struct sock 
>   return test_bit(flag, &sk->sk_flags);
>  }
>  
> +static inline int is_memalloc_sock(struct sock *sk)
> +{
> + return sock_flag(sk, SOCK_MEMALLOC);
> +}
> +
>  static inline void sk_acceptq_removed(struct sock *sk)
>  {
>   sk->sk_ack_backlog--;
> diff -up --recursive 2.6.12.3.clean/mm/page_alloc.c 2.6.12.3/mm/page_alloc.c
> --- 2.6.12.3.clean/mm/page_alloc.c2005-07-15 17:18:57.0 -0400
> +++ 2.6.12.3/mm/page_alloc.c  2005-08-05 21:53:09.0 -0400
> @@ -802,8 +802,8 @@ __alloc_pages(unsigned int __nocast gfp_
>  
>   /* This allocation should allow future memory freeing. */
>  
> - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> - && !in_interrupt()) {
> + if p->flags & PF_MEMALLOC) || 
> unlikely(test_thread_flag(TIF_MEMDIE)))
> + && !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) {
>   if (!(gfp_mask & __GFP_NOMEMALLOC)) {
>   /* go through the zonelist yet again, ignoring mins */
>   for (i = 0; (z = zones[i]) != NULL; i++) {
> diff -up --recursive 2.6.12.3.clean/net/ethernet/eth.c 
> 2.6.12.3/net/ethernet/eth.c
> --- 2.6.12.3.clean/net/ethernet/eth.c 2005-07-15 17:18:57.0 -0400
> +++ 2.6.12.3/net/ethernet/eth.c   2005-08-06 02:32:02.0 -0400
> @@ -281,6 +281,7 @@ void ether_setup(struct net_device *dev)
>   dev->mtu= 1500; /* eth_mtu */
>   dev->addr_len   = ETH_ALEN;
>   dev->tx_queue_len   = 1000; /* Ethernet wants good queues */
> + dev-

[RFC] Net vm deadlock fix (take two)

2005-08-06 Thread Daniel Phillips
Hi,

This version does not do blatantly stupid things in hardware irq context, is
more efficient, and... wow the patch is smaller!  (That never happens.)

I don't mark skbs as being allocated from reserve any more.  That works, but
it is slightly bogus, because it doesn't matter which skb came from reserve,
it only matters that we put one back.  So I just count them and don't mark
them.

The tricky issue that had to be dealt with is the possibility that a massive 
number of skbs could in theory be queued by the hardware interrupt before the 
softnet softirq gets around to delivering them to the protocol.  But we can 
only allocate a limited number of skbs from reserve memory.  If we run out of 
reserve memory we have no choice but to fail skb allocations, and that can 
cause packets to be dropped.  Since we don't know which packets are blockio 
packets and which are not at that point, we could be so unlucky as to always 
drop block IO packets and always let other junk through.  The other junk 
won't get very far though, because those packets will be dropped as soon as 
the protocol headers are decoded, which will reveal that they do not belong 
to a memalloc socket.  This short circuit ought to help take away the cpu 
load that caused the softirq constipation in the first place.

What is actually going to happen is, a few block IO packets might be randomly 
dropped under such conditions, degrading the transport efficiency.  Block IO 
progress will continue, unless we manage to accidently drop every block IO 
packet and our softirqs continue to stay comatose, probably indicating a 
scheduler bug.

OK, we want to allocate skbs from reserve, but we cannot go infinitely into 
reserve.  So we count how many packets a driver has allocated from reserve, 
in the net_device struct.  If this goes over some limit, the skb 
allocation fails and the device driver may drop a packet because of that.  
Note that the e1000 driver will just be trying to refill its rx-ring at this 
point, and it will try to refill it again as soon as the next packet arrives, 
so it is still some ways away from actually dropping a packet.  Other drivers 
may immediately drop a packet at this point, c'est la vie.  Remember, this 
can only happen if the softirqs are backed up a silly amount.

The thing is, we have got our block IO traffic moving, by virtue of dipping 
into the reserve, and most likely moving at near-optimal speed.  Normal 
memory-consuming tasks are not moving because they are blocked on vm IO.  The 
things that can mess us up are cpu hogs - a scheduler problem - and tons of 
unhelpful traffic sharing the network wire, which we are killing off early as 
mentioned above.

What happens when a packet arrives at the protocol handler is a little subtle.  
At this point, if the interface is into reserve, we can always decrement the 
reserve count, regardless of what type of packet it is.  If it is a block IO 
packet, the packet is still accounted for within the block driver's 
throttling.  We are sure that the packet's resources will be returned to the 
common pool in an organized way.  If it is some other random kind of packet, 
we drop it right away, also returning the resources to the common pool.  
Either way, it is not the responsibility of the interface to account for it 
any more.

I'll just reiterate what I'm trying to accomplish here:

   1) Guarantee network block io forward progress
   2) Block IO throughput should not degrade much under low memory

This iteration of the patch addresses those goals nicely, I think.  I have not 
yet shown how to drive this from the block IO layer, and I haven't shown how 
to be sure that all protocols on an interface (not just TCPv4, as here) can 
handle the reserve management semantics.  I have ignored all transports 
besides IP, though not much changes for other transports.  I have some 
accounting code that is very probably racy and needs to be rewritten with 
atomic_t.  I have ignored the many hooks that are possible in the protocol 
path.  I have assumed that all receive skbs are the same size, and haven't 
accounted for the possibility that that size (MTU) might change.  All these 
things need looking at, but the main point at the moment is to establish a 
solid sense of correctness and to get some real results on a vanilla delivery 
path.  That in itself will be useful for cluster work, where configuration
issues are kept under careful control.

As far as drivers are concerned, the new interface is dev_memalloc_skb, which 
is straightforward.  It needs to know about the netdev for accounting 
purposes, so it takes it as a parameter and thoughtfully plugs it into the 
skb for you.

I am still using the global memory reserve, not mempool.  But notice, now I am 
explicitly accounting and throttling how deep a driver dips into the global 
reserve.  So GFP_MEMALLOC wins a point: the driver isn't just using the 
global reserve blindly, as has been traditional.  The jury is still out 
thoug