Re: [RFC] Net vm deadlock fix (take two)
On Sunday 07 August 2005 02:07, Jeff Garzik wrote: > > +static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev, > > + unsigned length, int gfp_mask) > > +{ > > + struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask); > > + if (skb) > > + goto done; > > + if (dev->rx_reserve_used >= dev->rx_reserve) > > + return NULL; > > + if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC)) > > + return NULL;; > > + dev->rx_reserve_used++; > > why bother with rx_reserve at all? Why not just let the second > allocation fail, without the rx_reserve_used test? Because that would allow unbounded reserve use, either because of a leak or because of a legitimate backup in the softnet queues. It is not worth it to run the risk of wedging the whole system just to save this check. If we were using a mempool here, it would fail with a similar check. > Additionally, I think the rx_reserve_used accounting is wrong, since I > could simply free the skb Good point, I should provide a kfree_skb variant that does the reserve accounting (dev_free_skb) in case some driver wants to do this. Anyway, if somebody does free an skb in the delivery path without doing the accounting it is not a memory leak, but might cause a non-blockio packet to be unnecessarily dropped later. > -- but doing so would cause a rx_reserve_used > leak in your code, since you only decrement the counter in the TCP IPv4 > path. Reserve checks are needed not just on the IPv4 path but on every protocol path that is allowed to co-exist on the same wire as block IO. I will add udp and sctp to the patch next. If an unhandled protocol does get onto the wire, the consequences are not severe. There is just a risk that the entire reserve may be consumed (another reason we need the limit check above) and we just fall back to the old unreliable block IO behavior. Eventually this needs to be enforced automatically so that normal users don't have to worry about exactly what protocols they are running on an interface, but cluster users will just take care to run only supported protocols, they can already benefit from this without fancy checking. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Net vm deadlock fix (take two)
On Sat, Aug 06, 2005 at 05:22:23PM +1000, Daniel Phillips wrote: > Daniel > > diff -up --recursive 2.6.12.3.clean/include/linux/gfp.h > 2.6.12.3/include/linux/gfp.h > --- 2.6.12.3.clean/include/linux/gfp.h2005-07-15 17:18:57.0 > -0400 > +++ 2.6.12.3/include/linux/gfp.h 2005-08-05 21:53:09.0 -0400 > @@ -39,6 +39,7 @@ struct vm_area_struct; > #define __GFP_COMP 0x4000u /* Add compound page metadata */ > #define __GFP_ZERO 0x8000u /* Return zeroed page on success */ > #define __GFP_NOMEMALLOC 0x1u /* Don't use emergency reserves */ > +#define __GFP_MEMALLOC 0x2u /* Use emergency reserves */ > > #define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */ > #define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1) > diff -up --recursive 2.6.12.3.clean/include/linux/netdevice.h > 2.6.12.3/include/linux/netdevice.h > --- 2.6.12.3.clean/include/linux/netdevice.h 2005-07-15 17:18:57.0 > -0400 > +++ 2.6.12.3/include/linux/netdevice.h2005-08-06 01:06:18.0 > -0400 > @@ -371,6 +371,8 @@ struct net_device > struct Qdisc*qdisc_ingress; > struct list_headqdisc_list; > unsigned long tx_queue_len; /* Max frames per queue allowed > */ > + int rx_reserve; > + int rx_reserve_used; > > /* ingress path synchronizer */ > spinlock_t ingress_lock; > @@ -929,6 +931,28 @@ extern void net_disable_timestamp(void) > extern char *net_sysctl_strdup(const char *s); > #endif > > +static inline struct sk_buff *__dev_memalloc_skb(struct net_device *dev, > + unsigned length, int gfp_mask) > +{ > + struct sk_buff *skb = __dev_alloc_skb(length, gfp_mask); > + if (skb) > + goto done; > + if (dev->rx_reserve_used >= dev->rx_reserve) > + return NULL; > + if (!__dev_alloc_skb(length, gfp_mask|__GFP_MEMALLOC)) > + return NULL;; > + dev->rx_reserve_used++; why bother with rx_reserve at all? Why not just let the second allocation fail, without the rx_reserve_used test? Additionally, I think the rx_reserve_used accounting is wrong, since I could simply free the skb -- but doing so would cause a rx_reserve_used leak in your code, since you only decrement the counter in the TCP IPv4 path. > +done: > + skb->dev = dev; > + return skb; > +} > + > +static inline struct sk_buff *dev_alloc_skb_reserve(struct net_device *dev, > + unsigned length) > +{ > + return __dev_memalloc_skb(dev, length, GFP_ATOMIC); > +} unused function > + > #endif /* __KERNEL__ */ > > #endif /* _LINUX_DEV_H */ > diff -up --recursive 2.6.12.3.clean/include/net/sock.h > 2.6.12.3/include/net/sock.h > --- 2.6.12.3.clean/include/net/sock.h 2005-07-15 17:18:57.0 -0400 > +++ 2.6.12.3/include/net/sock.h 2005-08-05 21:53:09.0 -0400 > @@ -382,6 +382,7 @@ enum sock_flags { > SOCK_NO_LARGESEND, /* whether to sent large segments or not */ > SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */ > SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */ > + SOCK_MEMALLOC, /* protocol can use memalloc reserve */ > }; > > static inline void sock_set_flag(struct sock *sk, enum sock_flags flag) > @@ -399,6 +400,11 @@ static inline int sock_flag(struct sock > return test_bit(flag, &sk->sk_flags); > } > > +static inline int is_memalloc_sock(struct sock *sk) > +{ > + return sock_flag(sk, SOCK_MEMALLOC); > +} > + > static inline void sk_acceptq_removed(struct sock *sk) > { > sk->sk_ack_backlog--; > diff -up --recursive 2.6.12.3.clean/mm/page_alloc.c 2.6.12.3/mm/page_alloc.c > --- 2.6.12.3.clean/mm/page_alloc.c2005-07-15 17:18:57.0 -0400 > +++ 2.6.12.3/mm/page_alloc.c 2005-08-05 21:53:09.0 -0400 > @@ -802,8 +802,8 @@ __alloc_pages(unsigned int __nocast gfp_ > > /* This allocation should allow future memory freeing. */ > > - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE))) > - && !in_interrupt()) { > + if p->flags & PF_MEMALLOC) || > unlikely(test_thread_flag(TIF_MEMDIE))) > + && !in_interrupt()) || (gfp_mask & __GFP_MEMALLOC)) { > if (!(gfp_mask & __GFP_NOMEMALLOC)) { > /* go through the zonelist yet again, ignoring mins */ > for (i = 0; (z = zones[i]) != NULL; i++) { > diff -up --recursive 2.6.12.3.clean/net/ethernet/eth.c > 2.6.12.3/net/ethernet/eth.c > --- 2.6.12.3.clean/net/ethernet/eth.c 2005-07-15 17:18:57.0 -0400 > +++ 2.6.12.3/net/ethernet/eth.c 2005-08-06 02:32:02.0 -0400 > @@ -281,6 +281,7 @@ void ether_setup(struct net_device *dev) > dev->mtu= 1500; /* eth_mtu */ > dev->addr_len = ETH_ALEN; > dev->tx_queue_len = 1000; /* Ethernet wants good queues */ > + dev-
[RFC] Net vm deadlock fix (take two)
Hi, This version does not do blatantly stupid things in hardware irq context, is more efficient, and... wow the patch is smaller! (That never happens.) I don't mark skbs as being allocated from reserve any more. That works, but it is slightly bogus, because it doesn't matter which skb came from reserve, it only matters that we put one back. So I just count them and don't mark them. The tricky issue that had to be dealt with is the possibility that a massive number of skbs could in theory be queued by the hardware interrupt before the softnet softirq gets around to delivering them to the protocol. But we can only allocate a limited number of skbs from reserve memory. If we run out of reserve memory we have no choice but to fail skb allocations, and that can cause packets to be dropped. Since we don't know which packets are blockio packets and which are not at that point, we could be so unlucky as to always drop block IO packets and always let other junk through. The other junk won't get very far though, because those packets will be dropped as soon as the protocol headers are decoded, which will reveal that they do not belong to a memalloc socket. This short circuit ought to help take away the cpu load that caused the softirq constipation in the first place. What is actually going to happen is, a few block IO packets might be randomly dropped under such conditions, degrading the transport efficiency. Block IO progress will continue, unless we manage to accidently drop every block IO packet and our softirqs continue to stay comatose, probably indicating a scheduler bug. OK, we want to allocate skbs from reserve, but we cannot go infinitely into reserve. So we count how many packets a driver has allocated from reserve, in the net_device struct. If this goes over some limit, the skb allocation fails and the device driver may drop a packet because of that. Note that the e1000 driver will just be trying to refill its rx-ring at this point, and it will try to refill it again as soon as the next packet arrives, so it is still some ways away from actually dropping a packet. Other drivers may immediately drop a packet at this point, c'est la vie. Remember, this can only happen if the softirqs are backed up a silly amount. The thing is, we have got our block IO traffic moving, by virtue of dipping into the reserve, and most likely moving at near-optimal speed. Normal memory-consuming tasks are not moving because they are blocked on vm IO. The things that can mess us up are cpu hogs - a scheduler problem - and tons of unhelpful traffic sharing the network wire, which we are killing off early as mentioned above. What happens when a packet arrives at the protocol handler is a little subtle. At this point, if the interface is into reserve, we can always decrement the reserve count, regardless of what type of packet it is. If it is a block IO packet, the packet is still accounted for within the block driver's throttling. We are sure that the packet's resources will be returned to the common pool in an organized way. If it is some other random kind of packet, we drop it right away, also returning the resources to the common pool. Either way, it is not the responsibility of the interface to account for it any more. I'll just reiterate what I'm trying to accomplish here: 1) Guarantee network block io forward progress 2) Block IO throughput should not degrade much under low memory This iteration of the patch addresses those goals nicely, I think. I have not yet shown how to drive this from the block IO layer, and I haven't shown how to be sure that all protocols on an interface (not just TCPv4, as here) can handle the reserve management semantics. I have ignored all transports besides IP, though not much changes for other transports. I have some accounting code that is very probably racy and needs to be rewritten with atomic_t. I have ignored the many hooks that are possible in the protocol path. I have assumed that all receive skbs are the same size, and haven't accounted for the possibility that that size (MTU) might change. All these things need looking at, but the main point at the moment is to establish a solid sense of correctness and to get some real results on a vanilla delivery path. That in itself will be useful for cluster work, where configuration issues are kept under careful control. As far as drivers are concerned, the new interface is dev_memalloc_skb, which is straightforward. It needs to know about the netdev for accounting purposes, so it takes it as a parameter and thoughtfully plugs it into the skb for you. I am still using the global memory reserve, not mempool. But notice, now I am explicitly accounting and throttling how deep a driver dips into the global reserve. So GFP_MEMALLOC wins a point: the driver isn't just using the global reserve blindly, as has been traditional. The jury is still out thoug