Re: [PATCH] alloc_percpu() fails to allocate percpu data
On Friday 22 February 2008 09:26, Peter Zijlstra wrote: On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote: Some oprofile results obtained while using tbench on a 2x2 cpu machine were very surprising. For example, loopback_xmit() function was using high number of cpu cycles to perform the statistic updates, supposed to be real cheap since they use percpu data pcpu_lstats = netdev_priv(dev); lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id()); lb_stats-packets++; /* HERE : serious contention */ lb_stats-bytes += skb-len; struct pcpu_lstats is a small structure containing two longs. It appears that on my 32bits platform, alloc_percpu(8) allocates a single cache line, instead of giving to each cpu a separate cache line. Using the following patch gave me impressive boost in various benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too) Long term fix (ie = 2.6.26) would be to let each CPU allocate their own block of memory, so that we dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuff of course... Note : SLUB vs SLAB is important here to *show* the improvement, since they dont have the same minimum allocation sizes (8 bytes vs 32 bytes). This could very well explain regressions some guys reported when they switched to SLUB. I've complained about this false sharing as well, so until we get the new and improved percpu allocators, What I don't understand is why the slab allocators have something like this in it: if ((flags SLAB_HWCACHE_ALIGN) size cache_line_size() / 2) return max_t(unsigned long, align, cache_line_size()); If you ask for HWCACHE_ALIGN, then you should get it. I don't understand, why do they think they knows better than the caller? Things like this are just going to lead to very difficult to track performance problems. Possibly correctness problems in rare cases. There could be another flag for maybe align. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lockless pagecache Cassini regression
On Thu, Jan 03, 2008 at 07:32:45PM -0800, David Miller wrote: Nick, I think the following changeset: commit fa4f0774d7c6cccb4d1fda76b91dd8eddcb2dd6a [CASSINI]: dont touch page_count Remove page refcount manipulations from cassini driver by using another field in struct page. Needed for lockless pagecache. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: David S. Miller [EMAIL PROTECTED] Broke the Cassini driver. While it is true that, within the cassini driver, you converted the counter bumps and tests accurately, this changeset does not account for the page count bumps that are going to occur in other contexts when the SKBs being constructed are freed up (and thus the skb_frag_struct pages get liberated). Dang :( Basically what these drivers do is allocate a page, give it to the network card, and the card decides how to chop up the page based upon the size of arriving packets. Therefore we don't know ahead of time how many references to the page we will generate while attaching bits of the page to receive packet SKBs that get built. As incoming packets are constructed by the card using chunks of these system pages, it indicates in the descriptor which parts of which pages were used. We then use this to attach the page parts onto the SKB. Once the SKB is constructed it is passed up to the networking, later on the SKB is freed and the page references are dropped by kfree_skbmem(). And it is these page reference counts that the Cassini driver needs to test to be correct, not the driver local ones we are now using. I do something similar to what Cassini was doing in the NIU driver (drivers/net/niu.c). I think we need to restore Cassini to something similar to what NIU is doing. The only tricky bit is the page count checks, and frankly those can just be transformed into references to compount_head(page)-_count or similar. Actually, page_count() does exactly that, it is implemented by checking atomic_read(compound_head(page)-_count) and thus I think the best thing to do is revert your changes. These pages are freshly allocated pages, not stuff in the page cache or similar, so I think this is safe and the way to move forward. Also, these changes added lots of new atomics to the driver. It's already doing get_page() etc. as needed. What do you think Nick? Thanks for the detailed explanation as always, Dave. And I hope I didn't waste too much of your time debugging this... It is a regression so clearly needs to be reverted. It would actually break with the lockless pagecache, however the lockless pagecache is not merged anyway, so reverting is a no-brainer at this point. Just for interest, the lockless pagecache actually makes page-_count unstable for all pages that _have ever_ been pagecache pages (since the last quiescent rcu state, anyway). Basically, it looks up and takes a ref on the struct page without ever having a prior pin or reference on that page. It can do this because it knows the struct page won't actually get freed. After taking the ref, it rechecks that it has got the right page... Anyway, that's the short story but probably gives you an idea of why it hasn't been merged yet ;) I could avoid that whole hassle by rcu freeing pagecache pages, but that would add other overheads. [CASSINI]: Revert 'dont touch page_count'. This reverts changeset fa4f0774d7c6cccb4d1fda76b91dd8eddcb2dd6a ([CASSINI]: dont touch page_count) because it breaks the driver. The local page counting added by this changeset did not account for the asynchronous page count changes done by kfree_skb() and friends. The change adds extra atomics and on top of it all appears to be totally unnecessary as well. Signed-off-by: David S. Miller [EMAIL PROTECTED] Acked-by: Nick Piggin [EMAIL PROTECTED] diff --git a/drivers/net/cassini.c b/drivers/net/cassini.c index 9030ca5..ff957f2 100644 --- a/drivers/net/cassini.c +++ b/drivers/net/cassini.c @@ -336,30 +336,6 @@ static inline void cas_mask_intr(struct cas *cp) cas_disable_irq(cp, i); } -static inline void cas_buffer_init(cas_page_t *cp) -{ - struct page *page = cp-buffer; - atomic_set((atomic_t *)page-lru.next, 1); -} - -static inline int cas_buffer_count(cas_page_t *cp) -{ - struct page *page = cp-buffer; - return atomic_read((atomic_t *)page-lru.next); -} - -static inline void cas_buffer_inc(cas_page_t *cp) -{ - struct page *page = cp-buffer; - atomic_inc((atomic_t *)page-lru.next); -} - -static inline void cas_buffer_dec(cas_page_t *cp) -{ - struct page *page = cp-buffer; - atomic_dec((atomic_t *)page-lru.next); -} - static void cas_enable_irq(struct cas *cp, const int ring) { if (ring == 0) { /* all but TX_DONE */ @@ -497,7 +473,6 @@ static int cas_page_free(struct cas *cp, cas_page_t *page) { pci_unmap_page(cp-pdev, page-dma_addr, cp
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Wednesday 14 November 2007 17:37, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] I'm doing some oprofile runs now to see if I can get any more info. OK, in vanilla kernels, the page allocator definitely shows higher in the results (than with Herbert's patch reverted). 27516 2.7217 get_page_from_freelist 21677 2.1442 __rmqueue_smallest 20513 2.0290 __free_pages_ok 18725 1.8522 get_pageblock_flags_group Just these account for nearly 10% of cycles. __alloc_skb shows up higher too. free_hot_cold_page() shows a lot lower though, which might indicate that actually there is more higher order allocation activity (I'll check that next). SLUB, avg throughput 1548 CPU: AMD64 family10, speed 1900 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10 samples %symbol name 94636 9.3609 copy_user_generic_string 38932 3.8509 ipt_do_table 34746 3.4369 tcp_v4_rcv 29539 2.9218 skb_release_data 27516 2.7217 get_page_from_freelist 26046 2.5763 tcp_sendmsg 24482 2.4216 local_bh_enable 22910 2.2661 ip_queue_xmit 22113 2.1873 ktime_get 21677 2.1442 __rmqueue_smallest 20513 2.0290 __free_pages_ok 18725 1.8522 get_pageblock_flags_group 18580 1.8378 tcp_recvmsg 18108 1.7911 __napi_schedule 17593 1.7402 schedule 16998 1.6813 tcp_ack 16102 1.5927 dev_hard_start_xmit 15751 1.5580 system_call 15707 1.5536 net_rx_action 15150 1.4986 __switch_to 14988 1.4825 tcp_transmit_skb 13921 1.3770 kmem_cache_free 13398 1.3253 __mod_timer 13243 1.3099 tcp_rcv_established 13109 1.2967 __tcp_select_window 11022 1.0902 __tcp_push_pending_frames 10732 1.0615 set_normalized_timespec 10561 1.0446 netif_rx 8840 0.8744 netif_receive_skb 7816 0.7731 nf_iterate 7300 0.7221 __update_rq_clock 6683 0.6610 _read_lock_bh 6504 0.6433 sys_recvfrom 6283 0.6215 nf_hook_slow 6188 0.6121 release_sock 6172 0.6105 loopback_xmit 5927 0.5863 __alloc_skb 5707 0.5645 tcp_cleanup_rbuf 5538 0.5478 tcp_event_data_recv 5517 0.5457 tcp_v4_do_rcv 5516 0.5456 process_backlog SLUB, net patch reverted. Avg throughput 1933MB/s Counted CPU_CLK_UNHALTED , count 10 samples %symbol name 95895 9.5094 copy_user_generic_string 50259 4.9839 ipt_do_table 39408 3.9079 skb_release_data 37296 3.6984 tcp_v4_rcv 31309 3.1047 ip_queue_xmit 31308 3.1046 local_bh_enable 24052 2.3851 net_rx_action 23786 2.3587 __napi_schedule 21426 2.1247 tcp_recvmsg 21075 2.0899 schedule 20938 2.0763 dev_hard_start_xmit 20222 2.0053 tcp_sendmsg 19775 1.9610 tcp_ack 19717 1.9552 system_call 19495 1.9332 set_normalized_timespec 18584 1.8429 __switch_to 17022 1.6880 tcp_rcv_established 14655 1.4533 tcp_transmit_skb 14466 1.4345 __mod_timer 13820 1.3705 loopback_xmit 13776 1.3661 get_page_from_freelist 13288 1.3177 netif_receive_skb 9718 0.9637 _read_lock_bh 9625 0.9545 nf_iterate 9440 0.9361 netif_rx 9148 0.9072 free_hot_cold_page 8633 0.8561 __update_rq_clock 7668 0.7604 sys_recvfrom 7578 0.7515 __tcp_push_pending_frames 7311 0.7250 find_pid_ns 7178 0.7118 nf_hook_slow 6655 0.6599 sysret_check 6313 0.6260 release_sock 6290 0.6237 tcp_cleanup_rbuf 6263 0.6211 __tcp_select_window 6235 0.6183 process_backlog 5920 0.5871 ip_local_deliver_finish 5651 0.5604 ip_rcv 5239 0.5195 ip_finish_output 5058 0.5016 kmem_cache_free 5016 0.4974 thread_return Here are some other things you can play around with: 1) Monitor the values of skb-len and skb-data_len for packets going over loopback. OK, I've taken a log2 histogram of -len and -data_len sizes. I'll attach the plots (xaxis is byte value of most significant bit set, y axis is number of occurances, logscale on X so 0 isn't there :( ). If you want to see the patch or raw data, let me know. len looks very similar for both kernels, although the reverted kernel has significantly higher frequency at most points. tbench unfortunately is only possible to do it time-based, so this is rougly expected. data_len has a spike that shifts up to 512-1023, from 256-511, when reverting Herbert's patch. Again, I believe the magnitudes of the spikes should actually be pretty close if we were doing an equal amount of work. I can't see that these numbers show much useful, unfortunately. 2) Try removing NETIF_F_SG in drivers/net/loopback.c's dev-feastures setting. Will try that now. attachment: data.pngattachment: len.png
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Wednesday 14 November 2007 09:27, Nick Piggin wrote: 2) Try removing NETIF_F_SG in drivers/net/loopback.c's dev-feastures setting. Will try that now. Doesn't help (with vanilla kernel -- Herbert's patch applied). data_len histogram drops to 0 and goes to len (I guess that's not surprising). Performance is pretty similar (ie. not good). I'll look at allocator patterns next. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Wednesday 14 November 2007 22:10, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 09:27:39 +1100 OK, in vanilla kernels, the page allocator definitely shows higher in the results (than with Herbert's patch reverted). ... I can't see that these numbers show much useful, unfortunately. Thanks for all of this data Nick. So the thing that's being effected here in TCP is net/ipv4/tcp.c:select_size(), specifically the else branch: int tmp = tp-mss_cache; ... else { int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER); if (tmp = pgbreak tmp = pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE) tmp = pgbreak; } This is deciding, in 'tmp', how much linear sk_buff space to allocate. 'tmp' is initially set to the path MSS, which for loopback is 16K - the space necessary for packet headers. The SKB_MAX_HEAD() value has changed as a result of Herbert's bug fix. I suspect this 'if' test is passing both with and without the patch. But pgbreak is now smaller, and thus the skb-data linear data area size we choose to use is smaller as well. OK, that makes sense. BTW, are you taking advantage of kmalloc's quantization into slabs WRT the linear data area? I wonder if it would be at all useful... You can test if this is precisely what is causing the performance regression by using the old calculation just here in select_size(). Add something like this local to net/ipv4/tcp.c: #define OLD_SKB_WITH_OVERHEAD(X) \ (((X) - sizeof(struct skb_shared_info)) \ ~(SMP_CACHE_BYTES - 1)) #define OLD_SKB_MAX_ORDER(X, ORDER) \ OLD_SKB_WITH_OVERHEAD((PAGE_SIZE (ORDER)) - (X)) #define OLD_SKB_MAX_HEAD(X) (OLD_SKB_MAX_ORDER((X), 0)) And then use OLD_SKB_MAX_HEAD() in select_size(). That brings performance back up! I wonder why it isn't causing a problem for SLAB... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Wednesday 14 November 2007 22:48, Herbert Xu wrote: On Wed, Nov 14, 2007 at 03:10:22AM -0800, David Miller wrote: So the thing that's being effected here in TCP is net/ipv4/tcp.c:select_size(), specifically the else branch: Thanks for the pointer. Indeed there is a bug in that area. I'm not sure whether it's causing the problem at hand but it's certainly suboptimal. [TCP]: Fix size calculation in sk_stream_alloc_pskb This looks like it fixes the problem! Still interested to know why SLAB didn't see the same thing... We round up the header size in sk_stream_alloc_pskb so that TSO packets get zero tail room. Unfortunately this rounding up is not coordinated with the select_size() function used by TCP to calculate the second parameter of sk_stream_alloc_pskb. As a result, we may allocate more than a page of data in the non-TSO case when exactly one page is desired. In fact, rounding up the head room is detrimental in the non-TSO case because it makes memory that would otherwise be available to the payload head room. TSO doesn't need this either, all it wants is the guarantee that there is no tail room. So this patch fixes this by adjusting the skb_reserve call so that exactly the requested amount (which all callers have calculated in a precise way) is made available as tail room. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Cheers, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Thursday 15 November 2007 10:46, David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 19:48:44 +0800 Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied and I'll queue it up for -stable too. Good result. Thanks, everyone! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Thursday 15 November 2007 12:11, Herbert Xu wrote: On Wed, Nov 14, 2007 at 05:03:25PM -0800, Christoph Lameter wrote: Well this is likely the result of the SLUB regression. If you allocate an order 1 page then the zone locks need to be taken. SLAB queues the a Yeah, it appears this is what happened. The lockless page allocator fastpath appears on the list and the slowpaths disappear after Herbert's patches. SLAB is doing its own thing, so it avoids that pitfall. couple of higher order pages and can so serve a couple of requests without going into the page allocator whereas SLUB has to go directly to the page allocator for allocate and free. I guess that needs fixing in the page allocator. Or do I need to add a mechanism to buffer higher order page allcoations to SLUB? Actually this serves to discourage people from using high-order allocations which IMHO is a good thing :) Yeah I completely agree. The right fix is in the caller... The bug / suboptimal allocation would not have been found in tcp if not for this ;) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Tuesday 13 November 2007 06:44, Christoph Lameter wrote: On Sat, 10 Nov 2007, Nick Piggin wrote: BTW. your size-2048 kmalloc cache is order-1 in the default setup, wheras kmalloc(1024) or kmalloc(4096) will be order-0 allocations. And SLAB also uses order-0 for size-2048. It would be nice if SLUB did the same... You can try to see the effect that order 0 would have by booting with slub_max_order=0 Yeah, that didn't help much, but in general I think it would give more consistent and reliable behaviour from slub. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Wednesday 14 November 2007 12:58, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Tue, 13 Nov 2007 22:41:58 +1100 On Tuesday 13 November 2007 06:44, Christoph Lameter wrote: On Sat, 10 Nov 2007, Nick Piggin wrote: BTW. your size-2048 kmalloc cache is order-1 in the default setup, wheras kmalloc(1024) or kmalloc(4096) will be order-0 allocations. And SLAB also uses order-0 for size-2048. It would be nice if SLUB did the same... You can try to see the effect that order 0 would have by booting with slub_max_order=0 Yeah, that didn't help much, but in general I think it would give more consistent and reliable behaviour from slub. Just a note that I'm not ignoring this issue, I just don't have time to get to it yet. No problem. I would like to have helped more, but it's slow going given my lack of network stack knowledge. If I get any more interesting data, I'll send it. I suspect the issue is about having a huge skb-data linear area for TCP sends over loopback. We're likely getting a much smaller skb-data linear data area after the patch in question, the rest using the sk_buff scatterlist pages which are a little bit more expensive to process. It didn't seem to be noticeable at 1 client. Unless scatterlist processing is going to cause cacheline bouncing, I don't see why this hurts more as you add CPUs? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Wednesday 14 November 2007 17:12, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 14 Nov 2007 04:36:24 +1100 On Wednesday 14 November 2007 12:58, David Miller wrote: I suspect the issue is about having a huge skb-data linear area for TCP sends over loopback. We're likely getting a much smaller skb-data linear data area after the patch in question, the rest using the sk_buff scatterlist pages which are a little bit more expensive to process. It didn't seem to be noticeable at 1 client. Unless scatterlist processing is going to cause cacheline bouncing, I don't see why this hurts more as you add CPUs? Is your test system using HIGHMEM? That's one thing the page vector in the sk_buff can do a lot, kmaps. No, it's an x86-64, so no highmem. What's also interesting is that SLAB apparently doesn't have this condition. The first thing that sprung to mind is that SLAB caches order 0 allocations, while SLUB does not. However if anything, that should actually favour the SLUB numbers if network is avoiding order 0 allocations. I'm doing some oprofile runs now to see if I can get any more info. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench
On Saturday 10 November 2007 12:29, Nick Piggin wrote: cc'ed linux-netdev Err, make that 'netdev' :P On Saturday 10 November 2007 10:46, Christoph Lameter wrote: commit deea84b0ae3d26b41502ae0a39fe7fe134e703d0 seems to cause a drop in SLUB tbench performance: 8p x86_64 system: 2.6.24-rc2: 1260.80 MB/sec After reverting the patch: 2350.04 MB/sec SLAB performance (which is at 2435.58 MB/sec, ~3% better than SLUB) is not affected by the patch. Ah, I didn't realise this was a regression. Thanks for bisecting it. Since this is an alignment change it seems that tbench performance is sensitive to the data layout? SLUB packs data more tightly than SLAB. So 8 byte allocations could result in cacheline contention if adjacent objects are allocated from different cpus. SLABs minimum size is 32 bytes so the cacheline contention is likely more limited. Maybe we need to allocate a mininum of one cacheline to the skb head? Or padd it out to a full cacheline? The data should already be cacheline aligned. It is kmalloced, and with a minimum size of somewhere around 200 bytes on a 64-bit machine. So it will hit a cacheline aligned kmalloc slab AFAIKS -- cacheline interference is probably not the problem. (To verify, I built slub with minimum kmalloc size set to 32 like slab and it's no real difference) But I can't see why restricting the allocation to PAGE_SIZE would help either. Maybe the macros are used in some other areas. BTW. your size-2048 kmalloc cache is order-1 in the default setup, wheras kmalloc(1024) or kmalloc(4096) will be order-0 allocations. And SLAB also uses order-0 for size-2048. It would be nice if SLUB did the same... commit deea84b0ae3d26b41502ae0a39fe7fe134e703d0 Author: Herbert Xu [EMAIL PROTECTED] Date: Sun Oct 21 16:27:46 2007 -0700 [NET]: Fix SKB_WITH_OVERHEAD calculation The calculation in SKB_WITH_OVERHEAD is incorrect in that it can cause an overflow across a page boundary which is what it's meant to prevent. In particular, the header length (X) should not be lumped together with skb_shared_info. The latter needs to be aligned properly while the header has no choice but to sit in front of wherever the payload is. Therefore the correct calculation is to take away the aligned size of skb_shared_info, and then subtract the header length. The resulting quantity L satisfies the following inequality: SKB_DATA_ALIGN(L + X) + sizeof(struct skb_shared_info) = PAGE_SIZE This is the quantity used by alloc_skb to do the actual allocation. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Signed-off-by: David S. Miller [EMAIL PROTECTED] diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f93f22b..369f60a 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -41,8 +41,7 @@ #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) \ ~(SMP_CACHE_BYTES - 1)) #define SKB_WITH_OVERHEAD(X) \ - (((X) - sizeof(struct skb_shared_info)) \ -~(SMP_CACHE_BYTES - 1)) + ((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))) #define SKB_MAX_ORDER(X, ORDER) \ SKB_WITH_OVERHEAD((PAGE_SIZE (ORDER)) - (X)) #define SKB_MAX_HEAD(X)(SKB_MAX_ORDER((X), 0)) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote: On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote: On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Allow PF_MEMALLOC to be set in softirq context. When running softirqs from a borrowed context save current-flags, ksoftirqd will have its own task_struct. What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess that some networking thing must be picking it up in a subsequent patch, but I'm too lazy to look!)... Again, can you have more of a rationale in your patch headers, or ref the patch that uses it... thanks Right, I knew I was forgetting something in these changelogs. The network stack does quite a bit of packet processing from softirq context. Once you start swapping over network, some of the packets want to be processed under PF_MEMALLOC. Hmm... what about processing from interrupt context? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote: On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote: On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation contexts that are entitled to it. Care is taken to only touch the SLUB slow path. This is done to ensure reserve pages don't leak out and get consumed. I think this is generally a good idea (to prevent slab allocators from stealing reserve). However I naively think the implementation is a bit overengineered and thus has a few holes. Humour me, what was the problem with failing the slab allocation (actually, not fail but just call into the page allocator to do correct waiting / reclaim) in the slowpath if the process fails the watermark checks? Ah, we actually need slabs below the watermarks. Right, I'd still allow those guys to allocate slabs. Provided they have the right allocation context, right? Its just that once I allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want allocation contexts that do not have rights to those pages to walk off with objects. And I'd prevent these ones from doing so. Without keeping track of reserve pages, which doesn't feel too clean. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 23:17, Peter Zijlstra wrote: On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote: And I'd prevent these ones from doing so. Without keeping track of reserve pages, which doesn't feel too clean. The problem with that is that once a slab was allocated with the right allocation context, anybody can get objects from these slabs. [snip] I understand that. So we either reserve a page per object, which for 32 byte objects is a large waste, or we stop anybody who doesn't have the right permissions from obtaining objects. I took the latter approach. What I'm saying is that the slab allocator slowpath should always just check watermarks against the current task. Instead of this -reserve stuff. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/33] Swap over NFS -v14
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Hi, Another posting of the full swap over NFS series. Hi, Is it really worth all the added complexity of making swap over NFS files work, given that you could use a network block device instead? Also, have you ensured that page_file_index, page_file_mapping and page_offset are only ever used on anonymous pages when the page is locked? (otherwise PageSwapCache could change) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation contexts that are entitled to it. Care is taken to only touch the SLUB slow path. This is done to ensure reserve pages don't leak out and get consumed. I think this is generally a good idea (to prevent slab allocators from stealing reserve). However I naively think the implementation is a bit overengineered and thus has a few holes. Humour me, what was the problem with failing the slab allocation (actually, not fail but just call into the page allocator to do correct waiting / reclaim) in the slowpath if the process fails the watermark checks? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Allow the mempool to use the memalloc reserves when all else fails and the allocation context would otherwise allow it. I don't see what this is for. The whole point of when I fixed this to *not* use the memalloc reserves is because processes that were otherwise allowed to use those reserves, were. They should not. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/mempool.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) Index: linux-2.6/mm/mempool.c === --- linux-2.6.orig/mm/mempool.c +++ linux-2.6/mm/mempool.c @@ -14,6 +14,7 @@ #include linux/mempool.h #include linux/blkdev.h #include linux/writeback.h +#include internal.h static void add_element(mempool_t *pool, void *element) { @@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf void *element; unsigned long flags; wait_queue_t wait; - gfp_t gfp_temp; + gfp_t gfp_temp, gfp_orig = gfp_mask; might_sleep_if(gfp_mask __GFP_WAIT); @@ -228,6 +229,15 @@ repeat_alloc: } spin_unlock_irqrestore(pool-lock, flags); + /* if we really had right to the emergency reserves try those */ + if (gfp_to_alloc_flags(gfp_orig) ALLOC_NO_WATERMARKS) { + if (gfp_temp __GFP_NOMEMALLOC) { + gfp_temp = ~(__GFP_NOMEMALLOC|__GFP_NOWARN); + goto repeat_alloc; + } else + gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN; + } + /* We must not sleep in the GFP_ATOMIC case */ if (!(gfp_mask __GFP_WAIT)) return NULL; -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/33] mm: kmem_estimate_pages()
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Provide a method to get the upper bound on the pages needed to allocate a given number of objects from a given kmem_cache. Fair enough, but just to make it a bit easier, can you provide a little reason of why in this patch (or reference the patch number where you use it, or put it together with the patch where you use it, etc.). Thanks, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Change ALLOC_NO_WATERMARK page allocation such that the reserves are system wide - which they are per setup_per_zone_pages_min(), when we scrape the barrel, do it properly. IIRC it's actually not too uncommon to have allocations coming here via page reclaim. It's not exactly clear that you want to break mempolicies at this point. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/page_alloc.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1638,6 +1638,12 @@ restart: rebalance: if (alloc_flags ALLOC_NO_WATERMARKS) { nofail_alloc: + /* + * break out of mempolicy boundaries + */ + zonelist = NODE_DATA(numa_node_id())-node_zonelists + + gfp_zone(gfp_mask); + /* go through the zonelist yet again, ignoring mins */ page = get_page_from_freelist(gfp_mask, order, zonelist, ALLOC_NO_WATERMARKS); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/33] Swap over NFS -v14
On Wednesday 31 October 2007 15:37, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 31 Oct 2007 14:26:32 +1100 Is it really worth all the added complexity of making swap over NFS files work, given that you could use a network block device instead? Don't be misled. Swapping over NFS is just a scarecrow for the seemingly real impetus behind these changes which is network storage stuff like iSCSI. Oh, I'm OK with the network reserves stuff (not the actual patch, which I'm not really qualified to review, but at least the idea of it...). And also I'm not as such against the idea of swap over network. However, specifically the change to make swapfiles work through the filesystem layer (ATM it goes straight to the block layer, modulo some initialisation stuff which uses block filesystem- specific calls). I mean, I assume that anybody trying to swap over network *today* has to be using a network block device anyway, so the idea of just being able to transparently improve that case seems better than adding new complexities for seemingly not much gain. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Allow PF_MEMALLOC to be set in softirq context. When running softirqs from a borrowed context save current-flags, ksoftirqd will have its own task_struct. What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess that some networking thing must be picking it up in a subsequent patch, but I'm too lazy to look!)... Again, can you have more of a rationale in your patch headers, or ref the patch that uses it... thanks - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: #define atomic_read_volatile(v) \ ({ \ forget((v)-counter);\ ((v)-counter); \ }) where: *vomit* :) Not only do I hate the keyword volatile, but the barrier is only a one-sided affair so its probable this is going to have slightly different allowed reorderings than a real volatile access. Also, why would you want to make these insane accessors for atomic_t types? Just make sure everybody knows the basics of barriers, and they can apply that knowledge to atomic_t and all other lockless memory accesses as well. #define forget(a) __asm__ __volatile__ ( :=m (a) :m (a)) I like order(x) better, but it's not the most perfect name either. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Stefan Richter wrote: Nick Piggin wrote: I don't know why people would assume volatile of atomics. AFAIK, most of the documentation is pretty clear that all the atomic stuff can be reordered etc. except for those that modify and return a value. Which documentation is there? Documentation/atomic_ops.txt For driver authors, there is LDD3. It doesn't specifically cover effects of optimization on accesses to atomic_t. For architecture port authors, there is Documentation/atomic_ops.txt. Driver authors also can learn something from that document, as it indirectly documents the atomic_t and bitops APIs. Semantics and Behavior of Atomic and Bitmask Operations is pretty direct :) Sure, it says that it's for arch maintainers, but there is no reason why users can't make use of it. Prompted by this thread, I reread this document, and indeed, the sentence Unlike the above routines, it is required that explicit memory barriers are performed before and after [atomic_{inc,dec}_return] indicates that atomic_read (one of the above routines) is very different from all other atomic_t accessors that return values. This is strange. Why is it that atomic_read stands out that way? IMO It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set. this API imbalance is quite unexpected by many people. Wouldn't it be beneficial to change the atomic_read API to behave the same like all other atomic_t accessors that return values? It is very consistent and well defined. Operations which both modify the data _and_ return something are defined to have full barriers before and after. What do you want to add to the other atomic accessors? Full memory barriers? Only compiler barriers? It's quite likely that if you think some barriers will fix bugs, then there are other bugs lurking there anyway. Just use spinlocks if you're not absolutely clear about potential races and memory ordering issues -- they're pretty cheap and simple. OK, it is also different from the other accessors that return data in so far as it doesn't modify the data. But as driver author, i.e. user of the API, I can't see much use of an atomic_read that can be reordered and, more importantly, can be optimized away by the compiler. It will return to you an atomic snapshot of the data (loaded from memory at some point since the last compiler barrier). All you have to be aware of compiler barriers and the Linux SMP memory ordering model, which should be a given if you are writing lockless code. Sure, now that I learned of these properties I can start to audit code and insert barriers where I believe they are needed, but this simply means that almost all occurrences of atomic_read will get barriers (unless there already are implicit but more or less obvious barriers like msleep). You might find that these places that appear to need barriers are buggy for other reasons anyway. Can you point to some in-tree code we can have a look at? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Herbert Xu wrote: On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote: BTW, the sort of missing barriers that triggered this thread aren't that subtle. It'll result in a simple lock-up if the loop condition holds upon entry. At which point it's fairly straightforward to find the culprit. Not necessarily. A barrier-less buggy code such as below: atomic_set(v, 0); ... /* some initial code */ while (atomic_read(v)) ; ... /* code that MUST NOT be executed unless v becomes non-zero */ (where v-counter is has no volatile access semantics) could be generated by the compiler to simply *elid* or *do away* with the loop itself, thereby making the: /* code that MUST NOT be executed unless v becomes non-zero */ to be executed even when v is zero! That is subtle indeed, and causes no hard lockups. Then I presume you mean while (!atomic_read(v)) ; Which is just the same old infinite loop bug solved with cpu_relax(). These are pretty trivial to audit and fix, and also to debug, I would think. Granted, the above IS buggy code. But, the stated objective is to avoid heisenbugs. Anyway, why are you making up code snippets that are buggy in other ways in order to support this assertion being made that lots of kernel code supposedly depends on volatile semantics. Just reference the actual code. And we have driver / subsystem maintainers such as Stefan coming up and admitting that often a lot of code that's written to use atomic_read() does assume the read will not be elided by the compiler. So these are broken on i386 and x86-64? Are they definitely safe on SMP and weakly ordered machines with just a simple compiler barrier there? Because I would not be surprised if there are a lot of developers who don't really know what to assume when it comes to memory ordering issues. This is not a dig at driver writers: we still have memory ordering problems in the VM too (and probably most of the subtle bugs in lockless VM code are memory ordering ones). Let's not make up a false sense of security and hope that sprinkling volatile around will allow people to write bug-free lockless code. If a writer can't be bothered reading API documentation and learning the Linux memory model, they can still be productive writing safely locked code. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Sure, now that I learned of these properties I can start to audit code and insert barriers where I believe they are needed, but this simply means that almost all occurrences of atomic_read will get barriers (unless there already are implicit but more or less obvious barriers like msleep). You might find that these places that appear to need barriers are buggy for other reasons anyway. Can you point to some in-tree code we can have a look at? Such code was mentioned elsewhere (query nodemgr_host_thread in cscope) that managed to escape the requirement for a barrier only because of some completely un-obvious compilation-unit-scope thing. But I find such an non-explicit barrier quite bad taste. Stefan, do consider plunking an explicit call to barrier() there. It is very obvious. msleep calls schedule() (ie. sleeps), which is always a barrier. The unobvious thing is that you wanted to know how the compiler knows a function is a barrier -- answer is that if it does not *know* it is not a barrier, it must assume it is a barrier. If the whole msleep call chain including the scheduler were defined static in the current compilation unit, then it would still be a barrier because it would actually be able to see the barriers in schedule(void), if nothing else. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Also, why would you want to make these insane accessors for atomic_t types? Just make sure everybody knows the basics of barriers, and they can apply that knowledge to atomic_t and all other lockless memory accesses as well. Code that looks like: while (!atomic_read(v)) { ... cpu_relax_no_barrier(); forget(v.counter); } would be uglier. Also think about code such as: I think they would both be equally ugly, but the atomic_read_volatile variant would be more prone to subtle bugs because of the weird implementation. And it would be more ugly than introducing an order(x) statement for all memory operations, and adding an order_atomic() wrapper for it for atomic types. a = atomic_read(); if (!a) do_something(); forget(); a = atomic_read(); ... /* some code that depends on value of a, obviously */ forget(); a = atomic_read(); ... So much explicit sprinkling of forget() looks ugly. Firstly, why is it ugly? It's nice because of those nice explicit statements there that give us a good heads up and would have some comments attached to them (also, lack of the word volatile is always a plus). Secondly, what sort of code would do such a thing? In most cases, it is probably riddled with bugs anyway (unless it is doing a really specific sequence of interrupts or something, but in that case it is very likely to either require locking or busy waits anyway - ie. barriers). on the other hand, looks neater. The _volatile() suffix makes it also no less explicit than an explicit barrier-like macro that this primitive is something special, for code clarity purposes. Just don't use the word volatile, and have barriers both before and after the memory operation, and I'm OK with it. I don't see the point though, when you could just have a single barrier(x) barrier function defined for all memory locations, rather than this odd thing that only works for atomics (and would have to be duplicated for atomic_set. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Satyam Sharma wrote: It is very obvious. msleep calls schedule() (ie. sleeps), which is always a barrier. Probably you didn't mean that, but no, schedule() is not barrier because it sleeps. It's a barrier because it's invisible. Where did I say it is a barrier because it sleeps? It is always a barrier because, at the lowest level, schedule() (and thus anything that sleeps) is defined to always be a barrier. Regardless of whatever obscure means the compiler might need to infer the barrier. In other words, you can ignore those obscure details because schedule() is always going to have an explicit barrier in it. The unobvious thing is that you wanted to know how the compiler knows a function is a barrier -- answer is that if it does not *know* it is not a barrier, it must assume it is a barrier. True, that's clearly what happens here. But are you're definitely joking that this is obvious in terms of code-clarity, right? No. If you accept that barrier() is implemented correctly, and you know that sleeping is defined to be a barrier, then its perfectly clear. You don't have to know how the compiler knows that some function contains a barrier. Just 5 minutes back you mentioned elsewhere you like seeing lots of explicit calls to barrier() (with comments, no less, hmm? :-) Sure, but there are well known primitives which contain barriers, and trivial recognisable code sequences for which you don't need comments. waiting-loops using sleeps or cpu_relax() are prime examples. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: I think they would both be equally ugly, You think both these are equivalent in terms of looks: | while (!atomic_read(v)) { | while (!atomic_read_xxx(v)) { ... | ... cpu_relax_no_barrier(); | cpu_relax_no_barrier(); order_atomic(v); | } } | (where order_atomic() is an atomic_t specific wrapper as you mentioned below) ? I think the LHS is better if your atomic_read_xxx primitive is using the crazy one-sided barrier, because the LHS code you immediately know what barriers are happening, and with the RHS you have to look at the atomic_read_xxx definition. If your atomic_read_xxx implementation was more intuitive, then both are pretty well equal. More lines != ugly code. but the atomic_read_volatile variant would be more prone to subtle bugs because of the weird implementation. What bugs? You can't think for yourself? Your atomic_read_volatile contains a compiler barrier to the atomic variable before the load. 2 such reads from different locations look like this: asm volatile( : +m (v1)); atomic_read(v1); asm volatile( : +m (v2)); atomic_read(v2); Which implies that the load of v1 can be reordered to occur after the load of v2. Bet you didn't expect that? Secondly, what sort of code would do such a thing? See the nodemgr_host_thread() that does something similar, though not exactly same. I'm sorry, all this waffling about made up code which might do this and that is just a waste of time. Seriously, the thread is bloated enough and never going to get anywhere with all this handwaving. If someone is saving up all the really real and actually good arguments for why we must have a volatile here, now is the time to use them. and have barriers both before and after the memory operation, How could that lead to bugs? (if you can point to existing code, but just some testcase / sample code would be fine as well). See above. As I said, barrier() is too heavy-handed. Typo. I meant: defined for a single memory location (ie. order(x)). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Because they should be thinking about them in terms of barriers, over which the compiler / CPU is not to reorder accesses or cache memory operations, rather than special volatile accesses. This is obviously just a taste thing. Whether to have that forget(x) barrier as something author should explicitly sprinkle appropriately in appropriate places in the code by himself or use a primitive that includes it itself. That's not obviously just taste to me. Not when the primitive has many (perhaps, the majority) of uses that do not require said barriers. And this is not solely about the code generation (which, as Paul says, is relatively minor even on x86). I prefer people to think explicitly about barriers in their lockless code. I'm not saying taste matters aren't important (they are), but I'm really skeptical if most folks would find the former tasteful. So I /do/ have better taste than most folks? Thanks! :-) And by the way, the point is *also* about the fact that cpu_relax(), as of today, implies a full memory clobber, which is not what a lot of such loops want. (due to stuff mentioned elsewhere, summarized in that summary) That's not the point, That's definitely the point, why not. This is why barrier(), being heavy-handed, is not the best option. That is _not_ the point (of why a volatile atomic_read is good) because there has already been an alternative posted that better conforms with Linux barrier API and is much more widely useful and more usable. If you are so worried about barrier() being too heavyweight, then you're off to a poor start by wanting to add a few K of kernel text by making atomic_read volatile. because as I also mentioned, the logical extention to Linux's barrier API to handle this is the order(x) macro. Again, not special volatile accessors. Sure, that forget(x) macro _is_ proposed to be made part of the generic API. Doesn't explain why not to define/use primitives that has volatility semantics in itself, though (taste matters apart). If you follow the discussion You were thinking of a reason why the semantics *should* be changed or added, and I was rebutting your argument that it must be used when a full barrier() is too heavy (ie. by pointing out that order() has superior semantics anyway). Why do I keep repeating the same things? I'll not continue bloating this thread until a new valid point comes up... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Satyam Sharma wrote: On Fri, 17 Aug 2007, Nick Piggin wrote: Satyam Sharma wrote: It is very obvious. msleep calls schedule() (ie. sleeps), which is always a barrier. Probably you didn't mean that, but no, schedule() is not barrier because it sleeps. It's a barrier because it's invisible. Where did I say it is a barrier because it sleeps? Just below. What you wrote: It is always a barrier because, at the lowest level, schedule() (and thus anything that sleeps) is defined to always be a barrier. It is always a barrier because, at the lowest level, anything that sleeps is defined to always be a barrier. ... because it must call schedule and schedule is a barrier. Regardless of whatever obscure means the compiler might need to infer the barrier. In other words, you can ignore those obscure details because schedule() is always going to have an explicit barrier in it. I didn't quite understand what you said here, so I'll tell what I think: * foo() is a compiler barrier if the definition of foo() is invisible to the compiler at a callsite. * foo() is also a compiler barrier if the definition of foo() includes a barrier, and it is inlined at the callsite. If the above is wrong, or if there's something else at play as well, do let me know. Right. The unobvious thing is that you wanted to know how the compiler knows a function is a barrier -- answer is that if it does not *know* it is not a barrier, it must assume it is a barrier. True, that's clearly what happens here. But are you're definitely joking that this is obvious in terms of code-clarity, right? No. If you accept that barrier() is implemented correctly, and you know that sleeping is defined to be a barrier, Curiously, that's the second time you've said sleeping is defined to be a (compiler) barrier. _In Linux,_ sleeping is defined to be a compiler barrier. How does the compiler even know if foo() is a function that sleeps? Do compilers have some notion of sleeping to ensure they automatically assume a compiler barrier whenever such a function is called? Or are you saying that the compiler can see the barrier() inside said function ... nopes, you're saying quite the opposite below. You're getting too worried about the compiler implementation. Start by assuming that it does work ;) then its perfectly clear. You don't have to know how the compiler knows that some function contains a barrier. I think I do, why not? Would appreciate if you could elaborate on this. If a function is not completely visible to the compiler (so it can't determine whether a barrier could be in it or not), then it must always assume it will contain a barrier so it always does the right thing. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Segher Boessenkool wrote: Part of the motivation here is to fix heisenbugs. If I knew where they By the same token we should probably disable optimisations altogether since that too can create heisenbugs. Almost everything is a tradeoff; and so is this. I don't believe most people would find disabling all compiler optimisations an acceptable price to pay for some peace of mind. So why is this a good tradeoff? It certainly is better than disabling all compiler optimisations! It's easy to be better than something really stupid :) So i386 and x86-64 don't have volatiles there, and it saves them a few K of kernel text. What you need to justify is why it is a good tradeoff to make them volatile (which btw, is much harder to go the other way after we let people make those assumptions). I also think that just adding things to APIs in the hope it might fix up some bugs isn't really a good road to go down. Where do you stop? I look at it the other way: keeping the volatile semantics in atomic_XXX() (or adding them to it, whatever) helps _prevent_ bugs; Yeah, but we could add lots of things to help prevent bugs and would never be included. I would also contend that it helps _hide_ bugs and encourages people to be lazy when thinking about these things. Also, you dismiss the fact that we'd actually be *adding* volatile semantics back to the 2 most widely tested architectures (in terms of test time, number of testers, variety of configurations, and coverage of driver code). This is a very important different from just keeping volatile semantics because it is basically a one-way API change. certainly most people expect that behaviour, and also that behaviour is *needed* in some places and no other interface provides that functionality. I don't know that most people would expect that behaviour. Is there any documentation anywhere that would suggest this? Also, barrier() most definitely provides the required functionality. It is overkill in some situations, but volatile is overkill in _most_ situations. If that's what you're worried about, we should add a new ordering primitive. [some confusion about barriers wrt atomics snipped] What were you confused about? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Chris Snook wrote: Herbert Xu wrote: On Thu, Aug 16, 2007 at 03:48:54PM -0400, Chris Snook wrote: Can you find an actual atomic_read code snippet there that is broken without the volatile modifier? A whole bunch of atomic_read uses will be broken without the volatile modifier once we start removing barriers that aren't needed if volatile behavior is guaranteed. Could you please cite the file/function names so we can see whether removing the barrier makes sense? Thanks, At a glance, several architectures' implementations of smp_call_function() have one or more legitimate atomic_read() busy-waits that shouldn't be using CPU-relax. Some of them do work in the loop. sh looks like the only one there that would be broken (and that's only because they don't have a cpu_relax there, but it should be added anyway). Sure, if we removed volatile from other architectures, it would be wise to audit arch code because arch maintainers do sometimes make assumptions about their implementation details... however we can assume most generic code is safe without volatile. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Paul E. McKenney wrote: On Thu, Aug 16, 2007 at 06:42:50PM +0800, Herbert Xu wrote: In fact, volatile doesn't guarantee that the memory gets read anyway. You might be reading some stale value out of the cache. Granted this doesn't happen on x86 but when you're coding for the kernel you can't make such assumptions. So the point here is that if you don't mind getting a stale value from the CPU cache when doing an atomic_read, then surely you won't mind getting a stale value from the compiler cache. Absolutely disagree. An interrupt/NMI/SMI handler running on the CPU will see the same value (whether in cache or in store buffer) that the mainline code will see. In this case, we don't care about CPU misordering, only about compiler misordering. It is easy to see other uses that combine communication with handlers on the current CPU with communication among CPUs -- again, see prior messages in this thread. I still don't agree with the underlying assumption that everybody (or lots of kernel code) treats atomic accesses as volatile. Nobody that does has managed to explain my logic problem either: loads and stores to long and ptr have always been considered to be atomic, test_bit is atomic; so why are another special subclass of atomic loads and stores? (and yes, it is perfectly legitimate to want a non-volatile read for a data type that you also want to do atomic RMW operations on) Why are people making these undocumented and just plain false assumptions about atomic_t? If they're using lockless code (ie. which they must be if using atomics), then they actually need to be thinking much harder about memory ordering issues. If that is too much for them, then they can just use locks. So, the architecture guys can implement atomic_read however they want --- as long as it cannot be optimized away.* They can implement it however they want as long as it stays atomic. Precisely. And volatility is a key property of atomic. Let's please not throw it away. It isn't, though (at least not since i386 and x86-64 don't have it). _Adding_ it is trivial, and can be done any time. Throwing it away (ie. making the API weaker) is _hard_. So let's not add it without really good reasons. It most definitely results in worse code generation in practice. I don't know why people would assume volatile of atomics. AFAIK, most of the documentation is pretty clear that all the atomic stuff can be reordered etc. except for those that modify and return a value. It isn't even intuitive: `*lp = value` is like the most fundamental atomic operation in Linux. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Paul Mackerras wrote: Nick Piggin writes: So i386 and x86-64 don't have volatiles there, and it saves them a few K of kernel text. What you need to justify is why it is a good I'm really surprised it's as much as a few K. I tried it on powerpc and it only saved 40 bytes (10 instructions) for a G5 config. Paul. I'm surprised too. Numbers were from the ...use asm() like the other atomic operations already do thread. According to them, textdata bss dec hex filename 3434150 249176 176128 3859454 3ae3fe atomic_normal/vmlinux 3436203 249176 176128 3861507 3aec03 atomic_volatile/vmlinux The first one is a stock kenel, the second is with atomic_read/set cast to volatile. gcc-4.1 -- maybe if you have an earlier gcc it won't optimise as much? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Paul Mackerras wrote: Nick Piggin writes: Why are people making these undocumented and just plain false assumptions about atomic_t? Well, it has only been false since December 2006. Prior to that atomics *were* volatile on all platforms. Hmm, although I don't think it has ever been guaranteed by the API documentation (concede documentation is often not treated as the authoritative source here, but for atomic it is actually very good and obviously indispensable as the memory ordering reference). If they're using lockless code (ie. which they must be if using atomics), then they actually need to be thinking much harder about memory ordering issues. Indeed. I believe that most uses of atomic_read other than in polling loops or debug printk statements are actually racy. In some cases the race doesn't seem to matter, but I'm sure there are cases where it does. If that is too much for them, then they can just use locks. Why use locks when you can just sprinkle magic fix-the-races dust (aka atomic_t) over your code? :) :) I agree with your skepticism of a lot of lockless code. But I think a lot of the more subtle race problems will not be fixed with volatile. The big, dumb infinite loop bugs would be fixed, but they're pretty trivial to debug and even audit for. Precisely. And volatility is a key property of atomic. Let's please not throw it away. It isn't, though (at least not since i386 and x86-64 don't have it). Conceptually it is, because atomic_t is specifically for variables which are liable to be modified by other CPUs, and volatile _means_ liable to be changed by mechanisms outside the knowledge of the compiler. Usually that is the case, yes. But also most of the time we don't care that it has been changed and don't mind it being reordered or eliminated. One of the only places we really care about that at all is for variables that are modified by the *same* CPU. _Adding_ it is trivial, and can be done any time. Throwing it away (ie. making the API weaker) is _hard_. So let's not add it without Well, in one sense it's not that hard - Linus did it just 8 months ago in commit f9e9dcb3. :) Well it would have been harder if the documentation also guaranteed that atomic_read/atomic_set was ordered. Or it would have been harder for _me_ to make such a change, anyway ;) really good reasons. It most definitely results in worse code generation in practice. 0.0008% increase in kernel text size on powerpc according to my measurement. :) I don't think you're making a bad choice by keeping it volatile on powerpc and waiting for others to shake out more of the bugs. You get to fix everybody else's memory ordering bugs :) I don't know why people would assume volatile of atomics. AFAIK, most By making something an atomic_t you're saying other CPUs are going to be modifying this, so treat it specially. It's reasonable to assume that special treatment extends to reading and setting it. But I don't actually know what that special treatment is. Well actually, I do know that operations will never result in a partial modification being exposed. I also know that the operators that do not modify and return are not guaranteed to have any sort of ordering constraints. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/24] make atomic_read() behave consistently on frv
Paul E. McKenney wrote: On Tue, Aug 14, 2007 at 03:34:25PM +1000, Nick Piggin wrote: Maybe it is the safe way to go, but it does obscure cases where there is a real need for barriers. I prefer burying barriers into other primitives. When they should naturally be there, eg. locking or the RCU primitives, I agree. I don't like having them scattered in various just in case places, because it makes both the users and the APIs hard to understand and change. Especially since several big architectures don't have volatile in their atomic_get and _set, I think it would be a step backwards to add them in as a just in case thin now (unless there is a better reason). Many atomic operations are allowed to be reordered between CPUs, so I don't have a good idea for the rationale to order them within the CPU (also loads and stores to long and ptr types are not ordered like this, although we do consider those to be atomic operations too). barrier() in a way is like enforcing sequential memory ordering between process and interrupt context, wheras volatile is just enforcing coherency of a single memory location (and as such is cheaper). barrier() is useful, but it has the very painful side-effect of forcing the compiler to dump temporaries. So we do need something that is not quite so global in effect. Yep. What do you think of this crazy idea? /* Enforce a compiler barrier for only operations to location X. * Call multiple times to provide an ordering between multiple * memory locations. Other memory operations can be assumed by * the compiler to remain unchanged and may be reordered */ #define order(x) asm volatile( : +m (x)) There was something very similar discussed earlier in this thread, with quite a bit of debate as to exactly what the m flag should look like. I suggested something similar named ACCESS_ONCE in the context of RCU (http://lkml.org/lkml/2007/7/11/664): Oh, I missed that earlier debate. Will go have a look. #define ACCESS_ONCE(x) (*(volatile typeof(x) *)(x)) The nice thing about this is that it works for both loads and stores. Not clear that order() above does this -- I get compiler errors when I try something like b = order(a) or order(a) = 1 using gcc 4.1.2. As Arnd ponted out, order() is not supposed to be an lvalue, but a statement like the rest of our memory ordering API. As to your followup question of why to use it over ACCESS_ONCE. I guess, aside from consistency with the rest of the barrier APIs, you can use it in other primitives when you don't actually know what the caller is going to do or if it even will make an access. You could also use it between calls to _other_ primitives, etc... it just seems more flexible to me, but I haven't actually used such a thing in real code... ACCESS_ONCE doesn't seem as descriptive. What it results in is the memory location being loaded or stored (presumably once exactly), but I think the more general underlying idea is a barrier point. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/24] make atomic_read() behave consistently on frv
Segher Boessenkool wrote: Please check the definition of cache coherence. Which of the twelve thousand such definitions? :-) Every definition I have seen says that writes to a single memory location have a serial order as seen by all CPUs, and that a read will return the most recent write in the sequence (with a bit of extra mumbo jumbo to cover store queues and store forwarding). Within such a definition, I don't see how would be allowed for a single CPU to execute reads out of order and have the second read return an earlier write than the first read. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/24] make atomic_read() behave consistently on frv
Paul E. McKenney wrote: On Wed, Aug 15, 2007 at 11:30:05PM +1000, Nick Piggin wrote: Especially since several big architectures don't have volatile in their atomic_get and _set, I think it would be a step backwards to add them in as a just in case thin now (unless there is a better reason). Good point, except that I would expect gcc's optimization to continue to improve. I would like the kernel to be able to take advantage of improved optimization, which means that we are going to have to make a few changes. Adding volatile to atomic_get() and atomic_set() is IMHO one of those changes. What optimisations? gcc already does most of the things you need a barrier/volatile for, like reordering non-dependant loads and stores, and eliminating mem ops completely by caching in registers. As to your followup question of why to use it over ACCESS_ONCE. I guess, aside from consistency with the rest of the barrier APIs, you can use it in other primitives when you don't actually know what the caller is going to do or if it even will make an access. You could also use it between calls to _other_ primitives, etc... it just seems more flexible to me, but I haven't actually used such a thing in real code... ACCESS_ONCE doesn't seem as descriptive. What it results in is the memory location being loaded or stored (presumably once exactly), but I think the more general underlying idea is a barrier point. OK, first, I am not arguing that ACCESS_ONCE() can replace all current uses of barrier(). OK. Well I also wasn't saying that ACCESS_ONCE should not be implemented. But if we want something like it, then it would make sense to have an equivalent barrier statement as well (ie. order()). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
Segher Boessenkool wrote: Part of the motivation here is to fix heisenbugs. If I knew where they By the same token we should probably disable optimisations altogether since that too can create heisenbugs. Almost everything is a tradeoff; and so is this. I don't believe most people would find disabling all compiler optimisations an acceptable price to pay for some peace of mind. So why is this a good tradeoff? I also think that just adding things to APIs in the hope it might fix up some bugs isn't really a good road to go down. Where do you stop? On the actual proposal to make atomic_operators volatile: I think the better approach in the long term, for both maintainability of the code and education of coders, is to make the use of barriers _more_ explicit rather than sprinkling these just in case ones around. You may get rid of a few atomic_read heisenbugs (in noise when compared to all bugs), but if the coder was using a regular atomic load, or a test_bit (which is also atomic), etc. then they're going to have problems. It would be better for Linux if everyone was to have better awareness of barriers than to hide some of the cases where they're required. A pretty large number of bugs I see in lock free code in the VM is due to memory ordering problems. It's hard to find those bugs, or even be aware when you're writing buggy code if you don't have some feel for barriers. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/24] make atomic_read() behave consistently on frv
Paul E. McKenney wrote: On Mon, Aug 13, 2007 at 01:15:52PM +0800, Herbert Xu wrote: Paul E. McKenney [EMAIL PROTECTED] wrote: On Sat, Aug 11, 2007 at 08:54:46AM +0800, Herbert Xu wrote: Chris Snook [EMAIL PROTECTED] wrote: cpu_relax() contains a barrier, so it should do the right thing. For non-smp architectures, I'm concerned about interacting with interrupt handlers. Some drivers do use atomic_* operations. What problems with interrupt handlers? Access to int/long must be atomic or we're in big trouble anyway. Reordering due to compiler optimizations. CPU reordering does not affect interactions with interrupt handlers on a given CPU, but reordering due to compiler code-movement optimization does. Since volatile can in some cases suppress code-movement optimizations, it can affect interactions with interrupt handlers. If such reordering matters, then you should use one of the *mb macros or barrier() rather than relying on possibly hidden volatile cast. If communicating among CPUs, sure. However, when communicating between mainline and interrupt/NMI handlers on the same CPU, the barrier() and most expecially the *mb() macros are gross overkill. So there really truly is a place for volatile -- not a large place, to be sure, but a place nonetheless. I really would like all volatile users to go away and be replaced by explicit barriers. It makes things nicer and more explicit... for atomic_t type there probably aren't many optimisations that can be made which volatile would disallow (in actual kernel code), but for others (eg. bitops, maybe atomic ops in UP kernels), there would be. Maybe it is the safe way to go, but it does obscure cases where there is a real need for barriers. Many atomic operations are allowed to be reordered between CPUs, so I don't have a good idea for the rationale to order them within the CPU (also loads and stores to long and ptr types are not ordered like this, although we do consider those to be atomic operations too). barrier() in a way is like enforcing sequential memory ordering between process and interrupt context, wheras volatile is just enforcing coherency of a single memory location (and as such is cheaper). What do you think of this crazy idea? /* Enforce a compiler barrier for only operations to location X. * Call multiple times to provide an ordering between multiple * memory locations. Other memory operations can be assumed by * the compiler to remain unchanged and may be reordered */ #define order(x) asm volatile( : +m (x)) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/24] make atomic_read() behave consistently on frv
Chris Snook wrote: David Howells wrote: Chris Snook [EMAIL PROTECTED] wrote: cpu_relax() contains a barrier, so it should do the right thing. For non-smp architectures, I'm concerned about interacting with interrupt handlers. Some drivers do use atomic_* operations. I'm not sure that actually answers my question. Why not smp_rmb()? David I would assume because we want to waste time efficiently even on non-smp architectures, rather than frying the CPU or draining the battery. Certain looping execution patterns can cause the CPU to operate above thermal design power. I have fans on my workstation that only ever come on when running LINPACK, and that's generally memory bandwidth-bound. Just imagine what happens when you're executing the same few non-serializing instructions in a tight loop without ever stalling on memory fetches, or being scheduled out. If there's another reason, I'd like to hear it too, because I'm just guessing here. Well if there is only one memory location involved, then smp_rmb() isn't going to really do anything anyway, so it would be incorrect to use it. Consider that smp_rmb basically will do anything from flushing the pipeline to invalidating loads speculatively executed out of order. AFAIK it will not control the visibility of stores coming from other CPUs (that is up to the cache coherency). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] AFS: Implement basic file write support
David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: Why do you call SetPageUptodate when the page is not up to date? That leaks uninitialised data, AFAIKS. It only seems that way. If afs_prepare_write() is called, but doesn't return an error, then afs_commit_write() will be called, and it seems that the copy in of the data will be guaranteed not to fail by the caller. Not only does it seem that way, it is that way :) PG_uptodate is being set when the page is not uptodate, isn't it? Furthermore, afs_prepare_page() will have filled in the missing bits. And whilst all that is going on, the page lock will be help by the caller, so that no-one else can access the partially complete page. When a page is uptodate in pagecache, the generic read and nopage functions do not take the page lock. So how are you excluding those? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] AFS: Implement basic file write support
David Howells wrote: +/* + * prepare a page for being written to + */ +static int afs_prepare_page(struct afs_vnode *vnode, struct page *page, + struct key *key, unsigned offset, unsigned to) +{ + unsigned eof, tail, start, stop, len; + loff_t i_size, pos; + void *p; + int ret; + + _enter(); + + if (offset == 0 to == PAGE_SIZE) + return 0; + + p = kmap(page); + + i_size = i_size_read(vnode-vfs_inode); + pos = (loff_t) page-index PAGE_SHIFT; + if (pos = i_size) { + /* partial write, page beyond EOF */ + _debug(beyond); + if (offset 0) + memset(p, 0, offset); + if (to PAGE_SIZE) + memset(p + to, 0, PAGE_SIZE - to); + kunmap(page); + return 0; + } + + if (i_size - pos = PAGE_SIZE) { + /* partial write, page entirely before EOF */ + _debug(before); + tail = eof = PAGE_SIZE; + } else { + /* partial write, page overlaps EOF */ + eof = i_size - pos; + _debug(overlap %u, eof); + tail = max(eof, to); + if (tail PAGE_SIZE) + memset(p + tail, 0, PAGE_SIZE - tail); + if (offset eof) + memset(p + eof, 0, PAGE_SIZE - eof); + } + + kunmap(p); + + ret = 0; + if (offset 0 || eof to) { + /* need to fill one or two bits that aren't going to be written +* (cover both fillers in one read if there are two) */ + start = (offset 0) ? 0 : to; + stop = (eof to) ? eof : offset; + len = stop - start; + _debug(wr=%u-%u av=0-%u [EMAIL PROTECTED], + offset, to, eof, start, len); + ret = afs_fill_page(vnode, key, start, len, page); + } + + _leave( = %d, ret); + return ret; +} + +/* + * prepare to perform part of a write to a page + * - the caller holds the page locked, preventing it from being written out or + * modified by anyone else + */ +int afs_prepare_write(struct file *file, struct page *page, + unsigned offset, unsigned to) +{ + struct afs_writeback *candidate, *wb; + struct afs_vnode *vnode = AFS_FS_I(file-f_dentry-d_inode); + struct key *key = file-private_data; + pgoff_t index; + int ret; + + _enter({%x:%u},{%lx},%u,%u, + vnode-fid.vid, vnode-fid.vnode, page-index, offset, to); + + candidate = kzalloc(sizeof(*candidate), GFP_KERNEL); + if (!candidate) + return -ENOMEM; + candidate-vnode = vnode; + candidate-first = candidate-last = page-index; + candidate-offset_first = offset; + candidate-to_last = to; + candidate-usage = 1; + candidate-state = AFS_WBACK_PENDING; + init_waitqueue_head(candidate-waitq); + + if (!PageUptodate(page)) { + _debug(not up to date); + ret = afs_prepare_page(vnode, page, key, offset, to); + if (ret 0) { + kfree(candidate); + _leave( = %d [prep], ret); + return ret; + } + SetPageUptodate(page); + } Why do you call SetPageUptodate when the page is not up to date? That leaks uninitialised data, AFAIKS. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tue, Mar 06, 2007 at 02:20:55PM -0800, David Miller wrote: From: Robert Olsson [EMAIL PROTECTED] Date: Tue, 6 Mar 2007 14:26:04 +0100 David Miller writes: Actually, more accurately, the conflict exists in how this GC logic is implemented. The core issue is that hash table size guides the GC processing, and hash table growth therefore modifies those GC goals. So with the patch below we'll just keep growing the hash table instead of giving GC some time to try to keep the working set in equilibrium before doing the hash grow. AFIK the equilibrium is resizing function as well but using fixed hash table. So can we do without equilibrium resizing if tables are dynamic? I think so With the hash data structure we could monitor the average chain length or just size and resize hash after that. I'm not so sure, it may be a mistake to eliminate the equilibrium logic. One error I think it does have is the usage of chain length. Even a nearly perfect hash has small lumps in distribution, and we should not penalize entries which fall into these lumps. Let us call T the threshold at which we would grow the routing hash table. As we approach T we start to GC. Let's assume hash table has shift = 2. and T would (with T=N+(N1) algorithm) therefore be 6. TABLE:[0] DST1, DST2 [1] DST3, DST4, DST5 DST6 arrives, what should we do? If we just accept it and don't GC some existing entries, we will grow the hash table. This is the wrong thing to do if our true working set is smaller than 6 entries and thus some of the existing entries are unlikely to be reused and thus could be purged to keep us from hitting T. If they are all active, growing is the right thing to do. This is the crux of the whole routing cache problem. I guess this is similar to our problems with bdev and filesystem caches as well. What we do in that case (as you would know), is to let the caches expand to the size of memory, and the problem just becomes balancing their relative importance. We just try to go with a reasonable default, and provide a knob or two for fine tuning. I am of the opinion that LRU, for routes not attached to sockets, is probably the best thing to do here. Furthermore at high packet rates, the current rt_may_expire() logic probably is not very effective since it's granularity is limited to jiffies. We can quite easily create 100,000 or more entries per jiffie when HZ=100 during rDOS, for example. So perhaps some global LRU algorithm using ktime is more appropriate. Global LRU is not easy without touching a lot of memory. But I'm sure some clever trick can be discovered by someone :) Well we do a pseudo LRU in most vm/vfs caches, where we just set a bit if it has been touched since last checked. Add another working set list to promote popular routes into, and you have something like our pagecache active/inactive reclaim. I don't know if that really applies here, but it might if you decide to try hooking this cache into the slab shrinker thingy... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Mon, Mar 05, 2007 at 08:26:32PM -0800, David Miller wrote: This is essentially a port of Nick Piggin's dcache hash table patches to the routing cache. It solves the locking issues during table grow/shrink that I couldn't handle properly last time I tried to code up a patch like this. But one of the core issues of this kind of change still remains. There is a conflict between the desire of routing cache garbage collection to reach a state of equilibrium and the hash table grow code's desire to match the table size to the current state of affairs. Actually, more accurately, the conflict exists in how this GC logic is implemented. The core issue is that hash table size guides the GC processing, and hash table growth therefore modifies those GC goals. So with the patch below we'll just keep growing the hash table instead of giving GC some time to try to keep the working set in equilibrium before doing the hash grow. One idea is to put the hash grow check in the garbage collector, and put the hash shrink check in rt_del(). In fact, it would be a good time to perhaps hack up some entirely new passive GC logic for the routing cache. BTW, another thing that plays into this is that Robert's TRASH work could make this patch not necessary :-) Finally, I know that (due to some of Nick's helpful comments the other day) that I'm missing some rcu_assign_pointer()'s in here. Fixes in this area are most welcome. Cool! I have some fixes for the rcu barrier issues, with some C-style comments and questions :) I was going to send you a fix first for the rcu barriers, then a second to convert the read-side to a barrier-less one that I described, however considering that your patch is a WIP in progress anyway, I won't worry too much about the normal protocol. I _think_ my reasoning regarding the rcu barriers and grace periods is correct. I'll keep thinking about it though. (Paul cc'ed). I'm not so familiar with this code, so I have sprinkled around a lot of comments that could be pure crap ;) They are mainly just to help you ensure that you cover all bases... compile tested only at this stage. -- Index: linux-2.6/net/ipv4/route.c === --- linux-2.6.orig/net/ipv4/route.c +++ linux-2.6/net/ipv4/route.c @@ -311,6 +311,8 @@ static void rthash_free(struct rt_hash_b static unsigned int rt_hash_code(struct rt_hash *hashtable, u32 daddr, u32 saddr) { + /* BUG_ON(!rcu_read_protected()) */ + return (jhash_2words(daddr, saddr, rt_hash_rnd) hashtable-mask); } @@ -343,11 +345,16 @@ static void rt_hash_resize_work(struct w old_rt_hash = rt_hash; /* -* ensure that if the reader sees the new dentry_hash, -* then they will also see the old_dentry_hash assignment, -* above. +* ensure that if the reader sees the new rt_hash, then they will also +* see the old_rt_hash assignment, above. synchronize_rcu() is used +* rather than smp_wmb(), in order to avoid the smp_rmb() in the +* read-sidde. However synchronize_rcu() also implies a smp_wmb(), so +* that also means we can skip rcu_assign_pointer(). +* +* The readers can then also skip rcu_dereference, because a grace +* period implies that all readers have performed memory barriers. */ - smp_wmb(); + synchronize_rcu(); rt_hash = new_hash; synchronize_rcu(); @@ -1100,6 +1107,8 @@ static int rt_intern_hash(struct rt_hash int chain_length; int attempts = !in_softirq(); + /* BUG_ON(!rcu_read_protected()) */ + restart: chain_length = 0; min_score = ~(u32)0; @@ -1286,6 +1295,8 @@ static void rt_del(struct rt_hash *h, un { struct rtable **rthp; + /* BUG_ON(!rcu_read_protected()) */ + spin_lock_bh(rt_hash_lock_addr(hash)); ip_rt_put(rt); for (rthp = h-table[hash].chain; *rthp; @@ -1328,12 +1339,24 @@ void ip_rt_redirect(__be32 old_gw, __be3 for (i = 0; i 2; i++) { for (k = 0; k 2; k++) { - struct rt_hash *h = rt_hash; - unsigned hash = rt_hashfn(h, daddr, skeys[i], ikeys[k]); + struct rt_hash *h; + unsigned hash; + + /* +* rcu_read_lock() must cover the load of rt_hash, in +* order to satisfy our RCU protected dynamic hash +* sizing scheme; and it must also cover the hash list +* traversal, to satisfy our RCU protected lockless +* hash entry lookups. +* +* This note applies throughout the file. +*/ + rcu_read_lock(); + h =
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tue, Mar 06, 2007 at 01:17:06AM -0800, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Tue, 6 Mar 2007 10:11:12 +0100 @@ -1449,6 +1472,12 @@ static struct dst_entry *ipv4_negative_a %u.%u.%u.%u/%02x dropped\n, NIPQUAD(rt-rt_dst), rt-fl.fl4_tos); #endif + /* XXX: +* What if rt does not exist in rt_hash, but is in +* old_rt_hash? Don't we have to also check there? +* Similar questions for a couple of other places that +* look at rt_hash, but not old_rt_hash. +*/ rt_del(h, hash, rt); ret = NULL; } For the cases like ip_rt_redirect() I made the decision that we'll just not add the complexity of having to look in the old_rt_hash table. In these kinds of cases it's OK to miss events, they will just happen again. It's one of the nice things about the routing cache, if you lose information it's OK because we'll just cook up a new entry from the persistent backing store that is the real routing tables. And for events like redirects, if we miss it, we'll just send the packet to the wrong next-hop again and receive another redirect message which we'll (hopefully) propagate to a routine cache entry. Ah, that's a very neat trick. OK so with that question out of the way, there _may_ just be a few other places where you're working with an rt_hash table outside an rcu read critical section. I tried to fix up a couple of obvious ones, and I've just put in the BUG_ON assertions for you to verify the rest. Thanks for your feedback patch Nick, I'll process it tomorrow hopefully after getting some sleep. No problem. Thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH]: Dynamically sized routing cache hash table.
On Tue, Mar 06, 2007 at 10:23:44AM +0100, Eric Dumazet wrote: On Tuesday 06 March 2007 10:11, Nick Piggin wrote: Cool! I have some fixes for the rcu barrier issues, with some C-style comments and questions :) I was going to send you a fix first for the rcu barriers, then a second to convert the read-side to a barrier-less one that I described, however considering that your patch is a WIP in progress anyway, I won't worry too much about the normal protocol. I _think_ my reasoning regarding the rcu barriers and grace periods is correct. I'll keep thinking about it though. (Paul cc'ed). I'm not so familiar with this code, so I have sprinkled around a lot of comments that could be pure crap ;) They are mainly just to help you ensure that you cover all bases... compile tested only at this stage. I think we missed : +static void rt_hash_resize_work(struct work_struct *work) + + *head = rth-u.dst.rt_next; + + hash = rt_hashfn(rt_hash, + rth-fl.fl4_dst, + rth-fl.fl4_src, + iface); + rth-u.dst.rt_next = rt_hash-table[hash].chain; + rt_hash-table[hash].chain = rth; This really needs some ..._del_rcu()/..._add_rcu()_ ... primitives, no ? Or else a reader might be very confused... I'm not sure... this code really depends on the hash table management, rather than the management of the hash tables, if you understand me ;) From what I can _see_, this is similar to how rt_intern_hash does it. I don't know exactly why rt_intern_hash can get away without using rcu_assign_pointer in some cases, however: Note that we don't need an rcu_assign_pointer for this, because the memory operations that initialized the entry have already been ordered when it was first inserted. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: fix page_mkclean_one
Mark Groves wrote: Hi, I have been been seeing a problem when using sendfile repeatedly on an SMP server, which I believe is related to the problem that was discovered recently with marking dirty pages. The bug, as well as a test script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650. Currently, we're experiencing errors where part of a previous packet is being sent out rather than the current packet. I have applied the patch Linus posted to a 2.6.19 kernel but am still getting the problem. So I am wondering if there are any other places in the kernel which mark pages as dirty which might require a similar patch? Your issue is not related, firstly because the page_mkclean bug did not exist before 2.6.19 kernels. Anyway, I had a look at your bugzilla test-case and managed to slim it down to something that easily shows what the problem is (available on request) -- the problem is that recipient of the sendfile is seeing modifications that occur to the source file _after_ the sender has completed the sendfile, because the file pages are not copied but queued. I think the usual approach to what you are trying to do is to set TCP_CORK, then write(2) the header into the socket, then sendfile directly from the file you want. Another approach I guess is to implement an ack in your userland protocol so you do not modify the sendfile source file until the client acks that it has all the data. I'm not sure if there are any other usual ways to do this (ie. a barrier for sendfile, to ensure it will not pick up future modifications to the file). netdev cc'ed, someone there might have additional comments. Please close this bug if/when you are satisfied it is not a kernel problem. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
Evgeniy Polyakov wrote: On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([EMAIL PROTECTED]) wrote: Doesn't the provided solution is just a in-kernel variant of the SCHED_FIFO set from userspace? Why kernel should be able to mark some users as having higher priority? What if workload of the system is targeted to not the maximum TCP performance, but maximum other-task performance, which will be broken with provided patch. David's line of thinking for a solution sounds better to me. This patch does not prevent the process from being preempted (for potentially a long time), by any means. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rename *MEMALLOC flags
Paul Jackson wrote: Daniel wrote: Inventing a new name for an existing thing is very poor taste on grounds of grepability alone. I wouldn't say 'very poor taste' -- just something that should be done infrequently, with good reason, and with reasonable concensus, especially from the key maintainers in the affected area. Good names are good taste, in my book. But stable naming is good too. I wonder what Nick thinks of this? Looks like he added __GFP_NOMEMALLOC a year ago, following the naming style of PF_MEMALLOC. I added him to the cc list. __GFP_NOMEMALLOC was added to prevent mempool backed allocations from accessing the emergency reserve. Because that would just shift deadlocks from mempool safe sites to those which have not been converted. PF_MEMALLOC is a good name: PF_MEMALLOC says that the task is currently allocating memory. It does not say anything about the actual allocator implementation details to handle this (1. don't recurse into reclaim; 2. allow access to reserves), but that is a good thing. __GFP_NOMEMALLOC and __GFP_MEMALLOC are poorly named (I take the blame). It isn't that the task is suddenly no longer allocating in the context of an allocation, it is just that you want to allow or deny access to the reserve. __GFP_NOMEMALLOC should be something like __GFP_EMERG_NEVER and __GFP_MEMALLOC should be _ALWAYS. Or something like that. NOMEMALLOC is specific enough that I don't mind a rename at this stage. Renaming PF_MEMALLOC would be wrong, however. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/7] AMSO1100 Memory Management.
Tom Tucker wrote: On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote: On Wed, 07 Jun 2006 15:06:55 -0500 Steve Wise [EMAIL PROTECTED] wrote: +void c2_free(struct c2_alloc *alloc, u32 obj) +{ + spin_lock(alloc-lock); + clear_bit(obj, alloc-table); + spin_unlock(alloc-lock); +} The spinlock is unneeded here. Good point. Really? clear_bit does not give you any memory ordering, so you can have the situation where another CPU sees the bit cleared, but this CPU still has stores pending to whatever it is being freed. Or any number of other nasty memory ordering badness. I'd just use the spinlocks, and prepend the clear_bit with a double underscore (so you get the non-atomic version), if that is appropriate. The spinlocks nicely handle all the memory ordering issues, and serve to document concurrency issues. If you need every last bit of performance and scalability, that's OK, but you need comments and I suspect you'd need more memory barriers. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh
Ravikiran G Thirumalai wrote: On Thu, Mar 09, 2006 at 07:14:26PM +1100, Nick Piggin wrote: Ravikiran G Thirumalai wrote: Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches. This keeps local_t unsigned long. (We can change it to signed value along with other arches later in one go I guess) Why not just keep naming and structure of interfaces consistent with atomic_t? That would be signed and 32-bit. You then also have a local64_t. No, local_t is supposed to be 64-bits on 64bits arches and 32 bit on 32 bit arches. x86_64 was the only exception, so this patch fixes that. Right. If it wasn't I wouldn't have proposed the change. Considering that local_t has been broken so that basically nobody is using it, now is a great time to rethink the types before it gets fixed and people start using it. And modelling the type on the atomic types would make the most sense because everyone already knows them. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid atomic op on page free
Benjamin LaHaise wrote: Hello Andrew et al, The patch below adds a fast path that avoids the atomic dec and test operation and spinlock acquire/release on page free. This is especially important to the network stack which uses put_page() to free user buffers. Removing these atomic ops helps improve netperf on the P4 from ~8126Mbit/s to ~8199Mbit/s (although that number fluctuates quite a bit with some runs getting 8243Mbit/s). There are probably better workloads to see an improvement from this on, but removing 3 atomics and an irq save/restore is good. -ben You can't do this because you can't test PageLRU like that. Have a look in the lkml archives a few months back, where I proposed a way to do this for __free_pages(). You can't do it for put_page. BTW I have quite a large backlog of patches in -mm which should end up avoiding an atomic or two around these parts. -- Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid atomic op on page free
Benjamin LaHaise wrote: On Mon, Mar 06, 2006 at 04:50:39PM -0800, Andrew Morton wrote: Am a bit surprised at those numbers. Because userspace has to do peculiar things to get its pages taken off the LRU. What exactly was that application doing? It's just a simple send() and recv() pair of processes. Networking uses pages for the buffer on user transmits. Those pages tend to be freed in irq context on transmit or in the receiver if the traffic is local. The patch adds slight overhead to the common case while providing improvement to what I suspect is a very uncommon case? At least on any modern CPU with branch prediction, the test is essentially free (2 memory reads that pipeline well, iow 1 cycle, maybe 2). The upside is that you get to avoid the atomic (~17 cycles on a P4 with a simple test program, the penalty doubles if there is one other instruction that operates on memory in the loop), disabling interrupts (~20 cycles?, I don't remember) another atomic for the spinlock, another atomic for TestClearPageLRU() and the pushf/popf (expensive as they rely on whatever instruction that might still be in flight to complete and add the penalty for changing irq state). That's at least 70 cycles without including the memory barrier side effects which can cost 100 cycles+. Add in the costs for the cacheline bouncing of the lru_lock and we're talking *expensive*. My patches in -mm avoid the lru_lock and disabling/enabling interrupts if the page is not on lru too, btw. So, a 1-2 cycle cost for a case that normally takes from 17 to 100+ cycles? I think that's worth it given the benefits. Also, I think the common case (page cache read / map) is something that should be done differently, as those atomics really do add up to major pain. Using rcu for page cache reads would be truely wonderful, but that will take some time. It is not very difficult to implement (and is something I intend to look at after I finish my lockless pagecache). But it has quite a lot of problems, including a potentially big (temporal) increase of cache footprint to process the pages, more CPU time in general to traverse the lists, increased over / underflows in the per cpu pagelists. Possibly even worse would be the increased overhead on the RCU infrastructure and potential OOM conditions. Not to mention the extra logic involved to either retry, or fall back to get/put in the case that the userspace target page is not resident. I'd say it will turn out to be more trouble than its worth, for the miserly cost avoiding one atomic_inc, and one atomic_dec_and_test on page-local data that will be in L1 cache. I'd never turn my nose up at anyone just having a go though :) -- Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid atomic op on page free
Benjamin LaHaise wrote: On Tue, Mar 07, 2006 at 12:53:27PM +1100, Nick Piggin wrote: You can't do this because you can't test PageLRU like that. Have a look in the lkml archives a few months back, where I proposed a way to do this for __free_pages(). You can't do it for put_page. Even if we know that we are the last user of the page (the count is 1)? Who can bump the page's count then? Yes. vmscan. Your page_count and PageLRU tests have no synchronisation between them, which is the problem AFAIKS. Anything can happen between them and they can probably also be executed out of order (the loads). BTW I have quite a large backlog of patches in -mm which should end up avoiding an atomic or two around these parts. That certainly looks like it will help. Not taking the spinlock unconditionally gets rid of quite a bit of the cost. Cool. -- Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid atomic op on page free
Benjamin LaHaise wrote: On Tue, Mar 07, 2006 at 01:04:36PM +1100, Nick Piggin wrote: I'd say it will turn out to be more trouble than its worth, for the miserly cost avoiding one atomic_inc, and one atomic_dec_and_test on page-local data that will be in L1 cache. I'd never turn my nose up at anyone just having a go though :) The cost is anything but miserly. Consider that every lock instruction is a memory barrier which takes your OoO CPU with lots of instructions in flight to ramp down to just 1 for the time it takes that instruction to execute. That synchronization is what makes the atomic expensive. Yeah x86(-64) is a _little_ worse off in that regard because its locks imply rmbs. But I'm saying the cost is miserly compared to the likely overheads of using RCU-ed page freeing, when taken as impact on the system as a whole. Though definitely if we can get rid of atomic ops for free in any low level page handling functions in mm/ then we want to do that. In the case of netperf, I ended up with a 2.5Gbit/s (~30%) performance improvement through nothing but microoptimizations. There is method to my madness. ;-) Well... it was wrong too ;) But as you can see, I'm not against microoptimisations either and I'm glad others, like yourself, are looking at the problem too. The 30% number is very impressive. I'd be interested to see what the stuff currently in -mm is worth. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] avoid atomic op on page free
Chen, Kenneth W wrote: Nick Piggin wrote on Monday, March 06, 2006 6:05 PM My patches in -mm avoid the lru_lock and disabling/enabling interrupts if the page is not on lru too, btw. Can you put the spin lock/unlock inside TestClearPageLRU()? The difference is subtle though. That's the idea, but you just need to do a little bit more so as not to introduce a race. http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc2/2.6.16-rc2-mm1/broken-out/mm-never-clearpagelru-released-pages.patch - Ken --- ./mm/swap.c.orig2006-03-06 19:25:10.680967542 -0800 +++ ./mm/swap.c 2006-03-06 19:27:02.334286487 -0800 @@ -210,14 +210,16 @@ int lru_add_drain_all(void) void fastcall __page_cache_release(struct page *page) { unsigned long flags; - struct zone *zone = page_zone(page); + struct zone *zone; - spin_lock_irqsave(zone-lru_lock, flags); - if (TestClearPageLRU(page)) + if (TestClearPageLRU(page)) { + zone = page_zone(page); + spin_lock_irqsave(zone-lru_lock, flags); del_page_from_lru(zone, page); - if (page_count(page) != 0) - page = NULL; - spin_unlock_irqrestore(zone-lru_lock, flags); + if (page_count(page) != 0) + page = NULL; + spin_unlock_irqrestore(zone-lru_lock, flags); + } if (page) free_hot_page(page); } -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET : No need to update last_rx in loopback driver
Eric Dumazet wrote: Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- a/include/linux/netdevice.h 2006-02-07 11:55:42.0 +0100 +++ b/include/linux/netdevice.h 2006-02-09 09:23:15.0 +0100 @@ -649,6 +649,19 @@ return test_bit(__LINK_STATE_START, dev-state); } One minor suggestion I would have is to just load jiffies once (it is volatile). But it gets a little messy if you want to be tricky and still load it zero times in the case that the first test evaluates true. Maybe not worth the fuss. { unsigned long now; #ifndef CONFIG_SMP now = jiffies #else if (!(dev-flags IFF_SLAVE) || ((now = jiffies) == dev-last_rx)) return; #endif dev-last_rx = jiffies; } +static inline void netif_last_rx_update(struct net_device *dev) +{ +#ifdef CONFIG_SMP + /* +* In order to avoid cache line ping pongs on last_rx, we check +* if the device is a slave, +* and if last_rx really has to be modified +*/ + if (!(dev-flags IFF_SLAVE) || (dev-last_rx == jiffies)) + return; +#endif + dev-last_rx = jiffies; +} -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism
David S. Miller wrote: From: Matt Mackall [EMAIL PROTECTED] Date: Wed, 14 Dec 2005 21:02:50 -0800 There needs to be two rules: iff global memory critical flag is set - allocate from the global critical receive pool on receive - return packet to global pool if not destined for a socket with an attached send mempool This shuts off a router and/or firewall just because iSCSI or NFS peed in it's pants. Not really acceptable. But that should only happen (shut off a router and/or firewall) in cases where we now completely deadlock and never recover, including shutting off the router and firewall, because they don't have enough memory to recv packets either. I think this will provide the desired behavior It's not desirable. What if iSCSI is protected by IPSEC, and the key management daemon has to process a security assosciation expiration and negotiate a new one in order for iSCSI to further communicate with it's peer when this memory shortage occurs? It needs to send packets back and forth with the remove key management daemon in order to do this, but since you cut it off with this critical receive pool, the negotiation will never succeed. I guess IPSEC would be a critical socket too, in that case. Sure there is nothing we can do if the daemon insists on allocating lots of memory... This stuff won't work. It's not a generic solution and that's why it has more holes than swiss cheese. :-) True it will have holes. I think something that is complementary and would be desirable is to simply limit the amount of in-flight writeout that things like NFS allows (or used to allow, haven't checked for a while and there were noises about it getting better). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] struct file cleanup : the very large file_ra_state is now allocated only on demand.
Andi Kleen wrote: I would just set the ra pointer to a single global structure if the allocation fails. Then you can avoid all the other checks. It will slow down things and trash some state, but not fail and nobody should expect good performance after out of memory anyways. The only check still needed would be on freeing. You don't want to always have bad performance though, so you could attempt to allocate if either the pointer is null _or_ it points to the global structure? Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] struct file cleanup : the very large file_ra_state is now allocated only on demand.
Andi Kleen wrote: You don't want to always have bad performance though, so you could attempt to allocate if either the pointer is null _or_ it points to the global structure? Remember it's after a GFP_KERNEL OOM. If that fails most likely you have deadlocked somewhere else already because Linux's handling of that is so bad. Suboptimal readahead somewhere is the smallest of your problems. True. And in practice it may not even be able to happen at the moment if the page allocator still doesn't fail small order allocs. But I guess the dream one day is to robustly handle any OOM :\ Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html