from:"Nick Piggin"

Re: [PATCH] alloc_percpu() fails to allocate percpu data

2008-02-23 Thread Nick Piggin

On Friday 22 February 2008 09:26, Peter Zijlstra wrote:
 On Thu, 2008-02-21 at 19:00 +0100, Eric Dumazet wrote:
  Some oprofile results obtained while using tbench on a 2x2 cpu machine
  were very surprising.
 
  For example, loopback_xmit() function was using high number of cpu
  cycles to perform the statistic updates, supposed to be real cheap
  since they use percpu data
 
  pcpu_lstats = netdev_priv(dev);
  lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
  lb_stats-packets++;  /* HERE : serious contention */
  lb_stats-bytes += skb-len;
 
 
  struct pcpu_lstats is a small structure containing two longs. It
  appears that on my 32bits platform, alloc_percpu(8) allocates a single
  cache line,  instead of giving to each cpu a separate cache line.
 
  Using the following patch gave me impressive boost in various
  benchmarks ( 6 % in tbench) (all percpu_counters hit this bug too)
 
  Long term fix (ie = 2.6.26) would be to let each CPU allocate their
  own block of memory, so that we dont need to roudup sizes to
  L1_CACHE_BYTES, or merging the SGI stuff of course...
 
  Note : SLUB vs SLAB is important here to *show* the improvement, since
  they dont have the same minimum allocation sizes (8 bytes vs 32
  bytes). This could very well explain regressions some guys reported
  when they switched to SLUB.

 I've complained about this false sharing as well, so until we get the
 new and improved percpu allocators,

What I don't understand is why the slab allocators have something like
this in it:

if ((flags  SLAB_HWCACHE_ALIGN) 
size  cache_line_size() / 2)
return max_t(unsigned long, align, cache_line_size());

If you ask for HWCACHE_ALIGN, then you should get it. I don't
understand, why do they think they knows better than the caller?
Things like this are just going to lead to very difficult to track
performance problems. Possibly correctness problems in rare cases.

There could be another flag for maybe align.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: lockless pagecache Cassini regression

2008-01-04 Thread Nick Piggin

On Thu, Jan 03, 2008 at 07:32:45PM -0800, David Miller wrote:
 
 Nick, I think the following changeset:
 
 commit fa4f0774d7c6cccb4d1fda76b91dd8eddcb2dd6a
 
 [CASSINI]: dont touch page_count
 
 Remove page refcount manipulations from cassini driver by using
 another field in struct page. Needed for lockless pagecache.
 
 Signed-off-by: Nick Piggin [EMAIL PROTECTED]
 Signed-off-by: David S. Miller [EMAIL PROTECTED]
 
 Broke the Cassini driver.
 
 While it is true that, within the cassini driver, you converted
 the counter bumps and tests accurately, this changeset does not
 account for the page count bumps that are going to occur
 in other contexts when the SKBs being constructed are freed
 up (and thus the skb_frag_struct pages get liberated).

Dang :(

 
 Basically what these drivers do is allocate a page, give it
 to the network card, and the card decides how to chop up the
 page based upon the size of arriving packets.  Therefore we
 don't know ahead of time how many references to the page we
 will generate while attaching bits of the page to receive
 packet SKBs that get built.
 
 As incoming packets are constructed by the card using chunks of these
 system pages, it indicates in the descriptor which parts of which
 pages were used.  We then use this to attach the page parts onto the
 SKB.  Once the SKB is constructed it is passed up to the networking,
 later on the SKB is freed and the page references are dropped by
 kfree_skbmem().
 
 And it is these page reference counts that the Cassini driver needs to
 test to be correct, not the driver local ones we are now using.
 
 I do something similar to what Cassini was doing in the NIU driver
 (drivers/net/niu.c).  I think we need to restore Cassini to something
 similar to what NIU is doing.
 
 The only tricky bit is the page count checks, and frankly those can
 just be transformed into references to compount_head(page)-_count or
 similar.
 
 Actually, page_count() does exactly that, it is implemented by
 checking atomic_read(compound_head(page)-_count) and thus I
 think the best thing to do is revert your changes.
 
 These pages are freshly allocated pages, not stuff in the page
 cache or similar, so I think this is safe and the way to move
 forward.
 
 Also, these changes added lots of new atomics to the driver.
 It's already doing get_page() etc. as needed.
 
 What do you think Nick?

Thanks for the detailed explanation as always, Dave. And I hope I didn't
waste too much of your time debugging this...

It is a regression so clearly needs to be reverted. It would actually break
with the lockless pagecache, however the lockless pagecache is not merged
anyway, so reverting is a no-brainer at this point. 

Just for interest, the lockless pagecache actually makes page-_count unstable
for all pages that _have ever_ been pagecache pages (since the last quiescent
rcu state, anyway). Basically, it looks up and takes a ref on the struct page
without ever having a prior pin or reference on that page. It can do this
because it knows the struct page won't actually get freed. After taking the
ref, it rechecks that it has got the right page...

Anyway, that's the short story but probably gives you an idea of why it
hasn't been merged yet ;) I could avoid that whole hassle by rcu freeing
pagecache pages, but that would add other overheads.


 [CASSINI]: Revert 'dont touch page_count'.
 
 This reverts changeset fa4f0774d7c6cccb4d1fda76b91dd8eddcb2dd6a
 ([CASSINI]: dont touch page_count) because it breaks the driver.
 
 The local page counting added by this changeset did not account
 for the asynchronous page count changes done by kfree_skb()
 and friends.
 
 The change adds extra atomics and on top of it all appears to be
 totally unnecessary as well.
 
 Signed-off-by: David S. Miller [EMAIL PROTECTED]

Acked-by: Nick Piggin [EMAIL PROTECTED]


 diff --git a/drivers/net/cassini.c b/drivers/net/cassini.c
 index 9030ca5..ff957f2 100644
 --- a/drivers/net/cassini.c
 +++ b/drivers/net/cassini.c
 @@ -336,30 +336,6 @@ static inline void cas_mask_intr(struct cas *cp)
   cas_disable_irq(cp, i);
  }
  
 -static inline void cas_buffer_init(cas_page_t *cp)
 -{
 - struct page *page = cp-buffer;
 - atomic_set((atomic_t *)page-lru.next, 1);
 -}
 -
 -static inline int cas_buffer_count(cas_page_t *cp)
 -{
 - struct page *page = cp-buffer;
 - return atomic_read((atomic_t *)page-lru.next);
 -}
 -
 -static inline void cas_buffer_inc(cas_page_t *cp)
 -{
 - struct page *page = cp-buffer;
 - atomic_inc((atomic_t *)page-lru.next);
 -}
 -
 -static inline void cas_buffer_dec(cas_page_t *cp)
 -{
 - struct page *page = cp-buffer;
 - atomic_dec((atomic_t *)page-lru.next);
 -}
 -
  static void cas_enable_irq(struct cas *cp, const int ring)
  {
   if (ring == 0) { /* all but TX_DONE */
 @@ -497,7 +473,6 @@ static int cas_page_free(struct cas *cp, cas_page_t *page)
  {
   pci_unmap_page(cp-pdev, page-dma_addr, cp

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-14 Thread Nick Piggin

On Wednesday 14 November 2007 17:37, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]

  I'm doing some oprofile runs now to see if I can get any more info.

OK, in vanilla kernels, the page allocator definitely shows higher
in the results (than with Herbert's patch reverted).

27516 2.7217  get_page_from_freelist
21677 2.1442  __rmqueue_smallest
20513 2.0290  __free_pages_ok   
18725 1.8522  get_pageblock_flags_group 

Just these account for nearly 10% of cycles. __alloc_skb shows up
higher too. free_hot_cold_page() shows a lot lower though, which
might indicate that actually there is more higher order allocation
activity (I'll check that next).


SLUB, avg throughput 1548

CPU: AMD64 family10, speed 1900 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 10
samples  %symbol name
94636 9.3609  copy_user_generic_string
38932 3.8509  ipt_do_table
34746 3.4369  tcp_v4_rcv
29539 2.9218  skb_release_data
27516 2.7217  get_page_from_freelist
26046 2.5763  tcp_sendmsg
24482 2.4216  local_bh_enable
22910 2.2661  ip_queue_xmit
22113 2.1873  ktime_get
21677 2.1442  __rmqueue_smallest
20513 2.0290  __free_pages_ok
18725 1.8522  get_pageblock_flags_group
18580 1.8378  tcp_recvmsg
18108 1.7911  __napi_schedule
17593 1.7402  schedule
16998 1.6813  tcp_ack
16102 1.5927  dev_hard_start_xmit
15751 1.5580  system_call
15707 1.5536  net_rx_action
15150 1.4986  __switch_to
14988 1.4825  tcp_transmit_skb
13921 1.3770  kmem_cache_free
13398 1.3253  __mod_timer
13243 1.3099  tcp_rcv_established
13109 1.2967  __tcp_select_window
11022 1.0902  __tcp_push_pending_frames
10732 1.0615  set_normalized_timespec
10561 1.0446  netif_rx
8840  0.8744  netif_receive_skb
7816  0.7731  nf_iterate
7300  0.7221  __update_rq_clock
6683  0.6610  _read_lock_bh
6504  0.6433  sys_recvfrom
6283  0.6215  nf_hook_slow
6188  0.6121  release_sock
6172  0.6105  loopback_xmit
5927  0.5863  __alloc_skb
5707  0.5645  tcp_cleanup_rbuf
5538  0.5478  tcp_event_data_recv
5517  0.5457  tcp_v4_do_rcv
5516  0.5456  process_backlog

SLUB, net patch reverted. Avg throughput 1933MB/s
Counted CPU_CLK_UNHALTED , count 10
samples  %symbol name
95895 9.5094  copy_user_generic_string
50259 4.9839  ipt_do_table
39408 3.9079  skb_release_data
37296 3.6984  tcp_v4_rcv
31309 3.1047  ip_queue_xmit
31308 3.1046  local_bh_enable
24052 2.3851  net_rx_action
23786 2.3587  __napi_schedule
21426 2.1247  tcp_recvmsg
21075 2.0899  schedule
20938 2.0763  dev_hard_start_xmit
20222 2.0053  tcp_sendmsg
19775 1.9610  tcp_ack
19717 1.9552  system_call
19495 1.9332  set_normalized_timespec
18584 1.8429  __switch_to
17022 1.6880  tcp_rcv_established
14655 1.4533  tcp_transmit_skb
14466 1.4345  __mod_timer
13820 1.3705  loopback_xmit
13776 1.3661  get_page_from_freelist
13288 1.3177  netif_receive_skb
9718  0.9637  _read_lock_bh
9625  0.9545  nf_iterate
9440  0.9361  netif_rx
9148  0.9072  free_hot_cold_page
8633  0.8561  __update_rq_clock
7668  0.7604  sys_recvfrom
7578  0.7515  __tcp_push_pending_frames
7311  0.7250  find_pid_ns
7178  0.7118  nf_hook_slow
6655  0.6599  sysret_check
6313  0.6260  release_sock
6290  0.6237  tcp_cleanup_rbuf
6263  0.6211  __tcp_select_window
6235  0.6183  process_backlog
5920  0.5871  ip_local_deliver_finish
5651  0.5604  ip_rcv
5239  0.5195  ip_finish_output
5058  0.5016  kmem_cache_free
5016  0.4974  thread_return


 Here are some other things you can play around with:

 1) Monitor the values of skb-len and skb-data_len for packets
going over loopback.

OK, I've taken a log2 histogram of -len and -data_len sizes.
I'll attach the plots (xaxis is byte value of most significant
bit set, y axis is number of occurances, logscale on X so 0 isn't
there :( ). If you want to see the patch or raw data, let me know.

len looks very similar for both kernels, although the reverted
kernel has significantly higher frequency at most points. tbench
unfortunately is only possible to do it time-based, so this is
rougly expected.

data_len has a spike that shifts up to 512-1023, from 256-511,
when reverting Herbert's patch. Again, I believe the magnitudes
of the spikes should actually be pretty close if we were doing
an equal amount of work.

I can't see that these numbers show much useful, unfortunately.


 2) Try removing NETIF_F_SG in drivers/net/loopback.c's dev-feastures
setting.

Will try that now.
attachment: data.pngattachment: len.png

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-14 Thread Nick Piggin

On Wednesday 14 November 2007 09:27, Nick Piggin wrote:

  2) Try removing NETIF_F_SG in drivers/net/loopback.c's dev-feastures
 setting.

 Will try that now.

Doesn't help (with vanilla kernel -- Herbert's patch applied).

data_len histogram drops to 0 and goes to len (I guess that's not
surprising).

Performance is pretty similar (ie. not good).

I'll look at allocator patterns next.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-14 Thread Nick Piggin

On Wednesday 14 November 2007 22:10, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 14 Nov 2007 09:27:39 +1100

  OK, in vanilla kernels, the page allocator definitely shows higher
  in the results (than with Herbert's patch reverted).

  ...

  I can't see that these numbers show much useful, unfortunately.

 Thanks for all of this data Nick.

 So the thing that's being effected here in TCP is
 net/ipv4/tcp.c:select_size(), specifically the else branch:

   int tmp = tp-mss_cache;
  ...
   else {
   int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);

   if (tmp = pgbreak 
   tmp = pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
   tmp = pgbreak;
   }

 This is deciding, in 'tmp', how much linear sk_buff space to
 allocate.  'tmp' is initially set to the path MSS, which
 for loopback is 16K - the space necessary for packet headers.

 The SKB_MAX_HEAD() value has changed as a result of Herbert's
 bug fix.   I suspect this 'if' test is passing both with and
 without the patch.

 But pgbreak is now smaller, and thus the skb-data linear
 data area size we choose to use is smaller as well.

OK, that makes sense. BTW, are you taking advantage of kmalloc's
quantization into slabs WRT the linear data area? I wonder if
it would be at all useful...

 You can test if this is precisely what is causing the performance
 regression by using the old calculation just here in select_size().

 Add something like this local to net/ipv4/tcp.c:

 #define OLD_SKB_WITH_OVERHEAD(X)  \
   (((X) - sizeof(struct skb_shared_info))  \
~(SMP_CACHE_BYTES - 1))
 #define OLD_SKB_MAX_ORDER(X, ORDER) \
   OLD_SKB_WITH_OVERHEAD((PAGE_SIZE  (ORDER)) - (X))
 #define OLD_SKB_MAX_HEAD(X)   (OLD_SKB_MAX_ORDER((X), 0))

 And then use OLD_SKB_MAX_HEAD() in select_size().

That brings performance back up!

I wonder why it isn't causing a problem for SLAB...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-14 Thread Nick Piggin

On Wednesday 14 November 2007 22:48, Herbert Xu wrote:
 On Wed, Nov 14, 2007 at 03:10:22AM -0800, David Miller wrote:
  So the thing that's being effected here in TCP is
  net/ipv4/tcp.c:select_size(), specifically the else branch:

 Thanks for the pointer.  Indeed there is a bug in that area.
 I'm not sure whether it's causing the problem at hand but it's
 certainly suboptimal.

 [TCP]: Fix size calculation in sk_stream_alloc_pskb

This looks like it fixes the problem!

Still interested to know why SLAB didn't see the same thing...


 We round up the header size in sk_stream_alloc_pskb so that
 TSO packets get zero tail room.  Unfortunately this rounding
 up is not coordinated with the select_size() function used by
 TCP to calculate the second parameter of sk_stream_alloc_pskb.

 As a result, we may allocate more than a page of data in the
 non-TSO case when exactly one page is desired.

 In fact, rounding up the head room is detrimental in the non-TSO
 case because it makes memory that would otherwise be available to
 the payload head room.  TSO doesn't need this either, all it wants
 is the guarantee that there is no tail room.

 So this patch fixes this by adjusting the skb_reserve call so that
 exactly the requested amount (which all callers have calculated in
 a precise way) is made available as tail room.

 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

 Cheers,
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-14 Thread Nick Piggin

On Thursday 15 November 2007 10:46, David Miller wrote:
 From: Herbert Xu [EMAIL PROTECTED]
 Date: Wed, 14 Nov 2007 19:48:44 +0800

  Signed-off-by: Herbert Xu [EMAIL PROTECTED]

 Applied and I'll queue it up for -stable too.

Good result. Thanks, everyone!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-14 Thread Nick Piggin

On Thursday 15 November 2007 12:11, Herbert Xu wrote:
 On Wed, Nov 14, 2007 at 05:03:25PM -0800, Christoph Lameter wrote:
  Well this is likely the result of the SLUB regression. If you allocate an
  order 1 page then the zone locks need to be taken. SLAB queues the a

Yeah, it appears this is what happened. The lockless page allocator
fastpath appears on the list and the slowpaths disappear after
Herbert's patches. SLAB is doing its own thing, so it avoids that
pitfall.


  couple of higher order pages and can so serve a couple of requests
  without going into the page allocator whereas SLUB has to go directly to
  the page allocator for allocate and free. I guess that needs fixing in
  the page allocator. Or do I need to add a mechanism to buffer higher
  order page allcoations to SLUB?

 Actually this serves to discourage people from using high-order
 allocations which IMHO is a good thing :)

Yeah I completely agree. The right fix is in the caller...
The bug / suboptimal allocation would not have been found in tcp
if not for this ;)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-13 Thread Nick Piggin

On Tuesday 13 November 2007 06:44, Christoph Lameter wrote:
 On Sat, 10 Nov 2007, Nick Piggin wrote:
  BTW. your size-2048 kmalloc cache is order-1 in the default setup,
  wheras kmalloc(1024) or kmalloc(4096) will be order-0 allocations. And
  SLAB also uses order-0 for size-2048. It would be nice if SLUB did the
  same...

 You can try to see the effect that order 0 would have by booting with

 slub_max_order=0

Yeah, that didn't help much, but in general I think it would give
more consistent and reliable behaviour from slub.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-13 Thread Nick Piggin

On Wednesday 14 November 2007 12:58, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Tue, 13 Nov 2007 22:41:58 +1100

  On Tuesday 13 November 2007 06:44, Christoph Lameter wrote:
   On Sat, 10 Nov 2007, Nick Piggin wrote:
BTW. your size-2048 kmalloc cache is order-1 in the default setup,
wheras kmalloc(1024) or kmalloc(4096) will be order-0 allocations.
And SLAB also uses order-0 for size-2048. It would be nice if SLUB
did the same...

   You can try to see the effect that order 0 would have by booting with

   slub_max_order=0

  Yeah, that didn't help much, but in general I think it would give
  more consistent and reliable behaviour from slub.

 Just a note that I'm not ignoring this issue, I just don't have time
 to get to it yet.

No problem. I would like to have helped more, but it's slow going given
my lack of network stack knowledge. If I get any more interesting data,
I'll send it.

 I suspect the issue is about having a huge skb-data linear area for
 TCP sends over loopback.  We're likely getting a much smaller
 skb-data linear data area after the patch in question, the rest using
 the sk_buff scatterlist pages which are a little bit more expensive to
 process.

It didn't seem to be noticeable at 1 client. Unless scatterlist
processing is going to cause cacheline bouncing, I don't see why this
hurts more as you add CPUs?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-13 Thread Nick Piggin

On Wednesday 14 November 2007 17:12, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 14 Nov 2007 04:36:24 +1100

  On Wednesday 14 November 2007 12:58, David Miller wrote:
   I suspect the issue is about having a huge skb-data linear area for
   TCP sends over loopback.  We're likely getting a much smaller
   skb-data linear data area after the patch in question, the rest using
   the sk_buff scatterlist pages which are a little bit more expensive to
   process.

  It didn't seem to be noticeable at 1 client. Unless scatterlist
  processing is going to cause cacheline bouncing, I don't see why this
  hurts more as you add CPUs?

 Is your test system using HIGHMEM?

 That's one thing the page vector in the sk_buff can do a lot,
 kmaps.

No, it's an x86-64, so no highmem.

What's also interesting is that SLAB apparently doesn't have this
condition. The first thing that sprung to mind is that SLAB caches
order  0 allocations, while SLUB does not. However if anything,
that should actually favour the SLUB numbers if network is avoiding
order  0 allocations.

I'm doing some oprofile runs now to see if I can get any more info.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc2: Network commit causes SLUB performance regression with tbench

2007-11-09 Thread Nick Piggin

On Saturday 10 November 2007 12:29, Nick Piggin wrote:
 cc'ed linux-netdev

Err, make that 'netdev' :P

 On Saturday 10 November 2007 10:46, Christoph Lameter wrote:
  commit deea84b0ae3d26b41502ae0a39fe7fe134e703d0 seems to cause a drop
  in SLUB tbench performance:
 
  8p x86_64 system:
 
  2.6.24-rc2:
  1260.80 MB/sec
 
  After reverting the patch:
  2350.04 MB/sec
 
  SLAB performance (which is at 2435.58 MB/sec, ~3% better than SLUB) is
  not affected by the patch.

 Ah, I didn't realise this was a regression. Thanks for bisecting it.

  Since this is an alignment change it seems that tbench performance is
  sensitive to the data layout? SLUB packs data more tightly than SLAB. So
  8 byte allocations could result in cacheline contention if adjacent
  objects are allocated from different cpus. SLABs minimum size is 32
  bytes so the cacheline contention is likely more limited.
 
  Maybe we need to allocate a mininum of one cacheline to the skb head? Or
  padd it out to a full cacheline?

 The data should already be cacheline aligned. It is kmalloced, and
 with a minimum size of somewhere around 200 bytes on a 64-bit machine.
 So it will hit a cacheline aligned kmalloc slab AFAIKS  -- cacheline
 interference is probably not the problem. (To verify, I built slub with
 minimum kmalloc size set to 32 like slab and it's no real difference)

 But I can't see why restricting the allocation to PAGE_SIZE would help
 either. Maybe the macros are used in some other areas.

 BTW. your size-2048 kmalloc cache is order-1 in the default setup,
 wheras kmalloc(1024) or kmalloc(4096) will be order-0 allocations. And
 SLAB also uses order-0 for size-2048. It would be nice if SLUB did the
 same...

  commit deea84b0ae3d26b41502ae0a39fe7fe134e703d0
  Author: Herbert Xu [EMAIL PROTECTED]
  Date:   Sun Oct 21 16:27:46 2007 -0700
 
  [NET]: Fix SKB_WITH_OVERHEAD calculation
 
  The calculation in SKB_WITH_OVERHEAD is incorrect in that it can
  cause an overflow across a page boundary which is what it's meant to
  prevent. In particular, the header length (X) should not be lumped
  together with skb_shared_info.  The latter needs to be aligned properly
  while the header has no choice but to sit in front of wherever the
  payload is.
 
  Therefore the correct calculation is to take away the aligned size of
  skb_shared_info, and then subtract the header length.  The resulting
  quantity L satisfies the following inequality:
 
  SKB_DATA_ALIGN(L + X) + sizeof(struct skb_shared_info) =
  PAGE_SIZE
 
  This is the quantity used by alloc_skb to do the actual allocation.
  Signed-off-by: Herbert Xu [EMAIL PROTECTED]
  Signed-off-by: David S. Miller [EMAIL PROTECTED]
 
  diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
  index f93f22b..369f60a 100644
  --- a/include/linux/skbuff.h
  +++ b/include/linux/skbuff.h
  @@ -41,8 +41,7 @@
   #define SKB_DATA_ALIGN(X)  (((X) + (SMP_CACHE_BYTES - 1))  \
   ~(SMP_CACHE_BYTES - 1))
   #define SKB_WITH_OVERHEAD(X)   \
  -   (((X) - sizeof(struct skb_shared_info))  \
  -~(SMP_CACHE_BYTES - 1))
  +   ((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
   #define SKB_MAX_ORDER(X, ORDER) \
  SKB_WITH_OVERHEAD((PAGE_SIZE  (ORDER)) - (X))
   #define SKB_MAX_HEAD(X)(SKB_MAX_ORDER((X), 0))
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
 On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote:
  On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
   Allow PF_MEMALLOC to be set in softirq context. When running softirqs
   from a borrowed context save current-flags, ksoftirqd will have its
   own task_struct.
 
  What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
  that some networking thing must be picking it up in a subsequent patch,
  but I'm too lazy to look!)... Again, can you have more of a rationale in
  your patch headers, or ref the patch that uses it... thanks

 Right, I knew I was forgetting something in these changelogs.

 The network stack does quite a bit of packet processing from softirq
 context. Once you start swapping over network, some of the packets want
 to be processed under PF_MEMALLOC.

Hmm... what about processing from interrupt context?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
 On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote:
  On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
   Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
   contexts that are entitled to it.
  
   Care is taken to only touch the SLUB slow path.
  
   This is done to ensure reserve pages don't leak out and get consumed.
 
  I think this is generally a good idea (to prevent slab allocators
  from stealing reserve). However I naively think the implementation
  is a bit overengineered and thus has a few holes.
 
  Humour me, what was the problem with failing the slab allocation
  (actually, not fail but just call into the page allocator to do
  correct waiting  / reclaim) in the slowpath if the process fails the
  watermark checks?

 Ah, we actually need slabs below the watermarks.

Right, I'd still allow those guys to allocate slabs. Provided they
have the right allocation context, right?


 Its just that once I 
 allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want
 allocation contexts that do not have rights to those pages to walk off
 with objects.

And I'd prevent these ones from doing so.

Without keeping track of reserve pages, which doesn't feel
too clean.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 23:17, Peter Zijlstra wrote:
 On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote:

  And I'd prevent these ones from doing so.
 
  Without keeping track of reserve pages, which doesn't feel
  too clean.

 The problem with that is that once a slab was allocated with the right
 allocation context, anybody can get objects from these slabs.

[snip]

I understand that.


 So we either reserve a page per object, which for 32 byte objects is a
 large waste, or we stop anybody who doesn't have the right permissions
 from obtaining objects. I took the latter approach.

What I'm saying is that the slab allocator slowpath should always
just check watermarks against the current task. Instead of this
-reserve stuff.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/33] Swap over NFS -v14

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Hi,

 Another posting of the full swap over NFS series.

Hi,

Is it really worth all the added complexity of making swap
over NFS files work, given that you could use a network block
device instead?

Also, have you ensured that page_file_index, page_file_mapping
and page_offset are only ever used on anonymous pages when the
page is locked? (otherwise PageSwapCache could change)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
 contexts that are entitled to it.

 Care is taken to only touch the SLUB slow path.

 This is done to ensure reserve pages don't leak out and get consumed.

I think this is generally a good idea (to prevent slab allocators
from stealing reserve). However I naively think the implementation
is a bit overengineered and thus has a few holes.

Humour me, what was the problem with failing the slab allocation
(actually, not fail but just call into the page allocator to do
correct waiting  / reclaim) in the slowpath if the process fails the
watermark checks?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Allow the mempool to use the memalloc reserves when all else fails and
 the allocation context would otherwise allow it.

I don't see what this is for. The whole point of when I fixed this
to *not* use the memalloc reserves is because processes that were
otherwise allowed to use those reserves, were. They should not.



 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  mm/mempool.c |   12 +++-
  1 file changed, 11 insertions(+), 1 deletion(-)

 Index: linux-2.6/mm/mempool.c
 ===
 --- linux-2.6.orig/mm/mempool.c
 +++ linux-2.6/mm/mempool.c
 @@ -14,6 +14,7 @@
  #include linux/mempool.h
  #include linux/blkdev.h
  #include linux/writeback.h
 +#include internal.h

  static void add_element(mempool_t *pool, void *element)
  {
 @@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf
   void *element;
   unsigned long flags;
   wait_queue_t wait;
 - gfp_t gfp_temp;
 + gfp_t gfp_temp, gfp_orig = gfp_mask;

   might_sleep_if(gfp_mask  __GFP_WAIT);

 @@ -228,6 +229,15 @@ repeat_alloc:
   }
   spin_unlock_irqrestore(pool-lock, flags);

 + /* if we really had right to the emergency reserves try those */
 + if (gfp_to_alloc_flags(gfp_orig)  ALLOC_NO_WATERMARKS) {
 + if (gfp_temp  __GFP_NOMEMALLOC) {
 + gfp_temp = ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
 + goto repeat_alloc;
 + } else
 + gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 + }
 +
   /* We must not sleep in the GFP_ATOMIC case */
   if (!(gfp_mask  __GFP_WAIT))
   return NULL;

 --
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/33] mm: kmem_estimate_pages()

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Provide a method to get the upper bound on the pages needed to allocate
 a given number of objects from a given kmem_cache.


Fair enough, but just to make it a bit easier, can you provide a
little reason of why in this patch (or reference the patch number
where you use it, or put it together with the patch where you use
it, etc.).

Thanks,
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
 wide - which they are per setup_per_zone_pages_min(), when we scrape the
 barrel, do it properly.


IIRC it's actually not too uncommon to have allocations coming here via
page reclaim. It's not exactly clear that you want to break mempolicies
at this point.


 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  mm/page_alloc.c |6 ++
  1 file changed, 6 insertions(+)

 Index: linux-2.6/mm/page_alloc.c
 ===
 --- linux-2.6.orig/mm/page_alloc.c
 +++ linux-2.6/mm/page_alloc.c
 @@ -1638,6 +1638,12 @@ restart:
  rebalance:
   if (alloc_flags  ALLOC_NO_WATERMARKS) {
  nofail_alloc:
 + /*
 +  * break out of mempolicy boundaries
 +  */
 + zonelist = NODE_DATA(numa_node_id())-node_zonelists +
 + gfp_zone(gfp_mask);
 +
   /* go through the zonelist yet again, ignoring mins */
   page = get_page_from_freelist(gfp_mask, order, zonelist,
   ALLOC_NO_WATERMARKS);

 --

 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/33] Swap over NFS -v14

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 15:37, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 14:26:32 +1100

  Is it really worth all the added complexity of making swap
  over NFS files work, given that you could use a network block
  device instead?

 Don't be misled.  Swapping over NFS is just a scarecrow for the
 seemingly real impetus behind these changes which is network storage
 stuff like iSCSI.

Oh, I'm OK with the network reserves stuff (not the actual patch,
which I'm not really qualified to review, but at least the idea
of it...).

And also I'm not as such against the idea of swap over network.

However, specifically the change to make swapfiles work through
the filesystem layer (ATM it goes straight to the block layer,
modulo some initialisation stuff which uses block filesystem-
specific calls).

I mean, I assume that anybody trying to swap over network *today*
has to be using a network block device anyway, so the idea of
just being able to transparently improve that case seems better
than adding new complexities for seemingly not much gain.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
 a borrowed context save current-flags, ksoftirqd will have its own
 task_struct.


What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
that some networking thing must be picking it up in a subsequent patch,
but I'm too lazy to look!)... Again, can you have more of a rationale in
your patch headers, or ref the patch that uses it... thanks

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


#define atomic_read_volatile(v) \
({  \
forget((v)-counter);\
((v)-counter);  \
})

where:


*vomit* :)

Not only do I hate the keyword volatile, but the barrier is only a
one-sided affair so its probable this is going to have slightly
different allowed reorderings than a real volatile access.

Also, why would you want to make these insane accessors for atomic_t
types? Just make sure everybody knows the basics of barriers, and they
can apply that knowledge to atomic_t and all other lockless memory
accesses as well.



#define forget(a)   __asm__ __volatile__ ( :=m (a) :m (a))


I like order(x) better, but it's not the most perfect name either.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Stefan Richter wrote:

Nick Piggin wrote:


I don't know why people would assume volatile of atomics. AFAIK, most
of the documentation is pretty clear that all the atomic stuff can be
reordered etc. except for those that modify and return a value.



Which documentation is there?


Documentation/atomic_ops.txt



For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.



Semantics and Behavior of Atomic and Bitmask Operations is
pretty direct :)

Sure, it says that it's for arch maintainers, but there is no
reason why users can't make use of it.



Prompted by this thread, I reread this document, and indeed, the
sentence Unlike the above routines, it is required that explicit memory
barriers are performed before and after [atomic_{inc,dec}_return]
indicates that atomic_read (one of the above routines) is very
different from all other atomic_t accessors that return values.

This is strange.  Why is it that atomic_read stands out that way?  IMO


It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set.



this API imbalance is quite unexpected by many people.  Wouldn't it be
beneficial to change the atomic_read API to behave the same like all
other atomic_t accessors that return values?


It is very consistent and well defined. Operations which both modify
the data _and_ return something are defined to have full barriers
before and after.

What do you want to add to the other atomic accessors? Full memory
barriers? Only compiler barriers? It's quite likely that if you think
some barriers will fix bugs, then there are other bugs lurking there
anyway.

Just use spinlocks if you're not absolutely clear about potential
races and memory ordering issues -- they're pretty cheap and simple.



OK, it is also different from the other accessors that return data in so
far as it doesn't modify the data.  But as driver author, i.e. user of
the API, I can't see much use of an atomic_read that can be reordered
and, more importantly, can be optimized away by the compiler.


It will return to you an atomic snapshot of the data (loaded from
memory at some point since the last compiler barrier). All you have
to be aware of compiler barriers and the Linux SMP memory ordering
model, which should be a given if you are writing lockless code.



Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).


You might find that these places that appear to need barriers are
buggy for other reasons anyway. Can you point to some in-tree code
we can have a look at?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


On Fri, 17 Aug 2007, Herbert Xu wrote:



On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote:

BTW, the sort of missing barriers that triggered this thread
aren't that subtle.  It'll result in a simple lock-up if the
loop condition holds upon entry.  At which point it's fairly
straightforward to find the culprit.



Not necessarily. A barrier-less buggy code such as below:

atomic_set(v, 0);

... /* some initial code */

while (atomic_read(v))
;

... /* code that MUST NOT be executed unless v becomes non-zero */

(where v-counter is has no volatile access semantics)

could be generated by the compiler to simply *elid* or *do away* with
the loop itself, thereby making the:

/* code that MUST NOT be executed unless v becomes non-zero */

to be executed even when v is zero! That is subtle indeed, and causes
no hard lockups.


Then I presume you mean

while (!atomic_read(v))
;

Which is just the same old infinite loop bug solved with cpu_relax().
These are pretty trivial to audit and fix, and also to debug, I would
think.



Granted, the above IS buggy code. But, the stated objective is to avoid
heisenbugs.


Anyway, why are you making up code snippets that are buggy in other
ways in order to support this assertion being made that lots of kernel
code supposedly depends on volatile semantics. Just reference the
actual code.



And we have driver / subsystem maintainers such as Stefan
coming up and admitting that often a lot of code that's written to use
atomic_read() does assume the read will not be elided by the compiler.


So these are broken on i386 and x86-64?

Are they definitely safe on SMP and weakly ordered machines with
just a simple compiler barrier there? Because I would not be
surprised if there are a lot of developers who don't really know
what to assume when it comes to memory ordering issues.

This is not a dig at driver writers: we still have memory ordering
problems in the VM too (and probably most of the subtle bugs in
lockless VM code are memory ordering ones). Let's not make up a
false sense of security and hope that sprinkling volatile around
will allow people to write bug-free lockless code. If a writer
can't be bothered reading API documentation and learning the Linux
memory model, they can still be productive writing safely locked
code.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:



Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).


You might find that these places that appear to need barriers are
buggy for other reasons anyway. Can you point to some in-tree code
we can have a look at?



Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
that managed to escape the requirement for a barrier only because of
some completely un-obvious compilation-unit-scope thing. But I find such
an non-explicit barrier quite bad taste. Stefan, do consider plunking an
explicit call to barrier() there.


It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.

The unobvious thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier. If the whole msleep call chain
including the scheduler were defined static in the current compilation
unit, then it would still be a barrier because it would actually be able
to see the barriers in schedule(void), if nothing else.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:



Also, why would you want to make these insane accessors for atomic_t
types? Just make sure everybody knows the basics of barriers, and they
can apply that knowledge to atomic_t and all other lockless memory
accesses as well.



Code that looks like:

while (!atomic_read(v)) {
...
cpu_relax_no_barrier();
forget(v.counter);

}

would be uglier. Also think about code such as:


I think they would both be equally ugly, but the atomic_read_volatile
variant would be more prone to subtle bugs because of the weird
implementation.

And it would be more ugly than introducing an order(x) statement for
all memory operations, and adding an order_atomic() wrapper for it
for atomic types.



a = atomic_read();
if (!a)
do_something();

forget();
a = atomic_read();
... /* some code that depends on value of a, obviously */

forget();
a = atomic_read();
...

So much explicit sprinkling of forget() looks ugly.


Firstly, why is it ugly? It's nice because of those nice explicit
statements there that give us a good heads up and would have some
comments attached to them (also, lack of the word volatile is
always a plus).

Secondly, what sort of code would do such a thing? In most cases,
it is probably riddled with bugs anyway (unless it is doing a
really specific sequence of interrupts or something, but in that
case it is very likely to either require locking or busy waits
anyway - ie. barriers).



on the other hand, looks neater. The _volatile() suffix makes it also
no less explicit than an explicit barrier-like macro that this primitive
is something special, for code clarity purposes.


Just don't use the word volatile, and have barriers both before
and after the memory operation, and I'm OK with it. I don't see
the point though, when you could just have a single barrier(x)
barrier function defined for all memory locations, rather than
this odd thing that only works for atomics (and would have to
be duplicated for atomic_set.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Satyam Sharma wrote:

It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.



Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.



Where did I say it is a barrier because it sleeps?

It is always a barrier because, at the lowest level, schedule() (and thus
anything that sleeps) is defined to always be a barrier. Regardless of
whatever obscure means the compiler might need to infer the barrier.

In other words, you can ignore those obscure details because schedule() is
always going to have an explicit barrier in it.



The unobvious thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier.



True, that's clearly what happens here. But are you're definitely joking
that this is obvious in terms of code-clarity, right?



No. If you accept that barrier() is implemented correctly, and you know
that sleeping is defined to be a barrier, then its perfectly clear. You
don't have to know how the compiler knows that some function contains
a barrier.



Just 5 minutes back you mentioned elsewhere you like seeing lots of
explicit calls to barrier() (with comments, no less, hmm? :-)



Sure, but there are well known primitives which contain barriers, and
trivial recognisable code sequences for which you don't need comments.
waiting-loops using sleeps or cpu_relax() are prime examples.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:


I think they would both be equally ugly,



You think both these are equivalent in terms of looks:

|
while (!atomic_read(v)) {  |   while (!atomic_read_xxx(v)) {
... |   ...
cpu_relax_no_barrier(); |   cpu_relax_no_barrier();
order_atomic(v);   |   }
}   |

(where order_atomic() is an atomic_t
specific wrapper as you mentioned below)

?



I think the LHS is better if your atomic_read_xxx primitive is using the
crazy one-sided barrier, because the LHS code you immediately know what
barriers are happening, and with the RHS you have to look at the 
atomic_read_xxx

definition.

If your atomic_read_xxx implementation was more intuitive, then both are
pretty well equal. More lines != ugly code.



but the atomic_read_volatile
variant would be more prone to subtle bugs because of the weird
implementation.



What bugs?



You can't think for yourself? Your atomic_read_volatile contains a compiler
barrier to the atomic variable before the load. 2 such reads from different
locations look like this:

asm volatile( : +m (v1));
atomic_read(v1);
asm volatile( : +m (v2));
atomic_read(v2);

Which implies that the load of v1 can be reordered to occur after the load
of v2. Bet you didn't expect that?


Secondly, what sort of code would do such a thing?



See the nodemgr_host_thread() that does something similar, though not
exactly same.



I'm sorry, all this waffling about made up code which might do this and
that is just a waste of time. Seriously, the thread is bloated enough
and never going to get anywhere with all this handwaving. If someone is
saving up all the really real and actually good arguments for why we
must have a volatile here, now is the time to use them.


and have barriers both before and after the memory operation,



How could that lead to bugs? (if you can point to existing code,
but just some testcase / sample code would be fine as well).



See above.


As I said, barrier() is too heavy-handed.



Typo. I meant: defined for a single memory location (ie. order(x)).

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Because they should be thinking about them in terms of barriers, over
which the compiler / CPU is not to reorder accesses or cache memory
operations, rather than special volatile accesses.



This is obviously just a taste thing. Whether to have that forget(x)
barrier as something author should explicitly sprinkle appropriately
in appropriate places in the code by himself or use a primitive that
includes it itself.



That's not obviously just taste to me. Not when the primitive has many
(perhaps, the majority) of uses that do not require said barriers. And
this is not solely about the code generation (which, as Paul says, is
relatively minor even on x86). I prefer people to think explicitly
about barriers in their lockless code.



I'm not saying taste matters aren't important (they are), but I'm really
skeptical if most folks would find the former tasteful.



So I /do/ have better taste than most folks? Thanks! :-)



And by the way, the point is *also* about the fact that cpu_relax(), as
of today, implies a full memory clobber, which is not what a lot of such
loops want. (due to stuff mentioned elsewhere, summarized in that summary)


That's not the point,



That's definitely the point, why not. This is why barrier(), being
heavy-handed, is not the best option.



That is _not_ the point (of why a volatile atomic_read is good) because 
there
has already been an alternative posted that better conforms with Linux 
barrier
API and is much more widely useful and more usable. If you are so 
worried about
barrier() being too heavyweight, then you're off to a poor start by 
wanting to

add a few K of kernel text by making atomic_read volatile.



because as I also mentioned, the logical extention
to Linux's barrier API to handle this is the order(x) macro. Again, not
special volatile accessors.



Sure, that forget(x) macro _is_ proposed to be made part of the generic
API. Doesn't explain why not to define/use primitives that has volatility
semantics in itself, though (taste matters apart).



If you follow the discussion You were thinking of a reason why the
semantics *should* be changed or added, and I was rebutting your argument
that it must be used when a full barrier() is too heavy (ie. by pointing
out that order() has superior semantics anyway).

Why do I keep repeating the same things? I'll not continue bloating this
thread until a new valid point comes up...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-17 Thread Nick Piggin


Satyam Sharma wrote:



On Fri, 17 Aug 2007, Nick Piggin wrote:



Satyam Sharma wrote:


On Fri, 17 Aug 2007, Nick Piggin wrote:


Satyam Sharma wrote:

It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.


Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.


Where did I say it is a barrier because it sleeps?



Just below. What you wrote:



It is always a barrier because, at the lowest level, schedule() (and thus
anything that sleeps) is defined to always be a barrier.



It is always a barrier because, at the lowest level, anything that sleeps
is defined to always be a barrier.



... because it must call schedule and schedule is a barrier.



Regardless of
whatever obscure means the compiler might need to infer the barrier.

In other words, you can ignore those obscure details because schedule() is
always going to have an explicit barrier in it.



I didn't quite understand what you said here, so I'll tell what I think:

* foo() is a compiler barrier if the definition of foo() is invisible to
 the compiler at a callsite.

* foo() is also a compiler barrier if the definition of foo() includes
 a barrier, and it is inlined at the callsite.

If the above is wrong, or if there's something else at play as well,
do let me know.



Right.



The unobvious thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier.


True, that's clearly what happens here. But are you're definitely joking
that this is obvious in terms of code-clarity, right?


No. If you accept that barrier() is implemented correctly, and you know
that sleeping is defined to be a barrier,



Curiously, that's the second time you've said sleeping is defined to
be a (compiler) barrier.



_In Linux,_ sleeping is defined to be a compiler barrier.


How does the compiler even know if foo() is
a function that sleeps? Do compilers have some notion of sleeping
to ensure they automatically assume a compiler barrier whenever such
a function is called? Or are you saying that the compiler can see the
barrier() inside said function ... nopes, you're saying quite the
opposite below.



You're getting too worried about the compiler implementation. Start
by assuming that it does work ;)



then its perfectly clear. You
don't have to know how the compiler knows that some function contains
a barrier.



I think I do, why not? Would appreciate if you could elaborate on this.



If a function is not completely visible to the compiler (so it can't
determine whether a barrier could be in it or not), then it must always
assume it will contain a barrier so it always does the right thing.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-16 Thread Nick Piggin


Segher Boessenkool wrote:
Part of the motivation here is to fix heisenbugs.  If I knew where 
they




By the same token we should probably disable optimisations
altogether since that too can create heisenbugs.


Almost everything is a tradeoff; and so is this.  I don't
believe most people would find disabling all compiler
optimisations an acceptable price to pay for some peace
of mind.



So why is this a good tradeoff?



It certainly is better than disabling all compiler optimisations!


It's easy to be better than something really stupid :)

So i386 and x86-64 don't have volatiles there, and it saves them a
few K of kernel text. What you need to justify is why it is a good
tradeoff to make them volatile (which btw, is much harder to go
the other way after we let people make those assumptions).



I also think that just adding things to APIs in the hope it might fix
up some bugs isn't really a good road to go down. Where do you stop?



I look at it the other way: keeping the volatile semantics in
atomic_XXX() (or adding them to it, whatever) helps _prevent_ bugs;


Yeah, but we could add lots of things to help prevent bugs and
would never be included. I would also contend that it helps _hide_
bugs and encourages people to be lazy when thinking about these
things.

Also, you dismiss the fact that we'd actually be *adding* volatile
semantics back to the 2 most widely tested architectures (in terms
of test time, number of testers, variety of configurations, and
coverage of driver code). This is a very important different from
just keeping volatile semantics because it is basically a one-way
API change.



certainly most people expect that behaviour, and also that behaviour
is *needed* in some places and no other interface provides that
functionality.


I don't know that most people would expect that behaviour. Is there
any documentation anywhere that would suggest this?

Also, barrier() most definitely provides the required functionality.
It is overkill in some situations, but volatile is overkill in _most_
situations. If that's what you're worried about, we should add a new
ordering primitive.



[some confusion about barriers wrt atomics snipped]


What were you confused about?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-16 Thread Nick Piggin


Chris Snook wrote:

Herbert Xu wrote:


On Thu, Aug 16, 2007 at 03:48:54PM -0400, Chris Snook wrote:


Can you find an actual atomic_read code snippet there that is
broken without the volatile modifier?


A whole bunch of atomic_read uses will be broken without the volatile 
modifier once we start removing barriers that aren't needed if 
volatile behavior is guaranteed.



Could you please cite the file/function names so we can
see whether removing the barrier makes sense?

Thanks,



At a glance, several architectures' implementations of 
smp_call_function() have one or more legitimate atomic_read() busy-waits 
that shouldn't be using CPU-relax.  Some of them do work in the loop.


sh looks like the only one there that would be broken (and that's only
because they don't have a cpu_relax there, but it should be added anyway).
Sure, if we removed volatile from other architectures, it would be wise
to audit arch code because arch maintainers do sometimes make assumptions
about their implementation details... however we can assume most generic
code is safe without volatile.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-16 Thread Nick Piggin


Paul E. McKenney wrote:

On Thu, Aug 16, 2007 at 06:42:50PM +0800, Herbert Xu wrote:



In fact, volatile doesn't guarantee that the memory gets
read anyway.  You might be reading some stale value out
of the cache.  Granted this doesn't happen on x86 but
when you're coding for the kernel you can't make such
assumptions.

So the point here is that if you don't mind getting a stale
value from the CPU cache when doing an atomic_read, then
surely you won't mind getting a stale value from the compiler
cache.



Absolutely disagree.  An interrupt/NMI/SMI handler running on the CPU
will see the same value (whether in cache or in store buffer) that
the mainline code will see.  In this case, we don't care about CPU
misordering, only about compiler misordering.  It is easy to see
other uses that combine communication with handlers on the current
CPU with communication among CPUs -- again, see prior messages in
this thread.


I still don't agree with the underlying assumption that everybody
(or lots of kernel code) treats atomic accesses as volatile.

Nobody that does has managed to explain my logic problem either:
loads and stores to long and ptr have always been considered to be
atomic, test_bit is atomic; so why are another special subclass of
atomic loads and stores? (and yes, it is perfectly legitimate to
want a non-volatile read for a data type that you also want to do
atomic RMW operations on)

Why are people making these undocumented and just plain false
assumptions about atomic_t? If they're using lockless code (ie.
which they must be if using atomics), then they actually need to be
thinking much harder about memory ordering issues. If that is too
much for them, then they can just use locks.



So, the architecture guys can implement atomic_read however they want
--- as long as it cannot be optimized away.*


They can implement it however they want as long as it stays
atomic.



Precisely.  And volatility is a key property of atomic.  Let's please
not throw it away.


It isn't, though (at least not since i386 and x86-64 don't have it).
_Adding_ it is trivial, and can be done any time. Throwing it away
(ie. making the API weaker) is _hard_. So let's not add it without
really good reasons. It most definitely results in worse code
generation in practice.

I don't know why people would assume volatile of atomics. AFAIK, most
of the documentation is pretty clear that all the atomic stuff can be
reordered etc. except for those that modify and return a value.

It isn't even intuitive: `*lp = value` is like the most fundamental
atomic operation in Linux.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-16 Thread Nick Piggin


Paul Mackerras wrote:

Nick Piggin writes:



So i386 and x86-64 don't have volatiles there, and it saves them a
few K of kernel text. What you need to justify is why it is a good



I'm really surprised it's as much as a few K.  I tried it on powerpc
and it only saved 40 bytes (10 instructions) for a G5 config.

Paul.



I'm surprised too. Numbers were from the ...use asm() like the other
atomic operations already do thread. According to them,

  textdata bss dec hex filename
3434150  249176  176128 3859454  3ae3fe atomic_normal/vmlinux
3436203  249176  176128 3861507  3aec03 atomic_volatile/vmlinux

The first one is a stock kenel, the second is with atomic_read/set
cast to volatile. gcc-4.1 -- maybe if you have an earlier gcc it
won't optimise as much?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-16 Thread Nick Piggin


Paul Mackerras wrote:

Nick Piggin writes:



Why are people making these undocumented and just plain false
assumptions about atomic_t?



Well, it has only been false since December 2006.  Prior to that
atomics *were* volatile on all platforms.


Hmm, although I don't think it has ever been guaranteed by the
API documentation (concede documentation is often not treated
as the authoritative source here, but for atomic it is actually
very good and obviously indispensable as the memory ordering
reference).



If they're using lockless code (ie.
which they must be if using atomics), then they actually need to be
thinking much harder about memory ordering issues.



Indeed.  I believe that most uses of atomic_read other than in polling
loops or debug printk statements are actually racy.  In some cases the
race doesn't seem to matter, but I'm sure there are cases where it
does.



If that is too
much for them, then they can just use locks.



Why use locks when you can just sprinkle magic fix-the-races dust (aka
atomic_t) over your code? :) :)


I agree with your skepticism of a lot of lockless code. But I think
a lot of the more subtle race problems will not be fixed with volatile.
The big, dumb infinite loop bugs would be fixed, but they're pretty
trivial to debug and even audit for.



Precisely.  And volatility is a key property of atomic.  Let's please
not throw it away.


It isn't, though (at least not since i386 and x86-64 don't have it).



Conceptually it is, because atomic_t is specifically for variables
which are liable to be modified by other CPUs, and volatile _means_
liable to be changed by mechanisms outside the knowledge of the
compiler.


Usually that is the case, yes. But also most of the time we don't
care that it has been changed and don't mind it being reordered or
eliminated.

One of the only places we really care about that at all is for
variables that are modified by the *same* CPU.



_Adding_ it is trivial, and can be done any time. Throwing it away
(ie. making the API weaker) is _hard_. So let's not add it without



Well, in one sense it's not that hard - Linus did it just 8 months ago
in commit f9e9dcb3. :)


Well it would have been harder if the documentation also guaranteed
that atomic_read/atomic_set was ordered. Or it would have been harder
for _me_ to make such a change, anyway ;)



really good reasons. It most definitely results in worse code
generation in practice.



0.0008% increase in kernel text size on powerpc according to my
measurement. :)


I don't think you're making a bad choice by keeping it volatile on
powerpc and waiting for others to shake out more of the bugs. You
get to fix everybody else's memory ordering bugs :)



I don't know why people would assume volatile of atomics. AFAIK, most



By making something an atomic_t you're saying other CPUs are going to
be modifying this, so treat it specially.  It's reasonable to assume
that special treatment extends to reading and setting it.


But I don't actually know what that special treatment is. Well
actually, I do know that operations will never result in a partial
modification being exposed. I also know that the operators that
do not modify and return are not guaranteed to have any sort of
ordering constraints.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/24] make atomic_read() behave consistently on frv

2007-08-15 Thread Nick Piggin


Paul E. McKenney wrote:

On Tue, Aug 14, 2007 at 03:34:25PM +1000, Nick Piggin wrote:



Maybe it is the safe way to go, but it does obscure cases where there
is a real need for barriers.



I prefer burying barriers into other primitives.


When they should naturally be there, eg. locking or the RCU primitives,
I agree.

I don't like having them scattered in various just in case places,
because it makes both the users and the APIs hard to understand and
change.

Especially since several big architectures don't have volatile in their
atomic_get and _set, I think it would be a step backwards to add them in
as a just in case thin now (unless there is a better reason).



Many atomic operations are allowed to be reordered between CPUs, so
I don't have a good idea for the rationale to order them within the
CPU (also loads and stores to long and ptr types are not ordered like
this, although we do consider those to be atomic operations too).

barrier() in a way is like enforcing sequential memory ordering
between process and interrupt context, wheras volatile is just
enforcing coherency of a single memory location (and as such is
cheaper).



barrier() is useful, but it has the very painful side-effect of forcing
the compiler to dump temporaries.  So we do need something that is
not quite so global in effect.


Yep.



What do you think of this crazy idea?

/* Enforce a compiler barrier for only operations to location X.
* Call multiple times to provide an ordering between multiple
* memory locations. Other memory operations can be assumed by
* the compiler to remain unchanged and may be reordered
*/
#define order(x) asm volatile( : +m (x))



There was something very similar discussed earlier in this thread,
with quite a bit of debate as to exactly what the m flag should
look like.  I suggested something similar named ACCESS_ONCE in the
context of RCU (http://lkml.org/lkml/2007/7/11/664):


Oh, I missed that earlier debate. Will go have a look.



#define ACCESS_ONCE(x) (*(volatile typeof(x) *)(x))

The nice thing about this is that it works for both loads and stores.
Not clear that order() above does this -- I get compiler errors when
I try something like b = order(a) or order(a) = 1 using gcc 4.1.2.


As Arnd ponted out, order() is not supposed to be an lvalue, but a
statement like the rest of our memory ordering API.

As to your followup question of why to use it over ACCESS_ONCE. I
guess, aside from consistency with the rest of the barrier APIs, you
can use it in other primitives when you don't actually know what the
caller is going to do or if it even will make an access. You could
also use it between calls to _other_ primitives, etc... it just
seems more flexible to me, but I haven't actually used such a thing
in real code...

ACCESS_ONCE doesn't seem as descriptive. What it results in is the
memory location being loaded or stored (presumably once exactly),
but I think the more general underlying idea is a barrier point.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/24] make atomic_read() behave consistently on frv

2007-08-15 Thread Nick Piggin


Segher Boessenkool wrote:

Please check the definition of cache coherence.



Which of the twelve thousand such definitions?  :-)


Every definition I have seen says that writes to a single memory
location have a serial order as seen by all CPUs, and that a read
will return the most recent write in the sequence (with a bit of
extra mumbo jumbo to cover store queues and store forwarding).

Within such a definition, I don't see how would be allowed for a
single CPU to execute reads out of order and have the second read
return an earlier write than the first read.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/24] make atomic_read() behave consistently on frv

2007-08-15 Thread Nick Piggin


Paul E. McKenney wrote:

On Wed, Aug 15, 2007 at 11:30:05PM +1000, Nick Piggin wrote:



Especially since several big architectures don't have volatile in their
atomic_get and _set, I think it would be a step backwards to add them in
as a just in case thin now (unless there is a better reason).



Good point, except that I would expect gcc's optimization to continue
to improve.  I would like the kernel to be able to take advantage of
improved optimization, which means that we are going to have to make
a few changes.  Adding volatile to atomic_get() and atomic_set() is
IMHO one of those changes.


What optimisations? gcc already does most of the things you need a
barrier/volatile for, like reordering non-dependant loads and stores,
and eliminating mem ops completely by caching in registers.



As to your followup question of why to use it over ACCESS_ONCE. I
guess, aside from consistency with the rest of the barrier APIs, you
can use it in other primitives when you don't actually know what the
caller is going to do or if it even will make an access. You could
also use it between calls to _other_ primitives, etc... it just
seems more flexible to me, but I haven't actually used such a thing
in real code...

ACCESS_ONCE doesn't seem as descriptive. What it results in is the
memory location being loaded or stored (presumably once exactly),
but I think the more general underlying idea is a barrier point.



OK, first, I am not arguing that ACCESS_ONCE() can replace all current
uses of barrier().


OK. Well I also wasn't saying that ACCESS_ONCE should not be
implemented. But if we want something like it, then it would make
sense to have an equivalent barrier statement as well (ie. order()).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-15 Thread Nick Piggin


Segher Boessenkool wrote:

Part of the motivation here is to fix heisenbugs.  If I knew where they



By the same token we should probably disable optimisations
altogether since that too can create heisenbugs.



Almost everything is a tradeoff; and so is this.  I don't
believe most people would find disabling all compiler
optimisations an acceptable price to pay for some peace
of mind.


So why is this a good tradeoff?

I also think that just adding things to APIs in the hope it might fix
up some bugs isn't really a good road to go down. Where do you stop?

On the actual proposal to make atomic_operators volatile: I think the
better approach in the long term, for both maintainability of the
code and education of coders, is to make the use of barriers _more_
explicit rather than sprinkling these just in case ones around.

You may get rid of a few atomic_read heisenbugs (in noise when
compared to all bugs), but if the coder was using a regular atomic
load, or a test_bit (which is also atomic), etc. then they're going
to have problems.

It would be better for Linux if everyone was to have better awareness
of barriers than to hide some of the cases where they're required.
A pretty large number of bugs I see in lock free code in the VM is
due to memory ordering problems. It's hard to find those bugs, or
even be aware when you're writing buggy code if you don't have some
feel for barriers.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/24] make atomic_read() behave consistently on frv

2007-08-13 Thread Nick Piggin


Paul E. McKenney wrote:

On Mon, Aug 13, 2007 at 01:15:52PM +0800, Herbert Xu wrote:


Paul E. McKenney [EMAIL PROTECTED] wrote:


On Sat, Aug 11, 2007 at 08:54:46AM +0800, Herbert Xu wrote:


Chris Snook [EMAIL PROTECTED] wrote:

cpu_relax() contains a barrier, so it should do the right thing.  For 
non-smp architectures, I'm concerned about interacting with interrupt 
handlers.  Some drivers do use atomic_* operations.


What problems with interrupt handlers? Access to int/long must
be atomic or we're in big trouble anyway.


Reordering due to compiler optimizations.  CPU reordering does not
affect interactions with interrupt handlers on a given CPU, but
reordering due to compiler code-movement optimization does.  Since
volatile can in some cases suppress code-movement optimizations,
it can affect interactions with interrupt handlers.


If such reordering matters, then you should use one of the
*mb macros or barrier() rather than relying on possibly
hidden volatile cast.



If communicating among CPUs, sure.  However, when communicating between
mainline and interrupt/NMI handlers on the same CPU, the barrier() and
most expecially the *mb() macros are gross overkill.  So there really
truly is a place for volatile -- not a large place, to be sure, but a
place nonetheless.


I really would like all volatile users to go away and be replaced
by explicit barriers. It makes things nicer and more explicit... for
atomic_t type there probably aren't many optimisations that can be
made which volatile would disallow (in actual kernel code), but for
others (eg. bitops, maybe atomic ops in UP kernels), there would be.

Maybe it is the safe way to go, but it does obscure cases where there
is a real need for barriers.

Many atomic operations are allowed to be reordered between CPUs, so
I don't have a good idea for the rationale to order them within the
CPU (also loads and stores to long and ptr types are not ordered like
this, although we do consider those to be atomic operations too).

barrier() in a way is like enforcing sequential memory ordering
between process and interrupt context, wheras volatile is just
enforcing coherency of a single memory location (and as such is
cheaper).

What do you think of this crazy idea?

/* Enforce a compiler barrier for only operations to location X.
 * Call multiple times to provide an ordering between multiple
 * memory locations. Other memory operations can be assumed by
 * the compiler to remain unchanged and may be reordered
 */
#define order(x) asm volatile( : +m (x))

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/24] make atomic_read() behave consistently on frv

2007-08-13 Thread Nick Piggin


Chris Snook wrote:

David Howells wrote:


Chris Snook [EMAIL PROTECTED] wrote:

cpu_relax() contains a barrier, so it should do the right thing.  For 
non-smp
architectures, I'm concerned about interacting with interrupt 
handlers.  Some

drivers do use atomic_* operations.



I'm not sure that actually answers my question.  Why not smp_rmb()?

David



I would assume because we want to waste time efficiently even on non-smp 
architectures, rather than frying the CPU or draining the battery.  
Certain looping execution patterns can cause the CPU to operate above 
thermal design power.  I have fans on my workstation that only ever come 
on when running LINPACK, and that's generally memory bandwidth-bound.  
Just imagine what happens when you're executing the same few 
non-serializing instructions in a tight loop without ever stalling on 
memory fetches, or being scheduled out.


If there's another reason, I'd like to hear it too, because I'm just 
guessing here.


Well if there is only one memory location involved, then smp_rmb() isn't
going to really do anything anyway, so it would be incorrect to use it.

Consider that smp_rmb basically will do anything from flushing the
pipeline to invalidating loads speculatively executed out of order. AFAIK
it will not control the visibility of stores coming from other CPUs (that
is up to the cache coherency).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] AFS: Implement basic file write support

2007-05-11 Thread Nick Piggin


David Howells wrote:

Nick Piggin [EMAIL PROTECTED] wrote:



Why do you call SetPageUptodate when the page is not up to date?
That leaks uninitialised data, AFAIKS.



It only seems that way.  If afs_prepare_write() is called, but doesn't return
an error, then afs_commit_write() will be called, and it seems that the copy
in of the data will be guaranteed not to fail by the caller.


Not only does it seem that way, it is that way :) PG_uptodate is being set
when the page is not uptodate, isn't it?



Furthermore, afs_prepare_page() will have filled in the missing bits.

And whilst all that is going on, the page lock will be help by the caller, so
that no-one else can access the partially complete page.


When a page is uptodate in pagecache, the generic read and nopage functions
do not take the page lock. So how are you excluding those?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] AFS: Implement basic file write support

2007-05-09 Thread Nick Piggin


David Howells wrote:


+/*
+ * prepare a page for being written to
+ */
+static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
+   struct key *key, unsigned offset, unsigned to)
+{
+   unsigned eof, tail, start, stop, len;
+   loff_t i_size, pos;
+   void *p;
+   int ret;
+
+   _enter();
+
+   if (offset == 0  to == PAGE_SIZE)
+   return 0;
+
+   p = kmap(page);
+
+   i_size = i_size_read(vnode-vfs_inode);
+   pos = (loff_t) page-index  PAGE_SHIFT;
+   if (pos = i_size) {
+   /* partial write, page beyond EOF */
+   _debug(beyond);
+   if (offset  0)
+   memset(p, 0, offset);
+   if (to  PAGE_SIZE)
+   memset(p + to, 0, PAGE_SIZE - to);
+   kunmap(page);
+   return 0;
+   }
+
+   if (i_size - pos = PAGE_SIZE) {
+   /* partial write, page entirely before EOF */
+   _debug(before);
+   tail = eof = PAGE_SIZE;
+   } else {
+   /* partial write, page overlaps EOF */
+   eof = i_size - pos;
+   _debug(overlap %u, eof);
+   tail = max(eof, to);
+   if (tail  PAGE_SIZE)
+   memset(p + tail, 0, PAGE_SIZE - tail);
+   if (offset  eof)
+   memset(p + eof, 0, PAGE_SIZE - eof);
+   }
+
+   kunmap(p);
+
+   ret = 0;
+   if (offset  0 || eof  to) {
+   /* need to fill one or two bits that aren't going to be written
+* (cover both fillers in one read if there are two) */
+   start = (offset  0) ? 0 : to;
+   stop = (eof  to) ? eof : offset;
+   len = stop - start;
+   _debug(wr=%u-%u av=0-%u [EMAIL PROTECTED],
+  offset, to, eof, start, len);
+   ret = afs_fill_page(vnode, key, start, len, page);
+   }
+
+   _leave( = %d, ret);
+   return ret;
+}
+
+/*
+ * prepare to perform part of a write to a page
+ * - the caller holds the page locked, preventing it from being written out or
+ *   modified by anyone else
+ */
+int afs_prepare_write(struct file *file, struct page *page,
+ unsigned offset, unsigned to)
+{
+   struct afs_writeback *candidate, *wb;
+   struct afs_vnode *vnode = AFS_FS_I(file-f_dentry-d_inode);
+   struct key *key = file-private_data;
+   pgoff_t index;
+   int ret;
+
+   _enter({%x:%u},{%lx},%u,%u,
+  vnode-fid.vid, vnode-fid.vnode, page-index, offset, to);
+
+   candidate = kzalloc(sizeof(*candidate), GFP_KERNEL);
+   if (!candidate)
+   return -ENOMEM;
+   candidate-vnode = vnode;
+   candidate-first = candidate-last = page-index;
+   candidate-offset_first = offset;
+   candidate-to_last = to;
+   candidate-usage = 1;
+   candidate-state = AFS_WBACK_PENDING;
+   init_waitqueue_head(candidate-waitq);
+
+   if (!PageUptodate(page)) {
+   _debug(not up to date);
+   ret = afs_prepare_page(vnode, page, key, offset, to);
+   if (ret  0) {
+   kfree(candidate);
+   _leave( = %d [prep], ret);
+   return ret;
+   }
+   SetPageUptodate(page);
+   }



Why do you call SetPageUptodate when the page is not up to date?
That leaks uninitialised data, AFAIKS.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-07 Thread Nick Piggin

On Tue, Mar 06, 2007 at 02:20:55PM -0800, David Miller wrote:
 From: Robert Olsson [EMAIL PROTECTED]
 Date: Tue, 6 Mar 2007 14:26:04 +0100

  David Miller writes:

Actually, more accurately, the conflict exists in how this GC
logic is implemented.  The core issue is that hash table size
guides the GC processing, and hash table growth therefore
modifies those GC goals.  So with the patch below we'll just
keep growing the hash table instead of giving GC some time to
try to keep the working set in equilibrium before doing the
hash grow.

   AFIK the equilibrium is resizing function as well but using fixed 
   hash table. So can we do without equilibrium resizing if tables 
   are dynamic?  I think so

   With the hash data structure we could monitor the average chain 
   length or just size and resize hash after that.

 I'm not so sure, it may be a mistake to eliminate the equilibrium
 logic.  One error I think it does have is the usage of chain length.

 Even a nearly perfect hash has small lumps in distribution, and we
 should not penalize entries which fall into these lumps.

 Let us call T the threshold at which we would grow the routing hash
 table.  As we approach T we start to GC.  Let's assume hash table
 has shift = 2. and T would (with T=N+(N1) algorithm) therefore be
 6.

 TABLE:[0] DST1, DST2
   [1] DST3, DST4, DST5

 DST6 arrives, what should we do?

 If we just accept it and don't GC some existing entries, we
 will grow the hash table.  This is the wrong thing to do if
 our true working set is smaller than 6 entries and thus some
 of the existing entries are unlikely to be reused and thus
 could be purged to keep us from hitting T.

 If they are all active, growing is the right thing to do.

 This is the crux of the whole routing cache problem.

I guess this is similar to our problems with bdev and filesystem
caches as well.

What we do in that case (as you would know), is to let the caches
expand to the size of memory, and the problem just becomes balancing
their relative importance. We just try to go with a reasonable default,
and provide a knob or two for fine tuning.

 I am of the opinion that LRU, for routes not attached to sockets, is
 probably the best thing to do here.

 Furthermore at high packet rates, the current rt_may_expire() logic
 probably is not very effective since it's granularity is limited to
 jiffies.  We can quite easily create 100,000 or more entries per
 jiffie when HZ=100 during rDOS, for example.  So perhaps some global
 LRU algorithm using ktime is more appropriate.

 Global LRU is not easy without touching a lot of memory.  But I'm
 sure some clever trick can be discovered by someone :)

Well we do a pseudo LRU in most vm/vfs caches, where we just set a
bit if it has been touched since last checked. Add another working
set list to promote popular routes into, and you have something
like our pagecache active/inactive reclaim.

I don't know if that really applies here, but it might if you decide
to try hooking this cache into the slab shrinker thingy...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Nick Piggin

On Mon, Mar 05, 2007 at 08:26:32PM -0800, David Miller wrote:
 
 This is essentially a port of Nick Piggin's dcache hash table
 patches to the routing cache.  It solves the locking issues
 during table grow/shrink that I couldn't handle properly last
 time I tried to code up a patch like this.
 
 But one of the core issues of this kind of change still remains.
 There is a conflict between the desire of routing cache garbage
 collection to reach a state of equilibrium and the hash table
 grow code's desire to match the table size to the current state
 of affairs.
 
 Actually, more accurately, the conflict exists in how this GC
 logic is implemented.  The core issue is that hash table size
 guides the GC processing, and hash table growth therefore
 modifies those GC goals.  So with the patch below we'll just
 keep growing the hash table instead of giving GC some time to
 try to keep the working set in equilibrium before doing the
 hash grow.
 
 One idea is to put the hash grow check in the garbage collector,
 and put the hash shrink check in rt_del().
 
 In fact, it would be a good time to perhaps hack up some entirely
 new passive GC logic for the routing cache.
 
 BTW, another thing that plays into this is that Robert's TRASH work
 could make this patch not necessary :-)
 
 Finally, I know that (due to some of Nick's helpful comments the
 other day) that I'm missing some rcu_assign_pointer()'s in here.
 Fixes in this area are most welcome.

Cool! I have some fixes for the rcu barrier issues, with some C-style
comments and questions :)

I was going to send you a fix first for the rcu barriers, then a
second to convert the read-side to a barrier-less one that I described,
however considering that your patch is a WIP in progress anyway, I
won't worry too much about the normal protocol.

I _think_ my reasoning regarding the rcu barriers and grace periods
is correct. I'll keep thinking about it though. (Paul cc'ed).

I'm not so familiar with this code, so I have sprinkled around a lot
of comments that could be pure crap ;) They are mainly just to help
you ensure that you cover all bases... compile tested only at this
stage.

--
Index: linux-2.6/net/ipv4/route.c
===
--- linux-2.6.orig/net/ipv4/route.c
+++ linux-2.6/net/ipv4/route.c
@@ -311,6 +311,8 @@ static void rthash_free(struct rt_hash_b
 static unsigned int rt_hash_code(struct rt_hash *hashtable,
 u32 daddr, u32 saddr)
 {
+   /* BUG_ON(!rcu_read_protected()) */
+
return (jhash_2words(daddr, saddr, rt_hash_rnd)
 hashtable-mask);
 }
@@ -343,11 +345,16 @@ static void rt_hash_resize_work(struct w
 
old_rt_hash = rt_hash;
/*
-* ensure that if the reader sees the new dentry_hash,
-* then they will also see the old_dentry_hash assignment,
-* above.
+* ensure that if the reader sees the new rt_hash, then they will also
+* see the old_rt_hash assignment, above. synchronize_rcu() is used
+* rather than smp_wmb(), in order to avoid the smp_rmb() in the
+* read-sidde. However synchronize_rcu() also implies a smp_wmb(), so
+* that also means we can skip rcu_assign_pointer().
+*
+* The readers can then also skip rcu_dereference, because a grace
+* period implies that all readers have performed memory barriers.
 */
-   smp_wmb();
+   synchronize_rcu();
rt_hash = new_hash;
synchronize_rcu();
 
@@ -1100,6 +1107,8 @@ static int rt_intern_hash(struct rt_hash
int chain_length;
int attempts = !in_softirq();
 
+   /* BUG_ON(!rcu_read_protected()) */
+
 restart:
chain_length = 0;
min_score = ~(u32)0;
@@ -1286,6 +1295,8 @@ static void rt_del(struct rt_hash *h, un
 {
struct rtable **rthp;
 
+   /* BUG_ON(!rcu_read_protected()) */
+
spin_lock_bh(rt_hash_lock_addr(hash));
ip_rt_put(rt);
for (rthp = h-table[hash].chain; *rthp;
@@ -1328,12 +1339,24 @@ void ip_rt_redirect(__be32 old_gw, __be3
 
for (i = 0; i  2; i++) {
for (k = 0; k  2; k++) {
-   struct rt_hash *h = rt_hash;
-   unsigned hash = rt_hashfn(h, daddr, skeys[i], ikeys[k]);
+   struct rt_hash *h;
+   unsigned hash;
+
+   /*
+* rcu_read_lock() must cover the load of rt_hash, in
+* order to satisfy our RCU protected dynamic hash
+* sizing scheme; and it must also cover the hash list
+* traversal, to satisfy our RCU protected lockless
+* hash entry lookups.
+*
+* This note applies throughout the file.
+*/
+   rcu_read_lock();
+   h =

Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Nick Piggin

On Tue, Mar 06, 2007 at 01:17:06AM -0800, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Tue, 6 Mar 2007 10:11:12 +0100

  @@ -1449,6 +1472,12 @@ static struct dst_entry *ipv4_negative_a
%u.%u.%u.%u/%02x dropped\n,
  NIPQUAD(rt-rt_dst), rt-fl.fl4_tos);
   #endif
  +   /* XXX:
  +* What if rt does not exist in rt_hash, but is in
  +* old_rt_hash? Don't we have to also check there?
  +* Similar questions for a couple of other places that
  +* look at rt_hash, but not old_rt_hash.
  +*/
  rt_del(h, hash, rt);
  ret = NULL;
  }

 For the cases like ip_rt_redirect() I made the decision that we'll
 just not add the complexity of having to look in the old_rt_hash
 table.

 In these kinds of cases it's OK to miss events, they will just happen
 again.

 It's one of the nice things about the routing cache, if you lose
 information it's OK because we'll just cook up a new entry from
 the persistent backing store that is the real routing tables.
 And for events like redirects, if we miss it, we'll just send the
 packet to the wrong next-hop again and receive another redirect
 message which we'll (hopefully) propagate to a routine cache
 entry.

Ah, that's a very neat trick. OK so with that question out of the
way, there _may_ just be a few other places where you're working
with an rt_hash table outside an rcu read critical section.

I tried to fix up a couple of obvious ones, and I've just put in
the BUG_ON assertions for you to verify the rest.

 Thanks for your feedback patch Nick, I'll process it tomorrow
 hopefully after getting some sleep.

No problem. Thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH]: Dynamically sized routing cache hash table.

2007-03-06 Thread Nick Piggin

On Tue, Mar 06, 2007 at 10:23:44AM +0100, Eric Dumazet wrote:
 On Tuesday 06 March 2007 10:11, Nick Piggin wrote:
 
  Cool! I have some fixes for the rcu barrier issues, with some C-style
  comments and questions :)
 
  I was going to send you a fix first for the rcu barriers, then a
  second to convert the read-side to a barrier-less one that I described,
  however considering that your patch is a WIP in progress anyway, I
  won't worry too much about the normal protocol.
 
  I _think_ my reasoning regarding the rcu barriers and grace periods
  is correct. I'll keep thinking about it though. (Paul cc'ed).
 
  I'm not so familiar with this code, so I have sprinkled around a lot
  of comments that could be pure crap ;) They are mainly just to help
  you ensure that you cover all bases... compile tested only at this
  stage.
 
 I think we missed :
 
 +static void rt_hash_resize_work(struct work_struct *work)
 
 +
 + *head = rth-u.dst.rt_next;
 +
 + hash = rt_hashfn(rt_hash,
 +  rth-fl.fl4_dst,
 +  rth-fl.fl4_src,
 +  iface);
 + rth-u.dst.rt_next = rt_hash-table[hash].chain;
 + rt_hash-table[hash].chain = rth;
 
 This really needs some ..._del_rcu()/..._add_rcu()_ ... primitives, no ?
 Or else a reader might be very confused...

I'm not sure... this code really depends on the hash table management,
rather than the management of the hash tables, if you understand me ;)

From what I can _see_, this is similar to how rt_intern_hash does it.
I don't know exactly why rt_intern_hash can get away without using
rcu_assign_pointer in some cases, however:

Note that we don't need an rcu_assign_pointer for this, because the
memory operations that initialized the entry have already been ordered
when it was first inserted.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mm: fix page_mkclean_one

2007-02-02 Thread Nick Piggin


Mark Groves wrote:

Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Your issue is not related, firstly because the page_mkclean bug did not
exist before 2.6.19 kernels.

Anyway, I had a look at your bugzilla test-case and managed to slim it
down to something that easily shows what the problem is (available on
request) -- the problem is that recipient of the sendfile is seeing
modifications that occur to the source file _after_ the sender has
completed the sendfile, because the file pages are not copied but
queued.

I think the usual approach to what you are trying to do is to set TCP_CORK,
then write(2) the header into the socket, then sendfile directly from the
file you want.

Another approach I guess is to implement an ack in your userland protocol
so you do not modify the sendfile source file until the client acks that
it has all the data.

I'm not sure if there are any other usual ways to do this (ie. a barrier
for sendfile, to ensure it will not pick up future modifications to the
file). netdev cc'ed, someone there might have additional comments.

Please close this bug if/when you are satisfied it is not a kernel problem.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Nick Piggin


Evgeniy Polyakov wrote:

On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([EMAIL PROTECTED]) wrote:



Doesn't the provided solution is just a in-kernel variant of the
SCHED_FIFO set from userspace? Why kernel should be able to mark some
users as having higher priority?
What if workload of the system is targeted to not the maximum TCP
performance, but maximum other-task performance, which will be broken
with provided patch.


David's line of thinking for a solution sounds better to me. This patch
does not prevent the process from being preempted (for potentially a long
time), by any means.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rename *MEMALLOC flags

2006-08-13 Thread Nick Piggin


Paul Jackson wrote:

Daniel wrote:


Inventing a new name for an existing thing is very poor taste on grounds of
grepability alone.



I wouldn't say 'very poor taste' -- just something that should be
done infrequently, with good reason, and with reasonable concensus,
especially from the key maintainers in the affected area.

Good names are good taste, in my book.  But stable naming is good too.

I wonder what Nick thinks of this?  Looks like he added
__GFP_NOMEMALLOC a year ago, following the naming style of PF_MEMALLOC.

I added him to the cc list.



__GFP_NOMEMALLOC was added to prevent mempool backed allocations from
accessing the emergency reserve. Because that would just shift deadlocks
from mempool safe sites to those which have not been converted.

PF_MEMALLOC is a good name: PF_MEMALLOC says that the task is currently
allocating memory. It does not say anything about the actual allocator
implementation details to handle this (1. don't recurse into reclaim; 2.
allow access to reserves), but that is a good thing.

__GFP_NOMEMALLOC and __GFP_MEMALLOC are poorly named (I take the blame).
It isn't that the task is suddenly no longer allocating in the context
of an allocation, it is just that you want to allow or deny access to
the reserve.

__GFP_NOMEMALLOC should be something like __GFP_EMERG_NEVER and
__GFP_MEMALLOC should be _ALWAYS. Or something like that.

NOMEMALLOC is specific enough that I don't mind a rename at this stage.
Renaming PF_MEMALLOC would be wrong, however.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/7] AMSO1100 Memory Management.

2006-06-16 Thread Nick Piggin


Tom Tucker wrote:

On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote:


On Wed, 07 Jun 2006 15:06:55 -0500
Steve Wise [EMAIL PROTECTED] wrote:



+void c2_free(struct c2_alloc *alloc, u32 obj)
+{
+   spin_lock(alloc-lock);
+   clear_bit(obj, alloc-table);
+   spin_unlock(alloc-lock);
+}


The spinlock is unneeded here.



Good point.


Really?

clear_bit does not give you any memory ordering, so you can have
the situation where another CPU sees the bit cleared, but this
CPU still has stores pending to whatever it is being freed. Or
any number of other nasty memory ordering badness.

I'd just use the spinlocks, and prepend the clear_bit with a
double underscore (so you get the non-atomic version), if that
is appropriate.

The spinlocks nicely handle all the memory ordering issues, and
serve to document concurrency issues. If you need every last bit
of performance and scalability, that's OK, but you need comments
and I suspect you'd need more memory barriers.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/4] net: percpufy frequently used vars -- add percpu_counter_mod_bh

2006-03-09 Thread Nick Piggin


Ravikiran G Thirumalai wrote:

On Thu, Mar 09, 2006 at 07:14:26PM +1100, Nick Piggin wrote:


Ravikiran G Thirumalai wrote:



Here's a patch making x86_64 local_t to 64 bits like other 64 bit arches.
This keeps local_t unsigned long.  (We can change it to signed value 
along with other arches later in one go I guess) 



Why not just keep naming and structure of interfaces consistent with
atomic_t?

That would be signed and 32-bit. You then also have a local64_t.



No, local_t is supposed to be 64-bits on 64bits arches and 32 bit on 32 bit
arches.  x86_64 was the only exception, so this patch fixes that.




Right. If it wasn't I wouldn't have proposed the change.

Considering that local_t has been broken so that basically nobody
is using it, now is a great time to rethink the types before it
gets fixed and people start using it.

And modelling the type on the atomic types would make the most
sense because everyone already knows them.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Nick Piggin


Benjamin LaHaise wrote:


Hello Andrew et al,

The patch below adds a fast path that avoids the atomic dec and test 
operation and spinlock acquire/release on page free.  This is especially 
important to the network stack which uses put_page() to free user 
buffers.  Removing these atomic ops helps improve netperf on the P4 
from ~8126Mbit/s to ~8199Mbit/s (although that number fluctuates quite a 
bit with some runs getting 8243Mbit/s).  There are probably better 
workloads to see an improvement from this on, but removing 3 atomics and 
an irq save/restore is good.


-ben



You can't do this because you can't test PageLRU like that.

Have a look in the lkml archives a few months back, where I proposed
a way to do this for __free_pages(). You can't do it for put_page.

BTW I have quite a large backlog of patches in -mm which should end
up avoiding an atomic or two around these parts.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Nick Piggin


Benjamin LaHaise wrote:


On Mon, Mar 06, 2006 at 04:50:39PM -0800, Andrew Morton wrote:


Am a bit surprised at those numbers.




Because userspace has to do peculiar things to get its pages taken off the
LRU.  What exactly was that application doing?



It's just a simple send() and recv() pair of processes.  Networking uses 
pages for the buffer on user transmits.  Those pages tend to be freed 
in irq context on transmit or in the receiver if the traffic is local.




The patch adds slight overhead to the common case while providing
improvement to what I suspect is a very uncommon case?



At least on any modern CPU with branch prediction, the test is essentially 
free (2 memory reads that pipeline well, iow 1 cycle, maybe 2).  The 
upside is that you get to avoid the atomic (~17 cycles on a P4 with a 
simple test program, the penalty doubles if there is one other instruction 
that operates on memory in the loop), disabling interrupts (~20 cycles?, I 
don't remember) another atomic for the spinlock, another atomic for 
TestClearPageLRU() and the pushf/popf (expensive as they rely on whatever 
instruction that might still be in flight to complete and add the penalty 
for changing irq state).  That's at least 70 cycles without including the 
memory barrier side effects which can cost 100 cycles+.  Add in the costs 
for the cacheline bouncing of the lru_lock and we're talking *expensive*.





My patches in -mm avoid the lru_lock and disabling/enabling interrupts
if the page is not on lru too, btw.

So, a 1-2 cycle cost for a case that normally takes from 17 to 100+ cycles?  
I think that's worth it given the benefits.


Also, I think the common case (page cache read / map) is something that 
should be done differently, as those atomics really do add up to major 
pain.  Using rcu for page cache reads would be truely wonderful, but that 
will take some time.





It is not very difficult to implement (and is something I intend to look
at after I finish my lockless pagecache). But it has quite a lot of 
problems,
including a potentially big (temporal) increase of cache footprint to 
process

the pages, more CPU time in general to traverse the lists, increased over /
underflows in the per cpu pagelists. Possibly even worse would be the 
increased

overhead on the RCU infrastructure and potential OOM conditions.

Not to mention the extra logic involved to either retry, or fall back to 
get/put

in the case that the userspace target page is not resident.

I'd say it will turn out to be more trouble than its worth, for the 
miserly cost
avoiding one atomic_inc, and one atomic_dec_and_test on page-local data 
that will
be in L1 cache. I'd never turn my nose up at anyone just having a go 
though :)


--

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Nick Piggin


Benjamin LaHaise wrote:


On Tue, Mar 07, 2006 at 12:53:27PM +1100, Nick Piggin wrote:


You can't do this because you can't test PageLRU like that.

Have a look in the lkml archives a few months back, where I proposed
a way to do this for __free_pages(). You can't do it for put_page.



Even if we know that we are the last user of the page (the count is 1)?  
Who can bump the page's count then?





Yes. vmscan.

Your page_count and PageLRU tests have no synchronisation between
them, which is the problem AFAIKS. Anything can happen between them
and they can probably also be executed out of order (the loads).


BTW I have quite a large backlog of patches in -mm which should end
up avoiding an atomic or two around these parts.



That certainly looks like it will help.  Not taking the spinlock 
unconditionally gets rid of quite a bit of the cost.





Cool.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Nick Piggin


Benjamin LaHaise wrote:

On Tue, Mar 07, 2006 at 01:04:36PM +1100, Nick Piggin wrote:

I'd say it will turn out to be more trouble than its worth, for the 
miserly cost
avoiding one atomic_inc, and one atomic_dec_and_test on page-local data 
that will
be in L1 cache. I'd never turn my nose up at anyone just having a go 
though :)



The cost is anything but miserly.  Consider that every lock instruction is 
a memory barrier which takes your OoO CPU with lots of instructions in flight 
to ramp down to just 1 for the time it takes that instruction to execute.  
That synchronization is what makes the atomic expensive.




Yeah x86(-64) is a _little_ worse off in that regard because its locks
imply rmbs.

But I'm saying the cost is miserly compared to the likely overheads
of using RCU-ed page freeing, when taken as impact on the system as a
whole.

Though definitely if we can get rid of atomic ops for free in any low
level page handling functions in mm/ then we want to do that.

In the case of netperf, I ended up with a 2.5Gbit/s (~30%) performance 
improvement through nothing but microoptimizations.  There is method to 
my madness. ;-)




Well... it was wrong too ;)

But as you can see, I'm not against microoptimisations either and I'm
glad others, like yourself, are looking at the problem too.

The 30% number is very impressive. I'd be interested to see what the
stuff currently in -mm is worth.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] avoid atomic op on page free

2006-03-06 Thread Nick Piggin


Chen, Kenneth W wrote:

Nick Piggin wrote on Monday, March 06, 2006 6:05 PM


My patches in -mm avoid the lru_lock and disabling/enabling interrupts
if the page is not on lru too, btw.



Can you put the spin lock/unlock inside TestClearPageLRU()?  The
difference is subtle though.



That's the idea, but you just need to do a little bit more so as not to
introduce a race.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc2/2.6.16-rc2-mm1/broken-out/mm-never-clearpagelru-released-pages.patch


- Ken


--- ./mm/swap.c.orig2006-03-06 19:25:10.680967542 -0800
+++ ./mm/swap.c 2006-03-06 19:27:02.334286487 -0800
@@ -210,14 +210,16 @@ int lru_add_drain_all(void)
 void fastcall __page_cache_release(struct page *page)
 {
unsigned long flags;
-   struct zone *zone = page_zone(page);
+   struct zone *zone;
 
-	spin_lock_irqsave(zone-lru_lock, flags);

-   if (TestClearPageLRU(page))
+   if (TestClearPageLRU(page)) {
+   zone = page_zone(page);
+   spin_lock_irqsave(zone-lru_lock, flags);
del_page_from_lru(zone, page);
-   if (page_count(page) != 0)
-   page = NULL;
-   spin_unlock_irqrestore(zone-lru_lock, flags);
+   if (page_count(page) != 0)
+   page = NULL;
+   spin_unlock_irqrestore(zone-lru_lock, flags);
+   }
if (page)
free_hot_page(page);
 }





--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET : No need to update last_rx in loopback driver

2006-02-09 Thread Nick Piggin


Eric Dumazet wrote:


Signed-off-by: Eric Dumazet [EMAIL PROTECTED]




--- a/include/linux/netdevice.h 2006-02-07 11:55:42.0 +0100
+++ b/include/linux/netdevice.h 2006-02-09 09:23:15.0 +0100
@@ -649,6 +649,19 @@
return test_bit(__LINK_STATE_START, dev-state);
 }
 


One minor suggestion I would have is to just load jiffies once
(it is volatile). But it gets a little messy if you want to be
tricky and still load it zero times in the case that the first
test evaluates true. Maybe not worth the fuss.

{
unsigned long now;
#ifndef CONFIG_SMP
now = jiffies
#else
if (!(dev-flags  IFF_SLAVE) || ((now = jiffies) == dev-last_rx))
return;
#endif
dev-last_rx = jiffies;
}


+static inline void netif_last_rx_update(struct net_device *dev)
+{
+#ifdef CONFIG_SMP
+   /*
+* In order to avoid cache line ping pongs on last_rx, we check
+* if the device is a slave,
+* and if last_rx really has to be modified
+*/
+   if (!(dev-flags  IFF_SLAVE) || (dev-last_rx == jiffies))
+   return;
+#endif
+   dev-last_rx = jiffies;
+}


--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 0/3] TCP/IP Critical socket communication mechanism

2005-12-14 Thread Nick Piggin

David S. Miller wrote:

From: Matt Mackall [EMAIL PROTECTED]
Date: Wed, 14 Dec 2005 21:02:50 -0800

There needs to be two rules:

iff global memory critical flag is set
- allocate from the global critical receive pool on receive
- return packet to global pool if not destined for a socket with an
 attached send mempool

This shuts off a router and/or firewall just because iSCSI or NFS peed
in it's pants.  Not really acceptable.

But that should only happen (shut off a router and/or firewall) in cases
where we now completely deadlock and never recover, including shutting off
the router and firewall, because they don't have enough memory to recv
packets either.

I think this will provide the desired behavior

It's not desirable.

What if iSCSI is protected by IPSEC, and the key management daemon has
to process a security assosciation expiration and negotiate a new one
in order for iSCSI to further communicate with it's peer when this
memory shortage occurs?  It needs to send packets back and forth with
the remove key management daemon in order to do this, but since you
cut it off with this critical receive pool, the negotiation will never
succeed.

I guess IPSEC would be a critical socket too, in that case. Sure
there is nothing we can do if the daemon insists on allocating lots
of memory...

This stuff won't work.  It's not a generic solution and that's
why it has more holes than swiss cheese. :-)

True it will have holes. I think something that is complementary and
would be desirable is to simply limit the amount of in-flight writeout
that things like NFS allows (or used to allow, haven't checked for a
while and there were noises about it getting better).

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] struct file cleanup : the very large file_ra_state is now allocated only on demand.

2005-08-17 Thread Nick Piggin


Andi Kleen wrote:



I would just set the ra pointer to a single global structure if the allocation
fails. Then you can avoid all the other checks. It will slow down
things and trash some state, but not fail and nobody should expect good 
performance after out of memory anyways. The only check still

needed would be on freeing.




You don't want to always have bad performance though, so you
could attempt to allocate if either the pointer is null _or_ it
points to the global structure?


Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] struct file cleanup : the very large file_ra_state is now allocated only on demand.

2005-08-17 Thread Nick Piggin


Andi Kleen wrote:


You don't want to always have bad performance though, so you
could attempt to allocate if either the pointer is null _or_ it
points to the global structure?



Remember it's after a GFP_KERNEL OOM. If that fails most likely
you have deadlocked somewhere else already because Linux's handling
of that is so bad. Suboptimal readahead somewhere is the smallest
of your problems.




True. And in practice it may not even be able to happen at the
moment if the page allocator still doesn't fail small order allocs.

But I guess the dream one day is to robustly handle any OOM :\


Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

62 matches

Mail list logo