Re: Large stack usage in fs code (especially for PPC64)

2008-11-18 Thread Christoph Hellwig
On Tue, Nov 18, 2008 at 10:03:25AM +1100, Benjamin Herrenschmidt wrote:
 On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
  Note, I was using a default config that had CONFIG_IRQSTACKS off and
  CONFIG_PPC_64K_PAGES on.
 
 For one, we definitely need to turn IRQSTACKS on by default ... In fact,
 I'm pondering just removing the option.

It shouldn't be a user-visible option.  It shouldn't on x86 either, but
there it always gets into the ideological 4k stacks flameware so people
usually gave up after a while instead of doing the sensibke 8k +
irqstacks by default, 4k as an option..

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-18 Thread Nick Piggin
On Tuesday 18 November 2008 09:53, Paul Mackerras wrote:

 I'd love to be able to use a 4k base page size if I could still get
 the reduction in page faults and the expanded TLB reach that we get
 now with 64k pages.  If we could allocate the page cache for large
 files with order-4 allocations wherever possible that would be a good
 start.

That can still have nasty side-effects like fragmentation and 64k
granular reclaim. It also adds complexity to mapping the pages to
userspace.

Christoph's patchset IIRC also only did page cache, wheras I suppose
your kernbench workload is gaining mainly from anonymous page faults.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-18 Thread Nick Piggin
On Tuesday 18 November 2008 13:08, Linus Torvalds wrote:
 On Tue, 18 Nov 2008, Paul Mackerras wrote:
  Also, you didn't respond to my comments about the purely software
  benefits of a larger page size.

 I realize that there are benefits. It's just that the downsides tend to
 swamp the upsides.

 The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
 can do good TLB's with essentially gang lookups, giving almost effective
 page sizes of 32kB with hardly any of the downsides. Couple that with

It's much harder to do this with powerpc I think because they would need
to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
wouldn't they?


 low-latency fault handling (for not when you miss in the TLB, but when
 something really isn't in the page tables), and it seems to be seldom the
 biggest issue.

 (Don't get me wrong - TLB's are not unimportant on x86 either. But on x86,
 things are generally much better).

The per-page processing costs are interesting too, but IMO there is more
work that should be done to speed up order-0 pages. The patches I had to
remove the sync instruction for smp_mb() in unlock_page sped up pagecache
throughput (populate, write(2), reclaim) on my G5 by something really
crazy like 50% (most of that's in, but I'm still sitting on that fancy
unlock_page speedup to remove the final smp_mb).

I suspect some of the costs are also in powerpc specific code to insert
linux ptes into their hash table. I think some of the synchronisation for
those could possibly be shared with generic code so you don't need the
extra layer of locks there.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-18 Thread Ingo Molnar

* Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Tue, Nov 18, 2008 at 10:03:25AM +1100, Benjamin Herrenschmidt wrote:
  On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
   Note, I was using a default config that had CONFIG_IRQSTACKS off and
   CONFIG_PPC_64K_PAGES on.
  
  For one, we definitely need to turn IRQSTACKS on by default ... In fact,
  I'm pondering just removing the option.
 
 It shouldn't be a user-visible option.  It shouldn't on x86 either, 
 but there it always gets into the ideological 4k stacks flameware so 
 people usually gave up after a while instead of doing the sensibke 
 8k + irqstacks by default, 4k as an option..

yep, i tend to agree that 8k + irqstacks on both 32-bit and 64-bit 
would be the sane default. There's just too much gcc and other noise 
for us to have that cushion by default on 32-bit too. 64-bit is 
already there, on 32-bit we should decouple irqstacks from 4K stacks 
and just turn irqstacks on by default.

Life's too short to fight kernel stack overflows - and now we've got 
the stack-tracer that will record and show the frame of the worst-ever 
kernel stack situation the system was in since bootup. (see Steve's 
tracer output in this thread)

Ingo
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-18 Thread Paul Mackerras
Nick Piggin writes:

 It's much harder to do this with powerpc I think because they would need
 to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
 wouldn't they?

Yeah, the hashed page table sucks.  Film at 11, as they say. :)

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-18 Thread Linus Torvalds


On Tue, 18 Nov 2008, Nick Piggin wrote:
 
  The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
  can do good TLB's with essentially gang lookups, giving almost effective
  page sizes of 32kB with hardly any of the downsides. Couple that with
 
 It's much harder to do this with powerpc I think because they would need
 to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
 wouldn't they?

Oh, absolutely. It's why I despise hashed page tables. It's a broken 
concept.

 The per-page processing costs are interesting too, but IMO there is more
 work that should be done to speed up order-0 pages. The patches I had to
 remove the sync instruction for smp_mb() in unlock_page sped up pagecache
 throughput (populate, write(2), reclaim) on my G5 by something really
 crazy like 50% (most of that's in, but I'm still sitting on that fancy
 unlock_page speedup to remove the final smp_mb).
 
 I suspect some of the costs are also in powerpc specific code to insert
 linux ptes into their hash table. I think some of the synchronisation for
 those could possibly be shared with generic code so you don't need the
 extra layer of locks there.

Yeah, the hashed page tables get extra costs from the fact that it can't 
share the software page tables with the hardware ones, and the associated 
coherency logic. It's even worse at unmap time, I think.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Steven Rostedt

I've been hitting stack overflows on a PPC64 box, so I ran the ftrace 
stack_tracer and part of the problem with that box is that it can nest 
interrupts too deep. But what also worries me is that there's some heavy 
hitters of stacks in generic code. Namely the fs directory has some.

Here's the full dump of the stack (PPC64):

[EMAIL PROTECTED] ~ cat /debug/tracing/stack_trace 
Depth   Size  Location(56 entries)
-     
  0)14032 112   ftrace_call+0x4/0x14
  1)13920 128   .sched_clock+0x20/0x60
  2)13792 128   .sched_clock_cpu+0x34/0x50
  3)13664 144   .cpu_clock+0x3c/0xa0
  4)13520 144   .get_timestamp+0x2c/0x50
  5)13376 192   .softlockup_tick+0x100/0x220
  6)13184 128   .run_local_timers+0x34/0x50
  7)13056 160   .update_process_times+0x44/0xb0
  8)12896 176   .tick_sched_timer+0x8c/0x120
  9)12720 160   .__run_hrtimer+0xd8/0x130
 10)12560 240   .hrtimer_interrupt+0x16c/0x220
 11)12320 160   .timer_interrupt+0xcc/0x110
 12)12160 240   decrementer_common+0xe0/0x100
 13)11920 576   0x80
 14)11344 160   .usb_hcd_irq+0x94/0x150
 15)11184 160   .handle_IRQ_event+0x80/0x120
 16)11024 160   .handle_fasteoi_irq+0xd8/0x1e0
 17)10864 160   .do_IRQ+0xbc/0x150
 18)10704 144   hardware_interrupt_entry+0x1c/0x3c
 19)10560 672   0x0
 20) 9888 144   ._spin_unlock_irqrestore+0x84/0xd0
 21) 9744 160   .scsi_dispatch_cmd+0x170/0x360
 22) 9584 208   .scsi_request_fn+0x324/0x5e0
 23) 9376 144   .blk_invoke_request_fn+0xc8/0x1b0
 24) 9232 144   .__blk_run_queue+0x48/0x60
 25) 9088 144   .blk_run_queue+0x40/0x70
 26) 8944 192   .scsi_run_queue+0x3a8/0x3e0
 27) 8752 160   .scsi_next_command+0x58/0x90
 28) 8592 176   .scsi_end_request+0xd4/0x130
 29) 8416 208   .scsi_io_completion+0x15c/0x500
 30) 8208 160   .scsi_finish_command+0x15c/0x190
 31) 8048 160   .scsi_softirq_done+0x138/0x1e0
 32) 7888 160   .blk_done_softirq+0xd0/0x100
 33) 7728 192   .__do_softirq+0xe8/0x1e0
 34) 7536 144   .do_softirq+0xa4/0xd0
 35) 7392 144   .irq_exit+0xb4/0xf0
 36) 7248 160   .do_IRQ+0x114/0x150
 37) 7088 752   hardware_interrupt_entry+0x1c/0x3c
 38) 6336 144   .blk_rq_init+0x28/0xc0
 39) 6192 208   .get_request+0x13c/0x3d0
 40) 5984 240   .get_request_wait+0x60/0x170
 41) 5744 192   .__make_request+0xd4/0x560
 42) 5552 192   .generic_make_request+0x210/0x300
 43) 5360 208   .submit_bio+0x168/0x1a0
 44) 5152 160   .submit_bh+0x188/0x1e0
 45) 49921280   .block_read_full_page+0x23c/0x430
 46) 37121280   .do_mpage_readpage+0x43c/0x740
 47) 2432 352   .mpage_readpages+0x130/0x1c0
 48) 2080 160   .ext3_readpages+0x50/0x80
 49) 1920 256   .__do_page_cache_readahead+0x1e4/0x340
 50) 1664 160   .do_page_cache_readahead+0x94/0xe0
 51) 1504 240   .filemap_fault+0x360/0x530
 52) 1264 256   .__do_fault+0xb8/0x600
 53) 1008 240   .handle_mm_fault+0x190/0x920
 54)  768 320   .do_page_fault+0x3d4/0x5f0
 55)  448 448   handle_page_fault+0x20/0x5c


Notice at line 45 and 46 the stack usage of block_read_full_page and 
do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start 
of these two:

int block_read_full_page(struct page *page, get_block_t *get_block)
{
struct inode *inode = page-mapping-host;
sector_t iblock, lblock;
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
unsigned int blocksize;
int nr, i;
int fully_mapped = 1;
[...]

static struct bio *
do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
sector_t *last_block_in_bio, struct buffer_head *map_bh,
unsigned long *first_logical_block, get_block_t get_block)
{
struct inode *inode = page-mapping-host;
const unsigned blkbits = inode-i_blkbits;
const unsigned blocks_per_page = PAGE_CACHE_SIZE  blkbits;
const unsigned blocksize = 1  blkbits;
sector_t block_in_file;
sector_t last_block;
sector_t last_block_in_file;
sector_t blocks[MAX_BUF_PER_PAGE];
unsigned page_block;
unsigned first_hole = blocks_per_page;
struct block_device *bdev = NULL;
int length;
int fully_mapped = 1;
unsigned nblocks;
unsigned relative_block;


The thing that hits my eye on both is the MAX_BUF_PER_PAGE usage. That is 
defined as: 

define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)

Where PAGE_CACHE_SIZE is the same as PAGE_SIZE.

On PPC64 I'm told that the page size is 64K, which makes the above equal 
to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.

The problem with PPC64 is that the 

Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Steven Rostedt

On Mon, 17 Nov 2008, Steven Rostedt wrote:
 
 Note, I was using a default config that had CONFIG_IRQSTACKS off and
 CONFIG_PPC_64K_PAGES on.

Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
softirqs still use the same stack as the process.

[EMAIL PROTECTED] ~ cat /debug/tracing/stack_trace 
Depth   Size  Location(59 entries)
-     
  0)12384 192   ftrace_call+0x4/0x14
  1)12192 128   .sched_clock+0x20/0x60
  2)12064 128   .sched_clock_cpu+0x34/0x50
  3)11936 144   .cpu_clock+0x3c/0xa0
  4)11792 144   .get_timestamp+0x2c/0x50
  5)11648 144   .__touch_softlockup_watchdog+0x3c/0x60
  6)11504 192   .softlockup_tick+0xe4/0x220
  7)11312 128   .run_local_timers+0x34/0x50
  8)11184 160   .update_process_times+0x44/0xb0
  9)11024 176   .tick_sched_timer+0x8c/0x120
 10)10848 160   .__run_hrtimer+0xd8/0x130
 11)10688 240   .hrtimer_interrupt+0x16c/0x220
 12)10448 160   .timer_interrupt+0xcc/0x110
 13)10288  96   decrementer_common+0xe0/0x100
 14)10192 784   .ftrace_test_stop_func+0x4c/0x70
 15) 9408 112   ftrace_call+0x4/0x14
 16) 9296 144   .init_buffer_head+0x20/0x60
 17) 9152 256   .cache_alloc_refill+0x5a0/0x740
 18) 8896 160   .kmem_cache_alloc+0xfc/0x140
 19) 8736 144   .alloc_buffer_head+0x3c/0xc0
 20) 8592 176   .alloc_page_buffers+0xb0/0x130
 21) 8416 160   .create_empty_buffers+0x44/0x1a0
 22) 82561280   .block_read_full_page+0x3b0/0x430
 23) 6976 144   .blkdev_readpage+0x38/0x60
 24) 6832 176   .read_cache_page_async+0x124/0x230
 25) 6656 160   .read_cache_page+0x50/0xe0
 26) 6496 160   .read_dev_sector+0x58/0x100
 27) 6336 272   .msdos_partition+0xa4/0x880
 28) 6064 240   .rescan_partitions+0x1f0/0x430
 29) 5824 208   .__blkdev_get+0x124/0x3d0
 30) 5616 144   .blkdev_get+0x3c/0x60
 31) 5472 192   .register_disk+0x194/0x1f0
 32) 5280 160   .add_disk+0xe8/0x160
 33) 5120 224   .sd_probe+0x2f0/0x440
 34) 4896 176   .driver_probe_device+0x14c/0x380
 35) 4720 144   .__device_attach+0x38/0x60
 36) 4576 192   .bus_for_each_drv+0xb8/0x120
 37) 4384 160   .device_attach+0x94/0xe0
 38) 4224 144   .bus_attach_device+0x78/0xb0
 39) 4080 240   .device_add+0x54c/0x730
 40) 3840 160   .scsi_sysfs_add_sdev+0x7c/0x2e0
 41) 3680 336   .scsi_probe_and_add_lun+0xb7c/0xc60
 42) 3344 192   .__scsi_add_device+0x11c/0x150
 43) 3152 176   .ata_scsi_scan_host+0x238/0x330
 44) 2976 208   .ata_host_register+0x320/0x3c0
 45) 2768 192   .ata_host_activate+0xf8/0x190
 46) 2576 224   .mv_pci_init_one+0x284/0x430
 47) 2352 160   .pci_device_probe+0x98/0xf0
 48) 2192 176   .driver_probe_device+0x14c/0x380
 49) 2016 160   .__driver_attach+0xd4/0x110
 50) 1856 192   .bus_for_each_dev+0xa8/0x100
 51) 1664 144   .driver_attach+0x40/0x60
 52) 1520 192   .bus_add_driver+0x268/0x340
 53) 1328 176   .driver_register+0x88/0x1d0
 54) 1152 176   .__pci_register_driver+0x90/0x100
 55)  976 144   .mv_init+0x38/0xa0
 56)  832 544   .do_one_initcall+0x78/0x240
 57)  288 192   .kernel_init+0x21c/0x2b0
 58)   96  96   .kernel_thread+0x54/0x70

This is still 12K. Kind of big even for a 16K stack.

-- Steve

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Linus Torvalds


On Mon, 17 Nov 2008, Steven Rostedt wrote:

  45) 49921280   .block_read_full_page+0x23c/0x430
  46) 37121280   .do_mpage_readpage+0x43c/0x740

Ouch.

 Notice at line 45 and 46 the stack usage of block_read_full_page and 
 do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start 
 of these two:
 
 int block_read_full_page(struct page *page, get_block_t *get_block)
 {
   struct inode *inode = page-mapping-host;
   sector_t iblock, lblock;
   struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];

Yeah, that's unacceptable.

Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry 
array), but as you say:

 On PPC64 I'm told that the page size is 64K, which makes the above equal 
 to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.

Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB 
pages are insane, and 16kB pages are borderline. I've told people so.

The ppc people run databases, and they don't care about sane people 
telling them the big pages suck. It's made worse by the fact that they 
also have horribly bad TLB fills on their broken CPU's, and years and 
years of telling people that the MMU on ppc's are sh*t has only been 
reacted to with talk to the hand, we know better.

Quite frankly, 64kB pages are INSANE. But yes, in this case they actually 
cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses 
64 bytes, not 1kB.

I suspect the PPC people need to figure out some way to handle this in 
their broken setups (since I don't really expect them to finally admit 
that they were full of sh*t with their big pages), but since I think it's 
a ppc bug, I'm not at all interested in a fix that penalizes the _good_ 
case.

So either make it some kind of (clean) conditional dynamic non-stack 
allocation, or make it do some outer loop over the whole page that turns 
into a compile-time no-op when the page is sufficiently small to be done 
in one go.

Or perhaps say if you have 64kB pages, you're a moron, and to counteract 
that moronic page size, you cannot do 512-byte granularity IO any more.

Of course, that would likely mean that FAT etc wouldn't work on ppc64, so 
I don't think that's a valid model either. But if the 64kB page size is 
just a database server crazy-people config option, then maybe it's 
acceptable.

Database people usually don't want to connect their cameras or mp3-players 
with their FAT12 filesystems.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Pekka Enberg
On Mon, Nov 17, 2008 at 11:18 PM, Linus Torvalds
[EMAIL PROTECTED] wrote:
 I do wonder just _what_ it is that causes the stack frames to be so
 horrid. For example, you have

 18) 8896 160   .kmem_cache_alloc+0xfc/0x140

 and I'm looking at my x86-64 compile, and it has a stack frame of just 8
 bytes (!) for local variables plus the save/restore area (which looks like
 three registers plus frame pointer plus return address). IOW, if I'm
 looking at the code right (so big caveat: I did _not_ do a real stack
 dump!) the x86-64 stack cost for that same function is on the order of 48
 bytes. Not 160.

 Where does that factor-of-three+ difference come from? From the numbers, I
 suspect ppc64 has a 32-byte stack alignment, which may be part of it, and
 I guess the compiler is more eager to use all those extra registers and
 will happily have many more callee-saved regs that are actually used.

 But that still a _lot_ of extra stack.

 Of course, you may have things like spinlock debugging etc enabled. Some
 of our debugging options do tend to blow things up.

Note that kmem_cache_alloc() is likely to contain lots of inlined
functions for both SLAB and SLUB. Perhaps that blows up stack usage on
ppc?
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Andrew Morton
On Mon, 17 Nov 2008 13:23:23 -0800 (PST)
Linus Torvalds [EMAIL PROTECTED] wrote:

 
 
 On Mon, 17 Nov 2008, Andrew Morton wrote:
  
  Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT! :)
  
  I fixed this six years ago.  See http://lkml.org/lkml/2002/6/17/68
 
 Btw, in that thread I also said:
 
   If we have 64kB pages, such architectures will have to have a bigger 
kernel stack. Which they will have, simply by virtue of having the very 
same bigger page. So that problem kind of solves itself.
 
 and that may still be the right solution - if somebody is so insane that 
 they want 64kB pages, then they might as well have a 64kB kernel stack as 
 well. 

I'd have thought so, but I'm sure we're about to hear how important an
optimisation the smaller stacks are ;)

 Trust me, the kernel stack isn't where you blow your memory with a 64kB 
 page. You blow all your memory on the memory fragmentation of your page 
 cache. I did the stats for the kernel source tree a long time ago, and I 
 think you wasted something like 4GB of RAM with a 64kB page size.
 

Yup.  That being said, the younger me did assert that this is a neater
implementation anyway.  If we can implement those loops without
needing those on-stack temporary arrays then things probably are better
overall.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Linus Torvalds


On Mon, 17 Nov 2008, Andrew Morton wrote:
 
 Yup.  That being said, the younger me did assert that this is a neater
 implementation anyway.  If we can implement those loops without
 needing those on-stack temporary arrays then things probably are better
 overall.

Sure, if it actually ends up being nicer, I'll not argue with it. But from 
an L1 I$ standpoint (and I$ is often very important, especially for kernel 
loads where loops are fairly rare), it's often _much_ better to do two 
tight loops over two subsystems (filesystem and block layer) than it is 
to do one bigger loop that contains both. If the L1 can fit both subsystem 
paths, you're fine - but if not, you may get a lot more misses.

So it's often nice if you can stage things so that you do a cluster of 
calls to one area, followed by a cluster of calls to another, rather than 
mix it up. 

But numbers talk. And code cleanliness. If somebody has numbers that the 
code size actually goes down for example, or the code is just more 
readable, micro-optimizing cache patterns isn't worth it.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Paul Mackerras
Steven Rostedt writes:

 Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
 softirqs still use the same stack as the process.

They shouldn't.  I don't see do_softirq in the trace, though.  Which
functions did you think would be run in a softirq?  It looks to me
like the deepest 10 or so functions are called at hard irq level,
within hrtimer_interrupt called from timer_interrupt.

 [EMAIL PROTECTED] ~ cat /debug/tracing/stack_trace 
 Depth   Size  Location(59 entries)
 -     
   0)12384 192   ftrace_call+0x4/0x14
   1)12192 128   .sched_clock+0x20/0x60
   2)12064 128   .sched_clock_cpu+0x34/0x50
   3)11936 144   .cpu_clock+0x3c/0xa0
   4)11792 144   .get_timestamp+0x2c/0x50
   5)11648 144   .__touch_softlockup_watchdog+0x3c/0x60
   6)11504 192   .softlockup_tick+0xe4/0x220
   7)11312 128   .run_local_timers+0x34/0x50
   8)11184 160   .update_process_times+0x44/0xb0
   9)11024 176   .tick_sched_timer+0x8c/0x120
  10)10848 160   .__run_hrtimer+0xd8/0x130
  11)10688 240   .hrtimer_interrupt+0x16c/0x220
  12)10448 160   .timer_interrupt+0xcc/0x110
  13)10288  96   decrementer_common+0xe0/0x100

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Paul Mackerras
Linus Torvalds writes:

 The ppc people run databases, and they don't care about sane people 

And HPC apps, and all sorts of other things...

 telling them the big pages suck. It's made worse by the fact that they 
 also have horribly bad TLB fills on their broken CPU's, and years and 

Taking page faults at a 4k granularity also sucks.  A lot of the
performance improvement from using 64k pages comes just from executing
the page fault path (and other paths in the mm code) less frequently.
That's why we see a performance improvement from 64k pages even on
machines that don't support 64k pages in hardware (like the 21%
reduction in system time on a kernel compile that I reported earlier).
That has nothing to do with TLB miss times or anything to do with the
MMU.

I'd love to be able to use a 4k base page size if I could still get
the reduction in page faults and the expanded TLB reach that we get
now with 64k pages.  If we could allocate the page cache for large
files with order-4 allocations wherever possible that would be a good
start.  I think Christoph Lameter had some patches in that direction
but they didn't seem to get very far.

 years of telling people that the MMU on ppc's are sh*t has only been 
 reacted to with talk to the hand, we know better.

Well, it's been reacted to with AIX can use small pages where it
makes sense, and large pages where that makes sense, so why can't
Linux?

 I suspect the PPC people need to figure out some way to handle this in 
 their broken setups (since I don't really expect them to finally admit 
 that they were full of sh*t with their big pages), but since I think it's 
 a ppc bug, I'm not at all interested in a fix that penalizes the _good_ 
 case.

Looks like we should make the stack a bit larger.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Benjamin Herrenschmidt
On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
 Note, I was using a default config that had CONFIG_IRQSTACKS off and
 CONFIG_PPC_64K_PAGES on.

For one, we definitely need to turn IRQSTACKS on by default ... In fact,
I'm pondering just removing the option.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Benjamin Herrenschmidt
On Mon, 2008-11-17 at 15:59 -0500, Steven Rostedt wrote:
 On Mon, 17 Nov 2008, Steven Rostedt wrote:
  
  Note, I was using a default config that had CONFIG_IRQSTACKS off and
  CONFIG_PPC_64K_PAGES on.
 
 Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
 softirqs still use the same stack as the process.

No, they should use their own, unless somebody change the softirq code
in ways that break us... (or unless I missed something, we based our
implementation roughtly on what x86 did back then).

Ben.

 [EMAIL PROTECTED] ~ cat /debug/tracing/stack_trace 
 Depth   Size  Location(59 entries)
 -     
   0)12384 192   ftrace_call+0x4/0x14
   1)12192 128   .sched_clock+0x20/0x60
   2)12064 128   .sched_clock_cpu+0x34/0x50
   3)11936 144   .cpu_clock+0x3c/0xa0
   4)11792 144   .get_timestamp+0x2c/0x50
   5)11648 144   .__touch_softlockup_watchdog+0x3c/0x60
   6)11504 192   .softlockup_tick+0xe4/0x220
   7)11312 128   .run_local_timers+0x34/0x50
   8)11184 160   .update_process_times+0x44/0xb0
   9)11024 176   .tick_sched_timer+0x8c/0x120
  10)10848 160   .__run_hrtimer+0xd8/0x130
  11)10688 240   .hrtimer_interrupt+0x16c/0x220
  12)10448 160   .timer_interrupt+0xcc/0x110
  13)10288  96   decrementer_common+0xe0/0x100
  14)10192 784   .ftrace_test_stop_func+0x4c/0x70
  15) 9408 112   ftrace_call+0x4/0x14
  16) 9296 144   .init_buffer_head+0x20/0x60
  17) 9152 256   .cache_alloc_refill+0x5a0/0x740
  18) 8896 160   .kmem_cache_alloc+0xfc/0x140
  19) 8736 144   .alloc_buffer_head+0x3c/0xc0
  20) 8592 176   .alloc_page_buffers+0xb0/0x130
  21) 8416 160   .create_empty_buffers+0x44/0x1a0
  22) 82561280   .block_read_full_page+0x3b0/0x430
  23) 6976 144   .blkdev_readpage+0x38/0x60
  24) 6832 176   .read_cache_page_async+0x124/0x230
  25) 6656 160   .read_cache_page+0x50/0xe0
  26) 6496 160   .read_dev_sector+0x58/0x100
  27) 6336 272   .msdos_partition+0xa4/0x880
  28) 6064 240   .rescan_partitions+0x1f0/0x430
  29) 5824 208   .__blkdev_get+0x124/0x3d0
  30) 5616 144   .blkdev_get+0x3c/0x60
  31) 5472 192   .register_disk+0x194/0x1f0
  32) 5280 160   .add_disk+0xe8/0x160
  33) 5120 224   .sd_probe+0x2f0/0x440
  34) 4896 176   .driver_probe_device+0x14c/0x380
  35) 4720 144   .__device_attach+0x38/0x60
  36) 4576 192   .bus_for_each_drv+0xb8/0x120
  37) 4384 160   .device_attach+0x94/0xe0
  38) 4224 144   .bus_attach_device+0x78/0xb0
  39) 4080 240   .device_add+0x54c/0x730
  40) 3840 160   .scsi_sysfs_add_sdev+0x7c/0x2e0
  41) 3680 336   .scsi_probe_and_add_lun+0xb7c/0xc60
  42) 3344 192   .__scsi_add_device+0x11c/0x150
  43) 3152 176   .ata_scsi_scan_host+0x238/0x330
  44) 2976 208   .ata_host_register+0x320/0x3c0
  45) 2768 192   .ata_host_activate+0xf8/0x190
  46) 2576 224   .mv_pci_init_one+0x284/0x430
  47) 2352 160   .pci_device_probe+0x98/0xf0
  48) 2192 176   .driver_probe_device+0x14c/0x380
  49) 2016 160   .__driver_attach+0xd4/0x110
  50) 1856 192   .bus_for_each_dev+0xa8/0x100
  51) 1664 144   .driver_attach+0x40/0x60
  52) 1520 192   .bus_add_driver+0x268/0x340
  53) 1328 176   .driver_register+0x88/0x1d0
  54) 1152 176   .__pci_register_driver+0x90/0x100
  55)  976 144   .mv_init+0x38/0xa0
  56)  832 544   .do_one_initcall+0x78/0x240
  57)  288 192   .kernel_init+0x21c/0x2b0
  58)   96  96   .kernel_thread+0x54/0x70
 
 This is still 12K. Kind of big even for a 16K stack.
 
 -- Steve

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Benjamin Herrenschmidt

 Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry 
 array), but as you say:
 
  On PPC64 I'm told that the page size is 64K, which makes the above equal 
  to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.
 
 Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB 
 pages are insane, and 16kB pages are borderline. I've told people so.
 
 The ppc people run databases, and they don't care about sane people 
 telling them the big pages suck.

Hehe :-)

Guess who is pushing for larger page sizes nowadays ? Embedded
people :-) In fact, we have patches submited on the list to offer the
option for ... 256K pages on some 44x embedded CPUs :-)

It makes some sort of sense I suppose on very static embedded workloads
with no swap nor demand paging.

 It's made worse by the fact that they 
 also have horribly bad TLB fills on their broken CPU's, and years and 
 years of telling people that the MMU on ppc's are sh*t has only been 
 reacted to with talk to the hand, we know better.

Who are you talking about here precisely ? I don't think either Paul or
I every said something nearly around those lines ... Oh well.

But yeah, our existing server CPUs have pretty poor TLB refills and yes,
64K pages help. And yes, we would like things to be different, but they
aren't.

But there is also pressure to get larger page sizes from small embedded
field, where CPUs have even poorer TLB refill (software loaded
basically) :-)

 Quite frankly, 64kB pages are INSANE. But yes, in this case they actually 
 cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses 
 64 bytes, not 1kB.

Come on, the code is crap to allocate that on the stack anyway :-)

 Of course, that would likely mean that FAT etc wouldn't work on ppc64, so 
 I don't think that's a valid model either. But if the 64kB page size is 
 just a database server crazy-people config option, then maybe it's 
 acceptable.

Well, as I said, embedded folks are wanting that too ...

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Josh Boyer
On Tue, 18 Nov 2008 10:13:16 +1100
Benjamin Herrenschmidt [EMAIL PROTECTED] wrote:

 
  Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry 
  array), but as you say:
  
   On PPC64 I'm told that the page size is 64K, which makes the above equal 
   to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.
  
  Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB 
  pages are insane, and 16kB pages are borderline. I've told people so.
  
  The ppc people run databases, and they don't care about sane people 
  telling them the big pages suck.
 
 Hehe :-)
 
 Guess who is pushing for larger page sizes nowadays ? Embedded
 people :-) In fact, we have patches submited on the list to offer the
 option for ... 256K pages on some 44x embedded CPUs :-)

For clarification, that workload is very precise.  Namely embedded 44x
CPUs used in RAID cards.  I'm not entirely convinced bringing 256K
pages into mainline is a good thing yet anyway.

64K pages, while seemingly insane for embedded boards that typically
have less than 512 MiB of DRAM, help for a bit larger set of
workloads.  As a KVM host is the primary winner at the moment.  But
given the small number of TLB entries on these CPUs, it can pay off
elsewhere as well.

josh
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Linus Torvalds


On Tue, 18 Nov 2008, Benjamin Herrenschmidt wrote:
 
 Guess who is pushing for larger page sizes nowadays ? Embedded
 people :-) In fact, we have patches submited on the list to offer the
 option for ... 256K pages on some 44x embedded CPUs :-)
 
 It makes some sort of sense I suppose on very static embedded workloads
 with no swap nor demand paging.

It makes perfect sense for anything that doesn't use any MMU. 

The hugepage support seems to cover many of the relevant cases, ie 
databases and things like big static mappings (frame buffers etc).

  It's made worse by the fact that they 
  also have horribly bad TLB fills on their broken CPU's, and years and 
  years of telling people that the MMU on ppc's are sh*t has only been 
  reacted to with talk to the hand, we know better.
 
 Who are you talking about here precisely ? I don't think either Paul or
 I every said something nearly around those lines ... Oh well.

Every single time I've complained about it, somebody from IBM has said .. 
but but AIX.

This time it was Paul. Sometimes it has been software people who agree, 
but point to hardware designers who know better. If it's not some insane 
database person, it's a Fortran program that runs for days.

 But there is also pressure to get larger page sizes from small embedded
 field, where CPUs have even poorer TLB refill (software loaded
 basically) :-)

Yeah, I agree that you _can_ have even worse MMU's. I'm not saying that 
PPC64 is absolutely pessimal and cannot be made worse. Software fill is 
indeed even worse from a performance angle, despite the fact that it's 
really nice from a conceptual angle.

Of course, of thesw fill users that remain, many do seem to be ppc.. It's 
like the architecture brings out the worst in hardware designers.

  Quite frankly, 64kB pages are INSANE. But yes, in this case they actually 
  cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses 
  64 bytes, not 1kB.
 
 Come on, the code is crap to allocate that on the stack anyway :-)

Why? We do actually expect to be able to use stack-space for small 
structures. We do it for a lot of things, including stuff like select() 
optimistically using arrays allocated on the stack for the common small 
case, just because it's, oh, about infinitely faster to do than to use 
kmalloc().

Many of the page cache functions also have the added twist that they get 
called from low-memory setups (eg write_whole_page()), and so try to 
minimize allocations for that reason too.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Steven Rostedt

On Tue, 18 Nov 2008, Paul Mackerras wrote:

 Steven Rostedt writes:
 
  Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
  softirqs still use the same stack as the process.
 
 They shouldn't.  I don't see do_softirq in the trace, though.  Which
 functions did you think would be run in a softirq?  It looks to me
 like the deepest 10 or so functions are called at hard irq level,
 within hrtimer_interrupt called from timer_interrupt.
 
  [EMAIL PROTECTED] ~ cat /debug/tracing/stack_trace 
  Depth   Size  Location(59 entries)
  -     
0)12384 192   ftrace_call+0x4/0x14
1)12192 128   .sched_clock+0x20/0x60
2)12064 128   .sched_clock_cpu+0x34/0x50
3)11936 144   .cpu_clock+0x3c/0xa0
4)11792 144   .get_timestamp+0x2c/0x50
5)11648 144   .__touch_softlockup_watchdog+0x3c/0x60
6)11504 192   .softlockup_tick+0xe4/0x220
7)11312 128   .run_local_timers+0x34/0x50
8)11184 160   .update_process_times+0x44/0xb0
9)11024 176   .tick_sched_timer+0x8c/0x120
   10)10848 160   .__run_hrtimer+0xd8/0x130
   11)10688 240   .hrtimer_interrupt+0x16c/0x220
   12)10448 160   .timer_interrupt+0xcc/0x110
   13)10288  96   decrementer_common+0xe0/0x100

Ah, you are right. I thought I saw softirq code in there, but must have 
been seeing things. I need to get my eyesight checked. The other day I 
totally botch a hand in cards because I thought I had 3 Aces of diamonds 
when I really only had 2 Aces of diamonds and a Ace of hearts.

-- Steve

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Benjamin Herrenschmidt
On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote:
 
 I've been hitting stack overflows on a PPC64 box, so I ran the ftrace 
 stack_tracer and part of the problem with that box is that it can nest 
 interrupts too deep. But what also worries me is that there's some heavy 
 hitters of stacks in generic code. Namely the fs directory has some.

Note that we shouldn't stack interrupts much in practice. The PIC will
not let same or lower prio interrupts in until we have completed one.
However timer/decrementer is not going through the PIC, so I think what
happens is we get a hw IRQ, on the way back, just before returning from
do_IRQ (so we have completed the IRQ from the PIC standpoint), we go
into soft-irq's, at which point deep inside SCSI we get another HW IRQ
and we stack a decrementer interrupt on top of it.

Now, we should do stack switching for both HW IRQs and softirqs with
CONFIG_IRQSTACKS, which should significantly alleviate the problem.

Your second trace also shows how horrible the stack traces can be when
the device-model kicks in, ie, register-probe-register sub device -
etc... that isnt going to be nice on x86 with 4k stacks neither. 

I wonder if we should generally recommend for drivers of bus devices
not to register sub devices from their own probe() routine, but defer
that to a kernel thread... Because the stacking can be pretty bad, I
mean, nobody's done SATA over USB yet but heh :-)

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Paul Mackerras
Linus Torvalds writes:

   It's made worse by the fact that they 
   also have horribly bad TLB fills on their broken CPU's, and years and 
   years of telling people that the MMU on ppc's are sh*t has only been 
   reacted to with talk to the hand, we know better.
  
  Who are you talking about here precisely ? I don't think either Paul or
  I every said something nearly around those lines ... Oh well.
 
 Every single time I've complained about it, somebody from IBM has said .. 
 but but AIX.
 
 This time it was Paul. Sometimes it has been software people who agree, 

Maybe I wasn't clear.  I was reporting the reaction I get from the
hardware designers, who often like to push off the hard problems to
software to fix.

Keep flaming, by the way.  It provides me with some amount of
assistance in my quest to get the hardware designers to do better. ;)

Also, you didn't respond to my comments about the purely software
benefits of a larger page size.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Steven Rostedt

On Mon, 17 Nov 2008, Linus Torvalds wrote:
 
 I do wonder just _what_ it is that causes the stack frames to be so 
 horrid. For example, you have
 
18) 8896 160   .kmem_cache_alloc+0xfc/0x140
 
 and I'm looking at my x86-64 compile, and it has a stack frame of just 8 
 bytes (!) for local variables plus the save/restore area (which looks like 
 three registers plus frame pointer plus return address). IOW, if I'm 
 looking at the code right (so big caveat: I did _not_ do a real stack 
 dump!) the x86-64 stack cost for that same function is on the order of 48 
 bytes. Not 160.

Out of curiosity, I just ran stack_trace on the latest version of git 
(pulled sometime today) and ran it on my x86_64.

I have SLUB and SLUB debug defined, and here's what I found:

 11) 3592  64   kmem_cache_alloc+0x64/0xa3

64 bytes, still much lower than the 160 of PPC64.

Just to see where it got that number, I ran objdump on slub.o.

300e kmem_cache_alloc_node:
300e:   55  push   %rbp
300f:   48 89 e5mov%rsp,%rbp
3012:   41 57   push   %r15
3014:   41 56   push   %r14
3016:   41 55   push   %r13
3018:   41 54   push   %r12
301a:   53  push   %rbx
301b:   48 83 ec 08 sub$0x8,%rsp

Six words pushed, plus the 8 byte stack frame.

  6*8+8 = 56

But we also add the push of the function return address which is another 8 
bytes which gives us 64 bytes.

Here's the complete dump:

[EMAIL PROTECTED] linux-compile.git]# cat /debug/tracing/stack_trace 
Depth   Size  Location(54 entries)
-     
  0) 4504  48   __mod_zone_page_state+0x59/0x68
  1) 4456  64   __rmqueue_smallest+0xa0/0xd9
  2) 4392  80   __rmqueue+0x24/0x172
  3) 4312  96   rmqueue_bulk+0x57/0xa3
  4) 4216 224   get_page_from_freelist+0x371/0x6e1
  5) 3992 160   __alloc_pages_internal+0xe0/0x3f8
  6) 3832  16   __alloc_pages_nodemask+0xe/0x10
  7) 3816  48   alloc_pages_current+0xbe/0xc7
  8) 3768  16   alloc_slab_page+0x28/0x34
  9) 3752  64   new_slab+0x4a/0x1bb
 10) 3688  96   __slab_alloc+0x203/0x364
 11) 3592  64   kmem_cache_alloc+0x64/0xa3
 12) 3528  48   alloc_buffer_head+0x22/0x9d
 13) 3480  64   alloc_page_buffers+0x2f/0xd1
 14) 3416  48   create_empty_buffers+0x22/0xb5
 15) 3368 176   block_read_full_page+0x6b/0x25c
 16) 3192  16   blkdev_readpage+0x18/0x1a
 17) 3176  64   read_cache_page_async+0x85/0x11a
 18) 3112  32   read_cache_page+0x13/0x48
 19) 3080  48   read_dev_sector+0x36/0xaf
 20) 3032  96   read_lba+0x51/0xb0
 21) 2936 176   efi_partition+0x92/0x585
 22) 2760 128   rescan_partitions+0x173/0x308
 23) 2632  96   __blkdev_get+0x22a/0x2ea
 24) 2536  16   blkdev_get+0x10/0x12
 25) 2520  80   register_disk+0xe5/0x14a
 26) 2440  48   add_disk+0xbd/0x11f
 27) 2392  96   sd_probe+0x2c5/0x3ad
 28) 2296  48   driver_probe_device+0xc5/0x14c
 29) 2248  16   __device_attach+0xe/0x10
 30) 2232  64   bus_for_each_drv+0x56/0x8d
 31) 2168  48   device_attach+0x68/0x7f
 32) 2120  32   bus_attach_device+0x2d/0x5e
 33) 2088 112   device_add+0x45f/0x5d3
 34) 1976  64   scsi_sysfs_add_sdev+0xb7/0x246
 35) 1912 336   scsi_probe_and_add_lun+0x9f9/0xad7
 36) 1576  80   __scsi_add_device+0xb6/0xe5
 37) 1496  80   ata_scsi_scan_host+0x9e/0x1c6
 38) 1416  64   ata_host_register+0x238/0x252
 39) 1352  96   ata_pci_sff_activate_host+0x1a1/0x1ce
 40) 1256 256   piix_init_one+0x646/0x6b2
 41) 1000 144   pci_device_probe+0xc9/0x120
 42)  856  48   driver_probe_device+0xc5/0x14c
 43)  808  48   __driver_attach+0x67/0x91
 44)  760  64   bus_for_each_dev+0x54/0x84
 45)  696  16   driver_attach+0x21/0x23
 46)  680  64   bus_add_driver+0xba/0x204
 47)  616  64   driver_register+0x98/0x110
 48)  552  48   __pci_register_driver+0x6b/0xa4
 49)  504  16   piix_init+0x19/0x2c
 50)  488 272   _stext+0x5d/0x145
 51)  216  32   kernel_init+0x127/0x17a
 52)  184 184   child_rip+0xa/0x11

-- Steve

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Paul Mackerras
Steve,

 On Mon, 17 Nov 2008, Linus Torvalds wrote:
  
  I do wonder just _what_ it is that causes the stack frames to be so 
  horrid. For example, you have
  
   18) 8896 160   .kmem_cache_alloc+0xfc/0x140
  
  and I'm looking at my x86-64 compile, and it has a stack frame of just 8 
  bytes (!) for local variables plus the save/restore area (which looks like 
  three registers plus frame pointer plus return address). IOW, if I'm 
  looking at the code right (so big caveat: I did _not_ do a real stack 
  dump!) the x86-64 stack cost for that same function is on the order of 48 
  bytes. Not 160.
 
 Out of curiosity, I just ran stack_trace on the latest version of git 
 (pulled sometime today) and ran it on my x86_64.
 
 I have SLUB and SLUB debug defined, and here's what I found:
 
  11) 3592  64   kmem_cache_alloc+0x64/0xa3
 
 64 bytes, still much lower than the 160 of PPC64.

The ppc64 ABI has a minimum stack frame of 112 bytes, due to having an
area for called functions to store their parameters (64 bytes) plus 6
slots for saving stuff and for the compiler and linker to use if they
need to.  That's before any local variables are allocated.

The ppc32 ABI has a minimum stack frame of 16 bytes, which is much
nicer, at the expense of a much more complicated va_arg().

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Steven Rostedt

On Tue, 18 Nov 2008, Paul Mackerras wrote:
  
  64 bytes, still much lower than the 160 of PPC64.
 
 The ppc64 ABI has a minimum stack frame of 112 bytes, due to having an
 area for called functions to store their parameters (64 bytes) plus 6
 slots for saving stuff and for the compiler and linker to use if they
 need to.  That's before any local variables are allocated.
 
 The ppc32 ABI has a minimum stack frame of 16 bytes, which is much
 nicer, at the expense of a much more complicated va_arg().

Here's a full dump that I got from my 32bit Powerbook:

[EMAIL PROTECTED]:~$ cat /debug/tracing/stack_trace 
Depth   Size  Location(57 entries)
-     
  0) 3196  48   ftrace_call+0x4/0x48
  1) 3148  48   update_curr+0xa0/0x110
  2) 3100  48   enqueue_task_fair+0x54/0xfc
  3) 3052  32   enqueue_task+0x34/0x54
  4) 3020  16   activate_task+0x40/0x64
  5) 3004  48   try_to_wake_up+0x78/0x164
  6) 2956  16   default_wake_function+0x28/0x40
  7) 2940  48   __wake_up_common+0x54/0xac
  8) 2892  32   __wake_up_sync+0x58/0x94
  9) 2860  16   sock_def_readable+0x58/0xb8
 10) 2844  32   sock_queue_rcv_skb+0x10c/0x128
 11) 2812  32   __udp_queue_rcv_skb+0x2c/0xc4
 12) 2780  32   udp_queue_rcv_skb+0x1d8/0x27c
 13) 2748  96   __udp4_lib_rcv+0x514/0x77c
 14) 2652  16   udp_rcv+0x30/0x48
 15) 2636  32   ip_local_deliver_finish+0xf4/0x1cc
 16) 2604  16   ip_local_deliver+0x94/0xac
 17) 2588  64   ip_rcv_finish+0x2ec/0x31c
 18) 2524  32   ip_rcv+0x23c/0x278
 19) 2492  48   netif_receive_skb+0x48c/0x4c0
 20) 2444  48   process_backlog+0xb0/0x144
 21) 2396  48   net_rx_action+0x8c/0x1bc
 22) 2348  48   __do_softirq+0x84/0x10c
 23) 2300  16   do_softirq+0x50/0x6c
 24) 2284  16   local_bh_enable+0x90/0xc8
 25) 2268  32   dev_queue_xmit+0x564/0x5b0
 26) 2236  48   ip_finish_output+0x2b4/0x2f4
 27) 2188  32   ip_output+0xec/0x104
 28) 2156  16   ip_local_out+0x44/0x5c
 29) 2140  48   ip_push_pending_frames+0x340/0x3ec
 30) 2092  48   udp_push_pending_frames+0x2ac/0x348
 31) 2044 160   udp_sendmsg+0x458/0x5bc
 32) 1884  32   inet_sendmsg+0x70/0x88
 33) 1852 240   sock_sendmsg+0xd0/0xf8
 34) 1612  32   kernel_sendmsg+0x3c/0x58
 35) 1580  96   xs_send_kvec+0xb4/0xcc [sunrpc]
 36) 1484  64   xs_sendpages+0x94/0x1f0 [sunrpc]
 37) 1420  32   xs_udp_send_request+0x44/0x14c [sunrpc]
 38) 1388  32   xprt_transmit+0x100/0x228 [sunrpc]
 39) 1356  32   call_transmit+0x234/0x290 [sunrpc]
 40) 1324  48   __rpc_execute+0xcc/0x2d4 [sunrpc]
 41) 1276  32   rpc_execute+0x48/0x60 [sunrpc]
 42) 1244  32   rpc_run_task+0x74/0x90 [sunrpc]
 43) 1212  64   rpc_call_sync+0x6c/0xa0 [sunrpc]
 44) 1148  80   rpcb_register_call+0xb0/0x110 [sunrpc]
 45) 1068  96   rpcb_register+0xe8/0x100 [sunrpc]
 46)  972  80   svc_register+0x118/0x168 [sunrpc]
 47)  892  64   svc_setup_socket+0xa0/0x354 [sunrpc]
 48)  828 272   svc_create_socket+0x1e0/0x250 [sunrpc]
 49)  556  16   svc_udp_create+0x38/0x50 [sunrpc]
 50)  540  96   svc_create_xprt+0x1c8/0x2fc [sunrpc]
 51)  444  48   lockd_up+0xcc/0x27c [lockd]
 52)  396  96   write_ports+0xf4/0x2cc [nfsd]
 53)  300  32   nfsctl_transaction_write+0x78/0xbc [nfsd]
 54)  268  32   vfs_write+0xc8/0x178
 55)  236  48   sys_write+0x5c/0xa0
 56)  188 188   ret_from_syscall+0x0/0x40

And here's my i386 max stack:

[EMAIL PROTECTED] ~]# cat /debug/tracing/stack_trace 
Depth   Size  Location(47 entries)
-     
  0) 2216 240   blk_recount_segments+0x39/0x51
  1) 1976  12   bio_phys_segments+0x16/0x1c
  2) 1964  20   blk_rq_bio_prep+0x29/0xae
  3) 1944  16   init_request_from_bio+0xc9/0xcd
  4) 1928  60   __make_request+0x2f6/0x3c1
  5) 1868 136   generic_make_request+0x36a/0x3a0
  6) 1732  56   submit_bio+0xcd/0xd8
  7) 1676  28   submit_bh+0xd1/0xf0
  8) 1648 112   block_read_full_page+0x299/0x2a9
  9) 1536   8   blkdev_readpage+0x14/0x16
 10) 1528  28   read_cache_page_async+0x7e/0x109
 11) 1500  16   read_cache_page+0x11/0x49
 12) 1484  32   read_dev_sector+0x3c/0x72
 13) 1452  48   read_lba+0x4d/0xaa
 14) 1404 168   efi_partition+0x85/0x61b
 15) 1236  84   rescan_partitions+0x14b/0x316
 16) 1152  40   __blkdev_get+0x1b2/0x258
 17) 1112   8   blkdev_get+0xf/0x11
 18) 1104  36   register_disk+0xbc/0x112
 19) 1068  32   add_disk+0x9f/0xf3
 20) 1036  48   sd_probe+0x2d4/0x394
 21)  988  20   driver_probe_device+0xa5/0x120
 22)

Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Steven Rostedt

On Mon, 17 Nov 2008, Steven Rostedt wrote:
 
 And here's my i386 max stack:
 
 [EMAIL PROTECTED] ~]# cat /debug/tracing/stack_trace 
 Depth   Size  Location(47 entries)
 -     
   0) 2216 240   blk_recount_segments+0x39/0x51
   1) 1976  12   bio_phys_segments+0x16/0x1c
   2) 1964  20   blk_rq_bio_prep+0x29/0xae
   3) 1944  16   init_request_from_bio+0xc9/0xcd
   4) 1928  60   __make_request+0x2f6/0x3c1
   5) 1868 136   generic_make_request+0x36a/0x3a0
   6) 1732  56   submit_bio+0xcd/0xd8
   7) 1676  28   submit_bh+0xd1/0xf0
   8) 1648 112   block_read_full_page+0x299/0x2a9
   9) 1536   8   blkdev_readpage+0x14/0x16
  10) 1528  28   read_cache_page_async+0x7e/0x109
  11) 1500  16   read_cache_page+0x11/0x49
  12) 1484  32   read_dev_sector+0x3c/0x72
  13) 1452  48   read_lba+0x4d/0xaa
  14) 1404 168   efi_partition+0x85/0x61b
  15) 1236  84   rescan_partitions+0x14b/0x316
  16) 1152  40   __blkdev_get+0x1b2/0x258
  17) 1112   8   blkdev_get+0xf/0x11
  18) 1104  36   register_disk+0xbc/0x112
  19) 1068  32   add_disk+0x9f/0xf3
  20) 1036  48   sd_probe+0x2d4/0x394
  21)  988  20   driver_probe_device+0xa5/0x120
  22)  968   8   __device_attach+0xd/0xf
  23)  960  28   bus_for_each_drv+0x3e/0x67
  24)  932  24   device_attach+0x56/0x6a
  25)  908  16   bus_attach_device+0x26/0x4d
  26)  892  64   device_add+0x380/0x4aa
  27)  828  28   scsi_sysfs_add_sdev+0xa1/0x1c9
  28)  800 160   scsi_probe_and_add_lun+0x97e/0xa8f
  29)  640  36   __scsi_add_device+0x88/0xae
  30)  604  40   ata_scsi_scan_host+0x8b/0x1cd
  31)  564  28   ata_host_register+0x1da/0x1ea
  32)  536  24   ata_host_activate+0x98/0xb5
  33)  512 188   ahci_init_one+0xa23/0xa4f
  34)  324  20   pci_device_probe+0x3e/0x5e
  35)  304  20   driver_probe_device+0xa5/0x120
  36)  284  20   __driver_attach+0x51/0x70
  37)  264  32   bus_for_each_dev+0x40/0x62
  38)  232  12   driver_attach+0x19/0x1b
  39)  220  28   bus_add_driver+0x9c/0x1ac
  40)  192  28   driver_register+0x76/0xd2
  41)  164  20   __pci_register_driver+0x44/0x70
  42)  144   8   ahci_init+0x14/0x16
  43)  136  92   _stext+0x4f/0x116
  44)   44  16   kernel_init+0xf7/0x145
  45)   28  28   kernel_thread_helper+0x7/0x10
 
 Note, the i386 box had 4KSTACKS defined and the PPC did not have 
 IRQSTACKS defined. I'll compile with IRQSTACKS defined and post that 
 result.

Here's the PPC32 with IRQSTACKS on:

[EMAIL PROTECTED]:~$ cat /debug/tracing/stack_trace 
Depth   Size  Location(42 entries)
-     
  0) 2940  48   ftrace_call+0x4/0x48
  1) 2892  16   ide_map_sg+0x4c/0x88
  2) 2876  32   ide_build_sglist+0x30/0xa0
  3) 2844  64   pmac_ide_dma_setup+0x90/0x2a0
  4) 2780  48   do_rw_taskfile+0x250/0x2a0
  5) 2732  96   ide_do_rw_disk+0x290/0x2f8
  6) 2636  16   ide_gd_do_request+0x30/0x48
  7) 2620 176   ide_do_request+0x8e8/0xa70
  8) 2444  16   do_ide_request+0x34/0x4c
  9) 2428  16   __generic_unplug_device+0x4c/0x64
 10) 2412  16   generic_unplug_device+0x4c/0x90
 11) 2396  16   blk_unplug+0x34/0x4c
 12) 2380  80   dm_table_unplug_all+0x54/0xc0 [dm_mod]
 13) 2300  16   dm_unplug_all+0x34/0x54 [dm_mod]
 14) 2284  16   blk_unplug+0x34/0x4c
 15) 2268  16   blk_backing_dev_unplug+0x28/0x40
 16) 2252  16   sync_buffer+0x60/0x80
 17) 2236  48   __wait_on_bit+0x78/0xd4
 18) 2188  80   out_of_line_wait_on_bit+0x88/0xa0
 19) 2108  16   __wait_on_buffer+0x40/0x58
 20) 2092  16   __bread+0xbc/0xf0
 21) 2076  48   ext3_get_branch+0x90/0x118 [ext3]
 22) 2028 208   ext3_get_blocks_handle+0xac/0xa10 [ext3]
 23) 1820  48   ext3_get_block+0xc4/0x114 [ext3]
 24) 1772 160   do_mpage_readpage+0x26c/0x6f4
 25) 1612 144   mpage_readpages+0xe8/0x148
 26) 1468  16   ext3_readpages+0x38/0x50 [ext3]
 27) 1452  80   __do_page_cache_readahead+0x19c/0x248
 28) 1372  32   do_page_cache_readahead+0x80/0x98
 29) 1340  80   filemap_fault+0x1b0/0x444
 30) 1260  80   __do_fault+0x68/0x61c
 31) 1180  64   handle_mm_fault+0x500/0xafc
 32) 1116 176   do_page_fault+0x2d4/0x450
 33)  940  84   handle_page_fault+0xc/0x80
 34)  856 140   0xdeaffa78
 35)  716 128   load_elf_binary+0x6a0/0x1048
 36)  588  64   search_binary_handler+0x10c/0x310
 37)  524 176   load_script+0x248/0x268
 38)  348  64   search_binary_handler+0x10c/0x310
 39)  284  64   

Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Linus Torvalds


On Tue, 18 Nov 2008, Paul Mackerras wrote:
 
 Also, you didn't respond to my comments about the purely software
 benefits of a larger page size.

I realize that there are benefits. It's just that the downsides tend to 
swamp the upsides.

The fact is, Intel (and to a lesser degree, AMD) has shown how hardware 
can do good TLB's with essentially gang lookups, giving almost effective 
page sizes of 32kB with hardly any of the downsides. Couple that with 
low-latency fault handling (for not when you miss in the TLB, but when 
something really isn't in the page tables), and it seems to be seldom the 
biggest issue.

(Don't get me wrong - TLB's are not unimportant on x86 either. But on x86, 
things are generally much better).

Yes, we could prefill the page tables and do other things, and ultimately 
if you don't need to - by virtue of big pages, some loads will always 
benefit from just making the page size larger.

But the people who advocate large pages seem to never really face the 
downsides. They talk about their single loads, and optimize for that and 
nothing else. They don't seem to even acknowledge the fact that a 64kB 
page size is simply NOT EVEN REMOTELY ACCEPTABLE for other loads!

That's what gets to me. These absolute -idiots- talk about how they win 5% 
on some (important, for them) benchmark by doing large pages, but then 
ignore the fact that on other real-world loads they lose by sevaral 
HUNDRED percent because of the memory fragmentation costs.

(And btw, if they win more than 5%, it's because the hardware sucks really 
badly).

THAT is what irritates me.

What also irritates me is the .. but AIX argument. The fact is, the AIX 
memory management is very tightly tied to one particular broken MMU model. 
Linux supports something like thirty architectures, and while PPC may be 
one of the top ones, it is NOT EVEN CLOSE to be really relevant.

So .. but AIX simply doesn't matter. The Linux VM has other priorities.

And I _guarantee_ that in general, in the high-volume market (which is 
what drives things, like it or not), page sizes will not be growing. In 
that market, terabytes of RAM is not the primary case, and small files 
that want mmap are one _very_ common case.

To make things worse, the biggest performance market has another vendor 
that hasn't been saying .. but AIX for the last decade, and that 
actually listens to input. And, perhaps not incidentally, outperforms the 
highest-performance ppc64 chips mostly by a huge margin - while selling 
their chips for a fraction of the price.

I realize that this may be hard to accept for some people. But somebody 
who says ... but AIX should be taking a damn hard look in the mirror, 
and ask themselves some really tough questions. Because quite frankly, the 
..but AIX market isn't the most interesting one.

Linus
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Steven Rostedt

On Mon, 17 Nov 2008, Steven Rostedt wrote:
 
 On Mon, 17 Nov 2008, Steven Rostedt wrote:
  
  Note, I was using a default config that had CONFIG_IRQSTACKS off and
  CONFIG_PPC_64K_PAGES on.
 
 Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that 
 softirqs still use the same stack as the process.
 

By-the-way, my box has been running stable ever since I switched to 
CONFIG_IRQSTACKS.

-- Steve

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Paul Mackerras
Steven Rostedt writes:

 By-the-way, my box has been running stable ever since I switched to 
 CONFIG_IRQSTACKS.

Great.  We probably should remove the config option and just always
use irq stacks.

Paul.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread David Miller
From: Paul Mackerras [EMAIL PROTECTED]
Date: Tue, 18 Nov 2008 13:36:16 +1100

 Steven Rostedt writes:
 
  By-the-way, my box has been running stable ever since I switched to 
  CONFIG_IRQSTACKS.
 
 Great.  We probably should remove the config option and just always
 use irq stacks.

That's what I did from the start on sparc64 when I added
irqstacks support.  It's pretty stupid to make it optional
when we know there are failure cases.

For example, is XFS dependant on IRQSTACKS on x86?  It should be, or
even more so XFS and NFS both being enabled at the same time :-)
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Large stack usage in fs code (especially for PPC64)

2008-11-17 Thread Benjamin Herrenschmidt

  It makes some sort of sense I suppose on very static embedded workloads
  with no swap nor demand paging.
 
 It makes perfect sense for anything that doesn't use any MMU. 

To a certain extent. There's two different aspects to having an MMU and
in embedded space it's useful to have one and not the other, ie,
protection  address space isolation vs. paging.

Thus, it does make sense for those embedded devices with few files
(mostly a statically linked busybox, a few /dev entries and some app
stuff) to use large page sizes. Of course, as soon as they try to
populate their ramfs with lots of small files they lose... but mostly,
the idea is that the entire working set fits in the TLB and thus the
cost of TLB miss becomes irrelevant.

Now, regarding the shortcomings of the powerpc server MMU, well, we know
them, we know your opinion and mostly share it, and until we can get the
HW to change we are stuck with it.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev