Re: Large stack usage in fs code (especially for PPC64)
On Tue, 18 Nov 2008, Nick Piggin wrote: > > > > The fact is, Intel (and to a lesser degree, AMD) has shown how hardware > > can do good TLB's with essentially gang lookups, giving almost effective > > page sizes of 32kB with hardly any of the downsides. Couple that with > > It's much harder to do this with powerpc I think because they would need > to calculate 8 hashes and touch 8 cachelines to prefill 8 translations, > wouldn't they? Oh, absolutely. It's why I despise hashed page tables. It's a broken concept. > The per-page processing costs are interesting too, but IMO there is more > work that should be done to speed up order-0 pages. The patches I had to > remove the sync instruction for smp_mb() in unlock_page sped up pagecache > throughput (populate, write(2), reclaim) on my G5 by something really > crazy like 50% (most of that's in, but I'm still sitting on that fancy > unlock_page speedup to remove the final smp_mb). > > I suspect some of the costs are also in powerpc specific code to insert > linux ptes into their hash table. I think some of the synchronisation for > those could possibly be shared with generic code so you don't need the > extra layer of locks there. Yeah, the hashed page tables get extra costs from the fact that it can't share the software page tables with the hardware ones, and the associated coherency logic. It's even worse at unmap time, I think. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
Nick Piggin writes: > It's much harder to do this with powerpc I think because they would need > to calculate 8 hashes and touch 8 cachelines to prefill 8 translations, > wouldn't they? Yeah, the hashed page table sucks. Film at 11, as they say. :) Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
* Christoph Hellwig <[EMAIL PROTECTED]> wrote: > On Tue, Nov 18, 2008 at 10:03:25AM +1100, Benjamin Herrenschmidt wrote: > > On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote: > > > Note, I was using a default config that had CONFIG_IRQSTACKS off and > > > CONFIG_PPC_64K_PAGES on. > > > > For one, we definitely need to turn IRQSTACKS on by default ... In fact, > > I'm pondering just removing the option. > > It shouldn't be a user-visible option. It shouldn't on x86 either, > but there it always gets into the ideological 4k stacks flameware so > people usually gave up after a while instead of doing the sensibke > 8k + irqstacks by default, 4k as an option.. yep, i tend to agree that 8k + irqstacks on both 32-bit and 64-bit would be the sane default. There's just too much gcc and other noise for us to have that cushion by default on 32-bit too. 64-bit is already there, on 32-bit we should decouple irqstacks from 4K stacks and just turn irqstacks on by default. Life's too short to fight kernel stack overflows - and now we've got the stack-tracer that will record and show the frame of the worst-ever kernel stack situation the system was in since bootup. (see Steve's tracer output in this thread) Ingo ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Tuesday 18 November 2008 13:08, Linus Torvalds wrote: > On Tue, 18 Nov 2008, Paul Mackerras wrote: > > Also, you didn't respond to my comments about the purely software > > benefits of a larger page size. > > I realize that there are benefits. It's just that the downsides tend to > swamp the upsides. > > The fact is, Intel (and to a lesser degree, AMD) has shown how hardware > can do good TLB's with essentially gang lookups, giving almost effective > page sizes of 32kB with hardly any of the downsides. Couple that with It's much harder to do this with powerpc I think because they would need to calculate 8 hashes and touch 8 cachelines to prefill 8 translations, wouldn't they? > low-latency fault handling (for not when you miss in the TLB, but when > something really isn't in the page tables), and it seems to be seldom the > biggest issue. > > (Don't get me wrong - TLB's are not unimportant on x86 either. But on x86, > things are generally much better). The per-page processing costs are interesting too, but IMO there is more work that should be done to speed up order-0 pages. The patches I had to remove the sync instruction for smp_mb() in unlock_page sped up pagecache throughput (populate, write(2), reclaim) on my G5 by something really crazy like 50% (most of that's in, but I'm still sitting on that fancy unlock_page speedup to remove the final smp_mb). I suspect some of the costs are also in powerpc specific code to insert linux ptes into their hash table. I think some of the synchronisation for those could possibly be shared with generic code so you don't need the extra layer of locks there. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Tuesday 18 November 2008 09:53, Paul Mackerras wrote: > I'd love to be able to use a 4k base page size if I could still get > the reduction in page faults and the expanded TLB reach that we get > now with 64k pages. If we could allocate the page cache for large > files with order-4 allocations wherever possible that would be a good > start. That can still have nasty side-effects like fragmentation and 64k granular reclaim. It also adds complexity to mapping the pages to userspace. Christoph's patchset IIRC also only did page cache, wheras I suppose your kernbench workload is gaining mainly from anonymous page faults. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Tue, Nov 18, 2008 at 10:03:25AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote: > > Note, I was using a default config that had CONFIG_IRQSTACKS off and > > CONFIG_PPC_64K_PAGES on. > > For one, we definitely need to turn IRQSTACKS on by default ... In fact, > I'm pondering just removing the option. It shouldn't be a user-visible option. It shouldn't on x86 either, but there it always gets into the ideological 4k stacks flameware so people usually gave up after a while instead of doing the sensibke 8k + irqstacks by default, 4k as an option.. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
> > It makes some sort of sense I suppose on very static embedded workloads > > with no swap nor demand paging. > > It makes perfect sense for anything that doesn't use any MMU. To a certain extent. There's two different aspects to having an MMU and in embedded space it's useful to have one and not the other, ie, protection & address space isolation vs. paging. Thus, it does make sense for those embedded devices with few files (mostly a statically linked busybox, a few /dev entries and some app stuff) to use large page sizes. Of course, as soon as they try to populate their ramfs with lots of small files they lose... but mostly, the idea is that the entire working set fits in the TLB and thus the cost of TLB miss becomes irrelevant. Now, regarding the shortcomings of the powerpc server MMU, well, we know them, we know your opinion and mostly share it, and until we can get the HW to change we are stuck with it. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
From: Paul Mackerras <[EMAIL PROTECTED]> Date: Tue, 18 Nov 2008 13:36:16 +1100 > Steven Rostedt writes: > > > By-the-way, my box has been running stable ever since I switched to > > CONFIG_IRQSTACKS. > > Great. We probably should remove the config option and just always > use irq stacks. That's what I did from the start on sparc64 when I added irqstacks support. It's pretty stupid to make it optional when we know there are failure cases. For example, is XFS dependant on IRQSTACKS on x86? It should be, or even more so XFS and NFS both being enabled at the same time :-) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
Steven Rostedt writes: > By-the-way, my box has been running stable ever since I switched to > CONFIG_IRQSTACKS. Great. We probably should remove the config option and just always use irq stacks. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008, Steven Rostedt wrote: > > On Mon, 17 Nov 2008, Steven Rostedt wrote: > > > > Note, I was using a default config that had CONFIG_IRQSTACKS off and > > CONFIG_PPC_64K_PAGES on. > > Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that > softirqs still use the same stack as the process. > By-the-way, my box has been running stable ever since I switched to CONFIG_IRQSTACKS. -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Tue, 18 Nov 2008, Paul Mackerras wrote: > > Also, you didn't respond to my comments about the purely software > benefits of a larger page size. I realize that there are benefits. It's just that the downsides tend to swamp the upsides. The fact is, Intel (and to a lesser degree, AMD) has shown how hardware can do good TLB's with essentially gang lookups, giving almost effective page sizes of 32kB with hardly any of the downsides. Couple that with low-latency fault handling (for not when you miss in the TLB, but when something really isn't in the page tables), and it seems to be seldom the biggest issue. (Don't get me wrong - TLB's are not unimportant on x86 either. But on x86, things are generally much better). Yes, we could prefill the page tables and do other things, and ultimately if you don't need to - by virtue of big pages, some loads will always benefit from just making the page size larger. But the people who advocate large pages seem to never really face the downsides. They talk about their single loads, and optimize for that and nothing else. They don't seem to even acknowledge the fact that a 64kB page size is simply NOT EVEN REMOTELY ACCEPTABLE for other loads! That's what gets to me. These absolute -idiots- talk about how they win 5% on some (important, for them) benchmark by doing large pages, but then ignore the fact that on other real-world loads they lose by sevaral HUNDRED percent because of the memory fragmentation costs. (And btw, if they win more than 5%, it's because the hardware sucks really badly). THAT is what irritates me. What also irritates me is the ".. but AIX" argument. The fact is, the AIX memory management is very tightly tied to one particular broken MMU model. Linux supports something like thirty architectures, and while PPC may be one of the top ones, it is NOT EVEN CLOSE to be really relevant. So ".. but AIX" simply doesn't matter. The Linux VM has other priorities. And I _guarantee_ that in general, in the high-volume market (which is what drives things, like it or not), page sizes will not be growing. In that market, terabytes of RAM is not the primary case, and small files that want mmap are one _very_ common case. To make things worse, the biggest performance market has another vendor that hasn't been saying ".. but AIX" for the last decade, and that actually listens to input. And, perhaps not incidentally, outperforms the highest-performance ppc64 chips mostly by a huge margin - while selling their chips for a fraction of the price. I realize that this may be hard to accept for some people. But somebody who says "... but AIX" should be taking a damn hard look in the mirror, and ask themselves some really tough questions. Because quite frankly, the "..but AIX" market isn't the most interesting one. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008, Steven Rostedt wrote: > > And here's my i386 max stack: > > [EMAIL PROTECTED] ~]# cat /debug/tracing/stack_trace > Depth Size Location(47 entries) > - > 0) 2216 240 blk_recount_segments+0x39/0x51 > 1) 1976 12 bio_phys_segments+0x16/0x1c > 2) 1964 20 blk_rq_bio_prep+0x29/0xae > 3) 1944 16 init_request_from_bio+0xc9/0xcd > 4) 1928 60 __make_request+0x2f6/0x3c1 > 5) 1868 136 generic_make_request+0x36a/0x3a0 > 6) 1732 56 submit_bio+0xcd/0xd8 > 7) 1676 28 submit_bh+0xd1/0xf0 > 8) 1648 112 block_read_full_page+0x299/0x2a9 > 9) 1536 8 blkdev_readpage+0x14/0x16 > 10) 1528 28 read_cache_page_async+0x7e/0x109 > 11) 1500 16 read_cache_page+0x11/0x49 > 12) 1484 32 read_dev_sector+0x3c/0x72 > 13) 1452 48 read_lba+0x4d/0xaa > 14) 1404 168 efi_partition+0x85/0x61b > 15) 1236 84 rescan_partitions+0x14b/0x316 > 16) 1152 40 __blkdev_get+0x1b2/0x258 > 17) 1112 8 blkdev_get+0xf/0x11 > 18) 1104 36 register_disk+0xbc/0x112 > 19) 1068 32 add_disk+0x9f/0xf3 > 20) 1036 48 sd_probe+0x2d4/0x394 > 21) 988 20 driver_probe_device+0xa5/0x120 > 22) 968 8 __device_attach+0xd/0xf > 23) 960 28 bus_for_each_drv+0x3e/0x67 > 24) 932 24 device_attach+0x56/0x6a > 25) 908 16 bus_attach_device+0x26/0x4d > 26) 892 64 device_add+0x380/0x4aa > 27) 828 28 scsi_sysfs_add_sdev+0xa1/0x1c9 > 28) 800 160 scsi_probe_and_add_lun+0x97e/0xa8f > 29) 640 36 __scsi_add_device+0x88/0xae > 30) 604 40 ata_scsi_scan_host+0x8b/0x1cd > 31) 564 28 ata_host_register+0x1da/0x1ea > 32) 536 24 ata_host_activate+0x98/0xb5 > 33) 512 188 ahci_init_one+0xa23/0xa4f > 34) 324 20 pci_device_probe+0x3e/0x5e > 35) 304 20 driver_probe_device+0xa5/0x120 > 36) 284 20 __driver_attach+0x51/0x70 > 37) 264 32 bus_for_each_dev+0x40/0x62 > 38) 232 12 driver_attach+0x19/0x1b > 39) 220 28 bus_add_driver+0x9c/0x1ac > 40) 192 28 driver_register+0x76/0xd2 > 41) 164 20 __pci_register_driver+0x44/0x70 > 42) 144 8 ahci_init+0x14/0x16 > 43) 136 92 _stext+0x4f/0x116 > 44) 44 16 kernel_init+0xf7/0x145 > 45) 28 28 kernel_thread_helper+0x7/0x10 > > Note, the i386 box had 4KSTACKS defined and the PPC did not have > IRQSTACKS defined. I'll compile with IRQSTACKS defined and post that > result. Here's the PPC32 with IRQSTACKS on: [EMAIL PROTECTED]:~$ cat /debug/tracing/stack_trace Depth Size Location(42 entries) - 0) 2940 48 ftrace_call+0x4/0x48 1) 2892 16 ide_map_sg+0x4c/0x88 2) 2876 32 ide_build_sglist+0x30/0xa0 3) 2844 64 pmac_ide_dma_setup+0x90/0x2a0 4) 2780 48 do_rw_taskfile+0x250/0x2a0 5) 2732 96 ide_do_rw_disk+0x290/0x2f8 6) 2636 16 ide_gd_do_request+0x30/0x48 7) 2620 176 ide_do_request+0x8e8/0xa70 8) 2444 16 do_ide_request+0x34/0x4c 9) 2428 16 __generic_unplug_device+0x4c/0x64 10) 2412 16 generic_unplug_device+0x4c/0x90 11) 2396 16 blk_unplug+0x34/0x4c 12) 2380 80 dm_table_unplug_all+0x54/0xc0 [dm_mod] 13) 2300 16 dm_unplug_all+0x34/0x54 [dm_mod] 14) 2284 16 blk_unplug+0x34/0x4c 15) 2268 16 blk_backing_dev_unplug+0x28/0x40 16) 2252 16 sync_buffer+0x60/0x80 17) 2236 48 __wait_on_bit+0x78/0xd4 18) 2188 80 out_of_line_wait_on_bit+0x88/0xa0 19) 2108 16 __wait_on_buffer+0x40/0x58 20) 2092 16 __bread+0xbc/0xf0 21) 2076 48 ext3_get_branch+0x90/0x118 [ext3] 22) 2028 208 ext3_get_blocks_handle+0xac/0xa10 [ext3] 23) 1820 48 ext3_get_block+0xc4/0x114 [ext3] 24) 1772 160 do_mpage_readpage+0x26c/0x6f4 25) 1612 144 mpage_readpages+0xe8/0x148 26) 1468 16 ext3_readpages+0x38/0x50 [ext3] 27) 1452 80 __do_page_cache_readahead+0x19c/0x248 28) 1372 32 do_page_cache_readahead+0x80/0x98 29) 1340 80 filemap_fault+0x1b0/0x444 30) 1260 80 __do_fault+0x68/0x61c 31) 1180 64 handle_mm_fault+0x500/0xafc 32) 1116 176 do_page_fault+0x2d4/0x450 33) 940 84 handle_page_fault+0xc/0x80 34) 856 140 0xdeaffa78 35) 716 128 load_elf_binary+0x6a0/0x1048 36) 588 64 search_binary_handler+0x10c/0x310 37) 524 176 load_script+0x248/0x268 38) 348 64 search_binary_handler
Re: Large stack usage in fs code (especially for PPC64)
On Tue, 18 Nov 2008, Paul Mackerras wrote: > > > > 64 bytes, still much lower than the 160 of PPC64. > > The ppc64 ABI has a minimum stack frame of 112 bytes, due to having an > area for called functions to store their parameters (64 bytes) plus 6 > slots for saving stuff and for the compiler and linker to use if they > need to. That's before any local variables are allocated. > > The ppc32 ABI has a minimum stack frame of 16 bytes, which is much > nicer, at the expense of a much more complicated va_arg(). Here's a full dump that I got from my 32bit Powerbook: [EMAIL PROTECTED]:~$ cat /debug/tracing/stack_trace Depth Size Location(57 entries) - 0) 3196 48 ftrace_call+0x4/0x48 1) 3148 48 update_curr+0xa0/0x110 2) 3100 48 enqueue_task_fair+0x54/0xfc 3) 3052 32 enqueue_task+0x34/0x54 4) 3020 16 activate_task+0x40/0x64 5) 3004 48 try_to_wake_up+0x78/0x164 6) 2956 16 default_wake_function+0x28/0x40 7) 2940 48 __wake_up_common+0x54/0xac 8) 2892 32 __wake_up_sync+0x58/0x94 9) 2860 16 sock_def_readable+0x58/0xb8 10) 2844 32 sock_queue_rcv_skb+0x10c/0x128 11) 2812 32 __udp_queue_rcv_skb+0x2c/0xc4 12) 2780 32 udp_queue_rcv_skb+0x1d8/0x27c 13) 2748 96 __udp4_lib_rcv+0x514/0x77c 14) 2652 16 udp_rcv+0x30/0x48 15) 2636 32 ip_local_deliver_finish+0xf4/0x1cc 16) 2604 16 ip_local_deliver+0x94/0xac 17) 2588 64 ip_rcv_finish+0x2ec/0x31c 18) 2524 32 ip_rcv+0x23c/0x278 19) 2492 48 netif_receive_skb+0x48c/0x4c0 20) 2444 48 process_backlog+0xb0/0x144 21) 2396 48 net_rx_action+0x8c/0x1bc 22) 2348 48 __do_softirq+0x84/0x10c 23) 2300 16 do_softirq+0x50/0x6c 24) 2284 16 local_bh_enable+0x90/0xc8 25) 2268 32 dev_queue_xmit+0x564/0x5b0 26) 2236 48 ip_finish_output+0x2b4/0x2f4 27) 2188 32 ip_output+0xec/0x104 28) 2156 16 ip_local_out+0x44/0x5c 29) 2140 48 ip_push_pending_frames+0x340/0x3ec 30) 2092 48 udp_push_pending_frames+0x2ac/0x348 31) 2044 160 udp_sendmsg+0x458/0x5bc 32) 1884 32 inet_sendmsg+0x70/0x88 33) 1852 240 sock_sendmsg+0xd0/0xf8 34) 1612 32 kernel_sendmsg+0x3c/0x58 35) 1580 96 xs_send_kvec+0xb4/0xcc [sunrpc] 36) 1484 64 xs_sendpages+0x94/0x1f0 [sunrpc] 37) 1420 32 xs_udp_send_request+0x44/0x14c [sunrpc] 38) 1388 32 xprt_transmit+0x100/0x228 [sunrpc] 39) 1356 32 call_transmit+0x234/0x290 [sunrpc] 40) 1324 48 __rpc_execute+0xcc/0x2d4 [sunrpc] 41) 1276 32 rpc_execute+0x48/0x60 [sunrpc] 42) 1244 32 rpc_run_task+0x74/0x90 [sunrpc] 43) 1212 64 rpc_call_sync+0x6c/0xa0 [sunrpc] 44) 1148 80 rpcb_register_call+0xb0/0x110 [sunrpc] 45) 1068 96 rpcb_register+0xe8/0x100 [sunrpc] 46) 972 80 svc_register+0x118/0x168 [sunrpc] 47) 892 64 svc_setup_socket+0xa0/0x354 [sunrpc] 48) 828 272 svc_create_socket+0x1e0/0x250 [sunrpc] 49) 556 16 svc_udp_create+0x38/0x50 [sunrpc] 50) 540 96 svc_create_xprt+0x1c8/0x2fc [sunrpc] 51) 444 48 lockd_up+0xcc/0x27c [lockd] 52) 396 96 write_ports+0xf4/0x2cc [nfsd] 53) 300 32 nfsctl_transaction_write+0x78/0xbc [nfsd] 54) 268 32 vfs_write+0xc8/0x178 55) 236 48 sys_write+0x5c/0xa0 56) 188 188 ret_from_syscall+0x0/0x40 And here's my i386 max stack: [EMAIL PROTECTED] ~]# cat /debug/tracing/stack_trace Depth Size Location(47 entries) - 0) 2216 240 blk_recount_segments+0x39/0x51 1) 1976 12 bio_phys_segments+0x16/0x1c 2) 1964 20 blk_rq_bio_prep+0x29/0xae 3) 1944 16 init_request_from_bio+0xc9/0xcd 4) 1928 60 __make_request+0x2f6/0x3c1 5) 1868 136 generic_make_request+0x36a/0x3a0 6) 1732 56 submit_bio+0xcd/0xd8 7) 1676 28 submit_bh+0xd1/0xf0 8) 1648 112 block_read_full_page+0x299/0x2a9 9) 1536 8 blkdev_readpage+0x14/0x16 10) 1528 28 read_cache_page_async+0x7e/0x109 11) 1500 16 read_cache_page+0x11/0x49 12) 1484 32 read_dev_sector+0x3c/0x72 13) 1452 48 read_lba+0x4d/0xaa 14) 1404 168 efi_partition+0x85/0x61b 15) 1236 84 rescan_partitions+0x14b/0x316 16) 1152 40 __blkdev_get+0x1b2/0x258 17) 1112 8 blkdev_get+0xf/0x11 18) 1104 36 register_disk+0xbc/0x112 19) 1068 32 add_disk+0x9f/0xf3 20) 1036 48 sd_probe+0x2d4/0x394 21) 988 20 driver_probe_device+0xa5/0x
Re: Large stack usage in fs code (especially for PPC64)
Steve, > On Mon, 17 Nov 2008, Linus Torvalds wrote: > > > > I do wonder just _what_ it is that causes the stack frames to be so > > horrid. For example, you have > > > > 18) 8896 160 .kmem_cache_alloc+0xfc/0x140 > > > > and I'm looking at my x86-64 compile, and it has a stack frame of just 8 > > bytes (!) for local variables plus the save/restore area (which looks like > > three registers plus frame pointer plus return address). IOW, if I'm > > looking at the code right (so big caveat: I did _not_ do a real stack > > dump!) the x86-64 stack cost for that same function is on the order of 48 > > bytes. Not 160. > > Out of curiosity, I just ran stack_trace on the latest version of git > (pulled sometime today) and ran it on my x86_64. > > I have SLUB and SLUB debug defined, and here's what I found: > > 11) 3592 64 kmem_cache_alloc+0x64/0xa3 > > 64 bytes, still much lower than the 160 of PPC64. The ppc64 ABI has a minimum stack frame of 112 bytes, due to having an area for called functions to store their parameters (64 bytes) plus 6 slots for saving stuff and for the compiler and linker to use if they need to. That's before any local variables are allocated. The ppc32 ABI has a minimum stack frame of 16 bytes, which is much nicer, at the expense of a much more complicated va_arg(). Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008, Linus Torvalds wrote: > > I do wonder just _what_ it is that causes the stack frames to be so > horrid. For example, you have > >18) 8896 160 .kmem_cache_alloc+0xfc/0x140 > > and I'm looking at my x86-64 compile, and it has a stack frame of just 8 > bytes (!) for local variables plus the save/restore area (which looks like > three registers plus frame pointer plus return address). IOW, if I'm > looking at the code right (so big caveat: I did _not_ do a real stack > dump!) the x86-64 stack cost for that same function is on the order of 48 > bytes. Not 160. Out of curiosity, I just ran stack_trace on the latest version of git (pulled sometime today) and ran it on my x86_64. I have SLUB and SLUB debug defined, and here's what I found: 11) 3592 64 kmem_cache_alloc+0x64/0xa3 64 bytes, still much lower than the 160 of PPC64. Just to see where it got that number, I ran objdump on slub.o. 300e : 300e: 55 push %rbp 300f: 48 89 e5mov%rsp,%rbp 3012: 41 57 push %r15 3014: 41 56 push %r14 3016: 41 55 push %r13 3018: 41 54 push %r12 301a: 53 push %rbx 301b: 48 83 ec 08 sub$0x8,%rsp Six words pushed, plus the 8 byte stack frame. 6*8+8 = 56 But we also add the push of the function return address which is another 8 bytes which gives us 64 bytes. Here's the complete dump: [EMAIL PROTECTED] linux-compile.git]# cat /debug/tracing/stack_trace Depth Size Location(54 entries) - 0) 4504 48 __mod_zone_page_state+0x59/0x68 1) 4456 64 __rmqueue_smallest+0xa0/0xd9 2) 4392 80 __rmqueue+0x24/0x172 3) 4312 96 rmqueue_bulk+0x57/0xa3 4) 4216 224 get_page_from_freelist+0x371/0x6e1 5) 3992 160 __alloc_pages_internal+0xe0/0x3f8 6) 3832 16 __alloc_pages_nodemask+0xe/0x10 7) 3816 48 alloc_pages_current+0xbe/0xc7 8) 3768 16 alloc_slab_page+0x28/0x34 9) 3752 64 new_slab+0x4a/0x1bb 10) 3688 96 __slab_alloc+0x203/0x364 11) 3592 64 kmem_cache_alloc+0x64/0xa3 12) 3528 48 alloc_buffer_head+0x22/0x9d 13) 3480 64 alloc_page_buffers+0x2f/0xd1 14) 3416 48 create_empty_buffers+0x22/0xb5 15) 3368 176 block_read_full_page+0x6b/0x25c 16) 3192 16 blkdev_readpage+0x18/0x1a 17) 3176 64 read_cache_page_async+0x85/0x11a 18) 3112 32 read_cache_page+0x13/0x48 19) 3080 48 read_dev_sector+0x36/0xaf 20) 3032 96 read_lba+0x51/0xb0 21) 2936 176 efi_partition+0x92/0x585 22) 2760 128 rescan_partitions+0x173/0x308 23) 2632 96 __blkdev_get+0x22a/0x2ea 24) 2536 16 blkdev_get+0x10/0x12 25) 2520 80 register_disk+0xe5/0x14a 26) 2440 48 add_disk+0xbd/0x11f 27) 2392 96 sd_probe+0x2c5/0x3ad 28) 2296 48 driver_probe_device+0xc5/0x14c 29) 2248 16 __device_attach+0xe/0x10 30) 2232 64 bus_for_each_drv+0x56/0x8d 31) 2168 48 device_attach+0x68/0x7f 32) 2120 32 bus_attach_device+0x2d/0x5e 33) 2088 112 device_add+0x45f/0x5d3 34) 1976 64 scsi_sysfs_add_sdev+0xb7/0x246 35) 1912 336 scsi_probe_and_add_lun+0x9f9/0xad7 36) 1576 80 __scsi_add_device+0xb6/0xe5 37) 1496 80 ata_scsi_scan_host+0x9e/0x1c6 38) 1416 64 ata_host_register+0x238/0x252 39) 1352 96 ata_pci_sff_activate_host+0x1a1/0x1ce 40) 1256 256 piix_init_one+0x646/0x6b2 41) 1000 144 pci_device_probe+0xc9/0x120 42) 856 48 driver_probe_device+0xc5/0x14c 43) 808 48 __driver_attach+0x67/0x91 44) 760 64 bus_for_each_dev+0x54/0x84 45) 696 16 driver_attach+0x21/0x23 46) 680 64 bus_add_driver+0xba/0x204 47) 616 64 driver_register+0x98/0x110 48) 552 48 __pci_register_driver+0x6b/0xa4 49) 504 16 piix_init+0x19/0x2c 50) 488 272 _stext+0x5d/0x145 51) 216 32 kernel_init+0x127/0x17a 52) 184 184 child_rip+0xa/0x11 -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
Linus Torvalds writes: > > > It's made worse by the fact that they > > > also have horribly bad TLB fills on their broken CPU's, and years and > > > years of telling people that the MMU on ppc's are sh*t has only been > > > reacted to with "talk to the hand, we know better". > > > > Who are you talking about here precisely ? I don't think either Paul or > > I every said something nearly around those lines ... Oh well. > > Every single time I've complained about it, somebody from IBM has said ".. > but but AIX". > > This time it was Paul. Sometimes it has been software people who agree, Maybe I wasn't clear. I was reporting the reaction I get from the hardware designers, who often like to push off the hard problems to software to fix. Keep flaming, by the way. It provides me with some amount of assistance in my quest to get the hardware designers to do better. ;) Also, you didn't respond to my comments about the purely software benefits of a larger page size. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote: > > I've been hitting stack overflows on a PPC64 box, so I ran the ftrace > stack_tracer and part of the problem with that box is that it can nest > interrupts too deep. But what also worries me is that there's some heavy > hitters of stacks in generic code. Namely the fs directory has some. Note that we shouldn't stack interrupts much in practice. The PIC will not let same or lower prio interrupts in until we have completed one. However timer/decrementer is not going through the PIC, so I think what happens is we get a hw IRQ, on the way back, just before returning from do_IRQ (so we have completed the IRQ from the PIC standpoint), we go into soft-irq's, at which point deep inside SCSI we get another HW IRQ and we stack a decrementer interrupt on top of it. Now, we should do stack switching for both HW IRQs and softirqs with CONFIG_IRQSTACKS, which should significantly alleviate the problem. Your second trace also shows how horrible the stack traces can be when the device-model kicks in, ie, register->probe->register sub device -> etc... that isnt going to be nice on x86 with 4k stacks neither. I wonder if we should generally recommend for drivers of "bus" devices not to register sub devices from their own probe() routine, but defer that to a kernel thread... Because the stacking can be pretty bad, I mean, nobody's done SATA over USB yet but heh :-) Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Tue, 18 Nov 2008, Paul Mackerras wrote: > Steven Rostedt writes: > > > Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that > > softirqs still use the same stack as the process. > > They shouldn't. I don't see do_softirq in the trace, though. Which > functions did you think would be run in a softirq? It looks to me > like the deepest 10 or so functions are called at hard irq level, > within hrtimer_interrupt called from timer_interrupt. > > > [EMAIL PROTECTED] ~> cat /debug/tracing/stack_trace > > Depth Size Location(59 entries) > > - > > 0)12384 192 ftrace_call+0x4/0x14 > > 1)12192 128 .sched_clock+0x20/0x60 > > 2)12064 128 .sched_clock_cpu+0x34/0x50 > > 3)11936 144 .cpu_clock+0x3c/0xa0 > > 4)11792 144 .get_timestamp+0x2c/0x50 > > 5)11648 144 .__touch_softlockup_watchdog+0x3c/0x60 > > 6)11504 192 .softlockup_tick+0xe4/0x220 > > 7)11312 128 .run_local_timers+0x34/0x50 > > 8)11184 160 .update_process_times+0x44/0xb0 > > 9)11024 176 .tick_sched_timer+0x8c/0x120 > > 10)10848 160 .__run_hrtimer+0xd8/0x130 > > 11)10688 240 .hrtimer_interrupt+0x16c/0x220 > > 12)10448 160 .timer_interrupt+0xcc/0x110 > > 13)10288 96 decrementer_common+0xe0/0x100 Ah, you are right. I thought I saw softirq code in there, but must have been seeing things. I need to get my eyesight checked. The other day I totally botch a hand in cards because I thought I had 3 Aces of diamonds when I really only had 2 Aces of diamonds and a Ace of hearts. -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Tue, 18 Nov 2008, Benjamin Herrenschmidt wrote: > > Guess who is pushing for larger page sizes nowadays ? Embedded > people :-) In fact, we have patches submited on the list to offer the > option for ... 256K pages on some 44x embedded CPUs :-) > > It makes some sort of sense I suppose on very static embedded workloads > with no swap nor demand paging. It makes perfect sense for anything that doesn't use any MMU. The hugepage support seems to cover many of the relevant cases, ie databases and things like big static mappings (frame buffers etc). > > It's made worse by the fact that they > > also have horribly bad TLB fills on their broken CPU's, and years and > > years of telling people that the MMU on ppc's are sh*t has only been > > reacted to with "talk to the hand, we know better". > > Who are you talking about here precisely ? I don't think either Paul or > I every said something nearly around those lines ... Oh well. Every single time I've complained about it, somebody from IBM has said ".. but but AIX". This time it was Paul. Sometimes it has been software people who agree, but point to hardware designers who "know better". If it's not some insane database person, it's a Fortran program that runs for days. > But there is also pressure to get larger page sizes from small embedded > field, where CPUs have even poorer TLB refill (software loaded > basically) :-) Yeah, I agree that you _can_ have even worse MMU's. I'm not saying that PPC64 is absolutely pessimal and cannot be made worse. Software fill is indeed even worse from a performance angle, despite the fact that it's really "nice" from a conceptual angle. Of course, of thesw fill users that remain, many do seem to be ppc.. It's like the architecture brings out the worst in hardware designers. > > Quite frankly, 64kB pages are INSANE. But yes, in this case they actually > > cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses > > 64 bytes, not 1kB. > > Come on, the code is crap to allocate that on the stack anyway :-) Why? We do actually expect to be able to use stack-space for small structures. We do it for a lot of things, including stuff like select() optimistically using arrays allocated on the stack for the common small case, just because it's, oh, about infinitely faster to do than to use kmalloc(). Many of the page cache functions also have the added twist that they get called from low-memory setups (eg write_whole_page()), and so try to minimize allocations for that reason too. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Tue, 18 Nov 2008 10:13:16 +1100 Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote: > > > Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry > > array), but as you say: > > > > > On PPC64 I'm told that the page size is 64K, which makes the above equal > > > to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes. > > > > Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB > > pages are insane, and 16kB pages are borderline. I've told people so. > > > > The ppc people run databases, and they don't care about sane people > > telling them the big pages suck. > > Hehe :-) > > Guess who is pushing for larger page sizes nowadays ? Embedded > people :-) In fact, we have patches submited on the list to offer the > option for ... 256K pages on some 44x embedded CPUs :-) For clarification, that workload is very precise. Namely embedded 44x CPUs used in RAID cards. I'm not entirely convinced bringing 256K pages into mainline is a good thing yet anyway. 64K pages, while seemingly insane for embedded boards that typically have less than 512 MiB of DRAM, help for a bit larger set of workloads. As a KVM host is the primary winner at the moment. But given the small number of TLB entries on these CPUs, it can pay off elsewhere as well. josh ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
> I'd have thought so, but I'm sure we're about to hear how important an > optimisation the smaller stacks are ;) Not sure, I tend to agree that it would make sense to bump our stack to 64K on 64K pages, it's not like we are saving anything and we are probably adding overhead in alloc/dealloc. I'll see what Paul thinks here. > Yup. That being said, the younger me did assert that "this is a neater > implementation anyway". If we can implement those loops without > needing those on-stack temporary arrays then things probably are better > overall. Amen. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
> Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry > array), but as you say: > > > On PPC64 I'm told that the page size is 64K, which makes the above equal > > to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes. > > Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB > pages are insane, and 16kB pages are borderline. I've told people so. > > The ppc people run databases, and they don't care about sane people > telling them the big pages suck. Hehe :-) Guess who is pushing for larger page sizes nowadays ? Embedded people :-) In fact, we have patches submited on the list to offer the option for ... 256K pages on some 44x embedded CPUs :-) It makes some sort of sense I suppose on very static embedded workloads with no swap nor demand paging. > It's made worse by the fact that they > also have horribly bad TLB fills on their broken CPU's, and years and > years of telling people that the MMU on ppc's are sh*t has only been > reacted to with "talk to the hand, we know better". Who are you talking about here precisely ? I don't think either Paul or I every said something nearly around those lines ... Oh well. But yeah, our existing server CPUs have pretty poor TLB refills and yes, 64K pages help. And yes, we would like things to be different, but they aren't. But there is also pressure to get larger page sizes from small embedded field, where CPUs have even poorer TLB refill (software loaded basically) :-) > Quite frankly, 64kB pages are INSANE. But yes, in this case they actually > cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses > 64 bytes, not 1kB. Come on, the code is crap to allocate that on the stack anyway :-) > Of course, that would likely mean that FAT etc wouldn't work on ppc64, so > I don't think that's a valid model either. But if the 64kB page size is > just a "database server crazy-people config option", then maybe it's > acceptable. Well, as I said, embedded folks are wanting that too ... Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 2008-11-17 at 15:59 -0500, Steven Rostedt wrote: > On Mon, 17 Nov 2008, Steven Rostedt wrote: > > > > Note, I was using a default config that had CONFIG_IRQSTACKS off and > > CONFIG_PPC_64K_PAGES on. > > Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that > softirqs still use the same stack as the process. No, they should use their own, unless somebody change the softirq code in ways that break us... (or unless I missed something, we based our implementation roughtly on what x86 did back then). Ben. > [EMAIL PROTECTED] ~> cat /debug/tracing/stack_trace > Depth Size Location(59 entries) > - > 0)12384 192 ftrace_call+0x4/0x14 > 1)12192 128 .sched_clock+0x20/0x60 > 2)12064 128 .sched_clock_cpu+0x34/0x50 > 3)11936 144 .cpu_clock+0x3c/0xa0 > 4)11792 144 .get_timestamp+0x2c/0x50 > 5)11648 144 .__touch_softlockup_watchdog+0x3c/0x60 > 6)11504 192 .softlockup_tick+0xe4/0x220 > 7)11312 128 .run_local_timers+0x34/0x50 > 8)11184 160 .update_process_times+0x44/0xb0 > 9)11024 176 .tick_sched_timer+0x8c/0x120 > 10)10848 160 .__run_hrtimer+0xd8/0x130 > 11)10688 240 .hrtimer_interrupt+0x16c/0x220 > 12)10448 160 .timer_interrupt+0xcc/0x110 > 13)10288 96 decrementer_common+0xe0/0x100 > 14)10192 784 .ftrace_test_stop_func+0x4c/0x70 > 15) 9408 112 ftrace_call+0x4/0x14 > 16) 9296 144 .init_buffer_head+0x20/0x60 > 17) 9152 256 .cache_alloc_refill+0x5a0/0x740 > 18) 8896 160 .kmem_cache_alloc+0xfc/0x140 > 19) 8736 144 .alloc_buffer_head+0x3c/0xc0 > 20) 8592 176 .alloc_page_buffers+0xb0/0x130 > 21) 8416 160 .create_empty_buffers+0x44/0x1a0 > 22) 82561280 .block_read_full_page+0x3b0/0x430 > 23) 6976 144 .blkdev_readpage+0x38/0x60 > 24) 6832 176 .read_cache_page_async+0x124/0x230 > 25) 6656 160 .read_cache_page+0x50/0xe0 > 26) 6496 160 .read_dev_sector+0x58/0x100 > 27) 6336 272 .msdos_partition+0xa4/0x880 > 28) 6064 240 .rescan_partitions+0x1f0/0x430 > 29) 5824 208 .__blkdev_get+0x124/0x3d0 > 30) 5616 144 .blkdev_get+0x3c/0x60 > 31) 5472 192 .register_disk+0x194/0x1f0 > 32) 5280 160 .add_disk+0xe8/0x160 > 33) 5120 224 .sd_probe+0x2f0/0x440 > 34) 4896 176 .driver_probe_device+0x14c/0x380 > 35) 4720 144 .__device_attach+0x38/0x60 > 36) 4576 192 .bus_for_each_drv+0xb8/0x120 > 37) 4384 160 .device_attach+0x94/0xe0 > 38) 4224 144 .bus_attach_device+0x78/0xb0 > 39) 4080 240 .device_add+0x54c/0x730 > 40) 3840 160 .scsi_sysfs_add_sdev+0x7c/0x2e0 > 41) 3680 336 .scsi_probe_and_add_lun+0xb7c/0xc60 > 42) 3344 192 .__scsi_add_device+0x11c/0x150 > 43) 3152 176 .ata_scsi_scan_host+0x238/0x330 > 44) 2976 208 .ata_host_register+0x320/0x3c0 > 45) 2768 192 .ata_host_activate+0xf8/0x190 > 46) 2576 224 .mv_pci_init_one+0x284/0x430 > 47) 2352 160 .pci_device_probe+0x98/0xf0 > 48) 2192 176 .driver_probe_device+0x14c/0x380 > 49) 2016 160 .__driver_attach+0xd4/0x110 > 50) 1856 192 .bus_for_each_dev+0xa8/0x100 > 51) 1664 144 .driver_attach+0x40/0x60 > 52) 1520 192 .bus_add_driver+0x268/0x340 > 53) 1328 176 .driver_register+0x88/0x1d0 > 54) 1152 176 .__pci_register_driver+0x90/0x100 > 55) 976 144 .mv_init+0x38/0xa0 > 56) 832 544 .do_one_initcall+0x78/0x240 > 57) 288 192 .kernel_init+0x21c/0x2b0 > 58) 96 96 .kernel_thread+0x54/0x70 > > This is still 12K. Kind of big even for a 16K stack. > > -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 2008-11-17 at 15:34 -0500, Steven Rostedt wrote: > Note, I was using a default config that had CONFIG_IRQSTACKS off and > CONFIG_PPC_64K_PAGES on. For one, we definitely need to turn IRQSTACKS on by default ... In fact, I'm pondering just removing the option. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
Linus Torvalds writes: > The ppc people run databases, and they don't care about sane people And HPC apps, and all sorts of other things... > telling them the big pages suck. It's made worse by the fact that they > also have horribly bad TLB fills on their broken CPU's, and years and Taking page faults at a 4k granularity also sucks. A lot of the performance improvement from using 64k pages comes just from executing the page fault path (and other paths in the mm code) less frequently. That's why we see a performance improvement from 64k pages even on machines that don't support 64k pages in hardware (like the 21% reduction in system time on a kernel compile that I reported earlier). That has nothing to do with TLB miss times or anything to do with the MMU. I'd love to be able to use a 4k base page size if I could still get the reduction in page faults and the expanded TLB reach that we get now with 64k pages. If we could allocate the page cache for large files with order-4 allocations wherever possible that would be a good start. I think Christoph Lameter had some patches in that direction but they didn't seem to get very far. > years of telling people that the MMU on ppc's are sh*t has only been > reacted to with "talk to the hand, we know better". Well, it's been reacted to with "AIX can use small pages where it makes sense, and large pages where that makes sense, so why can't Linux?" > I suspect the PPC people need to figure out some way to handle this in > their broken setups (since I don't really expect them to finally admit > that they were full of sh*t with their big pages), but since I think it's > a ppc bug, I'm not at all interested in a fix that penalizes the _good_ > case. Looks like we should make the stack a bit larger. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
Steven Rostedt writes: > Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that > softirqs still use the same stack as the process. They shouldn't. I don't see do_softirq in the trace, though. Which functions did you think would be run in a softirq? It looks to me like the deepest 10 or so functions are called at hard irq level, within hrtimer_interrupt called from timer_interrupt. > [EMAIL PROTECTED] ~> cat /debug/tracing/stack_trace > Depth Size Location(59 entries) > - > 0)12384 192 ftrace_call+0x4/0x14 > 1)12192 128 .sched_clock+0x20/0x60 > 2)12064 128 .sched_clock_cpu+0x34/0x50 > 3)11936 144 .cpu_clock+0x3c/0xa0 > 4)11792 144 .get_timestamp+0x2c/0x50 > 5)11648 144 .__touch_softlockup_watchdog+0x3c/0x60 > 6)11504 192 .softlockup_tick+0xe4/0x220 > 7)11312 128 .run_local_timers+0x34/0x50 > 8)11184 160 .update_process_times+0x44/0xb0 > 9)11024 176 .tick_sched_timer+0x8c/0x120 > 10)10848 160 .__run_hrtimer+0xd8/0x130 > 11)10688 240 .hrtimer_interrupt+0x16c/0x220 > 12)10448 160 .timer_interrupt+0xcc/0x110 > 13)10288 96 decrementer_common+0xe0/0x100 Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008, Andrew Morton wrote: > > Yup. That being said, the younger me did assert that "this is a neater > implementation anyway". If we can implement those loops without > needing those on-stack temporary arrays then things probably are better > overall. Sure, if it actually ends up being nicer, I'll not argue with it. But from an L1 I$ standpoint (and I$ is often very important, especially for kernel loads where loops are fairly rare), it's often _much_ better to do two "tight" loops over two subsystems (filesystem and block layer) than it is to do one bigger loop that contains both. If the L1 can fit both subsystem paths, you're fine - but if not, you may get a lot more misses. So it's often nice if you can "stage" things so that you do a cluster of calls to one area, followed by a cluster of calls to another, rather than mix it up. But numbers talk. And code cleanliness. If somebody has numbers that the code size actually goes down for example, or the code is just more readable, micro-optimizing cache patterns isn't worth it. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008 13:23:23 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Mon, 17 Nov 2008, Andrew Morton wrote: > > > > Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT! :) > > > > I fixed this six years ago. See http://lkml.org/lkml/2002/6/17/68 > > Btw, in that thread I also said: > > "If we have 64kB pages, such architectures will have to have a bigger >kernel stack. Which they will have, simply by virtue of having the very >same bigger page. So that problem kind of solves itself." > > and that may still be the "right" solution - if somebody is so insane that > they want 64kB pages, then they might as well have a 64kB kernel stack as > well. I'd have thought so, but I'm sure we're about to hear how important an optimisation the smaller stacks are ;) > Trust me, the kernel stack isn't where you blow your memory with a 64kB > page. You blow all your memory on the memory fragmentation of your page > cache. I did the stats for the kernel source tree a long time ago, and I > think you wasted something like 4GB of RAM with a 64kB page size. > Yup. That being said, the younger me did assert that "this is a neater implementation anyway". If we can implement those loops without needing those on-stack temporary arrays then things probably are better overall. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, Nov 17, 2008 at 11:18 PM, Linus Torvalds <[EMAIL PROTECTED]> wrote: > I do wonder just _what_ it is that causes the stack frames to be so > horrid. For example, you have > > 18) 8896 160 .kmem_cache_alloc+0xfc/0x140 > > and I'm looking at my x86-64 compile, and it has a stack frame of just 8 > bytes (!) for local variables plus the save/restore area (which looks like > three registers plus frame pointer plus return address). IOW, if I'm > looking at the code right (so big caveat: I did _not_ do a real stack > dump!) the x86-64 stack cost for that same function is on the order of 48 > bytes. Not 160. > > Where does that factor-of-three+ difference come from? From the numbers, I > suspect ppc64 has a 32-byte stack alignment, which may be part of it, and > I guess the compiler is more eager to use all those extra registers and > will happily have many more callee-saved regs that are actually used. > > But that still a _lot_ of extra stack. > > Of course, you may have things like spinlock debugging etc enabled. Some > of our debugging options do tend to blow things up. Note that kmem_cache_alloc() is likely to contain lots of inlined functions for both SLAB and SLUB. Perhaps that blows up stack usage on ppc? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008, Andrew Morton wrote: > > Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT! :) > > I fixed this six years ago. See http://lkml.org/lkml/2002/6/17/68 Btw, in that thread I also said: "If we have 64kB pages, such architectures will have to have a bigger kernel stack. Which they will have, simply by virtue of having the very same bigger page. So that problem kind of solves itself." and that may still be the "right" solution - if somebody is so insane that they want 64kB pages, then they might as well have a 64kB kernel stack as well. Trust me, the kernel stack isn't where you blow your memory with a 64kB page. You blow all your memory on the memory fragmentation of your page cache. I did the stats for the kernel source tree a long time ago, and I think you wasted something like 4GB of RAM with a 64kB page size. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008, Steven Rostedt wrote: > > Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that > softirqs still use the same stack as the process. Yes. > This is still 12K. Kind of big even for a 16K stack. And while that 1kB+ stack slot for block_read_full_page still stands out like a sore thumb, I do agree that there's way too many other functions too with big stack frames. I do wonder just _what_ it is that causes the stack frames to be so horrid. For example, you have 18) 8896 160 .kmem_cache_alloc+0xfc/0x140 and I'm looking at my x86-64 compile, and it has a stack frame of just 8 bytes (!) for local variables plus the save/restore area (which looks like three registers plus frame pointer plus return address). IOW, if I'm looking at the code right (so big caveat: I did _not_ do a real stack dump!) the x86-64 stack cost for that same function is on the order of 48 bytes. Not 160. Where does that factor-of-three+ difference come from? From the numbers, I suspect ppc64 has a 32-byte stack alignment, which may be part of it, and I guess the compiler is more eager to use all those extra registers and will happily have many more callee-saved regs that are actually used. But that still a _lot_ of extra stack. Of course, you may have things like spinlock debugging etc enabled. Some of our debugging options do tend to blow things up. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008 15:34:13 -0500 (EST) Steven Rostedt <[EMAIL PROTECTED]> wrote: > > I've been hitting stack overflows on a PPC64 box, so I ran the ftrace > stack_tracer and part of the problem with that box is that it can nest > interrupts too deep. But what also worries me is that there's some heavy > hitters of stacks in generic code. Namely the fs directory has some. > > Here's the full dump of the stack (PPC64): > > ... > > do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start > of these two: > > int block_read_full_page(struct page *page, get_block_t *get_block) > { > struct inode *inode = page->mapping->host; > sector_t iblock, lblock; > struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; > unsigned int blocksize; > int nr, i; > int fully_mapped = 1; > [...] > > static struct bio * > do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, > sector_t *last_block_in_bio, struct buffer_head *map_bh, > unsigned long *first_logical_block, get_block_t get_block) > { > struct inode *inode = page->mapping->host; > const unsigned blkbits = inode->i_blkbits; > const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; > const unsigned blocksize = 1 << blkbits; > sector_t block_in_file; > sector_t last_block; > sector_t last_block_in_file; > sector_t blocks[MAX_BUF_PER_PAGE]; > unsigned page_block; > unsigned first_hole = blocks_per_page; > struct block_device *bdev = NULL; > int length; > int fully_mapped = 1; > unsigned nblocks; > unsigned relative_block; > > > The thing that hits my eye on both is the MAX_BUF_PER_PAGE usage. That is > defined as: > > define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512) > > Where PAGE_CACHE_SIZE is the same as PAGE_SIZE. > > On PPC64 I'm told that the page size is 64K, which makes the above equal > to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes. > > The problem with PPC64 is that the stack size is not equal to the page > size. The stack size is only 16K not 64K. > > The above stack trace was taken right after boot up and it was already at > 14K, not too far from the 16k limit. > > Note, I was using a default config that had CONFIG_IRQSTACKS off and > CONFIG_PPC_64K_PAGES on. > Far be it from me to apportion blame, but THIS IS ALL LINUS'S FAULT! :) I fixed this six years ago. See http://lkml.org/lkml/2002/6/17/68 I still have the patch but for some reason it appears to get some rejects. However I think the approach (whatever it was ;)) is still usable. Perhaps some keen young thing has time to get down and redo it. This patch fixes some potential stack consumption problems. The storage for sector_t blocks[MAX_BUF_PER_PAGE]; and struct buffer_head *arr[MAX_BUF_PER_PAGE]; will consume a kilobyte with 64k page, 64-bit sector_t. And on the path do_mpage_readpage ->block_read_full_page -> ->mpage_writepage they can be nested three-deep. 3k of stack gone. Presumably in this case the stack page would be 64k anyway, so that's not a problem. However if PAGE_SIZE=4k and PAGE_CACHE_SIZE=64k, we die. I've yet to see any reason for larger PAGE_CACHE_SIZE, but this is a neater implementation anyway. The patch removes MAX_BUF_PER_PAGE and instead walks the page's buffer ring to work out which buffers need to be submitted. --- 2.5.22/fs/mpage.c~stack-space Sun Jun 16 22:50:17 2002 +++ 2.5.22-akpm/fs/mpage.c Sun Jun 16 22:50:17 2002 @@ -169,8 +169,9 @@ do_mpage_readpage(struct bio *bio, struc const unsigned blocksize = 1 << blkbits; struct bio_vec *bvec; sector_t block_in_file; - sector_t last_block; - sector_t blocks[MAX_BUF_PER_PAGE]; + sector_t last_file_block; + sector_t first_page_block = -1; + sector_t last_page_block = -1; unsigned page_block; unsigned first_hole = blocks_per_page; struct block_device *bdev = NULL; @@ -180,12 +181,12 @@ do_mpage_readpage(struct bio *bio, struc goto confused; block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits); - last_block = (inode->i_size + blocksize - 1) >> blkbits; + last_file_block = (inode->i_size + blocksize - 1) >> blkbits; for (page_block = 0; page_block < blocks_per_page; page_block++, block_in_file++) { bh.b_state = 0; - if (block_in_file < last_block) { + if (block_in_file < last_file_block) { if (get_block(inode, block_in_file, &bh, 0)) goto confused; } @@ -199,10 +200,14 @@ do_mpage_readpage(struct bio *bio, struc if (first_hole != blocks_per_page) goto confused; /* hole -> non-hole *
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008, Steven Rostedt wrote: > > 45) 49921280 .block_read_full_page+0x23c/0x430 > 46) 37121280 .do_mpage_readpage+0x43c/0x740 Ouch. > Notice at line 45 and 46 the stack usage of block_read_full_page and > do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start > of these two: > > int block_read_full_page(struct page *page, get_block_t *get_block) > { > struct inode *inode = page->mapping->host; > sector_t iblock, lblock; > struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; Yeah, that's unacceptable. Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry array), but as you say: > On PPC64 I'm told that the page size is 64K, which makes the above equal > to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes. Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB pages are insane, and 16kB pages are borderline. I've told people so. The ppc people run databases, and they don't care about sane people telling them the big pages suck. It's made worse by the fact that they also have horribly bad TLB fills on their broken CPU's, and years and years of telling people that the MMU on ppc's are sh*t has only been reacted to with "talk to the hand, we know better". Quite frankly, 64kB pages are INSANE. But yes, in this case they actually cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses 64 bytes, not 1kB. I suspect the PPC people need to figure out some way to handle this in their broken setups (since I don't really expect them to finally admit that they were full of sh*t with their big pages), but since I think it's a ppc bug, I'm not at all interested in a fix that penalizes the _good_ case. So either make it some kind of (clean) conditional dynamic non-stack allocation, or make it do some outer loop over the whole page that turns into a compile-time no-op when the page is sufficiently small to be done in one go. Or perhaps say "if you have 64kB pages, you're a moron, and to counteract that moronic page size, you cannot do 512-byte granularity IO any more". Of course, that would likely mean that FAT etc wouldn't work on ppc64, so I don't think that's a valid model either. But if the 64kB page size is just a "database server crazy-people config option", then maybe it's acceptable. Database people usually don't want to connect their cameras or mp3-players with their FAT12 filesystems. Linus ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Large stack usage in fs code (especially for PPC64)
On Mon, 17 Nov 2008, Steven Rostedt wrote: > > Note, I was using a default config that had CONFIG_IRQSTACKS off and > CONFIG_PPC_64K_PAGES on. Here's my stack after boot up with CONFIG_IRQSTACKS set. Seems that softirqs still use the same stack as the process. [EMAIL PROTECTED] ~> cat /debug/tracing/stack_trace Depth Size Location(59 entries) - 0)12384 192 ftrace_call+0x4/0x14 1)12192 128 .sched_clock+0x20/0x60 2)12064 128 .sched_clock_cpu+0x34/0x50 3)11936 144 .cpu_clock+0x3c/0xa0 4)11792 144 .get_timestamp+0x2c/0x50 5)11648 144 .__touch_softlockup_watchdog+0x3c/0x60 6)11504 192 .softlockup_tick+0xe4/0x220 7)11312 128 .run_local_timers+0x34/0x50 8)11184 160 .update_process_times+0x44/0xb0 9)11024 176 .tick_sched_timer+0x8c/0x120 10)10848 160 .__run_hrtimer+0xd8/0x130 11)10688 240 .hrtimer_interrupt+0x16c/0x220 12)10448 160 .timer_interrupt+0xcc/0x110 13)10288 96 decrementer_common+0xe0/0x100 14)10192 784 .ftrace_test_stop_func+0x4c/0x70 15) 9408 112 ftrace_call+0x4/0x14 16) 9296 144 .init_buffer_head+0x20/0x60 17) 9152 256 .cache_alloc_refill+0x5a0/0x740 18) 8896 160 .kmem_cache_alloc+0xfc/0x140 19) 8736 144 .alloc_buffer_head+0x3c/0xc0 20) 8592 176 .alloc_page_buffers+0xb0/0x130 21) 8416 160 .create_empty_buffers+0x44/0x1a0 22) 82561280 .block_read_full_page+0x3b0/0x430 23) 6976 144 .blkdev_readpage+0x38/0x60 24) 6832 176 .read_cache_page_async+0x124/0x230 25) 6656 160 .read_cache_page+0x50/0xe0 26) 6496 160 .read_dev_sector+0x58/0x100 27) 6336 272 .msdos_partition+0xa4/0x880 28) 6064 240 .rescan_partitions+0x1f0/0x430 29) 5824 208 .__blkdev_get+0x124/0x3d0 30) 5616 144 .blkdev_get+0x3c/0x60 31) 5472 192 .register_disk+0x194/0x1f0 32) 5280 160 .add_disk+0xe8/0x160 33) 5120 224 .sd_probe+0x2f0/0x440 34) 4896 176 .driver_probe_device+0x14c/0x380 35) 4720 144 .__device_attach+0x38/0x60 36) 4576 192 .bus_for_each_drv+0xb8/0x120 37) 4384 160 .device_attach+0x94/0xe0 38) 4224 144 .bus_attach_device+0x78/0xb0 39) 4080 240 .device_add+0x54c/0x730 40) 3840 160 .scsi_sysfs_add_sdev+0x7c/0x2e0 41) 3680 336 .scsi_probe_and_add_lun+0xb7c/0xc60 42) 3344 192 .__scsi_add_device+0x11c/0x150 43) 3152 176 .ata_scsi_scan_host+0x238/0x330 44) 2976 208 .ata_host_register+0x320/0x3c0 45) 2768 192 .ata_host_activate+0xf8/0x190 46) 2576 224 .mv_pci_init_one+0x284/0x430 47) 2352 160 .pci_device_probe+0x98/0xf0 48) 2192 176 .driver_probe_device+0x14c/0x380 49) 2016 160 .__driver_attach+0xd4/0x110 50) 1856 192 .bus_for_each_dev+0xa8/0x100 51) 1664 144 .driver_attach+0x40/0x60 52) 1520 192 .bus_add_driver+0x268/0x340 53) 1328 176 .driver_register+0x88/0x1d0 54) 1152 176 .__pci_register_driver+0x90/0x100 55) 976 144 .mv_init+0x38/0xa0 56) 832 544 .do_one_initcall+0x78/0x240 57) 288 192 .kernel_init+0x21c/0x2b0 58) 96 96 .kernel_thread+0x54/0x70 This is still 12K. Kind of big even for a 16K stack. -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Large stack usage in fs code (especially for PPC64)
I've been hitting stack overflows on a PPC64 box, so I ran the ftrace stack_tracer and part of the problem with that box is that it can nest interrupts too deep. But what also worries me is that there's some heavy hitters of stacks in generic code. Namely the fs directory has some. Here's the full dump of the stack (PPC64): [EMAIL PROTECTED] ~> cat /debug/tracing/stack_trace Depth Size Location(56 entries) - 0)14032 112 ftrace_call+0x4/0x14 1)13920 128 .sched_clock+0x20/0x60 2)13792 128 .sched_clock_cpu+0x34/0x50 3)13664 144 .cpu_clock+0x3c/0xa0 4)13520 144 .get_timestamp+0x2c/0x50 5)13376 192 .softlockup_tick+0x100/0x220 6)13184 128 .run_local_timers+0x34/0x50 7)13056 160 .update_process_times+0x44/0xb0 8)12896 176 .tick_sched_timer+0x8c/0x120 9)12720 160 .__run_hrtimer+0xd8/0x130 10)12560 240 .hrtimer_interrupt+0x16c/0x220 11)12320 160 .timer_interrupt+0xcc/0x110 12)12160 240 decrementer_common+0xe0/0x100 13)11920 576 0x80 14)11344 160 .usb_hcd_irq+0x94/0x150 15)11184 160 .handle_IRQ_event+0x80/0x120 16)11024 160 .handle_fasteoi_irq+0xd8/0x1e0 17)10864 160 .do_IRQ+0xbc/0x150 18)10704 144 hardware_interrupt_entry+0x1c/0x3c 19)10560 672 0x0 20) 9888 144 ._spin_unlock_irqrestore+0x84/0xd0 21) 9744 160 .scsi_dispatch_cmd+0x170/0x360 22) 9584 208 .scsi_request_fn+0x324/0x5e0 23) 9376 144 .blk_invoke_request_fn+0xc8/0x1b0 24) 9232 144 .__blk_run_queue+0x48/0x60 25) 9088 144 .blk_run_queue+0x40/0x70 26) 8944 192 .scsi_run_queue+0x3a8/0x3e0 27) 8752 160 .scsi_next_command+0x58/0x90 28) 8592 176 .scsi_end_request+0xd4/0x130 29) 8416 208 .scsi_io_completion+0x15c/0x500 30) 8208 160 .scsi_finish_command+0x15c/0x190 31) 8048 160 .scsi_softirq_done+0x138/0x1e0 32) 7888 160 .blk_done_softirq+0xd0/0x100 33) 7728 192 .__do_softirq+0xe8/0x1e0 34) 7536 144 .do_softirq+0xa4/0xd0 35) 7392 144 .irq_exit+0xb4/0xf0 36) 7248 160 .do_IRQ+0x114/0x150 37) 7088 752 hardware_interrupt_entry+0x1c/0x3c 38) 6336 144 .blk_rq_init+0x28/0xc0 39) 6192 208 .get_request+0x13c/0x3d0 40) 5984 240 .get_request_wait+0x60/0x170 41) 5744 192 .__make_request+0xd4/0x560 42) 5552 192 .generic_make_request+0x210/0x300 43) 5360 208 .submit_bio+0x168/0x1a0 44) 5152 160 .submit_bh+0x188/0x1e0 45) 49921280 .block_read_full_page+0x23c/0x430 46) 37121280 .do_mpage_readpage+0x43c/0x740 47) 2432 352 .mpage_readpages+0x130/0x1c0 48) 2080 160 .ext3_readpages+0x50/0x80 49) 1920 256 .__do_page_cache_readahead+0x1e4/0x340 50) 1664 160 .do_page_cache_readahead+0x94/0xe0 51) 1504 240 .filemap_fault+0x360/0x530 52) 1264 256 .__do_fault+0xb8/0x600 53) 1008 240 .handle_mm_fault+0x190/0x920 54) 768 320 .do_page_fault+0x3d4/0x5f0 55) 448 448 handle_page_fault+0x20/0x5c Notice at line 45 and 46 the stack usage of block_read_full_page and do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start of these two: int block_read_full_page(struct page *page, get_block_t *get_block) { struct inode *inode = page->mapping->host; sector_t iblock, lblock; struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; unsigned int blocksize; int nr, i; int fully_mapped = 1; [...] static struct bio * do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, sector_t *last_block_in_bio, struct buffer_head *map_bh, unsigned long *first_logical_block, get_block_t get_block) { struct inode *inode = page->mapping->host; const unsigned blkbits = inode->i_blkbits; const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; const unsigned blocksize = 1 << blkbits; sector_t block_in_file; sector_t last_block; sector_t last_block_in_file; sector_t blocks[MAX_BUF_PER_PAGE]; unsigned page_block; unsigned first_hole = blocks_per_page; struct block_device *bdev = NULL; int length; int fully_mapped = 1; unsigned nblocks; unsigned relative_block; The thing that hits my eye on both is the MAX_BUF_PER_PAGE usage. That is defined as: define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512) Where PAGE_CACHE_SIZE is the same as PAGE_SIZE. On PPC64 I'm told that the page size is 64K, which makes the above equal to: 64K / 512 = 128 multiply that by 8 byte words, we have 1024 bytes. The problem with PPC64 is