Re: Memory management bug
>Agreed, that's almost sure _not_ random memory corruption of the page >structure. It looks like a VM bug (if you can reproduce trivially I'd give a >try to test8 too since test8 is rock solid for me while test10 lockups in VM >core at the second bonnie if using emulated highmem). I was lucky. Somehow I managed to f**k up my disk in a way that the filesystem check triggers the bug in a reproducible way and always with the same page! I setup a "trace store into" to the page structure and logged who is changing the "struct page". Here is the log starting after page->mapping was set: address changed function 5c13a mapping add_to_page_cache_unique count=2, flags=PG_locked, age=2 5b14a next_hash __add_page_to_hash_queue 5b178 buffers __add_page_to_hash_queue 68440 flags lru_cache_add flags=PG_active|PG_locked 6846a lru lru_cache_add 68470 lru lru_cache_add 78fc6 virtual create_empty_buffers 78fda count create_empty_buffers count=3 6d9ce count __free_pages count=2 5c122 list __add_page_to_hash_queue 68464 lru lru_cache_add 77b16 flags end_buffer_io_async flags=PG_active|PG_uptodate|PG_locked 77b52 flags end_buffer_io_async flags=PG_active|PG_uptodate|PG_locked 77bc4 flags end_buffer_io_async flags=PG_active|PG_uptodate 67792 age age_page_up age=5 5c88c count __find_get_page count=3 559be count copy_page_range count=4 559be count copy_page_rage count=5 6d9ce count __free_pages count=4 6b55e lru refill_inactive_scan 6b4ac flags refill_inactive_scan flags=PG_active|PG_uptodate 6770c age age_page_down_ageonly age=2 6b570 lru refill_inactive_scan 6b576 lru refill_inactive_scan 6b56a lru refill_inactive_scan 6b55e lru refill_inactive_scan 6b4ac flags refill_inactive_scan flags=PG_active|PG_uptodate 6770c age age_page_down_ageonly age=1 6b570 lru refill_inactive_scan 6b576 lru refill_inactive_scan 6b56a lru refill_inactive_scan 6b55e lru refill_inactive_scan 6b4ac flags refill_inactive_scan flags=PG_active|PG_uptodate 6770c age age_page_down_ageonly age=0 6b570 lru refill_inactive_scan 6b576 lru refill_inactive_scan 6b56a lru refill_inactive_scan program check at 6e1e0 because of BUG() in line 60 of swap_state.c. Stack backtrace from there: 6e1e0 add_to_swap_cache 6900a try_to_swap_out 69408 swap_out_vma 69578 swap_out_mm 69838 swap_out 6b90a refill_inactive 6bab4 do_try_to_free_pages 6bbba kswapd age_page_down_ageonly was always called from refill_inactive_scan. So refill_inactive_scan lowers the age of the pages but does not deactivate the page when it reached age==0 (page->count to big). try_to_swap_out doesn't check for page->mapping and tries to swap out the page because the age is 0. Bang! blue skies, Martin P.S. by the way this test was done on linux-2.4.0-test11 Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
Agreed, that's almost sure _not_ random memory corruption of the page structure. It looks like a VM bug (if you can reproduce trivially I'd give a try to test8 too since test8 is rock solid for me while test10 lockups in VM core at the second bonnie if using emulated highmem). I was lucky. Somehow I managed to f**k up my disk in a way that the filesystem check triggers the bug in a reproducible way and always with the same page! I setup a "trace store into" to the page structure and logged who is changing the "struct page". Here is the log starting after page-mapping was set: address changed function 5c13a mapping add_to_page_cache_unique count=2, flags=PG_locked, age=2 5b14a next_hash __add_page_to_hash_queue 5b178 buffers __add_page_to_hash_queue 68440 flags lru_cache_add flags=PG_active|PG_locked 6846a lru lru_cache_add 68470 lru lru_cache_add 78fc6 virtual create_empty_buffers 78fda count create_empty_buffers count=3 6d9ce count __free_pages count=2 5c122 list __add_page_to_hash_queue 68464 lru lru_cache_add 77b16 flags end_buffer_io_async flags=PG_active|PG_uptodate|PG_locked 77b52 flags end_buffer_io_async flags=PG_active|PG_uptodate|PG_locked 77bc4 flags end_buffer_io_async flags=PG_active|PG_uptodate 67792 age age_page_up age=5 5c88c count __find_get_page count=3 559be count copy_page_range count=4 559be count copy_page_rage count=5 6d9ce count __free_pages count=4 6b55e lru refill_inactive_scan 6b4ac flags refill_inactive_scan flags=PG_active|PG_uptodate 6770c age age_page_down_ageonly age=2 6b570 lru refill_inactive_scan 6b576 lru refill_inactive_scan 6b56a lru refill_inactive_scan 6b55e lru refill_inactive_scan 6b4ac flags refill_inactive_scan flags=PG_active|PG_uptodate 6770c age age_page_down_ageonly age=1 6b570 lru refill_inactive_scan 6b576 lru refill_inactive_scan 6b56a lru refill_inactive_scan 6b55e lru refill_inactive_scan 6b4ac flags refill_inactive_scan flags=PG_active|PG_uptodate 6770c age age_page_down_ageonly age=0 6b570 lru refill_inactive_scan 6b576 lru refill_inactive_scan 6b56a lru refill_inactive_scan program check at 6e1e0 because of BUG() in line 60 of swap_state.c. Stack backtrace from there: 6e1e0 add_to_swap_cache 6900a try_to_swap_out 69408 swap_out_vma 69578 swap_out_mm 69838 swap_out 6b90a refill_inactive 6bab4 do_try_to_free_pages 6bbba kswapd age_page_down_ageonly was always called from refill_inactive_scan. So refill_inactive_scan lowers the age of the pages but does not deactivate the page when it reached age==0 (page-count to big). try_to_swap_out doesn't check for page-mapping and tries to swap out the page because the age is 0. Bang! blue skies, Martin P.S. by the way this test was done on linux-2.4.0-test11 Linux/390 Design Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, 17 Nov 2000, Andrea Arcangeli wrote: > Plus I add that the "if (!order) goto try_again" is an obvious > deadlock prone bug introduce in test9 that should be removed. 1) how would this cause deadlocks? 2) how would this somehow be worse than the unconditional 'goto try_again' we had before? This goto is ok because we have the OOM killer, which will select a process to kill when we run out of memory. Also, the goto will make sure that OTHER processes will survive while the "guilty" process will be killed. The guilty process will never get to the goto because it will have PF_MEMALLOC set. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, 17 Nov 2000, Andrea Arcangeli wrote: > Actually memory balancing in 2.4.x doesn't get any information, > not even the information about which _classzone_ where to free > the memory (NOTE: both 2.2.x and 2.0.x _always_ got the > classzone where to free memory at least). This classzone missing > information causes resources wastage indeed and I just fixed it > several times, BTW. Interesting, I can't remember you sending me any patches... Also, the 2.4 VM (unlike the other VMs) doesn't actually FREE memory wrongly (with the exception of buffer cache pages from page_launder()) but just moves it to the inactive_clean list, from where it will be re-used by one of those 99% user level allocations that happen on a typical Linux system. But, as I said in Ottawa, I wouldn't mind any classzone stuff in the new VM, as long as it won't complicate the integration of _other_ memory organisations (like NUMA). regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, Nov 17, 2000 at 05:35:53PM +0100, [EMAIL PROTECTED] wrote: > I did a little closer investigation. The BUG was triggered by a page with > page->mapping pointing to an address space of a mapped ext2 file > (page->mapping->a_ops == _aops). The page had PG_locked, PG_uptodate, > PG_active and PG_swap_cache set. The stack backstrace showed that kswapd > called do_try_to_free_pages, refill_inactive, swap_out, swap_out_mm, > swap_out_vma, try_to_swap_out and add_to_swap_cache where BUG hit. The > registers look good, the struct page looks good. I don't think that this was > a random memory corruption. Agreed, that's almost sure _not_ random memory corruption of the page structure. It looks like a VM bug (if you can reproduce trivially I'd give a try to test8 too since test8 is rock solid for me while test10 lockups in VM core at the second bonnie if using emulated highmem). > I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not > the "if (something) BUG()" ifs. Ah ok :), see Linus's answer: in your case the "don't do that" means to implement the: #define SOFT_PAGE_SIZE (PAGE_SIZE<<2) thing we were talking about yesterday of course. Plus I add that the "if (!order) goto try_again" is an obvious deadlock prone bug introduce in test9 that should be removed. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, 17 Nov 2000 [EMAIL PROTECTED] wrote: > > >> Whats the reasoning behind these ifs ? > > > >To catch memory corruption or things running out of control in the kernel. > I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not > the "if (something) BUG()" ifs. Basically, if you try to wait for orders > 0, you may have to wait for a LOONG time. It actually works reasonably well on machines with big memories, because a buddy allocator _will_ try to coalesce memory allocations as much as possible. But it has nasty cases where you can be really unlucky. Feel free to run simulations to see, but basically if you have reasonably random allocation and free patterns and you want to get an order-X contiguous allocation, you may have to free up a noticeable portion of your memory before it succeeds. Sure, you could do "directed freeing", where you actually try to look at which pages would be worth freeing to find a large free area, but the complexity is not insignificant, and quite frankly the proper approach has always been "don't do that then". Don't rely on big contiguous chunks of memory. Having an mm that can guarantee contiguous chunks of physical memory would be cool, but I suspect strongly that it would have some serious downsides. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
>> before I hit yet another BUG in swap_state.c:60. > >The bug in swap_state:60 shows a kernel bug in the VM or random memory >corruption. Make sure you can reproduce on x86 to be sure it's not a s390 >that is randomly corrupting memory. If you read the oops after the BUG message >with asm at hand you will see in the registers the value of page->mapping and >you can guess if it's random memory corruption or bug in VM this way (for >example if `reg & 3 != 0' it's memory corruption for sure, you should also >if it's pointing to a suitable kernel-heap address). I did a little closer investigation. The BUG was triggered by a page with page->mapping pointing to an address space of a mapped ext2 file (page->mapping->a_ops == _aops). The page had PG_locked, PG_uptodate, PG_active and PG_swap_cache set. The stack backstrace showed that kswapd called do_try_to_free_pages, refill_inactive, swap_out, swap_out_mm, swap_out_vma, try_to_swap_out and add_to_swap_cache where BUG hit. The registers look good, the struct page looks good. I don't think that this was a random memory corruption. >> Whats the reasoning behind these ifs ? > >To catch memory corruption or things running out of control in the kernel. I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not the "if (something) BUG()" ifs. blue skies, Martin Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, Nov 17, 2000 at 11:41:58AM +0100, [EMAIL PROTECTED] wrote: > [..] But low on memory > does mean low on real memory + swap space, doesn't it ? [..] No. Low on memory here means that `grep MemFree enough swap space but it isn't using any of it when the BUG hits. I think This is normal. > the "if (!order)" statements before the "goto try_again" in __alloc_pages > have something to do with it. To test this assumption I removed the ifs and The right way to make allocation of order > 0 to work (when not impossible) is to pass the "order" information from allocator to memory balancing code so you don't waste resources by freeing and swapping pages that aren't physically consecutive. We'll need to teach the memory balancing about freeing only physically consecutive worthwhile pages. Actually memory balancing in 2.4.x doesn't get any information, not even the information about which _classzone_ where to free the memory (NOTE: both 2.2.x and 2.0.x _always_ got the classzone where to free memory at least). This classzone missing information causes resources wastage indeed and I just fixed it several times, BTW. > I didn't see any "__alloc_pages: %lu-order allocation failed." message So you probably didn't triggered the out-of-order-2-multipages problem, and the bug you triggered is going to be another one. But still the above order > 1 thoguths applies to both 2.2.x and 2.4.x since once you'll fix the other problem, you'll sure run into the failed order 2 allocations. > before I hit yet another BUG in swap_state.c:60. The bug in swap_state:60 shows a kernel bug in the VM or random memory corruption. Make sure you can reproduce on x86 to be sure it's not a s390 that is randomly corrupting memory. If you read the oops after the BUG message with asm at hand you will see in the registers the value of page->mapping and you can guess if it's random memory corruption or bug in VM this way (for example if `reg & 3 != 0' it's memory corruption for sure, you should also if it's pointing to a suitable kernel-heap address). > Whats the reasoning behind these ifs ? To catch memory corruption or things running out of control in the kernel. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
>> >> If they absolutely needs 4 pages for pmd pagetables due hardware constraints >> I'd recommend to use _four_ hardware pages for each softpage, not two. > >Yes. > >However, it definitely is an issue of making trade-offs. Most 64-bit MMU >models tend to have some flexibility in how you set up the page tables, >and it may be possible to just move bits around too (ie making both the >pmd and the pgd twice as large, and getting the expansion of 4 by doing >two expand-by-two's, for example, if the hardware has support for doing >things like that). Unluckly we don't have any flexibility. The segment index (pmd) has 11 bits, pointers are 8 byte. That makes 16K segment table. I have understood that this is a problem if the system is really low on memory. But low on memory does mean low on real memory + swap space, doesn't it ? The system has enough swap space but it isn't using any of it when the BUG hits. I think the "if (!order)" statements before the "goto try_again" in __alloc_pages have something to do with it. To test this assumption I removed the ifs and I didn't see any "__alloc_pages: %lu-order allocation failed." message before I hit yet another BUG in swap_state.c:60. Whats the reasoning behind these ifs ? blue skies, Martin Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
If they absolutely needs 4 pages for pmd pagetables due hardware constraints I'd recommend to use _four_ hardware pages for each softpage, not two. Yes. However, it definitely is an issue of making trade-offs. Most 64-bit MMU models tend to have some flexibility in how you set up the page tables, and it may be possible to just move bits around too (ie making both the pmd and the pgd twice as large, and getting the expansion of 4 by doing two expand-by-two's, for example, if the hardware has support for doing things like that). Unluckly we don't have any flexibility. The segment index (pmd) has 11 bits, pointers are 8 byte. That makes 16K segment table. I have understood that this is a problem if the system is really low on memory. But low on memory does mean low on real memory + swap space, doesn't it ? The system has enough swap space but it isn't using any of it when the BUG hits. I think the "if (!order)" statements before the "goto try_again" in __alloc_pages have something to do with it. To test this assumption I removed the ifs and I didn't see any "__alloc_pages: %lu-order allocation failed." message before I hit yet another BUG in swap_state.c:60. Whats the reasoning behind these ifs ? blue skies, Martin Linux/390 Design Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, Nov 17, 2000 at 11:41:58AM +0100, [EMAIL PROTECTED] wrote: [..] But low on memory does mean low on real memory + swap space, doesn't it ? [..] No. Low on memory here means that `grep MemFree /proc/meminfo' says you still have only a few mbytes. enough swap space but it isn't using any of it when the BUG hits. I think This is normal. the "if (!order)" statements before the "goto try_again" in __alloc_pages have something to do with it. To test this assumption I removed the ifs and The right way to make allocation of order 0 to work (when not impossible) is to pass the "order" information from allocator to memory balancing code so you don't waste resources by freeing and swapping pages that aren't physically consecutive. We'll need to teach the memory balancing about freeing only physically consecutive worthwhile pages. Actually memory balancing in 2.4.x doesn't get any information, not even the information about which _classzone_ where to free the memory (NOTE: both 2.2.x and 2.0.x _always_ got the classzone where to free memory at least). This classzone missing information causes resources wastage indeed and I just fixed it several times, BTW. I didn't see any "__alloc_pages: %lu-order allocation failed." message So you probably didn't triggered the out-of-order-2-multipages problem, and the bug you triggered is going to be another one. But still the above order 1 thoguths applies to both 2.2.x and 2.4.x since once you'll fix the other problem, you'll sure run into the failed order 2 allocations. before I hit yet another BUG in swap_state.c:60. The bug in swap_state:60 shows a kernel bug in the VM or random memory corruption. Make sure you can reproduce on x86 to be sure it's not a s390 that is randomly corrupting memory. If you read the oops after the BUG message with asm at hand you will see in the registers the value of page-mapping and you can guess if it's random memory corruption or bug in VM this way (for example if `reg 3 != 0' it's memory corruption for sure, you should also if it's pointing to a suitable kernel-heap address). Whats the reasoning behind these ifs ? To catch memory corruption or things running out of control in the kernel. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
before I hit yet another BUG in swap_state.c:60. The bug in swap_state:60 shows a kernel bug in the VM or random memory corruption. Make sure you can reproduce on x86 to be sure it's not a s390 that is randomly corrupting memory. If you read the oops after the BUG message with asm at hand you will see in the registers the value of page-mapping and you can guess if it's random memory corruption or bug in VM this way (for example if `reg 3 != 0' it's memory corruption for sure, you should also if it's pointing to a suitable kernel-heap address). I did a little closer investigation. The BUG was triggered by a page with page-mapping pointing to an address space of a mapped ext2 file (page-mapping-a_ops == ext2_aops). The page had PG_locked, PG_uptodate, PG_active and PG_swap_cache set. The stack backstrace showed that kswapd called do_try_to_free_pages, refill_inactive, swap_out, swap_out_mm, swap_out_vma, try_to_swap_out and add_to_swap_cache where BUG hit. The registers look good, the struct page looks good. I don't think that this was a random memory corruption. Whats the reasoning behind these ifs ? To catch memory corruption or things running out of control in the kernel. I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not the "if (something) BUG()" ifs. blue skies, Martin Linux/390 Design Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, 17 Nov 2000 [EMAIL PROTECTED] wrote: Whats the reasoning behind these ifs ? To catch memory corruption or things running out of control in the kernel. I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not the "if (something) BUG()" ifs. Basically, if you try to wait for orders 0, you may have to wait for a LOONG time. It actually works reasonably well on machines with big memories, because a buddy allocator _will_ try to coalesce memory allocations as much as possible. But it has nasty cases where you can be really unlucky. Feel free to run simulations to see, but basically if you have reasonably random allocation and free patterns and you want to get an order-X contiguous allocation, you may have to free up a noticeable portion of your memory before it succeeds. Sure, you could do "directed freeing", where you actually try to look at which pages would be worth freeing to find a large free area, but the complexity is not insignificant, and quite frankly the proper approach has always been "don't do that then". Don't rely on big contiguous chunks of memory. Having an mm that can guarantee contiguous chunks of physical memory would be cool, but I suspect strongly that it would have some serious downsides. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, 17 Nov 2000, Andrea Arcangeli wrote: Actually memory balancing in 2.4.x doesn't get any information, not even the information about which _classzone_ where to free the memory (NOTE: both 2.2.x and 2.0.x _always_ got the classzone where to free memory at least). This classzone missing information causes resources wastage indeed and I just fixed it several times, BTW. Interesting, I can't remember you sending me any patches... Also, the 2.4 VM (unlike the other VMs) doesn't actually FREE memory wrongly (with the exception of buffer cache pages from page_launder()) but just moves it to the inactive_clean list, from where it will be re-used by one of those 99% user level allocations that happen on a typical Linux system. But, as I said in Ottawa, I wouldn't mind any classzone stuff in the new VM, as long as it won't complicate the integration of _other_ memory organisations (like NUMA). regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Fri, 17 Nov 2000, Andrea Arcangeli wrote: Plus I add that the "if (!order) goto try_again" is an obvious deadlock prone bug introduce in test9 that should be removed. 1) how would this cause deadlocks? 2) how would this somehow be worse than the unconditional 'goto try_again' we had before? This goto is ok because we have the OOM killer, which will select a process to kill when we run out of memory. Also, the goto will make sure that OTHER processes will survive while the "guilty" process will be killed. The guilty process will never get to the goto because it will have PF_MEMALLOC set. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Thu, 16 Nov 2000, Andrea Arcangeli wrote: > > If they absolutely needs 4 pages for pmd pagetables due hardware constraints > I'd recommend to use _four_ hardware pages for each softpage, not two. Yes. However, it definitely is an issue of making trade-offs. Most 64-bit MMU models tend to have some flexibility in how you set up the page tables, and it may be possible to just move bits around too (ie making both the pmd and the pgd twice as large, and getting the expansion of 4 by doing two expand-by-two's, for example, if the hardware has support for doing things like that). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Thu, Nov 16, 2000 at 09:01:07AM -0800, Linus Torvalds wrote: > "Linux pages" be _two_ hardware pages, and make a Linux pte contain two If they absolutely needs 4 pages for pmd pagetables due hardware constraints I'd recommend to use _four_ hardware pages for each softpage, not two. The issue is that failing allocation at task creation (due 8k [or more] kernel stack) is trivial to handle, just have the syscall returning -ENOMEM and userspace will handle the allocation faliure gracefully. Also the parent of the servers will never fail that allocation after it's up and running (and it can try to fork childs later on). Failing allocation of a pagetable in some case can be solved only looping (deadlock prone) or killing the task hard without giving a chance to userspace to trap the fault (even SIGKILL signal handler may need that pmd pagetable to run). So being guaranteed to be able to allocate pagetables unless the machine is truly out of memory is quite necessary "feature" IMHO. We faced similar issues while thinking at possible ways for x86-64 pagetables, and we preferred not having to depend on the softpagesize framework in 2.4.x because it's very intrusive. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Thu, 16 Nov 2000 [EMAIL PROTECTED] wrote: > > Ok, the BUG() hit in get_pmd_slow: > > pmd_t * > get_pmd_slow(pgd_t *pgd, unsigned long offset) > { > pmd_t *pmd; > int i; > > pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2); You really need 4 pages? There's no way to reliably get 4 consecutive pages when you're even close to being low on memory. I would suggest just failing with a NULL return here. What is the architecture setup for this machine? I have no clue about S/390 memory management. Maybe you can modify the pmd layout? One potential fix for this is to just make the page size bigger. Make "Linux pages" be _two_ hardware pages, and make a Linux pte contain two "hardware pte's". That way the pmd would be an order-1 allocation instead of an order-2 one. Which is statistically _much_ more likely to be around (exponential distribution). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
>What happens if you just replace all places that would use a bad page >table with a BUG()? (Ie do _not_ add the bug to the place where you >added the test: by that time it's too late. I'm talking about the >places where the bad page tables are used, like in the error cases of >"get_pte_kernel_slow()" etc. Ok, the BUG() hit in get_pmd_slow: pmd_t * get_pmd_slow(pgd_t *pgd, unsigned long offset) { pmd_t *pmd; int i; pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2); if (pgd_none(*pgd)) { if (pmd) { for (i = 0; i < PTRS_PER_PMD; i++) pmd_clear(pmd+i); pgd_set(pgd, pmd); return pmd + offset; } BUG(); /* <--- this one hit */ pmd = (pmd_t *) get_bad_pmd_table(); pgd_set(pgd, pmd); return NULL; } free_pages((unsigned long)pmd,2); if (pgd_bad(*pgd)) BUG(); return (pmd_t *) pgd_page(*pgd) + offset; } The allocation of 4 consecutive pages for the page middle directory failed. This caused empty_bad_pmd_table to be used and clear_page_tables inserted it to the pmd quicklist. The important question is: why did __get_free_pages fail? blue skies, Martin Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
What happens if you just replace all places that would use a bad page table with a BUG()? (Ie do _not_ add the bug to the place where you added the test: by that time it's too late. I'm talking about the places where the bad page tables are used, like in the error cases of "get_pte_kernel_slow()" etc. Ok, the BUG() hit in get_pmd_slow: pmd_t * get_pmd_slow(pgd_t *pgd, unsigned long offset) { pmd_t *pmd; int i; pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2); if (pgd_none(*pgd)) { if (pmd) { for (i = 0; i PTRS_PER_PMD; i++) pmd_clear(pmd+i); pgd_set(pgd, pmd); return pmd + offset; } BUG(); /* --- this one hit */ pmd = (pmd_t *) get_bad_pmd_table(); pgd_set(pgd, pmd); return NULL; } free_pages((unsigned long)pmd,2); if (pgd_bad(*pgd)) BUG(); return (pmd_t *) pgd_page(*pgd) + offset; } The allocation of 4 consecutive pages for the page middle directory failed. This caused empty_bad_pmd_table to be used and clear_page_tables inserted it to the pmd quicklist. The important question is: why did __get_free_pages fail? blue skies, Martin Linux/390 Design Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Thu, 16 Nov 2000 [EMAIL PROTECTED] wrote: Ok, the BUG() hit in get_pmd_slow: pmd_t * get_pmd_slow(pgd_t *pgd, unsigned long offset) { pmd_t *pmd; int i; pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2); You really need 4 pages? There's no way to reliably get 4 consecutive pages when you're even close to being low on memory. I would suggest just failing with a NULL return here. What is the architecture setup for this machine? I have no clue about S/390 memory management. Maybe you can modify the pmd layout? One potential fix for this is to just make the page size bigger. Make "Linux pages" be _two_ hardware pages, and make a Linux pte contain two "hardware pte's". That way the pmd would be an order-1 allocation instead of an order-2 one. Which is statistically _much_ more likely to be around (exponential distribution). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Thu, Nov 16, 2000 at 09:01:07AM -0800, Linus Torvalds wrote: "Linux pages" be _two_ hardware pages, and make a Linux pte contain two If they absolutely needs 4 pages for pmd pagetables due hardware constraints I'd recommend to use _four_ hardware pages for each softpage, not two. The issue is that failing allocation at task creation (due 8k [or more] kernel stack) is trivial to handle, just have the syscall returning -ENOMEM and userspace will handle the allocation faliure gracefully. Also the parent of the servers will never fail that allocation after it's up and running (and it can try to fork childs later on). Failing allocation of a pagetable in some case can be solved only looping (deadlock prone) or killing the task hard without giving a chance to userspace to trap the fault (even SIGKILL signal handler may need that pmd pagetable to run). So being guaranteed to be able to allocate pagetables unless the machine is truly out of memory is quite necessary "feature" IMHO. We faced similar issues while thinking at possible ways for x86-64 pagetables, and we preferred not having to depend on the softpagesize framework in 2.4.x because it's very intrusive. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Thu, 16 Nov 2000, Andrea Arcangeli wrote: If they absolutely needs 4 pages for pmd pagetables due hardware constraints I'd recommend to use _four_ hardware pages for each softpage, not two. Yes. However, it definitely is an issue of making trade-offs. Most 64-bit MMU models tend to have some flexibility in how you set up the page tables, and it may be possible to just move bits around too (ie making both the pmd and the pgd twice as large, and getting the expansion of 4 by doing two expand-by-two's, for example, if the hardware has support for doing things like that). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
In article <[EMAIL PROTECTED]>, >After some trickery with some special hardware feature (storage >keys) I found out that empty_bad_pmd_table and empty_bad_pte_table have >been put to the page table quicklists multiple(!) times. This is definitely bad, and means that something else really bad is going on. In fact, I have this fairly strong suspicion that we should just get rid of the "bad" page tables altogether, and make the stuff that now uses them BUG() instead. The whole concept of "bad" page tables comes from very early on in Linux, when the way the page fault handler worked was that if it ran out of memory or something else really bad happened, it would insert a dummy page table entry that was guaranteed to let the CPU continue. That way the page fault handler was always "successful" from a hardware standpoint, even if it ended up trying to kill the process. This used to be required simply because a page fault in kernel space originally needed to let the process unwind sanely and cleanly. These days, the requirement that page faults always "succeed" is long long gone. The exception handling mechanism handles the cases where we validly can take a page fault, and in other cases we will just kill the process outright. As such, the bad page tables should no longer be needed, and are apparently just hiding some nasty bugs. What happens if you just replace all places that would use a bad page table with a BUG()? (Ie do _not_ add the bug to the place where you added the test: by that time it's too late. I'm talking about the places where the bad page tables are used, like in the error cases of "get_pte_kernel_slow()" etc. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
>> +extern pte_t empty_bad_pte_table[]; >> extern __inline__ void free_pte_fast(pte_t *pte) >> { >> + if (pte == empty_bad_pte_table) >> + return; > >I guess that should be BUG() instead of return, so that the callers can be >fixed. Not really. pte_free and pmd_free are called from the common mm code but the concept of empty_bad_{pte,pmd}_table is architecture dependent. The trouble starts in arch/???/mm/init.c where these special arrays are inserted into the paging tables. So the solution to the problem should be in architecture dependent files too. blue skies, Martin Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Wed, Nov 15, 2000 at 01:39:13PM +0100, [EMAIL PROTECTED] wrote: > +extern pte_t empty_bad_pte_table[]; > extern __inline__ void free_pte_fast(pte_t *pte) > { > + if (pte == empty_bad_pte_table) > + return; I guess that should be BUG() instead of return, so that the callers can be fixed. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
On Wed, Nov 15, 2000 at 01:39:13PM +0100, [EMAIL PROTECTED] wrote: +extern pte_t empty_bad_pte_table[]; extern __inline__ void free_pte_fast(pte_t *pte) { + if (pte == empty_bad_pte_table) + return; I guess that should be BUG() instead of return, so that the callers can be fixed. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
+extern pte_t empty_bad_pte_table[]; extern __inline__ void free_pte_fast(pte_t *pte) { + if (pte == empty_bad_pte_table) + return; I guess that should be BUG() instead of return, so that the callers can be fixed. Not really. pte_free and pmd_free are called from the common mm code but the concept of empty_bad_{pte,pmd}_table is architecture dependent. The trouble starts in arch/???/mm/init.c where these special arrays are inserted into the paging tables. So the solution to the problem should be in architecture dependent files too. blue skies, Martin Linux/390 Design Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Memory management bug
In article [EMAIL PROTECTED], After some trickery with some special hardware feature (storage keys) I found out that empty_bad_pmd_table and empty_bad_pte_table have been put to the page table quicklists multiple(!) times. This is definitely bad, and means that something else really bad is going on. In fact, I have this fairly strong suspicion that we should just get rid of the "bad" page tables altogether, and make the stuff that now uses them BUG() instead. The whole concept of "bad" page tables comes from very early on in Linux, when the way the page fault handler worked was that if it ran out of memory or something else really bad happened, it would insert a dummy page table entry that was guaranteed to let the CPU continue. That way the page fault handler was always "successful" from a hardware standpoint, even if it ended up trying to kill the process. This used to be required simply because a page fault in kernel space originally needed to let the process unwind sanely and cleanly. These days, the requirement that page faults always "succeed" is long long gone. The exception handling mechanism handles the cases where we validly can take a page fault, and in other cases we will just kill the process outright. As such, the bad page tables should no longer be needed, and are apparently just hiding some nasty bugs. What happens if you just replace all places that would use a bad page table with a BUG()? (Ie do _not_ add the bug to the place where you added the test: by that time it's too late. I'm talking about the places where the bad page tables are used, like in the error cases of "get_pte_kernel_slow()" etc. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/