Re: Memory management bug

2000-11-21 Thread schwidefsky



>Agreed, that's almost sure _not_ random memory corruption of the page
>structure. It looks like a VM bug (if you can reproduce trivially I'd give
a
>try to test8 too since test8 is rock solid for me while test10 lockups in
VM
>core at the second bonnie if using emulated highmem).
I was lucky. Somehow I managed to f**k up my disk in a way that the
filesystem
check triggers the bug in a reproducible way and always with the same page!
I setup a "trace store into" to the page structure and logged who is
changing
the "struct page". Here is the log starting after page->mapping was set:

address changed   function
5c13a   mapping   add_to_page_cache_unique
 count=2, flags=PG_locked, age=2
5b14a   next_hash __add_page_to_hash_queue
5b178   buffers   __add_page_to_hash_queue
68440   flags lru_cache_add
 flags=PG_active|PG_locked
6846a   lru   lru_cache_add
68470   lru   lru_cache_add
78fc6   virtual   create_empty_buffers
78fda   count create_empty_buffers
 count=3
6d9ce   count __free_pages
 count=2
5c122   list  __add_page_to_hash_queue
68464   lru   lru_cache_add
77b16   flags end_buffer_io_async
 flags=PG_active|PG_uptodate|PG_locked
77b52   flags end_buffer_io_async
 flags=PG_active|PG_uptodate|PG_locked
77bc4   flags end_buffer_io_async
 flags=PG_active|PG_uptodate
67792   age   age_page_up
 age=5
5c88c   count __find_get_page
 count=3
559be   count copy_page_range
 count=4
559be   count copy_page_rage
 count=5
6d9ce   count __free_pages
 count=4
6b55e   lru   refill_inactive_scan
6b4ac   flags refill_inactive_scan
 flags=PG_active|PG_uptodate
6770c   age   age_page_down_ageonly
 age=2
6b570   lru   refill_inactive_scan
6b576   lru   refill_inactive_scan
6b56a   lru   refill_inactive_scan
6b55e   lru   refill_inactive_scan
6b4ac   flags refill_inactive_scan
 flags=PG_active|PG_uptodate
6770c   age   age_page_down_ageonly
 age=1
6b570   lru   refill_inactive_scan
6b576   lru   refill_inactive_scan
6b56a   lru   refill_inactive_scan
6b55e   lru   refill_inactive_scan
6b4ac   flags refill_inactive_scan
 flags=PG_active|PG_uptodate
6770c   age   age_page_down_ageonly
 age=0
6b570   lru   refill_inactive_scan
6b576   lru   refill_inactive_scan
6b56a   lru   refill_inactive_scan

program check at 6e1e0 because of BUG() in line 60 of swap_state.c.
Stack backtrace from there:
6e1e0 add_to_swap_cache
6900a try_to_swap_out
69408 swap_out_vma
69578 swap_out_mm
69838 swap_out
6b90a refill_inactive
6bab4 do_try_to_free_pages
6bbba kswapd

age_page_down_ageonly was always called from refill_inactive_scan. So
refill_inactive_scan lowers the age of the pages but does not deactivate
the
page when it reached age==0 (page->count to big). try_to_swap_out doesn't
check for page->mapping and tries to swap out the page because the age is
0. Bang!

blue skies,
   Martin

P.S. by the way this test was done on linux-2.4.0-test11

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-21 Thread schwidefsky



Agreed, that's almost sure _not_ random memory corruption of the page
structure. It looks like a VM bug (if you can reproduce trivially I'd give
a
try to test8 too since test8 is rock solid for me while test10 lockups in
VM
core at the second bonnie if using emulated highmem).
I was lucky. Somehow I managed to f**k up my disk in a way that the
filesystem
check triggers the bug in a reproducible way and always with the same page!
I setup a "trace store into" to the page structure and logged who is
changing
the "struct page". Here is the log starting after page-mapping was set:

address changed   function
5c13a   mapping   add_to_page_cache_unique
 count=2, flags=PG_locked, age=2
5b14a   next_hash __add_page_to_hash_queue
5b178   buffers   __add_page_to_hash_queue
68440   flags lru_cache_add
 flags=PG_active|PG_locked
6846a   lru   lru_cache_add
68470   lru   lru_cache_add
78fc6   virtual   create_empty_buffers
78fda   count create_empty_buffers
 count=3
6d9ce   count __free_pages
 count=2
5c122   list  __add_page_to_hash_queue
68464   lru   lru_cache_add
77b16   flags end_buffer_io_async
 flags=PG_active|PG_uptodate|PG_locked
77b52   flags end_buffer_io_async
 flags=PG_active|PG_uptodate|PG_locked
77bc4   flags end_buffer_io_async
 flags=PG_active|PG_uptodate
67792   age   age_page_up
 age=5
5c88c   count __find_get_page
 count=3
559be   count copy_page_range
 count=4
559be   count copy_page_rage
 count=5
6d9ce   count __free_pages
 count=4
6b55e   lru   refill_inactive_scan
6b4ac   flags refill_inactive_scan
 flags=PG_active|PG_uptodate
6770c   age   age_page_down_ageonly
 age=2
6b570   lru   refill_inactive_scan
6b576   lru   refill_inactive_scan
6b56a   lru   refill_inactive_scan
6b55e   lru   refill_inactive_scan
6b4ac   flags refill_inactive_scan
 flags=PG_active|PG_uptodate
6770c   age   age_page_down_ageonly
 age=1
6b570   lru   refill_inactive_scan
6b576   lru   refill_inactive_scan
6b56a   lru   refill_inactive_scan
6b55e   lru   refill_inactive_scan
6b4ac   flags refill_inactive_scan
 flags=PG_active|PG_uptodate
6770c   age   age_page_down_ageonly
 age=0
6b570   lru   refill_inactive_scan
6b576   lru   refill_inactive_scan
6b56a   lru   refill_inactive_scan

program check at 6e1e0 because of BUG() in line 60 of swap_state.c.
Stack backtrace from there:
6e1e0 add_to_swap_cache
6900a try_to_swap_out
69408 swap_out_vma
69578 swap_out_mm
69838 swap_out
6b90a refill_inactive
6bab4 do_try_to_free_pages
6bbba kswapd

age_page_down_ageonly was always called from refill_inactive_scan. So
refill_inactive_scan lowers the age of the pages but does not deactivate
the
page when it reached age==0 (page-count to big). try_to_swap_out doesn't
check for page-mapping and tries to swap out the page because the age is
0. Bang!

blue skies,
   Martin

P.S. by the way this test was done on linux-2.4.0-test11

Linux/390 Design  Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Rik van Riel

On Fri, 17 Nov 2000, Andrea Arcangeli wrote:

> Plus I add that the "if (!order) goto try_again" is an obvious
> deadlock prone bug introduce in test9 that should be removed.

1) how would this cause deadlocks?
2) how would this somehow be worse than the
   unconditional 'goto try_again' we had before?

This goto is ok because we have the OOM killer, which will select
a process to kill when we run out of memory. Also, the goto will
make sure that OTHER processes will survive while the "guilty"
process will be killed.

The guilty process will never get to the goto because it will
have PF_MEMALLOC set.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Rik van Riel

On Fri, 17 Nov 2000, Andrea Arcangeli wrote:

> Actually memory balancing in 2.4.x doesn't get any information,
> not even the information about which _classzone_ where to free
> the memory (NOTE: both 2.2.x and 2.0.x _always_ got the
> classzone where to free memory at least). This classzone missing
> information causes resources wastage indeed and I just fixed it
> several times, BTW.

Interesting, I can't remember you sending me any
patches...

Also, the 2.4 VM (unlike the other VMs) doesn't actually
FREE memory wrongly (with the exception of buffer cache
pages from page_launder()) but just moves it to the
inactive_clean list, from where it will be re-used by one
of those 99% user level allocations that happen on a typical
Linux system.

But, as I said in Ottawa, I wouldn't mind any classzone
stuff in the new VM, as long as it won't complicate the
integration of _other_ memory organisations (like NUMA).

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Andrea Arcangeli

On Fri, Nov 17, 2000 at 05:35:53PM +0100, [EMAIL PROTECTED] wrote:
> I did a little closer investigation. The BUG was triggered by a page with
> page->mapping pointing to an address space of a mapped ext2 file
> (page->mapping->a_ops == _aops). The page had PG_locked, PG_uptodate,
> PG_active and PG_swap_cache set. The stack backstrace showed that kswapd
> called do_try_to_free_pages, refill_inactive, swap_out, swap_out_mm,
> swap_out_vma, try_to_swap_out and add_to_swap_cache where BUG hit.  The
> registers look good, the struct page looks good. I don't think that this was
> a random memory corruption.

Agreed, that's almost sure _not_ random memory corruption of the page
structure. It looks like a VM bug (if you can reproduce trivially I'd give a
try to test8 too since test8 is rock solid for me while test10 lockups in VM
core at the second bonnie if using emulated highmem).

> I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not
> the "if (something) BUG()" ifs.

Ah ok :), see Linus's answer: in your case the "don't do that" means to
implement the:

#define SOFT_PAGE_SIZE (PAGE_SIZE<<2)

thing we were talking about yesterday of course.

Plus I add that the "if (!order) goto try_again" is an obvious deadlock prone
bug introduce in test9 that should be removed.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Linus Torvalds



On Fri, 17 Nov 2000 [EMAIL PROTECTED] wrote:
> 
> >> Whats the reasoning behind these ifs ?
> >
> >To catch memory corruption or things running out of control in the kernel.
> I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not
> the "if (something) BUG()" ifs.

Basically, if you try to wait for orders > 0, you may have to wait for a
LOONG time.

It actually works reasonably well on machines with big memories, because a
buddy allocator _will_ try to coalesce memory allocations as much as
possible. But it has nasty cases where you can be really unlucky. Feel
free to run simulations to see, but basically if you have reasonably
random allocation and free patterns and you want to get an order-X
contiguous allocation, you may have to free up a noticeable portion of
your memory before it succeeds.

Sure, you could do "directed freeing", where you actually try to look at
which pages would be worth freeing to find a large free area, but the
complexity is not insignificant, and quite frankly the proper approach has
always been "don't do that then". Don't rely on big contiguous chunks of
memory. Having an mm that can guarantee contiguous chunks of physical
memory would be cool, but I suspect strongly that it would have some
serious downsides.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread schwidefsky



>> before I hit yet another BUG in swap_state.c:60.
>
>The bug in swap_state:60 shows a kernel bug in the VM or random memory
>corruption. Make sure you can reproduce on x86 to be sure it's not a s390
>that is randomly corrupting memory. If you read the oops after the BUG
message
>with asm at hand you will see in the registers the value of page->mapping
and
>you can guess if it's random memory corruption or bug in VM this way (for
>example if `reg & 3 != 0' it's memory corruption for sure, you should also
>if it's pointing to a suitable kernel-heap address).
I did a little closer investigation. The BUG was triggered by a page with
page->mapping pointing to an address space of a mapped ext2 file
(page->mapping->a_ops == _aops). The page had PG_locked, PG_uptodate,
PG_active and PG_swap_cache set. The stack backstrace showed that kswapd
called do_try_to_free_pages, refill_inactive, swap_out, swap_out_mm,
swap_out_vma, try_to_swap_out and add_to_swap_cache where BUG hit.
The registers look good, the struct page looks good. I don't think that
this
was a random memory corruption.

>> Whats the reasoning behind these ifs ?
>
>To catch memory corruption or things running out of control in the kernel.
I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not
the "if (something) BUG()" ifs.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Andrea Arcangeli

On Fri, Nov 17, 2000 at 11:41:58AM +0100, [EMAIL PROTECTED] wrote:
> [..] But low on memory
> does mean low on real memory + swap space, doesn't it ? [..]

No. Low on memory here means that `grep MemFree  enough swap space but it isn't using any of it when the BUG hits. I think

This is normal.

> the "if (!order)" statements before the "goto try_again" in __alloc_pages
> have something to do with it. To test this assumption I removed the ifs and

The right way to make allocation of order > 0 to work (when not impossible) is
to pass the "order" information from allocator to memory balancing code so you
don't waste resources by freeing and swapping pages that aren't physically
consecutive.  We'll need to teach the memory balancing about freeing only
physically consecutive worthwhile pages.

Actually memory balancing in 2.4.x doesn't get any information, not even the
information about which _classzone_ where to free the memory (NOTE: both 2.2.x
and 2.0.x _always_ got the classzone where to free memory at least). This
classzone missing information causes resources wastage indeed and I just fixed
it several times, BTW.

> I didn't see any "__alloc_pages: %lu-order allocation failed." message

So you probably didn't triggered the out-of-order-2-multipages problem, and
the bug you triggered is going to be another one. But still the above order > 1
thoguths applies to both 2.2.x and 2.4.x since once you'll fix the other
problem, you'll sure run into the failed order 2 allocations.

> before I hit yet another BUG in swap_state.c:60.

The bug in swap_state:60 shows a kernel bug in the VM or random memory
corruption. Make sure you can reproduce on x86 to be sure it's not a s390
that is randomly corrupting memory. If you read the oops after the BUG message
with asm at hand you will see in the registers the value of page->mapping and
you can guess if it's random memory corruption or bug in VM this way (for
example if `reg & 3 != 0' it's memory corruption for sure, you should also
if it's pointing to a suitable kernel-heap address).

> Whats the reasoning behind these ifs ?

To catch memory corruption or things running out of control in the kernel.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread schwidefsky



>>
>> If they absolutely needs 4 pages for pmd pagetables due hardware
constraints
>> I'd recommend to use _four_ hardware pages for each softpage, not two.
>
>Yes.
>
>However, it definitely is an issue of making trade-offs. Most 64-bit MMU
>models tend to have some flexibility in how you set up the page tables,
>and it may be possible to just move bits around too (ie making both the
>pmd and the pgd twice as large, and getting the expansion of 4 by doing
>two expand-by-two's, for example, if the hardware has support for doing
>things like that).

Unluckly we don't have any flexibility. The segment index (pmd) has 11
bits,
pointers are 8 byte. That makes 16K segment table. I have understood that
this is a problem if the system is really low on memory. But low on memory
does mean low on real memory + swap space, doesn't it ? The system has
enough swap space but it isn't using any of it when the BUG hits. I think
the "if (!order)" statements before the "goto try_again" in __alloc_pages
have something to do with it. To test this assumption I removed the ifs and

I didn't see any "__alloc_pages: %lu-order allocation failed." message
before I hit yet another BUG in swap_state.c:60.
Whats the reasoning behind these ifs ?

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread schwidefsky




 If they absolutely needs 4 pages for pmd pagetables due hardware
constraints
 I'd recommend to use _four_ hardware pages for each softpage, not two.

Yes.

However, it definitely is an issue of making trade-offs. Most 64-bit MMU
models tend to have some flexibility in how you set up the page tables,
and it may be possible to just move bits around too (ie making both the
pmd and the pgd twice as large, and getting the expansion of 4 by doing
two expand-by-two's, for example, if the hardware has support for doing
things like that).

Unluckly we don't have any flexibility. The segment index (pmd) has 11
bits,
pointers are 8 byte. That makes 16K segment table. I have understood that
this is a problem if the system is really low on memory. But low on memory
does mean low on real memory + swap space, doesn't it ? The system has
enough swap space but it isn't using any of it when the BUG hits. I think
the "if (!order)" statements before the "goto try_again" in __alloc_pages
have something to do with it. To test this assumption I removed the ifs and

I didn't see any "__alloc_pages: %lu-order allocation failed." message
before I hit yet another BUG in swap_state.c:60.
Whats the reasoning behind these ifs ?

blue skies,
   Martin

Linux/390 Design  Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Andrea Arcangeli

On Fri, Nov 17, 2000 at 11:41:58AM +0100, [EMAIL PROTECTED] wrote:
 [..] But low on memory
 does mean low on real memory + swap space, doesn't it ? [..]

No. Low on memory here means that `grep MemFree /proc/meminfo' says
you still have only a few mbytes.

 enough swap space but it isn't using any of it when the BUG hits. I think

This is normal.

 the "if (!order)" statements before the "goto try_again" in __alloc_pages
 have something to do with it. To test this assumption I removed the ifs and

The right way to make allocation of order  0 to work (when not impossible) is
to pass the "order" information from allocator to memory balancing code so you
don't waste resources by freeing and swapping pages that aren't physically
consecutive.  We'll need to teach the memory balancing about freeing only
physically consecutive worthwhile pages.

Actually memory balancing in 2.4.x doesn't get any information, not even the
information about which _classzone_ where to free the memory (NOTE: both 2.2.x
and 2.0.x _always_ got the classzone where to free memory at least). This
classzone missing information causes resources wastage indeed and I just fixed
it several times, BTW.

 I didn't see any "__alloc_pages: %lu-order allocation failed." message

So you probably didn't triggered the out-of-order-2-multipages problem, and
the bug you triggered is going to be another one. But still the above order  1
thoguths applies to both 2.2.x and 2.4.x since once you'll fix the other
problem, you'll sure run into the failed order 2 allocations.

 before I hit yet another BUG in swap_state.c:60.

The bug in swap_state:60 shows a kernel bug in the VM or random memory
corruption. Make sure you can reproduce on x86 to be sure it's not a s390
that is randomly corrupting memory. If you read the oops after the BUG message
with asm at hand you will see in the registers the value of page-mapping and
you can guess if it's random memory corruption or bug in VM this way (for
example if `reg  3 != 0' it's memory corruption for sure, you should also
if it's pointing to a suitable kernel-heap address).

 Whats the reasoning behind these ifs ?

To catch memory corruption or things running out of control in the kernel.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread schwidefsky



 before I hit yet another BUG in swap_state.c:60.

The bug in swap_state:60 shows a kernel bug in the VM or random memory
corruption. Make sure you can reproduce on x86 to be sure it's not a s390
that is randomly corrupting memory. If you read the oops after the BUG
message
with asm at hand you will see in the registers the value of page-mapping
and
you can guess if it's random memory corruption or bug in VM this way (for
example if `reg  3 != 0' it's memory corruption for sure, you should also
if it's pointing to a suitable kernel-heap address).
I did a little closer investigation. The BUG was triggered by a page with
page-mapping pointing to an address space of a mapped ext2 file
(page-mapping-a_ops == ext2_aops). The page had PG_locked, PG_uptodate,
PG_active and PG_swap_cache set. The stack backstrace showed that kswapd
called do_try_to_free_pages, refill_inactive, swap_out, swap_out_mm,
swap_out_vma, try_to_swap_out and add_to_swap_cache where BUG hit.
The registers look good, the struct page looks good. I don't think that
this
was a random memory corruption.

 Whats the reasoning behind these ifs ?

To catch memory corruption or things running out of control in the kernel.
I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not
the "if (something) BUG()" ifs.

blue skies,
   Martin

Linux/390 Design  Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Linus Torvalds



On Fri, 17 Nov 2000 [EMAIL PROTECTED] wrote:
 
  Whats the reasoning behind these ifs ?
 
 To catch memory corruption or things running out of control in the kernel.
 I was refering to the "if (!order) goto try_again" ifs in alloc_pages, not
 the "if (something) BUG()" ifs.

Basically, if you try to wait for orders  0, you may have to wait for a
LOONG time.

It actually works reasonably well on machines with big memories, because a
buddy allocator _will_ try to coalesce memory allocations as much as
possible. But it has nasty cases where you can be really unlucky. Feel
free to run simulations to see, but basically if you have reasonably
random allocation and free patterns and you want to get an order-X
contiguous allocation, you may have to free up a noticeable portion of
your memory before it succeeds.

Sure, you could do "directed freeing", where you actually try to look at
which pages would be worth freeing to find a large free area, but the
complexity is not insignificant, and quite frankly the proper approach has
always been "don't do that then". Don't rely on big contiguous chunks of
memory. Having an mm that can guarantee contiguous chunks of physical
memory would be cool, but I suspect strongly that it would have some
serious downsides.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Rik van Riel

On Fri, 17 Nov 2000, Andrea Arcangeli wrote:

 Actually memory balancing in 2.4.x doesn't get any information,
 not even the information about which _classzone_ where to free
 the memory (NOTE: both 2.2.x and 2.0.x _always_ got the
 classzone where to free memory at least). This classzone missing
 information causes resources wastage indeed and I just fixed it
 several times, BTW.

Interesting, I can't remember you sending me any
patches...

Also, the 2.4 VM (unlike the other VMs) doesn't actually
FREE memory wrongly (with the exception of buffer cache
pages from page_launder()) but just moves it to the
inactive_clean list, from where it will be re-used by one
of those 99% user level allocations that happen on a typical
Linux system.

But, as I said in Ottawa, I wouldn't mind any classzone
stuff in the new VM, as long as it won't complicate the
integration of _other_ memory organisations (like NUMA).

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-17 Thread Rik van Riel

On Fri, 17 Nov 2000, Andrea Arcangeli wrote:

 Plus I add that the "if (!order) goto try_again" is an obvious
 deadlock prone bug introduce in test9 that should be removed.

1) how would this cause deadlocks?
2) how would this somehow be worse than the
   unconditional 'goto try_again' we had before?

This goto is ok because we have the OOM killer, which will select
a process to kill when we run out of memory. Also, the goto will
make sure that OTHER processes will survive while the "guilty"
process will be killed.

The guilty process will never get to the goto because it will
have PF_MEMALLOC set.

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-16 Thread Linus Torvalds



On Thu, 16 Nov 2000, Andrea Arcangeli wrote:
> 
> If they absolutely needs 4 pages for pmd pagetables due hardware constraints
> I'd recommend to use _four_ hardware pages for each softpage, not two.

Yes.

However, it definitely is an issue of making trade-offs. Most 64-bit MMU
models tend to have some flexibility in how you set up the page tables,
and it may be possible to just move bits around too (ie making both the
pmd and the pgd twice as large, and getting the expansion of 4 by doing
two expand-by-two's, for example, if the hardware has support for doing
things like that).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-16 Thread Andrea Arcangeli

On Thu, Nov 16, 2000 at 09:01:07AM -0800, Linus Torvalds wrote:
> "Linux pages" be _two_ hardware pages, and make a Linux pte contain two

If they absolutely needs 4 pages for pmd pagetables due hardware constraints
I'd recommend to use _four_ hardware pages for each softpage, not two.

The issue is that failing allocation at task creation (due 8k [or more] kernel
stack) is trivial to handle, just have the syscall returning -ENOMEM and
userspace will handle the allocation faliure gracefully. Also the parent
of the servers will never fail that allocation after it's up and running
(and it can try to fork childs later on).

Failing allocation of a pagetable in some case can be solved only looping
(deadlock prone) or killing the task hard without giving a chance to userspace
to trap the fault (even SIGKILL signal handler may need that pmd pagetable to
run). So being guaranteed to be able to allocate pagetables unless
the machine is truly out of memory is quite necessary "feature" IMHO.

We faced similar issues while thinking at possible ways for x86-64 pagetables,
and we preferred not having to depend on the softpagesize framework in 2.4.x
because it's very intrusive.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-16 Thread Linus Torvalds



On Thu, 16 Nov 2000 [EMAIL PROTECTED] wrote:
> 
> Ok, the BUG() hit in get_pmd_slow:
> 
> pmd_t *
> get_pmd_slow(pgd_t *pgd, unsigned long offset)
> {
> pmd_t *pmd;
> int i;
> 
> pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2);

You really need 4 pages?

There's no way to reliably get 4 consecutive pages when you're even close
to being low on memory. I would suggest just failing with a NULL return
here.

What is the architecture setup for this machine? I have no clue about
S/390 memory management. Maybe you can modify the pmd layout?

One potential fix for this is to just make the page size bigger. Make
"Linux pages" be _two_ hardware pages, and make a Linux pte contain two
"hardware pte's". That way the pmd would be an order-1 allocation instead
of an order-2 one. Which is statistically _much_ more likely to be around
(exponential distribution).

Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-16 Thread schwidefsky



>What happens if you just replace all places that would use a bad page
>table with a BUG()? (Ie do _not_ add the bug to the place where you
>added the test: by that time it's too late.  I'm talking about the
>places where the bad page tables are used, like in the error cases of
>"get_pte_kernel_slow()" etc.

Ok, the BUG() hit in get_pmd_slow:

pmd_t *
get_pmd_slow(pgd_t *pgd, unsigned long offset)
{
pmd_t *pmd;
int i;

pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2);
if (pgd_none(*pgd)) {
if (pmd) {
for (i = 0; i < PTRS_PER_PMD; i++)
pmd_clear(pmd+i);
pgd_set(pgd, pmd);
return pmd + offset;
}
BUG();  /* <--- this one hit */
pmd = (pmd_t *) get_bad_pmd_table();
pgd_set(pgd, pmd);
return NULL;
}
free_pages((unsigned long)pmd,2);
if (pgd_bad(*pgd))
BUG();
return (pmd_t *) pgd_page(*pgd) + offset;
}

The allocation of 4 consecutive pages for the page middle directory failed.
This caused empty_bad_pmd_table to be used and clear_page_tables inserted
it to the pmd quicklist. The important question is: why did
__get_free_pages fail?

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-16 Thread schwidefsky



What happens if you just replace all places that would use a bad page
table with a BUG()? (Ie do _not_ add the bug to the place where you
added the test: by that time it's too late.  I'm talking about the
places where the bad page tables are used, like in the error cases of
"get_pte_kernel_slow()" etc.

Ok, the BUG() hit in get_pmd_slow:

pmd_t *
get_pmd_slow(pgd_t *pgd, unsigned long offset)
{
pmd_t *pmd;
int i;

pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2);
if (pgd_none(*pgd)) {
if (pmd) {
for (i = 0; i  PTRS_PER_PMD; i++)
pmd_clear(pmd+i);
pgd_set(pgd, pmd);
return pmd + offset;
}
BUG();  /* --- this one hit */
pmd = (pmd_t *) get_bad_pmd_table();
pgd_set(pgd, pmd);
return NULL;
}
free_pages((unsigned long)pmd,2);
if (pgd_bad(*pgd))
BUG();
return (pmd_t *) pgd_page(*pgd) + offset;
}

The allocation of 4 consecutive pages for the page middle directory failed.
This caused empty_bad_pmd_table to be used and clear_page_tables inserted
it to the pmd quicklist. The important question is: why did
__get_free_pages fail?

blue skies,
   Martin

Linux/390 Design  Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-16 Thread Linus Torvalds



On Thu, 16 Nov 2000 [EMAIL PROTECTED] wrote:
 
 Ok, the BUG() hit in get_pmd_slow:
 
 pmd_t *
 get_pmd_slow(pgd_t *pgd, unsigned long offset)
 {
 pmd_t *pmd;
 int i;
 
 pmd = (pmd_t *) __get_free_pages(GFP_KERNEL,2);

You really need 4 pages?

There's no way to reliably get 4 consecutive pages when you're even close
to being low on memory. I would suggest just failing with a NULL return
here.

What is the architecture setup for this machine? I have no clue about
S/390 memory management. Maybe you can modify the pmd layout?

One potential fix for this is to just make the page size bigger. Make
"Linux pages" be _two_ hardware pages, and make a Linux pte contain two
"hardware pte's". That way the pmd would be an order-1 allocation instead
of an order-2 one. Which is statistically _much_ more likely to be around
(exponential distribution).

Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-16 Thread Andrea Arcangeli

On Thu, Nov 16, 2000 at 09:01:07AM -0800, Linus Torvalds wrote:
 "Linux pages" be _two_ hardware pages, and make a Linux pte contain two

If they absolutely needs 4 pages for pmd pagetables due hardware constraints
I'd recommend to use _four_ hardware pages for each softpage, not two.

The issue is that failing allocation at task creation (due 8k [or more] kernel
stack) is trivial to handle, just have the syscall returning -ENOMEM and
userspace will handle the allocation faliure gracefully. Also the parent
of the servers will never fail that allocation after it's up and running
(and it can try to fork childs later on).

Failing allocation of a pagetable in some case can be solved only looping
(deadlock prone) or killing the task hard without giving a chance to userspace
to trap the fault (even SIGKILL signal handler may need that pmd pagetable to
run). So being guaranteed to be able to allocate pagetables unless
the machine is truly out of memory is quite necessary "feature" IMHO.

We faced similar issues while thinking at possible ways for x86-64 pagetables,
and we preferred not having to depend on the softpagesize framework in 2.4.x
because it's very intrusive.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-16 Thread Linus Torvalds



On Thu, 16 Nov 2000, Andrea Arcangeli wrote:
 
 If they absolutely needs 4 pages for pmd pagetables due hardware constraints
 I'd recommend to use _four_ hardware pages for each softpage, not two.

Yes.

However, it definitely is an issue of making trade-offs. Most 64-bit MMU
models tend to have some flexibility in how you set up the page tables,
and it may be possible to just move bits around too (ie making both the
pmd and the pgd twice as large, and getting the expansion of 4 by doing
two expand-by-two's, for example, if the hardware has support for doing
things like that).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-15 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
>After some trickery with some special hardware feature (storage
>keys) I found out that empty_bad_pmd_table and empty_bad_pte_table have
>been put to the page table quicklists multiple(!) times.

This is definitely bad, and means that something else really bad is
going on.

In fact, I have this fairly strong suspicion that we should just get rid
of the "bad" page tables altogether, and make the stuff that now uses
them BUG() instead. 

The whole concept of "bad" page tables comes from very early on in
Linux, when the way the page fault handler worked was that if it ran out
of memory or something else really bad happened, it would insert a dummy
page table entry that was guaranteed to let the CPU continue.  That way
the page fault handler was always "successful" from a hardware
standpoint, even if it ended up trying to kill the process. 

This used to be required simply because a page fault in kernel space
originally needed to let the process unwind sanely and cleanly.

These days, the requirement that page faults always "succeed" is long
long gone. The exception handling mechanism handles the cases where we
validly can take a page fault, and in other cases we will just kill the
process outright. As such, the bad page tables should no longer be
needed, and are apparently just hiding some nasty bugs.

What happens if you just replace all places that would use a bad page
table with a BUG()? (Ie do _not_ add the bug to the place where you
added the test: by that time it's too late.  I'm talking about the
places where the bad page tables are used, like in the error cases of
"get_pte_kernel_slow()" etc.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-15 Thread schwidefsky



>> +extern pte_t empty_bad_pte_table[];
>>  extern __inline__ void free_pte_fast(pte_t *pte)
>>  {
>> +   if (pte == empty_bad_pte_table)
>> +   return;
>
>I guess that should be BUG() instead of return, so that the callers can be
>fixed.
Not really. pte_free and pmd_free are called from the common mm code but
the concept of empty_bad_{pte,pmd}_table is architecture dependent. The
trouble starts in arch/???/mm/init.c where these special arrays are
inserted into the paging tables. So the solution to the problem should be
in architecture dependent files too.

blue skies,
   Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-15 Thread Andi Kleen

On Wed, Nov 15, 2000 at 01:39:13PM +0100, [EMAIL PROTECTED] wrote:
> +extern pte_t empty_bad_pte_table[];
>  extern __inline__ void free_pte_fast(pte_t *pte)
>  {
> +   if (pte == empty_bad_pte_table)
> +   return;

I guess that should be BUG() instead of return, so that the callers can be 
fixed.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-15 Thread Andi Kleen

On Wed, Nov 15, 2000 at 01:39:13PM +0100, [EMAIL PROTECTED] wrote:
 +extern pte_t empty_bad_pte_table[];
  extern __inline__ void free_pte_fast(pte_t *pte)
  {
 +   if (pte == empty_bad_pte_table)
 +   return;

I guess that should be BUG() instead of return, so that the callers can be 
fixed.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-15 Thread schwidefsky



 +extern pte_t empty_bad_pte_table[];
  extern __inline__ void free_pte_fast(pte_t *pte)
  {
 +   if (pte == empty_bad_pte_table)
 +   return;

I guess that should be BUG() instead of return, so that the callers can be
fixed.
Not really. pte_free and pmd_free are called from the common mm code but
the concept of empty_bad_{pte,pmd}_table is architecture dependent. The
trouble starts in arch/???/mm/init.c where these special arrays are
inserted into the paging tables. So the solution to the problem should be
in architecture dependent files too.

blue skies,
   Martin

Linux/390 Design  Development, IBM Deutschland Entwicklung GmbH
Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Memory management bug

2000-11-15 Thread Linus Torvalds

In article [EMAIL PROTECTED],
After some trickery with some special hardware feature (storage
keys) I found out that empty_bad_pmd_table and empty_bad_pte_table have
been put to the page table quicklists multiple(!) times.

This is definitely bad, and means that something else really bad is
going on.

In fact, I have this fairly strong suspicion that we should just get rid
of the "bad" page tables altogether, and make the stuff that now uses
them BUG() instead. 

The whole concept of "bad" page tables comes from very early on in
Linux, when the way the page fault handler worked was that if it ran out
of memory or something else really bad happened, it would insert a dummy
page table entry that was guaranteed to let the CPU continue.  That way
the page fault handler was always "successful" from a hardware
standpoint, even if it ended up trying to kill the process. 

This used to be required simply because a page fault in kernel space
originally needed to let the process unwind sanely and cleanly.

These days, the requirement that page faults always "succeed" is long
long gone. The exception handling mechanism handles the cases where we
validly can take a page fault, and in other cases we will just kill the
process outright. As such, the bad page tables should no longer be
needed, and are apparently just hiding some nasty bugs.

What happens if you just replace all places that would use a bad page
table with a BUG()? (Ie do _not_ add the bug to the place where you
added the test: by that time it's too late.  I'm talking about the
places where the bad page tables are used, like in the error cases of
"get_pte_kernel_slow()" etc.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/