On Jan 9, 2007, at 9:02 AM, Gregory Shimansky wrote:

Geir Magnusson Jr. wrote:
I started a new thread because I think this is really important.
I've also added a page in the wiki to track this stuff, because I can't keep it in my head:
  http://wiki.apache.org/harmony/MegaSpawnThreadingBug
which you can get to from the home page via the "WhiteBoards" section, intended to be a place where we can work as a team on a whiteboard, with the intention that once the mini-project is over, we erase...
I think this is a scary scary problem :)

I've tried to analyze MegaSpawn test on windows and here's what I found out.

OOME is thrown because process virtual size easily gets up to 2Gb. This happens at about ~1.5k simultaneously running threads. I think it happens because all of virtual process memory is mapped for thread stacks.

When virtual memory is exhausted all kind of problems may occur. In many places there are assertions that malloc returns non-NULL, and these assertions fail. In some places there are no checks for malloc, and NULL pointer is used for addressing, this also crashes VM.


This is actually good news (I thinK), as I'd rather be running out of heap rather than trashing it.

This is also useful for hardening - we should spend some time finding places where we aren't checking mallocs and such..


I tried to watch Sun implementation and it looks like they map smaller amounts of memory for thread stacks. Maybe they map only initial stack memory somehow and allow it to grow later (although I don't quite understand how it is possible in continuous address space). When Sun VM executes this test it created up to ~6k simultaneously running threads and process size at the same moment was smaller than 2Gb.

I think the same problem may happen on Linux because it spills out OOMEs on Ubuntu as well.

If somehow test doesn't crash on failed mallocs and gets to the shutdown stage and hangs with 2 or more dead locked threads. So far I didn't quite understand how they lock each other.

Cool - thanks. If you have a free second, could you note this on the wiki page so we don't forget?

geir


--
Gregory


Reply via email to