[Pharo-dev] Re: [Pharo-vm] Stack overflow support?

Daniel Slomovits Mon, 13 Nov 2023 22:20:53 -0800

Hello Stephane,

A simple example like that—really any case caused by common types of
mistake, even if the method involved is more complex—wouldn't get to run
for a couple seconds. It would error out after a small fraction of a
second, since the interrupt is issued immediately upon exceeding the limit
(which by default is—oops, 64k* slots*, 256kB, at most ~10k stack frames,
but realistically less due to args and temps), and it just doesn't take
that long to execute most recursive methods 10k levels deep.


At this point it would raise the limit a little for that process and signal
a StackOverflow exception. In production this likely would mean killing the
process but that's up to application code. In development you get a
walkback like any other exception. Because the stack is kept to a
*relatively* reasonable size, I've never had debugger performance be an
issue—it's noticeably more sluggish, but we're talking about taking
200-300ms to respond instead of imperceptible. Similarly for GC
pressure—even if you have dozens of processes all maxed out on stack, this
would still be less than the memory used to start a base image.

Certainly there are scenarios that can cause problems—an error while
printing an object will stop you from opening a debugger, yes. However a
StackOverflow in particular isn't any worse than any other exception—in all
cases you'll get a walkback, hit "Debug", and get another walkback instead
of a debugger. If you hit "Terminate" at that point, it might leave the
original process in a zombie state and/or a hidden debugger window, but you
can kill it from the Process Monitor/close it from the Window menu and be
fine. And even this is because Dolphin doesn't safeguard printing in the
development tools the way Pharo does—if Pharo adopted stack overflow
handling like Dolphin's, a stack overflow likely wouldn't even stop you
from opening a debugger, it would just make it a *little* slow, then
something would show up with "error printing: a StackOverflow" or the like.

Hope that helps.

Daniel

On Sun, Nov 12, 2023 at 4:53 AM stephane ducasse <stephane.duca...@inria.fr>
wrote:

> Hi daniel
>
> Thanks for the feedback.
> May be you wrote it but I could not really understand.
>
> How dolphin handled
>
> ```
> A >> foo
>    ^ self foo
> ```
>
>
> That is let to run a couple of seconds?
> Did they kill the process?
>
> In Pharo we do have an interrupt but
>
> But it could happen that,
>  - the stack is so big that the debugger is very sluggish (best-case
> scenario)
>  - the VM is just flooded doing GCs so maybe the Ctrl dot event does not
> even arrive at Pharo or the trigger
>  - if the recursion is hit when printing an object (which is more common
> than you could imagine), opening the debugger could trigger a new recursion
> and never give back the control to the user
>
>
> S
>
> On 11 Nov 2023, at 05:10, Daniel Slomovits <daniels...@gmail.com> wrote:
>
> I think this is a great idea. I've mostly used Dolphin Smalltalk, which is
> actually a strict stack machine under the hood (it has a context-like
> introspection API but the stack is explicitly the canonical form), so it's
> more-or-less forced to implement a limit of some kind. When I started using
> Pharo I triggered a couple stack overflows by mistake, and was frustrated
> by the fact that at first what happened was...nothing, everything seemed
> fine, my code just didn't work. And then half a minute later Pharo gets
> extremely slow and I notice it's using 2GB of memory and by then it's too
> late and I have to kill the image. Getting a more-or-less immediate error
> would be much more user-friendly IMO.
>
> A couple things to learn from Dolphin's implementation, I think:
>
>    1. When a stack overflow is detected, the resulting interrupt raises
>    that process' stack limit by a significant amount (though by less than the
>    original limit—IOW it doesn't double, but it's not just a couple more
>    frames either) before signaling the exception, precisely so that exception
>    handling can occur without triggering another stack-overflow event. A
>    further refinement could be that if a second stack overflow *is* detected,
>    we directly invoke more basic recovery—this could mean an emergency
>    evaluator, terminating the offending process and opening a post-mortem with
>    a textual stack dump (ugh! but at least it's predictable), etc.
>    2. I don't think we should worry too much about refining what exactly
>    the limit is. 10x as much stack as 99% of code will ever use, is still a
>    tiny amount compared to consuming all available memory with Contexts. At
>    least, if I'm understanding the graph/data correctly. That's 36kB of stack
>    space, right? Not 36k frames/contexts deep? With each context being six
>    slots plus args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack
>    representation—contexts add object-header overhead but we don't reify them
>    unless we have to). For reference, Dolphin's limit is 64kB, but that's a
>    32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo
>    can spill contexts to the stack, the limit could easily be 1MB, or a fixed
>    number of frames designed to approximate that—still a tiny amount of memory
>    overall, and still will be hit near-instantly by true infinite recursion,
>    but lots of breathing room for most use cases.
>
> Actually, this did get me to thinking...the stack depth of a Pharo process
> is not necessarily easy/cheap to compute in the general case, without
> caching a lot of information on intermediate contexts. In most cases the
> context chain acts as a proper stack, but *not always*—methods like
> Process>>on:do:, and some even more esoteric ones I forget off the top of
> my head, make modifications far away from the top context and may splice
> context chains together in odd ways. Perhaps a more flexible limit would be
> better—one that is triggered by allocating more than a certain number of
> contexts *total*, and examines running Processes in detail to find the
> culprit at that point.
>
> On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito <guillermopol...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> We started (with many interruptions over the last months) working a bit
>> with Stephane on understanding what is the (positive and negative) impact
>> of stack-overflow support in Pharo.
>> The key idea is that if a process consumes too much stack (potentially
>> because of an infinite recursion) then the process should stop with an
>> exception.
>>
>> ## Why we want better stack consumption control
>>
>> This idea comes up to solve issues that are pretty common and hit
>> especially newbies.
>> For example, imagine you accidentally write an accessor such as
>>
>> ```
>> A >> foo
>>    ^ self foo
>> ```
>>
>> Students do this all the time, and I’ve also seen it in experienced
>> people who go too fast :).
>> More importantly, such recursions could happen also with not-so-obvious
>> indirect recursions (a sends b, b sends c, c sends a), and these could hit
>> anybody.
>>
>> This is aggravated because the current execution model allows us to have
>> infinite stacks —meaning: limited by available memory only.
>> This is indeed a nice feature for many use cases but it has its own
>> drawbacks when one of these kind of recursions are hit:
>>  - code just loops forever taking space in the stack
>>  - when there is no more stack space, context objects are created and
>> moved to the heap
>>  - but those contexts are strongly held, so they are never GCed and take
>> up extra space
>>  - even worse! they are there adding more work to the GC every time and
>> making the GC run more often looking for space that is not there
>>
>> ## Why Ctrl-dot does not always work
>>
>> Of course, super users know there is this “Ctrl dot” hidden feature that
>> should help you recover from this.
>> First, let's take out of the equation that this is only known by super
>> users.
>> Now, in this situation, when Ctrl-dot is hit it will trigger a handler
>> that suspends the problematic process and opens a debugger on it.
>> But it could happen that,
>>  - the stack is so big that the debugger is very sluggish (best-case
>> scenario)
>>  - the VM is just flooded doing GCs so maybe the Ctrl dot event does not
>> even arrive at Pharo or the trigger
>>  - if the recursion is hit when printing an object (which is more common
>> than you could imagine), opening the debugger could trigger a new recursion
>> and never give back the control to the user
>>
>> ## What are we working on
>>
>> The main idea here is: Can we have a simple and efficient way to prevent
>> such kinds of situations?
>>
>> After many discussions around detecting recursion, we kinda arrived at
>> the simple solution of just detecting a stack overflow.
>> The solution is easy to understand (because it’s like other languages
>> work) and easy to implement because there is already support for that.
>> But this leaves open two questions:
>>  - what happens when people want to use the “infinite stack” feature?
>>  - when should a process stack overflow? What is a sensitive default
>> value?
>>
>> Our draft implementation here
>> https://github.com/pharo-project/pharo-vm/pull/710 does the following to
>> cope with this:
>>  - we can now parametrize the size of the stack (of each stack page to be
>> more accurate) when the VM starts up
>>  - the stack overflow check can be disabled per process
>>
>> We also are running experiments to see what could be a sensitive stack
>> size for our normal usages. Here, for example, we ran almost all test cases
>> in Pharo separately (one suite per line below), and we observed how many
>> tests broke (x-axis) with different stack sizes (y-axis).
>> Here we see that most test suites require at least 20-24k to run
>> properly, some go up to 36k of stack before converging (i.e., the number of
>> broken tests does not change).
>>
>> <ImagenPegada-10.tiff>
>>
>> You’ll notice in the graph that There are some scenarios that break all
>> the time. This is because exception handling itself is recursive and may
>> produce more stack overflows depending on the size of the stack between the
>> exception and the exception handler.
>> So some more work is still required, mostly changing Pharo libraries to
>> properly support this. For example:
>>  - should tests run in a fresh process with a fresh stack?
>>  - should the exception mechanism use less recursion?
>>  - resumable exceptions add stack pressure because they do not “unstack”
>> until the exception is finally handled, meaning that the stack used by
>> exception handling just adds up to the stack of the original code, can we
>> do better here?
>>
>> Probably there are more interesting questions here, that’s the “why"
>> behind this email.
>> I’m interested in opinions and scenarios you may come up with that should
>> be taken into account.
>>
>> Cheers,
>> Guille
>>
> _______________________________________________
> Pharo-vm mailing list -- pharo...@lists.pharo.org
> To unsubscribe send an email to pharo-vm-le...@lists.pharo.org
>
>
> --------------------------------------------
> Stéphane Ducasse
> http://stephane.ducasse.free.fr / http://www.pharo.org
> 03 59 35 87 52
> Assistant: Aurore Dalle
> FAX 03 59 57 78 50
> TEL 03 59 35 86 16
> S. Ducasse - Inria
> 40, avenue Halley,
> Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
> Villeneuve d'Ascq 59650
> France
>
>
>
>

[Pharo-dev] Re: [Pharo-vm] Stack overflow support?

Reply via email to