By bisecting in production we determined that the problem is --flush_liftoff_code, which was enabled by default starting in 13.2. In our environment, this flag seems to leak memory that lives in the code cache and so affects newly-created isolates. I've filed a bug:
https://issues.chromium.org/issues/390075235 -Kenton On Tue, Jan 14, 2025 at 9:09 AM Kenton Varda <[email protected]> wrote: > On Tue, Jan 14, 2025 at 5:59 AM Jakob Kummerow <[email protected]> > wrote: > >> - from what you describe, perhaps it would be feasible to craft a >> reproducer. It'd probably have to be a custom V8 embedder that, in a loop, >> creates many fresh isolates and instantiates/runs the same (or several?) >> demo Wasm module in them. >> > > I tried exactly that yesterday, and was able to see that "external memory" > was indeed correlated across isolates, but after creating/destroying > thousands of isolates it seemed to converge on a reasonable number rather > than keep growing forever. > > But in prod we see something in external memory growing and growing. > > >> - it could make sense to verify (with printfs in their destructors) that >> both Isolates and NativeModules get destroyed as expected. It's >> conceivable that the memory growth you're observing is intentional caching >> (of generated code, or something?) because the WasmEngine thinks that >> the cached data is still needed/useful. >> >> How/where exactly are you seeing this increased "external memory"? >> I.e. what reporting system are you using to get memory consumption numbers? >> >> >> On Tue, Jan 14, 2025 at 1:09 AM Kenton Varda <[email protected]> >> wrote: >> >>> To add context here: >>> >>> The problem appears to show up only after running in production for an >>> hour or two. During that time we will have created thousands of isolates to >>> handle millions of requests. >>> >>> But the problem seems to affect *new* isolates, even when those isolates >>> are loaded with applications that had been loaded into previous isolates >>> without problems. Startup of an application should be 100% deterministic >>> since we disallow any I/O during startup, but we're seeing that after the >>> host has been running a while, new isolates are showing much higher >>> "external memory" on startup. (E.g. 400MB external memory, but we enforce a >>> 128MB limit on the whole isolate.) >>> >>> We observed that the wasm native module cache causes identical wasm >>> modules to be shared across isolates, and that wasm lazy compilation causes >>> memory usage of a wasm module -- as accounted by all isolates that have >>> loaded it -- to change. >>> >>> Could it be that there is a memory leak in lazy compilation, such that >>> these shared cached modules are gradually growing over time, to the point >>> where new isolates that try to load these modules are being hit with >>> extremely high "external memory" numbers right off the bat? >>> >>> -Kenton >>> >>> On Mon, Jan 13, 2025 at 5:31 PM Erik Corry <[email protected]> >>> wrote: >>> >>>> It looks like it's related to shared objects between isolates. Is there >>>> a newer document than >>>> https://docs.google.com/document/d/18lYuaEsDSudzl2TDu-nc-0sVXW7WTGAs14k64GEhnFg/edit?usp=drivesdk >>>> that describes how this works today? In particular cross-isolate GCs? >>>> >>>> On Mon, 13 Jan 2025, 15:25 Jakob Kummerow, <[email protected]> >>>> wrote: >>>> >>>>> Sounds like a bug, but without more details (or a repro) I don't have >>>>> a more specific guess than that. >>>>> >>>>> If you're desperate, you could try to bisect it (even with a flaky >>>>> repro). Or review the ~500 changes between those branches: >>>>> https://chromium.googlesource.com/v8/v8/+log/branch-heads/13.1..branch-heads/13.2?n=10000 >>>>> >>>>> >>>>> On Mon, Jan 13, 2025 at 2:48 PM 'Dan Lapid' via v8-dev < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> In V8 13.2 and 13.3 we see wasm isolates external memory usage >>>>>> blowing up sometimes (up to gigabytes). >>>>>> Under V8 13.1 the same code would never ever use more than 80-100MB >>>>>> The issue doesn't happen every time for the same wasm bytecode. It >>>>>> doesn't even reproduce locally. >>>>>> But some significant percentage of the time it does happen. >>>>>> This has only started happening in 13.2, what are we missing? Should >>>>>> we be enabling/disabling some flags? >>>>>> It also seems that 13.3 is significantly worse in terms of error rate. >>>>>> The problem happens under "--liftoff-only". >>>>>> We use pointer compression but not sandbox. >>>>>> We've tried enabling --turboshaft-wasm in 13.1 and the problem did >>>>>> not reproduce. >>>>>> Has anything changed that we need to adapt to? >>>>>> Would really appreciate your help! >>>>>> >>>>>> -- >>>>>> >>>>> -- -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/v8-dev/CAJouXQm0Pcua2x1y-EwsJoyOzx6N%2BLiv9QHBXwRTZhO30wJu_g%40mail.gmail.com.
