> Maybe it would be possible to prototype something and measure. I am also interested in this topic; in some micros I ran (just looping on shuffle) I noticed ~15% improvement by using a "constant pool" mask rather than constructing the mask using immediate scalars and RIP-relative deduplication as is done currently. If someone here prototypes this I believe we can run some benchmarks (e.g. certain MediaPipe models) and share the speed-ups. I originally had it in my to-do list to take Zhi's initial patch (https://chromium-review.googlesource.com/c/v8/v8/+/2149408) a bit further but I would prefer to benchmark a more "official" prototype. Please keep me in the loop!
On Monday, March 22, 2021 at 9:51:02 AM UTC-7 Jakob Kummerow wrote: > On Wed, Mar 17, 2021 at 11:27 PM Dan Weber <dwe...@gmail.com> wrote: > >> Hi everyone, >> >> I've summarized comments, questions, and responses at the top here with >> the effect of making this a little bit easier to read. My comments, >> questions, and responses are just below. >> >> Clemens: >> - Currently, there is no kRootRegister in WASM SIMD (or at least x64). >> - Any access to data through the Isolate heap would require a few >> indirections since there wouldn't be any way to calculate a consistent >> offset or displacement for a specific constant. >> - An alternative to using the heap is to allocate data blocks somewhere >> that's PC-relative (or within 32bits of RIP). If pages can be allocated in >> that range and they are not code pages, they're not executable by default. >> This helps alleviate any security concerns. If closeby is not an option, >> we can use the code page allocator to allocate pages. However, if we do >> this, we should ensure that those pages are marked as not executable. >> - Start putting together a design document and include Zhi and me on it. >> - How would you use External References with the Heap? >> Zhi: >> - Have explored the possibility of using PC-relative/RIP relative >> addressing in >> https://docs.google.com/document/d/1uCYwyQYjgNAtaXDNgHusDGCV1m9YGhOWJx2eqzv2rdI/edit >> - Proposal was specific to shuffles and abandoned when another solution >> could provide immediate performance benefits without the complexity of the >> constant pool. >> - There is still interest in a constant pool and it warrants further >> investigation. >> Jakob: >> - Isolate is good for builtins and or anything fixed and static in scope. >> This might not be a good use case for two reasons: >> 1) Constants are likely limited in scope to the code using them and are >> unlikely to get benefit from sharing. >> 2) If it's allocated with an isolate factory, it now requires a handle >> since the address can move if the GC moves it. >> - A better alternative would be something like what Clemens is describing >> (a PC relative solution) since that will follow the same lifecycle as the >> code using it. >> - If the implementation can be made to work in such a way that it's PC >> relative but not in code space, that's even better, since it alleviates >> security concerns. >> >> Clemens: >> - With respect to External References, we've started using them quite a >> bit since they have some very nice properties. Regardless of address >> space, we can make any pointer address available with a movq, not just PC >> relative, and any other instruction (pandn/pshufb...) with the result as an >> aligned memory operand. My thought is that if we can find a way to ensure >> any given block of memory is deallocated after the code executes (or simply >> when the code itself falls out of scope), we can build a constant pool >> wherever whenever. In such a case, we could hypothetically have a std::set >> somewhere in heap space that could be used to deduplicate any/all constants >> we need and allow for their generation and use during the code generation >> process. The thought of using the Isolate Heap was appealing if >> kRootRegister existed and we could always generate a constant displacement >> -- thus eliminating the extra movq instruction. >> - Generally speaking, I would love your help and am open to any solution >> that performs better than constant re-generation with shuffles. If the PC >> relative solution is viable and efficient, it's certainly worth testing. >> - How would you like me to list you on the V8 design doc? Do I put you >> and Zhi as technical leads? I'm not sure what or whom to put in the LGTM >> column, and then what the next steps are. Do we talk offline and then >> submit it to v8-dev+design? Or is that where the dialog happens? >> Zhi: >> - This design doc and the prototype implementation are super helpful even >> if only for reference. Thanks. >> - With respect to the prototype implementation, does it actually build a >> constant pool or just inline constants before they're used? I'm curious >> about anything/everything that's happening in this green block: >> https://chromium-review.googlesource.com/c/v8/v8/+/2149408/2/src/codegen/x64/assembler-x64.cc#431 >> Jakob: >> - With respect to memory allocations that tie to the scope of the code, >> what options are available to us if isolate isn't good? >> > > We currently do custom memory management for Wasm code objects. So the > easiest option is to have constant pools within the code object; I think > the alternative would require building some new infrastructure to allow > storing data for each WasmCode that has the same lifetime. > > >> - If we could allocate pages for data that were never in the pathway of >> being set as executable that would be awesome. If we can then allocate >> from said pages each individual aligned constant, it would be really >> awesome. >> > > Everything is possible, it's "just" a question of effort... > I think a viable path forward might be: use constant pools inside the code > object (as mentioned elsewhere in this thread) to evaluate performance. If > the performance results indicate that the feature should be productionized, > evaluate options for where to store the data: is in-code-object good > enough? What would the alternatives look like, and how much effort would > they be? That'd be the stuff for a design doc :-) > > >> >> Thanks! >> Dan >> >> >> >> On Wed, Mar 17, 2021 at 4:32 PM Jakob Kummerow <jkum...@chromium.org> >> wrote: >> >>> I'd like to try to clear up the understanding of memory handling a bit: >>> There is indeed the option to put stuff into the Isolate, so that >>> root-relative addressing can be used to access it. This makes sense when >>> the amount of data is fixed and statically known (for example: a list of >>> all builtins that exist). >>> It's also true that there's a 1:1 relation between Isolates and Heaps, >>> but that's the managed heap! If you use the Factory to allocate an object, >>> then it lives on the managed heap. It'll (potentially) move around on GC, >>> you need Handles to refer to it from C++ code, and you can't use >>> root-relative addressing to get to it. >>> >>> So if you want to store constants required by a specific compiled >>> function (and not particularly likely to be shared by other functions), >>> then their storage should be dynamic and have the same lifetime as the code >>> that needs them. Putting them right inside the code is an easy way to >>> achieve that (though with security downsides, as Clemens points out), and >>> in fact that's what "constant pool" has historically meant in V8. Putting >>> them elsewhere (non-executable) but "nearby" would be even better. >>> >>> I hope this helps at least a little bit, and that I'm not >>> misunderstanding the whole idea (not familiar with SIMD). >>> >>> >>> On Wed, Mar 17, 2021 at 5:16 PM Zhi An Ng <zh...@chromium.org> wrote: >>> >>>> We did a little exploration on this when trying to optimize shuffles, >>>> see relevant design doc ( >>>> https://docs.google.com/document/d/1uCYwyQYjgNAtaXDNgHusDGCV1m9YGhOWJx2eqzv2rdI/edit). >>>> >>>> We ultimately decided against it because we found another way to improve >>>> shuffle performance without adding a constant pool. >>>> >>>> +Brown, Andrew fyi, since you have interest in exploring a 128-bit >>>> constant pool as well. >>>> >>>> On Wed, Mar 17, 2021 at 5:19 AM Clemens Backes <clem...@chromium.org> >>>> wrote: >>>> >>>>> Hi Dan, >>>>> >>>>> that sounds like an interesting idea. In fact, I considered >>>>> implementing a Wasm constant pool for floating point constants, but SIMD >>>>> code might benefit even more. >>>>> Note that in Wasm code we do not have a root register (currently), so >>>>> any access to the isolate root is a few indirections away. I am also not >>>>> sure I fully understand how you plan to use external references to >>>>> reference to dynamically generated constants. >>>>> >>>>> An alternative to allocating on the heap might be allocating in (or >>>>> close to) the Wasm code space, and use PC-relative addressing. For >>>>> security >>>>> reasons, we should try to make the constant pool non-executable, so if we >>>>> decide to allocate *in* the Wasm code space (instead of close by) we >>>>> should reserve a whole page and remove execute permissions from that page. >>>>> >>>>> I think a good next step would be starting a little design doc (see >>>>> https://v8.dev/docs/design-review-guidelines). I would propose to >>>>> initially include @Zhi An Ng and me, and we can add more people once >>>>> we figure out a working solution. >>>>> >>>>> Thanks for working on this! >>>>> >>>>> -Clemens >>>>> >>>>> On Tue, Mar 16, 2021 at 11:07 PM Dan Weber <dwe...@gmail.com> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I wanted to approach the group to understand the feasibility of >>>>>> creating a root relative constant pool for WASM to support reuse of >>>>>> intermediate constants generated during complex instruction sequences. >>>>>> >>>>>> Right now, when a complex instruction sequence like shuffle or >>>>>> swizzle operates and cannot find an architectural match for the >>>>>> requested >>>>>> operation, it generates in flight code to build shuffle masks which are >>>>>> passed to pshufb on x64 and tbl on a64. Due to the current nature of the >>>>>> code generator / assembler, these can be regenerated multiple times even >>>>>> if >>>>>> the input is constant ( >>>>>> https://bugs.chromium.org/p/v8/issues/detail?id=11545). >>>>>> >>>>>> Two options exist for resolving this. >>>>>> >>>>>> 1) Generating a constant pool for the code to use at compile time and >>>>>> loading the data from there. >>>>>> 2) Lifting sections of multi instruction code up to the graph for >>>>>> optimization and reduction. >>>>>> >>>>>> Each is a partial solution since both can benefit from the other. >>>>>> >>>>>> What I'd like to enquire now is the first option -- the feasibility >>>>>> of implementing an isolate root relative constant pool. >>>>>> >>>>>> From what I can see, this might be an easy and effective solution >>>>>> covering >>>>>> address moves, security concerns, and alignment. >>>>>> >>>>>> Since isolate() has a single coherent heap that is garbage collected >>>>>> and moved (https://v8.dev/blog/embedded-builtins), one can allocate >>>>>> objects relative to the root. If you use it with the ExternalReference >>>>>> operand mechanism we've been using, it'll automatically generate all of >>>>>> the >>>>>> address constants relative to the root register ( >>>>>> https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/x64/macro-assembler-x64.cc;l=124;bpv=1;bpt=1). >>>>>> >>>>>> This should preclude any concerns about addresses moving when the heap >>>>>> moves or gets reallocated. Likewise, isolate()->factory() provides >>>>>> mechanisms for aligning the pointers on each allocation. As such, if an >>>>>> external data structure such as a map or hash map is used to track the >>>>>> constants at code generation time, then each constant can be allocated >>>>>> individually on the isolate heap without respect to any other. If the >>>>>> heap >>>>>> persists the entire duration of the executed code and is deallocated at >>>>>> the >>>>>> end, then there are no memory management concerns. Lastly, there should >>>>>> be >>>>>> no security concerns since the data allocated on the isolate heap will >>>>>> not >>>>>> be executable by default. >>>>>> >>>>>> Is this correct? If so, what's the appropriate process for submitting >>>>>> and reviewing a design proposal? >>>>>> >>>>>> Dan >>>>>> >>>>>> -- >>>>> >>>>> -- -- v8-dev mailing list v8-dev@googlegroups.com http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/c81c01ee-905a-4382-8691-092f6052cfb7n%40googlegroups.com.