> Maybe it would be possible to prototype something and measure. 

I am also interested in this topic; in some micros I ran (just looping on 
shuffle) I noticed ~15% improvement by using a "constant pool" mask rather 
than constructing the mask using immediate scalars and RIP-relative 
deduplication as is done currently. If someone here prototypes this I 
believe we can run some benchmarks (e.g. certain MediaPipe models) and 
share the speed-ups. I originally had it in my to-do list to take Zhi's 
initial patch (https://chromium-review.googlesource.com/c/v8/v8/+/2149408) 
a bit further but I would prefer to benchmark a more "official" prototype. 
Please keep me in the loop!

On Monday, March 22, 2021 at 9:51:02 AM UTC-7 Jakob Kummerow wrote:

> On Wed, Mar 17, 2021 at 11:27 PM Dan Weber <dwe...@gmail.com> wrote:
>
>> Hi everyone, 
>>
>> I've summarized comments, questions, and responses at the top here with 
>> the effect of making this a little bit easier to read.  My comments, 
>> questions, and responses are just below.
>>
>> Clemens:
>> - Currently, there is no kRootRegister in WASM SIMD (or at least x64). 
>> - Any access to data through the Isolate heap would require a few 
>> indirections since there wouldn't be any way to calculate a consistent 
>> offset or displacement for a specific constant.
>> - An alternative to using the heap is to allocate data blocks somewhere 
>> that's PC-relative (or within 32bits of RIP).  If pages can be allocated in 
>> that range and they are not code pages, they're not executable by default.  
>> This helps alleviate any security concerns.  If closeby is not an option, 
>> we can use the code page allocator to allocate pages.  However, if we do 
>> this, we should ensure that those pages are marked as not executable.
>> - Start putting together a design document and include Zhi and me on it.
>> - How would you use External References with the Heap?
>> Zhi:
>> - Have explored the possibility of using PC-relative/RIP relative 
>> addressing in 
>> https://docs.google.com/document/d/1uCYwyQYjgNAtaXDNgHusDGCV1m9YGhOWJx2eqzv2rdI/edit
>> - Proposal was specific to shuffles and abandoned when another solution 
>> could provide immediate performance benefits without the complexity of the 
>> constant pool.
>> - There is still interest in a constant pool and it warrants further 
>> investigation.
>> Jakob:
>> - Isolate is good for builtins and or anything fixed and static in scope. 
>> This might not be a good use case for two reasons:
>> 1) Constants are likely limited in scope to the code using them and are 
>> unlikely to get benefit from sharing.
>> 2) If it's allocated with an isolate factory, it now requires a handle 
>> since the address can move if the GC moves it.
>> - A better alternative would be something like what Clemens is describing 
>> (a PC relative solution) since that will follow the same lifecycle as the 
>> code using it. 
>> - If the implementation can be made to work in such a way that it's PC 
>> relative but not in code space, that's even better, since it alleviates 
>> security concerns.
>>
>> Clemens:
>> - With respect to External References, we've started using them quite a 
>> bit since they have some very nice properties.  Regardless of address 
>> space, we can make any pointer address available with a movq, not just PC 
>> relative, and any other instruction (pandn/pshufb...) with the result as an 
>> aligned memory operand.  My thought is that if we can find a way to ensure 
>> any given block of memory is deallocated after the code executes (or simply 
>> when the code itself falls out of scope), we can build a constant pool 
>> wherever whenever.  In such a case, we could hypothetically have a std::set 
>> somewhere in heap space that could be used to deduplicate any/all constants 
>> we need and allow for their generation and use during the code generation 
>> process.  The thought of using the Isolate Heap was appealing if 
>> kRootRegister existed and we could always generate a constant displacement 
>> -- thus eliminating the extra movq instruction.
>> - Generally speaking, I would love your help and am open to any solution 
>> that performs better than constant re-generation with shuffles.  If the PC 
>> relative solution is viable and efficient, it's certainly worth testing.
>> - How would you like me to list you on the V8 design doc?  Do I put you 
>> and Zhi as technical leads?  I'm not sure what or whom to put in the LGTM 
>> column, and then what the next steps are.  Do we talk offline and then 
>> submit it to v8-dev+design?  Or is that where the dialog happens?
>> Zhi:
>> - This design doc and the prototype implementation are super helpful even 
>> if only for reference.  Thanks.
>> - With respect to the prototype implementation, does it actually build a 
>> constant pool or just inline constants before they're used? I'm curious 
>> about anything/everything that's happening in this green block: 
>> https://chromium-review.googlesource.com/c/v8/v8/+/2149408/2/src/codegen/x64/assembler-x64.cc#431
>> Jakob:
>> - With respect to memory allocations that tie to the scope of the code, 
>> what options are available to us if isolate isn't good?
>>
>
> We currently do custom memory management for Wasm code objects. So the 
> easiest option is to have constant pools within the code object; I think 
> the alternative would require building some new infrastructure to allow 
> storing data for each WasmCode that has the same lifetime.
>  
>
>> - If we could allocate pages for data that were never in the pathway of 
>> being set as executable that would be awesome.  If we can then allocate 
>> from said pages each individual aligned constant, it would be really 
>> awesome.
>>
>
> Everything is possible, it's "just" a question of effort...
> I think a viable path forward might be: use constant pools inside the code 
> object (as mentioned elsewhere in this thread) to evaluate performance. If 
> the performance results indicate that the feature should be productionized, 
> evaluate options for where to store the data: is in-code-object good 
> enough? What would the alternatives look like, and how much effort would 
> they be? That'd be the stuff for a design doc :-)
>  
>
>>
>> Thanks!
>> Dan
>>
>>
>>
>> On Wed, Mar 17, 2021 at 4:32 PM Jakob Kummerow <jkum...@chromium.org> 
>> wrote:
>>
>>> I'd like to try to clear up the understanding of memory handling a bit:
>>> There is indeed the option to put stuff into the Isolate, so that 
>>> root-relative addressing can be used to access it. This makes sense when 
>>> the amount of data is fixed and statically known (for example: a list of 
>>> all builtins that exist).
>>> It's also true that there's a 1:1 relation between Isolates and Heaps, 
>>> but that's the managed heap! If you use the Factory to allocate an object, 
>>> then it lives on the managed heap. It'll (potentially) move around on GC, 
>>> you need Handles to refer to it from C++ code, and you can't use 
>>> root-relative addressing to get to it.
>>>
>>> So if you want to store constants required by a specific compiled 
>>> function (and not particularly likely to be shared by other functions), 
>>> then their storage should be dynamic and have the same lifetime as the code 
>>> that needs them. Putting them right inside the code is an easy way to 
>>> achieve that (though with security downsides, as Clemens points out), and 
>>> in fact that's what "constant pool" has historically meant in V8. Putting 
>>> them elsewhere (non-executable) but "nearby" would be even better.
>>>
>>> I hope this helps at least a little bit, and that I'm not 
>>> misunderstanding the whole idea (not familiar with SIMD).
>>>
>>>
>>> On Wed, Mar 17, 2021 at 5:16 PM Zhi An Ng <zh...@chromium.org> wrote:
>>>
>>>> We did a little exploration on this when trying to optimize shuffles, 
>>>> see relevant design doc (
>>>> https://docs.google.com/document/d/1uCYwyQYjgNAtaXDNgHusDGCV1m9YGhOWJx2eqzv2rdI/edit).
>>>>  
>>>> We ultimately decided against it because we found another way to improve 
>>>> shuffle performance without adding a constant pool.
>>>>
>>>> +Brown, Andrew fyi, since you have interest in exploring a 128-bit 
>>>> constant pool as well.
>>>>
>>>> On Wed, Mar 17, 2021 at 5:19 AM Clemens Backes <clem...@chromium.org> 
>>>> wrote:
>>>>
>>>>> Hi Dan,
>>>>>
>>>>> that sounds like an interesting idea. In fact, I considered 
>>>>> implementing a Wasm constant pool for floating point constants, but SIMD 
>>>>> code might benefit even more.
>>>>> Note that in Wasm code we do not have a root register (currently), so 
>>>>> any access to the isolate root is a few indirections away. I am also not 
>>>>> sure I fully understand how you plan to use external references to 
>>>>> reference to dynamically generated constants.
>>>>>
>>>>> An alternative to allocating on the heap might be allocating in (or 
>>>>> close to) the Wasm code space, and use PC-relative addressing. For 
>>>>> security 
>>>>> reasons, we should try to make the constant pool non-executable, so if we 
>>>>> decide to allocate *in* the Wasm code space (instead of close by) we 
>>>>> should reserve a whole page and remove execute permissions from that page.
>>>>>
>>>>> I think a good next step would be starting a little design doc (see 
>>>>> https://v8.dev/docs/design-review-guidelines). I would propose to 
>>>>> initially include @Zhi An Ng and me, and we can add more people once 
>>>>> we figure out a working solution.
>>>>>
>>>>> Thanks for working on this!
>>>>>
>>>>> -Clemens
>>>>>
>>>>> On Tue, Mar 16, 2021 at 11:07 PM Dan Weber <dwe...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I wanted to approach the group to understand the feasibility of 
>>>>>> creating a root relative constant pool for WASM to support reuse of 
>>>>>> intermediate constants generated during complex instruction sequences.
>>>>>>
>>>>>> Right now, when a complex instruction sequence like shuffle or 
>>>>>> swizzle operates and cannot find an architectural match for the 
>>>>>> requested 
>>>>>> operation, it generates in flight code to build shuffle masks which are 
>>>>>> passed to pshufb on x64 and tbl on a64. Due to the current nature of the 
>>>>>> code generator / assembler, these can be regenerated multiple times even 
>>>>>> if 
>>>>>> the input is constant (
>>>>>> https://bugs.chromium.org/p/v8/issues/detail?id=11545).
>>>>>>
>>>>>> Two options exist for resolving this. 
>>>>>>
>>>>>> 1) Generating a constant pool for the code to use at compile time and 
>>>>>> loading the data from there.
>>>>>> 2) Lifting sections of multi instruction code up to the graph for 
>>>>>> optimization and reduction.
>>>>>>
>>>>>> Each is a partial solution since both can benefit from the other.
>>>>>>
>>>>>> What I'd like to enquire now is the first option -- the feasibility 
>>>>>> of implementing an isolate root relative constant pool.
>>>>>>
>>>>>> From what I can see, this might be an easy and effective solution 
>>>>>> covering
>>>>>> address moves, security concerns, and alignment. 
>>>>>>
>>>>>> Since isolate() has a single coherent heap that is garbage collected 
>>>>>> and moved (https://v8.dev/blog/embedded-builtins), one can allocate 
>>>>>> objects relative to the root. If you use it with the ExternalReference 
>>>>>> operand mechanism we've been using, it'll automatically generate all of 
>>>>>> the 
>>>>>> address constants relative to the root register (
>>>>>> https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/x64/macro-assembler-x64.cc;l=124;bpv=1;bpt=1).
>>>>>>  
>>>>>> This should preclude any concerns about addresses moving when the heap 
>>>>>> moves or gets reallocated. Likewise, isolate()->factory() provides 
>>>>>> mechanisms for aligning the pointers on each allocation. As such, if an 
>>>>>> external data structure such as a map or hash map is used to track the 
>>>>>> constants at code generation time, then each constant can be allocated 
>>>>>> individually on the isolate heap without respect to any other. If the 
>>>>>> heap 
>>>>>> persists the entire duration of the executed code and is deallocated at 
>>>>>> the 
>>>>>> end, then there are no memory management concerns. Lastly, there should 
>>>>>> be 
>>>>>> no security concerns since the data allocated on the isolate heap will 
>>>>>> not 
>>>>>> be executable by default.
>>>>>>
>>>>>> Is this correct? If so, what's the appropriate process for submitting 
>>>>>> and reviewing a design proposal?
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>> --
>>>>>
>>>>>

-- 
-- 
v8-dev mailing list
v8-dev@googlegroups.com
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to v8-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/v8-dev/c81c01ee-905a-4382-8691-092f6052cfb7n%40googlegroups.com.

Reply via email to