[PATCH] D72114: [MS] Overhaul how clang passes overaligned args on x86_32

Reid Kleckner via Phabricator via cfe-commits Tue, 14 Jan 2020 18:01:15 -0800

rnk added inline comments.


================
Comment at: clang/include/clang/CodeGen/CGFunctionInfo.h:91
   bool InAllocaSRet : 1;    // isInAlloca()
+  bool InAllocaIndirect : 1;// isInAlloca()
   bool IndirectByVal : 1;   // isIndirect()
----------------
rjmccall wrote:
> rnk wrote:
> > rnk wrote:
> > > rjmccall wrote:
> > > > Would it be better to handle `inalloca` differently, maybe as a flag 
> > > > rather than as a top-level kind?  I'm concerned about gradually 
> > > > duplicating a significant amount of the expressivity of other kinds.
> > > In the past, I've drafted a more than one unfinished designs for how we 
> > > could remodel inalloca with tokens so that it can be per-argument instead 
> > > of something that applies to all argument memory. Unfortunately, I never 
> > > found the time to finish or implement one.
> > > 
> > > As I was working on this patch, I was thinking to myself that this could 
> > > be the moment to implement one of those designs, but it would be pretty 
> > > involved. Part of the issue is that, personally, I have very little 
> > > interest in improving x86_32 code quality, so a big redesign wouldn't 
> > > deliver much benefit. The benefits would all be code simplifications and 
> > > maintenance cost reductions, which are nice, but seem to only get me 
> > > through the prototype design stage.
> > > 
> > > I'll go dig up my last doc and try to share it, but for now, I think we 
> > > have to suffer the extra inalloca code in this patch.
> > Here's what I wrote, with some sketches of possible LLVM IR that could 
> > replace inalloca:
> > https://reviews.llvm.org/P8183
> > 
> > The basic idea is that we need a call setup instruction that forms a region 
> > with the call. During CodeGen, we can look forward (potentially across BBs) 
> > to determine how much argument stack memory to allocate, allocate it 
> > (perhaps in pieces as we go along), and then skip the normal call stack 
> > argument adjustment during regular call lowering.
> > 
> > Suggestions for names better than "argmem" are welcome.
> > 
> > The major complication comes from edges that exit the call setup region. 
> > These could be exceptional exits or normal exits with statement expressions 
> > and break, return, or goto. Along these paths we need to adjust SP, or risk 
> > leaking stack memory. Today, with inalloca, I think we leak stack memory.
> > In the past, I've drafted a more than one unfinished designs for how we 
> > could remodel inalloca with tokens so that it can be per-argument instead 
> > of something that applies to all argument memory. Unfortunately, I never 
> > found the time to finish or implement one.
> 
> Sorry!  I think it would be great to rethink `inalloca` to avoid the 
> duplication and so on, but I certainly didn't mean to suggest that we should 
> do that as part of this patch.  (I'll look at your proposal later.)  I was 
> trying to ask if it would make sense to change how `inalloca` arguments are 
> represented by `ABIInfo`, so that we could e.g. build a normal indirect 
> `ABIInfo` and then flag that it also needs to be written into an `inalloca` 
> buffer.
I see. I think implementing that would require a greater refactoring of 
ClangToLLVMArgMapping. We'd need a new place to store the inalloca struct field 
index. That is currently put in a union with some other stuff, which would no 
longer work.


================
Comment at: clang/test/CodeGen/x86_32-arguments-win32.c:77
+// CHECK-LABEL: define dso_local void @receive_vec_256(<8 x float> inreg %x, 
<8 x float> inreg %y, <8 x float> inreg %z, <8 x float>* %0, <8 x float>* %1)
+// CHECK-LABEL: define dso_local void @receive_vec_512(<16 x float> inreg %x, 
<16 x float> inreg %y, <16 x float> inreg %z, <16 x float>* %0, <16 x float>* 
%1)
+// CHECK-LABEL: define dso_local void @receive_vec_1024(<32 x float>* %0, <32 
x float>* %1, <32 x float>* %2, <32 x float>* %3, <32 x float>* %4)
----------------
AntonYudintsev wrote:
> rjmccall wrote:
> > AntonYudintsev wrote:
> > > rjmccall wrote:
> > > > AntonYudintsev wrote:
> > > > > rjmccall wrote:
> > > > > > rnk wrote:
> > > > > > > rjmccall wrote:
> > > > > > > > rnk wrote:
> > > > > > > > > craig.topper wrote:
> > > > > > > > > > What happens in the backend with inreg if 512-bit vectors 
> > > > > > > > > > aren't legal?
> > > > > > > > > LLVM splits the vector up using the largest legal vector 
> > > > > > > > > size. As many pieces as possible are passed in available 
> > > > > > > > > XMM/YMM registers, and the rest are passed in memory. MSVC, 
> > > > > > > > > of course, assumes the user wanted the larger vector size, 
> > > > > > > > > and uses whatever vector instructions it needs to move the 
> > > > > > > > > arguments around.
> > > > > > > > > 
> > > > > > > > > Previously, I advocated for a model where calling an Intel 
> > > > > > > > > intrinsic function had the effect of implicitly marking the 
> > > > > > > > > caller with the target attributes of the intrinsic. This 
> > > > > > > > > falls down if the user tries to write a single function that 
> > > > > > > > > conditionally branches between code that uses different 
> > > > > > > > > instruction set extensions. You can imagine the SSE2 codepath 
> > > > > > > > > accidentally using AVX instructions because the compiler 
> > > > > > > > > thinks they are better. I'm told that ICC models CPU 
> > > > > > > > > micro-architectural features in the CFG, but I don't ever 
> > > > > > > > > expect that LLVM will do that. If we're stuck with 
> > > > > > > > > per-function CPU feature settings, it seems like it would be 
> > > > > > > > > nice to try to do what the user asked by default, and warn 
> > > > > > > > > the user if we see them doing a cpuid check in a function 
> > > > > > > > > that has been implicitly blessed with some target attributes. 
> > > > > > > > > You could imagine doing a similar thing when large vector 
> > > > > > > > > type variables are used: if a large vector argument or local 
> > > > > > > > > is used, implicitly enable the appropriate target features to 
> > > > > > > > > move vectors of that size around.
> > > > > > > > > 
> > > > > > > > > This idea didn't get anywhere, and the current situation has 
> > > > > > > > > persisted.
> > > > > > > > > 
> > > > > > > > > ---
> > > > > > > > > 
> > > > > > > > > You know, maybe we should just keep clang the way it is, and 
> > > > > > > > > just set up a warning in the backend that says "hey, I split 
> > > > > > > > > your large vector. You probably didn't want that." And then 
> > > > > > > > > we just continue doing what we do now. Nobody likes backend 
> > > > > > > > > warnings, but it seems better than the current direction of 
> > > > > > > > > the frontend knowing every detail of x86 vector extensions.
> > > > > > > > If target attributes affect ABI, it seems really dangerous to 
> > > > > > > > implicitly set attributes based on what intrinsics are called.
> > > > > > > > 
> > > > > > > > The local CPU-testing problem seems similar to the problems 
> > > > > > > > with local `#pragma STDC FENV_ACCESS` blocks that the 
> > > > > > > > constrained-FP people are looking into.  They both have a "this 
> > > > > > > > operation is normally fully optimizable, but we might need to 
> > > > > > > > be more careful in specific functions" aspect to them.  I 
> > > > > > > > wonder if there's a reasonable way to unify the approaches, or 
> > > > > > > > at least benefit from lessons learned.
> > > > > > > I agree, we wouldn't want intrinsic usage to change ABI. But, 
> > > > > > > does anybody actually want the behavior that LLVM implements 
> > > > > > > today where large vectors get split across registers and memory? 
> > > > > > > I think most users who pass a 512 bit vector want it to either be 
> > > > > > > passed in ZMM registers or fail to compile. Why do we even 
> > > > > > > support passing 1024 bit vectors? Could we make that an error?
> > > > > > > 
> > > > > > > Anyway, major redesigns aside, should clang do something when 
> > > > > > > large vectors are passed? Maybe we should warn here? Passing by 
> > > > > > > address is usually the safest way to pass something, so that's an 
> > > > > > > option. Implementing splitting logic in clang doesn't seem worth 
> > > > > > > it.
> > > > > > > I agree, we wouldn't want intrinsic usage to change ABI. But, 
> > > > > > > does anybody actually want the behavior that LLVM implements 
> > > > > > > today where large vectors get split across registers and memory?
> > > > > > 
> > > > > > I take it you're implying that the actual (Windows-only?) platform 
> > > > > > ABI doesn't say anything about this because other compilers don't 
> > > > > > allow large vectors.  How large are the vectors we do have ABI 
> > > > > > rules for?  Do they have the problem as the SysV ABI where the ABI 
> > > > > > rules are sensitive to compiler flags?
> > > > > > 
> > > > > > Anyway, I didn't realize the i386 Windows ABI *ever* used registers 
> > > > > > for arguments.  (Whether you can convince LLVM to do so  for a 
> > > > > > function signature that Clang isn't supposed to emit for 
> > > > > > ABI-conforming functions is a separate story.)   You're saying it 
> > > > > > uses them for vectors?  Presumably up to some limit, and only when 
> > > > > > they're non-variadic arguments?
> > > > > > You're saying it uses them for vectors? Presumably up to some 
> > > > > > limit, and only when they're non-variadic arguments?
> > > > > 
> > > > > Sorry to cut in (I am the one who report the issue, and so looking 
> > > > > forward for this patch to be merged).
> > > > > Yes, MSVC/x86 win ABI uses three registers for first three arguments.
> > > > > 
> > > > > https://godbolt.org/z/PZ3dBa
> > > > Interesting.  And anything that would caused the stack to be used is 
> > > > still banned: passing more than 3 vectors or passing them as variadic 
> > > > arguments.
> > > > 
> > > > I guess they chose not to implement stack realignment when they 
> > > > implemented this, and rather than being stuck with a suboptimal ABI, 
> > > > they just banned the cases that would have required it.  Technically 
> > > > that means that they haven't committed to an ABI, so even though LLVM 
> > > > is perfectly happy to realign the stack when required, we shouldn't 
> > > > actually take advantage of that here, and instead we should honor the 
> > > > same restriction.
> > > As I mentioned in https://reviews.llvm.org/D71915 ( and in 
> > > https://bugs.llvm.org/show_bug.cgi?id=44395 )
> > > 
> > > there is at least one particular case, where they do align stack for 
> > > aligned arguments, although it is not directly a fun call.
> > Oh, I see.  Then they *have* in fact implemented stack realignment, 
> > sometime between MSVC v19.10 and v19.14, so the ABI is now settled; I 
> > should've checked a more recent compiler than the one you linked.  So this 
> > is now a fairly standard in-registers-up-to-a-limit ABI, except that the 
> > limit is only non-zero for vectors.  (Highly-aligned struct types are 
> > passed indirectly.)
> Sorry for not being clear enough :(.
> 
> Yes, newer MSVC seem to have implemented stack realignment universally.
> 
> Older one (19.10) only did that in a particular cases (like in the one I 
> mentioned in my original patchset).
To clarify, it's not actually stack realignment, they pass highly aligned 
things indirectly to avoid having to implement stack realignment. 
(https://godbolt.org/z/XwKuyC) Realigning the stack for arguments would be hard 
for MSVC, because the 32-bit compiler always pushes arguments in memory 
individually, adjusting the stack as it goes along. My patch implements that 
logic in the frontend, which is necessary for over aligned aggregates anyway, 
if not vectors.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D72114/new/

https://reviews.llvm.org/D72114



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D72114: [MS] Overhaul how clang passes overaligned args on x86_32

Reply via email to