[PATCH] D155850: [Clang][CodeGen][RFC] Add codegen support for C++ Parallel Algorithm Offload

Ronan Keryell via Phabricator via cfe-commits Tue, 08 Aug 2023 16:44:56 -0700

keryell added inline comments.


================
Comment at: clang/lib/CodeGen/CGBuiltin.cpp:5542
+  if (!getLangOpts().HIPStdPar)
+    ErrorUnsupported(E, "builtin function");
 
----------------
AlexVlx wrote:
> efriedma wrote:
> > AlexVlx wrote:
> > > efriedma wrote:
> > > > This doesn't make sense; we can't just ignore bits of the source code.  
> > > > I guess this is related to "the decision on their validity is 
> > > > deferred", but I don't see how you expect this to work.
> > > This is one of the weirder parts, so let's consider the following example:
> > > 
> > > ```cpp
> > > void foo() { __builtin_ia32_pause(); }
> > > void bar() { __builtin_trap(); }
> > > 
> > > void baz(const vector<int>& v) {
> > >     return for_each(par_unseq, cbegin(v), cend(v), [](auto&& x) { if (x 
> > > == 42) bar(); });
> > > }
> > > ```
> > > 
> > > In the case above, what we'd offload to the accelerator, and ask the 
> > > target BE to lower, is the implementation of `for_each`, and `bar`, 
> > > because it is reachable from the latter. `foo` is not reachable by any 
> > > execution path on the accelerator side, however it includes a builtin 
> > > that is unsupported by the accelerator (unless said accelerator is x86, 
> > > which is not impossible, but not something we're dealing with at the 
> > > moment). If we were to actually error out early, in the FE, in these 
> > > cases, there's almost no appeal to what is being proposed, because 
> > > standard headers, as well as other libraries, are littered with various 
> > > target specific builtins that are not going to be supported. This all 
> > > builds on the core invariant of this model / extension / thingamabob, 
> > > which is that the algorithms, and only the algorithms, are targets for 
> > > offload. It thus follows that as long as code that is reachable from an 
> > > algorithm's implementation is safe, all is fine, but we cannot know this 
> > > in the FE / on an AST level, because we need the actual CFG. This part is 
> > > handled in LLVM in the `SelectAcceleratorCodePass` that's in the last 
> > > patch in this series.
> > > 
> > > Now, you will quite correctly observe that there's nothing preventing an 
> > > user from calling `foo` in the callable they pass to an algorithm; they 
> > > might read the docs / appreciate that this won't work, but even there 
> > > they are not safe, because there via some opaque call chain they might 
> > > end up touching some unsupported builtin. My intuition here, which is 
> > > reflected above in letting builtins just flow through, is that such cases 
> > > are better served with a compile time error, which is what will obtain 
> > > once the target BE chokes trying to lower an unsupported builtin. It's 
> > > not going to be a beautiful error, and we could probably prettify it 
> > > somewhat if we were to check after we've done the accelerator code 
> > > selection pass, but it will happen at compile time. Another solution 
> > > would be to emit these as traps (poison + trap for value returning ones), 
> > > but I am concerned that it would lead to really fascinating debug 
> > > journeys.
> > > 
> > > Having said this, if there's a better way to deal with these scenarios, 
> > > it would be rather nice. Similarly, if the above doesn't make sense, 
> > > please let me know.
> > > 
> > Oh, I see; you "optimistically" compile everything assuming it might run on 
> > the accelerator, then run LLVM IR optimizations, then determine late which 
> > bits of code will actually run on the accelerator, which then prunes the 
> > code which shouldn't run.
> > 
> > I'm not sure I really like this... would it be possible to infer which 
> > functions need to be run on the accelerator based on the AST?  I mean, if 
> > your API takes a lambda expression that runs on the accelerator, you can 
> > mark the lambda's body as "must be emitted for GPU", then recursively mark 
> > all the functions referred to by the lambda.
> > 
> > Emiting errors lazily from the backend means you get different diagnostics 
> > depending on the optimization level.
> > 
> > If you do go with this codegen-based approach, it's not clear to me how you 
> > detect that a forbidden builtin was called; if you skip the error handling, 
> > you just get a literal "undef".
> `I'm not sure I really like this...` - actually, I am not a big fan either, 
> however I think it's about the best one can do, given the constraints 
> (consume standard C++, no annotations on the user side etc.). Having tried a 
> few times in the past (and at least once in a different compiler), I don't 
> quite think this can be done on an AST level. It would add some fairly 
> awkward checking during template instantiation (no way to know earlier that a 
> `CallableFoo` was passed to an offloadable algorithm), and it's a bit 
> unwieldy to basically compute the CFG on the AST and mark reachable Callees 
> at that point. Ignoring those, the main reason for which we cannot do this is 
> that the interface is not constrained to only take lambdas, but callables in 
> general, and that includes pointers to function as well. We don't deal with 
> those today, but plan to, and there's a natural solution when operating on 
> IR, assuming closed / internalised Modules (which is the case at least for 
> AMDGPU at the moment). The final challenge pertains to the AST being per TU, 
> with no cross-TU visibility, whereas with IR you can either pre-link the BC 
> (implicitly or LTO) and then operate on the entire compilation. This is a 
> problem with cases where `foo` defined in TU0 is reachable from 
> `algorithm_bar_offloaded_impl` in TU1. So TL;DR, I think it would be more 
> complex to do this on the AST and would end up more brittle / less future 
> proof.
> 
> In what regards how to do deferred diagnostics, it think it can be done like 
> this (I crossed streams in my prior reply when discussing this part, so it's 
> actually nonsense): instead of emitting undef here, we can emit a builtin 
> with the same signature, but with the name suffixed with e.g. 
> (`__stdpar_unsupported`) or something similar. Then, when doing the 
> reachability computation later, if we stumble upon a node in the CFG that 
> contains a builtin suffixed with `__stdpar_unsupported` we error out, and can 
> provide nice diagnostics since we'd have the call-chain handy. Thoughts?
There is a lot of interesting design information in this discussion thread 
which will be lost forever after this is merged.
Is there a way to keep a summary as a comment somewhere to help the future 
readers/maintainers/historians?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155850/new/

https://reviews.llvm.org/D155850

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D155850: [Clang][CodeGen][RFC] Add codegen support for C++ Parallel Algorithm Offload

Reply via email to