mbs-octoml opened a new pull request, #11631: URL: https://github.com/apache/tvm/pull/11631
(See https://discuss.tvm.apache.org/t/byoc-supporting-cutlass-byoc-with-collage/12796/6 for context, which in turn is part of Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md). Currently CUTLASS has four entry points: - The usual 'partition_for_cutlass' partitioning function, using the standard pattern table and pass machinery (see cutlass/build.py). - A 'tune_cutlass_kernels' function which augments CUTLASS partition functions with the results of building and running test kernels (see cutlass/build.py). - A 'relay.ext.cutlass' external codegen function which inspects the turning results and generates a CSourceModule for each partitions (see cutlass/codegen.cc). - A 'build_cutlass_kernels_vm' function which runs 'export_library' with all the nvcc compiler options needed to build all the CSourceModules (see cutlass/bild.py). For Collage we'd like CUTLASS to have only two entry points: 'partition_for_cutlass', and 'relay.ext.cutlass' or equivalent. This makes the CUTLASS external codegen integration composable with other integrations, which in turn helps Collage avoid having to understand any external codegen APIs other than the global pattern table and the custom compilation function/pass. Collage also tends to end up requiring multiple partitions for the same backend since it is more aggressive at mixing-and-matching smaller sub-graphs between backends. Thus we'd also like to make sure all tuning, generated code and compilation overhead is shared between all such CUTLASS partitions. So, in this PR: - We add all the CUTLASS-specific tuning and compilation options as new Target attributes for the 'external codegen' "cutlass" TargetKind (cutlass/target.cc). The user now has one place to provide those settings, and we've already done the legwork to plumb the target instance. - We replace 'relay.ext.cutlass' with a 'RelayToTIR' custom pass hook 'CompileForCutlass' (see cutlass/codegen.cc). This pass obviously can see all the CUTLASS partitions in the IRModule, so we can now share tuning results between them all and can be sure to generate a single CSourceModule. The pass can also invoke the compiler to yield a StaticModule, which we've also already done the legwork to support. In this way all CUTLASS-specific steps are handled at once. - For convenience we supply 'finalize_modules' and 'finalize_modules_vm' which invoke nvcc for final linking (using export_library as usual). However, there's now nothing CUTLASS specific in those helpers other than their overriding of the 'compiler' to be nvcc. - test_cutlass.py is updated to use the new API. Though this is a breaking change for existing users of the CUTLASS integration the change is pretty minor, as shown in test_cutlass.py. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org