cyx-6 opened a new pull request, #14608: URL: https://github.com/apache/tvm/pull/14608
In some models, the input Q, K and V for attention ops are from a stacked tensor initially, and then they are splitted and reshaped to call attention op, like stacked_qkv -> split -> reshape -> attention. Actually, we could to skip the split and reshape ops, by manipulating the layout parameters in codegen. This PR adds the such fused patterns for stacked attention in BYOC. So that we are able to codegen directly from stacked_qkv. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org