cyx-6 opened a new pull request, #14608:
URL: https://github.com/apache/tvm/pull/14608

   In some models, the input Q, K and V for attention ops are from a stacked 
tensor initially, and then they are splitted and reshaped to call attention op, 
like
   
   stacked_qkv -> split -> reshape -> attention.
   
   Actually, we could to skip the split and reshape ops, by manipulating the 
layout parameters in codegen.
   
   This PR adds the such fused patterns for stacked attention in BYOC. So that 
we are able to codegen directly from stacked_qkv.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to