[PR] Update flash attention to integrate flash decoding with paged KV cache [tvm]

via GitHub Thu, 25 Jan 2024 19:02:39 -0800


masahi opened a new pull request, #16474:
URL: https://github.com/apache/tvm/pull/16474


   Flash attention recently added support for loading from paged KV cache in 
https://github.com/Dao-AILab/flash-attention/commit/54e80a3829c6d2337570d01e78ebd9529c02d342.
 The support was added to Flash-Decoding kernel, which we haven't used so far. 
   
   This PR let's us use Flash Decoding with paged KV cache support from TVM. We 
already use other kernels from Flash attention via BYOC, but due to the 
specialized nature of this kernel, it is supported as a contrib kernel (similar 
to vllm).
   
   @vinx13 @sunggg  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Update flash attention to integrate flash decoding with paged KV cache [tvm]

Reply via email to