MasterJH5574 opened a new pull request, #16692:
URL: https://github.com/apache/tvm/pull/16692

   This PR enhances PagedKVCache with the copy stream separation. In detail, 
for CUDA and ROCm backend, we create a standalone copy stream for the copy of 
auxiliary data structure from CPU to GPU. Furthermore, we move the copy from 
BeginForward to Attention, which means it's no longer eagerly executed, 
instead, becoming lazily executed when Attention computation is needed.
   
   By making these changes, we are able to overlap the auxiliary data copy time 
(on the copy stream) with the model forward computation that happens before the 
first Attention. As a result, we can hide some of the copy latency.
   
   This PR also bumps the version of FlashInfer for the copy stream support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to