Re: [PR] Update flash attention to integrate flash decoding with paged KV cache [tvm]

2024-01-29 Thread via GitHub
vinx13 commented on PR #16474: URL: https://github.com/apache/tvm/pull/16474#issuecomment-1915215166 we can move gpu builder to larger cpu instance if needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Update flash attention to integrate flash decoding with paged KV cache [tvm]

2024-01-26 Thread via GitHub
masahi commented on PR #16474: URL: https://github.com/apache/tvm/pull/16474#issuecomment-1912756112 oof. It's not surprising, since I added a new variant of kernel (flash decoding) with yet another many explicit template instantiations https://github.com/tlc-pack/libflash_attn/pull/9

Re: [PR] Update flash attention to integrate flash decoding with paged KV cache [tvm]

2024-01-26 Thread via GitHub
vinx13 commented on PR #16474: URL: https://github.com/apache/tvm/pull/16474#issuecomment-1912754133 @masahi this is probably the case, it didn't happen for this kernel before though -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Update flash attention to integrate flash decoding with paged KV cache [tvm]

2024-01-26 Thread via GitHub
masahi commented on PR #16474: URL: https://github.com/apache/tvm/pull/16474#issuecomment-1912749542 @vinx13 Does this CI failure seem like a compilation timeout https://ci.tlcpack.ai/blue/organizations/jenkins/tvm-gpu/detail/PR-16474/30/pipeline? I remember you hit something like this

Re: [PR] Update flash attention to integrate flash decoding with paged KV cache [tvm]

2024-01-26 Thread via GitHub
masahi commented on PR #16474: URL: https://github.com/apache/tvm/pull/16474#issuecomment-1912677068 > LGTM! When passing paged kv cache, is there any assumption there? e.g., layout Yes, I added shape and dtype requirements as comments. -- This is an automated message from the

[PR] Update flash attention to integrate flash decoding with paged KV cache [tvm]

2024-01-25 Thread via GitHub
masahi opened a new pull request, #16474: URL: https://github.com/apache/tvm/pull/16474 Flash attention recently added support for loading from paged KV cache in https://github.com/Dao-AILab/flash-attention/commit/54e80a3829c6d2337570d01e78ebd9529c02d342. The support was added to