400Ping opened a new issue, #702:
URL: https://github.com/apache/mahout/issues/702

   ### Summary
   While benchmarking the Mahout QDP Disk → GPU pipeline, `cuStreamSynchronize` 
is dominating the CUDA API time and significantly increasing end-to-end 
latency. We are synchronizing too frequently on the stream, which prevents 
overlap between I/O, H2D copies, and GPU compute.
   
   According to NVIDIA’s CUDA best practices, stream/device synchronization 
should be used sparingly because each sync blocks the host thread until all 
prior work in the stream completes. This can easily become a major performance 
bottleneck if called in a tight loop.
   
   
   ### Use Case
   Explain why this feature is useful.
   
   ### Proposed Implementation
   How do you propose implementing this feature?
   
   ### Alternatives Considered
   Have you thought of other ways to solve the same problem?
   
   ### Additional Context
   Any other relevant information or resources.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to