400Ping opened a new issue, #702: URL: https://github.com/apache/mahout/issues/702
### Summary While benchmarking the Mahout QDP Disk → GPU pipeline, `cuStreamSynchronize` is dominating the CUDA API time and significantly increasing end-to-end latency. We are synchronizing too frequently on the stream, which prevents overlap between I/O, H2D copies, and GPU compute. According to NVIDIA’s CUDA best practices, stream/device synchronization should be used sparingly because each sync blocks the host thread until all prior work in the stream completes. This can easily become a major performance bottleneck if called in a tight loop. ### Use Case Explain why this feature is useful. ### Proposed Implementation How do you propose implementing this feature? ### Alternatives Considered Have you thought of other ways to solve the same problem? ### Additional Context Any other relevant information or resources. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
