chrishkchris commented on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-640075943


   > You are right. The waiting time cannot be included in the execution time 
of the operation. But for some operators that use two cuda streams, we 
determine which stream to record events based on the name of the operator. I 
think it's not an elegant scheme, any ideas about this?
   
   For time profiling, the idea situation is: All the buffered communicator 
operators should use only one cuda stream, two streams is not good because one 
stream should wait for another. So I broke down most of the operations.
   
   The only one kernal I did not yet break it down yet is the sparse 
communication kernal, which is too long so I do not inlcude breaking the kernal 
down in this PR.
   
https://github.com/chrishkchris/singa/blob/SINGA-510_2/src/io/communicator.cc#L444
   
   My original plan of this PR is record the fp32/fp16 communication time 
seamlessly. If it prodives better time profiling for the sparse communication 
(breaking the large kernal down), it can be included in the future PR 
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to