(mahout) branch main updated: [QDP] Update NVTX workflow docs for new async pipeline capture (#939)

richhuang Mon, 26 Jan 2026 20:32:45 -0800

This is an automated email from the ASF dual-hosted git repository.

richhuang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/mahout.git



The following commit(s) were added to refs/heads/main by this push:
     new d811c26b4 [QDP] Update NVTX workflow docs for new async pipeline 
capture (#939)
d811c26b4 is described below

commit d811c26b4091a633c12f7591a7e9c0c7b04ac9a5
Author: KUAN-HAO HUANG <[email protected]>
AuthorDate: Tue Jan 27 12:31:57 2026 +0800

    [QDP] Update NVTX workflow docs for new async pipeline capture (#939)
    
    * Update NVTX workflow docs for new async pipeline capture
    
    * remove redundant part
---
 qdp/docs/observability/NVTX_USAGE.md  | 87 ++++++++++++++---------------------
 qdp/qdp-core/examples/nvtx_profile.rs | 12 +++--
 qdp/qdp-core/src/gpu/pipeline.rs      | 15 ++++--
 3 files changed, 55 insertions(+), 59 deletions(-)

diff --git a/qdp/docs/observability/NVTX_USAGE.md 
b/qdp/docs/observability/NVTX_USAGE.md
index a4fe92ee1..179f1ce09 100644
--- a/qdp/docs/observability/NVTX_USAGE.md
+++ b/qdp/docs/observability/NVTX_USAGE.md
@@ -4,19 +4,14 @@
 
 NVTX (NVIDIA Tools Extension) provides performance markers visible in Nsight 
Systems. This project uses zero-cost macros that compile to no-ops when the 
`observability` feature is disabled.
 
-## Build with NVTX
+## Run the NVTX Example
 
-Default builds exclude NVTX for zero overhead. Enable profiling with:
+Default builds exclude NVTX for zero overhead. The example below uses the
+async pipeline workload (large input) to surface the new pipeline markers.
 
 ```bash
 cd mahout/qdp
-cargo build -p qdp-core --example nvtx_profile --features observability 
--release
-```
-
-## Run Example
-
-```bash
-./target/release/examples/nvtx_profile
+cargo run -p qdp-core --example nvtx_profile --features observability --release
 ```
 
 **Expected output:**
@@ -24,7 +19,7 @@ cargo build -p qdp-core --example nvtx_profile --features 
observability --releas
 === NVTX Profiling Example ===
 
 ✓ Engine initialized
-✓ Created test data: 1024 elements
+✓ Created test data: 262144 elements
 
 Starting encoding (NVTX markers will appear in Nsight Systems)...
 Expected NVTX markers:
@@ -32,9 +27,10 @@ Expected NVTX markers:
   - CPU::L2Norm
   - GPU::Alloc
   - GPU::H2DCopy
-  - GPU::KernelLaunch
-  - GPU::Synchronize
-  - DLPack::Wrap
+  - GPU::CopyEventRecord
+  - GPU::H2D_Stage
+  - GPU::Kernel
+  - GPU::ComputeSync
 
 ✓ Encoding succeeded
 ✓ DLPack pointer: 0x558114be6250
@@ -45,11 +41,15 @@ Expected NVTX markers:
 
 ## Profile with Nsight Systems
 
+Focus capture on the `Mahout::Encode` range (recommended):
+
 ```bash
-nsys profile --trace=cuda,nvtx -o report ./target/release/examples/nvtx_profile
+nsys profile --trace=cuda,nvtx --capture-range=nvtx \
+  --nvtx-capture=Mahout::Encode --force-overwrite=true -o nvtx-workflow \
+  cargo run -p qdp-core --example nvtx_profile --features observability 
--release
 ```
 
-This generates `report.nsys-rep` and `report.sqlite`.
+This generates `nvtx-workflow.nsys-rep` and `nvtx-workflow.sqlite`.
 
 ## Viewing Results
 
@@ -64,48 +64,18 @@ nsys-ui report.nsys-rep
 In the GUI timeline view, you will see:
 - Colored blocks for each NVTX marker
 - CPU timeline showing `CPU::L2Norm`
-- GPU timeline showing `GPU::Alloc`, `GPU::H2DCopy`, `GPU::Kernel`
+- GPU timeline showing `GPU::Alloc`, `GPU::H2DCopy`, `GPU::CopyEventRecord`, 
`GPU::H2D_Stage`, `GPU::Kernel`, `GPU::ComputeSync`
 - Overall workflow covered by `Mahout::Encode`
 
 ### Command Line Statistics
 
-View summary statistics:
+NVTX range summary:
 
 ```bash
-nsys stats report.nsys-rep
-```
-
-**Example NVTX Range Summary output:**
-```
-Time (%)  Total Time (ns)  Instances    Avg (ns)      Med (ns)     Min (ns)    
Max (ns)   StdDev (ns)   Style        Range
---------  ---------------  ---------  ------------  ------------  ----------  
----------  -----------  --------  --------------
-   50.0       11,207,505          1  11,207,505.0  11,207,505.0  11,207,505  
11,207,505          0.0  StartEnd  Mahout::Encode
-   48.0       10,759,758          1  10,759,758.0  10,759,758.0  10,759,758  
10,759,758          0.0  StartEnd  GPU::Alloc
-    1.8          413,753          1     413,753.0     413,753.0     413,753    
 413,753          0.0  StartEnd  CPU::L2Norm
-    0.1           15,873          1      15,873.0      15,873.0      15,873    
  15,873          0.0  StartEnd  GPU::H2DCopy
-    0.0              317          1         317.0         317.0         317    
     317          0.0  StartEnd  GPU::KernelLaunch
+nsys stats --report nvtx_sum nvtx-workflow.nsys-rep
 ```
 
-The output shows:
-- Time percentage for each operation
-- Total time in nanoseconds
-- Number of instances
-- Average, median, min, max execution times
-
-**CUDA API Summary** shows detailed CUDA call statistics:
-
- Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)   
Max (ns)   StdDev (ns)          Name
- --------  ---------------  ---------  -----------  -----------  --------  
----------  -----------  --------------------
-     99.2       11,760,277          2  5,880,138.5  5,880,138.5     2,913  
11,757,364  8,311,652.0  cuMemAllocAsync
-      0.4           45,979          2     22,989.5     22,989.5     7,938      
38,041     21,286.0  cuMemcpyHtoDAsync_v2
-      0.1           14,722          1     14,722.0     14,722.0    14,722      
14,722          0.0  cuEventCreate
-      0.1           13,100          3      4,366.7      3,512.0       861      
 8,727      4,002.0  cuStreamSynchronize
-      0.1            9,468         11        860.7        250.0       114      
 4,671      1,453.3  cuCtxSetCurrent
-      0.1            6,479          1      6,479.0      6,479.0     6,479      
 6,479          0.0  cuEventDestroy_v2
-      0.0            4,599          2      2,299.5      2,299.5     1,773      
 2,826        744.6  cuMemFreeAsync
-- Memory allocation (`cuMemAllocAsync`)
-- Memory copies (`cuMemcpyHtoDAsync_v2`)
-- Stream synchronization (`cuStreamSynchronize`)
+Note: very short pipeline ranges may be easier to verify in the GUI timeline.
 
 ## NVTX Markers
 
@@ -115,9 +85,13 @@ The following markers are tracked:
 - `CPU::L2Norm` - L2 normalization on CPU
 - `GPU::Alloc` - GPU memory allocation
 - `GPU::H2DCopy` - Host-to-device memory copy
-- `GPU::KernelLaunch` - CPU-side kernel launch
-- `GPU::Synchronize` - CPU waiting for GPU completion
-- `DLPack::Wrap` - Conversion to DLPack pointer
+- `GPU::Kernel` - Kernel execution
+
+The following pipeline ranges are also used where applicable:
+
+- `GPU::CopyEventRecord` - Record copy completion event
+- `GPU::H2D_Stage` - Host staging copy into pinned buffer
+- `GPU::ComputeSync` - Compute stream synchronization
 
 ## Using Profiling Macros
 
@@ -146,3 +120,12 @@ Source code: `qdp-core/examples/nvtx_profile.rs`
 
 **nsys warnings:**
 Warnings about CPU sampling are normal and can be ignored. They do not affect 
NVTX marker recording.
+
+## Official Docs
+
+- CUDA Runtime Profiler Control:
+  https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__PROFILER.html
+- Nsight Systems User Guide (v2026.1):
+  https://docs.nvidia.com/nsight-systems/UserGuide/index.html
+- Nsight Compute Profiling Guide (latest as of now):
+  https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html
diff --git a/qdp/qdp-core/examples/nvtx_profile.rs 
b/qdp/qdp-core/examples/nvtx_profile.rs
index 3e5c0c050..87ceeff2b 100644
--- a/qdp/qdp-core/examples/nvtx_profile.rs
+++ b/qdp/qdp-core/examples/nvtx_profile.rs
@@ -35,8 +35,11 @@ fn main() {
         }
     };
 
-    // Create test data
-    let data: Vec<f64> = (0..1024).map(|i| (i as f64) / 1024.0).collect();
+    // Create test data (large enough to trigger async pipeline)
+    let data_len: usize = 262_144; // 2MB of f64, exceeds async threshold
+    let data: Vec<f64> = (0..data_len)
+        .map(|i| (i as f64) / (data_len as f64))
+        .collect();
     println!("✓ Created test data: {} elements", data.len());
     println!();
 
@@ -46,11 +49,14 @@ fn main() {
     println!("  - CPU::L2Norm");
     println!("  - GPU::Alloc");
     println!("  - GPU::H2DCopy");
+    println!("  - GPU::CopyEventRecord");
+    println!("  - GPU::H2D_Stage");
     println!("  - GPU::Kernel");
+    println!("  - GPU::ComputeSync");
     println!();
 
     // Perform encoding (this will trigger NVTX markers)
-    match engine.encode(&data, 10, "amplitude") {
+    match engine.encode(&data, 18, "amplitude") {
         Ok(ptr) => {
             println!("✓ Encoding succeeded");
             println!("✓ DLPack pointer: {:p}", ptr);
diff --git a/qdp/qdp-core/src/gpu/pipeline.rs b/qdp/qdp-core/src/gpu/pipeline.rs
index 484cbc7cf..26874ab3b 100644
--- a/qdp/qdp-core/src/gpu/pipeline.rs
+++ b/qdp/qdp-core/src/gpu/pipeline.rs
@@ -125,6 +125,7 @@ impl PipelineContext {
     /// `slot` must refer to a live event created by this context, and the 
context must
     /// remain alive until the event is no longer used by any stream.
     pub unsafe fn record_copy_done(&self, slot: usize) -> Result<()> {
+        crate::profile_scope!("GPU::CopyEventRecord");
         validate_event_slot(&self.events_copy_done, slot)?;
 
         let ret = cudaEventRecord(
@@ -283,7 +284,10 @@ where
 
         // Acquire pinned staging buffer and populate it with the current chunk
         let mut pinned_buf = pinned_pool.acquire();
-        pinned_buf.as_slice_mut()[..chunk.len()].copy_from_slice(chunk);
+        {
+            crate::profile_scope!("GPU::H2D_Stage");
+            pinned_buf.as_slice_mut()[..chunk.len()].copy_from_slice(chunk);
+        }
 
         // Async copy: host to device (non-blocking, on specified stream)
         // Uses CUDA Runtime API (cudaMemcpyAsync) for true async copy
@@ -335,9 +339,12 @@ where
         unsafe {
             ctx.sync_copy_stream()?;
         }
-        device
-            .wait_for(&ctx.stream_compute)
-            .map_err(|e| MahoutError::Cuda(format!("Compute stream sync 
failed: {:?}", e)))?;
+        {
+            crate::profile_scope!("GPU::ComputeSync");
+            device
+                .wait_for(&ctx.stream_compute)
+                .map_err(|e| MahoutError::Cuda(format!("Compute stream sync 
failed: {:?}", e)))?;
+        }
     }
 
     // Buffers are dropped here (after sync), freeing GPU memory

(mahout) branch main updated: [QDP] Update NVTX workflow docs for new async pipeline capture (#939)

Reply via email to