Re: [PR] [QDP] Double-buffered pinned I/O pipeline and faster Parquet decode [mahout]

via GitHub Sat, 03 Jan 2026 05:39:37 -0800


rich7420 commented on code in PR #751:
URL: https://github.com/apache/mahout/pull/751#discussion_r2658926866



##########
qdp/qdp-core/src/lib.rs:
##########
@@ -221,38 +228,13 @@ impl QdpEngine {
                     "Sample size cannot be zero".into(),
                 ));
             }
-            if sample_size > STAGE_SIZE_ELEMENTS {
-                return Err(MahoutError::InvalidInput(format!(
-                    "Sample size {} exceeds staging buffer capacity {} 
(elements)",
-                    sample_size, STAGE_SIZE_ELEMENTS
-                )));
-            }
-
-            // Reuse a single norm buffer across chunks to avoid per-chunk 
allocations.
-            //
-            // Important: the norm buffer must outlive the async kernels that 
consume it.
-            // Per-chunk allocation + drop can lead to use-after-free when the 
next chunk
-            // reuses the same device memory while the previous chunk is still 
running.

Review Comment:
   I think here has a potential problem in `qdp/qdp-core/src/lib.rs`. In the 
`encode_from_parquet()` function in `qdp/qdp-core/src/lib.rs`, there is a 
critical use-after-free bug in the lifetime management of `norm_buffer`. The 
code allocates `norm_buffer` inside the `BatchEncode` scope at line 331-339, 
and asynchronously launches `launch_l2_norm_batch` and 
`launch_amplitude_encode_batch` kernels at line 343-375, which execute 
asynchronously on `ctx.stream_compute`. However, when the `BatchEncode` scope 
ends at line 376, `norm_buffer` is immediately dropped, and according to the 
comment in `pipeline.rs:336`, when `CudaSlice` is dropped, it immediately calls 
`cudaFree` to free GPU memory. The problem is that `ctx.sync_copy_stream()` at 
line 378 only synchronizes the copy stream, not the compute stream, so when 
`norm_buffer` is freed, kernels on the compute stream may still be executing, 
causing kernels to access GPU memory that has already been freed. Even though 
kernels execute seque
 ntially on the same stream, if their execution time is long, `norm_buffer` may 
still be freed before the kernels complete. In a loop processing multiple 
chunks, the first chunk's `norm_buffer` is dropped before its kernel completes, 
and while the second chunk's kernel will wait for the first to complete, the 
first kernel is already accessing freed memory. The old code's comment 
explicitly warned about this issue, stating that the norm buffer must outlive 
the async kernels that consume it, and that per-chunk allocation plus drop can 
lead to use-after-free. Plz correct me if I'm wrong. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [QDP] Double-buffered pinned I/O pipeline and faster Parquet decode [mahout]

Reply via email to