On 10/28/25 6:50 AM, Chaitanya Kulkarni wrote:
> The ftrace blktrace path allocates buffers and writes trace events but
> was using the wrong recording function. After
> commit 4d8bc7bd4f73 ("blktrace: move ftrace blk_io_tracer to blk_io_trace2"),
> the ftrace interface was moved to use blk_io_trace2 format, but
> __blk_add_trace() still called record_blktrace_event() which writes in
> blk_io_trace (v1) format.
>
> This causes critical data corruption:
>
> - blk_io_trace (v1) has 32-bit 'action' field at offset 28
> - blk_io_trace2 (v2) has 32-bit 'pid' at offset 28 and 64-bit 'action'
>    at offset 32
> - When record_blktrace_event() writes to a v2 buffer:
>    * Writing pid (offset 32 in v1) corrupts the v2 action field
>    * Writing action (offset 28 in v1) corrupts the v2 pid field
>    * The 64-bit action is truncated to 32-bit via lower_32_bits()
>
> Fix by:
> 1. Adding version switch to select correct format (v1 vs v2)
> 2. Calling appropriate recording function based on version
> 3. Defaulting to v2 for ftrace (as intended by commit 4d8bc7bd4f73)
> 4. Adding WARN_ONCE for unexpected version values
>
> Without this patch :-
> linux-block (for-next) # sh reproduce_blktrace_bug.sh
>                dd-14242   [033] d..1.  3903.022308: Unknown action 36a2
>                dd-14242   [033] d..1.  3903.022333: Unknown action 36a2
>                dd-14242   [033] d..1.  3903.022365: Unknown action 36a2
>                dd-14242   [033] d..1.  3903.022366: Unknown action 36a2
>                dd-14242   [033] d..1.  3903.022369: Unknown action 36a2
>
> The action field is corrupted because:
>    - ftrace allocated blk_io_trace2 buffer (64 bytes)
>    - But called record_blktrace_event() (writes v1, 48 bytes)
>    - Field offsets don't match, causing corruption
>
> The hex value shown 0x30e3 is actually a PID, not an action code!
>
> linux-block (for-next) #
> linux-block (for-next) #
> linux-block (for-next) # sh reproduce_blktrace_bug.sh
> Trace output looks correct:
>
>                dd-2420    [019] d..1.    59.641742: 251,0    Q  RS 0 + 8 [dd]
>                dd-2420    [019] d..1.    59.641775: 251,0    G  RS 0 + 8 [dd]
>                dd-2420    [019] d..1.    59.641784: 251,0    P   N [dd]
>                dd-2420    [019] d..1.    59.641785: 251,0    U   N [dd] 1
>                dd-2420    [019] d..1.    59.641788: 251,0    D  RS 0 + 8 [dd]
>
> Fixes: 4d8bc7bd4f73 ("blktrace: move ftrace blk_io_tracer to blk_io_trace2")
> Signed-off-by: Chaitanya Kulkarni <[email protected]>
> ---
>   kernel/trace/blktrace.c | 59 +++++++++++++++++++++++++++++++++++++----
>   1 file changed, 54 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 1a83e03255ce..4097a288c235 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -384,16 +384,65 @@ static void __blk_add_trace(struct blk_trace *bt, 
> sector_t sector, int bytes,
>   
>               buffer = blk_tr->array_buffer.buffer;
>               trace_ctx = tracing_gen_ctx_flags(0);
> -             trace_len = sizeof(struct blk_io_trace2) + pdu_len + cgid_len;
> +             switch (bt->version) {
> +             case 1:
> +                     trace_len = sizeof(struct blk_io_trace);
> +                     break;
> +             case 2:
> +             default:
> +                     /*
> +                      * ftrace always uses v2 (blk_io_trace2) format.
> +                      *
> +                      * For sysfs-enabled tracing path (enabled via
> +                      * /sys/block/DEV/trace/enable), blk_trace_setup_queue()
> +                      * never initializes bt->version, leaving it 0 from
> +                      * kzalloc(). We must handle version==0 safely here.
> +                      *
> +                      * Fall through to default to ensure we never hit the
> +                      * old bug where default set trace_len=0, causing
> +                      * buffer underflow and memory corruption.
> +                      *
> +                      * Always use v2 format for ftrace and normalize
> +                      * bt->version to 2 when uninitialized.
> +                      */
> +                     trace_len = sizeof(struct blk_io_trace2);
> +                     if (bt->version == 0)
> +                             bt->version = 2;
> +                     break;
> +             }
> +             trace_len += pdu_len + cgid_len;
>               event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
>                                                 trace_len, trace_ctx);
>               if (!event)
>                       return;
>   
> -             record_blktrace_event(ring_buffer_event_data(event),
> -                                   pid, cpu, sector, bytes,
> -                                   what, bt->dev, error, cgid, cgid_len,
> -                                   pdu_data, pdu_len);
> +             switch (bt->version) {
> +             case 1:
> +                     record_blktrace_event(ring_buffer_event_data(event),
> +                                           pid, cpu, sector, bytes,
> +                                           what, bt->dev, error, cgid, 
> cgid_len,
> +                                           pdu_data, pdu_len);
> +                     break;
> +             case 2:
> +             default:
> +                     /*
> +                      * Use v2 recording function (record_blktrace_event2)
> +                      * which writes blk_io_trace2 structure with correct
> +                      * field layout:
> +                      *   - 32-bit pid at offset 28
> +                      *   - 64-bit action at offset 32
> +                      *
> +                      * Fall through to default handles version==0 case
> +                      * (from sysfs path), ensuring we always use correct
> +                      * v2 recording function to match the v2 buffer
> +                      * allocated above.
> +                      */
> +                     record_blktrace_event2(ring_buffer_event_data(event),
> +                                            pid, cpu, sector, bytes,
> +                                            what, bt->dev, error, cgid, 
> cgid_len,
> +                                            pdu_data, pdu_len);
> +                     break;
> +             }
>   
>               trace_buffer_unlock_commit(blk_tr, buffer, event, trace_ctx);
>               return;

Isn't a switch() kind of an overkill for something that can be done in a 
single if () {} else {} block?

Otherwise looks good,

Reviewed-by: Johannes Thumshirn <[email protected]>

Reply via email to