Public bug reported:

Running Ceph FS on Ubuntu 24.04 (6.8 kernel) - Ubuntu
6.8.0-100.100-generic 6.8.12

Enclosed script reproduce-ceph-punch-hole-corruption.py exposes issue
that we have found that on recent kernels CephFS silently corrupts 16KB
of data before the requested hole when trying to punch a hole through
file (test uses fallocate()). Corruption only occurs when hole touches
or crosses a 4MB RADOS object boundary (4MB is the default stripe size).

Execution shows the corruption:

root@EdgeOS-5HB6Q54:/home/eceuser# python3 
./reproduce-ceph-punch-hole-corruption.py /Shared_DataStore/
CephFS PUNCH_HOLE data corruption reproducer
============================================================
Mount point: /Shared_DataStore/
Object size: 4194304 (4 MiB)

Tests crossing 4MB object boundary (expect FAIL on buggy kernels):
------------------------------------------------------------
  FAIL  1 page before boundary, 2 pages
        hole=[4190208, 4198400)  checked=[4173824, 4190208)
        16384/16384 bytes read as 0x00 (expected 0xFF)
  FAIL  2 pages before boundary, 4 pages
        hole=[4186112, 4202496)  checked=[4169728, 4186112)
        16384/16384 bytes read as 0x00 (expected 0xFF)
  FAIL  4 pages before boundary, 8 pages
        hole=[4177920, 4210688)  checked=[4161536, 4177920)
        16384/16384 bytes read as 0x00 (expected 0xFF)
  FAIL  ends at boundary, 2 pages
        hole=[4186112, 4194304)  checked=[4169728, 4186112)
        16384/16384 bytes read as 0x00 (expected 0xFF)
  FAIL  ends at boundary, 1 page
        hole=[4190208, 4194304)  checked=[4173824, 4190208)
        16384/16384 bytes read as 0x00 (expected 0xFF)

Tests NOT crossing boundary (should always PASS):
------------------------------------------------------------
  PASS  within object 0
        hole=[4161536, 4169728)  checked=[4145152, 4161536)
  PASS  mid object 0
        hole=[1048576, 1056768)  checked=[1032192, 1048576)
  PASS  start of object 1
        hole=[4194304, 4202496)  checked=[4177920, 4194304)
  PASS  within object 1
        hole=[5242880, 5251072)  checked=[5226496, 5242880)

============================================================
Results: 4 passed, 5 failed out of 9

BUG CONFIRMED: This kernel has the CephFS PUNCH_HOLE corruption bug.

Enclosed is a patch submission detailing issue (AI created): 0001-ceph-
fix-data-corruption-from-short-read-on-punch-hole.patch

With patch test script now passes:
root@EdgeOS-3CD6Q54:~# python3 
/home/eceuser/reproduce-ceph-punch-hole-corruption.py /Shared_DataStore/
CephFS PUNCH_HOLE data corruption reproducer
============================================================
Mount point: /Shared_DataStore/
Object size: 4194304 (4 MiB)

Tests crossing 4MB object boundary (expect FAIL on buggy kernels):
------------------------------------------------------------
  PASS  1 page before boundary, 2 pages
        hole=[4190208, 4198400)  checked=[4173824, 4190208)
  PASS  2 pages before boundary, 4 pages
        hole=[4186112, 4202496)  checked=[4169728, 4186112)
  PASS  4 pages before boundary, 8 pages
        hole=[4177920, 4210688)  checked=[4161536, 4177920)
  PASS  ends at boundary, 2 pages
        hole=[4186112, 4194304)  checked=[4169728, 4186112)
  PASS  ends at boundary, 1 page
        hole=[4190208, 4194304)  checked=[4173824, 4190208)

Tests NOT crossing boundary (should always PASS):
------------------------------------------------------------
  PASS  within object 0
        hole=[4161536, 4169728)  checked=[4145152, 4161536)
  PASS  mid object 0
        hole=[1048576, 1056768)  checked=[1032192, 1048576)
  PASS  start of object 1
        hole=[4194304, 4202496)  checked=[4177920, 4194304)
  PASS  within object 1
        hole=[5242880, 5251072)  checked=[5226496, 5242880)

============================================================
Results: 9 passed, 0 failed out of 9

All tests passed. This kernel is not affected (or the fix is applied).

Appears as if following commit causes the issue:
92b6cc5d1e7c ("netfs: Add iov_iters to (sub)requests to describe various 
buffers") by David Howells, authored 2023-09-27, committed 2023-12-24. Merged 
in v6.8-rc1.

This is only present in 6.8 and 6.9 kernels, 6.10 rewrote this activity
under ee4cdf7ba857 ("netfs: Speed up buffered reading") by David
Howells, 2024-07-02. Merged in v6.10.) which no longer has this issue.

Asking for either analysis of enclosed patch to be included into Stable
or if there is another/better way to fix.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2144592

Title:
  Punching hole through CephFS hosted file causes corruption when
  crossing 4MB RADOS object boundary

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2144592/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to