On Thu, 13 Mar 2025, Marco Felsch wrote:
> Hi,
>
> sorry for the late reply but we had to run several tests and analyze
> multiple test outputs offline via a small self-written tool.
>
> On 25-02-14, Mikulas Patocka wrote:
> > Hi
> >
> > On Tue, 11 Feb 2025, Marco Felsch wrote:
> >
> > > Hi all,
> > >
> > > as written in the subject we do see an odd dm-integrity behaviour during
> > > the journal replay step.
> > >
> > > First things first, a short introduction to our setup:
> > > - Linux v6.8
> > > - We do have a dm-integrity+dm-crypt setup to provide an authenticated
> > > encrypted ext4-based rw data partition.
> > > - The setup is done with a custom script [1] since we are making use of
> > > the kernel trusted-keys infrastructure which isn't supported by LUKS2
> > > at the moment.
> > > - The device has no power failsafe e.g. hard power-cuts can appear.
> > > Therefore we use the dm-integrity J(ournal) mode.
> > > - The storage backend is an eMMC with 512Byte block size.
> >
> > Could you retest it with an eMMC or SDCARD from a different vendor, just
> > to test if it is hardware issue?
>
> We saw the issue on different eMMC devices from different manufacturers.
>
> > > - We use the dm-integrity 4K block size option to reduce the
> > > tag/metadata overhead.
> > >
> > > From time to time we do see "AEAD ERROR"s [2] while fsck tries to repair
> > > the filesystem which of course abort the fsck run.
> > > After a while within the rescue shell
> >
> > So, when you run fsck again from the rescue shell (without deactivating
> > and activating the device), the bug goes away? What is the time interval
> > after which the bug goes away?
>
> This happened from time to time, yes but only in rare cases and I'm not
> sure if our tester did something wrong.
>
> > Could you upload somewhere the image of the eMMC storage when the bug
> > happens and send me a link to it, so that I can look what kind of
> > corruption is there?
>
> Of course I need to align it with our customer but since the data is
> encrypted you could get the dm-integrity dump.
>
> > > and a following reboot the fsck
> > > run on the same file system doesn't trigger any "AEAD ERROR".
> > >
> > > The dm-integrity table is added twice [1] since we gather the
> > > provided_data_sectors information from the first call. I know that this
> > > isn't optimal. The provided_data_sectors should be stored and not
> > > gathered again but this shouldn't bother a system with a valid
> > > dm-integrity superblock already written.
> >
> > That should work fine - there is no problem with activating the device
> > twice.
>
> Also with activating it with differnet sizes as we do? I think there is
> no problem with it too. Our script doesn't different between the initial
> setup (no superblock available) and the "normal" setup.
>
> > > To debug the issue we uncommented the "#define DEBUG_PRINT" and noticed
> > > that the replay is happening twice [2] albeit this should be a
> > > synchronous operation and once the dm resume returned successfully the
> > > replay should be done.
> >
> > The replay happens twice because you activate it twice - this should be
> > harmless because each replay replays the same data - there is "replaying
> > 364 sections, starting at 143, commit seq 2" twice in the log.
>
> Yes, but after replaying it the first time we thought the code marks the
> entries as unused and don't have to replay these entries a 2nd time. But
> as I said, there should be no issue with it.
>
> > > We also noticed that once we uncommented "#define DEBUG_PRINT" it was
> > > harder to trigger the issue.
> > >
> > > We also checked the eMMC bus to see if the CMD23 is sent correctly (with
> > > the reliable write bit set) in case of a FUA request which is the case.
> > > So now with the above knowledge we suspect the replay path to be not
> > > synchronous, the dm resume is returning too early, while not all of the
> > > writes kicked off by copy_from_journal having reached the storage.
> >
> > There is "struct journal_completion comp" in do_journal_write and
> > "wait_for_completion_io(&comp.comp);" at the end of do_journal_write -
> > that should wait until all I/O submitted by copy_from_journal finishes.
> >
> > > Maybe you guys have some pointers we could follow or have an idea of
> > > what may go wrong here.
> > >
> > > Regards,
> > > Marco
> >
> > I read through the dm-integrity source code and I didn't find anything.
> >
> > There are two dm-crypt bugs that could cause your problems - see the
> > upstream commits 8b8f8037765757861f899ed3a2bfb34525b5c065 and
> > 9fdbbdbbc92b1474a87b89f8b964892a63734492. Please, backport these commits
> > to your kernel and re-run the tests.
>
> This helped, there are also other dm-crypt fixes which looked closely
> related. Therefore we updated to 6.12.16 and ran the tests. Tests which
> could reproduce the issue quite sufficient are now passing and it really
> looks like dm-crypt was the issue :/
>
> The issue could be reproduced by the following script which was started
> right after the boot:
>
> | #!/bin/bash
> |
> | set -x
> |
> | suffix=$(hexdump /dev/urandom -n4 -e '"%u"')
> | suffix=$((suffix%220))
> | testdir=/mnt/data/test
> | testfile=$testdir/test.$suffix
> |
> | # Make testdir
> | mkdir -p $testdir
> |
> | [ -f $testfile ] && rm $testfile
> |
> | dd if=/dev/mapper/data of=/dev/null && \
> | dd if=/dev/urandom of=$testfile bs=1M oflag=sync count=1 && \
> | dd if=/dev/mapper/data of=/dev/null && \
> | reboot -f
> |
> | echo "Something failed, please check!"
>
> Our data partition was quite small (300MB) which lead the journal-size
> to hold up to ~2MB. Since we use the ext4 on top of it we used a size of
> 1MB. If reading the whole device before or after the script stops and we
> dumped the raw eMMC partition (mmcblkXpY).
>
> Afterwards we analyzed the dump by our small tool to see if we can found
> the AEAD sector within the journal, which was the case.
>
> Now we checked if this particular sector (8 sectors since we provide 4K
> IO-size to the upper layers) can be found on the data area. Which was
> the case.
>
> Lastly we checked if we can find the entry metadata journal content
> within the metadata area. Which was the case too.
>
> Both the data and metadata belong to the same "area" so we do assume
> that the journal was correctly replayed but got the wrong
> <data><metadata> pair from the dm-crypt layer.
>
> Please correct me if our conclusion is wrong, but after the kernel
> update the issues are gone for now (we do need more test samples).
It's nice to hear that the issue is fixed. Thanks for testing it.
> Thanks,
> Marco
Mikulas