On Fri, Nov 16, 2018 at 12:11 PM Jason King <jason.k...@joyent.com> wrote:

> For a small amount of background, I’ve been trying to help nail down some
> of the reported problems from users trying out the zfs crypto PR on illumos
> distros.  I have one instance here I’ve been digging into, and have some
> questions I’m hoping those who’ve spent far more time in this code than I
> have (so far) might be able to shed some insight.
>
> Specifically, so far it appears there is something surrounding zfs
> send/recv that is causing the checksums to become corrupt.  I’m still
> digging into this further, but the short version is, multiple zfs scrubs on
> the source show no issues, but when sending an incremental stream from an
> encrypted zvol to another system is fairly easily cause a panic during
> reception:
>

That's awesome that you are able to reproduce a problem!


>
> panic[cpu1]/thread=fffffe000fe5cc20:
> assertion failed: (db = dbuf_hold(dn, blkid, FTAG)) != NULL, file:
> ../../common/fs/zfs/dmu.c, line: 1706
>
> fffffe000fe5caa0 genunix:process_type+16c14c ()
> fffffe000fe5cb10 zfs:dmu_assign_arcbuf_dnode+198 ()
> fffffe000fe5cb80 zfs:receive_write+140 ()
> fffffe000fe5cbc0 zfs:receive_process_record+123 ()
> fffffe000fe5cc00 zfs:receive_writer_thread+88 ()
> fffffe000fe5cc10 unix:thread_start+8 ()
>
> While not obvious from that panic message (it took a bit of dtrace and a
> few laps), that panic is because of the following scenario:
>
> dmu_assign_arcbuf_dnode
> dbuf_hold (returns NULL — as seen in the VERIFY violation)
> dbuf_hold_level (returns EIO)
> dbuf_hold_impl (returns EIO)
> dbuf_hold_impl (returns EIO)
> dbuf_hold_impl (returns EIO)
> dbuf_hold_impl (returns EIO)
> dbuf_findbp (returns EIO)
> dbuf_read (returns EIO)
> dbuf_read_verify_dnode_crypt (returns EIO)
> arc_untransform (returns EIO)
> arc_buf_fill (returns ECKSUM)
> arc_fill_hdr_crypt (returns ECKSUM)
>
> Looking at the ZoL code , it would appear it would also panic if it tried
> to receive the same stream (the call stack for ZoL would be slightly
> different since the dbuf_hold args are heap allocated there instead of on
> the stack, but same general sequence).   It seems like zfs recv should be
> more resilient here — the receive should just error out instead of leaving
> a corrupted zvol on the destination — zfs scrub on the _destination_ will
> show errors that it cannot fix, removing the last sent snapshot clears the
> errors (both systems have ECC ram, no errors there).  Is this already a
> known issue?  I didn’t see anything obvious that looked like this, but then
> I might not be looking in the correct places.
>

I'm not aware of this issue (though hopefully it turns out to be related to
the one that Garnot has been hitting).  I agree that the zfs recv should be
more resilient here, and there's an additional problem of why we are
getting this error at all, right?  arc_fill_hdr_crypt() is probably failing
because its MAC doesn't match, but I think that shouldn't happen unless
someone is intentionally trying to attack us with a malicious send stream.


>
> I’m still digging more into the actual corruption itself, but I wanted to
> at least mention the above issue in the meantime.  Of course if the zfs
> send/recv issue is ringing bells with anyone, I’d love to know that as
> well.   Interestingly enough, it’s always the same blkid in the panic.
>

--matt

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/Ta7b88998bf99fe05-M6eb1dd5111e84360978115e8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to