On Fri, Nov 16, 2018 at 12:11 PM Jason King <jason.k...@joyent.com> wrote:
> For a small amount of background, I’ve been trying to help nail down some > of the reported problems from users trying out the zfs crypto PR on illumos > distros. I have one instance here I’ve been digging into, and have some > questions I’m hoping those who’ve spent far more time in this code than I > have (so far) might be able to shed some insight. > > Specifically, so far it appears there is something surrounding zfs > send/recv that is causing the checksums to become corrupt. I’m still > digging into this further, but the short version is, multiple zfs scrubs on > the source show no issues, but when sending an incremental stream from an > encrypted zvol to another system is fairly easily cause a panic during > reception: > That's awesome that you are able to reproduce a problem! > > panic[cpu1]/thread=fffffe000fe5cc20: > assertion failed: (db = dbuf_hold(dn, blkid, FTAG)) != NULL, file: > ../../common/fs/zfs/dmu.c, line: 1706 > > fffffe000fe5caa0 genunix:process_type+16c14c () > fffffe000fe5cb10 zfs:dmu_assign_arcbuf_dnode+198 () > fffffe000fe5cb80 zfs:receive_write+140 () > fffffe000fe5cbc0 zfs:receive_process_record+123 () > fffffe000fe5cc00 zfs:receive_writer_thread+88 () > fffffe000fe5cc10 unix:thread_start+8 () > > While not obvious from that panic message (it took a bit of dtrace and a > few laps), that panic is because of the following scenario: > > dmu_assign_arcbuf_dnode > dbuf_hold (returns NULL — as seen in the VERIFY violation) > dbuf_hold_level (returns EIO) > dbuf_hold_impl (returns EIO) > dbuf_hold_impl (returns EIO) > dbuf_hold_impl (returns EIO) > dbuf_hold_impl (returns EIO) > dbuf_findbp (returns EIO) > dbuf_read (returns EIO) > dbuf_read_verify_dnode_crypt (returns EIO) > arc_untransform (returns EIO) > arc_buf_fill (returns ECKSUM) > arc_fill_hdr_crypt (returns ECKSUM) > > Looking at the ZoL code , it would appear it would also panic if it tried > to receive the same stream (the call stack for ZoL would be slightly > different since the dbuf_hold args are heap allocated there instead of on > the stack, but same general sequence). It seems like zfs recv should be > more resilient here — the receive should just error out instead of leaving > a corrupted zvol on the destination — zfs scrub on the _destination_ will > show errors that it cannot fix, removing the last sent snapshot clears the > errors (both systems have ECC ram, no errors there). Is this already a > known issue? I didn’t see anything obvious that looked like this, but then > I might not be looking in the correct places. > I'm not aware of this issue (though hopefully it turns out to be related to the one that Garnot has been hitting). I agree that the zfs recv should be more resilient here, and there's an additional problem of why we are getting this error at all, right? arc_fill_hdr_crypt() is probably failing because its MAC doesn't match, but I think that shouldn't happen unless someone is intentionally trying to attack us with a malicious send stream. > > I’m still digging more into the actual corruption itself, but I wanted to > at least mention the above issue in the meantime. Of course if the zfs > send/recv issue is ringing bells with anyone, I’d love to know that as > well. Interestingly enough, it’s always the same blkid in the panic. > --matt ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/Ta7b88998bf99fe05-M6eb1dd5111e84360978115e8 Delivery options: https://openzfs.topicbox.com/groups/developer/subscription