On 2016-08-01 11:46, Chris Murphy wrote:
OK I've created a new volume that's sufficiently large I can tell if
the kernel workers doing the scrub are also being killed off. First, I
do a scrub without logging out to get a time for an uninterrupted
scrub. And then initiate a scrub which I start timing, but then logout
of the DE and watch for the kernel workers to stop.

- The kernel workers are killed off within ~5 seconds of an
uninterrupted scrub. Conclusion is the scrub is still happening by the
kernel.
This makes sense, systemd is killing based on session ID, and the kernel workers have an sid of 0 (I think, it should be whatever the sid of kthreadd (PID 2) has).
- The btrfs process for the scrub isn't killed either, it's just
status Z for the entire length of the scrub.
Z means the process is dead, but nothing has called wait() or similar to get status info from it, so it was killed, it's just that nothing took the body to the morgue yet.
- While this scrubbing is happening, issuing a 'btrfs scrub status'
gets me consistently stale information. It's the same information from
the moment the DE was logged out.
This makes sense, because the userspace component updates this info (and that's _all_ it does).

[root@localhost ~]# btrfs scrub status /mnt/x
scrub status for 9f9e5e1f-8d5a-44a0-8f69-8a393fb7ff3c
    scrub started at Mon Aug  1 09:29:59 2016, running for 00:00:15
    total bytes scrubbed: 3.06GiB with 0 errors

Even a minute later this information is the same.

Once the zombie btrfs process dies off, and the kernel workers stop
working, I get this bogus status information:

[root@localhost ~]# btrfs scrub status /mnt/x
scrub status for 9f9e5e1f-8d5a-44a0-8f69-8a393fb7ff3c
    scrub started at Mon Aug  1 09:29:59 2016, interrupted after
00:00:15, not running
    total bytes scrubbed: 3.06GiB with 0 errors


Only the user process was interrupted. Not the scrub. Looks like only
the user process is writing out the statistics and status, so once it
goes zombie, there's no accounting, rather than accounting being done
independently via sysfs.

Can I resume this scrub? Yes. But that's also bogus because there
really isn't anything to resume. All that work was done already, it
just hasn't been accounted for.

So whether you want to call this a bug, or deeply suboptimal behavior,
I think that's splitting hairs. Neither mdadm nor LVM scrubs are
affected by this logout behavior and systemd killing off user
processes. I always get reliable scrub status information from either
'echo check md/sync_action' or 'lvchange --syncaction check' before
and after logging out of the DE from which the command was issued.
MD and DM RAID handle this by starting kernel threads to do the scrub. They then store the info about the scrub in the array itself, so you can query it externally. If you watch, neither of those commands runs longer than it takes to start the operation, so there's nothing for systemd to kill.

And it's even inconsistent with btrfs replace where it continues to
give me correct status information from a tty shell even though the
replace command was issued in a DE, subsequently logged out of. So
'btrfs scrub' is inconsistent no matter how you look at it. It's a
bug.

Replace was implemented the way scrub should have been. It's done entirely in the kernel, and the userspace tools just start, stop and check status. We should just get rid of the whole scrub state file crap and have a way to query the last scrub status directly from the FS. That would fix this particular issue, and make scrub more consistent with everything else (and solve the stale scrub status bug too).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to