On 11.09.19 г. 9:45 ч., David Newall wrote:
> Hi All,
>
> I might have misunderstood how to report a problem. I registered for
> bugzilla and reported a bug
> (https://bugzilla.kernel.org/show_bug.cgi?id=204757), but, perhaps I
> should have sent this message to this mailing list, first. My apologies
> if I bungled it.
>
> I've been trying to track down a problem, intermittently, for a long
> time, and now need to reach out for advice. I apologise in advance for
> the quality of this report, which I feel includes more detail than
> needed, yet may be missing what's important. I'm trying my best.
>
> The brief summary is that my system hangs during SSH login while a
> backup is in progress. Sshd uses PAM authentication. The problem seems
> to be related to mounts as df and mount also hang.
>
> The longer details are: I'm running Ubuntu 16.04.5 on a 64-bit VM under
> kvm. I backup data using the following steps:
>
> 1. Take an LVM2 snapshot of the (non-root) ext2 file-system mounted as
> /data;
> 2. Mount a btrfs file system as /backup;
> 2. Mount the snapshot over an empty directory (may be subvolume; does it
> make a difference?) on /backup/snapshot;
> 3. Rsync the snapshot (with --archive --one-file-system --hard-links
> --inplace --numeric-ids --delete) to a subvolume /backup/data (thus it
> always contains /data as at last backup);
> 4. Take btrfs subvolume snapshot of /backup/data;
> 5. Unmount /backup/snapshot and /backup.
>
> By the time I get called, SSH logins via PAM hang (but complete
> "immediately" if I re-configure sshd for UsePAM no). Sessions which are
> still logged in seem unaffected, except df and mount both hang. I don't
> know what else hangs.
>
> During all of these steps, the /data is almost static, maybe even be
> completely static.
>
> I've queried my user, carefully, to determine the exact step where it
> starts to hang, and am 90% confident in her answer, which indicates that
> the hang-condition starts during rsync.
>
> Processes that were hanging complete normally when subvolume snapshot
> finishes.
>
> There's a chance that processes complete when the snapshot or btrfs
> file-system is unmounted, but I think it's before then because I've
> tried running each step by hand, was unable to reproduce the problem,
> probably because the amount of data to rsync in real-use is much larger
> than I tried writing during that test. At any rate, during that test I
> could log in between and during each step of the procedure.
>
> The only messages in dmesg are "mounting ext2 file system using the ext4
> subsystem" and "mounted filesystem without journal. Opts: (null)", which
> sounds right as I use "mount" instead of "mount -text2".
>
> When I tried running df under strace, strace's output was:
>
> open("/proc/self/mountinfo", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
> read(3, "18 24 0:17 / /sys rw,nosuid,node"..., 1024) = 1024
> read(3, "ystemd/systemd-cgroups-agent,nam"..., 1024) = 1024
> read(3, "t rw,nosuid,nodev,noexec,relatim"..., 1024) = 1024
> read(3,
>
> After the subvolume snapshot completed, strace continued producing output:
>
> "fs lxcfs rw,user_id=0,group_id=0"..., 1024) = 624
> --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=28055,
> si_uid=1000} ---
> read(3, "", 1024) = 0
> lseek(3, 0, SEEK_CUR) = 3696
> close(3) = 0
>
> I think the SIGCONT was because I suspended the parent, strace, using
> Ctrl-Z.
>
> I could just leave sshd doing non-PAM authentication but I think that's
> the wrong approach. How do I zero in on this problem?
When the issue manifests do :
echo w > /proc/sysrq-trigger
This should provide a backtrace for all threads which are currently in
uninterruptible sleep. If it's a deadlock due to btrfs being stuck we
should see it. Also provide your exact kernel version.
>
> Thanks,
>
> David
>
>