Re: Can't list root directory
On 2024-02-01 02:37, Loren M. Lang wrote: On January 31, 2024 1:28:37 PM PST, hw wrote: On Wed, 2024-01-31 at 09:27 -0500, Gary Dale wrote: On 2024-01-30 15:54, hw wrote: On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote: I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability to see the root directory even when I am logged in as root (su -). This has been happening intermittently for several months. I initially thought it might be related to failing NVME drive that was part of a RAID1 array that is mounted as "/" but I replaced the device and the problem is still happening. [...] What happens when you put the device you replaced back? How could putting a known-failing device back in help? The problem existed before I replaced it and continues to exist after the replacement. It sounded like you were able to list the root directory (at least sometimes) before you did the replacement. Manually failing the device (perhaps after adding it back first) could make a difference. I've seen such indefinite hangs only when an NFS share has become unreachable after it had been mounted. You could use clonezilla to make a copy and then perhaps convert the file system to btrfs. Do you still have the problem when you remove one of the NVME storage things? Perhaps you have the equivivalent of a bad SATA cable or the mainboard doesn't like it when you access two of those at the same time, or something like that. Even simple network cables can behave very strangely, and NVME may be a bit more complicated than that. Running fsck on every boot to work around an issue like this is certainly a bad idea. Doesn't fsck report anything? If it really makes a difference in itself rather than creating some side effect that leads to the root directory being readable, it should report something. Perhaps you need to increase its verbosity. If there's no report then it would look like a side effect and raise the question what side effect it might be. Does fsck run before the RAID has been brought up or after? Is the RAID up when booting is completed? What does mdadm say about the device(s)? Can you still list the root directory when you manually fail either drive? What exactly are the circumstances under which you can and not list the root directory? You need to do some investigating and ask questions like those ... Also, instead of doing "ls -l /" which will stat() every child folder under root, try "/bin/ls -f /" and see if that is successful. That will only do a readdir() on root itself. Also, it might be interesting to get a log of "strace ls -l /" to confirm exactly where the hang happens. -Loren Thanks loren. /bin/ls -l works. The strace shows the hang is on /keybase. The strace did a really bad hang - ctrlC wouldn't kill it. I've set the fsck count to 1 again, so I can reboot and take a look at it.
Re: Can't list root directory
On 2024-01-31 12:02, Max Nikulin wrote: On 29/01/2024 23:42, Gary Dale wrote: "ls -l /" just hangs It may dereference symlinks, call stat, etc. to colorize output. May it happen that you have automount points or something related to network mounts? Does "echo /*" hangs? Even bash prompt may do some funny stuff. I would try it from "dash". Can you install strace? E.g. copy files while booted from a live media. Thanks everyone for the suggestions. I'll retune the array to not fsck every boot and see if the problem recurs so I can try your suggestions.
Re: Can't list root directory
On January 31, 2024 1:28:37 PM PST, hw wrote: >On Wed, 2024-01-31 at 09:27 -0500, Gary Dale wrote: >> On 2024-01-30 15:54, hw wrote: >> > On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote: >> > > I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability >> > > to see the root directory even when I am logged in as root (su -). >> > > >> > > This has been happening intermittently for several months. I initially >> > > thought it might be related to failing NVME drive that was part of a >> > > RAID1 array that is mounted as "/" but I replaced the device and the >> > > problem is still happening. >> > > [...] >> > What happens when you put the device you replaced back? >> > >> How could putting a known-failing device back in help? The problem >> existed before I replaced it and continues to exist after the replacement. > >It sounded like you were able to list the root directory (at least >sometimes) before you did the replacement. Manually failing the >device (perhaps after adding it back first) could make a difference. > >I've seen such indefinite hangs only when an NFS share has become >unreachable after it had been mounted. You could use clonezilla to >make a copy and then perhaps convert the file system to btrfs. > >Do you still have the problem when you remove one of the NVME storage >things? Perhaps you have the equivivalent of a bad SATA cable or the >mainboard doesn't like it when you access two of those at the same >time, or something like that. Even simple network cables can behave >very strangely, and NVME may be a bit more complicated than that. > >Running fsck on every boot to work around an issue like this is >certainly a bad idea. Doesn't fsck report anything? If it really >makes a difference in itself rather than creating some side effect >that leads to the root directory being readable, it should report >something. Perhaps you need to increase its verbosity. > >If there's no report then it would look like a side effect and raise >the question what side effect it might be. Does fsck run before the >RAID has been brought up or after? Is the RAID up when booting is >completed? What does mdadm say about the device(s)? Can you still >list the root directory when you manually fail either drive? What >exactly are the circumstances under which you can and not list the >root directory? > >You need to do some investigating and ask questions like those ... > Also, instead of doing "ls -l /" which will stat() every child folder under root, try "/bin/ls -f /" and see if that is successful. That will only do a readdir() on root itself. Also, it might be interesting to get a log of "strace ls -l /" to confirm exactly where the hang happens. -Loren -- Sent from my Nexus 4 with K-9 Mail. Please excuse my brevity.
Re: Can't list root directory
On Wed, 2024-01-31 at 09:27 -0500, Gary Dale wrote: > On 2024-01-30 15:54, hw wrote: > > On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote: > > > I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability > > > to see the root directory even when I am logged in as root (su -). > > > > > > This has been happening intermittently for several months. I initially > > > thought it might be related to failing NVME drive that was part of a > > > RAID1 array that is mounted as "/" but I replaced the device and the > > > problem is still happening. > > > [...] > > What happens when you put the device you replaced back? > > > How could putting a known-failing device back in help? The problem > existed before I replaced it and continues to exist after the replacement. It sounded like you were able to list the root directory (at least sometimes) before you did the replacement. Manually failing the device (perhaps after adding it back first) could make a difference. I've seen such indefinite hangs only when an NFS share has become unreachable after it had been mounted. You could use clonezilla to make a copy and then perhaps convert the file system to btrfs. Do you still have the problem when you remove one of the NVME storage things? Perhaps you have the equivivalent of a bad SATA cable or the mainboard doesn't like it when you access two of those at the same time, or something like that. Even simple network cables can behave very strangely, and NVME may be a bit more complicated than that. Running fsck on every boot to work around an issue like this is certainly a bad idea. Doesn't fsck report anything? If it really makes a difference in itself rather than creating some side effect that leads to the root directory being readable, it should report something. Perhaps you need to increase its verbosity. If there's no report then it would look like a side effect and raise the question what side effect it might be. Does fsck run before the RAID has been brought up or after? Is the RAID up when booting is completed? What does mdadm say about the device(s)? Can you still list the root directory when you manually fail either drive? What exactly are the circumstances under which you can and not list the root directory? You need to do some investigating and ask questions like those ...
Re: Can't list root directory
On 29/01/2024 23:42, Gary Dale wrote: "ls -l /" just hangs It may dereference symlinks, call stat, etc. to colorize output. May it happen that you have automount points or something related to network mounts? Does "echo /*" hangs? Even bash prompt may do some funny stuff. I would try it from "dash". Can you install strace? E.g. copy files while booted from a live media.
Re: Can't list root directory
On 2024-01-29 at 11:42, Gary Dale wrote: > I'm running Debian/Trixie on an AMD64 workstation. I've lost the > ability to see the root directory even when I am logged in as root > (su -). > > This has been happening intermittently for several months. I > initially thought it might be related to failing NVME drive that was > part of a RAID1 array that is mounted as "/" but I replaced the > device and the problem is still happening. > > I had been able to fix it by booting to SystemRescue and running an > fsck on the device but it didn't work this time. The device checks > out OK (even when using fsck -/dev/mdx -f) but I still can't list the > root. "ls -l /" just hangs, as do any attempts to see the root > directory in a graphical file manager. In dolphin this means there is > nothing in the folders - and since that is the default starting point > I have to manually enter a folder name (e.g. /home/me) in the > location bar to be able to see anything - but even then the folders > panel remains empty. > > Even running commands like df -h hang because they can't access the > root folder. However the system is otherwise running normally. I'm not sure it'll help lead to anything, but out of curiosity and/or as a possible diagnostic: when the problem is manifesting, what happens if you run 'stat /'? Does it report data (similar to what you'd get from 'stat' on another directory), or does it hang, or give errors, or...? My thought is that this will give information about the filesystem object that is the root directory, without trying to also access information about the *contents* of that directory. If the one succeeds where the other fails, that might help narrow down where the actual issue is. -- The Wanderer The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man. -- George Bernard Shaw signature.asc Description: OpenPGP digital signature
Re: Can't list root directory
On 2024-01-29 12:55, Hans wrote: Hi Gary, before loosing any data, I suggest, to boot from a liuvefile linux. Please use a modern livefile like Knoppix or Kali-Linux. If it is not a BIOS problem, you should see the device again and are able to mount it. If /root is on a seperated partition, you can do some filesystem checks, like e2fsck or else. Ans: Most important, with a livefile system you can mount an external harddrive and backup all files. Thus , even when the /dev/nvme*** is died or partly broken, you can maybe restore /root on another partition. Second: Please check ACL, although I do not believe the reason for these, it is worth to look at this. Maybe you or someone else has chenged it accidently. Third idea: Is the harddrive full? In the past I has the problem, not to be able to do anything. The reason: My harddrive was completely full (some temporary file was the reason). Deleting this big file was the trick. Just some ideas, maybe it could help. Good luck! Best Hans There is no problem seeing the root folder when I boot from a live distro. fsck never finds any significant issue. An ACL issue would be permanent. This comes and goes. I actually doubled the size of the root device when I put in the new NVME drive. When I set up the RAID array, I'd bought a 500G second drive to mirror the 256G original drive. When I replaced the 256G drive, I was able to expand the array to 500G (less a small amount for the EFI partition). The partition has lots of free space. As I said, running an fsck seems to fix the issue temporarily. I now run an fsck on every boot.
Re: Can't list root directory
On 2024-01-29 11:42, Gary Dale wrote: I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability to see the root directory even when I am logged in as root (su -). This has been happening intermittently for several months. I initially thought it might be related to failing NVME drive that was part of a RAID1 array that is mounted as "/" but I replaced the device and the problem is still happening. I had been able to fix it by booting to SystemRescue and running an fsck on the device but it didn't work this time. The device checks out OK (even when using fsck -/dev/mdx -f) but I still can't list the root. "ls -l /" just hangs, as do any attempts to see the root directory in a graphical file manager. In dolphin this means there is nothing in the folders - and since that is the default starting point I have to manually enter a folder name (e.g. /home/me) in the location bar to be able to see anything - but even then the folders panel remains empty. Even running commands like df -h hang because they can't access the root folder. However the system is otherwise running normally. Strangely, in the past simply booting to a rescue shell then exiting would also work. I'd usually try to do an fsck on the raid device but that would always fail because it was mounted. The only thing I noticed that was unusual was I rebooted after installing the latest Trixie updates this morning. That took about 10 minutes to shut down - 6 of which were spent waiting for a drkonqi process to finish. There was also a systemd message really late in the shutdown about /dev/md0 but that's not the root device. I'm used to Linux taking its time to shutdown lately so I don't think this was related. The systemd shutdown just seems to be easily delayed. Any ideas on how I can restore my ability to see the root directory? OK, got it working again. I used tune2fs to do an fsck on every boot. This being an NVME device, it's barely noticeable.
Re: Can't list root directory
On 2024-01-30 15:54, hw wrote: On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote: I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability to see the root directory even when I am logged in as root (su -). This has been happening intermittently for several months. I initially thought it might be related to failing NVME drive that was part of a RAID1 array that is mounted as "/" but I replaced the device and the problem is still happening. [...] What happens when you put the device you replaced back? How could putting a known-failing device back in help? The problem existed before I replaced it and continues to exist after the replacement.
Re: Can't list root directory
On Mon, 2024-01-29 at 11:42 -0500, Gary Dale wrote: > I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability > to see the root directory even when I am logged in as root (su -). > > This has been happening intermittently for several months. I initially > thought it might be related to failing NVME drive that was part of a > RAID1 array that is mounted as "/" but I replaced the device and the > problem is still happening. > [...] What happens when you put the device you replaced back?
Re: Can't list root directory
Hi Gary, before loosing any data, I suggest, to boot from a liuvefile linux. Please use a modern livefile like Knoppix or Kali-Linux. If it is not a BIOS problem, you should see the device again and are able to mount it. If /root is on a seperated partition, you can do some filesystem checks, like e2fsck or else. Ans: Most important, with a livefile system you can mount an external harddrive and backup all files. Thus , even when the /dev/nvme*** is died or partly broken, you can maybe restore /root on another partition. Second: Please check ACL, although I do not believe the reason for these, it is worth to look at this. Maybe you or someone else has chenged it accidently. Third idea: Is the harddrive full? In the past I has the problem, not to be able to do anything. The reason: My harddrive was completely full (some temporary file was the reason). Deleting this big file was the trick. Just some ideas, maybe it could help. Good luck! Best Hans
Re: Can't list root directory
On Mon, Jan 29, 2024 at 11:42:14AM -0500, Gary Dale wrote: > I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability to > see the root directory even when I am logged in as root (su -). > > This has been happening intermittently for several months. I initially > thought it might be related to failing NVME drive that was part of a RAID1 > array that is mounted as "/" but I replaced the device and the problem is > still happening. [...] Anything mounted below / whose block device is taking its time? Maybe a network device? What does mount say? Cheers -- t signature.asc Description: PGP signature
Can't list root directory
I'm running Debian/Trixie on an AMD64 workstation. I've lost the ability to see the root directory even when I am logged in as root (su -). This has been happening intermittently for several months. I initially thought it might be related to failing NVME drive that was part of a RAID1 array that is mounted as "/" but I replaced the device and the problem is still happening. I had been able to fix it by booting to SystemRescue and running an fsck on the device but it didn't work this time. The device checks out OK (even when using fsck -/dev/mdx -f) but I still can't list the root. "ls -l /" just hangs, as do any attempts to see the root directory in a graphical file manager. In dolphin this means there is nothing in the folders - and since that is the default starting point I have to manually enter a folder name (e.g. /home/me) in the location bar to be able to see anything - but even then the folders panel remains empty. Even running commands like df -h hang because they can't access the root folder. However the system is otherwise running normally. Strangely, in the past simply booting to a rescue shell then exiting would also work. I'd usually try to do an fsck on the raid device but that would always fail because it was mounted. The only thing I noticed that was unusual was I rebooted after installing the latest Trixie updates this morning. That took about 10 minutes to shut down - 6 of which were spent waiting for a drkonqi process to finish. There was also a systemd message really late in the shutdown about /dev/md0 but that's not the root device. I'm used to Linux taking its time to shutdown lately so I don't think this was related. The systemd shutdown just seems to be easily delayed. Any ideas on how I can restore my ability to see the root directory?