Fwd: Hurd shutdown problems
Further progress trying to track this down: I don't have to shutdown the system to have problems. "swapoff /dev/hd0s5" is enough to cause problems, once enough swap is in use. After a failed swapoff, I have an extra 98 storeio processes running! I don't have to swapoff to have "symptoms". The kernel debugger normally shows symbolic names, i.e: Stopped at machine_idle+0xe: leave machine_idle(0,81a2c630,3806f64,0,9b448b38)+0xe idle_thread_continue(9fcbdde0,81028b50,9c0c7fe4,0,9c3d5548)+0x2a Once I've got enough swap in use, though, it stops doing this. Now I see: Stopped at 0x81be: leave 0x81be(0,0,9fcc5990,0,9fb90b30) 0x810293fa(9fcbdde0,81028b50,99526fe4,0,9c3d5548) When I see a kernel page fault, it's always in strcmp() It doesn't matter if an ssh session is open or not (Riccardo Mottola's suggestion). I can't task_terminate the auth server, as this typically does nothing once I've started having symptoms, but I can kill the auth server from the command line (just "kill 7") and that triggers a reboot that leaves the disk in a clean state. I'm just learning Hurd. Any ideas? agape brent
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
Hello, Christian Seiler, on Fri 05 Aug 2016 21:09:21 +0200, wrote: > I've attached a patch that fixes this specific issue for me. I > probably won't have time to look at the other issue I reported > here, but with that I'd at least be able to have open-isns > working on Hurd. (And the patch will likely also fix problems > in other software.) > > It would be great if you could apply that patch in git. Applied, thanks! Samuel
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
Richard Braunwrites: > On Mon, Aug 08, 2016 at 04:54:47PM +0200, Justus Winter wrote: >> Richard Braun writes: >> > Why not start the translator from the remapped environment too ? >> >> No reason, but this has to be implemented. I started working on a >> library for writing such chrooting translators, then got side-tracked by >> the complexity of the dir_lookup operations. Currently, remap has a >> very naive lookup function, fakeroot's is better, but still not >> sufficient. I made some patches towards unifying and refactoring the >> logic used in libdiskfs and libnetfs, but these functions are still huge >> :/ > > No, i mean, here, in such a specific case, if the parent translator is > itself running from the remap env, it should used the custom pflocal > instance, right ? No, that doesn't help, because binding a unix socket involves setting a passive translator, and that is still started by the filesystem "outside" the chrooted environment: teythoon@hurdbox /tmp % touch 1 teythoon@hurdbox /tmp % remap /servers/socket/1 /tmp/1 -- /bin/bash bash: cannot make pipe for command substitution: (ipc/mig) bad request message ID teythoon@hurdbox:/tmp$ exit /bin/settrans: fsys_goaway: (ipc/mig) server died (eh, also it is tricky to set up, cannot use bash right away) teythoon@hurdbox /tmp % remap /servers/socket/1 /tmp/1 -- /bin/sh $ settrans -a 1 /hurd/pflocal teythoon@hurdbox:/tmp$ python3 Python 3.5.2+ (default, Aug 5 2016, 08:07:14) [GCC 6.1.1 20160705] on gnu0 Type "help", "copyright", "credits" or "license" for more information. >>> import socket >>> s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) >>> s.bind('/tmp/test.sock') Traceback (most recent call last): File "", line 1, in OSError: [Errno 1073741873] Cannot assign requested address >>> teythoon@hurdbox:/tmp$ showtrans test.sock /hurd/ifsock I firmly believe that the way to proceed is to teach such chrooting translators to detect that a node has a passive translator record, and instead of letting the filesystem start it, it must start the translator on its own. Not only gives this much stronger isolation, it is also necessary for correctness. Justus signature.asc Description: PGP signature
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
On Mon, Aug 08, 2016 at 04:54:47PM +0200, Justus Winter wrote: > Richard Braunwrites: > > Why not start the translator from the remapped environment too ? > > No reason, but this has to be implemented. I started working on a > library for writing such chrooting translators, then got side-tracked by > the complexity of the dir_lookup operations. Currently, remap has a > very naive lookup function, fakeroot's is better, but still not > sufficient. I made some patches towards unifying and refactoring the > logic used in libdiskfs and libnetfs, but these functions are still huge > :/ No, i mean, here, in such a specific case, if the parent translator is itself running from the remap env, it should used the custom pflocal instance, right ? -- Richard Braun
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
Richard Braunwrites: > On Mon, Aug 08, 2016 at 12:55:24PM +0200, Justus Winter wrote: >> Right, I can see how this is a problem. The thing is, remap doesn't >> quite do the job: 1/ it fails to remap relative paths, 2/ if one sets a >> translator record on a node, and that translator is then started by the >> filesystem, it is started "outside" of the remap environment. I belive >> 2/ is what happens here. > > Why not start the translator from the remapped environment too ? No reason, but this has to be implemented. I started working on a library for writing such chrooting translators, then got side-tracked by the complexity of the dir_lookup operations. Currently, remap has a very naive lookup function, fakeroot's is better, but still not sufficient. I made some patches towards unifying and refactoring the logic used in libdiskfs and libnetfs, but these functions are still huge :/ Justus signature.asc Description: PGP signature
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
On Mon, Aug 08, 2016 at 12:55:24PM +0200, Justus Winter wrote: > Right, I can see how this is a problem. The thing is, remap doesn't > quite do the job: 1/ it fails to remap relative paths, 2/ if one sets a > translator record on a node, and that translator is then started by the > filesystem, it is started "outside" of the remap environment. I belive > 2/ is what happens here. Why not start the translator from the remapped environment too ? -- Richard Braun
Re: Hurd shutdown problems
Hi, Justus Winter wrote: >Have you tried using halt-hurd instead of shutdown? As far as I can >remember, halt-hurd has never caused file system corruption for me, >but I'm pretty sure shutdown did way back when I was still trying >to use it. That is correct. halt-hurd is basically halt -f, which is safe on the Hurd, but skips the sysvinit shutdown. However, we need to figure out why this hangs every now and then. in my personal experience, I had "hangs" when I had a telnet session open (I think also ssh.. I shall try again). Usually all connected clients should get disconnected. If I power on hurd and then login from consoe and shut it down, it works reliably. Riccardo
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
Christian Seilerwrites: Use the remap translator instead, which is one of the things the Hurd design allows you to do easily. See /bin/remap to easily set one. >>> >>> remap doesn't work at all here, programs then complain >>> that they can't assign requested address when doing any >>> socket operation. >> >> Seems to work fine here: >> >> teythoon@hurdbox ~ % cd /tmp >> teythoon@hurdbox /tmp % settrans -ac 1 /hurd/pflocal >> teythoon@hurdbox /tmp % remap /servers/socket/1 /tmp/1 -- /bin/bash -c 'echo >> huhu world | wc' >> 1 2 11 > > For pipes yes, for named sockets (which is what open-isns > uses): no. > > $ cd /tmp > $ settrans -ac 1 /hurd/pflocal > $ remap /servers/socket/1 /tmp/1 -- python3 > Python 3.5.2+ (default, Aug 5 2016, 08:07:14) > [GCC 6.1.1 20160705] on gnu0 > Type "help", "copyright", "credits" or "license" for more information. import socket s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) s.bind('/tmp/test.sock') > Traceback (most recent call last): > File "", line 1, in > OSError: [Errno 1073741873] Cannot assign requested address > > (Same also from C programs, Python is just easier to test.) > > The same python code works if you run it without remap. Right, I can see how this is a problem. The thing is, remap doesn't quite do the job: 1/ it fails to remap relative paths, 2/ if one sets a translator record on a node, and that translator is then started by the filesystem, it is started "outside" of the remap environment. I belive 2/ is what happens here. fakeroot has the same problem. For me, lack of robust lightweight virtualization this is the most pressing shortcoming of the Hurd, and I did some work to address this. Aiui remap/fakeroot must prevent the filesystem from starting the translator and do it themself to make the translation more correct. > Anyway, not terribly important to me, rebooting did work fine > anyway, and I now have a working patch for open-isns that will > make it work on Hurd once my other patch against pflocal's > socket.c is merged. Cool! Cheers, Justus signature.asc Description: PGP signature
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
On 08/08/2016 12:18 PM, Justus Winter wrote: >> [settrans -ck stuff] > All in all this was just bad advice. Ok, good to know. :) >>> Use the remap translator instead, which is one of the things the Hurd >>> design allows you to do easily. >>> >>> See /bin/remap to easily set one. >> >> remap doesn't work at all here, programs then complain >> that they can't assign requested address when doing any >> socket operation. > > Seems to work fine here: > > teythoon@hurdbox ~ % cd /tmp > teythoon@hurdbox /tmp % settrans -ac 1 /hurd/pflocal > teythoon@hurdbox /tmp % remap /servers/socket/1 /tmp/1 -- /bin/bash -c 'echo > huhu world | wc' > 1 2 11 For pipes yes, for named sockets (which is what open-isns uses): no. $ cd /tmp $ settrans -ac 1 /hurd/pflocal $ remap /servers/socket/1 /tmp/1 -- python3 Python 3.5.2+ (default, Aug 5 2016, 08:07:14) [GCC 6.1.1 20160705] on gnu0 Type "help", "copyright", "credits" or "license" for more information. >>> import socket >>> s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) >>> s.bind('/tmp/test.sock') Traceback (most recent call last): File "", line 1, in OSError: [Errno 1073741873] Cannot assign requested address (Same also from C programs, Python is just easier to test.) The same python code works if you run it without remap. Anyway, not terribly important to me, rebooting did work fine anyway, and I now have a working patch for open-isns that will make it work on Hurd once my other patch against pflocal's socket.c is merged. Thanks, Christian
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
Christian Seilerwrites: > (The following is not really important, rebooting does > work, so it's not a showstopper.) > > On 08/07/2016 09:13 PM, Richard Braun wrote: >> On Sun, Aug 07, 2016 at 08:44:56PM +0300, Esa Peuha wrote: PS: Is there any way to sanely restart /hurd/pflocal without rebooting? >>> >>> Yes, the commands to do that are >>> >>> settrans -ck /servers/socket/1 >>> settrans -ck /servers/socket/1 /hurd/pflocal > > FYI: that's really weird: the translater appears to be > replaced on my system (up to date Debian sid), but from > the response of programs, the old one still appears to > be used. Yes, that's what the -k is for, it keeps the old translator running. Also, without specifying -a, settrans only stores the translator record, which does not change. -c creates the node, which already exists. All in all this was just bad advice. >> Use the remap translator instead, which is one of the things the Hurd >> design allows you to do easily. >> >> See /bin/remap to easily set one. > > remap doesn't work at all here, programs then complain > that they can't assign requested address when doing any > socket operation. Seems to work fine here: teythoon@hurdbox ~ % cd /tmp teythoon@hurdbox /tmp % settrans -ac 1 /hurd/pflocal teythoon@hurdbox /tmp % remap /servers/socket/1 /tmp/1 -- /bin/bash -c 'echo huhu world | wc' 1 2 11 Cheers, Justus signature.asc Description: PGP signature
Re: [PATCH] [hurd] pflocal/socket.c: Support MSG_DONTWAIT in pflocal send/recv
(The following is not really important, rebooting does work, so it's not a showstopper.) On 08/07/2016 09:13 PM, Richard Braun wrote: > On Sun, Aug 07, 2016 at 08:44:56PM +0300, Esa Peuha wrote: >>> PS: Is there any way to sanely restart /hurd/pflocal without >>> rebooting? >> >> Yes, the commands to do that are >> >> settrans -ck /servers/socket/1 >> settrans -ck /servers/socket/1 /hurd/pflocal FYI: that's really weird: the translater appears to be replaced on my system (up to date Debian sid), but from the response of programs, the old one still appears to be used. > Use the remap translator instead, which is one of the things the Hurd > design allows you to do easily. > > See /bin/remap to easily set one. remap doesn't work at all here, programs then complain that they can't assign requested address when doing any socket operation. Regards, Christian
Re: Hurd shutdown problems
"Brent W. Baccala"writes: > On Sat, Aug 6, 2016 at 7:59 AM, Justus Winter wrote: > >> >> To prevent filesystem damage, try the following. Break into the kernel >> debugger, and kill the auth server using: >> >> !task_terminate($task5) >> >> Then continue using "c", and /hurd/startup should cleanly shutdown the >> system. >> >> > The problem seems to be caused by a failure to swapoff the swap space. > Since I've started paying attention to the swap space usage, I've always > been able to cleanly shutdown if no swap is in use. Once, when a small > amount of swap was in use (7 MB), I was able to shutdown cleanly. After a > decent sized compile, however, with 100 MB or so of swap in use, I always > get this: > > Deactivating swap...swapoff: /dev/hd0s5: 177152k swap space > swapoff: /dev/hd0s5: (os/kern) failure > failed. > Unmounting weak filesystems...umount: /etc/mtab: Warning: duplicate entry > for device /dev/hd0s1 (/dev/cons) > done. > mount: cannot remount /: Device or resource busy > Will now halt. > > Now everything stops. Interesting. There is a utility in the Hurd tree called 'vmallocate' that can be used to allocate and dirty large amounts of memory to trigger such issues. Unfortunately it isn't shipped with Debian iirc. > What happens if I now try Justus's advice? > > Stoppedat 0x81be:leave > Kernel Page fault trap, eip 0x81029b4e > Caught Page fault (14),code = 0, pc = 81029b4e Well, your system seems to be in a bad shape when entering the debugger, a kernel fault occurred. You cannot reasonably expect anything at this point. But yes, it fails from time to time, usually when it fails I see the kernel rebooting as soon as I call the task_terminate function. I guess it is because one can break into the debugger when the system is at an inconsistent state by chance. Cheers, Justus signature.asc Description: PGP signature
Re: Hurd shutdown problems
On Sat, Aug 6, 2016 at 7:59 AM, Justus Winterwrote: > > To prevent filesystem damage, try the following. Break into the kernel > debugger, and kill the auth server using: > > !task_terminate($task5) > > Then continue using "c", and /hurd/startup should cleanly shutdown the > system. > > The problem seems to be caused by a failure to swapoff the swap space. Since I've started paying attention to the swap space usage, I've always been able to cleanly shutdown if no swap is in use. Once, when a small amount of swap was in use (7 MB), I was able to shutdown cleanly. After a decent sized compile, however, with 100 MB or so of swap in use, I always get this: Deactivating swap...swapoff: /dev/hd0s5: 177152k swap space swapoff: /dev/hd0s5: (os/kern) failure failed. Unmounting weak filesystems...umount: /etc/mtab: Warning: duplicate entry for device /dev/hd0s1 (/dev/cons) done. mount: cannot remount /: Device or resource busy Will now halt. Now everything stops. What happens if I now try Justus's advice? Stoppedat 0x81be:leave Kernel Page fault trap, eip 0x81029b4e Caught Page fault (14),code = 0, pc = 81029b4e db> !task_terminate($task5) Kernel Page fault trap, eip 0x81029b4e Caught Page fault (14),code = 0, pc = 81029b4e db> c ...and nothing. Break back into the debugger and nothing has changed. "show all tasks" still shows /hurd/auth running as ID 5. agape brent