Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
I have now got a patch that drains down the queue, and it does successfully stop sblock from being dereferenced etc when we actually reload. However, sometimes thread 5 (the same one that would dereference sblock) seems to get stuck in vm_fault_continue (at least according to the kernel debugger), so I need to do some more debugging to see why. James On 19 Jul 2015, at 15:00, Richard Braun rbr...@sceen.net wrote: On Sun, Jul 19, 2015 at 02:25:14PM +0100, James Clarke wrote: Yeah, I tried inhibiting both buckets, but the paging RPCs still got through, so my guess was that libports's inhibit/resume methods weren't able to deal with libpager's own threads. The thing is I don't think we currently keep track of any reference to the main/worker threads, as pager_start_workers just takes a bucket and returns void. Is there a way we can instead make the main thread and/or workers able to block ports_inhibit_X_rpcs like normal RPC handlers and be cancelled etc? If possible I think that would be a cleaner solution. To continue our discussion on IRC: No, it would definitely not be a cleaner solution, just an ugly hack. Since paging doesn't occur as part of an RPC, you just can't use RPC stuff to manage it. I suggest building rwlock-based synrchonization functions specific to the pager workers. -- Richard Braun
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
Hello James :) Quoting James Clarke (2015-07-15 22:20:57) I had a look today at what's happening, and it's that the *file* pager is trying to read from disk. Any thoughts? There is another thing I forgot. libpager is special, it has its own demuxer (see libpager/demuxer.c) that writes requests into a queue, and a pool of workers that process requests from said queue. The thing is, when we inhibit the pager RPCs, we merely prevent new ones from being enqueued, but we don't prevent the workers from processing already enqueued requests. So we indeed need to add functions to inhibit and restart paging to libpager that know about the queue. Justus
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
Yeah, I tried inhibiting both buckets, but the paging RPCs still got through, so my guess was that libports's inhibit/resume methods weren't able to deal with libpager's own threads. The thing is I don't think we currently keep track of any reference to the main/worker threads, as pager_start_workers just takes a bucket and returns void. Is there a way we can instead make the main thread and/or workers able to block ports_inhibit_X_rpcs like normal RPC handlers and be cancelled etc? If possible I think that would be a cleaner solution. James On 19 Jul 2015, at 13:50, Justus Winter 4win...@informatik.uni-hamburg.de wrote: Hello James :) Quoting James Clarke (2015-07-15 22:20:57) I had a look today at what's happening, and it's that the *file* pager is trying to read from disk. Any thoughts? There is another thing I forgot. libpager is special, it has its own demuxer (see libpager/demuxer.c) that writes requests into a queue, and a pool of workers that process requests from said queue. The thing is, when we inhibit the pager RPCs, we merely prevent new ones from being enqueued, but we don't prevent the workers from processing already enqueued requests. So we indeed need to add functions to inhibit and restart paging to libpager that know about the queue. Justus
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
On Sun, Jul 19, 2015 at 02:25:14PM +0100, James Clarke wrote: Yeah, I tried inhibiting both buckets, but the paging RPCs still got through, so my guess was that libports's inhibit/resume methods weren't able to deal with libpager's own threads. The thing is I don't think we currently keep track of any reference to the main/worker threads, as pager_start_workers just takes a bucket and returns void. Is there a way we can instead make the main thread and/or workers able to block ports_inhibit_X_rpcs like normal RPC handlers and be cancelled etc? If possible I think that would be a cleaner solution. To continue our discussion on IRC: No, it would definitely not be a cleaner solution, just an ugly hack. Since paging doesn't occur as part of an RPC, you just can't use RPC stuff to manage it. I suggest building rwlock-based synrchonization functions specific to the pager workers. -- Richard Braun
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
As discussed in IRC, this successfully stopped the disk pager from dereferencing sblock. However, it was still hanging at boot a lot of the time (seemingly if and only if I booted in normal mode ie not recovery mode, but that's probably just a timing thing). I had a look today at what's happening, and it's that the *file* pager is trying to read from disk. Any thoughts? James On 14 Jul 2015, at 20:54, Justus Winter 4win...@informatik.uni-hamburg.de wrote: Hi James :) you found a long-standing bug in ext2fs. Fixing it allows us to get rid of the ugly workaround in daemons/runsystem.sh (look for `XXX'). Quoting Richard Braun (2015-07-13 10:16:14) On Sun, Jul 12, 2015 at 12:56:31PM +0100, James Clarke wrote: That doesn’t seem to boot at all. I had tried changing it to inhibiting all RPCs (it looks like you’ve inhibited an extra class?), but it seems that paging is needed? Perhaps part of ext2fs gets paged out, and it needs to be paged in when remounting? Remounting can require paging out, yes. See diskfs_reload_global_state in ext2fs : diskfs_reload_global_state () { pokel_flush (global_pokel); pager_flush (diskfs_disk_pager, 1); So I guess we need to inhibit the RPCs here, not before calling diskfs_reload_global_state, then do: get_hypermetadata (); map_hypermetadata (); And reenable them here. return 0; } I guess that means changing the diskfs API. James, do you want to give it a shot? In the mean time, enjoy my hacky workaround: http://nonmonolithic.org/ext2fs.static Cheers, Justus
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
Quoting Richard Braun (2015-07-13 10:16:14) On Sun, Jul 12, 2015 at 12:56:31PM +0100, James Clarke wrote: That doesn’t seem to boot at all. I had tried changing it to inhibiting all RPCs (it looks like you’ve inhibited an extra class?), but it seems that paging is needed? Perhaps part of ext2fs gets paged out, and it needs to be paged in when remounting? Remounting can require paging out, yes. See diskfs_reload_global_state in ext2fs : diskfs_reload_global_state () { pokel_flush (global_pokel); pager_flush (diskfs_disk_pager, 1); ... Aha, but this is the disk pager, not the file pager which needs sblock. Justus
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
Hi James :) you found a long-standing bug in ext2fs. Fixing it allows us to get rid of the ugly workaround in daemons/runsystem.sh (look for `XXX'). Quoting Richard Braun (2015-07-13 10:16:14) On Sun, Jul 12, 2015 at 12:56:31PM +0100, James Clarke wrote: That doesn’t seem to boot at all. I had tried changing it to inhibiting all RPCs (it looks like you’ve inhibited an extra class?), but it seems that paging is needed? Perhaps part of ext2fs gets paged out, and it needs to be paged in when remounting? Remounting can require paging out, yes. See diskfs_reload_global_state in ext2fs : diskfs_reload_global_state () { pokel_flush (global_pokel); pager_flush (diskfs_disk_pager, 1); So I guess we need to inhibit the RPCs here, not before calling diskfs_reload_global_state, then do: get_hypermetadata (); map_hypermetadata (); And reenable them here. return 0; } I guess that means changing the diskfs API. James, do you want to give it a shot? In the mean time, enjoy my hacky workaround: http://nonmonolithic.org/ext2fs.static Cheers, Justus
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
On Sun, Jul 12, 2015 at 12:56:31PM +0100, James Clarke wrote: That doesn’t seem to boot at all. I had tried changing it to inhibiting all RPCs (it looks like you’ve inhibited an extra class?), but it seems that paging is needed? Perhaps part of ext2fs gets paged out, and it needs to be paged in when remounting? Remounting can require paging out, yes. See diskfs_reload_global_state in ext2fs : diskfs_reload_global_state () { pokel_flush (global_pokel); pager_flush (diskfs_disk_pager, 1); ... -- Richard Braun
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
That doesn’t seem to boot at all. I had tried changing it to inhibiting all RPCs (it looks like you’ve inhibited an extra class?), but it seems that paging is needed? Perhaps part of ext2fs gets paged out, and it needs to be paged in when remounting? James On 12 Jul 2015, at 00:27, Justus Winter 4win...@informatik.uni-hamburg.de wrote: Quoting James Clarke (2015-07-11 22:33:44) I did some more digging around today. I think what’s happening is that ext2fs tries to handle a pager RPC while the disk is being remounted Sounds plausible. Could you try: http://darnassus.sceen.net/~teythoon/ext2fs.static I'll send the patch as follow-up. Justus
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
Quoting James Clarke (2015-07-12 13:56:31) That doesn’t seem to boot at all. Indeed :/ db show all tasks ID TASK NAME [THREADS] 0 f2745f00 gnumach [8] 1 f2745e40 ext2fs [12] 2 f2745d80 exec [5] 3 f2745cc0 (ext2fs) [1] 4 f2745c00 /hurd/proc [4] 5 f2745b40 /hurd/auth [5] 6 f2745a80 /bin/sh(1) [2] 7 f27459c0 /hurd/term(8) [5] 8 f2745900 /hurd/pflocal(9) [7] 9 f2745780 (/hurd/mach-defpager(10)) [6] 10 f2745840 fsysopts(13) [2] I had tried changing it to inhibiting all RPCs (it looks like you’ve inhibited an extra class?), but it seems that paging is needed? Perhaps part of ext2fs gets paged out, and it needs to be paged in when remounting? Perhaps. Justus
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
I did some more digging around today. I think what’s happening is that ext2fs tries to handle a pager RPC while the disk is being remounted. We do call ports_inhibit_class_rpcs, which will wait until all RPCs for that class have finished. However, we call this with diskfs_protoid_class, which does *not* include the pager ports. These are added to _pager_class (libpager/priv.h) in pager_create (libpager/pager-create.c:32) and disk_pager_bucket (ext2fs/pager.c) in create_disk_pager (ext2fs/pager.c), and so as a result I believe we can get pager RPCs while remounting, leading to the call to ext2_getblk. Below is the stack for the call to ext2_getblk that leads to dereferencing sblock when it is NULL: 0 ext2fs/getblk.c:253 (ext2_getblk) 1 ext2fs/pager.c:147 (find_block) 2 ext2fs/pager.c:244 (file_pager_read_page) 3 ext2fs/pager.c:550 (pager_read_page) 4 libpager/data-request.c:113 (_pager_S_memory_object_data_request) 5 libpager/memory_objectServer.c:443 (_Xmemory_object_data_request) 6 libpager/demuxer.c:215 (worker_func) 7 libpthread/pthread/pt-create.c:64 (entry_point) James Clarke On 27 Jun 2015, at 20:34, Richard Braun rbr...@sceen.net wrote: On Sat, Jun 27, 2015 at 03:39:58PM +0100, James Clarke wrote: I have been suffering a lot from my Hurd system (running in VirtualBox) hanging at startup, just after Hurd server bootstrap... but before INIT: version 2.88 booting. I have been able to trace it back to getblk.c:248 (unsigned long addr_per_block = EXT2_ADDR_PER_BLOCK (sblock);) in ext2_getblk. It faults because sblock is NULL. I have traced the execution with debugging statements, and what seems to happen is as follows: 1. diskfs_remount is called (because root is remounted as rw) 2. RPCs are inhibited 3. diskfs_reload_global_state is called 4. sblock is set to NULL 5. While this is happening, ext2_getblk is called If you’re lucky, the superblock is read and sblock is set to point to this data before 5 (or at least before it gets to dereferencing sblock). If not, sblock is still NULL and thus a page fault is raised, causing the system to be stuck. Does anyone have an idea how this situation could be occurring? My initial thought would be how could it not happen ?. Despite diskfs_remount calling ports_inhibit_class_rpcs, other threads can very well be running to process previously received messages. There seems to be no other form of access synchronization such as locks in diskfs_reload_global_state. Can you get the call trace leading to ext2_getblk ? I'm not sure about backtrace(3) in static executables but it might be worth trying. -- Richard Braun
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
Quoting James Clarke (2015-07-11 22:33:44) I did some more digging around today. I think what’s happening is that ext2fs tries to handle a pager RPC while the disk is being remounted Sounds plausible. Could you try: http://darnassus.sceen.net/~teythoon/ext2fs.static I'll send the patch as follow-up. Justus
VirtualBox Hangs Pre-Init Due To Ext2FS Fault
Hi, I have been suffering a lot from my Hurd system (running in VirtualBox) hanging at startup, just after Hurd server bootstrap... but before INIT: version 2.88 booting. I have been able to trace it back to getblk.c:248 (unsigned long addr_per_block = EXT2_ADDR_PER_BLOCK (sblock);) in ext2_getblk. It faults because sblock is NULL. I have traced the execution with debugging statements, and what seems to happen is as follows: 1. diskfs_remount is called (because root is remounted as rw) 2. RPCs are inhibited 3. diskfs_reload_global_state is called 4. sblock is set to NULL 5. While this is happening, ext2_getblk is called If you’re lucky, the superblock is read and sblock is set to point to this data before 5 (or at least before it gets to dereferencing sblock). If not, sblock is still NULL and thus a page fault is raised, causing the system to be stuck. Does anyone have an idea how this situation could be occurring? James Clarke
Re: VirtualBox Hangs Pre-Init Due To Ext2FS Fault
On Sat, Jun 27, 2015 at 03:39:58PM +0100, James Clarke wrote: I have been suffering a lot from my Hurd system (running in VirtualBox) hanging at startup, just after Hurd server bootstrap... but before INIT: version 2.88 booting. I have been able to trace it back to getblk.c:248 (unsigned long addr_per_block = EXT2_ADDR_PER_BLOCK (sblock);) in ext2_getblk. It faults because sblock is NULL. I have traced the execution with debugging statements, and what seems to happen is as follows: 1. diskfs_remount is called (because root is remounted as rw) 2. RPCs are inhibited 3. diskfs_reload_global_state is called 4. sblock is set to NULL 5. While this is happening, ext2_getblk is called If you’re lucky, the superblock is read and sblock is set to point to this data before 5 (or at least before it gets to dereferencing sblock). If not, sblock is still NULL and thus a page fault is raised, causing the system to be stuck. Does anyone have an idea how this situation could be occurring? My initial thought would be how could it not happen ?. Despite diskfs_remount calling ports_inhibit_class_rpcs, other threads can very well be running to process previously received messages. There seems to be no other form of access synchronization such as locks in diskfs_reload_global_state. Can you get the call trace leading to ext2_getblk ? I'm not sure about backtrace(3) in static executables but it might be worth trying. -- Richard Braun