I did some more digging around today. I think what’s happening is that ext2fs tries to handle a pager RPC while the disk is being remounted.
We do call ports_inhibit_class_rpcs, which will wait until all RPCs for that class have finished. However, we call this with diskfs_protoid_class, which does *not* include the pager ports. These are added to _pager_class (libpager/priv.h) in pager_create (libpager/pager-create.c:32) and disk_pager_bucket (ext2fs/pager.c) in create_disk_pager (ext2fs/pager.c), and so as a result I believe we can get pager RPCs while remounting, leading to the call to ext2_getblk. Below is the stack for the call to ext2_getblk that leads to dereferencing sblock when it is NULL: 0 ext2fs/getblk.c:253 (ext2_getblk) 1 ext2fs/pager.c:147 (find_block) 2 ext2fs/pager.c:244 (file_pager_read_page) 3 ext2fs/pager.c:550 (pager_read_page) 4 libpager/data-request.c:113 (_pager_S_memory_object_data_request) 5 libpager/memory_objectServer.c:443 (_Xmemory_object_data_request) 6 libpager/demuxer.c:215 (worker_func) 7 libpthread/pthread/pt-create.c:64 (entry_point) James Clarke > On 27 Jun 2015, at 20:34, Richard Braun <rbr...@sceen.net> wrote: > > On Sat, Jun 27, 2015 at 03:39:58PM +0100, James Clarke wrote: >> I have been suffering a lot from my Hurd system (running in VirtualBox) >> hanging at startup, just after "Hurd server bootstrap..." but before "INIT: >> version 2.88 booting". >> >> I have been able to trace it back to getblk.c:248 (unsigned long >> addr_per_block = EXT2_ADDR_PER_BLOCK (sblock);) in ext2_getblk. It faults >> because sblock is NULL. >> >> I have traced the execution with debugging statements, and what seems to >> happen is as follows: >> >> 1. diskfs_remount is called (because root is remounted as rw) >> 2. RPCs are inhibited >> 3. diskfs_reload_global_state is called >> 4. sblock is set to NULL >> 5. While this is happening, ext2_getblk is called >> >> If you’re lucky, the superblock is read and sblock is set to point to this >> data before 5 (or at least before it gets to dereferencing sblock). If not, >> sblock is still NULL and thus a page fault is raised, causing the system to >> be stuck. >> >> Does anyone have an idea how this situation could be occurring? > > My initial thought would be "how could it not happen ?". > > Despite diskfs_remount calling ports_inhibit_class_rpcs, other threads > can very well be running to process previously received messages. There > seems to be no other form of access synchronization such as locks in > diskfs_reload_global_state. > > Can you get the call trace leading to ext2_getblk ? I'm not sure about > backtrace(3) in static executables but it might be worth trying. > > -- > Richard Braun