>On 23/01/17(Mon) 17:03, Ted Unangst wrote: >> Martin Pieuchot wrote: >> > On 17/01/17(Tue) 00:24, Alexander Bluhm wrote: >> > > On Mon, Jan 16, 2017 at 08:36:46PM +0100, cheek...@gmx.com wrote: >> > > > kernel: protection fault trap, code=0 >> > > > Stopped at fd_getfile+0x20: testb $0x2,mptramp_gdt32_desc+0x1e(%r >> > > > ax) >> > > > ddb{3}> fd_getfile() at fd_getfile+0x20 >> > > > sys_fstat() at sys_fstat+0x43 >> > > > syscall() at syscall+0x27b >> > > >> > > It crashes in fd_getfile() FILE_IS_USABLE(fp) as fdp->fd_ofiles has >> > > been freed. >> > >> > Are you sure? The faulting instruction is: >> > >> > /sys/kern/kern_descrip.c:190 >> > 80: f6 40 40 02 testb $0x2,0x40(%rax) >> > 84: 75 e7 jne 6d <fd_getfile+0xd> >> > >> > So %rax contains an incorrect value which is not NULL, are you >> > suggesting that it is garbage due to free(9) poisoning the memory? >> > >> > If that's the case, the easiest fix in would be to do the allocations >> > upfront. An alternative solution discussed here at a2k17 would be to >> > use a SRP to guarantee that fd_ofiles is not freed while another CPU is >> > still referencing it. That would also help for MP work. >> > >> > Or could it be another race that your lock is preventing? >> > >> > What about the diff below? Cheeky does it help? >> >> In that case, I think it's better to move the free down. This is more >> idiomatic, free the old value and immediately replace. > >Sure I can do that, I still want to add a comment since it might bite us >later. > >So a counter diff is like an ok?
My theory is memory pressure causes one or two of these waits, and then a threading bug enters this concurrently. It seems better to put all sleeping malloc's together, so that no sleeping operations occur during the manipulation. So I favor mpi's diff. Are there any occurances during which free sleeps?