On Tue, Aug 23, 2016 at 3:28 PM, Justin Cinkelj <justin.cink...@xlab.si> wrote:
> > @nadav > With second RFC patch, osv_execve dosn't return thread_id==0 any more. > > > (code after RFC-2 pathc) The application::start_and_join() line > sched::thread::current()->set_app_runtime(runtime()); has to finish, so > that with_all_app_threads works as desired. I think that's inline with your > explanation. If I add sleep(1) just before it, than problem becomes 100% > reproducible (it was 10-20% reproducible before). > Yes, so I think we understand now why the problem happens. Now the question is how to fix it. I have some ideas on how it might be possible to fix it, i.e., delay osv_execve()'s return until the new thread got its new app_runtime setting. However, I started wondering whether we should fix anything in osv_execve(): osv_execve() promises to return the new thread id, and that it does (after my last patch). However, it doesn't promise anything about how far along this thread went: Did it load the executable? Start to run it? Did it even set the thread's app_runtime? We don't know. The question becomes, if we start promising more, why the app_runtime thing? Isn't this just one of the arbitrary things that happen during the starting of the application - why promise that in particular? If osv_execve() returned an app instead of a thread id, things would be somewhat different, but even then, there is a question whether it is fine for osv_execve() to return an app that no thread yet belongs to - or whether we should wait until at least the one thread we created belongs to this app. So this got me thinking: what if we just decide that osv_execve()'s caller is not guaranteed the app_runtime was set? The caller (your MPI code looking for threads to setaffinity for) would just need to loop checking find_by_id(tid)->app_runtime() stopping if find_by_id can't find the thread (this means it exited already!) or if app_runtime() changed from what sched::thread::current->app_runtime() was. Alternatively, if the setaffinity code already works in a loop (since threads can be created after startup!), it's even simpler: Just check if find_by_id(tid)->app_runtime() is still the same as current->app_runtime() - and if it is, just don't do anything (a later iteration of the loop will find the threads). Does this last plan make sense? -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.