On Tue, Aug 23, 2016 at 3:28 PM, Justin Cinkelj <justin.cink...@xlab.si>
wrote:

>
> @nadav
> With second RFC patch, osv_execve dosn't return thread_id==0 any more.
>
>
> (code after RFC-2 pathc) The application::start_and_join() line
> sched::thread::current()->set_app_runtime(runtime()); has to finish, so
> that with_all_app_threads works as desired. I think that's inline with your
> explanation. If I add sleep(1) just before it, than problem becomes 100%
> reproducible (it was 10-20% reproducible before).
>

Yes, so I think we understand now why the problem happens. Now the question
is how to fix it.

I have some ideas on how it might be possible to fix it, i.e., delay
osv_execve()'s return until the new thread got its new app_runtime setting.

However, I started wondering whether we should fix anything in osv_execve():

osv_execve() promises to return the new thread id, and that it does (after
my last patch).
However, it doesn't promise anything about how far along this thread went:
Did it load the executable? Start to run it? Did it even set the thread's
app_runtime? We don't know.
The question becomes, if we start promising more, why the app_runtime
thing? Isn't this just one of the arbitrary things that happen during the
starting of the application - why promise that in particular?
If osv_execve() returned an app instead of a thread id, things would be
somewhat different, but even then, there is a question whether it is fine
for osv_execve() to return an app that no thread yet belongs to - or
whether we should wait until at least the one thread we created belongs to
this app.

So this got me thinking: what if we just decide that osv_execve()'s caller
is not guaranteed the app_runtime was set?
The caller (your MPI code looking for threads to setaffinity for) would
just need to loop checking find_by_id(tid)->app_runtime() stopping if
find_by_id can't find the thread (this means it exited already!) or if
app_runtime() changed from what sched::thread::current->app_runtime() was.
Alternatively, if the setaffinity code already works in a loop (since
threads can be created after startup!), it's even simpler: Just check if
find_by_id(tid)->app_runtime() is still the same as current->app_runtime()
- and if it is, just don't do anything (a later iteration of the loop will
find the threads).

Does this last plan make sense?

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to