By the way, a question:
When running with "-V" (--verbose), do you see a warning about a
stubbed "pthread_key_delete" function?
--
Nadav Har'El
n...@scylladb.com <mailto:n...@scylladb.com>
On Sun, Aug 28, 2016 at 12:42 PM, Nadav Har'El <n...@scylladb.com
<mailto:n...@scylladb.com>> wrote:
On Fri, Aug 26, 2016 at 2:55 PM, Justin Cinkelj
<justin.cink...@xlab.si <mailto:justin.cink...@xlab.si>> wrote:
Tested, and returned thread ID is correct one now. The
misc-execve.so + /misc-execve-payload.so test wasn't able to
show any problem.
Good, so I'll commit that patch.
But I noticed different problem, when running open mpi.
Possible that thread exit handlers are associated with wrong
thread.
Thanks. I think I know what's going on:
The new application is no longer running in a new thread - but
rather running in its parent's thread.
So if the new application registers some sort of thread-exit
callbacks, these won't be run until the thread exists - but this
happens *after* the new application was unloaded, so the callback
code no longer exists.
By the way, it seems our pthreads thread-specific-data support
(pthread_key_create() et al) does not support ELF namespaces well
because the list of destructors (tsd_dtor) is global, not
per-namespace... I think in the past this was just a waste, a leak
of keys because if !tsd[i] the code in
pthread_register_tsd_dtor_notifier did not try to to call the
destructor - but now it because an actual crash. I'll produce a
workaround patch for you to test.
```
sudo ./scripts/run.py -d -V -n -v -e '/usr/bin/mpirun -np 2 -H
192.168.122.90 --allow-run-as-root -mca hwloc_base_verbose 10
--report-bindings --bind-to core --map-by core --mca
mpi_paffinity_alone 1 -mca rmaps_base_verbose 5 -mca
rmaps_base_topology /topo/topo-1s-4c.xml -mca
hwloc_base_mem_alloc_policy 'local_only' -mca
hwloc_base_mem_bind_failure_action 1 /usr/lib/mpi_hello.so ' |
tee aa
...
Done
main EXIT
Done
main EXIT
thread_run_app_in_namespace ret = 0 tid=260
[osv:00000] *** Process received signal ***
[osv:00000] Signal: Segmentation fault (11)
[osv:00000] Associated errno: No error information (-32768)
[osv:00000] Signal code: (3994985)
[osv:00000] Failing at address: 0x100601035000
page fault outside application, addr: 0x0000100601035000
[registers]
RIP: 0x00000000007d1c49 <???+8199241>
RFL: 0x0000000000010246 CS: 0x0000000000000008 SS:
0x0000000000000010
RAX: 0x0000000000000005 RBX: 0x00002000005fec30 RCX:
0x000010060103596c RDX: 0x00002000005ffc80
RSI: 0x0000000000000000 RDI: 0x00002000005fea40 RBP:
0x00002000005fed20 R8: 0x0000000000000004
R9: 0x0000000000200000 R10: 0x000000000171ab30 R11:
0x0000000000000000 R12: 0x00002000005fec30
R13: 0x0000000000200000 R14: 0x000000000171ab30 R15:
0x0000000000000000 RSP: 0x00002000005febb0
Aborted
[backtrace]
0x000000000022bc39 <abort(char const*, ...)+270>
0x00000000003c4f6b <???+3952491>
0x00000000003c5108 <mmu::vm_fault(unsigned long,
exception_frame*)+350>
0x000000000048941e <page_fault+310>
0x0000000000488306 <???+4752134>
```
And gdb:
(gdb) bt
#0 processor::cli_hlt () at arch/x64/processor.hh:248
#1 0x0000000000209613 in arch::halt_no_interrupts () at
arch/x64/arch.hh:48
#2 0x0000000000498a1e in osv::halt () at arch/x64/power.cc:24
#3 0x000000000022bc60 in abort (fmt=0x99c32c "Aborted\n") at
runtime.cc:130
#4 0x000000000022bb2b in abort () at runtime.cc:96
#5 0x00000000003c4f6c in mmu::vm_sigsegv
(addr=17617972842496, ef=0xffff800005e23078) at core/mmu.cc:1316
#6 0x00000000003c5109 in mmu::vm_fault (addr=17617972842496,
ef=0xffff800005e23078) at core/mmu.cc:1338
#7 0x000000000048941f in page_fault (ef=0xffff800005e23078)
at arch/x64/mmu.cc:38
#8 <signal handler called>
#9 0x00000000007d1c49 in uw_frame_state_for ()
#10 0x00000000007d3a99 in _Unwind_Backtrace ()
#11 0x000000000066fe10 in backtrace (buffer=0x2000005fef50,
size=32) at libc/misc/backtrace.cc:68
#12 0x000010000607f5ed in opal_backtrace_print (file=0x12faae0
<f>, prefix=0x2000005ff0a0 "[osv:00000] ", strip=2)
at
../../../../../opal/mca/backtrace/execinfo/backtrace_execinfo.c:47
#13 0x000010000607c8bf in show_stackframe (signo=11,
info=0x2000005ffd28, p=0x2000005ff870) at
../../../opal/util/stacktrace.c:336
#14 0x000000000047df78 in call_signal_handler
(frame=0x2000005ffc80) at arch/x64/signal.cc:85
#15 <signal handler called>
#16 0x000010060103596c in ?? ()
#17 0x0000000000680d68 in
pthread_private::__lambda5::operator()
(__closure=0xffff900001e7f000) at libc/pthread.cc:60
#18 0x000000000068347f in std::_Function_handler<void(),
pthread_private::pthread_register_tsd_dtor_notifier()::__lambda5>::_M_invoke(const
std::_Any_data &) (__functor=...) at
/usr/include/c++/4.8.2/functional:2071
#19 0x000000000044a632 in std::function<void ()>::operator()()
const (this=0xffff900001e80010) at
/usr/include/c++/4.8.2/functional:2471
#20 0x00000000005b90c8 in sched::run_exit_notifiers () at
core/sched.cc:984
#21 0x00000000005b9c9f in sched::thread::complete
(this=0xffff800005e1e050) at core/sched.cc:1205
---Type <return> to continue, or q <return> to quit---
#22 0x00000000005b5b43 in sched::thread_main_c
(t=0xffff800005e1e050) at arch/x64/arch-switch.hh:165
#23 0x0000000000489286 in thread_main () at arch/x64/entry.S:113
(gdb) frame 17
#17 0x0000000000680d68 in
pthread_private::__lambda5::operator()
(__closure=0xffff900001e7f000) at libc/pthread.cc:60
60 tsd_dtor[i](val);
(gdb) p i
$1 = 3
(gdb) p val
$2 = (void *) 0xffffa00003247600
(gdb) p tsd_dtor
$3 = std::vector of length 128, capacity 128 = {0x100005c3596c
<buffer_cleanup>, 0x1000060988f1 <buffer_cleanup>,
0x10040103596c, 0x10060103596c,
0x1004014988f1, 0x1006014988f1, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) frame 21
#21 0x00000000005b9c9f in sched::thread::complete
(this=0xffff800005e1e050) at core/sched.cc:1205
1205 run_exit_notifiers();
(gdb) p this
$4 = (sched::thread * const) 0xffff800005e1e050
(gdb) p *this
$5 = {<sched::timer_base::client> = {_vptr.client = 0xa42c70
<vtable for sched::thread+16>, _timers_need_reload = false,
_active_timers =
{<boost::intrusive::list_impl<boost::intrusive::listopt<boost::intrusive::detail::member_hook_traits<sched::timer_base,
boost::intrusive::list_member_hook<boost::intrusive::none,
boost::intrusive::none, boost::intrusive::none>,
&sched::timer_base::client_hook>, unsigned long, true> >> =
{<boost::intrusive::detail::clear_on_destructor_base<boost::intrusive::list_impl<boost::intrusive::listopt<boost::intrusive::detail::member_hook_traits<sched::timer_base,
boost::intrusive::list_member_hook<boost::intrusive::none,
boost::intrusive::none, boost::intrusive::none>,
&sched::timer_base::client_hook>, unsigned long, true> > >> =
{<No data fields>}, static external_value_traits = false,
static constant_time_size = true,
static stateful_value_traits = <optimized out>,
data_ =
{<boost::intrusive::detail::member_hook_traits<sched::timer_base,
boost::intrusive::list_member_hook<boost::intrusive::none,
boost::intrusive::none, boost::intrusive::none>,
&sched::timer_base::client_hook>> = {static link_mode =
<optimized out>},
root_plus_size_ =
{<boost::intrusive::detail::size_holder<true, unsigned long>>
= {static constant_time_size = <optimized out>, size_ = 0},
root_ = {next_ = 0xffff800005e1e068, prev_ =
0xffff800005e1e068}}}}, <No data fields>}}, _detach_state = {
_M_i = sched::thread::detach_state::detached}, static
priority_idle = inf, static priority_default = 1,
static priority_infinity = 9.99999975e-06, _func =
{<std::_Maybe_unary_or_binary_function<void>> = {<No data
fields>}, <std::_Function_base> = {
static _M_max_size = 16, static _M_max_align = 8,
_M_functor = {_M_unused = {_M_object = 0xffffa00005cfc500,
_M_const_object = 0xffffa00005cfc500,
_M_function_pointer = 0xffffa00005cfc500,
_M_member_pointer = (void
(std::_Undefined_class::*)(std::_Undefined_class * const))
0xffffa00005cfc500, this adjustment 7953674122929206130},
_M_pod_data = "\000\305\317\005\000\240\377\377rsity an"},
_M_manager = 0x683588
<std::_Function_base::_Base_manager<pthread_private::pthread::pthread(void*
(*)(void*), void*, sigset_t, const
pthread_private::thread_attr*)::__lambda6>::_M_manager(std::_Any_data
&, const std::_Any_data &, std::_Manager_operation)>},
_M_invoker = 0x683566 <std::_Function_handler<void(),
pthread_private::pthread::pthread(void* (*)(void*), void*,
sigset_t, const
pthread_private::thread_attr*)::__lambda6>::_M_invoke(const
std::_Any_data &)>}, _state = {exception_stack =
0xffff800001a2e0f0 "!", <incomplete sequence \342>,
rsp = 0xffff800005e1eb40, rbp = 0xffff800005e1eba0, rip =
0x5b5690 <sched::thread::switch_to()+228>}, _tcb =
0xffff900005c147c0, _runtime = {
_priority = 1, _Rtt = 7.2966543e-08, _renormalize_count = 0},
_detached_state =
std::unique_ptr<sched::thread::detached_state> containing
0xffffa00005d157e0, _attr = {_stack = {begin = 0x200000500000,
---Type <return> to continue, or q <return> to quit---
size = 1048576, deleter = 0x681360
<pthread_private::pthread::free_stack(sched::thread::stack_info)>},
_pinned_cpu = 0x0, _detached = true,
_name = {_M_elems = ">/usr/bin/mpiru"}},
_migration_lock_counter = 1, _pinned = true, _arch = {
interrupt_stack = "(c) 2004-2005 The University of
Tennessee and The University\n#", ' ' <repeats 25 times>, "of
Tennessee Research Foundation. All rights\n#", ' ' <repeats
25 times>, "reserved.\n# Copyright (c) 2004-2005 High "...,
exception_stack =
"\211\004$M\211\320\377ЉE\370\203}\370\000\017\225\300\017\266\300H\205\300t?\213E\370\211\307\350V\250\000
E\344\235\000\017\251\000\000\063\213M\344\254\000\023\031\254\000\240b\370\372\377\213E\344\353\005\270&\001\362\r\311\303f\220UH\211\345H\211}\370H\213E\370\017\267@(\017\267\300\203\340\004\205\300k\000/]\303
\000\002\000\252\001\020\224\213\000\a
\000\060\211u\360D\000\261H\213P0H\213E\360H\211\020d\000\003\"\000\061\203\354\020f\000\005&\000#U\360\274\001R\273\377\377\377\311g\000\220\203\354
H\211}\350\211u\321\000\301苀\230\000\000\000;E\344\177\nJ\000!\351S|\001\200E\350\213@x\215P\001\n\000"...},
_id = 260, _interrupted = {<std::atomic_bool> = {_M_base =
{_M_i = false}}, <No data fields>},
_cleanup = {<std::_Maybe_unary_or_binary_function<void>> =
{<No data fields>}, <std::_Function_base> = {static
_M_max_size = 16,
static _M_max_align = 8, _M_functor = {_M_unused =
{_M_object = 0xffffa000056ba490, _M_const_object =
0xffffa000056ba490,
_M_function_pointer = 0xffffa000056ba490,
_M_member_pointer = (void
(std::_Undefined_class::*)(std::_Undefined_class * const))
0xffffa000056ba490, this adjustment 6046151},
_M_pod_data =
"\220\244k\005\000\240\377\377\307A\\\000\000\000\000"},
_M_manager = 0x68368f
<std::_Function_base::_Base_manager<pthread_private::pthread::pthread(void*
(*)(void*), void*, sigset_t, const
pthread_private::thread_attr*)::__lambda7>::_M_manager(std::_Any_data
&, const std::_Any_data &, std::_Manager_operation)>},
_M_invoker = 0x68366d <std::_Function_handler<void(),
pthread_private::pthread::pthread(void* (*)(void*), void*,
sigset_t, const
pthread_private::thread_attr*)::__lambda7>::_M_invoke(const
std::_Any_data &)>}, _tls = std::vector of length 1, capacity
1 = {0xffff900005c14000 ""}, _app = true,
_app_runtime = std::shared_ptr (count 5, weak 0)
0xffffa00002776f58, _joiner = {_M_b = {_M_p = 0x0}},
_runqueue_link =
{<boost::intrusive::detail::generic_hook<boost::intrusive::get_set_node_algo<void*,
false>, boost::intrusive::member_tag,
(boost::intrusive::link_mode_type)1, 0>> =
{<boost::intrusive::detail::no_default_definer> = {<No data
fields>}, <boost::intrusive::rbtree_node<void*>> = {
parent_ = 0x0, left_ = 0x0, right_ = 0x0, color_ =
boost::intrusive::rbtree_node<void*>::black_t}, <No data
fields>}, <No data fields>},
_wakeup_link = {_helper = 0xffffa00002776188, _next =
0xffffa000056baf80}, static _s_idgen = 370, static _s_reaper =
0xffff8000027df040,
stat_switches = {_counter = {<std::__atomic_base<unsigned
long>> = {_M_i = 571}, <No data fields>}}, stat_preemptions = {
---Type <return> to continue, or q <return> to quit---
_counter = {<std::__atomic_base<unsigned long>> = {_M_i =
159}, <No data fields>}}, stat_migrations = {
_counter = {<std::__atomic_base<unsigned long>> = {_M_i =
1}, <No data fields>}}, _total_cpu_time = {__r = 167458000},
_cputime_estimator = {<std::__atomic_base<unsigned long>> =
{_M_i = 702368892544957}, <No data fields>}}
(gdb) p (*this)._attr
$10 = {_stack = {begin = 0x200000500000, size = 1048576,
deleter = 0x681360
<pthread_private::pthread::free_stack(sched::thread::stack_info)>},
_pinned_cpu = 0x0, _detached = true, _name = {_M_elems =
">/usr/bin/mpiru"}}
Now frame 21, _M_elems = ">/usr/bin/mpiru" makes me think that
is mpirun. And that one should not know much about
0x10040103596c, 0x10060103596c, 0x1004014988f1, 0x1006014988f1
from:
$3 = std::vector of length 128, capacity 128 = {0x100005c3596c
<buffer_cleanup>, 0x1000060988f1 <buffer_cleanup>,
0x10040103596c, 0x10060103596c,
0x1004014988f1, 0x1006014988f1
The 0x10040xx and 0x10060xx are from mpi_hello.so, started in
new ELF namespace.
Hope this gdb is helpful.
BR Justin
On 08/25/2016 01:29 PM, Nadav Har'El wrote:
I just sent another RFC patch which should cause osv_execve
to wait until app_runtime() is the new application's.
I haven't really tested it much beyond the usual "make check"
- can you please try it out?
Thanks,
Nadav.
--
Nadav Har'El
n...@scylladb.com <mailto:n...@scylladb.com>
On Wed, Aug 24, 2016 at 7:21 AM, Justin Cinkelj
<justin.cink...@xlab.si <mailto:justin.cink...@xlab.si>> wrote:
On 08/23/2016 11:58 PM, Nadav Har'El wrote:
On Tue, Aug 23, 2016 at 3:28 PM, Justin Cinkelj
<justin.cink...@xlab.si <mailto:justin.cink...@xlab.si>>
wrote:
@nadav
With second RFC patch, osv_execve dosn't return
thread_id==0 any more.
(code after RFC-2 pathc) The
application::start_and_join() line
sched::thread::current()->set_app_runtime(runtime());
has to finish, so that with_all_app_threads works as
desired. I think that's inline with your
explanation. If I add sleep(1) just before it, than
problem becomes 100% reproducible (it was 10-20%
reproducible before).
Yes, so I think we understand now why the problem
happens. Now the question is how to fix it.
I have some ideas on how it might be possible to fix it,
i.e., delay osv_execve()'s return until the new thread
got its new app_runtime setting.
However, I started wondering whether we should fix
anything in osv_execve():
osv_execve() promises to return the new thread id, and
that it does (after my last patch).
However, it doesn't promise anything about how far along
this thread went: Did it load the executable? Start to
run it? Did it even set the thread's app_runtime? We
don't know.
The question becomes, if we start promising more, why
the app_runtime thing? Isn't this just one of the
arbitrary things that happen during the starting of the
application - why promise that in particular?
If osv_execve() returned an app instead of a thread id,
things would be somewhat different, but even then, there
is a question whether it is fine for osv_execve() to
return an app that no thread yet belongs to - or whether
we should wait until at least the one thread we created
belongs to this app.
So this got me thinking: what if we just decide that
osv_execve()'s caller is not guaranteed the app_runtime
was set?
The caller (your MPI code looking for threads to
setaffinity for) would just need to loop checking
find_by_id(tid)->app_runtime() stopping if find_by_id
can't find the thread (this means it exited already!) or
if app_runtime() changed from what
sched::thread::current->app_runtime() was.
Alternatively, if the setaffinity code already works in
a loop (since threads can be created after startup!),
it's even simpler: Just check if
find_by_id(tid)->app_runtime() is still the same as
current->app_runtime() - and if it is, just don't do
anything (a later iteration of the loop will find the
threads).
Does this last plan make sense?
Yes. No.
Yes, that someone has to wait until app_runtime is set.
But as user of OSv, I don't want to know all the OSv
internal details, and how to work around them.
If osv_execve returns before new thread app_runtime is
set, than maybe osv_get_all_app_threads (from patch I
haven't yet sent to list) could do the waiting on
app_runtime.
But it would have to check if new_thread.app_runtime is
different from caller_thread.app_runtime. And that would
not work if some thread wants to get list of its own
thread - caller_thread == new_thread.
So the wait loop should cover that extra case (not that
current Open MPI code needs it; it just to be a bit more
general).
Sounds acceptable?