[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495358#comment-17495358 ] Yibo Cai commented on ARROW-15604: -- Saw arrow-dataset-scanner-test segfault in travis CI today, probably same issue. [https://app.travis-ci due to.com/github/apache/arrow/jobs/560510053#L3228|https://app.travis-ci.com/github/apache/arrow/jobs/560510053#L3228] > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Assignee: Weston Pace >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489126#comment-17489126 ] Weston Pace commented on ARROW-15604: - We could work around it with some kind of guarded singleton class: * The constructor instantiates the instance on the heap * The accessor grabs a mutex (or spinlock if we need to be signal safe) and, if instance is null, returns an invalid status * Note: will require the accessor to return Result instead of T* like it does today, not sure if that will be a problem for OT Then register an atexit handler that grabs the mutex/spin lock, deletes the instance, and sets the pointer to null Looking at the OT code more closely though I am a bit surprised we are encountering this. The {{END_SPAN_ON_FUTURE_COMPLETION}} macro uses {{Then}} and creates a new future. The new future should only be marked finished after the OT work is done. If no one tackles this in the meantime I will investigate further on Friday. > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488727#comment-17488727 ] Antoine Pitrou commented on ARROW-15604: bq. I don't suppose there is any way to block the shutdown until the eternal thread pool is idle? Which shutdown? The problem seems to be (as David diagnosed) that the OT context storage singleton is being destroyed before the Arrow CPU thread pool singleton. I don't know how we can change or workaround that. > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488548#comment-17488548 ] Weston Pace commented on ARROW-15604: - I ran into bugs like this before. I don't think the cause is really OT but it seems to increase the likelihood of failure. Basically we have async tasks that do something like... * Run task * Mark future finished with result (at this point the main thread is free to exit and start shutdown) * Cleanup task If anything in the Cleanup task accesses global state we could get this error. In the past the problem was that a task was accessing the default memory pool in its cleanup (I don't recall why). A short term fix is to update the test so it isn't using the eternal thread pool or to call WaitForIdle on the CPU thread pool but these feel more like hacks than real fixes as a real customer would still have a segfault at shutdown. In this case it seems the cleanup step is doing something with OT (which makes perfect sense). I don't suppose there is any way to block the shutdown until the eternal thread pool is idle? It could probably be signal safe if we waited with a busy loop but then I think you run the risk of shutdown delays. > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488349#comment-17488349 ] David Li commented on ARROW-15604: -- It also seems the main thread is being destroyed during/before the thread pools, so maybe this is a static destructor order pitfall… > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488347#comment-17488347 ] David Li commented on ARROW-15604: -- Hmm, I think I ran into something similar when I working on my PR. https://github.com/apache/arrow/pull/11964#issuecomment-995043666 CC [~mbrobbel] Should we disable OT in CI for now? > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488325#comment-17488325 ] Antoine Pitrou commented on ARROW-15604: So, basically, it seems using OpenTracing in an asynchronous setup where code may run after process teardown has started may be quite delicate. [~lidavidm] [~westonpace] > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488321#comment-17488321 ] Antoine Pitrou commented on ARROW-15604: The "atexit hook" I mentioned simply seems to be a standard C++ exit hook that destroys global/static variables. Here the static singleton that's stored in {{RuntimeContext::GetStorage}}. > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)