[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-20 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495358#comment-17495358
 ] 

Yibo Cai commented on ARROW-15604:
--

Saw arrow-dataset-scanner-test segfault in travis CI today, probably same issue.
[https://app.travis-ci due 
to.com/github/apache/arrow/jobs/560510053#L3228|https://app.travis-ci.com/github/apache/arrow/jobs/560510053#L3228]

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-08 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489126#comment-17489126
 ] 

Weston Pace commented on ARROW-15604:
-

We could work around it with some kind of guarded singleton class:
 * The constructor instantiates the instance on the heap
 * The accessor grabs a mutex (or spinlock if we need to be signal safe) and, 
if instance is null, returns an invalid status
* Note: will require the accessor to return Result instead of T* like 
it does today, not sure if that will be a problem for OT

Then register an atexit handler that grabs the mutex/spin lock, deletes the 
instance, and sets the pointer to null

Looking at the OT code more closely though I am a bit surprised we are 
encountering this.  The {{END_SPAN_ON_FUTURE_COMPLETION}} macro uses {{Then}} 
and creates a new future.  The new future should only be marked finished after 
the OT work is done.

If no one tackles this in the meantime I will investigate further on Friday.

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-08 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488727#comment-17488727
 ] 

Antoine Pitrou commented on ARROW-15604:


bq. I don't suppose there is any way to block the shutdown until the eternal 
thread pool is idle?

Which shutdown? The problem seems to be (as David diagnosed) that the OT 
context storage singleton is being destroyed before the Arrow CPU thread pool 
singleton. I don't know how we can change or workaround that.

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488548#comment-17488548
 ] 

Weston Pace commented on ARROW-15604:
-

I ran into bugs like this before.  I don't think the cause is really OT but it 
seems to increase the likelihood of failure.  Basically we have async tasks 
that do something like...

 * Run task
 * Mark future finished with result (at this point the main thread is free to 
exit and start shutdown)
 * Cleanup task

If anything in the Cleanup task accesses global state we could get this error.  
In the past the problem was that a task was accessing the default memory pool 
in its cleanup (I don't recall why).  A short term fix is to update the test so 
it isn't using the eternal thread pool or to call WaitForIdle on the CPU thread 
pool but these feel more like hacks than real fixes as a real customer would 
still have a segfault at shutdown.

In this case it seems the cleanup step is doing something with OT (which makes 
perfect sense).

I don't suppose there is any way to block the shutdown until the eternal thread 
pool is idle?  It could probably be signal safe if we waited with a busy loop 
but then I think you run the risk of shutdown delays.

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488349#comment-17488349
 ] 

David Li commented on ARROW-15604:
--

It also seems the main thread is being destroyed during/before the thread 
pools, so maybe this is a static destructor order pitfall…

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488347#comment-17488347
 ] 

David Li commented on ARROW-15604:
--

Hmm, I think I ran into something similar when I working on my PR. 
https://github.com/apache/arrow/pull/11964#issuecomment-995043666

CC [~mbrobbel]

Should we disable OT in CI for now? 

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488325#comment-17488325
 ] 

Antoine Pitrou commented on ARROW-15604:


So, basically, it seems using OpenTracing in an asynchronous setup where code 
may run after process teardown has started may be quite delicate. [~lidavidm] 
[~westonpace]

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488321#comment-17488321
 ] 

Antoine Pitrou commented on ARROW-15604:


The "atexit hook" I mentioned simply seems to be a standard C++ exit hook that 
destroys global/static variables. Here the static singleton that's stored in 
{{RuntimeContext::GetStorage}}.



> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)