[jira] [Commented] (MESOS-3271) SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.

2016-02-26 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170238#comment-15170238
 ] 

Joris Van Remoortere commented on MESOS-3271:
-

{code}
commit 16aa038949741f4dc6bf43423dc0340f869605ce
Author: Alexander Rojas 
Date:   Fri Feb 26 17:17:50 2016 -0800

Removed race condition from libevent based poll implementation.

Under certains circumstances, the future returned by poll is discarded
right after the event is triggered, this causes the event callback to be
called before the discard callback which results in an abort signal
being raised by the libevent library.

Review: https://reviews.apache.org/r/43799/
{code}

> SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
> ---
>
> Key: MESOS-3271
> URL: https://issues.apache.org/jira/browse/MESOS-3271
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Paul Brett
> Attachments: build.txt
>
>
> Test failure on Ubuntu 14 configured with {{--disable-java --disable-python 
> --enable-ssl --enable-libevent --enable-optimize --enable-network-isolation}}
> Commit: {{9b78b301469667b5a44f0a351de5f3a71edae499}}
> {code}
> [ RUN  ] SlaveRecoveryTest/0.NonCheckpointingFramework
> I0815 06:41:47.413602 17091 exec.cpp:133] Version: 0.24.0
> I0815 06:41:47.416780 17111 exec.cpp:207] Executor registered on slave 
> 20150815-064146-544909504-51064-12195-S0
> Registered executor on slave1-ubuntu12
> Starting task 044bd49e-2f38-4671-802a-ac6524d61a85
> Forked command at 17114
> sh -c 'sleep 1000'
> [err] event_active called on a non-initialized event 0x7f6b740232d0 (events: 
> 0x2, fd: 21, flags: 0x80)
> *** Aborted at 1439646107 (unix time) try "date -d @1439646107" if you are 
> using GNU date ***
> PC: @ 0x7f6ba512d0d5 (unknown)
> *** SIGABRT (@0x2fa3) received by PID 12195 (TID 0x7f6b9d613700) from PID 
> 12195; stack trace: ***
> @ 0x7f6ba54c4cb0 (unknown)
> @ 0x7f6ba512d0d5 (unknown)
> @ 0x7f6ba513083b (unknown)
> @ 0x7f6ba448e1ba (unknown)
> @ 0x7f6ba448e52b (unknown)
> @ 0x7f6ba447dcc9 (unknown)
> @   0x4c4033 process::internal::run<>()
> @ 0x7f6ba72642ab process::Future<>::discard()
> @ 0x7f6ba72643be process::internal::discard<>()
> @ 0x7f6ba7262298 
> _ZNSt17_Function_handlerIFvvEZNK7process6FutureImE9onDiscardISt5_BindIFPFvNS1_10WeakFutureIsEEES7_RKS3_OT_EUlvE_E9_M_invokeERKSt9_Any_data
> @   0x4c4033 process::internal::run<>()
> @   0x6fa0cb process::Future<>::discard()
> @ 0x7f6ba6fb5736 cgroups::event::Listener::finalize()
> @ 0x7f6ba728fb11 process::ProcessManager::resume()
> @ 0x7f6ba728fe0f process::internal::schedule()
> @ 0x7f6ba5c9d490 (unknown)
> @ 0x7f6ba54bce9a start_thread
> @ 0x7f6ba51ea38d (unknown)
> + /bin/true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3271) SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.

2016-02-26 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170107#comment-15170107
 ] 

Joris Van Remoortere commented on MESOS-3271:
-

{code}
commit 2297a3cf8db2b88860bc839cf934894b1d09dbbc
Author: Alexander Rojas 
Date:   Fri Feb 26 14:38:05 2016 -0800

Removed race condition from libevent based poll implementation.

Under certains circumstances, the future returned by poll is discarded
right after the event is triggered, this causes the event callback to be
called before the discard callback which results in an abort signal
being raised by the libevent library.

Review: https://reviews.apache.org/r/43799/
{code}

> SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
> ---
>
> Key: MESOS-3271
> URL: https://issues.apache.org/jira/browse/MESOS-3271
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Paul Brett
> Attachments: build.txt
>
>
> Test failure on Ubuntu 14 configured with {{--disable-java --disable-python 
> --enable-ssl --enable-libevent --enable-optimize --enable-network-isolation}}
> Commit: {{9b78b301469667b5a44f0a351de5f3a71edae499}}
> {code}
> [ RUN  ] SlaveRecoveryTest/0.NonCheckpointingFramework
> I0815 06:41:47.413602 17091 exec.cpp:133] Version: 0.24.0
> I0815 06:41:47.416780 17111 exec.cpp:207] Executor registered on slave 
> 20150815-064146-544909504-51064-12195-S0
> Registered executor on slave1-ubuntu12
> Starting task 044bd49e-2f38-4671-802a-ac6524d61a85
> Forked command at 17114
> sh -c 'sleep 1000'
> [err] event_active called on a non-initialized event 0x7f6b740232d0 (events: 
> 0x2, fd: 21, flags: 0x80)
> *** Aborted at 1439646107 (unix time) try "date -d @1439646107" if you are 
> using GNU date ***
> PC: @ 0x7f6ba512d0d5 (unknown)
> *** SIGABRT (@0x2fa3) received by PID 12195 (TID 0x7f6b9d613700) from PID 
> 12195; stack trace: ***
> @ 0x7f6ba54c4cb0 (unknown)
> @ 0x7f6ba512d0d5 (unknown)
> @ 0x7f6ba513083b (unknown)
> @ 0x7f6ba448e1ba (unknown)
> @ 0x7f6ba448e52b (unknown)
> @ 0x7f6ba447dcc9 (unknown)
> @   0x4c4033 process::internal::run<>()
> @ 0x7f6ba72642ab process::Future<>::discard()
> @ 0x7f6ba72643be process::internal::discard<>()
> @ 0x7f6ba7262298 
> _ZNSt17_Function_handlerIFvvEZNK7process6FutureImE9onDiscardISt5_BindIFPFvNS1_10WeakFutureIsEEES7_RKS3_OT_EUlvE_E9_M_invokeERKSt9_Any_data
> @   0x4c4033 process::internal::run<>()
> @   0x6fa0cb process::Future<>::discard()
> @ 0x7f6ba6fb5736 cgroups::event::Listener::finalize()
> @ 0x7f6ba728fb11 process::ProcessManager::resume()
> @ 0x7f6ba728fe0f process::internal::schedule()
> @ 0x7f6ba5c9d490 (unknown)
> @ 0x7f6ba54bce9a start_thread
> @ 0x7f6ba51ea38d (unknown)
> + /bin/true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3271) SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.

2016-02-17 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151149#comment-15151149
 ] 

Alexander Rojas commented on MESOS-3271:


I'm seeing the same {{event_active}} error when running {{MESOS_VERBOSE=1 sudo 
./bin/mesos-tests.sh 
--gtest_filter="MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery" 
--gtest_repeat=1000}} on CentOS 6.7 in VirtualBox

> SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
> ---
>
> Key: MESOS-3271
> URL: https://issues.apache.org/jira/browse/MESOS-3271
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Paul Brett
> Attachments: build.txt
>
>
> Test failure on Ubuntu 14 configured with {{--disable-java --disable-python 
> --enable-ssl --enable-libevent --enable-optimize --enable-network-isolation}}
> Commit: {{9b78b301469667b5a44f0a351de5f3a71edae499}}
> {code}
> [ RUN  ] SlaveRecoveryTest/0.NonCheckpointingFramework
> I0815 06:41:47.413602 17091 exec.cpp:133] Version: 0.24.0
> I0815 06:41:47.416780 17111 exec.cpp:207] Executor registered on slave 
> 20150815-064146-544909504-51064-12195-S0
> Registered executor on slave1-ubuntu12
> Starting task 044bd49e-2f38-4671-802a-ac6524d61a85
> Forked command at 17114
> sh -c 'sleep 1000'
> [err] event_active called on a non-initialized event 0x7f6b740232d0 (events: 
> 0x2, fd: 21, flags: 0x80)
> *** Aborted at 1439646107 (unix time) try "date -d @1439646107" if you are 
> using GNU date ***
> PC: @ 0x7f6ba512d0d5 (unknown)
> *** SIGABRT (@0x2fa3) received by PID 12195 (TID 0x7f6b9d613700) from PID 
> 12195; stack trace: ***
> @ 0x7f6ba54c4cb0 (unknown)
> @ 0x7f6ba512d0d5 (unknown)
> @ 0x7f6ba513083b (unknown)
> @ 0x7f6ba448e1ba (unknown)
> @ 0x7f6ba448e52b (unknown)
> @ 0x7f6ba447dcc9 (unknown)
> @   0x4c4033 process::internal::run<>()
> @ 0x7f6ba72642ab process::Future<>::discard()
> @ 0x7f6ba72643be process::internal::discard<>()
> @ 0x7f6ba7262298 
> _ZNSt17_Function_handlerIFvvEZNK7process6FutureImE9onDiscardISt5_BindIFPFvNS1_10WeakFutureIsEEES7_RKS3_OT_EUlvE_E9_M_invokeERKSt9_Any_data
> @   0x4c4033 process::internal::run<>()
> @   0x6fa0cb process::Future<>::discard()
> @ 0x7f6ba6fb5736 cgroups::event::Listener::finalize()
> @ 0x7f6ba728fb11 process::ProcessManager::resume()
> @ 0x7f6ba728fe0f process::internal::schedule()
> @ 0x7f6ba5c9d490 (unknown)
> @ 0x7f6ba54bce9a start_thread
> @ 0x7f6ba51ea38d (unknown)
> + /bin/true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3271) SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.

2015-10-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955774#comment-14955774
 ] 

Benjamin Mahler commented on MESOS-3271:


[~jvanremoortere] this looks to be a but in the libevent integration?

{noformat}
[err] event_active called on a non-initialized event 0x7f6b740232d0 (events: 
0x2, fd: 21, flags: 0x80)
{noformat}

> SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
> ---
>
> Key: MESOS-3271
> URL: https://issues.apache.org/jira/browse/MESOS-3271
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Paul Brett
> Attachments: build.txt
>
>
> Test failure on Ubuntu 14 configured with {{--disable-java --disable-python 
> --enable-ssl --enable-libevent --enable-optimize --enable-network-isolation}}
> Commit: {{9b78b301469667b5a44f0a351de5f3a71edae499}}
> {code}
> [ RUN  ] SlaveRecoveryTest/0.NonCheckpointingFramework
> I0815 06:41:47.413602 17091 exec.cpp:133] Version: 0.24.0
> I0815 06:41:47.416780 17111 exec.cpp:207] Executor registered on slave 
> 20150815-064146-544909504-51064-12195-S0
> Registered executor on slave1-ubuntu12
> Starting task 044bd49e-2f38-4671-802a-ac6524d61a85
> Forked command at 17114
> sh -c 'sleep 1000'
> [err] event_active called on a non-initialized event 0x7f6b740232d0 (events: 
> 0x2, fd: 21, flags: 0x80)
> *** Aborted at 1439646107 (unix time) try "date -d @1439646107" if you are 
> using GNU date ***
> PC: @ 0x7f6ba512d0d5 (unknown)
> *** SIGABRT (@0x2fa3) received by PID 12195 (TID 0x7f6b9d613700) from PID 
> 12195; stack trace: ***
> @ 0x7f6ba54c4cb0 (unknown)
> @ 0x7f6ba512d0d5 (unknown)
> @ 0x7f6ba513083b (unknown)
> @ 0x7f6ba448e1ba (unknown)
> @ 0x7f6ba448e52b (unknown)
> @ 0x7f6ba447dcc9 (unknown)
> @   0x4c4033 process::internal::run<>()
> @ 0x7f6ba72642ab process::Future<>::discard()
> @ 0x7f6ba72643be process::internal::discard<>()
> @ 0x7f6ba7262298 
> _ZNSt17_Function_handlerIFvvEZNK7process6FutureImE9onDiscardISt5_BindIFPFvNS1_10WeakFutureIsEEES7_RKS3_OT_EUlvE_E9_M_invokeERKSt9_Any_data
> @   0x4c4033 process::internal::run<>()
> @   0x6fa0cb process::Future<>::discard()
> @ 0x7f6ba6fb5736 cgroups::event::Listener::finalize()
> @ 0x7f6ba728fb11 process::ProcessManager::resume()
> @ 0x7f6ba728fe0f process::internal::schedule()
> @ 0x7f6ba5c9d490 (unknown)
> @ 0x7f6ba54bce9a start_thread
> @ 0x7f6ba51ea38d (unknown)
> + /bin/true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3271) SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.

2015-10-13 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954826#comment-14954826
 ] 

Benjamin Bannier commented on MESOS-3271:
-

I wasn't able to reproduce this at all in a vagrant container (6 cpus, 1O GB 
ram) on an OS X host, can you provide any guidance on how to increase the 
failure rate [~pbrett]? What is the approximate failure rate you are seeing? 

> SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
> ---
>
> Key: MESOS-3271
> URL: https://issues.apache.org/jira/browse/MESOS-3271
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Paul Brett
> Attachments: build.txt
>
>
> Test failure on Ubuntu 14 configured with {{--disable-java --disable-python 
> --enable-ssl --enable-libevent --enable-optimize --enable-network-isolation}}
> Commit: {{9b78b301469667b5a44f0a351de5f3a71edae499}}
> {code}
> [ RUN  ] SlaveRecoveryTest/0.NonCheckpointingFramework
> I0815 06:41:47.413602 17091 exec.cpp:133] Version: 0.24.0
> I0815 06:41:47.416780 17111 exec.cpp:207] Executor registered on slave 
> 20150815-064146-544909504-51064-12195-S0
> Registered executor on slave1-ubuntu12
> Starting task 044bd49e-2f38-4671-802a-ac6524d61a85
> Forked command at 17114
> sh -c 'sleep 1000'
> [err] event_active called on a non-initialized event 0x7f6b740232d0 (events: 
> 0x2, fd: 21, flags: 0x80)
> *** Aborted at 1439646107 (unix time) try "date -d @1439646107" if you are 
> using GNU date ***
> PC: @ 0x7f6ba512d0d5 (unknown)
> *** SIGABRT (@0x2fa3) received by PID 12195 (TID 0x7f6b9d613700) from PID 
> 12195; stack trace: ***
> @ 0x7f6ba54c4cb0 (unknown)
> @ 0x7f6ba512d0d5 (unknown)
> @ 0x7f6ba513083b (unknown)
> @ 0x7f6ba448e1ba (unknown)
> @ 0x7f6ba448e52b (unknown)
> @ 0x7f6ba447dcc9 (unknown)
> @   0x4c4033 process::internal::run<>()
> @ 0x7f6ba72642ab process::Future<>::discard()
> @ 0x7f6ba72643be process::internal::discard<>()
> @ 0x7f6ba7262298 
> _ZNSt17_Function_handlerIFvvEZNK7process6FutureImE9onDiscardISt5_BindIFPFvNS1_10WeakFutureIsEEES7_RKS3_OT_EUlvE_E9_M_invokeERKSt9_Any_data
> @   0x4c4033 process::internal::run<>()
> @   0x6fa0cb process::Future<>::discard()
> @ 0x7f6ba6fb5736 cgroups::event::Listener::finalize()
> @ 0x7f6ba728fb11 process::ProcessManager::resume()
> @ 0x7f6ba728fe0f process::internal::schedule()
> @ 0x7f6ba5c9d490 (unknown)
> @ 0x7f6ba54bce9a start_thread
> @ 0x7f6ba51ea38d (unknown)
> + /bin/true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)