[
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390152#comment-17390152
]
Charles Natali commented on MESOS-10226:
----------------------------------------
Hm, I can't reproduce it.
I updated the test to run the arm64 alpine image to cause it to fail in a
similar way that it should be failing for you, and it's not hanging, but
failing:
```
# ./bin/mesos-tests.sh
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand*
[ RUN ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0
sh: 1: hadoop: not found
Marked '/' as rslave
I0729 21:40:16.121507 434157 exec.cpp:164] Version: 1.12.0
I0729 21:40:16.136072 434156 exec.cpp:237] Executor registered on agent
48863f87-f283-42ab-bd93-f301fdfbd73b-S0
I0729 21:40:16.139089 434154 executor.cpp:190] Received SUBSCRIBED event
I0729 21:40:16.139974 434154 executor.cpp:194] Subscribed executor on thinkpad
I0729 21:40:16.140264 434154 executor.cpp:190] Received LAUNCH event
I0729 21:40:16.141703 434154 executor.cpp:722] Starting task
1461a266-1ead-4bdf-9165-9c0f6c5938b8
I0729 21:40:16.147071 434154 executor.cpp:740] Forked command at 434163
Preparing rootfs at
'/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634'
Changing root to
/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634
Failed to execute '/bin/ls': Exec format error
I0729 21:40:16.321754 434155 executor.cpp:1041] Command exited with status 1
(pid: 434163)
../../src/tests/containerizer/provisioner_docker_tests.cpp:785: Failure
Expected: TASK_FINISHED
To be equal to: statusFinished->state()
Which is: TASK_FAILED
I0729 21:40:16.333557 434157 exec.cpp:478] Executor asked to shutdown
I0729 21:40:16.334996 434158 executor.cpp:190] Received SHUTDOWN event
I0729 21:40:16.335037 434158 executor.cpp:843] Shutting down
[ FAILED ]
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where
GetParam() = "arm64v8/alpine" (5851 ms)
```
Could you try running
```
./bin/mesos-tests.sh
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand*
--verbose
```
And see if it hangs, and post the result?
Worst case we could just ignore the hang and update the test to use the arn64
image so it passes, but I'd like to understand why it hangs.
> test suite hangs on ARM64
> -------------------------
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
> Issue Type: Bug
> Reporter: Charles Natali
> Assignee: Charles Natali
> Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt,
> gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>
> {noformat}
> [ RUN ]
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630 32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512 31 exec.cpp:237] Executor registered on agent
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999 36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351 36 executor.cpp:194] Subscribed executor on
> martin-arm64
> I0726 11:59:17.832775 36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415 36 executor.cpp:722] Starting task
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910 36 executor.cpp:740] Forked command at 38
> Preparing rootfs at
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488 33 executor.cpp:1041] Command exited with status 1
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:1111:
> Failure
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0xffffc28527f0, @0xffffa2cf3a60 136-byte
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00
> 03-00 00-00>)
> Expected: to be called twice
> Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401 37 process.cpp:935] Stopped the socket accept
> loop{noformat}
>
> I asked him to provide a gdb traceback and we can see the following:
>
> {noformat}
> Thread 1 (Thread 0xffffa3bc2c60 (LWP 173475)):
> #0 0x0000ffffa518db20 in __libc_open64 (file=0xaaab00f342e0
> "/tmp/7VXP3w/pipe", oflag=<optimized out>) at
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0x0000ffffa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0,
> filename=<optimized out>, posix_mode=<optimized out>, prot=prot@entry=438,
> read_write=8, is32not64=<optimized out>) at fileops.c:189
> #2 0x0000ffffa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0,
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode=<optimized
> out>, mode@entry=0xaaaad762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281
> #3 0x0000ffffa512e0dc in __fopen_internal (filename=0xaaab00f342e0
> "/tmp/7VXP3w/pipe", mode=0xaaaad762f3c8 "r", is32=1) at iofopen.c:75
> #4 0x0000aaaad54f5350 in os::read (path="/tmp/7VXP3w/pipe") at
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0x0000aaaad74f1c1c in
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
> (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>
>
> Basically the test uses a named pipe to synchronize with the task being
> started, and if the task fails to start - in this case because we're trying
> to launch an x86 container on an arm64 host - the test will just hang reading
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if
> successful.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)