longfei created MESOS-9951:
------------------------------

             Summary: A likely STW problem in master'gc routine 
                 Key: MESOS-9951
                 URL: https://issues.apache.org/jira/browse/MESOS-9951
             Project: Mesos
          Issue Type: Bug
            Reporter: longfei
         Attachments: image-2019-08-22-14-00-16-298.png

I'm using a 1.7.3 master, which seemed to stop for half a minute recently.
{code:java}
// I0820 20:53:56.705075 4185864 registrar.cpp:487] Applied 1 operations in 
1.163968ms; attempting to update the registry
I0820 20:53:56.705541 4185861 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 353
I0820 20:53:56.705739 4185875 replica.cpp:541] Replica received write request 
for position 353 from __req_res__(568)@10.10.23.74:5050
I0820 20:53:56.721997 4185859 master.cpp:8753] Executor 
'mt:l00000000004115106217:1' of framework 
a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent 
bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 
(10.153.38.24): exited with status 0
I0820 20:53:56.722085 4185859 master.cpp:11215] Removing executor 
'mt:l00000000004115106217:1' with resources [] of framework 
a878e862-349c-4206-bfb8-3048c841e8ec-0002 on agent 
bd5550a6-4089-482d-aa96-3389bae5b0de-S179 at slave(1)@10.153.38.24:5051 
(10.153.38.24)
I0820 20:53:56.742550 4185877 replica.cpp:695] Replica received learned notice 
for position 353 from log-network(1)@10.10.23.74:5050
I0820 20:53:56.784256 4185881 registrar.cpp:544] Successfully updated the 
registry in 79.105792ms
I0820 20:53:56.784489 4185857 coordinator.cpp:348] Coordinator attempting to 
write TRUNCATE action at position 354
I0820 20:53:56.784641 4185890 replica.cpp:541] Replica received write request 
for position 354 from __req_res__(571)@10.10.23.74:5050
I0820 20:53:56.825901 4185890 replica.cpp:695] Replica received learned notice 
for position 354 from log-network(1)@10.10.23.74:5050
I0820 20:54:34.798512 4185864 master.cpp:1978] Garbage collected 1 unreachable 
and 0 gone agents from the registry
I0820 20:54:34.798610 4185864 master.cpp:8510] Status update TASK_FINISHED 
(Status UUID: 6304aa62-2854-4d46-ad09-ffbf3347f24b) for task 
mt:l00000000004115107127:1 of framework 
a878e862-349c-4206-bfb8-3048c841e8ec-0002 from agent 
bd5550a6-4089-482d-aa96-3389bae5b0de-S138 at slave(1)@10.17.44.133:5051 
(10.17.44.133)
{code}
Note that their are no log produced between 20:53:56 and 20:54:34.

atop shows that a core(used by master) is full during the STW period.

!image-2019-08-22-14-00-16-298.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to