[jira] [Assigned] (MESOS-9324) Resource fragmentation: frameworks may be starved of port resources in the presence of large number frameworks with quota.
[ https://issues.apache.org/jira/browse/MESOS-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Meng Zhu reassigned MESOS-9324: --- Assignee: Meng Zhu > Resource fragmentation: frameworks may be starved of port resources in the > presence of large number frameworks with quota. > -- > > Key: MESOS-9324 > URL: https://issues.apache.org/jira/browse/MESOS-9324 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Major > Labels: mesosphere > > In our environment where there are 1.5k frameworks and quota is heavily > utilized, we would experience a severe resource fragmentation issue. > Specifically, we observed a large number of port-less offers circulating in > the cluster. Thus frameworks that need port resources are not able to launch > tasks even if their roles have quota (because currently, we can only set > quota for scalar resources, not port range resources). > While most of the 1.5k frameworks do not suppress today and we believe the > situation will significantly improve once they do. Still, I think there are > some improvements the Mesos allocator can make to help. > h3. How resource becomes fragmented > The origin of these port-less offers stems from quota chopping. Specifically, > when chopping an agent to satisfy a role’s quota, we will also hand out > resources that this role does not have quota for (as long as it does not > break other role’s quota). These “extra resources” certainly includes ALL the > remaining port resources on the agent. After this offer, the agent will be > left with no port resources even though it still has CPUs and etc. Later, > these resources may be offered to other frameworks but they are useless due > to no ports. Now we have some “bad offers” in the cluster. > h3. How resource fragmentation prolonged > A resource offer, once it is declined (e.g. due to no ports), is recovered by > the allocator and offered to other frameworks again. Before this happens, it > is possible that this offer might be able to merge with either the remaining > resources or other declined resources on the same agent. However, it is > conceivable that not uncommonly, the declined offer will be hand out again > *as-is*. This is especially probable if the allocator makes offers faster > than the framework offer response time. As a result, we will observe the > circulation of bad offers across different frameworks. These bad offers will > exist for a long time before being consolidated again. For how long? *The > longevity of the bad offer will be roughly proportional to the number of > active frameworks*. In the worse case, once all the active frameworks have > (hopefully long) declined the bad offer, the bad offer will have nowhere to > go and finally start to merge with other resources on that agent. > Note, since the allocator performance has greatly improved in the past > several months. The scenario described here could be increasingly common. > Also, as we introduce quota limits and hierarchical quota, there will be much > more agent chopping, making resource fragmentation even worse. > h3. Near-term Mitigations > As mentioned above, the longevity of a bad offer is proportional to the > active frameworks. Thus framework suppression will certainly help. In > addition, from the Mesos side, a couple of mitigation measures are worth > considering (other than the long-term optimistic allocation strategy): > 1. Adding a defragment interval once in a while in the allocator. For > example, each minute or a dozen allocation cycles or so, we will pause the > allocation, rescind all the offers and start allocating again. This > essentially eliminates all the circulating bad offers by giving them a chance > to be consolidated. Think of this as a periodic “reboot” of the allocator. > 2. Consider chopping non-quota resources as well. Right now, for resources > such as ports (or any other resources that the role does not have quota for), > all are allocated in a single offer. We could choose to chop these non-quota > resources as well. For example, port resources can be distributed > proportionally to allocated CPU resources. > 3. Provide support for specifying port quantities. With this, we can utilize > the existing quota or `min_allocatable_resources` APIs to guarantee a certain > number of port resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8841) Flaky `MasterAllocatorTest/0.SingleFramework`
[ https://issues.apache.org/jira/browse/MESOS-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725277#comment-16725277 ] Meng Zhu commented on MESOS-8841: - Need more log for further triaging. Haven't observed any flakiness since April. Closing this atm. We can reopen it if it flakes again with more log. > Flaky `MasterAllocatorTest/0.SingleFramework` > - > > Key: MESOS-8841 > URL: https://issues.apache.org/jira/browse/MESOS-8841 > Project: Mesos > Issue Type: Bug > Components: allocation, master > Environment: Fedora 25 > master/a1c6a7a3c5 >Reporter: Andrei Budnik >Priority: Major > Labels: flaky-test > > > {code:java} > [ RUN ] MasterAllocatorTest/0.SingleFramework > F0426 08:31:29.775804 9701 hierarchical.cpp:586] Check failed: > slaves.contains(slaveId) > *** Check failure stack trace: *** > @ 0x7f365e108fb8 google::LogMessage::Fail() > @ 0x7f365e108f15 google::LogMessage::SendToLog() > @ 0x7f365e10890f google::LogMessage::Flush() > @ 0x7f365e10b6d2 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f365c63b8d7 > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > @ 0x55728a500ac7 > _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_7SlaveIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_ > @ 0x55728a589908 > _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_7SlaveIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_ > @ 0x55728a586a0f > _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_7SlaveIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi113invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1DTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_OSD_OSH_N5cpp1416integer_sequenceImJXspT2_SL_ > @ 0x55728a5852b0 > _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_7SlaveIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi1clIJSO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOSX_ > @ 0x55728a584209 > _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_7SlaveIDESD_EEvRKNS4_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS4_11ProcessBaseEE_JSB_St12_PlaceholderILi1EJSQ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOSV_ > @ 0x55728a583995 > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_7SlaveIDESE_EEvRKNS5_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EJSR_EEEvOSG_DpOT0_ > @ 0x55728a581522 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_7SlaveIDESH_EEvRKNS1_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_S3_E_JSF_St12_PlaceholderILi1EEclEOS3_ > @ 0x7f365e0484c0 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ > @ 0x7f365e025760 process::ProcessBase::consume() > @ 0x7f365e033abc _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE > @ 0x55728a1cb6ea process::ProcessBase::serve() > @ 0x7f365e0225ed process::ProcessManager::resume() > @ 0x7f365e01e94c _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv > @ 0x7f365e031080 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x7f365e030a34 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv > @ 0x7f365e030338 > _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7f365478976f (unknown) > @ 0x7f3654e6973a start_thread > @ 0x7f3653eefe7f __GI___clone{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9494) Add a unit test for the interaction between request batching and response compression
Benno Evers created MESOS-9494: -- Summary: Add a unit test for the interaction between request batching and response compression Key: MESOS-9494 URL: https://issues.apache.org/jira/browse/MESOS-9494 Project: Mesos Issue Type: Improvement Reporter: Benno Evers As discussed in https://reviews.apache.org/r/69064/ , we should try to add a unit test that verifies that simultaneous requests with different accept encoding headers produce different responses. It could look like this: {noformat} TEST_F(MasterLoadTest, AcceptEncoding) { MockAuthorizer authorizer; prepareCluster(&authorizer); Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL); Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}}; Headers acceptRawHeaders = {{"Accept-Encoding", "raw"}}; RequestDescriptor descriptor1; descriptor1.endpoint = "/state"; descriptor1.headers = authHeaders + acceptGzipHeaders; RequestDescriptor descriptor2 = descriptor1; descriptor2.headers = authHeaders + acceptRawHeaders; auto responses = launchSimultaneousRequests({descriptor1, descriptor2}); foreachpair ( const RequestDescriptor& request, Future& response, responses) { AWAIT_READY(response); ASSERT_SOME(request.headers.get("Accept-Encoding")); if (request.headers.get("Accept-Encoding").get() == "gzip") { ASSERT_SOME(response->headers.get("Content-Encoding")); EXPECT_EQ(response->headers.get("Content-Encoding").get(), "gzip"); } else { EXPECT_NONE(response->headers.get("Content-Encoding")); } } // Ensure that we actually hit the metrics code path while executing // the test. JSON::Object metrics = Metrics(); ASSERT_TRUE(metrics.values["master/http_cache_hits"].is()); ASSERT_GT( metrics.values["master/http_cache_hits"].as().as(), 0u); } {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8782) Transition operations to OPERATION_GONE_BY_OPERATOR when marking an agent gone.
[ https://issues.apache.org/jira/browse/MESOS-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725230#comment-16725230 ] Benno Evers commented on MESOS-8782: Review: https://reviews.apache.org/r/69575/ > Transition operations to OPERATION_GONE_BY_OPERATOR when marking an agent > gone. > --- > > Key: MESOS-8782 > URL: https://issues.apache.org/jira/browse/MESOS-8782 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0, 1.6.0 >Reporter: Gastón Kleiman >Assignee: Benno Evers >Priority: Critical > Labels: foundations > Fix For: 1.8.0 > > > The master should transition operations to the state > {{OPERATION_GONE_BY_OPERATOR}} when an agent is marked gone, sending an > operation status update to the frameworks that created them. > We should also remove them from {{Master::frameworks}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9493) libprocess may skip gethostname when accepting connections.
[ https://issues.apache.org/jira/browse/MESOS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725172#comment-16725172 ] Till Toenshoff commented on MESOS-9493: --- Additionally, we can skip that {{gethostname}} altogether when the peer certificate verification was based purely on the IP address; https://github.com/apache/mesos/blob/8344f303ffd6429ffa781e7fd7de5d00d9946d78/3rdparty/libprocess/src/openssl.cpp#L99-L103 > libprocess may skip gethostname when accepting connections. > --- > > Key: MESOS-9493 > URL: https://issues.apache.org/jira/browse/MESOS-9493 > Project: Mesos > Issue Type: Improvement >Affects Versions: 1.8.0 >Reporter: Till Toenshoff >Priority: Major > > libprocess, when accepting incoming connections on SSL/libevent builds, does > attempt to retrieve the hostname for the peer address; > https://github.com/apache/mesos/blob/8344f303ffd6429ffa781e7fd7de5d00d9946d78/3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp#L1158-L1168 > The motivation for that step is the peer certificate verification, possibly > happening later in that process; > https://github.com/apache/mesos/blob/8344f303ffd6429ffa781e7fd7de5d00d9946d78/3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp#L441 > The peer certificate verification however is optional and switched off by > default: > https://github.com/apache/mesos/blob/8344f303ffd6429ffa781e7fd7de5d00d9946d78/3rdparty/libprocess/src/openssl.cpp#L88-L97 > As an optimisation, we could skip the retrieval of the hostname when > certificate verification was disabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9493) libprocess may skip gethostname when accepting connections.
Till Toenshoff created MESOS-9493: - Summary: libprocess may skip gethostname when accepting connections. Key: MESOS-9493 URL: https://issues.apache.org/jira/browse/MESOS-9493 Project: Mesos Issue Type: Improvement Affects Versions: 1.8.0 Reporter: Till Toenshoff libprocess, when accepting incoming connections on SSL/libevent builds, does attempt to retrieve the hostname for the peer address; https://github.com/apache/mesos/blob/8344f303ffd6429ffa781e7fd7de5d00d9946d78/3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp#L1158-L1168 The motivation for that step is the peer certificate verification, possibly happening later in that process; https://github.com/apache/mesos/blob/8344f303ffd6429ffa781e7fd7de5d00d9946d78/3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp#L441 The peer certificate verification however is optional and switched off by default: https://github.com/apache/mesos/blob/8344f303ffd6429ffa781e7fd7de5d00d9946d78/3rdparty/libprocess/src/openssl.cpp#L88-L97 As an optimisation, we could skip the retrieval of the hostname when certificate verification was disabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9463) Parallel test runner gets confused if a GTEST_FILTER expression also matches a sequential filter
[ https://issues.apache.org/jira/browse/MESOS-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725038#comment-16725038 ] Andrei Budnik edited comment on MESOS-9463 at 12/19/18 2:23 PM: Since GTEST filter [does not support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests] boolean AND operator and does not support composition (to emulate AND operator using De Morgan's laws), we should either: 1) Fix Mesos containerizer and Mesos tests to support launching ROOT tests in parallel 2) when GTEST_FILTER is specified, run all tests in sequential mode was (Author: abudnik): Since GTEST filter [does not support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests] boolean AND operator and does not support composition (to emulate AND operator using De Morgan's laws), we should either: 1) Fix mesos c'zer and mesos tests to support launching ROOT tests in parallel 2) when GTEST_FILTER is specified, run all tests in sequential mode > Parallel test runner gets confused if a GTEST_FILTER expression also matches > a sequential filter > > > Key: MESOS-9463 > URL: https://issues.apache.org/jira/browse/MESOS-9463 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier >Priority: Major > Labels: parallel-tests, test > > Users expect the be able to select tests to run via {{make check}} with a > {{GTEST_FILTER}} environment variable. The parallel test runner on the other > hand programmatically also injects filter expressions to select tests to > execute sequentially. > This causes e.g., all {{*ROOT_*}} tests to be run in the sequential phase for > superusers, even if a {{GTEST_FILTER}} was set. > It seems that need to handle set {{GTEST_FILTER}} environment variables more > carefully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9463) Parallel test runner gets confused if a GTEST_FILTER expression also matches a sequential filter
[ https://issues.apache.org/jira/browse/MESOS-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725038#comment-16725038 ] Andrei Budnik commented on MESOS-9463: -- Since GTEST filter [does not support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests] boolean AND operator and does not support composition (to emulate AND operator using De Morgan's laws), we should either: 1) Fix mesos c'zer and mesos tests to support launching ROOT tests in parallel 2) when GTEST_FILTER is specified, run all tests in sequential mode > Parallel test runner gets confused if a GTEST_FILTER expression also matches > a sequential filter > > > Key: MESOS-9463 > URL: https://issues.apache.org/jira/browse/MESOS-9463 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier >Priority: Major > Labels: parallel-tests, test > > Users expect the be able to select tests to run via {{make check}} with a > {{GTEST_FILTER}} environment variable. The parallel test runner on the other > hand programmatically also injects filter expressions to select tests to > execute sequentially. > This causes e.g., all {{*ROOT_*}} tests to be run in the sequential phase for > superusers, even if a {{GTEST_FILTER}} was set. > It seems that need to handle set {{GTEST_FILTER}} environment variables more > carefully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)