[jira] [Updated] (MESOS-4691) Add a HierarchicalAllocator benchmark with reservation labels.
[ https://issues.apache.org/jira/browse/MESOS-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4691: Shepherd: Joris Van Remoortere (was: Michael Park) > Add a HierarchicalAllocator benchmark with reservation labels. > -- > > Key: MESOS-4691 > URL: https://issues.apache.org/jira/browse/MESOS-4691 > Project: Mesos > Issue Type: Task >Reporter: Michael Park >Assignee: Neil Conway > Labels: mesosphere > Fix For: 0.28.0 > > > With {{Labels}} being part of the {{ReservationInfo}}, we should ensure that > we don't observe a significant performance degradation in the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4816) Expose TaskInfo to Isolators
[ https://issues.apache.org/jira/browse/MESOS-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175065#comment-15175065 ] Guangya Liu commented on MESOS-4816: 1) When Kubernetes framework register, it will create a executor https://github.com/kubernetes/kubernetes/blob/master/contrib/mesos/pkg/scheduler/service/service.go#L492-L499 2) Then kubernetes will use this executor to launchtask. https://github.com/kubernetes/kubernetes/blob/master/contrib/mesos/pkg/scheduler/podtask/pod_task.go#L191-L198 So the executor will be launched with the first task and later on, all tasks wil this executor and this caused the isolator cannot get other task infos. > Expose TaskInfo to Isolators > > > Key: MESOS-4816 > URL: https://issues.apache.org/jira/browse/MESOS-4816 > Project: Mesos > Issue Type: Improvement > Components: modules, slave >Reporter: Connor Doyle > > Authors of custom isolator modules frequently require access to the TaskInfo > in order to read custom metadata in task labels. > Currently, it's possible to link containers to tasks within a module by > implementing both an isolator and the {{slaveRunTaskLabelDecorator}} hook, > and maintaining a shared map of containers to tasks. This way works, but > adds unnecessary complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4828) XFS disk quota isolator
[ https://issues.apache.org/jira/browse/MESOS-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175062#comment-15175062 ] James Peach commented on MESOS-4828: [~jieyu] and [~xujyan] volunteered to shepherd. > XFS disk quota isolator > --- > > Key: MESOS-4828 > URL: https://issues.apache.org/jira/browse/MESOS-4828 > Project: Mesos > Issue Type: Improvement > Components: isolation >Reporter: James Peach >Assignee: James Peach > > Implement a disk resource isolator using XFS project quotas. Compared to the > {{posix/disk}} isolator, this doesn't need to scan the filesystem > periodically, and applications receive a {{ENOSPC}} error instead of being > summarily killed. > This initial implementation only isolates sandbox directory resources, since > isolation doesn't have any visibility into the the lifecycle of volumes, > which is needed to assign and track project IDs. > The build dependencies for this are XFS header (from xfsprogs-devel) and > libblkid. We need libblkid or the equivalent to map filesystem paths to block > devices in order to apply quota. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4492) Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation
[ https://issues.apache.org/jira/browse/MESOS-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168583#comment-15168583 ] Fan Du edited comment on MESOS-4492 at 3/2/16 5:16 AM: --- Here goes the RR: (Discarded) https://reviews.apache.org/r/44058/ Updated RR with document fix and test code addon: https://reviews.apache.org/r/44255/ was (Author: fan.du): Here goes the RR: https://reviews.apache.org/r/44058/ > Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation > -- > > Key: MESOS-4492 > URL: https://issues.apache.org/jira/browse/MESOS-4492 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Fan Du >Assignee: Fan Du >Priority: Minor > > This ticket aims to enable user or operator to inspect operation statistics > such as RESERVE, UNRESERVE, CREATE and DESTROY, current implementation only > supports LAUNCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4816) Expose TaskInfo to Isolators
[ https://issues.apache.org/jira/browse/MESOS-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175016#comment-15175016 ] James Peach commented on MESOS-4816: {quote} it will not work for some cases such as Kubernetes and Mesos integration where one executor can manage many tasks. {quote} How does this work in Kubernetes? Can you point me to code or something? > Expose TaskInfo to Isolators > > > Key: MESOS-4816 > URL: https://issues.apache.org/jira/browse/MESOS-4816 > Project: Mesos > Issue Type: Improvement > Components: modules, slave >Reporter: Connor Doyle > > Authors of custom isolator modules frequently require access to the TaskInfo > in order to read custom metadata in task labels. > Currently, it's possible to link containers to tasks within a module by > implementing both an isolator and the {{slaveRunTaskLabelDecorator}} hook, > and maintaining a shared map of containers to tasks. This way works, but > adds unnecessary complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.
[ https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-4831: -- Assignee: Guangya Liu > Master sometimes sends two inverse offers after the agent goes into > maintenance. > > > Key: MESOS-4831 > URL: https://issues.apache.org/jira/browse/MESOS-4831 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Anand Mazumdar >Assignee: Guangya Liu > Labels: maintenance, mesosphere > > Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} > https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull > {code} > I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to > allocate! > I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for > slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns > I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework > fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > {code} > The ideal expected workflow for this test is something like: > - The framework receives offers from master. > - The framework updates its maintenance schedule. > - The current offer is rescinded. > - A new offer is received from the master with unavailability set. > - After the agent goes for maintenance, an inverse offer is sent. > For some reason, in the logs we see that the master is sending 2 inverse > offers. The test seems to pass as we just check for the initial inverse offer > being present. This can also be reproduced by a modified version of the > original test. > {code} > // Test ensures that an offer will have an `unavailability` set if the > // slave is scheduled to go down for maintenance. > TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest) > { > Trymaster = StartMaster(); > ASSERT_SOME(master); > MockExecutor exec(DEFAULT_EXECUTOR_ID); > Try slave = StartSlave(); > ASSERT_SOME(slave); > auto scheduler = std::make_shared(); > EXPECT_CALL(*scheduler, heartbeat(_)) > .WillRepeatedly(Return()); // Ignore heartbeats. > Future connected; > EXPECT_CALL(*scheduler, connected(_)) > .WillOnce(FutureSatisfy()) > .WillRepeatedly(Return()); // Ignore future invocations. > scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, > scheduler); > AWAIT_READY(connected); > Future subscribed; > EXPECT_CALL(*scheduler, subscribed(_, _)) > .WillOnce(FutureArg<1>()); > Future normalOffers; > Future unavailabilityOffers; > Future inverseOffers; > EXPECT_CALL(*scheduler, offers(_, _)) > .WillOnce(FutureArg<1>()) > .WillOnce(FutureArg<1>()) > .WillOnce(FutureArg<1>()); > // The original offers should be rescinded when the unavailability is > changed. > Future offerRescinded; > EXPECT_CALL(*scheduler, rescind(_, _)) > .WillOnce(FutureSatisfy()); > { > Call call; > call.set_type(Call::SUBSCRIBE); > Call::Subscribe* subscribe = call.mutable_subscribe(); > subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO); > mesos.send(call); > } > AWAIT_READY(subscribed); > v1::FrameworkID frameworkId(subscribed->framework_id()); > AWAIT_READY(normalOffers); > EXPECT_NE(0, normalOffers->offers().size()); > // Regular offers shouldn't have unavailability. > foreach (const v1::Offer& offer, normalOffers->offers()) { > EXPECT_FALSE(offer.has_unavailability()); > } > // Schedule this slave for maintenance. > MachineID machine; > machine.set_hostname(maintenanceHostname); > machine.set_ip(stringify(slave.get().address.ip)); > const Time start = Clock::now() + Seconds(60); > const Duration duration = Seconds(120); > const Unavailability unavailability = createUnavailability(start, duration); > // Post a valid schedule with one machine. > maintenance::Schedule schedule = createSchedule( > {createWindow({machine}, unavailability)}); > // We have a few seconds between the first set of offers and the > // next allocation of offers. This should be enough time to perform > // a maintenance schedule update. This update will also trigger the > // rescinding of offers from the scheduled slave. > Future response = process::http::post( > master.get(), > "maintenance/schedule", > headers, >
[jira] [Commented] (MESOS-4816) Expose TaskInfo to Isolators
[ https://issues.apache.org/jira/browse/MESOS-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174848#comment-15174848 ] Guangya Liu commented on MESOS-4816: I saw that MESOS-4500 enabled {{Expose ExecutorInfo and TaskInfo for isolators in prepare()}}, but as [~cdoyle] point out, this is not enough, as {{prepare}} wil only be invoked just once per container executor, it will not work for some cases such as Kubernetes and Mesos integration where one executor can manage many tasks. Does it make sense to leave this ticket and update the isoloator api of {{update()}} to pass a list of {{TaskInfo}} to cover more cases? > Expose TaskInfo to Isolators > > > Key: MESOS-4816 > URL: https://issues.apache.org/jira/browse/MESOS-4816 > Project: Mesos > Issue Type: Improvement > Components: modules, slave >Reporter: Connor Doyle > > Authors of custom isolator modules frequently require access to the TaskInfo > in order to read custom metadata in task labels. > Currently, it's possible to link containers to tasks within a module by > implementing both an isolator and the {{slaveRunTaskLabelDecorator}} hook, > and maintaining a shared map of containers to tasks. This way works, but > adds unnecessary complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4836) Fix rmdir for windows
Vinod Kone created MESOS-4836: - Summary: Fix rmdir for windows Key: MESOS-4836 URL: https://issues.apache.org/jira/browse/MESOS-4836 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Alex Clemmer This is due to a bug in MESOS-4415 that landed for 0.27.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4796) Debug ability enhancement for unified container
[ https://issues.apache.org/jira/browse/MESOS-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174788#comment-15174788 ] Guangya Liu commented on MESOS-4796: Thanks [~jieyu], there are still couple of patches that need to address this in backend, isoloator etc, shall I open this ticket to continue the patches? > Debug ability enhancement for unified container > --- > > Key: MESOS-4796 > URL: https://issues.apache.org/jira/browse/MESOS-4796 > Project: Mesos > Issue Type: Improvement >Reporter: Guangya Liu >Assignee: Guangya Liu > Fix For: 0.28.0 > > > The following are some start point for what I want to do for this after some > discussion with [~jieyu], there will be more enhancement later. > docker/local_puller: > LocalPullerProcess::extractLayer: add some detail for how to extract > LocalPullerProcess::pull: Message needs to be updated to add image info the > log info > docker/puller.cpp: > Puller::create: Clarify which puller is using: local or registry > docker/registery_puller.cpp > RegistryPullerProcess::pull: Clarify which image is going to be pulled > RegistryPullerProcess::__pull: Add some detail for roots,layerPath, tarpath, > Json etc when creat layer path. > RegistryPullerProcess::fetchBlobs: The log message needs to be updated for > reference: stringify(reference) > backends/bind.cpp: > BindBackendProcess::provision: Add more detail for provision, such as mount > point etc. > BindBackendProcess::destroy: Add which roots is destroying. > backends/copy.cpp: > CopyBackendProcess::destroy: Add which roots is destroying. > CopyBackendProcess::provision: Add more detail for provision info, such as > rootfs etc. > mesos/isolators/docker/runtime.cpp > add some logs to clarify some detail for > DockerRuntimeIsolatorProcess::prepare for how does the docker run time > isolator is prepared. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2840) MesosContainerizer support multiple image provisioners
[ https://issues.apache.org/jira/browse/MESOS-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174727#comment-15174727 ] Vinod Kone commented on MESOS-2840: --- IIUC, the MVP for this feature is complete? If yes, can you move the unresolved issues into a new epic and close this one? We also need a blurb for this in the CHANGELOG and user doc. > MesosContainerizer support multiple image provisioners > -- > > Key: MESOS-2840 > URL: https://issues.apache.org/jira/browse/MESOS-2840 > Project: Mesos > Issue Type: Epic > Components: containerization, docker >Affects Versions: 0.23.0 >Reporter: Marco Massenzio >Assignee: Timothy Chen > Labels: mesosphere, twitter > > We want to utilize the Appc integration interfaces to further make > MesosContainerizers to support multiple image formats. > This allows our future work on isolators to support any container image > format. > Design > [open to public comments] > https://docs.google.com/document/d/1oUpJNjJ0l51fxaYut21mKPwJUiAcAdgbdF7SAdAW2PA/edit?usp=sharing > [original document, requires permission] > https://docs.google.com/a/twitter.com/document/d/1Fx5TS0LytV7u5MZExQS0-g-gScX2yKCKQg9UPFzhp6U/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4343) Introduce the ability to assign network handles to mesos containers
[ https://issues.apache.org/jira/browse/MESOS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174725#comment-15174725 ] Vinod Kone commented on MESOS-4343: --- Can you add a blurb in the CHANGELOG describing this feature? This is one of the few epics going into the 0.28.0 release. Great to see that there is already a user doc for this. > Introduce the ability to assign network handles to mesos containers > --- > > Key: MESOS-4343 > URL: https://issues.apache.org/jira/browse/MESOS-4343 > Project: Mesos > Issue Type: Epic > Components: containerization >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: containers, mesosphere > Fix For: 0.28.0 > > > Linux provides net_cls as a cgroup subsystem. A net_cls cgroup is associated > with a 16-bit major handle and a 16-bit minor handle. When a task is > associated with a net_cls cgroup, the kernel tags every packet being > generated by the task with the major and minor handle associated with the > net_cls cgroup. These tags are then used by network performance shaping and > firewall tools such as tc (traffic controller) and iptables. > Currently, mesos agents do not provide any isolator that can enable > mesos-containers in a net_cls cgroup, or assign network handles to a net_cls > cgroup. As part of this epic we plan to achieve the following: > a) Implement net_cls cgroup isolator for mesos agents. > b) Implement a manager for the net_cls handles. > c) Allow operators to set a major network handle when launching an agent. > d) Expose the net_cls network handle allocated to a container, to entities > such as operators and frameworks. > Once the above goals are met operators can learn about network handles > allocated to containers and apply them to tools such as tc and iptables to > enforce network policies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.
[ https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4831: -- Description: Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull {code} I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to allocate! I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) {code} The ideal expected workflow for this test is something like: - The framework receives offers from master. - The framework updates its maintenance schedule. - The current offer is rescinded. - A new offer is received from the master with unavailability set. - After the agent goes for maintenance, an inverse offer is sent. For some reason, in the logs we see that the master is sending 2 inverse offers. The test seems to pass as we just check for the initial inverse offer being present. This can also be reproduced by a modified version of the original test. {code} // Test ensures that an offer will have an `unavailability` set if the // slave is scheduled to go down for maintenance. TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest) { Trymaster = StartMaster(); ASSERT_SOME(master); MockExecutor exec(DEFAULT_EXECUTOR_ID); Try slave = StartSlave(); ASSERT_SOME(slave); auto scheduler = std::make_shared(); EXPECT_CALL(*scheduler, heartbeat(_)) .WillRepeatedly(Return()); // Ignore heartbeats. Future connected; EXPECT_CALL(*scheduler, connected(_)) .WillOnce(FutureSatisfy()) .WillRepeatedly(Return()); // Ignore future invocations. scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, scheduler); AWAIT_READY(connected); Future subscribed; EXPECT_CALL(*scheduler, subscribed(_, _)) .WillOnce(FutureArg<1>()); Future normalOffers; Future unavailabilityOffers; Future inverseOffers; EXPECT_CALL(*scheduler, offers(_, _)) .WillOnce(FutureArg<1>()) .WillOnce(FutureArg<1>()) .WillOnce(FutureArg<1>()); // The original offers should be rescinded when the unavailability is changed. Future offerRescinded; EXPECT_CALL(*scheduler, rescind(_, _)) .WillOnce(FutureSatisfy()); { Call call; call.set_type(Call::SUBSCRIBE); Call::Subscribe* subscribe = call.mutable_subscribe(); subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO); mesos.send(call); } AWAIT_READY(subscribed); v1::FrameworkID frameworkId(subscribed->framework_id()); AWAIT_READY(normalOffers); EXPECT_NE(0, normalOffers->offers().size()); // Regular offers shouldn't have unavailability. foreach (const v1::Offer& offer, normalOffers->offers()) { EXPECT_FALSE(offer.has_unavailability()); } // Schedule this slave for maintenance. MachineID machine; machine.set_hostname(maintenanceHostname); machine.set_ip(stringify(slave.get().address.ip)); const Time start = Clock::now() + Seconds(60); const Duration duration = Seconds(120); const Unavailability unavailability = createUnavailability(start, duration); // Post a valid schedule with one machine. maintenance::Schedule schedule = createSchedule( {createWindow({machine}, unavailability)}); // We have a few seconds between the first set of offers and the // next allocation of offers. This should be enough time to perform // a maintenance schedule update. This update will also trigger the // rescinding of offers from the scheduled slave. Future response = process::http::post( master.get(), "maintenance/schedule", headers, stringify(JSON::protobuf(schedule))); AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response); // The original offers should be rescinded when the unavailability // is changed. AWAIT_READY(offerRescinded); AWAIT_READY(unavailabilityOffers); EXPECT_NE(0, unavailabilityOffers->offers().size()); // Make sure the new offers have the unavailability set. foreach (const v1::Offer& offer, unavailabilityOffers->offers()) { EXPECT_TRUE(offer.has_unavailability()); EXPECT_EQ( unavailability.start().nanoseconds(), offer.unavailability().start().nanoseconds()); EXPECT_EQ( unavailability.duration().nanoseconds(),
[jira] [Created] (MESOS-4834) Add 'file' fetcher plugin.
Jojy Varghese created MESOS-4834: Summary: Add 'file' fetcher plugin. Key: MESOS-4834 URL: https://issues.apache.org/jira/browse/MESOS-4834 Project: Mesos Issue Type: Task Components: containerization Reporter: Jojy Varghese Assignee: Jojy Varghese Add support for "file" based URI fetcher. This could be useful for container image provisioning from local file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4586) Resources clarification in Mesos UI
[ https://issues.apache.org/jira/browse/MESOS-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-4586: -- Affects Version/s: (was: 0.27.0) (was: 0.26.0) In the main page, "Used" should actually be called "Allocated" because it represents the resources allocated. "Offered" represents resources that are currently offered to framework(s) but frameworks haven't accepted/declined them. "Idle": Total - Used/Allocated - Offered. Note that even though a resource might be idle it might not be offered to framework(s) if there are filters set on it (e.g., declined by a framework for 1 day). > Resources clarification in Mesos UI > --- > > Key: MESOS-4586 > URL: https://issues.apache.org/jira/browse/MESOS-4586 > Project: Mesos > Issue Type: Improvement >Reporter: Craig W > > On the Mesos UI under the "resources" section when it lists CPUs and Mem, > it seems to be calculated by sum up every executor cpu and memory statistics, > which would be less than <= "allocated" resources. > On the page that displays information for a slave it shows the CPUs and Mem > show used and allocated. > When I look at the Mesos UI front page, I was looking at "Idle" resources as > the amount of resources I have available for offers. However, that's not the > case. It would be nice to have it show the amount of "free" or "available" > resources as well as "idle", so I can better determine how many resources I > actually have available for scheduling additional tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4825) Master's slave reregister logic does not update version field
[ https://issues.apache.org/jira/browse/MESOS-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174659#comment-15174659 ] Klaus Ma commented on MESOS-4825: - RR: https://reviews.apache.org/r/44236/ > Master's slave reregister logic does not update version field > - > > Key: MESOS-4825 > URL: https://issues.apache.org/jira/browse/MESOS-4825 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Joris Van Remoortere >Assignee: Klaus Ma >Priority: Blocker > Fix For: 0.28.0 > > > The master's logic for reregistering a slave does not update the version > field if the slave re-registers with a new version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.
[ https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4831: -- Description: Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull {code} I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to allocate! I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) {code} The ideal expected workflow for this test is something like: - The framework receives offers from master. - The framework updates its maintenance schedule. - The current offer is rescinded. - A new offer is received from the master with unavailability set. - After the agent goes for maintenance, an inverse offer is sent. For some reason, in the logs we see that the master is sending 2 inverse offers. The test seems to pass as we just check for the initial inverse offer being present. Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove {{numberOfOffers}} constant. was: Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} {code} I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to allocate! I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) {code} The ideal expected workflow for this test is something like: - The framework receives offers from master. - The framework updates its maintenance schedule. - The current offer is rescinded. - A new offer is received from the master with unavailability set. - After the agent goes for maintenance, an inverse offer is sent. For some reason, in the logs we see that the master is sending 2 inverse offers. The test seems to pass as we just check for the initial inverse offer being present. Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove {{numberOfOffers}} constant. > Master sometimes sends two inverse offers after the agent goes into > maintenance. > > > Key: MESOS-4831 > URL: https://issues.apache.org/jira/browse/MESOS-4831 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0 >Reporter: Anand Mazumdar > Labels: maintenance, mesosphere > > Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} > https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull > {code} > I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to > allocate! > I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for > slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns > I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework > fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to > framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) > {code} > The ideal expected workflow for this test is something like: > - The framework receives offers from master. > - The framework updates its maintenance schedule. > - The current offer is rescinded. > - A new offer is received from the master with unavailability set. > - After the agent goes for maintenance, an inverse offer is sent. > For some reason, in the logs we see that the master is sending 2 inverse > offers. The test seems to pass as we just check for the initial inverse offer > being present. > Also, unrelated, we need to
[jira] [Updated] (MESOS-4740) Improve metrics/snapshot performace
[ https://issues.apache.org/jira/browse/MESOS-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Robinson updated MESOS-4740: -- Description: [~drobinson] noticed retrieving metrics/snapshot statistics could be very inefficient. {noformat} [user@server ~]$ time curl -s localhost:5050/metrics/snapshot real0m35.654s user0m0.019s sys 0m0.011s {noformat} MESOS-1287 introduces a timeout parameter for this query, but for metric-collectors like ours they are not aware of such URL-specific parameter, so we need: 1) We should always have a timeout and set some default value to it 2) Investigate why metrics/snapshot could take such a long time to complete under load, since we don't use history for these statistics and the values are just some atomic read. was: David Robinson noticed retrieving metrics/snapshot statistics could be very inefficient and cause Mesos master stuck. {noformat} [root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot real2m7.302s user0m0.001s sys0m0.004s {noformat} MESOS-1287 introduces a timeout parameter for this query, but for observers like ours they are not aware of such URL-specific parameter, so we need: 1) We should always have a timeout and set some default value to it 2) Investigate why metrics/snapshot could take such a long time to complete under load, since we don't use history for these statistics and the values are just some atomic read. > Improve metrics/snapshot performace > --- > > Key: MESOS-4740 > URL: https://issues.apache.org/jira/browse/MESOS-4740 > Project: Mesos > Issue Type: Task >Reporter: Cong Wang >Assignee: Cong Wang > > [~drobinson] noticed retrieving metrics/snapshot statistics could be very > inefficient. > {noformat} > [user@server ~]$ time curl -s localhost:5050/metrics/snapshot > real 0m35.654s > user 0m0.019s > sys 0m0.011s > {noformat} > MESOS-1287 introduces a timeout parameter for this query, but for > metric-collectors like ours they are not aware of such URL-specific > parameter, so we need: > 1) We should always have a timeout and set some default value to it > 2) Investigate why metrics/snapshot could take such a long time to complete > under load, since we don't use history for these statistics and the values > are just some atomic read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4415) Implement stout/os/windows/rmdir.hpp
[ https://issues.apache.org/jira/browse/MESOS-4415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174573#comment-15174573 ] Joris Van Remoortere commented on MESOS-4415: - https://reviews.apache.org/r/43907/ https://reviews.apache.org/r/43908/ > Implement stout/os/windows/rmdir.hpp > > > Key: MESOS-4415 > URL: https://issues.apache.org/jira/browse/MESOS-4415 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Joris Van Remoortere >Assignee: Alex Clemmer > Labels: mesosphere, windows > Fix For: 0.27.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4821) Introduce a port field in `ImageManifest` in order to set exposed ports for a container.
[ https://issues.apache.org/jira/browse/MESOS-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-4821: - Description: Networking isolators such as `network/cni` need to learn about ports that a container wishes to be exposed to the outside world. This can be achieved by adding a field to the `ImageManifest` protobuf and allowing the `ImageProvisioner` to set these fields to inform the isolator of the ports that the container wishes to be exposed. (was: Networking isolators such as `network/cni` need to learn about ports that a container wishes to be exposed to the outside world. This can be achieved by adding a field to the `ContainerConfig` protobuf and allowing the `Containerizer` or framework set these fields to inform the isolator of the ports that the container wishes to be exposed. ) Summary: Introduce a port field in `ImageManifest` in order to set exposed ports for a container. (was: Introduce a port field in `ContainerConfig` in order to set exposed ports for a container.) > Introduce a port field in `ImageManifest` in order to set exposed ports for a > container. > > > Key: MESOS-4821 > URL: https://issues.apache.org/jira/browse/MESOS-4821 > Project: Mesos > Issue Type: Task > Components: containerization > Environment: linux >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: mesosphere > > Networking isolators such as `network/cni` need to learn about ports that a > container wishes to be exposed to the outside world. This can be achieved by > adding a field to the `ImageManifest` protobuf and allowing the > `ImageProvisioner` to set these fields to inform the isolator of the ports > that the container wishes to be exposed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4780) Remove `user` and `rootfs` flags in Windows launcher.
[ https://issues.apache.org/jira/browse/MESOS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174440#comment-15174440 ] Joris Van Remoortere edited comment on MESOS-4780 at 3/1/16 10:42 PM: -- https://reviews.apache.org/r/43904/ https://reviews.apache.org/r/43905/ https://reviews.apache.org/r/40938/ https://reviews.apache.org/r/40939/ was (Author: jvanremoortere): https://reviews.apache.org/r/43904/ https://reviews.apache.org/r/43905/ > Remove `user` and `rootfs` flags in Windows launcher. > - > > Key: MESOS-4780 > URL: https://issues.apache.org/jira/browse/MESOS-4780 > Project: Mesos > Issue Type: Task >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows-mvp > Fix For: 0.28.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4780) Remove `user` and `rootfs` flags in Windows launcher.
[ https://issues.apache.org/jira/browse/MESOS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174457#comment-15174457 ] Joris Van Remoortere edited comment on MESOS-4780 at 3/1/16 10:42 PM: -- {code} commit 9f1b115a67a1625a4807c2a7d4e1a41bca1af2a6 Author: Daniel PravatDate: Tue Mar 1 14:18:41 2016 -0800 Stout: Marked `os::su` as deleted on Windows. Review: https://reviews.apache.org/r/40939/ commit a1f731746657b1cbcf136ddb2bf154ca3da271fc Author: Daniel Pravat Date: Tue Mar 1 14:16:08 2016 -0800 Stout: Marked `os::chroot` as deleted on Windows. Review: https://reviews.apache.org/r/40938/ commit a1a9cd5939d25f82214a5c533bde96a3493f81f3 Author: Alex Clemmer Date: Tue Mar 1 13:35:13 2016 -0800 Windows: Stout: Removed user based functions. Review: https://reviews.apache.org/r/43905/ commit b9de8c6a06f0d0246ea38ab5586de1d0b2478c38 Author: Alex Clemmer Date: Tue Mar 1 13:33:37 2016 -0800 Windows: Removed `user` launcher flag, preventing `su`. `su` does not exist on Windows. Unfortunately, the launcher also depends on it. In this commit, we remove Windows support for the launcher flag `user`, which controls whether we use `su` in the launcher. This allows us to divest ourselves of `su` altogether on Windows. Review: https://reviews.apache.org/r/43905/ {code} was (Author: jvanremoortere): {code} commit a1a9cd5939d25f82214a5c533bde96a3493f81f3 Author: Alex Clemmer Date: Tue Mar 1 13:35:13 2016 -0800 Windows: Stout: Removed user based functions. Review: https://reviews.apache.org/r/43905/ commit b9de8c6a06f0d0246ea38ab5586de1d0b2478c38 Author: Alex Clemmer Date: Tue Mar 1 13:33:37 2016 -0800 Windows: Removed `user` launcher flag, preventing `su`. `su` does not exist on Windows. Unfortunately, the launcher also depends on it. In this commit, we remove Windows support for the launcher flag `user`, which controls whether we use `su` in the launcher. This allows us to divest ourselves of `su` altogether on Windows. Review: https://reviews.apache.org/r/43905/ {code} > Remove `user` and `rootfs` flags in Windows launcher. > - > > Key: MESOS-4780 > URL: https://issues.apache.org/jira/browse/MESOS-4780 > Project: Mesos > Issue Type: Task >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows-mvp > Fix For: 0.28.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4712) Remove 'force' field from the Subscribe Call in v1 Scheduler API
[ https://issues.apache.org/jira/browse/MESOS-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174542#comment-15174542 ] Vinod Kone commented on MESOS-4712: --- test > Remove 'force' field from the Subscribe Call in v1 Scheduler API > > > Key: MESOS-4712 > URL: https://issues.apache.org/jira/browse/MESOS-4712 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Vinod Kone > Fix For: 0.28.0 > > > We/I introduced the `force` field in SUBSCRIBE call to deal with scheduler > partition cases. Having thought a bit more and discussing with few other > folks ([~anandmazumdar], [~greggomann]), I think we can get away from not > having that field in the v1 API. The obvious advantage of removing the field > is that framework devs don't have to think about how/when to set the field > (the current semantics are a bit confusing). > The new workflow when a master receives a SUBSCRIBE call is that master > always accepts this call and closes any existing connection (after sending > ERROR event) from the same scheduler (identified by framework id). > The expectation from schedulers is that they must close the old subscribe > connection before resending a new SUBSCRIBE call. > Lets look at some tricky scenarios and see how this works and why it is safe. > 1) Connection disconnection @ the scheduler but not @ the master > > Scheduler sees the disconnection and sends a new SUBSCRIBE call. Master sends > ERROR on the old connection (won't be received by the scheduler because the > connection is already closed) and closes it. > 2) Connection disconnection @ master but not @ scheduler > Scheduler realizes this from lack of HEARTBEAT events. It then closes its > existing connection and sends a new SUBSCRIBE call. Master accepts the new > SUBSCRIBE call. There is no old connection to close on the master as it is > already closed. > 3) Scheduler failover but no disconnection @ master > Newly elected scheduler sends a SUBSCRIBE call. Master sends ERROR event and > closes the old connection (won't be received because the old scheduler failed > over). > 4) If Scheduler A got partitioned (but is alive and connected with master) > and Scheduler B got elected as new leader. > When Scheduler B sends SUBSCRIBE, master sends ERROR and closes the > connection from Scheduler A. Master accepts Scheduler B's connection. > Typically Scheduler A aborts after receiving ERROR and gets restarted. After > restart it won't become the leader because Scheduler B is already elected. > 5) Scheduler sends SUBSCRIBE, times out, closes the SUBSCRIBE connection (A) > and sends a new SUBSCRIBE (B). Master receives SUBSCRIBE (B) and then > receives SUBSCRIBE (A) but doesn't see A's disconnection yet. > Master first accepts SUBSCRIBE (B). After it receives SUBSCRIBE (A), it sends > ERROR to SUBSCRIBE (B) and closes that connection. When it accepts SUBSCRIBE > (A) and tries to send SUBSCRIBED event the connection closure is detected. > Scheduler retries the SUBSCRIBE connection after a backoff. I think this is a > rare enough race for it to happen continuously in a loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-4712) Remove 'force' field from the Subscribe Call in v1 Scheduler API
[ https://issues.apache.org/jira/browse/MESOS-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-4712: -- Comment: was deleted (was: test) > Remove 'force' field from the Subscribe Call in v1 Scheduler API > > > Key: MESOS-4712 > URL: https://issues.apache.org/jira/browse/MESOS-4712 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Vinod Kone > Fix For: 0.28.0 > > > We/I introduced the `force` field in SUBSCRIBE call to deal with scheduler > partition cases. Having thought a bit more and discussing with few other > folks ([~anandmazumdar], [~greggomann]), I think we can get away from not > having that field in the v1 API. The obvious advantage of removing the field > is that framework devs don't have to think about how/when to set the field > (the current semantics are a bit confusing). > The new workflow when a master receives a SUBSCRIBE call is that master > always accepts this call and closes any existing connection (after sending > ERROR event) from the same scheduler (identified by framework id). > The expectation from schedulers is that they must close the old subscribe > connection before resending a new SUBSCRIBE call. > Lets look at some tricky scenarios and see how this works and why it is safe. > 1) Connection disconnection @ the scheduler but not @ the master > > Scheduler sees the disconnection and sends a new SUBSCRIBE call. Master sends > ERROR on the old connection (won't be received by the scheduler because the > connection is already closed) and closes it. > 2) Connection disconnection @ master but not @ scheduler > Scheduler realizes this from lack of HEARTBEAT events. It then closes its > existing connection and sends a new SUBSCRIBE call. Master accepts the new > SUBSCRIBE call. There is no old connection to close on the master as it is > already closed. > 3) Scheduler failover but no disconnection @ master > Newly elected scheduler sends a SUBSCRIBE call. Master sends ERROR event and > closes the old connection (won't be received because the old scheduler failed > over). > 4) If Scheduler A got partitioned (but is alive and connected with master) > and Scheduler B got elected as new leader. > When Scheduler B sends SUBSCRIBE, master sends ERROR and closes the > connection from Scheduler A. Master accepts Scheduler B's connection. > Typically Scheduler A aborts after receiving ERROR and gets restarted. After > restart it won't become the leader because Scheduler B is already elected. > 5) Scheduler sends SUBSCRIBE, times out, closes the SUBSCRIBE connection (A) > and sends a new SUBSCRIBE (B). Master receives SUBSCRIBE (B) and then > receives SUBSCRIBE (A) but doesn't see A's disconnection yet. > Master first accepts SUBSCRIBE (B). After it receives SUBSCRIBE (A), it sends > ERROR to SUBSCRIBE (B) and closes that connection. When it accepts SUBSCRIBE > (A) and tries to send SUBSCRIBED event the connection closure is detected. > Scheduler retries the SUBSCRIBE connection after a backoff. I think this is a > rare enough race for it to happen continuously in a loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4820) Need to set `EXPOSED` ports from docker images into `ContainerConfig`
[ https://issues.apache.org/jira/browse/MESOS-4820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan reassigned MESOS-4820: Assignee: Avinash Sridharan > Need to set `EXPOSED` ports from docker images into `ContainerConfig` > - > > Key: MESOS-4820 > URL: https://issues.apache.org/jira/browse/MESOS-4820 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan >Priority: Critical > Labels: mesosphere > > Most docker images have an `EXPOSE` command associated with them. This tells > the container run-time the TCP ports that the micro-service "wishes" to > expose to the outside world. > With the `Unified containerizer` project since `MesosContainerizer` is going > to natively support docker images it is imperative that the Mesos container > run time have a mechanism to expose ports listed in a Docker image. The first > step to achieve this is to extract this information from the `Docker` image > and set in the `ContainerConfig` . The `ContainerConfig` can then be used to > pass this information to any isolator (for e.g. `network/cni` isolator) that > will install port forwarding rules to expose the desired ports. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4832) DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes exits when the /tmp directory is bind-mounted
[ https://issues.apache.org/jira/browse/MESOS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174482#comment-15174482 ] Jie Yu commented on MESOS-4832: --- I think the problem is that we're trying to umount the persistent volume twice: {noformat} I0226 03:17:28.127876 1114 docker.cpp:912] Unmounting volume for container 'bcc90102-163d-4ff6-a3fc-a1b2e3fc3b7c' I0226 03:17:28.127957 1114 docker.cpp:912] Unmounting volume for container 'bcc90102-163d-4ff6-a3fc-a1b2e3fc3b7c' {noformat} Looking at the code: {code} Try unmountPersistentVolumes(const ContainerID& containerId) { // We assume volumes are only supported on Linux, and also // the target path contains the containerId. #ifdef __linux__ Try table = fs::MountInfoTable::read(); if (table.isError()) { return Error("Failed to get mount table: " + table.error()); } foreach (const fs::MountInfoTable::Entry& entry, adaptor::reverse(table.get().entries)) { // TODO(tnachen): We assume there is only one docker container // running per container Id and no other mounts will have the // container Id name. We might need to revisit if this is no // longer true. if (strings::contains(entry.target, containerId.value())) { LOG(INFO) << "Unmounting volume for container '" << containerId << "'"; Try unmount = fs::unmount(entry.target); if (unmount.isError()) { return Error("Failed to unmount volume '" + entry.target + "': " + unmount.error()); } } } #endif // __linux__ return Nothing(); } {code} We rely on {noformat}if (strings::contains(entry.target, containerId.value())) {noformat} to discovery persistent volume mounts. But on some system settings, if the slave's work_dir is under a bind mount, and the parent of that bind mount is a 'shared' mount, that mount of persistent volumes will be propagated to another mount point. That means there will be two mounts in the mount table that contain the 'containerId'. There are two issues: 1) we should modify unmountPersistentVolumes to be more robust. One simple fix is to check if 'entry.target' is under slave's work_dir or not. 2) Ideally, we should do the same as we did in LinuxFilesystemIsolator to make slave's work_dir a slave+shared mount. I'll add a TODO > DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes exits > when the /tmp directory is bind-mounted > -- > > Key: MESOS-4832 > URL: https://issues.apache.org/jira/browse/MESOS-4832 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.27.0 > Environment: Seen on CentOS 7 & Debian 8. >Reporter: Joseph Wu >Assignee: Jie Yu > Labels: mesosphere, test > Fix For: 0.28.0 > > > If the {{/tmp}} directory (where Mesos tests create temporary directories) is > a bind mount, the test suite will exit here: > {code} > [ RUN ] > DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes > I0226 03:17:26.722806 1097 leveldb.cpp:174] Opened db in 12.587676ms > I0226 03:17:26.723496 1097 leveldb.cpp:181] Compacted db in 636999ns > I0226 03:17:26.723536 1097 leveldb.cpp:196] Created db iterator in 18271ns > I0226 03:17:26.723547 1097 leveldb.cpp:202] Seeked to beginning of db in > 1555ns > I0226 03:17:26.723554 1097 leveldb.cpp:271] Iterated through 0 keys in the > db in 363ns > I0226 03:17:26.723593 1097 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0226 03:17:26.724128 1117 recover.cpp:447] Starting replica recovery > I0226 03:17:26.724367 1117 recover.cpp:473] Replica is in EMPTY status > I0226 03:17:26.725237 1117 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (13810)@172.30.2.151:51934 > I0226 03:17:26.725744 1114 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0226 03:17:26.726356 master.cpp:376] Master > 5cc57c0e-f1ad-4107-893f-420ed1a1db1a (ip-172-30-2-151.mesosphere.io) started > on 172.30.2.151:51934 > I0226 03:17:26.726369 1118 recover.cpp:564] Updating replica status to > STARTING > I0226 03:17:26.726378 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/djHTVQ/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO"
[jira] [Updated] (MESOS-4740) Improve metrics/snapshot performace
[ https://issues.apache.org/jira/browse/MESOS-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cong Wang updated MESOS-4740: - Description: David Robinson noticed retrieving metrics/snapshot statistics could be very inefficient and cause Mesos master stuck. {noformat} [root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot real2m7.302s user0m0.001s sys0m0.004s {noformat} MESOS-1287 introduces a timeout parameter for this query, but for observers like ours they are not aware of such URL-specific parameter, so we need: 1) We should always have a timeout and set some default value to it 2) Investigate why metrics/snapshot could take such a long time to complete under load, since we don't use history for these statistics and the values are just some atomic read. was: David Robinson noticed retrieving metrics/snapshot statistics could be very inefficient and cause Mesos master stuck. {noformat} [root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot real2m7.302s user0m0.001s sys0m0.004s {noformat} >From a quick glance of the code, this *seems* due to we sort all the values >saved in the time series when calculating percentiles. {noformat} foreach (const typename TimeSeries::Value& value, values_) { values.push_back(value.data); } std::sort(values.begin(), values.end()); {noformat} > Improve metrics/snapshot performace > --- > > Key: MESOS-4740 > URL: https://issues.apache.org/jira/browse/MESOS-4740 > Project: Mesos > Issue Type: Task >Reporter: Cong Wang >Assignee: Cong Wang > > David Robinson noticed retrieving metrics/snapshot statistics could be very > inefficient and cause Mesos master stuck. > {noformat} > [root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot > real2m7.302s > user0m0.001s > sys0m0.004s > {noformat} > MESOS-1287 introduces a timeout parameter for this query, but for observers > like ours they are not aware of such URL-specific parameter, so we need: > 1) We should always have a timeout and set some default value to it > 2) Investigate why metrics/snapshot could take such a long time to complete > under load, since we don't use history for these statistics and the values > are just some atomic read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4832) DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes exits when the /tmp directory is bind-mounted
[ https://issues.apache.org/jira/browse/MESOS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4832: -- Fix Version/s: 0.28.0 > DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes exits > when the /tmp directory is bind-mounted > -- > > Key: MESOS-4832 > URL: https://issues.apache.org/jira/browse/MESOS-4832 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.27.0 > Environment: Seen on CentOS 7 & Debian 8. >Reporter: Joseph Wu >Assignee: Jie Yu > Labels: mesosphere, test > Fix For: 0.28.0 > > > If the {{/tmp}} directory (where Mesos tests create temporary directories) is > a bind mount, the test suite will exit here: > {code} > [ RUN ] > DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes > I0226 03:17:26.722806 1097 leveldb.cpp:174] Opened db in 12.587676ms > I0226 03:17:26.723496 1097 leveldb.cpp:181] Compacted db in 636999ns > I0226 03:17:26.723536 1097 leveldb.cpp:196] Created db iterator in 18271ns > I0226 03:17:26.723547 1097 leveldb.cpp:202] Seeked to beginning of db in > 1555ns > I0226 03:17:26.723554 1097 leveldb.cpp:271] Iterated through 0 keys in the > db in 363ns > I0226 03:17:26.723593 1097 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0226 03:17:26.724128 1117 recover.cpp:447] Starting replica recovery > I0226 03:17:26.724367 1117 recover.cpp:473] Replica is in EMPTY status > I0226 03:17:26.725237 1117 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (13810)@172.30.2.151:51934 > I0226 03:17:26.725744 1114 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0226 03:17:26.726356 master.cpp:376] Master > 5cc57c0e-f1ad-4107-893f-420ed1a1db1a (ip-172-30-2-151.mesosphere.io) started > on 172.30.2.151:51934 > I0226 03:17:26.726369 1118 recover.cpp:564] Updating replica status to > STARTING > I0226 03:17:26.726378 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/djHTVQ/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/djHTVQ/master" > --zk_session_timeout="10secs" > I0226 03:17:26.726605 master.cpp:423] Master only allowing > authenticated frameworks to register > I0226 03:17:26.726616 master.cpp:428] Master only allowing > authenticated slaves to register > I0226 03:17:26.726632 credentials.hpp:35] Loading credentials for > authentication from '/tmp/djHTVQ/credentials' > I0226 03:17:26.726860 master.cpp:468] Using default 'crammd5' > authenticator > I0226 03:17:26.726977 master.cpp:537] Using default 'basic' HTTP > authenticator > I0226 03:17:26.727092 master.cpp:571] Authorization enabled > I0226 03:17:26.727243 1118 hierarchical.cpp:144] Initialized hierarchical > allocator process > I0226 03:17:26.727285 1116 whitelist_watcher.cpp:77] No whitelist given > I0226 03:17:26.728852 1114 master.cpp:1712] The newly elected leader is > master@172.30.2.151:51934 with id 5cc57c0e-f1ad-4107-893f-420ed1a1db1a > I0226 03:17:26.728876 1114 master.cpp:1725] Elected as the leading master! > I0226 03:17:26.728891 1114 master.cpp:1470] Recovering from registrar > I0226 03:17:26.728977 1117 registrar.cpp:307] Recovering registrar > I0226 03:17:26.731503 1112 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 4.977811ms > I0226 03:17:26.731539 1112 replica.cpp:320] Persisted replica status to > STARTING > I0226 03:17:26.731711 recover.cpp:473] Replica is in STARTING status > I0226 03:17:26.732501 1114 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (13812)@172.30.2.151:51934 > I0226 03:17:26.732862 recover.cpp:193] Received a recover response from > a replica in STARTING status > I0226
[jira] [Commented] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174464#comment-15174464 ] Erik Weathers commented on MESOS-4735: -- [~dma1982] : kind of correct.. I'm saying that people that use file downloading tools (e.g., curl, wget, every single web browser) have the option to choose the downloaded resultant filename. e.g., * {{curl -o bar-executor-binary.tgz http://somewebserver/bar-executor-binary.tgz.foobarbazblahblahblah}} * {{wget -O bar-executor-binary.tgz http://somewebserver/bar-executor-binary.tgz.foobarbazblahblahblah}} > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Reporter: Erik Weathers >Assignee: Guangya Liu >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4833) Poor allocator performance with labeled resources and/or persistent volumes
[ https://issues.apache.org/jira/browse/MESOS-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4833: Priority: Blocker (was: Critical) > Poor allocator performance with labeled resources and/or persistent volumes > --- > > Key: MESOS-4833 > URL: https://issues.apache.org/jira/browse/MESOS-4833 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Neil Conway >Assignee: Neil Conway >Priority: Blocker > Labels: mesosphere, resources > Fix For: 0.28.0 > > > Modifying the {{HierarchicalAllocator_BENCHMARK_Test.ResourceLabels}} > benchmark from https://reviews.apache.org/r/43686/ to use distinct labels > between different slaves, performance regresses from ~2 seconds to ~3 > minutes. The culprit seems to be the way in which the allocator merges > together resources; reserved resource labels (or persistent volume IDs) > inhibit merging, which causes performance to be much worse. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4833) Poor allocator performance with labeled resources and/or persistent volumes
[ https://issues.apache.org/jira/browse/MESOS-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-4833: --- Shepherd: Joris Van Remoortere > Poor allocator performance with labeled resources and/or persistent volumes > --- > > Key: MESOS-4833 > URL: https://issues.apache.org/jira/browse/MESOS-4833 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Neil Conway >Assignee: Neil Conway >Priority: Critical > Labels: mesosphere, resources > Fix For: 0.28.0 > > > Modifying the {{HierarchicalAllocator_BENCHMARK_Test.ResourceLabels}} > benchmark from https://reviews.apache.org/r/43686/ to use distinct labels > between different slaves, performance regresses from ~2 seconds to ~3 > minutes. The culprit seems to be the way in which the allocator merges > together resources; reserved resource labels (or persistent volume IDs) > inhibit merging, which causes performance to be much worse. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4833) Poor allocator performance with labeled resources and/or persistent volumes
Neil Conway created MESOS-4833: -- Summary: Poor allocator performance with labeled resources and/or persistent volumes Key: MESOS-4833 URL: https://issues.apache.org/jira/browse/MESOS-4833 Project: Mesos Issue Type: Bug Components: allocation Reporter: Neil Conway Priority: Critical Fix For: 0.28.0 Modifying the {{HierarchicalAllocator_BENCHMARK_Test.ResourceLabels}} benchmark from https://reviews.apache.org/r/43686/ to use distinct labels between different slaves, performance regresses from ~2 seconds to ~3 minutes. The culprit seems to be the way in which the allocator merges together resources; reserved resource labels (or persistent volume IDs) inhibit merging, which causes performance to be much worse. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4833) Poor allocator performance with labeled resources and/or persistent volumes
[ https://issues.apache.org/jira/browse/MESOS-4833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway reassigned MESOS-4833: -- Assignee: Neil Conway > Poor allocator performance with labeled resources and/or persistent volumes > --- > > Key: MESOS-4833 > URL: https://issues.apache.org/jira/browse/MESOS-4833 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Neil Conway >Assignee: Neil Conway >Priority: Critical > Labels: mesosphere, resources > Fix For: 0.28.0 > > > Modifying the {{HierarchicalAllocator_BENCHMARK_Test.ResourceLabels}} > benchmark from https://reviews.apache.org/r/43686/ to use distinct labels > between different slaves, performance regresses from ~2 seconds to ~3 > minutes. The culprit seems to be the way in which the allocator merges > together resources; reserved resource labels (or persistent volume IDs) > inhibit merging, which causes performance to be much worse. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4780) Remove `user` and `rootfs` flags in Windows launcher.
[ https://issues.apache.org/jira/browse/MESOS-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174440#comment-15174440 ] Joris Van Remoortere edited comment on MESOS-4780 at 3/1/16 9:31 PM: - https://reviews.apache.org/r/43904/ https://reviews.apache.org/r/43905/ was (Author: jvanremoortere): https://reviews.apache.org/r/43904/ > Remove `user` and `rootfs` flags in Windows launcher. > - > > Key: MESOS-4780 > URL: https://issues.apache.org/jira/browse/MESOS-4780 > Project: Mesos > Issue Type: Task >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows-mvp > Fix For: 0.28.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4832) DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes exits when the /tmp directory is bind-mounted
Joseph Wu created MESOS-4832: Summary: DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes exits when the /tmp directory is bind-mounted Key: MESOS-4832 URL: https://issues.apache.org/jira/browse/MESOS-4832 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 0.27.0 Environment: Seen on CentOS 7 & Debian 8. Reporter: Joseph Wu Assignee: Jie Yu If the {{/tmp}} directory (where Mesos tests create temporary directories) is a bind mount, the test suite will exit here: {code} [ RUN ] DockerContainerizerTest.ROOT_DOCKER_RecoverOrphanedPersistentVolumes I0226 03:17:26.722806 1097 leveldb.cpp:174] Opened db in 12.587676ms I0226 03:17:26.723496 1097 leveldb.cpp:181] Compacted db in 636999ns I0226 03:17:26.723536 1097 leveldb.cpp:196] Created db iterator in 18271ns I0226 03:17:26.723547 1097 leveldb.cpp:202] Seeked to beginning of db in 1555ns I0226 03:17:26.723554 1097 leveldb.cpp:271] Iterated through 0 keys in the db in 363ns I0226 03:17:26.723593 1097 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0226 03:17:26.724128 1117 recover.cpp:447] Starting replica recovery I0226 03:17:26.724367 1117 recover.cpp:473] Replica is in EMPTY status I0226 03:17:26.725237 1117 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (13810)@172.30.2.151:51934 I0226 03:17:26.725744 1114 recover.cpp:193] Received a recover response from a replica in EMPTY status I0226 03:17:26.726356 master.cpp:376] Master 5cc57c0e-f1ad-4107-893f-420ed1a1db1a (ip-172-30-2-151.mesosphere.io) started on 172.30.2.151:51934 I0226 03:17:26.726369 1118 recover.cpp:564] Updating replica status to STARTING I0226 03:17:26.726378 master.cpp:378] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/djHTVQ/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/djHTVQ/master" --zk_session_timeout="10secs" I0226 03:17:26.726605 master.cpp:423] Master only allowing authenticated frameworks to register I0226 03:17:26.726616 master.cpp:428] Master only allowing authenticated slaves to register I0226 03:17:26.726632 credentials.hpp:35] Loading credentials for authentication from '/tmp/djHTVQ/credentials' I0226 03:17:26.726860 master.cpp:468] Using default 'crammd5' authenticator I0226 03:17:26.726977 master.cpp:537] Using default 'basic' HTTP authenticator I0226 03:17:26.727092 master.cpp:571] Authorization enabled I0226 03:17:26.727243 1118 hierarchical.cpp:144] Initialized hierarchical allocator process I0226 03:17:26.727285 1116 whitelist_watcher.cpp:77] No whitelist given I0226 03:17:26.728852 1114 master.cpp:1712] The newly elected leader is master@172.30.2.151:51934 with id 5cc57c0e-f1ad-4107-893f-420ed1a1db1a I0226 03:17:26.728876 1114 master.cpp:1725] Elected as the leading master! I0226 03:17:26.728891 1114 master.cpp:1470] Recovering from registrar I0226 03:17:26.728977 1117 registrar.cpp:307] Recovering registrar I0226 03:17:26.731503 1112 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 4.977811ms I0226 03:17:26.731539 1112 replica.cpp:320] Persisted replica status to STARTING I0226 03:17:26.731711 recover.cpp:473] Replica is in STARTING status I0226 03:17:26.732501 1114 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from (13812)@172.30.2.151:51934 I0226 03:17:26.732862 recover.cpp:193] Received a recover response from a replica in STARTING status I0226 03:17:26.733264 1117 recover.cpp:564] Updating replica status to VOTING I0226 03:17:26.733836 1118 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 388246ns I0226 03:17:26.733855 1118 replica.cpp:320] Persisted replica status to VOTING I0226 03:17:26.733979 1113 recover.cpp:578] Successfully joined the Paxos group I0226 03:17:26.734149 1113 recover.cpp:462] Recover process terminated I0226 03:17:26.734478 log.cpp:659] Attempting to start the
[jira] [Updated] (MESOS-4824) "filesystem/linux" isolator does not unmount orphaned persistent volumes
[ https://issues.apache.org/jira/browse/MESOS-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-4824: - Sprint: Mesosphere Sprint 30 Priority: Blocker (was: Major) Fix Version/s: 0.28.0 > "filesystem/linux" isolator does not unmount orphaned persistent volumes > > > Key: MESOS-4824 > URL: https://issues.apache.org/jira/browse/MESOS-4824 > Project: Mesos > Issue Type: Bug > Components: isolation >Affects Versions: 0.24.0, 0.25.0, 0.26.0, 0.27.0 >Reporter: Joseph Wu >Assignee: Joseph Wu >Priority: Blocker > Labels: containerizer, mesosphere, persistent-volumes > Fix For: 0.28.0 > > > A persistent volume can be orphaned when: > # A framework registers with checkpointing enabled. > # The framework starts a task + a persistent volume. > # The agent exits. The task continues to run. > # Something wipes the agent's {{meta}} directory. This removes the > checkpointed framework info from the agent. > # The agent comes back and recovers. The framework for the task is not > found, so the task is considered orphaned now. > The agent currently does not unmount the persistent volume, saying (with > {{GLOG_v=1}}) > {code} > I0229 23:55:42.078940 5635 linux.cpp:711] Ignoring cleanup request for > unknown container: a35189d3-85d5-4d02-b568-67f675b6dc97 > {code} > Test implemented here: https://reviews.apache.org/r/44122/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4708) Provide a Mesos build for Ubuntu 15.10 Wily
[ https://issues.apache.org/jira/browse/MESOS-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-4708: -- Affects Version/s: (was: 0.27.0) > Provide a Mesos build for Ubuntu 15.10 Wily > --- > > Key: MESOS-4708 > URL: https://issues.apache.org/jira/browse/MESOS-4708 > Project: Mesos > Issue Type: Wish >Reporter: Ludovic Claude > > Hello, > I am running Mesos on Ubuntu. Recently, I was using Ubuntu 15.04 but because > Docker does not support anymore this version, I decided to upgrade to 15.10. > Then I realised - too late - that Mesos does not support officially Ubuntu > 15.10. Is there a way out? > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.
Anand Mazumdar created MESOS-4831: - Summary: Master sometimes sends two inverse offers after the agent goes into maintenance. Key: MESOS-4831 URL: https://issues.apache.org/jira/browse/MESOS-4831 Project: Mesos Issue Type: Bug Affects Versions: 0.27.0 Reporter: Anand Mazumdar Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}} {code} I0229 11:08:57.027559 668 hierarchical.cpp:1437] No resources available to allocate! I0229 11:08:57.027745 668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0 in 272747ns I0229 11:08:57.027757 675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) I0229 11:08:57.028586 675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) I0229 11:08:57.029039 675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b- (default) {code} The ideal expected workflow for this test is something like: - The framework receives offers from master. - The framework updates its maintenance schedule. - The current offer is rescinded. - A new offer is received from the master with unavailability set. - After the agent goes for maintenance, an inverse offer is sent. For some reason, in the logs we see that the master is sending 2 inverse offers. The test seems to pass as we just check for the initial inverse offer being present. Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove {{numberOfOffers}} constant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-4735: -- Affects Version/s: (was: 0.27.0) > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Reporter: Erik Weathers >Assignee: Guangya Liu >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4700) Allow agent to configure net_cls handle minor range.
[ https://issues.apache.org/jira/browse/MESOS-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-4700: -- Shepherd: Jie Yu > Allow agent to configure net_cls handle minor range. > > > Key: MESOS-4700 > URL: https://issues.apache.org/jira/browse/MESOS-4700 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Avinash Sridharan > Labels: mesosphere > Fix For: 0.28.0 > > > Bug exists in some user libraries that prevents some certain minor net_cls > handle being used. It'll be great if we can configure the minor range through > agent flags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4830) Bind docker runtime isolator with docker image provider.
[ https://issues.apache.org/jira/browse/MESOS-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-4830: Summary: Bind docker runtime isolator with docker image provider. (was: Bind docker runtime isolator with docker image provider) > Bind docker runtime isolator with docker image provider. > > > Key: MESOS-4830 > URL: https://issues.apache.org/jira/browse/MESOS-4830 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Gilbert Song >Assignee: Gilbert Song > Labels: containerizer, mesosphere > Fix For: 0.28.0 > > > If image provider is specified as `docker` but docker/runtime is not set, it > would be not meaningful, because of no executables. A check should be added > to make sure docker runtime isolator is on if using docker as image provider. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4830) Bind docker runtime isolator with docker image provider
Gilbert Song created MESOS-4830: --- Summary: Bind docker runtime isolator with docker image provider Key: MESOS-4830 URL: https://issues.apache.org/jira/browse/MESOS-4830 Project: Mesos Issue Type: Bug Components: containerization Reporter: Gilbert Song Assignee: Gilbert Song Fix For: 0.28.0 If image provider is specified as `docker` but docker/runtime is not set, it would be not meaningful, because of no executables. A check should be added to make sure docker runtime isolator is on if using docker as image provider. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4816) Expose TaskInfo to Isolators
[ https://issues.apache.org/jira/browse/MESOS-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174232#comment-15174232 ] Connor Doyle commented on MESOS-4816: - Thanks for the update James, my experience with this is also pre-0.28. As I understand it, {{prepare()}} gets invoked just once per container (for an executor's first task), so it might not be sufficient given a framework that launches multiple tasks per executor. However, if {{ContainerConfig}} covers most uses then maybe this issue can be dropped? > Expose TaskInfo to Isolators > > > Key: MESOS-4816 > URL: https://issues.apache.org/jira/browse/MESOS-4816 > Project: Mesos > Issue Type: Improvement > Components: modules, slave >Reporter: Connor Doyle > > Authors of custom isolator modules frequently require access to the TaskInfo > in order to read custom metadata in task labels. > Currently, it's possible to link containers to tasks within a module by > implementing both an isolator and the {{slaveRunTaskLabelDecorator}} hook, > and maintaining a shared map of containers to tasks. This way works, but > adds unnecessary complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4829) Remove `grace_period_seconds` field from Shutdown event v1 protobuf.
[ https://issues.apache.org/jira/browse/MESOS-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4829: -- Fix Version/s: 0.28.0 > Remove `grace_period_seconds` field from Shutdown event v1 protobuf. > > > Key: MESOS-4829 > URL: https://issues.apache.org/jira/browse/MESOS-4829 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar > Labels: mesosphere > Fix For: 0.28.0 > > > There are two ways in which a shutdown of executor can be triggered: > 1. If it receives an explicit `Shutdown` message from the agent. > 2. If the recovery timeout period has elapsed, and the executor still hasn’t > been able to (re-)connect with the agent. > Currently, the executor library relies on the field `grace_period_seconds` > having a default value of 5 seconds to handle the second scenario. > https://github.com/apache/mesos/blob/master/src/executor/executor.cpp#L608 > The driver used to trigger the grace period via a constant defined in > src/slave/constants.cpp. > https://github.com/apache/mesos/blob/master/src/exec/exec.cpp#L92 > The agent may want to force a shorter shutdown grace period (e.g. > oversubscription eviction may have shorter deadline) in the future. For now, > we can just read the value via an environment variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4829) Remove `grace_period_seconds` field from Shutdown event v1 protobuf.
[ https://issues.apache.org/jira/browse/MESOS-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar reassigned MESOS-4829: - Assignee: Anand Mazumdar > Remove `grace_period_seconds` field from Shutdown event v1 protobuf. > > > Key: MESOS-4829 > URL: https://issues.apache.org/jira/browse/MESOS-4829 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: mesosphere > Fix For: 0.28.0 > > > There are two ways in which a shutdown of executor can be triggered: > 1. If it receives an explicit `Shutdown` message from the agent. > 2. If the recovery timeout period has elapsed, and the executor still hasn’t > been able to (re-)connect with the agent. > Currently, the executor library relies on the field `grace_period_seconds` > having a default value of 5 seconds to handle the second scenario. > https://github.com/apache/mesos/blob/master/src/executor/executor.cpp#L608 > The driver used to trigger the grace period via a constant defined in > src/slave/constants.cpp. > https://github.com/apache/mesos/blob/master/src/exec/exec.cpp#L92 > The agent may want to force a shorter shutdown grace period (e.g. > oversubscription eviction may have shorter deadline) in the future. For now, > we can just read the value via an environment variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4829) Remove `grace_period_seconds` field from Shutdown event v1 protobuf.
[ https://issues.apache.org/jira/browse/MESOS-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4829: -- Summary: Remove `grace_period_seconds` field from Shutdown event v1 protobuf. (was: Remove `grace_period_seconds` field from Shutdown event.) > Remove `grace_period_seconds` field from Shutdown event v1 protobuf. > > > Key: MESOS-4829 > URL: https://issues.apache.org/jira/browse/MESOS-4829 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar > Labels: mesosphere > > There are two ways in which a shutdown of executor can be triggered: > 1. If it receives an explicit `Shutdown` message from the agent. > 2. If the recovery timeout period has elapsed, and the executor still hasn’t > been able to (re-)connect with the agent. > Currently, the executor library relies on the field `grace_period_seconds` > having a default value of 5 seconds to handle the second scenario. > https://github.com/apache/mesos/blob/master/src/executor/executor.cpp#L608 > The driver used to trigger the grace period via a constant defined in > src/slave/constants.cpp. > https://github.com/apache/mesos/blob/master/src/exec/exec.cpp#L92 > The agent may want to force a shorter shutdown grace period (e.g. > oversubscription eviction may have shorter deadline) in the future. For now, > we can just read the value via an environment variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4829) Remove `grace_period_seconds` field from Shutdown event.
[ https://issues.apache.org/jira/browse/MESOS-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4829: -- Description: There are two ways in which a shutdown of executor can be triggered: 1. If it receives an explicit `Shutdown` message from the agent. 2. If the recovery timeout period has elapsed, and the executor still hasn’t been able to (re-)connect with the agent. Currently, the executor library relies on the field `grace_period_seconds` having a default value of 5 seconds to handle the second scenario. https://github.com/apache/mesos/blob/master/src/executor/executor.cpp#L608 The driver used to trigger the grace period via a constant defined in src/slave/constants.cpp. https://github.com/apache/mesos/blob/master/src/exec/exec.cpp#L92 The agent may want to force a shorter shutdown grace period (e.g. oversubscription eviction may have shorter deadline) in the future. For now, we can just read the value via an environment variable. was: There are two ways in which a shutdown of executor can be triggered: 1. If it receives an explicit `Shutdown` message from the agent. 2. If the recovery timeout period has elapsed, and the executor still hasn’t been able to (re-)connect with the agent. Currently, the executor library relies on the field `grace_period_seconds` having a default value of 5 seconds to handle the second scenario. https://github.com/apache/mesos/blob/master/src/executor/executor.cpp#L608 The driver used to trigger the grace period via a constant defined in src/slave/constants.cpp. https://github.com/apache/mesos/blob/master/src/exec/exec.cpp#L92 The agent may want to force a shorter shutdown grace period (e.g. oversubscription eviction may have shorter deadline). > Remove `grace_period_seconds` field from Shutdown event. > > > Key: MESOS-4829 > URL: https://issues.apache.org/jira/browse/MESOS-4829 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar > Labels: mesosphere > > There are two ways in which a shutdown of executor can be triggered: > 1. If it receives an explicit `Shutdown` message from the agent. > 2. If the recovery timeout period has elapsed, and the executor still hasn’t > been able to (re-)connect with the agent. > Currently, the executor library relies on the field `grace_period_seconds` > having a default value of 5 seconds to handle the second scenario. > https://github.com/apache/mesos/blob/master/src/executor/executor.cpp#L608 > The driver used to trigger the grace period via a constant defined in > src/slave/constants.cpp. > https://github.com/apache/mesos/blob/master/src/exec/exec.cpp#L92 > The agent may want to force a shorter shutdown grace period (e.g. > oversubscription eviction may have shorter deadline) in the future. For now, > we can just read the value via an environment variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4712) Remove 'force' field from the Subscribe Call in v1 Scheduler API
[ https://issues.apache.org/jira/browse/MESOS-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-4712: -- Fix Version/s: 0.28.0 > Remove 'force' field from the Subscribe Call in v1 Scheduler API > > > Key: MESOS-4712 > URL: https://issues.apache.org/jira/browse/MESOS-4712 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Vinod Kone > Fix For: 0.28.0 > > > We/I introduced the `force` field in SUBSCRIBE call to deal with scheduler > partition cases. Having thought a bit more and discussing with few other > folks ([~anandmazumdar], [~greggomann]), I think we can get away from not > having that field in the v1 API. The obvious advantage of removing the field > is that framework devs don't have to think about how/when to set the field > (the current semantics are a bit confusing). > The new workflow when a master receives a SUBSCRIBE call is that master > always accepts this call and closes any existing connection (after sending > ERROR event) from the same scheduler (identified by framework id). > The expectation from schedulers is that they must close the old subscribe > connection before resending a new SUBSCRIBE call. > Lets look at some tricky scenarios and see how this works and why it is safe. > 1) Connection disconnection @ the scheduler but not @ the master > > Scheduler sees the disconnection and sends a new SUBSCRIBE call. Master sends > ERROR on the old connection (won't be received by the scheduler because the > connection is already closed) and closes it. > 2) Connection disconnection @ master but not @ scheduler > Scheduler realizes this from lack of HEARTBEAT events. It then closes its > existing connection and sends a new SUBSCRIBE call. Master accepts the new > SUBSCRIBE call. There is no old connection to close on the master as it is > already closed. > 3) Scheduler failover but no disconnection @ master > Newly elected scheduler sends a SUBSCRIBE call. Master sends ERROR event and > closes the old connection (won't be received because the old scheduler failed > over). > 4) If Scheduler A got partitioned (but is alive and connected with master) > and Scheduler B got elected as new leader. > When Scheduler B sends SUBSCRIBE, master sends ERROR and closes the > connection from Scheduler A. Master accepts Scheduler B's connection. > Typically Scheduler A aborts after receiving ERROR and gets restarted. After > restart it won't become the leader because Scheduler B is already elected. > 5) Scheduler sends SUBSCRIBE, times out, closes the SUBSCRIBE connection (A) > and sends a new SUBSCRIBE (B). Master receives SUBSCRIBE (B) and then > receives SUBSCRIBE (A) but doesn't see A's disconnection yet. > Master first accepts SUBSCRIBE (B). After it receives SUBSCRIBE (A), it sends > ERROR to SUBSCRIBE (B) and closes that connection. When it accepts SUBSCRIBE > (A) and tries to send SUBSCRIBED event the connection closure is detected. > Scheduler retries the SUBSCRIBE connection after a backoff. I think this is a > rare enough race for it to happen continuously in a loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4829) Remove `grace_period_seconds` field from Shutdown event.
Anand Mazumdar created MESOS-4829: - Summary: Remove `grace_period_seconds` field from Shutdown event. Key: MESOS-4829 URL: https://issues.apache.org/jira/browse/MESOS-4829 Project: Mesos Issue Type: Task Reporter: Anand Mazumdar There are two ways in which a shutdown of executor can be triggered: 1. If it receives an explicit `Shutdown` message from the agent. 2. If the recovery timeout period has elapsed, and the executor still hasn’t been able to (re-)connect with the agent. Currently, the executor library relies on the field `grace_period_seconds` having a default value of 5 seconds to handle the second scenario. https://github.com/apache/mesos/blob/master/src/executor/executor.cpp#L608 The driver used to trigger the grace period via a constant defined in src/slave/constants.cpp. https://github.com/apache/mesos/blob/master/src/exec/exec.cpp#L92 The agent may want to force a shorter shutdown grace period (e.g. oversubscription eviction may have shorter deadline). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3525) Figure out how to enforce 64-bit builds on Windows.
[ https://issues.apache.org/jira/browse/MESOS-3525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174156#comment-15174156 ] Joris Van Remoortere commented on MESOS-3525: - https://reviews.apache.org/r/43692/ https://reviews.apache.org/r/43693/ https://reviews.apache.org/r/43694/ https://reviews.apache.org/r/43695/ https://reviews.apache.org/r/43689/ > Figure out how to enforce 64-bit builds on Windows. > --- > > Key: MESOS-3525 > URL: https://issues.apache.org/jira/browse/MESOS-3525 > Project: Mesos > Issue Type: Task > Components: build >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: build, cmake, mesosphere > Fix For: 0.28.0 > > > We need to make sure people don't try to compile Mesos on 32-bit > architectures. We don't want a Windows repeat of something like this: > https://issues.apache.org/jira/browse/MESOS-267 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4053: - Environment: CentOS 6.6 (was: CentOS 6.6, Ubuntu 14.04) > MemoryPressureMesosTest tests fail on CentOS 6.6 > > > Key: MESOS-4053 > URL: https://issues.apache.org/jira/browse/MESOS-4053 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6 >Reporter: Greg Mann >Assignee: Benjamin Hindman > Labels: mesosphere, test-failure > > {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and > {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It > seems that mounted cgroups are not properly cleaned up after previous tests, > so multiple hierarchies are detected and thus an error is produced: > {code} > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Please unmount the corresponding (or all) subsystems. > - > ../../src/tests/mesos.cpp:932: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Please unmount the corresponding (or all) subsystems. > - > ../../src/tests/mesos.cpp:932: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174054#comment-15174054 ] Greg Mann edited comment on MESOS-4053 at 3/1/16 6:30 PM: -- I also produced this error with 0.25.1-rc1 on Ubuntu 14.04 using gcc, with libevent and SSL enabled. Tests were run as root. However, rebooting and running {{sudo make check}} with the current master yields no test failures at all, so this doesn't seem to currently be an issue on Ubuntu 14.04. was (Author: greggomann): I also produced this error with 0.25.1-rc1 on Ubuntu 14.04 using gcc, with libevent and SSL enabled. Tests were run as root. > MemoryPressureMesosTest tests fail on CentOS 6.6 > > > Key: MESOS-4053 > URL: https://issues.apache.org/jira/browse/MESOS-4053 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6, Ubuntu 14.04 >Reporter: Greg Mann >Assignee: Benjamin Hindman > Labels: mesosphere, test-failure > > {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and > {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It > seems that mounted cgroups are not properly cleaned up after previous tests, > so multiple hierarchies are detected and thus an error is produced: > {code} > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Please unmount the corresponding (or all) subsystems. > - > ../../src/tests/mesos.cpp:932: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Please unmount the corresponding (or all) subsystems. > - > ../../src/tests/mesos.cpp:932: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4816) Expose TaskInfo to Isolators
[ https://issues.apache.org/jira/browse/MESOS-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174136#comment-15174136 ] James Peach commented on MESOS-4816: The isolator that I have that consumes {{TaskInfo}} labels was written for Mesos 0.27. Since 0.28, {{prepare()}} gets a {{ContainerConfig}} which looks like it should have the {{TaskInfo}}. > Expose TaskInfo to Isolators > > > Key: MESOS-4816 > URL: https://issues.apache.org/jira/browse/MESOS-4816 > Project: Mesos > Issue Type: Improvement > Components: modules, slave >Reporter: Connor Doyle > > Authors of custom isolator modules frequently require access to the TaskInfo > in order to read custom metadata in task labels. > Currently, it's possible to link containers to tasks within a module by > implementing both an isolator and the {{slaveRunTaskLabelDecorator}} hook, > and maintaining a shared map of containers to tasks. This way works, but > adds unnecessary complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4828) XFS disk quota isolator
James Peach created MESOS-4828: -- Summary: XFS disk quota isolator Key: MESOS-4828 URL: https://issues.apache.org/jira/browse/MESOS-4828 Project: Mesos Issue Type: Improvement Components: isolation Reporter: James Peach Assignee: James Peach Implement a disk resource isolator using XFS project quotas. Compared to the {{posix/disk}} isolator, this doesn't need to scan the filesystem periodically, and applications receive a {{ENOSPC}} error instead of being summarily killed. This initial implementation only isolates sandbox directory resources, since isolation doesn't have any visibility into the the lifecycle of volumes, which is needed to assign and track project IDs. The build dependencies for this are XFS header (from xfsprogs-devel) and libblkid. We need libblkid or the equivalent to map filesystem paths to block devices in order to apply quota. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4816) Expose TaskInfo to Isolators
[ https://issues.apache.org/jira/browse/MESOS-4816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174099#comment-15174099 ] Connor Doyle commented on MESOS-4816: - Hi [~gyliu], I agree the optional task info argument in the comment is awkward, but a list sounds pretty good. Existing isolators could continue to look only at the aggregated resources. This question came up during the isolation WG meeting last week. I and others have used this workaround while prototyping isolators for networking, but in general people tend to pass information to isolators via task labels before concepts become first-class in ContainerInfo or elsewhere. [~jamespeach] and [~idownes] may be able to fill in more details. > Expose TaskInfo to Isolators > > > Key: MESOS-4816 > URL: https://issues.apache.org/jira/browse/MESOS-4816 > Project: Mesos > Issue Type: Improvement > Components: modules, slave >Reporter: Connor Doyle > > Authors of custom isolator modules frequently require access to the TaskInfo > in order to read custom metadata in task labels. > Currently, it's possible to link containers to tasks within a module by > implementing both an isolator and the {{slaveRunTaskLabelDecorator}} hook, > and maintaining a shared map of containers to tasks. This way works, but > adds unnecessary complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1796) Support multiple working paths
[ https://issues.apache.org/jira/browse/MESOS-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174095#comment-15174095 ] James Peach commented on MESOS-1796: This sounds like a duplicate or a subset of MESOS-1650. > Support multiple working paths > -- > > Key: MESOS-1796 > URL: https://issues.apache.org/jira/browse/MESOS-1796 > Project: Mesos > Issue Type: Wish > Components: slave >Reporter: Charles Allen >Priority: Minor > > As a framework developer, I would like the ability to have multiple working > paths as part of a slave reporting its resources. > Currently, if a slave (like an ec2 instance) has multiple disks, the disks > must be combined in a MD array or similar in order to be fully utilized in > Mesos. This ask is to allow multiple disks to be mounted on multiple paths, > and have the slave be able to support and report availability on these > various working paths. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4718) Add allocator metric for number of completed allocation runs
[ https://issues.apache.org/jira/browse/MESOS-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4718: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add allocator metric for number of completed allocation runs > > > Key: MESOS-4718 > URL: https://issues.apache.org/jira/browse/MESOS-4718 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4723) Add allocator metric for currently satisfied quotas
[ https://issues.apache.org/jira/browse/MESOS-4723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4723: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add allocator metric for currently satisfied quotas > --- > > Key: MESOS-4723 > URL: https://issues.apache.org/jira/browse/MESOS-4723 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4576) Introduce a stout helper for "which"
[ https://issues.apache.org/jira/browse/MESOS-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4576: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Introduce a stout helper for "which" > > > Key: MESOS-4576 > URL: https://issues.apache.org/jira/browse/MESOS-4576 > Project: Mesos > Issue Type: Improvement > Components: stout >Reporter: Joseph Wu >Assignee: Disha Singh > Labels: mesosphere > > We may want to add a helper to {{stout/os.hpp}} that will natively emulate > the functionality of the Linux utility {{which}}. i.e. > {code} > Option which(const string& command) > { > Option path = os::getenv("PATH"); > // Loop through path and return the first one which os::exists(...). > return None(); > } > {code} > This helper may be useful: > * for test filters in {{src/tests/environment.cpp}} > * a few tests in {{src/tests/containerizer/port_mapping_tests.cpp}} > * the {{sha512}} utility in {{src/common/command_utils.cpp}} > * as runtime checks in the {{LogrotateContainerLogger}} > * etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4720) Add allocator metric for current allocation breakdown
[ https://issues.apache.org/jira/browse/MESOS-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4720: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add allocator metric for current allocation breakdown > - > > Key: MESOS-4720 > URL: https://issues.apache.org/jira/browse/MESOS-4720 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > > We likely want to expose allocated/available/total. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1571) Signal escalation timeout is not configurable.
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-1571: - Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Sprint 29) > Signal escalation timeout is not configurable. > -- > > Key: MESOS-1571 > URL: https://issues.apache.org/jira/browse/MESOS-1571 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen >Assignee: Alexander Rukletsov > Labels: mesosphere > > Even though the executor shutdown grace period is set to a larger interval, > the signal escalation timeout will still be 3 seconds. It should either be > configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4683) Document docker runtime isolator.
[ https://issues.apache.org/jira/browse/MESOS-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4683: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Document docker runtime isolator. > - > > Key: MESOS-4683 > URL: https://issues.apache.org/jira/browse/MESOS-4683 > Project: Mesos > Issue Type: Bug > Components: documentation >Reporter: Gilbert Song >Assignee: Gilbert Song > Labels: containerizer, documentation > > Should include the following information: > *What features are currently supported in docker runtime isolator. > *How to use the docker runtime isolator (user manual). > *Compare the different semantics v.s. docker containerizer, and explain why. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog
[ https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4233: - Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29) > Logging is too verbose for sysadmins / syslog > - > > Key: MESOS-4233 > URL: https://issues.apache.org/jira/browse/MESOS-4233 > Project: Mesos > Issue Type: Epic >Reporter: Cody Maloney >Assignee: Kapil Arya > Labels: mesosphere > Attachments: giant_port_range_logging > > > Currently mesos logs a lot. When launching a thousand tasks in the space of > 10 seconds it will print tens of thousands of log lines, overwhelming syslog > (there is a max rate at which a process can send stuff over a unix socket) > and not giving useful information to a sysadmin who cares about just the > high-level activity and when something goes wrong. > Note mesos also blocks writing to its log locations, so when writing a lot of > log messages, it can fill up the write buffer in the kernel, and be suspended > until the syslog agent catches up reading from the socket (GLOG does a > blocking fwrite to stderr). GLOG also has a big mutex around logging so only > one thing logs at a time. > While for "internal debugging" it is useful to see things like "message went > from internal compoent x to internal component y", from a sysadmin > perspective I only care about the high level actions taken (launched task for > framework x), sent offer to framework y, got task failed from host z. Note > those are what I'd expect at the "INFO" level. At the "WARNING" level I'd > expect very little to be logged / almost nothing in normal operation. Just > things like "WARN: Repliacted log write took longer than expected". WARN > would also get things like backtraces on crashes and abnormal exits / abort. > When trying to launch 3k+ tasks inside a second, mesos logging currently > overwhelms syslog with 100k+ messages, many of which are thousands of bytes. > Sysadmins expect to be able to use syslog to monitor basic events in their > system. This is too much. > We can keep logging the messages to files, but the logging to stderr needs to > be reduced significantly (stderr gets picked up and forwarded to syslog / > central aggregation). > What I would like is if I can set the stderr logging level to be different / > independent from the file logging level (Syslog giving the "sysadmin" > aggregated overview, files useful for debugging in depth what happened in a > cluster). A lot of what mesos currently logs at info is really debugging info > / should show up as debug log level. > Some samples of mesos logging a lot more than a sysadmin would want / expect > are attached, and some are below: > - Every task gets printed multiple times for a basic launch: > {noformat} > Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: > I1215 22:58:29.382644 1315 master.cpp:3248] Launching task > envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework > 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon) > Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: > I1215 22:58:29.382925 1315 master.hpp:176] Adding task > envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; > mem(*​):16; ports(*):[14047-14047] > {noformat} > - Every task status update prints many log lines, successful ones are part > of normal operation and maybe should be logged at info / debug levels, but > not to a sysadmin (Just show when things fail, and maybe aggregate counters > to tell of the volume of working) > - No log messagse should be really big / more than 1k characters (Would > prevent the giant port list attached, make that easily discoverable / bug > filable / fixable) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4721) Add allocator metric for allocation duration
[ https://issues.apache.org/jira/browse/MESOS-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4721: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add allocator metric for allocation duration > > > Key: MESOS-4721 > URL: https://issues.apache.org/jira/browse/MESOS-4721 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4684) Create base docker image for test suite.
[ https://issues.apache.org/jira/browse/MESOS-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4684: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Create base docker image for test suite. > > > Key: MESOS-4684 > URL: https://issues.apache.org/jira/browse/MESOS-4684 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Gilbert Song >Assignee: Gilbert Song > Labels: containerizer > > This should be widely used for unified containerizer testing. Should > basically include: > *at least one layer. > *repositories. > For each layer: > *root file system as a layer tar ball. > *docker image json (manifest). > *docker version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4719) Add allocator metric for number of offers each framework received
[ https://issues.apache.org/jira/browse/MESOS-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4719: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add allocator metric for number of offers each framework received > - > > Key: MESOS-4719 > URL: https://issues.apache.org/jira/browse/MESOS-4719 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4214) Introduce HTTP endpoint /weights for updating weight
[ https://issues.apache.org/jira/browse/MESOS-4214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4214: - Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29) > Introduce HTTP endpoint /weights for updating weight > > > Key: MESOS-4214 > URL: https://issues.apache.org/jira/browse/MESOS-4214 > Project: Mesos > Issue Type: Task >Reporter: Yongqiao Wang >Assignee: Yongqiao Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4381) Improve upgrade compatibility documentation.
[ https://issues.apache.org/jira/browse/MESOS-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4381: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Improve upgrade compatibility documentation. > > > Key: MESOS-4381 > URL: https://issues.apache.org/jira/browse/MESOS-4381 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Joerg Schad >Assignee: Joerg Schad > Labels: documentation, mesosphere > > Investigate and document upgrade compatibility for 0.27 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4748) Add Appc image fetcher tests.
[ https://issues.apache.org/jira/browse/MESOS-4748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4748: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add Appc image fetcher tests. > - > > Key: MESOS-4748 > URL: https://issues.apache.org/jira/browse/MESOS-4748 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Jojy Varghese >Assignee: Jojy Varghese > Labels: mesosphere, unified-containerizer-mvp > > Mesos now has support for fetching Appc images. Add tests that verifies the > new component. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4691) Add a HierarchicalAllocator benchmark with reservation labels.
[ https://issues.apache.org/jira/browse/MESOS-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4691: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add a HierarchicalAllocator benchmark with reservation labels. > -- > > Key: MESOS-4691 > URL: https://issues.apache.org/jira/browse/MESOS-4691 > Project: Mesos > Issue Type: Task >Reporter: Michael Park >Assignee: Neil Conway > Labels: mesosphere > > With {{Labels}} being part of the {{ReservationInfo}}, we should ensure that > we don't observe a significant performance degradation in the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3945) Add operator documentation for /weight endpoint
[ https://issues.apache.org/jira/browse/MESOS-3945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-3945: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add operator documentation for /weight endpoint > --- > > Key: MESOS-3945 > URL: https://issues.apache.org/jira/browse/MESOS-3945 > Project: Mesos > Issue Type: Task >Reporter: James Wang >Assignee: Yongqiao Wang > > This JIRA ticket will update the related doc to apply to dynamic weights, and > add an new operator guide for dynamic weights which describes basic usage of > the /weights endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4633) Tests will dereference stack allocated agent objects upon assertion/expectation failure.
[ https://issues.apache.org/jira/browse/MESOS-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4633: - Sprint: Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 28, Mesosphere Sprint 29) > Tests will dereference stack allocated agent objects upon > assertion/expectation failure. > > > Key: MESOS-4633 > URL: https://issues.apache.org/jira/browse/MESOS-4633 > Project: Mesos > Issue Type: Bug >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: flaky, mesosphere, tech-debt, test > > Tests that use the {{StartSlave}} test helper are generally fragile when the > test fails an assert/expect in the middle of the test. This is because the > {{StartSlave}} helper takes raw pointer arguments, which may be > stack-allocated. > In case of an assert failure, the test immediately exits (destroying stack > allocated objects) and proceeds onto test cleanup. The test cleanup may > dereference some of these destroyed objects, leading to a test crash like: > {code} > [18:27:36][Step 8/8] F0204 18:27:35.981302 23085 logging.cpp:64] RAW: Pure > virtual method called > [18:27:36][Step 8/8] @ 0x7f7077055e1c google::LogMessage::Fail() > [18:27:36][Step 8/8] @ 0x7f707705ba6f google::RawLog__() > [18:27:36][Step 8/8] @ 0x7f70760f76c9 __cxa_pure_virtual > [18:27:36][Step 8/8] @ 0xa9423c > mesos::internal::tests::Cluster::Slaves::shutdown() > [18:27:36][Step 8/8] @ 0x1074e45 > mesos::internal::tests::MesosTest::ShutdownSlaves() > [18:27:36][Step 8/8] @ 0x1074de4 > mesos::internal::tests::MesosTest::Shutdown() > [18:27:36][Step 8/8] @ 0x1070ec7 > mesos::internal::tests::MesosTest::TearDown() > {code} > The {{StartSlave}} helper should take {{shared_ptr}} arguments instead. > This also means that we can remove the {{Shutdown}} helper from most of these > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4544) Propose design doc for agent partitioning behavior
[ https://issues.apache.org/jira/browse/MESOS-4544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4544: - Sprint: Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 28, Mesosphere Sprint 29) > Propose design doc for agent partitioning behavior > -- > > Key: MESOS-4544 > URL: https://issues.apache.org/jira/browse/MESOS-4544 > Project: Mesos > Issue Type: Task > Components: general >Reporter: Neil Conway >Assignee: Neil Conway > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3854) Finalize design for generalized Authorizer interface
[ https://issues.apache.org/jira/browse/MESOS-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-3854: - Sprint: Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29) > Finalize design for generalized Authorizer interface > > > Key: MESOS-3854 > URL: https://issues.apache.org/jira/browse/MESOS-3854 > Project: Mesos > Issue Type: Task > Components: security >Reporter: Bernd Mathiske >Assignee: Alexander Rojas > Labels: authorization, mesosphere > > Finalize the structure the interface and achieve consensus on the design doc > proposed in MESOS-2949. > https://docs.google.com/document/d/1-XARWJFUq0r_TgRHz_472NvLZNjbqE4G8c2JL44OSMQ/edit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4634) Tests will dereference stack allocated master objects upon assertion/expectation failure.
[ https://issues.apache.org/jira/browse/MESOS-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4634: - Sprint: Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 28, Mesosphere Sprint 29) > Tests will dereference stack allocated master objects upon > assertion/expectation failure. > - > > Key: MESOS-4634 > URL: https://issues.apache.org/jira/browse/MESOS-4634 > Project: Mesos > Issue Type: Bug >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: flaky, mesosphere, tech-debt, test > > Tests that use the {{StartMaster}} test helper are generally fragile when the > test fails an assert/expect in the middle of the test. This is because the > {{StartMaster}} helper takes raw pointer arguments, which may be > stack-allocated. > In case of an assert failure, the test immediately exits (destroying stack > allocated objects) and proceeds onto test cleanup. The test cleanup may > dereference some of these destroyed objects, leading to a test crash like: > {code} > [18:27:36][Step 8/8] F0204 18:27:35.981302 23085 logging.cpp:64] RAW: Pure > virtual method called > [18:27:36][Step 8/8] @ 0x7f7077055e1c google::LogMessage::Fail() > [18:27:36][Step 8/8] @ 0x7f707705ba6f google::RawLog__() > [18:27:36][Step 8/8] @ 0x7f70760f76c9 __cxa_pure_virtual > [18:27:36][Step 8/8] @ 0xa9423c > mesos::internal::tests::Cluster::Slaves::shutdown() > [18:27:36][Step 8/8] @ 0x1074e45 > mesos::internal::tests::MesosTest::ShutdownSlaves() > [18:27:36][Step 8/8] @ 0x1074de4 > mesos::internal::tests::MesosTest::Shutdown() > [18:27:36][Step 8/8] @ 0x1070ec7 > mesos::internal::tests::MesosTest::TearDown() > {code} > The {{StartMaster}} helper should take {{shared_ptr}} arguments instead. > This also means that we can remove the {{Shutdown}} helper from most of these > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4610) MasterContender/MasterDetector should be loadable as modules
[ https://issues.apache.org/jira/browse/MESOS-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4610: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > MasterContender/MasterDetector should be loadable as modules > > > Key: MESOS-4610 > URL: https://issues.apache.org/jira/browse/MESOS-4610 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Mark Cavage >Assignee: Mark Cavage > > Currently mesos depends on Zookeeper for leader election and notification to > slaves, although there is a C++ hierarchy in the code to support alternatives > (e.g., unit tests use an in-memory implementation). From an operational > perspective, many organizations/users do not want to take a dependency on > Zookeeper, and use an alternative solution to implementing leader election. > Our organization in particular, very much wants this, and as a reference > there have been several requests from the community (see referenced tickets) > to replace with etcd/consul/etc. > This ticket will serve as the work effort to modularize the > MasterContender/MasterDetector APIs such that integrators can build a > pluggable solution of their choice; this ticket will not fold in any > implementations such as etcd et al., but simply move this hierarchy to be > fully pluggable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4722) Add allocator metric for number of active offer filters
[ https://issues.apache.org/jira/browse/MESOS-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4722: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Add allocator metric for number of active offer filters > --- > > Key: MESOS-4722 > URL: https://issues.apache.org/jira/browse/MESOS-4722 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4673) Agent fails to shutdown after re-registering period timed-out.
[ https://issues.apache.org/jira/browse/MESOS-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-4673: - Sprint: Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Sprint 29) > Agent fails to shutdown after re-registering period timed-out. > -- > > Key: MESOS-4673 > URL: https://issues.apache.org/jira/browse/MESOS-4673 > Project: Mesos > Issue Type: Bug > Components: docker >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > Under certain conditions, when a mesos agent looses connection to the master > for an extended period of time (Say a switch fails), the master will > de-register the agent, and then when the agent comes back up, refuse to let > it register: {{Slave asked to shut down by master@10.102.25.1:5050 because > 'Slave attempted to re-register after removal'}}. > The agent doesn't seem to be able to properly shutdown and remove running > tasks as it should do to register as a new agent. Hence this message will > persist until it's resolved by manual intervetion. > This seems to be caused by Docker tasks that couldn't shutdown cleanly when > the agent is asked to shutdown running tasks to be able to register as a new > agent with the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2317) Remove deprecated checkpoint=false code
[ https://issues.apache.org/jira/browse/MESOS-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Harutyunyan updated MESOS-2317: - Sprint: Mesosphere Q1 Sprint 6 - 4/3, Mesosphere Q1 Sprint 7 - 4/17, Mesosphere Q2 Sprint 8 - 5/1, Mesosphere Q1 Sprint 9 - 5/15, Mesosphere Sprint 10, Mesosphere Sprint 11, Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30 (was: Mesosphere Q1 Sprint 6 - 4/3, Mesosphere Q1 Sprint 7 - 4/17, Mesosphere Q2 Sprint 8 - 5/1, Mesosphere Q1 Sprint 9 - 5/15, Mesosphere Sprint 10, Mesosphere Sprint 11, Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29) > Remove deprecated checkpoint=false code > --- > > Key: MESOS-2317 > URL: https://issues.apache.org/jira/browse/MESOS-2317 > Project: Mesos > Issue Type: Epic >Affects Versions: 0.22.0 >Reporter: Adam B >Assignee: Joerg Schad > Labels: checkpoint, mesosphere > > Cody's plan from MESOS-444 was: > 1) -Make it so the flag can't be changed at the command line- > 2) -Remove the checkpoint variable entirely from slave/flags.hpp. This is a > fairly involved change since a number of unit tests depend on manually > setting the flag, as well as the default being non-checkpointing.- > 3) -Remove logic around checkpointing in the slave, remove logic inside the > master.- > 4) Drop the flag from the SlaveInfo struct (Will require a deprecation cycle). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4053: - Environment: CentOS 6.6, Ubuntu 14.04 (was: CentOS 6.6) > MemoryPressureMesosTest tests fail on CentOS 6.6 > > > Key: MESOS-4053 > URL: https://issues.apache.org/jira/browse/MESOS-4053 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6, Ubuntu 14.04 >Reporter: Greg Mann >Assignee: Benjamin Hindman > Labels: mesosphere, test-failure > > {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and > {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It > seems that mounted cgroups are not properly cleaned up after previous tests, > so multiple hierarchies are detected and thus an error is produced: > {code} > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Please unmount the corresponding (or all) subsystems. > - > ../../src/tests/mesos.cpp:932: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Please unmount the corresponding (or all) subsystems. > - > ../../src/tests/mesos.cpp:932: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174054#comment-15174054 ] Greg Mann commented on MESOS-4053: -- I also produced this error with 0.25.1-rc1 on Ubuntu 14.04 using gcc, with libevent and SSL enabled. Tests were run as root. > MemoryPressureMesosTest tests fail on CentOS 6.6 > > > Key: MESOS-4053 > URL: https://issues.apache.org/jira/browse/MESOS-4053 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6, Ubuntu 14.04 >Reporter: Greg Mann >Assignee: Benjamin Hindman > Labels: mesosphere, test-failure > > {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and > {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It > seems that mounted cgroups are not properly cleaned up after previous tests, > so multiple hierarchies are detected and thus an error is produced: > {code} > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Please unmount the corresponding (or all) subsystems. > - > ../../src/tests/mesos.cpp:932: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Please unmount the corresponding (or all) subsystems. > - > ../../src/tests/mesos.cpp:932: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4827) Destroy Docker container from Marathon kills Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-4827: -- Labels: (was: newbie) > Destroy Docker container from Marathon kills Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0 >Reporter: Zhenzhong Shi > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4811) Reusable/Cacheable Offer
[ https://issues.apache.org/jira/browse/MESOS-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma reassigned MESOS-4811: --- Assignee: Klaus Ma > Reusable/Cacheable Offer > > > Key: MESOS-4811 > URL: https://issues.apache.org/jira/browse/MESOS-4811 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Klaus Ma >Assignee: Klaus Ma > > Currently, the resources are return back to allocator when task finished; and > those resources are not allocated to framework until next allocation cycle. > The performance is low for short running tasks (MESOS-3078). The proposed > solution is to let framework keep using the offer until allocator decide to > rescind it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4811) Reusable/Cacheable Offer
[ https://issues.apache.org/jira/browse/MESOS-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated MESOS-4811: Labels: tech-debt (was: ) > Reusable/Cacheable Offer > > > Key: MESOS-4811 > URL: https://issues.apache.org/jira/browse/MESOS-4811 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Klaus Ma >Assignee: Klaus Ma > Labels: tech-debt > > Currently, the resources are return back to allocator when task finished; and > those resources are not allocated to framework until next allocation cycle. > The performance is low for short running tasks (MESOS-3078). The proposed > solution is to let framework keep using the offer until allocator decide to > rescind it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173867#comment-15173867 ] Klaus Ma commented on MESOS-4735: - [~gyliu], I think [~erikdw] is talking about type/extension of the downloaded file; for example, there's some url that did not include file name in its url, so it's hard for {{fetcher}} to check which file type it is; the proposal of this JIRA is to add file type into {{CommandInfo.URI}}, so {{fetcher}} can use the right tool to un-package the download file. [~erikdw], please correct me if mis-understanding. > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Guangya Liu >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173854#comment-15173854 ] Guangya Liu commented on MESOS-4735: Hi [~erikdw] can you please show more detail for {{choosing the filename to save the downloaded file as}}? The curl fetcher now supports {{"http", "https", "ftp", "ftps"}}, what other kind of files that the curl fetcher need to support? > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Guangya Liu >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4827) Destroy Docker container from Marathon kills Mesos slave
Zhenzhong Shi created MESOS-4827: Summary: Destroy Docker container from Marathon kills Mesos slave Key: MESOS-4827 URL: https://issues.apache.org/jira/browse/MESOS-4827 Project: Mesos Issue Type: Bug Components: docker, framework, slave Affects Versions: 0.25.0 Reporter: Zhenzhong Shi The details of this issue originally [posted on StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. To be short, the problem is when we destroy/re-deploy a docker-containerized task, the mesos-slave got killed from time to time. It happened on our production environment and I cann't re-produce it. Please refer to the post on StackOverflow about the error message I got and details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2858) FetcherCacheHttpTest.HttpMixed is flaky.
[ https://issues.apache.org/jira/browse/MESOS-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173700#comment-15173700 ] Bernd Mathiske commented on MESOS-2858: --- Thanks! Having looked through this log once, I have not found the culprit yet. According to the sandbox dumps, the 3 tasks run as intended, but somehow signaling the TASK_FINISHED status updates gets hung somewhere along the way to an AWAIT. Investigation to be continued... > FetcherCacheHttpTest.HttpMixed is flaky. > > > Key: MESOS-2858 > URL: https://issues.apache.org/jira/browse/MESOS-2858 > Project: Mesos > Issue Type: Bug > Components: fetcher, test >Reporter: Benjamin Mahler >Assignee: Bernd Mathiske > Labels: flaky-test, mesosphere > > From jenkins: > {noformat} > [ RUN ] FetcherCacheHttpTest.HttpMixed > Using temporary directory '/tmp/FetcherCacheHttpTest_HttpMixed_qfpOOC' > I0611 00:40:28.208909 26042 leveldb.cpp:176] Opened db in 3.831173ms > I0611 00:40:28.209951 26042 leveldb.cpp:183] Compacted db in 997319ns > I0611 00:40:28.210011 26042 leveldb.cpp:198] Created db iterator in 23917ns > I0611 00:40:28.210032 26042 leveldb.cpp:204] Seeked to beginning of db in > 2112ns > I0611 00:40:28.210043 26042 leveldb.cpp:273] Iterated through 0 keys in the > db in 392ns > I0611 00:40:28.210095 26042 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0611 00:40:28.210741 26067 recover.cpp:449] Starting replica recovery > I0611 00:40:28.211144 26067 recover.cpp:475] Replica is in EMPTY status > I0611 00:40:28.212210 26074 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I0611 00:40:28.212728 26071 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I0611 00:40:28.213260 26069 recover.cpp:566] Updating replica status to > STARTING > I0611 00:40:28.214066 26073 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 590673ns > I0611 00:40:28.214095 26073 replica.cpp:323] Persisted replica status to > STARTING > I0611 00:40:28.214350 26073 recover.cpp:475] Replica is in STARTING status > I0611 00:40:28.214774 26061 master.cpp:363] Master > 20150611-004028-1946161580-33349-26042 (658ddc752264) started on > 172.17.0.116:33349 > I0611 00:40:28.214800 26061 master.cpp:365] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" > --credentials="/tmp/FetcherCacheHttpTest_HttpMixed_qfpOOC/credentials" > --framework_sorter="drf" --help="false" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.23.0/_inst/share/mesos/webui" > --work_dir="/tmp/FetcherCacheHttpTest_HttpMixed_qfpOOC/master" > --zk_session_timeout="10secs" > I0611 00:40:28.215342 26061 master.cpp:410] Master only allowing > authenticated frameworks to register > I0611 00:40:28.215361 26061 master.cpp:415] Master only allowing > authenticated slaves to register > I0611 00:40:28.215397 26061 credentials.hpp:37] Loading credentials for > authentication from '/tmp/FetcherCacheHttpTest_HttpMixed_qfpOOC/credentials' > I0611 00:40:28.215589 26064 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I0611 00:40:28.215770 26061 master.cpp:454] Using default 'crammd5' > authenticator > I0611 00:40:28.215934 26061 master.cpp:491] Authorization enabled > I0611 00:40:28.215932 26062 recover.cpp:195] Received a recover response from > a replica in STARTING status > I0611 00:40:28.216256 26070 whitelist_watcher.cpp:79] No whitelist given > I0611 00:40:28.216310 26066 hierarchical.hpp:309] Initialized hierarchical > allocator process > I0611 00:40:28.216352 26067 recover.cpp:566] Updating replica status to VOTING > I0611 00:40:28.216909 26070 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 374189ns > I0611 00:40:28.216931 26070 replica.cpp:323] Persisted replica status to > VOTING > I0611 00:40:28.217052 26075 recover.cpp:580] Successfully joined the Paxos > group > I0611 00:40:28.217355 26063 master.cpp:1476] The newly elected leader is > master@172.17.0.116:33349 with id 20150611-004028-1946161580-33349-26042 > I0611 00:40:28.217512 26063 master.cpp:1489] Elected as the leading master! > I0611 00:40:28.217540 26063 master.cpp:1259] Recovering from registrar > I0611 00:40:28.217753 26070 registrar.cpp:313] Recovering
[jira] [Comment Edited] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173519#comment-15173519 ] Erik Weathers edited comment on MESOS-4735 at 3/1/16 9:57 AM: -- [~gyliu], MESOS-3367 fixes one of the issues I appended above, but not the other. This proposal is more general than either of those issues, providing some level of future-proofness against other unforeseen issues. Ignoring those other issues, the Mesos fetcher is acting as an HTTP downloader, and all of the utilities I use on a day-to-day basis for that already support the functionality this ticket is requesting: choosing the filename to save the downloaded file as. Browsers let you do that, as do {{curl}} and {{wget}}. So it's just something that should be added sooner or later to the Mesos fetcher, and the fact that this would allow for other various problems to be overcome by a framework author is just another benefit. was (Author: erikdw): [~gyliu] MESOS-3367 fixes one of the issues I appended above, but not the other. This proposal is more general than either of those issues, providing some level of future-proofness against other unforeseen issues. Ignoring those other issues, the Mesos fetcher is acting as an HTTP downloader, and all of the utilities I use on a day-to-day basis for that already support the functionality this ticket is requesting: choosing the filename to save the downloaded file as. Browsers let you do that, as do {{curl}} and {{wget}}. So it's just something that should be added sooner or later to the Mesos fetcher, and the fact that this would allow for other various problems to be overcome by a framework author is just another benefit. > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Guangya Liu >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173519#comment-15173519 ] Erik Weathers commented on MESOS-4735: -- [~gyliu] MESOS-3367 fixes one of the issues I appended above, but not the other. This proposal is more general than either of those issues, providing some level of future-proofness against other unforeseen issues. Ignoring those other issues, the Mesos fetcher is acting as an HTTP downloader, and all of the utilities I use on a day-to-day basis for that already support the functionality this ticket is requesting: choosing the filename to save the downloaded file as. Browsers let you do that, as do {{curl}} and {{wget}}. So it's just something that should be added sooner or later to the Mesos fetcher, and the fact that this would allow for other various problems to be overcome by a framework author is just another benefit. > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Guangya Liu >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4709) Enable compiler optimization by default
[ https://issues.apache.org/jira/browse/MESOS-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-4709: -- Shepherd: Till Toenshoff > Enable compiler optimization by default > --- > > Key: MESOS-4709 > URL: https://issues.apache.org/jira/browse/MESOS-4709 > Project: Mesos > Issue Type: Improvement > Components: general >Reporter: Neil Conway >Assignee: Neil Conway > Labels: autoconf, configure, mesosphere > > At present, Mesos defaults to compiling with "-O0"; to enable compiler > optimizations, the user needs to specify "--enable-optimize" when running > {{configure}}. > We should change the default for the following reasons: > (1) The autoconf default for CFLAGS/CXXFLAGS is "-O2 -g". Anecdotally, > I think most software packages compile with a reasonable level of > optimizations enabled by default. > (2) I think we should make the default configure flags appropriate for > end-users (rather than Mesos developers): developers will be familiar > enough with Mesos to tune the configure flags according to their own > preferences. > (3) The performance consequences of not enabling compiler > optimizations can be pretty severe: 5x in a benchmark I just ran, and > we've seen between 2x and 30x (!) performance differences for some > real-world workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173462#comment-15173462 ] Guangya Liu commented on MESOS-4735: [~erikdw] I think that MESOS-3367 is going to fix the issue that you append above, right? From the append of you in MESOS-3367, seems you filed this JIRA ticket want to enable the URI can specify some local files? > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Guangya Liu >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4826) Test helper function mesos::internal::tests::Metrics has name not following mesos style
Benjamin Bannier created MESOS-4826: --- Summary: Test helper function mesos::internal::tests::Metrics has name not following mesos style Key: MESOS-4826 URL: https://issues.apache.org/jira/browse/MESOS-4826 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Bannier Priority: Trivial The test helper function {{mesos::internal::tests::Metrics}} has a name not following mesos style. The expected name would have been {{metrics}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)