[ https://issues.apache.org/jira/browse/MESOS-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235013#comment-16235013 ]
Anand Mazumdar commented on MESOS-7867: --------------------------------------- {noformat} commit fa747ac3620a6695cfe29ac443f833fad6c991a2 Author: Ilya Pronin <ipro...@twopensource.com> Date: Wed Nov 1 17:29:58 2017 -0700 Added SchedulerHttpApiTest.UpdateHttpToPidSchedulerAndBack test. This test verifies that we are able to upgrade from a PID based framework to an HTTP framework and then downgrade back without restarting the master. Review: https://reviews.apache.org/r/62241/ commit 5b93d6ac60725679399fe15233267c02cc9918df Author: Ilya Pronin <ipro...@twopensource.com> Date: Wed Nov 1 17:29:49 2017 -0700 Removed metrics removal from Master::failoverFramework(). When a framework upgrades from a PID based driver to an HTTP based driver, the master removes its per-principal metrics. When the same framework downgrades back to a PID based driver, the master doesn't reinstate those metrics. This causes a crash when the master receives a message from the failed over framework and tries to increment its metrics. This patch fixes the issue by removing metrics removal from framework failover handling code. Note that it doesn't handle the case when the framework's principal change. This situation is being dealt with separately in MESOS-2842. Review: https://reviews.apache.org/r/62240/ {noformat} > Master doesn't handle scheduler driver downgrade from HTTP based to PID based > ----------------------------------------------------------------------------- > > Key: MESOS-7867 > URL: https://issues.apache.org/jira/browse/MESOS-7867 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.3.0 > Reporter: Ilya Pronin > Assignee: Ilya Pronin > Priority: Major > Fix For: 1.5.0 > > > When a framework upgrades from a PID based driver to an HTTP based driver, > master removes its per-framework-principal metrics ({{messages_received}} and > {{messages_processed}}) in {{Master::failoverFramework}}. When the same > framework downgrades back to a PID based driver, the master doesn't reinstate > those metrics. This causes a crash when the master receives a message from > the failed over framework and increments {{messages_received}} counter in > {{Master::visit(const MessageEvent&)}}. > {noformat} > I0807 18:17:45.713220 19095 master.cpp:2916] Framework > 70822e80-ca38-4470-916e-e6da073a4742-0000 (TwitterScheduler) failed over > F0807 18:18:20.725908 19079 master.cpp:1451] Check failed: > metrics->frameworks.contains(principal.get()) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)