Ah, so the only issue there is the fix version on the ticket is wrong. For some reason I thought 0.26.0 had just been released much more recently, so (combined with the fix version on the ticket) I had assumed that a patch from November would definitely have been included.
At least that’s one mystery solved, thanks. From: haosdent [mailto:haosd...@gmail.com] Sent: Thursday, January 21, 2016 8:31 PM To: user Subject: Re: Framework Id and upgrading mesos versions >but I noticed that the code added to fix Mesos-3834 appears in the master >branch in github, but not the 0.26.0 branch. 0.26-rc1 checkout since Nov 13,2015 while this patch submit in Nov 24.2015, so don't contains this patch. On Fri, Jan 22, 2016 at 7:19 AM, David Kesler <dkes...@yodle.com<mailto:dkes...@yodle.com>> wrote: I'm attempting to test upgrading from our current version of mesos (0.22.1) to the latest. Even when going only one minor version at a time, I'm running into issues due to the lack of framework id in the framework info. I've been able to replicate the issue reliably. I started with with a single master and slave, with a fresh install of marathon 0.9.0 and mesos 0.22.1, wiping out /tmp/mesos on the slave and /mesos and /marathon in zookeeper. I started up a task. At this point, I can look at `/tmp/mesos/meta/slaves/latest/frameworks/<my current marathon framework id>/framework.info<http://framework.info>` and verify that there is no framework id present in the file. I then upgraded the master to mesos 0.23.1, restarted it, then the slave to 0.23.1 and restarted it, then marathon to 0.11.1 (which was built against mesos 0.23) and restarted it. The slave came up and recovered just fine. However the framework.info<http://framework.info> file never gets updated with the framework id. If I then proceed to upgrade the master to 0.24, restart it, then the slave to 0.24 and restart it, the slave fails to come up with the following error: Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.409395 9527 main.cpp:187] Version: 0.24.1 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.409406 9527 main.cpp:190] Git tag: 0.24.1 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.409418 9527 main.cpp:194] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.513608 9527 containerizer.cpp:143] Using isolation: posix/cpu,posix/mem,filesystem/posix Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@716: Client environment:host.name<http://host.name>=dev-sandbox-mesos-slave1 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@723: Client environment:os.name<http://os.name>=Linux Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-58-generic Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@725: Client environment:os.version=#97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.514710 9527 main.cpp:272] Starting Mesos slave Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.516090 9542 slave.cpp:190] Slave started on 1)@10.100.25.112:5051<http://10.100.25.112:5051> Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.516180 9542 slave.cpp:191] Flags at startup: --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierar chy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_so cket="/var/run/docker.sock" --docker_stop_timeout="0ns" --enforce_container_disk_quota="false" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --initialize_driver_logging="true" --ip="10.100.25.112" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: _level="INFO" --master="zk://dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181/mesos<http://dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181/mesos>" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0n s" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --version="false" --wo rk_dir="/tmp/mesos" Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.517006 9542 slave.cpp:354] Slave resources: cpus(*):2; mem(*):15025; disk(*):35818; ports(*):[31000-32000] Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.517315 9542 slave.cpp:384] Slave hostname: dev-sandbox-mesos-slave1.nyc.dev.yodle.com<http://dev-sandbox-mesos-slave1.nyc.dev.yodle.com> Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.517334 9542 slave.cpp:389] Slave checkpoint: true Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@733: Client environment:user.name<http://user.name>=(null) Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@741: Client environment:user.home=/root Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@753: Client environment:user.dir=/ Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181<http://dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181> sessionTimeout=10000 watcher=0x7f18dfac6610 sessionId=0 sessionPassw d=<null> context=0x7f18b8002180 flags=0 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.520829 9544 state.cpp:54] Recovering state from '/tmp/mesos/meta' Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,521:9527(0x7f18d2d8d700):ZOO_INFO@check_events@1703: initiated connection to server [10.100.25.111:2181<http://10.100.25.111:2181>] Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.524245 9542 slave.cpp:4157] Recovering framework 20160121-172941-1847157770-5050-4782-0000 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: F0121 17:54:46.524288 9542 slave.cpp:4175] Check failed: frameworkInfo.has_id() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: *** Check failure stack trace: *** Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18dfe3091d google::LogMessage::Fail() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18dfe3275d google::LogMessage::SendToLog() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 17:54:46,528:9527(0x7f18d2d8d700):ZOO_INFO@check_events@1750: session establishment complete on server [10.100.25.111:2181<http://10.100.25.111:2181>], sessionId=0x14ec1fa6d1a263d, negotiated timeout=10000 Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.528326 9549 group.cpp:331] Group process (group(1)@10.100.25.112:5051<http://10.100.25.112:5051>) connected to ZooKeeper Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.528370 9549 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.528455 9549 group.cpp:403] Trying to create path '/mesos' in ZooKeeper Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18dfe3050c google::LogMessage::Flush() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18dfe33059 google::LogMessageFatal::~LogMessageFatal() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.532296 9549 detector.cpp:156] Detected a new leader: (id='2') Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.532524 9543 group.cpp:674] Trying to get '/mesos/info_0000000002' in ZooKeeper Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18df900ba8 mesos::internal::slave::Slave::recoverFramework() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: W0121 17:54:46.533833 9543 detector.cpp:444] Leading master master@10.100.25.110:5050<http://master@10.100.25.110:5050> is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340) Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 17:54:46.534034 9543 detector.cpp:481] A new leading master (UPID=master@10.100.25.110:5050<http://master@10.100.25.110:5050>) is detected Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18df907193 mesos::internal::slave::Slave::recover() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18df938383 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS8_5state5StateEESD_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSK_FSI_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18dfde1681 process::ProcessManager::resume() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18dfde197f process::internal::schedule() Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18dec6da40 (unknown) Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18de48a182 start_thread Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ 0x7f18de1b747d (unknown) With 0.23.1 running, I've tried restarting the mesos-slave multiple times, I've tried deploying new tasks, and I've tried waiting, but the framework.info<http://framework.info> file never seems to get updated, so I have no clue how I'm supposed to actually get past 0.23.1 as part of the upgrade. Additionally, I saw https://issues.apache.org/jira/browse/MESOS-3834 which says it was fixed in 0.26.0 and resolved in November, so I tried going all the way to mesos 0.26.0. (Yes, I'm aware that it's not recommended to skip versions, but I wanted to see if I could get around the framework id issue). Not only did it fail the same way, but I noticed that the code added to fix Mesos-3834 appears in the master branch in github, but not the 0.26.0 branch. One last thing I don't understand is that our current dev/qa/master cluster slaves appear to be writing the framework id to the framework.info<http://framework.info> file, despite running mesos 0.22.1 and marathon 0.9.0 and set up via puppet just like the sandbox I've been testing in. So it's possible that there's some issue preventing the slave in the sandbox from writing the framework id to the file, but I can't find any difference in setups that would cause that either. Any help you can provide would be greatly appreciated. -- Best Regards, Haosdent Huang