Ah,  so the only issue there is the fix version on the ticket is wrong.  For 
some reason I thought 0.26.0 had just been released much more recently, so 
(combined with the fix version on the ticket) I had assumed that a patch from 
November would definitely have been included.

At least that’s one mystery solved, thanks.

From: haosdent [mailto:haosd...@gmail.com]
Sent: Thursday, January 21, 2016 8:31 PM
To: user
Subject: Re: Framework Id and upgrading mesos versions

>but I noticed that the code added to fix Mesos-3834 appears in the master 
>branch in github, but not the 0.26.0 branch.
0.26-rc1 checkout since Nov 13,2015 while this patch submit in Nov 24.2015, so 
don't contains this patch.

On Fri, Jan 22, 2016 at 7:19 AM, David Kesler 
<dkes...@yodle.com<mailto:dkes...@yodle.com>> wrote:
I'm attempting to test upgrading from our current version of mesos (0.22.1) to 
the latest.  Even when going only one minor version at a time, I'm running into 
issues due to the lack of framework id in the framework info.

I've been able to replicate the issue reliably.  I started with with a single 
master and slave, with a fresh install of marathon 0.9.0 and mesos 0.22.1, 
wiping out /tmp/mesos on the slave and /mesos and /marathon in zookeeper.  I 
started up a task.  At this point, I can look at 
`/tmp/mesos/meta/slaves/latest/frameworks/<my current marathon framework 
id>/framework.info<http://framework.info>` and verify that there is no 
framework id present in the file.  I then upgraded the master to mesos 0.23.1, 
restarted it, then the slave to 0.23.1 and restarted it, then marathon to 
0.11.1 (which was built against mesos 0.23) and restarted it.  The slave came 
up and recovered just fine.  However the framework.info<http://framework.info> 
file never gets updated with the framework id.  If I then proceed to upgrade 
the master to 0.24, restart it, then the slave to 0.24 and restart it, the 
slave fails to come up with the following error:

Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.409395  9527 main.cpp:187] Version: 0.24.1
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.409406  9527 main.cpp:190] Git tag: 0.24.1
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.409418  9527 main.cpp:194] Git SHA: 
44873806c2bb55da37e9adbece938274d8cd7c48
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.513608  9527 containerizer.cpp:143] Using isolation: 
posix/cpu,posix/mem,filesystem/posix
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@716: Client 
environment:host.name<http://host.name>=dev-sandbox-mesos-slave1
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@723: Client 
environment:os.name<http://os.name>=Linux
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@724: Client 
environment:os.arch=3.13.0-58-generic
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@725: Client 
environment:os.version=#97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.514710  9527 main.cpp:272] Starting Mesos slave
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.516090  9542 slave.cpp:190] Slave started on 
1)@10.100.25.112:5051<http://10.100.25.112:5051>
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.516180  9542 slave.cpp:191] Flags at startup: 
--appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierar
chy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" 
--container_disk_watch_interval="15secs" --containerizers="docker,mesos" 
--default_role="*" --disk_watch_interval="1mins" --docker="docker" 
--docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_so
cket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--enforce_container_disk_quota="false" --executor_registration_timeout="5mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks"
 --gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--initialize_driver_logging="true" --ip="10.100.25.112" 
--isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" 
--log_dir="/var/log/mesos" --logbufsecs="0" --logging
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: _level="INFO" 
--master="zk://dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181/mesos<http://dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181/mesos>"
 --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0n
s" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" 
--revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="true" --version="false" --wo
rk_dir="/tmp/mesos"
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.517006  9542 slave.cpp:354] Slave resources: cpus(*):2; mem(*):15025; 
disk(*):35818; ports(*):[31000-32000]
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.517315  9542 slave.cpp:384] Slave hostname: 
dev-sandbox-mesos-slave1.nyc.dev.yodle.com<http://dev-sandbox-mesos-slave1.nyc.dev.yodle.com>
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.517334  9542 slave.cpp:389] Slave checkpoint: true
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@733: Client 
environment:user.name<http://user.name>=(null)
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@741: Client 
environment:user.home=/root
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@753: Client 
environment:user.dir=/
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@zookeeper_init@786: Initiating 
client connection, 
host=dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181<http://dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181>
 sessionTimeout=10000 watcher=0x7f18dfac6610 sessionId=0 sessionPassw
d=<null> context=0x7f18b8002180 flags=0
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.520829  9544 state.cpp:54] Recovering state from '/tmp/mesos/meta'
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,521:9527(0x7f18d2d8d700):ZOO_INFO@check_events@1703: initiated 
connection to server [10.100.25.111:2181<http://10.100.25.111:2181>]
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.524245  9542 slave.cpp:4157] Recovering framework 
20160121-172941-1847157770-5050-4782-0000
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: F0121 
17:54:46.524288  9542 slave.cpp:4175] Check failed: frameworkInfo.has_id()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: *** Check failure 
stack trace: ***
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18dfe3091d  google::LogMessage::Fail()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18dfe3275d  google::LogMessage::SendToLog()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 
17:54:46,528:9527(0x7f18d2d8d700):ZOO_INFO@check_events@1750: session 
establishment complete on server 
[10.100.25.111:2181<http://10.100.25.111:2181>], sessionId=0x14ec1fa6d1a263d, 
negotiated timeout=10000
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.528326  9549 group.cpp:331] Group process 
(group(1)@10.100.25.112:5051<http://10.100.25.112:5051>) connected to ZooKeeper
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.528370  9549 group.cpp:805] Syncing group operations: queue size 
(joins, cancels, datas) = (0, 0, 0)
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.528455  9549 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18dfe3050c  google::LogMessage::Flush()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18dfe33059  google::LogMessageFatal::~LogMessageFatal()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.532296  9549 detector.cpp:156] Detected a new leader: (id='2')
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.532524  9543 group.cpp:674] Trying to get '/mesos/info_0000000002' in 
ZooKeeper
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18df900ba8  mesos::internal::slave::Slave::recoverFramework()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: W0121 
17:54:46.533833  9543 detector.cpp:444] Leading master 
master@10.100.25.110:5050<http://master@10.100.25.110:5050> is using a Protobuf 
binary format when registering with ZooKeeper (info): this will be deprecated 
as of Mesos 0.24 (see MESOS-2340)
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 
17:54:46.534034  9543 detector.cpp:481] A new leading master 
(UPID=master@10.100.25.110:5050<http://master@10.100.25.110:5050>) is detected
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18df907193  mesos::internal::slave::Slave::recover()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18df938383  
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS8_5state5StateEESD_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSK_FSI_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18dfde1681  process::ProcessManager::resume()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18dfde197f  process::internal::schedule()
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18dec6da40  (unknown)
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18de48a182  start_thread
Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]:     @     
0x7f18de1b747d  (unknown)




With 0.23.1 running, I've tried restarting the mesos-slave multiple times, I've 
tried deploying new tasks, and I've tried waiting, but the 
framework.info<http://framework.info> file never seems to get updated, so I 
have no clue how I'm supposed to actually get past 0.23.1 as part of the 
upgrade.

Additionally, I saw https://issues.apache.org/jira/browse/MESOS-3834 which says 
it was fixed in 0.26.0 and resolved in November, so I tried going all the way 
to mesos 0.26.0.  (Yes, I'm aware that it's not recommended to skip versions, 
but I wanted to see if I could get around the framework id issue).  Not only 
did it fail the same way, but I noticed that the code added to fix Mesos-3834 
appears in the master branch in github, but not the 0.26.0 branch.

One last thing I don't understand is that our current dev/qa/master cluster 
slaves appear to be writing the framework id to the 
framework.info<http://framework.info> file, despite running mesos 0.22.1 and 
marathon 0.9.0 and set up via puppet just like the sandbox I've been testing 
in.  So it's possible that there's some issue preventing the slave in the 
sandbox from writing the framework id to the file, but I can't find any 
difference in setups that would cause that either.

Any help you can provide would be greatly appreciated.



--
Best Regards,
Haosdent Huang

Reply via email to