[ https://issues.apache.org/jira/browse/MESOS-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294461#comment-15294461 ]
Joseph Wu commented on MESOS-5427: ---------------------------------- Are you running on Ubuntu 10? (Typo?) I'm not sure if Mesos builds on that. Could you try the same setup/configuration with a more recent version of Mesos? The SASL-based authentication code has not change much. (It was moved around, and is now called the CRAM MD5 authenticator/ee.) > Mesos master locks up after slave fails to authenticate > ------------------------------------------------------- > > Key: MESOS-5427 > URL: https://issues.apache.org/jira/browse/MESOS-5427 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 0.20.1 > Environment: Linux XXXXXX-XXXXXXXXX 3.13.0-49-generic #81-Ubuntu SMP > Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux > Ubuntu 10.04.1 LTS > AWS/8cores/16GB > Reporter: analogue > Priority: Minor > > In a mesos master cluster with one leader and two backups, a single slave > attempting to authenticate with the leader locked up the master and resulted > in 2 CPU cores pegged at 100% CPU usage until restarted. > master > {noformat} > I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating > slave(1)@10.85.20.76:5051 > I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL > connection > I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > ... > W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out > I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication > request from slave(1)@10.85.20.76:5051 because authentication is still in > progress > (last long line repeats until mesos-master restarted) > {noformat} > slave > {noformat} > Log file created at: 2016/05/16 02:37:52 > Running on machine: 10-85-20-76-uswest2btestopia > Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg > I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started! > I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by > I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1 > I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1 > I0516 02:37:52.112783 10198 main.cpp:135] Git SHA: > fe0a39112f3304283f970f1b08b322b1e970829d > I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation: > cgroups/cpu,cgroups/mem > I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave > I0516 02:37:52.128687 10246 slave.cpp:167] Slave started on > 1)@10.85.20.76:5051 > I0516 02:37:52.128708 10246 credentials.hpp:84] Loading credential for > authentication from '/etc/seagull_mesos_creds' > W0516 02:37:52.128865 10246 credentials.hpp:99] Permissions on credential > file '/etc/seagull_mesos_creds' are too open. It is recommended that your > credential file is NOT accessible by others. > I0516 02:37:52.128968 10246 slave.cpp:265] Slave using credential for: > seagull_slave > I0516 02:37:52.129612 10246 slave.cpp:278] Slave resources: cpus(*):31; > disk(*):140000; mem(*):59382; ports(*):[31000-32000] > I0516 02:37:52.132064 10255 group.cpp:313] Group process > (group(1)@10.85.20.76:5051) connected to ZooKeeper > I0516 02:37:52.132086 10255 group.cpp:787] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0516 02:37:52.132097 10255 group.cpp:385] Trying to create path > '/mesos-releng' in ZooKeeper > I0516 02:37:52.200781 10246 slave.cpp:306] Slave hostname: > 10-85-20-76-uswest2btestopia.dev.yelpcorp.com > I0516 02:37:52.200804 10246 slave.cpp:307] Slave checkpoint: true > I0516 02:37:52.201323 10262 detector.cpp:138] Detected a new leader: > (id='4733') > I0516 02:37:52.201484 10258 group.cpp:658] Trying to get > '/mesos-releng/info_0000004733' in ZooKeeper > I0516 02:37:52.202786 10262 detector.cpp:426] A new leading master > (UPID=master@10.85.3.50:5050) is detected > I0516 02:37:52.203071 10257 state.cpp:33] Recovering state from > '/ephemeral/mesos-slave/meta' > I0516 02:37:52.203229 10253 status_update_manager.cpp:193] Recovering status > update manager > I0516 02:37:52.203358 10247 containerizer.cpp:252] Recovering containerizer > I0516 02:37:52.205095 10253 slave.cpp:3198] Finished recovery > I0516 02:37:52.206739 10264 slave.cpp:589] New master detected at > master@10.85.3.50:5050 > I0516 02:37:52.207399 10264 slave.cpp:663] Authenticating with master > master@10.85.3.50:5050 > I0516 02:37:52.207417 10253 status_update_manager.cpp:167] New master > detected at master@10.85.3.50:5050 > I0516 02:37:52.207474 10264 slave.cpp:636] Detecting new master > I0516 02:37:52.207484 10256 authenticatee.hpp:104] Initializing client SASL > I0516 02:37:53.521589 10256 authenticatee.hpp:128] Creating new client SASL > connection > W0516 02:37:57.210635 10257 slave.cpp:737] Authentication timed out > W0516 02:37:57.210777 10257 slave.cpp:701] Failed to authenticate with master > {noformat} > No idea why the timestamps don't match up between the master and the slave > but ntp logs to syslog indicate there was no time skew at play here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)