[ 
https://issues.apache.org/jira/browse/MESOS-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294461#comment-15294461
 ] 

Joseph Wu commented on MESOS-5427:
----------------------------------

Are you running on Ubuntu 10?  (Typo?)  I'm not sure if Mesos builds on that.

Could you try the same setup/configuration with a more recent version of Mesos? 
 The SASL-based authentication code has not change much.  (It was moved around, 
and is now called the CRAM MD5 authenticator/ee.)

> Mesos master locks up after slave fails to authenticate
> -------------------------------------------------------
>
>                 Key: MESOS-5427
>                 URL: https://issues.apache.org/jira/browse/MESOS-5427
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.20.1
>         Environment: Linux XXXXXX-XXXXXXXXX 3.13.0-49-generic #81-Ubuntu SMP 
> Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux
> Ubuntu 10.04.1 LTS
> AWS/8cores/16GB
>            Reporter: analogue
>            Priority: Minor
>
> In a mesos master cluster with one leader and two backups, a single slave 
> attempting to authenticate with the leader locked up the master and resulted 
> in 2 CPU cores pegged at 100% CPU usage until restarted.
> master
> {noformat}
> I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating 
> slave(1)@10.85.20.76:5051
> I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL 
> connection
> I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> ...
> W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out
> I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication 
> request from slave(1)@10.85.20.76:5051 because authentication is still in 
> progress
> (last long line repeats until mesos-master restarted)
> {noformat}
> slave
> {noformat}
> Log file created at: 2016/05/16 02:37:52
> Running on machine: 10-85-20-76-uswest2btestopia
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started!
> I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by
> I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1
> I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1
> I0516 02:37:52.112783 10198 main.cpp:135] Git SHA: 
> fe0a39112f3304283f970f1b08b322b1e970829d
> I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation: 
> cgroups/cpu,cgroups/mem
> I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave
> I0516 02:37:52.128687 10246 slave.cpp:167] Slave started on 
> 1)@10.85.20.76:5051
> I0516 02:37:52.128708 10246 credentials.hpp:84] Loading credential for 
> authentication from '/etc/seagull_mesos_creds'
> W0516 02:37:52.128865 10246 credentials.hpp:99] Permissions on credential 
> file '/etc/seagull_mesos_creds' are too open. It is recommended that your 
> credential file is NOT accessible by others.
> I0516 02:37:52.128968 10246 slave.cpp:265] Slave using credential for: 
> seagull_slave
> I0516 02:37:52.129612 10246 slave.cpp:278] Slave resources: cpus(*):31; 
> disk(*):140000; mem(*):59382; ports(*):[31000-32000]
> I0516 02:37:52.132064 10255 group.cpp:313] Group process 
> (group(1)@10.85.20.76:5051) connected to ZooKeeper
> I0516 02:37:52.132086 10255 group.cpp:787] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0516 02:37:52.132097 10255 group.cpp:385] Trying to create path 
> '/mesos-releng' in ZooKeeper
> I0516 02:37:52.200781 10246 slave.cpp:306] Slave hostname: 
> 10-85-20-76-uswest2btestopia.dev.yelpcorp.com
> I0516 02:37:52.200804 10246 slave.cpp:307] Slave checkpoint: true
> I0516 02:37:52.201323 10262 detector.cpp:138] Detected a new leader: 
> (id='4733')
> I0516 02:37:52.201484 10258 group.cpp:658] Trying to get 
> '/mesos-releng/info_0000004733' in ZooKeeper
> I0516 02:37:52.202786 10262 detector.cpp:426] A new leading master 
> (UPID=master@10.85.3.50:5050) is detected
> I0516 02:37:52.203071 10257 state.cpp:33] Recovering state from 
> '/ephemeral/mesos-slave/meta'
> I0516 02:37:52.203229 10253 status_update_manager.cpp:193] Recovering status 
> update manager
> I0516 02:37:52.203358 10247 containerizer.cpp:252] Recovering containerizer
> I0516 02:37:52.205095 10253 slave.cpp:3198] Finished recovery
> I0516 02:37:52.206739 10264 slave.cpp:589] New master detected at 
> master@10.85.3.50:5050
> I0516 02:37:52.207399 10264 slave.cpp:663] Authenticating with master 
> master@10.85.3.50:5050
> I0516 02:37:52.207417 10253 status_update_manager.cpp:167] New master 
> detected at master@10.85.3.50:5050
> I0516 02:37:52.207474 10264 slave.cpp:636] Detecting new master
> I0516 02:37:52.207484 10256 authenticatee.hpp:104] Initializing client SASL
> I0516 02:37:53.521589 10256 authenticatee.hpp:128] Creating new client SASL 
> connection
> W0516 02:37:57.210635 10257 slave.cpp:737] Authentication timed out
> W0516 02:37:57.210777 10257 slave.cpp:701] Failed to authenticate with master 
> {noformat}
> No idea why the timestamps don't match up between the master and the slave 
> but ntp logs to syslog indicate there was no time skew at play here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to