[ https://issues.apache.org/jira/browse/MESOS-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035631#comment-16035631 ]
Neil Conway commented on MESOS-1606: ------------------------------------ Perhaps a disk I/O error, e.g., due to a flaky disk? > Slave failed to checkpoint on Mac OS X > -------------------------------------- > > Key: MESOS-1606 > URL: https://issues.apache.org/jira/browse/MESOS-1606 > Project: Mesos > Issue Type: Bug > Components: agent > Environment: Mac OS X, Darwin Kernel Version 13.3.0 > Reporter: Zuyu Zhang > > {noformat} > This bug happens to test_framework and LowLevelSchedulerLibprocess as well. > [ RUN ] ExamplesTest.LowLevelSchedulerPthread > Using temporary directory '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al' > Enabling authentication for the scheduler > I0715 19:03:59.296200 2019271440 scheduler.cpp:132] Version: 0.20.0 > I0715 19:03:59.300429 2019271440 leveldb.cpp:176] Opened db in 1982us > I0715 19:03:59.300900 2019271440 leveldb.cpp:183] Compacted db in 447us > I0715 19:03:59.300946 2019271440 leveldb.cpp:198] Created db iterator in 27us > I0715 19:03:59.300978 2019271440 leveldb.cpp:204] Seeked to beginning of db > in 16us > I0715 19:03:59.301007 2019271440 leveldb.cpp:273] Iterated through 0 keys in > the db in 20us > I0715 19:03:59.301053 2019271440 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0715 19:03:59.301713 222965760 recover.cpp:425] Starting replica recovery > I0715 19:03:59.301914 222965760 recover.cpp:451] Replica is in EMPTY status > I0715 19:03:59.302671 221892608 replica.cpp:638] Replica in EMPTY status > received a broadcasted recover request > I0715 19:03:59.302781 224575488 recover.cpp:188] Received a recover response > from a replica in EMPTY status > I0715 19:03:59.303050 225112064 recover.cpp:542] Updating replica status to > STARTING > I0715 19:03:59.303432 222965760 leveldb.cpp:306] Persisting metadata (8 > bytes) to leveldb took 298us > I0715 19:03:59.303475 222965760 replica.cpp:320] Persisted replica status to > STARTING > I0715 19:03:59.303540 221356032 recover.cpp:451] Replica is in STARTING status > I0715 19:03:59.303797 224575488 master.cpp:288] Master > 20140715-190359-16777343-64313-60122 (localhost) started on 127.0.0.1:64313 > I0715 19:03:59.303848 224575488 master.cpp:325] Master only allowing > authenticated frameworks to register > I0715 19:03:59.303865 224575488 master.cpp:332] Master allowing > unauthenticated slaves to register > I0715 19:03:59.303884 224575488 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al/credentials' > W0715 19:03:59.303961 224575488 credentials.hpp:51] Permissions on > credentials file > '/tmp/ExamplesTest_LowLevelSchedulerPthread_SCL6Al/credentials' are too open. > It is recommended that your credentials file is NOT accessible by others. > I0715 19:03:59.304028 224575488 master.cpp:359] Authorization enabled > I0715 19:03:59.304379 223502336 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0715 19:03:59.304505 2019271440 containerizer.cpp:124] Using isolation: > posix/cpu,posix/mem > I0715 19:03:59.304666 223502336 recover.cpp:188] Received a recover response > from a replica in STARTING status > I0715 19:03:59.304805 223502336 recover.cpp:542] Updating replica status to > VOTING > I0715 19:03:59.305186 223502336 leveldb.cpp:306] Persisting metadata (8 > bytes) to leveldb took 214us > I0715 19:03:59.305219 223502336 replica.cpp:320] Persisted replica status to > VOTING > I0715 19:03:59.305250 223502336 recover.cpp:556] Successfully joined the > Paxos group > I0715 19:03:59.305361 223502336 recover.cpp:440] Recover process terminated > I0715 19:03:59.305927 224038912 slave.cpp:168] Slave started on > 1)@127.0.0.1:64313 > I0715 19:03:59.306221 224038912 slave.cpp:279] Slave resources: cpus(*):4; > mem(*):7168; disk(*):470714; ports(*):[31000-32000] > I0715 19:03:59.306234 2019271440 containerizer.cpp:124] Using isolation: > posix/cpu,posix/mem > I0715 19:03:59.306248 223502336 master.cpp:1128] The newly elected leader is > master@127.0.0.1:64313 with id 20140715-190359-16777343-64313-60122 > I0715 19:03:59.306269 223502336 master.cpp:1141] Elected as the leading > master! > I0715 19:03:59.306293 223502336 master.cpp:959] Recovering from registrar > I0715 19:03:59.306395 225112064 registrar.cpp:313] Recovering registrar > I0715 19:03:59.306617 221892608 log.cpp:656] Attempting to start the writer > I0715 19:03:59.306952 224575488 slave.cpp:168] Slave started on > 2)@127.0.0.1:64313 > I0715 19:03:59.307158 224575488 slave.cpp:279] Slave resources: cpus(*):4; > mem(*):7168; disk(*):470714; ports(*):[31000-32000] > I0715 19:03:59.307207 222965760 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0715 19:03:59.307401 224038912 slave.cpp:324] Slave hostname: localhost > I0715 19:03:59.307459 224038912 slave.cpp:325] Slave checkpoint: true > I0715 19:03:59.307446 222965760 leveldb.cpp:306] Persisting metadata (8 > bytes) to leveldb took 232us > I0715 19:03:59.307512 222965760 replica.cpp:342] Persisted promised to 1 > I0715 19:03:59.307615 224575488 slave.cpp:324] Slave hostname: localhost > I0715 19:03:59.307631 224575488 slave.cpp:325] Slave checkpoint: true > I0715 19:03:59.307802 222965760 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0715 19:03:59.307924 223502336 state.cpp:33] Recovering state from > '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta' > I0715 19:03:59.308027 2019271440 containerizer.cpp:124] Using isolation: > posix/cpu,posix/mem > I0715 19:03:59.308171 222429184 status_update_manager.cpp:193] Recovering > status update manager > I0715 19:03:59.308205 225112064 state.cpp:33] Recovering state from > '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/1/meta' > I0715 19:03:59.308316 221892608 containerizer.cpp:287] Recovering > containerizer > I0715 19:03:59.308384 221356032 status_update_manager.cpp:193] Recovering > status update manager > I0715 19:03:59.308575 225112064 containerizer.cpp:287] Recovering > containerizer > I0715 19:03:59.309072 222429184 slave.cpp:3130] Finished recovery > I0715 19:03:59.309079 223502336 slave.cpp:3130] Finished recovery > F0715 19:03:59.309267 222429184 slave.cpp:3141] > CHECK_SOME(state::checkpoint(path, bootId.get())): Failed to checkpoint > '1405473915' to > '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta/boot_id': > Failed to open file > '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/0/meta/boot_id': > No such file or directory > *** Check failure stack trace: *** > I0715 19:03:59.309270 221892608 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0715 19:03:59.309516 221892608 leveldb.cpp:343] Persisting action (8 bytes) > to leveldb took 219us > I0715 19:03:59.309502 223502336 slave.cpp:168] Slave started on > 3)@127.0.0.1:64313 > I0715 19:03:59.309582 222965760 slave.cpp:603] New master detected at > master@127.0.0.1:64313 > I0715 19:03:59.309588 221892608 replica.cpp:676] Persisted action at 0 > I0715 19:03:59.309665 222965760 slave.cpp:639] No credentials provided. > Attempting to register without authentication > I0715 19:03:59.309685 225112064 status_update_manager.cpp:167] New master > detected at master@127.0.0.1:64313 > I0715 19:03:59.309798 223502336 slave.cpp:279] Slave resources: cpus(*):4; > mem(*):7168; disk(*):470714; ports(*):[31000-32000] > I0715 19:03:59.310104 224038912 replica.cpp:508] Replica received write > request for position 0 > I0715 19:03:59.310331 222965760 slave.cpp:652] Detecting new master > I0715 19:03:59.310395 224038912 leveldb.cpp:438] Reading position from > leveldb took 30us > I0715 19:03:59.310642 223502336 slave.cpp:324] Slave hostname: localhost > I0715 19:03:59.310657 223502336 slave.cpp:325] Slave checkpoint: true > I0715 19:03:59.310689 224038912 leveldb.cpp:343] Persisting action (14 bytes) > to leveldb took 227us > I0715 19:03:59.310722 224038912 replica.cpp:676] Persisted action at 0 > I0715 19:03:59.310936 222965760 replica.cpp:655] Replica received learned > notice for position 0 > I0715 19:03:59.311103 222965760 leveldb.cpp:343] Persisting action (16 bytes) > to leveldb took 160us > @ 0x10b3d54f9 google::LogMessage::SendToLog() > I0715 19:03:59.311158 221892608 state.cpp:33] Recovering state from > '/var/folders/67/g567hfcj4bjcd_bm3gsqs54h0000gn/T/mesos-XXXXXX.FUk9AYoy/2/meta' > I0715 19:03:59.311436 222965760 replica.cpp:676] Persisted action at 0 > I0715 19:03:59.311514 222965760 replica.cpp:661] Replica learned NOP action > at position 0 > I0715 19:03:59.311544 221892608 status_update_manager.cpp:193] Recovering > status update manager > I0715 19:03:59.311612 221892608 containerizer.cpp:287] Recovering > containerizer > I0715 19:03:59.311643 222965760 log.cpp:672] Writer started with ending > position 0 > @ 0x10b3d5a24 google::LogMessage::Flush() > I0715 19:03:59.311983 225112064 slave.cpp:3130] Finished recovery > @ 0x10b3d8b0f google::LogMessageFatal::~LogMessageFatal() > I0715 19:03:59.312419 224038912 leveldb.cpp:438] Reading position from > leveldb took 43us > I0715 19:03:59.312515 222965760 slave.cpp:603] New master detected at > master@127.0.0.1:64313 > I0715 19:03:59.312854 222965760 slave.cpp:639] No credentials provided. > Attempting to register without authentication > I0715 19:03:59.312891 222965760 slave.cpp:652] Detecting new master > I0715 19:03:59.312924 222965760 status_update_manager.cpp:167] New master > detected at master@127.0.0.1:64313 > @ 0x10b3d60f9 google::LogMessageFatal::~LogMessageFatal() > @ 0x10ad381b3 _CheckFatal::~_CheckFatal() > @ 0x10ad37a29 _CheckFatal::~_CheckFatal() > @ 0x10af8371f mesos::internal::slave::Slave::__recover() > @ 0x10b30df43 process::ProcessBase::visit() > @ 0x10b304d44 process::ProcessManager::resume() > @ 0x10b30488f process::schedule() > @ 0x7fff907b0899 _pthread_body > @ 0x7fff907b072a _pthread_start > @ 0x7fff907b4fc9 thread_start > ../../src/tests/script.cpp:85: Failure > Failed > low_level_scheduler_pthread_test.sh terminated with signal Abort trap: 6 > make[3]: *** [check-local] Segmentation fault: 11 > make[2]: *** [check-am] Error 2 > make[1]: *** [check] Error 2 > make: *** [check-recursive] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)