[ 
https://issues.apache.org/jira/browse/MESOS-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875294#comment-13875294
 ] 

Nikita Vetoshkin commented on MESOS-912:
----------------------------------------

Ah, sorry... Read the manual again and, yes, it (rather implicitly) says that 
after {{raise()}} sighandler will be executed in same stack (i.e. not 
asynchronously). But instead we have something very strange. Seems like 
problems with glog - I have the same stacktrace after sending signal by hand.
Anyway, I've always thought that ignoring {{SIGPIPE}} is the first thing you 
should do when dealing with network services. libev author suggests that too in 
docs:
{quote}
So when you encounter spurious, unexplained daemon exits, make sure you ignore 
SIGPIPE (and maybe make sure you log the exit status of your daemon somewhere, 
as that would have given you a big clue).
{quote}
Sending code already does handle sending errors and logs them appropriately (or 
throws exception with {{strerror()}} message), why bother using another 
mechanism? I can understand that from libprocess point of view - it's a library 
and maybe we shouldn't demand ignoring SIGPIPE from the user.

> Slave sometimes crashes with SIGPIPE
> ------------------------------------
>
>                 Key: MESOS-912
>                 URL: https://issues.apache.org/jira/browse/MESOS-912
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.17.0
>         Environment: OSX 10.8.5
>            Reporter: Vinod Kone
>             Fix For: 0.17.0
>
>
> ➜  build git:(vinod/vote) ✗ ./bin/mesos-slave.sh --master=127.0.0.1:5055
> I0115 12:15:19.846664 2096390528 main.cpp:118] Build: 2014-01-14 16:52:48 by 
> vinod
> I0115 12:15:19.847189 2096390528 main.cpp:120] Creating "process" isolator
> I0115 12:15:19.847462 2096390528 main.cpp:132] Starting Mesos slave
> I0115 12:15:19.847807 2096390528 slave.cpp:111] Slave started on 
> 1)@172.25.27.97:5051
> I0115 12:15:19.848068 2096390528 slave.cpp:211] Slave resources: cpus(*):4; 
> mem(*):7168; disk(*):481998; ports(*):[31000-32000]
> I0115 12:15:19.852408 175071232 state.cpp:33] Recovering state from 
> '/tmp/mesos/meta'
> I0115 12:15:19.853726 175071232 status_update_manager.cpp:188] Recovering 
> status update manager
> I0115 12:15:19.853798 175071232 process_isolator.cpp:317] Recovering isolator
> I0115 12:15:19.853883 175071232 slave.cpp:2769] Finished recovery
> I0115 12:15:19.854004 173998080 slave.cpp:500] New master detected at 
> [email protected]:5055
> I0115 12:15:19.854161 175607808 status_update_manager.cpp:162] New master 
> detected at [email protected]:5055
> I0115 12:15:19.854220 173998080 slave.cpp:525] Detecting new master
> I0115 12:15:19.854409 175607808 slave.cpp:1966] [email protected]:5055 exited
> W0115 12:15:19.854440 175607808 slave.cpp:1969] Master disconnected! Waiting 
> for a new master to be elected
> W0115 12:15:19.854440 2096390528 logging.cpp:69] RAW: Received signal 
> SIGPIPE; escalating to SIGABRT
> *** Aborted at 1389816919 (unix time) try "date -d @1389816919" if you are 
> using GNU date ***
> PC: @     0x7fff98586d46 __kill
> *** SIGABRT (@0x7fff98586d46) received by PID 21391 (TID 0x7fff7cf46180) 
> stack trace: ***
>     @     0x7fff960b190a _sigtramp
>     @     0x7fff7bf03588 std::string::_Rep::_S_empty_rep_storage
>     @     0x7fff960b190a _sigtramp
>     @                0x0 (unknown)
>     @        0x10956046b process::ProcessManager::wait()
>     @        0x109566e7d process::wait()
>     @        0x10924760a main
>     @     0x7fff947cc7e1 start
>     @                0x2 (unknown)
> [1]    21391 abort      ./bin/mesos-slave.sh --master=127.0.0.1:5055



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to