[ 
https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203825#comment-14203825
 ] 

ASF GitHub Bot commented on STORM-513:
--------------------------------------

Github user itaifrenkel commented on the pull request:

    https://github.com/apache/storm/pull/286#issuecomment-62295232
  
    @HeartSaVioR @clockfly  I think we need to keep the multilang protocl 
implementation as simple as possible. A full roundtrip of heartbeat messages is 
not that bad, as long as it does not add too much latency. If you would like an 
optimization for the rountrip messages then you could consider any emit as an 
heartbeat, and trigger the heartbeat rountrip only if there are not enough 
emits from the bolt. It makes the java code more complicated :(, but achieves 
similar goals, and leaves the multilang implementation simpler :). All-in-all I 
think this commit is good, and we could discuss various optimizations later on.


> ShellBolt keeps sending heartbeats even when child process is hung
> ------------------------------------------------------------------
>
>                 Key: STORM-513
>                 URL: https://issues.apache.org/jira/browse/STORM-513
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>         Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5)
>            Reporter: Dan Blanchard
>            Priority: Blocker
>             Fix For: 0.9.3-rc2
>
>
> If I'm understanding everything correctly with how ShellBolts work, the Java 
> ShellBolt executor is the part of the topology that sends heartbeats back to 
> Nimbus to let it know that a particular multilang bolt is still alive.  The 
> problem with this is that if the multilang subprocess/bolt severely hangs 
> (i.e., it will not even respond to {{SIGALRM}} and the like), the Java 
> ShellBolt does not seem to notice or care. Simply having the tuple get 
> replayed when it times out will not suffice either, because the subprocess 
> will still be stuck.
> The most obvious way to handle this seem to be to add heartbeating to the 
> multilang protocol itself, so that the ShellBolt expects a message of some 
> kind every {{timeout}} seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to