[ 
https://issues.apache.org/jira/browse/S4-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130390#comment-13130390
 ] 

Gavin Li commented on S4-3:
---------------------------

Kishore: in the engineering practice, we usually have some daemon process 
monitoring the service processes. Whenever it exits for whatever reason: crash 
or out of memory or uncatched runtime exception, we'll restart the process. I 
think this is also the reason why sometimes we can simply exit the process as 
the handling for many exception. For example, the typical handling for 
zookeeper session expiration exception is to exit the process so that the 
process can restart. Otherwise, it is very hard to cleanup all the resources 
the current session is owning and redo the initialization. Simple exiting can 
make the process to release all the existing resources and initialize again.
                
> Sometimes one process node owns 2 tasks
> ---------------------------------------
>
>                 Key: S4-3
>                 URL: https://issues.apache.org/jira/browse/S4-3
>             Project: Apache S4
>          Issue Type: Bug
>            Reporter: Gavin Li
>            Assignee: Gavin Li
>         Attachments: s4_loscon_fix
>
>
> When using S4, we found sometimes it ends up with one process node owns 2 
> tasks. I did some investigation, it seems that the handling of 
> ConnectionLossException when creating the ephemeral node is problematic. 
> Sometimes when the response from zookeeper server times out, 
> zookeeper.create() will fail with ConnectionLossException while the creation 
> request might already be sent to server(see 
> http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java
>  line 830). From our logs this is the case we ran into.
> Maybe we should handle it in the way that HBase is handling it 
> (http://svn.apache.org/viewvc/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java?view=markup),
>  just simply exit the process when got that exception to let the whole 
> process restart.
> To be more clear, what happened was: a process node called zookeeper.create() 
> to acquire a task, the request was successfully sent to zookeeper server, but 
> the zookeeper IO loop timed out before the response came. So the 
> zookeeper.create() failed with ConnectionLossException. Then the process node 
> ignored this exception and tried to acquire another task. Then it got 2 tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to