[ 
https://issues.apache.org/jira/browse/S4-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127425#comment-13127425
 ] 

Gavin Li commented on S4-3:
---------------------------

Actually I have some concern about the way of checking the existence and owner 
of the znode. The ConnectionLossException exception happens when the sendThread 
in zookeeper client found the response times out. Say some process of znode 
creation(including the proposal, ack and commit on zookeeper servers) takes a 
long time period, then the SendThread consider it times out, then 
zookeeper.create() function fails with ConnectionLossException, then we try to 
read that znode to see if it exists and who is its owner. There are chances 
that when the read request is issued to the zookeeper server, the creation 
process is still ongoing. So the read would end up with not found the znode. 
But the znode creation might succeeds after the read request is served. As the 
read only directly check the status on the zookeeper server the client is 
connecting to, doesn't contain any consensus voting process, it should be 
faster. Do you think this can happen?

So I think maybe close the session is safer. I also think it is a little bit 
complicated to implement as besides calling zookeeper.close() we also needs to 
construct a new instance of Zookeeper class in order to create a new session, 
that involves the more code change. I guess that's why Both Hbase and Hedwig 
choose to simply let the process exits to restart.  

What do you think?
                
> Sometimes one process node owns 2 tasks
> ---------------------------------------
>
>                 Key: S4-3
>                 URL: https://issues.apache.org/jira/browse/S4-3
>             Project: Apache S4
>          Issue Type: Bug
>            Reporter: Gavin Li
>            Assignee: Gavin Li
>         Attachments: s4_loscon_fix
>
>
> When using S4, we found sometimes it ends up with one process node owns 2 
> tasks. I did some investigation, it seems that the handling of 
> ConnectionLossException when creating the ephemeral node is problematic. 
> Sometimes when the response from zookeeper server times out, 
> zookeeper.create() will fail with ConnectionLossException while the creation 
> request might already be sent to server(see 
> http://svn.apache.org/viewvc/hadoop/zookeeper/trunk/src/java/main/org/apache/zookeeper/ClientCnxn.java
>  line 830). From our logs this is the case we ran into.
> Maybe we should handle it in the way that HBase is handling it 
> (http://svn.apache.org/viewvc/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java?view=markup),
>  just simply exit the process when got that exception to let the whole 
> process restart.
> To be more clear, what happened was: a process node called zookeeper.create() 
> to acquire a task, the request was successfully sent to zookeeper server, but 
> the zookeeper IO loop timed out before the response came. So the 
> zookeeper.create() failed with ConnectionLossException. Then the process node 
> ignored this exception and tried to acquire another task. Then it got 2 tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to