[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800424#comment-17800424
 ] 

yangzhenxing edited comment on ZOOKEEPER-2802 at 1/6/24 3:34 AM:
-----------------------------------------------------------------

Interesting, I just hit same issue on an old version 3.5.4 .

The stack is just same as yours, it hung at wait_sync_completion.

I dig it a little bit , and found that the main thread do_io does not exist in 
the dump.

So it looks like a dead lock caused by the client code:

1, we need to have a mutex when we need to send a request with the zk client 
handle;

2. when client sees the session is expired, it just TERMINATEs the main thread 
"do_io" and rely on user to close the connection and re-init a new zk client 
handle.

3. user need to hold the same mutex to do the close and re-init.

 

The problem is , with the main thread do_io exited, the waiting sync request 
can't complete forever but the thread is still holding the mutex......

The issue looks to be a small time window that do_io exits while a new request 
just is pushed to the "to_send" and "sent_requests" list .

 

I just wrote a small program to reproduce it easily, the master branch also has 
same issue.

The main idea of the test program is keep sending zk requests with a new thread 
for each request, and keeps 

down and up network.

 

 


was (Author: JIRAUSER303522):
Interesting, I just hit same issue on an old version 3.5.4 .

The stack is just same as yours, it hung at wait_sync_completion.

I dig it a little bit , and found that the main thread do_io does not exist in 
the dump.

So it looks like an dead lock caused by the client code:

1, we need to have a mutex when we need to send a request with the zk client 
handle;

2. when client sees the session is expired, it just TERMINATEs the main thread 
"do_io" and rely on user to close the connection and re-init a new zk client 
handle.

3. user need to hold the same mutex to do the close and re-init.

 

The problem is , with the main thread do_io exited, the waiting sync request 
can't complete forever but the thread is still holding the mutex......

The issue looks to be a small time window that do_io exits while a new request 
just is pushed to the "to_send" and "sent_requests" list .

 

I just wrote a small program to reproduce it easily, the master branch also has 
same issue.

The main idea of the test program is keep sending zk requests with a new thread 
for each request, and keeps 

down and up network.

 

 

> Zookeeper C client hang @wait_sync_completion
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-2802
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2802
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.4.6
>         Environment: DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS"
>            Reporter: yihao yang
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: zookeeper.out.2017.05.31-10.06.23
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I was using zookeeper 3.4.6 c client to access one zookeeper server in a VM. 
> The VM environment is not stable and I get a lot of EXPIRED_SESSION_STATE 
> events. I will create another session to ZK when I get an expired event. I 
> also have a read/write lock to protect session read (get/list/... on zk) and 
> write(connect, close, reconnect zhandle).
> The problem is the session got an EXPIRED_SESSION_STATE event and when it 
> tried to hold the write lock and  reconnect the session, it found there is a 
> thread was holding the read lock (which was operating sync list on zk). See 
> the stack below:
> GDBStack:
> Thread 7 (Thread 0x7f838a43a700 (LWP 62845)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at 
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x0000000000636033 in  wait_sync_completion (sc=sc@entry=0x7f8344000af0) 
> at src/mt_adaptor.c:85
> #2 0x0000000000633248 in zoo_wget_children2_ (zh=<optimized out>, 
> path=0x7f83440677a8 "/dict/objects/__services/RLS-GSE/_static_nodes", 
> watcher=0x0, watcherCtx=0x13e6310, strings=0x7f838a4397b0, 
> stat=0x7f838a4398d0) at src/zookeeper.c:3630
> #3 0x000000000045e6ff in ZooKeeperContext::getChildren (this=0x13e6310, 
> path=..., children=children@entry=0x7f838a439890, 
> stat=stat@entry=0x7f838a4398d0) at zookeeper_context.cpp:xxx
> This sync list didn't return a ZINVALIDSTAT but hung. Anyone know the problem?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to