On Fri, Aug 8, 2025 at 3:52 AM Timpl, Markus <[email protected]> wrote:
> Hello, > > I have a test where multiple users(15 users join, wait 20 seconds, > disconnect/rejoin and so on) are joining the same connection(rdp in this > case) in quick succession. > This test reliably deadlocks guacd(1.6.0 unmodified). The two interesting > threads are: > > #0 pthread_rwlock_rdlock from /lib/x86_64-linux-gnu/libpthread.so.0 > #1 guac_rwlock_acquire_read_lock at rwlock.c:228 > #2 guac_display_layer_get_bounds at display-layer.c:51 > #3 guac_display_dup at display.c:259 > #4 guac_rdp_join_pending_handler at client.c:135 > #5 guac_client_promote_pending_users at client.c:178 > #6 guac_client_pending_users_thread at client.c:246 > #7 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 > #8 clone from /lib/x86_64-linux-gnu/libc.so.6 > > --> has __pending_users_lock(which breaks adding/removing users) waits for > pending_frame.lock has read lock on last_frame.lock > > > #0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0 > #1 guac_rwlock_acquire_write_lock at rwlock.c:186 > #2 guac_display_end_multiple_frames at display-flush.c:323 > #3 guac_display_worker_thread at display-worker.c:461 > #4 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 > #5 clone from /lib/x86_64-linux-gnu/libc.so.6 > > --> has pending_frame.lock waits for last_frame.lock > > I tried to fix this be acquiring pending_frame.lock in guac_display_dup > before getting the last_frame.lock. > This makes things a lot better. > > But now I hit another issue reliably after about a minute. Somehow a > thread is stuck in a socket write operation: > #0 write from /lib/x86_64-linux-gnu/libpthread.so.0 > #1 guac_socket_fd_write at socket-fd.c:109 > #2 guac_socket_fd_flush at socket-fd.c:189 > #3 guac_socket_fd_write_buffered at socket-fd.c:263 > #4 guac_socket_fd_write_handler at socket-fd.c:318 > #5 __guac_socket_write at socket.c:91 > #6 guac_socket_write at socket.c:107 > #7 __write_chunk_callback at socket-broadcast.c:135 > #8 guac_client_foreach_pending_user at client.c:560 > #9 __guac_socket_broadcast_write_handler at socket-broadcast.c:173 > #10 __guac_socket_write at socket.c:91 > #11 guac_socket_write at socket.c:107 > #12 guac_socket_flush_base64 at socket.c:341 > #13 guac_socket_write_base64 at socket.c:372 > #14 guac_protocol_send_blob at protocol.c:262 > #15 guac_png_flush_data at encode-png.c:79 > #16 guac_png_write_data at encode-png.c:114 > #17 guac_png_cairo_write_handler at encode-png.c:162 > #18 ?? from /usr/lib/x86_64-linux-gnu/libcairo.so.2 > #19 png_write_chunk_data from /usr/lib/x86_64-linux-gnu/libpng16.so.16 > #20 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16 > #21 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16 > #22 ?? from /usr/lib/x86_64-linux-gnu/libpng16.so.16 > #23 png_write_row from /usr/lib/x86_64-linux-gnu/libpng16.so.16 > #24 png_write_image from /usr/lib/x86_64-linux-gnu/libpng16.so.16 > #25 ?? from /usr/lib/x86_64-linux-gnu/libcairo.so.2 > #26 cairo_surface_write_to_png_stream from > /usr/lib/x86_64-linux-gnu/libcairo.so.2 > #27 guac_png_cairo_write at encode-png.c:195 > #28 guac_png_write at encode-png.c:300 > #29 guac_client_stream_png at client.c:799 > #30 guac_display_dup at display.c:275 > #31 guac_rdp_join_pending_handler at client.c:135 > #32 guac_client_promote_pending_users at client.c:178 > #33 guac_client_pending_users_thread at client.c:246 > #34 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 > #35 clone from /lib/x86_64-linux-gnu/libc.so.6 > > This thread does not exit anymore. Even if all users are disconnected. > No new users can join anymore because the thread holds the > __pending_users_lock. > Join threads: > #0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0 > #1 guac_rwlock_acquire_write_lock at rwlock.c:186 > #2 guac_client_add_pending_user at client.c:440 > #3 guac_client_add_user at client.c:479 > #4 guac_user_handle_connection at user-handshake.c:339 > #5 guacd_user_thread at proc.c:99 > #6 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 > #7 clone from /lib/x86_64-linux-gnu/libc.so.6 > > Also, no existing users can be removed from the connection for the same > reason. > Remove threads: > #0 pthread_rwlock_wrlock from /lib/x86_64-linux-gnu/libpthread.so.0 > #1 guac_rwlock_acquire_write_lock at rwlock.c:186 > #2 guac_client_remove_user at client.c:497 > #3 guac_user_handle_connection at user-handshake.c:364 > #4 guacd_user_thread at proc.c:99 > #5 start_thread from /lib/x86_64-linux-gnu/libpthread.so.0 > #6 clone from /lib/x86_64-linux-gnu/libc.so.6 > > I am not sure how to fix this. Any ideas? Adding timeouts to the currently > blocking socket call is the only solution I can come up with. > But after discovering the discussion in > https://lists.apache.org/thread/94xrxq9w3kd4otcpdn3fh0jwn603m4wp it seems > like this might not be the preferred way to fix this. > > Hello, Markus, It seems like you're on the right trail of a deadlock somewhere in the guacd code. I don't have any great suggestions except to keep tracking down where there might be a lock that is getting acquired and not released, or a race condition where a lock is acquired but cannot be released. I think we have a couple of these in the 1.6.0 version of the code, as reported by you and at least one other person on the mailing lists and/or Jira forum. Maybe adding some guac_client_log and/or guac_user_log calls to the code in various places can help track down when locks are being acquired but not released? -Nick
