Hello Guacamole Dev Team,

I am writing to report a persistent crash issue we are experiencing
with guacd under load. We have been working to debug this for a while and
have applied several fixes that have improved stability, but we are still
seeing one final, intermittent crash.

*Summary of the Issue*

guacd crashes with a SIGABRT signal, originating
from __pthread_kill_implementation(), when handling a high volume of
concurrent RDP sessions (around 300). The crash occurs in a generic FreeRDP
worker thread, which strongly suggests heap corruption caused by a race
condition or memory bug elsewhere in the application.

We are using 16 Core, 128 GB system.

*Environment*

   - *Guacamole Server Version:* 1.6.0
   - *FreeRDP Version:* 2.11.0
   - *Operating System:* RHEL 9 on x86_64
   - *Build:* Custom build using GCC 12.

*Latest Crash Backtrace*

Here is the backtrace from the most recent crash. The crash location has
moved from the RDP disconnect logic to a generic worker thread after our
previous fixes.



Program terminated with signal SIGABRT, Aborted.

#0  0x00007f67e988bedc in __pthread_kill_implementation () from
/usr/lib64/libc.so.6

[Current thread is 1 (Thread 0x7f646c598640 (LWP 1496945))]


=== bt ===


#0  0x00007f67e988bedc in __pthread_kill_implementation () from
/usr/lib64/libc.so.6

#1  0x00007f67e983eb46 in raise () from /usr/lib64/libc.so.6

#2  0x00007f67e9828833 in abort () from /usr/lib64/libc.so.6

#3  0x00007f67e9829172 in __libc_message.cold () from /usr/lib64/libc.so.6

#4  0x00007f67e9895f87 in malloc_printerr () from /usr/lib64/libc.so.6

#5  0x00007f67e9897c70 in _int_free () from /usr/lib64/libc.so.6

#6  0x00007f67e989a2c5 in free () from /usr/lib64/libc.so.6

#7  0x00007f67e0465507 in BufferPool_Clear () from
/opt/zscaler/lib64/libwinpr2.so.2

#8  0x00007f67e04656f6 in BufferPool_Free () from
/opt/zscaler/lib64/libwinpr2.so.2

#9  0x00007f67e06bf71f in rfx_context_free () from
/opt/zscaler/lib64/libfreerdp2.so.2

#10 0x00007f67e0640003 in codecs_free () from
/opt/zscaler/lib64/libfreerdp2.so.2

#11 0x00007f67e0648c3d in rdp_client_disconnect () from
/opt/zscaler/lib64/libfreerdp2.so.2

#12 0x00007f67e0639207 in freerdp_disconnect () from
/opt/zscaler/lib64/libfreerdp2.so.2

#13 0x00007f67e07be54e in guac_rdp_handle_connection
(client=0x7f67d4005870) at rdp.c:676

#14 guac_rdp_client_thread (data=0x7f67d4005870) at rdp.c:944

#15 0x00007f67e988a19a in start_thread () from /usr/lib64/libc.so.6

#16 0x00007f67e990f210 in clone3 () from /usr/lib64/libc.so.6



*Analysis and Troubleshooting Steps Taken*

Our investigation points towards a memory corruption issue, likely a race
condition exposed by the high rate of connection setup and teardown. The
logs around the time of the crash show many "Handshake failed, 'connect'
instruction was not received" errors, indicating this high churn.

We have progressively identified and fixed several bugs:

   1. *Incorrect Cleanup Order:* Initially, we found
   that freerdp_disconnect() was called before gdi_free(), which we corrected.


*Request for Help*

We would greatly appreciate it if the community could review our analysis
and the suspected root cause.

   - Does the analysis of the race condition in print-job.c seem correct?
   - Are there any other known issues or areas of the code we should
   investigate that could cause this type of heap corruption under heavy load?

We are happy to provide more detailed logs, code snippets, or run further
tests as needed.

Thank you for your time and assistance.


Best Regards,
Dilip

-- 


This communication (including any attachments) is intended for the sole 
use of the intended recipient and may contain confidential, non-public, 
and/or privileged material. Use, distribution, or reproduction of this 
communication by unintended recipients is not authorized. If you received 
this communication in error, please immediately notify the sender and then 
delete all copies of this communication from your system.

Reply via email to