I have run across a repeatable crash in a program that uses ORBit2 on Windows. The crash is caused by code for the relatively new timeout functionality added by Jules Colding. For some reason I see this crash only in this one specific program (that I just recently started testing on Windows), not in other programs that also use ORBit2 heavily.
In a nutshell, the problem is that giop_thread_free() frees a GIOPThread while there still is an outstanding timeout added with giop_timeout_add() for a connection whose thread is that thread. This leads to heap corruption (references to freed memory). I added some debugging printouts to the linc2 and GIOP code. Here is an annotated dump of the output: First just some introductory logging. As you see the debugging output I added prints GIOPConnection (== a LinkConnection subclass) struct pointers, reference counts of it, and the tdata (GIOPThread*) pointer of the LinkConnection. link_connection_ref_T: cnx=062FB320 ref_count=1 tdata=00000000 link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000 link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000 link_connection_ref_T: cnx=062FB320 ref_count=2 tdata=00000000 link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000 link_connection_unref_unlock: cnx=062FB320 ref_count=2 tdata=00000000 link_connection_init: cnx=062FB370 link_connection_ref_T: cnx=062FB370 ref_count=1 tdata=00000000 link_connection_ref: cnx=062FB370 ref_count=2 tdata=00000000 link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=00000000 link_connection_ref: cnx=062FB370 ref_count=2 tdata=00000000 Here giop_timeout_add() is called for a GIOPConnection: giop_timeout_add:1432: lcnx=062FB370 tdata=06334710 tdata->lock=06335A88 link_connection_ref_T: cnx=062FB370 ref_count=3 tdata=06334710 link_connection_unref_unlock: cnx=062FB370 ref_count=4 tdata=06334710 link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=06334710 Now comes the interesting parts. giop_thread_free() gets called for the tdata of the GIOPConnection that was passed to giop_timeout_add() above. giop_thread_free:343: tdata=06334710 tdata->lock=06335A88 giop_thread_free:350: tdata=06334710 tdata->lock=00000000 giop_thread_free:363: FREEING tdata=06334710 Other stuff... link_connection_ref_T: cnx=062FB320 ref_count=1 tdata=00000000 link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000 link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000 link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000 giop_timeout_add:1432: lcnx=062FB320 tdata=06335910 tdata->lock=06335A88 link_connection_ref_T: cnx=062FB320 ref_count=3 tdata=06335910 link_connection_unref_unlock: cnx=062FB320 ref_count=4 tdata=06335910 link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=06335910 giop_thread_free:343: tdata=06335910 tdata->lock=06335A88 giop_thread_free:350: tdata=06335910 tdata->lock=00000000 giop_thread_free:363: FREEING tdata=06335910 Now then the timeout fires for connection 062FB370. The tdata pointer that gets passed to giop_timeout() is the one that was freed in the first giop_thread_free() above. So giop_timeout() is using freed memory here. This obviously causes problems sooner or later. (Without changes to the code, the memory happens to still be intact, and the crash comes from the tdata->lock pointer still being NULL after giop_thread_free() set it so.) giop_timeout:1373: cnx=062FB370 tdata=06334710 tdata->lock=00000000 link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=06334710 giop_timeout:1394: cnx=062FB370 tdata=06334710 tdata->lock=00000000 And then *crash* when g_mutex_lock() is called on the NULL pointer. How to solve this? I can think of some alternative approaches: - Keep a "back" pointer to the LinkConnection in the GIOPThread struct. When setting up a timeout for the connection, initialise this pointer. When freeing the thread before the timeout has expired, set the tdata pointer of the corresponding connection to NULL. Michael Meeks says this is ugly as it adds coupling between relatively unconnected concepts. And can we be sure that each thread has just one connection, no idea... - Make sure the timeout is always zero for local connections? The rationale for the timeout code in Jules's ChangeLog entry from 2006-12-05 talks about timeouts being necessary for remote servers only. In my case, the connection is a local one (using TCP, though, no Unix domain sockers on Windows obviously). This would probably solve the problem I see. But can we be sure that this problematic scenario can never happen for remote connections? - Make sure giop_thread_free() is not called on threads associated with connections for which timeouts are outstanding. Is this feasible? It sounds kinda like a clean and obvious solution, but I don't really understand the code that well... Jules, do you have any ideas? Have you ever seen this happen? (If somebody wants to run tests with the same debugging output, I can mail the svn diff.) --tml _______________________________________________ orbit-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/orbit-list
