Hi Tor, On 08/05/2008, at 17.15, Tor Lillqvist wrote:
> I have run across a repeatable crash in a program that uses ORBit2 on > Windows. The crash is caused by code for the relatively new timeout > functionality added by Jules Colding. For some reason I see this crash > only in this one specific program (that I just recently started > testing on Windows), not in other programs that also use ORBit2 > heavily. > > In a nutshell, the problem is that giop_thread_free() frees a > GIOPThread while there still is an outstanding timeout added with > giop_timeout_add() for a connection whose thread is that thread. This > leads to heap corruption (references to freed memory). > > I added some debugging printouts to the linc2 and GIOP code. Here is > an annotated dump of the output: > > First just some introductory logging. As you see the debugging output > I added prints GIOPConnection (== a LinkConnection subclass) struct > pointers, reference counts of it, and the tdata (GIOPThread*) pointer > of the LinkConnection. > > link_connection_ref_T: cnx=062FB320 ref_count=1 tdata=00000000 > link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000 > link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000 > link_connection_ref_T: cnx=062FB320 ref_count=2 tdata=00000000 > link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000 > link_connection_unref_unlock: cnx=062FB320 ref_count=2 tdata=00000000 > link_connection_init: cnx=062FB370 > link_connection_ref_T: cnx=062FB370 ref_count=1 tdata=00000000 > link_connection_ref: cnx=062FB370 ref_count=2 tdata=00000000 > link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=00000000 > link_connection_ref: cnx=062FB370 ref_count=2 tdata=00000000 > > Here giop_timeout_add() is called for a GIOPConnection: > > giop_timeout_add:1432: lcnx=062FB370 tdata=06334710 tdata- > >lock=06335A88 > > link_connection_ref_T: cnx=062FB370 ref_count=3 tdata=06334710 > link_connection_unref_unlock: cnx=062FB370 ref_count=4 tdata=06334710 > link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=06334710 > > Now comes the interesting parts. giop_thread_free() gets called for > the tdata of the GIOPConnection that was passed to giop_timeout_add() > above. > > giop_thread_free:343: tdata=06334710 tdata->lock=06335A88 > giop_thread_free:350: tdata=06334710 tdata->lock=00000000 > giop_thread_free:363: FREEING tdata=06334710 > > Other stuff... > > link_connection_ref_T: cnx=062FB320 ref_count=1 tdata=00000000 > link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000 > link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000 > link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000 > giop_timeout_add:1432: lcnx=062FB320 tdata=06335910 tdata- > >lock=06335A88 > link_connection_ref_T: cnx=062FB320 ref_count=3 tdata=06335910 > link_connection_unref_unlock: cnx=062FB320 ref_count=4 tdata=06335910 > link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=06335910 > giop_thread_free:343: tdata=06335910 tdata->lock=06335A88 > giop_thread_free:350: tdata=06335910 tdata->lock=00000000 > giop_thread_free:363: FREEING tdata=06335910 > > Now then the timeout fires for connection 062FB370. The tdata pointer > that gets passed to giop_timeout() is the one that was freed in the > first giop_thread_free() above. So giop_timeout() is using freed > memory here. This obviously causes problems sooner or later. (Without > changes to the code, the memory happens to still be intact, and the > crash comes from the tdata->lock pointer still being NULL after > giop_thread_free() set it so.) > > giop_timeout:1373: cnx=062FB370 tdata=06334710 tdata->lock=00000000 > link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=06334710 > giop_timeout:1394: cnx=062FB370 tdata=06334710 tdata->lock=00000000 > > And then *crash* when g_mutex_lock() is called on the NULL pointer. > > How to solve this? I can think of some alternative approaches: > > - Keep a "back" pointer to the LinkConnection in the GIOPThread > struct. When setting up a timeout for the connection, initialise this > pointer. When freeing the thread before the timeout has expired, set > the tdata pointer of the corresponding connection to NULL. Michael > Meeks says this is ugly as it adds coupling between relatively > unconnected concepts. And can we be sure that each thread has just one > connection, no idea... > > - Make sure the timeout is always zero for local connections? The > rationale for the timeout code in Jules's ChangeLog entry from > 2006-12-05 talks about timeouts being necessary for remote servers > only. In my case, the connection is a local one (using TCP, though, no > Unix domain sockers on Windows obviously). This would probably solve > the problem I see. But can we be sure that this problematic scenario > can never happen for remote connections? > > - Make sure giop_thread_free() is not called on threads associated > with connections for which timeouts are outstanding. Is this feasible? > It sounds kinda like a clean and obvious solution, but I don't really > understand the code that well... > > Jules, do you have any ideas? Have you ever seen this happen? I've never seen this happen before. I would agree that your proposal number 3 is the way to go, but I can not clearly see how to implement it. Michael? Best regards, jules _______________________________________________ orbit-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/orbit-list
