Problem with the timeout code

Tor Lillqvist Thu, 08 May 2008 08:15:52 -0700

I have run across a repeatable crash in a program that uses ORBit2 on
Windows. The crash is caused by code for the relatively new timeout
functionality added by Jules Colding. For some reason I see this crash
only in this one specific program (that I just recently started
testing on Windows), not in other programs that also use ORBit2
heavily.


In a nutshell, the problem is that giop_thread_free() frees a
GIOPThread while there still is an outstanding timeout added with
giop_timeout_add() for a connection whose thread is that thread. This
leads to heap corruption (references to freed memory).

I added some debugging printouts to the linc2 and GIOP code. Here is
an annotated dump of the output:

First just some introductory logging. As you see the debugging output
I added prints GIOPConnection (== a LinkConnection subclass) struct
pointers, reference counts of it, and the tdata (GIOPThread*) pointer
of the LinkConnection.

link_connection_ref_T: cnx=062FB320 ref_count=1 tdata=00000000
link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000
link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000
link_connection_ref_T: cnx=062FB320 ref_count=2 tdata=00000000
link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000
link_connection_unref_unlock: cnx=062FB320 ref_count=2 tdata=00000000
link_connection_init: cnx=062FB370
link_connection_ref_T: cnx=062FB370 ref_count=1 tdata=00000000
link_connection_ref: cnx=062FB370 ref_count=2 tdata=00000000
link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=00000000
link_connection_ref: cnx=062FB370 ref_count=2 tdata=00000000

Here giop_timeout_add() is called for a GIOPConnection:

giop_timeout_add:1432: lcnx=062FB370 tdata=06334710 tdata->lock=06335A88

link_connection_ref_T: cnx=062FB370 ref_count=3 tdata=06334710
link_connection_unref_unlock: cnx=062FB370 ref_count=4 tdata=06334710
link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=06334710

Now comes the interesting parts. giop_thread_free() gets called for
the tdata of the GIOPConnection that was passed to giop_timeout_add()
above.

giop_thread_free:343: tdata=06334710 tdata->lock=06335A88
giop_thread_free:350: tdata=06334710 tdata->lock=00000000
giop_thread_free:363: FREEING tdata=06334710

Other stuff...

link_connection_ref_T: cnx=062FB320 ref_count=1 tdata=00000000
link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000
link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=00000000
link_connection_ref: cnx=062FB320 ref_count=2 tdata=00000000
giop_timeout_add:1432: lcnx=062FB320 tdata=06335910 tdata->lock=06335A88
link_connection_ref_T: cnx=062FB320 ref_count=3 tdata=06335910
link_connection_unref_unlock: cnx=062FB320 ref_count=4 tdata=06335910
link_connection_unref_unlock: cnx=062FB320 ref_count=3 tdata=06335910
giop_thread_free:343: tdata=06335910 tdata->lock=06335A88
giop_thread_free:350: tdata=06335910 tdata->lock=00000000
giop_thread_free:363: FREEING tdata=06335910

Now then the timeout fires for connection 062FB370. The tdata pointer
that gets passed to giop_timeout() is the one that was freed in the
first giop_thread_free() above. So giop_timeout() is using freed
memory here. This obviously causes problems sooner or later. (Without
changes to the code, the memory happens to still be intact, and the
crash comes from the tdata->lock pointer still being NULL after
giop_thread_free() set it so.)

giop_timeout:1373: cnx=062FB370 tdata=06334710 tdata->lock=00000000
link_connection_unref_unlock: cnx=062FB370 ref_count=3 tdata=06334710
giop_timeout:1394: cnx=062FB370 tdata=06334710 tdata->lock=00000000

And then *crash* when g_mutex_lock() is called on the NULL pointer.

How to solve this? I can think of some alternative approaches:

- Keep a "back" pointer to the LinkConnection in the GIOPThread
struct. When setting up a timeout for the connection, initialise this
pointer. When freeing the thread before the timeout has expired, set
the tdata pointer of the corresponding connection to NULL. Michael
Meeks says this is ugly as it adds coupling between relatively
unconnected concepts. And can we be sure that each thread has just one
connection, no idea...

- Make sure the timeout is always zero for local connections? The
rationale for the timeout code in Jules's ChangeLog entry from
2006-12-05 talks about timeouts being necessary for remote servers
only. In my case, the connection is a local one (using TCP, though, no
Unix domain sockers on Windows obviously). This would probably solve
the problem I see. But can we be sure that this problematic scenario
can never happen for remote connections?

- Make sure giop_thread_free() is not called on threads associated
with connections for which timeouts are outstanding. Is this feasible?
It sounds kinda like a clean and obvious solution, but I don't really
understand the code that well...

Jules, do you have any ideas? Have you ever seen this happen?

(If somebody wants to run tests with the same debugging output, I can
mail the svn diff.)

--tml
_______________________________________________
orbit-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/orbit-list

Problem with the timeout code

Reply via email to