[ https://issues.apache.org/jira/browse/PROTON-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338377#comment-14338377 ]
Rafael H. Schloming commented on PROTON-829: -------------------------------------------- I added that code to get the python tests to pass without valgrind errors, so I'm surprised that removing it doesn't cause issues. I'll try to explain a little bit about what's going on. (I've commented this elsewhere in the code, but I missed this spot. I think andrew pointed out that this might be clearer with a helper function instead of/in addition to a comment.) There are definitely non-debug cases where it is useful to examine the refcount. In particular, a finalizer gets fired when the refcount hits zero, but the finalizer can sometimes run code that creates a new reference to the object that is being finalized. There are a few places where this pattern is very useful: - The finalizer can run some logic that decides whether or not the object should be pooled and if so sticking it in the pool adds a reference. - The finalizer might run some code that posts an event, thereby creating a reference from the collector to the object. - The connection/session/link/delivery/transport objects all have pointers to each other, and to just naively refcount all those pointers would result in circular references and memory leaks. In order to avoid circular references, we don't refcount internal pointers, e.g. the parent->child pointers, instead the finalizers of all the children have logic that runs and figures out if it is safe for the child to go away (basically does a tiny amount of incremental garbage collection). If the finalizer's "garbage collection" finds an internal reference, the parent of that object takes over the last external reference. In the latter case an external pointer to a child may be created again, and to account for this, the incref of all the child objects is overriden to remove the reference that the parent took over. In all these cases, the logic that may/may not create a new reference is run at the beginning of the finalizer. The finalizer then needs to check if a new reference has actually been created in order to figure out whether or not to actually free the resources used by the object or to return. This pattern could perhaps be formalized by distinguishing between finalizers (the part that runs application logic which can create new references) and destructors (the part that actually frees anything malloced in the initializer) in the object code. This should also explain why incref/decref is not necessarily a noop, it is basically retriggering the "gc" portion of the code in the finalizer that decides whether or not an object should really go away or not, and in the rare case that you toggle some state that might influence that gc logic, you need to get it to run again. Clearing the tp work queue is one such case. The reason it is guarded by a check on the refcount is that we also call that function from inside the delivery finalizer, and in that case if the refcount is zero we've already decided that the object should go away. Again, this gc part of the pattern could probably be formalized a bit, e.g. possibly having some sort of mark/sweep style thing that is in fact triggered by external references reaching zero, but probably not a good idea to attempt anytime soon since what complicates this picture is not so much the fact that it isn't formalized, but rather that the engine uses ad-hoc data structures that predate the development of the collections that are part of the object library and therefore has to manually account and search for references in a type specific way rather than using the automatic ref counting and/or search routines available in the collections. This manual/type specific accounting is what makes the "gc" logic a bit brittle, and could possibly be the source of the bug you are seeing. At some point relatively soon (probably after 0.9 but before the next release) I expect to have to update a significant chunk of those data structures to properly use the collections, but until then if you manage to pare this down to a minimal reproducer in terms of proton API calls, that would definitely suggest the bug is in proton and I can probably help debug more from there. As it is, it looks like it could possibly just be a plain old double free style bug in messaging, and removing the above code just masks it by introducing a memory leak. > Possible reference counting bug in pn_clear_tpwork > -------------------------------------------------- > > Key: PROTON-829 > URL: https://issues.apache.org/jira/browse/PROTON-829 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c > Affects Versions: 0.8 > Reporter: Alan Conway > Assignee: Alan Conway > Fix For: 0.9 > > > See QPID-6415 which describes a core dump in the qpid tests that appears when > using the current 0.9 proton master. The qpid tests pass OK with proton 0.8. > The valgrind output in QPID-6415 shows that a connection is deleted while it > is being finalized by a call from pn_connection_unbound to pn_clear_tpwork. > I do not yet understand the details, but removing the following strange code > fixes the problem and passes the proton test suite without valgrind errors: > {noformat} > --- a/proton-c/src/engine/engine.c > +++ b/proton-c/src/engine/engine.c > @@ -690,10 +690,10 @@ void pn_clear_tpwork(pn_delivery_t *delivery) > { > LL_REMOVE(connection, tpwork, delivery); > delivery->tpwork = false; > - if (pn_refcount(delivery) > 0) { > - pn_incref(delivery); > - pn_decref(delivery); > - } > } > } > {noformat} > The code is strange because > a) you should never examine a refcount except for debugging purposes > b) under normal refcounting semantics incref+decref is a no-op. > Is removing this code OK? -- This message was sent by Atlassian JIRA (v6.3.4#6332)