On Sun, 25 Oct 2015, Peter Hajdu wrote:
> I try to give it one more attempt with a bit more description about the 
> bug.
> 
> After calling dlclose in _dl_notify_unload_shlib_ group reference counts 
> are decreased by following the object's grpref-list.  Unfortunately the 
> references are removed from the list during the graph traversal.
> 
> dlclose will run all destructors of the unused objects and tries to 
> unload all objects reachable from the closed object's child and 
> grpref-list.  Since the grpref-list references were removed, the unused 
> destructed object stays in memory as garbage.  Next time when this 
> object is loaded it is found in the memory and crashes during load.
> 
> This patch unloads all unused objects instead of following the closed 
> object's child and grpref list.

Thank you for working on this.  After a long rumination, I'd like to 
propose a different diff, seen below.

Your diff changes dlclose() to switch from calling _dl_unload_shlib() to a 
new function _dl_unload_unused().  They both unload refcnt==0 objects: 
_dl_unload_shlib() finds them by walking the dependency tree from the 
selected object, while _dl_unload_unused() just scans the entire list.

So why is the former not sufficient?  As you describe, the problem occurs 
when a grpref is removed which is the last reference to an object. grprefs 
are used guarantee that entire load groups are unloaded all at once and 
not piecemeal.  If later dlopen() adds a dependency to a child in this 
load group, the entire group will be kept even if that child is the last 
real link.  When that dependency is added, the later object takes a grpref 
to the load_object, which is root of the load group.

As you note _dl_run_all_dtors() releases that grpref, but we still know 
where it had pointed: to the load_object of some object being released!  
So we can retain the behavior of _dl_unload_shlib(), but we need to add a 
check for whether our load_object is now unreferenced.  If so, it 
previously had a grpref which has been released, so we need to take down 
the entire load group.

Thus the diff below.  It works with your test setup (thanks for writing 
that!), passes regress/libexec/ld.so/, and chrome hasn't choked on it.  
Can someone who's familiar with the sdl problem case test it there?


Does that make sense?


Again, thank you for pushing on this.  If what I show here works, it's 
because your description triggered an "ahhh, wait a moment..." thought.


Philip


Index: library.c
===================================================================
RCS file: /data/src/openbsd/src/libexec/ld.so/library.c,v
retrieving revision 1.71
diff -u -p -r1.71 library.c
--- library.c   16 Jan 2015 16:18:07 -0000      1.71
+++ library.c   3 Nov 2015 09:09:15 -0000
@@ -59,9 +59,27 @@ void
 _dl_unload_shlib(elf_object_t *object)
 {
        struct dep_node *n;
+       elf_object_t *load_object = object->load_object;
+
+       /*
+        * If our load object has become unreferenced then we lost the
+        * last group reference to it, so the entire group should be taken
+        * down.  The current object is somewhere below load_object in
+        * the child_list tree, so it'll get cleaned up by the recursion.
+        * That means we can just switch here to the load object.
+        */
+       if (load_object != object && OBJECT_REF_CNT(load_object) == 0 &&
+           (load_object->status & STAT_UNLOADED) == 0) {
+               DL_DEB(("unload_shlib switched from %s to %s\n",
+                   object->load_name, load_object->load_name));
+               object = load_object;
+               goto unload;
+       }
+
        DL_DEB(("unload_shlib called on %s\n", object->load_name));
        if (OBJECT_REF_CNT(object) == 0 &&
            (object->status & STAT_UNLOADED) == 0) {
+unload:
                object->status |= STAT_UNLOADED;
                TAILQ_FOREACH(n, &object->child_list, next_sib)
                        _dl_unload_shlib(n->data);
Index: library_mquery.c
===================================================================
RCS file: /data/src/openbsd/src/libexec/ld.so/library_mquery.c,v
retrieving revision 1.49
diff -u -p -r1.49 library_mquery.c
--- library_mquery.c    22 Jan 2015 05:48:17 -0000      1.49
+++ library_mquery.c    3 Nov 2015 09:10:39 -0000
@@ -64,10 +64,27 @@ void
 _dl_unload_shlib(elf_object_t *object)
 {
        struct dep_node *n;
+       elf_object_t *load_object = object->load_object;
+
+       /*
+        * If our load object has become unreferenced then we lost the
+        * last group reference to it, so the entire group should be taken
+        * down.  The current object is somewhere below load_object in
+        * the child_list tree, so it'll get cleaned up by the recursion.
+        * That means we can just switch here to the load object.
+        */
+       if (load_object != object && OBJECT_REF_CNT(load_object) == 0 &&
+           (load_object->status & STAT_UNLOADED) == 0) {
+               DL_DEB(("unload_shlib switched from %s to %s\n",
+                   object->load_name, load_object->load_name));
+               object = load_object;
+               goto unload;
+       }
 
        DL_DEB(("unload_shlib called on %s\n", object->load_name));
        if (OBJECT_REF_CNT(object) == 0 &&
            (object->status & STAT_UNLOADED) == 0) {
+unload:
                object->status |= STAT_UNLOADED;
                TAILQ_FOREACH(n, &object->child_list, next_sib)
                        _dl_unload_shlib(n->data);

Reply via email to