Yes, that's good.

On 06/28/2015 07:00 AM, Ilia Mirkin wrote:
How about:

/* Use dupfd in hash table, to avoid errors if the original fd gets
* closed by its owner. The hash key needs to live at least as long as
* the screen.
*/


On Sun, Jun 28, 2015 at 12:57 AM, Mario Kleiner
<mario.kleiner...@gmail.com> wrote:
On 06/28/2015 06:25 AM, Ilia Mirkin wrote:

But that's the thing... in this case, the fd lifetime is owned by the
X server. In fact, nouveau doesn't touch that fd in mesa beyond
dup'ing it, and then exclusively using the dup'd fd.


True, it's not owned by nouveau, but instead by mesa core code? I think the
x-server itself is only involved in authenticating the fd under DRI2, after
mesa has opened it? Although under DRI3 the ready made fd comes from a
xcb_dri3_open() trip to the server.

See the open() call on the device file:
dri2CreateScreen() in mesa/src/glx/dri2_glx.c

xcb_dri3_open():
dri3_create_screen() in mesa/src/glx/dri3_glx.c

And the close() calls for the fd's are also inside those files.

Maybe those winsys functions we fixed here are also involved in things like
EGL for wayland etc.? Haven't checked this.

Other than that your new commit message is fine. Also i am possibly just
confused about this, and your commit message is an improvement in clarity
anyway :)

So i think it is fine to leave it as is.

thanks,
-mario



On Sun, Jun 28, 2015 at 12:23 AM, Mario Kleiner
<mario.kleiner...@gmail.com> wrote:

Ok, maybe one thing for the commit message, as i just made the same mixup
in
my reply: The fd's are not owned by the x-server, but by the mesa screens
(pipe screens? dri screens?) they represent, so if such a screen goes
away,
the fd goes away. Using "X server" would be confusing, especially under
dri3, although each such "mesa screen" corresponds to a x-screen.

-mario


On 06/28/2015 06:03 AM, Mario Kleiner wrote:


On 06/28/2015 05:41 AM, Ilia Mirkin wrote:


Oh duh. Thanks for the super-detailed explanation. To rephrase what
you said in a slightly shorter manner:

See bug https://bugs.freedesktop.org/show_bug.cgi?id=79823 for why we
need to dupfd (which we are already, although radeon might not be).


Radeon doesn't so far. My 2nd patch for radeon from earlier today adds
that.

Except instead of sticking the dup'd fd into the hash table, whose
lifetime matches that of the device, we stick the other one in, which
is effectively owned by the X server. As a result, those fstat's might
fail down the line, and all sorts of hell will break loose.


Yes. Essentially we should make sure the fd's we keep around have the
same lifetime as the data structure in which we need to use it. That was
the point, the server owned fd went away while we still needed it.

Would you be opposed to me rewriting your commit message to basically
reflect the above? Perhaps your original one did as well, but it
wasn't clear to me. I'd also like to throw some assert's in to make
sure the fstat's don't fail -- does that sound reasonable?


Oh please, yes! My commit messages often have this disease that they are
either too terse, when i think the problem is obvious, or too long and
somewhat convoluted when i'm going overboard in the other direction. Any
editing/shortening for clarity more than happily accepted :)

thanks,
-mario

Another solution, btw, is to stick <stat.st_dev, stat.st_ino,
stat.st_rdev> as the real key in the hash, although that'd involve
making pointers. Probably not worth it.

Cheers,

     -ilia

On Sat, Jun 27, 2015 at 11:13 PM, Mario Kleiner
<mario.kleiner...@gmail.com> wrote:


On 06/28/2015 03:48 AM, Ilia Mirkin wrote:



On Fri, Jun 5, 2015 at 9:36 AM, Mario Kleiner
<mario.kleiner...@gmail.com> wrote:



The dup'ed fd owned by the nouveau_screen for a device node
must also be used as key for the winsys hash table, instead
of using the original fd passed in for a screen, to make
multi-x-screen ZaphodHeads configurations work on nouveau.

This prevents the following crash scenario that was observed
when a dynamically loaded rendering plugin used OpenGL on a
ZaphodHeads setup, e.g., on a dual x-screen setup. At first
load the plugin worked, but after unloading and reloading it,
the next rendering operation crashed:

1. Client, e.g., a plugin, calls glXQueryVersion.

2. DRI screens for the x-screens 0 and 1 are created, one shared
       nouveau_screen is created for the shared device node of both
       screens, but the original fd of x-screen 0 is used as
identifying
       key in the hash table, instead of the dup()ed fd of x-screen 0
       which is owned by the nouveau_screen. nouveau_screen's
refcount
       is now 2.




See below, but it shouldn't matter which fd gets used.


3. Regular rendering happens by the client plugin, then the plugin
       gets unloaded.

4. XCloseDisplay(). x-screen 0 gets its DRI screen destroyed,
       nouveau_drm_screen_unref() drops the refcount to 1, calling
mesa
       code then closes the fd of x-screen 0, so now the fd which is
       used as key in the hash table is invalid. x-screen 1 gets
       destroyed, nouveau_drm_screen_unref() drops the refcount to 0,
       the nouveau_screen gets destroyed, but removal of its entry
       in the hash table fails, because the invalid fd in the hash
       table no longer matches anything (fstat() on the fd is used
       for hashing and key comparison, but fstat() on an already
closed
       fd fails and returns bogus results). x-screen 1 closes its fd.

       Now all fd's are closed, the nouveau_screen destroyed, but
       there is a dangling reference to the nouveau_screen in the
       hash table.

5. Some OpenGL client plugin gets loaded again and calls
       glXQueryVersion. Step 2 above repeats, but because a
       dangling reference with a matching fd is found in the winsys
       hash table, no new nouveau_screen is created this time.
Instead
       the invalid pointer to the old nouveau_screen is recycled,
       which points to nirvana -> Crash.

This problem is avoided by use of the dup()ed fd which is
owned by the nouveau_screen and has the same lifetime as
the nouveau_screen itself.




I need to think about this some more, but... this shouldn't happen :)

In fact, the whole dupfd thing was added there for ZaphodHeads
screens
in the first place. See
https://bugs.freedesktop.org/show_bug.cgi?id=79823 and commit
a59f2bb17bcc which fixed it.

Note that the hash has the following hash/eq functions:

static unsigned hash_fd(void *key)
{
        int fd = pointer_to_intptr(key);
        struct stat stat;
        fstat(fd, &stat);

        return stat.st_dev ^ stat.st_ino ^ stat.st_rdev;
}

static int compare_fd(void *key1, void *key2)
{
        int fd1 = pointer_to_intptr(key1);
        int fd2 = pointer_to_intptr(key2);
        struct stat stat1, stat2;
        fstat(fd1, &stat1);
        fstat(fd2, &stat2);

        return stat1.st_dev != stat2.st_dev ||
               stat1.st_ino != stat2.st_ino ||
               stat1.st_rdev != stat2.st_rdev;
}

so fd and dupfd should get hashed to the same thing. I suspect
there's
something else going on in your application...


My application is a set of dynamically loaded plugins running inside
another
host app (Psychtoolbox-3 inside GNU/Octave), so what happens often is
that
the OpenGL using plugin gets unloaded and reloaded at runtime, so a
after a
OpenGL session has ended with a XCloseDisplay() tearing the winsys
down, you
can easily have it restart at plugin reload with another XOpenDisplay
->
glXQueryVersion -> .... sequence, where the new call to
glXQueryVersion will
trigger a recreation of the winsys, which will find the stale entry
in the
hash table pointing to nowhere instead of the previously released
nouveau_screen -> use-after-free -> boom!

The reason this fails is because during destruction of the 2nd, 3rd
etc.
x-screen, the already closed fd associated with the 1st x-screen is
fed into
the compare_fd function, so fstat() errors out on the invalid fd. I
added
printf's etc. on both nouveau and now radeon to verify the fstat
gives me
EBADF errors. So the hash calculation goes wrong when trying to find
the
matching element in the hash table with a fd that has a matching hash
-> The
element which should be removed is ignored/not removed because it
contains
an already closed fd for which no proper hash can be calculated
anymore ->
hash comparison during search goes wrong.

This is because multiple x-screens, e.g., 2 x-screens are destroyed
in order
0, 1 by FreeScreenConfigs() as part of XCloseDisplay().
FreeScreenConfigs()
calls dri2DestroyScreen() in dri2.c or dri3_destroy_screen in dri3.c,
which
are essentially identical, e.g.,:

static void
dri2DestroyScreen(struct glx_screen *base)
{
      struct dri2_screen *psc = (struct dri2_screen *) base;

      /* Free the direct rendering per screen data */
      (*psc->core->destroyScreen) (psc->driScreen);

--> This core->destroyScreen calls eventually into the winsys, which
then
drops the refcount of the nouveau_screen or radeon winsys structure
by one,
and tries to release the struct and remove the hash table entry once
the
refcount drops to zero.

      driDestroyConfigs(psc->driver_configs);
      close(psc->fd);

-> This close() closes the fd of a screen immediately after closing
down
that screen. At the time screen 1 is closed down, the fd associated
with
screen 0 is already closed.

      free(psc);
}

So you have this sequence during creation:

glXQueryVersion() -> AllocAndFetchScreenConfigs():

create screen 0 and its fd, e.g., fd==5, dup(fd), e.g., fd==6 for the
nouveau_screen, but store x-screen 0's fd==5 *itself* in the hash
table as
key.

create screen 1. This now creates an fd 7

=> refcount is now 2.


And this sequence during destruction at XCloseDisplay();

FreeScreenConfigs():

free screen 0: core->destroyScreen() -> ... ->
nouveau_drm_screen_unref() =>
refcount-- is now 1
close(fd==5) of screen 0 and thereby the fd==5 stored inside the hash
table
as key. From now on using fcntl(fd==5) for hash calculation on that
element
will malfunction.

free screen 1: -> nouveau_drm_screen_unref -> refcount-- is now 0. ->
Try to
remove hash table entry => fails because fcntl(fd==5) on the already
closed
fd==5 of screen 0 fails with EBADF. => Hash table keeps its stale
element
referencing the struct nouveau_screen.

Free nouveau_screen struct (close dup()ed fd==6 for nouveau_screen,
close
down stuff, free the struct).
-> close(fd==7) associated with screen 1.

Now all fd's (5,6,7) are closed, the struct is gone, the hash table
contains
a stale element with a no longer existent fd==5 and a pointer to an
already
free()d struct nouveau_screen.

Now we reload the plugin and do glXQueryVersion() again, so
AllocAndFetchScreenConfigs() is executed again to create the winsys
etc. for
screens 0 and 1.

create screen 0 and its fd: Here the first unused fd in the
filedescriptor
table is recycled, which happens to be fd==5, just as in the first
session
with the plugin! Now suddenly fd=5 is again the valid fd for x-screen
0, and
it points to the same /dev/drm/card0 device file as before. That
means when
nouveau_drm_screen_create(fd==5) is called, it finds the stale
element in
the hash table from the previous session, because the previously
closed fd 5
in that element is now valid and open again and points to the same
device
file as in the previous session, ergo has the same hash etc.
nouveau_drm_screen_create() concludes that a fully initialized
nouveau_screen already exists for this session, except it doesn't
exist -
the pointer in the hash table points to invalid previously freed
memory.
When it tries to access that memory we have a use-after-free and the
application segfaults.

So it depends a bit on what happens in the applications memory
between the
two consecutive OpenGL sessions etc. how long it takes for an invalid
memory
access to happen, but eventually it hits invalid data. Maybe
occassionally
it gets lucky and the freed memory is still accessible and the struct
intact, so stuff works by chance.

If the application happened to open other files inbetween the 1st and
2nd
session, then the relevant fd in the 2nd session may not be exactly
the same
as in the first session, because recycling that fd didn't work, so we
are
only left with a dead but harmless entry in the hash table instead of
a
crasher. I assume something like that prevented my crashes on radeon
in the
past, because when i wasn't expecting/looking for trouble i didn't use
a
test sequence carefully made to trigger the bug for certain.

Similar things happen on radeon for the same reason, ergo the 2nd
patch with
a similar fix.

I don't have the gdb backtrace for nouveau anymore, but the traces for
radeon are attached to this mail to show you the flow of execution,
similar
to nouveau.

-mario
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to