How about: /* Use dupfd in hash table, to avoid errors if the original fd gets * closed by its owner. The hash key needs to live at least as long as * the screen. */
On Sun, Jun 28, 2015 at 12:57 AM, Mario Kleiner <mario.kleiner...@gmail.com> wrote: > On 06/28/2015 06:25 AM, Ilia Mirkin wrote: >> >> But that's the thing... in this case, the fd lifetime is owned by the >> X server. In fact, nouveau doesn't touch that fd in mesa beyond >> dup'ing it, and then exclusively using the dup'd fd. >> > > True, it's not owned by nouveau, but instead by mesa core code? I think the > x-server itself is only involved in authenticating the fd under DRI2, after > mesa has opened it? Although under DRI3 the ready made fd comes from a > xcb_dri3_open() trip to the server. > > See the open() call on the device file: > dri2CreateScreen() in mesa/src/glx/dri2_glx.c > > xcb_dri3_open(): > dri3_create_screen() in mesa/src/glx/dri3_glx.c > > And the close() calls for the fd's are also inside those files. > > Maybe those winsys functions we fixed here are also involved in things like > EGL for wayland etc.? Haven't checked this. > > Other than that your new commit message is fine. Also i am possibly just > confused about this, and your commit message is an improvement in clarity > anyway :) > > So i think it is fine to leave it as is. > > thanks, > -mario > > > >> On Sun, Jun 28, 2015 at 12:23 AM, Mario Kleiner >> <mario.kleiner...@gmail.com> wrote: >>> >>> Ok, maybe one thing for the commit message, as i just made the same mixup >>> in >>> my reply: The fd's are not owned by the x-server, but by the mesa screens >>> (pipe screens? dri screens?) they represent, so if such a screen goes >>> away, >>> the fd goes away. Using "X server" would be confusing, especially under >>> dri3, although each such "mesa screen" corresponds to a x-screen. >>> >>> -mario >>> >>> >>> On 06/28/2015 06:03 AM, Mario Kleiner wrote: >>>> >>>> >>>> On 06/28/2015 05:41 AM, Ilia Mirkin wrote: >>>>> >>>>> >>>>> Oh duh. Thanks for the super-detailed explanation. To rephrase what >>>>> you said in a slightly shorter manner: >>>>> >>>>> See bug https://bugs.freedesktop.org/show_bug.cgi?id=79823 for why we >>>>> need to dupfd (which we are already, although radeon might not be). >>>>> >>>> >>>> Radeon doesn't so far. My 2nd patch for radeon from earlier today adds >>>> that. >>>> >>>>> Except instead of sticking the dup'd fd into the hash table, whose >>>>> lifetime matches that of the device, we stick the other one in, which >>>>> is effectively owned by the X server. As a result, those fstat's might >>>>> fail down the line, and all sorts of hell will break loose. >>>>> >>>> >>>> Yes. Essentially we should make sure the fd's we keep around have the >>>> same lifetime as the data structure in which we need to use it. That was >>>> the point, the server owned fd went away while we still needed it. >>>> >>>>> Would you be opposed to me rewriting your commit message to basically >>>>> reflect the above? Perhaps your original one did as well, but it >>>>> wasn't clear to me. I'd also like to throw some assert's in to make >>>>> sure the fstat's don't fail -- does that sound reasonable? >>>>> >>>> >>>> Oh please, yes! My commit messages often have this disease that they are >>>> either too terse, when i think the problem is obvious, or too long and >>>> somewhat convoluted when i'm going overboard in the other direction. Any >>>> editing/shortening for clarity more than happily accepted :) >>>> >>>> thanks, >>>> -mario >>>> >>>>> Another solution, btw, is to stick <stat.st_dev, stat.st_ino, >>>>> stat.st_rdev> as the real key in the hash, although that'd involve >>>>> making pointers. Probably not worth it. >>>>> >>>>> Cheers, >>>>> >>>>> -ilia >>>>> >>>>> On Sat, Jun 27, 2015 at 11:13 PM, Mario Kleiner >>>>> <mario.kleiner...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> On 06/28/2015 03:48 AM, Ilia Mirkin wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 5, 2015 at 9:36 AM, Mario Kleiner >>>>>>> <mario.kleiner...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The dup'ed fd owned by the nouveau_screen for a device node >>>>>>>> must also be used as key for the winsys hash table, instead >>>>>>>> of using the original fd passed in for a screen, to make >>>>>>>> multi-x-screen ZaphodHeads configurations work on nouveau. >>>>>>>> >>>>>>>> This prevents the following crash scenario that was observed >>>>>>>> when a dynamically loaded rendering plugin used OpenGL on a >>>>>>>> ZaphodHeads setup, e.g., on a dual x-screen setup. At first >>>>>>>> load the plugin worked, but after unloading and reloading it, >>>>>>>> the next rendering operation crashed: >>>>>>>> >>>>>>>> 1. Client, e.g., a plugin, calls glXQueryVersion. >>>>>>>> >>>>>>>> 2. DRI screens for the x-screens 0 and 1 are created, one shared >>>>>>>> nouveau_screen is created for the shared device node of both >>>>>>>> screens, but the original fd of x-screen 0 is used as >>>>>>>> identifying >>>>>>>> key in the hash table, instead of the dup()ed fd of x-screen 0 >>>>>>>> which is owned by the nouveau_screen. nouveau_screen's >>>>>>>> refcount >>>>>>>> is now 2. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> See below, but it shouldn't matter which fd gets used. >>>>>>> >>>>>>>> >>>>>>>> 3. Regular rendering happens by the client plugin, then the plugin >>>>>>>> gets unloaded. >>>>>>>> >>>>>>>> 4. XCloseDisplay(). x-screen 0 gets its DRI screen destroyed, >>>>>>>> nouveau_drm_screen_unref() drops the refcount to 1, calling >>>>>>>> mesa >>>>>>>> code then closes the fd of x-screen 0, so now the fd which is >>>>>>>> used as key in the hash table is invalid. x-screen 1 gets >>>>>>>> destroyed, nouveau_drm_screen_unref() drops the refcount to 0, >>>>>>>> the nouveau_screen gets destroyed, but removal of its entry >>>>>>>> in the hash table fails, because the invalid fd in the hash >>>>>>>> table no longer matches anything (fstat() on the fd is used >>>>>>>> for hashing and key comparison, but fstat() on an already >>>>>>>> closed >>>>>>>> fd fails and returns bogus results). x-screen 1 closes its fd. >>>>>>>> >>>>>>>> Now all fd's are closed, the nouveau_screen destroyed, but >>>>>>>> there is a dangling reference to the nouveau_screen in the >>>>>>>> hash table. >>>>>>>> >>>>>>>> 5. Some OpenGL client plugin gets loaded again and calls >>>>>>>> glXQueryVersion. Step 2 above repeats, but because a >>>>>>>> dangling reference with a matching fd is found in the winsys >>>>>>>> hash table, no new nouveau_screen is created this time. >>>>>>>> Instead >>>>>>>> the invalid pointer to the old nouveau_screen is recycled, >>>>>>>> which points to nirvana -> Crash. >>>>>>>> >>>>>>>> This problem is avoided by use of the dup()ed fd which is >>>>>>>> owned by the nouveau_screen and has the same lifetime as >>>>>>>> the nouveau_screen itself. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I need to think about this some more, but... this shouldn't happen :) >>>>>>> >>>>>>> In fact, the whole dupfd thing was added there for ZaphodHeads >>>>>>> screens >>>>>>> in the first place. See >>>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=79823 and commit >>>>>>> a59f2bb17bcc which fixed it. >>>>>>> >>>>>>> Note that the hash has the following hash/eq functions: >>>>>>> >>>>>>> static unsigned hash_fd(void *key) >>>>>>> { >>>>>>> int fd = pointer_to_intptr(key); >>>>>>> struct stat stat; >>>>>>> fstat(fd, &stat); >>>>>>> >>>>>>> return stat.st_dev ^ stat.st_ino ^ stat.st_rdev; >>>>>>> } >>>>>>> >>>>>>> static int compare_fd(void *key1, void *key2) >>>>>>> { >>>>>>> int fd1 = pointer_to_intptr(key1); >>>>>>> int fd2 = pointer_to_intptr(key2); >>>>>>> struct stat stat1, stat2; >>>>>>> fstat(fd1, &stat1); >>>>>>> fstat(fd2, &stat2); >>>>>>> >>>>>>> return stat1.st_dev != stat2.st_dev || >>>>>>> stat1.st_ino != stat2.st_ino || >>>>>>> stat1.st_rdev != stat2.st_rdev; >>>>>>> } >>>>>>> >>>>>>> so fd and dupfd should get hashed to the same thing. I suspect >>>>>>> there's >>>>>>> something else going on in your application... >>>>>>> >>>>>> >>>>>> My application is a set of dynamically loaded plugins running inside >>>>>> another >>>>>> host app (Psychtoolbox-3 inside GNU/Octave), so what happens often is >>>>>> that >>>>>> the OpenGL using plugin gets unloaded and reloaded at runtime, so a >>>>>> after a >>>>>> OpenGL session has ended with a XCloseDisplay() tearing the winsys >>>>>> down, you >>>>>> can easily have it restart at plugin reload with another XOpenDisplay >>>>>> -> >>>>>> glXQueryVersion -> .... sequence, where the new call to >>>>>> glXQueryVersion will >>>>>> trigger a recreation of the winsys, which will find the stale entry >>>>>> in the >>>>>> hash table pointing to nowhere instead of the previously released >>>>>> nouveau_screen -> use-after-free -> boom! >>>>>> >>>>>> The reason this fails is because during destruction of the 2nd, 3rd >>>>>> etc. >>>>>> x-screen, the already closed fd associated with the 1st x-screen is >>>>>> fed into >>>>>> the compare_fd function, so fstat() errors out on the invalid fd. I >>>>>> added >>>>>> printf's etc. on both nouveau and now radeon to verify the fstat >>>>>> gives me >>>>>> EBADF errors. So the hash calculation goes wrong when trying to find >>>>>> the >>>>>> matching element in the hash table with a fd that has a matching hash >>>>>> -> The >>>>>> element which should be removed is ignored/not removed because it >>>>>> contains >>>>>> an already closed fd for which no proper hash can be calculated >>>>>> anymore -> >>>>>> hash comparison during search goes wrong. >>>>>> >>>>>> This is because multiple x-screens, e.g., 2 x-screens are destroyed >>>>>> in order >>>>>> 0, 1 by FreeScreenConfigs() as part of XCloseDisplay(). >>>>>> FreeScreenConfigs() >>>>>> calls dri2DestroyScreen() in dri2.c or dri3_destroy_screen in dri3.c, >>>>>> which >>>>>> are essentially identical, e.g.,: >>>>>> >>>>>> static void >>>>>> dri2DestroyScreen(struct glx_screen *base) >>>>>> { >>>>>> struct dri2_screen *psc = (struct dri2_screen *) base; >>>>>> >>>>>> /* Free the direct rendering per screen data */ >>>>>> (*psc->core->destroyScreen) (psc->driScreen); >>>>>> >>>>>> --> This core->destroyScreen calls eventually into the winsys, which >>>>>> then >>>>>> drops the refcount of the nouveau_screen or radeon winsys structure >>>>>> by one, >>>>>> and tries to release the struct and remove the hash table entry once >>>>>> the >>>>>> refcount drops to zero. >>>>>> >>>>>> driDestroyConfigs(psc->driver_configs); >>>>>> close(psc->fd); >>>>>> >>>>>> -> This close() closes the fd of a screen immediately after closing >>>>>> down >>>>>> that screen. At the time screen 1 is closed down, the fd associated >>>>>> with >>>>>> screen 0 is already closed. >>>>>> >>>>>> free(psc); >>>>>> } >>>>>> >>>>>> So you have this sequence during creation: >>>>>> >>>>>> glXQueryVersion() -> AllocAndFetchScreenConfigs(): >>>>>> >>>>>> create screen 0 and its fd, e.g., fd==5, dup(fd), e.g., fd==6 for the >>>>>> nouveau_screen, but store x-screen 0's fd==5 *itself* in the hash >>>>>> table as >>>>>> key. >>>>>> >>>>>> create screen 1. This now creates an fd 7 >>>>>> >>>>>> => refcount is now 2. >>>>>> >>>>>> >>>>>> And this sequence during destruction at XCloseDisplay(); >>>>>> >>>>>> FreeScreenConfigs(): >>>>>> >>>>>> free screen 0: core->destroyScreen() -> ... -> >>>>>> nouveau_drm_screen_unref() => >>>>>> refcount-- is now 1 >>>>>> close(fd==5) of screen 0 and thereby the fd==5 stored inside the hash >>>>>> table >>>>>> as key. From now on using fcntl(fd==5) for hash calculation on that >>>>>> element >>>>>> will malfunction. >>>>>> >>>>>> free screen 1: -> nouveau_drm_screen_unref -> refcount-- is now 0. -> >>>>>> Try to >>>>>> remove hash table entry => fails because fcntl(fd==5) on the already >>>>>> closed >>>>>> fd==5 of screen 0 fails with EBADF. => Hash table keeps its stale >>>>>> element >>>>>> referencing the struct nouveau_screen. >>>>>> >>>>>> Free nouveau_screen struct (close dup()ed fd==6 for nouveau_screen, >>>>>> close >>>>>> down stuff, free the struct). >>>>>> -> close(fd==7) associated with screen 1. >>>>>> >>>>>> Now all fd's (5,6,7) are closed, the struct is gone, the hash table >>>>>> contains >>>>>> a stale element with a no longer existent fd==5 and a pointer to an >>>>>> already >>>>>> free()d struct nouveau_screen. >>>>>> >>>>>> Now we reload the plugin and do glXQueryVersion() again, so >>>>>> AllocAndFetchScreenConfigs() is executed again to create the winsys >>>>>> etc. for >>>>>> screens 0 and 1. >>>>>> >>>>>> create screen 0 and its fd: Here the first unused fd in the >>>>>> filedescriptor >>>>>> table is recycled, which happens to be fd==5, just as in the first >>>>>> session >>>>>> with the plugin! Now suddenly fd=5 is again the valid fd for x-screen >>>>>> 0, and >>>>>> it points to the same /dev/drm/card0 device file as before. That >>>>>> means when >>>>>> nouveau_drm_screen_create(fd==5) is called, it finds the stale >>>>>> element in >>>>>> the hash table from the previous session, because the previously >>>>>> closed fd 5 >>>>>> in that element is now valid and open again and points to the same >>>>>> device >>>>>> file as in the previous session, ergo has the same hash etc. >>>>>> nouveau_drm_screen_create() concludes that a fully initialized >>>>>> nouveau_screen already exists for this session, except it doesn't >>>>>> exist - >>>>>> the pointer in the hash table points to invalid previously freed >>>>>> memory. >>>>>> When it tries to access that memory we have a use-after-free and the >>>>>> application segfaults. >>>>>> >>>>>> So it depends a bit on what happens in the applications memory >>>>>> between the >>>>>> two consecutive OpenGL sessions etc. how long it takes for an invalid >>>>>> memory >>>>>> access to happen, but eventually it hits invalid data. Maybe >>>>>> occassionally >>>>>> it gets lucky and the freed memory is still accessible and the struct >>>>>> intact, so stuff works by chance. >>>>>> >>>>>> If the application happened to open other files inbetween the 1st and >>>>>> 2nd >>>>>> session, then the relevant fd in the 2nd session may not be exactly >>>>>> the same >>>>>> as in the first session, because recycling that fd didn't work, so we >>>>>> are >>>>>> only left with a dead but harmless entry in the hash table instead of >>>>>> a >>>>>> crasher. I assume something like that prevented my crashes on radeon >>>>>> in the >>>>>> past, because when i wasn't expecting/looking for trouble i didn't use >>>>>> a >>>>>> test sequence carefully made to trigger the bug for certain. >>>>>> >>>>>> Similar things happen on radeon for the same reason, ergo the 2nd >>>>>> patch with >>>>>> a similar fix. >>>>>> >>>>>> I don't have the gdb backtrace for nouveau anymore, but the traces for >>>>>> radeon are attached to this mail to show you the flow of execution, >>>>>> similar >>>>>> to nouveau. >>>>>> >>>>>> -mario _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev