Pushed. Thanks for the fix and for bearing with me!
On Sun, Jun 28, 2015 at 1:01 AM, Mario Kleiner <mario.kleiner...@gmail.com> wrote: > Yes, that's good. > > > On 06/28/2015 07:00 AM, Ilia Mirkin wrote: >> >> How about: >> >> /* Use dupfd in hash table, to avoid errors if the original fd gets >> * closed by its owner. The hash key needs to live at least as long as >> * the screen. >> */ >> >> >> On Sun, Jun 28, 2015 at 12:57 AM, Mario Kleiner >> <mario.kleiner...@gmail.com> wrote: >>> >>> On 06/28/2015 06:25 AM, Ilia Mirkin wrote: >>>> >>>> >>>> But that's the thing... in this case, the fd lifetime is owned by the >>>> X server. In fact, nouveau doesn't touch that fd in mesa beyond >>>> dup'ing it, and then exclusively using the dup'd fd. >>>> >>> >>> True, it's not owned by nouveau, but instead by mesa core code? I think >>> the >>> x-server itself is only involved in authenticating the fd under DRI2, >>> after >>> mesa has opened it? Although under DRI3 the ready made fd comes from a >>> xcb_dri3_open() trip to the server. >>> >>> See the open() call on the device file: >>> dri2CreateScreen() in mesa/src/glx/dri2_glx.c >>> >>> xcb_dri3_open(): >>> dri3_create_screen() in mesa/src/glx/dri3_glx.c >>> >>> And the close() calls for the fd's are also inside those files. >>> >>> Maybe those winsys functions we fixed here are also involved in things >>> like >>> EGL for wayland etc.? Haven't checked this. >>> >>> Other than that your new commit message is fine. Also i am possibly just >>> confused about this, and your commit message is an improvement in clarity >>> anyway :) >>> >>> So i think it is fine to leave it as is. >>> >>> thanks, >>> -mario >>> >>> >>> >>>> On Sun, Jun 28, 2015 at 12:23 AM, Mario Kleiner >>>> <mario.kleiner...@gmail.com> wrote: >>>>> >>>>> >>>>> Ok, maybe one thing for the commit message, as i just made the same >>>>> mixup >>>>> in >>>>> my reply: The fd's are not owned by the x-server, but by the mesa >>>>> screens >>>>> (pipe screens? dri screens?) they represent, so if such a screen goes >>>>> away, >>>>> the fd goes away. Using "X server" would be confusing, especially under >>>>> dri3, although each such "mesa screen" corresponds to a x-screen. >>>>> >>>>> -mario >>>>> >>>>> >>>>> On 06/28/2015 06:03 AM, Mario Kleiner wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 06/28/2015 05:41 AM, Ilia Mirkin wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Oh duh. Thanks for the super-detailed explanation. To rephrase what >>>>>>> you said in a slightly shorter manner: >>>>>>> >>>>>>> See bug https://bugs.freedesktop.org/show_bug.cgi?id=79823 for why we >>>>>>> need to dupfd (which we are already, although radeon might not be). >>>>>>> >>>>>> >>>>>> Radeon doesn't so far. My 2nd patch for radeon from earlier today adds >>>>>> that. >>>>>> >>>>>>> Except instead of sticking the dup'd fd into the hash table, whose >>>>>>> lifetime matches that of the device, we stick the other one in, which >>>>>>> is effectively owned by the X server. As a result, those fstat's >>>>>>> might >>>>>>> fail down the line, and all sorts of hell will break loose. >>>>>>> >>>>>> >>>>>> Yes. Essentially we should make sure the fd's we keep around have the >>>>>> same lifetime as the data structure in which we need to use it. That >>>>>> was >>>>>> the point, the server owned fd went away while we still needed it. >>>>>> >>>>>>> Would you be opposed to me rewriting your commit message to basically >>>>>>> reflect the above? Perhaps your original one did as well, but it >>>>>>> wasn't clear to me. I'd also like to throw some assert's in to make >>>>>>> sure the fstat's don't fail -- does that sound reasonable? >>>>>>> >>>>>> >>>>>> Oh please, yes! My commit messages often have this disease that they >>>>>> are >>>>>> either too terse, when i think the problem is obvious, or too long and >>>>>> somewhat convoluted when i'm going overboard in the other direction. >>>>>> Any >>>>>> editing/shortening for clarity more than happily accepted :) >>>>>> >>>>>> thanks, >>>>>> -mario >>>>>> >>>>>>> Another solution, btw, is to stick <stat.st_dev, stat.st_ino, >>>>>>> stat.st_rdev> as the real key in the hash, although that'd involve >>>>>>> making pointers. Probably not worth it. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> -ilia >>>>>>> >>>>>>> On Sat, Jun 27, 2015 at 11:13 PM, Mario Kleiner >>>>>>> <mario.kleiner...@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 06/28/2015 03:48 AM, Ilia Mirkin wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 5, 2015 at 9:36 AM, Mario Kleiner >>>>>>>>> <mario.kleiner...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The dup'ed fd owned by the nouveau_screen for a device node >>>>>>>>>> must also be used as key for the winsys hash table, instead >>>>>>>>>> of using the original fd passed in for a screen, to make >>>>>>>>>> multi-x-screen ZaphodHeads configurations work on nouveau. >>>>>>>>>> >>>>>>>>>> This prevents the following crash scenario that was observed >>>>>>>>>> when a dynamically loaded rendering plugin used OpenGL on a >>>>>>>>>> ZaphodHeads setup, e.g., on a dual x-screen setup. At first >>>>>>>>>> load the plugin worked, but after unloading and reloading it, >>>>>>>>>> the next rendering operation crashed: >>>>>>>>>> >>>>>>>>>> 1. Client, e.g., a plugin, calls glXQueryVersion. >>>>>>>>>> >>>>>>>>>> 2. DRI screens for the x-screens 0 and 1 are created, one shared >>>>>>>>>> nouveau_screen is created for the shared device node of >>>>>>>>>> both >>>>>>>>>> screens, but the original fd of x-screen 0 is used as >>>>>>>>>> identifying >>>>>>>>>> key in the hash table, instead of the dup()ed fd of >>>>>>>>>> x-screen 0 >>>>>>>>>> which is owned by the nouveau_screen. nouveau_screen's >>>>>>>>>> refcount >>>>>>>>>> is now 2. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> See below, but it shouldn't matter which fd gets used. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> 3. Regular rendering happens by the client plugin, then the plugin >>>>>>>>>> gets unloaded. >>>>>>>>>> >>>>>>>>>> 4. XCloseDisplay(). x-screen 0 gets its DRI screen destroyed, >>>>>>>>>> nouveau_drm_screen_unref() drops the refcount to 1, calling >>>>>>>>>> mesa >>>>>>>>>> code then closes the fd of x-screen 0, so now the fd which >>>>>>>>>> is >>>>>>>>>> used as key in the hash table is invalid. x-screen 1 gets >>>>>>>>>> destroyed, nouveau_drm_screen_unref() drops the refcount to >>>>>>>>>> 0, >>>>>>>>>> the nouveau_screen gets destroyed, but removal of its entry >>>>>>>>>> in the hash table fails, because the invalid fd in the hash >>>>>>>>>> table no longer matches anything (fstat() on the fd is used >>>>>>>>>> for hashing and key comparison, but fstat() on an already >>>>>>>>>> closed >>>>>>>>>> fd fails and returns bogus results). x-screen 1 closes its >>>>>>>>>> fd. >>>>>>>>>> >>>>>>>>>> Now all fd's are closed, the nouveau_screen destroyed, but >>>>>>>>>> there is a dangling reference to the nouveau_screen in the >>>>>>>>>> hash table. >>>>>>>>>> >>>>>>>>>> 5. Some OpenGL client plugin gets loaded again and calls >>>>>>>>>> glXQueryVersion. Step 2 above repeats, but because a >>>>>>>>>> dangling reference with a matching fd is found in the >>>>>>>>>> winsys >>>>>>>>>> hash table, no new nouveau_screen is created this time. >>>>>>>>>> Instead >>>>>>>>>> the invalid pointer to the old nouveau_screen is recycled, >>>>>>>>>> which points to nirvana -> Crash. >>>>>>>>>> >>>>>>>>>> This problem is avoided by use of the dup()ed fd which is >>>>>>>>>> owned by the nouveau_screen and has the same lifetime as >>>>>>>>>> the nouveau_screen itself. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I need to think about this some more, but... this shouldn't happen >>>>>>>>> :) >>>>>>>>> >>>>>>>>> In fact, the whole dupfd thing was added there for ZaphodHeads >>>>>>>>> screens >>>>>>>>> in the first place. See >>>>>>>>> https://bugs.freedesktop.org/show_bug.cgi?id=79823 and commit >>>>>>>>> a59f2bb17bcc which fixed it. >>>>>>>>> >>>>>>>>> Note that the hash has the following hash/eq functions: >>>>>>>>> >>>>>>>>> static unsigned hash_fd(void *key) >>>>>>>>> { >>>>>>>>> int fd = pointer_to_intptr(key); >>>>>>>>> struct stat stat; >>>>>>>>> fstat(fd, &stat); >>>>>>>>> >>>>>>>>> return stat.st_dev ^ stat.st_ino ^ stat.st_rdev; >>>>>>>>> } >>>>>>>>> >>>>>>>>> static int compare_fd(void *key1, void *key2) >>>>>>>>> { >>>>>>>>> int fd1 = pointer_to_intptr(key1); >>>>>>>>> int fd2 = pointer_to_intptr(key2); >>>>>>>>> struct stat stat1, stat2; >>>>>>>>> fstat(fd1, &stat1); >>>>>>>>> fstat(fd2, &stat2); >>>>>>>>> >>>>>>>>> return stat1.st_dev != stat2.st_dev || >>>>>>>>> stat1.st_ino != stat2.st_ino || >>>>>>>>> stat1.st_rdev != stat2.st_rdev; >>>>>>>>> } >>>>>>>>> >>>>>>>>> so fd and dupfd should get hashed to the same thing. I suspect >>>>>>>>> there's >>>>>>>>> something else going on in your application... >>>>>>>>> >>>>>>>> >>>>>>>> My application is a set of dynamically loaded plugins running inside >>>>>>>> another >>>>>>>> host app (Psychtoolbox-3 inside GNU/Octave), so what happens often >>>>>>>> is >>>>>>>> that >>>>>>>> the OpenGL using plugin gets unloaded and reloaded at runtime, so a >>>>>>>> after a >>>>>>>> OpenGL session has ended with a XCloseDisplay() tearing the winsys >>>>>>>> down, you >>>>>>>> can easily have it restart at plugin reload with another >>>>>>>> XOpenDisplay >>>>>>>> -> >>>>>>>> glXQueryVersion -> .... sequence, where the new call to >>>>>>>> glXQueryVersion will >>>>>>>> trigger a recreation of the winsys, which will find the stale entry >>>>>>>> in the >>>>>>>> hash table pointing to nowhere instead of the previously released >>>>>>>> nouveau_screen -> use-after-free -> boom! >>>>>>>> >>>>>>>> The reason this fails is because during destruction of the 2nd, 3rd >>>>>>>> etc. >>>>>>>> x-screen, the already closed fd associated with the 1st x-screen is >>>>>>>> fed into >>>>>>>> the compare_fd function, so fstat() errors out on the invalid fd. I >>>>>>>> added >>>>>>>> printf's etc. on both nouveau and now radeon to verify the fstat >>>>>>>> gives me >>>>>>>> EBADF errors. So the hash calculation goes wrong when trying to find >>>>>>>> the >>>>>>>> matching element in the hash table with a fd that has a matching >>>>>>>> hash >>>>>>>> -> The >>>>>>>> element which should be removed is ignored/not removed because it >>>>>>>> contains >>>>>>>> an already closed fd for which no proper hash can be calculated >>>>>>>> anymore -> >>>>>>>> hash comparison during search goes wrong. >>>>>>>> >>>>>>>> This is because multiple x-screens, e.g., 2 x-screens are destroyed >>>>>>>> in order >>>>>>>> 0, 1 by FreeScreenConfigs() as part of XCloseDisplay(). >>>>>>>> FreeScreenConfigs() >>>>>>>> calls dri2DestroyScreen() in dri2.c or dri3_destroy_screen in >>>>>>>> dri3.c, >>>>>>>> which >>>>>>>> are essentially identical, e.g.,: >>>>>>>> >>>>>>>> static void >>>>>>>> dri2DestroyScreen(struct glx_screen *base) >>>>>>>> { >>>>>>>> struct dri2_screen *psc = (struct dri2_screen *) base; >>>>>>>> >>>>>>>> /* Free the direct rendering per screen data */ >>>>>>>> (*psc->core->destroyScreen) (psc->driScreen); >>>>>>>> >>>>>>>> --> This core->destroyScreen calls eventually into the winsys, which >>>>>>>> then >>>>>>>> drops the refcount of the nouveau_screen or radeon winsys structure >>>>>>>> by one, >>>>>>>> and tries to release the struct and remove the hash table entry once >>>>>>>> the >>>>>>>> refcount drops to zero. >>>>>>>> >>>>>>>> driDestroyConfigs(psc->driver_configs); >>>>>>>> close(psc->fd); >>>>>>>> >>>>>>>> -> This close() closes the fd of a screen immediately after closing >>>>>>>> down >>>>>>>> that screen. At the time screen 1 is closed down, the fd associated >>>>>>>> with >>>>>>>> screen 0 is already closed. >>>>>>>> >>>>>>>> free(psc); >>>>>>>> } >>>>>>>> >>>>>>>> So you have this sequence during creation: >>>>>>>> >>>>>>>> glXQueryVersion() -> AllocAndFetchScreenConfigs(): >>>>>>>> >>>>>>>> create screen 0 and its fd, e.g., fd==5, dup(fd), e.g., fd==6 for >>>>>>>> the >>>>>>>> nouveau_screen, but store x-screen 0's fd==5 *itself* in the hash >>>>>>>> table as >>>>>>>> key. >>>>>>>> >>>>>>>> create screen 1. This now creates an fd 7 >>>>>>>> >>>>>>>> => refcount is now 2. >>>>>>>> >>>>>>>> >>>>>>>> And this sequence during destruction at XCloseDisplay(); >>>>>>>> >>>>>>>> FreeScreenConfigs(): >>>>>>>> >>>>>>>> free screen 0: core->destroyScreen() -> ... -> >>>>>>>> nouveau_drm_screen_unref() => >>>>>>>> refcount-- is now 1 >>>>>>>> close(fd==5) of screen 0 and thereby the fd==5 stored inside the >>>>>>>> hash >>>>>>>> table >>>>>>>> as key. From now on using fcntl(fd==5) for hash calculation on that >>>>>>>> element >>>>>>>> will malfunction. >>>>>>>> >>>>>>>> free screen 1: -> nouveau_drm_screen_unref -> refcount-- is now 0. >>>>>>>> -> >>>>>>>> Try to >>>>>>>> remove hash table entry => fails because fcntl(fd==5) on the already >>>>>>>> closed >>>>>>>> fd==5 of screen 0 fails with EBADF. => Hash table keeps its stale >>>>>>>> element >>>>>>>> referencing the struct nouveau_screen. >>>>>>>> >>>>>>>> Free nouveau_screen struct (close dup()ed fd==6 for nouveau_screen, >>>>>>>> close >>>>>>>> down stuff, free the struct). >>>>>>>> -> close(fd==7) associated with screen 1. >>>>>>>> >>>>>>>> Now all fd's (5,6,7) are closed, the struct is gone, the hash table >>>>>>>> contains >>>>>>>> a stale element with a no longer existent fd==5 and a pointer to an >>>>>>>> already >>>>>>>> free()d struct nouveau_screen. >>>>>>>> >>>>>>>> Now we reload the plugin and do glXQueryVersion() again, so >>>>>>>> AllocAndFetchScreenConfigs() is executed again to create the winsys >>>>>>>> etc. for >>>>>>>> screens 0 and 1. >>>>>>>> >>>>>>>> create screen 0 and its fd: Here the first unused fd in the >>>>>>>> filedescriptor >>>>>>>> table is recycled, which happens to be fd==5, just as in the first >>>>>>>> session >>>>>>>> with the plugin! Now suddenly fd=5 is again the valid fd for >>>>>>>> x-screen >>>>>>>> 0, and >>>>>>>> it points to the same /dev/drm/card0 device file as before. That >>>>>>>> means when >>>>>>>> nouveau_drm_screen_create(fd==5) is called, it finds the stale >>>>>>>> element in >>>>>>>> the hash table from the previous session, because the previously >>>>>>>> closed fd 5 >>>>>>>> in that element is now valid and open again and points to the same >>>>>>>> device >>>>>>>> file as in the previous session, ergo has the same hash etc. >>>>>>>> nouveau_drm_screen_create() concludes that a fully initialized >>>>>>>> nouveau_screen already exists for this session, except it doesn't >>>>>>>> exist - >>>>>>>> the pointer in the hash table points to invalid previously freed >>>>>>>> memory. >>>>>>>> When it tries to access that memory we have a use-after-free and the >>>>>>>> application segfaults. >>>>>>>> >>>>>>>> So it depends a bit on what happens in the applications memory >>>>>>>> between the >>>>>>>> two consecutive OpenGL sessions etc. how long it takes for an >>>>>>>> invalid >>>>>>>> memory >>>>>>>> access to happen, but eventually it hits invalid data. Maybe >>>>>>>> occassionally >>>>>>>> it gets lucky and the freed memory is still accessible and the >>>>>>>> struct >>>>>>>> intact, so stuff works by chance. >>>>>>>> >>>>>>>> If the application happened to open other files inbetween the 1st >>>>>>>> and >>>>>>>> 2nd >>>>>>>> session, then the relevant fd in the 2nd session may not be exactly >>>>>>>> the same >>>>>>>> as in the first session, because recycling that fd didn't work, so >>>>>>>> we >>>>>>>> are >>>>>>>> only left with a dead but harmless entry in the hash table instead >>>>>>>> of >>>>>>>> a >>>>>>>> crasher. I assume something like that prevented my crashes on radeon >>>>>>>> in the >>>>>>>> past, because when i wasn't expecting/looking for trouble i didn't >>>>>>>> use >>>>>>>> a >>>>>>>> test sequence carefully made to trigger the bug for certain. >>>>>>>> >>>>>>>> Similar things happen on radeon for the same reason, ergo the 2nd >>>>>>>> patch with >>>>>>>> a similar fix. >>>>>>>> >>>>>>>> I don't have the gdb backtrace for nouveau anymore, but the traces >>>>>>>> for >>>>>>>> radeon are attached to this mail to show you the flow of execution, >>>>>>>> similar >>>>>>>> to nouveau. >>>>>>>> >>>>>>>> -mario _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev