Re: inotify_rm_watch() user-space safety requirements?
Hi Heinrich, Thanks for looking into this. I've never encountered the aliasing via wraparound or otherwise, but I've tried to code against unexpected descriptor reuse anyway through more flexible workarounds. From a user standpoint, denser descriptor reassignment would be preferable to facilitate simple dense linear mappings of cache objects, but it's easy to see why things were designed as they were given the queue entry persistence. Getting current behavior accurately documented is my first hope, and fixing any gotcha cases my second, but being able to have inotify_add_watch/read look and feel more like open/epoll_ctl(EPOLL_CTL_ADD))/epoll_wait would be my ultimate wish. The asynchronous-feeling handle removal is not a great interface in general to deal with. Best regards, Jeff On Tue, May 27, 2014 at 2:32 PM, Heinrich Schuchardt wrote: > On 27.05.2014 19:25, Jeff Smith wrote: >> >> inotify's behavior concerning events from removed watches (they do >> happen) and watch descriptor reuse (beyond my knowledge) is currently >> undocumented. >> >> Although it mimics a standard multiplexing interface in most regards, >> writing a robust user-space handler is comparatively more complex due >> to the atypical delivery of "stale" wd events preceding an IN_IGNORE >> event and a lack of guarantees about how quickly a wd can be reused >> via inotify_add_watch(). Not being familiar with inotify/fsnotify >> internals, it's not trivially obvious to me how the fsnotify_group >> management is being done. Up to the present, I've maintained queues of >> "dead" wd wrappers (or at least a counter) to filter stale events, but >> I am clueless whether or not this is overkill. >> >> If removed descriptors are reserved until the IN_IGNORE event is >> drained from the read queue, could that be formally guaranteed? If >> it's not, is it functionality that could ever reasonably be expected >> to be added, short of some other form of new (optional?) >> queue-filter-on-rm functionality? It's my experience that the >> asynchronous handling of watch removals is a cost that seldom serves >> much user benefit. >> >> Regards, >> Jeff > > Hello Jeff, > > I tried to dive a bit into the code. This is what I understand: > > Function inotify_ignored_and_remove_idr is called after the mark has been > removed. This function puts an IN_IGNORED event onto the inotify queue and > removes the watch descriptor from the list of used watch descriptors using > function idr_remove. > > With a test program I could receive the IN_IGNORED event. This behavior is > currently not documented in the manpages (inotify.7 and inotify_rm_watch.2). > > When inotify_add_watch is called it uses function idr_alloc_cyclic to assign > a watch descriptor ID. This function starts looking for an unused id > starting with the id after the last assigned watch descriptor. > > This implies that in most cases inotify_add_watch will return a watch > descriptor different to the one released by a prior call to > inotify_rm_watch. But there is no guarantee. > > I consider this a bug. > > I CCed the maintainers of the inotify interface hoping that they can provide > a better solution. > > Until such a solution is provided I suggest you use the following > workaround. After calling inotify_rm_watch read from the inotify file > descriptor until you reach the matching IN_IGNORED event. > > Only thereafter you can safely call inotify_add_watch again. > > Best regards > > Heinrich Schuchardt > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
inotify_rm_watch() user-space safety requirements?
inotify's behavior concerning events from removed watches (they do happen) and watch descriptor reuse (beyond my knowledge) is currently undocumented. Although it mimics a standard multiplexing interface in most regards, writing a robust user-space handler is comparatively more complex due to the atypical delivery of "stale" wd events preceding an IN_IGNORE event and a lack of guarantees about how quickly a wd can be reused via inotify_add_watch(). Not being familiar with inotify/fsnotify internals, it's not trivially obvious to me how the fsnotify_group management is being done. Up to the present, I've maintained queues of "dead" wd wrappers (or at least a counter) to filter stale events, but I am clueless whether or not this is overkill. If removed descriptors are reserved until the IN_IGNORE event is drained from the read queue, could that be formally guaranteed? If it's not, is it functionality that could ever reasonably be expected to be added, short of some other form of new (optional?) queue-filter-on-rm functionality? It's my experience that the asynchronous handling of watch removals is a cost that seldom serves much user benefit. Regards, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
inotify_rm_watch() user-space safety requirements?
inotify's behavior concerning events from removed watches (they do happen) and watch descriptor reuse (beyond my knowledge) is currently undocumented. Although it mimics a standard multiplexing interface in most regards, writing a robust user-space handler is comparatively more complex due to the atypical delivery of stale wd events preceding an IN_IGNORE event and a lack of guarantees about how quickly a wd can be reused via inotify_add_watch(). Not being familiar with inotify/fsnotify internals, it's not trivially obvious to me how the fsnotify_group management is being done. Up to the present, I've maintained queues of dead wd wrappers (or at least a counter) to filter stale events, but I am clueless whether or not this is overkill. If removed descriptors are reserved until the IN_IGNORE event is drained from the read queue, could that be formally guaranteed? If it's not, is it functionality that could ever reasonably be expected to be added, short of some other form of new (optional?) queue-filter-on-rm functionality? It's my experience that the asynchronous handling of watch removals is a cost that seldom serves much user benefit. Regards, Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: inotify_rm_watch() user-space safety requirements?
Hi Heinrich, Thanks for looking into this. I've never encountered the aliasing via wraparound or otherwise, but I've tried to code against unexpected descriptor reuse anyway through more flexible workarounds. From a user standpoint, denser descriptor reassignment would be preferable to facilitate simple dense linear mappings of cache objects, but it's easy to see why things were designed as they were given the queue entry persistence. Getting current behavior accurately documented is my first hope, and fixing any gotcha cases my second, but being able to have inotify_add_watch/read look and feel more like open/epoll_ctl(EPOLL_CTL_ADD))/epoll_wait would be my ultimate wish. The asynchronous-feeling handle removal is not a great interface in general to deal with. Best regards, Jeff On Tue, May 27, 2014 at 2:32 PM, Heinrich Schuchardt xypron.g...@gmx.de wrote: On 27.05.2014 19:25, Jeff Smith wrote: inotify's behavior concerning events from removed watches (they do happen) and watch descriptor reuse (beyond my knowledge) is currently undocumented. Although it mimics a standard multiplexing interface in most regards, writing a robust user-space handler is comparatively more complex due to the atypical delivery of stale wd events preceding an IN_IGNORE event and a lack of guarantees about how quickly a wd can be reused via inotify_add_watch(). Not being familiar with inotify/fsnotify internals, it's not trivially obvious to me how the fsnotify_group management is being done. Up to the present, I've maintained queues of dead wd wrappers (or at least a counter) to filter stale events, but I am clueless whether or not this is overkill. If removed descriptors are reserved until the IN_IGNORE event is drained from the read queue, could that be formally guaranteed? If it's not, is it functionality that could ever reasonably be expected to be added, short of some other form of new (optional?) queue-filter-on-rm functionality? It's my experience that the asynchronous handling of watch removals is a cost that seldom serves much user benefit. Regards, Jeff Hello Jeff, I tried to dive a bit into the code. This is what I understand: Function inotify_ignored_and_remove_idr is called after the mark has been removed. This function puts an IN_IGNORED event onto the inotify queue and removes the watch descriptor from the list of used watch descriptors using function idr_remove. With a test program I could receive the IN_IGNORED event. This behavior is currently not documented in the manpages (inotify.7 and inotify_rm_watch.2). When inotify_add_watch is called it uses function idr_alloc_cyclic to assign a watch descriptor ID. This function starts looking for an unused id starting with the id after the last assigned watch descriptor. This implies that in most cases inotify_add_watch will return a watch descriptor different to the one released by a prior call to inotify_rm_watch. But there is no guarantee. I consider this a bug. I CCed the maintainers of the inotify interface hoping that they can provide a better solution. Until such a solution is provided I suggest you use the following workaround. After calling inotify_rm_watch read from the inotify file descriptor until you reach the matching IN_IGNORED event. Only thereafter you can safely call inotify_add_watch again. Best regards Heinrich Schuchardt -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remap_file_pages() use
I've got no real issues at this point, but could you perhaps elaborate a bit on the rough order of magnitude of "long" time and what cases would be slower? I'm pretty sure that some places I've worked that still have some remap_file_pages() logic in place aren't too religious about checking dmesg. --Jeff On Mon, May 26, 2014 at 8:47 AM, Kirill A. Shutemov wrote: > Jeff Smith wrote: >> OK, I misinterpreted "the overlapped part of the mapping(s) will be >> discarded" as discarding the -new- mappings. My objections about >> needing a replacement for remap_file_pages() are gone, but my concerns >> about existing code still remain. > > As I said, emulation will be there for long time. With warning in dmesg. > The emulation is interface-compatible, but slower in some cases (not > your's). > > -- > Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remap_file_pages() use
OK, I misinterpreted "the overlapped part of the mapping(s) will be discarded" as discarding the -new- mappings. My objections about needing a replacement for remap_file_pages() are gone, but my concerns about existing code still remain. --Jeff On Mon, May 26, 2014 at 8:35 AM, Paolo Bonzini wrote: > Il 26/05/2014 15:24, Jeff Smith ha scritto: > >> Your addr2 mmap() call is a bit incorrect semantically and >> syntactically (you skipped the length arg). The addr2 request will >> fail because mmap() does not implicitly munmap() occupied virtual >> address space. > > > With MAP_FIXED it does. It is in the man page. > > Paolo > > >> Even if you did that, the following still has a race >> condition between the addr2 request and another thread grabbing the >> same virtual space, which nothing short of a lock on all threads' >> mmap()-ing logic can protect: > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remap_file_pages() use
>> Mirrored mapping is absolutely required by several >> independent proprietary platforms I'm aware of, and remap_file_pages() >> has historically been the only sane way to accomplish this. (i.e., >> shm_open(), mmap(NULL, 2^(n+1) pages), remap_file_pages() on 2nd >> half). > > Em.. What's wrong with shm_open() + two mmap()s to cover both halfs? > > fd = shm_open(); > addr1 = mmap(NULL, 2*SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); > addr2 = mmap(addr1 + SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, > 0); > > Is there a reason why it doens't work? Your addr2 mmap() call is a bit incorrect semantically and syntactically (you skipped the length arg). The addr2 request will fail because mmap() does not implicitly munmap() occupied virtual address space. Even if you did that, the following still has a race condition between the addr2 request and another thread grabbing the same virtual space, which nothing short of a lock on all threads' mmap()-ing logic can protect: addr1 = mmap(NULL, 2*SIZE, PROT_READ, MAP_SHARED, fd, 0); munmap(addr1 + SIZE, SIZE); /* race on virtual address space here, but n/a for remap_file_pages() ... */ addr2 = mmap(addr1, SIZE, PROT_READ, MAP_SHARED | MAP_FIXED, fd, 0); remap_file_pages() is not subject to this problem and allows the creation of considerably cleaner code. Protecting the address space corner cases with locks or arbitrarily bounded munmap()-and-retry loops is a substantial burden over the historically provided approach. >> but failing that, a reservation API would need >> to be created (possibly a MAP_RESERVE flag) that would set aside a >> region that could only be subsequently mapped via explicit >> address-requesting mmap() calls. > > I don't get this part. I'm proposing that a call along the lines of mmap(NULL, len, prot, MAP_RESERVED | ..., fd, offset) could return a virtual address block that is -not- actually mapped but -is- protected from other mmap() calls not explicitly requesting the space via their addr parameters. Unfortunately, you'd also need to define separate semantics to un-reserving not-mapped space, etc. The important issue is that users need to be able to trivially protect themselves from transient virtual address space congestion problems and only fail early on long-term exhaustion situations. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remap_file_pages() use
Mirrored mapping is absolutely required by several independent proprietary platforms I'm aware of, and remap_file_pages() has historically been the only sane way to accomplish this. (i.e., shm_open(), mmap(NULL, 2^(n+1) pages), remap_file_pages() on 2nd half). Em.. What's wrong with shm_open() + two mmap()s to cover both halfs? fd = shm_open(); addr1 = mmap(NULL, 2*SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); addr2 = mmap(addr1 + SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 0); Is there a reason why it doens't work? Your addr2 mmap() call is a bit incorrect semantically and syntactically (you skipped the length arg). The addr2 request will fail because mmap() does not implicitly munmap() occupied virtual address space. Even if you did that, the following still has a race condition between the addr2 request and another thread grabbing the same virtual space, which nothing short of a lock on all threads' mmap()-ing logic can protect: addr1 = mmap(NULL, 2*SIZE, PROT_READ, MAP_SHARED, fd, 0); munmap(addr1 + SIZE, SIZE); /* race on virtual address space here, but n/a for remap_file_pages() ... */ addr2 = mmap(addr1, SIZE, PROT_READ, MAP_SHARED | MAP_FIXED, fd, 0); remap_file_pages() is not subject to this problem and allows the creation of considerably cleaner code. Protecting the address space corner cases with locks or arbitrarily bounded munmap()-and-retry loops is a substantial burden over the historically provided approach. but failing that, a reservation API would need to be created (possibly a MAP_RESERVE flag) that would set aside a region that could only be subsequently mapped via explicit address-requesting mmap() calls. I don't get this part. I'm proposing that a call along the lines of mmap(NULL, len, prot, MAP_RESERVED | ..., fd, offset) could return a virtual address block that is -not- actually mapped but -is- protected from other mmap() calls not explicitly requesting the space via their addr parameters. Unfortunately, you'd also need to define separate semantics to un-reserving not-mapped space, etc. The important issue is that users need to be able to trivially protect themselves from transient virtual address space congestion problems and only fail early on long-term exhaustion situations. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remap_file_pages() use
OK, I misinterpreted the overlapped part of the mapping(s) will be discarded as discarding the -new- mappings. My objections about needing a replacement for remap_file_pages() are gone, but my concerns about existing code still remain. --Jeff On Mon, May 26, 2014 at 8:35 AM, Paolo Bonzini pbonz...@redhat.com wrote: Il 26/05/2014 15:24, Jeff Smith ha scritto: Your addr2 mmap() call is a bit incorrect semantically and syntactically (you skipped the length arg). The addr2 request will fail because mmap() does not implicitly munmap() occupied virtual address space. With MAP_FIXED it does. It is in the man page. Paolo Even if you did that, the following still has a race condition between the addr2 request and another thread grabbing the same virtual space, which nothing short of a lock on all threads' mmap()-ing logic can protect: -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remap_file_pages() use
I've got no real issues at this point, but could you perhaps elaborate a bit on the rough order of magnitude of long time and what cases would be slower? I'm pretty sure that some places I've worked that still have some remap_file_pages() logic in place aren't too religious about checking dmesg. --Jeff On Mon, May 26, 2014 at 8:47 AM, Kirill A. Shutemov kirill.shute...@linux.intel.com wrote: Jeff Smith wrote: OK, I misinterpreted the overlapped part of the mapping(s) will be discarded as discarding the -new- mappings. My objections about needing a replacement for remap_file_pages() are gone, but my concerns about existing code still remain. As I said, emulation will be there for long time. With warning in dmesg. The emulation is interface-compatible, but slower in some cases (not your's). -- Kirill A. Shutemov -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remap_file_pages() use
On Mon, May 19, 2014 at 9:38 AM, Christoph Hellwig wrote: > On Mon, May 19, 2014 at 05:35:40PM +0300, Kirill A. Shutemov wrote: >> >From functional POV, emulation *should* be identical to original >> remap_file_pages(), but slower. It would be nice, if you test it early. >> >> It's not clear yet how long emulation will be there. > > Stop right there. We found out about two real life users of > remap_file_pages() already, without even committing the patches to warn > about using it to any tree. > > I think at this point the whole idea of removing the API should be dead > on the floor, as we do not needlessly break userspace programs. > > If we can get rid of the ugly guts and provide a good enough emulation > that the user won't cry I'd love to get rid of this cruft, but even > that doesn't look certain yet. Sorry for being late to the party, but I just noticed this proposal via the LWN summary byline. I wanted to comment that Kenny's use case is (I believe) quite widespread. I've used the technique since ~2008, and I've come across other people in subsequent jobs who independently developed the same technique. Mirrored mapping is absolutely required by several independent proprietary platforms I'm aware of, and remap_file_pages() has historically been the only sane way to accomplish this. (i.e., shm_open(), mmap(NULL, 2^(n+1) pages), remap_file_pages() on 2nd half). It may not be individuals who are involved in the kernel development scene to any great extent, but I am sure that remap_file_pages() being deprecated would seriously piss off a lot of individuals. The pattern has even had a section in the Wikipedia article for quite some time: http://en.wikipedia.org/wiki/Circular_buffer#Mirroring It would be most preferable from a user standpoint to keep the existing system intact, but failing that, a reservation API would need to be created (possibly a MAP_RESERVE flag) that would set aside a region that could only be subsequently mapped via explicit address-requesting mmap() calls. Thanks for any consideration of these concerns. --Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remap_file_pages() use
On Mon, May 19, 2014 at 9:38 AM, Christoph Hellwig h...@infradead.org wrote: On Mon, May 19, 2014 at 05:35:40PM +0300, Kirill A. Shutemov wrote: From functional POV, emulation *should* be identical to original remap_file_pages(), but slower. It would be nice, if you test it early. It's not clear yet how long emulation will be there. Stop right there. We found out about two real life users of remap_file_pages() already, without even committing the patches to warn about using it to any tree. I think at this point the whole idea of removing the API should be dead on the floor, as we do not needlessly break userspace programs. If we can get rid of the ugly guts and provide a good enough emulation that the user won't cry I'd love to get rid of this cruft, but even that doesn't look certain yet. Sorry for being late to the party, but I just noticed this proposal via the LWN summary byline. I wanted to comment that Kenny's use case is (I believe) quite widespread. I've used the technique since ~2008, and I've come across other people in subsequent jobs who independently developed the same technique. Mirrored mapping is absolutely required by several independent proprietary platforms I'm aware of, and remap_file_pages() has historically been the only sane way to accomplish this. (i.e., shm_open(), mmap(NULL, 2^(n+1) pages), remap_file_pages() on 2nd half). It may not be individuals who are involved in the kernel development scene to any great extent, but I am sure that remap_file_pages() being deprecated would seriously piss off a lot of individuals. The pattern has even had a section in the Wikipedia article for quite some time: http://en.wikipedia.org/wiki/Circular_buffer#Mirroring It would be most preferable from a user standpoint to keep the existing system intact, but failing that, a reservation API would need to be created (possibly a MAP_RESERVE flag) that would set aside a region that could only be subsequently mapped via explicit address-requesting mmap() calls. Thanks for any consideration of these concerns. --Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/