Re: inotify_rm_watch() user-space safety requirements?

2014-05-27 Thread Jeff Smith
Hi Heinrich,

Thanks for looking into this. I've never encountered the aliasing via
wraparound or otherwise, but I've tried to code against unexpected
descriptor reuse anyway through more flexible workarounds. From a user
standpoint, denser descriptor reassignment would be preferable to
facilitate simple dense linear mappings of cache objects, but it's
easy to see why things were designed as they were given the queue
entry persistence. Getting current behavior accurately documented is
my first hope, and fixing any gotcha cases my second, but being able
to have inotify_add_watch/read look and feel more like
open/epoll_ctl(EPOLL_CTL_ADD))/epoll_wait would be my ultimate wish.
The asynchronous-feeling handle removal is not a great interface in
general to deal with.

Best regards,
Jeff

On Tue, May 27, 2014 at 2:32 PM, Heinrich Schuchardt  wrote:
> On 27.05.2014 19:25, Jeff Smith wrote:
>>
>> inotify's behavior concerning events from removed watches (they do
>> happen) and watch descriptor reuse (beyond my knowledge) is currently
>> undocumented.
>>
>> Although it mimics a standard multiplexing interface in most regards,
>> writing a robust user-space handler is comparatively more complex due
>> to the atypical delivery of "stale" wd events preceding an IN_IGNORE
>> event and a lack of guarantees about how quickly a wd can be reused
>> via inotify_add_watch(). Not being familiar with inotify/fsnotify
>> internals, it's not trivially obvious to me how the fsnotify_group
>> management is being done. Up to the present, I've maintained queues of
>> "dead" wd wrappers (or at least a counter) to filter stale events, but
>> I am clueless whether or not this is overkill.
>>
>> If removed descriptors are reserved until the IN_IGNORE event is
>> drained from the read queue, could that be formally guaranteed? If
>> it's not, is it functionality that could ever reasonably be expected
>> to be added, short of some other form of new (optional?)
>> queue-filter-on-rm functionality? It's my experience that the
>> asynchronous handling of watch removals is a cost that seldom serves
>> much user benefit.
>>
>> Regards,
>> Jeff
>
> Hello Jeff,
>
> I tried to dive a bit into the code. This is what I understand:
>
> Function inotify_ignored_and_remove_idr is called after the mark has been
> removed. This function puts an IN_IGNORED event onto the inotify queue and
> removes the watch descriptor from the list of used watch descriptors using
> function idr_remove.
>
> With a test program I could receive the IN_IGNORED event. This behavior is
> currently not documented in the manpages (inotify.7 and inotify_rm_watch.2).
>
> When inotify_add_watch is called it uses function idr_alloc_cyclic to assign
> a watch descriptor ID. This function starts looking for an unused id
> starting with the id after the last assigned watch descriptor.
>
> This implies that in most cases inotify_add_watch will return a watch
> descriptor different to the one released by a prior call to
> inotify_rm_watch. But there is no guarantee.
>
> I consider this a bug.
>
> I CCed the maintainers of the inotify interface hoping that they can provide
> a better solution.
>
> Until such a solution is provided I suggest you use the following
> workaround. After calling inotify_rm_watch read from the inotify file
> descriptor until you reach the matching IN_IGNORED event.
>
> Only thereafter you can safely call inotify_add_watch again.
>
> Best regards
>
> Heinrich Schuchardt
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


inotify_rm_watch() user-space safety requirements?

2014-05-27 Thread Jeff Smith
inotify's behavior concerning events from removed watches (they do
happen) and watch descriptor reuse (beyond my knowledge) is currently
undocumented.

Although it mimics a standard multiplexing interface in most regards,
writing a robust user-space handler is comparatively more complex due
to the atypical delivery of "stale" wd events preceding an IN_IGNORE
event and a lack of guarantees about how quickly a wd can be reused
via inotify_add_watch(). Not being familiar with inotify/fsnotify
internals, it's not trivially obvious to me how the fsnotify_group
management is being done. Up to the present, I've maintained queues of
"dead" wd wrappers (or at least a counter) to filter stale events, but
I am clueless whether or not this is overkill.

If removed descriptors are reserved until the IN_IGNORE event is
drained from the read queue, could that be formally guaranteed? If
it's not, is it functionality that could ever reasonably be expected
to be added, short of some other form of new (optional?)
queue-filter-on-rm functionality? It's my experience that the
asynchronous handling of watch removals is a cost that seldom serves
much user benefit.

Regards,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


inotify_rm_watch() user-space safety requirements?

2014-05-27 Thread Jeff Smith
inotify's behavior concerning events from removed watches (they do
happen) and watch descriptor reuse (beyond my knowledge) is currently
undocumented.

Although it mimics a standard multiplexing interface in most regards,
writing a robust user-space handler is comparatively more complex due
to the atypical delivery of stale wd events preceding an IN_IGNORE
event and a lack of guarantees about how quickly a wd can be reused
via inotify_add_watch(). Not being familiar with inotify/fsnotify
internals, it's not trivially obvious to me how the fsnotify_group
management is being done. Up to the present, I've maintained queues of
dead wd wrappers (or at least a counter) to filter stale events, but
I am clueless whether or not this is overkill.

If removed descriptors are reserved until the IN_IGNORE event is
drained from the read queue, could that be formally guaranteed? If
it's not, is it functionality that could ever reasonably be expected
to be added, short of some other form of new (optional?)
queue-filter-on-rm functionality? It's my experience that the
asynchronous handling of watch removals is a cost that seldom serves
much user benefit.

Regards,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: inotify_rm_watch() user-space safety requirements?

2014-05-27 Thread Jeff Smith
Hi Heinrich,

Thanks for looking into this. I've never encountered the aliasing via
wraparound or otherwise, but I've tried to code against unexpected
descriptor reuse anyway through more flexible workarounds. From a user
standpoint, denser descriptor reassignment would be preferable to
facilitate simple dense linear mappings of cache objects, but it's
easy to see why things were designed as they were given the queue
entry persistence. Getting current behavior accurately documented is
my first hope, and fixing any gotcha cases my second, but being able
to have inotify_add_watch/read look and feel more like
open/epoll_ctl(EPOLL_CTL_ADD))/epoll_wait would be my ultimate wish.
The asynchronous-feeling handle removal is not a great interface in
general to deal with.

Best regards,
Jeff

On Tue, May 27, 2014 at 2:32 PM, Heinrich Schuchardt xypron.g...@gmx.de wrote:
 On 27.05.2014 19:25, Jeff Smith wrote:

 inotify's behavior concerning events from removed watches (they do
 happen) and watch descriptor reuse (beyond my knowledge) is currently
 undocumented.

 Although it mimics a standard multiplexing interface in most regards,
 writing a robust user-space handler is comparatively more complex due
 to the atypical delivery of stale wd events preceding an IN_IGNORE
 event and a lack of guarantees about how quickly a wd can be reused
 via inotify_add_watch(). Not being familiar with inotify/fsnotify
 internals, it's not trivially obvious to me how the fsnotify_group
 management is being done. Up to the present, I've maintained queues of
 dead wd wrappers (or at least a counter) to filter stale events, but
 I am clueless whether or not this is overkill.

 If removed descriptors are reserved until the IN_IGNORE event is
 drained from the read queue, could that be formally guaranteed? If
 it's not, is it functionality that could ever reasonably be expected
 to be added, short of some other form of new (optional?)
 queue-filter-on-rm functionality? It's my experience that the
 asynchronous handling of watch removals is a cost that seldom serves
 much user benefit.

 Regards,
 Jeff

 Hello Jeff,

 I tried to dive a bit into the code. This is what I understand:

 Function inotify_ignored_and_remove_idr is called after the mark has been
 removed. This function puts an IN_IGNORED event onto the inotify queue and
 removes the watch descriptor from the list of used watch descriptors using
 function idr_remove.

 With a test program I could receive the IN_IGNORED event. This behavior is
 currently not documented in the manpages (inotify.7 and inotify_rm_watch.2).

 When inotify_add_watch is called it uses function idr_alloc_cyclic to assign
 a watch descriptor ID. This function starts looking for an unused id
 starting with the id after the last assigned watch descriptor.

 This implies that in most cases inotify_add_watch will return a watch
 descriptor different to the one released by a prior call to
 inotify_rm_watch. But there is no guarantee.

 I consider this a bug.

 I CCed the maintainers of the inotify interface hoping that they can provide
 a better solution.

 Until such a solution is provided I suggest you use the following
 workaround. After calling inotify_rm_watch read from the inotify file
 descriptor until you reach the matching IN_IGNORED event.

 Only thereafter you can safely call inotify_add_watch again.

 Best regards

 Heinrich Schuchardt

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remap_file_pages() use

2014-05-26 Thread Jeff Smith
I've got no real issues at this point, but could you perhaps elaborate
a bit on the rough order of magnitude of "long" time and what cases
would be slower? I'm pretty sure that some places I've worked that
still have some remap_file_pages() logic in place aren't too religious
about checking dmesg.
--Jeff

On Mon, May 26, 2014 at 8:47 AM, Kirill A. Shutemov
 wrote:
> Jeff Smith wrote:
>> OK, I misinterpreted "the overlapped part of the mapping(s) will be
>> discarded" as discarding the -new- mappings. My objections about
>> needing a replacement for remap_file_pages() are gone, but my concerns
>> about existing code still remain.
>
> As I said, emulation will be there for long time. With warning in dmesg.
> The emulation is interface-compatible, but slower in some cases (not
> your's).
>
> --
>  Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remap_file_pages() use

2014-05-26 Thread Jeff Smith
OK, I misinterpreted "the overlapped part of the mapping(s) will be
discarded" as discarding the -new- mappings. My objections about
needing a replacement for remap_file_pages() are gone, but my concerns
about existing code still remain.
--Jeff

On Mon, May 26, 2014 at 8:35 AM, Paolo Bonzini  wrote:
> Il 26/05/2014 15:24, Jeff Smith ha scritto:
>
>> Your addr2 mmap() call is a bit incorrect semantically and
>> syntactically (you skipped the length arg). The addr2 request will
>> fail because mmap() does not implicitly munmap() occupied virtual
>> address space.
>
>
> With MAP_FIXED it does.  It is in the man page.
>
> Paolo
>
>
>> Even if you did that, the following still has a race
>> condition between the addr2 request and another thread grabbing the
>> same virtual space, which nothing short of a lock on all threads'
>> mmap()-ing logic can protect:
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remap_file_pages() use

2014-05-26 Thread Jeff Smith
>> Mirrored mapping is absolutely required by several
>> independent proprietary platforms I'm aware of, and remap_file_pages()
>> has historically been the only sane way to accomplish this. (i.e.,
>> shm_open(), mmap(NULL, 2^(n+1) pages), remap_file_pages() on 2nd
>> half).
>
> Em.. What's wrong with shm_open() + two mmap()s to cover both halfs?
>
> fd = shm_open();
> addr1 = mmap(NULL, 2*SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> addr2 = mmap(addr1 + SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 
> 0);
>
> Is there a reason why it doens't work?

Your addr2 mmap() call is a bit incorrect semantically and
syntactically (you skipped the length arg). The addr2 request will
fail because mmap() does not implicitly munmap() occupied virtual
address space. Even if you did that, the following still has a race
condition between the addr2 request and another thread grabbing the
same virtual space, which nothing short of a lock on all threads'
mmap()-ing logic can protect:

addr1 = mmap(NULL, 2*SIZE, PROT_READ, MAP_SHARED, fd, 0);
munmap(addr1 + SIZE, SIZE);
/* race on virtual address space here, but n/a for remap_file_pages() ... */
addr2 = mmap(addr1, SIZE, PROT_READ, MAP_SHARED | MAP_FIXED, fd, 0);

remap_file_pages() is not subject to this problem and allows the
creation of considerably cleaner code. Protecting the address space
corner cases with locks or arbitrarily bounded munmap()-and-retry
loops is a substantial burden over the historically provided approach.

>> but failing that, a reservation API would need
>> to be created (possibly a MAP_RESERVE flag) that would set aside a
>> region that could only be subsequently mapped via explicit
>> address-requesting mmap() calls.
>
> I don't get this part.

I'm proposing that a call along the lines of mmap(NULL, len, prot,
MAP_RESERVED | ..., fd, offset) could return a virtual address block
that is -not- actually mapped but -is- protected from other mmap()
calls not explicitly requesting the space via their addr parameters.
Unfortunately, you'd also need to define separate semantics to
un-reserving not-mapped space, etc.

The important issue is that users need to be able to trivially protect
themselves from transient virtual address space congestion problems
and only fail early on long-term exhaustion situations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remap_file_pages() use

2014-05-26 Thread Jeff Smith
 Mirrored mapping is absolutely required by several
 independent proprietary platforms I'm aware of, and remap_file_pages()
 has historically been the only sane way to accomplish this. (i.e.,
 shm_open(), mmap(NULL, 2^(n+1) pages), remap_file_pages() on 2nd
 half).

 Em.. What's wrong with shm_open() + two mmap()s to cover both halfs?

 fd = shm_open();
 addr1 = mmap(NULL, 2*SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
 addr2 = mmap(addr1 + SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 
 0);

 Is there a reason why it doens't work?

Your addr2 mmap() call is a bit incorrect semantically and
syntactically (you skipped the length arg). The addr2 request will
fail because mmap() does not implicitly munmap() occupied virtual
address space. Even if you did that, the following still has a race
condition between the addr2 request and another thread grabbing the
same virtual space, which nothing short of a lock on all threads'
mmap()-ing logic can protect:

addr1 = mmap(NULL, 2*SIZE, PROT_READ, MAP_SHARED, fd, 0);
munmap(addr1 + SIZE, SIZE);
/* race on virtual address space here, but n/a for remap_file_pages() ... */
addr2 = mmap(addr1, SIZE, PROT_READ, MAP_SHARED | MAP_FIXED, fd, 0);

remap_file_pages() is not subject to this problem and allows the
creation of considerably cleaner code. Protecting the address space
corner cases with locks or arbitrarily bounded munmap()-and-retry
loops is a substantial burden over the historically provided approach.

 but failing that, a reservation API would need
 to be created (possibly a MAP_RESERVE flag) that would set aside a
 region that could only be subsequently mapped via explicit
 address-requesting mmap() calls.

 I don't get this part.

I'm proposing that a call along the lines of mmap(NULL, len, prot,
MAP_RESERVED | ..., fd, offset) could return a virtual address block
that is -not- actually mapped but -is- protected from other mmap()
calls not explicitly requesting the space via their addr parameters.
Unfortunately, you'd also need to define separate semantics to
un-reserving not-mapped space, etc.

The important issue is that users need to be able to trivially protect
themselves from transient virtual address space congestion problems
and only fail early on long-term exhaustion situations.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remap_file_pages() use

2014-05-26 Thread Jeff Smith
OK, I misinterpreted the overlapped part of the mapping(s) will be
discarded as discarding the -new- mappings. My objections about
needing a replacement for remap_file_pages() are gone, but my concerns
about existing code still remain.
--Jeff

On Mon, May 26, 2014 at 8:35 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 26/05/2014 15:24, Jeff Smith ha scritto:

 Your addr2 mmap() call is a bit incorrect semantically and
 syntactically (you skipped the length arg). The addr2 request will
 fail because mmap() does not implicitly munmap() occupied virtual
 address space.


 With MAP_FIXED it does.  It is in the man page.

 Paolo


 Even if you did that, the following still has a race
 condition between the addr2 request and another thread grabbing the
 same virtual space, which nothing short of a lock on all threads'
 mmap()-ing logic can protect:


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remap_file_pages() use

2014-05-26 Thread Jeff Smith
I've got no real issues at this point, but could you perhaps elaborate
a bit on the rough order of magnitude of long time and what cases
would be slower? I'm pretty sure that some places I've worked that
still have some remap_file_pages() logic in place aren't too religious
about checking dmesg.
--Jeff

On Mon, May 26, 2014 at 8:47 AM, Kirill A. Shutemov
kirill.shute...@linux.intel.com wrote:
 Jeff Smith wrote:
 OK, I misinterpreted the overlapped part of the mapping(s) will be
 discarded as discarding the -new- mappings. My objections about
 needing a replacement for remap_file_pages() are gone, but my concerns
 about existing code still remain.

 As I said, emulation will be there for long time. With warning in dmesg.
 The emulation is interface-compatible, but slower in some cases (not
 your's).

 --
  Kirill A. Shutemov
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remap_file_pages() use

2014-05-25 Thread Jeff Smith
On Mon, May 19, 2014 at 9:38 AM, Christoph Hellwig  wrote:
> On Mon, May 19, 2014 at 05:35:40PM +0300, Kirill A. Shutemov wrote:
>> >From functional POV, emulation *should* be identical to original
>> remap_file_pages(), but slower. It would be nice, if you test it early.
>>
>> It's not clear yet how long emulation will be there.
>
> Stop right there.  We found out about two real life users of
> remap_file_pages() already, without even committing the patches to warn
> about using it to any tree.
>
> I think at this point the whole idea of removing the API should be dead
> on the floor, as we do not needlessly break userspace programs.
>
> If we can get rid of the ugly guts and provide a good enough emulation
> that the user won't cry I'd love to get rid of this cruft, but even
> that doesn't look certain yet.

Sorry for being late to the party, but I just noticed this proposal
via the LWN summary byline.

I wanted to comment that Kenny's use case is (I believe) quite
widespread. I've used the technique since ~2008, and I've come across
other people in subsequent jobs who independently developed the same
technique. Mirrored mapping is absolutely required by several
independent proprietary platforms I'm aware of, and remap_file_pages()
has historically been the only sane way to accomplish this. (i.e.,
shm_open(), mmap(NULL, 2^(n+1) pages), remap_file_pages() on 2nd
half).

It may not be individuals who are involved in the kernel development
scene to any great extent, but I am sure that remap_file_pages() being
deprecated would seriously piss off a lot of individuals. The pattern
has even had a section in the Wikipedia article for quite some time:
http://en.wikipedia.org/wiki/Circular_buffer#Mirroring

It would be most preferable from a user standpoint to keep the
existing system intact, but failing that, a reservation API would need
to be created (possibly a MAP_RESERVE flag) that would set aside a
region that could only be subsequently mapped via explicit
address-requesting mmap() calls.

Thanks for any consideration of these concerns.
--Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: remap_file_pages() use

2014-05-25 Thread Jeff Smith
On Mon, May 19, 2014 at 9:38 AM, Christoph Hellwig h...@infradead.org wrote:
 On Mon, May 19, 2014 at 05:35:40PM +0300, Kirill A. Shutemov wrote:
 From functional POV, emulation *should* be identical to original
 remap_file_pages(), but slower. It would be nice, if you test it early.

 It's not clear yet how long emulation will be there.

 Stop right there.  We found out about two real life users of
 remap_file_pages() already, without even committing the patches to warn
 about using it to any tree.

 I think at this point the whole idea of removing the API should be dead
 on the floor, as we do not needlessly break userspace programs.

 If we can get rid of the ugly guts and provide a good enough emulation
 that the user won't cry I'd love to get rid of this cruft, but even
 that doesn't look certain yet.

Sorry for being late to the party, but I just noticed this proposal
via the LWN summary byline.

I wanted to comment that Kenny's use case is (I believe) quite
widespread. I've used the technique since ~2008, and I've come across
other people in subsequent jobs who independently developed the same
technique. Mirrored mapping is absolutely required by several
independent proprietary platforms I'm aware of, and remap_file_pages()
has historically been the only sane way to accomplish this. (i.e.,
shm_open(), mmap(NULL, 2^(n+1) pages), remap_file_pages() on 2nd
half).

It may not be individuals who are involved in the kernel development
scene to any great extent, but I am sure that remap_file_pages() being
deprecated would seriously piss off a lot of individuals. The pattern
has even had a section in the Wikipedia article for quite some time:
http://en.wikipedia.org/wiki/Circular_buffer#Mirroring

It would be most preferable from a user standpoint to keep the
existing system intact, but failing that, a reservation API would need
to be created (possibly a MAP_RESERVE flag) that would set aside a
region that could only be subsequently mapped via explicit
address-requesting mmap() calls.

Thanks for any consideration of these concerns.
--Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/