from:"Davide Libenzi"

Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi

On Thu, 26 Mar 2015, David Rientjes wrote:

> Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a 
> hugetlb vma is long standing and there may be applications that break as a 
> result of changing the behavior: a database that reserves all allocated 
> hugetlb memory with mmap() so that it always has exclusive access to those 
> hugepages, whether they are faulted or not, and maintains its own hugepage 
> pool (which is common), may test the return value of munmap() and depend 
> on it returning -EINVAL to determine if it is freeing memory that was 
> either dynamically allocated or mapped from the hugetlb reserved pool.

You went a long way to create such a case.
But, in your case, that application will erroneously considering hugepage 
mmaped memory, as dynamically allocated, since it will always get EINVAL, 
unless it passes an aligned size. Aligned size, which a fix like the one 
posted in the patch will still leave as success.
OTOH, an application, which might be more common than the one you posted,
which calls munmap() to release a pointer which it validly got from a 
previous mmap(), will leak huge pages as all the issued munmaps will fail.

> If we were to go back in time and decide this when the munmap() behavior 
> for hugetlb vmas was originally introduced, that would be valid.  The 
> problem is that it could lead to userspace breakage and that's a 
> non-starter.
> 
> What we can do is improve the documentation and man-page to clearly 
> specify the long-standing behavior so that nobody encounters unexpected 
> results in the future.

This way you will leave the mmap API with broken semantics.
In any case, I am done arguing.
I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
the mmap man pages with the new funny behaviour.

- Davide

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi

On Wed, 25 Mar 2015, David Rientjes wrote:

> I looked at this thread at http://marc.info/?t=14139250881 since I 
> didn't have it in my mailbox, and I didn't get a chance to actually run 
> your test code.
> 
> In short, I think what you're saying is that
> 
>   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
>   munmap(ptr, 4KB) == EINVAL

I am not sure you have read the email correctly:

munmap(mmap(size, HUGETLB), size) = EFAIL

For every size not multiple of the huge page size.
Whereas:

munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK


> Respecting the mmap(2) POSIX specification?  I don't think 
> mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates 
> POSIX.1-2001 and not only because it obviously doesn't address 
> MAP_HUGETLB, but I don't think the spec says the system cannot map more 
> memory than len.
> 
> Using MAP_HUGETLB is really more a library function than anything else 
> since you could easily implement the same behavior in a library.  That 
> function includes aligning len to the hugepage size, so doing
> 
>   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
> 
> is the equivalent to doing
> 
>   ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)
> 
> and that doesn't violate any spec.  But your patch doesn't change mmap() 
> at all, so let's forget about that.

That is what every mmap() implementation does, irrespectively of any page 
size. And that is also what the POSIX spec states.
The size will be automatically rounded up to a multiple of the underline 
physical page size.
The problem is not mmap() though, in this case.


> The question you pose is whether munmap(ptr, 4KB) should succeed for a 
> hugetlb vma and in your patch you align this to the hugepage size of the 
> vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
> for a non-hugetlb vma.
> 
> The munmap() spec says the whole pages that include any part of the passed 
> length should be unmapped.  In spirit, I would agree with you that the 
> page size for the vma is the hugepage size so that would be what would be 
> unmapped.
> 
> But that's going by a spec that doesn't address hugepages and is worded in 
> a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
> done.  It carries no notion of variable page sizes and how hugepages 
> should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
> this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
> but the munmap() behavior for hugetlb vmas is not restricted.
> 
> It would seem too dangerous at this point to change the behavior of 
> munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
> arise from aligning to the hugepage size.

You mean, there is an harder failure than the current failure? :)


> Some applications purposefully reserve hugetlb pages by mmap() and never 
> munmap() them so they have exclusive access to hugepages that were 
> allocated either at boot or runtime by the sysadmin.  If they depend on 
> the return value of munmap() to determine if memory to free is memory 
> dynamically allocated by the application or reserved as hugetlb memory, 
> then this would cause them to break.  I can't say for certain that no such 
> application exists.

The fact that certain applications will seldomly call an API, should be no 
reason for API to have bugs, or at the very least, a bahviour which not 
only in not documented in the man pages, but also totally unrespectful of 
the normal mmap/munmap semantics.
Again, the scenario that you are picturing, is one where an application 
relies on a permanent (that is what it is - it always fails unless the 
munmap size is multiple than huge page size) failure of munmap, to do some 
productive task.
An munmap() of huge page aligned size, will succeed in both case (vanilla, 
and patch).


> Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() 
> specification, and there's a potential userspace breakage if the length 
> becomes hugepage aligned, I think the do_unmap() implementation is correct 
> as it stands.

If the length is huge page aligned, it will be working with or without 
patch applied.
The problem is for the other 2097151 out of 2097152 cases, where length is 
not indeed aligned to 2MB (or whatever hugepage size is for the 
architecture).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread Davide Libenzi

On Wed, 25 Mar 2015, Hugh Dickins wrote:

> When you say "tracking back to 3.2.x", I think you mean you've tried as
> far back as 3.2.x and found the same behaviour, but not tried further?
> 
> From the source, it looks like this is unchanged since MAP_HUGETLB was
> introduced in 2.6.32.  And is the same behaviour as you've been given
> with hugetlbfs since it arrived in 2.5.46.

Went back checking the application logs, an I had to patch the code (the 
application one - to align size on munmap()) on May 2014.
But it might be we started noticing it at that time, because before the 
allocation pattern was simply monotonic, so it could be it was always 
there.
The bug test app is ten lines of code, to verify that.

> The patch looks to me as if it will do what you want, and I agree
> that it's displeasing that you can mmap a size, and then fail to
> munmap that same size.
> 
> But I tend to think that's simply typical of the clunkiness we offer
> you with hugetlbfs and MAP_HUGETLB: that it would have been better to
> make a different choice all those years ago, but wrong to change the
> user interface now.
> 
> Perhaps others will disagree.  And if I'm wrong, and the behaviour
> got somehow changed in 3.2, then that's a different story and we
> should fix it back.

This is not an interface change, in the sense old clients will continue to 
work.
This is simply respecting the mmap(2) POSIX specification, for a feature 
which has been exposed via mmap(2).

- Davide

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-22 Thread Davide Libenzi

[Resending with proper CC list suggested by Andrew]

Calling munmap on a MAP_HUGETLB area, and a size which is not 2MB aligned, 
causes munmap to fail.  Tested on 3.13.x but tracking back to 3.2.x.
In do_munmap() we forcibly want a 4KB default page, and we wrongly 
calculate the end of the map.  Since the calculated end is within the end 
address of the target vma, we try to do a split with an address right in 
the middle of a huge page, which would fail with EINVAL.

Tentative (untested) patch and test case attached (be sure you have a few 
huge pages available via /proc/sys/vm/nr_hugepages tinkering).


Signed-Off-By: Davide Libenzi 


- Davide


diff --git a/mm/mmap.c b/mm/mmap.c
index 7f85520..6dba257 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
return -EINVAL;
 
-   len = PAGE_ALIGN(len);
-   if (len == 0)
-   return -EINVAL;
-
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
prev = vma->vm_prev;
/* we have  start < vma->vm_end  */
 
+   if (likely(!is_vm_hugetlb_page(vma)))
+   len = PAGE_ALIGN(len);
+   else {
+   unsigned long hpage_size = huge_page_size(hstate_vma(vma));
+
+   len = ALIGN(len, hpage_size);
+   }
+   if (unlikely(len == 0))
+   return -EINVAL;
+
/* if it doesn't overlap, we have nothing.. */
end = start + len;
if (vma->vm_start >= end)




[hugebug.c]

#include 
#include 
#include 
#include 
#include 

static void test(int flags, size_t size)
{
void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
  flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (addr == MAP_FAILED)
{
perror("mmap");
exit(1);
}
*(char*) addr = 17;

if (munmap(addr, size) != 0)
{
perror("munmap");
exit(1);
}
}

int main(int ac, const char** av)
{
static const size_t hugepage_size = 2 * 1024 * 1024;

printf("Testing normal pages with 2MB size ...\n");
test(0, hugepage_size);
printf("OK\n");

printf("Testing huge pages with 2MB size ...\n");
test(MAP_HUGETLB, hugepage_size);
printf("OK\n");


printf("Testing normal pages with 4KB byte size ...\n");
test(0, 4096);
printf("OK\n");

printf("Testing huge pages with 4KB byte size ...\n");
test(MAP_HUGETLB, 4096);
printf("OK\n");


printf("Testing normal pages with 1 byte size ...\n");
test(0, 1);
printf("OK\n");

printf("Testing huge pages with 1 byte size ...\n");
test(MAP_HUGETLB, 1);
printf("OK\n");

return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-21 Thread Davide Libenzi

On Tue, 21 Oct 2014, Davide Libenzi wrote:

> Calling mmap with MAP_HUGETLB and a size which is not 2MB aligned, causes 
> mmap to fail. Tested on 3.13.x but tracking back to 3.2.x.

A couple of "un" were found under my desk, as it's clearly munmap which is 
failing.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-21 Thread Davide Libenzi

Calling mmap with MAP_HUGETLB and a size which is not 2MB aligned, causes 
mmap to fail. Tested on 3.13.x but tracking back to 3.2.x.
In do_munmap() we forcibly want a 4KB default page, and we wrongly calculate
the end of the map, by doing a split right in the middle of a huge page, 
which will result in EINVAL.
Tentative (untested) patch and test case attached (be sure you have a few 
huge pages available via /proc/sys/vm/nr_hugepages tinkering).


Signed-Off-By: Davide Libenzi 


- Davide


diff --git a/mm/mmap.c b/mm/mmap.c
index 7f85520..6dba257 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
return -EINVAL;
 
-   len = PAGE_ALIGN(len);
-   if (len == 0)
-   return -EINVAL;
-
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
prev = vma->vm_prev;
/* we have  start < vma->vm_end  */
 
+   if (likely(!is_vm_hugetlb_page(vma)))
+   len = PAGE_ALIGN(len);
+   else {
+   unsigned long hpage_size = huge_page_size(hstate_vma(vma));
+
+   len = ALIGN(len, hpage_size);
+   }
+   if (unlikely(len == 0))
+   return -EINVAL;
+
/* if it doesn't overlap, we have nothing.. */
end = start + len;
if (vma->vm_start >= end)





[hugebug.c]

#include 
#include 
#include 
#include 
#include 

static void test(int flags, size_t size)
{
void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
  flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (addr == MAP_FAILED)
{
perror("mmap");
exit(1);
}
*(char*) addr = 17;

if (munmap(addr, size) != 0)
{
perror("munmap");
exit(1);
}
}

int main(int ac, const char** av)
{
static const size_t hugepage_size = 2 * 1024 * 1024;

printf("Testing normal pages with 2MB size ...\n");
test(0, hugepage_size);
printf("OK\n");

printf("Testing huge pages with 2MB size ...\n");
test(MAP_HUGETLB, hugepage_size);
printf("OK\n");


printf("Testing normal pages with 4KB byte size ...\n");
test(0, 4096);
printf("OK\n");

printf("Testing huge pages with 4KB byte size ...\n");
test(MAP_HUGETLB, 4096);
printf("OK\n");


printf("Testing normal pages with 1 byte size ...\n");
test(0, 1);
printf("OK\n");

printf("Testing huge pages with 1 byte size ...\n");
test(MAP_HUGETLB, 1);
printf("OK\n");

return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: epoll and shared fd's

2008-02-26 Thread Davide Libenzi

On Tue, 26 Feb 2008, Michael Kerrisk wrote:

> Okay -- I'll look at it some more.  I am however loathe to drop the
> term open file description, because POSIX uses, as well as a number of
> other Linux man pages by now.

Heh, POSIX. Now doesn't take a genius to see that "file description" and 
"file descriptor" looks amazingly similar, does it? :)


> > That'd mean placing an eventpoll custom hook into sys_close(). Looks 
> > very bad to me, and probably will look even worse to other kernel 
> > folks. Is not much a performance issue (a check to see if a file* is 
> > an eventpoll file is as easy as comparing the f_op pointer), but a 
> > design/style issue.
>
> Oh -- I wasn't suggesting we could make the change now -- it would
> break the ABI and all that.  I was just wondering why the decision
> wasn't made to do it the other way to begin with.  The existing
> semantics are somewhat couterintuitive, and potentially interact
> libraries that do private manipulations with file descriptors.

For the same reason that a custom hook in sys_close wouldn't have passed 
the radar ;)
As far as problems with libraries doing tricks with fds, that's an issue 
that goes beyond epoll.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: epoll and shared fd's

2008-02-26 Thread Davide Libenzi

On Tue, 26 Feb 2008, Michael Kerrisk wrote:

> Following up after quite some time:
> 
> Davide Libenzi wrote:
> > On Sat, 26 Jan 2008, Michael Kerrisk wrote:
> > 
> >> On Jan 25, 2008 12:57 AM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
> >>> On Thu, 24 Jan 2008, Pierre Habouzit wrote:
> >>>
> >>>> On Fri, Jan 18, 2008 at 09:10:18PM +, Davide Libenzi wrote:
> >>>>> On Fri, 18 Jan 2008, Pierre Habouzit wrote:
> >>>>>
> >>>>>>   Hi,
> >>>>>>
> >>>>>>   I just came across a strange behavior of epoll that seems to
> >>>>>> contradict the documentation. Here is what happens:
> >>>>>>
> >>>>>> * I have two processes P1 and P2, P1 accept()s connections, and send 
> >>>>>> the
> >>>>>>   resulting file descriptors to P2 through a unix socket.
> >>>>>>
> >>>>>> * P2 registers the received socket in his epollfd.
> >>>>>>
> >>>>>>   [time passes]
> >>>>>>
> >>>>>> * P2 is done with the socket and closes it
> >>>>>>
> >>>>>> * P2 gets events for the socket again !
> >>>>>>
> >>>>>>
> >>>>>>   Though the documentation says that if a process closes a file
> >>>>>> descriptor, it gets unregistered. And yes I'm sure that P2 doens't 
> >>>>>> dup()
> >>>>>> the file descriptor. Though (because of a bug) it was still open in
> >>>>>> P1[0], hence the referenced socket still live at the kernel level.
> >>>>>>
> >>>>>>   Of course the userland workaround is to force the EPOLL_CTL_DEL 
> >>>>>> before
> >>>>>> the close, which I now do, but costs me a syscall where I wanted to
> >>>>>> spare one :|
> >>>>> For epoll, a close is when the kernel file* is released (that is, when 
> >>>>> all
> >>>>> its instances are gone).
> >>>>> We could put a special handling in filp_close(), but I don't think is a
> >>>>> good idea, and we're better live with the current behaviour.
> >>>>   Okay, maybe updating the linux manpages to be more clear about that is
> >>>> the way to go then. Thanks
> >>> Sure. I'll send Michael Kerrisk and updated statement for the A6 answer in
> >>> the epoll man page.
> >> Thanks Davide -- yes please send me a patch.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to [EMAIL PROTECTED]
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >>
> > 
> > Something like the one below ...
> > 
> > 
> > - Davide
> > 
> > 
> > 
> > --- epoll.4 2008-01-26 12:58:21.0 -0800
> > +++ epoll.4.new 2008-01-26 13:06:36.0 -0800
> > @@ -285,7 +285,19 @@
> >  sets automatically?
> >  .TP
> >  .B A6
> > -Yes.
> > +A file descriptor is the userspace counterpart of an internal kernel 
> > handle.
> > +Every time a process calls functions liks
> > +.BR dup (2),
> > +.BR dup2 (2)
> > +or
> > +.BR fork (2),
> > +a new file descriptor referring to the same internal kernel handle is
> > +created. The internal kernel handle remains alive until all the userspace
> > +file descriptors have been closed.
> > +The
> > +.BR epoll (4)
> > +interface automatically removes the internal kernel handle from the set,
> > +once all the file descriptor instances have been closed.
> >  .TP
> >  .B Q7
> >  If more than one event occurs between
> 
> Davide,
> 
> Two points.
> 
> a) I did a
> 
> s/internal kernel handle/open file description/
> 
> since that is the POSIX term for the internal handle.
> 
> b) It seems to me that you text doesn't quite make the point explicit
> enough.  I've tried to rewrite it; could you please check:
> 
>A6 Yes, but be aware of the following point.  A  file
>   descriptor is a reference to an open file descrip-
>   tion (see  open(2)).   Whenever  a  descriptor  is
>   duplicated  via dup(2), dup2(2), fcntl(2) F_DUPFD,
>   or

Re: epoll design problems with common fork/exec patterns

2008-02-26 Thread Davide Libenzi

On Tue, 26 Feb 2008, Michael Kerrisk wrote:

> Davide Libenzi wrote:
> > On Sun, 28 Oct 2007, David Schwartz wrote:
> > 
> >> Eric Dumazet wrote:
> >>
> >>> Events are not necessarly reported "by descriptors". epoll uses an opaque
> >>> field provided by the user.
> >>>
> >>> It's up to the user to properly chose a tag that will makes sense
> >>> if the user
> >>> app is playing dup()/close() games for example.
> >> Great. So the only issue then is that the documentation is confusing. It
> >> frequently uses the term "fd" where it means file. For example, it says:
> >>
> >>   Q1 What  happens  if  you  add  the  same fd to an
> >> epoll_set
> >>  twice?
> >>
> >>   A1 You will probably get EEXIST.  However,  it  is
> >> possible
> >>  that  two  threads  may  add the same fd twice. This 
> >> is
> >> a
> >>  harmless condition.
> >>
> >> This gives no reason to think there's anything wrong with adding the same
> >> file twice so long as you do so through different descriptors. (One can
> >> imagine an application that does this to segregate read and write 
> >> operations
> >> to avoid a race where the descriptor is closed from under a writer due to
> >> handling a fatal read error.) Obviously, that won't work.
> > 
> > I agree, that is confusing. However, you can safely add two different file 
> > descriptors pointing to the same file*, with different event masks, and 
> > that will work as expected.
> 
> So can I summarize what I understand:
> 
> a) Adding the same file descriptor twice to an epoll set will cause an
> error (EEXIST).

Yes.



> b) In a separate message to linux-man, Chris Heath says that two threads
> *can't* add the same fd twice to an epoll set, despite what the existing
> man page text says.  I haven't tested that, but it sounds to me as though
> it is likely to be true.  Can you comment please Davide?

Yes, you can't add the same fd twice. Think about a DB where "file*,fd" is 
the key.



> c) It is possible to add duplicated file descriptors referring to the same
> underlying open file description ("file *").  As you note, this can be a
> useful filtering technique, if the two file descriptors specify different
> masks.
> 
> Assuming that is all correct, for man-pages-2.79, I've reworked the text
> for Q1/A1 as follows:
> 
>Q1 What  happens  if you add the same file descriptor
>   to an epoll set twice?
> 
>A1 You will probably get EEXIST.  However, it is pos-
>   sible   to   add  a  duplicate  (dup(2),  dup2(2),
>   fcntl(2) F_DUPFD, fork(2)) descriptor to the  same
>   epoll  set.   This  can  be a useful technique for
>   filtering events, if the duplicate  file  descrip-
>   tors are registered with different events masks.
> 
> Seem okay Davide?

Looks sane to me.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] avoid kmemcheck warning in epoll

2008-02-11 Thread Davide Libenzi

Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
rb_set_parent() accesses node's pointer in its code. This creates a 
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
memory access. The warning is harmless since the following rb-tree node 
insert is going to overwrite the node data. In any case I think it's 
better to not have that happening at all, and fix it by simplifying the 
code to get rid of a few lines that became superfluous after the previous 
epoll changes.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/eventpoll.c |   27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

Index: linux-2.6.mod/fs/eventpoll.c
===
--- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-11 15:31:02.0 -0800
+++ linux-2.6.mod/fs/eventpoll.c2008-02-11 15:32:46.0 -0800
@@ -257,25 +257,6 @@
(p1->file < p2->file ? -1 : p1->fd - p2->fd));
 }
 
-/* Special initialization for the RB tree node to detect linkage */
-static inline void ep_rb_initnode(struct rb_node *n)
-{
-   rb_set_parent(n, n);
-}
-
-/* Removes a node from the RB tree and marks it for a fast is-linked check */
-static inline void ep_rb_erase(struct rb_node *n, struct rb_root *r)
-{
-   rb_erase(n, r);
-   rb_set_parent(n, n);
-}
-
-/* Fast check to verify that the item is linked to the main RB tree */
-static inline int ep_rb_linked(struct rb_node *n)
-{
-   return rb_parent(n) != n;
-}
-
 /* Tells us if the item is currently linked */
 static inline int ep_is_linked(struct list_head *p)
 {
@@ -283,13 +264,13 @@
 }
 
 /* Get the "struct epitem" from a wait queue pointer */
-static inline struct epitem * ep_item_from_wait(wait_queue_t *p)
+static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
 {
return container_of(p, struct eppoll_entry, wait)->base;
 }
 
 /* Get the "struct epitem" from an epoll queue wrapper */
-static inline struct epitem * ep_item_from_epqueue(poll_table *p)
+static inline struct epitem *ep_item_from_epqueue(poll_table *p)
 {
return container_of(p, struct ep_pqueue, pt)->epi;
 }
@@ -411,8 +392,7 @@
list_del_init(&epi->fllink);
spin_unlock(&file->f_ep_lock);
 
-   if (ep_rb_linked(&epi->rbn))
-   ep_rb_erase(&epi->rbn, &ep->rbr);
+   rb_erase(&epi->rbn, &ep->rbr);
 
spin_lock_irqsave(&ep->lock, flags);
if (ep_is_linked(&epi->rdllink))
@@ -728,7 +708,6 @@
goto error_return;
 
/* Item initialization follow here ... */
-   ep_rb_initnode(&epi->rbn);
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] avoid kmemcheck warning in epoll

2008-02-11 Thread Davide Libenzi

On Mon, 11 Feb 2008, Andrew Morton wrote:

> On Sun, 10 Feb 2008 13:32:01 -0800 (PST)
> Davide Libenzi <[EMAIL PROTECTED]> wrote:
> 
> > Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
> > rb_set_parent() accesses node's pointer in its code. This creates a 
> > warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
> > memory access. The warning is harmless since the following rb-tree node 
> > insert is going to overwrite the node data. In any case I think it's 
> > better to not have that happening at all, and fix it by properly 
> > initializing the data.
> > 
> > 
> > Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>
> > 
> > 
> > - Davide
> > 
> > 
> > ---
> >  fs/eventpoll.c |2 +-
> >  include/linux/rbtree.h |   12 
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6.mod/fs/eventpoll.c
> > ===
> > --- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-10 12:36:20.0 
> > -0800
> > +++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800
> > @@ -260,7 +260,7 @@
> >  /* Special initialization for the RB tree node to detect linkage */
> >  static inline void ep_rb_initnode(struct rb_node *n)
> >  {
> > -   rb_set_parent(n, n);
> > +   rb_init_node(n, n);
> >  }
> >  
> >  /* Removes a node from the RB tree and marks it for a fast is-linked check 
> > */
> > Index: linux-2.6.mod/include/linux/rbtree.h
> > ===
> > --- linux-2.6.mod.orig/include/linux/rbtree.h   2008-02-10 
> > 12:36:13.0 -0800
> > +++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 
> > -0800
> > @@ -112,6 +112,18 @@
> > struct rb_node *rb_node;
> >  };
> >  
> > +/**
> > + * rb_init_node - Initializes the node internal data
> > + *
> > + * @node: Pointer to the RB-Tree node
> > + * @parent: Pointer to the parent node, or NULL
> > + *
> > + */
> > +static inline void rb_init_node(struct rb_node *node, struct rb_node 
> > *parent)
> > +{
> > +   node->rb_parent_color = (unsigned long) parent;
> > +   node->rb_left = node->rb_right = NULL;
> > +}
> 
> Is epoll the only rbtree-using code which exhibits this problem?  If so,
> what is epoll doing differently from all the others?

Dunno, but I don't think so. Epoll initializes the node to a state so that 
later on can check if the file-fd item is inserted or not. And it uses the 
"parent" information for that. But rb_set_parent() (that is used in the 
current initialization code) uses the data in the node, and this triggers 
the uninitialized memory access.
Taking a better look at it, the code (after the latest changes) has no 
more point-of-failures after the rb-tree insert, so I can probably avoid 
doing the node's parent-initialization and check altogether.
Yeah, let's that ;)

 eventpoll.c |   27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

Andrew, drop that one. I'm gonna send the new one after some test...



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] avoid kmemcheck warning in epoll

2008-02-10 Thread Davide Libenzi

Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
rb_set_parent() accesses node's pointer in its code. This creates a 
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
memory access. The warning is harmless since the following rb-tree node 
insert is going to overwrite the node data. In any case I think it's 
better to not have that happening at all, and fix it by properly 
initializing the data.


Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/eventpoll.c |2 +-
 include/linux/rbtree.h |   12 
 2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventpoll.c
===
--- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-10 12:36:20.0 -0800
+++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800
@@ -260,7 +260,7 @@
 /* Special initialization for the RB tree node to detect linkage */
 static inline void ep_rb_initnode(struct rb_node *n)
 {
-   rb_set_parent(n, n);
+   rb_init_node(n, n);
 }
 
 /* Removes a node from the RB tree and marks it for a fast is-linked check */
Index: linux-2.6.mod/include/linux/rbtree.h
===
--- linux-2.6.mod.orig/include/linux/rbtree.h   2008-02-10 12:36:13.0 
-0800
+++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 
-0800
@@ -112,6 +112,18 @@
struct rb_node *rb_node;
 };
 
+/**
+ * rb_init_node - Initializes the node internal data
+ *
+ * @node: Pointer to the RB-Tree node
+ * @parent: Pointer to the parent node, or NULL
+ *
+ */
+static inline void rb_init_node(struct rb_node *node, struct rb_node *parent)
+{
+   node->rb_parent_color = (unsigned long) parent;
+   node->rb_left = node->rb_right = NULL;
+}
 
 #define rb_parent(r)   ((struct rb_node *)((r)->rb_parent_color & ~3))
 #define rb_color(r)   ((r)->rb_parent_color & 1)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: epoll and shared fd's

2008-01-24 Thread Davide Libenzi

On Thu, 24 Jan 2008, Pierre Habouzit wrote:

> On Fri, Jan 18, 2008 at 09:10:18PM +0000, Davide Libenzi wrote:
> > On Fri, 18 Jan 2008, Pierre Habouzit wrote:
> > 
> > >   Hi,
> > > 
> > >   I just came across a strange behavior of epoll that seems to
> > > contradict the documentation. Here is what happens:
> > > 
> > > * I have two processes P1 and P2, P1 accept()s connections, and send the
> > >   resulting file descriptors to P2 through a unix socket.
> > > 
> > > * P2 registers the received socket in his epollfd.
> > > 
> > >   [time passes]
> > > 
> > > * P2 is done with the socket and closes it
> > > 
> > > * P2 gets events for the socket again !
> > > 
> > > 
> > >   Though the documentation says that if a process closes a file
> > > descriptor, it gets unregistered. And yes I'm sure that P2 doens't dup()
> > > the file descriptor. Though (because of a bug) it was still open in
> > > P1[0], hence the referenced socket still live at the kernel level.
> > > 
> > >   Of course the userland workaround is to force the EPOLL_CTL_DEL before
> > > the close, which I now do, but costs me a syscall where I wanted to
> > > spare one :|
> > 
> > For epoll, a close is when the kernel file* is released (that is, when all 
> > its instances are gone).
> > We could put a special handling in filp_close(), but I don't think is a 
> > good idea, and we're better live with the current behaviour.
> 
>   Okay, maybe updating the linux manpages to be more clear about that is
> the way to go then. Thanks

Sure. I'll send Michael Kerrisk and updated statement for the A6 answer in 
the epoll man page.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lockdep: annotate epoll

2008-01-22 Thread Davide Libenzi

On Tue, 22 Jan 2008, Stefan Richter wrote:

> On 22 Jan, Stefan Richter wrote:
> > On 22 Jan, Peter Zijlstra wrote:
> >> Curious though that this gets reported frequently the last few weeks,
> >> afaics this problem is way old.
> > 
> > Here is a report against Fedora's 2.6.23-0.222.rc9.git4.fc8, filed in
> > October:  https://bugzilla.redhat.com/show_bug.cgi?id=323411
> 
> Upstream bug: http://bugzilla.kernel.org/show_bug.cgi?id=9786
> 
> Date: Sun, 13 Jan 2008 19:44:26 +0100
> From: Peter Zijlstra <[EMAIL PROTECTED]>
> 
> On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:
> 
> > I remember I talked with Arjan about this time ago. Basically, since 1) 
> > you can drop an epoll fd inside another epoll fd 2) callback-based wakeups 
> > are used, you can see a wake_up() from inside another wake_up(), but they 
> > will never refer to the same lock instance.
> > Think about:
> > 
> > dfd = socket(...);
> > efd1 = epoll_create();
> > efd2 = epoll_create();
> > epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
> > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
> > 
> > When a packet arrives to the device underneath "dfd", the net code will 
> > issue a wake_up() on its poll wake list. Epoll (efd1) has installed a 
> > callback wakeup entry on that queue, and the wake_up() performed by the 
> > "dfd" net code will end up in ep_poll_callback(). At this point epoll 
> > (efd1) notices that it may have some event ready, so it needs to wake up 
> > the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() 
> > that ends up in another wake_up(), after having checked about the 
> > recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to 
> > avoid stack blasting. Never hit the same queue, to avoid loops like:
> > 
> > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
> > epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
> > epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
> > epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
> > 
> > The code "if (tncur->wq == wq || ..." prevents re-entering the same 
> > queue/lock.
> 
> Since the epoll code is very careful to not nest same instance locks
> allow the recursion.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> Tested-by: Stefan Richter <[EMAIL PROTECTED]>
> ---
>  fs/eventpoll.c   |2 +-
>  include/linux/wait.h |   16 
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> Index: linux/fs/eventpoll.c
> ===
> --- linux.orig/fs/eventpoll.c
> +++ linux/fs/eventpoll.c
> @@ -353,7 +353,7 @@ static void ep_poll_safewake(struct poll
>   spin_unlock_irqrestore(&psw->lock, flags);
>  
>   /* Do really wake up now */
> - wake_up(wq);
> + wake_up_nested(wq, 1 + wake_nests);
>  
>   /* Remove the current task from the list */
>   spin_lock_irqsave(&psw->lock, flags);
> Index: linux/include/linux/wait.h
> ===
> --- linux.orig/include/linux/wait.h
> +++ linux/include/linux/wait.h
> @@ -161,6 +161,22 @@ wait_queue_head_t *FASTCALL(bit_waitqueu
>  #define  wake_up_locked(x)   __wake_up_locked((x), 
> TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE)
>  #define wake_up_interruptible_sync(x)   
> __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
>  
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +/*
> + * macro to avoid include hell
> + */
> +#define wake_up_nested(x, s) \
> +do { \
> + unsigned long flags;\
> + \
> + spin_lock_irqsave_nested(&(x)->lock, flags, (s));   \
> + wake_up_locked(x);          \
> + spin_unlock_irqrestore(&(x)->lock, flags);  \
> +} while (0)
> +#else
> +#define wake_up_nested(x, s) wake_up(x)
> +#endif
> +
>  #define __wait_event(wq, condition)  \
>  do { \
>   DEFINE_WAIT(__wait);\
> 

Looks fine to me.


Acked-by: Davide Libenzi <[EMAIL PROTECTED]>



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc7 to 2.6.24-rc8 possible regression

2008-01-22 Thread Davide Libenzi

On Tue, 22 Jan 2008, Stefan Richter wrote:

> (adding Cc: Davide and akpm)
> 
> On 22 Jan, Peter Zijlstra wrote:
> > 
> > On Tue, 2008-01-22 at 17:23 +0100, Stefan Richter wrote:
> >> Denys Fedoryshchenko wrote:
> >> > No, i am using vanilla kernel. It is one of production machines, and as 
> >> > i 
> >> > know screen is not using epoll.
> >> 
> >> OK, but the trace shows that it is the epoll recursion again.
> >> 
> >> > I will try to apply on all my production machines this patch. Sorry if 
> >> > it is 
> >> > related.
> >> 
> >> Well, let's hope that the lockdep annotation or whatever other fix gets
> >> into mainline sooner than later.  Which reminds me to test my setup
> >> again which appeared to be able to reproduce the __wake_up recursion on
> >> my command...
> > 
> > Would be appreciated, I have been waiting on testing feedback because
> > I'm not fully certain here.
> > 
> > Curious though that this gets reported frequently the last few weeks,
> > afaics this problem is way old.
> 
> Here is a report against Fedora's 2.6.23-0.222.rc9.git4.fc8, filed in
> October:  https://bugzilla.redhat.com/show_bug.cgi?id=323411
> 
> I did several tests with a pristine 2.6.24-rc8 now.  The lockdep warning
> below is 100% reliably triggered by grabbing video from a DV camcorder
> with dvgrab v3 via firewire-core's character device file ABI.
> 
> Also, this warning is 100% reliably suppressed by your annotation patch.
> I will reply to this message with this patch for kind consideration by
> those concerned. :-)

Patch? Where is it?



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: epoll and shared fd's

2008-01-18 Thread Davide Libenzi

On Fri, 18 Jan 2008, Pierre Habouzit wrote:

>   Hi,
> 
>   I just came across a strange behavior of epoll that seems to
> contradict the documentation. Here is what happens:
> 
> * I have two processes P1 and P2, P1 accept()s connections, and send the
>   resulting file descriptors to P2 through a unix socket.
> 
> * P2 registers the received socket in his epollfd.
> 
>   [time passes]
> 
> * P2 is done with the socket and closes it
> 
> * P2 gets events for the socket again !
> 
> 
>   Though the documentation says that if a process closes a file
> descriptor, it gets unregistered. And yes I'm sure that P2 doens't dup()
> the file descriptor. Though (because of a bug) it was still open in
> P1[0], hence the referenced socket still live at the kernel level.
> 
>   Of course the userland workaround is to force the EPOLL_CTL_DEL before
> the close, which I now do, but costs me a syscall where I wanted to
> spare one :|

For epoll, a close is when the kernel file* is released (that is, when all 
its instances are gone).
We could put a special handling in filp_close(), but I don't think is a 
good idea, and we're better live with the current behaviour.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] syslets: add generic syslets infrastructure

2008-01-09 Thread Davide Libenzi

On Thu, 10 Jan 2008, Rusty Russell wrote:

> On Thursday 10 January 2008 05:16:57 Zach Brown wrote:
> > > The latter.  A ring is optimal for processing a huge number of requests,
> > > but if you're really going to be firing off syslet threads all over the
> > > place you're not going to be optimal anyway.  And being able to point the
> > > return value to the stack or into some datastructure is way nicer to code
> > > (zero setup == easy to understand and easy to convert).
> >
> > One of Linus' rhetorical requirements for the syslets work is that we be
> > able to wait for the result without spending overhead building up state
> > in some completion context.  The userland rings in the current syslet
> > patches achieve that and don't seem outrageously complicated.
> 
> I'd have to read his original statement, but eventfd doesn't build up state, 
> so I think it qualifies.

I think you and Zach are talking about different issues ;)
Eventfd born for two different reasons. First, to be able to have 
userspace to signal to a poll/select/epoll based listener an event. This 
can elso be done with pipes, but eventfd has a few advantages over pipes 
(3-4 times faster and *a lot* less memory footprint). Second, as a generic 
way for kernel subsystems to signal completions to a poll/select/epoll 
userspace listener. And this is what is used in the new KAIO eventfd 
feature (patch was like 5 lines IIRC).
This allow for KAIO events to be signaled to poll/select/epoll in a pretty 
easy way, using a simple extension of the AIO API.
What we talked originally with Ingo, when the first syslet code came up, 
was the ability to do the reverse thing. That is, host an epoll_wait() 
inside a syslet, and gather the completion using whatever the syslet code 
was/is going to use for it.
Given that 1) the eventfd completion patch was trivial and immediately 
available 2) the future of the whole syslet concept was/is still unclear, 
I believe it makes/made sense. If the syslets will become mainline, it'll 
mean that userspace will then be able to select the event-completion 
"hosting" method that better suits their needs. That are, either AIO 
completion hosted inside an epoll_wait() via an eventfd, or an 
epoll_wait() hosted inside a syslet.

- Davide

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Improve scalability of epoll_ctl

2008-01-08 Thread Davide Libenzi

On Tue, 8 Jan 2008, Eric Dumazet wrote:

> Changli Gao a écrit :
> > Replace the epitem rbtree with a dynamic array to get the constant 
> > insertion/deletion/modification time of the file descriptors. Reuse the 
> > size argument of epoll_create, however the size must be smaller than the 
> > max file descriptor number: ether the resource limitation or the compiling 
> > time limitation.
> >
> >   
> Hum, you should read this :
> 
> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc3/2.6.7-rc3-mm2/broken-out/epoll-uses-rbtrees.patch
> 
> 
> Your patch is a revert of this change, it probably wont be accepted.
> 
> epoll_ctl() is scalable, since it's O(log2(N)) : With 1 millions files
> descriptors, that means less than 20 nodes to lookup in the tree.

Indeed, and we aren't going back.

> In what situation do you think epoll_ctl() performance is bad ?

I'm sure you can cook up a microbench that only does epoll_ctl(), and with 
a good hash implementation you get better numbers. This is what I did 
before I changed to rbtree, and the numbers did not justify the code (the 
"size" in epoll_create() is just an hint, and that means you need to 
handle resizing - plus other things like locking up memory when the number 
of fds shrinks, and other stuff I don't even remember).

- Davide

Re: 2.6.24-rc6: possible recursive locking detected

2008-01-07 Thread Davide Libenzi

On Sun, 6 Jan 2008, Christian Kujau wrote:

> On Sat, 5 Jan 2008, Davide Libenzi wrote:
> > A solution may be to move the call to ep_poll_safewake() (that'd become a
> > simple wake_up()) inside a tasklet or whatever is today trendy for delayed
> > work. But his kinda scares me to be honest, since epoll has already a
> > bunch of places where it could be asynchronously hit (plus performance
> > regression will need to be verified).
> 
> Although I'm not able to reproduce this one right now, I'm happy to test any
> patches you guys come up with.

There's no need to reproduce it. It's there, it's among us ;)



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc6: possible recursive locking detected

2008-01-05 Thread Davide Libenzi

On Sat, 5 Jan 2008, Peter Zijlstra wrote:

> 
> On Sat, 2008-01-05 at 17:53 +0100, Peter Zijlstra wrote:
> > On Sat, 2008-01-05 at 18:12 +1100, Herbert Xu wrote:
> > > On Fri, Jan 04, 2008 at 09:30:49AM +0100, Ingo Molnar wrote:
> > > >
> > > > > > [ 1310.670986] =
> > > > > > [ 1310.671690] [ INFO: possible recursive locking detected ]
> > > > > > [ 1310.672097] 2.6.24-rc6 #1
> > > > > > [ 1310.672421] -
> > > > > > [ 1310.672828] FahCore_a0.exe/3692 is trying to acquire lock:
> > > > > > [ 1310.673238]  (&q->lock){++..}, at: [] 
> > > > > > __wake_up+0x1b/0x50
> > > > > > [ 1310.673869]
> > > > > > [ 1310.673870] but task is already holding lock:
> > > > > > [ 1310.674567]  (&q->lock){++..}, at: [] 
> > > > > > __wake_up+0x1b/0x50
> > > > > > [ 1310.675267]
> > > > > > [ 1310.675268] other info that might help us debug this:
> > > > > > [ 1310.675952] 5 locks held by FahCore_a0.exe/3692:
> > > > > > [ 1310.676334]  #0:  (rcu_read_lock){..--}, at: [] 
> > > > > > net_rx_action+0x60/0x1b0
> > > > > > [ 1310.677251]  #1:  (rcu_read_lock){..--}, at: [] 
> > > > > > netif_receive_skb+0x100/0x470
> > > > > > [ 1310.677924]  #2:  (rcu_read_lock){..--}, at: [] 
> > > > > > ip_local_deliver_finish+0x32/0x210
> > > > > > [ 1310.678460]  #3:  (clock-AF_INET){-.-?}, at: [] 
> > > > > > sock_def_readable+0x1e/0x80
> > > > > > [ 1310.679250]  #4:  (&q->lock){++..}, at: [] 
> > > > > > __wake_up+0x1b/0x50
> > > 
> > > The net part might just be a red herring, since the problem is that
> > > __wake_up is somehow reentering itself.
> > 
> > /*
> >  * Perform a safe wake up of the poll wait list. The problem is that
> >  * with the new callback'd wake up system, it is possible that the
> >  * poll callback is reentered from inside the call to wake_up() done
> >  * on the poll wait queue head. The rule is that we cannot reenter the
> >  * wake up code from the same task more than EP_MAX_POLLWAKE_NESTS times,
> >  * and we cannot reenter the same wait queue head at all. This will
> >  * enable to have a hierarchy of epoll file descriptor of no more than
> >  * EP_MAX_POLLWAKE_NESTS deep. We need the irq version of the spin lock
> >  * because this one gets called by the poll callback, that in turn is called
> >  * from inside a wake_up(), that might be called from irq context.
> >  */
> > 
> > Seems to suggest that the epoll code can indeed recurse into wakeup.
> > 
> > Davide, Johannes, any ideas?
> 
> Since EP_MAX_POLLWAKE_NESTS < MAX_LOCKDEP_SUBCLASSES we could perhaps do
> something like:
> 
>   wake_up_nested(..., wake_nests);
> 
> although I'm not quite sure that is correct, my understanding of this
> code is still fragile at best.

I remember I talked with Arjan about this time ago. Basically, since 1) 
you can drop an epoll fd inside another epoll fd 2) callback-based wakeups 
are used, you can see a wake_up() from inside another wake_up(), but they 
will never refer to the same lock instance.
Think about:

dfd = socket(...);
efd1 = epoll_create();
efd2 = epoll_create();
epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);

When a packet arrives to the device underneath "dfd", the net code will 
issue a wake_up() on its poll wake list. Epoll (efd1) has installed a 
callback wakeup entry on that queue, and the wake_up() performed by the 
"dfd" net code will end up in ep_poll_callback(). At this point epoll 
(efd1) notices that it may have some event ready, so it needs to wake up 
the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() 
that ends up in another wake_up(), after having checked about the 
recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to 
avoid stack blasting. Never hit the same queue, to avoid loops like:

epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);

The code "if (tncur->wq == wq || ..." prevents re-entering the same 
queue/lock.
I don't know how the lockdep code works, so I can't say about 
wake_up_nested(). Although I have a feeling is not enough in this case.
A solution may be to move the call to ep_poll_safewake() (that'd become a 
simple wake_up()) inside a tasklet or whatever is today trendy for delayed 
work. But his kinda scares me to be honest, since epoll has already a 
bunch of places where it could be asynchronously hit (plus performance 
regression will need to be verified).

- Davide

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/2] timerfd - make the returned time to be the remaining time till the next expiration

2007-12-17 Thread Davide Libenzi

Make the returned time to be the remaining time till the next expiration.
If the timer is already expired, and there's no next expiration, zero will
be returned.
Andrew, this goes on top of the ones you already have in -mm.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/timerfd.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-12-14 16:04:36.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-12-14 16:05:32.0 -0800
@@ -49,6 +49,15 @@
return HRTIMER_NORESTART;
 }
 
+static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx) {
+   ktime_t now, remaining;
+
+   now = ctx->tmr.base->get_time();
+   remaining = ktime_sub(ctx->tmr.expires, now);
+
+   return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
+}
+
 static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
@@ -240,7 +249,7 @@
if (ctx->expired && ctx->tintv.tv64)
hrtimer_forward_now(&ctx->tmr, ctx->tintv);
 
-   kotmr.it_value = ktime_to_timespec(ctx->tmr.expires);
+   kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
kotmr.it_interval = ktime_to_timespec(ctx->tintv);
 
/*
@@ -274,7 +283,7 @@
hrtimer_forward_now(&ctx->tmr, ctx->tintv) - 1;
hrtimer_restart(&ctx->tmr);
}
-   kotmr.it_value = ktime_to_timespec(ctx->tmr.expires);
+   kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
kotmr.it_interval = ktime_to_timespec(ctx->tintv);
spin_unlock_irq(&ctx->wqh.lock);
fput(file);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/2] timerfd - make hrtimer_forward() to return a u64

2007-12-17 Thread Davide Libenzi

This patch makes hrtimer_forward() to return a u64 instead of unsigned long.
Since timerfd returns the number of timer ticks in a u64 variable, and
hrtimer_forward() is used to calculate the timer ticks, the patch allow full
64 bit usage even on 32 bit platforms. The core of the hrtimer_forward() ticks
calculation, ktime_divns(), was already having the result in u64 and it was
chopping it to unsigned long.
Andrew, this goes on top of the ones you already have in -mm.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/timerfd.c|6 +++---
 include/linux/hrtimer.h |   10 +-
 kernel/hrtimer.c|9 -
 kernel/posix-timers.c   |9 +
 4 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-12-13 16:37:36.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-12-14 16:05:23.0 
-0800
@@ -295,12 +295,12 @@
 }
 
 /* Forward a hrtimer so it expires after now: */
-extern unsigned long
+extern u64
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
 /* Forward a hrtimer so it expires after the hrtimer's current now */
-static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
-   ktime_t interval)
+static inline u64 hrtimer_forward_now(struct hrtimer *timer,
+ ktime_t interval)
 {
return hrtimer_forward(timer, timer->base->get_time(), interval);
 }
@@ -322,9 +322,9 @@
 extern void __init hrtimers_init(void);
 
 #if BITS_PER_LONG < 64
-extern unsigned long ktime_divns(const ktime_t kt, s64 div);
+extern u64 ktime_divns(const ktime_t kt, s64 div);
 #else /* BITS_PER_LONG < 64 */
-# define ktime_divns(kt, div)  (unsigned long)((kt).tv64 / (div))
+# define ktime_divns(kt, div)  (u64)((kt).tv64 / (div))
 #endif
 
 /* Show pending timers: */
Index: linux-2.6.mod/kernel/hrtimer.c
===
--- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800
+++ linux-2.6.mod/kernel/hrtimer.c  2007-12-13 16:41:42.0 -0800
@@ -306,7 +306,7 @@
 /*
  * Divide a ktime value by a nanosecond value
  */
-unsigned long ktime_divns(const ktime_t kt, s64 div)
+u64 ktime_divns(const ktime_t kt, s64 div)
 {
u64 dclc, inc, dns;
int sft = 0;
@@ -321,7 +321,7 @@
dclc >>= sft;
do_div(dclc, (unsigned long) div);
 
-   return (unsigned long) dclc;
+   return dclc;
 }
 #endif /* BITS_PER_LONG >= 64 */
 
@@ -655,10 +655,9 @@
  * Forward the timer expiry so it will expire in the future.
  * Returns the number of overruns.
  */
-unsigned long
-hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
+u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-   unsigned long orun = 1;
+   u64 orun = 1;
ktime_t delta;
 
delta = ktime_sub(now, timer->expires);
Index: linux-2.6.mod/kernel/posix-timers.c
===
--- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 
-0800
+++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800
@@ -256,8 +256,9 @@
if (timr->it.real.interval.tv64 == 0)
return;
 
-   timr->it_overrun += hrtimer_forward(timer, timer->base->get_time(),
-   timr->it.real.interval);
+   timr->it_overrun += (unsigned int) hrtimer_forward(timer,
+   timer->base->get_time(),
+   timr->it.real.interval);
 
timr->it_overrun_last = timr->it_overrun;
timr->it_overrun = -1;
@@ -386,7 +387,7 @@
now = ktime_add(now, kj);
}
 #endif
-   timr->it_overrun +=
+   timr->it_overrun += (unsigned int)
hrtimer_forward(timer, now,
timr->it.real.interval);
ret = HRTIMER_RESTART;
@@ -662,7 +663,7 @@
 */
if (iv.tv64 && (timr->it_requeue_pending & REQUEUE_PENDING ||
(timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE))
-   timr->it_overrun += hrtimer_forward(timer, now, iv);
+   timr->it_overrun += (unsigned int) hrtimer_forward(timer, now, 
iv);
 
remaining = ktime_sub(timer->expires, now);
/* Return 0 only, when the timer is expired and not pending */
Index: linux-2.6.mod/fs/timerfd.c
===

Re: Tesing of / bugs in new timerfd API

2007-12-17 Thread Davide Libenzi

On Mon, 17 Dec 2007, Michael Kerrisk wrote:

> > Can you try the two patches below? I tried them on my 32 bit box (one of
> > the rare beasts still lingering around here) and it seems to be working
> > fine (those go on top of the previous ones).
> 
> Against 2.6.24-rc5, I applied first your earlier patches ("v3") and
> then the newest patch.  My tests confirm that:
> 
> > This fixed the 32 bit tick-count truncation, and makes the time returned
> > to be the remaining time till the next expiration.
> 
> Are you going to resubmit a new patch set that includes these latest changes?

Yes, I'll send them out to Andrew today.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Tesing of / bugs in new timerfd API

2007-12-16 Thread Davide Libenzi

On Fri, 14 Dec 2007, Michael Kerrisk wrote:

> You snipped my example that demonstrated the problem.  Both of the
> following runs create a timer that expires 10 seconds from "now", but
> observe the difference in the value returned by timerfd_gettime():
> 
> $ ./timerfd_test 10 # does not use TFD_TIMER_ABSTIME
> Initial setting for settime:   value=10.000, interval=0.000
> ./timerfd_test> g
> (elapsed time=  1)
> Current value: value=346.448, interval=0.000
> 
> $ ./timerfd_test -a 10  # uses TFD_TIMER_ABSTIME
> Initial setting for settime:   value=1197630855.254, interval=0.000
> ./timerfd_test> g
> (elapsed time=  1)
> Current value: value=1197630855.254, interval=0.000
> 
> Either there's an inconsistency here depending on the use of
> TFD_TIMER_ABSTIME, or there is a bug in my understanding or my test program
> (but so far I haven't spotted that bug ;-).).

Can you try the two patches below? I tried them on my 32 bit box (one of 
the rare beasts still lingering around here) and it seems to be working 
fine (those go on top of the previous ones).
This fixed the 32 bit tick-count truncation, and makes the time returned 
to be the remaining time till the next expiration.




- Davide



---
 fs/timerfd.c|6 +++---
 include/linux/hrtimer.h |   10 +-
 kernel/hrtimer.c|9 -
 kernel/posix-timers.c   |9 +
 4 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-12-13 16:37:36.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-12-14 16:05:23.0 
-0800
@@ -295,12 +295,12 @@
 }
 
 /* Forward a hrtimer so it expires after now: */
-extern unsigned long
+extern u64
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
 /* Forward a hrtimer so it expires after the hrtimer's current now */
-static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
-   ktime_t interval)
+static inline u64 hrtimer_forward_now(struct hrtimer *timer,
+ ktime_t interval)
 {
return hrtimer_forward(timer, timer->base->get_time(), interval);
 }
@@ -322,9 +322,9 @@
 extern void __init hrtimers_init(void);
 
 #if BITS_PER_LONG < 64
-extern unsigned long ktime_divns(const ktime_t kt, s64 div);
+extern u64 ktime_divns(const ktime_t kt, s64 div);
 #else /* BITS_PER_LONG < 64 */
-# define ktime_divns(kt, div)  (unsigned long)((kt).tv64 / (div))
+# define ktime_divns(kt, div)  (u64)((kt).tv64 / (div))
 #endif
 
 /* Show pending timers: */
Index: linux-2.6.mod/kernel/hrtimer.c
===
--- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800
+++ linux-2.6.mod/kernel/hrtimer.c  2007-12-13 16:41:42.0 -0800
@@ -306,7 +306,7 @@
 /*
  * Divide a ktime value by a nanosecond value
  */
-unsigned long ktime_divns(const ktime_t kt, s64 div)
+u64 ktime_divns(const ktime_t kt, s64 div)
 {
u64 dclc, inc, dns;
int sft = 0;
@@ -321,7 +321,7 @@
dclc >>= sft;
do_div(dclc, (unsigned long) div);
 
-   return (unsigned long) dclc;
+   return dclc;
 }
 #endif /* BITS_PER_LONG >= 64 */
 
@@ -655,10 +655,9 @@
  * Forward the timer expiry so it will expire in the future.
  * Returns the number of overruns.
  */
-unsigned long
-hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
+u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-   unsigned long orun = 1;
+   u64 orun = 1;
ktime_t delta;
 
delta = ktime_sub(now, timer->expires);
Index: linux-2.6.mod/kernel/posix-timers.c
===
--- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 
-0800
+++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800
@@ -256,8 +256,9 @@
if (timr->it.real.interval.tv64 == 0)
return;
 
-   timr->it_overrun += hrtimer_forward(timer, timer->base->get_time(),
-   timr->it.real.interval);
+   timr->it_overrun += (unsigned int) hrtimer_forward(timer,
+   timer->base->get_time(),
+   timr->it.real.interval);
 
timr->it_overrun_last = timr->it_overrun;
timr->it_overrun = -1;
@@ -386,7 +387,7 @@
now = ktime_add(now, kj);
}
 #endif
-   timr->it_overrun +=
+   timr->it_overrun += (unsigned int)
hrtimer_forward(timer, now,
timr->it.real.interv

Re: Tesing of / bugs in new timerfd API

2007-12-13 Thread Davide Libenzi

On Thu, 13 Dec 2007, Michael Kerrisk wrote:

> > > BUG 2:
> > > The last sentence does not match the implementation.
> > > (Nor is it consistent with the behavior of POSIX timers.
> > > And I *think* things did work correctly in the original
> > > timerfd() implementation, but I have not gone back to check.)
> > >
> > > Suppose that we set an absolute timer to expire 100 seconds
> > > in the future.  Then according to this sentence of the man
> > > page then each subsequent call to timerfd_gettime() should
> > > retrun an itimerspec structure whose it_value steadily
> > > decreases from 100 to 0 (when the timer expires).  (This
> > > is the behavior in the analogous situation with POSIX timers
> > > and with setitimer()/getitimer().)
> > >
> > > However, the implementation of timerfd_gettime() always
> > > returns the "time when the timer would next expire", and
> > > this value depends on whether TFD_TIMER_ABSTIME was specified
> > > when setting the timer.
> >
> > This is been like that from the beginning of the new API. So no, the
> > previous was behaving exactly the same WRT this feature.
> > Is this something really needed?
> 
> Three reasons that I think of off the top of my head (and there might
> well be more reasosn) why this should change:
> 
> a) consistency with the other two timer APIs (POSIX timers
> (timer_create(), etc.), and setitimer()/getitimer()).
> 
> b) Returning the amount of time until the next expiration  is more
> useful to userland: I'd say the most common case is for userland to
> want to know how long until the next expiration occurs, or to adjust
> that time by adding/subtracting some value to the existing setting.
> That is difficult to with the current implementation: the userland app
> must use timer_gettime(), call clock_gettime(), and calculate the
> difference between the two, in order to know how much time remains
> until the next timer expiration.

Bah, I don't want to argue with that because otherwise it starts going 
towards the "typical" use cases listing, that can be found the same exact 
reasons to have one or the other way. And we end up wasting lots of time.
We'd just have to move another thing that userspace could do, inside the 
kernel (subtract the current value returned by hrtimer_forward() in 
->expires with "now").




> c) Currently, the information returned differs depending on whether
> TFD_TIMER_ABSTIME is specified -- this is not the case for the
> analogous situation for POSIX timers.  For POSIX timers, the returned
> setting is always the amount of time until the timer next expires.
> This inconsistency is messy for applications -- the application may
> not (be able to) know whether or not the timer it is examining was set
> using TFD_TIMER_ABSTIME (the timerfd might have been created by a
> library, for example).

Hmm... the time returned is always the next absolute time when the next 
timer tick will go off. It does not depend on TFD_TIMER_ABSTIME. I return 
the ->expires field of the timer, and hrtimer_forward() sets it to the 
next absolute time.




- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Tesing of / bugs in new timerfd API

2007-12-13 Thread Davide Libenzi

On Thu, 13 Dec 2007, Michael Kerrisk wrote:

> Davide, Andrew,
> 
> I applied Davide's v3 patchset (sent into LKML on 25 Nov) against
> 2.4.24-rc3, and did various tests (all on x86).  Several tests
> were done using the program at the foot of this mail.  Various others
> were done by cobbling together bits of code that I haven't included
> here.

Thanks for such a thorough test Michael.



> Tested: after setting a CLOCK_REALTIME timer with the
> TFD_TIMER_ABSTIME flag to expire at some time in the past with
> a non-zero interval (e.g., setting 100 seconds in the past, with
> a 5 second interval), read() from the file descriptor returns
> the correct number of expirations (e.g., 20).
> 
> This seems a reasonable thing to do, I suppose.  However, while
> playing around to test this, I found what looks like a bug (see
> below).
> 
> BUG 1:
> However, this test exposed what looks like a bug: if I set a
> CLOCK_REALTIME clock to expire in the past, with a very small
> interval, then the maximum number of expirations that can be
> returned via read seems to be limited to 32 bits, even though
> we have a 64-bit value for returning this information.
> I haven't checked the kernel source to determine where this
> bug is.

Yes, true. Right now hrtimer_forward() (that we use to get the "missed" 
ticks) returns an unsigned long, that on 32 bit is (duh!) 32 bit :)
But hrtimer_forward() has all in place to just return an u64. So, Thomas, 
would it be OK to have hrtimer_forward() (and the new hrtimer_forward_now())
to return a u64?




> BUG 2:
> The last sentence does not match the implementation.
> (Nor is it consistent with the behavior of POSIX timers.
> And I *think* things did work correctly in the original
> timerfd() implementation, but I have not gone back to check.)
> 
> Suppose that we set an absolute timer to expire 100 seconds
> in the future.  Then according to this sentence of the man
> page then each subsequent call to timerfd_gettime() should
> retrun an itimerspec structure whose it_value steadily
> decreases from 100 to 0 (when the timer expires).  (This
> is the behavior in the analogous situation with POSIX timers
> and with setitimer()/getitimer().)
> 
> However, the implementation of timerfd_gettime() always
> returns the "time when the timer would next expire", and
> this value depends on whether TFD_TIMER_ABSTIME was specified
> when setting the timer.

This is been like that from the beginning of the new API. So no, the 
previous was behaving exactly the same WRT this feature.
Is this something really needed?



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] move the related code from exit_notify() to exit_signals()

2007-12-06 Thread Davide Libenzi

On Thu, 6 Dec 2007, Oleg Nesterov wrote:

> The previous bugfix was not optimal, we shouldn't care about group stop when
> we are the only thread or the group stop is in progress. In that case nothing
> special is needed, just set PF_EXITING and return.
> 
> Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify().
> 
> So, from the performance POV the only difference is that we don't trust
> !signal_pending() until we take ->siglock. But this in fact fixes another
> ___pure___ theoretical minor race. __group_complete_signal() finds the task
> without PF_EXITING and chooses it as the target for signal_wake_up(). But
> nothing prevents this task from exiting in between without noticing the
> pending signal and thus unpredictably delaying the actual delivery.

For a second there, you got me confused, since it wasn't clear to me this 
patch was going over the previous one. When posting patches that are not 
in any tree, it may be wise to include the set they nest onto ;)


> + if (!signal_pending(tsk))
> + goto out;
> +
> + /* It could be that __group_complete_signal() choose us to
> +  * notify about group-wide signal. Another thread should be
> +  * woken now to take the signal since we will not.
> +  */
> + for (t = tsk; (t = next_thread(t)) != tsk; )
> + if (!signal_pending(t) && !(t->flags & PF_EXITING))
> + recalc_sigpending_and_wake(t);
> +
> + if (unlikely(tsk->signal->group_stop_count) &&
> + !--tsk->signal->group_stop_count) {
> + tsk->signal->flags = SIGNAL_STOP_STOPPED;
> + group_stop = 1;
> + }
> +out:

Looks fine to me, even though the one above could simply be an:

if (signal_pending(tsk)) {
...
}

(since the "out" label is used by that "if" only).



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix group stop with exit race

2007-12-05 Thread Davide Libenzi

On Wed, 5 Dec 2007, Oleg Nesterov wrote:

> do_signal_stop() counts all sub-thread and sets ->group_stop_count 
> accordingly.
> Every thread should decrement ->group_stop_count and stop, the last one should
> notify the parent.
> 
> However a sub-thread can exit before it notices the signal_pending(), or it 
> may
> be somewhere in do_exit() already. In that case the group stop never finishes
> properly.
> 
> Note: this is a minimal fix, we can add some optimizations later. Say we can
> return quickly if thread_group_empty(). Also, we can move some signal related
> code from exit_notify() to exit_signals().
> 
> Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]>

Looks OK for me, even though we're doing more work on the exit path. OTOH 
I don't see a non-racy way of doing it w/out grabbing the lock. Did you 
try to bench how much this change costs?
Anyway, looks sane to me...

Acked-by: Davide Libenzi <[EMAIL PROTECTED]>



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] update sys_ni.c with the new timerfd syscalls

2007-11-28 Thread Davide Libenzi

Update sys_ni.c with the new timerfd syscalls.


Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 kernel/sys_ni.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/kernel/sys_ni.c
===
--- linux-2.6.mod.orig/kernel/sys_ni.c  2007-11-28 11:44:48.0 -0800
+++ linux-2.6.mod/kernel/sys_ni.c   2007-11-28 11:46:01.0 -0800
@@ -153,7 +153,9 @@
 
 /* New file descriptors */
 cond_syscall(sys_signalfd);
-cond_syscall(sys_timerfd);
+cond_syscall(sys_timerfd_create);
+cond_syscall(sys_timerfd_settime);
+cond_syscall(sys_timerfd_gettime);
 cond_syscall(compat_sys_signalfd);
 cond_syscall(compat_sys_timerfd);
 cond_syscall(sys_eventfd);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc3-mm2 - Build Failure on powerpc timerfd() undeclared

2007-11-28 Thread Davide Libenzi

On Wed, 28 Nov 2007, Andrew Morton wrote:

> On Wed, 28 Nov 2007 14:32:07 +0100 Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> 
> > On Wednesday 28 November 2007, Kamalesh Babulal wrote:
> > > Kernel build fails, with build error
> > > 
> > >   CC      arch/powerpc/platforms/cell/spu_callbacks.o
> > > In file included from arch/powerpc/platforms/cell/spu_callbacks.c:49:
> > > include/asm/systbl.h:312: error: ‘sys_timerfd’ undeclared here (not in a 
> > > function)
> > > make[2]: *** [arch/powerpc/platforms/cell/spu_callbacks.o] Error 1
> > > make[1]: *** [arch/powerpc/platforms/cell] Error 2
> > > make: *** [arch/powerpc/platforms] Error 2
> > > 
> > 
> > I guess all architectures except x86 are currently broken because they
> > reference the old sys_timerfd function.
> 
> None of them were broken in my testing and I'm unsure why powerpc broke
> here.
> 
> > This patch should add the missing
> > bits to powerpc.
> > 
> 
> Because the patches in -mm left the stubs in place in sys_ni.c and powerpc
> _should_ have (incorrectly) picked those up.

My fault. I forgot to update sys_ni.c with the new functions (and with the
sys_timerfd->sys_timerfd_create name change).
Do you want a patch Andrew?



- Davide

Re: [patch 2/4] Timerfd v3 - new timerfd API

2007-11-27 Thread Davide Libenzi

On Tue, 27 Nov 2007, Andrew Morton wrote:

> On Tue, 27 Nov 2007 12:47:46 -0800 (PST)
> Davide Libenzi <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 27 Nov 2007, Andrew Morton wrote:
> > 
> > > On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi <[EMAIL PROTECTED]> 
> > > wrote:
> > > 
> > > > +static struct file *timerfd_fget(int fd)
> > > > +{
> > > > +   struct file *file;
> > > > +
> > > > +   file = fget(fd);
> > > > +   if (!file)
> > > > +   return ERR_PTR(-EBADF);
> > > > +   if (file->f_op != &timerfd_fops) {
> > > > +   fput(file);
> > > > +   return ERR_PTR(-EINVAL);
> > > > +   }
> > > > +
> > > > +   return file;
> > > > +}
> > > 
> > > I suppose we could use fget_light() in here sometime.  It is significantly
> > > quicker in microbenchmarks.  Or it was - nobody has checked that in a few
> > > years afaik.
> > 
> > Should I now?
> 
> No great rush.  It'd be fun to see if it actually makes any measurable
> difference on modern hardware (hint ;)).

I was going to say BS, given that at the time of the tests, the files 
struct was protected by an rwlock whereas now there's a rcu one.
But that seems not the case:

http://www.xmailserver.org/fget-light-test.c


$ fget-light-test -r 9   
warming up ...
testing non-shared ...
time = 314.354000 ms
testing shared ...
time = 390.781000 ms



And here is the oprofile output:

[SHARED CASE]
samples  %app name symbol name
7436 28.9339  vmlinux  __clear_user
2369  9.2179  vmlinux  system_call
1710  6.6537  vmlinux  fget_light
1244  4.8405  vmlinux  inotify_dentry_parent_queue_event
1128  4.3891  vmlinux  sys_read
1041  4.0506  libpthread-2.6.1.so  __read_nocancel
1027  3.9961  Xorg (no symbols)
978   3.8054  vmlinux  read_zero
755   2.9377  vmlinux  vfs_read
545   2.1206  vmlinux  inotify_inode_queue_event
507   1.9728  vmlinux  sysret_check
414   1.6109  vmlinux  rw_verify_area
405   1.5759  vmlinux  unix_poll
371   1.4436  nvidia   (no symbols)
333   1.2957  vmlinux  acpi_pm_read
311   1.2101  nvidia_drv.so(no symbols)
290   1.1284  vmlinux  do_select
288   1.1206  vmlinux  dnotify_parent
253   0.9844  libc-2.6.1.so(no symbols)
232   0.9027  bash (no symbols)
216   0.8405  libpthread-2.6.1.so  __pthread_enable_asynccancel


[UNSHARED CASE]
samples  %app name symbol name
6542 27.6922  vmlinux  __clear_user
4074 17.2452  vmlinux  vfs_read
1266  5.3590  vmlinux  inotify_inode_queue_event
1091  4.6182  Xorg (no symbols)
1059  4.4827  vmlinux  system_call
937   3.9663  libpthread-2.6.1.so  __read_nocancel
544   2.3027  vmlinux  clear_user
484   2.0488  vmlinux  dnotify_parent
445   1.8837  vmlinux  read_zero
438   1.8540  vmlinux  sysret_check
414   1.7525  vmlinux  unix_poll
407   1.7228  nvidia   (no symbols)
389   1.6466  vmlinux  acpi_pm_read
315   1.3334  nvidia_drv.so(no symbols)
312   1.3207  vmlinux  inotify_dentry_parent_queue_event
312   1.3207  vmlinux  sys_read
305   1.2911  vmlinux  rw_verify_area
304   1.2868  vmlinux  do_select
222   0.9397  libc-2.6.1.so(no symbols)
214   0.9059  bash (no symbols)
196   0.8297  fget-light-test  read_test
185   0.7831  vmlinux  fget_light



You can clearly notice the fget_light() drop out of the relevance window.
BTW, are all those "notify" noises supposed to be there?




- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/4] Timerfd v3 - new timerfd API

2007-11-27 Thread Davide Libenzi

On Tue, 27 Nov 2007, Andrew Morton wrote:

> On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi <[EMAIL PROTECTED]> wrote:
> 
> > +static struct file *timerfd_fget(int fd)
> > +{
> > +   struct file *file;
> > +
> > +   file = fget(fd);
> > +   if (!file)
> > +   return ERR_PTR(-EBADF);
> > +   if (file->f_op != &timerfd_fops) {
> > +   fput(file);
> > +   return ERR_PTR(-EINVAL);
> > +   }
> > +
> > +   return file;
> > +}
> 
> I suppose we could use fget_light() in here sometime.  It is significantly
> quicker in microbenchmarks.  Or it was - nobody has checked that in a few
> years afaik.

Should I now? Half of the internet traffic of the last three month was 
generated by me posting those timerfd patches :)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-26 Thread Davide Libenzi

On Mon, 26 Nov 2007, H. Peter Anvin wrote:

> Ingo Molnar wrote:
> > 
> > So it's not like sys_indirect() would break some magic pristine state of a
> > flat parameter space - on the contrary, most of the nontrivial syscalls take
> > pointers to structures or pointers to streams of information. The parameter
> > count histogram i believe further underlines this point:
> > 
> >   #args   #syscalls
> >   -
> >   0   22
> >   1   51
> >   2   83
> >   3   85
> >   4   40
> >   5   23
> >   68
> > 
> > the natural 'center' of function call parameter counts is around 1-4
> > parameters, and that is natural. (most operators that the human brain
> > prefers to operate with are like that - having higher complexity than that
> > often defeats the purpose of getting an API used by ... humans.)
> > 
> 
> I was preparing a response to Linus' email, but I really feel this needs to be
> addressed specifically.
> 
> When it comes to dealing with the operator-visible state, what matters is what
> happens on the API level, NOT on the system call level. Furthermore, the
> proposed sys_indirect interface just means that there are parameters hidden
> from immediately view, even though they fundamentally change the operation
> performed, and that it is much harder to correlate, say, the output of
> strace(1) with what actually happened in the program.  So from a
> *psychological* point of view, this seems to be an insane design choice.

I think there are two different issues. One is the proliferation of system 
calls, and the other is the sane design of internal kernel APIs.
The first one is not very interesting to me, since I don't have any strong 
opinions in either cases.
The second is the one I'd care most. I think that, whatever is the 
solution used to address the first, internal kernel APIs should be 
designed so that parameters flow down from the system call to the 
parameter's user code. IMO, besides very few cases where it could make 
some sense [*], setting some thread-global bits in the upper layer, to be 
magically picked up by code in the lower layers, does not lead to readable 
interfaces.

[*] Things that already read from a shared context, that is already 
exposed to the user through some sort of set/get APIs.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/4] Timerfd v3 - wire the new timerfd API to the x86 family

2007-11-25 Thread Davide Libenzi

Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 arch/x86/ia32/ia32entry.S  |4 +++-
 arch/x86/kernel/syscall_table_32.S |4 +++-
 include/asm-x86/unistd_32.h|6 --
 include/asm-x86/unistd_64.h|9 +++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/include/asm-x86/unistd_32.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_32.h  2007-11-23 
13:55:15.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_32.h   2007-11-24 12:49:28.0 
-0800
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86/unistd_64.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_64.h  2007-11-23 
13:55:15.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_64.h   2007-11-24 12:49:28.0 
-0800
@@ -629,12 +629,17 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
+
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S
===
--- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S   2007-11-23 
13:55:16.0 -0800
+++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-24 
12:49:28.0 -0800
@@ -321,6 +321,8 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 
13:55:16.0 -0800
+++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-24 12:49:28.0 
-0800
@@ -723,7 +723,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 4/4] Timerfd v3 - un-break CONFIG_TIMERFD

2007-11-25 Thread Davide Libenzi

Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:55:15.0 -0800
+++ linux-2.6.mod/init/Kconfig  2007-11-24 12:49:30.0 -0800
@@ -566,7 +566,6 @@
 config TIMERFD
bool "Enable timerfd() system call" if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/4] Timerfd v3 - introduce a new hrtimer_forward_now() function

2007-11-25 Thread Davide Libenzi

I think that advancing the timer against the timer's current "now" can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-11-23 13:55:16.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-11-24 12:48:05.0 
-0800
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer->base->get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec *rmtp,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/4] Timerfd v3 - new timerfd API

2007-11-25 Thread Davide Libenzi

This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not 
NULL).
The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  199 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 166 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:55:16.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-11-24 12:49:21.0 -0800
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the "expired"
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
ctx->expired = 1;
+   ctx->ticks++;
wake_up_locked(&ctx->wqh);
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr->it_value);
ctx->expired = 0;
+   ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
-   hrtimer_init(&ctx->tmr, clockid, htmode);
+   hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
ctx->tmr.expires = texp;
ctx->tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, &ctx->wqh, wait);
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
-   if (ctx->expired)
+   if (ctx->ticks)
events |= POLLIN;
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(&ctx->wqh.lock);
res = -EAGAIN;
-   if (!ctx->expired && !(file->f_flags & O_NONBLOCK)) {
+   if (!ctx->ticks && !(file->f_flags & O_NONBLOCK)) {
__add_wait_queue(&ctx->wqh, &wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx->expired) {
+   if (ctx->ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(&ctx->wqh, &wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx->expired) {
-   ctx->expired = 0;
-   if (ctx->tintv.tv64 != 0) {
+   if (ctx->ticks) {
+   ticks = ctx->ticks;
+   if (ctx->expired && ctx->tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(&ctx->tmr,
-   hrtimer_cb_get_time(&ctx->tmr),
-

Re: [patch 2/4] Timerfd v2 - new timerfd API

2007-11-24 Thread Davide Libenzi

On Sat, 24 Nov 2007, Michael Kerrisk wrote:

> > +asmlinkage long sys_timerfd_create(int clockid, int flags)
> >  {
> > -   int error;
> > +   int error, ufd;
> > struct timerfd_ctx *ctx;
> > struct file *file;
> > struct inode *inode;
> > -   struct itimerspec ktmr;
> > -
> > -   if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
> > -   return -EFAULT;
> >  
> > if (clockid != CLOCK_MONOTONIC &&
> > clockid != CLOCK_REALTIME)
> > return -EINVAL;
> 
> Could I suggest here, the following placeholder addition:
> 
>   if (flags != 0)
>   return -EINVAL;

Make sense, will repost.


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 4/4] Timerfd v2 - un-break CONFIG_TIMERFD

2007-11-23 Thread Davide Libenzi

Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:13:16.0 -0800
+++ linux-2.6.mod/init/Kconfig  2007-11-23 13:36:42.0 -0800
@@ -566,7 +566,6 @@
 config TIMERFD
bool "Enable timerfd() system call" if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/4] Timerfd v2 - wire the new timerfd API to the x86 family

2007-11-23 Thread Davide Libenzi

Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 arch/x86/ia32/ia32entry.S  |4 +++-
 arch/x86/kernel/syscall_table_32.S |4 +++-
 include/asm-x86/unistd_32.h|6 --
 include/asm-x86/unistd_64.h|9 +++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/include/asm-x86/unistd_32.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_32.h  2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_32.h   2007-11-23 13:36:40.0 
-0800
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86/unistd_64.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_64.h  2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_64.h   2007-11-23 13:36:40.0 
-0800
@@ -629,12 +629,17 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
+
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S
===
--- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S   2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-23 
13:36:40.0 -0800
@@ -321,6 +321,8 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-23 13:36:40.0 
-0800
@@ -723,7 +723,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/4] Timerfd v2 - introduce a new hrtimer_forward_now() function

2007-11-23 Thread Davide Libenzi

I think that advancing the timer against the timer's current "now" can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-11-23 13:13:21.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-11-23 13:36:36.0 
-0800
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer->base->get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec *rmtp,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/4] Timerfd v2 - new timerfd API

2007-11-23 Thread Davide Libenzi

This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not 
NULL).
The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  197 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 164 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:13:19.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-11-23 13:36:39.0 -0800
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the "expired"
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
ctx->expired = 1;
+   ctx->ticks++;
wake_up_locked(&ctx->wqh);
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr->it_value);
ctx->expired = 0;
+   ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
-   hrtimer_init(&ctx->tmr, clockid, htmode);
+   hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
ctx->tmr.expires = texp;
ctx->tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, &ctx->wqh, wait);
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
-   if (ctx->expired)
+   if (ctx->ticks)
events |= POLLIN;
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(&ctx->wqh.lock);
res = -EAGAIN;
-   if (!ctx->expired && !(file->f_flags & O_NONBLOCK)) {
+   if (!ctx->ticks && !(file->f_flags & O_NONBLOCK)) {
__add_wait_queue(&ctx->wqh, &wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx->expired) {
+   if (ctx->ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(&ctx->wqh, &wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx->expired) {
-   ctx->expired = 0;
-   if (ctx->tintv.tv64 != 0) {
+   if (ctx->ticks) {
+   ticks = ctx->ticks;
+   if (ctx->expired && ctx->tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(&ctx->tmr,
-   hrtimer_cb_get_time(&ctx->tmr),
-

Re: Where is the new timerfd?

2007-11-23 Thread Davide Libenzi

On Fri, 23 Nov 2007, Ulrich Drepper wrote:

> On Nov 23, 2007 9:29 AM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
> > Yes, it's disabled, and yes, I'll repost today ...
> 
> I haven't seen the patch and don't feel like searching.  So I say it
> here: please mak sure you add a flags parameter to the system call
> itself (instead of adding it on as for eventfd and signalfd).  We need
> to be able to use O_CLOEXEC some way or another.

I'm more then OK about adding a flags parameter. If it was for me, I'd add 
it even to eventfd and signalfd. I asked Linus if he was OK about adding 
the flags parameter to all. He didn't reply, and I read that as "no".



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the new timerfd?

2007-11-23 Thread Davide Libenzi

On Fri, 23 Nov 2007, Andrew Morton wrote:

> > I suppose this means that timerfd will only go in for 2.6.25.  I don't
> > have a problem with that, but we better make sure that the existing
> > timerfd in 2.6.24 is still disabled.  (Andrew had a one liner for
> > that, but I haven't checked if it's in place.)
> > 
> 
> I have no timerfd patches here.

Yes, it's disabled, and yes, I'll repost today ...


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi

On Thu, 22 Nov 2007, Andrew Morton wrote:

> On Thu, 22 Nov 2007 11:46:13 -0800 (PST) Davide Libenzi <[EMAIL PROTECTED]> 
> wrote:
> 
> > On Thu, 22 Nov 2007, Michael Kerrisk wrote:
> > 
> > > On Nov 22, 2007 6:34 PM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
> > > > On Thu, 22 Nov 2007, Michael Kerrisk wrote:
> > > >
> > > > > Hey Davide,
> > > > >
> > > > > Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...
> > > >
> > > > Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it 
> > > > was
> > > > merged. Some screw up reverted it?
> > > 
> > > t looks that way.
> > 
> > I'm looking at the log now. It never went in actually. Andrew-san, what 
> > happened?
> > 
> 
> Last I recall, we removed the API for 2.6.23 because we intended to do a
> different interface for 2.6.24.
> 
> But I don't recall seeing any timerfd patches in maybe a month.

Was sent on Sep 23, Subject: new timerfd API
Do you want me to repost?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi

On Thu, 22 Nov 2007, Michael Kerrisk wrote:

> On Nov 22, 2007 6:34 PM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
> > On Thu, 22 Nov 2007, Michael Kerrisk wrote:
> >
> > > Hey Davide,
> > >
> > > Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...
> >
> > Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was
> > merged. Some screw up reverted it?
> 
> t looks that way.

I'm looking at the log now. It never went in actually. Andrew-san, what 
happened?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi

On Thu, 22 Nov 2007, Michael Kerrisk wrote:

> Hey Davide,
> 
> Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...

Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was 
merged. Some screw up reverted it?

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-20 Thread Davide Libenzi

On Tue, 20 Nov 2007, Ingo Molnar wrote:

> * H. Peter Anvin <[EMAIL PROTECTED]> wrote:
> 
> > It seems that you're doing the same thing in both cases, except you're 
> > now extending it to include other random functionality, which means 
> > other things than syslets are suddenly affected.
> >
> > syslets are arguably a little bit different, since what you're 
> > effectively doing there is running a miniature interpreted language in 
> > kernel space.  A higher startup overhead should be acceptable, since 
> > you're amortizing it over a larger number of calls.  Extending that 
> > mechanism suddenly means you HAVE to use that interpreted language 
> > message mechanism to access certain system calls, which really does 
> > not seem like a good thing neither for performance nor for encouraging 
> > sane design of interfaces.
> 
> whether that interpreted syslet language survives is still an open 
> question - it was extremely ugly when i wrote the first version of it 
> and it only got uglier since then :-)

Aha! You admitted it finally :)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: epoll design problems with common fork/exec patterns

2007-10-29 Thread Davide Libenzi

On Sun, 28 Oct 2007, David Schwartz wrote:

> 
> Eric Dumazet wrote:
> 
> > Events are not necessarly reported "by descriptors". epoll uses an opaque
> > field provided by the user.
> >
> > It's up to the user to properly chose a tag that will makes sense
> > if the user
> > app is playing dup()/close() games for example.
> 
> Great. So the only issue then is that the documentation is confusing. It
> frequently uses the term "fd" where it means file. For example, it says:
> 
>   Q1 What  happens  if  you  add  the  same fd to an
> epoll_set
>  twice?
> 
>   A1 You will probably get EEXIST.  However,  it  is
> possible
>  that  two  threads  may  add the same fd twice. This is
> a
>  harmless condition.
> 
> This gives no reason to think there's anything wrong with adding the same
> file twice so long as you do so through different descriptors. (One can
> imagine an application that does this to segregate read and write operations
> to avoid a race where the descriptor is closed from under a writer due to
> handling a fatal read error.) Obviously, that won't work.

I agree, that is confusing. However, you can safely add two different file 
descriptors pointing to the same file*, with different event masks, and 
that will work as expected.




> And this part:
> 
>   Q6 Will  the  close of an fd cause it to be removed from
> all
>  epoll sets automatically?
> 
>   A6 Yes.
> 
> This is incorrect. Closing an fd will not cause it to be removed from all
> epoll sets automatically. Only closing a file will. This is what caused the
> OP's confusion, and it is at best imprecise and, at worst, flat out wrong.

OTOH you cannot list *every* possible scenario in a man page, otherwise 
you end up writing a book instead of a man page. I will try to find some 
time with Michael to refine the man page.



> PS: It is customary to trim individuals off of CC lists when replying to a
> list when the subject matter of the post is squarely inside the subject of
> the list. If the person CC'd was interested in the list's subject, he or she
> would presumably subscribe to the list. Not everyone wants two copies of
> every post. Not everyone wants a personal copy of every sub-thread that
> results from a post they make. In the past few years, I've received
> approximately an equal number of complaints about trimming CC's on posts to
> LKML and not trimming CC's on such posts.

Does anyone that in 2007 still did not manage to find a way to avoid dups 
in hitting his mailbox, deserve any consideration at all?
OTOH many ppl, like myself, uses To and Cc header to direct email to 
proper folders, where they are treated with a different level of 
attention. And your stripp-all-headers mania screws that up badly.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: epoll design problems with common fork/exec patterns

2007-10-28 Thread Davide Libenzi

On Sat, 27 Oct 2007, David Schwartz wrote:

>   I don't see how that can be. Suppose I add fd 8 to an epoll set. 
> Suppose fd
> 5 is a dup of fd 8. Now, I close fd 8. How can fd 8 remain in my epoll set,
> since there no longer is an fd 8? Events on files registered for epoll
> notification are reported by descriptor, so the set membership has to be
> associated (as reflected into userspace) with the descriptor, not the file.

Eric already answered to your question (epoll deals with internal kernel 
objects - aka file*).
I just want to answer this one for another reason. WTF is wrong with all 
of you Cc-list-trimmers?
Could you *please* stop trimming Cc-lists?

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: epoll design problems with common fork/exec patterns

2007-10-27 Thread Davide Libenzi

On Sat, 27 Oct 2007, Willy Tarreau wrote:

> On Sat, Oct 27, 2007 at 09:59:07AM -0700, Davide Libenzi wrote:
> > On Sat, 27 Oct 2007, Marc Lehmann wrote:
> > 
> > > > Please provide some code to illustrate one exact problem you have.
> > > 
> > >// assume there is an open epoll set that listens for events on fd 5
> > >if (fork () = 0)
> > >  {
> > >close (5);
> > >// fd 5 is now removed from the epoll set of the parent.
> > >_exit (0);
> > >  }
> > 
> > Hmmm ... what? I assume you know that:
> > 
> > 1) A file descriptor is a userspace view/handle of a kernel object
> > 
> > 2) The kernel object has a use-count for as many file descriptors that 
> >have been handed out to userspace
> > 
> > 3) A close() decreases the internal counter by one
> > 
> > 4) The kernel object gets effectively closed when the internal counter 
> >goes to zero
> > 
> > 5) A fork() acts as a dup() on the file descriptors by hence bumping up 
> >its internal counter
> > 
> > 6) Epoll removes the file from the set, when the *kernel* object gets 
> >closed (internal use-count goes to zero)
> > 
> > With that in mind, how can the code snippet above trigger a removal from 
> > the epoll set?
> 
> Davide,
> 
> from what I understand, Marc is not asking for the code above to remove
> the fd from the epoll set, but he's in fact complaining that he *observed*
> that the fd was removed from the epoll set in the *parent* process when
> the child closes it, which is of course not expected at all. As strange
> as it looks like, this might need investigation. It is possible that there
> is some strange bug somewhere in some kernel versions.

That would be *really* strange, since epoll hooks in __fput() in order to 
perform proper cleanup. This means that, in the case above, the file will 
be really closed in the parent too. That, I think, would trigger way more 
serious problems in userspace.



> Marc, I think that if you indicate the last kernel version on which you
> observed this and provide a very short and easy reproducer, it would
> help everyone investigating this. Basically something which reports "OK"
> or "KO".

Of course. That'd be great.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: epoll design problems with common fork/exec patterns

2007-10-27 Thread Davide Libenzi

On Sat, 27 Oct 2007, Marc Lehmann wrote:

> > Please provide some code to illustrate one exact problem you have.
> 
>// assume there is an open epoll set that listens for events on fd 5
>if (fork () = 0)
>  {
>close (5);
>// fd 5 is now removed from the epoll set of the parent.
>_exit (0);
>  }

Hmmm ... what? I assume you know that:

1) A file descriptor is a userspace view/handle of a kernel object

2) The kernel object has a use-count for as many file descriptors that 
   have been handed out to userspace

3) A close() decreases the internal counter by one

4) The kernel object gets effectively closed when the internal counter 
   goes to zero

5) A fork() acts as a dup() on the file descriptors by hence bumping up 
   its internal counter

6) Epoll removes the file from the set, when the *kernel* object gets 
   closed (internal use-count goes to zero)

With that in mind, how can the code snippet above trigger a removal from 
the epoll set?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revised signalfd man-page

2007-10-17 Thread Davide Libenzi

On Wed, 17 Oct 2007, Michael Kerrisk wrote:

> > With the new code Linus already merged, signalfd does not attach to the 
> > sighand anymore, so the "orphaned sighand" behaviour is no more there.
> > An exec() will carry the fd over, and you will be able to use the fd in 
> > the same way you did before the exec(). If sigpending()/sigwaitinfo() will 
> > show signals available, so it should signalfd.
> 
> So I wrote:
> 
>execve(2) semantics
>Just  like  any  other  file  descriptor, a signalfd file
>descriptor remains open across an  execve(2),  unless  it
>has  been  marked  for close-on-exec (see fcntl(2)).  Any
>signals  that  were  available  for  reading  before  the
>execve(2)  remain  available to the newly loaded program.
>(This is analogous to traditional signal semantics, where
>a  blocked  signal that is pending remains pending across
>an execve(2).)  (This is analogous to traditional  signal
>semantics, where a blocked signal that is pending remains
>pending across an execve(2).)
> 
> Okay?

Ok



> > It'll return the signals that would be normally returned to the thread 
> > with the standard signal delivery methods. That is, calling thread private 
> > signals, and calling thread-group shared signals.
> 
> So I wrote:
> 
>Thread semantics
>The semantics of signalfd file descriptors  in  a  multi-
>threaded  program  mirror the standard semantics for sig-
>nals.  In other words, when a thread reads  from  a  sig-
>nalfd  file descriptor, it will read the signals that are
>directed to the thread itself and the  signals  that  are
>directed  to the process (i.e., the entire thread group).
>(A thread will not be  able  to  read  signals  that  are
>directed to other threads in the process.)
> 
> Okay?

Ok, although this looks more specific:

(A thread will not be able to read other threads private signals)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revised signalfd man-page

2007-10-15 Thread Davide Libenzi

On Mon, 15 Oct 2007, Michael Kerrisk wrote:

> Hi Davide,
> 
> There were two questions that you overlooked in my earlier draft of the
> signalfd man page.  I've revised one of the questions slightly.  Could you
> look at these please:

I think I already answered those, no? Anyway ...



> .SS execve(2) semantics
> [TO BE COMPLETED]
> .\" FIXME
> .\" Davide, what are the intended semantics after an execve()?
> .\" I would hope that the descriptor remains available, and can
> .\" be used to read any queued signals.  This is analogous with
> .\" traditional behavior, where blocked signals that are pending
> .\" prior to an execve() remain pending after the execve().
> .\"
> .\" Below, was my original question, based on how things
> .\" worked at one point, but perhaps they have changed by now:
> .\"===
> .\" As far as I can work out, after an execve() the file descriptor
> .\" is still available, but reads from it always return 0, even if:
> .\"
> .\" a) there were signals pending before the execve().
> .\"However, sigpending() shows the signal as pending,
> .\"and the signal can be accepted using sigwaitinfo().
> .\"
> .\" b) we generate the signal after the exec.
> .\"
> .\" Is this intended behavior (the "orphaned sighand" condition
> .\" described above?)?  Is it a bug?

With the new code Linus already merged, signalfd does not attach to the 
sighand anymore, so the "orphaned sighand" behaviour is no more there.
An exec() will carry the fd over, and you will be able to use the fd in 
the same way you did before the exec(). If sigpending()/sigwaitinfo() will 
show signals available, so it should signalfd.



> .SS Thread semantics
> [TO BE COMPLETED]
> .\" FIXME Davide, a signal can be directed to the process as
> .\" a whole, or to a particular thread.  What are the intended
> .\" semantics for signalfd()?  If a thread calls signalfd(),
> .\" does the resulting file descriptor return just those
> .\" signals directed to [the thread and the process as a whole],
> .\" or will it also receive signals that are targeted at
> .\" other threads in the process?

It'll return the signals that would be normally returned to the thread 
with the standard signal delivery methods. That is, calling thread private 
signals, and calling thread-group shared signals.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: F_DUPFD_CLOEXEC implementation

2007-10-02 Thread Davide Libenzi

On Tue, 2 Oct 2007, Denys Vlasenko wrote:

> I have following proposals:
> 
> * make recv(..., MSG_DONTWAIT) work on any fd
> 
> Sounds neat, but not trivial to implement in current kernel.

This is mildly ugly, if you ask me. Those are socket functions, and the 
flags parameter contain some pretty specific network meanings.

> * new fcntl command F_DUPFL: fcntl(fd, F_DUPFL, n):
>   Analogous to F_DUPFD, but gives you *unshared* copy of the fd.
>   Further seeks, fcntl(fd, F_SETFL, O_NONBLOCK), etc won't affect
>   any other process.

You'll need an ad-hoc copy function though, since your memcpy-based one is 
gonna explode even before memcpy returns ;) You'll have problems with 
ref-counting too. And that layer is not designed to cleanly support that 
operation.
Unfortunately the "clean" solution would involve changing a whole bunch of 
code, and I don't feel exactly sure it'd be worth it.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Davide Libenzi

On Mon, 1 Oct 2007, Paul E. McKenney wrote:

> That would indeed be one approach that CPU designers could take to
> avoid being careless or sadistic.  ;-)

That'd be the easier (unique maybe) approach too for them, from an silicon 
complexity POV. Distinguishing between different CPUs stores once inside a 
shared store buffer, would require tagging them in some way. That'd defeat 
most of the pros of having a shared store buffer ;)

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: F_DUPFD_CLOEXEC implementation

2007-10-01 Thread Davide Libenzi

On Mon, 1 Oct 2007, Denys Vlasenko wrote:

> On Monday 01 October 2007 19:16, Al Viro wrote:
> > * it's on a bunch of cyclic lists.  Have its neighbor
> > go away while you are doing all that crap => boom
> > * there's that thing call current position...  It gets buggered.
> > * overwriting it while another task might be in the middle of
> > syscall involving it => boom
> 
> Hm, I suspected that it's herecy. Any idea how to do it cleanly?
> 
> > * non-cooperative tasks reading *in* *parallel* from the same
> > opened file are going to have a lot more serious problems than agreeing
> > on O_NONBLOCK anyway, so I really don't understand what the hell is that 
> > for.
> 
> They don't even need to read in parallel, just having shared fd is enough.
> Think about pipes, sockets and terminals. A real-world scenario:
> 
> * a process started from shell (interactive or shell script)
> * it sets O_NONBLOCK and does a read from fd 0...
> * it gets killed (kill -9, whatever)
> * shell suddenly has it's fd 0 in O_NONBLOCK mode
> * shell and all subsequent commands started from it unexpectedly have
>   O_NONBLOCKed stdin.

I told you how in the previous email. You cannot use the:

1) set O_NONBLOCK
2) read/write
3) unset O_NONBLOCK

in a racy-free fashion, w/out wrapping it with a lock (thing that we 
don't want to do).



PS: send/recv are socket functions, and you really don't want to overload 
them for other fds.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-10-01 Thread Davide Libenzi

On Sun, 30 Sep 2007, Paul E. McKenney wrote:

> On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote:
> > On Sun, 30 Sep 2007, Oleg Nesterov wrote:
> > 
> > > Ah, but I asked the different question. We must see CPU 1's stores by
> > > definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> > > 
> > > Let's take a "real life" example,
> > > 
> > > A = B = X = 0;
> > > P = Q = &A;
> > > 
> > > CPU_0   CPU_1   CPU_2
> > > 
> > > P = &B; *P = 1; if (X) {
> > > wmb();  rmb();
> > > X = 1;  BUG_ON(*P != 1 && *Q != 1);
> > > }
> > > 
> > > So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?
> > 
> > That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
> > from a cache inv. POV) to *P=1, that must have happened after P=&B (in 
> > order for *P to assign B). So P=&B happened, from a pure time POV, before 
> > the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.
> 
> Actually, CPU designers have to go quite a ways out of their way to
> prevent this BUG_ON from happening.  One way that it would happen
> naturally would be if the cache line containing P were owned by CPU 2,
> and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
> here is what could happen given careless or sadistic CPU designers:

Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
that it was a cross-CPU sync point (ala read_barrier_depends). If that's a
local CPU load ordering only, things are different, clearly. But ...



> 
> o CPU 0 stores &B to P, but misses the cache, so puts the
>   result in the store buffer.  This means that only CPUs 0 and 1
>   can see it.
> 
> o CPU 1 fetches P, and sees &B, so stores a 1 to B.  Again,
>   this value for P is visible only to CPUs 0 and 1.
> 
> o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
>   in order.  But it does nothing about CPU 0's stores, nor about CPU
>   1's loads, for that matter (and the only reason that POWER ends
>   up working the way you would like is because wmb() turns into
>   "sync" rather than the "eieio" instruction that would have been
>   used for smp_wmb() -- which is maybe what Oleg was thinking of,
>   but happened to abbreviate.  If my analysis is buggy, Anton and
>   Paulus will no doubt correct me...)

If a store buffer is shared between CPU_0 and CPU_1, it is very likely 
that a sync done on CPU_1 is going to sync even CPU_0 stores that are 
held in the buffer at the time of CPU_1's sync.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: F_DUPFD_CLOEXEC implementation

2007-09-30 Thread Davide Libenzi

On Mon, 1 Oct 2007, Denys Vlasenko wrote:

> My use case is: I want to do a nonblocking read on descriptor 0 (stdin).
> It may be a pipe or a socket.
> 
> There may be other processes which share this descriptor with me,
> I simply cannot know that. And they, too, may want to do reads on it.
> 
> I want to do nonblocking read in such a way that neither those other
> processes will ever see fd switching to O_NONBLOCK and back, and
> I also want to be safe from other processes doing the same.
> 
> I don't see how this can be done using standard unix primitives.

Indeed. You could simulate non-blocking using poll with zero timeout, but 
if another task may read/write on it, your following read/write may end up 
blocking even after a poll returned the required events.
One way to solve this would be some sort of readx/writex where you pass an 
extra flags parameter (this could be done with sys_indirect, assuming 
we'll ever get that mainline) where you specify the non-blocking 
requirement for-this-call, and not as global per-file flag. Then, of 
course, you'll have to modify all the "file->f_flags & O_NONBLOCK" tests 
(and there are many of them) to check for that flag too (that can be a 
per task_struct flag).

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: F_DUPFD_CLOEXEC implementation

2007-09-30 Thread Davide Libenzi

On Sun, 30 Sep 2007, Denys Vlasenko wrote:

> Hi Ulrich,
> 
> On Friday 28 September 2007 18:34, Ulrich Drepper wrote:
> > One more small change to extend the availability of creation of
> > file descriptors with FD_CLOEXEC set.  Adding a new command to
> > fcntl() requires no new system call and the overall impact on
> > code size if minimal.
> 
> Tangential question: do you have any idea how userspace can
> safely do nonblocking read or write on a potentially-shared fd?
> 
> IIUC, currently it cannot be done without races:
> 
> old_flags = fcntl(fd, F_GETFL);
> ...other process may change flags!...
> fcntl(fd, F_SETFL, old_flags | O_NONBLOCK);
> read(fd, ...)
> ...other process may see flags changed under its feet!...
> fcntl(fd, F_SETFL, old_flags);
> 
> Can this be fixed?

I'm not sure I understood correctly your use case. But, if you have two 
processes/threads randomly switching O_NONBLOCK on/off, your problems 
arise not only the F_SETFL time.
If one of the tasks is not expecting an fd to be O_NONBLOCK, that will 
likely end up not handling correctly read/write-miss situations.
In that case it'd be better to keep the fd as O_NONBLOCK, and manually 
create blocking behaviour (when needed) with poll+read/write.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/9] RCU: Preemptible RCU

2007-09-30 Thread Davide Libenzi

On Sun, 30 Sep 2007, Oleg Nesterov wrote:

> Ah, but I asked the different question. We must see CPU 1's stores by
> definition, but what about CPU 0's stores (which could be seen by CPU 1)?
> 
> Let's take a "real life" example,
> 
> A = B = X = 0;
> P = Q = &A;
> 
> CPU_0   CPU_1   CPU_2
> 
> P = &B; *P = 1; if (X) {
> wmb();  rmb();
> X = 1;  BUG_ON(*P != 1 && *Q != 1);
> }
> 
> So, is it possible that CPU_1 sees P == &B, but CPU_2 sees P == &A ?

That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
from a cache inv. POV) to *P=1, that must have happened after P=&B (in 
order for *P to assign B). So P=&B happened, from a pure time POV, before 
the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc][patch] i386: remove comment about barriers

2007-09-29 Thread Davide Libenzi

On Sat, 29 Sep 2007, Nick Piggin wrote:

> [ This is true for x86's sfence/lfence, but raises a question about Linux's
> memory barriers. Does anybody expect that a sequence of smp_wmb and smp_rmb
> would ever provide a full smp_mb barrier? I've always assumed no, but I
> don't know if it is actually documented? ]

No, that can't be. rmb+wmb can't be considered a full mb. There was a 
recent discussion about this in the thread originated by peterz scalable 
rw_mutex patches.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: F_DUPFD_CLOEXEC implementation

2007-09-28 Thread Davide Libenzi

On Fri, 28 Sep 2007, Ulrich Drepper wrote:

> One more small change to extend the availability of creation of
> file descriptors with FD_CLOEXEC set.  Adding a new command to
> fcntl() requires no new system call and the overall impact on
> code size if minimal.
> 
> If this patch gets accepted we will also add this change to the
> next revision of the POSIX spec.
> 
> To test the patch, use the following little program.  Adjust the
> value of F_DUPFD_CLOEXEC appropriately.

I think new system calls would have been a cleaner way to accomplish this. 
The "small pill at a time" may have better chance to go in, but will 
likely result in an uglier userspace interface.
In any case, this is better than *nothing*, if it makes it easier to use 
fds inside system libraries.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] anon-inodes use open coded atomic_inc for the shared inode

2007-09-27 Thread Davide Libenzi

Since we know the shared inode count is always >0, we can avoid igrab() 
and use an open coded atomic_inc().


Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/anon_inodes.c |   25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

Index: linux-2.6.mod/fs/anon_inodes.c
===
--- linux-2.6.mod.orig/fs/anon_inodes.c 2007-09-27 11:26:57.0 -0700
+++ linux-2.6.mod/fs/anon_inodes.c  2007-09-27 11:51:42.0 -0700
@@ -76,7 +76,6 @@
 {
struct qstr this;
struct dentry *dentry;
-   struct inode *inode;
struct file *file;
int error, fd;
 
@@ -86,15 +85,9 @@
if (!file)
return -ENFILE;
 
-   inode = igrab(anon_inode_inode);
-   if (IS_ERR(inode)) {
-   error = PTR_ERR(inode);
-   goto err_put_filp;
-   }
-
error = get_unused_fd();
if (error < 0)
-   goto err_iput;
+   goto err_put_filp;
fd = error;
 
/*
@@ -108,14 +101,22 @@
dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
if (!dentry)
goto err_put_unused_fd;
+
+   /*
+* We know the anon_inode inode count is always greater than zero,
+* so we can avoid doing an igrab() and we can use an open-coded
+* atomic_inc().
+*/
+   atomic_inc(&anon_inode_inode->i_count);
+
dentry->d_op = &anon_inodefs_dentry_operations;
/* Do not publish this dentry inside the global dentry hash table */
dentry->d_flags &= ~DCACHE_UNHASHED;
-   d_instantiate(dentry, inode);
+   d_instantiate(dentry, anon_inode_inode);
 
file->f_path.mnt = mntget(anon_inode_mnt);
file->f_path.dentry = dentry;
-   file->f_mapping = inode->i_mapping;
+   file->f_mapping = anon_inode_inode->i_mapping;
 
file->f_pos = 0;
file->f_flags = O_RDWR;
@@ -127,14 +128,12 @@
fd_install(fd, file);
 
*pfd = fd;
-   *pinode = inode;
+   *pinode = anon_inode_inode;
*pfile = file;
return 0;
 
 err_put_unused_fd:
put_unused_fd(fd);
-err_iput:
-   iput(inode);
 err_put_filp:
put_filp(file);
return error;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] rename signalfd_siginfo fields

2007-09-27 Thread Davide Libenzi


For Michael Kerrisk request, the following patch renames signalfd_siginfo 
fields in order to keep them consistent with the siginfo_t ones.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/signalfd.c|   44 ++--
 include/linux/signalfd.h |   32 
 2 files changed, 38 insertions(+), 38 deletions(-)

Index: linux-2.6.mod/fs/signalfd.c
===
--- linux-2.6.mod.orig/fs/signalfd.c2007-09-27 10:31:48.0 -0700
+++ linux-2.6.mod/fs/signalfd.c 2007-09-27 10:33:45.0 -0700
@@ -74,45 +74,45 @@
 * If you change siginfo_t structure, please be sure
 * this code is fixed accordingly.
 */
-   err |= __put_user(kinfo->si_signo, &uinfo->signo);
-   err |= __put_user(kinfo->si_errno, &uinfo->err);
-   err |= __put_user((short)kinfo->si_code, &uinfo->code);
+   err |= __put_user(kinfo->si_signo, &uinfo->ssi_signo);
+   err |= __put_user(kinfo->si_errno, &uinfo->ssi_errno);
+   err |= __put_user((short) kinfo->si_code, &uinfo->ssi_code);
switch (kinfo->si_code & __SI_MASK) {
case __SI_KILL:
-   err |= __put_user(kinfo->si_pid, &uinfo->pid);
-   err |= __put_user(kinfo->si_uid, &uinfo->uid);
+   err |= __put_user(kinfo->si_pid, &uinfo->ssi_pid);
+   err |= __put_user(kinfo->si_uid, &uinfo->ssi_uid);
break;
case __SI_TIMER:
-err |= __put_user(kinfo->si_tid, &uinfo->tid);
-err |= __put_user(kinfo->si_overrun, &uinfo->overrun);
-err |= __put_user((long)kinfo->si_ptr, &uinfo->svptr);
+err |= __put_user(kinfo->si_tid, &uinfo->ssi_tid);
+err |= __put_user(kinfo->si_overrun, &uinfo->ssi_overrun);
+err |= __put_user((long) kinfo->si_ptr, &uinfo->ssi_ptr);
break;
case __SI_POLL:
-   err |= __put_user(kinfo->si_band, &uinfo->band);
-   err |= __put_user(kinfo->si_fd, &uinfo->fd);
+   err |= __put_user(kinfo->si_band, &uinfo->ssi_band);
+   err |= __put_user(kinfo->si_fd, &uinfo->ssi_fd);
break;
case __SI_FAULT:
-   err |= __put_user((long)kinfo->si_addr, &uinfo->addr);
+   err |= __put_user((long) kinfo->si_addr, &uinfo->ssi_addr);
 #ifdef __ARCH_SI_TRAPNO
-   err |= __put_user(kinfo->si_trapno, &uinfo->trapno);
+   err |= __put_user(kinfo->si_trapno, &uinfo->ssi_trapno);
 #endif
break;
case __SI_CHLD:
-   err |= __put_user(kinfo->si_pid, &uinfo->pid);
-   err |= __put_user(kinfo->si_uid, &uinfo->uid);
-   err |= __put_user(kinfo->si_status, &uinfo->status);
-   err |= __put_user(kinfo->si_utime, &uinfo->utime);
-   err |= __put_user(kinfo->si_stime, &uinfo->stime);
+   err |= __put_user(kinfo->si_pid, &uinfo->ssi_pid);
+   err |= __put_user(kinfo->si_uid, &uinfo->ssi_uid);
+   err |= __put_user(kinfo->si_status, &uinfo->ssi_status);
+   err |= __put_user(kinfo->si_utime, &uinfo->ssi_utime);
+   err |= __put_user(kinfo->si_stime, &uinfo->ssi_stime);
break;
case __SI_RT: /* This is not generated by the kernel as of now. */
case __SI_MESGQ: /* But this is */
-   err |= __put_user(kinfo->si_pid, &uinfo->pid);
-   err |= __put_user(kinfo->si_uid, &uinfo->uid);
-   err |= __put_user((long)kinfo->si_ptr, &uinfo->svptr);
+   err |= __put_user(kinfo->si_pid, &uinfo->ssi_pid);
+   err |= __put_user(kinfo->si_uid, &uinfo->ssi_uid);
+   err |= __put_user((long) kinfo->si_ptr, &uinfo->ssi_ptr);
break;
default: /* this is just in case for now ... */
-   err |= __put_user(kinfo->si_pid, &uinfo->pid);
-   err |= __put_user(kinfo->si_uid, &uinfo->uid);
+   err |= __put_user(kinfo->si_pid, &uinfo->ssi_pid);
+   err |= __put_user(kinfo->si_uid, &uinfo->ssi_uid);
break;
}
 
Index: linux-2.6.mod/include/linux/signalfd.h
===
--- linux-2.6.mod.orig/include/linux/signalfd.h 2007-09-27 10:28:59.0 
-0700
+++ linux-2.6.mod/include/linux/signalfd.h

Re: Revised signalfd man-page

2007-09-27 Thread Davide Libenzi

On Thu, 27 Sep 2007, Michael Kerrisk wrote:

> .\" Copyright (C) 2007 Michael Kerrisk <[EMAIL PROTECTED]>
> .\" starting from a version by Davide Libenzi <[EMAIL PROTECTED]>
> .\"
> .\" This program is free software; you can redistribute it and/or modify
> .\" it under the terms of the GNU General Public License as published by
> .\" the Free Software Foundation; either version 2 of the License, or
> .\" (at your option) any later version.
> .\"
> .\" This program is distributed in the hope that it will be useful,
> .\" but WITHOUT ANY WARRANTY; without even the implied warranty of
> .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> .\" GNU General Public License for more details.
> .\"
> .\" You should have received a copy of the GNU General Public License
> .\" along with this program; if not, write to the Free Software
> .\" Foundation, Inc., 59 Temple Place, Suite 330, Boston,
> .\" MA  02111-1307  USA
> .\"
> .TH SIGNALFD 2 2007-09-27 Linux "Linux Programmer's Manual"
> .SH NAME
> signalfd \- create a file descriptor for accepting signals
> .SH SYNOPSIS
> .\" FIXME . This header file may well change
> .\" FIXME . Probably _GNU_SOURCE will be required
> .\" FIXME . May require: Link with \fI\-lrt\f
> .B #include 
> .sp
> .BI "int signalfd(int " fd ", const sigset_t *" mask );
> .\" Almost certainly the glibc wrapper will hide this argument:
> .\" ", size_t " sizemask
> .SH DESCRIPTION
> .BR signalfd (2)
> creates a file descriptor that can be used to accept signals
> targeted at the caller.
> This provides an alternative to the use of a signal handler or
> .BR sigwaitinfo (2),
> and has the advantage that the file descriptor may be monitored by
> .BR select (2),
> .BR poll (2),
> and
> .BR epoll (7).
> The
> .I mask
> argument specifies the set of signals that the caller
> wishes to accept via the file descriptor.
> This argument is a signal set whose contents can be initialized
> using the macros described in
> .BR sigsetops (3).
> Normally, the set of signals to be received via the
> file descriptor should be blocked using
> .BR sigprocmask (2),
> to prevent the signals being handled according to their default
> dispositions.
> It is not possible to receive
> .B SIGKILL
> or
> .B SIGSTOP
> signals via a signalfd file descriptor;
> these signals are silently ignored if specified in
> .IR mask .
> 
> If the
> .I fd
> argument is \-1,
> then the call creates a new file descriptor and associates the
> signal set specified in
> .I mask
> with that descriptor.
> If
> .I fd
> is not \-1,
> then it must specify a valid existing signalfd file descriptor, and
> .I mask
> is used to replace the signal set associated with that descriptor.
> 
> .BR signalfd (2)
> returns a file descriptor that supports the following operations:
> .TP
> .BR read (2)
> If one or more of the signals specified in
> .I mask
> is pending for the process, then the buffer supplied to
> .BR read (2)
> is used to return one or more
> .I signalfd_siginfo
> structures (see below) that describe the signals.
> The
> .BR read (2)
> returns information for as many signals as are pending and will
> fit in the supplied buffer.
> The buffer must be at least
> .I "sizeof(struct signalfd_siginfo)"
> bytes.
> The return value of the
> .BR read (2)
> is the total number of bytes read.
> .IP
> As a consequence of the
> .BR read (2),
> the signals are consumed,
> so that they are no longer pending for the process
> (i.e., will not be caught by signal handlers,
> and cannot be accepted using
> .BR sigwaitinfo (2)).
> .IP
> If none of the signals in
> .I mask
> is pending for the process, then the
> .BR read (2)
> either blocks until one of the signals in
> .I mask
> is generated for the process,
> or fails with the error
> .B EAGAIN
> if the file descriptor has been made non-blocking
> (via the use of the
> .BR fcntl (2)
> .B F_SETFL
> operation to set the
> .B O_NONBLOCK
> flag).
> 
> .\" FIXME Davide, what does the following mean?  How (in userspace
> .\" terms) does a sighand structure become orphaned?
> The
> .BR read (2)
> call can also return 0,
> in case the sighand structure to which the signalfd was attached,
> has been orphaned.

You can remove the five lines above, in virtue of the fact that Linus 
merged my simplification patch.




> .TP
> .BR poll "(2), " select "(2) (and similar)"
> The file descriptor is readable
>

Re: eventfd.2 man page

2007-09-27 Thread Davide Libenzi

On Thu, 27 Sep 2007, Michael Kerrisk wrote:

> Hi Davide,
> 
> I've slightly tweaked the eventfd.2 man page in preparation for adding it
> to the man-pages set.  Could you please review the text below, and confirm
> that it correctly describes intended behavior.

Looks good to me. At the time that I was tsting it, I cooked an example 
about how to use an eventfd with KAIO:

http://www.xmailserver.org/eventfd-aio-test.c

In case you'd like to add a sketch to the man page.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Man page for revised timerfd API

2007-09-27 Thread Davide Libenzi

On Thu, 27 Sep 2007, Michael Kerrisk wrote:

> Davide,
> 
> A further question: what is the expected behavior in the
> following scenario:
> 
> 1. Create a timerfd and arm it.
> 2. Wait until M timer expirations have occurred
> 3. Modify the settings of the timer
> 4. Wait for N further timer expirations have occurred
> 5. read() from the timerfd
> 
> Does the buffer returned by the read() contain the value
> N or (M+N)?  In other words, should modifying the timer
> settings reset the expiration count to zero?

Every timerfd_settime() zeroes the tick counter. So in your scenario it'll 
return N.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Man page for revised timerfd API

2007-09-27 Thread Davide Libenzi

On Thu, 27 Sep 2007, Michael Kerrisk wrote:

> Hi Davide,
> 
> A follow up to the man page text.  Does passing a timerfd file
> descriptor via a Unix domain socket to another process do the
> expected thing?  That is, the receiving process will be able to 
> read from the file descriptor in order to obtain notification
> of timer expirations that are occurring for the process
> that sent the file descriptor, right?

Yup.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Man page for revised timerfd API

2007-09-26 Thread Davide Libenzi


Michael, SCB ...


On Wed, 26 Sep 2007, Michael Kerrisk wrote:

> .TH TIMERFD_CREATE 2 2007-09-26 Linux "Linux Programmer's Manual"
> .SH NAME
> timerfd_create, timerfd_settime, timer_gettime \-
> timers that notify via file descriptors
> .SH SYNOPSIS
> .\" FIXME . This header file may well change
> .\" FIXME . Probably _GNU_SOURCE will be required
> .\" FIXME . May require: Link with \fI\-lrt\f
> .nf
> .B #include 
> .sp
> .BI "int timerfd_create(int " clockid );
> .sp
> .BI "int timerfd_settime(int " fd ", int " flags ,
> .BI "const struct itimerspec *" new_value ,
> .BI "struct itimerspec *" curr_value );
> .sp
> .BI "int timerfd_gettime(int " fd ", struct itimerspec *" curr_value );
> .fi
> .SH DESCRIPTION
> These system calls create and operate on a timer
> that delivers timer expiration notifications via a file descriptor.
> They provide an alternative to the use of
> .BR setitimer (2)
> or
> .BR timer_create (3),
> with the advantage that the file descriptor may be monitored by
> .BR poll (2)
> and
> .BR select (2).

epoll, no?




> The use of these three system calls is analogous to the use of
> .BR timer_create (2),
> .BR timer_settime (2),
> and
> .BR timer_gettime (2).
> .\"
> .SS timerfd_create()
> .BR timerfd_create ()
> creates a new timer object,
> and returns a file descriptor that refers to that timer.
> The
> .I clockid
> argument specifies the clock that is used to mark the progress
> of the timer, and must be either
> .B CLOCK_REALTIME
> or
> .BR CLOCK_MONOTONIC .
> .B CLOCK_REALTIME
> is a settable system-wide clock.
> .B CLOCK_MONOTONIC
> is a non-settable clock that is not affected
> by discontinuous changes in the system clock
> (e.g., manual changes to system time).
> The current value of each of these clocks can be retrieved using
> .BR clock_gettime (3).
> .\"
> .SS timerfd_settime()
> .BR timerfd_settime ()
> arms (starts) or disarms (stops)
> the timer referred to by the file descriptor
> .IR fd .
> 
> The
> .I new_value
> argument specifies the initial expiration and interval for the timer.
> The
> .I itimer
> structure used for this argument contains two fields,
> each of which is in turn a structure of type
> .IR timespec :
> .in +0.25i
> .nf
> 
> struct timespec {
> time_t tv_sec;/* Seconds */
> long   tv_nsec;   /* Nanoseconds */
> };
> 
> struct itimerspec {
> struct timespec it_interval;  /* Interval for periodic timer */
> struct timespec it_value; /* Initial expiration */
> };
> .fi
> .in
> .PP
> .I new_value.it_value
> specifies the initial expiration of the timer,
> in seconds and nanoseconds.
> Setting either field of
> .I new_value.it_value
> to a non-zero value arms the timer.
> Setting both fields of
> .I new_value.it_value
> to zero disarms the timer.
> 
> Setting one or both fields of
> .I new_value.it_interval
> to non-zero values specifies the period, in seconds and nanoseconds,
> for repeated timer expirations after the initial expiration.
> If both fields of
> .I new_value.it_interval
> are zero, the timer expires just once, at the time specified by
> .IR new_value.it_value .
> 
> The
> .I flags
> argument is either 0, to start a relative timer
> .RI ( new_value.it_interval
> specifies a time relative to the current value of the clock specified by
> .IR clockid ),
> or
> .BR TFD_TIMER_ABSTIME ,
> to start an absolute timer
> .RI ( new_value.it_interval
> specifies an absolute time for the clock specified by
> .IR clockid ;
> that is, the timer will expire when the value of that
> clock reaches the value specified in
> .IR new_value.it_interval ).
> 
> The
> .I curr_value
> argument returns a structure containing the setting of the timer that
> was current at the time of the call; see the description of
> .BR timerfd_gettime ()
> following.
> .\"
> .SS timerfd_gettime()
> .BR timerfd_gettime ()
> returns, in
> .IR curr_value ,
> an
> .IR itimerspec
> that contains the current setting of the timer
> referred to by the file descriptor
> .IR fd .
> 
> The
> .I it_value
> field returns the amount of time
> until the timer will next expire.
> If both fields of this structure are zero,
> then the timer is currently disarmed.
> This field always contains a relative value, regardless of whether the
> .BR TFD_TIMER_ABSTIME
> flag was specified when setting the timer.
> 
> The
> .I it_interval
> field returns the interval of the timer.
> If both fields of this structure are zero,
> then the timer is set to expire just once, at the time specified by
> .IR curr_value.it_value .
> .SS Operating on a timer file descriptor
> The file descriptor returned by
> .BR timerfd_create (2)
> supports the following operations:
> .TP
> .BR read (2)
> If the timer has already expired one or more times since it was created,
> or since the last
> .BR read (2),
> then the buffer given to
> .BR read (2)
> returns an unsigned 8-byte integer
> .RI ( uint64_t )
> containing the number of expirations tha

Re: [patch 2/4] new timerfd API v2 - new timerfd API

2007-09-25 Thread Davide Libenzi

On Tue, 25 Sep 2007, Jonathan Corbet wrote:

> One quick question:
> 
> > Like the previous timerfd API implementation, read(2) and poll(2) are 
> > supported
> > (with the same interface).
> 
> Looking at that interface, it appears that a process doing a read() on a
> timerfd with no timer set will block for a very long time.  It's an
> obvious "don't do that" situation, but perhaps we could help an
> occasional developer get a clue by returning something like -EINVAL when
> the timer has not been set?

That is the same as you try to read once more after an expired timer. You 
won't wake up until the next timer event will show up. That is, after at 
most TP time for periodic timers, or after the time  the next 
timerfd_settime() will setup.
I'd like to keep the "timerfd not set yet" and the "timerfd already 
expired and not re-armed" acting the same way. That is, wait till next 
event happen (unless O_NONBLOCK of course).

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/2] Enable link power management for ata drivers

2007-09-24 Thread Davide Libenzi

On Tue, 25 Sep 2007, roel wrote:

> > +   if (!(ap->flags & ATA_FLAG_IPM) || !ata_dev_enabled(dev)) {
> 
>   if (!((ap->flags & ATA_FLAG_IPM) && ata_dev_enabled(dev))) {

int foo(int i, int j) {

return !(i & 8) || !j;
}

int moo(int i, int j) {

return !((i & 8) && j);
}


gcc -O2 -S:


.globl foo
.type   foo, @function
foo:
shrl$3, %edi
xorl$1, %edi
testl   %esi, %esi
sete%al
orl %eax, %edi
andl$1, %edi
movl%edi, %eax
ret
.globl moo
.type   moo, @function
moo:
shrl$3, %edi
xorl$1, %edi
testl   %esi, %esi
sete%al
orl %eax, %edi
andl$1, %edi
movl%edi, %eax
ret



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 4/4] new timerfd API v2 - un-break CONFIG_TIMERFD

2007-09-24 Thread Davide Libenzi

Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-09-24 12:13:25.0 -0700
+++ linux-2.6.mod/init/Kconfig  2007-09-24 12:31:40.0 -0700
@@ -488,7 +488,6 @@
 config TIMERFD
bool "Enable timerfd() system call" if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/4] new timerfd API v2 - new timerfd API

2007-09-24 Thread Davide Libenzi

This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not 
NULL).
The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  197 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 164 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-09-24 12:26:19.0 -0700
+++ linux-2.6.mod/fs/timerfd.c  2007-09-24 12:31:22.0 -0700
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the "expired"
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
ctx->expired = 1;
+   ctx->ticks++;
wake_up_locked(&ctx->wqh);
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr->it_value);
ctx->expired = 0;
+   ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
-   hrtimer_init(&ctx->tmr, clockid, htmode);
+   hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
ctx->tmr.expires = texp;
ctx->tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, &ctx->wqh, wait);
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
-   if (ctx->expired)
+   if (ctx->ticks)
events |= POLLIN;
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(&ctx->wqh.lock);
res = -EAGAIN;
-   if (!ctx->expired && !(file->f_flags & O_NONBLOCK)) {
+   if (!ctx->ticks && !(file->f_flags & O_NONBLOCK)) {
__add_wait_queue(&ctx->wqh, &wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx->expired) {
+   if (ctx->ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(&ctx->wqh, &wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx->expired) {
-   ctx->expired = 0;
-   if (ctx->tintv.tv64 != 0) {
+   if (ctx->ticks) {
+   ticks = ctx->ticks;
+   if (ctx->expired && ctx->tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(&ctx->tmr,
-   hrtimer_cb_get_time(&ctx->tmr),
-

[patch 3/4] new timerfd API v2 - wire the new timerfd API to the x86 family

2007-09-24 Thread Davide Libenzi

Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 arch/i386/kernel/syscall_table.S |5 -
 arch/x86_64/ia32/ia32entry.S |4 +++-
 include/asm-i386/unistd.h|6 --
 include/asm-x86_64/unistd.h  |8 ++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.mod.orig/arch/i386/kernel/syscall_table.S 2007-09-24 
12:13:28.0 -0700
+++ linux-2.6.mod/arch/i386/kernel/syscall_table.S  2007-09-24 
12:31:39.0 -0700
@@ -321,6 +321,9 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
+
Index: linux-2.6.mod/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86_64/ia32/ia32entry.S 2007-09-24 
12:13:28.0 -0700
+++ linux-2.6.mod/arch/x86_64/ia32/ia32entry.S  2007-09-24 12:31:39.0 
-0700
@@ -730,7 +730,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:
Index: linux-2.6.mod/include/asm-i386/unistd.h
===
--- linux-2.6.mod.orig/include/asm-i386/unistd.h2007-09-24 
12:13:28.0 -0700
+++ linux-2.6.mod/include/asm-i386/unistd.h 2007-09-24 12:31:39.0 
-0700
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86_64/unistd.h
===
--- linux-2.6.mod.orig/include/asm-x86_64/unistd.h  2007-09-24 
12:13:28.0 -0700
+++ linux-2.6.mod/include/asm-x86_64/unistd.h   2007-09-24 12:31:39.0 
-0700
@@ -626,12 +626,16 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/4] new timerfd API v2 - introduce a new hrtimer_forward_now() function

2007-09-24 Thread Davide Libenzi

I think that advancing the timer against the timer's current "now" can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-09-24 12:27:20.0 
-0700
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-09-24 12:29:39.0 
-0700
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer->base->get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec __user *rmtp,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/3] new timerfd API - wire the new timerfd API to the x86 family

2007-09-24 Thread Davide Libenzi

On Mon, 24 Sep 2007, Michael Kerrisk wrote:

> Hi Davide,
> 
> Davide Libenzi wrote:
> > On Mon, 24 Sep 2007, Michael Kerrisk wrote:
> >> Is it perhaps not better to group the three syscalls contiguously with
> >> respect to syscall numbers?  The old timerfd slot can be re-used for some
> >> other syscall later.
> > 
> > There's no problem if they're not contiguous. 
> 
> I realise there is no problem, in a technical sense.  But it strikes me as
> more aesthetic to make related syscalls numerically contiguous.  Thus, we
> see such as the following in the kernel source
> 
> #define __NR_epoll_create   254
> #define __NR_epoll_ctl  255
> #define __NR_epoll_wait 256
> 
> and
> 
> #define __NR_timer_create   259
> #define __NR_timer_settime  (__NR_timer_create+1)
> #define __NR_timer_gettime  (__NR_timer_create+2)
> #define __NR_timer_getoverrun   (__NR_timer_create+3)
> #define __NR_timer_delete   (__NR_timer_create+4)
> 
> and
> 
> #define __NR_inotify_init   291
> #define __NR_inotify_add_watch  292
> #define __NR_inotify_rm_watch   293
> 
> > Holes, unless filled 
> > immediately, need to be remembered to be filled.
> 
> Well, in the past it seems they do get filled soon enough though.  There's
> fair odds that you'll be the one to fill it with the next syscall you write
> ;-).

You have to talk to arch mantainers. I do not care. I simply provided the 
x86 hooks because I tested on x86.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/3] new timerfd API - new timerfd API

2007-09-24 Thread Davide Libenzi

On Mon, 24 Sep 2007, Thomas Gleixner wrote:

> >  struct timerfd_ctx {
> > struct hrtimer tmr;
> > +   int clockid;
> > ktime_t tintv;
> > wait_queue_head_t wqh;
> > int expired;
> > +   u64 ticks;
> >  };
> 
> Can you please restructure the struct in a way which does not result in
> padding by the compiler ?
> 
> struct timerfd_ctx {
>   struct hrtimer tmr;
>   ktime_t tintv;
>   wait_queue_head_t wqh;
>   u64 ticks;
>   int expired;
>   int clockid;
> };

Sure.



> > +   ticks += (u64)
> > hrtimer_forward(&ctx->tmr,
> > hrtimer_cb_get_time(&ctx->tmr),
> 
> You need to use ctx->tmr.base->get_time() here, otherwise you might read
> a stale time value (in case that CONFIG_HIGH_RES_TIMERS is off).

Is the particular position of hrtimer_cb_get_time() in the code that would 
break here? Because function was added by your patch ;)
Did something change later?



> > +err_kfree_ctx:
> > +   kfree(ctx);
> > +   return error;
> 
> You really can avoid the goto here.

Ack.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/3] new timerfd API - wire the new timerfd API to the x86 family

2007-09-24 Thread Davide Libenzi

On Mon, 24 Sep 2007, Michael Kerrisk wrote:

> Davide,
> 
> Is it perhaps not better to group the three syscalls contiguously with
> respect to syscall numbers?  The old timerfd slot can be re-used for some
> other syscall later.

There's no problem if they're not contiguous. Holes, unless filled 
immediately, need to be remembered to be filled.


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/3] new timerfd API - un-break CONFIG_TIMERFD

2007-09-23 Thread Davide Libenzi

Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-09-23 15:18:06.0 -0700
+++ linux-2.6.mod/init/Kconfig  2007-09-23 15:28:54.0 -0700
@@ -488,7 +488,6 @@
 config TIMERFD
bool "Enable timerfd() system call" if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/3] new timerfd API - new timerfd API

2007-09-23 Thread Davide Libenzi

This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not 
NULL).
The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  199 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 168 insertions(+), 77 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-09-23 15:18:09.0 -0700
+++ linux-2.6.mod/fs/timerfd.c  2007-09-23 15:25:55.0 -0700
@@ -23,15 +23,17 @@
 
 struct timerfd_ctx {
struct hrtimer tmr;
+   int clockid;
ktime_t tintv;
wait_queue_head_t wqh;
int expired;
+   u64 ticks;
 };
 
 /*
  * This gets called when the timer event triggers. We set the "expired"
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
ctx->expired = 1;
+   ctx->ticks++;
wake_up_locked(&ctx->wqh);
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr->it_value);
ctx->expired = 0;
+   ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
-   hrtimer_init(&ctx->tmr, clockid, htmode);
+   hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
ctx->tmr.expires = texp;
ctx->tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, &ctx->wqh, wait);
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
-   if (ctx->expired)
+   if (ctx->ticks)
events |= POLLIN;
spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(&ctx->wqh.lock);
res = -EAGAIN;
-   if (!ctx->expired && !(file->f_flags & O_NONBLOCK)) {
+   if (!ctx->ticks && !(file->f_flags & O_NONBLOCK)) {
__add_wait_queue(&ctx->wqh, &wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx->expired) {
+   if (ctx->ticks) {
res = 0;
break;
}
@@ -121,22 +125,23 @@
__remove_wait_queue(&ctx->wqh, &wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx->expired) {
-   ctx->expired = 0;
-   if (ctx->tintv.tv64 != 0) {
+   if (ctx->ticks) {
+   ticks = ctx->ticks;
+   if (ctx->expired && ctx->tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
+   ticks += (u64)
hrtimer_forward(&ctx->tmr,

[patch 2/3] new timerfd API - wire the new timerfd API to the x86 family

2007-09-23 Thread Davide Libenzi

Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 arch/i386/kernel/syscall_table.S |5 -
 arch/x86_64/ia32/ia32entry.S |4 +++-
 include/asm-i386/unistd.h|6 --
 include/asm-x86_64/unistd.h  |8 ++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.mod.orig/arch/i386/kernel/syscall_table.S 2007-09-23 
15:28:48.0 -0700
+++ linux-2.6.mod/arch/i386/kernel/syscall_table.S  2007-09-23 
15:28:52.0 -0700
@@ -321,6 +321,9 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
+
Index: linux-2.6.mod/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86_64/ia32/ia32entry.S 2007-09-23 
15:28:48.0 -0700
+++ linux-2.6.mod/arch/x86_64/ia32/ia32entry.S  2007-09-23 15:28:52.0 
-0700
@@ -730,7 +730,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:
Index: linux-2.6.mod/include/asm-i386/unistd.h
===
--- linux-2.6.mod.orig/include/asm-i386/unistd.h2007-09-23 
15:28:48.0 -0700
+++ linux-2.6.mod/include/asm-i386/unistd.h 2007-09-23 15:28:52.0 
-0700
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86_64/unistd.h
===
--- linux-2.6.mod.orig/include/asm-x86_64/unistd.h  2007-09-23 
15:28:48.0 -0700
+++ linux-2.6.mod/include/asm-x86_64/unistd.h   2007-09-23 15:28:52.0 
-0700
@@ -626,12 +626,16 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: A revised timerfd API

2007-09-23 Thread Davide Libenzi

On Sun, 23 Sep 2007, Davide Libenzi wrote:

> On Sun, 23 Sep 2007, Michael Kerrisk wrote:
> 
> > I applied this patch against 2.6.27-rc7, and wired up the syscalls as shown
> > in the definitions below.  When I ran the the program below, my system
> > immediately froze.  Can you try it on your system please.
> 
> There's an hrtimer_init() missing in timerfd_create(). I'll refactor the 
> patch.

There's the case of a timerfd_gettime return status when the timerfd has 
not been set yet (ie, soon after a timerfd_create), to handle.
Current way is to return an (itimerspec) { 0, 0 }. Ok?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: A revised timerfd API

2007-09-23 Thread Davide Libenzi

On Sun, 23 Sep 2007, Michael Kerrisk wrote:

> I applied this patch against 2.6.27-rc7, and wired up the syscalls as shown
> in the definitions below.  When I ran the the program below, my system
> immediately froze.  Can you try it on your system please.

There's an hrtimer_init() missing in timerfd_create(). I'll refactor the 
patch.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: A revised timerfd API

2007-09-22 Thread Davide Libenzi

On Sat, 22 Sep 2007, Thomas Gleixner wrote:

> On Sat, 2007-09-22 at 14:07 -0700, Davide Libenzi wrote:
> > On Sat, 22 Sep 2007, Michael Kerrisk wrote:
> > 
> > > So I'm inclined to implement option (b), unless someone has strong
> > > objections.  Davide, could I persuade you to help?
> > 
> > I guess I better do, otherwise you'll continue to stress me ;)
> > 
> > int timerfd_create(int clockid);
> > int timerfd_settime(int ufd, int flags,
> > const struct itimerspec *utmr,
> > struct itimerspec *otmr);
> > int timerfd_gettime(int ufd, struct itimerspec *otmr);
> > 
> > Patch below. Builds, not tested yet (you need to remove the "broken" 
> > status from CONFIG_TIMERFD in case you want to test - and plug the new 
> > syscall to arch/xxx).
> > May that work for you?
> > Thomas-san, hrtimer_try_to_cancel() does not touch ->expires and I assume
> > it'll never do, granted?
> 
> Davide-san, I have no intention to change that, but remember there is
> this file "Documentation/stable_api_nonsense.txt" :)

Heh, I guess that'll work then ;)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: A revised timerfd API

2007-09-22 Thread Davide Libenzi

On Sat, 22 Sep 2007, Michael Kerrisk wrote:

> So I'm inclined to implement option (b), unless someone has strong
> objections.  Davide, could I persuade you to help?

I guess I better do, otherwise you'll continue to stress me ;)

int timerfd_create(int clockid);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

Patch below. Builds, not tested yet (you need to remove the "broken" 
status from CONFIG_TIMERFD in case you want to test - and plug the new 
syscall to arch/xxx).
May that work for you?
Thomas-san, hrtimer_try_to_cancel() does not touch ->expires and I assume
it'll never do, granted?



- Davide



---
 fs/compat.c  |   32 --
 fs/timerfd.c |  144 +--
 include/linux/compat.h   |7 +-
 include/linux/syscalls.h |7 +-
 4 files changed, 126 insertions(+), 64 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-09-22 12:22:19.0 -0700
+++ linux-2.6.mod/fs/timerfd.c  2007-09-22 13:21:21.0 -0700
@@ -23,6 +23,7 @@
 
 struct timerfd_ctx {
struct hrtimer tmr;
+   int clockid;
ktime_t tintv;
wait_queue_head_t wqh;
int expired;
@@ -46,7 +47,7 @@
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -58,7 +59,7 @@
texp = timespec_to_ktime(ktmr->it_value);
ctx->expired = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
-   hrtimer_init(&ctx->tmr, clockid, htmode);
+   hrtimer_init(&ctx->tmr, ctx->clockid, htmode);
ctx->tmr.expires = texp;
ctx->tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -150,76 +151,109 @@
.read   = timerfd_read,
 };
 
-asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
-   const struct itimerspec __user *utmr)
+asmlinkage long sys_timerfd_create(int clockid)
 {
-   int error;
+   int error, ufd;
struct timerfd_ctx *ctx;
struct file *file;
struct inode *inode;
-   struct itimerspec ktmr;
-
-   if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
-   return -EFAULT;
 
if (clockid != CLOCK_MONOTONIC &&
clockid != CLOCK_REALTIME)
return -EINVAL;
+
+   ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+   if (!ctx)
+   return -ENOMEM;
+
+   init_waitqueue_head(&ctx->wqh);
+   ctx->clockid = clockid;
+
+   error = anon_inode_getfd(&ufd, &inode, &file, "[timerfd]",
+&timerfd_fops, ctx);
+   if (error)
+   goto err_kfree_ctx;
+
+   return ufd;
+
+err_kfree_ctx:
+   kfree(ctx);
+   return error;
+}
+
+asmlinkage long sys_timerfd_settime(int ufd, int flags,
+   const struct itimerspec __user *utmr,
+   struct itimerspec __user *otmr)
+{
+   struct file *file;
+   struct timerfd_ctx *ctx;
+   struct itimerspec ktmr, kotmr;
+
+   if (copy_from_user(&ktmr, utmr, sizeof(ktmr)))
+   return -EFAULT;
+
if (!timespec_valid(&ktmr.it_value) ||
!timespec_valid(&ktmr.it_interval))
return -EINVAL;
 
-   if (ufd == -1) {
-   ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
-   if (!ctx)
-   return -ENOMEM;
-
-   init_waitqueue_head(&ctx->wqh);
-
-   timerfd_setup(ctx, clockid, flags, &ktmr);
-
-   /*
-* When we call this, the initialization must be complete, since
-* anon_inode_getfd() will install the fd.
-*/
-   error = anon_inode_getfd(&ufd, &inode, &file, "[timerfd]",
-&timerfd_fops, ctx);
-   if (error)
-   goto err_tmrcancel;
-   } else {
-   file = fget(ufd);
-   if (!file)
-   return -EBADF;
-   ctx = file->private_data;
-   if (file->f_op != &timerfd_fops) {
-   fput(file);
-   return -EINVAL;
-   }
-   /*
-* We need to stop the existing timer before reprogramming
-* it to the new values.
-*/
-   for (;;) {
-   spin_lock_irq(&ctx->wqh.lock);
-   if (hrtimer_try_to_cancel(&ctx->tmr) >= 0)
-   break;
-   spin_unlock_irq(&ctx->

Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-21 Thread Davide Libenzi

On Thu, 20 Sep 2007, Nagendra Tomar wrote:

> > That's not what POLLOUT means in the Unix meaning. POLLOUT indicates the 
> > ability to write, and it is not meant as to signal every time a packet 
> > (skb) is sent on the wire (and the buffer released).
> 
> Aren't they both the same ? Everytime an incoming ACK frees up a buffer
> from the retransmit queue, the writability condition is freshly asserted,
> much the same way as the readability condition is asserted everytime a 
> new data is queued in the socket receive queue (irrespective of 
> whether there was data already waiting to be read in the receive queue).
> 
> This difference in meaning of POLLOUT only arises in the ET case, which was
> not what traditional Unix poll referred to. 

Again, events here are "readability" and "writeability" and the fact that 
the lower network layer freed a buffer is not, per se, an event (unless 
before, "writeability" was tested and reported as unavailable).
In you case the solution looks pretty simple. Just create appropriately 
sized buffers, split the single sendfile into multiple buffer-sized ones, 
and recycle the buffer once each of them completes.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Davide Libenzi

On Thu, 20 Sep 2007, Nagendra Tomar wrote:

> 
> --- Davide Libenzi <[EMAIL PROTECTED]> wrote:
> 
> > Looking back at it, I think the current TCP code is right, once you look 
> > at the "event" to be a output buffer full->with_space transition.
> > If you drop an fd inside epoll with EPOLLOUT|EPOLLET and you get an event 
> > (free space on the output buffer), if you do not consume it (say a 
> > tcp_sendmsg that re-fill the buffer), you can't see other OUT event 
> > anymore since they happen on the full->with_space transition.
> > Yes, I know, the read size (EPOLLIN) works differently and you get an 
> > event for every packet you receive. And yes, I do not like asymmetric 
> > things. But that does not make the EPOLLOUT|EPOLLET wrong IMO.
> > 
> 
> I agree that ET means the event should happen at the transition
> from nospace->space condition, but isn't the other case (event is 
> delivered every time the event actually happens) more usable.
> Also the epoll man page says so
> 
> "... Edge Triggered event distribution delivers events only when 
> events happens on the  monitored file."
> 
> This serves the purpose of ET (reducing the number of poll events) and
> at the same time makes userspace coding easier. My userspace code
> has the liberty of deciding when it can write to the socket. f.e. the
> sendfile buffer management example that I quoted in my earlier post
> will be difficult with the current ET|POLLOUT behaviour. I cannot 
> write in full-buffer units. I'll ve to write partial buffers just to 
> fill the TCP writeq which is needed to trigger the event.

That's not what POLLOUT means in the Unix meaning. POLLOUT indicates the 
ability to write, and it is not meant as to signal every time a packet 
(skb) is sent on the wire (and the buffer released).
In your particular application, you could simply split the sendfile into 
appropriately sized chunks, and handle the buffer realease in there.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] signalfd simplification

2007-09-20 Thread Davide Libenzi

The following patch simplifies signalfd code, by avoiding it to remain 
attached to the sighand during its lifetime.
In this way, the signalfd remain attached to the sighand only during 
poll(2) (and select and epoll) and read(2). This also allows to remove all 
the custom "tsk == current" checks in kernel/signal.c, since 
dequeue_signal() will only be called by "current".
I think this is also what Ben was suggesting time ago.
The external effect of this, is that a thread can extract only its own 
private signals and the group ones. I think this is an acceptable 
behaviour, in that those are the signals the thread would be able to fetch 
w/out signalfd.
Oleg, as far as the smart wakeup thing. Can't be done w/out huge 
revolutions to the kernel code.
It can easily be done for the ->read(), but not for the ->poll(). You need 
to be the provider of the poll_table in order to install custom wakeups, 
and inside ->poll(), the providers are either fs/select.c or 
fs/eventpoll.c (or fs/aio.c). We could add a new poll_wait_func() to pass 
the function to be called, and that would work for fs/select.c (in there 
you can filter based on the signal and eventually issue the real wakeup). 
But it won't work for epoll, since it already expect to install its own 
callback, and ATM callbacks can't be chained.
Since the thundering herd effect on signalfds should be pretty limited, I 
think it's not worth the change.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide



---
 fs/exec.c |3 
 fs/signalfd.c |  190 +++---
 include/linux/init_task.h |2 
 include/linux/sched.h |2 
 include/linux/signalfd.h  |   40 -
 kernel/exit.c |9 --
 kernel/fork.c |2 
 kernel/signal.c   |8 -
 8 files changed, 39 insertions(+), 217 deletions(-)

Index: linux-2.6.mod/fs/signalfd.c
===
--- linux-2.6.mod.orig/fs/signalfd.c2007-09-19 15:54:21.0 -0700
+++ linux-2.6.mod/fs/signalfd.c 2007-09-19 15:54:42.0 -0700
@@ -11,8 +11,10 @@
  *  Now using anonymous inode source.
  *  Thanks to Oleg Nesterov for useful code review and suggestions.
  *  More comments and suggestions from Arnd Bergmann.
- * Sat May 19, 2007: Davi E. M. Arnaut <[EMAIL PROTECTED]>
+ *  Sat May 19, 2007: Davi E. M. Arnaut <[EMAIL PROTECTED]>
  *  Retrieve multiple signals with one read() call
+ *  Sun Jul 15, 2007: Davide Libenzi <[EMAIL PROTECTED]>
+ *  Attach to the sighand only during read() and poll().
  */
 
 #include 
@@ -27,102 +29,12 @@
 #include 
 
 struct signalfd_ctx {
-   struct list_head lnk;
-   wait_queue_head_t wqh;
sigset_t sigmask;
-   struct task_struct *tsk;
 };
 
-struct signalfd_lockctx {
-   struct task_struct *tsk;
-   unsigned long flags;
-};
-
-/*
- * Tries to acquire the sighand lock. We do not increment the sighand
- * use count, and we do not even pin the task struct, so we need to
- * do it inside an RCU read lock, and we must be prepared for the
- * ctx->tsk going to NULL (in signalfd_deliver()), and for the sighand
- * being detached. We return 0 if the sighand has been detached, or
- * 1 if we were able to pin the sighand lock.
- */
-static int signalfd_lock(struct signalfd_ctx *ctx, struct signalfd_lockctx *lk)
-{
-   struct sighand_struct *sighand = NULL;
-
-   rcu_read_lock();
-   lk->tsk = rcu_dereference(ctx->tsk);
-   if (likely(lk->tsk != NULL))
-   sighand = lock_task_sighand(lk->tsk, &lk->flags);
-   rcu_read_unlock();
-
-   if (!sighand)
-   return 0;
-
-   if (!ctx->tsk) {
-   unlock_task_sighand(lk->tsk, &lk->flags);
-   return 0;
-   }
-
-   if (lk->tsk->tgid == current->tgid)
-   lk->tsk = current;
-
-   return 1;
-}
-
-static void signalfd_unlock(struct signalfd_lockctx *lk)
-{
-   unlock_task_sighand(lk->tsk, &lk->flags);
-}
-
-/*
- * This must be called with the sighand lock held.
- */
-void signalfd_deliver(struct task_struct *tsk, int sig)
-{
-   struct sighand_struct *sighand = tsk->sighand;
-   struct signalfd_ctx *ctx, *tmp;
-
-   BUG_ON(!sig);
-   list_for_each_entry_safe(ctx, tmp, &sighand->signalfd_list, lnk) {
-   /*
-* We use a negative signal value as a way to broadcast that the
-* sighand has been orphaned, so that we can notify all the
-* listeners about this. Remember the ctx->sigmask is inverted,
-* so if the user is interested in a signal, that corresponding
-* bit will be zero.
-*/
-   if (sig < 0) {
-   if (ctx->tsk == tsk) {
-

Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Davide Libenzi

On Thu, 20 Sep 2007, Eric Dumazet wrote:

> Does it means that with your patch each ACK on a ET managed socket will
> trigger an epoll event   ?
> 
> Maybe your very sensitive high throuput appication needs to set a flag or
> something at socket level to ask for such a behavior.
> 
> The default should stay as is. That is an event should be sent only if someone
> cared about the wakeup.

Unfortunately f_op->poll() does not let the caller to specify the events 
it's interested in, that would allow to split send/recevie wait queues and 
better detect read/write cases.
The detection of a waitqueue_active(->sk_wr_sleep) would work fine in 
detecting is someone is actually waiting for a write, w/out the false 
positives triggered by the read-waiters.
That would be a very sane thing to do, but would require a big&dumb change 
to all the ->poll around (that could be automated by a script - devices 
not caring about the events hint can just continue to use the single queue 
like they currently do), and a more critical and gradual change of all the 
devices that wants to take advantage of it.
That way, no more magic bits are needed, and a simple waitqueue_active() 
would tell you if someone is waiting for write-space events.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Davide Libenzi

On Wed, 19 Sep 2007, Nagendra Tomar wrote:

> The tcp_check_space() function calls tcp_new_space() only if the
> SOCK_NOSPACE bit is set in the socket flags. This is causing Edge Triggered
> EPOLLOUT events to be missed for TCP sockets, as the ep_poll_callback() 
> is not called from the wakeup routine.
> 
> The SOCK_NOSPACE bit indicates the user's intent to perform writes
> on that socket (set in tcp_sendmsg and tcp_poll). I believe the idea 
> behind the SOCK_NOSPACE check is to optimize away the tcp_new_space call
> in cases when user is not interested in writing to the socket. These two
> take care of all possible scenarios in which a user can convey his intent
> to write on that socket.
> 
> Case 1: tcp_sendmsg detects lack of sndbuf space
> Case 2: tcp_poll returns not writable
> 
> This is fine if we do not deal with epoll's Edge Triggered events (EPOLLET).
> With ET events we can have a scenario where the SOCK_NOSPACE bit is not set,
> as the user has neither done a sendmsg nor a poll/epoll call that returned
> with the POLLOUT condition not set. 

Looking back at it, I think the current TCP code is right, once you look 
at the "event" to be a output buffer full->with_space transition.
If you drop an fd inside epoll with EPOLLOUT|EPOLLET and you get an event 
(free space on the output buffer), if you do not consume it (say a 
tcp_sendmsg that re-fill the buffer), you can't see other OUT event 
anymore since they happen on the full->with_space transition.
Yes, I know, the read size (EPOLLIN) works differently and you get an 
event for every packet you receive. And yes, I do not like asymmetric 
things. But that does not make the EPOLLOUT|EPOLLET wrong IMO.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Davide Libenzi

On Wed, 19 Sep 2007, Nagendra Tomar wrote:

> Definitely not ! 
> 
> The point is that the "tcp write space available" 
> wakeup does not get called if SOCK_NOSPACE bit is not set. This was
> fine when the wakeup was merely a wakeup (since SOCK_NOSPACE bit 
> indicated that someone really cared abt the wakeup). Now after the
> introduction of callback'ed wakeups, we might have some work to
> do inside the callback even if there is nobody interested in the wakeup
> at that point of time. 
> 
> In this particular case the ep_poll_callback is not getting called and
> hence the socket fd is not getting added to the ready list.

I know, I saw the patch. I was just commenting the point where DaveM was 
heading to ;)
This things needs to be looked at a little bit more deeply.


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Davide Libenzi

On Wed, 19 Sep 2007, David Miller wrote:

> From: Nagendra Tomar <[EMAIL PROTECTED]>
> Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT)
> 
> > With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call will 
> > not return, even when the incoming acks free the buffers.
> > Note that this patch assumes that the SOCK_NOSPACE check in
> > tcp_check_space is a trivial optimization which can be safely removed.
> 
> I already replied to your patch posting explaining that whatever is
> not setting SOCK_NOSPACE should be fixed instead.
> 
> Please address that, thanks.

You're not planning of putting the notion of a SOCK_NOSPACE bit inside a 
completely device-unaware interface like epoll, I hope?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.23-rc6] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Davide Libenzi

On Wed, 19 Sep 2007, Nagendra Tomar wrote:

[ ... ]

Seems like epoll ate your message too :)
Can you repost your bug report and the patch?


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: A revised timerfd API

2007-09-18 Thread Davide Libenzi

On Tue, 18 Sep 2007, Michael Kerrisk wrote:

> The four designs are:
> 
> a) A multiplexing timerfd() system call.
> b) Creating three syscalls analogous to the POSIX timers API (i.e.,
>timerfd_create/timerfd_settime/timerfd_gettime).
> c) Creating a simplified timerfd() system call that is integrated
>with the POSIX timers API.
> d) Extending the POSIX timers API to support the timerfd concept.

If you really want to shoot yourself in your foot, I'd pick bullet B.
Bullet A makes me sea-sick, and bullets C and D, well, let's leave POSIX 
APIs being *POSIX* APIs.
Once you remove all the "ifs" and "elses" that resulted from your previous 
bullet A multiplexing implementation, timerfd_gettime and timerfd_settime 
should result in being pretty slick.
I still think we could have survived w/out all this done inside the 
kernel though.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Revised timerfd() interface

2007-09-06 Thread Davide Libenzi

On Thu, 6 Sep 2007, Michael Kerrisk wrote:

> You are asserting this in the face of two previous APIs designed 
> by people who (at least in the case of POSIX timers) probably 
> thoroughly examined and discussed existing APIs and practice.

You really think that. Uhmm, ok.

> This function is *not at all* equivalent to the "get"
> functionality of the previous APIs.  The "get" functionality
> of POSIX timers (for example) returns a structure that contains
> the timer interval and the *time until the next expiration of
> the timer* (not the initial timer string, as your code above
> does).
> So come up with a reliable, race-free way of doing that in
> userspace.  Then make it work for both CLOCK_MONOTONIC and
> CLOCK_REALTIME timers.  (You'll certainly need to be making
> some additional system calls, by the way: at least some
> calls to clock_gettime().)

No, I don't think I will. Since 1) it's so trivial it's not even 
challenging 2) you know it can be done (I assume) 3) it solves a non-case 
made up to justify an API/kernel-code bloating.
But you don't need *my* signoff. Cook a working patch, make a case for it, 
and gather support and signoffs. I won't be acking it because, as I said 
many times, I think it's useless bloating.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Revised timerfd() interface

2007-09-05 Thread Davide Libenzi

On Thu, 6 Sep 2007, Michael Kerrisk wrote:

> Hi Davide,
> 
> > > > > > > As I think about this more, I see more problems with
> > > > > > > your argument.  timerfd needs the ability to get and 
> > > > > > > get-while-setting just as much as the earlier APIs.
> > > > > > > Consider a library that creates a timerfd file descriptor that
> > > > > > > is handed off to an application: that library may want
> > > > > > > to modify the timer settings without having to create a
> > > > > > > new file descriptor (the app mey not be able to be told about
> > > > > > > the new fd).  Your argument just doesn't hold, AFAICS.
> > > > > > 
> > > > > > Such hypotethical library, in case it really wanted to offer such 
> > > > > > functionality, could simply return an handle instead of the raw
> > > > > > fd, and take care of all that stuff in userspace.
> > > > > 
> > > > > Did I miss something?  Is it not the case that as soon as the
> > > > > library returns a handle, rather than an fd, then the whole
> > > > > advantage of timerfd() (being able to select/poll/epoll on 
> > > > > the timer as well as other fds) is lost?  
> > > > 
> > > > Why? The handle would simply be a little struct where the timerfd fd
> > > > is 
> > > > stored, and a XXX_getfd() would return it.
> > > > So my point is, I doubt such functionalities are really needed, and I 
> > > > also argue that the kernel is the best place for such wrapper code
> > > > to go.
> > > 
> > > So what happens if one thread (via the library) wants modify
> > > a timer's settings at the same timer as another thread is 
> > > select()ing on it?  The first thread can't do this by creating
> > > a new timerfd timer, since it wants to affect the select()
> > > in the other thread?
> > 
> > It can be done w/out any problems. The select thread will be notified 
> > whenever the new timer setting expires.
> 
> We are going in circles here.  I think you are missing my point.
> Consider the following
> 
> [[
> Thread A: calls library function which creates a timerfd file
> descriptor.
> 
> Thread B: calls select() on the timerfd file descriptor.
> 
> Thread A: calls library function which wants to:
>a) modify timer settings, and retrieve copy of current timer
>   settings, and later
>b) restore old timer settings.
> ]]
> 
> This seems a quite reasonable use-case to me, and the existing
> interface simply can't support it.

"Quite reasonable"? :)
I honestly doubt it, but anyway. Modulo error checking:

struct tfd {
int fd, clockid;
struct itimerspec ts;
};

struct tfd *tfd_create(int clockid, int flags, const struct itimerspec *ts) {
struct tfd *th;
th = malloc(sizeof(*th));
th->clockid = clockid;
th->ts = *ts;
th->fd = timerfd(-1, clockid, flags, ts);
return th;
}

void tfd_close(struct tfd *th) {
close(th->fd);
free(th);
}

int tfd_getfd(const struct tfd *th) {
return th->fd;
}

int tfd_gettime(const struct tfd *th, int *clockid, struct itimerspec *ts) {
*clockid = th->clockid;
*ts = th->ts;
return 0;
}

int tfd_settime(struct tfd *th, int clockid, int flags,
const struct itimerspec *ts) {
th->fd = timerfd(th->fd, clockid, flags, ts);
th->clockid = clockid;
th->ts = *ts;
return 0;
}

Wrap the get/set with a mutex in case you plan to shoot yourself in a foot 
by doing get/set from multiple threads ;)
So, once again:

- I sincerly doubt the above is common usage/design patters for timerfds

  * timerfds are not a common global resource, ala signals, that requires 
get+set+restore pattern - you can have many of them set to different 
times

- Those IMO *very* special use cases can be handled in userspace with few 
  lines of code, *if* really needed



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Revised timerfd() interface

2007-09-05 Thread Davide Libenzi

On Wed, 5 Sep 2007, Michael Kerrisk wrote:

> Hi Davide,
> 
> > > > > As I think about this more, I see more problems with
> > > > > your argument.  timerfd needs the ability to get and 
> > > > > get-while-setting just as much as the earlier APIs.
> > > > > Consider a library that creates a timerfd file descriptor that
> > > > > is handed off to an application: that library may want
> > > > > to modify the timer settings without having to create a
> > > > > new file descriptor (the app mey not be able to be told about
> > > > > the new fd).  Your argument just doesn't hold, AFAICS.
> > > > 
> > > > Such hypotethical library, in case it really wanted to offer such 
> > > > functionality, could simply return an handle instead of the raw fd,
> > > > and take care of all that stuff in userspace.
> > > 
> > > Did I miss something?  Is it not the case that as soon as the
> > > library returns a handle, rather than an fd, then the whole
> > > advantage of timerfd() (being able to select/poll/epoll on 
> > > the timer as well as other fds) is lost?  
> > 
> > Why? The handle would simply be a little struct where the timerfd fd is 
> > stored, and a XXX_getfd() would return it.
> > So my point is, I doubt such functionalities are really needed, and I 
> > also argue that the kernel is the best place for such wrapper code
> > to go.
> 
> So what happens if one thread (via the library) wants modify
> a timer's settings at the same timer as another thread is 
> select()ing on it?  The first thread can't do this by creating
> a new timerfd timer, since it wants to affect the select()
> in the other thread?

It can be done w/out any problems. The select thread will be notified 
whenever the new timer setting expires.


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Revised timerfd() interface

2007-09-05 Thread Davide Libenzi

On Wed, 5 Sep 2007, Michael Kerrisk wrote:

> Davide,

A Michael!


> > > As I think about this more, I see more problems with
> > > your argument.  timerfd needs the ability to get and 
> > > get-while-setting just as much as the earlier APIs.
> > > Consider a library that creates a timerfd file descriptor that
> > > is handed off to an application: that library may want
> > > to modify the timer settings without having to create a
> > > new file descriptor (the app mey not be able to be told about
> > > the new fd).  Your argument just doesn't hold, AFAICS.
> > 
> > Such hypotethical library, in case it really wanted to offer such 
> > functionality, could simply return an handle instead of the raw fd, and 
> > take care of all that stuff in userspace.
> 
> Did I miss something?  Is it not the case that as soon as the
> library returns a handle, rather than an fd, then the whole
> advantage of timerfd() (being able to select/poll/epoll on 
> the timer as well as other fds) is lost?  

Why? The handle would simply be a little struct where the timerfd fd is 
stored, and a XXX_getfd() would return it.
So my point is, I doubt such functionalities are really needed, and I also 
argue that the kernel is the best place for such wrapper code to go.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Revised timerfd() interface

2007-09-04 Thread Davide Libenzi

On Tue, 4 Sep 2007, Michael Kerrisk wrote:

> > Useless like it'd be a motorcycle w/out a cup-holder :)
> > Seriously, the ability to get the previous values from "something" could 
> > have a meaning if this something is a shared global resource (like 
> > signals
> > for example). In the timerfd case this makes little sense, since you can 
> > create as many timerfd as you like and you do not need to share a single 
> > one by changing/restoring the original context.
> 
> Davide,
> 
> As I think about this more, I see more problems with
> your argument.  timerfd needs the ability to get and 
> get-while-setting just as much as the earlier APIs.
> Consider a library that creates a timerfd file descriptor that
> is handed off to an application: that library may want
> to modify the timer settings without having to create a
> new file descriptor (the app mey not be able to be told about
> the new fd).  Your argument just doesn't hold, AFAICS.

Such hypotethical library, in case it really wanted to offer such 
functionality, could simply return an handle instead of the raw fd, and 
take care of all that stuff in userspace.
Again, mimicking POSIX APIs doesn't always take you in the right place.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 >

1 - 100 of 765 matches

Mail list logo