Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi
On Thu, 26 Mar 2015, David Rientjes wrote:

> Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a 
> hugetlb vma is long standing and there may be applications that break as a 
> result of changing the behavior: a database that reserves all allocated 
> hugetlb memory with mmap() so that it always has exclusive access to those 
> hugepages, whether they are faulted or not, and maintains its own hugepage 
> pool (which is common), may test the return value of munmap() and depend 
> on it returning -EINVAL to determine if it is freeing memory that was 
> either dynamically allocated or mapped from the hugetlb reserved pool.

You went a long way to create such a case.
But, in your case, that application will erroneously considering hugepage 
mmaped memory, as dynamically allocated, since it will always get EINVAL, 
unless it passes an aligned size. Aligned size, which a fix like the one 
posted in the patch will still leave as success.
OTOH, an application, which might be more common than the one you posted,
which calls munmap() to release a pointer which it validly got from a 
previous mmap(), will leak huge pages as all the issued munmaps will fail.


> If we were to go back in time and decide this when the munmap() behavior 
> for hugetlb vmas was originally introduced, that would be valid.  The 
> problem is that it could lead to userspace breakage and that's a 
> non-starter.
> 
> What we can do is improve the documentation and man-page to clearly 
> specify the long-standing behavior so that nobody encounters unexpected 
> results in the future.

This way you will leave the mmap API with broken semantics.
In any case, I am done arguing.
I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
the mmap man pages with the new funny behaviour.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi
On Wed, 25 Mar 2015, David Rientjes wrote:

> I looked at this thread at http://marc.info/?t=14139250881 since I 
> didn't have it in my mailbox, and I didn't get a chance to actually run 
> your test code.
> 
> In short, I think what you're saying is that
> 
>   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
>   munmap(ptr, 4KB) == EINVAL

I am not sure you have read the email correctly:

munmap(mmap(size, HUGETLB), size) = EFAIL

For every size not multiple of the huge page size.
Whereas:

munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK


> Respecting the mmap(2) POSIX specification?  I don't think 
> mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates 
> POSIX.1-2001 and not only because it obviously doesn't address 
> MAP_HUGETLB, but I don't think the spec says the system cannot map more 
> memory than len.
> 
> Using MAP_HUGETLB is really more a library function than anything else 
> since you could easily implement the same behavior in a library.  That 
> function includes aligning len to the hugepage size, so doing
> 
>   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
> 
> is the equivalent to doing
> 
>   ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)
> 
> and that doesn't violate any spec.  But your patch doesn't change mmap() 
> at all, so let's forget about that.

That is what every mmap() implementation does, irrespectively of any page 
size. And that is also what the POSIX spec states.
The size will be automatically rounded up to a multiple of the underline 
physical page size.
The problem is not mmap() though, in this case.


> The question you pose is whether munmap(ptr, 4KB) should succeed for a 
> hugetlb vma and in your patch you align this to the hugepage size of the 
> vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
> for a non-hugetlb vma.
> 
> The munmap() spec says the whole pages that include any part of the passed 
> length should be unmapped.  In spirit, I would agree with you that the 
> page size for the vma is the hugepage size so that would be what would be 
> unmapped.
> 
> But that's going by a spec that doesn't address hugepages and is worded in 
> a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
> done.  It carries no notion of variable page sizes and how hugepages 
> should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
> this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
> but the munmap() behavior for hugetlb vmas is not restricted.
> 
> It would seem too dangerous at this point to change the behavior of 
> munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
> arise from aligning to the hugepage size.

You mean, there is an harder failure than the current failure? :)


> Some applications purposefully reserve hugetlb pages by mmap() and never 
> munmap() them so they have exclusive access to hugepages that were 
> allocated either at boot or runtime by the sysadmin.  If they depend on 
> the return value of munmap() to determine if memory to free is memory 
> dynamically allocated by the application or reserved as hugetlb memory, 
> then this would cause them to break.  I can't say for certain that no such 
> application exists.

The fact that certain applications will seldomly call an API, should be no 
reason for API to have bugs, or at the very least, a bahviour which not 
only in not documented in the man pages, but also totally unrespectful of 
the normal mmap/munmap semantics.
Again, the scenario that you are picturing, is one where an application 
relies on a permanent (that is what it is - it always fails unless the 
munmap size is multiple than huge page size) failure of munmap, to do some 
productive task.
An munmap() of huge page aligned size, will succeed in both case (vanilla, 
and patch).


> Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() 
> specification, and there's a potential userspace breakage if the length 
> becomes hugepage aligned, I think the do_unmap() implementation is correct 
> as it stands.

If the length is huge page aligned, it will be working with or without 
patch applied.
The problem is for the other 2097151 out of 2097152 cases, where length is 
not indeed aligned to 2MB (or whatever hugepage size is for the 
architecture).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi
On Wed, 25 Mar 2015, David Rientjes wrote:

 I looked at this thread at http://marc.info/?t=14139250881 since I 
 didn't have it in my mailbox, and I didn't get a chance to actually run 
 your test code.
 
 In short, I think what you're saying is that
 
   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
   munmap(ptr, 4KB) == EINVAL

I am not sure you have read the email correctly:

munmap(mmap(size, HUGETLB), size) = EFAIL

For every size not multiple of the huge page size.
Whereas:

munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK


 Respecting the mmap(2) POSIX specification?  I don't think 
 mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) mapping 2MB violates 
 POSIX.1-2001 and not only because it obviously doesn't address 
 MAP_HUGETLB, but I don't think the spec says the system cannot map more 
 memory than len.
 
 Using MAP_HUGETLB is really more a library function than anything else 
 since you could easily implement the same behavior in a library.  That 
 function includes aligning len to the hugepage size, so doing
 
   ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...)
 
 is the equivalent to doing
 
   ptr = mmap(..., hugepage_size, ..., MAP_HUGETLB | ..., ...)
 
 and that doesn't violate any spec.  But your patch doesn't change mmap() 
 at all, so let's forget about that.

That is what every mmap() implementation does, irrespectively of any page 
size. And that is also what the POSIX spec states.
The size will be automatically rounded up to a multiple of the underline 
physical page size.
The problem is not mmap() though, in this case.


 The question you pose is whether munmap(ptr, 4KB) should succeed for a 
 hugetlb vma and in your patch you align this to the hugepage size of the 
 vma in the same manner that munmap(ptr, 2KB) would be aligned to PAGE_SIZE 
 for a non-hugetlb vma.
 
 The munmap() spec says the whole pages that include any part of the passed 
 length should be unmapped.  In spirit, I would agree with you that the 
 page size for the vma is the hugepage size so that would be what would be 
 unmapped.
 
 But that's going by a spec that doesn't address hugepages and is worded in 
 a way that {PAGE_SIZE} is the base unit that both mmap() and munmap() is 
 done.  It carries no notion of variable page sizes and how hugepages 
 should be handled with respect to pages of {PAGE_SIZE} length.  So I think 
 this is beyond the scope of the spec: any length is aligned to PAGE_SIZE, 
 but the munmap() behavior for hugetlb vmas is not restricted.
 
 It would seem too dangerous at this point to change the behavior of 
 munmap(ptr, 4KB) on a hugetlb vma and that userspace bugs could actually 
 arise from aligning to the hugepage size.

You mean, there is an harder failure than the current failure? :)


 Some applications purposefully reserve hugetlb pages by mmap() and never 
 munmap() them so they have exclusive access to hugepages that were 
 allocated either at boot or runtime by the sysadmin.  If they depend on 
 the return value of munmap() to determine if memory to free is memory 
 dynamically allocated by the application or reserved as hugetlb memory, 
 then this would cause them to break.  I can't say for certain that no such 
 application exists.

The fact that certain applications will seldomly call an API, should be no 
reason for API to have bugs, or at the very least, a bahviour which not 
only in not documented in the man pages, but also totally unrespectful of 
the normal mmap/munmap semantics.
Again, the scenario that you are picturing, is one where an application 
relies on a permanent (that is what it is - it always fails unless the 
munmap size is multiple than huge page size) failure of munmap, to do some 
productive task.
An munmap() of huge page aligned size, will succeed in both case (vanilla, 
and patch).


 Since hugetlb memory is beyond the scope of the POSIX.1-2001 munmap() 
 specification, and there's a potential userspace breakage if the length 
 becomes hugepage aligned, I think the do_unmap() implementation is correct 
 as it stands.

If the length is huge page aligned, it will be working with or without 
patch applied.
The problem is for the other 2097151 out of 2097152 cases, where length is 
not indeed aligned to 2MB (or whatever hugepage size is for the 
architecture).


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-26 Thread Davide Libenzi
On Thu, 26 Mar 2015, David Rientjes wrote:

 Yes, this munmap() behavior of lengths = hugepage_size - PAGE_SIZE for a 
 hugetlb vma is long standing and there may be applications that break as a 
 result of changing the behavior: a database that reserves all allocated 
 hugetlb memory with mmap() so that it always has exclusive access to those 
 hugepages, whether they are faulted or not, and maintains its own hugepage 
 pool (which is common), may test the return value of munmap() and depend 
 on it returning -EINVAL to determine if it is freeing memory that was 
 either dynamically allocated or mapped from the hugetlb reserved pool.

You went a long way to create such a case.
But, in your case, that application will erroneously considering hugepage 
mmaped memory, as dynamically allocated, since it will always get EINVAL, 
unless it passes an aligned size. Aligned size, which a fix like the one 
posted in the patch will still leave as success.
OTOH, an application, which might be more common than the one you posted,
which calls munmap() to release a pointer which it validly got from a 
previous mmap(), will leak huge pages as all the issued munmaps will fail.


 If we were to go back in time and decide this when the munmap() behavior 
 for hugetlb vmas was originally introduced, that would be valid.  The 
 problem is that it could lead to userspace breakage and that's a 
 non-starter.
 
 What we can do is improve the documentation and man-page to clearly 
 specify the long-standing behavior so that nobody encounters unexpected 
 results in the future.

This way you will leave the mmap API with broken semantics.
In any case, I am done arguing.
I will leave to Andrew to sort it out, and to Michael Kerrisk to update 
the mmap man pages with the new funny behaviour.



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread Davide Libenzi
On Wed, 25 Mar 2015, Hugh Dickins wrote:

> When you say "tracking back to 3.2.x", I think you mean you've tried as
> far back as 3.2.x and found the same behaviour, but not tried further?
> 
> From the source, it looks like this is unchanged since MAP_HUGETLB was
> introduced in 2.6.32.  And is the same behaviour as you've been given
> with hugetlbfs since it arrived in 2.5.46.

Went back checking the application logs, an I had to patch the code (the 
application one - to align size on munmap()) on May 2014.
But it might be we started noticing it at that time, because before the 
allocation pattern was simply monotonic, so it could be it was always 
there.
The bug test app is ten lines of code, to verify that.


> The patch looks to me as if it will do what you want, and I agree
> that it's displeasing that you can mmap a size, and then fail to
> munmap that same size.
> 
> But I tend to think that's simply typical of the clunkiness we offer
> you with hugetlbfs and MAP_HUGETLB: that it would have been better to
> make a different choice all those years ago, but wrong to change the
> user interface now.
> 
> Perhaps others will disagree.  And if I'm wrong, and the behaviour
> got somehow changed in 3.2, then that's a different story and we
> should fix it back.

This is not an interface change, in the sense old clients will continue to 
work.
This is simply respecting the mmap(2) POSIX specification, for a feature 
which has been exposed via mmap(2).



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-25 Thread Davide Libenzi
On Wed, 25 Mar 2015, Hugh Dickins wrote:

 When you say tracking back to 3.2.x, I think you mean you've tried as
 far back as 3.2.x and found the same behaviour, but not tried further?
 
 From the source, it looks like this is unchanged since MAP_HUGETLB was
 introduced in 2.6.32.  And is the same behaviour as you've been given
 with hugetlbfs since it arrived in 2.5.46.

Went back checking the application logs, an I had to patch the code (the 
application one - to align size on munmap()) on May 2014.
But it might be we started noticing it at that time, because before the 
allocation pattern was simply monotonic, so it could be it was always 
there.
The bug test app is ten lines of code, to verify that.


 The patch looks to me as if it will do what you want, and I agree
 that it's displeasing that you can mmap a size, and then fail to
 munmap that same size.
 
 But I tend to think that's simply typical of the clunkiness we offer
 you with hugetlbfs and MAP_HUGETLB: that it would have been better to
 make a different choice all those years ago, but wrong to change the
 user interface now.
 
 Perhaps others will disagree.  And if I'm wrong, and the behaviour
 got somehow changed in 3.2, then that's a different story and we
 should fix it back.

This is not an interface change, in the sense old clients will continue to 
work.
This is simply respecting the mmap(2) POSIX specification, for a feature 
which has been exposed via mmap(2).



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-22 Thread Davide Libenzi
[Resending with proper CC list suggested by Andrew]

Calling munmap on a MAP_HUGETLB area, and a size which is not 2MB aligned, 
causes munmap to fail.  Tested on 3.13.x but tracking back to 3.2.x.
In do_munmap() we forcibly want a 4KB default page, and we wrongly 
calculate the end of the map.  Since the calculated end is within the end 
address of the target vma, we try to do a split with an address right in 
the middle of a huge page, which would fail with EINVAL.

Tentative (untested) patch and test case attached (be sure you have a few 
huge pages available via /proc/sys/vm/nr_hugepages tinkering).


Signed-Off-By: Davide Libenzi 


- Davide


diff --git a/mm/mmap.c b/mm/mmap.c
index 7f85520..6dba257 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
return -EINVAL;
 
-   len = PAGE_ALIGN(len);
-   if (len == 0)
-   return -EINVAL;
-
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
prev = vma->vm_prev;
/* we have  start < vma->vm_end  */
 
+   if (likely(!is_vm_hugetlb_page(vma)))
+   len = PAGE_ALIGN(len);
+   else {
+   unsigned long hpage_size = huge_page_size(hstate_vma(vma));
+
+   len = ALIGN(len, hpage_size);
+   }
+   if (unlikely(len == 0))
+   return -EINVAL;
+
/* if it doesn't overlap, we have nothing.. */
end = start + len;
if (vma->vm_start >= end)




[hugebug.c]

#include 
#include 
#include 
#include 
#include 

static void test(int flags, size_t size)
{
void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
  flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (addr == MAP_FAILED)
{
perror("mmap");
exit(1);
}
*(char*) addr = 17;

if (munmap(addr, size) != 0)
{
perror("munmap");
exit(1);
}
}

int main(int ac, const char** av)
{
static const size_t hugepage_size = 2 * 1024 * 1024;

printf("Testing normal pages with 2MB size ...\n");
test(0, hugepage_size);
printf("OK\n");

printf("Testing huge pages with 2MB size ...\n");
test(MAP_HUGETLB, hugepage_size);
printf("OK\n");


printf("Testing normal pages with 4KB byte size ...\n");
test(0, 4096);
printf("OK\n");

printf("Testing huge pages with 4KB byte size ...\n");
test(MAP_HUGETLB, 4096);
printf("OK\n");


printf("Testing normal pages with 1 byte size ...\n");
test(0, 1);
printf("OK\n");

printf("Testing huge pages with 1 byte size ...\n");
test(MAP_HUGETLB, 1);
printf("OK\n");

return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-22 Thread Davide Libenzi
[Resending with proper CC list suggested by Andrew]

Calling munmap on a MAP_HUGETLB area, and a size which is not 2MB aligned, 
causes munmap to fail.  Tested on 3.13.x but tracking back to 3.2.x.
In do_munmap() we forcibly want a 4KB default page, and we wrongly 
calculate the end of the map.  Since the calculated end is within the end 
address of the target vma, we try to do a split with an address right in 
the middle of a huge page, which would fail with EINVAL.

Tentative (untested) patch and test case attached (be sure you have a few 
huge pages available via /proc/sys/vm/nr_hugepages tinkering).


Signed-Off-By: Davide Libenzi davi...@xmailserver.org


- Davide


diff --git a/mm/mmap.c b/mm/mmap.c
index 7f85520..6dba257 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
if ((start  ~PAGE_MASK) || start  TASK_SIZE || len  TASK_SIZE-start)
return -EINVAL;
 
-   len = PAGE_ALIGN(len);
-   if (len == 0)
-   return -EINVAL;
-
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
prev = vma-vm_prev;
/* we have  start  vma-vm_end  */
 
+   if (likely(!is_vm_hugetlb_page(vma)))
+   len = PAGE_ALIGN(len);
+   else {
+   unsigned long hpage_size = huge_page_size(hstate_vma(vma));
+
+   len = ALIGN(len, hpage_size);
+   }
+   if (unlikely(len == 0))
+   return -EINVAL;
+
/* if it doesn't overlap, we have nothing.. */
end = start + len;
if (vma-vm_start = end)




[hugebug.c]

#include sys/mman.h
#include stdio.h
#include stdlib.h
#include string.h
#include errno.h

static void test(int flags, size_t size)
{
void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
  flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (addr == MAP_FAILED)
{
perror(mmap);
exit(1);
}
*(char*) addr = 17;

if (munmap(addr, size) != 0)
{
perror(munmap);
exit(1);
}
}

int main(int ac, const char** av)
{
static const size_t hugepage_size = 2 * 1024 * 1024;

printf(Testing normal pages with 2MB size ...\n);
test(0, hugepage_size);
printf(OK\n);

printf(Testing huge pages with 2MB size ...\n);
test(MAP_HUGETLB, hugepage_size);
printf(OK\n);


printf(Testing normal pages with 4KB byte size ...\n);
test(0, 4096);
printf(OK\n);

printf(Testing huge pages with 4KB byte size ...\n);
test(MAP_HUGETLB, 4096);
printf(OK\n);


printf(Testing normal pages with 1 byte size ...\n);
test(0, 1);
printf(OK\n);

printf(Testing huge pages with 1 byte size ...\n);
test(MAP_HUGETLB, 1);
printf(OK\n);

return 0;
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-21 Thread Davide Libenzi
On Tue, 21 Oct 2014, Davide Libenzi wrote:

> Calling mmap with MAP_HUGETLB and a size which is not 2MB aligned, causes 
> mmap to fail. Tested on 3.13.x but tracking back to 3.2.x.

A couple of "un" were found under my desk, as it's clearly munmap which is 
failing.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-21 Thread Davide Libenzi
Calling mmap with MAP_HUGETLB and a size which is not 2MB aligned, causes 
mmap to fail. Tested on 3.13.x but tracking back to 3.2.x.
In do_munmap() we forcibly want a 4KB default page, and we wrongly calculate
the end of the map, by doing a split right in the middle of a huge page, 
which will result in EINVAL.
Tentative (untested) patch and test case attached (be sure you have a few 
huge pages available via /proc/sys/vm/nr_hugepages tinkering).


Signed-Off-By: Davide Libenzi 


- Davide


diff --git a/mm/mmap.c b/mm/mmap.c
index 7f85520..6dba257 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
return -EINVAL;
 
-   len = PAGE_ALIGN(len);
-   if (len == 0)
-   return -EINVAL;
-
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
prev = vma->vm_prev;
/* we have  start < vma->vm_end  */
 
+   if (likely(!is_vm_hugetlb_page(vma)))
+   len = PAGE_ALIGN(len);
+   else {
+   unsigned long hpage_size = huge_page_size(hstate_vma(vma));
+
+   len = ALIGN(len, hpage_size);
+   }
+   if (unlikely(len == 0))
+   return -EINVAL;
+
/* if it doesn't overlap, we have nothing.. */
end = start + len;
if (vma->vm_start >= end)





[hugebug.c]

#include 
#include 
#include 
#include 
#include 

static void test(int flags, size_t size)
{
void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
  flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (addr == MAP_FAILED)
{
perror("mmap");
exit(1);
}
*(char*) addr = 17;

if (munmap(addr, size) != 0)
{
perror("munmap");
exit(1);
}
}

int main(int ac, const char** av)
{
static const size_t hugepage_size = 2 * 1024 * 1024;

printf("Testing normal pages with 2MB size ...\n");
test(0, hugepage_size);
printf("OK\n");

printf("Testing huge pages with 2MB size ...\n");
test(MAP_HUGETLB, hugepage_size);
printf("OK\n");


printf("Testing normal pages with 4KB byte size ...\n");
test(0, 4096);
printf("OK\n");

printf("Testing huge pages with 4KB byte size ...\n");
test(MAP_HUGETLB, 4096);
printf("OK\n");


printf("Testing normal pages with 1 byte size ...\n");
test(0, 1);
printf("OK\n");

printf("Testing huge pages with 1 byte size ...\n");
test(MAP_HUGETLB, 1);
printf("OK\n");

return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-21 Thread Davide Libenzi
Calling mmap with MAP_HUGETLB and a size which is not 2MB aligned, causes 
mmap to fail. Tested on 3.13.x but tracking back to 3.2.x.
In do_munmap() we forcibly want a 4KB default page, and we wrongly calculate
the end of the map, by doing a split right in the middle of a huge page, 
which will result in EINVAL.
Tentative (untested) patch and test case attached (be sure you have a few 
huge pages available via /proc/sys/vm/nr_hugepages tinkering).


Signed-Off-By: Davide Libenzi davi...@xmailserver.org


- Davide


diff --git a/mm/mmap.c b/mm/mmap.c
index 7f85520..6dba257 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2528,10 +2528,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
if ((start  ~PAGE_MASK) || start  TASK_SIZE || len  TASK_SIZE-start)
return -EINVAL;
 
-   len = PAGE_ALIGN(len);
-   if (len == 0)
-   return -EINVAL;
-
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2539,6 +2535,16 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
prev = vma-vm_prev;
/* we have  start  vma-vm_end  */
 
+   if (likely(!is_vm_hugetlb_page(vma)))
+   len = PAGE_ALIGN(len);
+   else {
+   unsigned long hpage_size = huge_page_size(hstate_vma(vma));
+
+   len = ALIGN(len, hpage_size);
+   }
+   if (unlikely(len == 0))
+   return -EINVAL;
+
/* if it doesn't overlap, we have nothing.. */
end = start + len;
if (vma-vm_start = end)





[hugebug.c]

#include sys/mman.h
#include stdio.h
#include stdlib.h
#include string.h
#include errno.h

static void test(int flags, size_t size)
{
void* addr = mmap(NULL, size, PROT_READ | PROT_WRITE,
  flags | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (addr == MAP_FAILED)
{
perror(mmap);
exit(1);
}
*(char*) addr = 17;

if (munmap(addr, size) != 0)
{
perror(munmap);
exit(1);
}
}

int main(int ac, const char** av)
{
static const size_t hugepage_size = 2 * 1024 * 1024;

printf(Testing normal pages with 2MB size ...\n);
test(0, hugepage_size);
printf(OK\n);

printf(Testing huge pages with 2MB size ...\n);
test(MAP_HUGETLB, hugepage_size);
printf(OK\n);


printf(Testing normal pages with 4KB byte size ...\n);
test(0, 4096);
printf(OK\n);

printf(Testing huge pages with 4KB byte size ...\n);
test(MAP_HUGETLB, 4096);
printf(OK\n);


printf(Testing normal pages with 1 byte size ...\n);
test(0, 1);
printf(OK\n);

printf(Testing huge pages with 1 byte size ...\n);
test(MAP_HUGETLB, 1);
printf(OK\n);

return 0;
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] MAP_HUGETLB munmap fails with size not 2MB aligned

2014-10-21 Thread Davide Libenzi
On Tue, 21 Oct 2014, Davide Libenzi wrote:

 Calling mmap with MAP_HUGETLB and a size which is not 2MB aligned, causes 
 mmap to fail. Tested on 3.13.x but tracking back to 3.2.x.

A couple of un were found under my desk, as it's clearly munmap which is 
failing.


- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and shared fd's

2008-02-26 Thread Davide Libenzi
On Tue, 26 Feb 2008, Michael Kerrisk wrote:

> Okay -- I'll look at it some more.  I am however loathe to drop the
> term open file description, because POSIX uses, as well as a number of
> other Linux man pages by now.

Heh, POSIX. Now doesn't take a genius to see that "file description" and 
"file descriptor" looks amazingly similar, does it? :)


> > That'd mean placing an eventpoll custom hook into sys_close(). Looks 
> > very bad to me, and probably will look even worse to other kernel 
> > folks. Is not much a performance issue (a check to see if a file* is 
> > an eventpoll file is as easy as comparing the f_op pointer), but a 
> > design/style issue.
>
> Oh -- I wasn't suggesting we could make the change now -- it would
> break the ABI and all that.  I was just wondering why the decision
> wasn't made to do it the other way to begin with.  The existing
> semantics are somewhat couterintuitive, and potentially interact
> libraries that do private manipulations with file descriptors.

For the same reason that a custom hook in sys_close wouldn't have passed 
the radar ;)
As far as problems with libraries doing tricks with fds, that's an issue 
that goes beyond epoll.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and shared fd's

2008-02-26 Thread Davide Libenzi
On Tue, 26 Feb 2008, Michael Kerrisk wrote:

> Following up after quite some time:
> 
> Davide Libenzi wrote:
> > On Sat, 26 Jan 2008, Michael Kerrisk wrote:
> > 
> >> On Jan 25, 2008 12:57 AM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
> >>> On Thu, 24 Jan 2008, Pierre Habouzit wrote:
> >>>
> >>>> On Fri, Jan 18, 2008 at 09:10:18PM +, Davide Libenzi wrote:
> >>>>> On Fri, 18 Jan 2008, Pierre Habouzit wrote:
> >>>>>
> >>>>>>   Hi,
> >>>>>>
> >>>>>>   I just came across a strange behavior of epoll that seems to
> >>>>>> contradict the documentation. Here is what happens:
> >>>>>>
> >>>>>> * I have two processes P1 and P2, P1 accept()s connections, and send 
> >>>>>> the
> >>>>>>   resulting file descriptors to P2 through a unix socket.
> >>>>>>
> >>>>>> * P2 registers the received socket in his epollfd.
> >>>>>>
> >>>>>>   [time passes]
> >>>>>>
> >>>>>> * P2 is done with the socket and closes it
> >>>>>>
> >>>>>> * P2 gets events for the socket again !
> >>>>>>
> >>>>>>
> >>>>>>   Though the documentation says that if a process closes a file
> >>>>>> descriptor, it gets unregistered. And yes I'm sure that P2 doens't 
> >>>>>> dup()
> >>>>>> the file descriptor. Though (because of a bug) it was still open in
> >>>>>> P1[0], hence the referenced socket still live at the kernel level.
> >>>>>>
> >>>>>>   Of course the userland workaround is to force the EPOLL_CTL_DEL 
> >>>>>> before
> >>>>>> the close, which I now do, but costs me a syscall where I wanted to
> >>>>>> spare one :|
> >>>>> For epoll, a close is when the kernel file* is released (that is, when 
> >>>>> all
> >>>>> its instances are gone).
> >>>>> We could put a special handling in filp_close(), but I don't think is a
> >>>>> good idea, and we're better live with the current behaviour.
> >>>>   Okay, maybe updating the linux manpages to be more clear about that is
> >>>> the way to go then. Thanks
> >>> Sure. I'll send Michael Kerrisk and updated statement for the A6 answer in
> >>> the epoll man page.
> >> Thanks Davide -- yes please send me a patch.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to [EMAIL PROTECTED]
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >>
> > 
> > Something like the one below ...
> > 
> > 
> > - Davide
> > 
> > 
> > 
> > --- epoll.4 2008-01-26 12:58:21.0 -0800
> > +++ epoll.4.new 2008-01-26 13:06:36.0 -0800
> > @@ -285,7 +285,19 @@
> >  sets automatically?
> >  .TP
> >  .B A6
> > -Yes.
> > +A file descriptor is the userspace counterpart of an internal kernel 
> > handle.
> > +Every time a process calls functions liks
> > +.BR dup (2),
> > +.BR dup2 (2)
> > +or
> > +.BR fork (2),
> > +a new file descriptor referring to the same internal kernel handle is
> > +created. The internal kernel handle remains alive until all the userspace
> > +file descriptors have been closed.
> > +The
> > +.BR epoll (4)
> > +interface automatically removes the internal kernel handle from the set,
> > +once all the file descriptor instances have been closed.
> >  .TP
> >  .B Q7
> >  If more than one event occurs between
> 
> Davide,
> 
> Two points.
> 
> a) I did a
> 
> s/internal kernel handle/open file description/
> 
> since that is the POSIX term for the internal handle.
> 
> b) It seems to me that you text doesn't quite make the point explicit
> enough.  I've tried to rewrite it; could you please check:
> 
>A6 Yes, but be aware of the following point.  A  file
>   descriptor is a reference to an open file descrip-
>   tion (see  open(2)).   Whenever  a  descriptor  is
>   duplicated  via dup(2), dup2(2), fcntl(2) F_DUPFD,
>   or fork(2), a new file des

Re: epoll design problems with common fork/exec patterns

2008-02-26 Thread Davide Libenzi
On Tue, 26 Feb 2008, Michael Kerrisk wrote:

> Davide Libenzi wrote:
> > On Sun, 28 Oct 2007, David Schwartz wrote:
> > 
> >> Eric Dumazet wrote:
> >>
> >>> Events are not necessarly reported "by descriptors". epoll uses an opaque
> >>> field provided by the user.
> >>>
> >>> It's up to the user to properly chose a tag that will makes sense
> >>> if the user
> >>> app is playing dup()/close() games for example.
> >> Great. So the only issue then is that the documentation is confusing. It
> >> frequently uses the term "fd" where it means file. For example, it says:
> >>
> >>   Q1 What  happens  if  you  add  the  same fd to an
> >> epoll_set
> >>  twice?
> >>
> >>   A1 You will probably get EEXIST.  However,  it  is
> >> possible
> >>  that  two  threads  may  add the same fd twice. This 
> >> is
> >> a
> >>  harmless condition.
> >>
> >> This gives no reason to think there's anything wrong with adding the same
> >> file twice so long as you do so through different descriptors. (One can
> >> imagine an application that does this to segregate read and write 
> >> operations
> >> to avoid a race where the descriptor is closed from under a writer due to
> >> handling a fatal read error.) Obviously, that won't work.
> > 
> > I agree, that is confusing. However, you can safely add two different file 
> > descriptors pointing to the same file*, with different event masks, and 
> > that will work as expected.
> 
> So can I summarize what I understand:
> 
> a) Adding the same file descriptor twice to an epoll set will cause an
> error (EEXIST).

Yes.



> b) In a separate message to linux-man, Chris Heath says that two threads
> *can't* add the same fd twice to an epoll set, despite what the existing
> man page text says.  I haven't tested that, but it sounds to me as though
> it is likely to be true.  Can you comment please Davide?

Yes, you can't add the same fd twice. Think about a DB where "file*,fd" is 
the key.



> c) It is possible to add duplicated file descriptors referring to the same
> underlying open file description ("file *").  As you note, this can be a
> useful filtering technique, if the two file descriptors specify different
> masks.
> 
> Assuming that is all correct, for man-pages-2.79, I've reworked the text
> for Q1/A1 as follows:
> 
>Q1 What  happens  if you add the same file descriptor
>   to an epoll set twice?
> 
>A1 You will probably get EEXIST.  However, it is pos-
>   sible   to   add  a  duplicate  (dup(2),  dup2(2),
>   fcntl(2) F_DUPFD, fork(2)) descriptor to the  same
>   epoll  set.   This  can  be a useful technique for
>   filtering events, if the duplicate  file  descrip-
>   tors are registered with different events masks.
> 
> Seem okay Davide?

Looks sane to me.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and shared fd's

2008-02-26 Thread Davide Libenzi
On Tue, 26 Feb 2008, Michael Kerrisk wrote:

 Following up after quite some time:
 
 Davide Libenzi wrote:
  On Sat, 26 Jan 2008, Michael Kerrisk wrote:
  
  On Jan 25, 2008 12:57 AM, Davide Libenzi [EMAIL PROTECTED] wrote:
  On Thu, 24 Jan 2008, Pierre Habouzit wrote:
 
  On Fri, Jan 18, 2008 at 09:10:18PM +, Davide Libenzi wrote:
  On Fri, 18 Jan 2008, Pierre Habouzit wrote:
 
Hi,
 
I just came across a strange behavior of epoll that seems to
  contradict the documentation. Here is what happens:
 
  * I have two processes P1 and P2, P1 accept()s connections, and send 
  the
resulting file descriptors to P2 through a unix socket.
 
  * P2 registers the received socket in his epollfd.
 
[time passes]
 
  * P2 is done with the socket and closes it
 
  * P2 gets events for the socket again !
 
 
Though the documentation says that if a process closes a file
  descriptor, it gets unregistered. And yes I'm sure that P2 doens't 
  dup()
  the file descriptor. Though (because of a bug) it was still open in
  P1[0], hence the referenced socket still live at the kernel level.
 
Of course the userland workaround is to force the EPOLL_CTL_DEL 
  before
  the close, which I now do, but costs me a syscall where I wanted to
  spare one :|
  For epoll, a close is when the kernel file* is released (that is, when 
  all
  its instances are gone).
  We could put a special handling in filp_close(), but I don't think is a
  good idea, and we're better live with the current behaviour.
Okay, maybe updating the linux manpages to be more clear about that is
  the way to go then. Thanks
  Sure. I'll send Michael Kerrisk and updated statement for the A6 answer in
  the epoll man page.
  Thanks Davide -- yes please send me a patch.
  --
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
 
  
  Something like the one below ...
  
  
  - Davide
  
  
  
  --- epoll.4 2008-01-26 12:58:21.0 -0800
  +++ epoll.4.new 2008-01-26 13:06:36.0 -0800
  @@ -285,7 +285,19 @@
   sets automatically?
   .TP
   .B A6
  -Yes.
  +A file descriptor is the userspace counterpart of an internal kernel 
  handle.
  +Every time a process calls functions liks
  +.BR dup (2),
  +.BR dup2 (2)
  +or
  +.BR fork (2),
  +a new file descriptor referring to the same internal kernel handle is
  +created. The internal kernel handle remains alive until all the userspace
  +file descriptors have been closed.
  +The
  +.BR epoll (4)
  +interface automatically removes the internal kernel handle from the set,
  +once all the file descriptor instances have been closed.
   .TP
   .B Q7
   If more than one event occurs between
 
 Davide,
 
 Two points.
 
 a) I did a
 
 s/internal kernel handle/open file description/
 
 since that is the POSIX term for the internal handle.
 
 b) It seems to me that you text doesn't quite make the point explicit
 enough.  I've tried to rewrite it; could you please check:
 
A6 Yes, but be aware of the following point.  A  file
   descriptor is a reference to an open file descrip-
   tion (see  open(2)).   Whenever  a  descriptor  is
   duplicated  via dup(2), dup2(2), fcntl(2) F_DUPFD,
   or fork(2), a new file descriptor referring to the
   same  open  file  description is created.  An open
   file description continues to exist until all file
   descriptors referring to it have been closed.  The
   epoll  interface  automatically  removes  a   file
   descriptor  from  an  epoll set only after all the
   file descriptors referring to the underlying  open
   file  handle  have  been  closed.  This means that
   even after a file descriptor that is  part  of  an
   epoll  set has been closed, events may be reported
   for that file descriptor if other file descriptors
   referring  to the same underlying file description
   remain open.
 
 Does that seem okay?  I plan to include the text in man-pages-2.79.

I agree with Bodo, it is kinda confusing. The name open file description,
even though POSIX, looks very similar to file descriptor.
I honestly don't know how more easily such concept could be expressed. 
IMHO at least internal kernel handle does not play look-alike games with 
file descriptor.



 Was there some reason why removing a file descriptor couldn't have been
 made to do the expected thing (i.e., remove notifications for that file
 descriptor, regardless of whether the underlying file description remains
 open)?

That'd mean placing an eventpoll custom hook into sys_close(). Looks very 
bad to me, and probably will look even worse to other kernel folks.
Is not much a performance issue (a check

Re: epoll design problems with common fork/exec patterns

2008-02-26 Thread Davide Libenzi
On Tue, 26 Feb 2008, Michael Kerrisk wrote:

 Davide Libenzi wrote:
  On Sun, 28 Oct 2007, David Schwartz wrote:
  
  Eric Dumazet wrote:
 
  Events are not necessarly reported by descriptors. epoll uses an opaque
  field provided by the user.
 
  It's up to the user to properly chose a tag that will makes sense
  if the user
  app is playing dup()/close() games for example.
  Great. So the only issue then is that the documentation is confusing. It
  frequently uses the term fd where it means file. For example, it says:
 
Q1 What  happens  if  you  add  the  same fd to an
  epoll_set
   twice?
 
A1 You will probably get EEXIST.  However,  it  is
  possible
   that  two  threads  may  add the same fd twice. This 
  is
  a
   harmless condition.
 
  This gives no reason to think there's anything wrong with adding the same
  file twice so long as you do so through different descriptors. (One can
  imagine an application that does this to segregate read and write 
  operations
  to avoid a race where the descriptor is closed from under a writer due to
  handling a fatal read error.) Obviously, that won't work.
  
  I agree, that is confusing. However, you can safely add two different file 
  descriptors pointing to the same file*, with different event masks, and 
  that will work as expected.
 
 So can I summarize what I understand:
 
 a) Adding the same file descriptor twice to an epoll set will cause an
 error (EEXIST).

Yes.



 b) In a separate message to linux-man, Chris Heath says that two threads
 *can't* add the same fd twice to an epoll set, despite what the existing
 man page text says.  I haven't tested that, but it sounds to me as though
 it is likely to be true.  Can you comment please Davide?

Yes, you can't add the same fd twice. Think about a DB where file*,fd is 
the key.



 c) It is possible to add duplicated file descriptors referring to the same
 underlying open file description (file *).  As you note, this can be a
 useful filtering technique, if the two file descriptors specify different
 masks.
 
 Assuming that is all correct, for man-pages-2.79, I've reworked the text
 for Q1/A1 as follows:
 
Q1 What  happens  if you add the same file descriptor
   to an epoll set twice?
 
A1 You will probably get EEXIST.  However, it is pos-
   sible   to   add  a  duplicate  (dup(2),  dup2(2),
   fcntl(2) F_DUPFD, fork(2)) descriptor to the  same
   epoll  set.   This  can  be a useful technique for
   filtering events, if the duplicate  file  descrip-
   tors are registered with different events masks.
 
 Seem okay Davide?

Looks sane to me.



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and shared fd's

2008-02-26 Thread Davide Libenzi
On Tue, 26 Feb 2008, Michael Kerrisk wrote:

 Okay -- I'll look at it some more.  I am however loathe to drop the
 term open file description, because POSIX uses, as well as a number of
 other Linux man pages by now.

Heh, POSIX. Now doesn't take a genius to see that file description and 
file descriptor looks amazingly similar, does it? :)


  That'd mean placing an eventpoll custom hook into sys_close(). Looks 
  very bad to me, and probably will look even worse to other kernel 
  folks. Is not much a performance issue (a check to see if a file* is 
  an eventpoll file is as easy as comparing the f_op pointer), but a 
  design/style issue.

 Oh -- I wasn't suggesting we could make the change now -- it would
 break the ABI and all that.  I was just wondering why the decision
 wasn't made to do it the other way to begin with.  The existing
 semantics are somewhat couterintuitive, and potentially interact
 libraries that do private manipulations with file descriptors.

For the same reason that a custom hook in sys_close wouldn't have passed 
the radar ;)
As far as problems with libraries doing tricks with fds, that's an issue 
that goes beyond epoll.



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] avoid kmemcheck warning in epoll

2008-02-11 Thread Davide Libenzi
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
rb_set_parent() accesses node's pointer in its code. This creates a 
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
memory access. The warning is harmless since the following rb-tree node 
insert is going to overwrite the node data. In any case I think it's 
better to not have that happening at all, and fix it by simplifying the 
code to get rid of a few lines that became superfluous after the previous 
epoll changes.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/eventpoll.c |   27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

Index: linux-2.6.mod/fs/eventpoll.c
===
--- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-11 15:31:02.0 -0800
+++ linux-2.6.mod/fs/eventpoll.c2008-02-11 15:32:46.0 -0800
@@ -257,25 +257,6 @@
(p1->file < p2->file ? -1 : p1->fd - p2->fd));
 }
 
-/* Special initialization for the RB tree node to detect linkage */
-static inline void ep_rb_initnode(struct rb_node *n)
-{
-   rb_set_parent(n, n);
-}
-
-/* Removes a node from the RB tree and marks it for a fast is-linked check */
-static inline void ep_rb_erase(struct rb_node *n, struct rb_root *r)
-{
-   rb_erase(n, r);
-   rb_set_parent(n, n);
-}
-
-/* Fast check to verify that the item is linked to the main RB tree */
-static inline int ep_rb_linked(struct rb_node *n)
-{
-   return rb_parent(n) != n;
-}
-
 /* Tells us if the item is currently linked */
 static inline int ep_is_linked(struct list_head *p)
 {
@@ -283,13 +264,13 @@
 }
 
 /* Get the "struct epitem" from a wait queue pointer */
-static inline struct epitem * ep_item_from_wait(wait_queue_t *p)
+static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
 {
return container_of(p, struct eppoll_entry, wait)->base;
 }
 
 /* Get the "struct epitem" from an epoll queue wrapper */
-static inline struct epitem * ep_item_from_epqueue(poll_table *p)
+static inline struct epitem *ep_item_from_epqueue(poll_table *p)
 {
return container_of(p, struct ep_pqueue, pt)->epi;
 }
@@ -411,8 +392,7 @@
list_del_init(>fllink);
spin_unlock(>f_ep_lock);
 
-   if (ep_rb_linked(>rbn))
-   ep_rb_erase(>rbn, >rbr);
+   rb_erase(>rbn, >rbr);
 
spin_lock_irqsave(>lock, flags);
if (ep_is_linked(>rdllink))
@@ -728,7 +708,6 @@
goto error_return;
 
/* Item initialization follow here ... */
-   ep_rb_initnode(>rbn);
INIT_LIST_HEAD(>rdllink);
INIT_LIST_HEAD(>fllink);
INIT_LIST_HEAD(>pwqlist);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] avoid kmemcheck warning in epoll

2008-02-11 Thread Davide Libenzi
On Mon, 11 Feb 2008, Andrew Morton wrote:

> On Sun, 10 Feb 2008 13:32:01 -0800 (PST)
> Davide Libenzi <[EMAIL PROTECTED]> wrote:
> 
> > Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
> > rb_set_parent() accesses node's pointer in its code. This creates a 
> > warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
> > memory access. The warning is harmless since the following rb-tree node 
> > insert is going to overwrite the node data. In any case I think it's 
> > better to not have that happening at all, and fix it by properly 
> > initializing the data.
> > 
> > 
> > Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>
> > 
> > 
> > - Davide
> > 
> > 
> > ---
> >  fs/eventpoll.c |2 +-
> >  include/linux/rbtree.h |   12 
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6.mod/fs/eventpoll.c
> > ===
> > --- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-10 12:36:20.0 
> > -0800
> > +++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800
> > @@ -260,7 +260,7 @@
> >  /* Special initialization for the RB tree node to detect linkage */
> >  static inline void ep_rb_initnode(struct rb_node *n)
> >  {
> > -   rb_set_parent(n, n);
> > +   rb_init_node(n, n);
> >  }
> >  
> >  /* Removes a node from the RB tree and marks it for a fast is-linked check 
> > */
> > Index: linux-2.6.mod/include/linux/rbtree.h
> > ===
> > --- linux-2.6.mod.orig/include/linux/rbtree.h   2008-02-10 
> > 12:36:13.0 -0800
> > +++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 
> > -0800
> > @@ -112,6 +112,18 @@
> > struct rb_node *rb_node;
> >  };
> >  
> > +/**
> > + * rb_init_node - Initializes the node internal data
> > + *
> > + * @node: Pointer to the RB-Tree node
> > + * @parent: Pointer to the parent node, or NULL
> > + *
> > + */
> > +static inline void rb_init_node(struct rb_node *node, struct rb_node 
> > *parent)
> > +{
> > +   node->rb_parent_color = (unsigned long) parent;
> > +   node->rb_left = node->rb_right = NULL;
> > +}
> 
> Is epoll the only rbtree-using code which exhibits this problem?  If so,
> what is epoll doing differently from all the others?

Dunno, but I don't think so. Epoll initializes the node to a state so that 
later on can check if the file-fd item is inserted or not. And it uses the 
"parent" information for that. But rb_set_parent() (that is used in the 
current initialization code) uses the data in the node, and this triggers 
the uninitialized memory access.
Taking a better look at it, the code (after the latest changes) has no 
more point-of-failures after the rb-tree insert, so I can probably avoid 
doing the node's parent-initialization and check altogether.
Yeah, let's that ;)

 eventpoll.c |   27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

Andrew, drop that one. I'm gonna send the new one after some test...



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] avoid kmemcheck warning in epoll

2008-02-11 Thread Davide Libenzi
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
rb_set_parent() accesses node's pointer in its code. This creates a 
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
memory access. The warning is harmless since the following rb-tree node 
insert is going to overwrite the node data. In any case I think it's 
better to not have that happening at all, and fix it by simplifying the 
code to get rid of a few lines that became superfluous after the previous 
epoll changes.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/eventpoll.c |   27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

Index: linux-2.6.mod/fs/eventpoll.c
===
--- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-11 15:31:02.0 -0800
+++ linux-2.6.mod/fs/eventpoll.c2008-02-11 15:32:46.0 -0800
@@ -257,25 +257,6 @@
(p1-file  p2-file ? -1 : p1-fd - p2-fd));
 }
 
-/* Special initialization for the RB tree node to detect linkage */
-static inline void ep_rb_initnode(struct rb_node *n)
-{
-   rb_set_parent(n, n);
-}
-
-/* Removes a node from the RB tree and marks it for a fast is-linked check */
-static inline void ep_rb_erase(struct rb_node *n, struct rb_root *r)
-{
-   rb_erase(n, r);
-   rb_set_parent(n, n);
-}
-
-/* Fast check to verify that the item is linked to the main RB tree */
-static inline int ep_rb_linked(struct rb_node *n)
-{
-   return rb_parent(n) != n;
-}
-
 /* Tells us if the item is currently linked */
 static inline int ep_is_linked(struct list_head *p)
 {
@@ -283,13 +264,13 @@
 }
 
 /* Get the struct epitem from a wait queue pointer */
-static inline struct epitem * ep_item_from_wait(wait_queue_t *p)
+static inline struct epitem *ep_item_from_wait(wait_queue_t *p)
 {
return container_of(p, struct eppoll_entry, wait)-base;
 }
 
 /* Get the struct epitem from an epoll queue wrapper */
-static inline struct epitem * ep_item_from_epqueue(poll_table *p)
+static inline struct epitem *ep_item_from_epqueue(poll_table *p)
 {
return container_of(p, struct ep_pqueue, pt)-epi;
 }
@@ -411,8 +392,7 @@
list_del_init(epi-fllink);
spin_unlock(file-f_ep_lock);
 
-   if (ep_rb_linked(epi-rbn))
-   ep_rb_erase(epi-rbn, ep-rbr);
+   rb_erase(epi-rbn, ep-rbr);
 
spin_lock_irqsave(ep-lock, flags);
if (ep_is_linked(epi-rdllink))
@@ -728,7 +708,6 @@
goto error_return;
 
/* Item initialization follow here ... */
-   ep_rb_initnode(epi-rbn);
INIT_LIST_HEAD(epi-rdllink);
INIT_LIST_HEAD(epi-fllink);
INIT_LIST_HEAD(epi-pwqlist);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] avoid kmemcheck warning in epoll

2008-02-11 Thread Davide Libenzi
On Mon, 11 Feb 2008, Andrew Morton wrote:

 On Sun, 10 Feb 2008 13:32:01 -0800 (PST)
 Davide Libenzi [EMAIL PROTECTED] wrote:
 
  Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
  rb_set_parent() accesses node's pointer in its code. This creates a 
  warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
  memory access. The warning is harmless since the following rb-tree node 
  insert is going to overwrite the node data. In any case I think it's 
  better to not have that happening at all, and fix it by properly 
  initializing the data.
  
  
  Signed-off-by: Davide Libenzi [EMAIL PROTECTED]
  
  
  - Davide
  
  
  ---
   fs/eventpoll.c |2 +-
   include/linux/rbtree.h |   12 
   2 files changed, 13 insertions(+), 1 deletion(-)
  
  Index: linux-2.6.mod/fs/eventpoll.c
  ===
  --- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-10 12:36:20.0 
  -0800
  +++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800
  @@ -260,7 +260,7 @@
   /* Special initialization for the RB tree node to detect linkage */
   static inline void ep_rb_initnode(struct rb_node *n)
   {
  -   rb_set_parent(n, n);
  +   rb_init_node(n, n);
   }
   
   /* Removes a node from the RB tree and marks it for a fast is-linked check 
  */
  Index: linux-2.6.mod/include/linux/rbtree.h
  ===
  --- linux-2.6.mod.orig/include/linux/rbtree.h   2008-02-10 
  12:36:13.0 -0800
  +++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 
  -0800
  @@ -112,6 +112,18 @@
  struct rb_node *rb_node;
   };
   
  +/**
  + * rb_init_node - Initializes the node internal data
  + *
  + * @node: Pointer to the RB-Tree node
  + * @parent: Pointer to the parent node, or NULL
  + *
  + */
  +static inline void rb_init_node(struct rb_node *node, struct rb_node 
  *parent)
  +{
  +   node-rb_parent_color = (unsigned long) parent;
  +   node-rb_left = node-rb_right = NULL;
  +}
 
 Is epoll the only rbtree-using code which exhibits this problem?  If so,
 what is epoll doing differently from all the others?

Dunno, but I don't think so. Epoll initializes the node to a state so that 
later on can check if the file-fd item is inserted or not. And it uses the 
parent information for that. But rb_set_parent() (that is used in the 
current initialization code) uses the data in the node, and this triggers 
the uninitialized memory access.
Taking a better look at it, the code (after the latest changes) has no 
more point-of-failures after the rb-tree insert, so I can probably avoid 
doing the node's parent-initialization and check altogether.
Yeah, let's that ;)

 eventpoll.c |   27 +++
 1 file changed, 3 insertions(+), 24 deletions(-)

Andrew, drop that one. I'm gonna send the new one after some test...



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] avoid kmemcheck warning in epoll

2008-02-10 Thread Davide Libenzi
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
rb_set_parent() accesses node's pointer in its code. This creates a 
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
memory access. The warning is harmless since the following rb-tree node 
insert is going to overwrite the node data. In any case I think it's 
better to not have that happening at all, and fix it by properly 
initializing the data.


Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/eventpoll.c |2 +-
 include/linux/rbtree.h |   12 
 2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventpoll.c
===
--- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-10 12:36:20.0 -0800
+++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800
@@ -260,7 +260,7 @@
 /* Special initialization for the RB tree node to detect linkage */
 static inline void ep_rb_initnode(struct rb_node *n)
 {
-   rb_set_parent(n, n);
+   rb_init_node(n, n);
 }
 
 /* Removes a node from the RB tree and marks it for a fast is-linked check */
Index: linux-2.6.mod/include/linux/rbtree.h
===
--- linux-2.6.mod.orig/include/linux/rbtree.h   2008-02-10 12:36:13.0 
-0800
+++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 
-0800
@@ -112,6 +112,18 @@
struct rb_node *rb_node;
 };
 
+/**
+ * rb_init_node - Initializes the node internal data
+ *
+ * @node: Pointer to the RB-Tree node
+ * @parent: Pointer to the parent node, or NULL
+ *
+ */
+static inline void rb_init_node(struct rb_node *node, struct rb_node *parent)
+{
+   node->rb_parent_color = (unsigned long) parent;
+   node->rb_left = node->rb_right = NULL;
+}
 
 #define rb_parent(r)   ((struct rb_node *)((r)->rb_parent_color & ~3))
 #define rb_color(r)   ((r)->rb_parent_color & 1)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] avoid kmemcheck warning in epoll

2008-02-10 Thread Davide Libenzi
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but 
rb_set_parent() accesses node's pointer in its code. This creates a 
warning in kmemcheck (reported by Vegard Nossum) about an uninitialized 
memory access. The warning is harmless since the following rb-tree node 
insert is going to overwrite the node data. In any case I think it's 
better to not have that happening at all, and fix it by properly 
initializing the data.


Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/eventpoll.c |2 +-
 include/linux/rbtree.h |   12 
 2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/fs/eventpoll.c
===
--- linux-2.6.mod.orig/fs/eventpoll.c   2008-02-10 12:36:20.0 -0800
+++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800
@@ -260,7 +260,7 @@
 /* Special initialization for the RB tree node to detect linkage */
 static inline void ep_rb_initnode(struct rb_node *n)
 {
-   rb_set_parent(n, n);
+   rb_init_node(n, n);
 }
 
 /* Removes a node from the RB tree and marks it for a fast is-linked check */
Index: linux-2.6.mod/include/linux/rbtree.h
===
--- linux-2.6.mod.orig/include/linux/rbtree.h   2008-02-10 12:36:13.0 
-0800
+++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 
-0800
@@ -112,6 +112,18 @@
struct rb_node *rb_node;
 };
 
+/**
+ * rb_init_node - Initializes the node internal data
+ *
+ * @node: Pointer to the RB-Tree node
+ * @parent: Pointer to the parent node, or NULL
+ *
+ */
+static inline void rb_init_node(struct rb_node *node, struct rb_node *parent)
+{
+   node-rb_parent_color = (unsigned long) parent;
+   node-rb_left = node-rb_right = NULL;
+}
 
 #define rb_parent(r)   ((struct rb_node *)((r)-rb_parent_color  ~3))
 #define rb_color(r)   ((r)-rb_parent_color  1)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and shared fd's

2008-01-24 Thread Davide Libenzi
On Thu, 24 Jan 2008, Pierre Habouzit wrote:

> On Fri, Jan 18, 2008 at 09:10:18PM +0000, Davide Libenzi wrote:
> > On Fri, 18 Jan 2008, Pierre Habouzit wrote:
> > 
> > >   Hi,
> > > 
> > >   I just came across a strange behavior of epoll that seems to
> > > contradict the documentation. Here is what happens:
> > > 
> > > * I have two processes P1 and P2, P1 accept()s connections, and send the
> > >   resulting file descriptors to P2 through a unix socket.
> > > 
> > > * P2 registers the received socket in his epollfd.
> > > 
> > >   [time passes]
> > > 
> > > * P2 is done with the socket and closes it
> > > 
> > > * P2 gets events for the socket again !
> > > 
> > > 
> > >   Though the documentation says that if a process closes a file
> > > descriptor, it gets unregistered. And yes I'm sure that P2 doens't dup()
> > > the file descriptor. Though (because of a bug) it was still open in
> > > P1[0], hence the referenced socket still live at the kernel level.
> > > 
> > >   Of course the userland workaround is to force the EPOLL_CTL_DEL before
> > > the close, which I now do, but costs me a syscall where I wanted to
> > > spare one :|
> > 
> > For epoll, a close is when the kernel file* is released (that is, when all 
> > its instances are gone).
> > We could put a special handling in filp_close(), but I don't think is a 
> > good idea, and we're better live with the current behaviour.
> 
>   Okay, maybe updating the linux manpages to be more clear about that is
> the way to go then. Thanks

Sure. I'll send Michael Kerrisk and updated statement for the A6 answer in 
the epoll man page.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and shared fd's

2008-01-24 Thread Davide Libenzi
On Thu, 24 Jan 2008, Pierre Habouzit wrote:

 On Fri, Jan 18, 2008 at 09:10:18PM +, Davide Libenzi wrote:
  On Fri, 18 Jan 2008, Pierre Habouzit wrote:
  
 Hi,
   
 I just came across a strange behavior of epoll that seems to
   contradict the documentation. Here is what happens:
   
   * I have two processes P1 and P2, P1 accept()s connections, and send the
 resulting file descriptors to P2 through a unix socket.
   
   * P2 registers the received socket in his epollfd.
   
 [time passes]
   
   * P2 is done with the socket and closes it
   
   * P2 gets events for the socket again !
   
   
 Though the documentation says that if a process closes a file
   descriptor, it gets unregistered. And yes I'm sure that P2 doens't dup()
   the file descriptor. Though (because of a bug) it was still open in
   P1[0], hence the referenced socket still live at the kernel level.
   
 Of course the userland workaround is to force the EPOLL_CTL_DEL before
   the close, which I now do, but costs me a syscall where I wanted to
   spare one :|
  
  For epoll, a close is when the kernel file* is released (that is, when all 
  its instances are gone).
  We could put a special handling in filp_close(), but I don't think is a 
  good idea, and we're better live with the current behaviour.
 
   Okay, maybe updating the linux manpages to be more clear about that is
 the way to go then. Thanks

Sure. I'll send Michael Kerrisk and updated statement for the A6 answer in 
the epoll man page.



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lockdep: annotate epoll

2008-01-22 Thread Davide Libenzi
On Tue, 22 Jan 2008, Stefan Richter wrote:

> On 22 Jan, Stefan Richter wrote:
> > On 22 Jan, Peter Zijlstra wrote:
> >> Curious though that this gets reported frequently the last few weeks,
> >> afaics this problem is way old.
> > 
> > Here is a report against Fedora's 2.6.23-0.222.rc9.git4.fc8, filed in
> > October:  https://bugzilla.redhat.com/show_bug.cgi?id=323411
> 
> Upstream bug: http://bugzilla.kernel.org/show_bug.cgi?id=9786
> 
> Date: Sun, 13 Jan 2008 19:44:26 +0100
> From: Peter Zijlstra <[EMAIL PROTECTED]>
> 
> On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:
> 
> > I remember I talked with Arjan about this time ago. Basically, since 1) 
> > you can drop an epoll fd inside another epoll fd 2) callback-based wakeups 
> > are used, you can see a wake_up() from inside another wake_up(), but they 
> > will never refer to the same lock instance.
> > Think about:
> > 
> > dfd = socket(...);
> > efd1 = epoll_create();
> > efd2 = epoll_create();
> > epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
> > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
> > 
> > When a packet arrives to the device underneath "dfd", the net code will 
> > issue a wake_up() on its poll wake list. Epoll (efd1) has installed a 
> > callback wakeup entry on that queue, and the wake_up() performed by the 
> > "dfd" net code will end up in ep_poll_callback(). At this point epoll 
> > (efd1) notices that it may have some event ready, so it needs to wake up 
> > the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() 
> > that ends up in another wake_up(), after having checked about the 
> > recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to 
> > avoid stack blasting. Never hit the same queue, to avoid loops like:
> > 
> > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
> > epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
> > epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
> > epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
> > 
> > The code "if (tncur->wq == wq || ..." prevents re-entering the same 
> > queue/lock.
> 
> Since the epoll code is very careful to not nest same instance locks
> allow the recursion.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> Tested-by: Stefan Richter <[EMAIL PROTECTED]>
> ---
>  fs/eventpoll.c   |2 +-
>  include/linux/wait.h |   16 
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> Index: linux/fs/eventpoll.c
> ===
> --- linux.orig/fs/eventpoll.c
> +++ linux/fs/eventpoll.c
> @@ -353,7 +353,7 @@ static void ep_poll_safewake(struct poll
>   spin_unlock_irqrestore(>lock, flags);
>  
>   /* Do really wake up now */
> - wake_up(wq);
> + wake_up_nested(wq, 1 + wake_nests);
>  
>   /* Remove the current task from the list */
>   spin_lock_irqsave(>lock, flags);
> Index: linux/include/linux/wait.h
> ===
> --- linux.orig/include/linux/wait.h
> +++ linux/include/linux/wait.h
> @@ -161,6 +161,22 @@ wait_queue_head_t *FASTCALL(bit_waitqueu
>  #define  wake_up_locked(x)   __wake_up_locked((x), 
> TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE)
>  #define wake_up_interruptible_sync(x)   
> __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
>  
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +/*
> + * macro to avoid include hell
> + */
> +#define wake_up_nested(x, s) \
> +do { \
> + unsigned long flags;\
> + \
> + spin_lock_irqsave_nested(&(x)->lock, flags, (s));   \
> + wake_up_locked(x);      \
> + spin_unlock_irqrestore(&(x)->lock, flags);  \
> +} while (0)
> +#else
> +#define wake_up_nested(x, s) wake_up(x)
> +#endif
> +
>  #define __wait_event(wq, condition)  \
>  do { \
>   DEFINE_WAIT(__wait);\
> 

Looks fine to me.


Acked-by: Davide Libenzi <[EMAIL PROTECTED]>



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc7 to 2.6.24-rc8 possible regression

2008-01-22 Thread Davide Libenzi
On Tue, 22 Jan 2008, Stefan Richter wrote:

> (adding Cc: Davide and akpm)
> 
> On 22 Jan, Peter Zijlstra wrote:
> > 
> > On Tue, 2008-01-22 at 17:23 +0100, Stefan Richter wrote:
> >> Denys Fedoryshchenko wrote:
> >> > No, i am using vanilla kernel. It is one of production machines, and as 
> >> > i 
> >> > know screen is not using epoll.
> >> 
> >> OK, but the trace shows that it is the epoll recursion again.
> >> 
> >> > I will try to apply on all my production machines this patch. Sorry if 
> >> > it is 
> >> > related.
> >> 
> >> Well, let's hope that the lockdep annotation or whatever other fix gets
> >> into mainline sooner than later.  Which reminds me to test my setup
> >> again which appeared to be able to reproduce the __wake_up recursion on
> >> my command...
> > 
> > Would be appreciated, I have been waiting on testing feedback because
> > I'm not fully certain here.
> > 
> > Curious though that this gets reported frequently the last few weeks,
> > afaics this problem is way old.
> 
> Here is a report against Fedora's 2.6.23-0.222.rc9.git4.fc8, filed in
> October:  https://bugzilla.redhat.com/show_bug.cgi?id=323411
> 
> I did several tests with a pristine 2.6.24-rc8 now.  The lockdep warning
> below is 100% reliably triggered by grabbing video from a DV camcorder
> with dvgrab v3 via firewire-core's character device file ABI.
> 
> Also, this warning is 100% reliably suppressed by your annotation patch.
> I will reply to this message with this patch for kind consideration by
> those concerned. :-)

Patch? Where is it?



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc7 to 2.6.24-rc8 possible regression

2008-01-22 Thread Davide Libenzi
On Tue, 22 Jan 2008, Stefan Richter wrote:

 (adding Cc: Davide and akpm)
 
 On 22 Jan, Peter Zijlstra wrote:
  
  On Tue, 2008-01-22 at 17:23 +0100, Stefan Richter wrote:
  Denys Fedoryshchenko wrote:
   No, i am using vanilla kernel. It is one of production machines, and as 
   i 
   know screen is not using epoll.
  
  OK, but the trace shows that it is the epoll recursion again.
  
   I will try to apply on all my production machines this patch. Sorry if 
   it is 
   related.
  
  Well, let's hope that the lockdep annotation or whatever other fix gets
  into mainline sooner than later.  Which reminds me to test my setup
  again which appeared to be able to reproduce the __wake_up recursion on
  my command...
  
  Would be appreciated, I have been waiting on testing feedback because
  I'm not fully certain here.
  
  Curious though that this gets reported frequently the last few weeks,
  afaics this problem is way old.
 
 Here is a report against Fedora's 2.6.23-0.222.rc9.git4.fc8, filed in
 October:  https://bugzilla.redhat.com/show_bug.cgi?id=323411
 
 I did several tests with a pristine 2.6.24-rc8 now.  The lockdep warning
 below is 100% reliably triggered by grabbing video from a DV camcorder
 with dvgrab v3 via firewire-core's character device file ABI.
 
 Also, this warning is 100% reliably suppressed by your annotation patch.
 I will reply to this message with this patch for kind consideration by
 those concerned. :-)

Patch? Where is it?



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lockdep: annotate epoll

2008-01-22 Thread Davide Libenzi
On Tue, 22 Jan 2008, Stefan Richter wrote:

 On 22 Jan, Stefan Richter wrote:
  On 22 Jan, Peter Zijlstra wrote:
  Curious though that this gets reported frequently the last few weeks,
  afaics this problem is way old.
  
  Here is a report against Fedora's 2.6.23-0.222.rc9.git4.fc8, filed in
  October:  https://bugzilla.redhat.com/show_bug.cgi?id=323411
 
 Upstream bug: http://bugzilla.kernel.org/show_bug.cgi?id=9786
 
 Date: Sun, 13 Jan 2008 19:44:26 +0100
 From: Peter Zijlstra [EMAIL PROTECTED]
 
 On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:
 
  I remember I talked with Arjan about this time ago. Basically, since 1) 
  you can drop an epoll fd inside another epoll fd 2) callback-based wakeups 
  are used, you can see a wake_up() from inside another wake_up(), but they 
  will never refer to the same lock instance.
  Think about:
  
  dfd = socket(...);
  efd1 = epoll_create();
  efd2 = epoll_create();
  epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
  epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
  
  When a packet arrives to the device underneath dfd, the net code will 
  issue a wake_up() on its poll wake list. Epoll (efd1) has installed a 
  callback wakeup entry on that queue, and the wake_up() performed by the 
  dfd net code will end up in ep_poll_callback(). At this point epoll 
  (efd1) notices that it may have some event ready, so it needs to wake up 
  the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() 
  that ends up in another wake_up(), after having checked about the 
  recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to 
  avoid stack blasting. Never hit the same queue, to avoid loops like:
  
  epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
  epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
  epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
  epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
  
  The code if (tncur-wq == wq || ... prevents re-entering the same 
  queue/lock.
 
 Since the epoll code is very careful to not nest same instance locks
 allow the recursion.
 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 Tested-by: Stefan Richter [EMAIL PROTECTED]
 ---
  fs/eventpoll.c   |2 +-
  include/linux/wait.h |   16 
  2 files changed, 17 insertions(+), 1 deletion(-)
 
 Index: linux/fs/eventpoll.c
 ===
 --- linux.orig/fs/eventpoll.c
 +++ linux/fs/eventpoll.c
 @@ -353,7 +353,7 @@ static void ep_poll_safewake(struct poll
   spin_unlock_irqrestore(psw-lock, flags);
  
   /* Do really wake up now */
 - wake_up(wq);
 + wake_up_nested(wq, 1 + wake_nests);
  
   /* Remove the current task from the list */
   spin_lock_irqsave(psw-lock, flags);
 Index: linux/include/linux/wait.h
 ===
 --- linux.orig/include/linux/wait.h
 +++ linux/include/linux/wait.h
 @@ -161,6 +161,22 @@ wait_queue_head_t *FASTCALL(bit_waitqueu
  #define  wake_up_locked(x)   __wake_up_locked((x), 
 TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE)
  #define wake_up_interruptible_sync(x)   
 __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
  
 +#ifdef CONFIG_DEBUG_LOCK_ALLOC
 +/*
 + * macro to avoid include hell
 + */
 +#define wake_up_nested(x, s) \
 +do { \
 + unsigned long flags;\
 + \
 + spin_lock_irqsave_nested((x)-lock, flags, (s));   \
 + wake_up_locked(x);  \
 + spin_unlock_irqrestore((x)-lock, flags);  \
 +} while (0)
 +#else
 +#define wake_up_nested(x, s) wake_up(x)
 +#endif
 +
  #define __wait_event(wq, condition)  \
  do { \
   DEFINE_WAIT(__wait);\
 

Looks fine to me.


Acked-by: Davide Libenzi [EMAIL PROTECTED]



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and shared fd's

2008-01-18 Thread Davide Libenzi
On Fri, 18 Jan 2008, Pierre Habouzit wrote:

>   Hi,
> 
>   I just came across a strange behavior of epoll that seems to
> contradict the documentation. Here is what happens:
> 
> * I have two processes P1 and P2, P1 accept()s connections, and send the
>   resulting file descriptors to P2 through a unix socket.
> 
> * P2 registers the received socket in his epollfd.
> 
>   [time passes]
> 
> * P2 is done with the socket and closes it
> 
> * P2 gets events for the socket again !
> 
> 
>   Though the documentation says that if a process closes a file
> descriptor, it gets unregistered. And yes I'm sure that P2 doens't dup()
> the file descriptor. Though (because of a bug) it was still open in
> P1[0], hence the referenced socket still live at the kernel level.
> 
>   Of course the userland workaround is to force the EPOLL_CTL_DEL before
> the close, which I now do, but costs me a syscall where I wanted to
> spare one :|

For epoll, a close is when the kernel file* is released (that is, when all 
its instances are gone).
We could put a special handling in filp_close(), but I don't think is a 
good idea, and we're better live with the current behaviour.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and shared fd's

2008-01-18 Thread Davide Libenzi
On Fri, 18 Jan 2008, Pierre Habouzit wrote:

   Hi,
 
   I just came across a strange behavior of epoll that seems to
 contradict the documentation. Here is what happens:
 
 * I have two processes P1 and P2, P1 accept()s connections, and send the
   resulting file descriptors to P2 through a unix socket.
 
 * P2 registers the received socket in his epollfd.
 
   [time passes]
 
 * P2 is done with the socket and closes it
 
 * P2 gets events for the socket again !
 
 
   Though the documentation says that if a process closes a file
 descriptor, it gets unregistered. And yes I'm sure that P2 doens't dup()
 the file descriptor. Though (because of a bug) it was still open in
 P1[0], hence the referenced socket still live at the kernel level.
 
   Of course the userland workaround is to force the EPOLL_CTL_DEL before
 the close, which I now do, but costs me a syscall where I wanted to
 spare one :|

For epoll, a close is when the kernel file* is released (that is, when all 
its instances are gone).
We could put a special handling in filp_close(), but I don't think is a 
good idea, and we're better live with the current behaviour.



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] syslets: add generic syslets infrastructure

2008-01-09 Thread Davide Libenzi
On Thu, 10 Jan 2008, Rusty Russell wrote:

> On Thursday 10 January 2008 05:16:57 Zach Brown wrote:
> > > The latter.  A ring is optimal for processing a huge number of requests,
> > > but if you're really going to be firing off syslet threads all over the
> > > place you're not going to be optimal anyway.  And being able to point the
> > > return value to the stack or into some datastructure is way nicer to code
> > > (zero setup == easy to understand and easy to convert).
> >
> > One of Linus' rhetorical requirements for the syslets work is that we be
> > able to wait for the result without spending overhead building up state
> > in some completion context.  The userland rings in the current syslet
> > patches achieve that and don't seem outrageously complicated.
> 
> I'd have to read his original statement, but eventfd doesn't build up state, 
> so I think it qualifies.

I think you and Zach are talking about different issues ;)
Eventfd born for two different reasons. First, to be able to have 
userspace to signal to a poll/select/epoll based listener an event. This 
can elso be done with pipes, but eventfd has a few advantages over pipes 
(3-4 times faster and *a lot* less memory footprint). Second, as a generic 
way for kernel subsystems to signal completions to a poll/select/epoll 
userspace listener. And this is what is used in the new KAIO eventfd 
feature (patch was like 5 lines IIRC).
This allow for KAIO events to be signaled to poll/select/epoll in a pretty 
easy way, using a simple extension of the AIO API.
What we talked originally with Ingo, when the first syslet code came up, 
was the ability to do the reverse thing. That is, host an epoll_wait() 
inside a syslet, and gather the completion using whatever the syslet code 
was/is going to use for it.
Given that 1) the eventfd completion patch was trivial and immediately 
available 2) the future of the whole syslet concept was/is still unclear, 
I believe it makes/made sense. If the syslets will become mainline, it'll 
mean that userspace will then be able to select the event-completion 
"hosting" method that better suits their needs. That are, either AIO 
completion hosted inside an epoll_wait() via an eventfd, or an 
epoll_wait() hosted inside a syslet.




- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] syslets: add generic syslets infrastructure

2008-01-09 Thread Davide Libenzi
On Thu, 10 Jan 2008, Rusty Russell wrote:

 On Thursday 10 January 2008 05:16:57 Zach Brown wrote:
   The latter.  A ring is optimal for processing a huge number of requests,
   but if you're really going to be firing off syslet threads all over the
   place you're not going to be optimal anyway.  And being able to point the
   return value to the stack or into some datastructure is way nicer to code
   (zero setup == easy to understand and easy to convert).
 
  One of Linus' rhetorical requirements for the syslets work is that we be
  able to wait for the result without spending overhead building up state
  in some completion context.  The userland rings in the current syslet
  patches achieve that and don't seem outrageously complicated.
 
 I'd have to read his original statement, but eventfd doesn't build up state, 
 so I think it qualifies.

I think you and Zach are talking about different issues ;)
Eventfd born for two different reasons. First, to be able to have 
userspace to signal to a poll/select/epoll based listener an event. This 
can elso be done with pipes, but eventfd has a few advantages over pipes 
(3-4 times faster and *a lot* less memory footprint). Second, as a generic 
way for kernel subsystems to signal completions to a poll/select/epoll 
userspace listener. And this is what is used in the new KAIO eventfd 
feature (patch was like 5 lines IIRC).
This allow for KAIO events to be signaled to poll/select/epoll in a pretty 
easy way, using a simple extension of the AIO API.
What we talked originally with Ingo, when the first syslet code came up, 
was the ability to do the reverse thing. That is, host an epoll_wait() 
inside a syslet, and gather the completion using whatever the syslet code 
was/is going to use for it.
Given that 1) the eventfd completion patch was trivial and immediately 
available 2) the future of the whole syslet concept was/is still unclear, 
I believe it makes/made sense. If the syslets will become mainline, it'll 
mean that userspace will then be able to select the event-completion 
hosting method that better suits their needs. That are, either AIO 
completion hosted inside an epoll_wait() via an eventfd, or an 
epoll_wait() hosted inside a syslet.




- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Improve scalability of epoll_ctl

2008-01-08 Thread Davide Libenzi
On Tue, 8 Jan 2008, Eric Dumazet wrote:

> Changli Gao a écrit :
> > Replace the epitem rbtree with a dynamic array to get the constant 
> > insertion/deletion/modification time of the file descriptors. Reuse the 
> > size argument of epoll_create, however the size must be smaller than the 
> > max file descriptor number: ether the resource limitation or the compiling 
> > time limitation.
> >
> >   
> Hum, you should read this :
> 
> http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc3/2.6.7-rc3-mm2/broken-out/epoll-uses-rbtrees.patch
> 
> 
> Your patch is a revert of this change, it probably wont be accepted.
> 
> epoll_ctl() is scalable, since it's O(log2(N)) : With 1 millions files
> descriptors, that means less than 20 nodes to lookup in the tree.

Indeed, and we aren't going back.


> In what situation do you think epoll_ctl() performance is bad ?

I'm sure you can cook up a microbench that only does epoll_ctl(), and with 
a good hash implementation you get better numbers. This is what I did 
before I changed to rbtree, and the numbers did not justify the code (the 
"size" in epoll_create() is just an hint, and that means you need to 
handle resizing - plus other things like locking up memory when the number 
of fds shrinks, and other stuff I don't even remember).



- Davide



Re: [PATCH] Improve scalability of epoll_ctl

2008-01-08 Thread Davide Libenzi
On Tue, 8 Jan 2008, Eric Dumazet wrote:

 Changli Gao a écrit :
  Replace the epitem rbtree with a dynamic array to get the constant 
  insertion/deletion/modification time of the file descriptors. Reuse the 
  size argument of epoll_create, however the size must be smaller than the 
  max file descriptor number: ether the resource limitation or the compiling 
  time limitation.
 

 Hum, you should read this :
 
 http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc3/2.6.7-rc3-mm2/broken-out/epoll-uses-rbtrees.patch
 
 
 Your patch is a revert of this change, it probably wont be accepted.
 
 epoll_ctl() is scalable, since it's O(log2(N)) : With 1 millions files
 descriptors, that means less than 20 nodes to lookup in the tree.

Indeed, and we aren't going back.


 In what situation do you think epoll_ctl() performance is bad ?

I'm sure you can cook up a microbench that only does epoll_ctl(), and with 
a good hash implementation you get better numbers. This is what I did 
before I changed to rbtree, and the numbers did not justify the code (the 
size in epoll_create() is just an hint, and that means you need to 
handle resizing - plus other things like locking up memory when the number 
of fds shrinks, and other stuff I don't even remember).



- Davide



Re: 2.6.24-rc6: possible recursive locking detected

2008-01-07 Thread Davide Libenzi
On Sun, 6 Jan 2008, Christian Kujau wrote:

> On Sat, 5 Jan 2008, Davide Libenzi wrote:
> > A solution may be to move the call to ep_poll_safewake() (that'd become a
> > simple wake_up()) inside a tasklet or whatever is today trendy for delayed
> > work. But his kinda scares me to be honest, since epoll has already a
> > bunch of places where it could be asynchronously hit (plus performance
> > regression will need to be verified).
> 
> Although I'm not able to reproduce this one right now, I'm happy to test any
> patches you guys come up with.

There's no need to reproduce it. It's there, it's among us ;)



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc6: possible recursive locking detected

2008-01-07 Thread Davide Libenzi
On Sun, 6 Jan 2008, Christian Kujau wrote:

 On Sat, 5 Jan 2008, Davide Libenzi wrote:
  A solution may be to move the call to ep_poll_safewake() (that'd become a
  simple wake_up()) inside a tasklet or whatever is today trendy for delayed
  work. But his kinda scares me to be honest, since epoll has already a
  bunch of places where it could be asynchronously hit (plus performance
  regression will need to be verified).
 
 Although I'm not able to reproduce this one right now, I'm happy to test any
 patches you guys come up with.

There's no need to reproduce it. It's there, it's among us ;)



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc6: possible recursive locking detected

2008-01-05 Thread Davide Libenzi
On Sat, 5 Jan 2008, Peter Zijlstra wrote:

> 
> On Sat, 2008-01-05 at 17:53 +0100, Peter Zijlstra wrote:
> > On Sat, 2008-01-05 at 18:12 +1100, Herbert Xu wrote:
> > > On Fri, Jan 04, 2008 at 09:30:49AM +0100, Ingo Molnar wrote:
> > > >
> > > > > > [ 1310.670986] =
> > > > > > [ 1310.671690] [ INFO: possible recursive locking detected ]
> > > > > > [ 1310.672097] 2.6.24-rc6 #1
> > > > > > [ 1310.672421] -
> > > > > > [ 1310.672828] FahCore_a0.exe/3692 is trying to acquire lock:
> > > > > > [ 1310.673238]  (>lock){++..}, at: [] 
> > > > > > __wake_up+0x1b/0x50
> > > > > > [ 1310.673869]
> > > > > > [ 1310.673870] but task is already holding lock:
> > > > > > [ 1310.674567]  (>lock){++..}, at: [] 
> > > > > > __wake_up+0x1b/0x50
> > > > > > [ 1310.675267]
> > > > > > [ 1310.675268] other info that might help us debug this:
> > > > > > [ 1310.675952] 5 locks held by FahCore_a0.exe/3692:
> > > > > > [ 1310.676334]  #0:  (rcu_read_lock){..--}, at: [] 
> > > > > > net_rx_action+0x60/0x1b0
> > > > > > [ 1310.677251]  #1:  (rcu_read_lock){..--}, at: [] 
> > > > > > netif_receive_skb+0x100/0x470
> > > > > > [ 1310.677924]  #2:  (rcu_read_lock){..--}, at: [] 
> > > > > > ip_local_deliver_finish+0x32/0x210
> > > > > > [ 1310.678460]  #3:  (clock-AF_INET){-.-?}, at: [] 
> > > > > > sock_def_readable+0x1e/0x80
> > > > > > [ 1310.679250]  #4:  (>lock){++..}, at: [] 
> > > > > > __wake_up+0x1b/0x50
> > > 
> > > The net part might just be a red herring, since the problem is that
> > > __wake_up is somehow reentering itself.
> > 
> > /*
> >  * Perform a safe wake up of the poll wait list. The problem is that
> >  * with the new callback'd wake up system, it is possible that the
> >  * poll callback is reentered from inside the call to wake_up() done
> >  * on the poll wait queue head. The rule is that we cannot reenter the
> >  * wake up code from the same task more than EP_MAX_POLLWAKE_NESTS times,
> >  * and we cannot reenter the same wait queue head at all. This will
> >  * enable to have a hierarchy of epoll file descriptor of no more than
> >  * EP_MAX_POLLWAKE_NESTS deep. We need the irq version of the spin lock
> >  * because this one gets called by the poll callback, that in turn is called
> >  * from inside a wake_up(), that might be called from irq context.
> >  */
> > 
> > Seems to suggest that the epoll code can indeed recurse into wakeup.
> > 
> > Davide, Johannes, any ideas?
> 
> Since EP_MAX_POLLWAKE_NESTS < MAX_LOCKDEP_SUBCLASSES we could perhaps do
> something like:
> 
>   wake_up_nested(..., wake_nests);
> 
> although I'm not quite sure that is correct, my understanding of this
> code is still fragile at best.

I remember I talked with Arjan about this time ago. Basically, since 1) 
you can drop an epoll fd inside another epoll fd 2) callback-based wakeups 
are used, you can see a wake_up() from inside another wake_up(), but they 
will never refer to the same lock instance.
Think about:

dfd = socket(...);
efd1 = epoll_create();
efd2 = epoll_create();
epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);

When a packet arrives to the device underneath "dfd", the net code will 
issue a wake_up() on its poll wake list. Epoll (efd1) has installed a 
callback wakeup entry on that queue, and the wake_up() performed by the 
"dfd" net code will end up in ep_poll_callback(). At this point epoll 
(efd1) notices that it may have some event ready, so it needs to wake up 
the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() 
that ends up in another wake_up(), after having checked about the 
recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to 
avoid stack blasting. Never hit the same queue, to avoid loops like:

epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);

The code "if (tncur->wq == wq || ..." prevents re-entering the same 
queue/lock.
I don't know how the lockdep code works, so I can't say about 
wake_up_nested(). Although I have a feeling is not enough in this case.
A solution may be to move the call to ep_poll_safewake() (that'd become a 
simple wake_up()) inside a tasklet or whatever is today trendy for delayed 
work. But his kinda scares me to be honest, since epoll has already a 
bunch of places where it could be asynchronously hit (plus performance 
regression will need to be verified).



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc6: possible recursive locking detected

2008-01-05 Thread Davide Libenzi
On Sat, 5 Jan 2008, Peter Zijlstra wrote:

 
 On Sat, 2008-01-05 at 17:53 +0100, Peter Zijlstra wrote:
  On Sat, 2008-01-05 at 18:12 +1100, Herbert Xu wrote:
   On Fri, Jan 04, 2008 at 09:30:49AM +0100, Ingo Molnar wrote:
   
  [ 1310.670986] =
  [ 1310.671690] [ INFO: possible recursive locking detected ]
  [ 1310.672097] 2.6.24-rc6 #1
  [ 1310.672421] -
  [ 1310.672828] FahCore_a0.exe/3692 is trying to acquire lock:
  [ 1310.673238]  (q-lock){++..}, at: [c011544b] 
  __wake_up+0x1b/0x50
  [ 1310.673869]
  [ 1310.673870] but task is already holding lock:
  [ 1310.674567]  (q-lock){++..}, at: [c011544b] 
  __wake_up+0x1b/0x50
  [ 1310.675267]
  [ 1310.675268] other info that might help us debug this:
  [ 1310.675952] 5 locks held by FahCore_a0.exe/3692:
  [ 1310.676334]  #0:  (rcu_read_lock){..--}, at: [c038b620] 
  net_rx_action+0x60/0x1b0
  [ 1310.677251]  #1:  (rcu_read_lock){..--}, at: [c0388d60] 
  netif_receive_skb+0x100/0x470
  [ 1310.677924]  #2:  (rcu_read_lock){..--}, at: [c03a7fb2] 
  ip_local_deliver_finish+0x32/0x210
  [ 1310.678460]  #3:  (clock-AF_INET){-.-?}, at: [c038164e] 
  sock_def_readable+0x1e/0x80
  [ 1310.679250]  #4:  (q-lock){++..}, at: [c011544b] 
  __wake_up+0x1b/0x50
   
   The net part might just be a red herring, since the problem is that
   __wake_up is somehow reentering itself.
  
  /*
   * Perform a safe wake up of the poll wait list. The problem is that
   * with the new callback'd wake up system, it is possible that the
   * poll callback is reentered from inside the call to wake_up() done
   * on the poll wait queue head. The rule is that we cannot reenter the
   * wake up code from the same task more than EP_MAX_POLLWAKE_NESTS times,
   * and we cannot reenter the same wait queue head at all. This will
   * enable to have a hierarchy of epoll file descriptor of no more than
   * EP_MAX_POLLWAKE_NESTS deep. We need the irq version of the spin lock
   * because this one gets called by the poll callback, that in turn is called
   * from inside a wake_up(), that might be called from irq context.
   */
  
  Seems to suggest that the epoll code can indeed recurse into wakeup.
  
  Davide, Johannes, any ideas?
 
 Since EP_MAX_POLLWAKE_NESTS  MAX_LOCKDEP_SUBCLASSES we could perhaps do
 something like:
 
   wake_up_nested(..., wake_nests);
 
 although I'm not quite sure that is correct, my understanding of this
 code is still fragile at best.

I remember I talked with Arjan about this time ago. Basically, since 1) 
you can drop an epoll fd inside another epoll fd 2) callback-based wakeups 
are used, you can see a wake_up() from inside another wake_up(), but they 
will never refer to the same lock instance.
Think about:

dfd = socket(...);
efd1 = epoll_create();
efd2 = epoll_create();
epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);

When a packet arrives to the device underneath dfd, the net code will 
issue a wake_up() on its poll wake list. Epoll (efd1) has installed a 
callback wakeup entry on that queue, and the wake_up() performed by the 
dfd net code will end up in ep_poll_callback(). At this point epoll 
(efd1) notices that it may have some event ready, so it needs to wake up 
the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() 
that ends up in another wake_up(), after having checked about the 
recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to 
avoid stack blasting. Never hit the same queue, to avoid loops like:

epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);

The code if (tncur-wq == wq || ... prevents re-entering the same 
queue/lock.
I don't know how the lockdep code works, so I can't say about 
wake_up_nested(). Although I have a feeling is not enough in this case.
A solution may be to move the call to ep_poll_safewake() (that'd become a 
simple wake_up()) inside a tasklet or whatever is today trendy for delayed 
work. But his kinda scares me to be honest, since epoll has already a 
bunch of places where it could be asynchronously hit (plus performance 
regression will need to be verified).



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/2] timerfd - make the returned time to be the remaining time till the next expiration

2007-12-17 Thread Davide Libenzi
Make the returned time to be the remaining time till the next expiration.
If the timer is already expired, and there's no next expiration, zero will
be returned.
Andrew, this goes on top of the ones you already have in -mm.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/timerfd.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-12-14 16:04:36.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-12-14 16:05:32.0 -0800
@@ -49,6 +49,15 @@
return HRTIMER_NORESTART;
 }
 
+static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx) {
+   ktime_t now, remaining;
+
+   now = ctx->tmr.base->get_time();
+   remaining = ktime_sub(ctx->tmr.expires, now);
+
+   return remaining.tv64 < 0 ? ktime_set(0, 0): remaining;
+}
+
 static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
@@ -240,7 +249,7 @@
if (ctx->expired && ctx->tintv.tv64)
hrtimer_forward_now(>tmr, ctx->tintv);
 
-   kotmr.it_value = ktime_to_timespec(ctx->tmr.expires);
+   kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
kotmr.it_interval = ktime_to_timespec(ctx->tintv);
 
/*
@@ -274,7 +283,7 @@
hrtimer_forward_now(>tmr, ctx->tintv) - 1;
hrtimer_restart(>tmr);
}
-   kotmr.it_value = ktime_to_timespec(ctx->tmr.expires);
+   kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
kotmr.it_interval = ktime_to_timespec(ctx->tintv);
spin_unlock_irq(>wqh.lock);
fput(file);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/2] timerfd - make hrtimer_forward() to return a u64

2007-12-17 Thread Davide Libenzi
This patch makes hrtimer_forward() to return a u64 instead of unsigned long.
Since timerfd returns the number of timer ticks in a u64 variable, and
hrtimer_forward() is used to calculate the timer ticks, the patch allow full
64 bit usage even on 32 bit platforms. The core of the hrtimer_forward() ticks
calculation, ktime_divns(), was already having the result in u64 and it was
chopping it to unsigned long.
Andrew, this goes on top of the ones you already have in -mm.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/timerfd.c|6 +++---
 include/linux/hrtimer.h |   10 +-
 kernel/hrtimer.c|9 -
 kernel/posix-timers.c   |9 +
 4 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-12-13 16:37:36.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-12-14 16:05:23.0 
-0800
@@ -295,12 +295,12 @@
 }
 
 /* Forward a hrtimer so it expires after now: */
-extern unsigned long
+extern u64
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
 /* Forward a hrtimer so it expires after the hrtimer's current now */
-static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
-   ktime_t interval)
+static inline u64 hrtimer_forward_now(struct hrtimer *timer,
+ ktime_t interval)
 {
return hrtimer_forward(timer, timer->base->get_time(), interval);
 }
@@ -322,9 +322,9 @@
 extern void __init hrtimers_init(void);
 
 #if BITS_PER_LONG < 64
-extern unsigned long ktime_divns(const ktime_t kt, s64 div);
+extern u64 ktime_divns(const ktime_t kt, s64 div);
 #else /* BITS_PER_LONG < 64 */
-# define ktime_divns(kt, div)  (unsigned long)((kt).tv64 / (div))
+# define ktime_divns(kt, div)  (u64)((kt).tv64 / (div))
 #endif
 
 /* Show pending timers: */
Index: linux-2.6.mod/kernel/hrtimer.c
===
--- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800
+++ linux-2.6.mod/kernel/hrtimer.c  2007-12-13 16:41:42.0 -0800
@@ -306,7 +306,7 @@
 /*
  * Divide a ktime value by a nanosecond value
  */
-unsigned long ktime_divns(const ktime_t kt, s64 div)
+u64 ktime_divns(const ktime_t kt, s64 div)
 {
u64 dclc, inc, dns;
int sft = 0;
@@ -321,7 +321,7 @@
dclc >>= sft;
do_div(dclc, (unsigned long) div);
 
-   return (unsigned long) dclc;
+   return dclc;
 }
 #endif /* BITS_PER_LONG >= 64 */
 
@@ -655,10 +655,9 @@
  * Forward the timer expiry so it will expire in the future.
  * Returns the number of overruns.
  */
-unsigned long
-hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
+u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-   unsigned long orun = 1;
+   u64 orun = 1;
ktime_t delta;
 
delta = ktime_sub(now, timer->expires);
Index: linux-2.6.mod/kernel/posix-timers.c
===
--- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 
-0800
+++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800
@@ -256,8 +256,9 @@
if (timr->it.real.interval.tv64 == 0)
return;
 
-   timr->it_overrun += hrtimer_forward(timer, timer->base->get_time(),
-   timr->it.real.interval);
+   timr->it_overrun += (unsigned int) hrtimer_forward(timer,
+   timer->base->get_time(),
+   timr->it.real.interval);
 
timr->it_overrun_last = timr->it_overrun;
timr->it_overrun = -1;
@@ -386,7 +387,7 @@
now = ktime_add(now, kj);
}
 #endif
-   timr->it_overrun +=
+   timr->it_overrun += (unsigned int)
hrtimer_forward(timer, now,
timr->it.real.interval);
ret = HRTIMER_RESTART;
@@ -662,7 +663,7 @@
 */
if (iv.tv64 && (timr->it_requeue_pending & REQUEUE_PENDING ||
(timr->it_sigev_notify & ~SIGEV_THREAD_ID) == SIGEV_NONE))
-   timr->it_overrun += hrtimer_forward(timer, now, iv);
+   timr->it_overrun += (unsigned int) hrtimer_forward(timer, now, 
iv);
 
remaining = ktime_sub(timer->expires, now);
/* Return 0 only, when the timer is expired and not pending */
Index: linux-2.6.mod/fs/timerfd.c
===

Re: Tesing of / bugs in new timerfd API

2007-12-17 Thread Davide Libenzi
On Mon, 17 Dec 2007, Michael Kerrisk wrote:

> > Can you try the two patches below? I tried them on my 32 bit box (one of
> > the rare beasts still lingering around here) and it seems to be working
> > fine (those go on top of the previous ones).
> 
> Against 2.6.24-rc5, I applied first your earlier patches ("v3") and
> then the newest patch.  My tests confirm that:
> 
> > This fixed the 32 bit tick-count truncation, and makes the time returned
> > to be the remaining time till the next expiration.
> 
> Are you going to resubmit a new patch set that includes these latest changes?

Yes, I'll send them out to Andrew today.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tesing of / bugs in new timerfd API

2007-12-17 Thread Davide Libenzi
On Mon, 17 Dec 2007, Michael Kerrisk wrote:

  Can you try the two patches below? I tried them on my 32 bit box (one of
  the rare beasts still lingering around here) and it seems to be working
  fine (those go on top of the previous ones).
 
 Against 2.6.24-rc5, I applied first your earlier patches (v3) and
 then the newest patch.  My tests confirm that:
 
  This fixed the 32 bit tick-count truncation, and makes the time returned
  to be the remaining time till the next expiration.
 
 Are you going to resubmit a new patch set that includes these latest changes?

Yes, I'll send them out to Andrew today.


- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/2] timerfd - make hrtimer_forward() to return a u64

2007-12-17 Thread Davide Libenzi
This patch makes hrtimer_forward() to return a u64 instead of unsigned long.
Since timerfd returns the number of timer ticks in a u64 variable, and
hrtimer_forward() is used to calculate the timer ticks, the patch allow full
64 bit usage even on 32 bit platforms. The core of the hrtimer_forward() ticks
calculation, ktime_divns(), was already having the result in u64 and it was
chopping it to unsigned long.
Andrew, this goes on top of the ones you already have in -mm.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/timerfd.c|6 +++---
 include/linux/hrtimer.h |   10 +-
 kernel/hrtimer.c|9 -
 kernel/posix-timers.c   |9 +
 4 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-12-13 16:37:36.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-12-14 16:05:23.0 
-0800
@@ -295,12 +295,12 @@
 }
 
 /* Forward a hrtimer so it expires after now: */
-extern unsigned long
+extern u64
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
 /* Forward a hrtimer so it expires after the hrtimer's current now */
-static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
-   ktime_t interval)
+static inline u64 hrtimer_forward_now(struct hrtimer *timer,
+ ktime_t interval)
 {
return hrtimer_forward(timer, timer-base-get_time(), interval);
 }
@@ -322,9 +322,9 @@
 extern void __init hrtimers_init(void);
 
 #if BITS_PER_LONG  64
-extern unsigned long ktime_divns(const ktime_t kt, s64 div);
+extern u64 ktime_divns(const ktime_t kt, s64 div);
 #else /* BITS_PER_LONG  64 */
-# define ktime_divns(kt, div)  (unsigned long)((kt).tv64 / (div))
+# define ktime_divns(kt, div)  (u64)((kt).tv64 / (div))
 #endif
 
 /* Show pending timers: */
Index: linux-2.6.mod/kernel/hrtimer.c
===
--- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800
+++ linux-2.6.mod/kernel/hrtimer.c  2007-12-13 16:41:42.0 -0800
@@ -306,7 +306,7 @@
 /*
  * Divide a ktime value by a nanosecond value
  */
-unsigned long ktime_divns(const ktime_t kt, s64 div)
+u64 ktime_divns(const ktime_t kt, s64 div)
 {
u64 dclc, inc, dns;
int sft = 0;
@@ -321,7 +321,7 @@
dclc = sft;
do_div(dclc, (unsigned long) div);
 
-   return (unsigned long) dclc;
+   return dclc;
 }
 #endif /* BITS_PER_LONG = 64 */
 
@@ -655,10 +655,9 @@
  * Forward the timer expiry so it will expire in the future.
  * Returns the number of overruns.
  */
-unsigned long
-hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
+u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-   unsigned long orun = 1;
+   u64 orun = 1;
ktime_t delta;
 
delta = ktime_sub(now, timer-expires);
Index: linux-2.6.mod/kernel/posix-timers.c
===
--- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 
-0800
+++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800
@@ -256,8 +256,9 @@
if (timr-it.real.interval.tv64 == 0)
return;
 
-   timr-it_overrun += hrtimer_forward(timer, timer-base-get_time(),
-   timr-it.real.interval);
+   timr-it_overrun += (unsigned int) hrtimer_forward(timer,
+   timer-base-get_time(),
+   timr-it.real.interval);
 
timr-it_overrun_last = timr-it_overrun;
timr-it_overrun = -1;
@@ -386,7 +387,7 @@
now = ktime_add(now, kj);
}
 #endif
-   timr-it_overrun +=
+   timr-it_overrun += (unsigned int)
hrtimer_forward(timer, now,
timr-it.real.interval);
ret = HRTIMER_RESTART;
@@ -662,7 +663,7 @@
 */
if (iv.tv64  (timr-it_requeue_pending  REQUEUE_PENDING ||
(timr-it_sigev_notify  ~SIGEV_THREAD_ID) == SIGEV_NONE))
-   timr-it_overrun += hrtimer_forward(timer, now, iv);
+   timr-it_overrun += (unsigned int) hrtimer_forward(timer, now, 
iv);
 
remaining = ktime_sub(timer-expires, now);
/* Return 0 only, when the timer is expired and not pending */
Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-12-13 16:49:46.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-12-14 16:04:36.0

[patch 2/2] timerfd - make the returned time to be the remaining time till the next expiration

2007-12-17 Thread Davide Libenzi
Make the returned time to be the remaining time till the next expiration.
If the timer is already expired, and there's no next expiration, zero will
be returned.
Andrew, this goes on top of the ones you already have in -mm.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/timerfd.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-12-14 16:04:36.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-12-14 16:05:32.0 -0800
@@ -49,6 +49,15 @@
return HRTIMER_NORESTART;
 }
 
+static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx) {
+   ktime_t now, remaining;
+
+   now = ctx-tmr.base-get_time();
+   remaining = ktime_sub(ctx-tmr.expires, now);
+
+   return remaining.tv64  0 ? ktime_set(0, 0): remaining;
+}
+
 static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
@@ -240,7 +249,7 @@
if (ctx-expired  ctx-tintv.tv64)
hrtimer_forward_now(ctx-tmr, ctx-tintv);
 
-   kotmr.it_value = ktime_to_timespec(ctx-tmr.expires);
+   kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
kotmr.it_interval = ktime_to_timespec(ctx-tintv);
 
/*
@@ -274,7 +283,7 @@
hrtimer_forward_now(ctx-tmr, ctx-tintv) - 1;
hrtimer_restart(ctx-tmr);
}
-   kotmr.it_value = ktime_to_timespec(ctx-tmr.expires);
+   kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
kotmr.it_interval = ktime_to_timespec(ctx-tintv);
spin_unlock_irq(ctx-wqh.lock);
fput(file);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tesing of / bugs in new timerfd API

2007-12-16 Thread Davide Libenzi
On Fri, 14 Dec 2007, Michael Kerrisk wrote:

> You snipped my example that demonstrated the problem.  Both of the
> following runs create a timer that expires 10 seconds from "now", but
> observe the difference in the value returned by timerfd_gettime():
> 
> $ ./timerfd_test 10 # does not use TFD_TIMER_ABSTIME
> Initial setting for settime:   value=10.000, interval=0.000
> ./timerfd_test> g
> (elapsed time=  1)
> Current value: value=346.448, interval=0.000
> 
> $ ./timerfd_test -a 10  # uses TFD_TIMER_ABSTIME
> Initial setting for settime:   value=1197630855.254, interval=0.000
> ./timerfd_test> g
> (elapsed time=  1)
> Current value: value=1197630855.254, interval=0.000
> 
> Either there's an inconsistency here depending on the use of
> TFD_TIMER_ABSTIME, or there is a bug in my understanding or my test program
> (but so far I haven't spotted that bug ;-).).

Can you try the two patches below? I tried them on my 32 bit box (one of 
the rare beasts still lingering around here) and it seems to be working 
fine (those go on top of the previous ones).
This fixed the 32 bit tick-count truncation, and makes the time returned 
to be the remaining time till the next expiration.




- Davide



---
 fs/timerfd.c|6 +++---
 include/linux/hrtimer.h |   10 +-
 kernel/hrtimer.c|9 -
 kernel/posix-timers.c   |9 +
 4 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-12-13 16:37:36.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-12-14 16:05:23.0 
-0800
@@ -295,12 +295,12 @@
 }
 
 /* Forward a hrtimer so it expires after now: */
-extern unsigned long
+extern u64
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
 /* Forward a hrtimer so it expires after the hrtimer's current now */
-static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
-   ktime_t interval)
+static inline u64 hrtimer_forward_now(struct hrtimer *timer,
+ ktime_t interval)
 {
return hrtimer_forward(timer, timer->base->get_time(), interval);
 }
@@ -322,9 +322,9 @@
 extern void __init hrtimers_init(void);
 
 #if BITS_PER_LONG < 64
-extern unsigned long ktime_divns(const ktime_t kt, s64 div);
+extern u64 ktime_divns(const ktime_t kt, s64 div);
 #else /* BITS_PER_LONG < 64 */
-# define ktime_divns(kt, div)  (unsigned long)((kt).tv64 / (div))
+# define ktime_divns(kt, div)  (u64)((kt).tv64 / (div))
 #endif
 
 /* Show pending timers: */
Index: linux-2.6.mod/kernel/hrtimer.c
===
--- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800
+++ linux-2.6.mod/kernel/hrtimer.c  2007-12-13 16:41:42.0 -0800
@@ -306,7 +306,7 @@
 /*
  * Divide a ktime value by a nanosecond value
  */
-unsigned long ktime_divns(const ktime_t kt, s64 div)
+u64 ktime_divns(const ktime_t kt, s64 div)
 {
u64 dclc, inc, dns;
int sft = 0;
@@ -321,7 +321,7 @@
dclc >>= sft;
do_div(dclc, (unsigned long) div);
 
-   return (unsigned long) dclc;
+   return dclc;
 }
 #endif /* BITS_PER_LONG >= 64 */
 
@@ -655,10 +655,9 @@
  * Forward the timer expiry so it will expire in the future.
  * Returns the number of overruns.
  */
-unsigned long
-hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
+u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-   unsigned long orun = 1;
+   u64 orun = 1;
ktime_t delta;
 
delta = ktime_sub(now, timer->expires);
Index: linux-2.6.mod/kernel/posix-timers.c
===
--- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 
-0800
+++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800
@@ -256,8 +256,9 @@
if (timr->it.real.interval.tv64 == 0)
return;
 
-   timr->it_overrun += hrtimer_forward(timer, timer->base->get_time(),
-   timr->it.real.interval);
+   timr->it_overrun += (unsigned int) hrtimer_forward(timer,
+   timer->base->get_time(),
+   timr->it.real.interval);
 
timr->it_overrun_last = timr->it_overrun;
timr->it_overrun = -1;
@@ -386,7 +387,7 @@
now = ktime_add(now, kj);
}
 #endif
-   timr->it_overrun +=
+   timr->it_overrun += (unsigned int)
hrtimer_forward(timer, now,

Re: Tesing of / bugs in new timerfd API

2007-12-16 Thread Davide Libenzi
On Fri, 14 Dec 2007, Michael Kerrisk wrote:

 You snipped my example that demonstrated the problem.  Both of the
 following runs create a timer that expires 10 seconds from now, but
 observe the difference in the value returned by timerfd_gettime():
 
 $ ./timerfd_test 10 # does not use TFD_TIMER_ABSTIME
 Initial setting for settime:   value=10.000, interval=0.000
 ./timerfd_test g
 (elapsed time=  1)
 Current value: value=346.448, interval=0.000
 
 $ ./timerfd_test -a 10  # uses TFD_TIMER_ABSTIME
 Initial setting for settime:   value=1197630855.254, interval=0.000
 ./timerfd_test g
 (elapsed time=  1)
 Current value: value=1197630855.254, interval=0.000
 
 Either there's an inconsistency here depending on the use of
 TFD_TIMER_ABSTIME, or there is a bug in my understanding or my test program
 (but so far I haven't spotted that bug ;-).).

Can you try the two patches below? I tried them on my 32 bit box (one of 
the rare beasts still lingering around here) and it seems to be working 
fine (those go on top of the previous ones).
This fixed the 32 bit tick-count truncation, and makes the time returned 
to be the remaining time till the next expiration.




- Davide



---
 fs/timerfd.c|6 +++---
 include/linux/hrtimer.h |   10 +-
 kernel/hrtimer.c|9 -
 kernel/posix-timers.c   |9 +
 4 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-12-13 16:37:36.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-12-14 16:05:23.0 
-0800
@@ -295,12 +295,12 @@
 }
 
 /* Forward a hrtimer so it expires after now: */
-extern unsigned long
+extern u64
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
 /* Forward a hrtimer so it expires after the hrtimer's current now */
-static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
-   ktime_t interval)
+static inline u64 hrtimer_forward_now(struct hrtimer *timer,
+ ktime_t interval)
 {
return hrtimer_forward(timer, timer-base-get_time(), interval);
 }
@@ -322,9 +322,9 @@
 extern void __init hrtimers_init(void);
 
 #if BITS_PER_LONG  64
-extern unsigned long ktime_divns(const ktime_t kt, s64 div);
+extern u64 ktime_divns(const ktime_t kt, s64 div);
 #else /* BITS_PER_LONG  64 */
-# define ktime_divns(kt, div)  (unsigned long)((kt).tv64 / (div))
+# define ktime_divns(kt, div)  (u64)((kt).tv64 / (div))
 #endif
 
 /* Show pending timers: */
Index: linux-2.6.mod/kernel/hrtimer.c
===
--- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800
+++ linux-2.6.mod/kernel/hrtimer.c  2007-12-13 16:41:42.0 -0800
@@ -306,7 +306,7 @@
 /*
  * Divide a ktime value by a nanosecond value
  */
-unsigned long ktime_divns(const ktime_t kt, s64 div)
+u64 ktime_divns(const ktime_t kt, s64 div)
 {
u64 dclc, inc, dns;
int sft = 0;
@@ -321,7 +321,7 @@
dclc = sft;
do_div(dclc, (unsigned long) div);
 
-   return (unsigned long) dclc;
+   return dclc;
 }
 #endif /* BITS_PER_LONG = 64 */
 
@@ -655,10 +655,9 @@
  * Forward the timer expiry so it will expire in the future.
  * Returns the number of overruns.
  */
-unsigned long
-hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
+u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 {
-   unsigned long orun = 1;
+   u64 orun = 1;
ktime_t delta;
 
delta = ktime_sub(now, timer-expires);
Index: linux-2.6.mod/kernel/posix-timers.c
===
--- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 
-0800
+++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800
@@ -256,8 +256,9 @@
if (timr-it.real.interval.tv64 == 0)
return;
 
-   timr-it_overrun += hrtimer_forward(timer, timer-base-get_time(),
-   timr-it.real.interval);
+   timr-it_overrun += (unsigned int) hrtimer_forward(timer,
+   timer-base-get_time(),
+   timr-it.real.interval);
 
timr-it_overrun_last = timr-it_overrun;
timr-it_overrun = -1;
@@ -386,7 +387,7 @@
now = ktime_add(now, kj);
}
 #endif
-   timr-it_overrun +=
+   timr-it_overrun += (unsigned int)
hrtimer_forward(timer, now,
timr-it.real.interval);
ret = 

Re: Tesing of / bugs in new timerfd API

2007-12-13 Thread Davide Libenzi
On Thu, 13 Dec 2007, Michael Kerrisk wrote:

> > > BUG 2:
> > > The last sentence does not match the implementation.
> > > (Nor is it consistent with the behavior of POSIX timers.
> > > And I *think* things did work correctly in the original
> > > timerfd() implementation, but I have not gone back to check.)
> > >
> > > Suppose that we set an absolute timer to expire 100 seconds
> > > in the future.  Then according to this sentence of the man
> > > page then each subsequent call to timerfd_gettime() should
> > > retrun an itimerspec structure whose it_value steadily
> > > decreases from 100 to 0 (when the timer expires).  (This
> > > is the behavior in the analogous situation with POSIX timers
> > > and with setitimer()/getitimer().)
> > >
> > > However, the implementation of timerfd_gettime() always
> > > returns the "time when the timer would next expire", and
> > > this value depends on whether TFD_TIMER_ABSTIME was specified
> > > when setting the timer.
> >
> > This is been like that from the beginning of the new API. So no, the
> > previous was behaving exactly the same WRT this feature.
> > Is this something really needed?
> 
> Three reasons that I think of off the top of my head (and there might
> well be more reasosn) why this should change:
> 
> a) consistency with the other two timer APIs (POSIX timers
> (timer_create(), etc.), and setitimer()/getitimer()).
> 
> b) Returning the amount of time until the next expiration  is more
> useful to userland: I'd say the most common case is for userland to
> want to know how long until the next expiration occurs, or to adjust
> that time by adding/subtracting some value to the existing setting.
> That is difficult to with the current implementation: the userland app
> must use timer_gettime(), call clock_gettime(), and calculate the
> difference between the two, in order to know how much time remains
> until the next timer expiration.

Bah, I don't want to argue with that because otherwise it starts going 
towards the "typical" use cases listing, that can be found the same exact 
reasons to have one or the other way. And we end up wasting lots of time.
We'd just have to move another thing that userspace could do, inside the 
kernel (subtract the current value returned by hrtimer_forward() in 
->expires with "now").




> c) Currently, the information returned differs depending on whether
> TFD_TIMER_ABSTIME is specified -- this is not the case for the
> analogous situation for POSIX timers.  For POSIX timers, the returned
> setting is always the amount of time until the timer next expires.
> This inconsistency is messy for applications -- the application may
> not (be able to) know whether or not the timer it is examining was set
> using TFD_TIMER_ABSTIME (the timerfd might have been created by a
> library, for example).

Hmm... the time returned is always the next absolute time when the next 
timer tick will go off. It does not depend on TFD_TIMER_ABSTIME. I return 
the ->expires field of the timer, and hrtimer_forward() sets it to the 
next absolute time.




- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tesing of / bugs in new timerfd API

2007-12-13 Thread Davide Libenzi
On Thu, 13 Dec 2007, Michael Kerrisk wrote:

> Davide, Andrew,
> 
> I applied Davide's v3 patchset (sent into LKML on 25 Nov) against
> 2.4.24-rc3, and did various tests (all on x86).  Several tests
> were done using the program at the foot of this mail.  Various others
> were done by cobbling together bits of code that I haven't included
> here.

Thanks for such a thorough test Michael.



> Tested: after setting a CLOCK_REALTIME timer with the
> TFD_TIMER_ABSTIME flag to expire at some time in the past with
> a non-zero interval (e.g., setting 100 seconds in the past, with
> a 5 second interval), read() from the file descriptor returns
> the correct number of expirations (e.g., 20).
> 
> This seems a reasonable thing to do, I suppose.  However, while
> playing around to test this, I found what looks like a bug (see
> below).
> 
> BUG 1:
> However, this test exposed what looks like a bug: if I set a
> CLOCK_REALTIME clock to expire in the past, with a very small
> interval, then the maximum number of expirations that can be
> returned via read seems to be limited to 32 bits, even though
> we have a 64-bit value for returning this information.
> I haven't checked the kernel source to determine where this
> bug is.

Yes, true. Right now hrtimer_forward() (that we use to get the "missed" 
ticks) returns an unsigned long, that on 32 bit is (duh!) 32 bit :)
But hrtimer_forward() has all in place to just return an u64. So, Thomas, 
would it be OK to have hrtimer_forward() (and the new hrtimer_forward_now())
to return a u64?




> BUG 2:
> The last sentence does not match the implementation.
> (Nor is it consistent with the behavior of POSIX timers.
> And I *think* things did work correctly in the original
> timerfd() implementation, but I have not gone back to check.)
> 
> Suppose that we set an absolute timer to expire 100 seconds
> in the future.  Then according to this sentence of the man
> page then each subsequent call to timerfd_gettime() should
> retrun an itimerspec structure whose it_value steadily
> decreases from 100 to 0 (when the timer expires).  (This
> is the behavior in the analogous situation with POSIX timers
> and with setitimer()/getitimer().)
> 
> However, the implementation of timerfd_gettime() always
> returns the "time when the timer would next expire", and
> this value depends on whether TFD_TIMER_ABSTIME was specified
> when setting the timer.

This is been like that from the beginning of the new API. So no, the 
previous was behaving exactly the same WRT this feature.
Is this something really needed?



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tesing of / bugs in new timerfd API

2007-12-13 Thread Davide Libenzi
On Thu, 13 Dec 2007, Michael Kerrisk wrote:

 Davide, Andrew,
 
 I applied Davide's v3 patchset (sent into LKML on 25 Nov) against
 2.4.24-rc3, and did various tests (all on x86).  Several tests
 were done using the program at the foot of this mail.  Various others
 were done by cobbling together bits of code that I haven't included
 here.

Thanks for such a thorough test Michael.



 Tested: after setting a CLOCK_REALTIME timer with the
 TFD_TIMER_ABSTIME flag to expire at some time in the past with
 a non-zero interval (e.g., setting 100 seconds in the past, with
 a 5 second interval), read() from the file descriptor returns
 the correct number of expirations (e.g., 20).
 
 This seems a reasonable thing to do, I suppose.  However, while
 playing around to test this, I found what looks like a bug (see
 below).
 
 BUG 1:
 However, this test exposed what looks like a bug: if I set a
 CLOCK_REALTIME clock to expire in the past, with a very small
 interval, then the maximum number of expirations that can be
 returned via read seems to be limited to 32 bits, even though
 we have a 64-bit value for returning this information.
 I haven't checked the kernel source to determine where this
 bug is.

Yes, true. Right now hrtimer_forward() (that we use to get the missed 
ticks) returns an unsigned long, that on 32 bit is (duh!) 32 bit :)
But hrtimer_forward() has all in place to just return an u64. So, Thomas, 
would it be OK to have hrtimer_forward() (and the new hrtimer_forward_now())
to return a u64?




 BUG 2:
 The last sentence does not match the implementation.
 (Nor is it consistent with the behavior of POSIX timers.
 And I *think* things did work correctly in the original
 timerfd() implementation, but I have not gone back to check.)
 
 Suppose that we set an absolute timer to expire 100 seconds
 in the future.  Then according to this sentence of the man
 page then each subsequent call to timerfd_gettime() should
 retrun an itimerspec structure whose it_value steadily
 decreases from 100 to 0 (when the timer expires).  (This
 is the behavior in the analogous situation with POSIX timers
 and with setitimer()/getitimer().)
 
 However, the implementation of timerfd_gettime() always
 returns the time when the timer would next expire, and
 this value depends on whether TFD_TIMER_ABSTIME was specified
 when setting the timer.

This is been like that from the beginning of the new API. So no, the 
previous was behaving exactly the same WRT this feature.
Is this something really needed?



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tesing of / bugs in new timerfd API

2007-12-13 Thread Davide Libenzi
On Thu, 13 Dec 2007, Michael Kerrisk wrote:

   BUG 2:
   The last sentence does not match the implementation.
   (Nor is it consistent with the behavior of POSIX timers.
   And I *think* things did work correctly in the original
   timerfd() implementation, but I have not gone back to check.)
  
   Suppose that we set an absolute timer to expire 100 seconds
   in the future.  Then according to this sentence of the man
   page then each subsequent call to timerfd_gettime() should
   retrun an itimerspec structure whose it_value steadily
   decreases from 100 to 0 (when the timer expires).  (This
   is the behavior in the analogous situation with POSIX timers
   and with setitimer()/getitimer().)
  
   However, the implementation of timerfd_gettime() always
   returns the time when the timer would next expire, and
   this value depends on whether TFD_TIMER_ABSTIME was specified
   when setting the timer.
 
  This is been like that from the beginning of the new API. So no, the
  previous was behaving exactly the same WRT this feature.
  Is this something really needed?
 
 Three reasons that I think of off the top of my head (and there might
 well be more reasosn) why this should change:
 
 a) consistency with the other two timer APIs (POSIX timers
 (timer_create(), etc.), and setitimer()/getitimer()).
 
 b) Returning the amount of time until the next expiration  is more
 useful to userland: I'd say the most common case is for userland to
 want to know how long until the next expiration occurs, or to adjust
 that time by adding/subtracting some value to the existing setting.
 That is difficult to with the current implementation: the userland app
 must use timer_gettime(), call clock_gettime(), and calculate the
 difference between the two, in order to know how much time remains
 until the next timer expiration.

Bah, I don't want to argue with that because otherwise it starts going 
towards the typical use cases listing, that can be found the same exact 
reasons to have one or the other way. And we end up wasting lots of time.
We'd just have to move another thing that userspace could do, inside the 
kernel (subtract the current value returned by hrtimer_forward() in 
-expires with now).




 c) Currently, the information returned differs depending on whether
 TFD_TIMER_ABSTIME is specified -- this is not the case for the
 analogous situation for POSIX timers.  For POSIX timers, the returned
 setting is always the amount of time until the timer next expires.
 This inconsistency is messy for applications -- the application may
 not (be able to) know whether or not the timer it is examining was set
 using TFD_TIMER_ABSTIME (the timerfd might have been created by a
 library, for example).

Hmm... the time returned is always the next absolute time when the next 
timer tick will go off. It does not depend on TFD_TIMER_ABSTIME. I return 
the -expires field of the timer, and hrtimer_forward() sets it to the 
next absolute time.




- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] move the related code from exit_notify() to exit_signals()

2007-12-06 Thread Davide Libenzi
On Thu, 6 Dec 2007, Oleg Nesterov wrote:

> The previous bugfix was not optimal, we shouldn't care about group stop when
> we are the only thread or the group stop is in progress. In that case nothing
> special is needed, just set PF_EXITING and return.
> 
> Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify().
> 
> So, from the performance POV the only difference is that we don't trust
> !signal_pending() until we take ->siglock. But this in fact fixes another
> ___pure___ theoretical minor race. __group_complete_signal() finds the task
> without PF_EXITING and chooses it as the target for signal_wake_up(). But
> nothing prevents this task from exiting in between without noticing the
> pending signal and thus unpredictably delaying the actual delivery.

For a second there, you got me confused, since it wasn't clear to me this 
patch was going over the previous one. When posting patches that are not 
in any tree, it may be wise to include the set they nest onto ;)


> + if (!signal_pending(tsk))
> + goto out;
> +
> + /* It could be that __group_complete_signal() choose us to
> +  * notify about group-wide signal. Another thread should be
> +  * woken now to take the signal since we will not.
> +  */
> + for (t = tsk; (t = next_thread(t)) != tsk; )
> + if (!signal_pending(t) && !(t->flags & PF_EXITING))
> + recalc_sigpending_and_wake(t);
> +
> + if (unlikely(tsk->signal->group_stop_count) &&
> + !--tsk->signal->group_stop_count) {
> + tsk->signal->flags = SIGNAL_STOP_STOPPED;
> + group_stop = 1;
> + }
> +out:

Looks fine to me, even though the one above could simply be an:

if (signal_pending(tsk)) {
...
}

(since the "out" label is used by that "if" only).



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] move the related code from exit_notify() to exit_signals()

2007-12-06 Thread Davide Libenzi
On Thu, 6 Dec 2007, Oleg Nesterov wrote:

 The previous bugfix was not optimal, we shouldn't care about group stop when
 we are the only thread or the group stop is in progress. In that case nothing
 special is needed, just set PF_EXITING and return.
 
 Also, take the related TIF_SIGPENDING re-targeting code from exit_notify().
 
 So, from the performance POV the only difference is that we don't trust
 !signal_pending() until we take -siglock. But this in fact fixes another
 ___pure___ theoretical minor race. __group_complete_signal() finds the task
 without PF_EXITING and chooses it as the target for signal_wake_up(). But
 nothing prevents this task from exiting in between without noticing the
 pending signal and thus unpredictably delaying the actual delivery.

For a second there, you got me confused, since it wasn't clear to me this 
patch was going over the previous one. When posting patches that are not 
in any tree, it may be wise to include the set they nest onto ;)


 + if (!signal_pending(tsk))
 + goto out;
 +
 + /* It could be that __group_complete_signal() choose us to
 +  * notify about group-wide signal. Another thread should be
 +  * woken now to take the signal since we will not.
 +  */
 + for (t = tsk; (t = next_thread(t)) != tsk; )
 + if (!signal_pending(t)  !(t-flags  PF_EXITING))
 + recalc_sigpending_and_wake(t);
 +
 + if (unlikely(tsk-signal-group_stop_count) 
 + !--tsk-signal-group_stop_count) {
 + tsk-signal-flags = SIGNAL_STOP_STOPPED;
 + group_stop = 1;
 + }
 +out:

Looks fine to me, even though the one above could simply be an:

if (signal_pending(tsk)) {
...
}

(since the out label is used by that if only).



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fix group stop with exit race

2007-12-05 Thread Davide Libenzi
On Wed, 5 Dec 2007, Oleg Nesterov wrote:

> do_signal_stop() counts all sub-thread and sets ->group_stop_count 
> accordingly.
> Every thread should decrement ->group_stop_count and stop, the last one should
> notify the parent.
> 
> However a sub-thread can exit before it notices the signal_pending(), or it 
> may
> be somewhere in do_exit() already. In that case the group stop never finishes
> properly.
> 
> Note: this is a minimal fix, we can add some optimizations later. Say we can
> return quickly if thread_group_empty(). Also, we can move some signal related
> code from exit_notify() to exit_signals().
> 
> Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]>

Looks OK for me, even though we're doing more work on the exit path. OTOH 
I don't see a non-racy way of doing it w/out grabbing the lock. Did you 
try to bench how much this change costs?
Anyway, looks sane to me...

Acked-by: Davide Libenzi <[EMAIL PROTECTED]>



- Davide


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fix group stop with exit race

2007-12-05 Thread Davide Libenzi
On Wed, 5 Dec 2007, Oleg Nesterov wrote:

 do_signal_stop() counts all sub-thread and sets -group_stop_count 
 accordingly.
 Every thread should decrement -group_stop_count and stop, the last one should
 notify the parent.
 
 However a sub-thread can exit before it notices the signal_pending(), or it 
 may
 be somewhere in do_exit() already. In that case the group stop never finishes
 properly.
 
 Note: this is a minimal fix, we can add some optimizations later. Say we can
 return quickly if thread_group_empty(). Also, we can move some signal related
 code from exit_notify() to exit_signals().
 
 Signed-off-by: Oleg Nesterov [EMAIL PROTECTED]

Looks OK for me, even though we're doing more work on the exit path. OTOH 
I don't see a non-racy way of doing it w/out grabbing the lock. Did you 
try to bench how much this change costs?
Anyway, looks sane to me...

Acked-by: Davide Libenzi [EMAIL PROTECTED]



- Davide


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] update sys_ni.c with the new timerfd syscalls

2007-11-28 Thread Davide Libenzi
Update sys_ni.c with the new timerfd syscalls.


Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 kernel/sys_ni.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/kernel/sys_ni.c
===
--- linux-2.6.mod.orig/kernel/sys_ni.c  2007-11-28 11:44:48.0 -0800
+++ linux-2.6.mod/kernel/sys_ni.c   2007-11-28 11:46:01.0 -0800
@@ -153,7 +153,9 @@
 
 /* New file descriptors */
 cond_syscall(sys_signalfd);
-cond_syscall(sys_timerfd);
+cond_syscall(sys_timerfd_create);
+cond_syscall(sys_timerfd_settime);
+cond_syscall(sys_timerfd_gettime);
 cond_syscall(compat_sys_signalfd);
 cond_syscall(compat_sys_timerfd);
 cond_syscall(sys_eventfd);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm2 - Build Failure on powerpc timerfd() undeclared

2007-11-28 Thread Davide Libenzi
On Wed, 28 Nov 2007, Andrew Morton wrote:

> On Wed, 28 Nov 2007 14:32:07 +0100 Arnd Bergmann <[EMAIL PROTECTED]> wrote:
> 
> > On Wednesday 28 November 2007, Kamalesh Babulal wrote:
> > > Kernel build fails, with build error
> > > 
> > >   CC      arch/powerpc/platforms/cell/spu_callbacks.o
> > > In file included from arch/powerpc/platforms/cell/spu_callbacks.c:49:
> > > include/asm/systbl.h:312: error: ‘sys_timerfd’ undeclared here (not in a 
> > > function)
> > > make[2]: *** [arch/powerpc/platforms/cell/spu_callbacks.o] Error 1
> > > make[1]: *** [arch/powerpc/platforms/cell] Error 2
> > > make: *** [arch/powerpc/platforms] Error 2
> > > 
> > 
> > I guess all architectures except x86 are currently broken because they
> > reference the old sys_timerfd function.
> 
> None of them were broken in my testing and I'm unsure why powerpc broke
> here.
> 
> > This patch should add the missing
> > bits to powerpc.
> > 
> 
> Because the patches in -mm left the stubs in place in sys_ni.c and powerpc
> _should_ have (incorrectly) picked those up.

My fault. I forgot to update sys_ni.c with the new functions (and with the
sys_timerfd->sys_timerfd_create name change).
Do you want a patch Andrew?



- Davide


Re: 2.6.24-rc3-mm2 - Build Failure on powerpc timerfd() undeclared

2007-11-28 Thread Davide Libenzi
On Wed, 28 Nov 2007, Andrew Morton wrote:

 On Wed, 28 Nov 2007 14:32:07 +0100 Arnd Bergmann [EMAIL PROTECTED] wrote:
 
  On Wednesday 28 November 2007, Kamalesh Babulal wrote:
   Kernel build fails, with build error
   
     CC      arch/powerpc/platforms/cell/spu_callbacks.o
   In file included from arch/powerpc/platforms/cell/spu_callbacks.c:49:
   include/asm/systbl.h:312: error: ‘sys_timerfd’ undeclared here (not in a 
   function)
   make[2]: *** [arch/powerpc/platforms/cell/spu_callbacks.o] Error 1
   make[1]: *** [arch/powerpc/platforms/cell] Error 2
   make: *** [arch/powerpc/platforms] Error 2
   
  
  I guess all architectures except x86 are currently broken because they
  reference the old sys_timerfd function.
 
 None of them were broken in my testing and I'm unsure why powerpc broke
 here.
 
  This patch should add the missing
  bits to powerpc.
  
 
 Because the patches in -mm left the stubs in place in sys_ni.c and powerpc
 _should_ have (incorrectly) picked those up.

My fault. I forgot to update sys_ni.c with the new functions (and with the
sys_timerfd-sys_timerfd_create name change).
Do you want a patch Andrew?



- Davide


[patch] update sys_ni.c with the new timerfd syscalls

2007-11-28 Thread Davide Libenzi
Update sys_ni.c with the new timerfd syscalls.


Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 kernel/sys_ni.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.mod/kernel/sys_ni.c
===
--- linux-2.6.mod.orig/kernel/sys_ni.c  2007-11-28 11:44:48.0 -0800
+++ linux-2.6.mod/kernel/sys_ni.c   2007-11-28 11:46:01.0 -0800
@@ -153,7 +153,9 @@
 
 /* New file descriptors */
 cond_syscall(sys_signalfd);
-cond_syscall(sys_timerfd);
+cond_syscall(sys_timerfd_create);
+cond_syscall(sys_timerfd_settime);
+cond_syscall(sys_timerfd_gettime);
 cond_syscall(compat_sys_signalfd);
 cond_syscall(compat_sys_timerfd);
 cond_syscall(sys_eventfd);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v3 - new timerfd API

2007-11-27 Thread Davide Libenzi
On Tue, 27 Nov 2007, Andrew Morton wrote:

> On Tue, 27 Nov 2007 12:47:46 -0800 (PST)
> Davide Libenzi <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 27 Nov 2007, Andrew Morton wrote:
> > 
> > > On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi <[EMAIL PROTECTED]> 
> > > wrote:
> > > 
> > > > +static struct file *timerfd_fget(int fd)
> > > > +{
> > > > +   struct file *file;
> > > > +
> > > > +   file = fget(fd);
> > > > +   if (!file)
> > > > +   return ERR_PTR(-EBADF);
> > > > +   if (file->f_op != _fops) {
> > > > +   fput(file);
> > > > +   return ERR_PTR(-EINVAL);
> > > > +   }
> > > > +
> > > > +   return file;
> > > > +}
> > > 
> > > I suppose we could use fget_light() in here sometime.  It is significantly
> > > quicker in microbenchmarks.  Or it was - nobody has checked that in a few
> > > years afaik.
> > 
> > Should I now?
> 
> No great rush.  It'd be fun to see if it actually makes any measurable
> difference on modern hardware (hint ;)).

I was going to say BS, given that at the time of the tests, the files 
struct was protected by an rwlock whereas now there's a rcu one.
But that seems not the case:

http://www.xmailserver.org/fget-light-test.c


$ fget-light-test -r 9   
warming up ...
testing non-shared ...
time = 314.354000 ms
testing shared ...
time = 390.781000 ms



And here is the oprofile output:

[SHARED CASE]
samples  %app name symbol name
7436 28.9339  vmlinux  __clear_user
2369  9.2179  vmlinux  system_call
1710  6.6537  vmlinux  fget_light
1244  4.8405  vmlinux  inotify_dentry_parent_queue_event
1128  4.3891  vmlinux  sys_read
1041  4.0506  libpthread-2.6.1.so  __read_nocancel
1027  3.9961  Xorg (no symbols)
978   3.8054  vmlinux  read_zero
755   2.9377  vmlinux  vfs_read
545   2.1206  vmlinux  inotify_inode_queue_event
507   1.9728  vmlinux  sysret_check
414   1.6109  vmlinux  rw_verify_area
405   1.5759  vmlinux  unix_poll
371   1.4436  nvidia   (no symbols)
333   1.2957  vmlinux  acpi_pm_read
311   1.2101  nvidia_drv.so(no symbols)
290   1.1284  vmlinux  do_select
288   1.1206  vmlinux  dnotify_parent
253   0.9844  libc-2.6.1.so(no symbols)
232   0.9027  bash (no symbols)
216   0.8405  libpthread-2.6.1.so  __pthread_enable_asynccancel


[UNSHARED CASE]
samples  %app name symbol name
6542 27.6922  vmlinux  __clear_user
4074 17.2452  vmlinux  vfs_read
1266  5.3590  vmlinux  inotify_inode_queue_event
1091  4.6182  Xorg (no symbols)
1059  4.4827  vmlinux  system_call
937   3.9663  libpthread-2.6.1.so  __read_nocancel
544   2.3027  vmlinux  clear_user
484   2.0488  vmlinux  dnotify_parent
445   1.8837  vmlinux  read_zero
438   1.8540  vmlinux  sysret_check
414   1.7525  vmlinux  unix_poll
407   1.7228  nvidia   (no symbols)
389   1.6466  vmlinux  acpi_pm_read
315   1.3334  nvidia_drv.so(no symbols)
312   1.3207  vmlinux  inotify_dentry_parent_queue_event
312   1.3207  vmlinux  sys_read
305   1.2911  vmlinux  rw_verify_area
304   1.2868  vmlinux  do_select
222   0.9397  libc-2.6.1.so(no symbols)
214   0.9059  bash (no symbols)
196   0.8297  fget-light-test  read_test
185   0.7831  vmlinux  fget_light



You can clearly notice the fget_light() drop out of the relevance window.
BTW, are all those "notify" noises supposed to be there?




- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v3 - new timerfd API

2007-11-27 Thread Davide Libenzi
On Tue, 27 Nov 2007, Andrew Morton wrote:

> On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi <[EMAIL PROTECTED]> wrote:
> 
> > +static struct file *timerfd_fget(int fd)
> > +{
> > +   struct file *file;
> > +
> > +   file = fget(fd);
> > +   if (!file)
> > +   return ERR_PTR(-EBADF);
> > +   if (file->f_op != _fops) {
> > +   fput(file);
> > +   return ERR_PTR(-EINVAL);
> > +   }
> > +
> > +   return file;
> > +}
> 
> I suppose we could use fget_light() in here sometime.  It is significantly
> quicker in microbenchmarks.  Or it was - nobody has checked that in a few
> years afaik.

Should I now? Half of the internet traffic of the last three month was 
generated by me posting those timerfd patches :)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v3 - new timerfd API

2007-11-27 Thread Davide Libenzi
On Tue, 27 Nov 2007, Andrew Morton wrote:

 On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi [EMAIL PROTECTED] wrote:
 
  +static struct file *timerfd_fget(int fd)
  +{
  +   struct file *file;
  +
  +   file = fget(fd);
  +   if (!file)
  +   return ERR_PTR(-EBADF);
  +   if (file-f_op != timerfd_fops) {
  +   fput(file);
  +   return ERR_PTR(-EINVAL);
  +   }
  +
  +   return file;
  +}
 
 I suppose we could use fget_light() in here sometime.  It is significantly
 quicker in microbenchmarks.  Or it was - nobody has checked that in a few
 years afaik.

Should I now? Half of the internet traffic of the last three month was 
generated by me posting those timerfd patches :)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v3 - new timerfd API

2007-11-27 Thread Davide Libenzi
On Tue, 27 Nov 2007, Andrew Morton wrote:

 On Tue, 27 Nov 2007 12:47:46 -0800 (PST)
 Davide Libenzi [EMAIL PROTECTED] wrote:
 
  On Tue, 27 Nov 2007, Andrew Morton wrote:
  
   On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi [EMAIL PROTECTED] 
   wrote:
   
+static struct file *timerfd_fget(int fd)
+{
+   struct file *file;
+
+   file = fget(fd);
+   if (!file)
+   return ERR_PTR(-EBADF);
+   if (file-f_op != timerfd_fops) {
+   fput(file);
+   return ERR_PTR(-EINVAL);
+   }
+
+   return file;
+}
   
   I suppose we could use fget_light() in here sometime.  It is significantly
   quicker in microbenchmarks.  Or it was - nobody has checked that in a few
   years afaik.
  
  Should I now?
 
 No great rush.  It'd be fun to see if it actually makes any measurable
 difference on modern hardware (hint ;)).

I was going to say BS, given that at the time of the tests, the files 
struct was protected by an rwlock whereas now there's a rcu one.
But that seems not the case:

http://www.xmailserver.org/fget-light-test.c


$ fget-light-test -r 9   
warming up ...
testing non-shared ...
time = 314.354000 ms
testing shared ...
time = 390.781000 ms



And here is the oprofile output:

[SHARED CASE]
samples  %app name symbol name
7436 28.9339  vmlinux  __clear_user
2369  9.2179  vmlinux  system_call
1710  6.6537  vmlinux  fget_light
1244  4.8405  vmlinux  inotify_dentry_parent_queue_event
1128  4.3891  vmlinux  sys_read
1041  4.0506  libpthread-2.6.1.so  __read_nocancel
1027  3.9961  Xorg (no symbols)
978   3.8054  vmlinux  read_zero
755   2.9377  vmlinux  vfs_read
545   2.1206  vmlinux  inotify_inode_queue_event
507   1.9728  vmlinux  sysret_check
414   1.6109  vmlinux  rw_verify_area
405   1.5759  vmlinux  unix_poll
371   1.4436  nvidia   (no symbols)
333   1.2957  vmlinux  acpi_pm_read
311   1.2101  nvidia_drv.so(no symbols)
290   1.1284  vmlinux  do_select
288   1.1206  vmlinux  dnotify_parent
253   0.9844  libc-2.6.1.so(no symbols)
232   0.9027  bash (no symbols)
216   0.8405  libpthread-2.6.1.so  __pthread_enable_asynccancel


[UNSHARED CASE]
samples  %app name symbol name
6542 27.6922  vmlinux  __clear_user
4074 17.2452  vmlinux  vfs_read
1266  5.3590  vmlinux  inotify_inode_queue_event
1091  4.6182  Xorg (no symbols)
1059  4.4827  vmlinux  system_call
937   3.9663  libpthread-2.6.1.so  __read_nocancel
544   2.3027  vmlinux  clear_user
484   2.0488  vmlinux  dnotify_parent
445   1.8837  vmlinux  read_zero
438   1.8540  vmlinux  sysret_check
414   1.7525  vmlinux  unix_poll
407   1.7228  nvidia   (no symbols)
389   1.6466  vmlinux  acpi_pm_read
315   1.3334  nvidia_drv.so(no symbols)
312   1.3207  vmlinux  inotify_dentry_parent_queue_event
312   1.3207  vmlinux  sys_read
305   1.2911  vmlinux  rw_verify_area
304   1.2868  vmlinux  do_select
222   0.9397  libc-2.6.1.so(no symbols)
214   0.9059  bash (no symbols)
196   0.8297  fget-light-test  read_test
185   0.7831  vmlinux  fget_light



You can clearly notice the fget_light() drop out of the relevance window.
BTW, are all those notify noises supposed to be there?




- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-26 Thread Davide Libenzi
On Mon, 26 Nov 2007, H. Peter Anvin wrote:

> Ingo Molnar wrote:
> > 
> > So it's not like sys_indirect() would break some magic pristine state of a
> > flat parameter space - on the contrary, most of the nontrivial syscalls take
> > pointers to structures or pointers to streams of information. The parameter
> > count histogram i believe further underlines this point:
> > 
> >   #args   #syscalls
> >   -
> >   0   22
> >   1   51
> >   2   83
> >   3   85
> >   4   40
> >   5   23
> >   68
> > 
> > the natural 'center' of function call parameter counts is around 1-4
> > parameters, and that is natural. (most operators that the human brain
> > prefers to operate with are like that - having higher complexity than that
> > often defeats the purpose of getting an API used by ... humans.)
> > 
> 
> I was preparing a response to Linus' email, but I really feel this needs to be
> addressed specifically.
> 
> When it comes to dealing with the operator-visible state, what matters is what
> happens on the API level, NOT on the system call level. Furthermore, the
> proposed sys_indirect interface just means that there are parameters hidden
> from immediately view, even though they fundamentally change the operation
> performed, and that it is much harder to correlate, say, the output of
> strace(1) with what actually happened in the program.  So from a
> *psychological* point of view, this seems to be an insane design choice.

I think there are two different issues. One is the proliferation of system 
calls, and the other is the sane design of internal kernel APIs.
The first one is not very interesting to me, since I don't have any strong 
opinions in either cases.
The second is the one I'd care most. I think that, whatever is the 
solution used to address the first, internal kernel APIs should be 
designed so that parameters flow down from the system call to the 
parameter's user code. IMO, besides very few cases where it could make 
some sense [*], setting some thread-global bits in the upper layer, to be 
magically picked up by code in the lower layers, does not lead to readable 
interfaces.



[*] Things that already read from a shared context, that is already 
exposed to the user through some sort of set/get APIs.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-26 Thread Davide Libenzi
On Mon, 26 Nov 2007, H. Peter Anvin wrote:

 Ingo Molnar wrote:
  
  So it's not like sys_indirect() would break some magic pristine state of a
  flat parameter space - on the contrary, most of the nontrivial syscalls take
  pointers to structures or pointers to streams of information. The parameter
  count histogram i believe further underlines this point:
  
#args   #syscalls
-
0   22
1   51
2   83
3   85
4   40
5   23
68
  
  the natural 'center' of function call parameter counts is around 1-4
  parameters, and that is natural. (most operators that the human brain
  prefers to operate with are like that - having higher complexity than that
  often defeats the purpose of getting an API used by ... humans.)
  
 
 I was preparing a response to Linus' email, but I really feel this needs to be
 addressed specifically.
 
 When it comes to dealing with the operator-visible state, what matters is what
 happens on the API level, NOT on the system call level. Furthermore, the
 proposed sys_indirect interface just means that there are parameters hidden
 from immediately view, even though they fundamentally change the operation
 performed, and that it is much harder to correlate, say, the output of
 strace(1) with what actually happened in the program.  So from a
 *psychological* point of view, this seems to be an insane design choice.

I think there are two different issues. One is the proliferation of system 
calls, and the other is the sane design of internal kernel APIs.
The first one is not very interesting to me, since I don't have any strong 
opinions in either cases.
The second is the one I'd care most. I think that, whatever is the 
solution used to address the first, internal kernel APIs should be 
designed so that parameters flow down from the system call to the 
parameter's user code. IMO, besides very few cases where it could make 
some sense [*], setting some thread-global bits in the upper layer, to be 
magically picked up by code in the lower layers, does not lead to readable 
interfaces.



[*] Things that already read from a shared context, that is already 
exposed to the user through some sort of set/get APIs.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/4] Timerfd v3 - wire the new timerfd API to the x86 family

2007-11-25 Thread Davide Libenzi
Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 arch/x86/ia32/ia32entry.S  |4 +++-
 arch/x86/kernel/syscall_table_32.S |4 +++-
 include/asm-x86/unistd_32.h|6 --
 include/asm-x86/unistd_64.h|9 +++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/include/asm-x86/unistd_32.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_32.h  2007-11-23 
13:55:15.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_32.h   2007-11-24 12:49:28.0 
-0800
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86/unistd_64.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_64.h  2007-11-23 
13:55:15.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_64.h   2007-11-24 12:49:28.0 
-0800
@@ -629,12 +629,17 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
+
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S
===
--- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S   2007-11-23 
13:55:16.0 -0800
+++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-24 
12:49:28.0 -0800
@@ -321,6 +321,8 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 
13:55:16.0 -0800
+++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-24 12:49:28.0 
-0800
@@ -723,7 +723,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/4] Timerfd v3 - un-break CONFIG_TIMERFD

2007-11-25 Thread Davide Libenzi
Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:55:15.0 -0800
+++ linux-2.6.mod/init/Kconfig  2007-11-24 12:49:30.0 -0800
@@ -566,7 +566,6 @@
 config TIMERFD
bool "Enable timerfd() system call" if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/4] Timerfd v3 - introduce a new hrtimer_forward_now() function

2007-11-25 Thread Davide Libenzi
I think that advancing the timer against the timer's current "now" can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-11-23 13:55:16.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-11-24 12:48:05.0 
-0800
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer->base->get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec *rmtp,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/4] Timerfd v3 - new timerfd API

2007-11-25 Thread Davide Libenzi
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not 
NULL).
The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  199 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 166 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:55:16.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-11-24 12:49:21.0 -0800
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the "expired"
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(>wqh.lock, flags);
ctx->expired = 1;
+   ctx->ticks++;
wake_up_locked(>wqh);
spin_unlock_irqrestore(>wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr->it_value);
ctx->expired = 0;
+   ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
-   hrtimer_init(>tmr, clockid, htmode);
+   hrtimer_init(>tmr, ctx->clockid, htmode);
ctx->tmr.expires = texp;
ctx->tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, >wqh, wait);
 
spin_lock_irqsave(>wqh.lock, flags);
-   if (ctx->expired)
+   if (ctx->ticks)
events |= POLLIN;
spin_unlock_irqrestore(>wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(>wqh.lock);
res = -EAGAIN;
-   if (!ctx->expired && !(file->f_flags & O_NONBLOCK)) {
+   if (!ctx->ticks && !(file->f_flags & O_NONBLOCK)) {
__add_wait_queue(>wqh, );
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx->expired) {
+   if (ctx->ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(>wqh, );
__set_current_state(TASK_RUNNING);
}
-   if (ctx->expired) {
-   ctx->expired = 0;
-   if (ctx->tintv.tv64 != 0) {
+   if (ctx->ticks) {
+   ticks = ctx->ticks;
+   if (ctx->expired && ctx->tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(>tmr,
-   hrtimer_cb_get_time(>tmr),
-   ctx->tintv);
+   ticks += (u64) hrtimer_forward_now(>tmr,
+  

[patch 1/4] Timerfd v3 - introduce a new hrtimer_forward_now() function

2007-11-25 Thread Davide Libenzi
I think that advancing the timer against the timer's current now can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-11-23 13:55:16.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-11-24 12:48:05.0 
-0800
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer-base-get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec *rmtp,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/4] Timerfd v3 - new timerfd API

2007-11-25 Thread Davide Libenzi
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The clockid
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the otmr parameter is not 
NULL).
The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit
is set in the flags parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  199 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 166 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:55:16.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-11-24 12:49:21.0 -0800
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the expired
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(ctx-wqh.lock, flags);
ctx-expired = 1;
+   ctx-ticks++;
wake_up_locked(ctx-wqh);
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr-it_value);
ctx-expired = 0;
+   ctx-ticks = 0;
ctx-tintv = timespec_to_ktime(ktmr-it_interval);
-   hrtimer_init(ctx-tmr, clockid, htmode);
+   hrtimer_init(ctx-tmr, ctx-clockid, htmode);
ctx-tmr.expires = texp;
ctx-tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, ctx-wqh, wait);
 
spin_lock_irqsave(ctx-wqh.lock, flags);
-   if (ctx-expired)
+   if (ctx-ticks)
events |= POLLIN;
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
res = -EAGAIN;
-   if (!ctx-expired  !(file-f_flags  O_NONBLOCK)) {
+   if (!ctx-ticks  !(file-f_flags  O_NONBLOCK)) {
__add_wait_queue(ctx-wqh, wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx-expired) {
+   if (ctx-ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx-expired) {
-   ctx-expired = 0;
-   if (ctx-tintv.tv64 != 0) {
+   if (ctx-ticks) {
+   ticks = ctx-ticks;
+   if (ctx-expired  ctx-tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(ctx-tmr,
-   hrtimer_cb_get_time(ctx-tmr),
-   ctx-tintv);
+   ticks += (u64) hrtimer_forward_now(ctx-tmr,
+  ctx-tintv) - 1;
hrtimer_restart(ctx-tmr);
-   } else
-   ticks = 1;
+   }
+   ctx-expired

[patch 3/4] Timerfd v3 - wire the new timerfd API to the x86 family

2007-11-25 Thread Davide Libenzi
Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 arch/x86/ia32/ia32entry.S  |4 +++-
 arch/x86/kernel/syscall_table_32.S |4 +++-
 include/asm-x86/unistd_32.h|6 --
 include/asm-x86/unistd_64.h|9 +++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/include/asm-x86/unistd_32.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_32.h  2007-11-23 
13:55:15.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_32.h   2007-11-24 12:49:28.0 
-0800
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86/unistd_64.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_64.h  2007-11-23 
13:55:15.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_64.h   2007-11-24 12:49:28.0 
-0800
@@ -629,12 +629,17 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
+
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S
===
--- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S   2007-11-23 
13:55:16.0 -0800
+++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-24 
12:49:28.0 -0800
@@ -321,6 +321,8 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 
13:55:16.0 -0800
+++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-24 12:49:28.0 
-0800
@@ -723,7 +723,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/4] Timerfd v3 - un-break CONFIG_TIMERFD

2007-11-25 Thread Davide Libenzi
Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:55:15.0 -0800
+++ linux-2.6.mod/init/Kconfig  2007-11-24 12:49:30.0 -0800
@@ -566,7 +566,6 @@
 config TIMERFD
bool Enable timerfd() system call if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v2 - new timerfd API

2007-11-24 Thread Davide Libenzi
On Sat, 24 Nov 2007, Michael Kerrisk wrote:

> > +asmlinkage long sys_timerfd_create(int clockid, int flags)
> >  {
> > -   int error;
> > +   int error, ufd;
> > struct timerfd_ctx *ctx;
> > struct file *file;
> > struct inode *inode;
> > -   struct itimerspec ktmr;
> > -
> > -   if (copy_from_user(, utmr, sizeof(ktmr)))
> > -   return -EFAULT;
> >  
> > if (clockid != CLOCK_MONOTONIC &&
> > clockid != CLOCK_REALTIME)
> > return -EINVAL;
> 
> Could I suggest here, the following placeholder addition:
> 
>   if (flags != 0)
>   return -EINVAL;

Make sense, will repost.


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/4] Timerfd v2 - new timerfd API

2007-11-24 Thread Davide Libenzi
On Sat, 24 Nov 2007, Michael Kerrisk wrote:

  +asmlinkage long sys_timerfd_create(int clockid, int flags)
   {
  -   int error;
  +   int error, ufd;
  struct timerfd_ctx *ctx;
  struct file *file;
  struct inode *inode;
  -   struct itimerspec ktmr;
  -
  -   if (copy_from_user(ktmr, utmr, sizeof(ktmr)))
  -   return -EFAULT;
   
  if (clockid != CLOCK_MONOTONIC 
  clockid != CLOCK_REALTIME)
  return -EINVAL;
 
 Could I suggest here, the following placeholder addition:
 
   if (flags != 0)
   return -EINVAL;

Make sense, will repost.


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/4] Timerfd v2 - un-break CONFIG_TIMERFD

2007-11-23 Thread Davide Libenzi
Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:13:16.0 -0800
+++ linux-2.6.mod/init/Kconfig  2007-11-23 13:36:42.0 -0800
@@ -566,7 +566,6 @@
 config TIMERFD
bool "Enable timerfd() system call" if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/4] Timerfd v2 - wire the new timerfd API to the x86 family

2007-11-23 Thread Davide Libenzi
Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 arch/x86/ia32/ia32entry.S  |4 +++-
 arch/x86/kernel/syscall_table_32.S |4 +++-
 include/asm-x86/unistd_32.h|6 --
 include/asm-x86/unistd_64.h|9 +++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/include/asm-x86/unistd_32.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_32.h  2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_32.h   2007-11-23 13:36:40.0 
-0800
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86/unistd_64.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_64.h  2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_64.h   2007-11-23 13:36:40.0 
-0800
@@ -629,12 +629,17 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
+
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S
===
--- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S   2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-23 
13:36:40.0 -0800
@@ -321,6 +321,8 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-23 13:36:40.0 
-0800
@@ -723,7 +723,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/4] Timerfd v2 - introduce a new hrtimer_forward_now() function

2007-11-23 Thread Davide Libenzi
I think that advancing the timer against the timer's current "now" can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-11-23 13:13:21.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-11-23 13:36:36.0 
-0800
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer->base->get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec *rmtp,

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/4] Timerfd v2 - new timerfd API

2007-11-23 Thread Davide Libenzi
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not 
NULL).
The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]>


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  197 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 164 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:13:19.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-11-23 13:36:39.0 -0800
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the "expired"
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(>wqh.lock, flags);
ctx->expired = 1;
+   ctx->ticks++;
wake_up_locked(>wqh);
spin_unlock_irqrestore(>wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr->it_value);
ctx->expired = 0;
+   ctx->ticks = 0;
ctx->tintv = timespec_to_ktime(ktmr->it_interval);
-   hrtimer_init(>tmr, clockid, htmode);
+   hrtimer_init(>tmr, ctx->clockid, htmode);
ctx->tmr.expires = texp;
ctx->tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, >wqh, wait);
 
spin_lock_irqsave(>wqh.lock, flags);
-   if (ctx->expired)
+   if (ctx->ticks)
events |= POLLIN;
spin_unlock_irqrestore(>wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(>wqh.lock);
res = -EAGAIN;
-   if (!ctx->expired && !(file->f_flags & O_NONBLOCK)) {
+   if (!ctx->ticks && !(file->f_flags & O_NONBLOCK)) {
__add_wait_queue(>wqh, );
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx->expired) {
+   if (ctx->ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(>wqh, );
__set_current_state(TASK_RUNNING);
}
-   if (ctx->expired) {
-   ctx->expired = 0;
-   if (ctx->tintv.tv64 != 0) {
+   if (ctx->ticks) {
+   ticks = ctx->ticks;
+   if (ctx->expired && ctx->tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(>tmr,
-   hrtimer_cb_get_time(>tmr),
-   ctx->tintv);
+   ticks += (u64) hrtimer_forward_now(>tmr,
+  

Re: Where is the new timerfd?

2007-11-23 Thread Davide Libenzi
On Fri, 23 Nov 2007, Ulrich Drepper wrote:

> On Nov 23, 2007 9:29 AM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
> > Yes, it's disabled, and yes, I'll repost today ...
> 
> I haven't seen the patch and don't feel like searching.  So I say it
> here: please mak sure you add a flags parameter to the system call
> itself (instead of adding it on as for eventfd and signalfd).  We need
> to be able to use O_CLOEXEC some way or another.

I'm more then OK about adding a flags parameter. If it was for me, I'd add 
it even to eventfd and signalfd. I asked Linus if he was OK about adding 
the flags parameter to all. He didn't reply, and I read that as "no".



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-23 Thread Davide Libenzi
On Fri, 23 Nov 2007, Andrew Morton wrote:

> > I suppose this means that timerfd will only go in for 2.6.25.  I don't
> > have a problem with that, but we better make sure that the existing
> > timerfd in 2.6.24 is still disabled.  (Andrew had a one liner for
> > that, but I haven't checked if it's in place.)
> > 
> 
> I have no timerfd patches here.

Yes, it's disabled, and yes, I'll repost today ...


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/4] Timerfd v2 - un-break CONFIG_TIMERFD

2007-11-23 Thread Davide Libenzi
Remove the broken status to CONFIG_TIMERFD.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 init/Kconfig |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.mod/init/Kconfig
===
--- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:13:16.0 -0800
+++ linux-2.6.mod/init/Kconfig  2007-11-23 13:36:42.0 -0800
@@ -566,7 +566,6 @@
 config TIMERFD
bool Enable timerfd() system call if EMBEDDED
select ANON_INODES
-   depends on BROKEN
default y
help
  Enable the timerfd() system call that allows to receive timer

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/4] Timerfd v2 - introduce a new hrtimer_forward_now() function

2007-11-23 Thread Davide Libenzi
I think that advancing the timer against the timer's current now can
be a pretty common usage, so, w/out exposing hrtimer's internals, we add
a new hrtimer_forward_now() function.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 include/linux/hrtimer.h |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.mod/include/linux/hrtimer.h
===
--- linux-2.6.mod.orig/include/linux/hrtimer.h  2007-11-23 13:13:21.0 
-0800
+++ linux-2.6.mod/include/linux/hrtimer.h   2007-11-23 13:36:36.0 
-0800
@@ -298,6 +298,13 @@
 extern unsigned long
 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
 
+/* Forward a hrtimer so it expires after the hrtimer's current now */
+static inline unsigned long hrtimer_forward_now(struct hrtimer *timer,
+   ktime_t interval)
+{
+   return hrtimer_forward(timer, timer-base-get_time(), interval);
+}
+
 /* Precise sleep: */
 extern long hrtimer_nanosleep(struct timespec *rqtp,
  struct timespec *rmtp,

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-23 Thread Davide Libenzi
On Fri, 23 Nov 2007, Andrew Morton wrote:

  I suppose this means that timerfd will only go in for 2.6.25.  I don't
  have a problem with that, but we better make sure that the existing
  timerfd in 2.6.24 is still disabled.  (Andrew had a one liner for
  that, but I haven't checked if it's in place.)
  
 
 I have no timerfd patches here.

Yes, it's disabled, and yes, I'll repost today ...


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-23 Thread Davide Libenzi
On Fri, 23 Nov 2007, Ulrich Drepper wrote:

 On Nov 23, 2007 9:29 AM, Davide Libenzi [EMAIL PROTECTED] wrote:
  Yes, it's disabled, and yes, I'll repost today ...
 
 I haven't seen the patch and don't feel like searching.  So I say it
 here: please mak sure you add a flags parameter to the system call
 itself (instead of adding it on as for eventfd and signalfd).  We need
 to be able to use O_CLOEXEC some way or another.

I'm more then OK about adding a flags parameter. If it was for me, I'd add 
it even to eventfd and signalfd. I asked Linus if he was OK about adding 
the flags parameter to all. He didn't reply, and I read that as no.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 2/4] Timerfd v2 - new timerfd API

2007-11-23 Thread Davide Libenzi
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd. The clockid
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the otmr parameter is not 
NULL).
The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit
is set in the flags parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or {0, 
0}
if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are supported
(with the same interface).
Here's a simple test program I used to exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 fs/compat.c  |   32 ++-
 fs/timerfd.c |  197 ++-
 include/linux/compat.h   |7 +
 include/linux/syscalls.h |7 +
 4 files changed, 164 insertions(+), 79 deletions(-)

Index: linux-2.6.mod/fs/timerfd.c
===
--- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:13:19.0 -0800
+++ linux-2.6.mod/fs/timerfd.c  2007-11-23 13:36:39.0 -0800
@@ -25,13 +25,15 @@
struct hrtimer tmr;
ktime_t tintv;
wait_queue_head_t wqh;
+   u64 ticks;
int expired;
+   int clockid;
 };
 
 /*
  * This gets called when the timer event triggers. We set the expired
  * flag, but we do not re-arm the timer (in case it's necessary,
- * tintv.tv64 != 0) until the timer is read.
+ * tintv.tv64 != 0) until the timer is accessed.
  */
 static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
 {
@@ -40,13 +42,14 @@
 
spin_lock_irqsave(ctx-wqh.lock, flags);
ctx-expired = 1;
+   ctx-ticks++;
wake_up_locked(ctx-wqh);
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
return HRTIMER_NORESTART;
 }
 
-static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+static void timerfd_setup(struct timerfd_ctx *ctx, int flags,
  const struct itimerspec *ktmr)
 {
enum hrtimer_mode htmode;
@@ -57,8 +60,9 @@
 
texp = timespec_to_ktime(ktmr-it_value);
ctx-expired = 0;
+   ctx-ticks = 0;
ctx-tintv = timespec_to_ktime(ktmr-it_interval);
-   hrtimer_init(ctx-tmr, clockid, htmode);
+   hrtimer_init(ctx-tmr, ctx-clockid, htmode);
ctx-tmr.expires = texp;
ctx-tmr.function = timerfd_tmrproc;
if (texp.tv64 != 0)
@@ -83,7 +87,7 @@
poll_wait(file, ctx-wqh, wait);
 
spin_lock_irqsave(ctx-wqh.lock, flags);
-   if (ctx-expired)
+   if (ctx-ticks)
events |= POLLIN;
spin_unlock_irqrestore(ctx-wqh.lock, flags);
 
@@ -102,11 +106,11 @@
return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
res = -EAGAIN;
-   if (!ctx-expired  !(file-f_flags  O_NONBLOCK)) {
+   if (!ctx-ticks  !(file-f_flags  O_NONBLOCK)) {
__add_wait_queue(ctx-wqh, wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx-expired) {
+   if (ctx-ticks) {
res = 0;
break;
}
@@ -121,22 +125,21 @@
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (ctx-expired) {
-   ctx-expired = 0;
-   if (ctx-tintv.tv64 != 0) {
+   if (ctx-ticks) {
+   ticks = ctx-ticks;
+   if (ctx-expired  ctx-tintv.tv64) {
/*
 * If tintv.tv64 != 0, this is a periodic timer that
 * needs to be re-armed. We avoid doing it in the timer
 * callback to avoid DoS attacks specifying a very
 * short timer period.
 */
-   ticks = (u64)
-   hrtimer_forward(ctx-tmr,
-   hrtimer_cb_get_time(ctx-tmr),
-   ctx-tintv);
+   ticks += (u64) hrtimer_forward_now(ctx-tmr,
+  ctx-tintv) - 1;
hrtimer_restart(ctx-tmr);
-   } else
-   ticks = 1;
+   }
+   ctx-expired

[patch 3/4] Timerfd v2 - wire the new timerfd API to the x86 family

2007-11-23 Thread Davide Libenzi
Wires up the new timerfd API to the x86 family.



Signed-off-by: Davide Libenzi [EMAIL PROTECTED]


- Davide


---
 arch/x86/ia32/ia32entry.S  |4 +++-
 arch/x86/kernel/syscall_table_32.S |4 +++-
 include/asm-x86/unistd_32.h|6 --
 include/asm-x86/unistd_64.h|9 +++--
 4 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6.mod/include/asm-x86/unistd_32.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_32.h  2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_32.h   2007-11-23 13:36:40.0 
-0800
@@ -327,13 +327,15 @@
 #define __NR_epoll_pwait   319
 #define __NR_utimensat 320
 #define __NR_signalfd  321
-#define __NR_timerfd   322
+#define __NR_timerfd_create322
 #define __NR_eventfd   323
 #define __NR_fallocate 324
+#define __NR_timerfd_settime   325
+#define __NR_timerfd_gettime   326
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 325
+#define NR_syscalls 327
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/include/asm-x86/unistd_64.h
===
--- linux-2.6.mod.orig/include/asm-x86/unistd_64.h  2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/include/asm-x86/unistd_64.h   2007-11-23 13:36:40.0 
-0800
@@ -629,12 +629,17 @@
 __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait)
 #define __NR_signalfd  282
 __SYSCALL(__NR_signalfd, sys_signalfd)
-#define __NR_timerfd   283
-__SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_timerfd_create283
+__SYSCALL(__NR_timerfd_create, sys_timerfd_create)
 #define __NR_eventfd   284
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate 285
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_timerfd_settime   286
+__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime)
+#define __NR_timerfd_gettime   287
+__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime)
+
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S
===
--- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S   2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-23 
13:36:40.0 -0800
@@ -321,6 +321,8 @@
.long sys_epoll_pwait
.long sys_utimensat /* 320 */
.long sys_signalfd
-   .long sys_timerfd
+   .long sys_timerfd_create
.long sys_eventfd
.long sys_fallocate
+   .long sys_timerfd_settime   /* 325 */
+   .long sys_timerfd_gettime
Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S
===
--- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 
13:13:18.0 -0800
+++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-23 13:36:40.0 
-0800
@@ -723,7 +723,9 @@
.quad sys_epoll_pwait
.quad compat_sys_utimensat  /* 320 */
.quad compat_sys_signalfd
-   .quad compat_sys_timerfd
+   .quad sys_timerfd_create
.quad sys_eventfd
.quad sys32_fallocate
+   .quad compat_sys_timerfd_settime/* 325 */
+   .quad compat_sys_timerfd_gettime
 ia32_syscall_end:

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Andrew Morton wrote:

> On Thu, 22 Nov 2007 11:46:13 -0800 (PST) Davide Libenzi <[EMAIL PROTECTED]> 
> wrote:
> 
> > On Thu, 22 Nov 2007, Michael Kerrisk wrote:
> > 
> > > On Nov 22, 2007 6:34 PM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
> > > > On Thu, 22 Nov 2007, Michael Kerrisk wrote:
> > > >
> > > > > Hey Davide,
> > > > >
> > > > > Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...
> > > >
> > > > Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it 
> > > > was
> > > > merged. Some screw up reverted it?
> > > 
> > > t looks that way.
> > 
> > I'm looking at the log now. It never went in actually. Andrew-san, what 
> > happened?
> > 
> 
> Last I recall, we removed the API for 2.6.23 because we intended to do a
> different interface for 2.6.24.
> 
> But I don't recall seeing any timerfd patches in maybe a month.

Was sent on Sep 23, Subject: new timerfd API
Do you want me to repost?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Michael Kerrisk wrote:

> On Nov 22, 2007 6:34 PM, Davide Libenzi <[EMAIL PROTECTED]> wrote:
> > On Thu, 22 Nov 2007, Michael Kerrisk wrote:
> >
> > > Hey Davide,
> > >
> > > Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...
> >
> > Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was
> > merged. Some screw up reverted it?
> 
> t looks that way.

I'm looking at the log now. It never went in actually. Andrew-san, what 
happened?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Michael Kerrisk wrote:

> Hey Davide,
> 
> Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...

Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was 
merged. Some screw up reverted it?


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Michael Kerrisk wrote:

 Hey Davide,
 
 Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...

Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was 
merged. Some screw up reverted it?


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Michael Kerrisk wrote:

 On Nov 22, 2007 6:34 PM, Davide Libenzi [EMAIL PROTECTED] wrote:
  On Thu, 22 Nov 2007, Michael Kerrisk wrote:
 
   Hey Davide,
  
   Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...
 
  Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was
  merged. Some screw up reverted it?
 
 t looks that way.

I'm looking at the log now. It never went in actually. Andrew-san, what 
happened?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Where is the new timerfd?

2007-11-22 Thread Davide Libenzi
On Thu, 22 Nov 2007, Andrew Morton wrote:

 On Thu, 22 Nov 2007 11:46:13 -0800 (PST) Davide Libenzi [EMAIL PROTECTED] 
 wrote:
 
  On Thu, 22 Nov 2007, Michael Kerrisk wrote:
  
   On Nov 22, 2007 6:34 PM, Davide Libenzi [EMAIL PROTECTED] wrote:
On Thu, 22 Nov 2007, Michael Kerrisk wrote:
   
 Hey Davide,

 Where is the new timerfd API.  In 2.6.24-rc3, I see the *old* API...
   
Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it 
was
merged. Some screw up reverted it?
   
   t looks that way.
  
  I'm looking at the log now. It never went in actually. Andrew-san, what 
  happened?
  
 
 Last I recall, we removed the API for 2.6.23 because we intended to do a
 different interface for 2.6.24.
 
 But I don't recall seeing any timerfd patches in maybe a month.

Was sent on Sep 23, Subject: new timerfd API
Do you want me to repost?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-20 Thread Davide Libenzi
On Tue, 20 Nov 2007, Ingo Molnar wrote:

> * H. Peter Anvin <[EMAIL PROTECTED]> wrote:
> 
> > It seems that you're doing the same thing in both cases, except you're 
> > now extending it to include other random functionality, which means 
> > other things than syslets are suddenly affected.
> >
> > syslets are arguably a little bit different, since what you're 
> > effectively doing there is running a miniature interpreted language in 
> > kernel space.  A higher startup overhead should be acceptable, since 
> > you're amortizing it over a larger number of calls.  Extending that 
> > mechanism suddenly means you HAVE to use that interpreted language 
> > message mechanism to access certain system calls, which really does 
> > not seem like a good thing neither for performance nor for encouraging 
> > sane design of interfaces.
> 
> whether that interpreted syslet language survives is still an open 
> question - it was extremely ugly when i wrote the first version of it 
> and it only got uglier since then :-)

Aha! You admitted it finally :)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-20 Thread Davide Libenzi
On Tue, 20 Nov 2007, Ingo Molnar wrote:

 * H. Peter Anvin [EMAIL PROTECTED] wrote:
 
  It seems that you're doing the same thing in both cases, except you're 
  now extending it to include other random functionality, which means 
  other things than syslets are suddenly affected.
 
  syslets are arguably a little bit different, since what you're 
  effectively doing there is running a miniature interpreted language in 
  kernel space.  A higher startup overhead should be acceptable, since 
  you're amortizing it over a larger number of calls.  Extending that 
  mechanism suddenly means you HAVE to use that interpreted language 
  message mechanism to access certain system calls, which really does 
  not seem like a good thing neither for performance nor for encouraging 
  sane design of interfaces.
 
 whether that interpreted syslet language survives is still an open 
 question - it was extremely ugly when i wrote the first version of it 
 and it only got uglier since then :-)

Aha! You admitted it finally :)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: epoll design problems with common fork/exec patterns

2007-10-29 Thread Davide Libenzi
On Sun, 28 Oct 2007, David Schwartz wrote:

> 
> Eric Dumazet wrote:
> 
> > Events are not necessarly reported "by descriptors". epoll uses an opaque
> > field provided by the user.
> >
> > It's up to the user to properly chose a tag that will makes sense
> > if the user
> > app is playing dup()/close() games for example.
> 
> Great. So the only issue then is that the documentation is confusing. It
> frequently uses the term "fd" where it means file. For example, it says:
> 
>   Q1 What  happens  if  you  add  the  same fd to an
> epoll_set
>  twice?
> 
>   A1 You will probably get EEXIST.  However,  it  is
> possible
>  that  two  threads  may  add the same fd twice. This is
> a
>  harmless condition.
> 
> This gives no reason to think there's anything wrong with adding the same
> file twice so long as you do so through different descriptors. (One can
> imagine an application that does this to segregate read and write operations
> to avoid a race where the descriptor is closed from under a writer due to
> handling a fatal read error.) Obviously, that won't work.

I agree, that is confusing. However, you can safely add two different file 
descriptors pointing to the same file*, with different event masks, and 
that will work as expected.




> And this part:
> 
>   Q6 Will  the  close of an fd cause it to be removed from
> all
>  epoll sets automatically?
> 
>   A6 Yes.
> 
> This is incorrect. Closing an fd will not cause it to be removed from all
> epoll sets automatically. Only closing a file will. This is what caused the
> OP's confusion, and it is at best imprecise and, at worst, flat out wrong.

OTOH you cannot list *every* possible scenario in a man page, otherwise 
you end up writing a book instead of a man page. I will try to find some 
time with Michael to refine the man page.



> PS: It is customary to trim individuals off of CC lists when replying to a
> list when the subject matter of the post is squarely inside the subject of
> the list. If the person CC'd was interested in the list's subject, he or she
> would presumably subscribe to the list. Not everyone wants two copies of
> every post. Not everyone wants a personal copy of every sub-thread that
> results from a post they make. In the past few years, I've received
> approximately an equal number of complaints about trimming CC's on posts to
> LKML and not trimming CC's on such posts.

Does anyone that in 2007 still did not manage to find a way to avoid dups 
in hitting his mailbox, deserve any consideration at all?
OTOH many ppl, like myself, uses To and Cc header to direct email to 
proper folders, where they are treated with a different level of 
attention. And your stripp-all-headers mania screws that up badly.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: epoll design problems with common fork/exec patterns

2007-10-29 Thread Davide Libenzi
On Sun, 28 Oct 2007, David Schwartz wrote:

 
 Eric Dumazet wrote:
 
  Events are not necessarly reported by descriptors. epoll uses an opaque
  field provided by the user.
 
  It's up to the user to properly chose a tag that will makes sense
  if the user
  app is playing dup()/close() games for example.
 
 Great. So the only issue then is that the documentation is confusing. It
 frequently uses the term fd where it means file. For example, it says:
 
   Q1 What  happens  if  you  add  the  same fd to an
 epoll_set
  twice?
 
   A1 You will probably get EEXIST.  However,  it  is
 possible
  that  two  threads  may  add the same fd twice. This is
 a
  harmless condition.
 
 This gives no reason to think there's anything wrong with adding the same
 file twice so long as you do so through different descriptors. (One can
 imagine an application that does this to segregate read and write operations
 to avoid a race where the descriptor is closed from under a writer due to
 handling a fatal read error.) Obviously, that won't work.

I agree, that is confusing. However, you can safely add two different file 
descriptors pointing to the same file*, with different event masks, and 
that will work as expected.




 And this part:
 
   Q6 Will  the  close of an fd cause it to be removed from
 all
  epoll sets automatically?
 
   A6 Yes.
 
 This is incorrect. Closing an fd will not cause it to be removed from all
 epoll sets automatically. Only closing a file will. This is what caused the
 OP's confusion, and it is at best imprecise and, at worst, flat out wrong.

OTOH you cannot list *every* possible scenario in a man page, otherwise 
you end up writing a book instead of a man page. I will try to find some 
time with Michael to refine the man page.



 PS: It is customary to trim individuals off of CC lists when replying to a
 list when the subject matter of the post is squarely inside the subject of
 the list. If the person CC'd was interested in the list's subject, he or she
 would presumably subscribe to the list. Not everyone wants two copies of
 every post. Not everyone wants a personal copy of every sub-thread that
 results from a post they make. In the past few years, I've received
 approximately an equal number of complaints about trimming CC's on posts to
 LKML and not trimming CC's on such posts.

Does anyone that in 2007 still did not manage to find a way to avoid dups 
in hitting his mailbox, deserve any consideration at all?
OTOH many ppl, like myself, uses To and Cc header to direct email to 
proper folders, where they are treated with a different level of 
attention. And your stripp-all-headers mania screws that up badly.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: epoll design problems with common fork/exec patterns

2007-10-28 Thread Davide Libenzi
On Sat, 27 Oct 2007, David Schwartz wrote:

>   I don't see how that can be. Suppose I add fd 8 to an epoll set. 
> Suppose fd
> 5 is a dup of fd 8. Now, I close fd 8. How can fd 8 remain in my epoll set,
> since there no longer is an fd 8? Events on files registered for epoll
> notification are reported by descriptor, so the set membership has to be
> associated (as reflected into userspace) with the descriptor, not the file.

Eric already answered to your question (epoll deals with internal kernel 
objects - aka file*).
I just want to answer this one for another reason. WTF is wrong with all 
of you Cc-list-trimmers?
Could you *please* stop trimming Cc-lists?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: epoll design problems with common fork/exec patterns

2007-10-28 Thread Davide Libenzi
On Sat, 27 Oct 2007, David Schwartz wrote:

   I don't see how that can be. Suppose I add fd 8 to an epoll set. 
 Suppose fd
 5 is a dup of fd 8. Now, I close fd 8. How can fd 8 remain in my epoll set,
 since there no longer is an fd 8? Events on files registered for epoll
 notification are reported by descriptor, so the set membership has to be
 associated (as reflected into userspace) with the descriptor, not the file.

Eric already answered to your question (epoll deals with internal kernel 
objects - aka file*).
I just want to answer this one for another reason. WTF is wrong with all 
of you Cc-list-trimmers?
Could you *please* stop trimming Cc-lists?



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >