[PATCH] Port leak when using clisp

2015-08-26 Thread Flávio Cruz
Hi all

I'm trying to port the SBCL compiler to the Hurd but I've found a serious
port leak which prevents me from completing the port.

So, the first thing I did was to compile CLISP in order to bootstrap SBCL.
CLISP compiled just fine and runs well as far as I know.

Next, I started the bootstrapping process to compile SBCL but it would
always fail with a kernel panic since Mach was unable to allocate more
slabs for the ipc_port data structure. I figured out that the kernel had
thousands of inactive ports (which had a single reference and one send-only
right). I wrote a very small script in lisp to reproduce the bug, which
simply loads a file several times, like this:

(loop for i from 0 to 100
do (load hello-world))

and the number of inactive ports in the kernel will increase (never
reclaimed), even when the CLISP process is killed. I've also attempted to
use a separate partition and the ports would still hang around after the
ext2fs translator was terminated.

After looking around and using KDB, I've figured out that the following
loop in kern/exceptions.c/exception_raise_continue_slow was the culprit:

while (mr == MACH_RCV_INTERRUPTED) {
/*
* Somebody is trying to force this thread
* to a clean point.  We must cooperate
* and then resume the receive.
*/

while (thread_should_halt(self)) {
/* don't terminate while holding a reference */
if (self-ast  AST_TERMINATE)
ipc_port_release(reply_port);
thread_halt_self();
}

ip_lock(reply_port);

}

The reply port of the target thread (from CLISP?) is not being released
since it enters the while(thread_should_halt(self)) loop and the thread
terminates inside thread_halt_self but the previous condition does not
hold, which fails to release the port. I also think that once the code
enters thread_halt_self, it never comes back since the stack is discarded
(for both if cases inside the function).

I've changed the code to make sure the port is released when
thread_should_halt is true. So far, the kernel works great and I was
finally able to complete the second bootstrapping stage for SBCL. Please
see the attached patch and let me know what you think.

Regards
Flavio


port-leak.patch
Description: Binary data


Re: [RFC mach] vm: fix locking issues

2015-08-26 Thread Samuel Thibault
Justus Winter, le Mon 17 Aug 2015 17:51:09 +0200, a écrit :
 @@ -2157,10 +2162,13 @@ start_pass_1:
   /*
*  Check for permanent objects in the destination.
*/
 -
 - if ((entry-object.vm_object != VM_OBJECT_NULL) 
 -!entry-object.vm_object-temporary)
 - contains_permanent_objects = TRUE;
 + object = entry-object.vm_object;
 + if ((object != VM_OBJECT_NULL)
 +  ! contains_permanent_objects) {
 + vm_object_lock(object);
 + contains_permanent_objects = object-temporary;

This looks wrong, shouldn't it be !object-temporary?

Also, taking the lock just to read a boolean usually means there's
something bigger wrong somewhere.  Either the boolean was not meant to
be protected (e.g. because it is constant), or more than just reading it
should be locked, for the whole coherency.  Here it never changes except
when memory_object_set_attributes_common is called, but then there is
nothing that will prevent memory_object_set_attributes_common from doing
it between the time you take the lock, get the temporary field, release
the lock, and the time when you actually do something about it, e.g.
here return KERN_FAILURE because you believe that there's a permanent
object while it has just been turned temporary actually. That's probably
not a problem in practice because the two calls will probably have some
dependency, but that's the actual issue here.

 @@ -2275,8 +2284,15 @@ start_pass_1:
*/
  
   object = entry-object.vm_object;
 + temporary = 0;
 + if (object != VM_OBJECT_NULL) {
 + vm_object_lock(object);
 + temporary = object-temporary;
 + vm_object_unlock(object);
 + }
 +

Here it's worse: the copy strategy changes. Perhaps we could think that
it doesn't pose problem in practice because the two calls would probably
have a dependency, I don't know about memory objects to know.

 diff --git a/vm/vm_object.c b/vm/vm_object.c
 index deac0c2..1d3e727 100644
 --- a/vm/vm_object.c
 +++ b/vm/vm_object.c
 @@ -217,9 +217,11 @@ static void _vm_object_setup(
   vm_size_t   size)
  {
   *object = vm_object_template;
 - queue_init(object-memq);
   vm_object_lock_init(object);
 + vm_object_lock(object);
 + queue_init(object-memq);
   object-size = size;
 + vm_object_unlock(object);

Here it's a fresh object, that's why there's no need to lock, but oh
well.

 @@ -244,8 +246,11 @@ vm_object_t vm_object_allocate(
   port = ipc_port_alloc_kernel();
   if (port == IP_NULL)
   panic(vm_object_allocate);
 +
 + vm_object_lock(object);
   object-pager_name = port;
   ipc_kobject_set(port, (ipc_kobject_t) object, IKOT_PAGING_NAME);
 + vm_object_unlock(object);

Ditto (and later with new_copy and result)

 @@ -1474,14 +1491,11 @@ vm_object_t vm_object_copy_delayed(
*  must be done carefully, to avoid deadlock.
*/
  
 - /*
 -  *  Allocate a new copy object before locking, even
 -  *  though we may not need it later.
 -  */
 + vm_object_lock(src_object);
  
   new_copy = vm_object_allocate(src_object-size);
  
 - vm_object_lock(src_object);

Better make a copy of src_object-size to avoid keeping the object
locked while we allocate stuff.

 @@ -252,6 +252,8 @@ vm_pageout_setup(
   vm_object_unlock(new_object);
   }
  
 + vm_object_lock(old_object);
 +
   if (flush) {
   /*
*  Create a place-holder page where the old one was,
 @@ -262,7 +264,6 @@ vm_pageout_setup(
   == VM_PAGE_NULL)
   vm_page_more_fictitious();
 - vm_object_lock(old_object);

Why keeping the lock while allocating a page?

   vm_page_lock_queues();
   vm_page_remove(m);
   vm_page_unlock_queues();
 @@ -281,8 +282,6 @@ vm_pageout_setup(
   VM_EXTERNAL_STATE_EXISTS);
  #endif   /* MACH_PAGEMAP */
  
 - vm_object_unlock(old_object);

Why keeping it locked here?  For accessing old_object-internal?  It is
constant across the life of an object, thus does not need to be
protected by the lock.

Samuel



Re: [PATCH] Port leak when using clisp

2015-08-26 Thread Flávio Cruz
Em qua, 26 de ago de 2015 às 12:18, Justus Winter 
4win...@informatik.uni-hamburg.de escreveu:


  After looking around and using KDB, I've figured out that the following
 loop in
  kern/exceptions.c/exception_raise_continue_slow was the culprit:

 Could you elaborate a little on your method?


I changed the code to chain all the allocated ports together in a linked
list. Then, I added new fields to ipc_port to track send-only rights and to
understand where those rights were being created and where the port was
being manipulated. Finally, I added a new KDB command to show all the
inactive ports. It was then easy to track exactly how the port was handled
inside exception_raise_continue_slow.


  while (mr == MACH_RCV_INTERRUPTED) {
  /*
  * Somebody is trying to force this thread
  * to a clean point.  We must cooperate
  * and then resume the receive.
  */
 
  while (thread_should_halt(self)) {
  /* don't terminate while holding a reference */
  if (self-ast  AST_TERMINATE)
  ipc_port_release(reply_port);
  thread_halt_self();
  }
 
  ip_lock(reply_port);
  
  }
 
  The reply port of the target thread (from CLISP?) is not being released
 since
  it enters the while(thread_should_halt(self)) loop and the thread
 terminates
  inside thread_halt_self but the previous condition does not hold, which
 fails
  to release the port. I also think that once the code enters
 thread_halt_self,
  it never comes back since the stack is discarded (for both if cases
 inside the
  function).

 While I agree that the function thread_halt_self does not return, it
 only terminates the thread if the AST_TERMINATE flag is set.
 Otherwise, it returns to userspace.


Yes, it returns to userspace using thread_exception_return, but notice that
a reference must be released anyway (check first statement after the while
loop). For the else case in thread_halt_self, the thread may throw away the
current stack and run thread_exception_return in thread_block, which is
also executed in the case of success in exception_raise_continue_slow.


  I've changed the code to make sure the port is released when
 thread_should_halt
  is true. So far, the kernel works great and I was finally able to
 complete the
  second bootstrapping stage for SBCL. Please see the attached patch and
 let me
  know what you think.

 Fwiw, I prefer inlined patches so I can reply easier ;)


OK ;)



  diff --git a/kern/exception.c b/kern/exception.c
  index 6cb3bfb..c68c5cb 100644
  --- a/kern/exception.c
  +++ b/kern/exception.c
  @@ -875,12 +875,10 @@ exception_raise_continue_slow(
 *  and then resume the receive.
 */
 
  - while (thread_should_halt(self)) {
  - /* don't terminate while holding a reference */
  - if (self-ast  AST_TERMINATE)

 Unless I'm missing something, you're basically removing this
 conditional.  For me, it is not obvious that this is correct.  B/c now
 we've given up our reference, return to userspace, and userspace is
 likely re-trying the same again.  So when it does, isn't the reference
 missing now?


See above. Furthermore, the port has 2 references and 1 must always be
released (see line 387, where those ports are created).



  - ipc_port_release(reply_port);
  + if (thread_should_halt(self))
  + ipc_port_release(reply_port);
  + while (thread_should_halt(self))
thread_halt_self();
  - }

 I find the use of `while' dodgy in the first place b/c if
 thread_halt_self doesn't return, an if would do the trick as well and
 don't imply the chance of that code looping...


I think there might be some situations where it will loop around depending
on the scheduling order (see line 798 on sched_prim.c).

I've added a continuation argument to thread_halt_self so that the port is
released if the thread resumes using the continuation (followed by
 thread_exception_return). Otherwise, it may loop and it will eventually
also release the reference and call thread_exception_return.

I've also made some cosmetic changes to the code.

diff --git a/kern/ast.c b/kern/ast.c
index 4b9d63d..2772ed3 100644
--- a/kern/ast.c
+++ b/kern/ast.c
@@ -96,7 +96,7 @@ ast_taken(void)
  if (self != current_processor()-idle_thread) {
 #ifndef MIGRATING_THREADS
  while (thread_should_halt(self))
- thread_halt_self();
+ thread_halt_self(thread_exception_return);
 #endif

  /*
diff --git a/kern/exception.c b/kern/exception.c
index 6cb3bfb..9ad1c42 100644
--- a/kern/exception.c
+++ b/kern/exception.c
@@ -231,7 +231,7 @@ exception_no_server(void)
  */

  while (thread_should_halt(self))
- thread_halt_self();
+ thread_halt_self(thread_exception_return);


 #if 0
@@ -257,7 +257,7 @@ exception_no_server(void)
  */

  (void) task_terminate(self-task);
- thread_halt_self();
+ thread_halt_self(thread_exception_return);
  panic(terminating the task didn't kill us);
  /*NOTREACHED*/
 }
@@ 

Re: [PATCH] Port leak when using clisp

2015-08-26 Thread Justus Winter
Hi Flávio :)

Quoting Flávio Cruz (2015-08-26 01:46:16)
 Next, I started the bootstrapping process to compile SBCL but it would always
 fail with a kernel panic since Mach was unable to allocate more slabs for the
 ipc_port data structure. I figured out that the kernel had thousands of
 inactive ports (which had a single reference and one send-only right).

A very interesting find :)

 and the number of inactive ports in the kernel will increase (never 
 reclaimed),
 even when the CLISP process is killed. I've also attempted to use a separate
 partition and the ports would still hang around after the ext2fs translator 
 was
 terminated.

Indeed we have suspected that there is a bug like this for some time now.

 After looking around and using KDB, I've figured out that the following loop 
 in
 kern/exceptions.c/exception_raise_continue_slow was the culprit:

Could you elaborate a little on your method?

 while (mr == MACH_RCV_INTERRUPTED) {
 /*
 * Somebody is trying to force this thread
 * to a clean point.  We must cooperate
 * and then resume the receive.
 */
 
 while (thread_should_halt(self)) {
 /* don't terminate while holding a reference */
 if (self-ast  AST_TERMINATE)
 ipc_port_release(reply_port);
 thread_halt_self();
 }
 
 ip_lock(reply_port);
 
 }
 
 The reply port of the target thread (from CLISP?) is not being released since
 it enters the while(thread_should_halt(self)) loop and the thread terminates
 inside thread_halt_self but the previous condition does not hold, which fails
 to release the port. I also think that once the code enters thread_halt_self,
 it never comes back since the stack is discarded (for both if cases inside the
 function).

While I agree that the function thread_halt_self does not return, it
only terminates the thread if the AST_TERMINATE flag is set.
Otherwise, it returns to userspace.

 I've changed the code to make sure the port is released when 
 thread_should_halt
 is true. So far, the kernel works great and I was finally able to complete the
 second bootstrapping stage for SBCL. Please see the attached patch and let me
 know what you think.

Fwiw, I prefer inlined patches so I can reply easier ;)

 diff --git a/kern/exception.c b/kern/exception.c
 index 6cb3bfb..c68c5cb 100644
 --- a/kern/exception.c
 +++ b/kern/exception.c
 @@ -875,12 +875,10 @@ exception_raise_continue_slow(
*  and then resume the receive.
*/

 - while (thread_should_halt(self)) {
 - /* don't terminate while holding a reference */
 - if (self-ast  AST_TERMINATE)

Unless I'm missing something, you're basically removing this
conditional.  For me, it is not obvious that this is correct.  B/c now
we've given up our reference, return to userspace, and userspace is
likely re-trying the same again.  So when it does, isn't the reference
missing now?

 - ipc_port_release(reply_port);
 + if (thread_should_halt(self))
 + ipc_port_release(reply_port);
 + while (thread_should_halt(self))
   thread_halt_self();
 - }

I find the use of `while' dodgy in the first place b/c if
thread_halt_self doesn't return, an if would do the trick as well and
don't imply the chance of that code looping...

Cheers,
Justus



Re: [PATCH] Rump on GNU/Hurd (3): system limits

2015-08-26 Thread Antti Kantee

On 24/08/15 21:24, Robert Millan wrote:

El 16/08/15 a les 15:07, Antti Kantee ha escrit:


Can you submit the patches against NetBSD tools directly to NetBSD?
It's hard to test that one didn't break any platform when mucking
around with the crosstools, so it never hurts to have the maximal
amount of eyeballs on changes there.  Specifically, I'm not sure what
the fallout from AC_CANONICAL_HOST() will be (if any).  If nobody else
has comments or handles the PR, I can commit your patches in a few days.

https://www.netbsd.org/cgi-bin/sendpr.cgi?gndb=netbsd


Here:

https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=50166


So, I tried it, and indeed there was fallout from AC_CANONICAL_HOST() 
(as noted in the PR).  Can you try to work around it somehow? 
src/tools/make/configure.ac contains the official way to regenerate 
configure.




Re: [PATCH] Check sysheaders when looking for Mach and Hurd headers.

2015-08-26 Thread Samuel Thibault
Hello,

Manolis Ragkousis, le Fri 14 Aug 2015 14:42:46 +0300, a écrit :
 With this patch, glibc's configure, when checking for Mach and Hurd
 headers, takes
 into account the --with-headers argument.

Thanks!

(we need it for the Debian bootstrap too, indeed :)

 I am not sending this to libc-alpha because I noticed there are some
 differences in the
 files I am patching and the patch won't apply there. I will send to
 libc-alpha an updated
 patch that applies to their tree.

Please do, the patch looks completely sane (and is actually what other
configure.ac files do).

Samuel



Re: [PATCH] Check sysheaders when looking for Mach and Hurd headers.

2015-08-26 Thread Manolis Ragkousis
On 27 August 2015 at 00:44, Samuel Thibault samuel.thiba...@gnu.org wrote:
 Please do, the patch looks completely sane (and is actually what other
 configure.ac files do).

Will do :)