SMP profiling -- patch, if someone cares to bring it up to date

2012-02-02 Thread Thor Lancelot Simon
I sent a patch considerably improving kernel profiling support to
tech-kern while I was with Coyote Point.  It is at:

http://mail-index.netbsd.org/tech-kern/2010/12/11/msg009519.html

I got a few comments about the code organization but not much else.  I
simply ran out of time to work on updating and integrating the patch.

I mention it here because anyone wanting to take good kernel profiles
on SMP NetBSD systems might want to look into integrating this.  I do
not have the time at present.

-- 
Thor Lancelot Simont...@panix.com
  All of my opinions are consistent, but I cannot present them all
   at once.-Jean-Jacques Rousseau, On The Social Contract


Re: Question about tcp ephemeral ports

2012-02-02 Thread Olivier MATZ
Hi,

Attached is a patch that makes my small test program working. I
applies to 5.1 and 5.1.1 only. Porting it to current would be a bit
harder due to the port randomization, as described by Eric
previously.

This is just a proof of concept and I would be happy to have some
feedback about how to write it better and what are the potential
issues.

Olivier
From 61c4012c89cd088f8f6e3f16f5e1306104232b28 Mon Sep 17 00:00:00 2001
From: Olivier Matz olivier.m...@6wind.com
Date: Thu, 2 Feb 2012 16:51:05 +0100
Subject: tcp: allow to reuse an ephemeral port if dest addr/port is different

When a TCP client calls connect(), an implicit bind is done by the
network stack to choose an ephemeral port. Currently, there is a
limitation that prevent the tcp client to open many ephemeral ports even
if the destination port or address is different.

The problem is described in details there:
http://mail-index.netbsd.org/tech-kern/2012/01/30/msg012602.html

The goal of this patch is to allow duplicate the code of in_pcbbind() in
a new function in_pcbbind_before_connect() that is called specifically
by the TCP connect code when doing an implicit bind. The behaviour is a
bit different compared to the initial in_pcbbind():

- only the (nam == NULL) case is allowed
- the function is aware of remote address that will be given to the
  connect(). The duplication of the ephemeral port is checked by a
  in_pcblookup_connect() instead of a in_pcblookup_port().
- the socket state is not changed to BOUND (but the the pcb is added in
  the INPCBHASH_PORT table). The connect() will change the state to
  CONNECTED if it is successful.

If the in_pcbconnect() fails, we need to restore the initial state:
inp-in_port to 0, tcp in INPCBHASH_PORT table[0], remove INP_ANONPORT
flag.

Note: this patch is just a proof of concept and should probably be
cleaned and enhanced. Currently, only IPv4 is done.
---
 netinet/in_pcb.c |   88 ++
 netinet/in_pcb.h |2 +
 netinet/tcp_usrreq.c |   10 +-
 3 files changed, 99 insertions(+), 1 deletions(-)

diff --git a/netinet/in_pcb.c b/netinet/in_pcb.c
index 5d662ce..498a344 100644
--- a/netinet/in_pcb.c
+++ b/netinet/in_pcb.c
@@ -371,6 +371,94 @@ noname:
 	return (0);
 }
 
+int
+in_pcbbind_before_connect(void *v, struct in_addr raddr,
+	   u_int rport, struct lwp *l)
+{
+	struct inpcb *inp = v;
+	struct socket *so = inp-inp_socket;
+	struct inpcbtable *table = inp-inp_table;
+	struct sockaddr_in *sin = NULL; /* XXXGCC */
+	u_int16_t lport = 0;
+#ifndef IPNOPRIVPORTS
+	kauth_cred_t cred = l-l_cred;
+#endif
+	int	   cnt;
+	u_int16_t  mymin, mymax;
+	u_int16_t *lastport;
+
+	if (inp-inp_af != AF_INET)
+		return (EINVAL);
+
+	if (TAILQ_FIRST(in_ifaddrhead) == 0)
+		return (EADDRNOTAVAIL);
+	if (inp-inp_lport || !in_nullhost(inp-inp_laddr))
+		return (EINVAL);
+
+	if (inp-inp_flags  INP_LOWPORT) {
+#ifndef IPNOPRIVPORTS
+		if (kauth_authorize_network(cred,
+	KAUTH_NETWORK_BIND,
+	KAUTH_REQ_NETWORK_BIND_PRIVPORT, so,
+	sin, NULL))
+			return (EACCES);
+#endif
+		mymin = lowportmin;
+		mymax = lowportmax;
+		lastport = table-inpt_lastlow;
+	} else {
+		mymin = anonportmin;
+		mymax = anonportmax;
+		lastport = table-inpt_lastport;
+	}
+	if (mymin  mymax) {	/* sanity check */
+		u_int16_t swp;
+
+		swp = mymin;
+		mymin = mymax;
+		mymax = swp;
+	}
+
+	lport = *lastport - 1;
+	for (cnt = mymax - mymin + 1; cnt; cnt--, lport--) {
+		if (lport  mymin || lport  mymax)
+			lport = mymax;
+		if (!in_pcblookup_connect(table, inp-inp_laddr,
+	  htons(lport), raddr, htons(rport)))
+			goto found;
+	}
+	if (!in_nullhost(inp-inp_laddr))
+		inp-inp_laddr.s_addr = INADDR_ANY;
+	return (EAGAIN);
+
+ found:
+	inp-inp_flags |= INP_ANONPORT;
+	*lastport = lport;
+	lport = htons(lport);
+
+	inp-inp_lport = lport;
+	LIST_REMOVE(inp-inp_head, inph_lhash);
+	LIST_INSERT_HEAD(INPCBHASH_PORT(table, inp-inp_lport), inp-inp_head,
+			 inph_lhash);
+
+	return (0);
+}
+
+void
+in_pcbbind_revert(void *v)
+{
+	struct inpcb *inp = v;
+	struct inpcbtable *table = inp-inp_table;
+
+	/* Called from tcp_usrreq if the connect failed after an
+	 * implicit bind. This will restore the initial state */
+	inp-inp_flags = ~INP_ANONPORT;
+	inp-inp_lport = 0;
+	LIST_REMOVE(inp-inp_head, inph_lhash);
+	LIST_INSERT_HEAD(INPCBHASH_PORT(table, inp-inp_lport), inp-inp_head,
+			 inph_lhash);
+}
+
 /*
  * Connect from a socket to a specified address.
  * Both address and port must be specified in argument sin.
diff --git a/netinet/in_pcb.h b/netinet/in_pcb.h
index 8e1d929..51a0a5c 100644
--- a/netinet/in_pcb.h
+++ b/netinet/in_pcb.h
@@ -125,6 +125,8 @@ struct inpcb {
 void	in_losing(struct inpcb *);
 int	in_pcballoc(struct socket *, void *);
 int	in_pcbbind(void *, struct mbuf *, struct lwp *);
+int	in_pcbbind_before_connect(void *, struct in_addr, u_int, struct lwp *);
+void	in_pcbbind_revert(void *v);
 int	in_pcbconnect(void *, struct mbuf *, struct lwp *);
 void	

Re: Second stage bootloader (i386) hangs on ls command for ext2

2012-02-02 Thread Evgeniy Ivanov
On Sun, Dec 25, 2011 at 11:54 AM, Evgeniy Ivanov lolkaanti...@gmail.com wrote:
 Hi,

 On Sun, Dec 25, 2011 at 10:20 AM, Izumi Tsutsui tsut...@ceres.dti.ne.jp 
 wrote:
 Hi,

 Evgeniy Ivanov wrote:

 Izumi, thank you for reviewing! New patches are attached.
  :
  I think it's better to use a positive LIBSA_ENABLE_LS_OP option rather
  than LIBSA_NO_LS_OP, and make whole (fs_ops)-ls op part optional because
   - there are many primary bootloaders (bootxx_foo) which don't need
    the ls op and have size restrictions (alpha, atari, pmax ...)
   - there are few bootloaders which support command prompt mode where
    the `ls' op is actually required (some ports don't have even getchar())

 Done.

  We also have to check all other non-x86 bootloaders which refer ufs_ls().
  (ews4800mips, ia64, landisk, x68k, zaurus etc)

 Done. I'm not able to check though, but the modification is trivial
 and almost the same as for i386.

 Committed all changes (with several fixes for ews4800mips and x68k)
 http://mail-index.NetBSD.org/source-changes/2011/12/25/msg02.html

 Great!

 Thank you for your great work!

 np :-)

 Now it's time for someone[TM] to try PR/30866 :-)
 http://gnats.NetBSD.org/30866

 Seems to be a useful feature, I'll work on this in Jan if it doesn't
 violate [TM] :P

Unfortunately I was out of time and doubtfully will get some time for
this soon...
So anybody is welcome to work on this feature.



-- 
Evgeniy


Re: kmem change related trouble

2012-02-02 Thread Frank Wille
Lars Heidieker wrote:

 I've just posted a patch ( http://www.netbsd.org/~para/fix.patch )
 - It moves uareas and buffer cache back to the kernel_map restoring
 the previous behavior. Sizing the kmem_arena is changed accordingly
 (Something I stepped on while checking evbmips on gxemul).
 - Code to drain pools if the kmem_arena runs out of space.

I tried your patch on sandpoint and ofppc. Unfortunately it doesn't change
anything.

Here are the last lines before the crash on ofppc (note that the warning no
/dev/console is wrong):

[...]
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
warning: no /dev/console
trap: kernel read DSI trap @ 0xa00011c8 by 0x3d6324 (DSISR 0x4000,
err=14), lr 0x3d6310
Press a key to panic.
panic: trap


Entering ddb shows the crash happened in pool_cache_get_paddr():

kernel DSI read trap @ 0xa00011c8 by pool_cache_get_paddr+0x4c: srr1=0x9032
r1=0xa22b9aa0 cr=0x28284084 xer=0x02000 ctr=0x1642c dsisr=0x4000

The backtrace:
copyright
kmem_intr_alloc
exec_elf32_makecmds
check_exec
execve1
start_init,
setfunc_trampoline

-- 
Frank Wille



Re: kmem change related trouble

2012-02-02 Thread Matt Thomas
 kernel DSI read trap @ 0xa00011c8 by pool_cache_get_paddr+0x4c: srr1=0x9032
 r1=0xa22b9aa0 cr=0x28284084 xer=0x02000 ctr=0x1642c dsisr=0x4000
 
 The backtrace:
 copyright
 kmem_intr_alloc
 exec_elf32_makecmds
 check_exec
 execve1
 start_init,
 setfunc_trampoline

is that with the latest exec_elf.c?  I'd like to see if the location changes 
with the latest one.

rw_lock vs mutex

2012-02-02 Thread Paul Goyette
While digging around looking into another problem, I noticed that the 
piixpm(4) driver uses an rw_lock for its ic_acquire_bus/ic_release_bus 
routines.  ic_acquire_bus() uses rw_enter(..., RW_WRITER) and there 
doesn't appear to be any use anywhere of RW_READER for that lock.


The man page for rw_lock implies that it is a superset of a mutex.  So 
I'm wondering if it makes any sense to use the simpler mutex instead?



-
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |  | pgoyette at netbsd.org  |
-


Re: rw_lock vs mutex

2012-02-02 Thread Matt Thomas

On Feb 2, 2012, at 5:38 PM, Paul Goyette wrote:

 While digging around looking into another problem, I noticed that the 
 piixpm(4) driver uses an rw_lock for its ic_acquire_bus/ic_release_bus 
 routines.  ic_acquire_bus() uses rw_enter(..., RW_WRITER) and there doesn't 
 appear to be any use anywhere of RW_READER for that lock.
 
 The man page for rw_lock implies that it is a superset of a mutex.  So I'm 
 wondering if it makes any sense to use the simpler mutex instead?

Switch to a mutex, it's much less overhead that a r/w lock

RE: rw_lock vs mutex

2012-02-02 Thread Paul_Koning
A rw_lock allows multiple readers, correct?  If there's a non-trivial 
probability of concurrent reads that would make a difference.  If not, then a 
mutex would be just as good especially if that is lower overhead.

paul

-Original Message-
From: tech-kern-ow...@netbsd.org [mailto:tech-kern-ow...@netbsd.org] On Behalf 
Of Matt Thomas
Sent: Thursday, February 02, 2012 8:53 PM
To: p...@whooppee.com
Cc: tech-kern@netbsd.org
Subject: Re: rw_lock vs mutex


On Feb 2, 2012, at 5:38 PM, Paul Goyette wrote:

 While digging around looking into another problem, I noticed that the 
 piixpm(4) driver uses an rw_lock for its ic_acquire_bus/ic_release_bus 
 routines.  ic_acquire_bus() uses rw_enter(..., RW_WRITER) and there doesn't 
 appear to be any use anywhere of RW_READER for that lock.
 
 The man page for rw_lock implies that it is a superset of a mutex.  So I'm 
 wondering if it makes any sense to use the simpler mutex instead?

Switch to a mutex, it's much less overhead that a r/w lock


RE: rw_lock vs mutex

2012-02-02 Thread Paul Goyette

On Thu, 2 Feb 2012, paul_kon...@dell.com wrote:

A rw_lock allows multiple readers, correct?  If there's a non-trivial 
probability of concurrent reads that would make a difference.  If not, 
then a mutex would be just as good especially if that is lower 
overhead.


The rwlock in question is contained with the driver's softc.  The only 
exported accessors are i2c_bus_acquire() (which grabs a RW_WRITER lock) 
and i2c_bus_release().


A mutex makes much more sense.


-
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |  | pgoyette at netbsd.org  |
-


Re: extended attributes and lsextattr/extattr_list_file

2012-02-02 Thread YAMAMOTO Takashi
hi,

 YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote:
 
 we need to decide what to be shipped for netbsd-6.  (hope it isn't too late.)
 is anyone against the removal of freebsd-style syscalls?
 
 We will need some macro to discover what API is available: FreeBSD-line
 in 5.x and Linux-like in 6.0.

i was assuming we have no releases on which the freebsd-style API is
actually usable.  it it wrong?

YAMAMOTO Takashi

 
 -- 
 Emmanuel Dreyfus
 http://hcpnet.free.fr/pubz
 m...@netbsd.org


Re: extended attributes and lsextattr/extattr_list_file

2012-02-02 Thread Emmanuel Dreyfus
YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote:

  We will need some macro to discover what API is available: FreeBSD-line
  in 5.x and Linux-like in 6.0.
 i was assuming we have no releases on which the freebsd-style API is
 actually usable.  it it wrong?

Yes, you are right. 

I committed code in glusterfs that used it, but it is #ifndef
HAVE_SYS_XATTR_H, therefore that will automatically revert to
Linux-style API when sys/xattr.h is availale.

Therefore we have no problem.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org