from:"Rick Macklem"

Re: Network stack changes

2013-09-14 Thread Rick Macklem

Sam Fourman Jr. wrote:
 
 
  And any time you increase latency, that will have a negative impact
  on
  NFS performance. NFS RPCs are usually small messages (except Write
  requests
  and Read replies) and the RTT for these (mostly small,
  bidirectional)
  messages can have a significant impact on NFS perf.
 
  rick
 
 
 this may be a bit off topic but not much... I have wondered with all
 of the
 new
 tcp algorithms
 http://freebsdfoundation.blogspot.com/2011/03/summary-of-five-new-tcp-congestion.html
 
 what algorithm is best suited for NFS over gigabit Ethernet, say
 FreeBSD to
 FreeBSD.
 and further more would a NFS optimized tcp algorithm be useful?
 
I have no idea what effect they might have. NFS traffic is quite different than
streaming or bulk data transfer. I think this might make a nice research
project for someone.

rick

 Sam Fourman Jr.
 
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Network stack changes

2013-09-13 Thread Rick Macklem

George Neville-Neil wrote:
 
 On Aug 29, 2013, at 7:49 , Adrian Chadd adr...@freebsd.org wrote:
 
  Hi,
  
  There's a lot of good stuff to review here, thanks!
  
  Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless
  to keep
  locking things like that on a per-packet basis. We should be able
  to do
  this in a cleaner way - we can defer RX into a CPU pinned taskqueue
  and
  convert the interrupt handler to a fast handler that just schedules
  that
  taskqueue. We can ignore the ithread entirely here.
  
  What do you think?
  
  Totally pie in the sky handwaving at this point:
  
  * create an array of mbuf pointers for completed mbufs;
  * populate the mbuf array;
  * pass the array up to ether_demux().
  
  For vlan handling, it may end up populating its own list of mbufs
  to push
  up to ether_demux(). So maybe we should extend the API to have a
  bitmap of
  packets to actually handle from the array, so we can pass up a
  larger array
  of mbufs, note which ones are for the destination and then the
  upcall can
  mark which frames its consumed.
  
  I specifically wonder how much work/benefit we may see by doing:
  
  * batching packets into lists so various steps can batch process
  things
  rather than run to completion;
  * batching the processing of a list of frames under a single lock
  instance
  - eg, if the forwarding code could do the forwarding lookup for 'n'
  packets
  under a single lock, then pass that list of frames up to
  inet_pfil_hook()
  to do the work under one lock, etc, etc.
  
  Here, the processing would look less like grab lock and process to
  completion and more like mark and sweep - ie, we have a list of
  frames
  that we mark as needing processing and mark as having been
  processed at
  each layer, so we know where to next dispatch them.
  
 
 One quick note here.  Every time you increase batching you may
 increase bandwidth
 but you will also increase per packet latency for the last packet in
 a batch.
 That is fine so long as we remember that and that this is a tuning
 knob
 to balance the two.
 
And any time you increase latency, that will have a negative impact on
NFS performance. NFS RPCs are usually small messages (except Write requests
and Read replies) and the RTT for these (mostly small, bidirectional)
messages can have a significant impact on NFS perf.

rick

  I still have some tool coding to do with PMC before I even think
  about
  tinkering with this as I'd like to measure stuff like per-packet
  latency as
  well as top-level processing overhead (ie,
  CPU_CLK_UNHALTED.THREAD_P /
  lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core,
  etc.)
  
 
 This would be very useful in identifying the actual hot spots, and
 would be helpful
 to anyone who can generate a decent stream of packets with, say, an
 IXIA.
 
 Best,
 George
 
 
 
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

review of patches for the gssd that handle getpwXX_r ERANGE return

2013-04-27 Thread Rick Macklem

Hi,

I have attached two patches, which can also be found at:
  http://people.freebsd.org/~rmacklem/getpw.patch1 and getpw.patch2

They are almost identical and handle the ERANGE error return from
getpw[nam|uid]_r() when buf[128] isn't large enough.

Is anyone interested in reviewing these? (This has been discussed
some time ago, but the patch was never reviewed. Actually I reviewed
a patch similar to this, but the submitter subsequently requested that
I not use their patch, so I wrote similar ones.)

Thanks in advance for any review, rick
--- usr.sbin/gssd/gssd.c.sav	2013-04-26 20:38:45.0 -0400
+++ usr.sbin/gssd/gssd.c	2013-04-26 20:38:53.0 -0400
@@ -37,6 +37,7 @@ __FBSDID($FreeBSD: head/usr.sbin/gssd/g
 #include ctype.h
 #include dirent.h
 #include err.h
+#include errno.h
 #ifndef WITHOUT_KERBEROS
 #include krb5.h
 #endif
@@ -557,8 +558,11 @@ gssd_pname_to_uid_1_svc(pname_to_uid_arg
 {
 	gss_name_t name = gssd_find_resource(argp-pname);
 	uid_t uid;
-	char buf[128];
+	char buf[1024], *bufp;
 	struct passwd pwd, *pw;
+	size_t buflen;
+	int error;
+	static size_t buflen_hint = 1024;
 
 	memset(result, 0, sizeof(*result));
 	if (name) {
@@ -567,7 +571,24 @@ gssd_pname_to_uid_1_svc(pname_to_uid_arg
 			name, argp-mech, uid);
 		if (result-major_status == GSS_S_COMPLETE) {
 			result-uid = uid;
-			getpwuid_r(uid, pwd, buf, sizeof(buf), pw);
+			buflen = buflen_hint;
+			for (;;) {
+pw = NULL;
+bufp = buf;
+if (buflen  sizeof(buf))
+	bufp = malloc(buflen);
+if (bufp == NULL)
+	break;
+error = getpwuid_r(uid, pwd, bufp, buflen,
+pw);
+if (error != ERANGE)
+	break;
+if (buflen  sizeof(buf))
+	free(bufp);
+buflen += 1024;
+if (buflen  buflen_hint)
+	buflen_hint = buflen;
+			}
 			if (pw) {
 int len = NGRPS;
 int groups[NGRPS];
@@ -584,6 +605,8 @@ gssd_pname_to_uid_1_svc(pname_to_uid_arg
 result-gidlist.gidlist_len = 0;
 result-gidlist.gidlist_val = NULL;
 			}
+			if (bufp != NULL  buflen  sizeof(buf))
+free(bufp);
 		}
 	} else {
 		result-major_status = GSS_S_BAD_NAME;
--- kerberos5/lib/libgssapi_krb5/pname_to_uid.c.sav	2013-04-26 20:37:45.0 -0400
+++ kerberos5/lib/libgssapi_krb5/pname_to_uid.c	2013-04-27 16:25:14.0 -0400
@@ -26,6 +26,7 @@
  */
 /* $FreeBSD: head/kerberos5/lib/libgssapi_krb5/pname_to_uid.c 181344 2008-08-06 14:02:05Z dfr $ */
 
+#include errno.h
 #include pwd.h
 
 #include krb5/gsskrb5_locl.h
@@ -37,8 +38,12 @@ _gsskrb5_pname_to_uid(OM_uint32 *minor_s
 	krb5_context context;
 	krb5_const_principal name = (krb5_const_principal) pname;
 	krb5_error_code kret;
-	char lname[MAXLOGNAME + 1], buf[128];
+	char lname[MAXLOGNAME + 1], buf[1024], *bufp;
 	struct passwd pwd, *pw;
+	size_t buflen;
+	int error;
+	OM_uint32 ret;
+	static size_t buflen_hint = 1024;
 
 	GSSAPI_KRB5_INIT (context);
 
@@ -49,11 +54,30 @@ _gsskrb5_pname_to_uid(OM_uint32 *minor_s
 	}
 
 	*minor_status = 0;
-	getpwnam_r(lname, pwd, buf, sizeof(buf), pw);
+	buflen = buflen_hint;
+	for (;;) {
+		pw = NULL;
+		bufp = buf;
+		if (buflen  sizeof(buf))
+			bufp = malloc(buflen);
+		if (bufp == NULL)
+			break;
+		error = getpwnam_r(lname, pwd, bufp, buflen, pw);
+		if (error != ERANGE)
+			break;
+		if (buflen  sizeof(buf))
+			free(bufp);
+		buflen += 1024;
+		if (buflen  buflen_hint)
+			buflen_hint = buflen;
+	}
 	if (pw) {
 		*uidp = pw-pw_uid;
-		return (GSS_S_COMPLETE);
+		ret = GSS_S_COMPLETE;
 	} else {
-		return (GSS_S_FAILURE);
+		ret = GSS_S_FAILURE;
 	}
+	if (bufp != NULL  buflen  sizeof(buf))
+		free(bufp);
+	return (ret);
 }
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: stupid UFS behaviour on random writes

2013-01-18 Thread Rick Macklem

Stefan Esser wrote:
 Am 18.01.2013 00:01, schrieb Rick Macklem:
  Wojciech Puchar wrote:
  create 10GB file (on 2GB RAM machine, with some swap used to make
  sure
  little cache would be available for filesystem.
 
  dd if=/dev/zero of=file bs=1m count=10k
 
  block size is 32KB, fragment size 4k
 
 
  now test random read access to it (10 threads)
 
  randomio test 10 0 0 4096
 
  normal result on such not so fast disk in my laptop.
 
  118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan
  138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan
  142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan
  133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan
  138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan
  145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan
 
 
  systat shows 4kB I/O size. all is fine.
 
  BUT random 4kB writes
 
  randomio test 10 1 0 4096
 
  total | read: latency (ms) | write: latency (ms)
  iops | iops min avg max sdev | iops min avg max
  sdev
  +---+--
  38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5
  44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7
  44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0
  45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3
  45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0
 
 
 
  results are horrific. systat shows 32kB I/O, gstat shows half are
  reads
  half are writes.
 
  Why UFS need to read full block, change one 4kB part and then write
  back, instead of just writing 4kB part?
 
  Because that's the way the buffer cache works. It writes an entire
  buffer
  cache block (unless at the end of file), so it must read the rest of
  the block into
  the buffer, so it doesn't write garbage (the rest of the block) out.
 
 Without having looked at the code or testing:
 
 I assume using O_DIRECT when opening the file should help for that
 particular test (on kernels compiled with options DIRECTIO).
 
  I'd argue that using an I/O size smaller than the file system block
  size is
  simply sub-optimal and that most apps. don't do random I/O of
  blocks.
  OR
  If you had an app. that does random I/O of 4K blocks (at 4K byte
  offsets),
  then using a 4K/1K file system would be better.
 
 A 4k/1k file system has higher overhead (more indirect blocks) and
 is clearly sub-obtimal for most general uses, today.
 
Yes, but if the sysadmin knows that most of the I/O is random 4K blocks,
that's his specific case, not a general use. Sorry, I didn't mean to
imply that a 4K file system was a good choice, in general.

  NFS is the exception, in that it keeps track of a dirty byte range
  within
  a buffer cache block and writes that byte range. (NFS writes are
  byte granular,
  unlike a disk.)
 
 I should be easy to add support for a fragment mask to the buffer
 cache, which allows to identify valid fragments. Such a mask should
 be set to 0xff for all current uses of the buffer cache (meaning the
 full block is valid), but a special case could then be added for
 writes
 of exactly one or multiple fragments, where only the corresponding
 valid flag bits were set. In addition, a possible later read from
 disk must obviously skip fragments for which the valid mask bits are
 already set.
 This bit mask could then be used to update the affected fragments
 only, without a read-modify-write of the containing block.
 
 But I doubt that such a change would improve performance in the
 general case, just in random update scenarios (which might still
 be relevant, in case of a DBMS knowing the fragment size and using
 it for DB files).
 
 Regards, STefan
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
Yes. And for some I/O patterns the fragment change would degrade performance. 
You mentioned
that a later read might have to skip fragments with the valid bit. I think
this would translate to doing multiple reads for the other fragments, in 
practice.
Also, when an app. goes to write a partial fragment, that fragment would have 
to be read in and
this could result in several reads of fragments instead of one read for the 
entire block.
It's the old OS doesn't have a crystal ball that predicts future I/O activity.

Btw, although I did a dirty byte range for NFS for the buffer cache
ages (late 1980s) ago, it is also a performance hit for certain cases.
The linker/loaders love to write random sized chucks to files. For the
NFS code, if the new write isn't contiguous with the old one, a synchronous
write of the old dirty byte range is forced to the server. I have a patch that
replaces the single byte range with a list in order to avoid this synchronous 
write,
but it has not made it into head. (I hope to do so someday, after more
testing and when I figure out all the implications

Re: stupid UFS behaviour on random writes

2013-01-17 Thread Rick Macklem

Wojciech Puchar wrote:
 create 10GB file (on 2GB RAM machine, with some swap used to make sure
 little cache would be available for filesystem.
 
 dd if=/dev/zero of=file bs=1m count=10k
 
 block size is 32KB, fragment size 4k
 
 
 now test random read access to it (10 threads)
 
 randomio test 10 0 0 4096
 
 normal result on such not so fast disk in my laptop.
 
 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan
 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan
 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan
 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan
 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan
 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan
 
 
 systat shows 4kB I/O size. all is fine.
 
 BUT random 4kB writes
 
 randomio test 10 1 0 4096
 
 total | read: latency (ms) | write: latency (ms)
 iops | iops min avg max sdev | iops min avg max
 sdev
 +---+--
 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5
 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7
 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0
 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3
 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0
 
 
 
 results are horrific. systat shows 32kB I/O, gstat shows half are
 reads
 half are writes.
 
 Why UFS need to read full block, change one 4kB part and then write
 back, instead of just writing 4kB part?

Because that's the way the buffer cache works. It writes an entire buffer
cache block (unless at the end of file), so it must read the rest of the block 
into
the buffer, so it doesn't write garbage (the rest of the block) out.

I'd argue that using an I/O size smaller than the file system block size is
simply sub-optimal and that most apps. don't do random I/O of blocks.
OR
If you had an app. that does random I/O of 4K blocks (at 4K byte offsets),
then using a 4K/1K file system would be better.

NFS is the exception, in that it keeps track of a dirty byte range within
a buffer cache block and writes that byte range. (NFS writes are byte granular,
unlike a disk.)
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: iSCSI vs. SMB with ZFS.

2012-12-17 Thread Rick Macklem

Wojciech Puchar wrote:
  With a network file system (either SMB or NFS, it doesn't matter),
  you
  need to ask the server for *each* of the following situations:
  * to ask the server if a file has been changed so the client can
  use
  cached data (if the protocol supports it)
  * to ask the server if a file (or a portion of a file) has been
  locked
  by another client
 
 not really if there is only one user of file - then windows know this,
 but
 change to behaviour you described when there are more users.
 
 AND FINALLY the latter behaviour fails to work properly since windows
 XP
 (worked fine with windows 98). If you use programs that read/write
 share
 same files you may be sure data corruption would happen.
 
 you have to set
 locking = yes
 oplocks = no
 level2 oplocks = no
 
 to make it work properly but even more slow!.
 
Btw, NFSv4 has delegations, which are essentially level2 oplocks. They can
be enabled for a server if the volumes exported via NFSv4 are not being
accessed locally (including Samba). For them to work, the nfscbd needs to
be running on the client(s) and the clients must have IP addresses visible
to the server for a callback TCP connection (no firewalls or NAT gateways).

Even with delegations working, the client caching is limited to the buffer
cache.

I have an experimental patch that uses on-disk caching in the client for
delegated files (I call it packrats), but it is not ready for production
use. Now that I have the 4.1 client in place, I plan to get back to working
on it.

rick

  This basically means that for almost every single IO, you need to
  ask
  the server for something, which involves network traffic and
  round-trip
  delays.
 Not that. The problem is that windows do not use all free memory for
 caching as with local or local (iSCSI) disk.
 
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: iSCSI vs. SMB with ZFS.

2012-12-17 Thread Rick Macklem

Zaphod Beeblebrox wrote:
 Does windows 7 support nfs v4, then? Is it expected (ie: is it
 worthwhile
 trying) that nfsv4 would perform at a similar speed to iSCSI? It would
 seem that this at least requires active directory (or this user name
 mapping ... which I remember being hard).

As far as I know, there is no NFSv4 in Windows. I only made the comment
(which I admit was a bit off topic), because the previous post had stated
 SMB or NFS, they're the same or something like that.)

There was work on an NFSv4 client for Windows being done by CITI at the
Univ. of Michigan funded by Microsoft research, but I have no idea if it
was ever released.

rick

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

request for review: gssd patch for alternate cred cache files

2012-12-12 Thread Rick Macklem

Hi,

A couple of people have reported problems using NFS
mounts with sec=krb5 because the version of sshd they
use doesn't use credential cache files named /tmp/krb5cc_N.

The attached patch modifies the gssd so that it
roughly emulates what the gssd used
by most Linux distros does when a new -s option is used.
This has been tested by the reporters and fixed their issue.

Would someone like to review this?

rick
ps: The patch can also be found at
 http://people.freebsd.org/~rmacklem/gssd-ccache.patch
--- usr.sbin/gssd/gssd.c.sav2	2012-10-08 16:49:50.0 -0400
+++ usr.sbin/gssd/gssd.c	2012-12-12 19:19:51.0 -0500
@@ -35,7 +35,9 @@ __FBSDID($FreeBSD: head/usr.sbin/gssd/g
 #include sys/queue.h
 #include sys/syslog.h
 #include ctype.h
+#include dirent.h
 #include err.h
+#include krb5.h
 #include pwd.h
 #include stdio.h
 #include stdlib.h
@@ -64,8 +66,12 @@ int gss_resource_count;
 uint32_t gss_next_id;
 uint32_t gss_start_time;
 int debug_level;
+static char ccfile_dirlist[PATH_MAX + 1], ccfile_substring[NAME_MAX + 1];
+static char pref_realm[1024];
 
 static void gssd_load_mech(void);
+static int find_ccache_file(const char *, uid_t, char *);
+static int is_a_valid_tgt_cache(const char *, uid_t, int *, time_t *);
 
 extern void gssd_1(struct svc_req *rqstp, SVCXPRT *transp);
 extern int gssd_syscall(char *path);
@@ -82,14 +88,45 @@ main(int argc, char **argv)
 	int fd, oldmask, ch, debug;
 	SVCXPRT *xprt;
 
+	/*
+	 * Initialize the credential cache file name substring and the
+	 * search directory list.
+	 */
+	strlcpy(ccfile_substring, krb5cc_, sizeof(ccfile_substring));
+	ccfile_dirlist[0] = '\0';
+	pref_realm[0] = '\0';
 	debug = 0;
-	while ((ch = getopt(argc, argv, d)) != -1) {
+	while ((ch = getopt(argc, argv, ds:c:r:)) != -1) {
 		switch (ch) {
 		case 'd':
 			debug_level++;
 			break;
+		case 's':
+			/*
+			 * Set the directory search list. This enables use of
+			 * find_ccache_file() to search the directories for a
+			 * suitable credentials cache file.
+			 */
+			strlcpy(ccfile_dirlist, optarg, sizeof(ccfile_dirlist));
+			break;
+		case 'c':
+			/*
+			 * Specify a non-default credential cache file
+			 * substring.
+			 */
+			strlcpy(ccfile_substring, optarg,
+			sizeof(ccfile_substring));
+			break;
+		case 'r':
+			/*
+			 * Set the preferred realm for the credential cache tgt.
+			 */
+			strlcpy(pref_realm, optarg, sizeof(pref_realm));
+			break;
 		default:
-			fprintf(stderr, usage: %s [-d]\n, argv[0]);
+			fprintf(stderr,
+			usage: %s [-d] [-s dir-list] [-c file-substring]
+			 [-r preferred-realm]\n, argv[0]);
 			exit(1);
 			break;
 		}
@@ -267,13 +304,36 @@ gssd_init_sec_context_1_svc(init_sec_con
 	gss_cred_id_t cred = GSS_C_NO_CREDENTIAL;
 	gss_ctx_id_t ctx = GSS_C_NO_CONTEXT;
 	gss_name_t name = GSS_C_NO_NAME;
-	char ccname[strlen(FILE:/tmp/krb5cc_) + 6 + 1];
+	char ccname[PATH_MAX + 5 + 1], *cp, *cp2;
+	int gotone;
 
-	snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d,
-	(int) argp-uid);
+	memset(result, 0, sizeof(*result));
+	if (ccfile_dirlist[0] != '\0'  argp-cred == 0) {
+		gotone = 0;
+		cp = ccfile_dirlist;
+		do {
+			cp2 = strchr(cp, ':');
+			if (cp2 != NULL)
+*cp2 = '\0';
+			gotone = find_ccache_file(cp, argp-uid, ccname);
+			if (gotone != 0)
+break;
+			if (cp2 != NULL)
+*cp2++ = ':';
+			cp = cp2;
+		} while (cp != NULL  *cp != '\0');
+		if (gotone == 0)
+			snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d,
+			(int) argp-uid);
+	} else
+		/*
+		 * If there wasn't a -s option or the credentials have
+		 * been provided as an argument, do it the old way.
+		 */
+		snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d,
+		(int) argp-uid);
 	setenv(KRB5CCNAME, ccname, TRUE);
 
-	memset(result, 0, sizeof(*result));
 	if (argp-cred) {
 		cred = gssd_find_resource(argp-cred);
 		if (!cred) {
@@ -516,13 +576,37 @@ gssd_acquire_cred_1_svc(acquire_cred_arg
 {
 	gss_name_t desired_name = GSS_C_NO_NAME;
 	gss_cred_id_t cred;
-	char ccname[strlen(FILE:/tmp/krb5cc_) + 6 + 1];
+	char ccname[PATH_MAX + 5 + 1], *cp, *cp2;
+	int gotone;
 
-	snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d,
-	(int) argp-uid);
+	memset(result, 0, sizeof(*result));
+	if (ccfile_dirlist[0] != '\0'  argp-desired_name == 0) {
+		gotone = 0;
+		cp = ccfile_dirlist;
+		do {
+			cp2 = strchr(cp, ':');
+			if (cp2 != NULL)
+*cp2 = '\0';
+			gotone = find_ccache_file(cp, argp-uid, ccname);
+			if (gotone != 0)
+break;
+			if (cp2 != NULL)
+*cp2++ = ':';
+			cp = cp2;
+		} while (cp != NULL  *cp != '\0');
+		if (gotone == 0)
+			snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d,
+			(int) argp-uid);
+	} else
+		/*
+		 * If there wasn't a -s option or the name has
+		 * been provided as an argument, do it the old way.
+		 * (The name is provided for host based initiator credentials.)
+		 */
+		snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d,
+		(int) argp-uid);
 	setenv(KRB5CCNAME, ccname, TRUE);
 
-	memset(result,

any arch not pack uint32_t x[2]?

2012-12-06 Thread Rick Macklem

Hi,

The subject line pretty well says it. I am about ready
to commit the NFSv4.1 client patches, but I had better
ask this dump question first.

Is there any architecture where:
  uint32_t x[2];
isn't packed? (Or, sizeof(x) != 8, if you prefer.)

As you might have guessed, if the answer is yes, I have
some code fixin to do, rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

naming a .h file for kernel use only

2012-12-04 Thread Rick Macklem

Hi,

For my NFSv4.1 client work, I've taken a few definitions out of a
kernel rpc .c file and put them in a .h file so that they can
be included in other sys/rpc .c files.

I've currently named the file _krpc.h. I thought I'd check if
this is a reasonable name before doing the big commit of the
NFSv4.1 stuff to head. (I have a vague notion that a leading _
would indicate not for public use, but I am not sure?)

Thanks in advance for naming suggestions for this file, rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-22 Thread Rick Macklem

Ivan Voras wrote:
 On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:
 
  Here are the results from testing both patches :
  http://home.totalterror.net/freebsd/nfstest/results.html
  Both tests ran for about 14 hours ( a bit too much, but I wanted to
  compare different zfs recordsize settings ),
  and were done first after a fresh reboot.
  The only noticeable difference seems to be much more context
  switches with Ivan's patch.
 
 Thank you very much for your extensive testing!
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
 But, you have also shown that my patch doesn't do any better than
 Rick's even on a fairly large configuration, so I don't think there's
 value in adding the extra complexity, and Rick knows NFS much better
 than I do.
 
 But there are a few things other than that I'm interested in: like why
 does your load average spike almost to 20-ties, and how come that with
 24 drives in RAID-10 you only push through 600 MBit/s through the 10
 GBit/s Ethernet. Have you tested your drive setup locally (AESNI
 shouldn't be a bottleneck, you should be able to encrypt well into
 Gbyte/s range) and the network?
 
 If you have the time, could you repeat the tests but with a recent
 Samba server and a CIFS mount on the client side? This is probably not
 important, but I'm just curious of how would it perform on your
 machine.

Oh, I realized that, if you are testing 9/stable (and not head), that
you won't have r227809. Without that, all reads on a given file will
be serialized, because the server will acquire an exclusive lock on
the vnode.

The patch for r227809 in head is at:
  http://people.freebsd.org/~rmacklem/lkshared.patch
This should apply fine to a 9 system (but not 8.n), I think.

Good luck with it and have fun, rick

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Rick Macklem

Ivan Voras wrote:
 On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:
 
  Here are the results from testing both patches :
  http://home.totalterror.net/freebsd/nfstest/results.html
  Both tests ran for about 14 hours ( a bit too much, but I wanted to
  compare different zfs recordsize settings ),
  and were done first after a fresh reboot.
  The only noticeable difference seems to be much more context
  switches with Ivan's patch.
 
 Thank you very much for your extensive testing!
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
Don't the mtx_lock() calls spin for a little while and then context
switch if another thread still has it locked?

 But, you have also shown that my patch doesn't do any better than
 Rick's even on a fairly large configuration, so I don't think there's
 value in adding the extra complexity, and Rick knows NFS much better
 than I do.
 
Hmm, I didn't look, but were there any tests using UDP mounts?
(I would have thought that your patch would mainly affect UDP mounts,
 since that is when my version still has the single LRU queue/mutex.
 As I think you know, my concern with your patch would be correctness
 for UDP, not performance.)

Anyhow, sounds like you guys are having fun with it and learning
some useful things.

Keep up the good work, rick
 But there are a few things other than that I'm interested in: like why
 does your load average spike almost to 20-ties, and how come that with
 24 drives in RAID-10 you only push through 600 MBit/s through the 10
 GBit/s Ethernet. Have you tested your drive setup locally (AESNI
 shouldn't be a bottleneck, you should be able to encrypt well into
 Gbyte/s range) and the network?
 
 If you have the time, could you repeat the tests but with a recent
 Samba server and a CIFS mount on the client side? This is probably not
 important, but I'm just curious of how would it perform on your
 machine.
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Rick Macklem

Outback Dingo wrote:
 On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras ivo...@freebsd.org
 wrote:
  On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote:
  Ivan Voras wrote:
 
  I don't know how to interpret the rise in context switches; as
  this is
  kernel code, I'd expect no context switches. I hope someone else
  can
  explain.
 
  Don't the mtx_lock() calls spin for a little while and then context
  switch if another thread still has it locked?
 
  Yes, but are in-kernel context switches also counted? I was assuming
  they are light-weight enough not to count.
 
  Hmm, I didn't look, but were there any tests using UDP mounts?
  (I would have thought that your patch would mainly affect UDP
  mounts,
   since that is when my version still has the single LRU
   queue/mutex.
 
  Another assumption - I thought UDP was the default.
 
TCP has been the default for a FreeBSD client for a long time. It was
changed for the old NFS client before I became a committer. (You can
explicitly set one or the other as mount options or check via wireshark/tcpdump)

   As I think you know, my concern with your patch would be
   correctness
   for UDP, not performance.)
 
  Yes.
 
 Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB
 drives on an LSI controller.
 Im watching the thread patiently, im kinda looking for results, and
 answers, Though Im also tempted to
 run benchmarks on my system also see if i get similar results I also
 considered that netmap might be one
 but not quite sure if it would help NFS, since its to hard to tell if
 its a network bottle neck, though it appears
 to be network related.
 
NFS network traffic looks very different that a TCP stream (ala bit torrent
or ...). I've seen this cause issues before. You can look at a packet trace
in wireshark and see if TCP is retransmitting segments.

rick

  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread Rick Macklem

Ivan Voras wrote:
 On 13/10/2012 17:22, Nikolay Denev wrote:
 
  drc3.patch applied and build cleanly and shows nice improvement!
 
  I've done a quick benchmark using iozone over the NFS mount from the
  Linux host.
 
 
 Hi,
 
 If you are already testing, could you please also test this patch:
 
 http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
 
I don't think (it is hard to test this) your trim cache algorithm
will choose the correct entries to delete.

The problem is that UDP entries very seldom time out (unless the
NFS server isn't seeing hardly any load) and are mostly trimmed
because the size exceeds the highwater mark.

With your code, it will clear out all of the entries in the first
hash buckets that aren't currently busy, until the total count
drops below the high water mark. (If you monitor a busy server
with nfsstat -e -s, you'll see the cache never goes below the
high water mark, which is 500 by default.) This would delete
entries of fairly recent requests.

If you are going to replace the global LRU list with ones for
each hash bucket, then you'll have to compare the time stamps
on the least recently used entries of all the hash buckets and
then delete those. If you keep the timestamp of the least recent
one for that hash bucket in the hash bucket head, you could at least
use that to select which bucket to delete from next, but you'll still
need to:
  - lock that hash bucket
- delete a few entries from that bucket's lru list
  - unlock hash bucket
- repeat for various buckets until the count is beloew the high
  water mark
Or something like that. I think you'll find it a lot more work that
one LRU list and one mutex. Remember that mutex isn't held for long.

Btw, the code looks very nice. (If I was being a style(9) zealot,
I'd remind you that it likes return (X); and not return X;.

rick

 It should apply to HEAD without Rick's patches.
 
 It's a bit different approach than Rick's, breaking down locks even
 more.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread Rick Macklem

Ivan Voras wrote:
 On 15 October 2012 22:58, Rick Macklem rmack...@uoguelph.ca wrote:
 
  The problem is that UDP entries very seldom time out (unless the
  NFS server isn't seeing hardly any load) and are mostly trimmed
  because the size exceeds the highwater mark.
 
  With your code, it will clear out all of the entries in the first
  hash buckets that aren't currently busy, until the total count
  drops below the high water mark. (If you monitor a busy server
  with nfsstat -e -s, you'll see the cache never goes below the
  high water mark, which is 500 by default.) This would delete
  entries of fairly recent requests.
 
 You are right about that, if testing by Nikolay goes reasonably well,
 I'll work on that.
 
  If you are going to replace the global LRU list with ones for
  each hash bucket, then you'll have to compare the time stamps
  on the least recently used entries of all the hash buckets and
  then delete those. If you keep the timestamp of the least recent
  one for that hash bucket in the hash bucket head, you could at least
  use that to select which bucket to delete from next, but you'll
  still
  need to:
- lock that hash bucket
  - delete a few entries from that bucket's lru list
- unlock hash bucket
  - repeat for various buckets until the count is beloew the high
water mark
 
 Ah, I think I get it: is the reliance on the high watermark as a
 criteria for cache expiry the reason the list is a LRU instead of an
 ordinary unordered list?
 
Yes, I think you've gt it;-)

Have fun with it, rick

  Or something like that. I think you'll find it a lot more work that
  one LRU list and one mutex. Remember that mutex isn't held for long.
 
 It could be, but the current state of my code is just groundwork for
 the next things I have in plan:
 
 1) Move the expiry code (the trim function) into a separate thread,
 run periodically (or as a callout, I'll need to talk with someone
 about which one is cheaper)
 
 2) Replace the mutex with a rwlock. The only thing which is preventing
 me from doing this right away is the LRU list, since each read access
 modifies it (and requires a write lock). This is why I was asking you
 if we can do away with the LRU algorithm.
 
  Btw, the code looks very nice. (If I was being a style(9) zealot,
  I'd remind you that it likes return (X); and not return X;.
 
 Thanks, I'll make it more style(9) compliant as I go along.
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-13 Thread Rick Macklem

Garrett Wollman wrote:
 On Fri, 12 Oct 2012 22:05:54 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
  I've attached the patch drc3.patch (it assumes drc2.patch has
  already been
  applied) that replaces the single mutex with one for each hash list
  for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.
 
 I haven't tested this at all, but I think putting all of the mutexes
 in an array like that is likely to cause cache-line ping-ponging. It
 may be better to use a pool mutex, or to put the mutexes adjacent in
 memory to the list heads that they protect. 
Well, I'll admit I don't know how to do this.

What the code does need is a set of mutexes, where any of the mutexes
can be referred to by an index. I could easily define a structure that
has:
struct nfsrc_hashhead {
 struct nfsrvcachehead head;
 struct mtx mutex;
} nfsrc_hashhead[NFSRVCACHE_HASHSIZE];
- but all that does is leave a small structure between each struct mtx and I
  wouldn't have thought that would make much difference. (How big is a typical
  hardware cache line these days? I have no idea.)
  - I suppose I could waste space and define a glob of unused space
between them, like:
struct nfsrc_hashhead {
 struct nfsrvcachehead head;
 char garbage[N];
 struct mtx mutex;
} nfsrc_hashhead[NFSRVCACHE_HASHSIZE];
- If this makes sense, how big should N be? (Somewhat less that the length
  of a cache line, I'd guess. It seems that the structure should be at least
  a cache line length in size.)

All this seems kinda hokey to me and beyond what code at this level should
be worrying about, but I'm game to make changes, if others think it's 
appropriate.

I've never use mtx_pool(9) mutexes, but it doesn't sound like they would
be the right fit, from reading the man page. (Assuming the mtx_pool_find()
is guaranteed to return the same mutex for the same address passed in as
an argument, it would seem that they would work, since I can pass
nfsrvcachehead[i] in as the pointer arg to index a mutex.)
Hopefully jhb@ can say if using mtx_pool(9) for this would be better than
an array:
   struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE];

Does anyone conversant with mutexes know what the best coding approach is?

(But I probably won't be
 able to do the performance testing on any of these for a while. I
 have a server running the drc2 code but haven't gotten my users to
 put a load on it yet.)
 
No rush. At this point, the earliest I could commit something like this to
head would be December.

rick
ps: I hope John doesn't mind being added to the cc list yet again. It's
just that I suspect he knows a fair bit about mutex implementation
and possible hardware cache line effects.

 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-12 Thread Rick Macklem

I wrote:
 Oops, I didn't get the readahead option description
 quite right in the last post. The default read ahead
 is 1, which does result in rsize * 2, since there is
 the read + 1 readahead.
 
 rsize * 16 would actually be for the option readahead=15
 and for readahead=16 the calculation would be rsize * 17.
 
 However, the example was otherwise ok, I think? rick

I've attached the patch drc3.patch (it assumes drc2.patch has already been
applied) that replaces the single mutex with one for each hash list
for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.

These patches are also at:
  http://people.freebsd.org/~rmacklem/drc2.patch
  http://people.freebsd.org/~rmacklem/drc3.patch
in case the attachments don't get through.

rick
ps: I haven't tested drc3.patch a lot, but I think it's ok?

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
--- fs/nfsserver/nfs_nfsdcache.c.orig	2012-02-29 21:07:53.0 -0500
+++ fs/nfsserver/nfs_nfsdcache.c	2012-10-03 08:23:24.0 -0400
@@ -164,8 +164,19 @@ NFSCACHEMUTEX;
 int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0;
 #endif	/* !APPLEKEXT */
 
+SYSCTL_DECL(_vfs_nfsd);
+
+static int	nfsrc_tcphighwater = 0;
+SYSCTL_INT(_vfs_nfsd, OID_AUTO, tcphighwater, CTLFLAG_RW,
+nfsrc_tcphighwater, 0,
+High water mark for TCP cache entries);
+static int	nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER;
+SYSCTL_INT(_vfs_nfsd, OID_AUTO, udphighwater, CTLFLAG_RW,
+nfsrc_udphighwater, 0,
+High water mark for UDP cache entries);
+
 static int nfsrc_tcpnonidempotent = 1;
-static int nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER, nfsrc_udpcachesize = 0;
+static int nfsrc_udpcachesize = 0;
 static TAILQ_HEAD(, nfsrvcache) nfsrvudplru;
 static struct nfsrvhashhead nfsrvhashtbl[NFSRVCACHE_HASHSIZE],
 nfsrvudphashtbl[NFSRVCACHE_HASHSIZE];
@@ -781,8 +792,15 @@ nfsrc_trimcache(u_int64_t sockref, struc
 {
 	struct nfsrvcache *rp, *nextrp;
 	int i;
+	static time_t lasttrim = 0;
 
+	if (NFSD_MONOSEC == lasttrim 
+	nfsrc_tcpsavedreplies  nfsrc_tcphighwater 
+	nfsrc_udpcachesize  (nfsrc_udphighwater +
+	nfsrc_udphighwater / 2))
+		return;
 	NFSLOCKCACHE();
+	lasttrim = NFSD_MONOSEC;
 	TAILQ_FOREACH_SAFE(rp, nfsrvudplru, rc_lru, nextrp) {
 		if (!(rp-rc_flag  (RC_INPROG|RC_LOCKED|RC_WANTED))
 		  rp-rc_refcnt == 0
--- fs/nfsserver/nfs_nfsdcache.c.sav	2012-10-10 18:56:01.0 -0400
+++ fs/nfsserver/nfs_nfsdcache.c	2012-10-12 21:04:21.0 -0400
@@ -160,7 +160,8 @@ __FBSDID($FreeBSD: head/sys/fs/nfsserve
 #include fs/nfs/nfsport.h
 
 extern struct nfsstats newnfsstats;
-NFSCACHEMUTEX;
+extern struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE];
+extern struct mtx nfsrc_udpmtx;
 int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0;
 #endif	/* !APPLEKEXT */
 
@@ -208,10 +209,11 @@ static int newnfsv2_procid[NFS_V3NPROCS]
 	NFSV2PROC_NOOP,
 };
 
+#define	nfsrc_hash(xid)	(((xid) + ((xid)  24)) % NFSRVCACHE_HASHSIZE)
 #define	NFSRCUDPHASH(xid) \
-	(nfsrvudphashtbl[((xid) + ((xid)  24)) % NFSRVCACHE_HASHSIZE])
+	(nfsrvudphashtbl[nfsrc_hash(xid)])
 #define	NFSRCHASH(xid) \
-	(nfsrvhashtbl[((xid) + ((xid)  24)) % NFSRVCACHE_HASHSIZE])
+	(nfsrvhashtbl[nfsrc_hash(xid)])
 #define	TRUE	1
 #define	FALSE	0
 #define	NFSRVCACHE_CHECKLEN	100
@@ -262,6 +264,18 @@ static int nfsrc_getlenandcksum(mbuf_t m
 static void nfsrc_marksametcpconn(u_int64_t);
 
 /*
+ * Return the correct mutex for this cache entry.
+ */
+static __inline struct mtx *
+nfsrc_cachemutex(struct nfsrvcache *rp)
+{
+
+	if ((rp-rc_flag  RC_UDP) != 0)
+		return (nfsrc_udpmtx);
+	return (nfsrc_tcpmtx[nfsrc_hash(rp-rc_xid)]);
+}
+
+/*
  * Initialize the server request cache list
  */
 APPLESTATIC void
@@ -336,10 +350,12 @@ nfsrc_getudp(struct nfsrv_descript *nd, 
 	struct sockaddr_in6 *saddr6;
 	struct nfsrvhashhead *hp;
 	int ret = 0;
+	struct mtx *mutex;
 
+	mutex = nfsrc_cachemutex(newrp);
 	hp = NFSRCUDPHASH(newrp-rc_xid);
 loop:
-	NFSLOCKCACHE();
+	mtx_lock(mutex);
 	LIST_FOREACH(rp, hp, rc_hash) {
 	if (newrp-rc_xid == rp-rc_xid 
 		newrp-rc_proc == rp-rc_proc 
@@ -347,8 +363,8 @@ loop:
 		nfsaddr_match(NETFAMILY(rp), rp-rc_haddr, nd-nd_nam)) {
 			if ((rp-rc_flag  RC_LOCKED) != 0) {
 rp-rc_flag |= RC_WANTED;
-(void)mtx_sleep(rp, NFSCACHEMUTEXPTR,
-(PZERO - 1) | PDROP, nfsrc, 10 * hz);
+(void)mtx_sleep(rp, mutex, (PZERO - 1) | PDROP,
+nfsrc, 10 * hz);
 goto loop;
 			}
 			if (rp-rc_flag == 0)
@@ -358,14 +374,14 @@ loop:
 			TAILQ_INSERT_TAIL(nfsrvudplru, rp, rc_lru);
 			if (rp-rc_flag  RC_INPROG) {
 newnfsstats.srvcache_inproghits++;
-NFSUNLOCKCACHE();
+mtx_unlock(mutex);
 ret = RC_DROPIT;
 			} else if (rp-rc_flag  RC_REPSTATUS) {
 /*
  * V2 only.
  */
 newnfsstats.srvcache_nonidemdonehits++;
-

Re: NFS server bottlenecks

2012-10-11 Thread Rick Macklem

Nikolay Denev wrote:
 On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote:
 
 
  On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Nikolay Denev wrote:
  On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Nikolay Denev wrote:
  On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly
  since
  I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the
  lists,
  I
  don't
  see how that can be done with a global lock for list updates?
 
  Well, the global lock is what we have now, but the cache
  trimming
  process only looks at one list at a time, so not locking the
  list
  that
  isn't being iterated over probably wouldn't hurt, unless
  there's
  some
  mechanism (that I didn't see) for entries to move from one
  list
  to
  another. Note that I'm considering each hash bucket a separate
  list. (One issue to worry about in that case would be
  cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is
  also
  the
  LRU list that all entries end up on, that gets used by the
  trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't
  looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache
  when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for
  me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will
  improve
  things.
  There are N nfsd threads already (N can be bumped up to 256
  if
  you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the
  (global)
  mutex for much longer than any individual nfsd service thread
  has
  any
  need to, and having N threads doing that in parallel is why
  it's
  so
  heavily contended. If there's only one thread doing the trim,
  then
  the nfsd service threads aren't spending time either
  contending
  on
  the
  mutex (it will be held less frequently and for shorter
  periods).
 
  I think the little drc2.patch which will keep the nfsd threads
  from
  acquiring the mutex and doing the trimming most of the time,
  might
  be
  sufficient. I still don't see why a separate trimming thread
  will
  be
  an advantage. I'd also be worried that the one cache trimming
  thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a
  peak
  load of about 100RPCs/sec, it was necessary to trim
  aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is no
  longer relevant, I recently recall someone trying to run
  FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on it.)
 
  The only negative effect I can think of w.r.t. having the
  nfsd
  threads doing it would be a (I believe negligible) increase
  in
  RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to
  disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache, and
  10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of the
  NFS
  server, it may actually be a win to turn off adaptive mutexes
  --
  I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is a
  good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
  My quest for IOPS over NFS continues :)
  So far I'm not able to achieve more than about 3000 8K read
  requests
  over NFS,
  while the server locally gives much more.
  And this is all from a file that is completely in ARC cache, no
  disk
  IO involved.
 
  Just out

Re: NFS server bottlenecks

2012-10-11 Thread Rick Macklem

Nikolay Denev wrote:
 On Oct 11, 2012, at 7:20 PM, Nikolay Denev nde...@gmail.com wrote:
 
  On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote:
 
 
  On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Nikolay Denev wrote:
  On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Nikolay Denev wrote:
  On Oct 4, 2012, at 12:36 AM, Rick Macklem
  rmack...@uoguelph.ca
  wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a
  cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly
  since
  I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the
  lists,
  I
  don't
  see how that can be done with a global lock for list
  updates?
 
  Well, the global lock is what we have now, but the cache
  trimming
  process only looks at one list at a time, so not locking the
  list
  that
  isn't being iterated over probably wouldn't hurt, unless
  there's
  some
  mechanism (that I didn't see) for entries to move from one
  list
  to
  another. Note that I'm considering each hash bucket a
  separate
  list. (One issue to worry about in that case would be
  cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is
  also
  the
  LRU list that all entries end up on, that gets used by the
  trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't
  looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache
  when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for
  me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will
  improve
  things.
  There are N nfsd threads already (N can be bumped up to 256
  if
  you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the
  (global)
  mutex for much longer than any individual nfsd service thread
  has
  any
  need to, and having N threads doing that in parallel is why
  it's
  so
  heavily contended. If there's only one thread doing the trim,
  then
  the nfsd service threads aren't spending time either
  contending
  on
  the
  mutex (it will be held less frequently and for shorter
  periods).
 
  I think the little drc2.patch which will keep the nfsd threads
  from
  acquiring the mutex and doing the trimming most of the time,
  might
  be
  sufficient. I still don't see why a separate trimming thread
  will
  be
  an advantage. I'd also be worried that the one cache trimming
  thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a
  peak
  load of about 100RPCs/sec, it was necessary to trim
  aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is
  no
  longer relevant, I recently recall someone trying to run
  FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on
  it.)
 
  The only negative effect I can think of w.r.t. having the
  nfsd
  threads doing it would be a (I believe negligible) increase
  in
  RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to
  disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache,
  and
  10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of
  the
  NFS
  server, it may actually be a win to turn off adaptive mutexes
  --
  I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is
  a
  good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
  My quest for IOPS over NFS continues :)
  So far I'm not able to achieve more than about 3000 8K read
  requests
  over NFS,
  while the server locally gives much more.
  And this is all

Re: NFS server bottlenecks

2012-10-11 Thread Rick Macklem

Oops, I didn't get the readahead option description
quite right in the last post. The default read ahead
is 1, which does result in rsize * 2, since there is
the read + 1 readahead.

rsize * 16 would actually be for the option readahead=15
and for readahead=16 the calculation would be rsize * 17.

However, the example was otherwise ok, I think? rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-10 Thread Rick Macklem

Garrett Wollman wrote:
 On Tue, 9 Oct 2012 20:18:00 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
  And, although this experiment seems useful for testing patches that
  try
  and reduce DRC CPU overheads, most real NFS servers will be doing
  disk
  I/O.
 
 We don't always have control over what the user does. I think the
 worst-case for my users involves a third-party program (that they're
 not willing to modify) that does line-buffered writes in append mode.
 This uses nearly all of the CPU on per-RPC overhead (each write is
 three RPCs: GETATTR, WRITE, COMMIT).
 
Yes. My comment was simply meant to imply that his testing isn't a
realistic load for most NFS servers. It was not meant to imply that
reducing the CPU overhead/lock contention of the DRC is a useless
exercise.

rick

 -GAWollman
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-10 Thread Rick Macklem

Nikolay Denev wrote:
 On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
  Nikolay Denev wrote:
  On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly since
  I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the
  lists,
  I
  don't
  see how that can be done with a global lock for list updates?
 
  Well, the global lock is what we have now, but the cache trimming
  process only looks at one list at a time, so not locking the list
  that
  isn't being iterated over probably wouldn't hurt, unless there's
  some
  mechanism (that I didn't see) for entries to move from one list
  to
  another. Note that I'm considering each hash bucket a separate
  list. (One issue to worry about in that case would be
  cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is also
  the
  LRU list that all entries end up on, that gets used by the
  trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't
  looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will
  improve
  things.
  There are N nfsd threads already (N can be bumped up to 256 if
  you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the (global)
  mutex for much longer than any individual nfsd service thread has
  any
  need to, and having N threads doing that in parallel is why it's
  so
  heavily contended. If there's only one thread doing the trim,
  then
  the nfsd service threads aren't spending time either contending
  on
  the
  mutex (it will be held less frequently and for shorter periods).
 
  I think the little drc2.patch which will keep the nfsd threads
  from
  acquiring the mutex and doing the trimming most of the time, might
  be
  sufficient. I still don't see why a separate trimming thread will
  be
  an advantage. I'd also be worried that the one cache trimming
  thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a peak
  load of about 100RPCs/sec, it was necessary to trim aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is no
  longer relevant, I recently recall someone trying to run FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on it.)
 
  The only negative effect I can think of w.r.t. having the nfsd
  threads doing it would be a (I believe negligible) increase in
  RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache, and
  10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of the
  NFS
  server, it may actually be a win to turn off adaptive mutexes --
  I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is a
  good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
  My quest for IOPS over NFS continues :)
  So far I'm not able to achieve more than about 3000 8K read
  requests
  over NFS,
  while the server locally gives much more.
  And this is all from a file that is completely in ARC cache, no
  disk
  IO involved.
 
  Just out of curiousity, why do you use 8K reads instead of 64K
  reads.
  Since the RPC overhead (including the DRC functions) is per RPC,
  doing
  fewer larger RPCs should usually work better. (Sometimes large
  rsize/wsize
  values

Re: NFS server bottlenecks

2012-10-09 Thread Rick Macklem

Nikolay Denev wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly since I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the lists,
  I
  don't
  see how that can be done with a global lock for list updates?
 
  Well, the global lock is what we have now, but the cache trimming
  process only looks at one list at a time, so not locking the list
  that
  isn't being iterated over probably wouldn't hurt, unless there's
  some
  mechanism (that I didn't see) for entries to move from one list to
  another. Note that I'm considering each hash bucket a separate
  list. (One issue to worry about in that case would be cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is also
  the
  LRU list that all entries end up on, that gets used by the trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will improve
  things.
  There are N nfsd threads already (N can be bumped up to 256 if you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the (global)
  mutex for much longer than any individual nfsd service thread has
  any
  need to, and having N threads doing that in parallel is why it's so
  heavily contended. If there's only one thread doing the trim, then
  the nfsd service threads aren't spending time either contending on
  the
  mutex (it will be held less frequently and for shorter periods).
 
  I think the little drc2.patch which will keep the nfsd threads from
  acquiring the mutex and doing the trimming most of the time, might
  be
  sufficient. I still don't see why a separate trimming thread will be
  an advantage. I'd also be worried that the one cache trimming thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a peak
  load of about 100RPCs/sec, it was necessary to trim aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is no
  longer relevant, I recently recall someone trying to run FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on it.)
 
  The only negative effect I can think of w.r.t. having the nfsd
  threads doing it would be a (I believe negligible) increase in RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of the NFS
  server, it may actually be a win to turn off adaptive mutexes -- I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is a good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
 My quest for IOPS over NFS continues :)
 So far I'm not able to achieve more than about 3000 8K read requests
 over NFS,
 while the server locally gives much more.
 And this is all from a file that is completely in ARC cache, no disk
 IO involved.
 
Just out of curiousity, why do you use 8K reads instead of 64K reads.
Since the RPC overhead (including the DRC functions) is per RPC, doing
fewer larger RPCs should usually work better. (Sometimes large rsize/wsize
values generate too large a burst of traffic for a network interface to
handle and then the rsize/wsize has to be decreased to avoid this issue.)

And, although

Re: NFS server bottlenecks

2012-10-06 Thread Rick Macklem

Nikolay Deney wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly since I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the lists,
  I
  don't
  see how that can be done with a global lock for list updates?
 
  Well, the global lock is what we have now, but the cache trimming
  process only looks at one list at a time, so not locking the list
  that
  isn't being iterated over probably wouldn't hurt, unless there's
  some
  mechanism (that I didn't see) for entries to move from one list to
  another. Note that I'm considering each hash bucket a separate
  list. (One issue to worry about in that case would be cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is also
  the
  LRU list that all entries end up on, that gets used by the trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will improve
  things.
  There are N nfsd threads already (N can be bumped up to 256 if you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the (global)
  mutex for much longer than any individual nfsd service thread has
  any
  need to, and having N threads doing that in parallel is why it's so
  heavily contended. If there's only one thread doing the trim, then
  the nfsd service threads aren't spending time either contending on
  the
  mutex (it will be held less frequently and for shorter periods).
 
  I think the little drc2.patch which will keep the nfsd threads from
  acquiring the mutex and doing the trimming most of the time, might
  be
  sufficient. I still don't see why a separate trimming thread will be
  an advantage. I'd also be worried that the one cache trimming thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a peak
  load of about 100RPCs/sec, it was necessary to trim aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is no
  longer relevant, I recently recall someone trying to run FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on it.)
 
  The only negative effect I can think of w.r.t. having the nfsd
  threads doing it would be a (I believe negligible) increase in RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of the NFS
  server, it may actually be a win to turn off adaptive mutexes -- I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is a good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
 I was doing some NFS testing with RELENG_9 machine and
 a Linux RHEL machine over 10G network, and noticed the same nfsd
 threads issue.
 
 Previously I would read a 32G file locally on the FreeBSD ZFS/NFS
 server with dd if=/tank/32G.bin of=/dev/null bs=1M to cache it
 completely in ARC (machine has 196G RAM),
 then if I do this again locally I would get close to 4GB/sec read -
 completely from the cache...
 
 But If I try to read the file over NFS from the Linux machine I would
 only get about 100MB/sec speed, sometimes a bit more,
 and all of the nfsd threads are clearly visible in top. pmcstat also
 showed the same mutex

Re: NFS server bottlenecks

2012-10-03 Thread Rick Macklem

Garrett Wollman wrote:
 [Adding freebsd-fs@ to the Cc list, which I neglected the first time
 around...]
 
 On Tue, 2 Oct 2012 08:28:29 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
  I can't remember (I am early retired now;-) if I mentioned this
  patch before:
http://people.freebsd.org/~rmacklem/drc.patch
  It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater
  that can
  be twiddled so that the drc is trimmed less frequently. By making
  these
  values larger, the trim will only happen once/sec until the high
  water
  mark is reached, instead of on every RPC. The tradeoff is that the
  DRC will
  become larger, but given memory sizes these days, that may be fine
  for you.
 
 It will be a while before I have another server that isn't in
 production (it's on my deployment plan, but getting the production
 servers going is taking first priority).
 
 The approaches that I was going to look at:
 
 Simplest: only do the cache trim once every N requests (for some
 reasonable value of N, e.g., 1000). Maybe keep track of the number of
 entries in each hash bucket and ignore those buckets that only have
 one entry even if is stale.
 
Well, the patch I have does it when it gets too big. This made sense to
me, since the cache is trimmed to keep it from getting too large. It also
does the trim at least once/sec, so that really stale entries are removed.

 Simple: just use a sepatate mutex for each list that a cache entry
 is on, rather than a global lock for everything. This would reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
Well, since the cache trimming is removing entries from the lists, I don't
see how that can be done with a global lock for list updates? A mutex in
each element could be used for changes (not insertion/removal) to an individual
element. However, the current code manipulates the lists and makes minimal
changes to the individual elements, so I'm not sure if a mutex in each element
would be useful or not, but it wouldn't help for the trimming case, imho.

I modified the patch slightly, so it doesn't bother to acquire the mutex when
it is checking if it should trim now. I think this results in a slight risk that
the test will use an out of date cached copy of one of the global vars, but
since the code isn't modifying them, I don't think it matters. This modified
patch is attached and is also here:
   http://people.freebsd.org/~rmacklem/drc2.patch

 Moderately complicated: figure out if a different synchronization type
 can safely be used (e.g., rmlock instead of mutex) and do so.
 
 More complicated: move all cache trimming to a separate thread and
 just have the rest of the code wake it up when the cache is getting
 too big (or just once a second since that's easy to implement). Maybe
 just move all cache processing to a separate thread.
 
Only doing it once/sec would result in a very large cache when bursts of
traffic arrives. The above patch does it when it is too big or at least
once/sec.

I'm not sure I see why doing it as a separate thread will improve things.
There are N nfsd threads already (N can be bumped up to 256 if you wish)
and having a bunch more cache trimming threads would just increase
contention, wouldn't it? The only negative effect I can think of w.r.t.
having the nfsd threads doing it would be a (I believe negligible) increase
in RPC response times (the time the nfsd thread spends trimming the cache).
As noted, I think this time would be negligible compared to disk I/O and network
transit times in the total RPC response time?

Isilon did use separate threads (I never saw their code, so I am going by what
they told me), but it sounded to me like they were trimming the cache too 
agressively
to be effective for TCP mounts. (ie. It sounded to me like they had broken the
algorithm to achieve better perf.)

Remember that the DRC is weird, in that it is a cache to improve correctness at
the expense of overhead. It never improves performance. On the other hand, turn
it off or throw away entries too aggressively and data corruption, due to 
retries
of non-idempotent operations, can be the outcome.

Good luck with whatever you choose, rick

 It's pretty clear from the profile that the cache mutex is heavily
 contended, so anything that reduces the length of time it's held is
 probably a win.
 
 That URL again, for the benefit of people on freebsd-fs who didn't see
 it on hackers, is:
 
  http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png.
 
 (This graph is slightly modified from my previous post as I removed
 some spurious edges to make the formatting look better. Still looking
 for a way to get a profile that includes all kernel modules with the
 kernel.)
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers

Re: NFS server bottlenecks

2012-10-03 Thread Rick Macklem

Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache entry
  is on, rather than a global lock for everything. This would reduce
  the mutex contention, but I'm not sure how significantly since I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the lists, I
  don't
  see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list that
 isn't being iterated over probably wouldn't hurt, unless there's some
 mechanism (that I didn't see) for entries to move from one list to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be cache-line
 contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
Yea, a separate mutex for each hash list might help. There is also the
LRU list that all entries end up on, that gets used by the trimming code.
(I think? I wrote this stuff about 8 years ago, so I haven't looked at
 it in a while.)

Also, increasing the hash table size is probably a good idea, especially
if you reduce how aggressively the cache is trimmed.

  Only doing it once/sec would result in a very large cache when
  bursts of
  traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
This code was originally production tested on a server with 1Gbyte,
so times have changed a bit;-)

  I'm not sure I see why doing it as a separate thread will improve
  things.
  There are N nfsd threads already (N can be bumped up to 256 if you
  wish)
  and having a bunch more cache trimming threads would just increase
  contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has any
 need to, and having N threads doing that in parallel is why it's so
 heavily contended. If there's only one thread doing the trim, then
 the nfsd service threads aren't spending time either contending on the
 mutex (it will be held less frequently and for shorter periods).
 
I think the little drc2.patch which will keep the nfsd threads from
acquiring the mutex and doing the trimming most of the time, might be
sufficient. I still don't see why a separate trimming thread will be
an advantage. I'd also be worried that the one cache trimming thread
won't get the job done soon enough.

When I did production testing on a 1Gbyte server that saw a peak
load of about 100RPCs/sec, it was necessary to trim aggressively.
(Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)

  The only negative effect I can think of w.r.t. having the nfsd
  threads doing it would be a (I believe negligible) increase in RPC
  response times (the time the nfsd thread spends trimming the cache).
  As noted, I think this time would be negligible compared to disk I/O
  and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the NFS
 server, it may actually be a win to turn off adaptive mutexes -- I
 should give that a try once I'm able to do more testing.)
 
Have fun with it. Let me know when you have what you think is a good patch.

rick

 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-02 Thread Rick Macklem

Garrett Wollman wrote:
 I had an email conversation with Rick Macklem about six months ago
 about NFS server bottlenecks. I'm now in a position to observe my
 large-scale NFS server under an actual production load, so I thought I
 would update folks on what it looks like. This is a 9.1 prerelease
 kernel (I hope 9.1 will be released soon as I have four moe of these
 servers to deploy!). When under nearly 100% load on an 8-core
 (16-thread) Quanta QSSC-S99Q storage server, with a 10G network
 interface, pmcstat tells me this:
 
 PMC: [INST_RETIRED.ANY_P] Samples: 2727105 (100.0%) , 27 unresolved
 Key: q = exiting...
 %SAMP IMAGE FUNCTION CALLERS
 29.3 kernel _mtx_lock_sleep nfsrvd_updatecache:10.0
 nfsrvd_getcache:7.4 ...
 9.5 kernel cpu_search_highest cpu_search_highest:8.1 sched_idletd:1.4
 7.4 zfs.ko lzjb_decompress zio_decompress
 4.3 kernel _mtx_lock_spin turnstile_trywait:2.2 pmclog_reserve:1.0 ...
 4.0 zfs.ko fletcher_4_native zio_checksum_error:3.1
 zio_checksum_compute:0.8
 3.6 kernel cpu_search_lowest cpu_search_lowest
 3.3 kernel nfsrc_trimcache nfsrvd_getcache:1.6 nfsrvd_updatecache:1.6
 2.3 kernel ipfw_chk ipfw_check_hook
 2.1 pmcstat _init
 1.1 kernel _sx_xunlock
 0.9 kernel _sx_xlock
 0.9 kernel spinlock_exit
 
 This does seem to confirm my original impression that the NFS replay
 cache is quite expensive. Running a gprof(1) analysis on the same PMC
 data reveals a bit more detail (I've removed some uninteresting parts
 of the call graph):
 
I can't remember (I am early retired now;-) if I mentioned this patch before:
  http://people.freebsd.org/~rmacklem/drc.patch
It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can
be twiddled so that the drc is trimmed less frequently. By making these
values larger, the trim will only happen once/sec until the high water
mark is reached, instead of on every RPC. The tradeoff is that the DRC will
become larger, but given memory sizes these days, that may be fine for you.

jwd@ was going to test it, but he moved to a different job away from NFS, so
the patch has just been collecting dust.

If you could test it, that would be nice, rick
ps: Also, the current patch still locks before checking if it needs to do the
trim. I think that could safely be changed so that it doesn't lock/unlock
when it isn't doing the trim, if that makes a significant difference.

 
 called/total parents
 index %time self descendents called+self name index
 called/total children
 4881.00 2004642.70 932627/932627 svc_run_internal [2]
 [4] 45.1 4881.00 2004642.70 932627 nfssvc_program [4]
 13199.00 504436.33 584319/584319 nfsrvd_updatecache [9]
 23075.00 403396.18 468009/468009 nfsrvd_getcache [14]
 1032.25 416249.44 2239/2284 svc_sendreply_mbuf [15]
 6168.00 381770.44 11618/11618 nfsrvd_dorpc [24]
 3526.87 86869.88 112478/112514 nfsrvd_sentcache [74]
 890.00 50540.89 4252/4252 svc_getcred [101]
 14876.60 32394.26 4177/24500 crfree cycle 3 [263]
 11550.11 25150.73 3243/24500 free cycle 3 [102]
 1348.88 15451.66 2716/16831 m_freem [59]
 4066.61 216.81 1434/1456 svc_freereq [321]
 2342.15 677.40 557/1459 malloc_type_freed [265]
 59.14 1916.84 134/2941 crget [113]
 1602.25 0.00 322/9682 bzero [105]
 690.93 0.00 43/44 getmicrotime [571]
 287.22 7.33 138/1205 prison_free [384]
 233.61 0.00 60/798 PHYS_TO_VM_PAGE [358]
 203.12 0.00 94/230 nfsrv_mallocmget_limit [632]
 151.76 0.00 51/1723 pmap_kextract [309]
 0.78 70.28 9/3281 _mtx_unlock_sleep [154]
 19.22 16.88 38/400403 nfsrc_trimcache [26]
 11.05 21.74 7/197 crsetgroups [532]
 30.37 0.00 11/6592 critical_enter [190]
 25.50 0.00 9/36 turnstile_chain_unlock [844]
 24.86 0.00 3/7 nfsd_errmap [913]
 12.36 8.57 8/2145 in_cksum_skip [298]
 9.10 3.59 5/12455 mb_free_ext [140]
 1.84 4.85 2/2202 VOP_UNLOCK_APV [269]
 
 ---
 
 0.49 0.15 1/1129009 uhub_explore [1581]
 0.49 0.15 1/1129009 tcp_output [10]
 0.49 0.15 1/1129009 pmap_remove_all [1141]
 0.49 0.15 1/1129009 vm_map_insert [236]
 0.49 0.15 1/1129009 vnode_create_vobject [281]
 0.49 0.15 1/1129009 biodone [351]
 0.49 0.15 1/1129009 vm_object_madvise [670]
 0.49 0.15 1/1129009 xpt_done [483]
 0.49 0.15 1/1129009 vputx [80]
 0.49 0.15 1/1129009 vm_map_delete cycle 3 [49]
 0.49 0.15 1/1129009 vm_object_deallocate cycle 3 [356]
 0.49 0.15 1/1129009 vm_page_unwire [338]
 0.49 0.15 1/1129009 pmap_change_wiring [318]
 0.98 0.31 2/1129009 getnewvnode [227]
 0.98 0.31 2/1129009 pmap_clear_reference [1004]
 0.98 0.31 2/1129009 usbd_do_request_flags [1282]
 0.98 0.31 2/1129009 vm_object_collapse cycle 3 [587]
 0.98 0.31 2/1129009 vm_object_page_remove [122]
 1.48 0.46 3/1129009 mpt_pci_intr [487]
 1.48 0.46 3/1129009 pmap_extract [355]
 1.48 0.46 3/1129009 vm_fault_unwire [171]
 1.97 0.62 4/1129009 vgonel [270]
 1.97 0.62 4/1129009 vm_object_shadow [926]
 1.97 0.62 4/1129009 zone_alloc_item [434]
 2.46 0.77 5/1129009 vnlru_free [235]
 2.46 0.77 5/1129009 insmntque1 [737]
 2.95 0.93 6/1129009 zone_free_item [409]
 3.94 1.24 8

Re: Upcoming release schedule - 8.4 ?

2012-06-13 Thread Rick Macklem

Mark Saad wrote:
 I'll share my 2 cents here, as someone who maintains a decent sided
 FreeBSD install.
 
 1. FreeBSD needs to make end users more comfortable with using a
 Dot-Ohh release; and at the time of the dot-ohh release
 a timeline for the next point releases should be made. *
 
 2. Having three supported releases is showing issues , and brings up
 the point of why was 9.0 not released as 8.3 ? **
 
 3. The end users appear to want less releases, and for them to be
 supported longer .
 
 * A rough outline would do and it should be on the main release page
 http://www.freebsd.org/releases/
 
 ** Yes I understand that 9.0 had tons of new features that were added
 and its not exactly a point release upgrade from 8.2 , however one can
 argue that if it were there would be less yelling about when version X
 is going to be EOL'd and when will version Y be released.
 
One thought here might be to revisit the Kernel APIs can only change on
a major release rule. It seems to me that some KPIs could be frozen
for longer periods than others, maybe?
For example:
- If device driver KPIs were frozen for a longer period of time, there
  wouldn't be the challenge of backporting drivers for newer hardware to
  the older systems.
vs
- The VFS/VOP interface. As far as I know, there are currently 2 out of
  source tree file systems (OpenAFS and FUSE) and there are FreeBSD
  committers involved in both of these. As such, making a VFS change within
  a minor release cycle might not be a big problem, so long as all the
  file systems in the source tree are fixed and the maintainers for the
  above 2 file systems were aware of the change and when they needed to
  release a patch/rebuild their module.
- Similarily, are there any out of source tree network stacks?

It seems that this rule is where the controversy of major vs minor release
changes comes in?

Just a thought, rick

 
 
 --
 mark saad | nones...@longcount.org
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: pxe + nfs + microsoft dhcp

2012-05-28 Thread Rick Macklem

pacija wrote:

- Original Message -
 Dear list readers,
 
 I am having a problem with pxe loader on FreeBSD 9.0 i386 release. No
 matter what value I put for DHCP option 017 (Root Path) in Microsoft
 DHCP server, pxe always sets root path:
 pxe_open: server path: /
 
 I've read src/sys/boot/i386/libi386/pxe.c as instructed in handbook,
 and
 i learned there that root path is a failover value which gets set if
 no
 valid value is supplied by DHCP server. At first i thought that
 Microsoft DHCP does not send it but i confirmed with windump it does:
 
 --
 15:46:49.505748 IP (tos 0x0, ttl 128, id 6066, offset 0, flags [none],
 proto: UDP
 (17), length: 392) dhcp.domain.tld.67  255.255.255.255.68: [bad udp
 cksum 4537!]
 BOOTP/DHCP, Reply, length 364, xid 0xdcdb5309, Flags [ none ] (0x)
 Your-IP 192.168.218.32
 Server-IP dhcp.domain.tld
 Client-Ethernet-Address 00:19:db:db:53:09 (oui Unknown)
 file FreeBSD/install/boot/pxeboot
 Vendor-rfc1048 Extensions
 Magic Cookie 0x63825363
 DHCP-Message Option 53, length 1: Offer
 Subnet-Mask Option 1, length 4: 255.255.255.0
 RN Option 58, length 4: 345600
 RB Option 59, length 4: 604800
 Lease-Time Option 51, length 4: 691200
 Server-ID Option 54, length 4: dhcp.domain.tld
 Default-Gateway Option 3, length 4: gate.domain.tld
 Domain-Name-Server Option 6, length 4: dhcp.domain.tld
 Domain-Name Option 15, length 1: ^@
 RP Option 17, length 42:
 192.168.218.32:/b/tftpboot/FreeBSD/install/^@
 BF Option 67, length 29: FreeBSD/install/boot/pxeboot^@
What about getting rid of the ^@ characters at the end of
the strings?

rick

 --
 
 I do not understand code well enough to fix it, or at least send
 pxeloader static value of /b/tftpboot/FreeBSD/install/, so if someone
 would instruct me how to do it i would be very grateful.
 
 Thank you in advance for your help.
 
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS - slow

2012-05-10 Thread Rick Macklem

David Brodbeck wrote:
 On Mon, Apr 30, 2012 at 10:00 PM, Wojciech Puchar
 woj...@wojtek.tensor.gdynia.pl wrote:
  i tried nfsv4, tested under FreeBSD over localhost and it is roughly
  the
  same. am i doing something wrong?
 
 I found NFSv4 to be much *slower* than NFSv3 on FreeBSD, when I
 benchmarked it a year or so ago.
 
If delegations are not enabled, there is additional overhead doing the
Open operations against the server.

Delegations are not enabled by default in the server, because there isn't
code to handle conflicts with opens done locally on the server. (ie. Delegations
work iff the volumes exported over NFSv4 are not accessed locally in the 
server.)

I think there are also some issues w.r.t. name caching in the client that
still need to be resolved.

NFSv4 should provide better byte range locking, plus NFSv4 ACLs and a few other
things. However, it is more complex and will not perform better than NFSv3, at
least until delegations are used (or pNFS, which is a part of NFSv4.1).

rick
 --
 David Brodbeck
 System Administrator, Linguistics
 University of Washington
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS - slow

2012-05-01 Thread Rick Macklem

Wojciech Puchar wrote:
 i tried nfsv4, tested under FreeBSD over localhost and it is roughly
 the
 same. am i doing something wrong?
 
Probably not. NFSv4 writes are done exactly the same as NFSv3. (It changes
other stuff, like locking, adding support for ACLs, etc.) I do have a patch
that allows the client to do more extension caching to local disk in the
client (called Packrats), but that isn't ready for prime time yet.

NFSv4.1 optionally supports pNFS, where reading and writing can be done
to Data Servers (DS) separate from the NFS (called Metadata Server or MDS).
I`m working on the client side of this, but it is also a work-in-progress
and no work on a NFSv4.1 server for FreeBSD has been done yet, as far as I know.

If you have increased MAXBSIZE in both the client and server machines and
use the new (experimental in 8.x) client and server, they will use a larger
rsize, wsize for NFSv3 as well as NFSv4. (Capturing packets and looking at them
in wireshark will tell you what the actual rsize, wsize is. A patch to nfsstat
to get the actual mount options in use is another of my `to do`items. If
anyone else wants to work on this, I`d be happy to help them.

 On Mon, 30 Apr 2012, Peter Jeremy wrote:
 
  On 2012-Apr-27 22:05:42 +0200, Wojciech Puchar
  woj...@wojtek.tensor.gdynia.pl wrote:
  is there any way to speed up NFS server?
  ...
  - write works terribly. it performs sync on every write IMHO,
 
  You don't mention which NFS server or NFS version you are using but
  for traditional NFS, this is by design. The NFS server is
  stateless
  and NFS server failures are transparent (other than time-wise) to
  the
  client. This means that once the server acknowledges a write, it
  guarantees the client will be able to later retrieve that data, even
  if the server crashes. This implies that the server needs to do a
  synchronous write to disk before it can return the acknowledgement
  back to the client.
 
  --
  Peter Jeremy
 
Btw, For NFSv3 and 4, the story is slightly different than the above.

A client can do writes with a flag that is either FILESYNC or UNSTABLE.
For FILESYNC, the server must do exactly what the above says. That is,
the data and any required metadata changes, must be on stable storage
before the server replies to the RPC.
For UNSTABLE, the server can simply save the data in memory and reply OK
to the RPC. For this case, the client needs to do a separate Commit RPC
later and the server must store the data on stable storage at that time.
(For this case, the client needs to keep the data written UNSTABLE in its
 cache and be prepared to re-write it, if the server reboots before the
 Commit RPC is done.)
- When any app. does a fsync(2), the client needs to do a Commit RPC
  if it has been doing UNSTABLE writes.

Most clients, including FreeBSD, do writes with UNSTABLE. However, one
limitation on the FreeBSD client is that it currently only keeps track
of one contiguous modified byte range in a buffer cache block. When an
app. in the client does non-contiguous writes to the same buffer cache
block, it must write the old modified byte range to the server with FILESYNC
before it copies the newly written data into the buffer cache block. This
happens frequently for builds during the loader phase. (jhb and I have
looked at this. I have an experimental patch that makes the modified byte
range a list, but it requires changes to struct buf. I think it is worth
persuing. It is a client side patch, since that is where things can be
improved, if clients avoid doing FILESYNC or frequent Commit RPCs.)

rick
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS - slow

2012-04-30 Thread Rick Macklem

Wojciech Puchar wrote:
  the server is required to do that. (ie. Make sure the data is stored
  on
  stable storage, so it can't be lost if the server crashes/reboots.)
  Expensive NFS servers can use non-volatile RAM to speed this up, but
  a generic
  FreeBSD box can't do that.
 
  Some clients (I believe ESXi is one of these) requests FILE_SYNC all
  the
  time, but all clients will do so sooner or later.
 
  If you are exporting ZFS volumes and don't mind violating the NFS
  RFCs
  and risking data loss, there is a ZFS option that helps. I don't use
  ZFS, but I think the option is (sync=disabled) or something like
  that.
  (ZFS folks can help out, if you want that.) Even using
  vfs.nfsrv.async=1
  breaks the above.
 
 
 thank you for answering. i don't use or plan to use ZFS. and i am
 aware of
 this NFS feature but i don't understand - even with syncs disabled,
 why
 writes are not clustered. i always see 32kB writes in systat
 
The old (default on NFSv3) server sets the maximum wsize to 32K. The
new (default on 9) sets it to MAXBSIZE, which is currently 64K, but
I would like to get that increased. (A quick test suggested that the
kernel works when MAXBSIZE is set to 128K, but I haven't done much
testing yet.)

 
 when running unfsd from ports it doesn't have that problem and works
 FASTER than kernel nfs.
But you had taken out fsync() calls, which breaks the protocol, as above.

rick

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS - slow

2012-04-29 Thread Rick Macklem

Wojciech Puchar wrote:
 is there any way to speed up NFS server?
 
 from what i noticed:
 
 - reads works fast and good, like accessed locally, readahead up to
 maxbsize works fine on large files etc.
 
 - write works terribly. it performs sync on every write IMHO,
 setting vfs.nfsrv.async=1 improves things SLIGHTLY, but still - writes
 are
 sent to hard disk every single block - no clustering.
 
 
 am i doing something wrong or is it that broken?
 
Since I haven't seen anyone else answer this, I'll throw out my
$0.00 worth one more time. (This topic comes up regularily on the
mailing lists.)

Not broken, it's just a feature of NFS. When the client says FILE_SYNC,
the server is required to do that. (ie. Make sure the data is stored on
stable storage, so it can't be lost if the server crashes/reboots.)
Expensive NFS servers can use non-volatile RAM to speed this up, but a generic
FreeBSD box can't do that.

Some clients (I believe ESXi is one of these) requests FILE_SYNC all the
time, but all clients will do so sooner or later.

If you are exporting ZFS volumes and don't mind violating the NFS RFCs
and risking data loss, there is a ZFS option that helps. I don't use
ZFS, but I think the option is (sync=disabled) or something like that.
(ZFS folks can help out, if you want that.) Even using vfs.nfsrv.async=1
breaks the above.
Once you do this, when an application in a client does a successful
fsync() and assumes the data is safely stored and then the server crashes,
the data can still be lost.

rick
 
 i tried user space nfs from ports, it's funny but it's performance is
 actually better after i removed fsync from code.
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Ways to promote FreeBSD?

2012-04-27 Thread Rick Macklem

Steven Hartland wrote:
 - Original Message -
 From: Mehmet Erol Sanliturk
 
  My opinion is that most important obstacle in front of FreeBSD is
  its
  installation structure :
 
 
  It is NOT possible to install and use a FreeBSD distribution
  directly as it
  is .
 
 I disagree, we find quite the opposite; FreeBSD's current install is
 perfect
 its quick, doesn't install stuff we don't need and leaves a very nice
 base.
 
 Linux on the other had takes ages, asks way to many questions, has
 issues
 with some hardware with mouse and gui not work properly making the
 install difficult to navigate, but most import its quite hard to get a
 nice simple
 base as there are so many options, which is default with FreeBSD.
 
 In essence it depends on what you want and how you use the OS. For
 the way we use FreeBSD on our servers its perfect. So if your trying
 to suggest its not suitable for all that's is incorrect as it depends
 on what
 you want :)
 
I worked for the CS dept. at a university for 30years. What I observed
was that students were usually enthusiastic about trying a new os. However,
these days, they have almost no idea how to work in a command line environment.

If they installed FreeBSD, it would be zapped off their disk within minutes
of the install completing and they'd forget about it.

They install and like distros like Ubuntu, which install and work the way
they expect (yes, they expect a GUI desktop, etc).

When they get out in industry, they remember Linux, but won't remember
FreeBSD (at least not in a good way).

Now, I am not suggesting that FreeBSD try and generate Ubuntu-like desktop
distros. However, it might be nice if the top level web page let people
know that the installs there are not desktop systems and point them to
PC-BSD (or whatever other desktop distro there might be?) for a desktop install.
(I know, the original poster wasn't a PC-BSD fan, but others seem happy
 with it. I'll admit I've never tried it, but then, I'm not a GUI desktop 
guy.:-)

Just my $0.00 worth, rick

 Regards
 Steve
 
 
 This e.mail is private and confidential between Multiplay (UK) Ltd.
 and the person or entity to whom it is addressed. In the event of
 misdirection, the recipient is prohibited from using, copying,
 printing or otherwise disseminating it or any information contained in
 it.
 
 In the event of misdirection, illegible or incomplete transmission
 please telephone +44 845 868 1337
 or return the E.mail to postmas...@multiplay.co.uk.
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: mount_nfs does not like exports longer then 88 chars

2012-04-19 Thread Rick Macklem

Mark Saad wrote:
 On Thu, Apr 19, 2012 at 3:51 PM, Andrew Duane adu...@juniper.net
 wrote:
  MNAMELEN is used to bound the Mount NAMe LENgth, and is used in many
  many places. It may seem to work fine, but there are lots of
  utilities and such that will almost certainly fail managing it.
  Search the source code for MNAMELEN.
 
 I see that this is used in a number of Mount and fs bits. Do you know
 why mount_nfs would care how long the exported path and hostname are ?
 
Well, it's copied to f_mntfromname in struct statfs. If one longer
than MNAMELEN is allowed, it gets truncated when copied. I have no idea
which userland apps. will get upset with a truncated value in
f_mntfromname. (To change the size of f_mntfromname would require a new
revision of the statfs syscall, I think?)

Does this answer what you were asking? rick

 
   ...
  Andrew Duane
  Juniper Networks
  +1 978-589-0551 (o)
  +1 603-770-7088 (m)
  adu...@juniper.net
 
 
 
 
  -Original Message-
  From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd-
  hack...@freebsd.org] On Behalf Of Mark Saad
  Sent: Thursday, April 19, 2012 3:46 PM
  To: freebsd-hackers@freebsd.org
  Subject: mount_nfs does not like exports longer then 88 chars
 
  Hello Hackers
    I was wondering if anyone has come across this issue. This exists
    in
  FreeBSD 6, 7, and 9 , and probably in 8 but I am not using it at
  this time.
  When a nfs export path and host name total to more then 88
  characters
  mount_nfs bombs out with the following error when it attempts to
  mount it.
 
  mount_nfs:
  nyisilon2-13.grp2:/ifs/clients/www/csar884520456/files_cms-
  stage-BK/imagefield_default_images:
  File name too long
 
  I traced this down to a check in mount_nfs.c . This is about line
  560
  in the 7-STABLE version and 734 in the 9-STABLE version
 
 
          /*
           * If there has been a trailing slash at mounttime it seems
           * that some mountd implementations fail to remove the
           mount
           * entries from their mountlist while unmounting.
           */
          for (speclen = strlen(spec);
                  speclen  1  spec[speclen - 1] == '/';
                  speclen--)
                  spec[speclen - 1] = '\0';
          if (strlen(hostp) + strlen(spec) + 1  MNAMELEN) {
                  warnx(%s:%s: %s, hostp, spec,
                  strerror(ENAMETOOLONG));
                  return (0);
          }
 
  Does any one know why the check for hostp + spec +1 to be less then
  MNAMELEN is there for ?
 
   I removed the check on my 9-STABLE box and it mounts the long
   mounts fine
 
  I submitted a pr for this its kern/167105
  http://www.freebsd.org/cgi/query-pr.cgi?pr=167105 as there is no
  mention of this in the man page and I cant find any reason for the
  check at all.
 
 
  --
  mark saad | nones...@longcount.org
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to freebsd-hackers-
  unsubscr...@freebsd.org
 
 
 
 --
 mark saad | nones...@longcount.org
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Kerberos and FreeBSD

2012-02-08 Thread Rick Macklem

Benjamin Kaduk wrote:
 On Wed, 8 Feb 2012, Ansar Mohammed wrote:
 
  Hello All,
  Is the port of Heimdal on FreeBSD being maintained? The version that
  ships with 9.0 seems a bit old.
 
  # /usr/libexec/kdc-v
  kdc (Heimdal 1.1.0)
  Copyright 1995-2008 Kungliga Tekniska Högskolan
  Send bug-reports to heimdal-b...@h5l.org
 
 My understanding is that every five years or so, someone becomes fed
 up
 enough with the staleness of the current version and puts in the
 effort
 to merge in a newer version.
 It looks like 3 years ago, dfr brought in that Heimdal 1.1 you see, to
 replace the Heimdal 0.6 that nectar brought in 8 years ago.
 I don't know of anyone with active plans to bring in a new version, at
 present.
 
 -Ben Kaduk
 
I think it's a little trickier than it sounds. The Kerberos in FreeBSD
isn't vanilla Heimdal 1.1, but a somewhat modified variant.

Heimdal libraries have a separate source file for each function, plus
a source file that defines all global storage used by functions in the
library.
One difference w.r.t. the FreeBSD variant that I am aware of is:
- Some of the functions were moved from one library to another. (I don't
  know why, but maybe it was to avoid a POLA violation which would require
  apps to be linked with additional libraries?)
  - To do this, some global variables were added to the source file in the
library these functions were moved to.
As such, if you statically link an app. to both libraries, the global variable
can come up multiply defined. (I ran into this when I was developing a gssd
prior to the one introduced as part of the kernel rpc.) You can get around this
by dynamically linking, being careful about the order in which the libraries are
specified. (The command krb5-config --libs helps w.r.t. this.)

I don't know what else was changed, but I do know that it isn't as trivial as
replacing the sources with ones from a newer Heimdal release.

I think it would be nice if a newer Heimdal release was brought it, with the
minimal changes required to make it work. (If that meant that apps. needed more
libraries, the make files could use krb5-config --libs to handle it, I think?)

Oh, and I'm not volunteering to try and do it;-) rick

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: FreeBSD has serious problems with focus, longevity, and lifecycle

2012-01-26 Thread Rick Macklem

Mark Blackman wrote:
 On 26 Jan 2012, at 14:37, John Baldwin wrote:
 
  On Thursday, January 19, 2012 4:33:40 pm Adrian Chadd wrote:
  On 19 January 2012 09:47, Mark Saad nones...@longcount.org wrote:
 
 
  What could I do to help make 7.5-RELEASE a reality ?
 
 
  Put your hand up and volunteer to run the 7.5-RELEASE release
  cycle.
 
  That's not actually true or really fair. There has to be some buy-in
  from the
  project to do an official release; it is not something that a single
  person
  can do off in a corner and then have the Project bless the bits as
  an official
  release.
 
 And raises the interesting question for an outsider of
 
 a) who is the project in this case
 and
 b) what does it take for a release to be a release?
 
 Wasn't there a freebsd-releng (or similar) mailing list ages ago?
 
I am going to avoid the above question, since I don't know the answer
and I believe other(s) have already answered it.

However, I will throw out the following comment:
I can't seem to find the post, but someone suggested a release
mechanism where stable/N would simply be branched when it appeared
to be in good shape. Although I have no idea if this is practical for
all releases, it seems that it might be a low overhead approach for
releases off old stable branches like stable/7 currently is?
(ie. Since there aren't a lot of commits happening to stable/7, just
 branch it. You could maybe give a one/two week warning email about when
 this will happen. I don't think it would cause a flurry of commits
 like happens when code slush/freeze approaches for a new .0 one.)

Just a thought, rick

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: [ANN] host-setup 4.0 released

2012-01-03 Thread Rick Macklem

David Teske wrote:
  -Original Message-
  From: Mohacsi Janos [mailto:moha...@niif.hu]
  Sent: Tuesday, January 03, 2012 3:59 AM
  To: Devin Teske
  Cc: freebsd-hackers@freebsd.org; Dave Robison; Devin Teske
  Subject: Re: [ANN] host-setup 4.0 released

  Hi Devin,
  I had a look at the code. It is very nice,

 Thank you.

  however there are same missing
  elements:
  - IPv6 support

 Open to suggestions.

 Maybe adding a ipaddr6 below ipaddr in the interface configuration
 menu.

 Also, do you happen to know what the RFC number is for IPv6 address
 format? I
 need to know all the special features (for example, I know you can
 specify
 ::1 for localhost, but can you simply omit octets at-will? e.g.,
 ::ff:12:00:::
 ?)

The basics are in RFC4291, but I think that inet_pton(3) knows how to
deal with it. (I think :: can be used once to specify the longest #
of 16bit fields that are all zeros.)

After inet_pton() has translated it to a binary address, then the macros
in sys/netinet6/in6.h can be used to determine if the address is a loopback, 
etc.

I'm no ip6 guy by any means, so others, please correct/improve on this, as 
required.

rick

  - VLAN tagging support - creation/deleting

 How is that done these days? and how might we present it in the user
 interface?
 --
 Devin

  Best Regards,

  Janos Mohacsi
  Head of HBONE+ project
  Network Engineer, Deputy Director of Network Planning and Projects
  NIIF/HUNGARNET, HUNGARY Key 70EF9882: DEC2 C685 1ED4 C95A 145F 4300
  6F64 7B00 70EF 9882

  On Mon, 2 Jan 2012, Devin Teske wrote:

   Hi fellow -hackers,

   I'd like to announce the release of a major new revision (4.0) of
   my
   FreeBSD setup utility host-setup.

   http://druidbsd.sourceforge.net/

   Direct Link:
   http://druidbsd.sourceforge.net/download/host-setup.txt

   NOTE: Make sure to hit refresh to defeat the cache

   Major highlights of this version are listed on the druidbsd
   homepage.

   For those unfamiliar with my host-setup, it's a manly shell
   script
   designed to make it super-easy to configure the following:
   1. Timezone
   2. Hostname/Domain
   3. Network Interface Settings
   4. Default Router/Gateway
   5. DNS nameservers

   All from an easy-to-use dialog(1) or Xdialog(1)* interface

   * Fully compatible and tested -- simply pass `-X' while in a
   usable X
   environment
   --
   Devin

   P.S. Feedback most certainly is welcomed!

   _
   The information contained in this message is proprietary and/or
 confidential. If
  you are not the intended recipient, please: (i) delete the message
  and all
 copies;
  (ii) do not disclose, distribute or use the message in any manner;
  and (iii)
 notify
  the sender immediately. In addition, please be aware that any
  message
  addressed to our domain is subject to archiving and review by
  persons other
 than
  the intended recipient. Thank you.
   ___
   freebsd-hackers@freebsd.org mailing list
   http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
   To unsubscribe, send any mail to
   freebsd-hackers-unsubscr...@freebsd.org

 _
 The information contained in this message is proprietary and/or
 confidential. If you are not the intended recipient, please: (i)
 delete the message and all copies; (ii) do not disclose, distribute or
 use the message in any manner; and (iii) notify the sender
 immediately. In addition, please be aware that any message addressed
 to our domain is subject to archiving and review by persons other than
 the intended recipient. Thank you.
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Dumping core over NFS

2011-08-11 Thread Rick Macklem

Andrew Duane wrote:
 We have a strange problem in 6.2 that we're wondering if anyone else
 has seen. If a process is dumping core to an NFS-mounted directory,
 sending SIGINT, SIGTERM, or SIGKILL to that process causes NFS to
 wedge. The nfs_asyncio starts complaining that 20 iods are already
 processing the mount, but nothing makes any forward progress.
 
 Sending SIGUSR1, SIGUSR2, or SIGABRT seem to work fine, as does any
 signal if the core dump is going to a local filesystem.
 
 Before I dig into this apparent deadlock, just wondering if it's been
 seen before.
 
The only thing I can tell you is that SIGINT, SIGTERM are signals that are
handled differently by mounts with the intr option set. For this case,
the client tries to make the syscall in progress fail with EINTR when one
of these signals is posted. I have no idea what effect this might have on
a core dump in progress or if you are using intr mounts.

There was an issue in FreeBSD8.[01] (for the intr case) where the termination 
signal could get
the krpc code in a loop when trying to re-establish a TCP connection because
an msleep() would always return EINTR right away without waiting for the
connection attempt to complete and then code outside that would just try
it again and again and... This bug was fixed for FreeBSD8.2.
Obviously it's not the same bug since FreeBSD6 didn't have a krpc subsystem,
but you might look for something similar. (ie. a sleep(...PCATCH...) and then
a caller that just tries again for it returning EINTR.

If you use intr, you might also try without intr and see if that has
any effect.

Good luck with it, rick
 ...
 
 Andrew Duane
 Juniper Networks
 o +1 978 589 0551
 m +1 603-770-7088
 adu...@juniper.net
 
 
 
 
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Check for 0 ino_t in readdir(3)

2011-06-07 Thread Rick Macklem

mdf wrote:
 There is a check in the function implementing readdir(3) for a zero
 inode number:
 
 struct dirent *
 _readdir_unlocked(dirp, skip)
 DIR *dirp;
 int skip;
 {
 /* ... */
 if (dp-d_ino == 0  skip)
 continue;
 /* ... */
 }
 
 skip is 1 except for when coming from _seekdir(3).
 
 I don't recall any requirement that a filesystem not use an inode
 numbered 0, though for obvious reasons it's a poor choice for a file's
 inode. So... is this code in libc incorrect? Or is there
 documentation that 0 cannot be a valid inode number for a filesystem?
 
Well, my recollection (if I'm incorrect, please correct me:-) is that, for
real BSD directories (the ones generated by UFS/FFS, which everything else
is expected to emulate), the d_ino field is set to 0 when the first entry
in a directory block is unlink'd. This is because directory entries are not
permitted to straddle blocks, so the first entry can not be subsumed by the
last dirent in the previous block.

In other words, when d_ino == 0, the dirent is free.

rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Mount_nfs question

2011-05-31 Thread Rick Macklem

 Maybe you can use showmount -a SERVER-IP, foreach server you have...
 
That might work. NFS doesn't actually have a notion of a mount, but
the mount protocol daemon (typically called mountd) does try and keep
track of NFSv3 mounts from the requests it sees. How well this works for
NFSv3 will depend on how well the server keeps track of these things and
how easily they are lost during a server reboot or similar.

Since NFSv4 doesn't use the mount protocol, it will be useless for NFSv4.

 Thiago
 2011/5/30 Mark Saad nones...@longcount.org:
  On Mon, May 30, 2011 at 8:13 PM, Rick Macklem rmack...@uoguelph.ca
  wrote:
  Hello All
  So I am stumped on this one. I want to know what the IP of each
  nfs server that is providing each nfs export. I am running
  7.4-RELEASE
  When I run mount -t nfs I see something like this
 
  VIP-01:/export/source on /mnt/src
  VIP-02:/export/target on /mnt/target
  VIP-01:/export/logs on /mnt/logs
  VIP-02:/export/package on /mnt/pkg
 
  The issue is I use a load balanced nfs server , from isilon. So
  VIP-01
  could be any one of a group of IPs . I am trying to track down a
  network congestion issue and I cant find a way to match the output
  of
  lsof , and netstat to the output of mount -t nfs . Does anyone
  have
  any ideas how I could track this down , is there a way to run
  mount
  and have it show the IP and not the name of the source server ?
 
  Just fire up wireshark (or tcpdump) and watch the traffic. tcpdump
  doesn't know much about NFS, but if al you want are the IP#s, it'll
  do.
 
  But, no, mount won't tell you more than what the argument looked
  like.
 
  rick
 
  Wireshark seams like using a tank to swap a fly.
 
Maybe, but watching traffic isn't that scary and over the years I've
discovered things I would have never expected from doing it. Like a
case where one specific TCP segment was being dropped by a network
switch (it was a hardware problem in the switch that didn't manifest
itself any other way). Or, that one client was generating a massive
number of Getattr and Lookup RPCs. (That one turned out to be a grad
student who had made themselves an app. that had a bunch of threads
continually scanning to fs changes. Not a bad idea, but the threads
never took a break and continually did it.)

I've always found watching traffic kinda fun, but then I'm weird, rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Mount_nfs question

2011-05-30 Thread Rick Macklem

 Hello All
 So I am stumped on this one. I want to know what the IP of each
 nfs server that is providing each nfs export. I am running 7.4-RELEASE
 When I run mount -t nfs I see something like this
 
 VIP-01:/export/source on /mnt/src
 VIP-02:/export/target on /mnt/target
 VIP-01:/export/logs on /mnt/logs
 VIP-02:/export/package on /mnt/pkg
 
 The issue is I use a load balanced nfs server , from isilon. So VIP-01
 could be any one of a group of IPs . I am trying to track down a
 network congestion issue and I cant find a way to match the output of
 lsof , and netstat to the output of mount -t nfs . Does anyone have
 any ideas how I could track this down , is there a way to run mount
 and have it show the IP and not the name of the source server ?
 
Just fire up wireshark (or tcpdump) and watch the traffic. tcpdump
doesn't know much about NFS, but if al you want are the IP#s, it'll do.

But, no, mount won't tell you more than what the argument looked like.

rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

should I use a SYSCTL_STRUCT?

2011-04-29 Thread Rick Macklem

Hi,

I am at the point where I need to fix the -z option of
nfsstat. Currently the stats are acquired/zeroed for the
old NFS subsystem via sysctl. The setup in the kernel is:

SYSCTL_STRUCT(_vfs_nfs, NFS_NFSSTATS, nfsstats, CTLFLAG_RW,
nfsstats, nfsstats, S,nfsstats);

The new NFS subsystem currently gets the contents of the
structure via a flag on nfssvc(2).

So, I could either:
- add another flag for nfssvc(2) to zero the structure
OR
- switch the new NFS subsystem over to using a SYSCTL_STRUCT()
  like the above.

Which do you think would be preferable?

Thanks in advance for any info, rick
ps: I got completely lost on the SYSCTL thread in Jan. and
would rather not start another one like it:-)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: SMP question w.r.t. reading kernel variables

2011-04-20 Thread Rick Macklem

 On Tue, Apr 19, 2011 at 12:00:29PM +,
 freebsd-hackers-requ...@freebsd.org wrote:
  Subject: Re: SMP question w.r.t. reading kernel variables
  To: Rick Macklem rmack...@uoguelph.ca
  Cc: freebsd-hackers@freebsd.org
  Message-ID: 201104181712.14457@freebsd.org

 [John Baldwin]
  On Monday, April 18, 2011 4:22:37 pm Rick Macklem wrote:
On Sunday, April 17, 2011 3:49:48 pm Rick Macklem wrote:
 ...
   All of this makes sense. What I was concerned about was memory
   cache
   consistency and whet (if anything) has to be done to make sure a
   thread
   doesn't see a stale cached value for the memory location.

   Here's a generic example of what I was thinking of:
   (assume x is a global int and y is a local int on the thread's
   stack)
   - time proceeds down the screen
   thread X on CPU 0 thread Y on CPU 1
   x = 0;
x = 0; /* 0 for x's location
in CPU 1's memory cache */
   x = 1;
y = x;
   -- now, is y guaranteed to be 1 or can it get the stale cached
   0 value?
   if not, what needs to be done to guarantee it?

  Well, the bigger problem is getting the CPU and compiler to order
  the
  instructions such that they don't execute out of order, etc. Because
  of that,
  even if your code has 'x = 0; x = 1;' as adjacent threads in thread
  X,
  the 'x = 1' may actually execute a good bit after the 'y = x' on CPU
  1.

 Actually, as I recall the rules for C, it's worse than that. For
 this (admittedly simplified scenario), x=0; in thread X may never
 execute unless it's declared volatile, as the compiler may optimize it
 out and emit no code for it.

  Locks force that to sychronize as the CPUs coordinate around the
  lock cookie
  (e.g. the 'mtx_lock' member of 'struct mutex').

   Also, I see cases of:
mtx_lock(np);
np-n_attrstamp = 0;
mtx_unlock(np);
   in the regular NFS client. Why is the assignment mutex locked? (I
   had assumed
   it was related to the above memory caching issue, but now I'm not
   so sure.)

  In general I think writes to data that are protected by locks should
  always be
  protected by locks. In some cases you may be able to read data using
  weaker
  locking (where no locking can be a form of weaker locking, but
  also a
  read/shared lock is weak, and if a variable is protected by multiple
  locks,
  then any singe lock is weak, but sufficient for reading while all of
  the
  associated locks must be held for writing) than writing, but writing
  generally
  requires full locking (write locks, etc.).

Oops, I now see that you've differentiated between writing and reading.
(I mistakenly just stated that you had recommended a lock for reading.
 Sorry about my misinterpretation of the above on the first quick read.)

 What he said. In addition to all that, lock operations generate
 atomic barriers which a compiler or optimizer is prevented from
 moving code across.

All good and useful comments, thanks.

The above example was meant to be contrived, to indicate what I was
worried about w.r.t. memory caches.
Here's a somewhat simplified version of what my actual problem is:
(Mostly fyi, in case you are interested.)

Thread X is doing a forced dismount of an NFS volume, it (in dounmount()):
- sets MNTK_UNMOUNTF
- calls VFS_SYNC()/nfs_sync()
  - so this doesn't get hung on an unresponsive server it must test
for MNTK_UNMOUNTF and return an error it is set. This seems fine,
since it is the same thread and in a called function. (I can't
imagine that the optimizer could move setting of a global flag
to after a function call which might use it.)
- calls VFS_UNMOUNT()/nfs_unmount()
  - now the fun begins...
  after some other stuff, it calls nfscl_umount() to get rid of the
  state info (opens/locks...)
  nfscl_umount() - synchronizes with other threads that will use this
state (see below) using the combination of a mutex and a
shared/exclusive sleep lock. (Because of various quirks in the
code, this shared/exclusive lock is a locally coded version and
I happenned to call the shared case a refcnt and the exclusive
case just a lock.)

Other threads that will use state info (open/lock...) will:
-call nfscl_getcl()
  - this function does two things that are relevant
  1 - it allocates a new clientid, as required, while holding the mutex
  - this case needs to check for MNTK_UNMOUNTF and return error, in
case the clientid has already been deleted by nfscl_umount() above.
  (This happens before #2 because the sleep lock is in the clientd 
structure.)
-- it must see the MNTK_UNMOUNTF set if it happens after (in a temporal sense)
  being set by dounmount()
  2 - while holding the mutex, it acquires the shared lock
  - if this happens before nfscl_umount() gets the exclusive lock, it is
fine, since acquisition of the exclusive lock above will wait for its

Re: SMP question w.r.t. reading kernel variables

2011-04-20 Thread Rick Macklem

[good stuff snipped for brevity]
 
 1. Set MNTK_UNMOUNTF
 2. Acquire a standard FreeBSD mutex m.
 3. Update some data structures.
 4. Release mutex m.
 
 Then, other threads that acquire m after step 4 has occurred will
 see
 MNTK_UNMOUNTF as set. But, other threads that beat thread X to step 2
 may
 or may not see MNTK_UNMOUNTF as set.
 
First off, Alan, thanks for the great explanation. I think it would be
nice if this was captured somewhere in the docs, if it isn't already
there somewhere (I couldn't spot it, but that doesn't mean anything:-).

 The question that I have about your specific scenario is concerned
 with
 VOP_SYNC(). Do you care if another thread performing nfscl_getcl()
 after
 thread X has performed VOP_SYNC() doesn't see MNTK_UNMOUNTF as set?

Well, no and yes. It doesn't matter if it doesn't see it after thread X
performed nfs_sync(), but it does matter that the threads calling nfscl_getcl()
see it before they compete with thread X for the sleep lock.

 Another
 relevant question is Does VOP_SYNC() acquire and release the same
 mutex as
 nfscl_umount() and nfscl_getcl()?
 
No. So, to get this to work correctly it sounds like I have to do one
of the following:
1 - mtx_lock(m); mtx_unlock(m); in nfs_sync(), where m is the mutex used
by nfscl_getcl() for the NFS open/lock state.
or
2 - mtx_lock(m); mtx_unlock(m); mtx_lock(m); before the point where I care
that the threads executing nfscl_getcl() see MNTK_UMOUNTF set in 
nfscl_umount().
or
3 - mtx_lock(m2); mtx_unlock(m2); in nfscl_getcl(), where m2 is the mutex used
by thread X when setting MNTK_UMOUNTF, before mtx_lock(m); and then testing
MNTK_UMOUNTF plus acquiring the sleep lock. (By doing it before, I can avoid
any LOR issue and do an msleep() without worrying about having two mutex 
locks.)

I think #3 reads the best, so I'll probably do that one.

One more question, if you don't mind.

Is step 3 in your explanation necessary for this to work? If it is, I can just 
create
some global variable that I assign a value to between mtx_lock(m2); 
mtx_unlock(m2);
but it won't be used for anything, so I thought I'd check if it is necessary?

Thanks again for the clear explanation, rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: SMP question w.r.t. reading kernel variables

2011-04-20 Thread Rick Macklem

 [good stuff snipped for brevity]
 
  1. Set MNTK_UNMOUNTF
  2. Acquire a standard FreeBSD mutex m.
  3. Update some data structures.
  4. Release mutex m.
 
  Then, other threads that acquire m after step 4 has occurred will
  see
  MNTK_UNMOUNTF as set. But, other threads that beat thread X to step
  2
  may
  or may not see MNTK_UNMOUNTF as set.
 
 First off, Alan, thanks for the great explanation. I think it would be
 nice if this was captured somewhere in the docs, if it isn't already
 there somewhere (I couldn't spot it, but that doesn't mean
 anything:-).
 
  The question that I have about your specific scenario is concerned
  with
  VOP_SYNC(). Do you care if another thread performing nfscl_getcl()
  after
  thread X has performed VOP_SYNC() doesn't see MNTK_UNMOUNTF as set?
 
 Well, no and yes. It doesn't matter if it doesn't see it after thread
 X
 performed nfs_sync(), but it does matter that the threads calling
 nfscl_getcl()
 see it before they compete with thread X for the sleep lock.
 
  Another
  relevant question is Does VOP_SYNC() acquire and release the same
  mutex as
  nfscl_umount() and nfscl_getcl()?
 
 No. So, to get this to work correctly it sounds like I have to do one
 of the following:
 1 - mtx_lock(m); mtx_unlock(m); in nfs_sync(), where m is the mutex
 used
 by nfscl_getcl() for the NFS open/lock state.
 or
 2 - mtx_lock(m); mtx_unlock(m); mtx_lock(m); before the point where I
 care
 that the threads executing nfscl_getcl() see MNTK_UMOUNTF set in
 nfscl_umount().
 or
 3 - mtx_lock(m2); mtx_unlock(m2); in nfscl_getcl(), where m2 is the
 mutex used
 by thread X when setting MNTK_UMOUNTF, before mtx_lock(m); and then
 testing
 MNTK_UMOUNTF plus acquiring the sleep lock. (By doing it before, I can
 avoid
 any LOR issue and do an msleep() without worrying about having two
 mutex locks.)
 
 I think #3 reads the best, so I'll probably do that one.
 
 One more question, if you don't mind.
 
 Is step 3 in your explanation necessary for this to work? If it is, I
 can just create
 some global variable that I assign a value to between mtx_lock(m2);
 mtx_unlock(m2);
 but it won't be used for anything, so I thought I'd check if it is
 necessary?
 
Oops, I screwed up this question. For my #3, all that needs to be done
in nfscl_getcl() before I care if it sees MNTK_UMOUNTF set is mtx_lock(m2);
since that has already gone through your steps 1-4.

The question w.r.t. do you really need your step 3 would apply to the
cases where I was using m (the mutex nfscl_umount() and nfscl_getcl()
already use instead of the one used by thread X).

rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: SMP question w.r.t. reading kernel variables

2011-04-18 Thread Rick Macklem

 On Sunday, April 17, 2011 3:49:48 pm Rick Macklem wrote:
  Hi,
 
  I should know the answer to this, but... When reading a global
  kernel
  variable, where its modifications are protected by a mutex, is it
  necessary to get the mutex lock to just read its value?
 
  For example:
  A if ((mp-mnt_kern_flag  MNTK_UNMOUNTF) != 0)
return (EPERM);
  versus
  B MNT_ILOCK(mp);
   if ((mp-mnt_kern_flag  MNTK_UNMOUNTF) != 0) {
MNT_IUNLOCK(mp);
return (EPERM);
   }
   MNT_IUNLOCK(mp);
 
  My hunch is that B is necessary if you need an up-to-date value
  for the variable (mp-mnt_kern_flag in this case).
 
  Is that correct?
 
 You already have good followups from Attilio and Kostik, but one thing
 to keep
 in mind is that if a simple read is part of a larger atomic
 operation then
 it may still need a lock. In this case Kostik points out that another
 lock
 prevents updates to mnt_kern_flag so that this is safe. However, if
 not for
 that you would need to consider the case that another thread sets the
 flag on
 the next instruction. Even the B case above might still have that
 problem
 since you drop the lock right after checking it and the rest of the
 function
 is implicitly assuming the flag is never set perhaps (or it needs to
 handle
 the case that the flag might become set in the future while
 MNT_ILOCK() is
 dropped).
 
 One way you can make that code handle that race is by holding
 MNT_ILOCK()
 around the entire function, but that approach is often only suitable
 for a
 simple routine.
 
All of this makes sense. What I was concerned about was memory cache
consistency and whet (if anything) has to be done to make sure a thread
doesn't see a stale cached value for the memory location.

Here's a generic example of what I was thinking of:
(assume x is a global int and y is a local int on the thread's stack)
- time proceeds down the screen
thread X on CPU 0thread Y on CPU 1
x = 0;
 x = 0; /* 0 for x's location in CPU 1's 
memory cache */
x = 1;
 y = x;
-- now, is y guaranteed to be 1 or can it get the stale cached 0 value?
if not, what needs to be done to guarantee it?

For the original example, I am fine so long as the bit is seen as set after 
dounmount()
has set it.

Also, I see cases of:
 mtx_lock(np);
 np-n_attrstamp = 0;
 mtx_unlock(np);
in the regular NFS client. Why is the assignment mutex locked? (I had assumed
it was related to the above memory caching issue, but now I'm not so sure.)

Thanks a lot for all the good responses, rick
ps: I guess it comes down to whether or not atomic includes ensuring memory
cache consistency. I'll admit I assumed atomic meant that the memory
access or modify couldn't be interleaved with one done to the same location
by another CPU, but not memory cache consistency.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: SMP question w.r.t. reading kernel variables

2011-04-18 Thread Rick Macklem

 
 All of this makes sense. What I was concerned about was memory cache
 consistency and whet (if anything) has to be done to make sure a
Oops, whet should have been what..
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

SMP question w.r.t. reading kernel variables

2011-04-17 Thread Rick Macklem

Hi,

I should know the answer to this, but... When reading a global kernel
variable, where its modifications are protected by a mutex, is it
necessary to get the mutex lock to just read its value?

For example:
Aif ((mp-mnt_kern_flag  MNTK_UNMOUNTF) != 0)
  return (EPERM);
versus
BMNT_ILOCK(mp);
 if ((mp-mnt_kern_flag  MNTK_UNMOUNTF) != 0) {
  MNT_IUNLOCK(mp);
  return (EPERM);
 }
 MNT_IUNLOCK(mp);

My hunch is that B is necessary if you need an up-to-date value
for the variable (mp-mnt_kern_flag in this case).

Is that correct?

Thanks in advance for help with this, rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: SMP question w.r.t. reading kernel variables

2011-04-17 Thread Rick Macklem

 On Sun, Apr 17, 2011 at 03:49:48PM -0400, Rick Macklem wrote:
  Hi,
 
  I should know the answer to this, but... When reading a global
  kernel
  variable, where its modifications are protected by a mutex, is it
  necessary to get the mutex lock to just read its value?
 
  For example:
  A if ((mp-mnt_kern_flag  MNTK_UNMOUNTF) != 0)
return (EPERM);
  versus
  B MNT_ILOCK(mp);
   if ((mp-mnt_kern_flag  MNTK_UNMOUNTF) != 0) {
MNT_IUNLOCK(mp);
return (EPERM);
   }
   MNT_IUNLOCK(mp);
 
  My hunch is that B is necessary if you need an up-to-date value
  for the variable (mp-mnt_kern_flag in this case).
 
  Is that correct?
 
 mnt_kern_flag read is atomic on all architectures.
 If, as I suspect, the fragment is for the VFS_UNMOUNT() fs method,
 then VFS guarantees the stability of mnt_kern_flag, by blocking
 other attempts to unmount until current one is finished.
 If not, then either you do not need the lock, or provided snipped
 which takes a lock is unsufficient, since you are dropping the lock
 but continue the action that depends on the flag not being set.

Sounds like A should be ok then. The tests matter when dounmount()
calls VFS_SYNC() and VFS_UNMOUNT(), pretty much as you guessed. To
be honest, most of it will be the thread doing the dounmount() call,
although other threads fall through VOP_INACTIVE() while they are
terminating in VFS_UNMOUNT() and these need to do the test, too.
{ I just don't know much about the SMP stuff, so I don't know when
  a cache on another core might still have a stale copy of a value.
  I've heard the term memory barrier, but don't really know what it
  means.:-)

Thanks, rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: Getting vnode + credentials of a file from a struct mount and UFS inode #

2011-04-13 Thread Rick Macklem

 Hi,
 
 Yes, I am.. that was my suspicion (e.g., that it was the parameters of
 the process which called open()/creat()/socket()/... originally).
 What's the quickest way to get back to the v/inode's uid/gid?
 
 Also, calling VFS_VGET() seems to give me a lockmgr panic with unknown
 type 0x0.

VFS_VGET() returns a vnode ptr, it doesn't need the argument set to one.
The flags argument (assuming a recent kernel) needs to be LK_EXCLUSIVE or
LK_SHARED, not 0 (I suspect that's your panic).

 What is odd is that the only way I can get a vnode for VFS_VGET is
 through struct file, and then shouldn't I be able to use that? I tried
 using the flipping that vnode-inode with VTOI() and it was also
 giving me zeros for i_uid, i_gid, etc., when it shouldn't have been.
 
After VFS_VGET returns a vp, I'd do a VOP_GETATTR() and then vput() the
vp to release it. Look for examples of these calls in the kernel sources.
The struct vattr filled in by VOP_GETATTR() has va_uid and va_gid in it,
which are the uid,gid that owns the file, which is what I think you are
trying to get. (Credentials generally refer to the effective uid + gids
etc of the process/thread trying to do the syscall.)

rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS: file too large

2011-01-19 Thread Rick Macklem

  :Well, since a server specifies the maximum file size it can
  :handle, it seems good form to check for that in the client.
  :(Although I'd agree that a server shouldn't crash if a read/write
  : that goes beyond that limit.)
  :
  :Also, as Matt notes, off_t is signed. As such, it looks to me like
  :the check could mess up if uio_offset it right near
  0x7fff,
  :so that uio-ui_offset + uio-uio_resid ends up negative. I think
  the
  :check a little above that for uio_offset  0 should also check
  :uio_offset + uio_resid  0 to avoid this.
  :
  :rick
 
  Yes, though doing an overflow check in C, at least with newer
  versions
  of GCC, requires a separate comparison. The language has been
  mangled
  pretty badly over the years.
 
 
  if (a + b  a) - can be optimized-out by the compiler
 
  if (a + b  0) - also can be optimized-out by the compiler
 
  x = a + b;
  if (x  a) - this is ok (best method)
 
  x = a + b;
  if (x  0) - this is ok
 
Ok, thanks. I'll admit to being an old K+R type guy.

 
 my question, badly written, was why not let the underlaying fs (ufs,
 zfs, etc)
 have the last word, instead of the nfsclient having to guess? Is there
 a problem in sending back the error?
 
Well, the principal I try and apply in the name of interoperability is:
1 - The client should adhere to the RFCs as strictly as possible
2 - The server should assume the loosest interpretation of the RFCs.

For me #1 applies. ie. If a server specifies a maximum file size, the
client should not violate that. (Meanwhile the server should assume that
clients will exceed the maximum sooner or later.)

Remember that the server might be a Netapp, EMC, ... and those vendors
mostly test their servers against Linux, Solaris clients. (I've tried to
convince them to fire up FreeBSD systems in-house for testing and even
volunteered to help with the setup, but if they've done so, I've never
heard about it. Their usual response is come to connectathon. See below.)

Here's an NFSv4.0 example:
- RFC3530 describes the dircount argument for Readdir as a hint of
  the maximum number of bytes of directory information (in 4th para of
  pg 191). One vendor ships an NFSv4 client that always sets this value
  to 0. Their argument is that, since it is only a hint it can be
  anything they feel like putting there. (Several servers crapped out
  because of this in the early days.)

Part of the problem is that I am not in a position to attend the
interoperability testing events like www.connectathon.org, where these
things are usually discovered (and since they are covered under an NDA
that attendies sign, I don't find out the easy way when problems occur).

rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS: file too large

2011-01-14 Thread Rick Macklem

 
 BTW, why not make away with the test altogether?
 
Well, since a server specifies the maximum file size it can
handle, it seems good form to check for that in the client.
(Although I'd agree that a server shouldn't crash if a read/write
 that goes beyond that limit.)

Also, as Matt notes, off_t is signed. As such, it looks to me like
the check could mess up if uio_offset it right near 0x7fff,
so that uio-ui_offset + uio-uio_resid ends up negative. I think the
check a little above that for uio_offset  0 should also check
uio_offset + uio_resid  0 to avoid this.

rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS: file too large

2011-01-13 Thread Rick Macklem

   I'm getting 'File too large' when copying via NFS(v3, tcp/udp) a
   file
   that is larger than 1T. The server is ZFS which has no problem
   with
   large
   files.
  
   Is this fixable?
  
  As I understand it, there is no FreeBSD VFSop that returns the
  maximum
  file size supported. As such, the NFS servers just take a guess.
 
  You can either switch to the experimental NFS server, which guesses
  the
  largest size expressed in 64bits.
  OR
  You can edit sys/nfsserver/nfs_serv.c and change the assignment of a
  value to
  maxfsize = XXX;
  at around line #3671 to a larger value.
 
  I didn't check to see if there are additional restrictions in the
  clients. (They should believe what the server says it can support.)
 
  rick
 
 well, after some more experimentation, it sees to be a FreeBSD client
 issue.
 if the client is linux there is no problem.
 

Try editting line #1226 of sys/nfsclient/nfs_vfsops.c, where
it sets nm_maxfilesize = (u_int64_t)0x8000 * DEV_BSIZE - 1; and make it
something larger.

I have no idea why the limit is set that way? (I'm guessing it was the
limit for UFS.) Hopefully not some weird buffer cache restriction or
similar, but you'll find out when you try increasing it.:-)

I think I'll ask freebsd-fs@ about increasing this for NFSv3 and 4, since
the server does provide a limit. (The client currently only reduces 
nm_maxfilesize from the above initial value using the server's limit.)

Just grep nm_maxfilesize *.c in sys/nfsclient and you'll see it.

 BTW, I 'think' I'm using the experimental server, but how can I be
 sure?
 I have the -e set for both nfs_server and mountd, I don't have option
 NFSD,
 but the nfsd.ko gets loaded.
You can check by:
# nfsstat -s
# nfsstat -e -s
and see which one reports non-zero RPC counts.

If you happen to be running the regular server (probably not, given the
above), you need to edit the server code as well as the client side.

Good luck with it, rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS: file too large

2011-01-12 Thread Rick Macklem

 I'm getting 'File too large' when copying via NFS(v3, tcp/udp) a file
 that is larger than 1T. The server is ZFS which has no problem with
 large
 files.
 
 Is this fixable?
 
As I understand it, there is no FreeBSD VFSop that returns the maximum
file size supported. As such, the NFS servers just take a guess.

You can either switch to the experimental NFS server, which guesses the
largest size expressed in 64bits.
OR
You can edit sys/nfsserver/nfs_serv.c and change the assignment of a
value to
maxfsize = XXX;
at around line #3671 to a larger value.

I didn't check to see if there are additional restrictions in the
clients. (They should believe what the server says it can support.)

rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS Performance

2011-01-09 Thread Rick Macklem

 Rick
 Do you have more details on the issue is it 8.x only ? Can you point
 us to the stable thread abourt this ?
 
The bug is in the krpc, which means it's 8.x specific (at least for NFS,
I'm not sure if the nlm used the krpc in 7.x?).

David P. Discher reported a performance problem some time ago when testing
the FreeBSD8 client against certain servers. (I can't find the thread, so
maybe it never had a freebsd-stable@ cc after all.)

Fortutnately John Gemignani spotted the cause (for at least his case, because
he tested a patch that seemed to resolve the problem). The bug is basically
that the client side krpc for TCP assumed that the 4 bytes of data that hold
the length of the RPC message are in one mbuf and don't straddle multiple mbufs.
If the 4 bytes does straddle multiple mbufs, the krpc gets a garbage message
length and then typically wedges and eventually recovers by starting a fresh
TCP connection up and retrying the outstanding RPCs.

I have no idea if George is seeing the same problem, but the 1.5minute logjams
suggest that it might. I emailed him a patch and, hopefully, he will report back
on whether or not it helped.

A patch for the above bug is in the works for head, rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS Performance

2011-01-08 Thread Rick Macklem

 George
 I remember reading there was some sort of nfs issues in 8.1-RELEASE,
 a regression of some sort it was noted early on in the release. Have
 you tried this with 8.2-RC1 also what are your nfs client mount
 options ?
 
 On 1/8/11, george+free...@m5p.com george+free...@m5p.com wrote:
  Among four machines on my network, I'm observing startling
  differences
  in NFS performance. All machines are AMD64, and rpc_statd,
  rpc_lockd,
  and amd are enabled on all four machines.
 
  wonderland:
  hw.model: AMD Athlon(tm) II Dual-Core M32
  hw.physmem: 293510758
  ethernet: 100Mb/s
  partition 1: FreeBSD 8.1-STABLE
  partition 2: FreeBSD 7.3-STABLE
 
  scollay:
  hw.model: AMD Sempron(tm) 140 Processor
  hw.physmem: 186312294
  ethernet: 1000Mb/s
  FreeBSD 8.1-PRERELEASE
 
  sullivan:
  hw.model: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+
  hw.physmem: 4279980032
  ethernet 1000Mb/s
  FreeBSD 7.2-RELEASE
 
  mattapan:
  hw.model: AMD Sempron(tm) Processor 2600+
  hw.physmem: 456380416
  ethernet: 1000Mb/s
  FreeBSD 7.1-RELEASE
 
  Observed bytes per second (dd if=filename of=/dev/null bs=65536):
  Source machine: mattapan scollay sullivan
  Destination machine:
  wonderland/7.3 870K 5.2M 1.8M
  wonderland/8.1 496K 690K 420K
  mattapan 38M 28M
  scollay 33M 33M
  sullivan 38M 5M
 
  There is one 10/100/1000Mb/s ethernet switch between the various
  pairs
  of machines.
 
  I'm startled by the numbers for wonderland, first because of how
  much the
  100Mb/s interface slows things down, but even more because of how
  much
  difference there is on the identical hardware between FreeBSD 7 and
  FreeBSD 8.
 
  Even more annoying when running 8.1 on wonderland, NFS simply locks
  up
  at random for roughly a minute and a half under high load (such as
  when
  firefox does a gazillion locked references to my places.sqlite
  file),
  leading to entertaining log message clusters such as:
 
  Dec 29 08:17:41 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:17:41 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:17:41 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:17:41 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:17:47 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:17:47 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:18:01 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:18:01 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:18:02 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:18:02 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:18:02 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:18:02 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:18:08 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:18:08 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:18:09 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:18:09 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:18:09 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:18:09 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:20:21 wonderland last message repeated 2 times
  Dec 29 08:20:22 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:20:22 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:20:36 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:20:36 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:21:05 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:21:10 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:22:22 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:22:22 wonderland kernel: nfs server home:/usr: is alive
  again
  Dec 29 08:22:24 wonderland kernel: nfs server home:/usr: not
  responding
  Dec 29 08:22:24 wonderland last message repeated 2 times

Re: NFS server hangs (was no subject)

2010-08-08 Thread Rick Macklem

 I have a similar problem.
 
 I have a NFS server (8.0 upgraded a couple times since Feb 2010) that
 locks up
 and requires a reboot.
 
 The clients are busy vm's from VMWare ESXi using the NFS server for
 vmdk virtual
 disk storage.
 
 The ESXi reports nfs server inactive and all the vm's post disk write
 errors when
 trying to write to their disk.
 
 /etc/rc.d/nfsd restart fails to work (it can not kill the nfsd
 process)
 
 The nfsd process runs at 100% cpu at rc_lo state in top.
 
 reboot is the only fix.
 
 It has only happened under two circumstances.
 1) Installation of a VM using Windows 2008.
 2) Migrating 16 million mail messages from a physical server to a VM
 running FreeBSD with ZFS file system as a VM on the ESXi box that uses
 NFS to store the VM's ZFS disk.
 
 The NFS server uses ZFS also.

I don't think what you are seeing is the same as what others have reported.
(I have a hunch that your problem might be a replay cache problem.)

Please try the attached patch and make sure that your sys/rpc/svc.c
is at r205562 (upgrade if it isn't).

If this patch doesn't help, you could try using the experimental nfs
server (which doesn't use the generic replay cache), by adding -e to
mountd and nfsd.

Please let me know if the patch or switching to the experimental nfs
server helps, rick

--- rpc/replay.c.sav	2010-08-08 18:05:50.0 -0400
+++ rpc/replay.c	2010-08-08 18:16:43.0 -0400
@@ -90,8 +90,10 @@
 replay_setsize(struct replay_cache *rc, size_t newmaxsize)
 {
 
+	mtx_lock(rc-rc_lock);
 	rc-rc_maxsize = newmaxsize;
 	replay_prune(rc);
+	mtx_unlock(rc-rc_lock);
 }
 
 void
@@ -144,8 +146,8 @@
 	bool_t freed_one;
 
 	if (rc-rc_count = REPLAY_MAX || rc-rc_size  rc-rc_maxsize) {
-		freed_one = FALSE;
 		do {
+			freed_one = FALSE;
 			/*
 			 * Try to free an entry. Don't free in-progress entries
 			 */
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: possible NFS lockups

2010-08-01 Thread Rick Macklem

 From: Sam Fourman
 On Tue, Jul 27, 2010 at 10:29 AM, krad kra...@googlemail.com wrote:
  I have a production mail system with an nfs backend. Every now and
  again we
  see the nfs die on a particular head end. However it doesn't die
  across all
  the nodes. This suggests to me there isnt an issue with the filer
  itself and
  the stats from the filer concur with that.
 
  The symptoms are lines like this appearing in dmesg
 
  nfs server 10.44.17.138:/vol/vol1/mail: not responding
  nfs server 10.44.17.138:/vol/vol1/mail: is alive again
 
  trussing df it seems to hang on getfsstat, this is presumably when
  it tries
  the nfs mounts
 
 
 I also have this problem, where nfs locks up on a FreeBSD 9 server
 and a FreeBSD RELENG_8 client
 
If by RELENG_8, you mean 8.0 (or pre-8.1), there are a number
of patches for the client side krpc. They can be found at:
http://people.freebsd.org/~rmacklem/freebsd8.0-patches

(These are all in FreeBSD8.1, so ignore this if your client is
already running FreeBSD8.1.)

rick
ps: lock up can mean many things. The more specific you can
be w.r.t. the behaviour, the more likely it can be resolved.
For example:
- No more access to the subtree under the mount point is
  possible until the client is rebooted. When a ps axlH
  one process that was accessing a file in the mount point
  is shown with WCHAN rpclock and STAT DL.
vs
- All access to the mount point stops for about 1minute
  and then recovers.

Also, showing what mount options are being used by the
client and whether or not rpc.lockd and rpc.statd are
running can also be useful.
And if you can look at the net ttraffic with wireshark
when it is locked up and see if any NFS traffic is
happening can also be useful.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: possible NFS lockups

2010-07-31 Thread Rick Macklem

 From: krad kra...@googlemail.com
 To: freebsd-hackers@freebsd.org, FreeBSD Questions 
 freebsd-questi...@freebsd.org
 Sent: Tuesday, July 27, 2010 11:29:20 AM
 Subject: possible NFS lockups
 I have a production mail system with an nfs backend. Every now and
 again we
 see the nfs die on a particular head end. However it doesn't die
 across all
 the nodes. This suggests to me there isnt an issue with the filer
 itself and
 the stats from the filer concur with that.

 The symptoms are lines like this appearing in dmesg

 nfs server 10.44.17.138:/vol/vol1/mail: not responding
 nfs server 10.44.17.138:/vol/vol1/mail: is alive again

 trussing df it seems to hang on getfsstat, this is presumably when it
 tries
 the nfs mounts

 eg

 __sysctl(0xbfbfe224,0x2,0xbfbfe22c,0xbfbfe230,0x0,0x0) = 0 (0x0)
 mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) =
 1746583552 (0x681ac000)
 mmap(0x682ac000,344064,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0)
 =
 1747632128 (0x682ac000)
 munmap(0x681ac000,344064) = 0 (0x0)
 getfsstat(0x68201000,0x1270,0x2,0xbfbfe960,0xbfbfe95c,0x1) = 9 (0x9)

 I have played with mount options a fair bit but they dont make much
 difference. This is what they are set to at present

 10.44.17.138:/vol/vol1/mail /mail/0 nfs
 rw,noatime,tcp,acdirmax=320,acdirmin=180,acregmax=320,acregmin=180 0 0

 When this locking is occuring I find that if I do a show mount or
 mount
 10.44.17.138:/vol/vol1/mail again under another mount point I can
 access it
 fine.

 One thing I have just noticed is that lockd and statd always seem to
 have
 died when this happens. Restarting does not help

lockd and statd implement separate protocols (NLM ans NSM) that do
locking. The protocols were poorly designed and fundamentally
broken imho. (That refers to the protocols and not the implementation.)

I am not familiar with the lockd and statd implementations, but if you
don't need file locking to work for the same file when accessed
concurrently from multiple clients (heads) concurrently, you can use
the nolockd mount option to avoid using them. (I have no idea if
the mail system you are using will work without lockd or not? It
should be ok to use nolockd if file locking is only done on a
given file in one client node.)

I suspect that some interaction between your server and the
lockd/statd client causes them to crash and then the client is
stuck trying to talk to them, but I don't really know? Looking
at where all the processes and threads are sleeping via ps axlH
may tell you what is stuck and where.

As others noted, intermittent server not responding...server ok
messages just indicate slow response from the server and don't
mean much. However, if a given process is hung and doesn't
recover, knowing what it is sleeping on can help w.r.t diagnosis.

rick

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS write corruption on 8.0-RELEASE

2010-02-12 Thread Rick Macklem




On Thu, 11 Feb 2010, John Baldwin wrote:

[good stuff snipped]


Case1: single currupted block 3779CF88-3779 (12408 bytes).
Data in block is shifted 68 bytes up, loosing first 68 bytes are
filling last 68 bytes with garbage. Interestingly, among that garbage
is my hostname.


Is it the hostname of the server or the client?



My guess is that hades.panopticon (or something like that:-) is the 
client. The garbage is 4 bytes (80 00 80 84) followed by the first part of 
the RPC header. (Bytes 5-8 vary because they are the xid and then the host 
name is part of the AUTH_SYS authenticator.)


For Case2 and Case3, you see less of it, but it's the same stuff.

Why? I have no idea, although it smells like some sort of corruption
of the mbuf list. (It would be nice if you could switch to a different
net interface/driver. Just a thought, since others don't seem to be
seeing this?)

As John said, it would be nice to try and narrow it down to client or
server side, too.

Don't know if this helps or is just noise, rick

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS write corruption on 8.0-RELEASE

2010-02-12 Thread Rick Macklem




On Fri, 12 Feb 2010, Dmitry Marakasov wrote:



Interesting, I'll try disabling it. However now I really wonder why
is such dangerous option available (given it's the cause) at all,
especially without a notice. Silent data corruption is possibly the
worst thing to happen ever.



I doubt that the data corruption you are seeing would be because of
soft. soft will cause various problems w.r.t. consistency, but in
the case of a write through the buffer cache, I think it will leave the
buffer dirty and eventually it will get another write attempt.


However, without soft option NFS would be a strange thing to use -
network problems is kinda inevitable thing, and having all processes
locked in a unkillable state (with hard mounts) when it dies is not
fun. Or am I wrong?



Well, using NFS over an unreliable network is going to cause
grief sooner or later. The problem is that POSIX apps. don't
expect I/O system calls to fail with EIO and generally don't
handle that gracefully. For the future, I think umount -F
(a forced dismount that accepts data loss) is the best compromise,
since at least then a sysadmin knows that data corruption could
have occurred when they do it and can choose to wait until
the network is fixed as an alternative to the corruption?

rick

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS write corruption on 8.0-RELEASE

2010-02-12 Thread Rick Macklem




On Fri, 12 Feb 2010, Dmitry Marakasov wrote:


* Oliver Fromme (o...@lurza.secnetix.de) wrote:


I'm sorry for the confusion ...  I do not think that it's
the cause for your data corruption, in this particular
case.  I just mentioned the potential problems with soft
mounts because it could cause additional problems for you.
(And it's important to know anyhow.)


Oh, then I really misunderstood. If the curruption implied is
like when you copy a file via NFS and the net goes down, and in
case of soft mount you have half of a file (read: corruption), while
with hard mount the copy process will finish when the net is back up,
that's definitely OK and expected.


The problem is that it can't distinguish between slow network/server and
partitioned/failed network. In your case (one client) it may work out ok.
(I can't remember how long it takes to timeout and give up for soft.)

For many clients talking to an NFS server, the NFS server's response
time can degrade to the point where soft mounted clients start timing
out and that can get ugly.

rick

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS ( amd?) dysfunction descending a hierarchy

2008-12-10 Thread Rick Macklem




On Tue, 9 Dec 2008, David Wolfskill wrote:


On Tue, Dec 02, 2008 at 04:15:38PM -0800, David Wolfskill wrote:

I seem to have a fairly- (though not deterministly so) reproducible
mode of failure with an NFS-mounted directory hierarchy:  An attempt to
traverse a sufficiently large hierarchy (e.g., via tar zcpf or rm
-fr) will fail to visit some subdirectories, typically apparently
acting as if the subdirectories in question do not actually exist
(despite the names having been returned in the output of a previous
readdir()).
...


I was able to reproduce the external symptoms of the failure running
CURRENT as of yesterday, using rm -fr of a copy of a recent
/usr/ports hierachy on an NFS-mounted file system as a test case.
However, I believe the mechanism may be a bit different -- while
still being other than what I would expect.

One aspect in which the externally-observable symptoms were different
(under CURRENT, vs. RELENG_7) is that under CURRENT, once the error
condition occurred, the NFS client machine was in a state where it
merely kept repeating

nfs server [EMAIL PROTECTED]:/volume: not responding

until I logged in as root  rebooted it.


The different behaviour for -CURRENT could be the newer RPC layer that
was recently introduced, but that doesn't explain the basic problem.

All I can think of is to ask the obvious question. Are you using
interruptible or soft mounts? If so, switch to hard mounts and see
if the problem goes away. (imho, neither interruptible nor soft mounts
are a good idea. You can use a forced dismount if there is a crashed
NFS server that isn't coming back anytime soon.)

If you are getting this with hard mounts, I'm afraid I have no idea
what the problem is, rick.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]

64 matches

Mail list logo