Re: Network stack changes
Sam Fourman Jr. wrote: And any time you increase latency, that will have a negative impact on NFS performance. NFS RPCs are usually small messages (except Write requests and Read replies) and the RTT for these (mostly small, bidirectional) messages can have a significant impact on NFS perf. rick this may be a bit off topic but not much... I have wondered with all of the new tcp algorithms http://freebsdfoundation.blogspot.com/2011/03/summary-of-five-new-tcp-congestion.html what algorithm is best suited for NFS over gigabit Ethernet, say FreeBSD to FreeBSD. and further more would a NFS optimized tcp algorithm be useful? I have no idea what effect they might have. NFS traffic is quite different than streaming or bulk data transfer. I think this might make a nice research project for someone. rick Sam Fourman Jr. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Network stack changes
George Neville-Neil wrote: On Aug 29, 2013, at 7:49 , Adrian Chadd adr...@freebsd.org wrote: Hi, There's a lot of good stuff to review here, thanks! Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to keep locking things like that on a per-packet basis. We should be able to do this in a cleaner way - we can defer RX into a CPU pinned taskqueue and convert the interrupt handler to a fast handler that just schedules that taskqueue. We can ignore the ithread entirely here. What do you think? Totally pie in the sky handwaving at this point: * create an array of mbuf pointers for completed mbufs; * populate the mbuf array; * pass the array up to ether_demux(). For vlan handling, it may end up populating its own list of mbufs to push up to ether_demux(). So maybe we should extend the API to have a bitmap of packets to actually handle from the array, so we can pass up a larger array of mbufs, note which ones are for the destination and then the upcall can mark which frames its consumed. I specifically wonder how much work/benefit we may see by doing: * batching packets into lists so various steps can batch process things rather than run to completion; * batching the processing of a list of frames under a single lock instance - eg, if the forwarding code could do the forwarding lookup for 'n' packets under a single lock, then pass that list of frames up to inet_pfil_hook() to do the work under one lock, etc, etc. Here, the processing would look less like grab lock and process to completion and more like mark and sweep - ie, we have a list of frames that we mark as needing processing and mark as having been processed at each layer, so we know where to next dispatch them. One quick note here. Every time you increase batching you may increase bandwidth but you will also increase per packet latency for the last packet in a batch. That is fine so long as we remember that and that this is a tuning knob to balance the two. And any time you increase latency, that will have a negative impact on NFS performance. NFS RPCs are usually small messages (except Write requests and Read replies) and the RTT for these (mostly small, bidirectional) messages can have a significant impact on NFS perf. rick I still have some tool coding to do with PMC before I even think about tinkering with this as I'd like to measure stuff like per-packet latency as well as top-level processing overhead (ie, CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, etc.) This would be very useful in identifying the actual hot spots, and would be helpful to anyone who can generate a decent stream of packets with, say, an IXIA. Best, George ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
review of patches for the gssd that handle getpwXX_r ERANGE return
Hi, I have attached two patches, which can also be found at: http://people.freebsd.org/~rmacklem/getpw.patch1 and getpw.patch2 They are almost identical and handle the ERANGE error return from getpw[nam|uid]_r() when buf[128] isn't large enough. Is anyone interested in reviewing these? (This has been discussed some time ago, but the patch was never reviewed. Actually I reviewed a patch similar to this, but the submitter subsequently requested that I not use their patch, so I wrote similar ones.) Thanks in advance for any review, rick --- usr.sbin/gssd/gssd.c.sav 2013-04-26 20:38:45.0 -0400 +++ usr.sbin/gssd/gssd.c 2013-04-26 20:38:53.0 -0400 @@ -37,6 +37,7 @@ __FBSDID($FreeBSD: head/usr.sbin/gssd/g #include ctype.h #include dirent.h #include err.h +#include errno.h #ifndef WITHOUT_KERBEROS #include krb5.h #endif @@ -557,8 +558,11 @@ gssd_pname_to_uid_1_svc(pname_to_uid_arg { gss_name_t name = gssd_find_resource(argp-pname); uid_t uid; - char buf[128]; + char buf[1024], *bufp; struct passwd pwd, *pw; + size_t buflen; + int error; + static size_t buflen_hint = 1024; memset(result, 0, sizeof(*result)); if (name) { @@ -567,7 +571,24 @@ gssd_pname_to_uid_1_svc(pname_to_uid_arg name, argp-mech, uid); if (result-major_status == GSS_S_COMPLETE) { result-uid = uid; - getpwuid_r(uid, pwd, buf, sizeof(buf), pw); + buflen = buflen_hint; + for (;;) { +pw = NULL; +bufp = buf; +if (buflen sizeof(buf)) + bufp = malloc(buflen); +if (bufp == NULL) + break; +error = getpwuid_r(uid, pwd, bufp, buflen, +pw); +if (error != ERANGE) + break; +if (buflen sizeof(buf)) + free(bufp); +buflen += 1024; +if (buflen buflen_hint) + buflen_hint = buflen; + } if (pw) { int len = NGRPS; int groups[NGRPS]; @@ -584,6 +605,8 @@ gssd_pname_to_uid_1_svc(pname_to_uid_arg result-gidlist.gidlist_len = 0; result-gidlist.gidlist_val = NULL; } + if (bufp != NULL buflen sizeof(buf)) +free(bufp); } } else { result-major_status = GSS_S_BAD_NAME; --- kerberos5/lib/libgssapi_krb5/pname_to_uid.c.sav 2013-04-26 20:37:45.0 -0400 +++ kerberos5/lib/libgssapi_krb5/pname_to_uid.c 2013-04-27 16:25:14.0 -0400 @@ -26,6 +26,7 @@ */ /* $FreeBSD: head/kerberos5/lib/libgssapi_krb5/pname_to_uid.c 181344 2008-08-06 14:02:05Z dfr $ */ +#include errno.h #include pwd.h #include krb5/gsskrb5_locl.h @@ -37,8 +38,12 @@ _gsskrb5_pname_to_uid(OM_uint32 *minor_s krb5_context context; krb5_const_principal name = (krb5_const_principal) pname; krb5_error_code kret; - char lname[MAXLOGNAME + 1], buf[128]; + char lname[MAXLOGNAME + 1], buf[1024], *bufp; struct passwd pwd, *pw; + size_t buflen; + int error; + OM_uint32 ret; + static size_t buflen_hint = 1024; GSSAPI_KRB5_INIT (context); @@ -49,11 +54,30 @@ _gsskrb5_pname_to_uid(OM_uint32 *minor_s } *minor_status = 0; - getpwnam_r(lname, pwd, buf, sizeof(buf), pw); + buflen = buflen_hint; + for (;;) { + pw = NULL; + bufp = buf; + if (buflen sizeof(buf)) + bufp = malloc(buflen); + if (bufp == NULL) + break; + error = getpwnam_r(lname, pwd, bufp, buflen, pw); + if (error != ERANGE) + break; + if (buflen sizeof(buf)) + free(bufp); + buflen += 1024; + if (buflen buflen_hint) + buflen_hint = buflen; + } if (pw) { *uidp = pw-pw_uid; - return (GSS_S_COMPLETE); + ret = GSS_S_COMPLETE; } else { - return (GSS_S_FAILURE); + ret = GSS_S_FAILURE; } + if (bufp != NULL buflen sizeof(buf)) + free(bufp); + return (ret); } ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: stupid UFS behaviour on random writes
Stefan Esser wrote: Am 18.01.2013 00:01, schrieb Rick Macklem: Wojciech Puchar wrote: create 10GB file (on 2GB RAM machine, with some swap used to make sure little cache would be available for filesystem. dd if=/dev/zero of=file bs=1m count=10k block size is 32KB, fragment size 4k now test random read access to it (10 threads) randomio test 10 0 0 4096 normal result on such not so fast disk in my laptop. 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan systat shows 4kB I/O size. all is fine. BUT random 4kB writes randomio test 10 1 0 4096 total | read: latency (ms) | write: latency (ms) iops | iops min avg max sdev | iops min avg max sdev +---+-- 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0 results are horrific. systat shows 32kB I/O, gstat shows half are reads half are writes. Why UFS need to read full block, change one 4kB part and then write back, instead of just writing 4kB part? Because that's the way the buffer cache works. It writes an entire buffer cache block (unless at the end of file), so it must read the rest of the block into the buffer, so it doesn't write garbage (the rest of the block) out. Without having looked at the code or testing: I assume using O_DIRECT when opening the file should help for that particular test (on kernels compiled with options DIRECTIO). I'd argue that using an I/O size smaller than the file system block size is simply sub-optimal and that most apps. don't do random I/O of blocks. OR If you had an app. that does random I/O of 4K blocks (at 4K byte offsets), then using a 4K/1K file system would be better. A 4k/1k file system has higher overhead (more indirect blocks) and is clearly sub-obtimal for most general uses, today. Yes, but if the sysadmin knows that most of the I/O is random 4K blocks, that's his specific case, not a general use. Sorry, I didn't mean to imply that a 4K file system was a good choice, in general. NFS is the exception, in that it keeps track of a dirty byte range within a buffer cache block and writes that byte range. (NFS writes are byte granular, unlike a disk.) I should be easy to add support for a fragment mask to the buffer cache, which allows to identify valid fragments. Such a mask should be set to 0xff for all current uses of the buffer cache (meaning the full block is valid), but a special case could then be added for writes of exactly one or multiple fragments, where only the corresponding valid flag bits were set. In addition, a possible later read from disk must obviously skip fragments for which the valid mask bits are already set. This bit mask could then be used to update the affected fragments only, without a read-modify-write of the containing block. But I doubt that such a change would improve performance in the general case, just in random update scenarios (which might still be relevant, in case of a DBMS knowing the fragment size and using it for DB files). Regards, STefan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org Yes. And for some I/O patterns the fragment change would degrade performance. You mentioned that a later read might have to skip fragments with the valid bit. I think this would translate to doing multiple reads for the other fragments, in practice. Also, when an app. goes to write a partial fragment, that fragment would have to be read in and this could result in several reads of fragments instead of one read for the entire block. It's the old OS doesn't have a crystal ball that predicts future I/O activity. Btw, although I did a dirty byte range for NFS for the buffer cache ages (late 1980s) ago, it is also a performance hit for certain cases. The linker/loaders love to write random sized chucks to files. For the NFS code, if the new write isn't contiguous with the old one, a synchronous write of the old dirty byte range is forced to the server. I have a patch that replaces the single byte range with a list in order to avoid this synchronous write, but it has not made it into head. (I hope to do so someday, after more testing and when I figure out all the implications
Re: stupid UFS behaviour on random writes
Wojciech Puchar wrote: create 10GB file (on 2GB RAM machine, with some swap used to make sure little cache would be available for filesystem. dd if=/dev/zero of=file bs=1m count=10k block size is 32KB, fragment size 4k now test random read access to it (10 threads) randomio test 10 0 0 4096 normal result on such not so fast disk in my laptop. 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan systat shows 4kB I/O size. all is fine. BUT random 4kB writes randomio test 10 1 0 4096 total | read: latency (ms) | write: latency (ms) iops | iops min avg max sdev | iops min avg max sdev +---+-- 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0 results are horrific. systat shows 32kB I/O, gstat shows half are reads half are writes. Why UFS need to read full block, change one 4kB part and then write back, instead of just writing 4kB part? Because that's the way the buffer cache works. It writes an entire buffer cache block (unless at the end of file), so it must read the rest of the block into the buffer, so it doesn't write garbage (the rest of the block) out. I'd argue that using an I/O size smaller than the file system block size is simply sub-optimal and that most apps. don't do random I/O of blocks. OR If you had an app. that does random I/O of 4K blocks (at 4K byte offsets), then using a 4K/1K file system would be better. NFS is the exception, in that it keeps track of a dirty byte range within a buffer cache block and writes that byte range. (NFS writes are byte granular, unlike a disk.) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: iSCSI vs. SMB with ZFS.
Wojciech Puchar wrote: With a network file system (either SMB or NFS, it doesn't matter), you need to ask the server for *each* of the following situations: * to ask the server if a file has been changed so the client can use cached data (if the protocol supports it) * to ask the server if a file (or a portion of a file) has been locked by another client not really if there is only one user of file - then windows know this, but change to behaviour you described when there are more users. AND FINALLY the latter behaviour fails to work properly since windows XP (worked fine with windows 98). If you use programs that read/write share same files you may be sure data corruption would happen. you have to set locking = yes oplocks = no level2 oplocks = no to make it work properly but even more slow!. Btw, NFSv4 has delegations, which are essentially level2 oplocks. They can be enabled for a server if the volumes exported via NFSv4 are not being accessed locally (including Samba). For them to work, the nfscbd needs to be running on the client(s) and the clients must have IP addresses visible to the server for a callback TCP connection (no firewalls or NAT gateways). Even with delegations working, the client caching is limited to the buffer cache. I have an experimental patch that uses on-disk caching in the client for delegated files (I call it packrats), but it is not ready for production use. Now that I have the 4.1 client in place, I plan to get back to working on it. rick This basically means that for almost every single IO, you need to ask the server for something, which involves network traffic and round-trip delays. Not that. The problem is that windows do not use all free memory for caching as with local or local (iSCSI) disk. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: iSCSI vs. SMB with ZFS.
Zaphod Beeblebrox wrote: Does windows 7 support nfs v4, then? Is it expected (ie: is it worthwhile trying) that nfsv4 would perform at a similar speed to iSCSI? It would seem that this at least requires active directory (or this user name mapping ... which I remember being hard). As far as I know, there is no NFSv4 in Windows. I only made the comment (which I admit was a bit off topic), because the previous post had stated SMB or NFS, they're the same or something like that.) There was work on an NFSv4 client for Windows being done by CITI at the Univ. of Michigan funded by Microsoft research, but I have no idea if it was ever released. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
request for review: gssd patch for alternate cred cache files
Hi, A couple of people have reported problems using NFS mounts with sec=krb5 because the version of sshd they use doesn't use credential cache files named /tmp/krb5cc_N. The attached patch modifies the gssd so that it roughly emulates what the gssd used by most Linux distros does when a new -s option is used. This has been tested by the reporters and fixed their issue. Would someone like to review this? rick ps: The patch can also be found at http://people.freebsd.org/~rmacklem/gssd-ccache.patch --- usr.sbin/gssd/gssd.c.sav2 2012-10-08 16:49:50.0 -0400 +++ usr.sbin/gssd/gssd.c 2012-12-12 19:19:51.0 -0500 @@ -35,7 +35,9 @@ __FBSDID($FreeBSD: head/usr.sbin/gssd/g #include sys/queue.h #include sys/syslog.h #include ctype.h +#include dirent.h #include err.h +#include krb5.h #include pwd.h #include stdio.h #include stdlib.h @@ -64,8 +66,12 @@ int gss_resource_count; uint32_t gss_next_id; uint32_t gss_start_time; int debug_level; +static char ccfile_dirlist[PATH_MAX + 1], ccfile_substring[NAME_MAX + 1]; +static char pref_realm[1024]; static void gssd_load_mech(void); +static int find_ccache_file(const char *, uid_t, char *); +static int is_a_valid_tgt_cache(const char *, uid_t, int *, time_t *); extern void gssd_1(struct svc_req *rqstp, SVCXPRT *transp); extern int gssd_syscall(char *path); @@ -82,14 +88,45 @@ main(int argc, char **argv) int fd, oldmask, ch, debug; SVCXPRT *xprt; + /* + * Initialize the credential cache file name substring and the + * search directory list. + */ + strlcpy(ccfile_substring, krb5cc_, sizeof(ccfile_substring)); + ccfile_dirlist[0] = '\0'; + pref_realm[0] = '\0'; debug = 0; - while ((ch = getopt(argc, argv, d)) != -1) { + while ((ch = getopt(argc, argv, ds:c:r:)) != -1) { switch (ch) { case 'd': debug_level++; break; + case 's': + /* + * Set the directory search list. This enables use of + * find_ccache_file() to search the directories for a + * suitable credentials cache file. + */ + strlcpy(ccfile_dirlist, optarg, sizeof(ccfile_dirlist)); + break; + case 'c': + /* + * Specify a non-default credential cache file + * substring. + */ + strlcpy(ccfile_substring, optarg, + sizeof(ccfile_substring)); + break; + case 'r': + /* + * Set the preferred realm for the credential cache tgt. + */ + strlcpy(pref_realm, optarg, sizeof(pref_realm)); + break; default: - fprintf(stderr, usage: %s [-d]\n, argv[0]); + fprintf(stderr, + usage: %s [-d] [-s dir-list] [-c file-substring] + [-r preferred-realm]\n, argv[0]); exit(1); break; } @@ -267,13 +304,36 @@ gssd_init_sec_context_1_svc(init_sec_con gss_cred_id_t cred = GSS_C_NO_CREDENTIAL; gss_ctx_id_t ctx = GSS_C_NO_CONTEXT; gss_name_t name = GSS_C_NO_NAME; - char ccname[strlen(FILE:/tmp/krb5cc_) + 6 + 1]; + char ccname[PATH_MAX + 5 + 1], *cp, *cp2; + int gotone; - snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d, - (int) argp-uid); + memset(result, 0, sizeof(*result)); + if (ccfile_dirlist[0] != '\0' argp-cred == 0) { + gotone = 0; + cp = ccfile_dirlist; + do { + cp2 = strchr(cp, ':'); + if (cp2 != NULL) +*cp2 = '\0'; + gotone = find_ccache_file(cp, argp-uid, ccname); + if (gotone != 0) +break; + if (cp2 != NULL) +*cp2++ = ':'; + cp = cp2; + } while (cp != NULL *cp != '\0'); + if (gotone == 0) + snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d, + (int) argp-uid); + } else + /* + * If there wasn't a -s option or the credentials have + * been provided as an argument, do it the old way. + */ + snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d, + (int) argp-uid); setenv(KRB5CCNAME, ccname, TRUE); - memset(result, 0, sizeof(*result)); if (argp-cred) { cred = gssd_find_resource(argp-cred); if (!cred) { @@ -516,13 +576,37 @@ gssd_acquire_cred_1_svc(acquire_cred_arg { gss_name_t desired_name = GSS_C_NO_NAME; gss_cred_id_t cred; - char ccname[strlen(FILE:/tmp/krb5cc_) + 6 + 1]; + char ccname[PATH_MAX + 5 + 1], *cp, *cp2; + int gotone; - snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d, - (int) argp-uid); + memset(result, 0, sizeof(*result)); + if (ccfile_dirlist[0] != '\0' argp-desired_name == 0) { + gotone = 0; + cp = ccfile_dirlist; + do { + cp2 = strchr(cp, ':'); + if (cp2 != NULL) +*cp2 = '\0'; + gotone = find_ccache_file(cp, argp-uid, ccname); + if (gotone != 0) +break; + if (cp2 != NULL) +*cp2++ = ':'; + cp = cp2; + } while (cp != NULL *cp != '\0'); + if (gotone == 0) + snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d, + (int) argp-uid); + } else + /* + * If there wasn't a -s option or the name has + * been provided as an argument, do it the old way. + * (The name is provided for host based initiator credentials.) + */ + snprintf(ccname, sizeof(ccname), FILE:/tmp/krb5cc_%d, + (int) argp-uid); setenv(KRB5CCNAME, ccname, TRUE); - memset(result,
any arch not pack uint32_t x[2]?
Hi, The subject line pretty well says it. I am about ready to commit the NFSv4.1 client patches, but I had better ask this dump question first. Is there any architecture where: uint32_t x[2]; isn't packed? (Or, sizeof(x) != 8, if you prefer.) As you might have guessed, if the answer is yes, I have some code fixin to do, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
naming a .h file for kernel use only
Hi, For my NFSv4.1 client work, I've taken a few definitions out of a kernel rpc .c file and put them in a .h file so that they can be included in other sys/rpc .c files. I've currently named the file _krpc.h. I thought I'd check if this is a reasonable name before doing the big commit of the NFSv4.1 stuff to head. (I have a vague notion that a leading _ would indicate not for public use, but I am not sure?) Thanks in advance for naming suggestions for this file, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Ivan Voras wrote: On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. Oh, I realized that, if you are testing 9/stable (and not head), that you won't have r227809. Without that, all reads on a given file will be serialized, because the server will acquire an exclusive lock on the vnode. The patch for r227809 in head is at: http://people.freebsd.org/~rmacklem/lkshared.patch This should apply fine to a 9 system (but not 8.n), I think. Good luck with it and have fun, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Ivan Voras wrote: On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. Don't the mtx_lock() calls spin for a little while and then context switch if another thread still has it locked? But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. Hmm, I didn't look, but were there any tests using UDP mounts? (I would have thought that your patch would mainly affect UDP mounts, since that is when my version still has the single LRU queue/mutex. As I think you know, my concern with your patch would be correctness for UDP, not performance.) Anyhow, sounds like you guys are having fun with it and learning some useful things. Keep up the good work, rick But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Outback Dingo wrote: On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras ivo...@freebsd.org wrote: On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote: Ivan Voras wrote: I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. Don't the mtx_lock() calls spin for a little while and then context switch if another thread still has it locked? Yes, but are in-kernel context switches also counted? I was assuming they are light-weight enough not to count. Hmm, I didn't look, but were there any tests using UDP mounts? (I would have thought that your patch would mainly affect UDP mounts, since that is when my version still has the single LRU queue/mutex. Another assumption - I thought UDP was the default. TCP has been the default for a FreeBSD client for a long time. It was changed for the old NFS client before I became a committer. (You can explicitly set one or the other as mount options or check via wireshark/tcpdump) As I think you know, my concern with your patch would be correctness for UDP, not performance.) Yes. Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB drives on an LSI controller. Im watching the thread patiently, im kinda looking for results, and answers, Though Im also tempted to run benchmarks on my system also see if i get similar results I also considered that netmap might be one but not quite sure if it would help NFS, since its to hard to tell if its a network bottle neck, though it appears to be network related. NFS network traffic looks very different that a TCP stream (ala bit torrent or ...). I've seen this cause issues before. You can look at a packet trace in wireshark and see if TCP is retransmitting segments. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Ivan Voras wrote: On 13/10/2012 17:22, Nikolay Denev wrote: drc3.patch applied and build cleanly and shows nice improvement! I've done a quick benchmark using iozone over the NFS mount from the Linux host. Hi, If you are already testing, could you please also test this patch: http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch I don't think (it is hard to test this) your trim cache algorithm will choose the correct entries to delete. The problem is that UDP entries very seldom time out (unless the NFS server isn't seeing hardly any load) and are mostly trimmed because the size exceeds the highwater mark. With your code, it will clear out all of the entries in the first hash buckets that aren't currently busy, until the total count drops below the high water mark. (If you monitor a busy server with nfsstat -e -s, you'll see the cache never goes below the high water mark, which is 500 by default.) This would delete entries of fairly recent requests. If you are going to replace the global LRU list with ones for each hash bucket, then you'll have to compare the time stamps on the least recently used entries of all the hash buckets and then delete those. If you keep the timestamp of the least recent one for that hash bucket in the hash bucket head, you could at least use that to select which bucket to delete from next, but you'll still need to: - lock that hash bucket - delete a few entries from that bucket's lru list - unlock hash bucket - repeat for various buckets until the count is beloew the high water mark Or something like that. I think you'll find it a lot more work that one LRU list and one mutex. Remember that mutex isn't held for long. Btw, the code looks very nice. (If I was being a style(9) zealot, I'd remind you that it likes return (X); and not return X;. rick It should apply to HEAD without Rick's patches. It's a bit different approach than Rick's, breaking down locks even more. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Ivan Voras wrote: On 15 October 2012 22:58, Rick Macklem rmack...@uoguelph.ca wrote: The problem is that UDP entries very seldom time out (unless the NFS server isn't seeing hardly any load) and are mostly trimmed because the size exceeds the highwater mark. With your code, it will clear out all of the entries in the first hash buckets that aren't currently busy, until the total count drops below the high water mark. (If you monitor a busy server with nfsstat -e -s, you'll see the cache never goes below the high water mark, which is 500 by default.) This would delete entries of fairly recent requests. You are right about that, if testing by Nikolay goes reasonably well, I'll work on that. If you are going to replace the global LRU list with ones for each hash bucket, then you'll have to compare the time stamps on the least recently used entries of all the hash buckets and then delete those. If you keep the timestamp of the least recent one for that hash bucket in the hash bucket head, you could at least use that to select which bucket to delete from next, but you'll still need to: - lock that hash bucket - delete a few entries from that bucket's lru list - unlock hash bucket - repeat for various buckets until the count is beloew the high water mark Ah, I think I get it: is the reliance on the high watermark as a criteria for cache expiry the reason the list is a LRU instead of an ordinary unordered list? Yes, I think you've gt it;-) Have fun with it, rick Or something like that. I think you'll find it a lot more work that one LRU list and one mutex. Remember that mutex isn't held for long. It could be, but the current state of my code is just groundwork for the next things I have in plan: 1) Move the expiry code (the trim function) into a separate thread, run periodically (or as a callout, I'll need to talk with someone about which one is cheaper) 2) Replace the mutex with a rwlock. The only thing which is preventing me from doing this right away is the LRU list, since each read access modifies it (and requires a write lock). This is why I was asking you if we can do away with the LRU algorithm. Btw, the code looks very nice. (If I was being a style(9) zealot, I'd remind you that it likes return (X); and not return X;. Thanks, I'll make it more style(9) compliant as I go along. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Garrett Wollman wrote: On Fri, 12 Oct 2012 22:05:54 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: I've attached the patch drc3.patch (it assumes drc2.patch has already been applied) that replaces the single mutex with one for each hash list for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. I haven't tested this at all, but I think putting all of the mutexes in an array like that is likely to cause cache-line ping-ponging. It may be better to use a pool mutex, or to put the mutexes adjacent in memory to the list heads that they protect. Well, I'll admit I don't know how to do this. What the code does need is a set of mutexes, where any of the mutexes can be referred to by an index. I could easily define a structure that has: struct nfsrc_hashhead { struct nfsrvcachehead head; struct mtx mutex; } nfsrc_hashhead[NFSRVCACHE_HASHSIZE]; - but all that does is leave a small structure between each struct mtx and I wouldn't have thought that would make much difference. (How big is a typical hardware cache line these days? I have no idea.) - I suppose I could waste space and define a glob of unused space between them, like: struct nfsrc_hashhead { struct nfsrvcachehead head; char garbage[N]; struct mtx mutex; } nfsrc_hashhead[NFSRVCACHE_HASHSIZE]; - If this makes sense, how big should N be? (Somewhat less that the length of a cache line, I'd guess. It seems that the structure should be at least a cache line length in size.) All this seems kinda hokey to me and beyond what code at this level should be worrying about, but I'm game to make changes, if others think it's appropriate. I've never use mtx_pool(9) mutexes, but it doesn't sound like they would be the right fit, from reading the man page. (Assuming the mtx_pool_find() is guaranteed to return the same mutex for the same address passed in as an argument, it would seem that they would work, since I can pass nfsrvcachehead[i] in as the pointer arg to index a mutex.) Hopefully jhb@ can say if using mtx_pool(9) for this would be better than an array: struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE]; Does anyone conversant with mutexes know what the best coding approach is? (But I probably won't be able to do the performance testing on any of these for a while. I have a server running the drc2 code but haven't gotten my users to put a load on it yet.) No rush. At this point, the earliest I could commit something like this to head would be December. rick ps: I hope John doesn't mind being added to the cc list yet again. It's just that I suspect he knows a fair bit about mutex implementation and possible hardware cache line effects. -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
I wrote: Oops, I didn't get the readahead option description quite right in the last post. The default read ahead is 1, which does result in rsize * 2, since there is the read + 1 readahead. rsize * 16 would actually be for the option readahead=15 and for readahead=16 the calculation would be rsize * 17. However, the example was otherwise ok, I think? rick I've attached the patch drc3.patch (it assumes drc2.patch has already been applied) that replaces the single mutex with one for each hash list for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. These patches are also at: http://people.freebsd.org/~rmacklem/drc2.patch http://people.freebsd.org/~rmacklem/drc3.patch in case the attachments don't get through. rick ps: I haven't tested drc3.patch a lot, but I think it's ok? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org --- fs/nfsserver/nfs_nfsdcache.c.orig 2012-02-29 21:07:53.0 -0500 +++ fs/nfsserver/nfs_nfsdcache.c 2012-10-03 08:23:24.0 -0400 @@ -164,8 +164,19 @@ NFSCACHEMUTEX; int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0; #endif /* !APPLEKEXT */ +SYSCTL_DECL(_vfs_nfsd); + +static int nfsrc_tcphighwater = 0; +SYSCTL_INT(_vfs_nfsd, OID_AUTO, tcphighwater, CTLFLAG_RW, +nfsrc_tcphighwater, 0, +High water mark for TCP cache entries); +static int nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER; +SYSCTL_INT(_vfs_nfsd, OID_AUTO, udphighwater, CTLFLAG_RW, +nfsrc_udphighwater, 0, +High water mark for UDP cache entries); + static int nfsrc_tcpnonidempotent = 1; -static int nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER, nfsrc_udpcachesize = 0; +static int nfsrc_udpcachesize = 0; static TAILQ_HEAD(, nfsrvcache) nfsrvudplru; static struct nfsrvhashhead nfsrvhashtbl[NFSRVCACHE_HASHSIZE], nfsrvudphashtbl[NFSRVCACHE_HASHSIZE]; @@ -781,8 +792,15 @@ nfsrc_trimcache(u_int64_t sockref, struc { struct nfsrvcache *rp, *nextrp; int i; + static time_t lasttrim = 0; + if (NFSD_MONOSEC == lasttrim + nfsrc_tcpsavedreplies nfsrc_tcphighwater + nfsrc_udpcachesize (nfsrc_udphighwater + + nfsrc_udphighwater / 2)) + return; NFSLOCKCACHE(); + lasttrim = NFSD_MONOSEC; TAILQ_FOREACH_SAFE(rp, nfsrvudplru, rc_lru, nextrp) { if (!(rp-rc_flag (RC_INPROG|RC_LOCKED|RC_WANTED)) rp-rc_refcnt == 0 --- fs/nfsserver/nfs_nfsdcache.c.sav 2012-10-10 18:56:01.0 -0400 +++ fs/nfsserver/nfs_nfsdcache.c 2012-10-12 21:04:21.0 -0400 @@ -160,7 +160,8 @@ __FBSDID($FreeBSD: head/sys/fs/nfsserve #include fs/nfs/nfsport.h extern struct nfsstats newnfsstats; -NFSCACHEMUTEX; +extern struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE]; +extern struct mtx nfsrc_udpmtx; int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0; #endif /* !APPLEKEXT */ @@ -208,10 +209,11 @@ static int newnfsv2_procid[NFS_V3NPROCS] NFSV2PROC_NOOP, }; +#define nfsrc_hash(xid) (((xid) + ((xid) 24)) % NFSRVCACHE_HASHSIZE) #define NFSRCUDPHASH(xid) \ - (nfsrvudphashtbl[((xid) + ((xid) 24)) % NFSRVCACHE_HASHSIZE]) + (nfsrvudphashtbl[nfsrc_hash(xid)]) #define NFSRCHASH(xid) \ - (nfsrvhashtbl[((xid) + ((xid) 24)) % NFSRVCACHE_HASHSIZE]) + (nfsrvhashtbl[nfsrc_hash(xid)]) #define TRUE 1 #define FALSE 0 #define NFSRVCACHE_CHECKLEN 100 @@ -262,6 +264,18 @@ static int nfsrc_getlenandcksum(mbuf_t m static void nfsrc_marksametcpconn(u_int64_t); /* + * Return the correct mutex for this cache entry. + */ +static __inline struct mtx * +nfsrc_cachemutex(struct nfsrvcache *rp) +{ + + if ((rp-rc_flag RC_UDP) != 0) + return (nfsrc_udpmtx); + return (nfsrc_tcpmtx[nfsrc_hash(rp-rc_xid)]); +} + +/* * Initialize the server request cache list */ APPLESTATIC void @@ -336,10 +350,12 @@ nfsrc_getudp(struct nfsrv_descript *nd, struct sockaddr_in6 *saddr6; struct nfsrvhashhead *hp; int ret = 0; + struct mtx *mutex; + mutex = nfsrc_cachemutex(newrp); hp = NFSRCUDPHASH(newrp-rc_xid); loop: - NFSLOCKCACHE(); + mtx_lock(mutex); LIST_FOREACH(rp, hp, rc_hash) { if (newrp-rc_xid == rp-rc_xid newrp-rc_proc == rp-rc_proc @@ -347,8 +363,8 @@ loop: nfsaddr_match(NETFAMILY(rp), rp-rc_haddr, nd-nd_nam)) { if ((rp-rc_flag RC_LOCKED) != 0) { rp-rc_flag |= RC_WANTED; -(void)mtx_sleep(rp, NFSCACHEMUTEXPTR, -(PZERO - 1) | PDROP, nfsrc, 10 * hz); +(void)mtx_sleep(rp, mutex, (PZERO - 1) | PDROP, +nfsrc, 10 * hz); goto loop; } if (rp-rc_flag == 0) @@ -358,14 +374,14 @@ loop: TAILQ_INSERT_TAIL(nfsrvudplru, rp, rc_lru); if (rp-rc_flag RC_INPROG) { newnfsstats.srvcache_inproghits++; -NFSUNLOCKCACHE(); +mtx_unlock(mutex); ret = RC_DROPIT; } else if (rp-rc_flag RC_REPSTATUS) { /* * V2 only. */ newnfsstats.srvcache_nonidemdonehits++; -
Re: NFS server bottlenecks
Nikolay Denev wrote: On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out
Re: NFS server bottlenecks
Nikolay Denev wrote: On Oct 11, 2012, at 7:20 PM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all
Re: NFS server bottlenecks
Oops, I didn't get the readahead option description quite right in the last post. The default read ahead is 1, which does result in rsize * 2, since there is the read + 1 readahead. rsize * 16 would actually be for the option readahead=15 and for readahead=16 the calculation would be rsize * 17. However, the example was otherwise ok, I think? rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Garrett Wollman wrote: On Tue, 9 Oct 2012 20:18:00 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: And, although this experiment seems useful for testing patches that try and reduce DRC CPU overheads, most real NFS servers will be doing disk I/O. We don't always have control over what the user does. I think the worst-case for my users involves a third-party program (that they're not willing to modify) that does line-buffered writes in append mode. This uses nearly all of the CPU on per-RPC overhead (each write is three RPCs: GETATTR, WRITE, COMMIT). Yes. My comment was simply meant to imply that his testing isn't a realistic load for most NFS servers. It was not meant to imply that reducing the CPU overhead/lock contention of the DRC is a useless exercise. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing fewer larger RPCs should usually work better. (Sometimes large rsize/wsize values
Re: NFS server bottlenecks
Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing fewer larger RPCs should usually work better. (Sometimes large rsize/wsize values generate too large a burst of traffic for a network interface to handle and then the rsize/wsize has to be decreased to avoid this issue.) And, although
Re: NFS server bottlenecks
Nikolay Deney wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org I was doing some NFS testing with RELENG_9 machine and a Linux RHEL machine over 10G network, and noticed the same nfsd threads issue. Previously I would read a 32G file locally on the FreeBSD ZFS/NFS server with dd if=/tank/32G.bin of=/dev/null bs=1M to cache it completely in ARC (machine has 196G RAM), then if I do this again locally I would get close to 4GB/sec read - completely from the cache... But If I try to read the file over NFS from the Linux machine I would only get about 100MB/sec speed, sometimes a bit more, and all of the nfsd threads are clearly visible in top. pmcstat also showed the same mutex
Re: NFS server bottlenecks
Garrett Wollman wrote: [Adding freebsd-fs@ to the Cc list, which I neglected the first time around...] On Tue, 2 Oct 2012 08:28:29 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: I can't remember (I am early retired now;-) if I mentioned this patch before: http://people.freebsd.org/~rmacklem/drc.patch It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can be twiddled so that the drc is trimmed less frequently. By making these values larger, the trim will only happen once/sec until the high water mark is reached, instead of on every RPC. The tradeoff is that the DRC will become larger, but given memory sizes these days, that may be fine for you. It will be a while before I have another server that isn't in production (it's on my deployment plan, but getting the production servers going is taking first priority). The approaches that I was going to look at: Simplest: only do the cache trim once every N requests (for some reasonable value of N, e.g., 1000). Maybe keep track of the number of entries in each hash bucket and ignore those buckets that only have one entry even if is stale. Well, the patch I have does it when it gets too big. This made sense to me, since the cache is trimmed to keep it from getting too large. It also does the trim at least once/sec, so that really stale entries are removed. Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? A mutex in each element could be used for changes (not insertion/removal) to an individual element. However, the current code manipulates the lists and makes minimal changes to the individual elements, so I'm not sure if a mutex in each element would be useful or not, but it wouldn't help for the trimming case, imho. I modified the patch slightly, so it doesn't bother to acquire the mutex when it is checking if it should trim now. I think this results in a slight risk that the test will use an out of date cached copy of one of the global vars, but since the code isn't modifying them, I don't think it matters. This modified patch is attached and is also here: http://people.freebsd.org/~rmacklem/drc2.patch Moderately complicated: figure out if a different synchronization type can safely be used (e.g., rmlock instead of mutex) and do so. More complicated: move all cache trimming to a separate thread and just have the rest of the code wake it up when the cache is getting too big (or just once a second since that's easy to implement). Maybe just move all cache processing to a separate thread. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. The above patch does it when it is too big or at least once/sec. I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? Isilon did use separate threads (I never saw their code, so I am going by what they told me), but it sounded to me like they were trimming the cache too agressively to be effective for TCP mounts. (ie. It sounded to me like they had broken the algorithm to achieve better perf.) Remember that the DRC is weird, in that it is a cache to improve correctness at the expense of overhead. It never improves performance. On the other hand, turn it off or throw away entries too aggressively and data corruption, due to retries of non-idempotent operations, can be the outcome. Good luck with whatever you choose, rick It's pretty clear from the profile that the cache mutex is heavily contended, so anything that reduces the length of time it's held is probably a win. That URL again, for the benefit of people on freebsd-fs who didn't see it on hackers, is: http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png. (This graph is slightly modified from my previous post as I removed some spurious edges to make the formatting look better. Still looking for a way to get a profile that includes all kernel modules with the kernel.) -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers
Re: NFS server bottlenecks
Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Garrett Wollman wrote: I had an email conversation with Rick Macklem about six months ago about NFS server bottlenecks. I'm now in a position to observe my large-scale NFS server under an actual production load, so I thought I would update folks on what it looks like. This is a 9.1 prerelease kernel (I hope 9.1 will be released soon as I have four moe of these servers to deploy!). When under nearly 100% load on an 8-core (16-thread) Quanta QSSC-S99Q storage server, with a 10G network interface, pmcstat tells me this: PMC: [INST_RETIRED.ANY_P] Samples: 2727105 (100.0%) , 27 unresolved Key: q = exiting... %SAMP IMAGE FUNCTION CALLERS 29.3 kernel _mtx_lock_sleep nfsrvd_updatecache:10.0 nfsrvd_getcache:7.4 ... 9.5 kernel cpu_search_highest cpu_search_highest:8.1 sched_idletd:1.4 7.4 zfs.ko lzjb_decompress zio_decompress 4.3 kernel _mtx_lock_spin turnstile_trywait:2.2 pmclog_reserve:1.0 ... 4.0 zfs.ko fletcher_4_native zio_checksum_error:3.1 zio_checksum_compute:0.8 3.6 kernel cpu_search_lowest cpu_search_lowest 3.3 kernel nfsrc_trimcache nfsrvd_getcache:1.6 nfsrvd_updatecache:1.6 2.3 kernel ipfw_chk ipfw_check_hook 2.1 pmcstat _init 1.1 kernel _sx_xunlock 0.9 kernel _sx_xlock 0.9 kernel spinlock_exit This does seem to confirm my original impression that the NFS replay cache is quite expensive. Running a gprof(1) analysis on the same PMC data reveals a bit more detail (I've removed some uninteresting parts of the call graph): I can't remember (I am early retired now;-) if I mentioned this patch before: http://people.freebsd.org/~rmacklem/drc.patch It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can be twiddled so that the drc is trimmed less frequently. By making these values larger, the trim will only happen once/sec until the high water mark is reached, instead of on every RPC. The tradeoff is that the DRC will become larger, but given memory sizes these days, that may be fine for you. jwd@ was going to test it, but he moved to a different job away from NFS, so the patch has just been collecting dust. If you could test it, that would be nice, rick ps: Also, the current patch still locks before checking if it needs to do the trim. I think that could safely be changed so that it doesn't lock/unlock when it isn't doing the trim, if that makes a significant difference. called/total parents index %time self descendents called+self name index called/total children 4881.00 2004642.70 932627/932627 svc_run_internal [2] [4] 45.1 4881.00 2004642.70 932627 nfssvc_program [4] 13199.00 504436.33 584319/584319 nfsrvd_updatecache [9] 23075.00 403396.18 468009/468009 nfsrvd_getcache [14] 1032.25 416249.44 2239/2284 svc_sendreply_mbuf [15] 6168.00 381770.44 11618/11618 nfsrvd_dorpc [24] 3526.87 86869.88 112478/112514 nfsrvd_sentcache [74] 890.00 50540.89 4252/4252 svc_getcred [101] 14876.60 32394.26 4177/24500 crfree cycle 3 [263] 11550.11 25150.73 3243/24500 free cycle 3 [102] 1348.88 15451.66 2716/16831 m_freem [59] 4066.61 216.81 1434/1456 svc_freereq [321] 2342.15 677.40 557/1459 malloc_type_freed [265] 59.14 1916.84 134/2941 crget [113] 1602.25 0.00 322/9682 bzero [105] 690.93 0.00 43/44 getmicrotime [571] 287.22 7.33 138/1205 prison_free [384] 233.61 0.00 60/798 PHYS_TO_VM_PAGE [358] 203.12 0.00 94/230 nfsrv_mallocmget_limit [632] 151.76 0.00 51/1723 pmap_kextract [309] 0.78 70.28 9/3281 _mtx_unlock_sleep [154] 19.22 16.88 38/400403 nfsrc_trimcache [26] 11.05 21.74 7/197 crsetgroups [532] 30.37 0.00 11/6592 critical_enter [190] 25.50 0.00 9/36 turnstile_chain_unlock [844] 24.86 0.00 3/7 nfsd_errmap [913] 12.36 8.57 8/2145 in_cksum_skip [298] 9.10 3.59 5/12455 mb_free_ext [140] 1.84 4.85 2/2202 VOP_UNLOCK_APV [269] --- 0.49 0.15 1/1129009 uhub_explore [1581] 0.49 0.15 1/1129009 tcp_output [10] 0.49 0.15 1/1129009 pmap_remove_all [1141] 0.49 0.15 1/1129009 vm_map_insert [236] 0.49 0.15 1/1129009 vnode_create_vobject [281] 0.49 0.15 1/1129009 biodone [351] 0.49 0.15 1/1129009 vm_object_madvise [670] 0.49 0.15 1/1129009 xpt_done [483] 0.49 0.15 1/1129009 vputx [80] 0.49 0.15 1/1129009 vm_map_delete cycle 3 [49] 0.49 0.15 1/1129009 vm_object_deallocate cycle 3 [356] 0.49 0.15 1/1129009 vm_page_unwire [338] 0.49 0.15 1/1129009 pmap_change_wiring [318] 0.98 0.31 2/1129009 getnewvnode [227] 0.98 0.31 2/1129009 pmap_clear_reference [1004] 0.98 0.31 2/1129009 usbd_do_request_flags [1282] 0.98 0.31 2/1129009 vm_object_collapse cycle 3 [587] 0.98 0.31 2/1129009 vm_object_page_remove [122] 1.48 0.46 3/1129009 mpt_pci_intr [487] 1.48 0.46 3/1129009 pmap_extract [355] 1.48 0.46 3/1129009 vm_fault_unwire [171] 1.97 0.62 4/1129009 vgonel [270] 1.97 0.62 4/1129009 vm_object_shadow [926] 1.97 0.62 4/1129009 zone_alloc_item [434] 2.46 0.77 5/1129009 vnlru_free [235] 2.46 0.77 5/1129009 insmntque1 [737] 2.95 0.93 6/1129009 zone_free_item [409] 3.94 1.24 8
Re: Upcoming release schedule - 8.4 ?
Mark Saad wrote: I'll share my 2 cents here, as someone who maintains a decent sided FreeBSD install. 1. FreeBSD needs to make end users more comfortable with using a Dot-Ohh release; and at the time of the dot-ohh release a timeline for the next point releases should be made. * 2. Having three supported releases is showing issues , and brings up the point of why was 9.0 not released as 8.3 ? ** 3. The end users appear to want less releases, and for them to be supported longer . * A rough outline would do and it should be on the main release page http://www.freebsd.org/releases/ ** Yes I understand that 9.0 had tons of new features that were added and its not exactly a point release upgrade from 8.2 , however one can argue that if it were there would be less yelling about when version X is going to be EOL'd and when will version Y be released. One thought here might be to revisit the Kernel APIs can only change on a major release rule. It seems to me that some KPIs could be frozen for longer periods than others, maybe? For example: - If device driver KPIs were frozen for a longer period of time, there wouldn't be the challenge of backporting drivers for newer hardware to the older systems. vs - The VFS/VOP interface. As far as I know, there are currently 2 out of source tree file systems (OpenAFS and FUSE) and there are FreeBSD committers involved in both of these. As such, making a VFS change within a minor release cycle might not be a big problem, so long as all the file systems in the source tree are fixed and the maintainers for the above 2 file systems were aware of the change and when they needed to release a patch/rebuild their module. - Similarily, are there any out of source tree network stacks? It seems that this rule is where the controversy of major vs minor release changes comes in? Just a thought, rick -- mark saad | nones...@longcount.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: pxe + nfs + microsoft dhcp
pacija wrote: - Original Message - Dear list readers, I am having a problem with pxe loader on FreeBSD 9.0 i386 release. No matter what value I put for DHCP option 017 (Root Path) in Microsoft DHCP server, pxe always sets root path: pxe_open: server path: / I've read src/sys/boot/i386/libi386/pxe.c as instructed in handbook, and i learned there that root path is a failover value which gets set if no valid value is supplied by DHCP server. At first i thought that Microsoft DHCP does not send it but i confirmed with windump it does: -- 15:46:49.505748 IP (tos 0x0, ttl 128, id 6066, offset 0, flags [none], proto: UDP (17), length: 392) dhcp.domain.tld.67 255.255.255.255.68: [bad udp cksum 4537!] BOOTP/DHCP, Reply, length 364, xid 0xdcdb5309, Flags [ none ] (0x) Your-IP 192.168.218.32 Server-IP dhcp.domain.tld Client-Ethernet-Address 00:19:db:db:53:09 (oui Unknown) file FreeBSD/install/boot/pxeboot Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Offer Subnet-Mask Option 1, length 4: 255.255.255.0 RN Option 58, length 4: 345600 RB Option 59, length 4: 604800 Lease-Time Option 51, length 4: 691200 Server-ID Option 54, length 4: dhcp.domain.tld Default-Gateway Option 3, length 4: gate.domain.tld Domain-Name-Server Option 6, length 4: dhcp.domain.tld Domain-Name Option 15, length 1: ^@ RP Option 17, length 42: 192.168.218.32:/b/tftpboot/FreeBSD/install/^@ BF Option 67, length 29: FreeBSD/install/boot/pxeboot^@ What about getting rid of the ^@ characters at the end of the strings? rick -- I do not understand code well enough to fix it, or at least send pxeloader static value of /b/tftpboot/FreeBSD/install/, so if someone would instruct me how to do it i would be very grateful. Thank you in advance for your help. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS - slow
David Brodbeck wrote: On Mon, Apr 30, 2012 at 10:00 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: i tried nfsv4, tested under FreeBSD over localhost and it is roughly the same. am i doing something wrong? I found NFSv4 to be much *slower* than NFSv3 on FreeBSD, when I benchmarked it a year or so ago. If delegations are not enabled, there is additional overhead doing the Open operations against the server. Delegations are not enabled by default in the server, because there isn't code to handle conflicts with opens done locally on the server. (ie. Delegations work iff the volumes exported over NFSv4 are not accessed locally in the server.) I think there are also some issues w.r.t. name caching in the client that still need to be resolved. NFSv4 should provide better byte range locking, plus NFSv4 ACLs and a few other things. However, it is more complex and will not perform better than NFSv3, at least until delegations are used (or pNFS, which is a part of NFSv4.1). rick -- David Brodbeck System Administrator, Linguistics University of Washington ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS - slow
Wojciech Puchar wrote: i tried nfsv4, tested under FreeBSD over localhost and it is roughly the same. am i doing something wrong? Probably not. NFSv4 writes are done exactly the same as NFSv3. (It changes other stuff, like locking, adding support for ACLs, etc.) I do have a patch that allows the client to do more extension caching to local disk in the client (called Packrats), but that isn't ready for prime time yet. NFSv4.1 optionally supports pNFS, where reading and writing can be done to Data Servers (DS) separate from the NFS (called Metadata Server or MDS). I`m working on the client side of this, but it is also a work-in-progress and no work on a NFSv4.1 server for FreeBSD has been done yet, as far as I know. If you have increased MAXBSIZE in both the client and server machines and use the new (experimental in 8.x) client and server, they will use a larger rsize, wsize for NFSv3 as well as NFSv4. (Capturing packets and looking at them in wireshark will tell you what the actual rsize, wsize is. A patch to nfsstat to get the actual mount options in use is another of my `to do`items. If anyone else wants to work on this, I`d be happy to help them. On Mon, 30 Apr 2012, Peter Jeremy wrote: On 2012-Apr-27 22:05:42 +0200, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: is there any way to speed up NFS server? ... - write works terribly. it performs sync on every write IMHO, You don't mention which NFS server or NFS version you are using but for traditional NFS, this is by design. The NFS server is stateless and NFS server failures are transparent (other than time-wise) to the client. This means that once the server acknowledges a write, it guarantees the client will be able to later retrieve that data, even if the server crashes. This implies that the server needs to do a synchronous write to disk before it can return the acknowledgement back to the client. -- Peter Jeremy Btw, For NFSv3 and 4, the story is slightly different than the above. A client can do writes with a flag that is either FILESYNC or UNSTABLE. For FILESYNC, the server must do exactly what the above says. That is, the data and any required metadata changes, must be on stable storage before the server replies to the RPC. For UNSTABLE, the server can simply save the data in memory and reply OK to the RPC. For this case, the client needs to do a separate Commit RPC later and the server must store the data on stable storage at that time. (For this case, the client needs to keep the data written UNSTABLE in its cache and be prepared to re-write it, if the server reboots before the Commit RPC is done.) - When any app. does a fsync(2), the client needs to do a Commit RPC if it has been doing UNSTABLE writes. Most clients, including FreeBSD, do writes with UNSTABLE. However, one limitation on the FreeBSD client is that it currently only keeps track of one contiguous modified byte range in a buffer cache block. When an app. in the client does non-contiguous writes to the same buffer cache block, it must write the old modified byte range to the server with FILESYNC before it copies the newly written data into the buffer cache block. This happens frequently for builds during the loader phase. (jhb and I have looked at this. I have an experimental patch that makes the modified byte range a list, but it requires changes to struct buf. I think it is worth persuing. It is a client side patch, since that is where things can be improved, if clients avoid doing FILESYNC or frequent Commit RPCs.) rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS - slow
Wojciech Puchar wrote: the server is required to do that. (ie. Make sure the data is stored on stable storage, so it can't be lost if the server crashes/reboots.) Expensive NFS servers can use non-volatile RAM to speed this up, but a generic FreeBSD box can't do that. Some clients (I believe ESXi is one of these) requests FILE_SYNC all the time, but all clients will do so sooner or later. If you are exporting ZFS volumes and don't mind violating the NFS RFCs and risking data loss, there is a ZFS option that helps. I don't use ZFS, but I think the option is (sync=disabled) or something like that. (ZFS folks can help out, if you want that.) Even using vfs.nfsrv.async=1 breaks the above. thank you for answering. i don't use or plan to use ZFS. and i am aware of this NFS feature but i don't understand - even with syncs disabled, why writes are not clustered. i always see 32kB writes in systat The old (default on NFSv3) server sets the maximum wsize to 32K. The new (default on 9) sets it to MAXBSIZE, which is currently 64K, but I would like to get that increased. (A quick test suggested that the kernel works when MAXBSIZE is set to 128K, but I haven't done much testing yet.) when running unfsd from ports it doesn't have that problem and works FASTER than kernel nfs. But you had taken out fsync() calls, which breaks the protocol, as above. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS - slow
Wojciech Puchar wrote: is there any way to speed up NFS server? from what i noticed: - reads works fast and good, like accessed locally, readahead up to maxbsize works fine on large files etc. - write works terribly. it performs sync on every write IMHO, setting vfs.nfsrv.async=1 improves things SLIGHTLY, but still - writes are sent to hard disk every single block - no clustering. am i doing something wrong or is it that broken? Since I haven't seen anyone else answer this, I'll throw out my $0.00 worth one more time. (This topic comes up regularily on the mailing lists.) Not broken, it's just a feature of NFS. When the client says FILE_SYNC, the server is required to do that. (ie. Make sure the data is stored on stable storage, so it can't be lost if the server crashes/reboots.) Expensive NFS servers can use non-volatile RAM to speed this up, but a generic FreeBSD box can't do that. Some clients (I believe ESXi is one of these) requests FILE_SYNC all the time, but all clients will do so sooner or later. If you are exporting ZFS volumes and don't mind violating the NFS RFCs and risking data loss, there is a ZFS option that helps. I don't use ZFS, but I think the option is (sync=disabled) or something like that. (ZFS folks can help out, if you want that.) Even using vfs.nfsrv.async=1 breaks the above. Once you do this, when an application in a client does a successful fsync() and assumes the data is safely stored and then the server crashes, the data can still be lost. rick i tried user space nfs from ports, it's funny but it's performance is actually better after i removed fsync from code. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Ways to promote FreeBSD?
Steven Hartland wrote: - Original Message - From: Mehmet Erol Sanliturk My opinion is that most important obstacle in front of FreeBSD is its installation structure : It is NOT possible to install and use a FreeBSD distribution directly as it is . I disagree, we find quite the opposite; FreeBSD's current install is perfect its quick, doesn't install stuff we don't need and leaves a very nice base. Linux on the other had takes ages, asks way to many questions, has issues with some hardware with mouse and gui not work properly making the install difficult to navigate, but most import its quite hard to get a nice simple base as there are so many options, which is default with FreeBSD. In essence it depends on what you want and how you use the OS. For the way we use FreeBSD on our servers its perfect. So if your trying to suggest its not suitable for all that's is incorrect as it depends on what you want :) I worked for the CS dept. at a university for 30years. What I observed was that students were usually enthusiastic about trying a new os. However, these days, they have almost no idea how to work in a command line environment. If they installed FreeBSD, it would be zapped off their disk within minutes of the install completing and they'd forget about it. They install and like distros like Ubuntu, which install and work the way they expect (yes, they expect a GUI desktop, etc). When they get out in industry, they remember Linux, but won't remember FreeBSD (at least not in a good way). Now, I am not suggesting that FreeBSD try and generate Ubuntu-like desktop distros. However, it might be nice if the top level web page let people know that the installs there are not desktop systems and point them to PC-BSD (or whatever other desktop distro there might be?) for a desktop install. (I know, the original poster wasn't a PC-BSD fan, but others seem happy with it. I'll admit I've never tried it, but then, I'm not a GUI desktop guy.:-) Just my $0.00 worth, rick Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mount_nfs does not like exports longer then 88 chars
Mark Saad wrote: On Thu, Apr 19, 2012 at 3:51 PM, Andrew Duane adu...@juniper.net wrote: MNAMELEN is used to bound the Mount NAMe LENgth, and is used in many many places. It may seem to work fine, but there are lots of utilities and such that will almost certainly fail managing it. Search the source code for MNAMELEN. I see that this is used in a number of Mount and fs bits. Do you know why mount_nfs would care how long the exported path and hostname are ? Well, it's copied to f_mntfromname in struct statfs. If one longer than MNAMELEN is allowed, it gets truncated when copied. I have no idea which userland apps. will get upset with a truncated value in f_mntfromname. (To change the size of f_mntfromname would require a new revision of the statfs syscall, I think?) Does this answer what you were asking? rick  ... Andrew Duane Juniper Networks +1 978-589-0551 (o) +1 603-770-7088 (m) adu...@juniper.net -Original Message- From: owner-freebsd-hack...@freebsd.org [mailto:owner-freebsd- hack...@freebsd.org] On Behalf Of Mark Saad Sent: Thursday, April 19, 2012 3:46 PM To: freebsd-hackers@freebsd.org Subject: mount_nfs does not like exports longer then 88 chars Hello Hackers  I was wondering if anyone has come across this issue. This exists  in FreeBSD 6, 7, and 9 , and probably in 8 but I am not using it at this time. When a nfs export path and host name total to more then 88 characters mount_nfs bombs out with the following error when it attempts to mount it. mount_nfs: nyisilon2-13.grp2:/ifs/clients/www/csar884520456/files_cms- stage-BK/imagefield_default_images: File name too long I traced this down to a check in mount_nfs.c . This is about line 560 in the 7-STABLE version and 734 in the 9-STABLE version     /*      * If there has been a trailing slash at mounttime it seems      * that some mountd implementations fail to remove the      mount      * entries from their mountlist while unmounting.      */     for (speclen = strlen(spec);         speclen 1 spec[speclen - 1] == '/';         speclen--)         spec[speclen - 1] = '\0';     if (strlen(hostp) + strlen(spec) + 1 MNAMELEN) {         warnx(%s:%s: %s, hostp, spec,         strerror(ENAMETOOLONG));         return (0);     } Does any one know why the check for hostp + spec +1 to be less then MNAMELEN is there for ?  I removed the check on my 9-STABLE box and it mounts the long  mounts fine I submitted a pr for this its kern/167105 http://www.freebsd.org/cgi/query-pr.cgi?pr=167105 as there is no mention of this in the man page and I cant find any reason for the check at all. -- mark saad | nones...@longcount.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers- unsubscr...@freebsd.org -- mark saad | nones...@longcount.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Kerberos and FreeBSD
Benjamin Kaduk wrote: On Wed, 8 Feb 2012, Ansar Mohammed wrote: Hello All, Is the port of Heimdal on FreeBSD being maintained? The version that ships with 9.0 seems a bit old. # /usr/libexec/kdc-v kdc (Heimdal 1.1.0) Copyright 1995-2008 Kungliga Tekniska Högskolan Send bug-reports to heimdal-b...@h5l.org My understanding is that every five years or so, someone becomes fed up enough with the staleness of the current version and puts in the effort to merge in a newer version. It looks like 3 years ago, dfr brought in that Heimdal 1.1 you see, to replace the Heimdal 0.6 that nectar brought in 8 years ago. I don't know of anyone with active plans to bring in a new version, at present. -Ben Kaduk I think it's a little trickier than it sounds. The Kerberos in FreeBSD isn't vanilla Heimdal 1.1, but a somewhat modified variant. Heimdal libraries have a separate source file for each function, plus a source file that defines all global storage used by functions in the library. One difference w.r.t. the FreeBSD variant that I am aware of is: - Some of the functions were moved from one library to another. (I don't know why, but maybe it was to avoid a POLA violation which would require apps to be linked with additional libraries?) - To do this, some global variables were added to the source file in the library these functions were moved to. As such, if you statically link an app. to both libraries, the global variable can come up multiply defined. (I ran into this when I was developing a gssd prior to the one introduced as part of the kernel rpc.) You can get around this by dynamically linking, being careful about the order in which the libraries are specified. (The command krb5-config --libs helps w.r.t. this.) I don't know what else was changed, but I do know that it isn't as trivial as replacing the sources with ones from a newer Heimdal release. I think it would be nice if a newer Heimdal release was brought it, with the minimal changes required to make it work. (If that meant that apps. needed more libraries, the make files could use krb5-config --libs to handle it, I think?) Oh, and I'm not volunteering to try and do it;-) rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: FreeBSD has serious problems with focus, longevity, and lifecycle
Mark Blackman wrote: On 26 Jan 2012, at 14:37, John Baldwin wrote: On Thursday, January 19, 2012 4:33:40 pm Adrian Chadd wrote: On 19 January 2012 09:47, Mark Saad nones...@longcount.org wrote: What could I do to help make 7.5-RELEASE a reality ? Put your hand up and volunteer to run the 7.5-RELEASE release cycle. That's not actually true or really fair. There has to be some buy-in from the project to do an official release; it is not something that a single person can do off in a corner and then have the Project bless the bits as an official release. And raises the interesting question for an outsider of a) who is the project in this case and b) what does it take for a release to be a release? Wasn't there a freebsd-releng (or similar) mailing list ages ago? I am going to avoid the above question, since I don't know the answer and I believe other(s) have already answered it. However, I will throw out the following comment: I can't seem to find the post, but someone suggested a release mechanism where stable/N would simply be branched when it appeared to be in good shape. Although I have no idea if this is practical for all releases, it seems that it might be a low overhead approach for releases off old stable branches like stable/7 currently is? (ie. Since there aren't a lot of commits happening to stable/7, just branch it. You could maybe give a one/two week warning email about when this will happen. I don't think it would cause a flurry of commits like happens when code slush/freeze approaches for a new .0 one.) Just a thought, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [ANN] host-setup 4.0 released
David Teske wrote: -Original Message- From: Mohacsi Janos [mailto:moha...@niif.hu] Sent: Tuesday, January 03, 2012 3:59 AM To: Devin Teske Cc: freebsd-hackers@freebsd.org; Dave Robison; Devin Teske Subject: Re: [ANN] host-setup 4.0 released Hi Devin, I had a look at the code. It is very nice, Thank you. however there are same missing elements: - IPv6 support Open to suggestions. Maybe adding a ipaddr6 below ipaddr in the interface configuration menu. Also, do you happen to know what the RFC number is for IPv6 address format? I need to know all the special features (for example, I know you can specify ::1 for localhost, but can you simply omit octets at-will? e.g., ::ff:12:00::: ?) The basics are in RFC4291, but I think that inet_pton(3) knows how to deal with it. (I think :: can be used once to specify the longest # of 16bit fields that are all zeros.) After inet_pton() has translated it to a binary address, then the macros in sys/netinet6/in6.h can be used to determine if the address is a loopback, etc. I'm no ip6 guy by any means, so others, please correct/improve on this, as required. rick - VLAN tagging support - creation/deleting How is that done these days? and how might we present it in the user interface? -- Devin Best Regards, Janos Mohacsi Head of HBONE+ project Network Engineer, Deputy Director of Network Planning and Projects NIIF/HUNGARNET, HUNGARY Key 70EF9882: DEC2 C685 1ED4 C95A 145F 4300 6F64 7B00 70EF 9882 On Mon, 2 Jan 2012, Devin Teske wrote: Hi fellow -hackers, I'd like to announce the release of a major new revision (4.0) of my FreeBSD setup utility host-setup. http://druidbsd.sourceforge.net/ Direct Link: http://druidbsd.sourceforge.net/download/host-setup.txt NOTE: Make sure to hit refresh to defeat the cache Major highlights of this version are listed on the druidbsd homepage. For those unfamiliar with my host-setup, it's a manly shell script designed to make it super-easy to configure the following: 1. Timezone 2. Hostname/Domain 3. Network Interface Settings 4. Default Router/Gateway 5. DNS nameservers All from an easy-to-use dialog(1) or Xdialog(1)* interface * Fully compatible and tested -- simply pass `-X' while in a usable X environment -- Devin P.S. Feedback most certainly is welcomed! _ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org _ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Dumping core over NFS
Andrew Duane wrote: We have a strange problem in 6.2 that we're wondering if anyone else has seen. If a process is dumping core to an NFS-mounted directory, sending SIGINT, SIGTERM, or SIGKILL to that process causes NFS to wedge. The nfs_asyncio starts complaining that 20 iods are already processing the mount, but nothing makes any forward progress. Sending SIGUSR1, SIGUSR2, or SIGABRT seem to work fine, as does any signal if the core dump is going to a local filesystem. Before I dig into this apparent deadlock, just wondering if it's been seen before. The only thing I can tell you is that SIGINT, SIGTERM are signals that are handled differently by mounts with the intr option set. For this case, the client tries to make the syscall in progress fail with EINTR when one of these signals is posted. I have no idea what effect this might have on a core dump in progress or if you are using intr mounts. There was an issue in FreeBSD8.[01] (for the intr case) where the termination signal could get the krpc code in a loop when trying to re-establish a TCP connection because an msleep() would always return EINTR right away without waiting for the connection attempt to complete and then code outside that would just try it again and again and... This bug was fixed for FreeBSD8.2. Obviously it's not the same bug since FreeBSD6 didn't have a krpc subsystem, but you might look for something similar. (ie. a sleep(...PCATCH...) and then a caller that just tries again for it returning EINTR. If you use intr, you might also try without intr and see if that has any effect. Good luck with it, rick ... Andrew Duane Juniper Networks o +1 978 589 0551 m +1 603-770-7088 adu...@juniper.net ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Check for 0 ino_t in readdir(3)
mdf wrote: There is a check in the function implementing readdir(3) for a zero inode number: struct dirent * _readdir_unlocked(dirp, skip) DIR *dirp; int skip; { /* ... */ if (dp-d_ino == 0 skip) continue; /* ... */ } skip is 1 except for when coming from _seekdir(3). I don't recall any requirement that a filesystem not use an inode numbered 0, though for obvious reasons it's a poor choice for a file's inode. So... is this code in libc incorrect? Or is there documentation that 0 cannot be a valid inode number for a filesystem? Well, my recollection (if I'm incorrect, please correct me:-) is that, for real BSD directories (the ones generated by UFS/FFS, which everything else is expected to emulate), the d_ino field is set to 0 when the first entry in a directory block is unlink'd. This is because directory entries are not permitted to straddle blocks, so the first entry can not be subsumed by the last dirent in the previous block. In other words, when d_ino == 0, the dirent is free. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Mount_nfs question
Maybe you can use showmount -a SERVER-IP, foreach server you have... That might work. NFS doesn't actually have a notion of a mount, but the mount protocol daemon (typically called mountd) does try and keep track of NFSv3 mounts from the requests it sees. How well this works for NFSv3 will depend on how well the server keeps track of these things and how easily they are lost during a server reboot or similar. Since NFSv4 doesn't use the mount protocol, it will be useless for NFSv4. Thiago 2011/5/30 Mark Saad nones...@longcount.org: On Mon, May 30, 2011 at 8:13 PM, Rick Macklem rmack...@uoguelph.ca wrote: Hello All So I am stumped on this one. I want to know what the IP of each nfs server that is providing each nfs export. I am running 7.4-RELEASE When I run mount -t nfs I see something like this VIP-01:/export/source on /mnt/src VIP-02:/export/target on /mnt/target VIP-01:/export/logs on /mnt/logs VIP-02:/export/package on /mnt/pkg The issue is I use a load balanced nfs server , from isilon. So VIP-01 could be any one of a group of IPs . I am trying to track down a network congestion issue and I cant find a way to match the output of lsof , and netstat to the output of mount -t nfs . Does anyone have any ideas how I could track this down , is there a way to run mount and have it show the IP and not the name of the source server ? Just fire up wireshark (or tcpdump) and watch the traffic. tcpdump doesn't know much about NFS, but if al you want are the IP#s, it'll do. But, no, mount won't tell you more than what the argument looked like. rick Wireshark seams like using a tank to swap a fly. Maybe, but watching traffic isn't that scary and over the years I've discovered things I would have never expected from doing it. Like a case where one specific TCP segment was being dropped by a network switch (it was a hardware problem in the switch that didn't manifest itself any other way). Or, that one client was generating a massive number of Getattr and Lookup RPCs. (That one turned out to be a grad student who had made themselves an app. that had a bunch of threads continually scanning to fs changes. Not a bad idea, but the threads never took a break and continually did it.) I've always found watching traffic kinda fun, but then I'm weird, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Mount_nfs question
Hello All So I am stumped on this one. I want to know what the IP of each nfs server that is providing each nfs export. I am running 7.4-RELEASE When I run mount -t nfs I see something like this VIP-01:/export/source on /mnt/src VIP-02:/export/target on /mnt/target VIP-01:/export/logs on /mnt/logs VIP-02:/export/package on /mnt/pkg The issue is I use a load balanced nfs server , from isilon. So VIP-01 could be any one of a group of IPs . I am trying to track down a network congestion issue and I cant find a way to match the output of lsof , and netstat to the output of mount -t nfs . Does anyone have any ideas how I could track this down , is there a way to run mount and have it show the IP and not the name of the source server ? Just fire up wireshark (or tcpdump) and watch the traffic. tcpdump doesn't know much about NFS, but if al you want are the IP#s, it'll do. But, no, mount won't tell you more than what the argument looked like. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
should I use a SYSCTL_STRUCT?
Hi, I am at the point where I need to fix the -z option of nfsstat. Currently the stats are acquired/zeroed for the old NFS subsystem via sysctl. The setup in the kernel is: SYSCTL_STRUCT(_vfs_nfs, NFS_NFSSTATS, nfsstats, CTLFLAG_RW, nfsstats, nfsstats, S,nfsstats); The new NFS subsystem currently gets the contents of the structure via a flag on nfssvc(2). So, I could either: - add another flag for nfssvc(2) to zero the structure OR - switch the new NFS subsystem over to using a SYSCTL_STRUCT() like the above. Which do you think would be preferable? Thanks in advance for any info, rick ps: I got completely lost on the SYSCTL thread in Jan. and would rather not start another one like it:-) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: SMP question w.r.t. reading kernel variables
On Tue, Apr 19, 2011 at 12:00:29PM +, freebsd-hackers-requ...@freebsd.org wrote: Subject: Re: SMP question w.r.t. reading kernel variables To: Rick Macklem rmack...@uoguelph.ca Cc: freebsd-hackers@freebsd.org Message-ID: 201104181712.14457@freebsd.org [John Baldwin] On Monday, April 18, 2011 4:22:37 pm Rick Macklem wrote: On Sunday, April 17, 2011 3:49:48 pm Rick Macklem wrote: ... All of this makes sense. What I was concerned about was memory cache consistency and whet (if anything) has to be done to make sure a thread doesn't see a stale cached value for the memory location. Here's a generic example of what I was thinking of: (assume x is a global int and y is a local int on the thread's stack) - time proceeds down the screen thread X on CPU 0 thread Y on CPU 1 x = 0; x = 0; /* 0 for x's location in CPU 1's memory cache */ x = 1; y = x; -- now, is y guaranteed to be 1 or can it get the stale cached 0 value? if not, what needs to be done to guarantee it? Well, the bigger problem is getting the CPU and compiler to order the instructions such that they don't execute out of order, etc. Because of that, even if your code has 'x = 0; x = 1;' as adjacent threads in thread X, the 'x = 1' may actually execute a good bit after the 'y = x' on CPU 1. Actually, as I recall the rules for C, it's worse than that. For this (admittedly simplified scenario), x=0; in thread X may never execute unless it's declared volatile, as the compiler may optimize it out and emit no code for it. Locks force that to sychronize as the CPUs coordinate around the lock cookie (e.g. the 'mtx_lock' member of 'struct mutex'). Also, I see cases of: mtx_lock(np); np-n_attrstamp = 0; mtx_unlock(np); in the regular NFS client. Why is the assignment mutex locked? (I had assumed it was related to the above memory caching issue, but now I'm not so sure.) In general I think writes to data that are protected by locks should always be protected by locks. In some cases you may be able to read data using weaker locking (where no locking can be a form of weaker locking, but also a read/shared lock is weak, and if a variable is protected by multiple locks, then any singe lock is weak, but sufficient for reading while all of the associated locks must be held for writing) than writing, but writing generally requires full locking (write locks, etc.). Oops, I now see that you've differentiated between writing and reading. (I mistakenly just stated that you had recommended a lock for reading. Sorry about my misinterpretation of the above on the first quick read.) What he said. In addition to all that, lock operations generate atomic barriers which a compiler or optimizer is prevented from moving code across. All good and useful comments, thanks. The above example was meant to be contrived, to indicate what I was worried about w.r.t. memory caches. Here's a somewhat simplified version of what my actual problem is: (Mostly fyi, in case you are interested.) Thread X is doing a forced dismount of an NFS volume, it (in dounmount()): - sets MNTK_UNMOUNTF - calls VFS_SYNC()/nfs_sync() - so this doesn't get hung on an unresponsive server it must test for MNTK_UNMOUNTF and return an error it is set. This seems fine, since it is the same thread and in a called function. (I can't imagine that the optimizer could move setting of a global flag to after a function call which might use it.) - calls VFS_UNMOUNT()/nfs_unmount() - now the fun begins... after some other stuff, it calls nfscl_umount() to get rid of the state info (opens/locks...) nfscl_umount() - synchronizes with other threads that will use this state (see below) using the combination of a mutex and a shared/exclusive sleep lock. (Because of various quirks in the code, this shared/exclusive lock is a locally coded version and I happenned to call the shared case a refcnt and the exclusive case just a lock.) Other threads that will use state info (open/lock...) will: -call nfscl_getcl() - this function does two things that are relevant 1 - it allocates a new clientid, as required, while holding the mutex - this case needs to check for MNTK_UNMOUNTF and return error, in case the clientid has already been deleted by nfscl_umount() above. (This happens before #2 because the sleep lock is in the clientd structure.) -- it must see the MNTK_UNMOUNTF set if it happens after (in a temporal sense) being set by dounmount() 2 - while holding the mutex, it acquires the shared lock - if this happens before nfscl_umount() gets the exclusive lock, it is fine, since acquisition of the exclusive lock above will wait for its
Re: SMP question w.r.t. reading kernel variables
[good stuff snipped for brevity] 1. Set MNTK_UNMOUNTF 2. Acquire a standard FreeBSD mutex m. 3. Update some data structures. 4. Release mutex m. Then, other threads that acquire m after step 4 has occurred will see MNTK_UNMOUNTF as set. But, other threads that beat thread X to step 2 may or may not see MNTK_UNMOUNTF as set. First off, Alan, thanks for the great explanation. I think it would be nice if this was captured somewhere in the docs, if it isn't already there somewhere (I couldn't spot it, but that doesn't mean anything:-). The question that I have about your specific scenario is concerned with VOP_SYNC(). Do you care if another thread performing nfscl_getcl() after thread X has performed VOP_SYNC() doesn't see MNTK_UNMOUNTF as set? Well, no and yes. It doesn't matter if it doesn't see it after thread X performed nfs_sync(), but it does matter that the threads calling nfscl_getcl() see it before they compete with thread X for the sleep lock. Another relevant question is Does VOP_SYNC() acquire and release the same mutex as nfscl_umount() and nfscl_getcl()? No. So, to get this to work correctly it sounds like I have to do one of the following: 1 - mtx_lock(m); mtx_unlock(m); in nfs_sync(), where m is the mutex used by nfscl_getcl() for the NFS open/lock state. or 2 - mtx_lock(m); mtx_unlock(m); mtx_lock(m); before the point where I care that the threads executing nfscl_getcl() see MNTK_UMOUNTF set in nfscl_umount(). or 3 - mtx_lock(m2); mtx_unlock(m2); in nfscl_getcl(), where m2 is the mutex used by thread X when setting MNTK_UMOUNTF, before mtx_lock(m); and then testing MNTK_UMOUNTF plus acquiring the sleep lock. (By doing it before, I can avoid any LOR issue and do an msleep() without worrying about having two mutex locks.) I think #3 reads the best, so I'll probably do that one. One more question, if you don't mind. Is step 3 in your explanation necessary for this to work? If it is, I can just create some global variable that I assign a value to between mtx_lock(m2); mtx_unlock(m2); but it won't be used for anything, so I thought I'd check if it is necessary? Thanks again for the clear explanation, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: SMP question w.r.t. reading kernel variables
[good stuff snipped for brevity] 1. Set MNTK_UNMOUNTF 2. Acquire a standard FreeBSD mutex m. 3. Update some data structures. 4. Release mutex m. Then, other threads that acquire m after step 4 has occurred will see MNTK_UNMOUNTF as set. But, other threads that beat thread X to step 2 may or may not see MNTK_UNMOUNTF as set. First off, Alan, thanks for the great explanation. I think it would be nice if this was captured somewhere in the docs, if it isn't already there somewhere (I couldn't spot it, but that doesn't mean anything:-). The question that I have about your specific scenario is concerned with VOP_SYNC(). Do you care if another thread performing nfscl_getcl() after thread X has performed VOP_SYNC() doesn't see MNTK_UNMOUNTF as set? Well, no and yes. It doesn't matter if it doesn't see it after thread X performed nfs_sync(), but it does matter that the threads calling nfscl_getcl() see it before they compete with thread X for the sleep lock. Another relevant question is Does VOP_SYNC() acquire and release the same mutex as nfscl_umount() and nfscl_getcl()? No. So, to get this to work correctly it sounds like I have to do one of the following: 1 - mtx_lock(m); mtx_unlock(m); in nfs_sync(), where m is the mutex used by nfscl_getcl() for the NFS open/lock state. or 2 - mtx_lock(m); mtx_unlock(m); mtx_lock(m); before the point where I care that the threads executing nfscl_getcl() see MNTK_UMOUNTF set in nfscl_umount(). or 3 - mtx_lock(m2); mtx_unlock(m2); in nfscl_getcl(), where m2 is the mutex used by thread X when setting MNTK_UMOUNTF, before mtx_lock(m); and then testing MNTK_UMOUNTF plus acquiring the sleep lock. (By doing it before, I can avoid any LOR issue and do an msleep() without worrying about having two mutex locks.) I think #3 reads the best, so I'll probably do that one. One more question, if you don't mind. Is step 3 in your explanation necessary for this to work? If it is, I can just create some global variable that I assign a value to between mtx_lock(m2); mtx_unlock(m2); but it won't be used for anything, so I thought I'd check if it is necessary? Oops, I screwed up this question. For my #3, all that needs to be done in nfscl_getcl() before I care if it sees MNTK_UMOUNTF set is mtx_lock(m2); since that has already gone through your steps 1-4. The question w.r.t. do you really need your step 3 would apply to the cases where I was using m (the mutex nfscl_umount() and nfscl_getcl() already use instead of the one used by thread X). rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: SMP question w.r.t. reading kernel variables
On Sunday, April 17, 2011 3:49:48 pm Rick Macklem wrote: Hi, I should know the answer to this, but... When reading a global kernel variable, where its modifications are protected by a mutex, is it necessary to get the mutex lock to just read its value? For example: A if ((mp-mnt_kern_flag MNTK_UNMOUNTF) != 0) return (EPERM); versus B MNT_ILOCK(mp); if ((mp-mnt_kern_flag MNTK_UNMOUNTF) != 0) { MNT_IUNLOCK(mp); return (EPERM); } MNT_IUNLOCK(mp); My hunch is that B is necessary if you need an up-to-date value for the variable (mp-mnt_kern_flag in this case). Is that correct? You already have good followups from Attilio and Kostik, but one thing to keep in mind is that if a simple read is part of a larger atomic operation then it may still need a lock. In this case Kostik points out that another lock prevents updates to mnt_kern_flag so that this is safe. However, if not for that you would need to consider the case that another thread sets the flag on the next instruction. Even the B case above might still have that problem since you drop the lock right after checking it and the rest of the function is implicitly assuming the flag is never set perhaps (or it needs to handle the case that the flag might become set in the future while MNT_ILOCK() is dropped). One way you can make that code handle that race is by holding MNT_ILOCK() around the entire function, but that approach is often only suitable for a simple routine. All of this makes sense. What I was concerned about was memory cache consistency and whet (if anything) has to be done to make sure a thread doesn't see a stale cached value for the memory location. Here's a generic example of what I was thinking of: (assume x is a global int and y is a local int on the thread's stack) - time proceeds down the screen thread X on CPU 0thread Y on CPU 1 x = 0; x = 0; /* 0 for x's location in CPU 1's memory cache */ x = 1; y = x; -- now, is y guaranteed to be 1 or can it get the stale cached 0 value? if not, what needs to be done to guarantee it? For the original example, I am fine so long as the bit is seen as set after dounmount() has set it. Also, I see cases of: mtx_lock(np); np-n_attrstamp = 0; mtx_unlock(np); in the regular NFS client. Why is the assignment mutex locked? (I had assumed it was related to the above memory caching issue, but now I'm not so sure.) Thanks a lot for all the good responses, rick ps: I guess it comes down to whether or not atomic includes ensuring memory cache consistency. I'll admit I assumed atomic meant that the memory access or modify couldn't be interleaved with one done to the same location by another CPU, but not memory cache consistency. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: SMP question w.r.t. reading kernel variables
All of this makes sense. What I was concerned about was memory cache consistency and whet (if anything) has to be done to make sure a Oops, whet should have been what.. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
SMP question w.r.t. reading kernel variables
Hi, I should know the answer to this, but... When reading a global kernel variable, where its modifications are protected by a mutex, is it necessary to get the mutex lock to just read its value? For example: Aif ((mp-mnt_kern_flag MNTK_UNMOUNTF) != 0) return (EPERM); versus BMNT_ILOCK(mp); if ((mp-mnt_kern_flag MNTK_UNMOUNTF) != 0) { MNT_IUNLOCK(mp); return (EPERM); } MNT_IUNLOCK(mp); My hunch is that B is necessary if you need an up-to-date value for the variable (mp-mnt_kern_flag in this case). Is that correct? Thanks in advance for help with this, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: SMP question w.r.t. reading kernel variables
On Sun, Apr 17, 2011 at 03:49:48PM -0400, Rick Macklem wrote: Hi, I should know the answer to this, but... When reading a global kernel variable, where its modifications are protected by a mutex, is it necessary to get the mutex lock to just read its value? For example: A if ((mp-mnt_kern_flag MNTK_UNMOUNTF) != 0) return (EPERM); versus B MNT_ILOCK(mp); if ((mp-mnt_kern_flag MNTK_UNMOUNTF) != 0) { MNT_IUNLOCK(mp); return (EPERM); } MNT_IUNLOCK(mp); My hunch is that B is necessary if you need an up-to-date value for the variable (mp-mnt_kern_flag in this case). Is that correct? mnt_kern_flag read is atomic on all architectures. If, as I suspect, the fragment is for the VFS_UNMOUNT() fs method, then VFS guarantees the stability of mnt_kern_flag, by blocking other attempts to unmount until current one is finished. If not, then either you do not need the lock, or provided snipped which takes a lock is unsufficient, since you are dropping the lock but continue the action that depends on the flag not being set. Sounds like A should be ok then. The tests matter when dounmount() calls VFS_SYNC() and VFS_UNMOUNT(), pretty much as you guessed. To be honest, most of it will be the thread doing the dounmount() call, although other threads fall through VOP_INACTIVE() while they are terminating in VFS_UNMOUNT() and these need to do the test, too. { I just don't know much about the SMP stuff, so I don't know when a cache on another core might still have a stale copy of a value. I've heard the term memory barrier, but don't really know what it means.:-) Thanks, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Getting vnode + credentials of a file from a struct mount and UFS inode #
Hi, Yes, I am.. that was my suspicion (e.g., that it was the parameters of the process which called open()/creat()/socket()/... originally). What's the quickest way to get back to the v/inode's uid/gid? Also, calling VFS_VGET() seems to give me a lockmgr panic with unknown type 0x0. VFS_VGET() returns a vnode ptr, it doesn't need the argument set to one. The flags argument (assuming a recent kernel) needs to be LK_EXCLUSIVE or LK_SHARED, not 0 (I suspect that's your panic). What is odd is that the only way I can get a vnode for VFS_VGET is through struct file, and then shouldn't I be able to use that? I tried using the flipping that vnode-inode with VTOI() and it was also giving me zeros for i_uid, i_gid, etc., when it shouldn't have been. After VFS_VGET returns a vp, I'd do a VOP_GETATTR() and then vput() the vp to release it. Look for examples of these calls in the kernel sources. The struct vattr filled in by VOP_GETATTR() has va_uid and va_gid in it, which are the uid,gid that owns the file, which is what I think you are trying to get. (Credentials generally refer to the effective uid + gids etc of the process/thread trying to do the syscall.) rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS: file too large
:Well, since a server specifies the maximum file size it can :handle, it seems good form to check for that in the client. :(Although I'd agree that a server shouldn't crash if a read/write : that goes beyond that limit.) : :Also, as Matt notes, off_t is signed. As such, it looks to me like :the check could mess up if uio_offset it right near 0x7fff, :so that uio-ui_offset + uio-uio_resid ends up negative. I think the :check a little above that for uio_offset 0 should also check :uio_offset + uio_resid 0 to avoid this. : :rick Yes, though doing an overflow check in C, at least with newer versions of GCC, requires a separate comparison. The language has been mangled pretty badly over the years. if (a + b a) - can be optimized-out by the compiler if (a + b 0) - also can be optimized-out by the compiler x = a + b; if (x a) - this is ok (best method) x = a + b; if (x 0) - this is ok Ok, thanks. I'll admit to being an old K+R type guy. my question, badly written, was why not let the underlaying fs (ufs, zfs, etc) have the last word, instead of the nfsclient having to guess? Is there a problem in sending back the error? Well, the principal I try and apply in the name of interoperability is: 1 - The client should adhere to the RFCs as strictly as possible 2 - The server should assume the loosest interpretation of the RFCs. For me #1 applies. ie. If a server specifies a maximum file size, the client should not violate that. (Meanwhile the server should assume that clients will exceed the maximum sooner or later.) Remember that the server might be a Netapp, EMC, ... and those vendors mostly test their servers against Linux, Solaris clients. (I've tried to convince them to fire up FreeBSD systems in-house for testing and even volunteered to help with the setup, but if they've done so, I've never heard about it. Their usual response is come to connectathon. See below.) Here's an NFSv4.0 example: - RFC3530 describes the dircount argument for Readdir as a hint of the maximum number of bytes of directory information (in 4th para of pg 191). One vendor ships an NFSv4 client that always sets this value to 0. Their argument is that, since it is only a hint it can be anything they feel like putting there. (Several servers crapped out because of this in the early days.) Part of the problem is that I am not in a position to attend the interoperability testing events like www.connectathon.org, where these things are usually discovered (and since they are covered under an NDA that attendies sign, I don't find out the easy way when problems occur). rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS: file too large
BTW, why not make away with the test altogether? Well, since a server specifies the maximum file size it can handle, it seems good form to check for that in the client. (Although I'd agree that a server shouldn't crash if a read/write that goes beyond that limit.) Also, as Matt notes, off_t is signed. As such, it looks to me like the check could mess up if uio_offset it right near 0x7fff, so that uio-ui_offset + uio-uio_resid ends up negative. I think the check a little above that for uio_offset 0 should also check uio_offset + uio_resid 0 to avoid this. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS: file too large
I'm getting 'File too large' when copying via NFS(v3, tcp/udp) a file that is larger than 1T. The server is ZFS which has no problem with large files. Is this fixable? As I understand it, there is no FreeBSD VFSop that returns the maximum file size supported. As such, the NFS servers just take a guess. You can either switch to the experimental NFS server, which guesses the largest size expressed in 64bits. OR You can edit sys/nfsserver/nfs_serv.c and change the assignment of a value to maxfsize = XXX; at around line #3671 to a larger value. I didn't check to see if there are additional restrictions in the clients. (They should believe what the server says it can support.) rick well, after some more experimentation, it sees to be a FreeBSD client issue. if the client is linux there is no problem. Try editting line #1226 of sys/nfsclient/nfs_vfsops.c, where it sets nm_maxfilesize = (u_int64_t)0x8000 * DEV_BSIZE - 1; and make it something larger. I have no idea why the limit is set that way? (I'm guessing it was the limit for UFS.) Hopefully not some weird buffer cache restriction or similar, but you'll find out when you try increasing it.:-) I think I'll ask freebsd-fs@ about increasing this for NFSv3 and 4, since the server does provide a limit. (The client currently only reduces nm_maxfilesize from the above initial value using the server's limit.) Just grep nm_maxfilesize *.c in sys/nfsclient and you'll see it. BTW, I 'think' I'm using the experimental server, but how can I be sure? I have the -e set for both nfs_server and mountd, I don't have option NFSD, but the nfsd.ko gets loaded. You can check by: # nfsstat -s # nfsstat -e -s and see which one reports non-zero RPC counts. If you happen to be running the regular server (probably not, given the above), you need to edit the server code as well as the client side. Good luck with it, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS: file too large
I'm getting 'File too large' when copying via NFS(v3, tcp/udp) a file that is larger than 1T. The server is ZFS which has no problem with large files. Is this fixable? As I understand it, there is no FreeBSD VFSop that returns the maximum file size supported. As such, the NFS servers just take a guess. You can either switch to the experimental NFS server, which guesses the largest size expressed in 64bits. OR You can edit sys/nfsserver/nfs_serv.c and change the assignment of a value to maxfsize = XXX; at around line #3671 to a larger value. I didn't check to see if there are additional restrictions in the clients. (They should believe what the server says it can support.) rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS Performance
Rick Do you have more details on the issue is it 8.x only ? Can you point us to the stable thread abourt this ? The bug is in the krpc, which means it's 8.x specific (at least for NFS, I'm not sure if the nlm used the krpc in 7.x?). David P. Discher reported a performance problem some time ago when testing the FreeBSD8 client against certain servers. (I can't find the thread, so maybe it never had a freebsd-stable@ cc after all.) Fortutnately John Gemignani spotted the cause (for at least his case, because he tested a patch that seemed to resolve the problem). The bug is basically that the client side krpc for TCP assumed that the 4 bytes of data that hold the length of the RPC message are in one mbuf and don't straddle multiple mbufs. If the 4 bytes does straddle multiple mbufs, the krpc gets a garbage message length and then typically wedges and eventually recovers by starting a fresh TCP connection up and retrying the outstanding RPCs. I have no idea if George is seeing the same problem, but the 1.5minute logjams suggest that it might. I emailed him a patch and, hopefully, he will report back on whether or not it helped. A patch for the above bug is in the works for head, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS Performance
George I remember reading there was some sort of nfs issues in 8.1-RELEASE, a regression of some sort it was noted early on in the release. Have you tried this with 8.2-RC1 also what are your nfs client mount options ? On 1/8/11, george+free...@m5p.com george+free...@m5p.com wrote: Among four machines on my network, I'm observing startling differences in NFS performance. All machines are AMD64, and rpc_statd, rpc_lockd, and amd are enabled on all four machines. wonderland: hw.model: AMD Athlon(tm) II Dual-Core M32 hw.physmem: 293510758 ethernet: 100Mb/s partition 1: FreeBSD 8.1-STABLE partition 2: FreeBSD 7.3-STABLE scollay: hw.model: AMD Sempron(tm) 140 Processor hw.physmem: 186312294 ethernet: 1000Mb/s FreeBSD 8.1-PRERELEASE sullivan: hw.model: AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ hw.physmem: 4279980032 ethernet 1000Mb/s FreeBSD 7.2-RELEASE mattapan: hw.model: AMD Sempron(tm) Processor 2600+ hw.physmem: 456380416 ethernet: 1000Mb/s FreeBSD 7.1-RELEASE Observed bytes per second (dd if=filename of=/dev/null bs=65536): Source machine: mattapan scollay sullivan Destination machine: wonderland/7.3 870K 5.2M 1.8M wonderland/8.1 496K 690K 420K mattapan 38M 28M scollay 33M 33M sullivan 38M 5M There is one 10/100/1000Mb/s ethernet switch between the various pairs of machines. I'm startled by the numbers for wonderland, first because of how much the 100Mb/s interface slows things down, but even more because of how much difference there is on the identical hardware between FreeBSD 7 and FreeBSD 8. Even more annoying when running 8.1 on wonderland, NFS simply locks up at random for roughly a minute and a half under high load (such as when firefox does a gazillion locked references to my places.sqlite file), leading to entertaining log message clusters such as: Dec 29 08:17:41 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:17:41 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:17:41 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:17:41 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:17:47 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:17:47 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:18:01 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:18:01 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:18:02 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:18:02 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:18:02 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:18:02 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:18:08 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:18:08 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:18:09 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:18:09 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:18:09 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:18:09 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:20:21 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:20:21 wonderland last message repeated 2 times Dec 29 08:20:22 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:20:22 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:20:36 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:20:36 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:21:05 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:21:10 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:22:20 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:22:22 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:22:22 wonderland kernel: nfs server home:/usr: is alive again Dec 29 08:22:24 wonderland kernel: nfs server home:/usr: not responding Dec 29 08:22:24 wonderland last message repeated 2 times
Re: NFS server hangs (was no subject)
I have a similar problem. I have a NFS server (8.0 upgraded a couple times since Feb 2010) that locks up and requires a reboot. The clients are busy vm's from VMWare ESXi using the NFS server for vmdk virtual disk storage. The ESXi reports nfs server inactive and all the vm's post disk write errors when trying to write to their disk. /etc/rc.d/nfsd restart fails to work (it can not kill the nfsd process) The nfsd process runs at 100% cpu at rc_lo state in top. reboot is the only fix. It has only happened under two circumstances. 1) Installation of a VM using Windows 2008. 2) Migrating 16 million mail messages from a physical server to a VM running FreeBSD with ZFS file system as a VM on the ESXi box that uses NFS to store the VM's ZFS disk. The NFS server uses ZFS also. I don't think what you are seeing is the same as what others have reported. (I have a hunch that your problem might be a replay cache problem.) Please try the attached patch and make sure that your sys/rpc/svc.c is at r205562 (upgrade if it isn't). If this patch doesn't help, you could try using the experimental nfs server (which doesn't use the generic replay cache), by adding -e to mountd and nfsd. Please let me know if the patch or switching to the experimental nfs server helps, rick --- rpc/replay.c.sav 2010-08-08 18:05:50.0 -0400 +++ rpc/replay.c 2010-08-08 18:16:43.0 -0400 @@ -90,8 +90,10 @@ replay_setsize(struct replay_cache *rc, size_t newmaxsize) { + mtx_lock(rc-rc_lock); rc-rc_maxsize = newmaxsize; replay_prune(rc); + mtx_unlock(rc-rc_lock); } void @@ -144,8 +146,8 @@ bool_t freed_one; if (rc-rc_count = REPLAY_MAX || rc-rc_size rc-rc_maxsize) { - freed_one = FALSE; do { + freed_one = FALSE; /* * Try to free an entry. Don't free in-progress entries */ ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: possible NFS lockups
From: Sam Fourman On Tue, Jul 27, 2010 at 10:29 AM, krad kra...@googlemail.com wrote: I have a production mail system with an nfs backend. Every now and again we see the nfs die on a particular head end. However it doesn't die across all the nodes. This suggests to me there isnt an issue with the filer itself and the stats from the filer concur with that. The symptoms are lines like this appearing in dmesg nfs server 10.44.17.138:/vol/vol1/mail: not responding nfs server 10.44.17.138:/vol/vol1/mail: is alive again trussing df it seems to hang on getfsstat, this is presumably when it tries the nfs mounts I also have this problem, where nfs locks up on a FreeBSD 9 server and a FreeBSD RELENG_8 client If by RELENG_8, you mean 8.0 (or pre-8.1), there are a number of patches for the client side krpc. They can be found at: http://people.freebsd.org/~rmacklem/freebsd8.0-patches (These are all in FreeBSD8.1, so ignore this if your client is already running FreeBSD8.1.) rick ps: lock up can mean many things. The more specific you can be w.r.t. the behaviour, the more likely it can be resolved. For example: - No more access to the subtree under the mount point is possible until the client is rebooted. When a ps axlH one process that was accessing a file in the mount point is shown with WCHAN rpclock and STAT DL. vs - All access to the mount point stops for about 1minute and then recovers. Also, showing what mount options are being used by the client and whether or not rpc.lockd and rpc.statd are running can also be useful. And if you can look at the net ttraffic with wireshark when it is locked up and see if any NFS traffic is happening can also be useful. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: possible NFS lockups
From: krad kra...@googlemail.com To: freebsd-hackers@freebsd.org, FreeBSD Questions freebsd-questi...@freebsd.org Sent: Tuesday, July 27, 2010 11:29:20 AM Subject: possible NFS lockups I have a production mail system with an nfs backend. Every now and again we see the nfs die on a particular head end. However it doesn't die across all the nodes. This suggests to me there isnt an issue with the filer itself and the stats from the filer concur with that. The symptoms are lines like this appearing in dmesg nfs server 10.44.17.138:/vol/vol1/mail: not responding nfs server 10.44.17.138:/vol/vol1/mail: is alive again trussing df it seems to hang on getfsstat, this is presumably when it tries the nfs mounts eg __sysctl(0xbfbfe224,0x2,0xbfbfe22c,0xbfbfe230,0x0,0x0) = 0 (0x0) mmap(0x0,1048576,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 1746583552 (0x681ac000) mmap(0x682ac000,344064,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 1747632128 (0x682ac000) munmap(0x681ac000,344064) = 0 (0x0) getfsstat(0x68201000,0x1270,0x2,0xbfbfe960,0xbfbfe95c,0x1) = 9 (0x9) I have played with mount options a fair bit but they dont make much difference. This is what they are set to at present 10.44.17.138:/vol/vol1/mail /mail/0 nfs rw,noatime,tcp,acdirmax=320,acdirmin=180,acregmax=320,acregmin=180 0 0 When this locking is occuring I find that if I do a show mount or mount 10.44.17.138:/vol/vol1/mail again under another mount point I can access it fine. One thing I have just noticed is that lockd and statd always seem to have died when this happens. Restarting does not help lockd and statd implement separate protocols (NLM ans NSM) that do locking. The protocols were poorly designed and fundamentally broken imho. (That refers to the protocols and not the implementation.) I am not familiar with the lockd and statd implementations, but if you don't need file locking to work for the same file when accessed concurrently from multiple clients (heads) concurrently, you can use the nolockd mount option to avoid using them. (I have no idea if the mail system you are using will work without lockd or not? It should be ok to use nolockd if file locking is only done on a given file in one client node.) I suspect that some interaction between your server and the lockd/statd client causes them to crash and then the client is stuck trying to talk to them, but I don't really know? Looking at where all the processes and threads are sleeping via ps axlH may tell you what is stuck and where. As others noted, intermittent server not responding...server ok messages just indicate slow response from the server and don't mean much. However, if a given process is hung and doesn't recover, knowing what it is sleeping on can help w.r.t diagnosis. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS write corruption on 8.0-RELEASE
On Thu, 11 Feb 2010, John Baldwin wrote: [good stuff snipped] Case1: single currupted block 3779CF88-3779 (12408 bytes). Data in block is shifted 68 bytes up, loosing first 68 bytes are filling last 68 bytes with garbage. Interestingly, among that garbage is my hostname. Is it the hostname of the server or the client? My guess is that hades.panopticon (or something like that:-) is the client. The garbage is 4 bytes (80 00 80 84) followed by the first part of the RPC header. (Bytes 5-8 vary because they are the xid and then the host name is part of the AUTH_SYS authenticator.) For Case2 and Case3, you see less of it, but it's the same stuff. Why? I have no idea, although it smells like some sort of corruption of the mbuf list. (It would be nice if you could switch to a different net interface/driver. Just a thought, since others don't seem to be seeing this?) As John said, it would be nice to try and narrow it down to client or server side, too. Don't know if this helps or is just noise, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS write corruption on 8.0-RELEASE
On Fri, 12 Feb 2010, Dmitry Marakasov wrote: Interesting, I'll try disabling it. However now I really wonder why is such dangerous option available (given it's the cause) at all, especially without a notice. Silent data corruption is possibly the worst thing to happen ever. I doubt that the data corruption you are seeing would be because of soft. soft will cause various problems w.r.t. consistency, but in the case of a write through the buffer cache, I think it will leave the buffer dirty and eventually it will get another write attempt. However, without soft option NFS would be a strange thing to use - network problems is kinda inevitable thing, and having all processes locked in a unkillable state (with hard mounts) when it dies is not fun. Or am I wrong? Well, using NFS over an unreliable network is going to cause grief sooner or later. The problem is that POSIX apps. don't expect I/O system calls to fail with EIO and generally don't handle that gracefully. For the future, I think umount -F (a forced dismount that accepts data loss) is the best compromise, since at least then a sysadmin knows that data corruption could have occurred when they do it and can choose to wait until the network is fixed as an alternative to the corruption? rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS write corruption on 8.0-RELEASE
On Fri, 12 Feb 2010, Dmitry Marakasov wrote: * Oliver Fromme (o...@lurza.secnetix.de) wrote: I'm sorry for the confusion ... I do not think that it's the cause for your data corruption, in this particular case. I just mentioned the potential problems with soft mounts because it could cause additional problems for you. (And it's important to know anyhow.) Oh, then I really misunderstood. If the curruption implied is like when you copy a file via NFS and the net goes down, and in case of soft mount you have half of a file (read: corruption), while with hard mount the copy process will finish when the net is back up, that's definitely OK and expected. The problem is that it can't distinguish between slow network/server and partitioned/failed network. In your case (one client) it may work out ok. (I can't remember how long it takes to timeout and give up for soft.) For many clients talking to an NFS server, the NFS server's response time can degrade to the point where soft mounted clients start timing out and that can get ugly. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS ( amd?) dysfunction descending a hierarchy
On Tue, 9 Dec 2008, David Wolfskill wrote: On Tue, Dec 02, 2008 at 04:15:38PM -0800, David Wolfskill wrote: I seem to have a fairly- (though not deterministly so) reproducible mode of failure with an NFS-mounted directory hierarchy: An attempt to traverse a sufficiently large hierarchy (e.g., via tar zcpf or rm -fr) will fail to visit some subdirectories, typically apparently acting as if the subdirectories in question do not actually exist (despite the names having been returned in the output of a previous readdir()). ... I was able to reproduce the external symptoms of the failure running CURRENT as of yesterday, using rm -fr of a copy of a recent /usr/ports hierachy on an NFS-mounted file system as a test case. However, I believe the mechanism may be a bit different -- while still being other than what I would expect. One aspect in which the externally-observable symptoms were different (under CURRENT, vs. RELENG_7) is that under CURRENT, once the error condition occurred, the NFS client machine was in a state where it merely kept repeating nfs server [EMAIL PROTECTED]:/volume: not responding until I logged in as root rebooted it. The different behaviour for -CURRENT could be the newer RPC layer that was recently introduced, but that doesn't explain the basic problem. All I can think of is to ask the obvious question. Are you using interruptible or soft mounts? If so, switch to hard mounts and see if the problem goes away. (imho, neither interruptible nor soft mounts are a good idea. You can use a forced dismount if there is a crashed NFS server that isn't coming back anytime soon.) If you are getting this with hard mounts, I'm afraid I have no idea what the problem is, rick. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]