On Jul 6, 2011, at 12:15 AM, Jeremy Chadwick wrote: > On Wed, Jul 06, 2011 at 01:54:12PM +1000, Peter Ross wrote: >> Quoting "Jeremy Chadwick" <free...@jdc.parodius.com>: >> >>> On Wed, Jul 06, 2011 at 01:07:53PM +1000, Peter Ross wrote: >>>> Quoting "Jeremy Chadwick" <free...@jdc.parodius.com>: >>>> >>>>> On Wed, Jul 06, 2011 at 12:23:39PM +1000, Peter Ross wrote: >>>>>> Quoting "Jeremy Chadwick" <free...@jdc.parodius.com>: >>>>>> >>>>>>> On Tue, Jul 05, 2011 at 01:03:20PM -0400, Scott Sipe wrote: >>>>>>>> I'm running virtualbox 3.2.12_1 if that has anything to do with it. >>>>>>>> >>>>>>>> sysctl vfs.zfs.arc_max: 6200000000 >>>>>>>> >>>>>>>> While I'm trying to scp, kstat.zfs.misc.arcstats.size is >>>>>>>> hovering right around that value, sometimes above, sometimes >>>>>>>> below (that's as it should be, right?). I don't think that it >>>>>>>> dies when crossing over arc_max. I can run the same scp 10 times >>>>>>>> and it might fail 1-3 times, with no correlation to the >>>>>>>> arcstats.size being above/below arc_max that I can see. >>>>>>>> >>>>>>>> Scott >>>>>>>> >>>>>>>> On Jul 5, 2011, at 3:00 AM, Peter Ross wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> just as an addition: an upgrade to last Friday's >>>>>>>>> FreeBSD-Stable and to VirtualBox 4.0.8 does not fix the >>>>>>>>> problem. >>>>>>>>> >>>>>>>>> I will experiment a bit more tomorrow after hours and grab >>>>>> some statistics. >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Peter >>>>>>>>> >>>>>>>>> Quoting "Peter Ross" <peter.r...@bogen.in-berlin.de>: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I noticed a similar problem last week. It is also very >>>>>>>>>> similar to one reported last year: >>>>>>>>>> >>>>>>>>>> http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058708.html >>>>>>>>>> >>>>>>>>>> My server is a Dell T410 server with the same bge card (the >>>>>>>>>> same pciconf -lvc output as described by Mahlon: >>>>>>>>>> >>>>>>>>>> http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058711.html >>>>>>>>>> >>>>>>>>>> Yours, Scott, is a em(4).. >>>>>>>>>> >>>>>>>>>> Another similarity: In all cases we are using VirtualBox. I >>>>>>>>>> just want to mention it, in case it matters. I am still >>>>>>>>>> running VirtualBox 3.2. >>>>>>>>>> >>>>>>>>>> Most of the time kstat.zfs.misc.arcstats.size was reaching >>>>>>>>>> vfs.zfs.arc_max then, but I could catch one or two cases >>>>>>>>>> then the value was still below. >>>>>>>>>> >>>>>>>>>> I added vfs.zfs.prefetch_disable=1 to sysctl.conf but it >>>> does not help. >>>>>>>>>> >>>>>>>>>> BTW: It looks as ARC only gives back the memory when I >>>>>>>>>> destroy the ZFS (a cloned snapshot containing virtual >>>>>>>>>> machines). Even if nothing happens for hours the buffer >>>>>>>>>> isn't released.. >>>>>>>>>> >>>>>>>>>> My machine was still running 8.2-PRERELEASE so I am upgrading. >>>>>>>>>> >>>>>>>>>> I am happy to give information gathered on old/new kernel if it >>>>>>>>>> helps. >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> Peter >>>>>>>>>> >>>>>>>>>> Quoting "Scott Sipe" <csco...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jul 2, 2011, at 12:54 AM, jhell wrote: >>>>>>>>>>> >>>>>>>>>>>> On Fri, Jul 01, 2011 at 03:22:32PM -0700, Jeremy Chadwick wrote: >>>>>>>>>>>>> On Fri, Jul 01, 2011 at 03:13:17PM -0400, Scott Sipe wrote: >>>>>>>>>>>>>> I'm running 8.2-RELEASE and am having new problems >>>>>>>>>>>>>> with scp. When scping >>>>>>>>>>>>>> files to a ZFS directory on the FreeBSD server -- >>>>>>>>>>>>>> most notably large files >>>>>>>>>>>>>> -- the transfer frequently dies after just a few >>>>>>>>>>>>>> seconds. In my last test, I >>>>>>>>>>>>>> tried to scp an 800mb file to the FreeBSD system and >>>>>>>>>>>>>> the transfer died after >>>>>>>>>>>>>> 200mb. It completely copied the next 4 times I >>>>>>>>>>>>>> tried, and then died again on >>>>>>>>>>>>>> the next attempt. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On the client side: >>>>>>>>>>>>>> >>>>>>>>>>>>>> "Connection to home closed by remote host. >>>>>>>>>>>>>> lost connection" >>>>>>>>>>>>>> >>>>>>>>>>>>>> In /var/log/auth.log: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jul 1 14:54:42 freebsd sshd[18955]: fatal: Write >>>>>>>>>>>>>> failed: Cannot allocate >>>>>>>>>>>>>> memory >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've never seen this before and have used scp before >>>>>>>>>>>>>> to transfer large files >>>>>>>>>>>>>> without problems. This computer has been used in >>>>>>>>>>>>>> production for months and >>>>>>>>>>>>>> has a current uptime of 36 days. I have not been >>>>>>>>>>>>>> able to notice any problems >>>>>>>>>>>>>> copying files to the server via samba or netatalk, or >>>>>> any problems in >>>>>>>>>>>>>> apache. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Uname: >>>>>>>>>>>>>> >>>>>>>>>>>>>> FreeBSD xeon 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Sat >>>>>>>>>>>>>> Feb 19 01:02:54 EST >>>>>>>>>>>>>> 2011 root@xeon:/usr/obj/usr/src/sys/GENERIC amd64 >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've attached my dmesg and output of vmstat -z. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have not restarted the sshd daemon or rebooted the computer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am glad to provide any other information or test anything else. >>>>>>>>>>>>>> >>>>>>>>>>>>>> {snip vmstat -z and dmesg} >>>>>>>>>>>>> >>>>>>>>>>>>> You didn't provide details about your networking setup (rc.conf, >>>>>>>>>>>>> ifconfig -a, etc.). netstat -m would be useful too. >>>>>>>>>>>>> >>>>>>>>>>>>> Next, please see this thread circa September 2010, titled "Network >>>>>>>>>>>>> memory allocation failures": >>>>>>>>>>>>> >>>>>>>>>>>>> http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/thread.html#58708 >>>>>>>>>>>>> >>>>>>>>>>>>> The user in that thread is using rsync, which relies on >>>>>> scp by default. >>>>>>>>>>>>> I believe this problem is similar, if not identical, to yours. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Please also provide your output of ( /usr/bin/limits -a ) >>>>>> for the server >>>>>>>>>>>> end and the client. >>>>>>>>>>>> >>>>>>>>>>>> I am not quite sure I agree with the need for ifconfig -a but some >>>>>>>>>>>> information about the networking driver your using for the >>>>>>>>>>>> interface >>>>>>>>>>>> would be helpful, uptime of the boxes. And configuration >>>> of the pool. >>>>>>>>>>>> e.g. ( zpool status -a ;zfs get all <poolname> ) You should >>>>>>>>>>>> probably >>>>>>>>>>>> prop this information up somewhere so you can reference by >>>>>> URL whenever >>>>>>>>>>>> needed. >>>>>>>>>>>> >>>>>>>>>>>> rsync(1) does not rely on scp(1) whatsoever but rsync(1) >>>>>> can be made to >>>>>>>>>>>> use ssh(1) instead of rsh(1) and I believe that is what Jeremy is >>>>>>>>>>>> stating here but correct me if I am wrong. It does use ssh(1) by >>>>>>>>>>>> default. >>>>>>>>>>>> >>>>>>>>>>>> Its a possiblity as well that if using tmpfs(5) or mdmfs(8) for >>>>>>>>>>>> /tmp >>>>>>>>>>>> type filesystems that rsync(1) may be just filling up your >>>>>> temp ram area >>>>>>>>>>>> and causing the connection abort which would be >>>>>>>>>>>> expected. ( df -h ) would >>>>>>>>>>>> help here. >>>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> I'm not using tmpfs/mdmfs at all. The clients yesterday >>>>>>>>>>> were 3 different OSX computers (over gigabit). The FreeBSD >>>>>>>>>>> server has 12gb of ram and no bce adapter. For what it's >>>>>>>>>>> worth, the server is backed up remotely every night with >>>>>>>>>>> rsync (remote FreeBSD uses rsync to pull) to an offsite >>>>>>>>>>> (slow cable connection) FreeBSD computer, and I have not >>>>>>>>>>> seen any errors in the nightly rsync. >>>>>>>>>>> >>>>>>>>>>> Sorry for the omission of networking info, here's the >>>>>>>>>>> output of the requested commands and some that popped up >>>>>>>>>>> in the other thread: >>>>>>>>>>> >>>>>>>>>>> http://www.cap-press.com/misc/ >>>>>>>>>>> >>>>>>>>>>> In rc.conf: ifconfig_em1="inet 10.1.1.1 netmask 255.255.0.0" >>>>>>>>>>> >>>>>>>>>>> Scott >>>>>>> >>>>>>> Just to make it crystal clear to everyone: >>>>>>> >>>>>>> There is no correlation between this problem and use of ZFS. People are >>>>>>> attempting to correlate "cannot allocate memory" messages with "anything >>>>>>> on the system that uses memory". The VM is much more complex than that. >>>>>>> >>>>>>> Given the nature of this problem, it's much more likely the issue is >>>>>>> "somewhere" within a networking layer within FreeBSD, whether it be >>>>>>> driver-level or some sort of intermediary layer. >>>>>>> >>>>>>> Two people who have this issue in this thread are both using VirtualBox. >>>>>>> Can one, or both, of you remove VirtualBox from the configuration >>>>>>> entirely (kernel, etc. -- not sure what is required) and then see if the >>>>>>> issue goes away? >>>>>> >>>>>> On the machine in question I only can do it after hours so I will do >>>>>> it tonight. >>>>>> >>>>>> I was _successfully_ sending the file over the loopback interface using >>>>>> >>>>>> cat /zpool/temp/zimbra_oldroot.vdi | ssh localhost "cat > /dev/null" >>>>>> >>>>>> I did it, btw, with the IPv6 localhost address first (accidently), >>>>>> and then using IPv4. Both worked. >>>>>> >>>>>> It always fails if I am sending it through the bce(4) interface, >>>>>> even if my target is the VirtualBox bridged to the bce card (so it >>>>>> does not "leave" the computer physically). >>>>>> >>>>>> Below the uname -a, ifconfig -a, netstat -rn, pciconf -lv and >>>>>> kldstat output. >>>>>> >>>>>> I have another box where I do not see that problem. It copies files >>>>>> happily over the net using ssh. >>>>>> >>>>>> It is an an older HP ML 150 with 3GB RAM only but with a bge(4) >>>>>> driver instead. It runs the same last week's RELENG_8. I installed >>>>>> VirtualBox and enabled vboxnet (so it loads the kernel modules). But >>>>>> I do not run VirtualBox on it (because it hasn't enough RAM). >>>>>> >>>>>> Regards >>>>>> Peter >>>>>> >>>>>> DellT410one# uname -a >>>>>> FreeBSD DellT410one.vv.fda 8.2-STABLE FreeBSD 8.2-STABLE #1: Thu Jun >>>>>> 30 17:07:18 EST 2011 >>>>>> r...@dellt410one.vv.fda:/usr/obj/usr/src/sys/GENERIC amd64 >>>>>> DellT410one# ifconfig -a >>>>>> bce0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> >>>>>> metric 0 mtu 1500 >>>>>> >>>>>> options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> >>>>>> ether 84:2b:2b:68:64:e4 >>>>>> inet 192.168.50.220 netmask 0xffffff00 broadcast 192.168.50.255 >>>>>> inet 192.168.50.221 netmask 0xffffff00 broadcast 192.168.50.255 >>>>>> inet 192.168.50.223 netmask 0xffffff00 broadcast 192.168.50.255 >>>>>> inet 192.168.50.224 netmask 0xffffff00 broadcast 192.168.50.255 >>>>>> inet 192.168.50.225 netmask 0xffffff00 broadcast 192.168.50.255 >>>>>> inet 192.168.50.226 netmask 0xffffff00 broadcast 192.168.50.255 >>>>>> inet 192.168.50.227 netmask 0xffffff00 broadcast 192.168.50.255 >>>>>> inet 192.168.50.219 netmask 0xffffff00 broadcast 192.168.50.255 >>>>>> media: Ethernet autoselect (1000baseT <full-duplex>) >>>>>> status: active >>>>>> bce1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 >>>>>> >>>>>> options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> >>>>>> ether 84:2b:2b:68:64:e5 >>>>>> media: Ethernet autoselect >>>>>> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 >>>>>> options=3<RXCSUM,TXCSUM> >>>>>> inet6 fe80::1%lo0 prefixlen 64 scopeid 0xb >>>>>> inet6 ::1 prefixlen 128 >>>>>> inet 127.0.0.1 netmask 0xff000000 >>>>>> nd6 options=3<PERFORMNUD,ACCEPT_RTADV> >>>>>> vboxnet0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 >>>>>> ether 0a:00:27:00:00:00 >>>>>> DellT410one# netstat -rn >>>>>> Routing tables >>>>>> >>>>>> Internet: >>>>>> Destination Gateway Flags Refs Use Netif >>>>>> Expire >>>>>> default 192.168.50.201 UGS 0 52195 bce0 >>>>>> 127.0.0.1 link#11 UH 0 6 lo0 >>>>>> 192.168.50.0/24 link#1 U 0 1118212 bce0 >>>>>> 192.168.50.219 link#1 UHS 0 9670 lo0 >>>>>> 192.168.50.220 link#1 UHS 0 8347 lo0 >>>>>> 192.168.50.221 link#1 UHS 0 103024 lo0 >>>>>> 192.168.50.223 link#1 UHS 0 43614 lo0 >>>>>> 192.168.50.224 link#1 UHS 0 8358 lo0 >>>>>> 192.168.50.225 link#1 UHS 0 8438 lo0 >>>>>> 192.168.50.226 link#1 UHS 0 8338 lo0 >>>>>> 192.168.50.227 link#1 UHS 0 8333 lo0 >>>>>> 192.168.165.0/24 192.168.50.200 UGS 0 3311 bce0 >>>>>> 192.168.166.0/24 192.168.50.200 UGS 0 699 bce0 >>>>>> 192.168.167.0/24 192.168.50.200 UGS 0 3012 bce0 >>>>>> 192.168.168.0/24 192.168.50.200 UGS 0 552 bce0 >>>>>> >>>>>> Internet6: >>>>>> Destination Gateway >>>>>> Flags Netif Expire >>>>>> ::1 ::1 UH >>>>>> lo0 >>>>>> fe80::%lo0/64 link#11 U >>>>>> lo0 >>>>>> fe80::1%lo0 link#11 UHS >>>>>> lo0 >>>>>> ff01::%lo0/32 fe80::1%lo0 U >>>>>> lo0 >>>>>> ff02::%lo0/32 fe80::1%lo0 U >>>>>> lo0 >>>>>> DellT410one# kldstat >>>>>> Id Refs Address Size Name >>>>>> 1 19 0xffffffff80100000 dbf5d0 kernel >>>>>> 2 3 0xffffffff80ec0000 4c358 vboxdrv.ko >>>>>> 3 1 0xffffffff81012000 131998 zfs.ko >>>>>> 4 1 0xffffffff81144000 1ff1 opensolaris.ko >>>>>> 5 2 0xffffffff81146000 2940 vboxnetflt.ko >>>>>> 6 2 0xffffffff81149000 8e38 netgraph.ko >>>>>> 7 1 0xffffffff81152000 153c ng_ether.ko >>>>>> 8 1 0xffffffff81154000 e70 vboxnetadp.ko >>>>>> DellT410one# pciconf -lv >>>>>> .. >>>>>> bce0@pci0:1:0:0: class=0x020000 card=0x028d1028 >>>>>> chip=0x163b14e4 rev=0x20 hdr=0x00 >>>>>> vendor = 'Broadcom Corporation' >>>>>> class = network >>>>>> subclass = ethernet >>>>>> bce1@pci0:1:0:1: class=0x020000 card=0x028d1028 >>>>>> chip=0x163b14e4 rev=0x20 hdr=0x00 >>>>>> vendor = 'Broadcom Corporation' >>>>>> class = network >>>>>> subclass = ethernet >>>>> >>>>> Could you please provide "pciconf -lvcb" output instead, specific to the >>>>> bce chips? Thanks. >>>> >>>> Her it is: >>>> >>>> bce0@pci0:1:0:0: class=0x020000 card=0x028d1028 >>>> chip=0x163b14e4 rev=0x20 hdr=0x00 >>>> vendor = 'Broadcom Corporation' >>>> class = network >>>> subclass = ethernet >>>> bar [10] = type Memory, range 64, base 0xda000000, size >>>> 33554432, enabled >>>> cap 01[48] = powerspec 3 supports D0 D3 current D0 >>>> cap 03[50] = VPD >>>> cap 05[58] = MSI supports 16 messages, 64 bit enabled with 1 message >>>> cap 11[a0] = MSI-X supports 9 messages in map 0x10 >>>> cap 10[ac] = PCI-Express 2 endpoint max data 256(512) link x4(x4) >>>> ecap 0003[100] = Serial 1 842b2bfffe6864e4 >>>> ecap 0001[110] = AER 1 0 fatal 0 non-fatal 1 corrected >>>> ecap 0004[150] = unknown 1 >>>> ecap 0002[160] = VC 1 max VC0 >>> >>> Thanks Peter. >>> >>> Adding Yong-Hyeon and David to the discussion, since they've both worked >>> on the bce(4) driver in recent months (most of the changes made recently >>> are only in HEAD), and also adding Jack Vogel of Intel who maintains >>> em(4). Brief history for the devs: >>> >>> The issue is described "Network memory allocation failures" and was >>> reported last year, but two users recently (Scott and Peter) have >>> reported the issue again: >>> >>> http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/thread.html#58708 >>> >>> And was mentioned again by Scott here, which also contains some >>> technical details: >>> >>> http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063172.html >>> >>> What's interesting is that Scott's issue is identical in form but he's >>> using em(4), which isn't known to behave like this. Both individuals >>> are using VirtualBox, though we're not sure at this point if that is the >>> piece which is causing the anomaly. >>> >>> Relevant details of Scott's system (em-based): >>> >>> http://www.cap-press.com/misc/ >>> >>> Relevant details of Peter's system (bce-based): >>> >>> http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063221.html >>> http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063223.html >>> >>> I think the biggest complexity right now is figuring out how/why scp >>> fails intermittently in this nature. The errno probably "trickles down" >>> to userland from the kernel, but the condition regarding why it happens >>> is unknown. >> >> BTW: I also saw 2 of the errors coming from a BIND9 running in a >> jail on that box. >> >> DellT410one# fgrep -i allocate /jails/bind/20110315/var/log/messages >> Apr 13 05:17:41 bind named[23534]: internal_send: >> 192.168.50.145#65176: Cannot allocate memory >> Jun 21 23:30:44 bind named[39864]: internal_send: >> 192.168.50.251#36155: Cannot allocate memory >> Jun 24 15:28:00 bind named[39864]: internal_send: >> 192.168.50.251#28651: Cannot allocate memory >> Jun 28 12:57:52 bind named[2462]: internal_send: >> 192.168.165.154#1201: Cannot allocate memory >> >> My initial guess: it happens sooner or later somehow - whether it is >> a lot of traffic in one go (ssh/scp copies of virtual disks) or a >> lot of traffic over a longer period (a nameserver gets asked again >> and again). > > Scott, are you also using jails? If both of you are: is there any > possibility you can remove use of those? I'm not sure how VirtualBox > fits into the picture (jails + VirtualBox that is), but I can imagine > jails having different environmental constraints that might cause this. > > Basically the troubleshooting process here is to remove pieces of the > puzzle until you figure out which piece is causing the issue. I don't > want to get the NIC driver devs all spun up for something that, for > example, might be an issue with the jail implementation.
No jails here. I do have one bind error message in all my logs: daemon:Jun 20 10:52:28 xeon named[399]: internal_send: 10.1.2.95#51946: Cannot allocate memory Greping my logs for "allocate" turned up a handful of memory allocation errors with netatalk too. afpd.log:Jul 01 16:13:04.828835 afpd[18303] {dsi_stream.c:427} (E:DSI): dsi_stream_send: Cannot allocate memory afpd.log:Jun 23 13:34:01.000987 afpd[17970] {fork.c:980} (E:AFPDaemon): afp_read(final file.pdf): Cannot allocate memory And a handful from samba: [2011/07/05 23:43:22.483224, 0] lib/util_sock.c:675(write_data) write_data: write failure in writing to client 10.1.1.10. Error Cannot allocate memory [2011/07/05 23:43:22.493839, 0] smbd/process.c:79(srv_send_smb) Error writing 51 bytes to client. -1. (Cannot allocate memory) I haven't personally seen any errors on the client side with samba/netatalk (and when scp was failing regularly I transferred the same files over netatalk+samba without error) nor have I had any reports of problems, but I guess there's a good chance all these log messages are related. I've been trying to trigger the scp failure remotely tonight with no luck. I was triggering it regularly during the work day today, but not tonight. I will try to experiment tomorrow during the day with stopping VirtualBox and removing the kernel modules and seeing what happens. Scott_______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"