ZFS, SSDs, and TRIM performance
Me again. I have a new issue and I’m not sure if it is hardware or software. I have nine servers running 10.2-RELEASE-p5 with Dell OEM’d Samsung XS1715 NVMe SSDs. They are paired up in a single mirrored zpool on each server. They perform great most of the time. However, I have a problem when ZFS fires off TRIMs. Not during vdev creation, but like if I delete a 20GB snapshot. If I destroy a 20GB snapshot or delete large files, ZFS fires off tons of TRIMs to the disks. I can see the kstat.zfs.misc.zio_trim.success and kstat.zfs.misc.zio_trim.bytes sysctls skyrocket. While this is happening, any synchronous writes seem to block. For example, we’re running PostgreSQL which does fsync()s all the time. While these TRIMs happen, Postgres just hangs on writes. This causes reads to block due to lock contention as well. If I change sync=disabled on my tank/pgsql dataset while this is happening, it unblocks for the most part. But obviously this is not an ideal way to run PostgreSQL. I’m working with my vendor to get some Intel SSDs to test, but any ideas if this could somehow be a software issue? Or does the Samsung XS1715 just suck at TRIM and SYNC? We’re thinking of just setting the vfs.zfs.trim.enabled=0 tunable for now since WAL segment turnover actually causes TRIM operations a lot, but unfortunately this is a reboot. But disabling TRIM does seem to fix the issue on other servers I’ve tested with the same hardware config. -- Sean Kelly smke...@smkelly.org http://smkelly.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Dell NVMe issues
Back in May, I posted about issues I was having with a Dell PE R630 with 4x800GB NVMe SSDs. I would get kernel panics due to the inability to assign all the interrupts because of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321>. Jim Harris helped fix this issue so I bought several more of these servers, Including ones with 4x1.6TB drives… while the new servers with 4x800GB drives still work, the ones with 4x1.6TB drives do not. When I do a zpool create tank mirror nvd0 nvd1 mirror nvd2 nvd3 the command never returns and the kernel logs: nvme0: resetting controller nvme0: controller ready did not become 0 within 2000 ms I’ve tried several different things trying to understand where the actual problem is. WORKS: dd if=/dev/nvd0 of=/dev/null bs=1m WORKS: dd if=/dev/zero of=/dev/nvd0 bs=1m WORKS: newfs /dev/nvd0 FAILS: zpool create tank mirror nvd[01] FAILS: gpart add -t freebsd-zfs nvd[01] && zpool create tank mirror nvd[01]p1 FAILS: gpart add -t freebsd-zfs -s 1400g nvd[01[ && zpool create tank nvd[01]p1 WORKS: gpart add -t freebsd-zfs -s 800g nvd[01] && zpool create tank nvd[01]p1 NOTE: The above commands are more about getting the point across, not validity. I wiped the disk clean between gpart attempts and used GPT. So it seems like zpool works if I don’t cross past ~800GB. But other things like dd and newfs work. When I get the kernel messages about the controller resetting and then not responding, the NVMe subsystem hangs entirely. Since my boot disks are not NVMe, the system continues to work but no more NVMe stuff can be done. Further, attempting to reboot hangs and I have to do a power cycle. Any thoughts on what the deal may be here? 10.2-RELEASE-p5 nvme0@pci0:132:0:0: class=0x010802 card=0x1f971028 chip=0xa820144d rev=0x03 hdr=0x00 vendor = 'Samsung Electronics Co Ltd' class = mass storage subclass = NVM -- Sean Kelly smke...@smkelly.org http://smkelly.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Dell NVMe issues
> On Oct 6, 2015, at 10:29 AM, Slawa Olhovchenkov <s...@zxy.spb.ru> wrote: > > On Tue, Oct 06, 2015 at 10:18:11AM -0500, Sean Kelly wrote: > >> Back in May, I posted about issues I was having with a Dell PE R630 with >> 4x800GB NVMe SSDs. I would get kernel panics due to the inability to assign >> all the interrupts because of >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 >> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321> >> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 >> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321>>. Jim Harris >> helped fix this issue so I bought several more of these servers, Including >> ones with 4x1.6TB drives... >> >> while the new servers with 4x800GB drives still work, the ones with 4x1.6TB >> drives do not. When I do a >> zpool create tank mirror nvd0 nvd1 mirror nvd2 nvd3 >> the command never returns and the kernel logs: >> nvme0: resetting controller >> nvme0: controller ready did not become 0 within 2000 ms >> >> I've tried several different things trying to understand where the actual >> problem is. >> WORKS: dd if=/dev/nvd0 of=/dev/null bs=1m >> WORKS: dd if=/dev/zero of=/dev/nvd0 bs=1m >> WORKS: newfs /dev/nvd0 >> FAILS: zpool create tank mirror nvd[01] >> FAILS: gpart add -t freebsd-zfs nvd[01] && zpool create tank mirror nvd[01]p1 >> FAILS: gpart add -t freebsd-zfs -s 1400g nvd[01[ && zpool create tank >> nvd[01]p1 >> WORKS: gpart add -t freebsd-zfs -s 800g nvd[01] && zpool create tank >> nvd[01]p1 >> >> NOTE: The above commands are more about getting the point across, not >> validity. I wiped the disk clean between gpart attempts and used GPT. > > Just for purity of the experiment: do you try zpool on raw disk, w/o > GPT? I.e. zpool create tank mirror nvd0 nvd1 > Yes, that was actually what I tried first. I headed down the path of GPT because it allowed me a way to restrict how much disk zpool touched. zpool on the bare NVMe disks also triggers the issue. >> So it seems like zpool works if I don't cross past ~800GB. But other things >> like dd and newfs work. >> >> When I get the kernel messages about the controller resetting and then not >> responding, the NVMe subsystem hangs entirely. Since my boot disks are not >> NVMe, the system continues to work but no more NVMe stuff can be done. >> Further, attempting to reboot hangs and I have to do a power cycle. >> >> Any thoughts on what the deal may be here? >> >> 10.2-RELEASE-p5 >> >> nvme0@pci0:132:0:0: class=0x010802 card=0x1f971028 chip=0xa820144d >> rev=0x03 hdr=0x00 >>vendor = 'Samsung Electronics Co Ltd' >>class = mass storage >>subclass = NVM >> >> -- >> Sean Kelly >> smke...@smkelly.org >> http://smkelly.org >> >> ___ >> freebsd-stable@freebsd.org <mailto:freebsd-stable@freebsd.org> mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-stable >> <https://lists.freebsd.org/mailman/listinfo/freebsd-stable> >> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org >> <mailto:freebsd-stable-unsubscr...@freebsd.org>" ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Dell NVMe issues
> On Oct 6, 2015, at 11:06 AM, Eric van Gyzenwrote: > > Try this: > >sysctl vfs.zfs.vdev.trim_on_init=0 >zpool create tank mirror nvd[01] > That worked. So my guess is the controller/FreeBSD is timing out while zpool asks the drive to TRIM all 1.6TB? ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: 10.1 NVMe kernel panic
Jim, Thanks for the reply. I set hw.nvme.force_intx=1 and get a new form of kernel panic: http://smkelly.org/stuff/nvme_crash_force_intx.txt http://smkelly.org/stuff/nvme_crash_force_intx.txt It looks like the NVMes are just failing to initialize at all now. As long as that tunable is in the kenv, I get this behavior. If I kldload them after boot, the init fails as well. But if I kldunload, kenv -u, kldload, it then works again. The only difference is kldload doesn’t result in a panic, just timeouts initializing them all. I also compiled and tried stable/10 and it crashed in a similar way, but i’ve not captured the panic yet. It crashes even without the tunable in place. I’ll see if I can capture it. -- Sean Kelly smke...@smkelly.org http://smkelly.org On Jun 2, 2015, at 6:10 PM, Jim Harris jim.har...@gmail.com wrote: On Thu, May 21, 2015 at 8:33 AM, Sean Kelly smke...@smkelly.org mailto:smke...@smkelly.org wrote: Greetings. I have a Dell R630 server with four of Dell’s 800GB NVMe SSDs running FreeBSD 10.1-p10. According to the PCI vendor, they are some sort of rebranded Samsung drive. If I boot the system and then load nvme.ko and nvd.ko from a command line, the drives show up okay. If I put nvme_load=“YES” nvd_load=“YES” in /boot/loader.conf, the box panics on boot: panic: nexus_setup_intr: NULL irq resource! If I boot the system with “Safe Mode: ON” from the loader menu, it also boots successfully and the drives show up. You can see a full ‘boot -v’ here: http://smkelly.org/stuff/nvme-panic.txt http://smkelly.org/stuff/nvme-panic.txt http://smkelly.org/stuff/nvme-panic.txt http://smkelly.org/stuff/nvme-panic.txt Anyone have any insight into what the issue may be here? Ideally I need to get this working in the next few days or return this thing to Dell. Hi Sean, Can you try adding hw.nvme.force_intx=1 to /boot/loader.conf? I suspect you are able to load the drivers successfully after boot because interrupt assignments are not restricted to CPU0 at that point - see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 for a related issue. Your logs clearly show that vectors were allocated for the first 2 NVMe SSDs, but the third could not get its full allocation. There is a bug in the INTx fallback code that needs to be fixed - you do not hit this bug when loading after boot because bug #199321 only affects interrupt allocation during boot. If the force_intx test works, would you able to upgrade your nvme drivers to the latest on stable/10? There are several patches (one related to interrupt vector allocation) that have been pushed to stable/10 since 10.1 was released, and I will be pushing another patch for the issue you have reported shortly. Thanks, -Jim Thanks! -- Sean Kelly smke...@smkelly.org mailto:smke...@smkelly.org http://smkelly.org http://smkelly.org/ ___ freebsd-stable@freebsd.org mailto:freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org mailto:freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
10.1 NVMe kernel panic
Greetings. I have a Dell R630 server with four of Dell’s 800GB NVMe SSDs running FreeBSD 10.1-p10. According to the PCI vendor, they are some sort of rebranded Samsung drive. If I boot the system and then load nvme.ko and nvd.ko from a command line, the drives show up okay. If I put nvme_load=“YES” nvd_load=“YES” in /boot/loader.conf, the box panics on boot: panic: nexus_setup_intr: NULL irq resource! If I boot the system with “Safe Mode: ON” from the loader menu, it also boots successfully and the drives show up. You can see a full ‘boot -v’ here: http://smkelly.org/stuff/nvme-panic.txt http://smkelly.org/stuff/nvme-panic.txt Anyone have any insight into what the issue may be here? Ideally I need to get this working in the next few days or return this thing to Dell. Thanks! -- Sean Kelly smke...@smkelly.org http://smkelly.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
RE: RELENG_9 panic with PERC 6/i (mfi)
No, it remains an outstanding issue. We've begun moving services to a spare server to give us more time to investigate it. From: Wiley, Glen [gwi...@verisign.com] Sent: Wednesday, January 02, 2013 9:52 AM To: Sean Kelly; Daniel Braniss Cc: freebsd-stable@freebsd.org Subject: Re: RELENG_9 panic with PERC 6/i (mfi) Did you guys end up identifying the cause of that panic? -- Glen Wiley Systems Architect Verisign Inc. On 12/23/12 12:56 PM, Sean Kelly smke...@flightaware.com wrote: Greetings. All I have to do to panic it is boot it. As you can see from the dump, it died after about 30 seconds without me doing anything. I can't provide those sysctl values easily, as it panics too quickly. I suppose I can convince it to drop to DDB and pick them out if that would be helpful. Here they are from the working 8.2-R kernel. vm.kmem_map_free: 49870348288 vm.kmem_map_size: 68964352 This box, unlike most of our others, doesn't even utilizing ZFS. root@papa:~# gpart show =63 1141899192 mfid0 MBR (545G) 63 1141884072 1 freebsd [active] (544G) 1141884135 15120 - free - (7.4M) = 0 1141884072 mfid0s1 BSD (544G) 0 83886081 freebsd-ufs (4.0G) 8388608167772164 freebsd-ufs (8.0G) 25165824335544325 freebsd-ufs (16G) 58720256671088642 freebsd-swap (32G) 125829120671088647 freebsd-swap (32G) 192937984671088648 freebsd-swap (32G) 260046848 8818372246 freebsd-ufs (420G) From: Daniel Braniss [da...@cs.huji.ac.il] Sent: Sunday, December 23, 2012 1:43 AM To: Sean Kelly Subject: Re: RELENG_9 panic with PERC 6/i (mfi) btw: sysctl -a | grep kmem_map vm.kmem_map_free: 8859570176 vm.kmem_map_size: 6037008384 danny ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
RE: RELENG_9 panic with PERC 6/i (mfi)
Greetings. All I have to do to panic it is boot it. As you can see from the dump, it died after about 30 seconds without me doing anything. I can't provide those sysctl values easily, as it panics too quickly. I suppose I can convince it to drop to DDB and pick them out if that would be helpful. Here they are from the working 8.2-R kernel. vm.kmem_map_free: 49870348288 vm.kmem_map_size: 68964352 This box, unlike most of our others, doesn't even utilizing ZFS. root@papa:~# gpart show =63 1141899192 mfid0 MBR (545G) 63 1141884072 1 freebsd [active] (544G) 1141884135 15120 - free - (7.4M) = 0 1141884072 mfid0s1 BSD (544G) 0 83886081 freebsd-ufs (4.0G) 8388608167772164 freebsd-ufs (8.0G) 25165824335544325 freebsd-ufs (16G) 58720256671088642 freebsd-swap (32G) 125829120671088647 freebsd-swap (32G) 192937984671088648 freebsd-swap (32G) 260046848 8818372246 freebsd-ufs (420G) From: Daniel Braniss [da...@cs.huji.ac.il] Sent: Sunday, December 23, 2012 1:43 AM To: Sean Kelly Subject: Re: RELENG_9 panic with PERC 6/i (mfi) btw: sysctl -a | grep kmem_map vm.kmem_map_free: 8859570176 vm.kmem_map_size: 6037008384 danny ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
RELENG_9 panic with PERC 6/i (mfi)
Greetings. I have a Dell R710 with a mfi device (PERC 6/i Integrated) that panics almost immediately on FreeBSD 9. It works fine on FreeBSD 8.2-RELEASE, but I've now had it panic in FreeBSD 9.0-STABLE and 9.1-RELEASE. Output of mfiutil show adapter and panic backtrace below. Anybody seen this or have any ideas? # mfiutil show adapter: mfi0 Adapter: Product Name: PERC 6/i Integrated Serial Number: redacted Firmware: 6.3.1-0003 RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50 Battery Backup: present NVRAM: 32K Onboard Memory: 256M Minimum Stripe: 8K Maximum Stripe: 1M # kgdb -n 5 panic: kmem_malloc(-8192): kmem_map too small: 82677760 total allocated cpuid = 2 KDB: stack backtrace: #0 0x809208a6 at kdb_backtrace+0x66 #1 0x808ea8be at panic+0x1ce #2 0x80b44930 at vm_map_locked+0 #3 0x80b3b41a at uma_large_malloc+0x4a #4 0x808d5a69 at malloc+0xd9 #5 0x805b2985 at mfi_user_command+0x35 #6 0x805b2f2d at mfi_ioctl+0x2fd #7 0x807db28b at devfs_ioctl_f+0x7b #8 0x80932325 at kern_ioctl+0x115 #9 0x8093255d at sys_ioctl+0xfd #10 0x80bd7ae6 at amd64_syscall+0x546 #11 0x80bc3447 at Xfast_syscall+0xf7 Uptime: 35s Dumping 2032 out of 49122 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91% (kgdb) lis *0x805b2985 0x805b2985 is in mfi_user_command (/usr/src/sys/dev/mfi/mfi.c:2836). 2831 int error = 0, locked; 2832 2833 2834 if (ioc-buf_size 0) { 2835 ioc_buf = malloc(ioc-buf_size, M_MFIBUF, M_WAITOK); 2836 if (ioc_buf == NULL) { 2837 return (ENOMEM); 2838 } 2839 error = copyin(ioc-buf, ioc_buf, ioc-buf_size); 2840 if (error) { ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: REMINDER: 4.2 code freeze starts tomorrow!
On Wed, Nov 01, 2000 at 10:52:14AM -0800, Gordon Tetlow wrote: Hello there... On Wed, 1 Nov 2000, Vivek Khera wrote: There's one "bad" default that might like to get changed. That is the time that cron runs the daily scripts. The current setting in /etc/crontab is 1:59 in the morning. Well, last Sunday that time occurred twice as we switched from daylight to standard time. The times between 1am and 3am should be avoided for any system cron jobs just because of this problem. From what I recall (off the top of my head no less) is that the time change in the fall occurs at 3am [ECMP]DT and jumps back to 2am [ECMP]ST. In the spring, at 2am it jumps to 3am. 1:59am will reliably occur once every day of the year. At least that is how it is done in the US. -gordon That is incorrect. The US does not operate like this. The fallback and jump forward always occur at 2AM. This means that when we shift back an hour, we go from 1:59AM to 1:00AM. When we jump forwadr an hour, we go from 1:59AM to 3:00AM. There is never two 3AMs, and there can be anywhere from one to two 1:00 and 2:00AMs. /usr/src/share/zoneinfo/northamerica: # Rule NAMEFROMTO TYPEIN ON AT SAVELETTER/S RuleUS 1967max - Oct lastSun 2:000 S RuleUS 1987max - Apr Sun=1 2:001:00D -- Sean Kelly [EMAIL PROTECTED] or [EMAIL PROTECTED] PGP KeyID: 4AC781C7http://www.sean-kelly.org To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message