ZFS, SSDs, and TRIM performance

2015-10-29 Thread Sean Kelly
Me again. I have a new issue and I’m not sure if it is hardware or software. I 
have nine servers running 10.2-RELEASE-p5 with Dell OEM’d Samsung XS1715 NVMe 
SSDs. They are paired up in a single mirrored zpool on each server. They 
perform great most of the time. However, I have a problem when ZFS fires off 
TRIMs. Not during vdev creation, but like if I delete a 20GB snapshot.

If I destroy a 20GB snapshot or delete large files, ZFS fires off tons of TRIMs 
to the disks. I can see the kstat.zfs.misc.zio_trim.success and 
kstat.zfs.misc.zio_trim.bytes sysctls skyrocket. While this is happening, any 
synchronous writes seem to block. For example, we’re running PostgreSQL which 
does fsync()s all the time. While these TRIMs happen, Postgres just hangs on 
writes. This causes reads to block due to lock contention as well.

If I change sync=disabled on my tank/pgsql dataset while this is happening, it 
unblocks for the most part. But obviously this is not an ideal way to run 
PostgreSQL.

I’m working with my vendor to get some Intel SSDs to test, but any ideas if 
this could somehow be a software issue? Or does the Samsung XS1715 just suck at 
TRIM and SYNC?

We’re thinking of just setting the vfs.zfs.trim.enabled=0 tunable for now since 
WAL segment turnover actually causes TRIM operations a lot, but unfortunately 
this is a reboot. But disabling TRIM does seem to fix the issue on other 
servers I’ve tested with the same hardware config.

-- 
Sean Kelly
smke...@smkelly.org
http://smkelly.org

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Dell NVMe issues

2015-10-06 Thread Sean Kelly
Back in May, I posted about issues I was having with a Dell PE R630 with 
4x800GB NVMe SSDs. I would get kernel panics due to the inability to assign all 
the interrupts because of 
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 
<https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321>. Jim Harris helped 
fix this issue so I bought several more of these servers, Including ones with 
4x1.6TB drives…

while the new servers with 4x800GB drives still work, the ones with 4x1.6TB 
drives do not. When I do a
zpool create tank mirror nvd0 nvd1 mirror nvd2 nvd3
the command never returns and the kernel logs:
nvme0: resetting controller
nvme0: controller ready did not become 0 within 2000 ms

I’ve tried several different things trying to understand where the actual 
problem is.
WORKS: dd if=/dev/nvd0 of=/dev/null bs=1m
WORKS: dd if=/dev/zero of=/dev/nvd0 bs=1m
WORKS: newfs /dev/nvd0
FAILS: zpool create tank mirror nvd[01]
FAILS: gpart add -t freebsd-zfs nvd[01] && zpool create tank mirror nvd[01]p1
FAILS: gpart add -t freebsd-zfs -s 1400g nvd[01[ && zpool create tank nvd[01]p1
WORKS: gpart add -t freebsd-zfs -s 800g nvd[01] && zpool create tank nvd[01]p1

NOTE: The above commands are more about getting the point across, not validity. 
I wiped the disk clean between gpart attempts and used GPT.

So it seems like zpool works if I don’t cross past ~800GB. But other things 
like dd and newfs work.

When I get the kernel messages about the controller resetting and then not 
responding, the NVMe subsystem hangs entirely. Since my boot disks are not 
NVMe, the system continues to work but no more NVMe stuff can be done. Further, 
attempting to reboot hangs and I have to do a power cycle.

Any thoughts on what the deal may be here?

10.2-RELEASE-p5

nvme0@pci0:132:0:0: class=0x010802 card=0x1f971028 chip=0xa820144d rev=0x03 
hdr=0x00
vendor = 'Samsung Electronics Co Ltd'
class  = mass storage
    subclass   = NVM

-- 
Sean Kelly
smke...@smkelly.org
http://smkelly.org

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Dell NVMe issues

2015-10-06 Thread Sean Kelly


> On Oct 6, 2015, at 10:29 AM, Slawa Olhovchenkov <s...@zxy.spb.ru> wrote:
> 
> On Tue, Oct 06, 2015 at 10:18:11AM -0500, Sean Kelly wrote:
> 
>> Back in May, I posted about issues I was having with a Dell PE R630 with 
>> 4x800GB NVMe SSDs. I would get kernel panics due to the inability to assign 
>> all the interrupts because of 
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 
>> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321> 
>> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 
>> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321>>. Jim Harris 
>> helped fix this issue so I bought several more of these servers, Including 
>> ones with 4x1.6TB drives...
>> 
>> while the new servers with 4x800GB drives still work, the ones with 4x1.6TB 
>> drives do not. When I do a
>>  zpool create tank mirror nvd0 nvd1 mirror nvd2 nvd3
>> the command never returns and the kernel logs:
>>  nvme0: resetting controller
>>  nvme0: controller ready did not become 0 within 2000 ms
>> 
>> I've tried several different things trying to understand where the actual 
>> problem is.
>> WORKS: dd if=/dev/nvd0 of=/dev/null bs=1m
>> WORKS: dd if=/dev/zero of=/dev/nvd0 bs=1m
>> WORKS: newfs /dev/nvd0
>> FAILS: zpool create tank mirror nvd[01]
>> FAILS: gpart add -t freebsd-zfs nvd[01] && zpool create tank mirror nvd[01]p1
>> FAILS: gpart add -t freebsd-zfs -s 1400g nvd[01[ && zpool create tank 
>> nvd[01]p1
>> WORKS: gpart add -t freebsd-zfs -s 800g nvd[01] && zpool create tank 
>> nvd[01]p1
>> 
>> NOTE: The above commands are more about getting the point across, not 
>> validity. I wiped the disk clean between gpart attempts and used GPT.
> 
> Just for purity of the experiment: do you try zpool on raw disk, w/o
> GPT? I.e. zpool create tank mirror nvd0 nvd1
> 

Yes, that was actually what I tried first. I headed down the path of GPT 
because it allowed me a way to restrict how much disk zpool touched. zpool on 
the bare NVMe disks also triggers the issue.

>> So it seems like zpool works if I don't cross past ~800GB. But other things 
>> like dd and newfs work.
>> 
>> When I get the kernel messages about the controller resetting and then not 
>> responding, the NVMe subsystem hangs entirely. Since my boot disks are not 
>> NVMe, the system continues to work but no more NVMe stuff can be done. 
>> Further, attempting to reboot hangs and I have to do a power cycle.
>> 
>> Any thoughts on what the deal may be here?
>> 
>> 10.2-RELEASE-p5
>> 
>> nvme0@pci0:132:0:0: class=0x010802 card=0x1f971028 chip=0xa820144d 
>> rev=0x03 hdr=0x00
>>vendor = 'Samsung Electronics Co Ltd'
>>class  = mass storage
>>subclass   = NVM
>> 
>> -- 
>> Sean Kelly
>> smke...@smkelly.org
>> http://smkelly.org
>> 
>> ___
>> freebsd-stable@freebsd.org <mailto:freebsd-stable@freebsd.org> mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable 
>> <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>
>> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org 
>> <mailto:freebsd-stable-unsubscr...@freebsd.org>"

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Dell NVMe issues

2015-10-06 Thread Sean Kelly

> On Oct 6, 2015, at 11:06 AM, Eric van Gyzen  wrote:
> 
> Try this:
> 
>sysctl vfs.zfs.vdev.trim_on_init=0
>zpool create tank mirror nvd[01]
> 

That worked. So my guess is the controller/FreeBSD is timing out while zpool 
asks the drive to TRIM all 1.6TB?
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: 10.1 NVMe kernel panic

2015-06-02 Thread Sean Kelly
Jim,

Thanks for the reply. I set hw.nvme.force_intx=1 and get a new form of kernel 
panic:
http://smkelly.org/stuff/nvme_crash_force_intx.txt 
http://smkelly.org/stuff/nvme_crash_force_intx.txt

It looks like the NVMes are just failing to initialize at all now. As long as 
that tunable is in the kenv, I get this behavior. If I kldload them after boot, 
the init fails as well. But if I kldunload, kenv -u, kldload, it then works 
again. The only difference is kldload doesn’t result in a panic, just timeouts 
initializing them all.

I also compiled and tried stable/10 and it crashed in a similar way, but i’ve 
not captured the panic yet. It crashes even without the tunable in place. I’ll 
see if I can capture it.

-- 
Sean Kelly
smke...@smkelly.org
http://smkelly.org

 On Jun 2, 2015, at 6:10 PM, Jim Harris jim.har...@gmail.com wrote:
 
 
 
 On Thu, May 21, 2015 at 8:33 AM, Sean Kelly smke...@smkelly.org 
 mailto:smke...@smkelly.org wrote:
 Greetings.
 
 I have a Dell R630 server with four of Dell’s 800GB NVMe SSDs running FreeBSD 
 10.1-p10. According to the PCI vendor, they are some sort of rebranded 
 Samsung drive. If I boot the system and then load nvme.ko and nvd.ko from a 
 command line, the drives show up okay. If I put
 nvme_load=“YES”
 nvd_load=“YES”
 in /boot/loader.conf, the box panics on boot:
 panic: nexus_setup_intr: NULL irq resource!
 
 If I boot the system with “Safe Mode: ON” from the loader menu, it also boots 
 successfully and the drives show up.
 
 You can see a full ‘boot -v’ here:
 http://smkelly.org/stuff/nvme-panic.txt 
 http://smkelly.org/stuff/nvme-panic.txt 
 http://smkelly.org/stuff/nvme-panic.txt 
 http://smkelly.org/stuff/nvme-panic.txt
 
 Anyone have any insight into what the issue may be here? Ideally I need to 
 get this working in the next few days or return this thing to Dell.
 
 Hi Sean,
 
 Can you try adding hw.nvme.force_intx=1 to /boot/loader.conf?
 
 I suspect you are able to load the drivers successfully after boot because 
 interrupt assignments are not restricted to CPU0 at that point - see 
 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 
 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199321 for a related 
 issue.  Your logs clearly show that vectors were allocated for the first 2 
 NVMe SSDs, but the third could not get its full allocation.  There is a bug 
 in the INTx fallback code that needs to be fixed - you do not hit this bug 
 when loading after boot because bug #199321 only affects interrupt allocation 
 during boot.
 
 If the force_intx test works, would you able to upgrade your nvme drivers to 
 the latest on stable/10?  There are several patches (one related to interrupt 
 vector allocation) that have been pushed to stable/10 since 10.1 was 
 released, and I will be pushing another patch for the issue you have reported 
 shortly.
 
 Thanks,
 
 -Jim
 
 
   
 
 Thanks!
 
 --
 Sean Kelly
 smke...@smkelly.org mailto:smke...@smkelly.org
 http://smkelly.org http://smkelly.org/
 
 ___
 freebsd-stable@freebsd.org mailto:freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable 
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org 
 mailto:freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

10.1 NVMe kernel panic

2015-05-21 Thread Sean Kelly
Greetings.

I have a Dell R630 server with four of Dell’s 800GB NVMe SSDs running FreeBSD 
10.1-p10. According to the PCI vendor, they are some sort of rebranded Samsung 
drive. If I boot the system and then load nvme.ko and nvd.ko from a command 
line, the drives show up okay. If I put
nvme_load=“YES”
nvd_load=“YES”
in /boot/loader.conf, the box panics on boot:
panic: nexus_setup_intr: NULL irq resource!

If I boot the system with “Safe Mode: ON” from the loader menu, it also boots 
successfully and the drives show up.

You can see a full ‘boot -v’ here:
http://smkelly.org/stuff/nvme-panic.txt 
http://smkelly.org/stuff/nvme-panic.txt

Anyone have any insight into what the issue may be here? Ideally I need to get 
this working in the next few days or return this thing to Dell.

Thanks!

-- 
Sean Kelly
smke...@smkelly.org
http://smkelly.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

RE: RELENG_9 panic with PERC 6/i (mfi)

2013-01-02 Thread Sean Kelly
No, it remains an outstanding issue. We've begun moving services to a spare 
server to give us more time to investigate it.


From: Wiley, Glen [gwi...@verisign.com]
Sent: Wednesday, January 02, 2013 9:52 AM
To: Sean Kelly; Daniel Braniss
Cc: freebsd-stable@freebsd.org
Subject: Re: RELENG_9 panic with PERC 6/i (mfi)

Did you guys end up identifying the cause of that panic?

--
Glen Wiley
Systems Architect
Verisign Inc.




On 12/23/12 12:56 PM, Sean Kelly smke...@flightaware.com wrote:

Greetings.

All I have to do to panic it is boot it. As you can see from the dump, it
died after about 30 seconds without me doing anything. I can't provide
those sysctl values easily, as it panics too quickly. I suppose I can
convince it to drop to DDB and pick them out if that would be helpful.

Here they are from the working 8.2-R kernel.
vm.kmem_map_free: 49870348288
vm.kmem_map_size: 68964352

This box, unlike most of our others, doesn't even utilizing ZFS.
root@papa:~# gpart show
=63  1141899192  mfid0  MBR  (545G)
  63  1141884072  1  freebsd  [active]  (544G)
  1141884135   15120 - free -  (7.4M)

= 0  1141884072  mfid0s1  BSD  (544G)
   0 83886081  freebsd-ufs  (4.0G)
 8388608167772164  freebsd-ufs  (8.0G)
25165824335544325  freebsd-ufs  (16G)
58720256671088642  freebsd-swap  (32G)
   125829120671088647  freebsd-swap  (32G)
   192937984671088648  freebsd-swap  (32G)
   260046848   8818372246  freebsd-ufs  (420G)


From: Daniel Braniss [da...@cs.huji.ac.il]
Sent: Sunday, December 23, 2012 1:43 AM
To: Sean Kelly
Subject: Re: RELENG_9 panic with PERC 6/i (mfi)

btw:
sysctl -a | grep kmem_map
vm.kmem_map_free: 8859570176
vm.kmem_map_size: 6037008384


danny


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


RE: RELENG_9 panic with PERC 6/i (mfi)

2012-12-23 Thread Sean Kelly
Greetings.

All I have to do to panic it is boot it. As you can see from the dump, it died 
after about 30 seconds without me doing anything. I can't provide those sysctl 
values easily, as it panics too quickly. I suppose I can convince it to drop to 
DDB and pick them out if that would be helpful.

Here they are from the working 8.2-R kernel.
vm.kmem_map_free: 49870348288
vm.kmem_map_size: 68964352

This box, unlike most of our others, doesn't even utilizing ZFS.
root@papa:~# gpart show
=63  1141899192  mfid0  MBR  (545G)
  63  1141884072  1  freebsd  [active]  (544G)
  1141884135   15120 - free -  (7.4M)

= 0  1141884072  mfid0s1  BSD  (544G)
   0 83886081  freebsd-ufs  (4.0G)
 8388608167772164  freebsd-ufs  (8.0G)
25165824335544325  freebsd-ufs  (16G)
58720256671088642  freebsd-swap  (32G)
   125829120671088647  freebsd-swap  (32G)
   192937984671088648  freebsd-swap  (32G)
   260046848   8818372246  freebsd-ufs  (420G)


From: Daniel Braniss [da...@cs.huji.ac.il]
Sent: Sunday, December 23, 2012 1:43 AM
To: Sean Kelly
Subject: Re: RELENG_9 panic with PERC 6/i (mfi)

btw:
sysctl -a | grep kmem_map
vm.kmem_map_free: 8859570176
vm.kmem_map_size: 6037008384


danny


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


RELENG_9 panic with PERC 6/i (mfi)

2012-12-22 Thread Sean Kelly
Greetings.

I have a Dell R710 with a mfi device (PERC 6/i Integrated) that panics almost 
immediately on FreeBSD 9. It works fine on FreeBSD 8.2-RELEASE, but I've now 
had it panic in FreeBSD 9.0-STABLE and 9.1-RELEASE.

Output of mfiutil show adapter and panic backtrace below. Anybody seen this or 
have any ideas?

# mfiutil show adapter:
mfi0 Adapter:
Product Name: PERC 6/i Integrated
   Serial Number: redacted
Firmware: 6.3.1-0003
 RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
  Battery Backup: present
   NVRAM: 32K
  Onboard Memory: 256M
  Minimum Stripe: 8K
  Maximum Stripe: 1M

# kgdb -n 5
panic: kmem_malloc(-8192): kmem_map too small: 82677760 total allocated
cpuid = 2
KDB: stack backtrace:
#0 0x809208a6 at kdb_backtrace+0x66
#1 0x808ea8be at panic+0x1ce
#2 0x80b44930 at vm_map_locked+0
#3 0x80b3b41a at uma_large_malloc+0x4a
#4 0x808d5a69 at malloc+0xd9
#5 0x805b2985 at mfi_user_command+0x35
#6 0x805b2f2d at mfi_ioctl+0x2fd
#7 0x807db28b at devfs_ioctl_f+0x7b
#8 0x80932325 at kern_ioctl+0x115
#9 0x8093255d at sys_ioctl+0xfd
#10 0x80bd7ae6 at amd64_syscall+0x546
#11 0x80bc3447 at Xfast_syscall+0xf7
Uptime: 35s
Dumping 2032 out of 49122 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

(kgdb) lis *0x805b2985
0x805b2985 is in mfi_user_command (/usr/src/sys/dev/mfi/mfi.c:2836).
2831 int error = 0, locked;
2832
2833
2834 if (ioc-buf_size  0) {
2835 ioc_buf = malloc(ioc-buf_size, M_MFIBUF, M_WAITOK);
2836 if (ioc_buf == NULL) {
2837 return (ENOMEM);
2838 }
2839 error = copyin(ioc-buf, ioc_buf, ioc-buf_size);
2840 if (error) {

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: REMINDER: 4.2 code freeze starts tomorrow!

2000-11-01 Thread Sean Kelly

On Wed, Nov 01, 2000 at 10:52:14AM -0800, Gordon Tetlow wrote:
 Hello there...
 
 On Wed, 1 Nov 2000, Vivek Khera wrote:
 
  There's one "bad" default that might like to get changed.  That is the
  time that cron runs the daily scripts.  The current setting in
  /etc/crontab is 1:59 in the morning.  Well, last Sunday that time
  occurred twice as we switched from daylight to standard time.  The
  times between 1am and 3am should be avoided for any system cron jobs
  just because of this problem.
 
 From what I recall (off the top of my head no less) is that the time
 change in the fall occurs at 3am [ECMP]DT and jumps back to 2am [ECMP]ST.
 In the spring, at 2am it jumps to 3am. 1:59am will reliably occur once
 every day of the year. At least that is how it is done in the US.
 
 -gordon

That is incorrect.  The US does not operate like this.
The fallback and jump forward always occur at 2AM.  This means that when we
shift back an hour, we go from 1:59AM to 1:00AM.  When we jump forwadr an
hour, we go from 1:59AM to 3:00AM.  There is never two 3AMs, and there can
be anywhere from one to two 1:00 and 2:00AMs. 

/usr/src/share/zoneinfo/northamerica:
# Rule  NAMEFROMTO  TYPEIN  ON  AT  SAVELETTER/S
RuleUS  1967max -   Oct lastSun 2:000   S   
RuleUS  1987max -   Apr Sun=1  2:001:00D   

-- 
Sean Kelly [EMAIL PROTECTED] or [EMAIL PROTECTED]
   PGP KeyID: 4AC781C7http://www.sean-kelly.org


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message