Re: Terrible disk performance with LSI / FreeBSD 9.2-RC1
To follow up on this issue, at one point the stats were down to this: extended device statistics device r/s w/skr/skw/s qlen svc_t %b da00.0 0.0 0.0 0.00 0.0 0 da10.0 0.0 0.0 0.00 0.0 0 da2 127.9 0.0 202.3 0.01 47.5 100 da3 125.9 0.0 189.3 0.01 43.1 97 da4 127.9 0.0 189.8 0.01 45.8 100 da5 128.9 0.0 206.3 0.00 42.5 99 da6 127.9 0.0 202.3 0.01 46.2 98 da70.0 249.7 0.0 334.2 10 39.5 100 At some point, I figured out that 125 random iops is pretty much the limit for 7200 RPM SATA drives. So mostly what we're looking at here is the resilver of a raidz2 is the pathological worst case. Lesson learned; raidz2 is just really not viable without some kind of sort on the resilver operations. Wish I understood ZFS well enough to do something about that, but research suggests the problem is non-trivial. :( There also seems to be a separate ZFS issue related to having a very large number of snapshots (e.g. hourly for several months on a couple of filesystems). Some combination of the OS updates we've been doing trying to get this machine to 9.2-RC1 and deleting a ton of snapshots. It would be nice to know which it was; I guess we'll find out in a few months. So it seems like the combination of these two issues is mostly what is/was plaguing us. Thanks! ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: Terrible disk performance with LSI / FreeBSD 9.2-RC1
On 08/08/2013 12:42, Terje Elde wrote: On 8. aug. 2013, at 00:08, Frank Leonhardt wrote: As a suggestion, what happens if you read from the drives directly? Boot in single user and try reading a Gb or two using /bin/dd. It might eliminate or confirm a problem with ZFS. If not too inconvenient, it'd be very interesting to see what'd happen if you were to physically disconnect (data and power) 5 of the 6 drives, then boot and dd from the remaining disk to /dev/null. Then repeat with another drive. You could boot from USB to leave the system itself otherwise untouched. The reason I'm suggesting is that I'm wondering if this can be down to a power or cable-issue, locking things up or causing retransmits, etc. Not sure if this would always be logged, others might be able to enlighten that issue. Terje And while you're at it, could you post the output of diskinfo -v /dev/[slices] - check the cylinder alignment and so on if you haven't already. Regards, Frank. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: Terrible disk performance with LSI / FreeBSD 9.2-RC1
On 8. aug. 2013, at 00:08, Frank Leonhardt wrote: > As a suggestion, what happens if you read from the drives directly? Boot in > single user and try reading a Gb or two using /bin/dd. It might eliminate or > confirm a problem with ZFS. If not too inconvenient, it'd be very interesting to see what'd happen if you were to physically disconnect (data and power) 5 of the 6 drives, then boot and dd from the remaining disk to /dev/null. Then repeat with another drive. You could boot from USB to leave the system itself otherwise untouched. The reason I'm suggesting is that I'm wondering if this can be down to a power or cable-issue, locking things up or causing retransmits, etc. Not sure if this would always be logged, others might be able to enlighten that issue. Terje ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: Terrible disk performance with LSI / FreeBSD 9.2-RC1
On 07/08/2013 21:36, J David wrote: It feels like some sort of issue with the bus/controller/kernel/driver/ZFS that is affecting all the drives equally. Also, even ls takes forever (10-30 seconds for "ls -lh /") but when it eventually does finish, "time ls -lh /" reports: 0.02 real 0.00 user 0.00 sys Really not sure what to make of that. An attempt to do "ps axlww | fgrep ls" while the ls was running failed, because the ps hangs just as long as the ls. So it's like the system is just repeatedly putting anything that touches the disks on hold, even if all the data being requested is clearly in cache. (Even apparently loading the binary for /bin/ls or doing "ls -lh /" twice in a row.) As a suggestion, what happens if you read from the drives directly? Boot in single user and try reading a Gb or two using /bin/dd. It might eliminate or confirm a problem with ZFS. Regards, Frank. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: Terrible disk performance with LSI / FreeBSD 9.2-RC1
On Wed, Aug 7, 2013 at 3:15 PM, James Gosnell wrote: > Maybe one of your drives is bad, so it's constantly doing error correction? Not according to SMART; all the drives report no problems. Also, all the drives seem to perform in lock-step for both reading and writing. E.g. when one drive in an array is failing, all the drives may be pulling the same # of reads, but the failing drive will often report 100% busy and/or multi-second svc_t's and the others will sit at 4% with 20msec svc_t's or similar. In this case, it's acting like the disks are all hugely overloaded. Except without even the high svc_t's I typically associate with overworking an array. The speeds do fluctuate. Last night it was down to 64k/sec reads per drive (about 15 reads/sec) and still reporting 90% busy on all drives. It feels like some sort of issue with the bus/controller/kernel/driver/ZFS that is affecting all the drives equally. Also, even ls takes forever (10-30 seconds for "ls -lh /") but when it eventually does finish, "time ls -lh /" reports: 0.02 real 0.00 user 0.00 sys Really not sure what to make of that. An attempt to do "ps axlww | fgrep ls" while the ls was running failed, because the ps hangs just as long as the ls. So it's like the system is just repeatedly putting anything that touches the disks on hold, even if all the data being requested is clearly in cache. (Even apparently loading the binary for /bin/ls or doing "ls -lh /" twice in a row.) Thanks! ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Re: Terrible disk performance with LSI / FreeBSD 9.2-RC1
Maybe one of your drives is bad, so it's constantly doing error correction? On Tue, Aug 6, 2013 at 9:48 PM, J David wrote: > We have a machine running 9.2-RC1 that's getting terrible disk I/O > performance. Its performance has always been pretty bad, but it > didn't really become clear how bad until we did a zpool replace on one > of the drives and realized it was going to take 3 weeks to rebuild a > <1TB drive. > > The hardware specs are: > - 2 x Xeon L5420 > - 32 GiB RAM > - LSI Logic SAS 1068E > - 2 x 32GB SSD's > - 6 x 1TB Western Digital RE3 7200RPM SATA > > The LSI controller has the most recent firmware I'm aware of > (6.36.00.00 / 1.33.00.00 dated 2011.08.24), is in IT mode, and appears > to be working fine: > > mpt0 Adapter: >Board Name: USASLP-L8i >Board Assembly: USASLP-L8i > Chip Name: C1068E > Chip Revision: B3 > RAID Levels: none > > mpt0 Configuration: 0 volumes, 8 drives > drive da0 (30G) ONLINE SATA > drive da1 (29G) ONLINE SATA > drive da2 (931G) ONLINE SATA > drive da3 (931G) ONLINE SATA > drive da4 (931G) ONLINE SATA > drive da5 (931G) ONLINE SATA > drive da6 (931G) ONLINE SATA > drive da7 (931G) ONLINE SATA > > The eight drives are configured as ZIL, L2ARC on SSD and a six drive > raidz2 on the spinning disks. > > We did a ZFS replace on the last drive in the line, and the resilver > is proceeding at less than 800k/sec. > > extended device statistics > device r/s w/skr/skw/s qlen svc_t %b > da00.0 0.0 0.0 0.10 0.9 0 > da10.0 8.2 0.019.90 0.1 0 > da2 125.6 23.0 768.240.54 33.0 88 > da3 126.6 23.1 769.041.34 32.3 89 > da4 126.0 24.0 768.542.74 32.1 88 > da5 125.9 22.0 768.240.14 31.6 87 > da6 124.0 22.0 766.639.95 31.4 84 > da70.0 136.9 0.0 801.30 0.6 4 > > The system has plenty of free RAM, is 99.7% idle, has nothing else > going on, and runs like a one-legged dog. > > There are no error messages or any sign of a problem anywhere, other > than the really terrible performance. (When not rebuilding, it does > light NFS duty. That performance is similarly bad, but has never > really mattered.) > > Similar systems running Solaris put out 10x these numbers claiming 30% > busy instead of 90% busy. > > Does anyone have any suggestions for how I could troubleshoot this > further? At this point, I'm kind of at a loss as to where to go from > here. My goal is to try to phase out the Solaris machines, but this > is kind of a roadblock. > > Thanks for any advice! > ___ > freebsd-questions@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to " > freebsd-questions-unsubscr...@freebsd.org" > -- James Gosnell, ACP ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"
Terrible disk performance with LSI / FreeBSD 9.2-RC1
We have a machine running 9.2-RC1 that's getting terrible disk I/O performance. Its performance has always been pretty bad, but it didn't really become clear how bad until we did a zpool replace on one of the drives and realized it was going to take 3 weeks to rebuild a <1TB drive. The hardware specs are: - 2 x Xeon L5420 - 32 GiB RAM - LSI Logic SAS 1068E - 2 x 32GB SSD's - 6 x 1TB Western Digital RE3 7200RPM SATA The LSI controller has the most recent firmware I'm aware of (6.36.00.00 / 1.33.00.00 dated 2011.08.24), is in IT mode, and appears to be working fine: mpt0 Adapter: Board Name: USASLP-L8i Board Assembly: USASLP-L8i Chip Name: C1068E Chip Revision: B3 RAID Levels: none mpt0 Configuration: 0 volumes, 8 drives drive da0 (30G) ONLINE SATA drive da1 (29G) ONLINE SATA drive da2 (931G) ONLINE SATA drive da3 (931G) ONLINE SATA drive da4 (931G) ONLINE SATA drive da5 (931G) ONLINE SATA drive da6 (931G) ONLINE SATA drive da7 (931G) ONLINE SATA The eight drives are configured as ZIL, L2ARC on SSD and a six drive raidz2 on the spinning disks. We did a ZFS replace on the last drive in the line, and the resilver is proceeding at less than 800k/sec. extended device statistics device r/s w/skr/skw/s qlen svc_t %b da00.0 0.0 0.0 0.10 0.9 0 da10.0 8.2 0.019.90 0.1 0 da2 125.6 23.0 768.240.54 33.0 88 da3 126.6 23.1 769.041.34 32.3 89 da4 126.0 24.0 768.542.74 32.1 88 da5 125.9 22.0 768.240.14 31.6 87 da6 124.0 22.0 766.639.95 31.4 84 da70.0 136.9 0.0 801.30 0.6 4 The system has plenty of free RAM, is 99.7% idle, has nothing else going on, and runs like a one-legged dog. There are no error messages or any sign of a problem anywhere, other than the really terrible performance. (When not rebuilding, it does light NFS duty. That performance is similarly bad, but has never really mattered.) Similar systems running Solaris put out 10x these numbers claiming 30% busy instead of 90% busy. Does anyone have any suggestions for how I could troubleshoot this further? At this point, I'm kind of at a loss as to where to go from here. My goal is to try to phase out the Solaris machines, but this is kind of a roadblock. Thanks for any advice! ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"