Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hey there, Bob! Looks like you and Akhilesh (thanks, Akhilesh!) are driving at a similar, very valid point. I'm currently using the default recordsize (128K) on all of the ZFS pool (those of the iSCSI target nodes and the aggregate pool on the head node). I should've mentioned something about how the storage will be used in my original post, so I'm glad you brought it up. It will all be presented over NFS and CIFS as a 10GBe+Infiniband NAS which will serve a number of organizations. Some organizations will simply use their area for end-user file sharing, others will use it as a disk backup target, others for databases, and still others for HPC data crunching (gene sequences). Each of these uses will be on different filesystems, of course, so I expect it would be good to set different recordsize paramaters for each one. Do you have any suggestions on good starting sizes for each? I'd imagine filesharing might benefit from a relatively small record size (64K?), image-based backup targets might like a pretty large record size (256K?), databases just need recordsizes to match their block sizes, and HPC...I have no idea. Heh. I expect I'll need to get in contact with the HPC lab to see what kind of profile they have (whether they deal with tiny files or big files, etc). What do you think? Today I'm going to try a few non-ZFS-related tweaks (disabling the Nagle algorithm on the iSCSI initiator and increasing MTU everywhere to 9000). I'll give those a shot and see if they yield performance enhancements. -Gray On Tue, Oct 14, 2008 at 10:36 PM, Bob Friesenhahn < [EMAIL PROTECTED]> wrote: > On Tue, 14 Oct 2008, Gray Carper wrote: > >> >> So, how concerned should we be about the low scores here and there? Any >> suggestions on how to improve our configuration? And how excited should we >> be about the 8GB tests? ;> >> > > The level of concern should depend on how you expect your storage pool to > actually be used. It seems that it should work great for bulk storage, but > not to support a database, or ultra high-performance super-computing > applications. The good 8GB performance is due to successful ZFS ARC caching > in RAM, and because the record size is reasonable given the ZFS block size > and the buffering ability of the intermediate links. You might see somewhat > better performance using a 256K record size. > > It may take quite a while to fill 150TB up. > > Bob > == > Bob Friesenhahn > [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > > -- Gray Carper MSIS Technical Services University of Michigan Medical School [EMAIL PROTECTED] | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] HELP! SNV_97,98,99 zfs with iscsitadm and VMWare!
I'm not sure if this is a problem with the iscsitarget or zfs. I'd greatly appreciate it if it gets moved to the proper list. Well I'm just about out of ideas on what might be wrong.. Quick history: I installed OS 2008.05 when it was SNV_86 to try out ZFS with VMWare. Found out that multilun's were being treated as multipaths so waited till SNV_94 came out to fix the issues with VMWARE and iscsitadm/zfs shareiscsi=on. I Installed OS2008.05 on a virtual machine as a test bed, pkg image-update to SNV_94 a month ago, made some thin provisioned partitions, shared them with iscsitadm and mounted on VMWare without any problems. Ran storage VMotion and all went well. So with this success I purchased a Dell 1900 with a PERC 5/i controller 6 x 15K SAS DRIVEs with ZFS RAIDZ1 configuration. I shared the zfs partitions and mounted them on VMWare. Everything is great till I have to write to the disks. It won't write! Steps I took creating the disks 1) Installed mega_sas drivers. 2) zpool create tank raidz c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0 3) zfs create -V 1TB tank/disk1 4) zfs create -V 1TB tank/disk2 5) iscsitadm create target -b /dev/zvol/rdsk/tank/disk1 LABEL1 6) iscsitadm create target -b /dev/zvol/rdsk/tank/disk2 LABEL2 Now both drives are lun 0 but with uniqu VMHBA device identifiers. SO they are detected as seperate drives. I then redid (deleted) step 5 and 6 and changed it too 5) iscsitadm create target -u 0 -b /dev/zvol/rdsk/tank/disk1 LABEL1 6) iscsitadm create target -u 1 -b /dev/zvol/rdsk/tank/disk2 LABEL1 VMWARE discovers the seperate LUNs on the Device identifier, but still unable to write to the iscsi luns. Why is it that the steps I've conducted in SNV_94 work but in SNV_97,98, or 99 don't. Any ideas?? any log files I can check? I am still an ignorant linux user so I only know to look in /var/log :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool CKSUM errors since drive replace
> So this is where I stand. I'd like to ask zfs-discuss if they've seen any > ZIL/Replay style bugs associated with u3/u5 x86? Again, I'm confident in my > hardware, and /var/adm/messages is showing no warnings/errors. Are you absolutely sure the hardware is OK? Is there another disk you can test in its place? If I read your post correctly, your first disk was having errors logged against it, and now the second disk -- plugged into the same port -- is also logging errors. This seems to me more like the port is bad. Is there a third disk you can try in that same port? I have a hard time seeing that this could be a zfs bug - I've been doing lots of testing on u5 and the only time I see checksum errors is when I deliberately induce them. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Erast Benson wrote: > James, all serious ZFS bug fixes back-ported to b85 as well as marvell > and other sata drivers. Not everything is possible to back-port of > course, but I would say all critical things are there. This includes ZFS > ARC optimization patches, for example. Excellent! James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive
Well, I haven't solved everything yet, but I do feel better now that I realize that it was setting moutpoint=none that caused the zfs send/recv to hang. Allowing the default mountpoint setting fixed that problem. I'm now trying with moutpoint=legacy, because I'd really rather leave it unmounted, especially during the backup itself, to prevent changes happening while the incrementals are copying over, and also in the end to hopefully let me avoid using -F. The incrementals (copying all the snapshots beyond the first one copied) are really slow, however. Is there anything that can be done to speed that up? I'm using compression (gzip-1) on the source filesystem. I wanted the backup to retain the same compression. Can ZFS copy the compressed version over to the backup, or does it really have to uncompress it and recompress it? That takes time and lots of CPU cycles. I'm dealing with highly compressible data (at least 6.5:1). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] L2ARC on iSER ramdisks?
Hello the idea introducd by Chris Greer to use servers as solid state disks kept my brain busy the last days. Perhaps it makes sense to put L2ARC devices in memory as well to increase the in-memory-part of a database in excess of the capacity of a single server. I already wrote about the idea in my blog at http://www.c0t0d0s0.org/archives/4919-L2ARC-on-ramdisks.html so i will just copy-and-paste the article here: "I thought a little bit about the idea of transforming server into solid state disks. The idea in the mail of Chris Greer on zfs-discuss was to use mirrored iSCSI shared ramdisks as a storage for the seperated ZILs. But i think you could use the concept as well for L2ARC as well - e.g. for large databases. One of the sizing rules of databases: More main memory never hurts. Nothing helps the performance of a database more than even more memory. The rule of "main memory never hurts" is based on the fact, that a hard disk has only a few IOPS compared with the main memory and hard drive access massively hurts the performance of your database. But obviously the size of memory is limited, albeit the this limit is quite high with systems with memory sizes in the range of 512 GB on 4 rack units. But how can you get more memory into your database system, when all DIMM slots are filled with the biggest available DIMMS. I had an idea while cooking tea this evening while i thought about a discussion with a colleague: Let´s assume an architecture based on a X4600 as a head in front of four X4600 each fully maxed to 512GB. All the nodes are connected with Infiniband. The first X4600 is your normal database server (for example mysql or LarryBase). You put your data into an ZFS storage pool. This storage pool is augumented with L2ARC devices. But now comes the plot twist. Let´s use the 512GB X4600 as huge ramdisks (yes, i know, every engineers heart will crying now) speaking via iSER (no TCP/IP, just RDMA) at 20 GBit/s to the central database node. This would give you a cache in the size of almost 2 TB plus the cache on the database server itself.. By using L2ARC you could use the memory as database caches of other systems without using a database doing a combination of the memory resources by other means, for example the CacheFusion stuff of Oracle. You don´t have to fuse the caches of other databases servers. The other servers are caches. You don´t have to partition the databases. It would be interesting how such an system would perform in comparision to a Oracle RAC or other memory implementations. Anybody out there willing to test this ... my Infiniband switches are in the laundry at the moment ;-)" I hope this idea is not complete nonsense ... Regards Joerg -- Joerg Moellenkamp Tel: (+49 40) 25 15 23 - 460 Senior Systems Engineer Fax: (+49 40) 25 15 23 - 425 Sun Microsystems GmbH Mobile: (+49 172) 83 18 433 Nagelsweg 55 mailto:[EMAIL PROTECTED] D-20097 Hamburg http://www.sun.de Sitz der Gesellschaft: Sun Microsystems GmbH Sonnenallee 1 D-85551 Kirchheim-Heimstetten Amtsgericht München: HRB 161028 Geschäftsführer:Thomas Schröder Wolfgang Engels Dr. Roland Bömer Vorsitzender des Aufsichtsrates: Martin Häring ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper <[EMAIL PROTECTED]> wrote: > Hey, all! > > We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets > over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 > head node. In trying to discover optimal ZFS pool construction settings, > we've run a number of iozone tests, so I thought I'd share them with you and > see if you have any comments, suggestions, etc. > > First, on a single Thumper, we ran baseline tests on the direct-attached > storage (which is collected into a single ZFS pool comprised of four raidz2 > groups)... > > [1GB file size, 1KB record size] > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest > Write: 123919 > Rewrite: 146277 > Read: 383226 > Reread: 383567 > Random Read: 84369 > Random Write: 121617 > > [8GB file size, 512KB record size] > Command: > Write: 373345 > Rewrite: 665847 > Read: 2261103 > Reread: 2175696 > Random Read: 2239877 > Random Write: 666769 > > [64GB file size, 1MB record size] > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest > Write: 517092 > Rewrite: 541768 > Read: 682713 > Reread: 697875 > Random Read: 89362 > Random Write: 488944 > > These results look very nice, though you'll notice that the random read > numbers tend to be pretty low on the 1GB and 64GB tests (relative to their > sequential counterparts), but the 8GB random (and sequential) read is > unbelievably good. > > Now we move to the head node's iSCSI aggregate ZFS pool... > > [1GB file size, 1KB record size] > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f > /volumes/data-iscsi/perftest/1gbtest > Write: 127108 > Rewrite: 120704 > Read: 394073 > Reread: 396607 > Random Read: 63820 > Random Write: 5907 > > [8GB file size, 512KB record size] > Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f > /volumes/data-iscsi/perftest/8gbtest > Write: 235348 > Rewrite: 179740 > Read: 577315 > Reread: 662253 > Random Read: 249853 > Random Write: 274589 > > [64GB file size, 1MB record size] > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f > /volumes/data-iscsi/perftest/64gbtest > Write: 190535 > Rewrite: 194738 > Read: 297605 > Reread: 314829 > Random Read: 93102 > Random Write: 175688 > > Generally speaking, the results look good, but you'll notice that random > writes are atrocious on the 1GB tests and random reads are not so great on > the 1GB and 64GB tests, but the 8GB test looks great across the board. > Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk, > raidz1, and raidz2 modes - there were no significant changes in the results. > > So, how concerned should we be about the low scores here and there? Any > suggestions on how to improve our configuration? And how excited should we be > about the 8GB tests? ;> > > Thanks so much for any input you have! > -Gray > --- > University of Michigan > Medical School Information Services > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > Your setup sounds very interesting how you export iSCSI to another head unit, can you give me some more details on your file system layout, and how you mount it on the head unit? Sounds like a pretty clever way to export awesomely large volumes! Regards, -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Change the volblocksize of a ZFS volume
Nick Smith wrote: > Dear all, > > Background: > > I have a ZFS volume with the incorrect volume blocksize for the filesystem > (NTFS) that it is supporting. > > This volume contains important data that is proving impossible to copy using > Windows XP Xen HVM that "owns" the data. > > The disparity in volume blocksize (current set to 512bytes!!) is causing > significant performance problems. > > Question : > > Is there a way to change the volume blocksize say via 'zfs snapshot > send/receive'? > > As I see things, this isn't possible as the target volume (including property > values) gets overwritten by 'zfs receive'. > By default, properties are not received. To pass properties, you need to use the -R flag. For examples, see the ZFS Administration Guide, http://www.opensolaris.org/os/community/zfs/docs/zfsadmin.pdf -- richard > Many Thanks for any help. > > Nick Smith > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
James, all serious ZFS bug fixes back-ported to b85 as well as marvell and other sata drivers. Not everything is possible to back-port of course, but I would say all critical things are there. This includes ZFS ARC optimization patches, for example. On Tue, 2008-10-14 at 22:33 +1000, James C. McPherson wrote: > Gray Carper wrote: > > Hey there, James! > > > > We're actually running NexentaStor v1.0.8, which is based on b85. We > > haven't done any tuning ourselves, but I suppose it is possible that > > Nexenta did. If there's something specific you'd like me to look for, > > I'd be happy to. > > Hi Gray, > So build 85 that's getting a bit long in the tooth now. > > I know there have been *lots* of ZFS, Marvell SATA and iSCSI > fixes and enhancements since then which went into OpenSolaris. > I know they're in Solaris Express and the updated binary distro > form of os2008.05 - I just don't know whether Erast and the > Nexenta clan have included them in what they are releasing as 1.0.8. > > Erast - could you chime in here please? Unfortunately I've got no > idea about Nexenta. > > > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Tue, 14 Oct 2008, Gray Carper wrote: > > So, how concerned should we be about the low scores here and there? > Any suggestions on how to improve our configuration? And how excited > should we be about the 8GB tests? ;> The level of concern should depend on how you expect your storage pool to actually be used. It seems that it should work great for bulk storage, but not to support a database, or ultra high-performance super-computing applications. The good 8GB performance is due to successful ZFS ARC caching in RAM, and because the record size is reasonable given the ZFS block size and the buffering ability of the intermediate links. You might see somewhat better performance using a 256K record size. It may take quite a while to fill 150TB up. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Change the volblocksize of a ZFS volume
Dear all, Background: I have a ZFS volume with the incorrect volume blocksize for the filesystem (NTFS) that it is supporting. This volume contains important data that is proving impossible to copy using Windows XP Xen HVM that "owns" the data. The disparity in volume blocksize (current set to 512bytes!!) is causing significant performance problems. Question : Is there a way to change the volume blocksize say via 'zfs snapshot send/receive'? As I see things, this isn't possible as the target volume (including property values) gets overwritten by 'zfs receive'. Many Thanks for any help. Nick Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Just a random spectator here, but I think artifacts you're seeing here are not due to file size, but rather due to record size. What is the ZFS record size ? On a personal note, I wouldn't do non-concurrent (?) benchmarks. They are at best useless and at worst misleading for ZFS - Akhilesh. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Howdy! Sounds good. We'll upgrade to 1.1 (b101) as soon as it is released, re-run our battery of tests, and see where we stand. Thanks! -Gray On Tue, Oct 14, 2008 at 8:47 PM, James C. McPherson <[EMAIL PROTECTED] > wrote: > Gray Carper wrote: > >> Hello again! (And hellos to Erast, who has been a huge help to me many, >> many times! :>) >> >> As I understand it, Nexenta 1.1 should be released in a matter of weeks >> and it'll be based on build 101. We are waiting for that with baited breath, >> since it includes some very important Active Directory integration fixes, >> but this sounds like another reason to be excited about it. Maybe this is a >> discussion that should be tabled until we are able to upgrade? >> > > Yup, I think that's probably the best thing. And thanks > for passing on the info about the 1.1 release, I'll keep > that in my back pocket :) > > > cheers, > James > > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > -- Gray Carper MSIS Technical Services University of Michigan Medical School [EMAIL PROTECTED] | skype: graycarper | 734.418.8506 http://www.umms.med.umich.edu/msis/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Gray Carper wrote: > Hello again! (And hellos to Erast, who has been a huge help to me many, > many times! :>) > > As I understand it, Nexenta 1.1 should be released in a matter of weeks > and it'll be based on build 101. We are waiting for that with baited > breath, since it includes some very important Active Directory > integration fixes, but this sounds like another reason to be excited > about it. Maybe this is a discussion that should be tabled until we are > able to upgrade? Yup, I think that's probably the best thing. And thanks for passing on the info about the 1.1 release, I'll keep that in my back pocket :) cheers, James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hello again! (And hellos to Erast, who has been a huge help to me many, many times! :>) As I understand it, Nexenta 1.1 should be released in a matter of weeks and it'll be based on build 101. We are waiting for that with baited breath, since it includes some very important Active Directory integration fixes, but this sounds like another reason to be excited about it. Maybe this is a discussion that should be tabled until we are able to upgrade? -Gray On Tue, Oct 14, 2008 at 8:33 PM, James C. McPherson <[EMAIL PROTECTED] > wrote: > Gray Carper wrote: > >> Hey there, James! >> >> We're actually running NexentaStor v1.0.8, which is based on b85. We >> haven't done any tuning ourselves, but I suppose it is possible that Nexenta >> did. If there's something specific you'd like me to look for, I'd be happy >> to. >> > > Hi Gray, > So build 85 that's getting a bit long in the tooth now. > > I know there have been *lots* of ZFS, Marvell SATA and iSCSI > fixes and enhancements since then which went into OpenSolaris. > I know they're in Solaris Express and the updated binary distro > form of os2008.05 - I just don't know whether Erast and the > Nexenta clan have included them in what they are releasing as 1.0.8. > > Erast - could you chime in here please? Unfortunately I've got no > idea about Nexenta. > > > > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > -- Gray Carper MSIS Technical Services University of Michigan Medical School ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Gray Carper wrote: > Hey there, James! > > We're actually running NexentaStor v1.0.8, which is based on b85. We > haven't done any tuning ourselves, but I suppose it is possible that > Nexenta did. If there's something specific you'd like me to look for, > I'd be happy to. Hi Gray, So build 85 that's getting a bit long in the tooth now. I know there have been *lots* of ZFS, Marvell SATA and iSCSI fixes and enhancements since then which went into OpenSolaris. I know they're in Solaris Express and the updated binary distro form of os2008.05 - I just don't know whether Erast and the Nexenta clan have included them in what they are releasing as 1.0.8. Erast - could you chime in here please? Unfortunately I've got no idea about Nexenta. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hey there, James! We're actually running NexentaStor v1.0.8, which is based on b85. We haven't done any tuning ourselves, but I suppose it is possible that Nexenta did. If there's something specific you have in mind, I'd be happy to look for it. Thanks! -Gray On Tue, Oct 14, 2008 at 8:10 PM, James C. McPherson <[EMAIL PROTECTED] > wrote: > Gray Carper wrote: > >> Hey, all! >> >> We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI >> targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on >> an x4200 head node. In trying to discover optimal ZFS pool construction >> settings, we've run a number of iozone tests, so I thought I'd share them >> with you and see if you have any comments, suggestions, etc. >> > > [snip] > > > Which build are you running? Have you done any system > or ZFS tuning? > > > James C. McPherson > -- > Senior Kernel Software Engineer, Solaris > Sun Microsystems > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > -- Gray Carper MSIS Technical Services University of Michigan Medical School ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Gray Carper wrote: > Hey, all! > > We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI > targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on > an x4200 head node. In trying to discover optimal ZFS pool construction > settings, we've run a number of iozone tests, so I thought I'd share them > with you and see if you have any comments, suggestions, etc. [snip] Which build are you running? Have you done any system or ZFS tuning? James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS confused about disk controller
For the sake of completeness, in the end I simply created links in /dev/rdsk for c1t0d0sX to point to my disk and was able to reactivate the current BE. The shroud of mystery hasn't lifted though because when I did eventually reboot, I performed a reconfigure (boot -r) and the format and cfgadm commands now report my disk to be attached to the c1 controller. I don't know how or why Solaris can go from detecting C1 to C5 and back to C1 again, but at least now ZFS and format agrees on where my disk is. Maybe the last time I did a reconfigure was when I booted from the LiveCD ? Caryl On 10/07/08 11:58, Caryl Takvorian wrote: > Hi all, > > Please keep me on cc: since I am not subscribed to either lists. > > > I have a weird problem with my OpenSolaris 2008.05 installation (build > 96) on my Ultra 20 workstation. > For some reason, ZFS has been confused and has recently starting to > believe that my zpool is using a device which does not exist ! > > prodigal:zfs #zpool status > pool: rpool > state: ONLINE > scrub: none requested > config: > >NAMESTATE READ WRITE CKSUM >rpool ONLINE 0 0 0 > c1t0d0s0 ONLINE 0 0 0 > > errors: No known data errors > > > The c1t0d0s0 device doesn't exist on my system. Instead, my disk is > attached to c5t0d0s0 as shown by > > prodigal:zfs #format > Searching for disks...done > > > AVAILABLE DISK SELECTIONS: > 0. c5t0d0 > /[EMAIL PROTECTED],0/pci108e,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 > > or > > prodigal:zfs #cfgadmAp_Id Type > Receptacle Occupant Condition > sata0/0::dsk/c5t0d0disk connectedconfigured ok > > > > What is really annoying is that I attempted to update my current > OpenSolaris build 96 to the latest (b98) by using > > # pkg image-update > > The update went well, and at the end it selected the new BE to be > activated upon reboot, but failed when attempting to modify the grub > entry because install_grub asks ZFS what is my boot device and gets > back the wrong device (of course, I am using ZFS as my root > filesystem, otherwise it wouldn't be fun). > > When I manually try to run install_grub, this is the error message I get: > > prodigal:zfs #/tmp/tmpkkEF1W/boot/solaris/bin/update_grub -R > /tmp/tmpkkEF1W > Creating GRUB menu in /tmp/tmpkkEF1W > bootadm: fstyp -a on device /dev/rdsk/c1t0d0s0 failed > bootadm: failed to get pool for device: /dev/rdsk/c1t0d0s0 > bootadm: fstyp -a on device /dev/rdsk/c1t0d0s0 failed > bootadm: failed to get pool name from /dev/rdsk/c1t0d0s0 > bootadm: failed to create GRUB boot signature for device: > /dev/rdsk/c1t0d0s0 > bootadm: failed to get grubsign for root: /tmp/tmpkkEF1W, device > /dev/rdsk/c1t0d0s0 > Installing grub on /dev/rdsk/c1t0d0s0 > cannot open/stat device /dev/rdsk/c1t0d0s2 > > > The worst bit, is that now beadm refuses to reactivate my current > running OS to be used upon the next reboot. > So, the next time I reboot, my system is probably never going to come > back up. > > > prodigal:zfs #beadm list > > BEActive Mountpoint Space Policy Created > ---- -- - -- --- > opensolaris-5 N / 128.50M static 2008-09-09 13:03 > opensolaris-6 R /tmp/tmpkkEF1W 52.19G static 2008-10-07 10:14 > > > prodigal:zfs #export BE_PRINT_ERR=true > prodigal:zfs #beadm activate opensolaris-5 > be_do_installgrub: installgrub failed for device c1t0d0s0. > beadm: Unable to activate opensolaris-5 > > > So, how can I force zpool to accept that my disk device really is on > c5t0d0s0 and forget about c1? > > Since the file /etc/zfs/zpool.cache contains a reference to > /dev/dsk/c1t0d0s0 I have rebuilt the boot_archive after removing it > from the ramdisk, but I've got cold feet about rebooting without > confirmation. > > > Has anyone seen this before or has any idea how to fix this situation ? > > > Thanks > > > Caryl > > -- ~~~ Caryl Takvorian [EMAIL PROTECTED] ISV Engineering phone : +44 (0)1252 420 686 Sun Microsystems UK mobile: +44 (0)771 778 5646 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Carsten Aulbert schrieb: > Hi again, > > Thomas Maier-Komor wrote: >> Carsten Aulbert schrieb: >>> Hi Thomas, >> I don't know socat or what benefit it gives you, but have you tried >> using mbuffer to send and receive directly (options -I and -O)? > > I thought we tried that in the past and with socat it seemed faster, but > I just made a brief test and I got (/dev/zero -> remote /dev/null) 330 > MB/s with mbuffer+socat and 430MB/s with mbuffer alone. > >> Additionally, try to set the block size of mbuffer to the recordsize of >> zfs (usually 128k): >> receiver$ mbuffer -I sender:1 -s 128k -m 2048M | zfs receive >> sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:1 > > We are using 32k since many of our user use tiny files (and then I need > to reduce the buffer size because of this 'funny' error): > > mbuffer: fatal: Cannot address so much memory > (32768*65536=2147483648>1544040742911). > > Does this qualify for a bug report? > > Thanks for the hint of looking into this again! > > Cheers > > Carsten > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Yes this qualifies for a bug report. As a workaround for now, you can compile in 64 bit mode. I.e.: $ ./configure CFLAGS="-g -O -m64" $ make && make install This works for Sun Studio 12 and gcc. For older version of Sun Studio, you need to pass -xarch=v9 instead of -m64. I am planning to release an updated version mbuffer this week. I'll include a patch for this issue. Cheers, Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Hi again, Thomas Maier-Komor wrote: > Carsten Aulbert schrieb: >> Hi Thomas, > I don't know socat or what benefit it gives you, but have you tried > using mbuffer to send and receive directly (options -I and -O)? I thought we tried that in the past and with socat it seemed faster, but I just made a brief test and I got (/dev/zero -> remote /dev/null) 330 MB/s with mbuffer+socat and 430MB/s with mbuffer alone. > Additionally, try to set the block size of mbuffer to the recordsize of > zfs (usually 128k): > receiver$ mbuffer -I sender:1 -s 128k -m 2048M | zfs receive > sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:1 We are using 32k since many of our user use tiny files (and then I need to reduce the buffer size because of this 'funny' error): mbuffer: fatal: Cannot address so much memory (32768*65536=2147483648>1544040742911). Does this qualify for a bug report? Thanks for the hint of looking into this again! Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
Hey, all! We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 head node. In trying to discover optimal ZFS pool construction settings, we've run a number of iozone tests, so I thought I'd share them with you and see if you have any comments, suggestions, etc. First, on a single Thumper, we ran baseline tests on the direct-attached storage (which is collected into a single ZFS pool comprised of four raidz2 groups)... [1GB file size, 1KB record size] Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest Write: 123919 Rewrite: 146277 Read: 383226 Reread: 383567 Random Read: 84369 Random Write: 121617 [8GB file size, 512KB record size] Command: Write: 373345 Rewrite: 665847 Read: 2261103 Reread: 2175696 Random Read: 2239877 Random Write: 666769 [64GB file size, 1MB record size] Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest Write: 517092 Rewrite: 541768 Read: 682713 Reread: 697875 Random Read: 89362 Random Write: 488944 These results look very nice, though you'll notice that the random read numbers tend to be pretty low on the 1GB and 64GB tests (relative to their sequential counterparts), but the 8GB random (and sequential) read is unbelievably good. Now we move to the head node's iSCSI aggregate ZFS pool... [1GB file size, 1KB record size] Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /volumes/data-iscsi/perftest/1gbtest Write: 127108 Rewrite: 120704 Read: 394073 Reread: 396607 Random Read: 63820 Random Write: 5907 [8GB file size, 512KB record size] Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f /volumes/data-iscsi/perftest/8gbtest Write: 235348 Rewrite: 179740 Read: 577315 Reread: 662253 Random Read: 249853 Random Write: 274589 [64GB file size, 1MB record size] Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /volumes/data-iscsi/perftest/64gbtest Write: 190535 Rewrite: 194738 Read: 297605 Reread: 314829 Random Read: 93102 Random Write: 175688 Generally speaking, the results look good, but you'll notice that random writes are atrocious on the 1GB tests and random reads are not so great on the 1GB and 64GB tests, but the 8GB test looks great across the board. Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk, raidz1, and raidz2 modes - there were no significant changes in the results. So, how concerned should we be about the low scores here and there? Any suggestions on how to improve our configuration? And how excited should we be about the 8GB tests? ;> Thanks so much for any input you have! -Gray --- University of Michigan Medical School Information Services -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
Carsten Aulbert schrieb: > Hi Thomas, > > Thomas Maier-Komor wrote: > >> Carsten, >> >> the summary looks like you are using mbuffer. Can you elaborate on what >> options you are passing to mbuffer? Maybe changing the blocksize to be >> consistent with the recordsize of the zpool could improve performance. >> Is the buffer running full or is it empty most of the time? Are you sure >> that the network connection is 10Gb/s all the way through from machine >> to machine? > > Well spotted :) > > right now plain mbuffer with plenty of buffer (-m 2048M) on both ends > and I have not seen any buffer exceeding the 10% watermark level. The > network connection are via Neterion XFrame II Sun Fire NICs then via CX4 > cables to our core switch where both boxes are directly connected > (WovenSystmes EFX1000). netperf tells me that the TCP performance is > close to 7.5 GBit/s duplex and if I use > > cat /dev/zero | mbuffer | socat ---> socat | mbuffer > /dev/null > > I easily see speeds of about 350-400 MB/s so I think the network is fine. > > Cheers > > Carsten > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I don't know socat or what benefit it gives you, but have you tried using mbuffer to send and receive directly (options -I and -O)? Additionally, try to set the block size of mbuffer to the recordsize of zfs (usually 128k): receiver$ mbuffer -I sender:1 -s 128k -m 2048M | zfs receive sender$ zfs send blabla | mbuffer -s 128k -m 2048M -O receiver:1 As transmitting from /dev/zero to /dev/null is at a rate of 350MB/s, I guess, you are really hitting the maximum speed of your zpool. From my understanding, I'd guess sending is always slower than receiving, because reads are random and writes are sequential. So it should be quite normal that mbuffer's buffer doesn't really see a lot of usage. Cheers, Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss