[zfs-discuss] Best way/issues with large ZFS send?
I'm preparing to replicate about 200TB of data between two data centers using zfs send. We have ten 10TB zpools that are further broken down into zvols of various sizes in each data center. One DC is primary and the other will be the replication target and there is plenty of bandwidth between them (10 gig dark fiber). Are there any gotchas that I should be aware of? Also, at what level should I be taking the snapshot to do the zfs send? At the primary pool level or at the zvol level? Since the targets are to be exact replicas, I presume at the primary pool level (e.g. tank) rather than for every zvol (e.g. tank/prod/vol1)? This is all using Solaris 11 Express, snv_151a. Thanks, Eff -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
We tried all combinations of OCZ SSDs including their PCI based SSDs and they do NOT work as a ZIL. After a very short time performance degrades horribly and for the OCZ drives they eventually fail completely. We also tried Intel which performed a little better and didn't flat out fail over time, but these still did not work out as a ZIL. We use the DDRdrive X1 now for all of our ZIL applications and could not be happier. The cards are great, support is great and performance is incredible. We use them to provide NFS storage to 50K VMWare VDI users. As you stated, the DDRdrive is ideal. Go with that and you'll be very happy you did! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
They have been incredibly reliable with zero downtime or issues. As a result, we use 2 in every system striped. For one application outside of VDI, we use a pair of them mirrored, but that is very unusual and driven by the customer and not us. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Does a zvol use the zil?
Let me frame this in the context specifically of VMWare ESXi 4.x. If I create a zvol and give it to ESXi via iSCSI our experience has been that it is very fast and guest response is excellent. If we use NFS without a zil (we use DDRdrive X1==awesome) because VMWare uses sync (Stable = FSYNC) writes NFS performance is not very good. Once we enable our zil accelerator, NFS performance is approximately as fast as iSCSI. Enabling or disabling the zil has no measurable impact on iSCSI performance for us. Does a zvol use the zil then or not? If it does, then iSCSI performance seems like it should also be slower without a zil accelerator but it's not. If it doesn't, then is it true that if the power goes off when I'm doing a write to iSCSI and I have no battery backed HBA or RAID card I'll lose data? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD partitioned into multiple L2ARC read cache
We tried this in our environment and found that it didn't work out. The more partitions we used, the slower it went. We decided just to use the entire SSD as a read cache and it worked fine. Still has the TRIM issue of course until the next version. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bursty writes - why?
The NFS client in this case was VMWare ESXi 4.1 release build. What happened is that the file uploader behavior was changed in 4.1 to prevent I/O contention with the VM guests. That means when you go to upload something to the datastore, it only sends chunks of the file instead of streaming it all at once like it did in ESXi 4.0. To end users, something appeared to be broken because file uploads now took 95 seconds instead of 30. Turns out that is by design in 4.1. This is the behavior *only* for the uploader and not for the VM guests. Their I/O is as expected. I have to say as a side note, the DDRdrive X1s make a day and night difference with VMWare. If you use VMWare via NFS, I highly recommend the X1s as the ZIL. Otherwise the VMWare O_SYNC (Stable = FSYNC) will kill your performance dead. We also tried SSDs as the ZIL which worked ok until they got full, then performance tanked. As I have posted before, SSDs as your ZIL - don't do it! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bursty writes - why?
The NFS client that we're using always uses O_SYNC, which is why it was critical for us to use the DDRdrive X1 as the ZIL. I was unclear on the entire system we're using, my apologies. It is: OpenSolaris SNV_134 Motherboard: SuperMicro X8DAH RAM: 72GB CPU: Dual Intel 5503 @ 2.0GHz ZIL: DDRdrive X1 (two of these, independent and not mirrored) Drives: 24 x Seagate 1TB SAS, 7200 RPM Network connected via 3 x gigabit links as LACP + 1 gigabit backup, IPMP on top of those. The output I posted is from zpool iostat and I used that because it corresponds to what users are seeing. Whenever zpool iostat shows write activity, the file copies to the system are working as expected. As soon as zpool iostat shows no activity, the writes all pause. The simple test case is to copy a cd-rom ISO image to the server while doing the zpool iostat. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bursty writes - why?
Figured it out - it was the NFS client. I used snoop and then some dtrace magic to prove that the client (which was using O_SYNC) was sending very bursty requests to the system. I tried a number of other NFS clients with O_SYNC as well and got excellent performance when they were configured correctly. Just for fun I disabled the DDRdrive X1 (pair of them) that I use for the ZIL and performance tanked across the board when using O_SYNC. I can't recommend the DDRdrive X1 enough as a ZIL! Here is a great article on this behavior here: http://blogs.sun.com/brendan/entry/slog_screenshots Thanks for the help all! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Bursty writes - why?
I have a 24 x 1TB system being used as an NFS file server. Seagate SAS disks connected via an LSI 9211-8i SAS controller, disk layout 2 x 11 disk RAIDZ2 + 2 spares. I am using 2 x DDR Drive X1s as the ZIL. When we write anything to it, the writes are always very bursty like this: ool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0232 0 29.0M xpool488K 20.0T 0101 0 12.7M xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 50 0 6.37M xpool488K 20.0T 0477 0 59.7M xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool488K 20.0T 0 0 0 0 xpool 74.7M 20.0T 0702 0 76.2M xpool 74.7M 20.0T 0577 0 72.2M xpool 74.7M 20.0T 0110 0 13.9M xpool 74.7M 20.0T 0 0 0 0 xpool 74.7M 20.0T 0 0 0 0 xpool 74.7M 20.0T 0 0 0 0 xpool 74.7M 20.0T 0 0 0 0 Whenever you see 0 the write is just hanging. What I would like to see is at least some writing happening every second. What can I look at for this issue? Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
As I said, please by all means try it and post your benchmarks for first hour, first day and first week and then first month. The data will be of interest to you. On a subjective basis, if you feel that an SSD is working just fine as your ZIL, run with it. Good luck! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
I can't think of an easy way to measure pages that have not been consumed since it's really an SSD controller function which is obfuscated from the OS, and add the variable of over provisioning on top of that. If anyone would like to really get into what's going on inside of an SSD that makes it a bad choice for a ZIL, you can start here: http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29 and http://en.wikipedia.org/wiki/Write_amplification Which will be more than you might have ever wanted to know. :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
Saso is correct - ESX/i always uses F_SYNC for all writes and that is for sure your performance killer. Do a snoop | grep sync and you'll see the sync write calls from VMWare. We use DDRdrives in our production VMWare storage and they are excellent for solving this problem. Our cluster supports 50,000 users and we've had no issues at all. Do not use an SSD for the ZIL - as soon as it fills up you will be very unhappy. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
David asked me what I meant by filled up. If you make the unwise decision to use an SSD as your ZIL, at some point days to weeks after you install it, all of the pages will be allocated and you will suddenly find the device to be slower than a conventional disk drive. This is due to the way SSDs work. A great write up about how this works is here: http://www.anandtech.com/show/2738/8 The industry work around for this issue is called TRIM and AFAIK the current implementation of TRIM in Solaris does not work for ZIL devices, only for pool devices. If it does, then SSDs would not be a bad option, but the DDRdrive is so much better I wouldn't waste the time. If you don't believe me, try it and post your benchmarks for hour one, day one and week one. ;) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
By all means please try it to validate it yourself and post your results from hour one, day one and week one. In a ZIL use case, although the data set is small it is always writing a small ever changing (from the SSDs perspective) data set. The SSD does not know to release previously written pages and without TRIM there is no way to tell it to. That means every time a ZIL write happens, new SSD pages are consumed. After some amount of time, all of those empty pages will become consumed and the SSD will now have to go into the read-erase-write cycle which is incredibly slow and the whole point of TRIM. I can assure you from my extensive benchmarking with all major SSDs in the role of a ZIL you will eventually not be happy. Depending on your use case it might take months, but eventually all those free pages will be consumed and read-erase-write is how the SSD world works after that - unless you have TRIM, which we don't yet. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and VMware
Don't waste your time with something other than the DDRdrive for NFS ZIL. If it's RAM based it might work, but why risk it and if it's an SSD forget it. No SSD will work well for the ZIL long term. Short term the only SSD to consider would be Intel, but again long term even that will not work out for you. The 100% write characteristics of the ZIL are an SSDs worst case scenario especially without TRIM support. We have tried them all - Samsung, SanDisk, OCZ and none of those worked out. In particular, anything Sandforce 1500 based was the worst so avoid those at all costs if you dare to try an SSD ZIL. Don't. :) As for the queue depths, here's the command from the ZFS Evil Tuning Guide: echo zfs_vdev_max_pending/W0t10 | mdb -kw The W0t10 command is what to change. W0t35 (35 seconds) was the old value, 10 is the new one. For our NFS environment, we found W0t2 was the best by looking at the actual IO using dtrace scripts. Email me if you want those scripts. They are here, but need to be edited before they work: http://blogs.sun.com/chrisg/entry/latency_bubble_in_your_io -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and VMware
We are doing NFS in VMWare 4.0U2 production, 50K users using OpenSolaris SNV_134 on SuperMicro boxes with SATA drives. Yes, I am crazy. Our experience has been that iSCSI for ESXi 4.x is fast and works well with minimal fussing until there is a problem. When that problem happens, getting to data on VMFS LUNs even with the free java VMFS utility to do so is problematic at best and game over at worst. With NFS, data access in problem situations is a non event. Snapshots happen and everyone is happy. The problem with it is the VMWare NFS client which makes every write an F_SYNC write. That kills NFS performance dead. To get around that, we're using DDRdrive X1s for our ZIL and the problem is solved. I have not looked at the NFS client changes in 4.1, perhaps it's better or at least tuneable now. I would recommend NFS as the overall strategy, but you must get a good ZIL device to make that happen. Do not disable the ZIL. Do make sure you set your I/O queue depths correctly. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best usage of SSD-disk in ZFS system
Our experience has been that a new out of the box SSD works well for the ZIL but as soon as it's completely full, performance drops to slower than a regular SAS hard drive due to the write performance penalty in their fundamental design, their LBA map strategy and the not yet released (to me at least) TRIM support in OpenSolaris. Considering this, we only use (safe) DRAM based products for our ZILs like the DDRDrive X1 which is incredible. For the L2ARC, SSDs are ok but again, once they are full you still incur the write performance penalty when writing to it so you can read from it which is also slower than a regular SAS drive in some cases. More RAM costs more but works better. I would recommend that you test not just initially but for at least a week. That should give you time to fill up the drive and see what I call its native performance which means how fast it can do the read-erase-write operation under full load. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Victor L could you *please* help me with ZFS bug #6924390 - dedup hosed it
OpenSolaris snv_131 (problem also in snv_132 still) on an X4500, bug #6924390 Victor, I see in researching this issue that you know ZFS really well. I would appreciate your help so much and this problem seems interesting. I created a large zpool named xpool and then created 3 filesystems on that pool called vms, bkp and alt. Of course I enabled dedup for the entire zpool - why not. And then today we decided to delete bkp which was a 13TB filesystem with around 900GB of data in it. And now I am very familiar with bug #6924390. When I try to import, of course it seems to hang but it's really just going really slowly. Someone from OpenSolaris calculated it might take 2 weeks to import that volume so, now I know why it's a bug. My main objective is to be able to rescue the data in vms - none of the rest matters. I am currently booted into snv_132 via the ILOM and can boot it to the network for ssh as well. Thank you very much in advance! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss