Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Hi all I wonder if there has been any new development on this matter over the past 6 months. Today i pondered an idea of zfs-aware mv, capable of doing zero read/write of file data when moving files between datasets of one pool. This seems like a (z)cp idea proposed in this thread and seems like a trivial job for sun - who have all APIs and functional implementations for cloning and dedup as a means to reference the same block from different files. Such extension to cp should be cheaper than generac dedup and useful for copying any templated file sets. i thought of local zones first, but most people may init them by packages (though zoneadm says it is copying thousands of files), so /etc/skel might be a better example of the usecase - though nearly useless ,) jim -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file concatenation with ZFS copy-on-write
Hi all I wonder if there has been any new development on this matter over the past 6 months. Today i pondered an idea of zfs-aware mv, capable of doing zero read/write of file data when moving files between datasets of one pool. This seems like a (z)cp idea proposed in this thread and seems like a trivial job for sun - who have all APIs and functional implementations for cloning and dedup as a means to reference the same block from different files. Such extension to cp should be cheaper than generac dedup and useful for copying any templated file sets. i thought of local zones first, but most people may init them by packages (though zoneadm says it is copying thousands of files), so /etc/skel might be a better example of the usecase - though nearly useless ,) jim -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Native ZFS for Linux
Peter Jeremy peter.jer...@alcatel-lucent.com wrote: On 2010-Jun-11 17:41:38 +0800, Joerg Schilling joerg.schill...@fokus.fraunhofer.de wrote: PP.S.: Did you know that FreeBSD _includes_ the GPLd Reiserfs in the FreeBSD kernel since a while and that nobody did complain about this, see e.g.: http://svn.freebsd.org/base/stable/8/sys/gnu/fs/reiserfs/ That is completely irrelevant and somewhat misleading. FreeBSD has never prohibited non-BSD-licensed code in their kernel or userland however it has always been optional and, AFAIR, the GENERIC kernel has always defaulted to only contain BSD code. Non-BSD code (whether GPL or CDDL) is carefully segregated (note the 'gnu' in the above URI). Sorry but your reply is completely misleading as the people who claim that there is a legal problem with having ZFS in the Linux kernel would of course also claim that Reiserfs cannot be in the FreeBSD kernel. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] High-Performance ZFS (2000MB/s+)
Hi, We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. Our hardware specifications are as follows: Quad AMD G34 12-core 2.3 GHz (~110 GHz) 10 Crucial RealSSD (6Gb/s) 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders LSI2008SAS (two 4x ports) Mellanox InfiniBand 40 Gbit NICs 128 GB RAM This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Without going into details about chipsets and such, do any of you on this list have any experience with a similar setup and can share with us your thoughts, do's and dont's, and any other information that could be of help while building and configuring this? What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. Let's talk moon landings. Regards, Arve -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On 6/15/2010 4:42 AM, Arve Paalsrud wrote: Hi, We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. Our hardware specifications are as follows: Quad AMD G34 12-core 2.3 GHz (~110 GHz) 10 Crucial RealSSD (6Gb/s) 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders LSI2008SAS (two 4x ports) Mellanox InfiniBand 40 Gbit NICs 128 GB RAM This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Without going into details about chipsets and such, do any of you on this list have any experience with a similar setup and can share with us your thoughts, do's and dont's, and any other information that could be of help while building and configuring this? What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. Let's talk moon landings. Regards, Arve Given that for ZIL, random write IOPS is paramount, the RealSSD isn't a good choice. SLC SSDs still spank any MLC device, and random IOPS for something like an Intel X25-E or OCZ Vertex EX are over twice that of the RealSSD. I don't know where they manage to get 40k+ IOPS number for the RealSSD (I know it's in the specs, but how did they get that?), but that's not what others are reporting: http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=454Itemid=60limit=1limitstart=7 Sadly, none of the current crop of SSDs support a capacitor or battery to back up their local (on-SSD) cache, so they're all subject to data loss on a power interruption. Likewise, random Read dominates L2ARC usage. Here, the most cost-effective solutions tend to be MLC-based SSDs with more moderate IOPS performance - the Intel X25-M and OCZ Vertex series are likely much more cost-effective than a RealSSD, especially considering price/performance. Also, given the limitations of a x4 port connection to the rest of the system, I'd consider using a couple more SAS controllers, and fewer Expanders. The SSDs together are likely to be able to overwhelm a x4 PCI-E connection, so I'd want at least one dedicated x4 SAS HBA just for them. For the 42 disks, it depends more on what your workload looks like. If it is mostly small or random I/O to the disks, you can get away with fewer HBAs. Large, sequential I/O to the disks is going to require more HBAs. Remember, a modern 7200RPM SATA drive can pump out well over 100MB/s sequential, but well under 10MB/s random. Do the math to see how fast it will overwhelm the x4 PCI-E 2.0 connection which maxes out at about 2GB/s. I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn't really going to buy you much here (so far as I can tell). 6Gbit/s SAS is wasted on HDs, so don't bother paying for it if you can avoid doing so. Really, I'd suspect that paying for 6Gb/s SAS isn't worth it at all, as really only the read performance of the L2ARC SSDs might possibly exceed 3Gb/s SAS. I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, but, if I'm reading this correctly, even if you switch to the 160GB Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by the DDT. The rest is file cache. You'll need lots of RAM if you plan on storing lots of small files in the L2ARC (that is, if your workload is lots of small files). 200bytes/record needed in RAM for an L2ARC entry. I.e. if you have 1k average record size, for 600GB of L2ARC, you'll need 600GB / 1kb * 200B = 120GB RAM. if you have a more manageable 8k record size, then, 600GB / 8kB * 200B = 15GB -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On 15/06/2010 14:09, Erik Trimble wrote: I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, The point of L2ARC is that you start adding L2ARC when you can no longer physically put in (or afford) to add any more DRAM, so if OP can afford to put in 128GB of RAM then they should. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On 6/15/2010 6:17 AM, Darren J Moffat wrote: On 15/06/2010 14:09, Erik Trimble wrote: I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, The point of L2ARC is that you start adding L2ARC when you can no longer physically put in (or afford) to add any more DRAM, so if OP can afford to put in 128GB of RAM then they should. True. I was speaking price/performance. Those 8GB DIMMs are still pretty darned pricey... -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, but, if I'm reading this correctly, even if you switch to the 160GB Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by the DDT. The rest is file cache. You'll need lots of RAM if you plan on storing lots of small files in the L2ARC (that is, if your workload is lots of small files). 200bytes/record needed in RAM for an L2ARC entry. I.e. if you have 1k average record size, for 600GB of L2ARC, you'll need 600GB / 1kb * 200B = 120GB RAM. if you have a more manageable 8k record size, then, 600GB / 8kB * 200B = 15GB Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT entry. Later, someone else told med 270. Then you, at 200. Also, there should be a good way to list out a total of blocks (zdb just crashed with a full memory on my 10TB test box). I tried browsing the source to see the size of the ddt struct, but I got lost. Can someone with an osol development environment please just check sizeof that struct? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On 6/15/2010 6:40 AM, Roy Sigurd Karlsbakk wrote: I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, but, if I'm reading this correctly, even if you switch to the 160GB Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by the DDT. The rest is file cache. You'll need lots of RAM if you plan on storing lots of small files in the L2ARC (that is, if your workload is lots of small files). 200bytes/record needed in RAM for an L2ARC entry. I.e. if you have 1k average record size, for 600GB of L2ARC, you'll need 600GB / 1kb * 200B = 120GB RAM. if you have a more manageable 8k record size, then, 600GB / 8kB * 200B = 15GB Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT entry. Later, someone else told med 270. Then you, at 200. Also, there should be a good way to list out a total of blocks (zdb just crashed with a full memory on my 10TB test box). I tried browsing the source to see the size of the ddt struct, but I got lost. Can someone with an osol development environment please just check sizeof that struct? Vennlige hilsener / Best regards roy -- A DDT entry takes up about 250 bytes, regardless of where it is stored. For every normal (i.e. block, metadata, etc - NOT DDT ) L2ARC entry, about 200 bytes has to be stored in main memory (ARC). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On Jun 15, 2010, at 6:40 AM, Roy Sigurd Karlsbakk wrote: I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, but, if I'm reading this correctly, even if you switch to the 160GB Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by the DDT. The rest is file cache. You'll need lots of RAM if you plan on storing lots of small files in the L2ARC (that is, if your workload is lots of small files). 200bytes/record needed in RAM for an L2ARC entry. I.e. if you have 1k average record size, for 600GB of L2ARC, you'll need 600GB / 1kb * 200B = 120GB RAM. if you have a more manageable 8k record size, then, 600GB / 8kB * 200B = 15GB Now I'm confused. First thing I heard, was about 160 bytes was needed per DDT entry. Later, someone else told med 270. Then you, at 200. Also, there should be a good way to list out a total of blocks (zdb just crashed with a full memory on my 10TB test box). I tried browsing the source to see the size of the ddt struct, but I got lost. Can someone with an osol development environment please just check sizeof that struct? Why read source when you can read the output of zdb -D? :-) -- richard -- Richard Elling rich...@nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On Jun 15, 2010, at 4:42 AM, Arve Paalsrud wrote: Hi, We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. Our hardware specifications are as follows: Quad AMD G34 12-core 2.3 GHz (~110 GHz) 10 Crucial RealSSD (6Gb/s) 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders LSI2008SAS (two 4x ports) Mellanox InfiniBand 40 Gbit NICs 128 GB RAM This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Without going into details about chipsets and such, do any of you on this list have any experience with a similar setup and can share with us your thoughts, do's and dont's, and any other information that could be of help while building and configuring this? What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. In general, both dedup and compression gain space by trading off performance. You should take a closer look at snapshots + clones because they gain performance by trading off systems management. You can't size by ESX server, because ESX works (mostly) as a pass-through of the client VM workload. In your sizing calculations, think of ESX as a fancy network switch. -- richard -- Richard Elling rich...@nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On Tue, 2010-06-15 at 04:42 -0700, Arve Paalsrud wrote: Hi, We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. Our hardware specifications are as follows: Quad AMD G34 12-core 2.3 GHz (~110 GHz) 10 Crucial RealSSD (6Gb/s) 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders LSI2008SAS (two 4x ports) Mellanox InfiniBand 40 Gbit NICs Just recognize that those NICs are IB only. Solaris currently does not support 10GbE using Mellanox products, even though other operating systems do. (There are folks working on resolving this, but I think we're still a couple months from seeing the results of that effort.) 128 GB RAM This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. I expect that you need more space for L2ARC and a lot less for Zil. Furthmore, you'd be better served by an even lower latency/higher IOPs ZIL. If you're going to spend this kind of cash, I think I'd recommend at least one or two DDR Drive X1 units or something similar. While not very big, you don't need much to get a huge benefit from the ZIL, and I think the vastly superior IOPS of these units will pay off in the end. DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Dedup is not always a win, I think. I'd look hard at your data and usage to determine whether to use it. -- Garrett ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On Tue, 2010-06-15 at 07:36 -0700, Richard Elling wrote: What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. In general, both dedup and compression gain space by trading off performance. You should take a closer look at snapshots + clones because they gain performance by trading off systems management. It depends on the usage. Note that for some uses, compression can be a performance *win*, because generally CPUs are fast enough that the cost of decompression beats the cost of the larger IOs required to transfer uncompressed data. Of course, that assumes you have CPU cycles to spare. -- Garrett ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On Tue, Jun 15, 2010 at 3:09 PM, Erik Trimble erik.trim...@oracle.comwrote: On 6/15/2010 4:42 AM, Arve Paalsrud wrote: Hi, We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. Our hardware specifications are as follows: Quad AMD G34 12-core 2.3 GHz (~110 GHz) 10 Crucial RealSSD (6Gb/s) 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders LSI2008SAS (two 4x ports) Mellanox InfiniBand 40 Gbit NICs 128 GB RAM This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Without going into details about chipsets and such, do any of you on this list have any experience with a similar setup and can share with us your thoughts, do's and dont's, and any other information that could be of help while building and configuring this? What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. Let's talk moon landings. Regards, Arve Given that for ZIL, random write IOPS is paramount, the RealSSD isn't a good choice. SLC SSDs still spank any MLC device, and random IOPS for something like an Intel X25-E or OCZ Vertex EX are over twice that of the RealSSD. I don't know where they manage to get 40k+ IOPS number for the RealSSD (I know it's in the specs, but how did they get that?), but that's not what others are reporting: http://benchmarkreviews.com/index.php?option=com_contenttask=viewid=454Itemid=60limit=1limitstart=7 See http://www.anandtech.com/show/2944/3 and http://www.crucial.com/pdf/Datasheets-letter_C300_RealSSD_v2-5-10_online.pdf But I agree that we should look into using the Vertex instead. Sadly, none of the current crop of SSDs support a capacitor or battery to back up their local (on-SSD) cache, so they're all subject to data loss on a power interruption. Noted Likewise, random Read dominates L2ARC usage. Here, the most cost-effective solutions tend to be MLC-based SSDs with more moderate IOPS performance - the Intel X25-M and OCZ Vertex series are likely much more cost-effective than a RealSSD, especially considering price/performance. Our other option are to use two Fusion-IO ioDrive Duo SLC/MLC or the SMLC when available (as well as drivers for Solaris) - so the price we're currently talking about is not an issue. Also, given the limitations of a x4 port connection to the rest of the system, I'd consider using a couple more SAS controllers, and fewer Expanders. The SSDs together are likely to be able to overwhelm a x4 PCI-E connection, so I'd want at least one dedicated x4 SAS HBA just for them. For the 42 disks, it depends more on what your workload looks like. If it is mostly small or random I/O to the disks, you can get away with fewer HBAs. Large, sequential I/O to the disks is going to require more HBAs. Remember, a modern 7200RPM SATA drive can pump out well over 100MB/s sequential, but well under 10MB/s random. Do the math to see how fast it will overwhelm the x4 PCI-E 2.0 connection which maxes out at about 2GB/s. We're talking about 4X SAS 6Gb/s lanes - 4800MB/s per port. See http://www.lsi.com/DistributionSystem/AssetDocument/SCG_LSISAS2008_PB_043009.pdffor specifications of the LSI chip. In other words, it utilizes PCIe 2.0 8x. I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn't really going to buy you much here (so far as I can tell). 6Gbit/s SAS is wasted on HDs, so don't bother paying for it if you can avoid doing so. Really, I'd suspect that paying for 6Gb/s SAS isn't worth it at all, as really only the read performance of the L2ARC SSDs might possibly exceed 3Gb/s SAS. What about bandwidth in this scenario? Won't the ZIL be limited to the throughput of only one X25-E? The SATA disks operates at 3Gb/s through the SAS expanders, so no 6Gb/s there. I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, but, if I'm reading this correctly, even if you switch to the 160GB Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by the DDT. The rest is file cache. You'll need lots of RAM if you plan on storing lots of small files in the L2ARC (that is, if your workload is lots of small
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On Tue, Jun 15, 2010 at 4:20 PM, Erik Trimble erik.trim...@oracle.comwrote: On 6/15/2010 6:57 AM, Arve Paalsrud wrote: On Tue, Jun 15, 2010 at 3:09 PM, Erik Trimble erik.trim...@oracle.comwrote: I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn't really going to buy you much here (so far as I can tell). 6Gbit/s SAS is wasted on HDs, so don't bother paying for it if you can avoid doing so. Really, I'd suspect that paying for 6Gb/s SAS isn't worth it at all, as really only the read performance of the L2ARC SSDs might possibly exceed 3Gb/s SAS. What about bandwidth in this scenario? Won't the ZIL be limited to the throughput of only one X25-E? The SATA disks operates at 3Gb/s through the SAS expanders, so no 6Gb/s there. Yes - though I'm not sure how the slog devices work when there is more than one. I *don't* think they work like the L2ARC devices, which work round-robin. You'd have to ask. If they're doing a true stripe, then I doubt you'll get much more performance as weird as that sounds. Also, even with a single X25-E, you can service a huge number of IOPS - likely more small IOPS than can be pushed over even an Infiniband interface. The place that the Infiniband would certainly outpace the X25-E's capacity is for large writes, where a single 100MB write would suck up all the X25-E's throughput capability. But the Intel X25-E are limited to about 200 MB/s write, regardless of IOPS. So when throwing a lot of 16k IOPS (about 13 000) at it, it will still be limited to 200 MB/s - or about 6-7% of the throughput of an QDR InfiniBand links capacity. So I hereby officially ask: Can I have multiple slogs striped to handle higher bandwidth than a single device can - is that supported in ZFS? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA - Arve ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Dedup... still in beta status
Data: 90% of current computers has less than 9 GB of RAM, less than 5% has SSD systems. Let use a computer storage standard, with a capacity of 4 TB ... dedupe on, dataset with blocks of 32 kb ..., 2 TB of data in use ... need 16 GB of memory just only for DTT ... but this will not see it until it's too late ... ie, we will work with the system ... performance will be good ... Little by little we will see that write performance is dropping ... then we will see that the system crashes randomly (when deleting automatic snapshots) ... and finally will see that disabling dedup doesnt solve it. It may indicate that dedupe has some requirements ... that is true, but what is true too is that in systems with large amounts of RAM(for the usual parameters) usual operations as file deleting or datasets/snapshot destroying give us a decrease of performance ... even totally blocking system ... and that is not admissible ... so maybe it would be desirable to place dedupe in a freeze (beta or development situation) until we can get one stable version so we can make any necessary changes in the nucleus of zfs that allow its use without compromising the integrity of the entire system (p.ejm: Enabling the erasing of blocks in multithreading .) And what can we do if we have a system already contaminated with dedupe? ... 1st Disable snapshots 2. Create a new dataset without dedupe and copy the data to the new dataset. 3. After copying the data, delete the snapshots... first the smaller, if there is some snapshot bigger (more than 10 Gb)... make progresive roollback to it (Thus the snapshot will use 0 bytes) and we can delete. 4. When there are no snapshots in the dataset ... remove slowly (in batches) all files. 5. Finally, when there are no files... destroy de dataset If we miss any of these steps (and directly try to delete a snapshot with 95 Gb) , the system will crash ... if we try to delete the dataset and the system crashes ... by restarting your computer will crash the system too (since the operation will continue trying to erase ) My test system: AMD Athlon X2 5400, 8 Gb RAM, RAIDZ 3 TB, dataset 1,7 Tb, snapshot: 87 Gb... tested with: OSOL 134, EON 0.6, Nexenta core 3.02, Nexentastor enterprise 3.02... all systems freezes when trying to delete snapshots... finally with rollback i could delete all snapshots... but when trying to destroy the dataset ... The system is still processing the order ... (after 20 hours ... ) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On Tue, Jun 15, 2010 at 3:33 PM, Erik Trimble erik.trim...@oracle.comwrote: On 6/15/2010 6:17 AM, Darren J Moffat wrote: On 15/06/2010 14:09, Erik Trimble wrote: I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, The point of L2ARC is that you start adding L2ARC when you can no longer physically put in (or afford) to add any more DRAM, so if OP can afford to put in 128GB of RAM then they should. True. I was speaking price/performance. Those 8GB DIMMs are still pretty darned pricey... -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA The motherboard has 32 DIMM slots - making use of 32 4GB modules to gain 128GB quite affordable :) -Arve ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] OCZ Devena line of enterprise SSD
OCZ has a new line of enterprise SSDs, based on the SandForce 1500 controller. These three have a supercapacitor: OCZ Deneva Reliability 2.5 MLC SSDhttp://www.oczenterprise.com/products/details/ocz-deneva-reliability-2-5-mlc-ssd.html OCZ Deneva Reliability 2.5 SLC SSDhttp://www.oczenterprise.com/products/details/ocz-deneva-reliability-2-5-slc-ssd.html OCZ Deneva Reliability 2.5 eMLC SSDhttp://www.oczenterprise.com/products/details/ocz-deneva-reliability-2-5-emlc-ssd.html Any thoughts on how well they would work as a L2ARC/ZIL drive? Roger Hernandez ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] COMSTAR dropouts with dedup enabled
Thanks Brandon, This system has 24GB of RAM and currently no L2ARC. The total de duplicated data was about 250GB so I wouldn't have thought I would be out of RAM, I've removed the LUN for the time being so I can't get the DDR size at the moment. I have some X25-E's to go in as L2ARC and SLOG so I'll revisit de-dup soon to see if that helps the issue. -Matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Data Loss on system crash/upgrade
Hi Austin, No much help, as it turns out. I don't see any evidence that a recovery mechanism, where you might lose a few seconds of data transactions, was triggered. It almost sounds like your file system was rolled back to a previous snapshot because the data is lost as of a certain date. I don't see any evidence of a rollback either. I'm stumped at this point but maybe someone else has ideas. Is it possible that hardware failures caused the outright removal of all data after a certain date (?) Doesn't seem possible. You can review how the critical hardware failures were impacting your ZFS pools by reviewing the contents of fmdump -eV. Its a lot of output to sort through but looking for checksum errors and other problems. Still, ongoing checksum errors would result in data corruption, possibly, but not total loss of data after a certain date. Can you recover your data from your existing snapshots? Cindy On 06/14/10 21:44, Austin Rotondo wrote: Cindy, The log is quite long, so I've attached a text file of the command output. The last command in the log before the system crash was: 2010-04-07.20:09:06 zpool scrub zarray1 The system crashed sometime after 4/20/10, which is the last file I have record of creating. I looked through the log and didn't see anything unusual, but I'm definitely no expert on the subject. Thanks for your help, Austin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
-Original Message- From: Garrett D'Amore [mailto:garr...@nexenta.com] Sent: 15. juni 2010 17:43 To: Arve Paalsrud Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] High-Performance ZFS (2000MB/s+) On Tue, 2010-06-15 at 04:42 -0700, Arve Paalsrud wrote: Hi, We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. Our hardware specifications are as follows: Quad AMD G34 12-core 2.3 GHz (~110 GHz) 10 Crucial RealSSD (6Gb/s) 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders LSI2008SAS (two 4x ports) Mellanox InfiniBand 40 Gbit NICs Just recognize that those NICs are IB only. Solaris currently does not support 10GbE using Mellanox products, even though other operating systems do. (There are folks working on resolving this, but I think we're still a couple months from seeing the results of that effort.) 128 GB RAM This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. I expect that you need more space for L2ARC and a lot less for Zil. Furthmore, you'd be better served by an even lower latency/higher IOPs ZIL. If you're going to spend this kind of cash, I think I'd recommend at least one or two DDR Drive X1 units or something similar. While not very big, you don't need much to get a huge benefit from the ZIL, and I think the vastly superior IOPS of these units will pay off in the end. What about the ZIL bandwidth in this case? I mean, could I stripe across multiple devices to be able to handle higher throughput? Otherwise I would still be limited to the performance of the unit itself (155 MB/s). DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Dedup is not always a win, I think. I'd look hard at your data and usage to determine whether to use it. -- Garrett -Arve ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool export / import discrepancy
Hello All, I've migrated a JBOD of 16 drives from one server to another. I did a zpool export from the old system and a zpool import to the new system. One thing I did notice is since the drives are on a different controller card, the naming is different (as expected) but the order is also different. I setup the drives as passthrough on the controller card and went through each drive incrementally. I assumed the zpool import would have listed the drives in the order of c10t2d0, d1, d2, ... c10t3d7. As shown below the order the drives were imported is c10t2d0, d2, d3, d1, c10t3d0 through d7. __ |Original zpool setup on old server: | |zpool status backup | pool: backup | state: ONLINE |config: |NAME STATE READ WRITE CKSUM |backup ONLINE 0 0 0 | raidz2 ONLINE 0 0 0 |c7t1d0 ONLINE 0 0 0 |c7t2d0 ONLINE 0 0 0 |c7t3d0 ONLINE 0 0 0 |c7t4d0 ONLINE 0 0 0 |c7t5d0 ONLINE 0 0 0 |c7t6d0 ONLINE 0 0 0 |c7t7d0 ONLINE 0 0 0 |c7t8d0 ONLINE 0 0 0 |c7t9d0 ONLINE 0 0 0 |c7t10d0 ONLINE 0 0 0 |c7t11d0 ONLINE 0 0 0 |c7t12d0 ONLINE 0 0 0 |c7t13d0 ONLINE 0 0 0 |c7t14d0 ONLINE 0 0 0 |c7t15d0 ONLINE 0 0 0 |spares | c7t16d0AVAIL |_ __ |Imported zpool on new server: | |zpool status backup | pool: backup | state: ONLINE |config: |NAME STATE READ WRITE CKSUM |backup ONLINE 0 0 0 | raidz2 ONLINE 0 0 0 |c10t2d0 ONLINE 0 0 0 |c10t2d2 ONLINE 0 0 0 |c10t2d3 ONLINE 0 0 0 |c10t2d1 ONLINE 0 0 0 |c10t2d4 ONLINE 0 0 0 |c10t2d5 ONLINE 0 0 0 |c10t2d6 ONLINE 0 0 0 |c10t2d7 ONLINE 0 0 0 |c10t3d0 ONLINE 0 0 0 |c10t3d1 ONLINE 0 0 0 |c10t3d2 ONLINE 0 0 0 |c10t3d3 ONLINE 0 0 0 |c10t3d4 ONLINE 0 0 0 |c10t3d5 ONLINE 0 0 0 |c10t3d6 ONLINE 0 0 0 |spares | c10t3d7AVAIL |_ Is ZFS dependent on the order of the drives? Will this cause any issue down the road? Thank you all; Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On Tue, 2010-06-15 at 18:33 +0200, Arve Paalsrud wrote: What about the ZIL bandwidth in this case? I mean, could I stripe across multiple devices to be able to handle higher throughput? Otherwise I would still be limited to the performance of the unit itself (155 MB/s). I think so. Btw, I've gotten better performance than that with my driver (not sure about the production driver). I seem to recall about 220 MB/sec. (I was basically driving the PCIe x1 bus to its limit.) This was with large transfers (sized at 64k IIRC.) Shrinking the job size down, I could get up to 150K IOPS with 512 byte jobs. (This high IOP rate is unrealistic for ZFS -- for ZFS the bus bandwidth limitation comes into play long before you start hitting IOPS limitations.) One issue of course is that each of these units occupies a PCIe x1 slot. On another note, if you're dataset and usage requirements don't require strict I/O flush/sync guarantees, you could probably get away without any ZIL at all, and just use lots of RAM to get really good performance. (You'd then disable the zil on filesystems that didn't have this need. This is a very new feature in OpenSolaris.) Of course, you don't want to do this for data sets where loss of the data would be tragic. (But its ideal for situations such as filesystems used for compiling, etc. -- where the data being written can be easily regenerated in the event of a failure.) -- Garrett DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Dedup is not always a win, I think. I'd look hard at your data and usage to determine whether to use it. -- Garrett -Arve ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NexentaStor Community edition 3.0.3 released
Hi All, On behalf of NexentaStor team, I'm happy to announce the release of NexentaStor Community Edition 3.0.3. This release is the result of the community efforts of Nexenta Partners and users. Changes over 3.0.2 include * Many fixes to ON/ZFS backported to b134. * Multiple bug fixes in the appliance. With the addition of many new features, NexentaStor CE is the *most complete*, and feature-rich gratis unified storage solution today. Quick Summary of Features - * ZFS additions: Deduplication (based on OpenSolaris b134). * Free for upto 12 TB of *used* storage * Community edition supports easy upgrades * Many new features in the easy to use management interface. * Integrated search Grab the iso from http://www.nexentastor.org/projects/site/wiki/CommunityEdition If you are a storage solution provider, we invite you to join our growing social network at http://people.nexenta.com. -- Thanks Anil Gulecha Community Leader http://www.nexentastor.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OCZ Devena line of enterprise SSD
On Mon, Jun 14, 2010 at 2:07 PM, Roger Hernandez rhvar...@gmail.com wrote: OCZ has a new line of enterprise SSDs, based on the SandForce 1500 controller. The SLC based drive should be great as a ZIL, and the MLC drives should be a close second. Neither is cost effective as a L2ARC, since the cache device doesn't require resiliency or high random iops. A previous generation drive (such as the Vertex or X25-M) is probably sufficient. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On 6/15/2010 9:03 AM, Fco Javier Garcia wrote: Data: 90% of current computers has less than 9 GB of RAM, less than 5% has SSD systems. Let use a computer storage standard, with a capacity of 4 TB ... dedupe on, dataset with blocks of 32 kb ..., 2 TB of data in use ... need 16 GB of memory just only for DTT ... but this will not see it until it's too late ... ie, we will work with the system ... performance will be good ... Little by little we will see that write performance is dropping ... then we will see that the system crashes randomly (when deleting automatic snapshots) ... and finally will see that disabling dedup doesnt solve it. It may indicate that dedupe has some requirements ... that is true, but what is true too is that in systems with large amounts of RAM(for the usual parameters) usual operations as file deleting or datasets/snapshot destroying give us a decrease of performance ... even totally blocking system ... and that is not admissible ... so maybe it would be desirable to place dedupe in a freeze (beta or development situation) until we can get one stable version so we can make any necessary changes in the nucleus of zfs that allow its use without compromising the integrity of the entire system (p.ejm: Enabling the erasing of blocks in multithreading .) And what can we do if we have a system already contaminated with dedupe? ... 1st Disable snapshots 2. Create a new dataset without dedupe and copy the data to the new dataset. 3. After copying the data, delete the snapshots... first the smaller, if there is some snapshot bigger (more than 10 Gb)... make progresive roollback to it (Thus the snapshot will use 0 bytes) and we can delete. 4. When there are no snapshots in the dataset ... remove slowly (in batches) all files. 5. Finally, when there are no files... destroy de dataset If we miss any of these steps (and directly try to delete a snapshot with 95 Gb) , the system will crash ... if we try to delete the dataset and the system crashes ... by restarting your computer will crash the system too (since the operation will continue trying to erase ) My test system: AMD Athlon X2 5400, 8 Gb RAM, RAIDZ 3 TB, dataset 1,7 Tb, snapshot: 87 Gb... tested with: OSOL 134, EON 0.6, Nexenta core 3.02, Nexentastor enterprise 3.02... all systems freezes when trying to delete snapshots... finally with rollback i could delete all snapshots... but when trying to destroy the dataset ... The system is still processing the order ... (after 20 hours ... ) Frankly, dedup isn't practical for anything but enterprise-class machines. It's certainly not practical for desktops or anything remotely low-end. This isn't just a ZFS issue - all implementations I've seen so far require enterprise-class solutions. Realistically, I think people are overtly-enamored with dedup as a feature - I would generally only consider it worth-while in cases where you get significant savings. And by significant, I'm talking an order of magnitude space savings. A 2x savings isn't really enough to counteract the down sides. Especially when even enterprise disk space is (relatively) cheap. That all said, ZFS dedup is still definitely beta. There are known severe bugs and performance issues which will take time to fix, as not all of them have obvious solutions. Given current schedules, I predict that it should be production-ready some time in 2011. *When* in 2011, I couldn't hazard... Maybe time to make Solaris 10 Update 12 or so? grin -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SSDs adequate ZIL devices?
There has been many threads in the past asking about ZIL devices. Most of them end up in recommending Intel X-25 as an adequate device. Nevertheless there is always the warning about them not heeding cache flushes. But what use is a ZIL that ignores cache flushes? If I'm willing to tolerate that (I'm not), I can just as well take a mechanical drive and force zfs to not issue cache flushes to it. In this case it can easily compete with SSD in regard to IOPS and bandwidth. In case of a power failure I will likely lose about as many writes as I do with SSDs, a few milliseconds. So why buy SSD for ZIL at all? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
Realistically, I think people are overtly-enamored with dedup as a feature - I would generally only consider it worth-while in cases where you get significant savings. And by significant, I'm talking an order of magnitude space savings. A 2x savings isn't really enough to counteract the down sides. Especially when even enterprise disk space is (relatively) cheap. I think dedup may have its greatest appeal in VDI environments (think about a environment with 85% if the data that the virtual machine needs is into ARC or L2ARC... is like a dream...almost instantaneous response... and you can boot a new machine in a few seconds)... That all said, ZFS dedup is still definitely beta. There are known severe bugs and performance issues which will take time to fix, as not all of them have obvious solutions Given current schedules, I predict that it should be production-ready some time in 2011. *When* in 2011, I couldn't hazard... Maybe time to make Solaris 10 Update 12 or so? Yes... so you can start paching Solaris on Monday... and perhaps... it will be finished on Tuesday (but next week) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On 6/15/2010 10:52 AM, Erik Trimble wrote: Frankly, dedup isn't practical for anything but enterprise-class machines. It's certainly not practical for desktops or anything remotely low-end. This isn't just a ZFS issue - all implementations I've seen so far require enterprise-class solutions. Realistically, I think people are overtly-enamored with dedup as a feature - I would generally only consider it worth-while in cases where you get significant savings. And by significant, I'm talking an order of magnitude space savings. A 2x savings isn't really enough to counteract the down sides. Especially when even enterprise disk space is (relatively) cheap. That all said, ZFS dedup is still definitely beta. There are known severe bugs and performance issues which will take time to fix, as not all of them have obvious solutions. Given current schedules, I predict that it should be production-ready some time in 2011. *When* in 2011, I couldn't hazard... Maybe time to make Solaris 10 Update 12 or so? grin One thing here - I forgot to say, this is my opinion based on my observations/conversations on this list, and I in no way speak for Oracle officially, or as a member of the ZFS team (which I'm not). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
From: Fco Javier Garcia Sent: Tuesday, June 15, 2010 11:21 AM Realistically, I think people are overtly-enamored with dedup as a feature - I would generally only consider it worth-while in cases where you get significant savings. And by significant, I'm talking an order of magnitude space savings. A 2x savings isn't really enough to counteract the down sides. Especially when even enterprise disk space is (relatively) cheap. I think dedup may have its greatest appeal in VDI environments (think about a environment with 85% if the data that the virtual machine needs is into ARC or L2ARC... is like a dream...almost instantaneous response... and you can boot a new machine in a few seconds)... Does dedup benefit in the ARC/L2ARC space? For some reason, I have it in my head that for each time it requests the block from storage it will copy it into cache; therefore if I had 10 VMs requesting the same dedup'd block, there will be 10 copies of the same block in ARC/L2ARC. Geoff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
or as a member of the ZFS team (which I'm not). Then you have to be brutally good with Java -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On 6/15/2010 11:49 AM, Geoff Nordli wrote: From: Fco Javier Garcia Sent: Tuesday, June 15, 2010 11:21 AM Realistically, I think people are overtly-enamored with dedup as a feature - I would generally only consider it worth-while in cases where you get significant savings. And by significant, I'm talking an order of magnitude space savings. A 2x savings isn't really enough to counteract the down sides. Especially when even enterprise disk space is (relatively) cheap. I think dedup may have its greatest appeal in VDI environments (think about a environment with 85% if the data that the virtual machine needs is into ARC or L2ARC... is like a dream...almost instantaneous response... and you can boot a new machine in a few seconds)... Does dedup benefit in the ARC/L2ARC space? For some reason, I have it in my head that for each time it requests the block from storage it will copy it into cache; therefore if I had 10 VMs requesting the same dedup'd block, there will be 10 copies of the same block in ARC/L2ARC. Geoff No, that's not correct. It's the *same* block, regardless of where it was referenced from. The cached block has no idea where it was referenced from (that's in the metadata). So, even if I have 10 VMs, requesting access to 10 different files, if those files have been dedup-ed, then any common (i.e. deduped) blocks will be stored only once in the ARC/L2ARC. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On 6/15/2010 11:53 AM, Fco Javier Garcia wrote: or as a member of the ZFS team (which I'm not). Then you have to be brutally good with Java Thanks, but I do get it wrong every so often (hopefully, rarely). More importantly, I don't know anything about the internal goings-on of the ZFS team, so I have nothing extra to say about schedules, plans, timing, etc. that everyone else doesn't know. I can only speculate based on what's been publicly said on those topics. E.g. I wish I knew when certain bugs would be fixed, but I don't have any more visibility to that than the public. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Complete Linux Noob
I have been researching different types of raids, and I happened across raidz, and I am blown away. I have been trying to find resources to answer some of my questions, but many of them are either over my head in terms of details, or foreign to me as I am a linux noob, and I have to admit I have never even looked at Solaris. Are the Parity drives just that, a drive assigned to parity, or is the parity shared over several drives? I understand that you can build a raidz2 that will have 2 parity disks. So in theory I could lose 2 disks and still rebuild my array so long as they are not both the parity disks correct? I understand that you can have Spares assigned to the raid, so that if a drive fails, it will immediately grab the spare and rebuild the damaged drive. Is this correct? Now I can not find anything on how much space is taken up in the raidz1 or raidz2. If all the drives are the same size, does a raidz2 take up the space of 2 of the drives for parity, or is the space calculation different? I get that you can not expand a raidz as you would a normal raid, by simply slapping on a drive. Instead it seems that the preferred method is to create a new raidz. Now Lets say that I want to add another raidz1 to my system, can I get the OS to present this as one big drive with the space from both raid pools? How do I share these types of raid pools across the network. Or more specifically, how do I access them from Windows based systems? Is there any special trick? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Complete Linux Noob
snip/ How do I share these types of raid pools across the network. Or more specifically, how do I access them from Windows based systems? Is there any special trick? Most of your questions are answered here http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfslast.pdf Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Native ZFS for Linux
On Tue, 15 Jun 2010, Joerg Schilling wrote: Sorry but your reply is completely misleading as the people who claim that there is a legal problem with having ZFS in the Linux kernel would of course also claim that Reiserfs cannot be in the FreeBSD kernel. It seems that it is a license violation to link a computer containing GPLed code to the Internet. I think I heard on usenet or a blog that it was illegal to link GPLed code with non-GPLed code. The Internet itself is obviously a derived work and is therefore subject to the GPL. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OCZ Devena line of enterprise SSD
Price? I cannot find it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Native ZFS for Linux
Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Tue, 15 Jun 2010, Joerg Schilling wrote: Sorry but your reply is completely misleading as the people who claim that there is a legal problem with having ZFS in the Linux kernel would of course also claim that Reiserfs cannot be in the FreeBSD kernel. It seems that it is a license violation to link a computer containing GPLed code to the Internet. I think I heard on usenet or a blog that it was illegal to link GPLed code with non-GPLed code. The Internet itself is obviously a derived work and is therefore subject to the GPL. This is what e.g. Lawrence Rosen also mentions ;-) BTW: Our preliminary license compatibility information is now on-line: http://www.osscc.net/en/licenses.html#compatibility To switch to German, use the top level at: http://www.osscc.net/en/index.html Most people may know the OpenSource book from Larwence Rosen (see link in our web page). I have a new paper on License combinations from my collegue Tom Gordon (US-lawyer) on our server at: http://www.osscc.net/pdf/QualipsoA1D113.pdf Hope this helps to understand things better. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de(uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Complete Linux Noob
On Tue, June 15, 2010 14:13, CarlPalmer wrote: I have been researching different types of raids, and I happened across raidz, and I am blown away. I have been trying to find resources to answer some of my questions, but many of them are either over my head in terms of details, or foreign to me as I am a linux noob, and I have to admit I have never even looked at Solaris. Heh; caught another one :-) . Are the Parity drives just that, a drive assigned to parity, or is the parity shared over several drives? No drives are formally designated for parity; all n drives in the RAIDZ vdev are used together in such a way that you can lose one drive without loss of data, but exactly which bits are data and which bits are parity and where they are stored is not something the admin has to think about or know (and in fact cannot know). I understand that you can build a raidz2 that will have 2 parity disks. So in theory I could lose 2 disks and still rebuild my array so long as they are not both the parity disks correct? Any two disks out of a raidz2 vdev can be lost. Lose a third before the recover completes and your data is toast. I understand that you can have Spares assigned to the raid, so that if a drive fails, it will immediately grab the spare and rebuild the damaged drive. Is this correct? Yes, RAIDZ (including z2 and z3) and mirror vdevs will grab a hot spare if one is assigned and needed, and start the resilvering operation immediately. Now I can not find anything on how much space is taken up in the raidz1 or raidz2. If all the drives are the same size, does a raidz2 take up the space of 2 of the drives for parity, or is the space calculation different? That's the right calculation. I get that you can not expand a raidz as you would a normal raid, by simply slapping on a drive. Instead it seems that the preferred method is to create a new raidz. Now Lets say that I want to add another raidz1 to my system, can I get the OS to present this as one big drive with the space from both raid pools? You can't expand a normal RAID, either, anywhere I've ever seen. A pool can contain multiple vdevs. You can add additional vdevs to a pool and the new space become immediately available to the pool, and hence to anything (like a filesystem) drawing from that pool. (The zpool command will attempt to stop you from mixing vdevs of different redundancy in the same pool, but you can force it to let you. Mixing a RAIDZ vdev and a RAIDZ3 vdev in the same pool is a silly thing to do, since you don't control where in the pool any new data goes, and it's likely to be striped across the vdevs in the pool.) You can also replace all the drives in a vdev, serially (and waiting for the resilver to complete at each step before continuing to the next drive), and if the new drives are larger than the old drives, when you've replaced all of them the new space will be usable in that vdev. This is particularly useful with mirrors, where there are only two drives to replace. (Well, actually, ZFS mirrors can have any number of drives. To avoid the risk of loss when upgrading the drives in a mirror, attach the new bigger drive FIRST, wait for the resilver, and THEN detach one of the smaller original drives, repeat for the second drive, and you will never go to a redundancy lower than 2. You can even attach BOTH new disks at once, if you have the slots and controller space, and have a 4-way mirror for a while. Somebody reported configuring ALL the drives in a 'Thumper' as a mirror, a 48-way mirror, just to see if it worked. It did.) How do I share these types of raid pools across the network. Or more specifically, how do I access them from Windows based systems? Is there any special trick? Nothing special. In-kernel CIFS is better than SAMBA, and supports full NTFS ACLs. I hear it also attaches to AD cleanly, but I haven't done that, don't run AD at home. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool export / import discrepancy
On Tue, Jun 15, 2010 at 1:56 PM, Scott Squires ssqui...@gmail.com wrote: Is ZFS dependent on the order of the drives? Will this cause any issue down the road? Thank you all; No. In your case the logical names changed but ZFS managed to order the disks correctly as they were before. -- Giovanni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
On Tue, 15 Jun 2010, Arne Jansen wrote: In case of a power failure I will likely lose about as many writes as I do with SSDs, a few milliseconds. I agree with your concerns, but the data loss may span as much as 30 seconds rather than just a few milliseconds. Using an SSD as the ZIL allows zfs to turn a synchronous write into a normal batched async write which is scheduled for the next TXG. Zfs intentionally postpones writes. Without the SSD, zfs needs to write to an intent log in the main pool (consuming precious IOPS) or write directly to the main pool (consuming precious response latency). Battery-backed RAM in the adaptor card or storage array can do almost as well as the SSD as long as the amount of data does not overrun the limited write cache. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss Digest, Vol 56, Issue 78
-- next part -- An HTML attachment was scrubbed... URL: http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100615/767d6c7d/attachment-0001.html -- Message: 6 Date: Tue, 15 Jun 2010 21:32:02 +0200 (CEST) From: Roy Sigurd Karlsbakk r...@karlsbakk.net To: CarlPalmer dwarvenlo...@yahoo.com Cc: OpenSolaris ZFS discuss zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Complete Linux Noob Message-ID: 2566843.64.1276630322625.javamail.r...@zimbra Content-Type: text/plain; charset=utf-8 snip/ How do I share these types of raid pools across the network. Or more specifically, how do I access them from Windows based systems? Is there any special trick? Most of your questions are answered here http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfslast.pdf Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. -- Message: 7 Date: Tue, 15 Jun 2010 15:03:35 -0500 (CDT) From: Bob Friesenhahn bfrie...@simple.dallas.tx.us To: Joerg Schilling joerg.schill...@fokus.fraunhofer.de Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Native ZFS for Linux Message-ID: alpine.gso.2.01.1006151456200.12...@freddy.simplesystems.org Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed On Tue, 15 Jun 2010, Joerg Schilling wrote: Sorry but your reply is completely misleading as the people who claim that there is a legal problem with having ZFS in the Linux kernel would of course also claim that Reiserfs cannot be in the FreeBSD kernel. It seems that it is a license violation to link a computer containing GPLed code to the Internet. I think I heard on usenet or a blog that it was illegal to link GPLed code with non-GPLed code. The Internet itself is obviously a derived work and is therefore subject to the GPL. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ -- ___ zfs-discuss mailing list http://mail.opensolaris.org/mailman/listinfo/zfs-discuss End of zfs-discuss Digest, Vol 56, Issue 78 *** ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Data Loss on system crash/upgrade
Cindy, I've attached the results of the fmdump -eV command. They don't tell me anything, but someone more knowledgeable might be able to decipher it. As for snapshots, the earliest one I have is from after the new hardware. It doesn't appear that there are any snapshots from before the crash, and I'm positive I had them enabled through the GUI with time slider. The more research I do into figuring out how to solve the issue, the curiouser and curiouser this issue becomes. I knew ZFS was designed to be fault tolerant, but [i]nothing[/i] i've read has even hinted that something like this might have a remote possibility of occurring. Before I blew away the previous system install, I took a snapshot of it and dumped it on a spare hard drive I had lying around. My next step might be figuring out how to get that snapshot back on a hard drive to boot again, but that seemed very difficult because I stupidly upgraded the ZFS version so the live CD will not allow me to import/export it. I'll try that as a last-ditch effort, but at this point I'm very curious as to how something like this could have happened, and how one could recover from it. Thanks again for all your help, Austin -- This message posted from opensolaris.orgTIME CLASS May 28 2010 07:57:25.712193068 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0xd953c9a23d51 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x9cb8636d530e0eab vdev = 0x3d14c4fd46fc236d (end detector) pool = zarray1 pool_guid = 0x9cb8636d530e0eab pool_context = 0 pool_failmode = wait vdev_guid = 0x3d14c4fd46fc236d vdev_type = disk vdev_path = /dev/dsk/c11d1s0 vdev_devid = id1,c...@asamsung_hd103sj=s246jdwz407200/a parent_guid = 0xfca8d0350c5e7014 parent_type = mirror zio_err = 50 zio_offset = 0x2e3fe600 zio_size = 0x400 zio_objset = 0x1e zio_object = 0x36172 zio_level = 1 zio_blkid = 0x24 cksum_expected = 0x679a94912e 0x44a636ab8ab3 0x19330206d17513 0x69869956c85878c cksum_actual = 0x4a4c39042c 0x25f430d983f8 0xce6275bab1357 0x34cba59b47c2e80 cksum_algorithm = fletcher4 bad_ranges = 0x0 0x400 bad_ranges_min_gap = 0x8 bad_range_sets = 0x63d bad_range_clears = 0x83a bad_set_histogram = 0x20 0xd 0x27 0x19 0x13 0xf 0x11 0x2e 0x15 0xc 0x15 0x23 0x1e 0x1e 0x18 0x2a 0xd 0x13 0xe 0xe 0x13 0x19 0x18 0x2a 0x1d 0xb 0x10 0x17 0x1d 0xe 0x21 0x26 0x2c 0x12 0x2b 0x12 0x1e 0x10 0xf 0x2e 0x15 0xb 0x11 0x21 0x1f 0x19 0x19 0x22 0xd 0x15 0x13 0x13 0x18 0x1b 0x1c 0x2d 0x1e 0x9 0xd 0x1b 0x1e 0xe 0x1b 0x28 bad_cleared_histogram = 0x17 0x36 0x17 0x26 0x26 0x27 0x25 0xb 0x24 0x3b 0x1e 0x23 0x20 0x1c 0x23 0x9 0x29 0x2f 0x23 0x27 0x25 0x22 0x1f 0xa 0x18 0x3c 0x2a 0x1c 0x1e 0x27 0x1c 0xa 0x14 0x30 0x12 0x21 0x19 0x21 0x23 0xf 0x22 0x3d 0x24 0x1c 0x20 0x1f 0x22 0x11 0x27 0x32 0x26 0x28 0x26 0x20 0x22 0xd 0x1c 0x3e 0x23 0x1f 0x1e 0x28 0x1a 0x8 __ttl = 0x1 __tod = 0x4bffafa5 0x2a73342c May 28 2010 07:58:34.612731735 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0xda54764ec1b00801 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x9cb8636d530e0eab vdev = 0x3d14c4fd46fc236d (end detector) pool = zarray1 pool_guid = 0x9cb8636d530e0eab pool_context = 0 pool_failmode = wait vdev_guid = 0x3d14c4fd46fc236d vdev_type = disk vdev_path = /dev/dsk/c11d1s0 vdev_devid = id1,c...@asamsung_hd103sj=s246jdwz407200/a parent_guid = 0xfca8d0350c5e7014 parent_type = mirror zio_err = 50 zio_offset = 0x2c269600 zio_size = 0x400 zio_objset = 0x1e zio_object = 0x36172 zio_level = 1 zio_blkid = 0x14 cksum_expected = 0x6de977087a 0x4702ac30c13e 0x19a6384bcbc3b5 0x6a7654eca5ac6b8 cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 0x7f84b11b3fc7f80 cksum_algorithm = fletcher4 bad_ranges = 0x0 0x400 bad_ranges_min_gap = 0x8 bad_range_sets = 0xd60 bad_range_clears = 0x33a bad_set_histogram = 0x4c 0x33 0x4e 0x4c 0x52 0x4f 0x53 0x0 0x52 0x31 0x0 0x0 0x50 0x0 0x50 0x0 0x4d 0x35 0x0 0x51 0x4d 0x4e 0x0 0x6a 0x53 0x0 0x4e 0x50 0x4f 0x0 0x50 0x0 0x4e 0x36 0x53 0x4d 0x4e 0x4f 0x50 0x0 0x4c 0x34 0x0 0x0 0x50 0x0 0x53 0x0 0x53 0x38 0x0 0x57 0x4e 0x50 0x0 0x74 0x4e 0x0 0x53 0x53 0x54 0x0 0x58 0x0 bad_cleared_histogram = 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x14 0x0 0x0 0x2e 0x2a 0x0 0x31 0x0 0x10 0x0 0x0 0x32 0x0 0x0 0x0
Re: [zfs-discuss] SSDs adequate ZIL devices?
So why buy SSD for ZIL at all? For the record, not all SSDs ignore cache flushes. There are at least two SSDs sold today that guarantee synchronous write semantics; the Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is more accurate to describe the root cause as not power protecting on-board volatile caches. As the X25-E does implement the ATA FLUSH CACHE command, but does not have the required power protection to avoid transaction (data) loss. Best regards, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] High-Performance ZFS (2000MB/s+)
On 15/06/2010 12:42, Arve Paalsrud arve.paals...@gmail.com wrote: Hi, We are currently building a storage box based on OpenSolaris/Nexenta using ZFS. Our hardware specifications are as follows: Quad AMD G34 12-core 2.3 GHz (~110 GHz) 10 Crucial RealSSD (6Gb/s) 42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders LSI2008SAS (two 4x ports) Mellanox InfiniBand 40 Gbit NICs I was told that IB support in Nexenta is scheduled to be released in 3.0.4 (beginning of July). 128 GB RAM This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB L2ARC and 64GB Zil, all fit into a single 5U box. Both L2ARC and Zil shares the same disks (striped) due to bandwidth requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k read/write scenario with 70/30 distribution. Now, I know that you should have mirrored Zil for safety, but the entire box are synchronized with an active standby on a different site location (18km distance - round trip of 0.16ms + equipment latency). So in case the Zil in Site A takes a fall, or the motherboard/disk group/motherboard dies - we still have safety. DDT requirements for dedupe on 16k blocks should be about 640GB when main pool are full (capacity). Without going into details about chipsets and such, do any of you on this list have any experience with a similar setup and can share with us your thoughts, do's and dont's, and any other information that could be of help while building and configuring this? What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also InfiniBand-based), with both dedupe and compression enabled in ZFS. As VMware does not currently support NFS over RDMA, you will need to stick with IPoIB which will suffer from some performance implications inherent to traditional TCP/IP stack. You could also use iSER or SRP which are both supported. Let's talk moon landings. Regards, Arve -- Przem ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
On 15/06/2010 23:46, Christopher George cgeo...@ddrdrive.com wrote: So why buy SSD for ZIL at all? For the record, not all SSDs ignore cache flushes. There are at least two SSDs sold today that guarantee synchronous write semantics; the Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is more accurate to describe the root cause as not power protecting on-board volatile caches. As the X25-E does implement the ATA FLUSH CACHE command, but does not have the required power protection to avoid transaction (data) loss. Best regards, Christopher George Founder/CTO www.ddrdrive.com Often forgotten (most probably due the price) are the latest Pliant SSDs. -- Przem ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool export / import discrepancy
http://blogs.sun.com/constantin/entry/csi_munich_how_to_save 2010/6/15 Scott Squires ssqui...@gmail.com Hello All, I've migrated a JBOD of 16 drives from one server to another. I did a zpool export from the old system and a zpool import to the new system. One thing I did notice is since the drives are on a different controller card, the naming is different (as expected) but the order is also different. I setup the drives as passthrough on the controller card and went through each drive incrementally. I assumed the zpool import would have listed the drives in the order of c10t2d0, d1, d2, ... c10t3d7. As shown below the order the drives were imported is c10t2d0, d2, d3, d1, c10t3d0 through d7. __ |Original zpool setup on old server: | |zpool status backup | pool: backup | state: ONLINE |config: |NAME STATE READ WRITE CKSUM |backup ONLINE 0 0 0 | raidz2 ONLINE 0 0 0 |c7t1d0 ONLINE 0 0 0 |c7t2d0 ONLINE 0 0 0 |c7t3d0 ONLINE 0 0 0 |c7t4d0 ONLINE 0 0 0 |c7t5d0 ONLINE 0 0 0 |c7t6d0 ONLINE 0 0 0 |c7t7d0 ONLINE 0 0 0 |c7t8d0 ONLINE 0 0 0 |c7t9d0 ONLINE 0 0 0 |c7t10d0 ONLINE 0 0 0 |c7t11d0 ONLINE 0 0 0 |c7t12d0 ONLINE 0 0 0 |c7t13d0 ONLINE 0 0 0 |c7t14d0 ONLINE 0 0 0 |c7t15d0 ONLINE 0 0 0 |spares | c7t16d0AVAIL |_ __ |Imported zpool on new server: | |zpool status backup | pool: backup | state: ONLINE |config: |NAME STATE READ WRITE CKSUM |backup ONLINE 0 0 0 | raidz2 ONLINE 0 0 0 |c10t2d0 ONLINE 0 0 0 |c10t2d2 ONLINE 0 0 0 |c10t2d3 ONLINE 0 0 0 |c10t2d1 ONLINE 0 0 0 |c10t2d4 ONLINE 0 0 0 |c10t2d5 ONLINE 0 0 0 |c10t2d6 ONLINE 0 0 0 |c10t2d7 ONLINE 0 0 0 |c10t3d0 ONLINE 0 0 0 |c10t3d1 ONLINE 0 0 0 |c10t3d2 ONLINE 0 0 0 |c10t3d3 ONLINE 0 0 0 |c10t3d4 ONLINE 0 0 0 |c10t3d5 ONLINE 0 0 0 |c10t3d6 ONLINE 0 0 0 |spares | c10t3d7AVAIL |_ Is ZFS dependent on the order of the drives? Will this cause any issue down the road? Thank you all; Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Frank Contrepois Coblan srl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On 06/15/10 10:52, Erik Trimble wrote: Frankly, dedup isn't practical for anything but enterprise-class machines. It's certainly not practical for desktops or anything remotely low-end. We're certainly learning a lot about how zfs dedup behaves in practice. I've enabled dedup on two desktops and a home server and so far haven't regretted it on those three systems. However, they each have more than typical amounts of memory (4G and up) a data pool in two or more large-capacity SATA drives, plus an X25-M ssd sliced into a root pool as well as l2arc and slog slices for the data pool (see below: [1]) I tried enabling dedup on a smaller system (with only 1G memory and a single very slow disk), observed serious performance problems, and turned it off pretty quickly. I think, with current bits, it's not a simple matter of ok for enterprise, not ok for desktops. with an ssd for either main storage or l2arc, and/or enough memory, and/or a not very demanding workload, it seems to be ok. For one such system, I'm seeing: # zpool list z NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT z 464G 258G 206G55% 1.25x ONLINE - # zdb -D z DDT-sha256-zap-duplicate: 432759 entries, size 304 on disk, 156 in core DDT-sha256-zap-unique: 1094244 entries, size 298 on disk, 151 in core dedup = 1.25, compress = 1.44, copies = 1.00, dedup * compress / copies = 1.80 - Bill [1] To forestall responses of the form: you're nuts for putting a slog on an x25-m, which is off-topic for this thread and being discussed elsewhere: Yes, I'm aware of the write cache issues on power fail on the x25-m. For my purposes, it's a better robustness/performance tradeoff than either zil-on-spinning-rust or zil disabled, because: a) for many potential failure cases on whitebox hardware running bleeding edge opensolaris bits, the x25-m will not lose power and thus the write cache will stay intact across a crash. b) even if it loses power and loses some writes-in-flight, it's not likely to lose *everything* since the last txg sync. It's good enough for my personal use. Your mileage will vary. As always, system design involves tradeoffs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote: I think dedup may have its greatest appeal in VDI environments (think about a environment with 85% if the data that the virtual machine needs is into ARC or L2ARC... is like a dream...almost instantaneous response... and you can boot a new machine in a few seconds)... This may also be accomplished by using snapshots and clones of data sets. At least for OS images: user profiles and documents could be something else entirely. Another situation that comes to mind is perhaps as the back-end to a mail store: if you send out a message(s) with an attachment(s) to a lot of people, the attachment blocks could be deduped (and perhaps compressed as well, since base-64 adds 1/3 overhead). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On Tue, Jun 15, 2010 at 7:28 PM, David Magda dma...@ee.ryerson.ca wrote: On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote: I think dedup may have its greatest appeal in VDI environments (think about a environment with 85% if the data that the virtual machine needs is into ARC or L2ARC... is like a dream...almost instantaneous response... and you can boot a new machine in a few seconds)... This may also be accomplished by using snapshots and clones of data sets. At least for OS images: user profiles and documents could be something else entirely. It all depends on the nature of the VDI environment. If the VMs are regenerated on each login, the snapshot + clone mechanism is sufficient. Deduplication is not needed. However, if VMs have a long life and get periodic patches and other software updates, deduplication will be required if you want to remain at somewhat constant storage utilization. It probably makes a lot of sense to be sure that swap or page files are on a non-dedup dataset. Executables and shared libraries shouldn't be getting paged out to it and the likelihood that multiple VMs page the same thing to swap or a page file is very small. Another situation that comes to mind is perhaps as the back-end to a mail store: if you send out a message(s) with an attachment(s) to a lot of people, the attachment blocks could be deduped (and perhaps compressed as well, since base-64 adds 1/3 overhead). It all depends on how this is stored. If the attachments are stored like they were in 1990 as part of an mbox format, you will be very unlikely to get the proper block alignment. Even storing the message body (including headers) in the same file as the attachment may not align the attachments because the mail headers may be different (e.g. different recipients messages took different paths, some were forwarded, etc.). If the attachments are stored in separate files or a database format is used that stores attachments separate from the message (with matching database + zfs block size) things may work out favorably. However, a system that detaches messages and stores them separately may just as well store them in a file that matches the SHA256 hash, assuming that file doesn't already exist. If does exist, it can just increment a reference count. In other words, an intelligent mail system should already dedup. Or at least that is how I would have written it for the last decade or so... -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] size of slog device
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn It is good to keep in mind that only small writes go to the dedicated slog. Large writes to to main store. A succession of that many small writes (to fill RAM/2) is highly unlikely. Also, that the zil is not read back unless the system is improperly shut down. Can anyone verify this? I thought the decision for small vs large sync writes to go to log vs main store was determined by zfs_immediate_write_sz and logbias. logbias was introduced in snv_122, which is zpool 18 or 19. zfs_immediate_write_sz seems to have been around forever (I see comments about it as early as 2006). Then again, I can't seem to find my zfs_immediate_write_sz, via either zpool or zfs. Can anybody say what version zpool introduced zfs_immediate_write_sz, or perhaps I'm using the wrong commands to try and see mine? zpool get all rpool | grep zfs_immediate_write_sz ; zfs get all rpool | grep zfs_immediate_write_sz I thought, if you didn't explicitly tune these, all sync writes go to ZIL before the main store. Can't seem to find any way to verify this. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] size of slog device
On Jun 15, 2010, at 8:13 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn It is good to keep in mind that only small writes go to the dedicated slog. Large writes to to main store. A succession of that many small writes (to fill RAM/2) is highly unlikely. Also, that the zil is not read back unless the system is improperly shut down. Can anyone verify this? I thought the decision for small vs large sync writes to go to log vs main store was determined by zfs_immediate_write_sz and logbias. logbias was introduced in snv_122, which is zpool 18 or 19. zfs_immediate_write_sz seems to have been around forever (I see comments about it as early as 2006). Then again, I can't seem to find my zfs_immediate_write_sz, via either zpool or zfs. Can anybody say what version zpool introduced zfs_immediate_write_sz, or perhaps I'm using the wrong commands to try and see mine? zpool get all rpool | grep zfs_immediate_write_sz ; zfs get all rpool | grep zfs_immediate_write_sz It is an int, as in C, not a parameter tunable by zpool or zfs commands. For NFS service, it can be tuned by the client via wsize. I thought, if you didn't explicitly tune these, all sync writes go to ZIL before the main store. Can't seem to find any way to verify this. Cake. All sync writes go to the ZIL. The ZIL may be in the pool or in the separate log device :-) -- richard -- Richard Elling rich...@nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
Bob Friesenhahn wrote: On Tue, 15 Jun 2010, Arne Jansen wrote: In case of a power failure I will likely lose about as many writes as I do with SSDs, a few milliseconds. I agree with your concerns, but the data loss may span as much as 30 seconds rather than just a few milliseconds. Wait, I'm talking about using SSD for ZIL vs. using a dedicated hard drive for ZIL which is configured to ignore cache flushes. Do you say I can lose 30 seconds also if I use a badly behaving SSD? Using an SSD as the ZIL allows zfs to turn a synchronous write into a normal batched async write which is scheduled for the next TXG. Zfs intentionally postpones writes. Without the SSD, zfs needs to write to an intent log in the main pool (consuming precious IOPS) or write directly to the main pool (consuming precious response latency). Battery-backed RAM in the adaptor card or storage array can do almost as well as the SSD as long as the amount of data does not overrun the limited write cache. Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] size of slog device
On Jun 15, 2010, at 8:51 PM, Richard Elling wrote I thought, if you didn't explicitly tune these, all sync writes go to ZIL before the main store. Can't seem to find any way to verify this. Cake. All sync writes go to the ZIL. The ZIL may be in the pool or in the separate log device :-) go to may be too confusing. s/go to/are handled by/ -- richard -- Richard Elling rich...@nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss