Re: [zfs-discuss] ZFS performance degradation when backups are running
2008/9/30 Jean Dion [EMAIL PROTECTED]: iSCSI requires dedicated network and not a shared network or even VLAN. Backup cause large I/O that fill your network quickly. Like ans SAN today. Could you clarify why it is not suitable to use VLANs for iSCSI? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
The good news is that even though the answer to your question is no, it doesn't matter because it sounds like what you are doing is a piece of cake :) Given how cheap hardware is, and how modest your requirements sound, I expect you could build multiple custom systems for the cost of an EMC system. Even that pogolinux stuff is overshooting the mark compared to what a custom system might be. Price is typical too, considering they're trying to sell 1TB drives for $260 when similar drives are less than $150 for regular folks. The manageability of nexentastor software might be worth it to you over a solaris terminal, but for a small shop with one machine and one guy who knows it well, you might just do the hardware from scratch :) Especially given what there is to know about ZFS and your use case, such as being able to use slower disks with more RAM and a SSD ZIL cache to produce deceptively fast results. If cost continues to be a concern over performance, also consider that these pre-made systems are not designed for power conservation at all. They're still shipping old inefficient processors and other such parts in these things, hoping to take advantage of IT people who don't care or know any better. A custom system could potentially cut the total power cost in half... div id=jive-html-wrapper-div div dir=ltrHi everyone,brbrWe#39;re a small Linux shop (20 users). I am currently using a Linux server to host our 2TBs of data. I am considering better options for our data storage needs. I mostly need instant snapshots and better data protection. I have been considering EMC NS20 filers and Zfs based solutions. For the Zfs solutions, I am considering NexentaStor product installed on a pogoLinux StorageDirector box. The box will be mostly sharing 2TB over NFS, nothing fancy.br brNow, my question is I need to assess the zfs reliability today Q4-2008 in comparison to an EMC solution. Something like EMC is pretty mature and used at the most demanding sites. Zfs is fairly new, and from time to time I have heard it had some pretty bad bugs. However, the EMC solution is like 4X more expensive. I need to somehow quot;quantifyquot; the relative quality level, in order to judge whether or not I should be paying all that much to EMC. The only really important reliability measure to me, is not having data loss!br Is there any real measure like quot;percentage of total corruption of a poolquot; that can assess such a quality, so you#39;d tell me zfs has pool failure rate of 1 in a 10^6, while EMC has a rate of 1 in a 10^7. If not, would you guys rate such a zfs solution as ??% the reliability of an EMC solution ?br brI know it#39;s a pretty difficult question to answer, but it#39;s the one I need to answer and weigh against the cost. brThanks a million, I really appreciate your helpbr/div /div___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Thanks for all the answers .. Please find more questions below :) - Good to know EMC filers do not have end2end checksums! What about netapp ? - Any other limitations of the big two NAS vendors as compared to zfs ? - I still don't have my original question answered, I want to somehow assess the reliability of that zfs storage stack. If there's no hard data on that, then if any storage expert who works with lots of systems can give his impression of the reliability compared to the big two, that would be great ! - Regarding building my own hardware, I don't really want to do that (I am scared enough to put our small but very important data on zfs). If you know of any Dell box (we usually deal with dell) that can host say 10 drives minimum (for expandability) and that is *known* to work very well with nexentaStor. Then please please let me know about it. I am unconfident about the hardware quality of the pogoLinux solution, but forced to go with it for nexenta. The Sun thumper solution is too expensive for me, I am looking for a solution around 10k$. I don't need all those disks or RAM in thumper! - Assuming I plan to host a maximum of 8TB uesable data on the pogo box as seen in: http://www.pogolinux.com/quotes/editsys?sys_id=8498 * Would I need one or two of those Quad core xeon CPUs ? * How much RAM is needed ? * I'm planning on using Segate 1TB sata 7200 disks. Is that crazy ? The EMC guy insisted we use 10k Fibre/SAS drives at least. We're currently on 3 1TB sata disks on my current linux box, and it's fine for me! At least when it's not rsnapshotting. The workload is 20 user NFS for homes and some software shares * Assuming the pogo sata controller dies, do you suppose I could plug the disks into any other machine and work with them ? I wonder why the pogo box does not come with two controllers, doesn't solaris support that ! Thanks a lot for your replies On Tue, Sep 30, 2008 at 10:31 AM, MC [EMAIL PROTECTED] wrote: The good news is that even though the answer to your question is no, it doesn't matter because it sounds like what you are doing is a piece of cake :) Given how cheap hardware is, and how modest your requirements sound, I expect you could build multiple custom systems for the cost of an EMC system. Even that pogolinux stuff is overshooting the mark compared to what a custom system might be. Price is typical too, considering they're trying to sell 1TB drives for $260 when similar drives are less than $150 for regular folks. The manageability of nexentastor software might be worth it to you over a solaris terminal, but for a small shop with one machine and one guy who knows it well, you might just do the hardware from scratch :) Especially given what there is to know about ZFS and your use case, such as being able to use slower disks with more RAM and a SSD ZIL cache to produce deceptively fast results. If cost continues to be a concern over performance, also consider that these pre-made systems are not designed for power conservation at all. They're still shipping old inefficient processors and other such parts in these things, hoping to take advantage of IT people who don't care or know any better. A custom system could potentially cut the total power cost in half... div id=jive-html-wrapper-div div dir=ltrHi everyone,brbrWe#39;re a small Linux shop (20 users). I am currently using a Linux server to host our 2TBs of data. I am considering better options for our data storage needs. I mostly need instant snapshots and better data protection. I have been considering EMC NS20 filers and Zfs based solutions. For the Zfs solutions, I am considering NexentaStor product installed on a pogoLinux StorageDirector box. The box will be mostly sharing 2TB over NFS, nothing fancy.br brNow, my question is I need to assess the zfs reliability today Q4-2008 in comparison to an EMC solution. Something like EMC is pretty mature and used at the most demanding sites. Zfs is fairly new, and from time to time I have heard it had some pretty bad bugs. However, the EMC solution is like 4X more expensive. I need to somehow quot;quantifyquot; the relative quality level, in order to judge whether or not I should be paying all that much to EMC. The only really important reliability measure to me, is not having data loss!br Is there any real measure like quot;percentage of total corruption of a poolquot; that can assess such a quality, so you#39;d tell me zfs has pool failure rate of 1 in a 10^6, while EMC has a rate of 1 in a 10^7. If not, would you guys rate such a zfs solution as ??% the reliability of an EMC solution ?br brI know it#39;s a pretty difficult question to answer, but it#39;s the one I need to answer and weigh against the cost. brThanks a million, I really appreciate your helpbr/div /div___ zfs-discuss mailing list zfs-discuss@opensolaris.org
Re: [zfs-discuss] Quantifying ZFS reliability
On Sep 30, 2008, at 06:58, Ahmed Kamal wrote: - I still don't have my original question answered, I want to somehow assess the reliability of that zfs storage stack. If there's no hard data on that, then if any storage expert who works with lots of systems can give his impression of the reliability compared to the big two, that would be great! What would you consider hard data? Can you give examples of hard data for EMC and NetApp (or anyone else)? Then perhaps similar things can be found for ZFS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZSF Solaris
Hi, can anyone please tell me what is the maximum number of files that can be there in 1 folder in Solaris with ZSF file system. I am working on an application in which I have to support 1mn users. In my application I am using MySql MyISAM and in MyISAM there is 3 files created for 1 table. I am having application architechture in which each user will be having separate table, so the expected number of files in database folder is 3mn. I have read somewhere that there is a limit of each OS to create files in a folder. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On Tue, 30 Sep 2008, Ram Sharma wrote: Hi, can anyone please tell me what is the maximum number of files that can be there in 1 folder in Solaris with ZSF file system. By folder, I assume you mean directory and not, say, pool. In any case, the 'limit' is 2^48, but that's effectively no limit at all. Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
Simple. You cannot go faster than the slowest link. Any VLAN share the bandwidth workload and do not provide a dedicated bandwidth for each of them. That means if you have multiple VLAN coming out of the same wire of your server you do not have "n" time the bandwidth but only a fraction of it. Simple network maths. Also iSCSI works better by using segregated IP network switches. Beware that some switches do not guaranty full 1Gbits speed on all ports when all active at the same time. Plan multiple uplinks if you have more than one switch. Once again you cannot go faster than the slowest link. Jean gm_sjo wrote: 2008/9/30 Jean Dion [EMAIL PROTECTED]: iSCSI requires dedicated network and not a shared network or even VLAN. Backup cause large I/O that fill your network quickly. Like ans SAN today. Could you clarify why it is not suitable to use VLANs for iSCSI? -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
On Mon, Sep 29, 2008 at 06:01:18PM -0700, Jean Dion wrote: Do you have dedicated iSCSI ports from your server to your NetApp? Yes, it's a dedicated redundant gigabit network. iSCSI requires dedicated network and not a shared network or even VLAN. Backup cause large I/O that fill your network quickly. Like ans SAN today. Backup are extremely demanding on hardware (CPU, Mem, I/O ports, disk etc). Not rare to see performance issues during backup with several thousands small files. Each small file cause seeks to your disk and file system. As the number of files and size you will be impact. That means, thousand of small files cause thousand of small I/O but not a lot of throughput. What statistics can I generate to observe this contention? ZFS pool I/O statistics are not that different when the backup is running. Bigger your file are more likely the block will be consecutive on the file system. Small file can be spread in the entire file system causing seeks, latency and bottleneck. Legato client and server contains tuning parameters to avoid such small file problems. Check your Legato buffer parameters. These buffer will use your server memory as disk cache. I'll ask our backup person to investigate those settings. I assume that Networker should not be buffering files since those files won't be read again. How can I see memory usage by ZFS and by applications? Here is a good source of network tuning parameters for your T2000 http://www.solarisinternals.com/wiki/index.php/Networks#Tunable_for_general_workloads_on_T1000.2FT2000 The soft_ring is one of the best one. Here is another interesting place to look http://www.solarisinternals.com/wiki/index.php/Solaris_Internals_and_Performance_FAQ Thanks. I'll review those documents. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
ZFS has not limit for snapshots and filesystems too, but try to create a lot snapshots and filesytems and you will have to wait a lot for your pool to import too... ;-) I think you should not think about the limits, but performance. Any filesytem with *too many entries by directory will suffer. So, my advice is configure your app to create a better hierarchy. Leal. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
For Solaris internal debugging tools look here http://opensolaris.org/os/community/advocacy/events/techdays/seattle/OS_SEA_POD_JMAURO.pdf;jsessionid=9B3E275EEB6F1A0E0BC191D8DEC0F965 ZFS specifics is available here http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide Jean Gary Mills wrote: On Mon, Sep 29, 2008 at 06:01:18PM -0700, Jean Dion wrote: Do you have dedicated iSCSI ports from your server to your NetApp? Yes, it's a dedicated redundant gigabit network. iSCSI requires dedicated network and not a shared network or even VLAN. Backup cause large I/O that fill your network quickly. Like ans SAN today. Backup are extremely demanding on hardware (CPU, Mem, I/O ports, disk etc). Not rare to see performance issues during backup with several thousands small files. Each small file cause seeks to your disk and file system. As the number of files and size you will be impact. That means, thousand of small files cause thousand of small I/O but not a lot of throughput. What statistics can I generate to observe this contention? ZFS pool I/O statistics are not that different when the backup is running. Bigger your file are more likely the block will be consecutive on the file system. Small file can be spread in the entire file system causing seeks, latency and bottleneck. Legato client and server contains tuning parameters to avoid such small file problems. Check your Legato buffer parameters. These buffer will use your server memory as disk cache. I'll ask our backup person to investigate those settings. I assume that Networker should not be buffering files since those files won't be read again. How can I see memory usage by ZFS and by applications? Here is a good source of network tuning parameters for your T2000 http://www.solarisinternals.com/wiki/index.php/Networks#Tunable_for_general_workloads_on_T1000.2FT2000 The soft_ring is one of the best one. Here is another interesting place to look http://www.solarisinternals.com/wiki/index.php/Solaris_Internals_and_Performance_FAQ Thanks. I'll review those documents. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] c1t0d0 to c3t1d0
I have a ZFS disk (c1t0d0) in a eSATA/USB2 enclosure. If I would build this drive in the machine (internal SATA) it would become c3t1do. When I did it (for testing) zpool status did not see it. What do I have to do to be able to switch this drive? -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D ++ http://nagual.nl/ + SunOS sxce snv95 ++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] c1t0d0 to c3t1d0
On Tue, Sep 30, 2008 at 14:00, dick hoogendijk [EMAIL PROTECTED] wrote: What do I have to do to be able to switch this drive? I'd suggest running zpool import. If that doesn't show the pool, put it back in the external enclosure, run zpool export mypool and then see if it shows up in zpool import when it's internal. Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ahmed Kamal wrote: Thanks for all the answers .. Please find more questions below :) - Good to know EMC filers do not have end2end checksums! What about netapp ? If they are not at the end, they can't do end-to-end data validation. Ideally, application writers would do this, but it is a lot of work. ZFS does this on behalf of applications which use ZFS. Hence my comment about ZFS being complementary to your storage device decision. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, 30 Sep 2008, Ahmed Kamal wrote: - I still don't have my original question answered, I want to somehow assess the reliability of that zfs storage stack. If there's no hard data on that, then if any storage expert who works with lots of systems can give his impression of the reliability compared to the big two, that would be great The reliability of that zfs storage stack primarily depends on the reliability of the hardware it runs on. Note that there is a huge difference between 'reliability' and 'mean time to data loss' (MTDL). There is also the concern about 'availability' which is a function of how often the system fails, and the time to correct a failure. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
I guess I am mostly interested in MTDL for a zfs system on whitebox hardware (like pogo), vs dataonTap on netapp hardware. Any numbers ? On Tue, Sep 30, 2008 at 4:36 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Tue, 30 Sep 2008, Ahmed Kamal wrote: - I still don't have my original question answered, I want to somehow assess the reliability of that zfs storage stack. If there's no hard data on that, then if any storage expert who works with lots of systems can give his impression of the reliability compared to the big two, that would be great The reliability of that zfs storage stack primarily depends on the reliability of the hardware it runs on. Note that there is a huge difference between 'reliability' and 'mean time to data loss' (MTDL). There is also the concern about 'availability' which is a function of how often the system fails, and the time to correct a failure. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, 30 Sep 2008, Ahmed Kamal wrote: I guess I am mostly interested in MTDL for a zfs system on whitebox hardware (like pogo), vs dataonTap on netapp hardware. Any numbers ? Barring kernel bugs or memory errors, Richard Elling's blog entry seems to be the best place use as a guide: http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl It is pretty easy to build a ZFS pool with data loss probabilities (on paper) which are about as low as you winning the state jumbo lottery jackpot based on a ticket you found on the ground. However, if you want to compete with an EMC system, then you will want to purchase hardware of similar grade. If you purchase a cheapo system from Dell without ECC memory then the actual data reliability will suffer. ZFS protects you against corruption in the data storage path. It does not protect you against main memory errors or random memory overwrites due to a horrific kernel bug. ZFS also does not protect against data loss due to user error, which remains the primary factor in data loss. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ahmed Kamal wrote: I guess I am mostly interested in MTDL for a zfs system on whitebox hardware (like pogo), vs dataonTap on netapp hardware. Any numbers ? It depends to a large degree on the disks chosen. NetApp uses enterprise class disks and you can expect better reliability from such disks. I've blogged about a few different MTTDL models and posted some model results. http://blogs.sun.com/relling/tags/mttdl -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import of bootable root pool renders it
Stephen Quintero [EMAIL PROTECTED] writes: I am running OpenSolaris 2008.05 as a PV guest under Xen. If you import the bootable root pool of a VM into another Solaris VM, the root pool is no longer bootable. I had a similar problem: After installing and booting Opensolaris 2008.05, I succeded to lock myself out through some passwd/shadow inconsistency (totally my own fault). Not a problem, I thought -- I booted from the install disk, imported the root pool, fixed the inconsistency, and rebooted. Lo, instant panic. No idea why, though, I am not that familiar with the underlying code. I just did a reinstall. Regards, Juergen. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool error: must be a block device or regular file
The zfs kernel modules handle the caching/flushing of data across all the devices in the zpools. It uses a different method for this than the standard virtual memory system used by traditional file systems like UFS. Try defining your NVRAM card with ZFS as a log device using the /dev/dsk/xyz path and let us know how it goes. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
Gary - Besides the network questions... What does your zpool status look like? Are you using compression on the file systems? (Was single-threaded and fixed in s10u4 or equiv patches) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
2008/9/30 Jean Dion [EMAIL PROTECTED]: Simple. You cannot go faster than the slowest link. That is indeed correct, but what is the slowest link when using a Layer 2 VLAN? You made a broad statement that iSCSI 'requires' a dedicated, standalone network. I do not believe this is the case. Any VLAN share the bandwidth workload and do not provide a dedicated bandwidth for each of them. That means if you have multiple VLAN coming out of the same wire of your server you do not have n time the bandwidth but only a fraction of it. Simple network maths. I can only assume that you are only referring to VLAN trunks, eg using a NIC on a server for both 'normal' traffic and having another virtual interface on it bound to a 'storage' VLAN. If this is the case then what you say is true, of course you are sharing the same physical link so ultimately that will be the limit. However, and this should be clarified before anyone gets the wrong idea, there is nothing wrong with segmenting a switch by using VLANs to have some ports for storage traffic and some ports for 'normal' traffic. You can have one/multiple NIC(s) for storage, and another/multiple NIC(s) for everything else (or however you please to use your interfaces!). These can be hooked up to switch ports that are on different physical VLANs with no performance degredation. It's best not to assume that every use of a VLAN is a trunk. Also iSCSI works better by using segregated IP network switches. Beware that some switches do not guaranty full 1Gbits speed on all ports when all active at the same time. Plan multiple uplinks if you have more than one switch. Once again you cannot go faster than the slowest link. I think it's fairly safe to assume that you're going to get per-port line-speed across anything other than the cheapest budget switches. Most SMB (and above) switches will be rated at say 48gbit/sec backplane on a 24 port item, for example. However, I am keen to see any benchmarks you may have that shows the performance difference between running a single switch with layer 2 vlans Vs. two seperate switches. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
Normal iSCSI setup split network traffic at physical layer and not logical layer. That mean physical ports and often physical PCI bridge chip if you can. That will be fine for small traffic but we are talking backup performance issues. IP network and number of small files are very often the bottlenecks. If you want performance you do not put all your I/O across the same physical wire. Once again you cannot go faster than the physical wire can support (CAT5E, CAT6, fibre). No matter if it is layer 2 or not. Using VLAN on single port you "share" the bandwidth and not creating more Gbits speed with Layer 2. iSCSI best practice require separated physical network. Many books, white papers are written about this. This is like any FC SAN implementation. We always split the workload between disk and tape using more than one HBA. Never forget , backup are intensive I/O and will fill the entire I/O path. Jean gm_sjo wrote: 2008/9/30 Jean Dion [EMAIL PROTECTED]: Simple. You cannot go faster than the slowest link. That is indeed correct, but what is the slowest link when using a Layer 2 VLAN? You made a broad statement that iSCSI 'requires' a dedicated, standalone network. I do not believe this is the case. Any VLAN share the bandwidth workload and do not provide a dedicated bandwidth for each of them. That means if you have multiple VLAN coming out of the same wire of your server you do not have "n" time the bandwidth but only a fraction of it. Simple network maths. I can only assume that you are only referring to VLAN trunks, eg using a NIC on a server for both 'normal' traffic and having another virtual interface on it bound to a 'storage' VLAN. If this is the case then what you say is true, of course you are sharing the same physical link so ultimately that will be the limit. However, and this should be clarified before anyone gets the wrong idea, there is nothing wrong with segmenting a switch by using VLANs to have some ports for storage traffic and some ports for 'normal' traffic. You can have one/multiple NIC(s) for storage, and another/multiple NIC(s) for everything else (or however you please to use your interfaces!). These can be hooked up to switch ports that are on different physical VLANs with no performance degredation. It's best not to assume that every use of a VLAN is a trunk. Also iSCSI works better by using segregated IP network switches. Beware that some switches do not guaranty full 1Gbits speed on all ports when all active at the same time. Plan multiple uplinks if you have more than one switch. Once again you cannot go faster than the slowest link. I think it's fairly safe to assume that you're going to get per-port line-speed across anything other than the cheapest budget switches. Most SMB (and above) switches will be rated at say 48gbit/sec backplane on a 24 port item, for example. However, I am keen to see any benchmarks you may have that shows the performance difference between running a single switch with layer 2 vlans Vs. two seperate switches. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
On Tue, Sep 30, 2008 at 10:32:50AM -0700, William D. Hathaway wrote: Gary - Besides the network questions... Yes, I suppose I should see if traffic on the Iscsi network is hitting a limit of some sort. What does your zpool status look like? Pretty simple: $ zpool status pool: space state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM spaceONLINE 0 0 0 c4t60A98000433469764E4A2D456A644A74d0 ONLINE 0 0 0 c4t60A98000433469764E4A2D456A696579d0 ONLINE 0 0 0 c4t60A98000433469764E4A476D2F6B385Ad0 ONLINE 0 0 0 c4t60A98000433469764E4A476D2F664E4Fd0 ONLINE 0 0 0 errors: No known data errors The four LUNs use the built-in I/O multipathing, with separate Iscsi networks, switches, and ethernet interfaces. Are you using compression on the file systems? (Was single-threaded and fixed in s10u4 or equiv patches) No, I've never enabled compression there. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
On Mon, Sep 29, 2008 at 06:01:18PM -0700, Jean Dion wrote: Legato client and server contains tuning parameters to avoid such small file problems. Check your Legato buffer parameters. These buffer will use your server memory as disk cache. Our backup person tells me that there are no settings in Networker that affect buffering on the client side. Here is a good source of network tuning parameters for your T2000 http://www.solarisinternals.com/wiki/index.php/Networks#Tunable_for_general_workloads_on_T1000.2FT2000 The soft_ring is one of the best one. Those references are for network tuning. I don't want to change things blindly. How do I tell if they are necessary, that is if the network is the bottleneck in the I/O system? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
ak == Ahmed Kamal [EMAIL PROTECTED] writes: ak I need to answer and weigh against the cost. I suggest translating the reliability problems into a cost for mitigating them: price the ZFS alternative as two systems, and keep the second system offline except for nightly backup. Since you care mostly about data loss, not availability, this should work okay. You can lose 1 day of data, right? I think you need two zpools, or zpool + LVM2/XFS, some kind of two-filesystem setup, because of the ZFS corruption and panic/freeze-on-import problems. Having two zpools helps with other things, too, like if you need to destroy and recreate the pool to remove a slog or a vdev, or change from mirroring to raidz2, or something like that. I don't think it's realistic to give a quantitative MTDL for loss caused by software bugs, from netapp or from ZFS. ak The EMC guy insisted we use 10k Fibre/SAS drives at least. I'm still not experienced at dealing with these guys without wasting huge amounts of time. I guess one strategy is to call a bunch of them, so they are all wasting your time in parallel. Last time I tried, the EMC guy wanted to meet _in person_ in the financial district, and then he just stopped calling so I had to guesstimate his quote from some low-end iSCSI/FC box that Dell was reselling. Have you called netapp, hitachi, storagetek? The IBM NAS is netapp so you could call IBM if netapp ignores you, but you probably want the storevault which is sold differently. The HP NAS looks weird because it runs your choice of Linux or Windows instead of WeirdNASplatform---maybe read some more about that one. Of course you don't get source, but it surprised me these guys are MUCH worse than ordinary proprietary software. At least netapp stuff, you may as well consider it leased. They leverage the ``appliance'' aspect, and then have sneaky licenses, that attempt to obliterate any potential market for used filers. When you're cut off from support you can't even download manuals. If you're accustomed to the ``first sale doctrine'' then ZFS with source has a huge advantage over netapp, beyond even ZFS's advantage over proprietary software. The idea of dumping all my data into some opaque DRM canister lorded over by asshole CEO's who threaten to sick their corporate lawyers on users on the mailing list offends me just a bit, but I guess we have to follow the ``market forces.'' pgp0i0uaWcrRi.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs allow interaction with file system privileges
On Tue, 23 Sep 2008, Darren J Moffat wrote: Run the service with the file_chown privilege. See privileges(5), rbac(5) and if it runs as an SMF service smf_method(5). Thanks for the pointer. After reviewing this documentation, it seems that file_chown_self is the best privilege to delegate, as the service account only needs to give away the filesystems it has created to the appropriate owner, it should never need to arbitrarily chown other things. I'm actually running a separate instance of Apache/mod_perl which exposes my ZFS management API as a web service to our central identity management server. So it does run under SMF, but I'm having trouble getting the privilege delegation to the way I need it to be. The method_credential option in the manifest only seems to apply to the initial start of the service. Apache needs to start as root, and then gives up the privileges when it spawns children. I can't have SMF control the privileges of the initial parent Apache process or it won't start. Started with full privileges, the parent process looks like: E: all I: basic P: all L: all And the children: flags = none E: basic I: basic P: basic L: all I manually ran 'ppriv -s I+file_chown_self' on the parent Apache process, which resulted in: flags = none E: all I: basic,file_chown_self P: all L: all And the children: flags = none E: basic,file_chown_self I: basic,file_chown_self P: basic,file_chown_self L: all Which worked perfectly. Is there any syntax available for the SMF manifest that would allow starting the original process with all privileges, but configure the inheritable privileges to include the additional file_chown_self? If not, the only other option I can think of offhand is to put together a small Apache module that runs during server initialization and changes the inheritable permissions before the children are spawned. Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | [EMAIL PROTECTED] California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Thanks guys, it seems the problem is even more difficult than I thought, and it seems there is no real measure for the software quality of the zfs stack vs others, neutralizing the hardware used under both. I will be using ECC RAM, since you mentioned it, and I will shift to using enterprise disks (I had initially thought zfs will always recovers from cheapo sata disks, making other disks only faster but not also safer), but now I am shifting to 10krpm SAS disks So, I am changing my question into Do you see any obvious problems with the following setup I am considering - CPU: 1 Xeon Quad Core E5410 2.33GHz 12MB Cache 1333MHz - 16GB ECC FB-DIMM 667MHz (8 x 2GB) - 10 Seagate 400GB 10K 16MB SAS HDD The 10 disks will be: 2 spare + 2 parity for raidz2 + 6 data = 2.4TB useable space * Do I need more CPU power ? How do I measure that ? What about RAM ?! * Now that I'm using ECC RAM, and enterprisey disks, Does this put this solution in par with low end netapp 2020 for example ? I will be replicating the important data daily to a Linux box, just in case I hit a wonderful zpool bug. Any final advice before I take the blue bill ;) Thanks a lot On Tue, Sep 30, 2008 at 8:40 PM, Miles Nordin [EMAIL PROTECTED] wrote: ak == Ahmed Kamal [EMAIL PROTECTED] writes: ak I need to answer and weigh against the cost. I suggest translating the reliability problems into a cost for mitigating them: price the ZFS alternative as two systems, and keep the second system offline except for nightly backup. Since you care mostly about data loss, not availability, this should work okay. You can lose 1 day of data, right? I think you need two zpools, or zpool + LVM2/XFS, some kind of two-filesystem setup, because of the ZFS corruption and panic/freeze-on-import problems. Having two zpools helps with other things, too, like if you need to destroy and recreate the pool to remove a slog or a vdev, or change from mirroring to raidz2, or something like that. I don't think it's realistic to give a quantitative MTDL for loss caused by software bugs, from netapp or from ZFS. ak The EMC guy insisted we use 10k Fibre/SAS drives at least. I'm still not experienced at dealing with these guys without wasting huge amounts of time. I guess one strategy is to call a bunch of them, so they are all wasting your time in parallel. Last time I tried, the EMC guy wanted to meet _in person_ in the financial district, and then he just stopped calling so I had to guesstimate his quote from some low-end iSCSI/FC box that Dell was reselling. Have you called netapp, hitachi, storagetek? The IBM NAS is netapp so you could call IBM if netapp ignores you, but you probably want the storevault which is sold differently. The HP NAS looks weird because it runs your choice of Linux or Windows instead of WeirdNASplatform---maybe read some more about that one. Of course you don't get source, but it surprised me these guys are MUCH worse than ordinary proprietary software. At least netapp stuff, you may as well consider it leased. They leverage the ``appliance'' aspect, and then have sneaky licenses, that attempt to obliterate any potential market for used filers. When you're cut off from support you can't even download manuals. If you're accustomed to the ``first sale doctrine'' then ZFS with source has a huge advantage over netapp, beyond even ZFS's advantage over proprietary software. The idea of dumping all my data into some opaque DRM canister lorded over by asshole CEO's who threaten to sick their corporate lawyers on users on the mailing list offends me just a bit, but I guess we have to follow the ``market forces.'' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
2008/9/30 Jean Dion [EMAIL PROTECTED]: If you want performance you do not put all your I/O across the same physical wire. Once again you cannot go faster than the physical wire can support (CAT5E, CAT6, fibre). No matter if it is layer 2 or not. Using VLAN on single port you share the bandwidth and not creating more Gbits speed with Layer 2. iSCSI best practice require separated physical network. Many books, white papers are written about this. Yes, that's true, but I don't believe you mentioned single NIC implementations in your original statement. Just seeking clarification to help others :-) I think it's worth clarifying that iSCSI and VLANs is okay as long as people appreciate you will require seperate interfaces to get best performance. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Oracle DB sequential dump questions
Server: T5120 on 10 U5 Storage: Internal 8 drives on SAS HW RAID (R5) Oracle: ZFS fs, recordsize=8K and atime=off Tape: LTO-4 (half height) on SAS interface. Dumping a large file from memory using tar to LTO yields 44 MB/s ... I suspect the CPU cannot push more since it's a single thread doing all the work. Dumping oracle db files from filesystem yields ~ 25 MB/s. The interesting bit (apart from it being a rather slow speed) is the fact that the speed fluctuates from the disk area.. but stays constant to the tape. I see up to 50-60 MB/s spikes over 5 seconds, while the tape continues to push it's steady 25 MB/s. There has been NO tuning .. above is absolutely standard. Where should I investigate to increase throughput ... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] c1t0d0 to c3t1d0
Inserting the drive does not automatically mount the ZFS filesystem on it. You need to use the zpool import command which lists any pools available to import, then zpool import -f {name of pool} to force the import (to force the import if you haven't exported the pool first). Cheers Andrew. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 2:10 PM, Ahmed Kamal [EMAIL PROTECTED] wrote: * Now that I'm using ECC RAM, and enterprisey disks, Does this put this solution in par with low end netapp 2020 for example ? *sort of*. What are you going to be using it for? Half the beauty of NetApp are all the add-on applications you run server side. The snapmanager products. If you're just using it for basic single head file serving, I'd say you're pretty much on par. IMO, NetApp's clustering is still far superior (yes folks, from a fileserver perspecctive, not an application clustering perspective) to anything Solaris has to offer right now, and also much, much, MUCH easier to configure/manage. Let me know when I can plug an infiniband cable between two Solaris boxes and type cf enable and we'll talk :) --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance degradation when backups are running
gm_sjo wrote: 2008/9/30 Jean Dion [EMAIL PROTECTED]: If you want performance you do not put all your I/O across the same physical wire. Once again you cannot go faster than the physical wire can support (CAT5E, CAT6, fibre). No matter if it is layer 2 or not. Using VLAN on single port you share the bandwidth and not creating more Gbits speed with Layer 2. iSCSI best practice require separated physical network. Many books, white papers are written about this. Yes, that's true, but I don't believe you mentioned single NIC implementations in your original statement. Just seeking clarification to help others :-) I think it's worth clarifying that iSCSI and VLANs is okay as long as people appreciate you will require seperate interfaces to get best performance. Separate interfaces or networks may not be required, but properly sized networks are highly desirable. For example, a back-of-the-envelope analysis shows that a single 10GbE pipe is sufficient to drive 8 T10KB drives. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Oracle DB sequential dump questions
Louwtjie Burger wrote: Dumping a large file from memory using tar to LTO yields 44 MB/s ... I suspect the CPU cannot push more since it's a single thread doing all the work. Dumping oracle db files from filesystem yields ~ 25 MB/s. The interesting bit (apart from it being a rather slow speed) is the fact that the speed fluctuates from the disk area.. but stays constant to the tape. I see up to 50-60 MB/s spikes over 5 seconds, while the tape continues to push it's steady 25 MB/s. There has been NO tuning .. above is absolutely standard. Where should I investigate to increase throughput ... Does your tape drive compress (most do)? If so, you may be seeing compressible vs. uncompressible data effects. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS poor performance on Areca 1231ML
No apology necessary and I'm glad you figured it out - I was just reading this thread and thinking I'm missing something here - this can't be right. If you have the budget to run a few more experiments, try this SuperMicro card: http://www.springsource.com/repository/app/faq that others have had success with. Regards, -- Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Wrong link? --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
Is there more information that I need to post in order to help diagnose this problem? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS poor performance on Areca 1231ML
On Mon, Sep 29, 2008 at 12:57 PM, Ross Becker [EMAIL PROTECTED] wrote: I have to come back and face the shame; this was a total newbie mistake by myself. I followed the ZFS shortcuts for noobs guide off bigadmin; http://wikis.sun.com/display/BigAdmin/ZFS+Shortcuts+for+Noobs What that had me doing was creating a UFS filesystem on top of a ZFS volume, so I was using only 2 layers of ZFS. I just re-did this against end-to-end ZFS, and the results are pretty freaking impressive; ZFS is handily outrunning the hardware RAID. Bonnie++ is achieving 257 mb/sec write, and 312 mb/sec read. My apologies for wasting folks time; this is my first experience with a solaris of recent vintage. No apology necessary and I'm glad you figured it out - I was just reading this thread and thinking I'm missing something here - this can't be right. If you have the budget to run a few more experiments, try this SuperMicro card: http://www.springsource.com/repository/app/faq that others have had success with. Regards, -- Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Just to confuse you more, I mean, give you another point of view: - CPU: 1 Xeon Quad Core E5410 2.33GHz 12MB Cache 1333MHz The reason the Xeon line is good is because it allows you to squeeze maximum performance out of a given processor technology from Intel, possibly getting the highest performance density. The reason it is bad is because it isn't that much better for a lot more money. A mainstream processor is 80% of the performance for 20% of the price, so unless you need the highest possible performance density, you can save money going mainstream. Not that you should. Intel mainstream (and indeed many tech companies') stuff is purposely stratified from the enterprise stuff by cutting out features like ECC and higher memory capacity and using different interface form factors. - 10 Seagate 400GB 10K 16MB SAS HDD There is nothing magical about SAS drives. Hard drives are for the most part all built with the same technology. The MTBF on that is 1.4M hours vs 1.2M hours for the enterprise 1TB SATA disk, which isn't a big difference. And for comparison, the WD3000BLFS is a consumer drive with 1.4M hours MTBF. And we know that enterprise SATA drives are the same as the consumer drives, just with different firmware optimized for server workloads and longer testing designed to detect infant mortality, which affects MTBF just as much as old-age failure. The MTBF difference from this extra testing at the start is huge. So you can tell right there that the perceived extra reliability scam they're running is bunk. The SAS interface is a psychological tool to help disguise the fact that we're all using roughly the same stuff :) Do your own 24 hour or 7-day stress-testing before deployment to weed out bad drives. Apparently old humans don't live that much longer than they did in years gone by, instead much fewer of our babies die, which makes the average lifespan of everyone go up :) You know that 1TB SATA works for you now. Don't let some big greedy company convince you otherwise. That extra money should be spent on your payroll, not on filling EMC's coffers. ZFS provides a new landscape for storage. It is entirely possible that a server built with mainstream hardware can be cheaper, faster, and at least as reliable as an EMC system. Manageability and interoperability and all those things are another issue however. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
BJ Quinn wrote: Is there more information that I need to post in order to help diagnose this problem? Segmentation faults should be correctly handled by the software. Please file a bug and attach the core. http://bugs.opensolaris.org -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote: Thanks for all the answers .. Please find more questions below :) - Good to know EMC filers do not have end2end checksums! What about netapp ? Blunty - no remote storage can have it by definition. The checksum needs to be computed as close as possible to the application. What's why ZFS can do this and hardware solutions can't (being several unreliable subsystems away from the data). --Toby ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import of bootable root pool renders it
Hello Juergen, Tuesday, September 30, 2008, 5:43:56 PM, you wrote: JN Stephen Quintero [EMAIL PROTECTED] writes: I am running OpenSolaris 2008.05 as a PV guest under Xen. If you import the bootable root pool of a VM into another Solaris VM, the root pool is no longer bootable. JN I had a similar problem: After installing and booting Opensolaris JN 2008.05, I succeded to lock myself out through some passwd/shadow JN inconsistency (totally my own fault). Not a problem, I thought -- I JN booted from the install disk, imported the root pool, fixed the JN inconsistency, and rebooted. Lo, instant panic. JN No idea why, though, I am not that familiar with the underlying JN code. I just did a reinstall. I hit the same issue - once I tried to boot OS from within virtualbox with disk partition exposed to VB - kernel couldn't mount root fs either from VB or directly from notebook - I had to import/export pool while booting from CD. I haven't investigated it further but I'm surprised it's not working OOB. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On 30-Sep-08, at 7:50 AM, Ram Sharma wrote: Hi, can anyone please tell me what is the maximum number of files that can be there in 1 folder in Solaris with ZSF file system. I am working on an application in which I have to support 1mn users. In my application I am using MySql MyISAM and in MyISAM there is 3 files created for 1 table. I am having application architechture in which each user will be having separate table, so the expected number of files in database folder is 3mn. That sounds like a disastrous schema design. Apart from that, you're going to run into problems on several levels, including O/S resources (file descriptors) and filesystem scalability. --Toby I have read somewhere that there is a limit of each OS to create files in a folder. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS poor performance on Areca 1231ML
On Tue, Sep 30, 2008 at 3:51 PM, Tim [EMAIL PROTECTED] wrote: No apology necessary and I'm glad you figured it out - I was just reading this thread and thinking I'm missing something here - this can't be right. If you have the budget to run a few more experiments, try this SuperMicro card: http://www.springsource.com/repository/app/faq that others have had success with. Regards, -- Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Wrong link? Sorry! :( http://www.supermicro.com/products/accessories/addon/AOC-USASLP-L8i.cfm --Tim -- Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Toby Thain wrote: On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote: Thanks for all the answers .. Please find more questions below :) - Good to know EMC filers do not have end2end checksums! What about netapp ? Blunty - no remote storage can have it by definition. The checksum needs to be computed as close as possible to the application. What's why ZFS can do this and hardware solutions can't (being several unreliable subsystems away from the data). --Toby Well That's not _strictly_ true. ZFS can still munge things up as a result of faulty memory. And, it's entirely possible to build a hardware end2end system which is at least as reliable as ZFS (e.g. is only faultable due to host memory failures). It's just neither easy, nor currently available from anyone I know. Doing such checking is far easier at the filesystem level than any other place, which is a big strength of ZFS over other hardware solutions. Several of the storage vendors (EMC and NetApp included) I do believe support hardware checksumming over on the SAN/NAS device, but that still leaves them vulnerable to HBA and transport medium (e.g. FibreChannel/SCSI/Ethernet) errors, which they don't currently have a solution for. I'd be interested in seeing if anyone has statistics about where errors occur in the data stream. My gut tells me that (from most common to least): (1) hard drives (2) transport medium (particularly if it's Ethernet) (3) SAN/NAS controller cache (4) Host HBA (5) SAN/NAS controller (6) Host RAM (7) Host bus issues -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 4:26 PM, Toby Thain [EMAIL PROTECTED]wrote: On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote: Thanks for all the answers .. Please find more questions below :) - Good to know EMC filers do not have end2end checksums! What about netapp ? Blunty - no remote storage can have it by definition. The checksum needs to be computed as close as possible to the application. What's why ZFS can do this and hardware solutions can't (being several unreliable subsystems away from the data). --Toby ... So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD any different than a NetApp filer, running ONTAP with a QLogic HBA directly connected to an FC JBOD? How is it several unreliable subsystems away from the data? That's a great talking point but it's far from accurate. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import of bootable root pool renders it
On Tue, 30 Sep 2008, Robert Milkowski wrote: Hello Juergen, Tuesday, September 30, 2008, 5:43:56 PM, you wrote: JN Stephen Quintero [EMAIL PROTECTED] writes: I am running OpenSolaris 2008.05 as a PV guest under Xen. If you import the bootable root pool of a VM into another Solaris VM, the root pool is no longer bootable. JN I had a similar problem: After installing and booting Opensolaris JN 2008.05, I succeded to lock myself out through some passwd/shadow JN inconsistency (totally my own fault). Not a problem, I thought -- I JN booted from the install disk, imported the root pool, fixed the JN inconsistency, and rebooted. Lo, instant panic. JN No idea why, though, I am not that familiar with the underlying JN code. I just did a reinstall. I hit the same issue - once I tried to boot OS from within virtualbox with disk partition exposed to VB - kernel couldn't mount root fs either from VB or directly from notebook - I had to import/export pool while booting from CD. I haven't investigated it further but I'm surprised it's not working OOB. I think this is 6737463 -- Dave ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
Please forgive my ignorance. I'm fairly new to Solaris (Linux convert), and although I recognize that Linux has the same concept of Segmentation faults / core dumps, I believe my typical response to a Segmentation Fault was to upgrade the kernel and that always fixed the problem (i.e. somebody else filed the bug and fixed the problem before I got around to doing it myself). So - I'm running stock OpenSolaris 2008.05. Even if the bug was fixed, I imagine it would require a Solaris kernel upgrade anyway, right? Perhaps I could simply try that first? Are the kernel upgrades stable? I know for a while there, before the 2008.05 release, Solaris just released a new development kernel every two weeks. I don't think I want to just haphazardly upgrade to some random bi-weekly development kernel. Are there actually stable kernel upgrades for OS, and how would I go about upgrading it if there are? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS poor performance on Areca 1231ML
At this point, ZFS is performing admirably with the Areca card. Also, that card is only 8-port, and the Areca controllers I have are 12-port. My chassis has 24 SATA bays, so being able to cover all the drives with 2 controllers is preferable. Also, the driver for the Areca controllers is being integrated into OpenSolaris as we discuss, so the next spin of Opensolaris won't even require me to add the driver for it. --Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
BJ Quinn wrote: Please forgive my ignorance. I'm fairly new to Solaris (Linux convert), and although I recognize that Linux has the same concept of Segmentation faults / core dumps, I believe my typical response to a Segmentation Fault was to upgrade the kernel and that always fixed the problem (i.e. somebody else filed the bug and fixed the problem before I got around to doing it myself). So - I'm running stock OpenSolaris 2008.05. Even if the bug was fixed, I imagine it would require a Solaris kernel upgrade anyway, right? Perhaps I could simply try that first? Are the kernel upgrades stable? I know for a while there, before the 2008.05 release, Solaris just released a new development kernel every two weeks. I don't think I want to just haphazardly upgrade to some random bi-weekly development kernel. Are there actually stable kernel upgrades for OS, and how would I go about upgrading it if there are? If there was a bug already filed and fixed, then it should be in the bugs database, which is searchable at: http://bugs.opensolaris.org -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Will Murnane wrote: On Tue, Sep 30, 2008 at 21:48, Tim [EMAIL PROTECTED] wrote: why ZFS can do this and hardware solutions can't (being several unreliable subsystems away from the data). So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD any different than a NetApp filer, running ONTAP with a QLogic HBA directly connected to an FC JBOD? How is it several unreliable subsystems away from the data? That's a great talking point but it's far from accurate. Do your applications run on the NetApp filer? The idea of ZFS as I see it is to checksum the data from when the application puts the data into memory until it reads it out of memory again. Separate filers can checksum from when data is written into their buffers until they receive the request for that data, but to get from the filer to the machine running the application the data must be sent across an unreliable medium. If data is corrupted between the filer and the host, the corruption cannot be detected. Perhaps the filer could use a special protocol and include the checksum for each block, but then the host must verify the checksum for it to be useful. Contrast this with ZFS. It takes the application data, checksums it, and writes the data and the checksum out across the (unreliable) wire to the (unreliable) disk. Then when a read request comes, it reads the data and checksum across the (unreliable) wire, and verifies the checksum on the *host* side of the wire. If the data is corrupted any time between the checksum being calculated on the host and checked on the host, it can be detected. This adds a couple more layers of verifiability than filer-based checksums. Will To make Will's argument more succinct (wink), with a NetApp, undetectable (by the NetApp) errors can be introduced at the HBA and transport layer (FC Switch, slightly damage cable) levels. ZFS will detect such errors, and fix them (if properly configured). NetApp has no such ability. Also, I'm not sure that a NetApp (or EMC) has the ability to find bit-rot. That is, they can determine if a block is written correctly, but I don't know if they keep the block checksum around permanently, and, how redundant that stored block checksum is. If they don't permanently write the block checksum somewhere, then the NetApp has no way to determine if a READ block is OK, and hasn't suffered from bit-rot (aka disk block failure). And, if it's not either multiply stored, then they have the potential to lose the ability to do READ verification. Neither are problems of ZFS. In many of my production environments, I've got at least 2 different FC switches between my hosts and disks. And, with longer cables, comes more of the chance that something gets bent a bit too much. Finally, HBAs are not the most reliable things I've seen (sadly). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 03:19:40PM -0700, Erik Trimble wrote: To make Will's argument more succinct (wink), with a NetApp, undetectable (by the NetApp) errors can be introduced at the HBA and transport layer (FC Switch, slightly damage cable) levels. ZFS will detect such errors, and fix them (if properly configured). NetApp has no such ability. It sounds like you mean the Netapp can't detect silent errors in it's own storage. It can (in a manner similar, but not identical to ZFS). The difference is that the Netapp is always remote from the application, and cannot detect corruption introduced before it arrives at the filer. Also, I'm not sure that a NetApp (or EMC) has the ability to find bit-rot. That is, they can determine if a block is written correctly, but I don't know if they keep the block checksum around permanently, and, how redundant that stored block checksum is. If they don't permanently write the block checksum somewhere, then the NetApp has no way to determine if a READ block is OK, and hasn't suffered from bit-rot (aka disk block failure). And, if it's not either multiply stored, then they have the potential to lose the ability to do READ verification. Neither are problems of ZFS. A netapp filer does have a permanent block checksum that can verify reads. To my knowledge, it is not redundant. But then if it fails, you can just declare that block bad and fall back on the RAID/mirror redundancy to supply the data. -- Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 5:19 PM, Erik Trimble [EMAIL PROTECTED] wrote: To make Will's argument more succinct (wink), with a NetApp, undetectable (by the NetApp) errors can be introduced at the HBA and transport layer (FC Switch, slightly damage cable) levels. ZFS will detect such errors, and fix them (if properly configured). NetApp has no such ability. Also, I'm not sure that a NetApp (or EMC) has the ability to find bit-rot. That is, they can determine if a block is written correctly, but I don't know if they keep the block checksum around permanently, and, how redundant that stored block checksum is. If they don't permanently write the block checksum somewhere, then the NetApp has no way to determine if a READ block is OK, and hasn't suffered from bit-rot (aka disk block failure). And, if it's not either multiply stored, then they have the potential to lose the ability to do READ verification. Neither are problems of ZFS. In many of my production environments, I've got at least 2 different FC switches between my hosts and disks. And, with longer cables, comes more of the chance that something gets bent a bit too much. Finally, HBAs are not the most reliable things I've seen (sadly). * NetApp's block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there's a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit. * ** ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS poor performance on Areca 1231ML
On Tue, Sep 30, 2008 at 5:04 PM, Ross Becker [EMAIL PROTECTED]wrote: At this point, ZFS is performing admirably with the Areca card. Also, that card is only 8-port, and the Areca controllers I have are 12-port. My chassis has 24 SATA bays, so being able to cover all the drives with 2 controllers is preferable. Also, the driver for the Areca controllers is being integrated into OpenSolaris as we discuss, so the next spin of Opensolaris won't even require me to add the driver for it. --Ross -- All very valid points... if you don't mind spending 8x as much for the cards :) --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import of bootable root pool renders it unbootable
I have not tried importing bootable root pools onto other VMs, but there have been recent ZFS bug fixes in the area of importing and exporting bootable root pools - the panic might not occur on Solaris Nevada releases after approximately 97. There are still issues with renaming of bootable root pools - particularly if they are renamed during import - newpool from zpool(1m). If you import with a different name, at the moment, you will have to export, then import by the original name before it can be booted without GRUB menu changes. Check your Solaris version, check to see if your zpool import is using an alternate pool name. If so, try re-importing using the original name before trying to reboot. Let us know what happens. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Intel mainstream (and indeed many tech companies') stuff is purposely stratified from the enterprise stuff by cutting out features like ECC and higher memory capacity and using different interface form factors. Well I guess I am getting a Xeon anyway There is nothing magical about SAS drives. Hard drives are for the most part all built with the same technology. The MTBF on that is 1.4M hours vs 1.2M hours for the enterprise 1TB SATA disk, which isn't a big difference. And for comparison, the WD3000BLFS is a consumer drive with 1.4M hours MTBF. Hmm ... well, there is a considerable price difference, so unless someone says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 drives. By the way, how many of those would saturate a single (non trunked) Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 disks should be about enough to saturate it ? BTW, for everyone saying zfs is more reliable because it's closer to the application than a netapp, well at least in my case it isn't. The solaris box will be NFS sharing and the apps will be running on remote Linux boxes. So, I guess this makes them equal. How about a new reliable NFS protocol, that computes the hashes on the client side, sends it over the wire to be written remotely on the zfs storage node ?! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
True, but a search for zfs segmentation fault returns 500 bugs. It's possible one of those is related to my issue, but it would take all day to find out. If it's not flaky or unstable, I'd like to try upgrading to the newest kernel first, unless my Linux mindset is truly out of place here, or if it's not relatively easy to do. Are these kernels truly considered stable? How would I upgrade? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 6:03 PM, Ahmed Kamal [EMAIL PROTECTED] wrote: Hmm ... well, there is a considerable price difference, so unless someone says I'm horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 drives. By the way, how many of those would saturate a single (non trunked) Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 disks should be about enough to saturate it ? SAS has far greater performance, and if your workload is extremely random, will have a longer MTBF. SATA drives suffer badly on random workloads. BTW, for everyone saying zfs is more reliable because it's closer to the application than a netapp, well at least in my case it isn't. The solaris box will be NFS sharing and the apps will be running on remote Linux boxes. So, I guess this makes them equal. How about a new reliable NFS protocol, that computes the hashes on the client side, sends it over the wire to be written remotely on the zfs storage node ?! Won't be happening anytime soon. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ahmed Kamal wrote: BTW, for everyone saying zfs is more reliable because it's closer to the application than a netapp, well at least in my case it isn't. The solaris box will be NFS sharing and the apps will be running on remote Linux boxes. So, I guess this makes them equal. How about a new reliable NFS protocol, that computes the hashes on the client side, sends it over the wire to be written remotely on the zfs storage node ?! We've actually prototyped an NFS protocol extension that does this, but the challenges are integrating it with ZFS to form a single protection domain, and getting the protocol to be a standard. For now, an option you have is Kerberos with data integrity; the sender computes a CRC of the data and the receiver can verify it to rule out OTW corruption. This is, of course, not end-to-end from platter to memory, but introduces a separate protection domain for the NFS link. Rob T ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, 30 Sep 2008, Miles Nordin wrote: I think you need two zpools, or zpool + LVM2/XFS, some kind of two-filesystem setup, because of the ZFS corruption and panic/freeze-on-import problems. Having two zpools helps with other If ZFS provides such a terrible experience for you can I be brave enough to suggest that perhaps you are on the wrong mailing list and perhaps you should be watching the pinwheels with HFS+? ;-) While we surely do hear all the horror stories on this list, I don't think that ZFS is as wildly unstable as you make it out to be. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 1 Oct 2008, Ahmed Kamal wrote: So, I guess this makes them equal. How about a new reliable NFS protocol, that computes the hashes on the client side, sends it over the wire to be written remotely on the zfs storage node ?! Modern NFS runs over a TCP connection, which includes its own data validation. This surely helps. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
Actually, the one that'll hurt most is ironically the most closely related to bad database schema design... With a zillion files in the one directory, if someone does an 'ls' in that directory, it'll not only take ages, but steal a whole heap of memory and compute power... Provided the only things that'll be doing *anything* in that directory are using indexed methods, there is no real problem from a ZFS perspective, but if something decides to list (or worse, list and sort) that directory, it won't be that pleasant. Oh - That's of course assuming you have sufficient memory in the system to cache all that metadata somewhere... If you don't then that's another zillion I/O's you need to deal with each time you list the entire directory. an ls -1rt on a directory with about 1.2 million files with names like afile1202899 takes minutes to complete on my box, and we see 'ls' get to in excess of 700MB rss... (and that's not including the memory zfs is using to cache whatever it can.) My box has the ARC limited to about 1GB, so it's obviously undersized for such a workload, but still gives you an indication... I generally look to keep directories to a size that allows the utilities that work on and in it to perform at a reasonable rate... which for the most part is around the 100K files or less... Perhaps you are using larger hardware than I am for some of this stuff? :) Nathan. On 1/10/08 07:29 AM, Toby Thain wrote: On 30-Sep-08, at 7:50 AM, Ram Sharma wrote: Hi, can anyone please tell me what is the maximum number of files that can be there in 1 folder in Solaris with ZSF file system. I am working on an application in which I have to support 1mn users. In my application I am using MySql MyISAM and in MyISAM there is 3 files created for 1 table. I am having application architechture in which each user will be having separate table, so the expected number of files in database folder is 3mn. That sounds like a disastrous schema design. Apart from that, you're going to run into problems on several levels, including O/S resources (file descriptors) and filesystem scalability. --Toby I have read somewhere that there is a limit of each OS to create files in a folder. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- // // Nathan Kroenert [EMAIL PROTECTED] // // Senior Systems Engineer Phone: +61 3 9869 6255 // // Global Systems Engineering Fax:+61 3 9869 6288 // // Level 7, 476 St. Kilda Road // // Melbourne 3004 VictoriaAustralia // // ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 06:09:30PM -0500, Tim wrote: On Tue, Sep 30, 2008 at 6:03 PM, Ahmed Kamal [EMAIL PROTECTED] wrote: BTW, for everyone saying zfs is more reliable because it's closer to the application than a netapp, well at least in my case it isn't. The solaris box will be NFS sharing and the apps will be running on remote Linux boxes. So, I guess this makes them equal. How about a new reliable NFS protocol, that computes the hashes on the client side, sends it over the wire to be written remotely on the zfs storage node ?! Won't be happening anytime soon. If you use RPCSEC_GSS with integrity protection then you've got it already. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
rt == Robert Thurlow [EMAIL PROTECTED] writes: rt introduces a separate protection domain for the NFS link. There are checksums in the ethernet FCS, checksums in IP headers, checksums in UDP headers (which are sometimes ignored), and checksums in TCP (which are not ignored). There might be an RPC layer checksum, too, not sure. Different arguments can be made against each, I suppose, but did you have a particular argument in mind? Have you experienced corruption with NFS that you can blame on the network, not the CPU/memory/busses of the server and client? I've experienced enough to make me buy stories of corruption in disks, disk interfaces, and memory. but not yet with TCP so I'd like to hear the story as well as the hypothetical argument, if there is one. pgpE7CaaWOORQ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
BJ Quinn wrote: True, but a search for zfs segmentation fault returns 500 bugs. It's possible one of those is related to my issue, but it would take all day to find out. If it's not flaky or unstable, I'd like to try upgrading to the newest kernel first, unless my Linux mindset is truly out of place here, or if it's not relatively easy to do. Are these kernels truly considered stable? How would I upgrade? Searching bug databases can be an art... Project Indiana is where notifications of package repository changes are made. b98 is available, with instructions posted recently http://www.opensolaris.org/jive/thread.jspa?threadID=75115tstart=15 Be sure to read the release notes http://opensolaris.org/os/project/indiana/resources/rn3/image-update/ -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
On Tue, 30 Sep 2008, BJ Quinn wrote: True, but a search for zfs segmentation fault returns 500 bugs. It's possible one of those is related to my issue, but it would take all day to find out. If it's not flaky or unstable, I'd like to try upgrading to the newest kernel first, unless my Linux mindset is truly out of place here, or if it's not relatively easy to do. Are these kernels truly considered stable? How would I upgrade? -- This Linux and Solaris are quite different when it comes to kernel strategies. Linux documents and stabilizes its kernel interfaces while Solaris does not document its kernel interfaces, but focuses on stable shared library interfaces. Most Linux system APIs have a direct kernel API equivalent but Solaris often uses a completely different kernel interface. Segmentation faults in user applications are generally due to user-space bugs rather than due to the kernel. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On Wed, 1 Oct 2008, Nathan Kroenert wrote: zillion I/O's you need to deal with each time you list the entire directory. an ls -1rt on a directory with about 1.2 million files with names like afile1202899 takes minutes to complete on my box, and we see 'ls' get to in excess of 700MB rss... (and that's not including the memory zfs is using to cache whatever it can.) A million files in ZFS is no big deal: % ptime ls -1rt /dev/null real 17.277 user8.992 sys 8.231 % ptime ls -1rt | wc -l real 17.045 user8.607 sys 8.413 100 Maybe the problem is that you need to increase your screen's scroll rate. :-) Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Sep 30, 2008, at 19:44, Miles Nordin wrote: There are checksums in the ethernet FCS, checksums in IP headers, checksums in UDP headers (which are sometimes ignored), and checksums in TCP (which are not ignored). There might be an RPC layer checksum, too, not sure. Not of which helped Amazon when their S3 service went down due to a flipped bit: More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. http://status.aws.amazon.com/s3-20080720.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Sep 30, 2008, at 19:09, Tim wrote: SAS has far greater performance, and if your workload is extremely random, will have a longer MTBF. SATA drives suffer badly on random workloads. Well, if you can probably afford more SATA drives for the purchase price, you can put them in a striped-mirror set up, and that may help things. If your disks are cheap you can afford to buy more of them (space, heat, and power not withstanding). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Pool Question
Hello, I'm looking for info on adding a disk to my current zfs pool. I am running OpenSoarlis snv_98. I have upgraded my pool since my image-update. When I installed OpenSolaris it was a machine with 2 hard disks (regular IDE). Is it possible to add the second hard disk to the pool to increase my storage capacity without a raid controller? From what I've found, the command should be zpool add rpool device. Is that right? If so, how do I track down the device name? zpool status tells me my current device (hdd0) is named c3d0s0. Where do I find the other device name? Thanks! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 7:15 PM, David Magda [EMAIL PROTECTED] wrote: On Sep 30, 2008, at 19:09, Tim wrote: SAS has far greater performance, and if your workload is extremely random, will have a longer MTBF. SATA drives suffer badly on random workloads. Well, if you can probably afford more SATA drives for the purchase price, you can put them in a striped-mirror set up, and that may help things. If your disks are cheap you can afford to buy more of them (space, heat, and power not withstanding). More disks will not solve SATA's problem. I run into this on a daily basis working on enterprise storage. If it's for just archive/storage, or even sequential streaming, it shouldn't be a big deal. If it's random workload, there's pretty much nothing you can do to get around it short of more front-end cache and intelligence which is simply a band-aid, not a fix. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Well, if you can probably afford more SATA drives for the purchase price, you can put them in a striped-mirror set up, and that may help things. If your disks are cheap you can afford to buy more of them (space, heat, and power not withstanding). Hmm, that's actually cool ! If I configure the system with 10 x 400G 10k rpm disk == cost == 13k$ 10 x 1TB SATA 7200 == cost == 9k$ Always assuming 2 spare disks, and Using the sata disks, I would configure them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?), and better data reliability ?? (don't really know about that one) ? Is this a recommended setup ? It looks too good to be true ? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Bob Friesenhahn wrote: On Wed, 1 Oct 2008, Ahmed Kamal wrote: So, I guess this makes them equal. How about a new reliable NFS protocol, that computes the hashes on the client side, sends it over the wire to be written remotely on the zfs storage node ?! Modern NFS runs over a TCP connection, which includes its own data validation. This surely helps. Less than we'd sometimes like :-) The TCP checksum isn't very strong, and we've seen corruption tied to a broken router, where the Ethernet checksum was recomputed on bad data, and the TCP checksum didn't help. It sucked. Rob T ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Fwd: Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 7:30 PM, Ahmed Kamal [EMAIL PROTECTED] wrote: Well, if you can probably afford more SATA drives for the purchase price, you can put them in a striped-mirror set up, and that may help things. If your disks are cheap you can afford to buy more of them (space, heat, and power not withstanding). Hmm, that's actually cool ! If I configure the system with 10 x 400G 10k rpm disk == cost == 13k$ 10 x 1TB SATA 7200 == cost == 9k$ Always assuming 2 spare disks, and Using the sata disks, I would configure them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?), and better data reliability ?? (don't really know about that one) ? Is this a recommended setup ? It looks too good to be true ? I *HIGHLY* doubt you'll see better performance out of the SATA, but it is possible. You don't need 2 spares with SAS, 1 is more than enough with that few disks. I'd suggest doing RAID-Z (raid-5) as well if you've only got 9 data disks. 8+1 is more than acceptable with SAS drives. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Miles Nordin wrote: There are checksums in the ethernet FCS, checksums in IP headers, checksums in UDP headers (which are sometimes ignored), and checksums in TCP (which are not ignored). There might be an RPC layer checksum, too, not sure. Different arguments can be made against each, I suppose, but did you have a particular argument in mind? Have you experienced corruption with NFS that you can blame on the network, not the CPU/memory/busses of the server and client? Absolutely. See my recent post in this thread. The TCP checksum is not that strong, and a router broken the right way can regenerate a correct-looking Ethernet checksum on bad data. krb5i fixed it nicely. Rob T ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On Wed, 1 Oct 2008, Nathan Kroenert wrote: That being said, there is a large delta in your results and mine... If I get a chance, I'll look into it... I suspect it's a cached versus I/O issue... The first time I posted was the first time the directory has been read in well over a month so it was not currently cached. You might find this to be interesting since it shows that the 'rt' options are taking most of the time: % ptime ls -1 | wc -l real5.497 user4.825 sys 0.654 100 I will certainly agree that huge directories can cause problems for many applications, particularly ones that access the files over a network. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Tim wrote: On Tue, Sep 30, 2008 at 7:15 PM, David Magda [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: On Sep 30, 2008, at 19:09, Tim wrote: SAS has far greater performance, and if your workload is extremely random, will have a longer MTBF. SATA drives suffer badly on random workloads. Well, if you can probably afford more SATA drives for the purchase price, you can put them in a striped-mirror set up, and that may help things. If your disks are cheap you can afford to buy more of them (space, heat, and power not withstanding). More disks will not solve SATA's problem. I run into this on a daily basis working on enterprise storage. If it's for just archive/storage, or even sequential streaming, it shouldn't be a big deal. If it's random workload, there's pretty much nothing you can do to get around it short of more front-end cache and intelligence which is simply a band-aid, not a fix. I observe that there are no disk vendors supplying SATA disks with speed 7,200 rpm. It is no wonder that a 10k rpm disk outperforms a 7,200 rpm disk for random workloads. I'll attribute this to intentional market segmentation by the industry rather than a deficiency in the transfer protocol (SATA). -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
I observe that there are no disk vendors supplying SATA disks with speed 7,200 rpm. It is no wonder that a 10k rpm disk outperforms a 7,200 rpm disk for random workloads. I'll attribute this to intentional market segmentation by the industry rather than a deficiency in the transfer protocol (SATA). I don't really need more performance that what's needed to saturate a gig link (4 sata disks?) So, performance aside, does SAS have other benefits ? Data integrity ? How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ahmed Kamal wrote: I observe that there are no disk vendors supplying SATA disks with speed 7,200 rpm. It is no wonder that a 10k rpm disk outperforms a 7,200 rpm disk for random workloads. I'll attribute this to intentional market segmentation by the industry rather than a deficiency in the transfer protocol (SATA). I don't really need more performance that what's needed to saturate a gig link (4 sata disks?) It depends on the disk. A Seagate Barracuda 500 GByte SATA disk is rated at a media speed of 105 MBytes/s which is near the limit of a GbE link. In theory, one disk would be close, two should do it. So, performance aside, does SAS have other benefits ? Data integrity ? How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? Like apples and pomegranates. Both should be able to saturate a GbE link. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
So, performance aside, does SAS have other benefits ? Data integrity ? How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? Like apples and pomegranates. Both should be able to saturate a GbE link. You're the expert, but isn't the 100M/s for streaming not random read/write. For that, I suppose the disk drops to around 25M/s which is why I was mentioning 4 sata disks. When I was asking for comparing the 2 raids, It's was aside from performance, basically sata is obviously cheaper, it will saturate the gig link, so performance yes too, so the question becomes which has better data protection ( 8 sata raid1 or 8 sas raidz2) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 1 Oct 2008, Ahmed Kamal wrote: Always assuming 2 spare disks, and Using the sata disks, I would configure them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?), and better data reliability ?? (don't really know about that one) ? Is this a recommended setup ? It looks too good to be true ? Using mirrors will surely make up quite a lot for disks with slow seek times. Reliability is acceptable for most purposes. Resilver should be pretty fast. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 8:13 PM, Ahmed Kamal [EMAIL PROTECTED] wrote: So, performance aside, does SAS have other benefits ? Data integrity ? How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? Like apples and pomegranates. Both should be able to saturate a GbE link. You're the expert, but isn't the 100M/s for streaming not random read/write. For that, I suppose the disk drops to around 25M/s which is why I was mentioning 4 sata disks. When I was asking for comparing the 2 raids, It's was aside from performance, basically sata is obviously cheaper, it will saturate the gig link, so performance yes too, so the question becomes which has better data protection ( 8 sata raid1 or 8 sas raidz2) SAS's main benefits are seek time and max IOPS. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, 30 Sep 2008, Robert Thurlow wrote: Modern NFS runs over a TCP connection, which includes its own data validation. This surely helps. Less than we'd sometimes like :-) The TCP checksum isn't very strong, and we've seen corruption tied to a broken router, where the Ethernet checksum was recomputed on bad data, and the TCP checksum didn't help. It sucked. TCP does not see the router. The TCP and ethernet checksums are at completely different levels. Routers do not pass ethernet packets. They pass IP packets. Your statement does not make technical sense. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Hm, richard's excellent Graphs here http://blogs.sun.com/relling/tags/mttdl as well as his words say he prefers mirroring over raidz/raidz2 almost always. It's better for performance and MTTDL. Since 8 sata raid1 is cheaper and probably more reliable than 8 raidz2 sas (and I dont need extra sas performance), and offers better performance and MTTDL than 8 sata raidz2, I guess I will go with 8-sata-raid1 then! Hope I'm not horribly mistaken :) On Wed, Oct 1, 2008 at 3:18 AM, Tim [EMAIL PROTECTED] wrote: On Tue, Sep 30, 2008 at 8:13 PM, Ahmed Kamal [EMAIL PROTECTED] wrote: So, performance aside, does SAS have other benefits ? Data integrity ? How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? Like apples and pomegranates. Both should be able to saturate a GbE link. You're the expert, but isn't the 100M/s for streaming not random read/write. For that, I suppose the disk drops to around 25M/s which is why I was mentioning 4 sata disks. When I was asking for comparing the 2 raids, It's was aside from performance, basically sata is obviously cheaper, it will saturate the gig link, so performance yes too, so the question becomes which has better data protection ( 8 sata raid1 or 8 sas raidz2) SAS's main benefits are seek time and max IOPS. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On 30-Sep-08, at 6:31 PM, Tim wrote: On Tue, Sep 30, 2008 at 5:19 PM, Erik Trimble [EMAIL PROTECTED] wrote: To make Will's argument more succinct (wink), with a NetApp, undetectable (by the NetApp) errors can be introduced at the HBA and transport layer (FC Switch, slightly damage cable) levels. ZFS will detect such errors, and fix them (if properly configured). NetApp has no such ability. Also, I'm not sure that a NetApp (or EMC) has the ability to find bit-rot. ... NetApp's block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there's a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit. This is not end to end protection; they are merely saying the data arrived in the storage subsystem's memory verifiably intact. The data still has a long way to go before it reaches the application. --Toby ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
rt == Robert Thurlow [EMAIL PROTECTED] writes: dm == David Magda [EMAIL PROTECTED] writes: dm Not of which helped Amazon when their S3 service went down due dm to a flipped bit: ok, I get that S3 went down due to corruption, and that the network checksums I mentioned failed to prevent the corruption. The missing piece is: belief that the corruption occurred on the network rather than somewhere else. Their post-mortem sounds to me as though a bit flipped inside the memory of one server could be spread via this ``gossip'' protocol to infect the entire cluster. The replication and spreadability of the data makes their cluster into a many-terabyte gamma ray detector. I wonder if they even use a meaningful VPN. Modern NFS runs over a TCP connection, which includes its own data validation. This surely helps. Yeah fine, but IP and UDP and Ethernet also have checksums. The one in TCP isn't much fancier. rt The TCP checksum isn't very strong, and we've seen corruption rt tied to a broken router, where the Ethernet checksum was rt recomputed on bad data, and the TCP checksum didn't help. It rt sucked. That's more like what I was looking for. The other concept from your first post of ``protection domains'' is interesting, too (of one domain including ZFS and NFS). Of course, what do you do when you get an error on an NFS client, throw ``stale NFS file handle?'' Even speaking hypothetically, it depends on good exception handling for its value, which has been a big trouble spot for ZFS so far. This ``protection domain'' concept is already enshrined in IEEE 802.1d---bridges are not supposed to recalculate the FCS, and if they need to mangle the packet they're supposed to update the FCS algorithmically based on fancy math and only the bits they changed, not just recalculate it over the whole packet. They state this is to protect against bad RAM inside the bridge. I don't know if anyone DOES that, but it's written into the spec. But if the network is L3, then FCS and IP checksums (ttl decrement) will have to be recalculated, so the ``protection domain'' is partly split leaving only the UDP/TCP checksum contiguous. pgpDuHpk4l3x2.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 8:50 PM, Toby Thain [EMAIL PROTECTED]wrote: * NetApp's block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there's a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit.* This is not end to end protection; they are merely saying the data arrived in the storage subsystem's memory verifiably intact. The data still has a long way to go before it reaches the application. --Toby As it does in ANY fileserver scenario, INCLUDING zfs. He is building a FILESERVER. This is not an APPLICATION server. You seem to be stuck on this idea that everyone is using ZFS on the server they're running the application. That does a GREAT job of creating disparate storage islands, something EVERY enterprise is trying to get rid of. Not create more of. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Ahmed Kamal wrote: So, performance aside, does SAS have other benefits ? Data integrity ? How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? Like apples and pomegranates. Both should be able to saturate a GbE link. You're the expert, but isn't the 100M/s for streaming not random read/write. For that, I suppose the disk drops to around 25M/s which is why I was mentioning 4 sata disks. When I was asking for comparing the 2 raids, It's was aside from performance, basically sata is obviously cheaper, it will saturate the gig link, so performance yes too, so the question becomes which has better data protection ( 8 sata raid1 or 8 sas raidz2) Good question. Since you are talking about different disks, the vendor specs are different. The 500 GByte Seagate Barracuda 7200.11 I described above is rated with an MTBF of 750,000 hours, even though it comes in either a SATA or SAS interface -- but that isn't so interesting. A 450 GByte Seagate Cheetah 15k.6 (SAS) has a rated MTBF of 1.6M hours. Putting that into RAIDoptimizer we see: Disk RAID MTTDL[1](yrs) MTTDL[2](yrs) Barracuda 1+0 284,966 5,351 z2 180,663,117 6,784,904 Cheetah1+0 1,316,385 126,839 z21,807,134,968 348,249,968 For ZFS, 50% space used, logistical MTTR=24 hours, mirror resync time = 60 GBytes/hr In general, (2-way) mirrors are single parity, raidz2 is double parity. If you use a triple mirror, then the numbers will be closer to the raidz2 numbers. For explanations of these models, see my blog, http://blogs.sun.com/relling/entry/a_story_of_two_mttdl -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS, NFS and Auto Mounting
I am in the process of beefing up our development environment. In essence I am really going simply replicate what we have spread across here and there (that what happens when you keep running out of disk space). Unfortunately, I inherited all of this and the guy who dreamed up the conflagration is lonnnggg gone. So here the way it works today. There are two top level directories called GroupWS and ReleaseWS. The auto mount map (auto.ws) looks like this: Upgrades chekov:/mnt/dsk1/GroupWS/ cstoolschekov:/mnt/dsk1/GroupWS/ comchekov:/mnt/dsk1/GroupWS Integrationchekov:/mnt/dsk1/GroupWS/ Everything is fine. Do a cd to /ws/Integration and you are taken to chekov:/mnt/dsk1/GroupWS/Integration. The directory Integration is a real directory that lives in GroupWS. If you cd to /ws/com, you are taken to chekov:/mnt/dsk1/GroupWS and then you can move about as one sees fit. To replicate this in ZFS, I did the following; 1) Parked all of the drives (except c1t0d0 and c1t1d0) into several RAIDZ configurations in a zpool called dpool. 2) Created a file system called dpool/GroupWS and set the mountpoint to /mnt/zfs1/GroupWS. The sharenfs properties were set to sharenfs=rw,log,root=msc-servers. 3) Next I created another file system called dpool/GroupWS/Integration. Its mount point was inherited from GroupWS and is /mnt/zfs1/GroupWS/Integration. Essentially I only allowed the new file system to inherit from its parent. 4) I change the auto.ws map thusly: Integration chekov:/mnt/zfs1/GroupWS/ Upgradeschekov:/mnt/zfs1/GroupWS/ cstools chekov:/mnt/zfs1/GroupWS/ com chekov:/mnt/zfs1/GroupWS Now the odd behavior. You will notice that the directories Upgrades and cstools are just that. Directories in GroupWS. You can cd /ws/cstools from [b][i]any server[/b][/i] without a problem. Perform and ls and you see what you expect to see. Now the rub. If on chekov, one does a cd /ws/Integration you end up in chekov:/mnt/zsf1/GroupWS/Integration and everything is great. Do a cd to /ws/com and everything is fine. You can do a cd to Integration and everything is fine. But. If you go to another server and do a cd /ws/Integration all is well. However, if you do a cd to /ws/com and then a cd Integration, Integration is EMPTY!! I know this was long winded but it is a strange problem. The workaround is to destroy the dpool/GroupWS/Integration file system and recreate as a regular directory un GroupWS. But i was hoping to be able to use fle systems in this way for snapshot ease. Any ideas? Thanks, Doug -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On Tue, Sep 30, 2008 at 6:30 PM, Nathan Kroenert [EMAIL PROTECTED] wrote: Actually, the one that'll hurt most is ironically the most closely related to bad database schema design... With a zillion files in the one directory, if someone does an 'ls' in that directory, it'll not only take ages, but steal a whole heap of memory and compute power... Provided the only things that'll be doing *anything* in that directory are using indexed methods, there is no real problem from a ZFS perspective, but if something decides to list (or worse, list and sort) that directory, it won't be that pleasant. Oh - That's of course assuming you have sufficient memory in the system to cache all that metadata somewhere... If you don't then that's another zillion I/O's you need to deal with each time you list the entire directory. an ls -1rt on a directory with about 1.2 million files with names like afile1202899 takes minutes to complete on my box, and we see 'ls' get to ^^^ Here's your problem! in excess of 700MB rss... (and that's not including the memory zfs is using to cache whatever it can.) My box has the ARC limited to about 1GB, so it's obviously undersized for such a workload, but still gives you an indication... I generally look to keep directories to a size that allows the utilities that work on and in it to perform at a reasonable rate... which for the most part is around the 100K files or less... Perhaps you are using larger hardware than I am for some of this stuff? :) I've seen this problem where *Solaris has issues with many files created with this type of file naming pattern. For example, the file naming pattern produced by tmpfile(3C). I saw it originally on a tmpfs and it can be easily reproduced by: [note: I'm writing this from memory - so don't beat me up over specific details] 1) pick a number for the number of files you want to test with (try different numbers - start with 1,500 and then increase it). Call this test# 2) cd /tmp 3) IMPORTANT: Make a test directory for this experiment - let's call it temp 4) cd /tmp/temp (your playground) 5) using your favorite language generate your test# of files using a pattern similar to the one above by calling (ultimate) tmpfile() 6) ptime ls -al; - it will be quick the first time 7) ptime rm * ; - it will be quick the first time 8) repeat steps 5, 6 and 7. Your ptimes will be a little slower 9) repeat steps 5, 6 and 7. Your ptimes will be much slower 10) repeat steps 5, 6 and 7. Your ptimes will be *really* slow. Now you'll understand that you have a problem. 11) repeat 5, 6 and 7 a couple more times. Notice how bad your ptimes are now! 12) look at the size of /tmp/temp using ls -ald /tmp/temp and you'll notice that it has grown substancially. The larger this directory grows, the slower the filesystem operations will get. This behavior is common to tmpfs, UFS and I tested it on early ZFS releases. I have no idea why - I have not made the time to figure it out. What I have observed is that all operations on your (victim) test directory will max out (100% utilization) one CPU or one CPU core - and all directory operations become single-threaded and limited by the performance of one CPU (or core). Now for the weird part: the *only* way to return everything to normal performance levels (that I've found) is to rmdir the (victim) directory. This is why I recommend you perform this experiment in a subdirectory. If you do it in /tmp - you'll have to reboot the box to get reasonably performance back - and you don't want to do it in your home directory either!! I'll try to set aside some time tomorrow to re-run this experiment. But I'm nearly sure this is why your directory related file ops are so slow and *dramatically* slower than they should be. This problem/bug is insideous - because using tmpfile() in /tmp is a very common practice and the application(s) using /tmp will slow down dramatically while maxing out (100% utilization) one CPU (or core). And if your system only has a single CPU... :( Let me know what you find out. I know that the file name pattern is what causes this bug to bite bigtime - and not so much the number of files you use to test it. I *suspect* that there might be something like a hash table that is degenerating into a singly linked list as the root cause of this issue. But this is only my WAG. Regards, -- Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On 30-Sep-08, at 9:54 PM, Tim wrote: On Tue, Sep 30, 2008 at 8:50 PM, Toby Thain [EMAIL PROTECTED] wrote: NetApp's block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there's a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit. This is not end to end protection; they are merely saying the data arrived in the storage subsystem's memory verifiably intact. The data still has a long way to go before it reaches the application. --Toby As it does in ANY fileserver scenario, INCLUDING zfs. He is building a FILESERVER. This is not an APPLICATION server. You seem to be stuck on this idea that everyone is using ZFS on the server they're running the application. ZFS allows the architectural option of separate storage without losing end to end protection, so the distinction is still important. Of course this means ZFS itself runs on the application server, but so what? --Toby That does a GREAT job of creating disparate storage islands, something EVERY enterprise is trying to get rid of. Not create more of. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
Bob Friesenhahn wrote: On Wed, 1 Oct 2008, Nathan Kroenert wrote: zillion I/O's you need to deal with each time you list the entire directory. an ls -1rt on a directory with about 1.2 million files with names like afile1202899 takes minutes to complete on my box, and we see 'ls' get to in excess of 700MB rss... (and that's not including the memory zfs is using to cache whatever it can.) A million files in ZFS is no big deal: But how similar were your file names? Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pool Question
Josh Hardman wrote: Hello, I'm looking for info on adding a disk to my current zfs pool. I am running OpenSoarlis snv_98. I have upgraded my pool since my image-update. When I installed OpenSolaris it was a machine with 2 hard disks (regular IDE). Is it possible to add the second hard disk to the pool to increase my storage capacity without a raid controller? From what I've found, the command should be zpool add rpool device. Is that right? If so, how do I track down the device name? zpool status tells me my current device (hdd0) is named c3d0s0. Where do I find the other device name? Do not try zpool add on your rpool! IIRC, it will no be allowed, but if it were, your system would be unbootable and recovery would be difficult... very uncool. A better idea is to create a new storage pool. Alas, it seems that OpenSolaris 2008.05 does not include the ZFS BUI, so you might need to descend to the command line. format is the command to setup your disk slices (and as a gateway to managing partitions). Once you setup a slice, then a simple zpool create will do the trick. Many more details are available in the ZFS Administration Guide http://www.opensolaris.org/os/community/zfs/docs/zfsadmin.pdf -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 08:54:50PM -0500, Tim wrote: As it does in ANY fileserver scenario, INCLUDING zfs. He is building a FILESERVER. This is not an APPLICATION server. You seem to be stuck on this idea that everyone is using ZFS on the server they're running the application. That does a GREAT job of creating disparate storage islands, something EVERY enterprise is trying to get rid of. Not create more of. First off there's an issue of design. Wherever possible end-to-end protection is better (and easier to implement and deploy) than hop-by-hop protection. Hop-by-hop protection implies a lot of trust. Yes, in a NAS you're going to have at least one hop: from the client to the server. But how does the necessity of one hop mean that N hops is fine? One hop is manageable. N hops is a disaster waiting to happen. Second, NAS is not the only way to access remote storage. There's also SAN (e.g., iSCSI). So you might host a DB on a ZFS pool backed by iSCSI targets. If you do that with a random iSCSI target implementation then you get end-to-end integrity protection regardless of what else the vendor does for you in terms of hop-by-hop integrity protection. And you can even host the target on a ZFS pool, in which case there's two layers of integrity protection, and so some waste of disk space, but you get the benefit of very flexible volume management on both, the initiator and the target. Third, who's to say that end-to-end integrity protection can't possibly be had in a NAS environment? Sure, with today's protocols you can't have it -- you can get hop-by-hop protection with at least one hop (see above) -- but having end-to-end integrity protection built-in to the filesystem may enable new NAS protocols that do provide end-to-end protection. (This is a variant of the first point above: good design decisions pay off.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
On Tue, Sep 30, 2008 at 09:44:21PM -0500, Al Hopper wrote: This behavior is common to tmpfs, UFS and I tested it on early ZFS releases. I have no idea why - I have not made the time to figure it out. What I have observed is that all operations on your (victim) test directory will max out (100% utilization) one CPU or one CPU core - and all directory operations become single-threaded and limited by the performance of one CPU (or core). And sometimes its just a little bug: E.g. with a recent version of Solaris (i.e. = snv_95 || = S10U5) on UFS: SunOS graf 5.10 Generic_137112-07 i86pc i386 i86pc (X4600, S10U5) = admin.graf /var/tmp time sh -c 'mkfile 2g xx ; sync' 0.05u 9.78s 0:29.42 33.4% admin.graf /var/tmp time sh -c 'mkfile 2g xx ; sync' 0.05u 293.37s 5:13.67 93.5% admin.graf /var/tmp rm xx admin.graf /var/tmp time sh -c 'mkfile 2g xx ; sync' 0.05u 9.92s 0:31.75 31.4% admin.graf /var/tmp time sh -c 'mkfile 2g xx ; sync' 0.05u 305.15s 5:28.67 92.8% admin.graf /var/tmp time dd if=/dev/zero of=xx bs=1k count=2048 2048+0 records in 2048+0 records out 0.00u 298.40s 4:58.46 99.9% admin.graf /var/tmp time sh -c 'mkfile 2g xx ; sync' 0.05u 394.06s 6:52.79 95.4% SunOS kaiser 5.10 Generic_137111-07 sun4u sparc SUNW,Sun-Fire-V440 (S10, U5) = admin.kaiser /var/tmp time mkfile 1g xx 0.14u 5.24s 0:26.72 20.1% admin.kaiser /var/tmp time mkfile 1g xx 0.13u 64.23s 1:25.67 75.1% admin.kaiser /var/tmp time mkfile 1g xx 0.13u 68.36s 1:30.12 75.9% admin.kaiser /var/tmp rm xx admin.kaiser /var/tmp time mkfile 1g xx 0.14u 5.79s 0:29.93 19.8% admin.kaiser /var/tmp time mkfile 1g xx 0.13u 66.37s 1:28.06 75.5% SunOS q 5.11 snv_98 i86pc i386 i86pc (U40, S11b98) = elkner.q /var/tmp time mkfile 2g xx 0.05u 3.63s 0:42.91 8.5% elkner.q /var/tmp time mkfile 2g xx 0.04u 315.15s 5:54.12 89.0% SunOS dax 5.11 snv_79a i86pc i386 i86pc (U40, S11b79) = elkner.dax /var/tmp time mkfile 2g xx 0.05u 3.09s 0:43.09 7.2% elkner.dax /var/tmp time mkfile 2g xx 0.05u 4.95s 0:43.62 11.4% Regards, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
Tim wrote: As it does in ANY fileserver scenario, INCLUDING zfs. He is building a FILESERVER. This is not an APPLICATION server. You seem to be stuck on this idea that everyone is using ZFS on the server they're running the application. That does a GREAT job of creating disparate storage islands, something EVERY enterprise is trying to get rid of. Not create more of. I think you'd be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn't that the reason for buying in server appliances? Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 10:44 PM, Toby Thain [EMAIL PROTECTED]wrote: ZFS allows the architectural option of separate storage without losing end to end protection, so the distinction is still important. Of course this means ZFS itself runs on the application server, but so what? --Toby So what would be that the application has to run on Solaris. And requires a LUN to function. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, Oct 1, 2008 at 12:24 AM, Ian Collins [EMAIL PROTECTED] wrote: Tim wrote: As it does in ANY fileserver scenario, INCLUDING zfs. He is building a FILESERVER. This is not an APPLICATION server. You seem to be stuck on this idea that everyone is using ZFS on the server they're running the application. That does a GREAT job of creating disparate storage islands, something EVERY enterprise is trying to get rid of. Not create more of. I think you'd be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn't that the reason for buying in server appliances? Ian I think you'd be surprised how quickly they'd be fired for putting that much risk into their enterprise. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Tue, Sep 30, 2008 at 11:58 PM, Nicolas Williams [EMAIL PROTECTED] wrote: On Tue, Sep 30, 2008 at 08:54:50PM -0500, Tim wrote: As it does in ANY fileserver scenario, INCLUDING zfs. He is building a FILESERVER. This is not an APPLICATION server. You seem to be stuck on this idea that everyone is using ZFS on the server they're running the application. That does a GREAT job of creating disparate storage islands, something EVERY enterprise is trying to get rid of. Not create more of. First off there's an issue of design. Wherever possible end-to-end protection is better (and easier to implement and deploy) than hop-by-hop protection. Hop-by-hop protection implies a lot of trust. Yes, in a NAS you're going to have at least one hop: from the client to the server. But how does the necessity of one hop mean that N hops is fine? One hop is manageable. N hops is a disaster waiting to happen. Who's talking about N hops? WAFL gives you the exact same amount of hops as ZFS. Second, NAS is not the only way to access remote storage. There's also SAN (e.g., iSCSI). So you might host a DB on a ZFS pool backed by iSCSI targets. If you do that with a random iSCSI target implementation then you get end-to-end integrity protection regardless of what else the vendor does for you in terms of hop-by-hop integrity protection. And you can even host the target on a ZFS pool, in which case there's two layers of integrity protection, and so some waste of disk space, but you get the benefit of very flexible volume management on both, the initiator and the target. I don't recall saying it was. The original poster is talking about a FILESERVER, not iSCSI targets. As off topic as it is, the current iSCSI target is hardly fully baked or production ready. Third, who's to say that end-to-end integrity protection can't possibly be had in a NAS environment? Sure, with today's protocols you can't have it -- you can get hop-by-hop protection with at least one hop (see above) -- but having end-to-end integrity protection built-in to the filesystem may enable new NAS protocols that do provide end-to-end protection. (This is a variant of the first point above: good design decisions pay off.) Which would apply to WAFL as well as ZFS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZSF Solaris
Hi Guys, Thanks for so many good comments. Perhaps I got even more than what I asked for! I am targeting 1 million users for my application.My DB will be on solaris machine.And the reason I am making one table per user is that it will be a simple design as compared to keeping all the data in single table.In that case I need to worry about things like horizontal partitioning which inturn will require higher level of management. So for storing 1 million MYISAM tables (MYISAM being a good performer when it comes to not very large data) , I need to save 3 million data files in a single folder on disk. This is the way MYISAM saves data. I will never need to do an ls on this folder. This folder(~database) will be used just by MYSQL engine to exceute my SQL queries and fetch me results. And now that ZFS allows me to do this easily, I believe I can go forward with this design easily.Correct me if I am missing something. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss