Re: [zfs-discuss] zfs fragmentation
On Fri, 2009-08-07 at 19:33, Richard Elling wrote: This is very unlikely to be a fragmentation problem. It is a scalability problem and there may be something you can do about it in the short term. You could be right. Out test mail server consists of the exact same design, same hardware (SUN4V) but in a smaller configuration (less memory and 4 x 25g san luns) has a backup/copy thoughput of 30GB/hour. Data used for testing was copied from our production mail server. Adding another pool and copying all/some data over to it would only a short term solution. I'll have to disagree. What is the point of a filesystem the can grow to such a huge size and not have functionality built in to optimize data layout? Real world implementations of filesystems that are intended to live for years/decades need this functionality, don't they? Our mail system works well, only the backup doesn't perform well. All the features of ZFS that make reads perform well (prefetch, ARC) have little effect. We think backup is quite important. We do quite a few restores of months old data. Snapshots help in the short term, but for longer term restores we need to go to tape. Of course, as you can tell, I'm kinda stuck on this idea that file and directory fragmentation is causing our issues with the backup. I don't know how to analyze the pool to better understand the problem. If we did chop the pool up into lets say 7 pools (one for each current filesystem) then over time these 7 pools would grow and we would end up with the same issues. Thats why it seems to me to be a short term solution. If our issues with zfs are scalability then you could say zfs is not scalable. Is that true? (It certianly is if the solution is too create more pools!). -- Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
Adding another pool and copying all/some data over to it would only a short term solution. I'll have to disagree. What is the point of a filesystem the can grow to such a huge size and not have functionality built in to optimize data layout? Real world implementations of filesystems that are intended to live for years/decades need this functionality, don't they? Our mail system works well, only the backup doesn't perform well. All the features of ZFS that make reads perform well (prefetch, ARC) have little effect. We think backup is quite important. We do quite a few restores of months old data. Snapshots help in the short term, but for longer term restores we need to go to tape. Your scalability problem may be in your backup solution. The problem is not how many Gb data you have but the number of files. It was a while since I worked with networker so things may have changed. If you are doing backups directly to tape you may have a buffering problem. By simply staging backups on disk we got at lot faster backups. Have you configured networker to do several simultaneous backups from your pool? You can do that by having several zfs on the same pool or tell networker to do backups one directory level down so that it thinks you have more file systems. And don't forget to play with the parallelism settings in networker. This made a huge difference for us on VxFS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, 8 Aug 2009, Ed Spencer wrote: What is the point of a filesystem the can grow to such a huge size and not have functionality built in to optimize data layout? Real world implementations of filesystems that are intended to live for years/decades need this functionality, don't they? Enterprise storage should work fine without needing to run a tool to optimize data layout or repair the filesystem. Well designed software uses an approach which does not unravel through use. Our mail system works well, only the backup doesn't perform well. All the features of ZFS that make reads perform well (prefetch, ARC) have little effect. It is already known that ZFS prefetch is often not aggressive enough for bulk reads, and sometimes gets lost entirely. I think that is the first issue to resolve in order to get your backups going faster. Many of us here already tested our own systems and found that under some conditions ZFS was offering up only 30MB/second for bulk data reads regardless of how exotic our storage pool and hardware was. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote: Many of us here already tested our own systems and found that under some conditions ZFS was offering up only 30MB/second for bulk data reads regardless of how exotic our storage pool and hardware was. Just so we are using the same units of measurements. Backup/copy throughput on our development mail server is 8.5MB/sec. The people running our backups would be over joyed with that performance. However backup/copy throughput on our production mail server is 2.25 MB/sec. The underlying disk is 15000 RPM 146GB FC drives. Our performance may be hampered somewhat because the luns are on a Network Appliance accessed via iSCSI, but not to the extent that we are seeing, and it does not account for the throughput difference in the development and production pools. When I talk about fragmentation its not in the normal sense. I'm not talking about blocks in a file not being sequential. I'm talking about files in a single directory that end up spread across the entire filesytem/pool. My problem right now is diagnosing the performance issues. I can't address them without understanding the underlying cause. There is a lack of tools to help in this area. There is also a lack of acceptance that I'm actually having a problem with zfs. Its frustrating. Anyone know how significantly increase the performance of a zfs filesystem without causing any downtime to an Enterprise email system used by 30,000 intolerant people, when you don't really know what is causing the performance issues in the first place? (Yeah, it sucks to be me!) -- Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote: Your scalability problem may be in your backup solution. We've eliminated the backup system as being involved with the performance issues. The servers are Solaris 10 with the OS on UFS filesystems. (In zfs terms, the pool is old/mature). Solaris has been patched to a fairly current level. Copying data from the zfs filesystem to the local ufs filesystem enjoys the same throughput as the backup system. The test was simple. Create a test filesystem on the zfs pool. Restore production email data to it. Reboot the server. Backup the data (29 minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs to ufs using a 'cp -pr ...' command, which also took 29 minutes. And if anyone is interested it only took 15 minutes to restore (write) the 15.8GB of data over the network. -- Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, Aug 8, 2009 at 3:02 PM, Ed Spencered_spen...@umanitoba.ca wrote: On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote: Enterprise storage should work fine without needing to run a tool to optimize data layout or repair the filesystem. Well designed software uses an approach which does not unravel through use. H, this is counter to my understanding. I always thought that to optimize sequential read performance you must store the data according to how the device will read the data. Spinning rust reads data in a sequential fashion. In order to optimize read performance it has to be laid down that way. When reading files in a directory, the files need to be laid out on the physical device sequentially for optimal read performance. I probably not he person to argue this point thoughIs there a DBA around? The DBA's that I know use files that are at least hundreds of megabytes in size. Your problem is very different. Maybe my problems will go away once we move into the next generation of storage devices, SSD's! I'm starting to think that ZFS will really shine on SSD's. Your problem seems to be related to cold reads in a pretty large data set. With SSD's (l2arc) you are likely to see a performance boost for a larger set of recently read files, but my guess is that backups will still be pretty slow. There is likely more benefit in restore speed with SSD's than there is in read speeds. However, the NVRAM on the NetApp that is backing your iSCSI LUNs is probably already giving you most of this benefit (assuming low latency on network connections). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, 2009-08-08 at 15:12, Mike Gerdts wrote: The DBA's that I know use files that are at least hundreds of megabytes in size. Your problem is very different. Yes, definitely. I'm relating records in a table to my small files because our email system treats the filesystem as a database. And in the back of my mind I'm also thinking that you have to rebuild/repair the database once in a while to improve performance. And in my case, since the filesystem is the database, I want to do that to zfs! At least thats what I'm thinking, however, and I always come back to this, I'm not certian what is causing my problem. I need certainty before taking action on the production system. -- Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, Aug 8, 2009 at 3:25 PM, Ed Spencered_spen...@umanitoba.ca wrote: On Sat, 2009-08-08 at 15:12, Mike Gerdts wrote: The DBA's that I know use files that are at least hundreds of megabytes in size. Your problem is very different. Yes, definitely. I'm relating records in a table to my small files because our email system treats the filesystem as a database. Right... but ZFS doesn't understand your application. The reason that a file system would put files that are in the same directory in the same general area on a disk is to minimize seek time. I would argue that seek time doesn't matter a whole lot here - at least from the vantage point of ZFS. The LUNs that you have presented from the filer are probably RAID6 across many disks. ZFS seems to be doing a 4 way stripe (or are you mirroring or raidz?). Assuming you are doing something like a 7+2 RAID6 on the back end, the contents would be spread across 36 drives.[1] The trick to making this perform well is to have 36 * N worker threads. Mail is a great thing to keep those spindles kinda busy while getting decent performance. A small number of sequential readers - particularly with small files where you can't do a reasonable job with read-ahead - has little chance of keeping that number of drives busy. 1. Or you might have 4 LUNs presented from one 4+1 RAID5 in which you may be forcing more head movement because ZFS thinks it can speed things up by striping data across the LUNs. ZFS can recognize a database (or other application) doing a sequential read on a large file. While data located sequentially on disk can be helpful for reads, this is much less important when the pool sits across tens of disks. This is because it has the ability to spread the iops across lots of disks, potentially reading a heavily fragmented file much faster than a purely sequential file. In either case, your backup application is competing for iops (and seeks) with other workload. With the NetApp backend there are likely other applications on the same aggregate that are forcing head movement away from any data belonging to these LUNs. And in the back of my mind I'm also thinking that you have to rebuild/repair the database once in a while to improve performance. Certainly. Databases become fragmented and are reorganized to fix this. And in my case, since the filesystem is the database, I want to do that to zfs! At least thats what I'm thinking, however, and I always come back to this, I'm not certian what is causing my problem. I need certainty before taking action on the production system. Most databases are written in such a way that they can be optimized for sequential reads (table scans) and for backups, whether on raw disk or on a file system. The more advanced the database is, the more likely it is to ask the file system to get out of its way and *not* do anything fancy. It seems that cyrus was optimized for operations that make sense for a mail program (deliver messages, retrieve messages, delete messages) and nothing else. I would argue that any application that creates lots of tiny files is not optimized for backing up using a small number of streams. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, 2009-08-08 at 15:20, Bob Friesenhahn wrote: A SSD slog backed by a SAS 15K JBOD array should perform much better than a big iSCSI LUN. Now...yes. We implemented this pool years ago. I believe, then, the server would crash if you had a zfs drive fail. We decided to let the netapp handle the disk redundency. Its worked out well. I've looked at those really nice Sun products adoringly. And a 7000 series appliance would also be a nice addition to our central NFS service. Not to mention more cost effective than expanding our Network Appliance (We have researchers who are quite hungry for storage and NFS is always our first choice). We now have quite an investment in the current implementation. Its difficult to move away from. The netapp is quite a reliable product. We are quite happy with zfs and our implementation. We just need to address our backup performance and improve it just a little bit! We were almost lynched this spring because we encountered some pretty severe zfs bugs. We are still running the IDR named A wad of ZFS bug fixes for Solaris 10 Update 6. It took over a month to resolve the issues. I work at a University and Final Exams and year end occur at the same time. I don't recommend having email problems during this time! People are intolerant to email problems. I live in hope that a Netapp OS update, or a solaris patch, or a zfs patch, or a iscsi patch , or something will come along that improves our performance just a bit so our backup people get off my back! -- Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, 2009-08-08 at 15:05, Mike Gerdts wrote: On Sat, Aug 8, 2009 at 12:51 PM, Ed Spencered_spen...@umanitoba.ca wrote: On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote: Many of us here already tested our own systems and found that under some conditions ZFS was offering up only 30MB/second for bulk data reads regardless of how exotic our storage pool and hardware was. Just so we are using the same units of measurements. Backup/copy throughput on our development mail server is 8.5MB/sec. The people running our backups would be over joyed with that performance. However backup/copy throughput on our production mail server is 2.25 MB/sec. The underlying disk is 15000 RPM 146GB FC drives. Our performance may be hampered somewhat because the luns are on a Network Appliance accessed via iSCSI, but not to the extent that we are seeing, and it does not account for the throughput difference in the development and production pools. NetApp filers run WAFL - Write Anywhere File Layout. Even if ZFS arranged everything perfrectly (however that is defined) WAFL would undo its hard work. Since you are using iSCSI, I assume that you have disabled the Nagle algorithm and increased tcp_xmit_hiwat and tcp_recv_hiwat. If not, go do that now. We've tried many different iscsi parameter changes on our development server: Jumbo Frames Disabling the Nagle I'll double check next week on tcp_xmit_hiwat and tcp_recv_hiwat. Nothing has made any real difference. We are only using about 5% of the bandwidth on our IPSan. We use two cisco ethernet switches on the IPSAN. The iscsi initiators use MPXIO in a round robin configuration. When I talk about fragmentation its not in the normal sense. I'm not talking about blocks in a file not being sequential. I'm talking about files in a single directory that end up spread across the entire filesytem/pool. It's tempting to think that if the files were in roughly the same area of the block device that ZFS sees that reading the files sequentially would at least trigger a read-ahead at the filer. I suspect that even a moderate amount of file creation and deletion would cause the I/O pattern to be random enough (not purely sequential) that the back-end storage would not have a reasonable chance of recognizing it as a good time for read-ahead. Further, since the backup application is probably in a loop of: while there are more files in the directory if next file mtime last backup time open file read file contents, send to backup stream close file end if end while In other words, other I/O operations are interspersed between the sequential data reads, some files are likely to be skipped, and there is latency introduced by writing to the data stream. I would be surprised to see any file system do intelligent read-ahead here. In other words, lots of small file operations make backups and especially restores go slowly. More backup and restore streams will almost certainly help. Multiplex the streams so that you can keep your tapes moving at a constant speed. We backup to disk first and then put to tape later. Do you have statistics on network utilization to ensure that you aren't stressing it? Have you looked at iostat data to be sure that you are seeing asvc_t + wsvc_t that supports the number of operations that you need to perform? That is if asvc_t + wsvc_t for a device adds up to 10 ms, a workload that waits for the completion of one I/O before issuing the next will max out at 100 iops. Presumably ZFS should hide some of this from you[1], but it does suggest that each backup stream would be limited to about 100 files per second[2]. This is because the read request for one file does not happen before the close of the previous file[3]. Since cyrus stores each message as a separate file, this suggests that 2.5 MB/s corresponds to average mail message size of 25 KB. 1. via metadata caching, read-ahead on file data reads, etc. 2. Assuming wsvc_t + asvc_t = 10 ms 3. Assuming that networker is about as smart as tar, zip, cpio, etc. There is a backup of a single filesystem in the pool going on right now: # zpool iostat 5 5 capacity operationsbandwidth pool used avail read write read write -- - - - - - - space 1.05T 965G 97 69 5.24M 2.71M space 1.05T 965G113 10 6.41M 996K space 1.05T 965G100112 2.87M 1.81M space 1.05T 965G112 8 2.35M 35.9K space 1.05T 965G106 3 1.76M 55.1K Here are examples : iostat -xpn 5 5 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 17.1 29.2 746.7 317.1 0.0 0.60.0 12.5 0 27 c4t60A98000433469764E4A2D456A644A74d0 25.0 11.9 991.9 277.0 0.0 0.60.0 16.1 0 36
Re: [zfs-discuss] Fwd: zfs fragmentation
On Sat, 2009-08-08 at 17:25, Mike Gerdts wrote: ndd -get /dev/tcp tcp_xmit_hiwat ndd -get /dev/tcp tcp_recv_hiwat grep tcp-nodelay /kernel/drv/iscsi.conf # ndd -get /dev/tcp tcp_xmit_hiwat 2097152 # ndd -get /dev/tcp tcp_recv_hiwat 2097152 # grep tcp-nodelay /kernel/drv/iscsi.conf # While backups are running (which is probably all the time given the backup rate) # look at service times iostat -xzn 10 Oh crap. Look like there are no backup jobs running right now. It must have just ended. # is networker cpu bound? No. The server is barely tasked by either the email system or networker. prstat -mL Some indication of how many backup jobs run concurrently would probably help frame any future discussion. I'll get more info on the backups next week when the full backups run. -- Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Aug 8, 2009, at 5:02 AM, Ed Spencer wrote: On Fri, 2009-08-07 at 19:33, Richard Elling wrote: This is very unlikely to be a fragmentation problem. It is a scalability problem and there may be something you can do about it in the short term. You could be right. Out test mail server consists of the exact same design, same hardware (SUN4V) but in a smaller configuration (less memory and 4 x 25g san luns) has a backup/copy thoughput of 30GB/hour. Data used for testing was copied from our production mail server. Adding another pool and copying all/some data over to it would only a short term solution. I'll have to disagree. What is the point of a filesystem the can grow to such a huge size and not have functionality built in to optimize data layout? Real world implementations of filesystems that are intended to live for years/decades need this functionality, don't they? Our mail system works well, only the backup doesn't perform well. All the features of ZFS that make reads perform well (prefetch, ARC) have little effect. The best workload is one that doesn't read from disk to begin with :-) For workloads with millions of files (eg large-scale mail servers) you will need to increase the size of the Directory Name Lookup Cache (DNLC). By default, it is way too small for such workloads. If the directory names are in cache, then they do not have to be read from disk -- a big win. You can see how well the DNLC is working by looking at the output of vmstat -s and look for the total name lookups. You can size DNLC by tuning the ncsize parameter, but it requires a reboot. See the Solaris Tunable Parameters Guide for details. http://docs.sun.com/app/docs/doc/817-0404/chapter2-35?a=view I'd like to revisit the backup problem, but that is much more complicated and probably won't fit in a mail thread very easily (hence, the white paper :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, 8 Aug 2009, Ed Spencer wrote: r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 11.9 43.0 528.9 1972.8 0.0 2.10.0 38.9 0 31 c4t60A98000433469764E4A2D456A644A74d0 17.0 19.6 496.9 1499.0 0.0 1.40.0 38.8 0 39 c4t60A98000433469764E4A2D456A696579d0 14.0 30.0 670.2 1971.3 0.0 1.70.0 38.0 0 34 c4t60A98000433469764E4A476D2F664E4Fd0 19.7 28.7 985.2 1647.6 0.0 1.60.0 32.5 0 37 c4t60A98000433469764E4A476D2F6B385Ad0 I have this in my /etc/system file: * Set device I/O maximum concurrency * http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 set zfs:zfs_vdev_max_pending = 5 This parameter may be worthwhile to look at to reduce your asvc_t. It seems that the default (35) is tuned for a true JBOD setup and not a SAN-hosted LUN. As I recall, you can use the kernel debugger to set it while the system is running and immediately see differences in iostat output. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs fragmentation
On Sat, Aug 8, 2009 at 20:20, Ed Spencered_spen...@umanitoba.ca wrote: On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote: Your scalability problem may be in your backup solution. We've eliminated the backup system as being involved with the performance issues. The servers are Solaris 10 with the OS on UFS filesystems. (In zfs terms, the pool is old/mature). Solaris has been patched to a fairly current level. Copying data from the zfs filesystem to the local ufs filesystem enjoys the same throughput as the backup system. The test was simple. Create a test filesystem on the zfs pool. Restore production email data to it. Reboot the server. Backup the data (29 minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs to ufs using a 'cp -pr ...' command, which also took 29 minutes. Yes, that was expected. What hapens if you run two cp -pr at the same time? I am guessing that two cp will take almost the same time as one. If you get twice the performance from two cp then you will get twice the performance from doing two backups in parallel. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss Digest, Vol 46, Issue 50
Does DNLC even play a part in ZFS, or are the Docs out of date? Defines the number of entries in the directory name look-up cache (DNLC). This parameter is used by UFS and NFS to cache elements of path names that have been resolved. No mention of ZFS. Noticed that when discussing that with a customer of mine. The best workload is one that doesn't read from disk to begin with :-) For workloads with millions of files (eg large-scale mail servers) you will need to increase the size of the Directory Name Lookup Cache (DNLC). By default, it is way too small for such workloads. If the directory names are in cache, then they do not have to be read from disk -- a big win. You can see how well the DNLC is working by looking at the output of vmstat -s and look for the total name lookups. You can size DNLC by tuning the ncsize parameter, but it requires a reboot. See the Solaris Tunable Parameters Guide for details. http://docs.sun.com/app/docs/doc/817-0404/chapter2-35?a=view ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss Digest, Vol 46, Issue 50
Allen Eastwood wrote: Does DNLC even play a part in ZFS, or are the Docs out of date? Defines the number of entries in the directory name look-up cache (DNLC). This parameter is used by UFS and NFS to cache elements of path names that have been resolved. No mention of ZFS. Noticed that when discussing that with a customer of mine. The best workload is one that doesn't read from disk to begin with :-) For workloads with millions of files (eg large-scale mail servers) you will need to increase the size of the Directory Name Lookup Cache (DNLC). By default, it is way too small for such workloads. If the directory names are in cache, then they do not have to be read from disk -- a big win. You can see how well the DNLC is working by looking at the output of vmstat -s and look for the total name lookups. You can size DNLC by tuning the ncsize parameter, but it requires a reboot. See the Solaris Tunable Parameters Guide for details. http://docs.sun.com/app/docs/doc/817-0404/chapter2-35?a=view Yes, ZFS uses the DNLC as well. -tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss