Re: [zfs-discuss] ZFS/WAFL lawsuit
It's Columbia Pictures vs. Bunnell: http://www.eff.org/legal/cases/torrentspy/columbia_v_bunnell_magistrate_order.pdf The Register syndicated a Security Focus article that summarizes the potential impact of the court decision: http://www.theregister.co.uk/2007/08/08/litigation_data_retention/ -j On Thu, Sep 06, 2007 at 08:14:56PM +0200, [EMAIL PROTECTED] wrote: It really is a shot in the dark at this point, you really never know what will happen in court (take the example of the recent court decision that all data in RAM be held for discovery ?!WHAT, HEAD HURTS!?). But at the end of the day, if you waited for a sure bet on any technology or potential patent disputes you would not implement anything, ever. Do you have a reference for all data in RAM most be held. I guess we need to build COW RAM as well. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely long creat64 latencies on higly utilized zpools
You might also consider taking a look at this thread: http://mail.opensolaris.org/pipermail/zfs-discuss/2007-July/041760.html Although I'm not certain, this sounds a lot like the other pool fragmentation issues. -j On Wed, Aug 15, 2007 at 01:11:40AM -0700, Yaniv Aknin wrote: Hello friends, I've recently seen a strange phenomenon with ZFS on Solaris 10u3, and was wondering if someone may have more information. The system uses several zpools, each a bit under 10T, each containing one zfs with lots and lots of small files (way too many, about 100m files and 75m directories). I have absolutely no control over the directory structure and believe me I tried to change it. Filesystem usage patterns are create and read, never delete and never rewrite. When volumes approach 90% usage, and under medium/light load (zpool iostat reports 50mb/s and 750iops reads), some creat64 system calls take over 50 seconds to complete (observed with 'truss -D touch'). When doing manual tests, I've seen similar times on unlink() calls (truss -D rm). I'd like to stress this happens on /some/ of the calls, maybe every 100th manual call (I scripted the test), which (along with normal system operations) would probably be every 10,000th or 100,000th call. Other system parameters (memory usage, loadavg, process number, etc) appear nominal. The machine is an NFS server, though the crazy latencies were observed both local and remote. What would you suggest to further diagnose this? Has anyone seen trouble with high utilization and medium load? (with or without insanely high filecount?) Many thanks in advance, - Yaniv This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is send/receive incremental
You can do it either way. Eric Kustarz has a good explanation of how to set up incremental send/receive on your laptop. The description is on his blog: http://blogs.sun.com/erickustarz/date/20070612 The technique he uses is applicable to any ZFS filesystem. -j On Wed, Aug 08, 2007 at 04:44:16PM -0600, Peter Baumgartner wrote: I'd like to send a backup of my filesystem offsite nightly using zfs send/receive. Are those done incrementally so only changes move or would a full copy get shuttled across everytime? -- Pete ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] si3124 controller problem and fix (fwd)
In an attempt to speed up progress on some of the si3124 bugs that Roger reported, I've created a workspace with the fixes for: 6565894 sata drives are not identified by si3124 driver 6566207 si3124 driver loses interrupts. I'm attaching a driver which contains these fixes as well as a diff of the changes I used to produce them. I don't have access to a si3124 chipset, unfortunately. Would somebody be able to review these changes and try the new driver on a si3124 card? Thanks, -j On Tue, Jul 17, 2007 at 02:39:00AM -0700, Nigel Smith wrote: You can see the status of bug here: http://bugs.opensolaris.org/view_bug.do?bug_id=6566207 Unfortunately, it's showing no progress since 20th June. This fix really could do to be in place for S10u4 and snv_70. Thanks Nigel Smith This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss si3124.tar.gz Description: application/tar-gz --- usr/src/uts/common/io/sata/adapters/si3124/si3124.c --- Index: usr/src/uts/common/io/sata/adapters/si3124/si3124.c --- /ws/onnv-clone/usr/src/uts/common/io/sata/adapters/si3124/si3124.c Mon Nov 13 23:20:01 2006 +++ /export/johansen/si-fixes/usr/src/uts/common/io/sata/adapters/si3124/si3124.c Tue Jul 17 14:37:17 2007 @@ -22,11 +22,11 @@ /* * Copyright 2006 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. */ -#pragma ident @(#)si3124.c 1.4 06/11/14 SMI +#pragma ident @(#)si3124.c 1.5 07/07/17 SMI /* * SiliconImage 3124/3132 sata controller driver @@ -381,11 +381,11 @@ extern struct mod_ops mod_driverops; static struct modldrv modldrv = { mod_driverops, /* driverops */ - si3124 driver v1.4, + si3124 driver v1.5, sictl_dev_ops, /* driver ops */ }; static struct modlinkage modlinkage = { MODREV_1, @@ -2808,10 +2808,13 @@ si_portp = si_ctlp-sictl_ports[port]; mutex_enter(si_portp-siport_mutex); /* Clear Port Reset. */ ddi_put32(si_ctlp-sictl_port_acc_handle, + (uint32_t *)PORT_CONTROL_SET(si_ctlp, port), + PORT_CONTROL_SET_BITS_PORT_RESET); + ddi_put32(si_ctlp-sictl_port_acc_handle, (uint32_t *)PORT_CONTROL_CLEAR(si_ctlp, port), PORT_CONTROL_CLEAR_BITS_PORT_RESET); /* * Arm the interrupts for: Cmd completion, Cmd error, @@ -3509,16 +3512,16 @@ port); if (port_intr_status INTR_COMMAND_COMPLETE) { (void) si_intr_command_complete(si_ctlp, si_portp, port); - } - + } else { /* Clear the interrupts */ ddi_put32(si_ctlp-sictl_port_acc_handle, (uint32_t *)(PORT_INTERRUPT_STATUS(si_ctlp, port)), port_intr_status INTR_MASK); + } /* * Note that we did not clear the interrupt for command * completion interrupt. Reading of slot_status takes care * of clearing the interrupt for command completion case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [storage-discuss] NCQ performance
When sequential I/O is done to the disk directly there is no performance degradation at all. All filesystems impose some overhead compared to the rate of raw disk I/O. It's going to be hard to store data on a disk unless some kind of filesystem is used. All the tests that Eric and I have performed show regressions for multiple sequential I/O streams. If you have data that shows otherwise, please feel free to share. [I]t does not take any additional time in ldi_strategy(), bdev_strategy(), mv_rw_dma_start(). In some instance it actually takes less time. The only thing that sometimes takes additional time is waiting for the disk I/O. Let's be precise about what was actually observed. Eric and I saw increased service times for the I/O on devices with NCQ enabled when running multiple sequential I/O streams. Everything that we observed indicated that it actually took the disk longer to service requests when many sequential I/Os were queued. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
*sata_hba_list::list sata_hba_inst_t satahba_next | ::print sata_hba_inst_t satahba_dev_port | ::array void* 32 | ::print void* | ::grep .!=0 | ::print sata_cport_info_t cport_devp.cport_sata_drive | ::print -a sata_drive_info_t satadrv_features_support satadrv_settings satadrv_features_enabled This gives me mdb: failed to dereference symbol: unknown symbol name. You may not have the SATA module installed. If you type: ::modinfo ! grep sata and don't get any output, your sata driver is attached some other way. My apologies for the confusion. -K ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?
At Matt's request, I did some further experiments and have found that this appears to be particular to your hardware. This is not a general 32-bit problem. I re-ran this experiment on a 1-disk pool using a 32 and 64-bit kernel. I got identical results: 64-bit == $ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 20.1 user0.0 sys 1.2 62 Mb/s # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 19.0 user0.0 sys 2.6 65 Mb/s 32-bit == /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 20.1 user0.0 sys 1.7 62 Mb/s # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 19.1 user0.0 sys 4.3 65 Mb/s -j On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote: Marko Milisavljevic wrote: now lets try: set zfs:zfs_prefetch_disable=1 bingo! r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 609.00.0 77910.00.0 0.0 0.80.01.4 0 83 c0d0 only 1-2 % slower then dd from /dev/dsk. Do you think this is general 32-bit problem, or specific to this combination of hardware? I suspect that it's fairly generic, but more analysis will be necessary. Finally, should I file a bug somewhere regarding prefetch, or is this a known issue? It may be related to 6469558, but yes please do file another bug report. I'll have someone on the ZFS team take a look at it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?
Marko, Matt and I discussed this offline some more and he had a couple of ideas about double-checking your hardware. It looks like your controller (or disks, maybe?) is having trouble with multiple simultaneous I/Os to the same disk. It looks like prefetch aggravates this problem. When I asked Matt what we could do to verify that it's the number of concurrent I/Os that is causing performance to be poor, he had the following suggestions: set zfs_vdev_{min,max}_pending=1 and run with prefetch on, then iostat should show 1 outstanding io and perf should be good. or turn prefetch off, and have multiple threads reading concurrently, then iostat should show multiple outstanding ios and perf should be bad. Let me know if you have any additional questions. -j On Wed, May 16, 2007 at 11:38:24AM -0700, [EMAIL PROTECTED] wrote: At Matt's request, I did some further experiments and have found that this appears to be particular to your hardware. This is not a general 32-bit problem. I re-ran this experiment on a 1-disk pool using a 32 and 64-bit kernel. I got identical results: 64-bit == $ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 20.1 user0.0 sys 1.2 62 Mb/s # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 19.0 user0.0 sys 2.6 65 Mb/s 32-bit == /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 20.1 user0.0 sys 1.7 62 Mb/s # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 19.1 user0.0 sys 4.3 65 Mb/s -j On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote: Marko Milisavljevic wrote: now lets try: set zfs:zfs_prefetch_disable=1 bingo! r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 609.00.0 77910.00.0 0.0 0.80.01.4 0 83 c0d0 only 1-2 % slower then dd from /dev/dsk. Do you think this is general 32-bit problem, or specific to this combination of hardware? I suspect that it's fairly generic, but more analysis will be necessary. Finally, should I file a bug somewhere regarding prefetch, or is this a known issue? It may be related to 6469558, but yes please do file another bug report. I'll have someone on the ZFS team take a look at it. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
Each drive is freshly formatted with one 2G file copied to it. How are you creating each of these files? Also, would you please include a the output from the isalist(1) command? These are snapshots of iostat -xnczpm 3 captured somewhere in the middle of the operation. Have you double-checked that this isn't a measurement problem by measuring zfs with zpool iostat (see zpool(1M)) and verifying that outputs from both iostats match? single drive, zfs file r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 258.30.0 33066.60.0 33.0 2.0 127.77.7 100 100 c0d1 Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s / r/s gives 256K, as I would imagine it should. Not sure. If we can figure out why ZFS is slower than raw disk access in your case, it may explain why you're seeing these results. What if we read a UFS file from the PATA disk and ZFS from SATA: r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 792.80.0 44092.90.0 0.0 1.80.02.2 1 98 c1d0 224.00.0 28675.20.0 33.0 2.0 147.38.9 100 100 c0d0 Now that is confusing! Why did SATA/ZFS slow down too? I've retried this a number of times, not a fluke. This could be cache interference. ZFS and UFS use different caches. How much memory is in this box? I have no idea what to make of all this, except that it ZFS has a problem with this hardware/drivers that UFS and other traditional file systems, don't. Is it a bug in the driver that ZFS is inadvertently exposing? A specific feature that ZFS assumes the hardware to have, but it doesn't? Who knows! This may be a more complicated interaction than just ZFS and your hardware. There are a number of layers of drivers underneath ZFS that may also be interacting with your hardware in an unfavorable way. If you'd like to do a little poking with MDB, we can see the features that your SATA disks claim they support. As root, type mdb -k, and then at the prompt that appears, enter the following command (this is one very long line): *sata_hba_list::list sata_hba_inst_t satahba_next | ::print sata_hba_inst_t satahba_dev_port | ::array void* 32 | ::print void* | ::grep .!=0 | ::print sata_cport_info_t cport_devp.cport_sata_drive | ::print -a sata_drive_info_t satadrv_features_support satadrv_settings satadrv_features_enabled This should show satadrv_features_support, satadrv_settings, and satadrv_features_enabled for each SATA disk on the system. The values for these variables are defined in: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/sata/impl/sata.h this is the relevant snippet for interpreting these values: /* * Device feature_support (satadrv_features_support) */ #define SATA_DEV_F_DMA 0x01 #define SATA_DEV_F_LBA280x02 #define SATA_DEV_F_LBA480x04 #define SATA_DEV_F_NCQ 0x08 #define SATA_DEV_F_SATA10x10 #define SATA_DEV_F_SATA20x20 #define SATA_DEV_F_TCQ 0x40/* Non NCQ tagged queuing */ /* * Device features enabled (satadrv_features_enabled) */ #define SATA_DEV_F_E_TAGGED_QING0x01/* Tagged queuing enabled */ #define SATA_DEV_F_E_UNTAGGED_QING 0x02/* Untagged queuing enabled */ /* * Drive settings flags (satdrv_settings) */ #define SATA_DEV_READ_AHEAD 0x0001 /* Read Ahead enabled */ #define SATA_DEV_WRITE_CACHE0x0002 /* Write cache ON */ #define SATA_DEV_SERIAL_FEATURES0x8000 /* Serial ATA feat. enabled */ #define SATA_DEV_ASYNCH_NOTIFY 0x2000 /* Asynch-event enabled */ This may give us more information if this is indeed a problem with hardware/drivers supporting the right features. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
This certainly isn't the case on my machine. $ /usr/bin/time dd if=/test/filebench/largefile2 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real1.3 user0.0 sys 1.2 # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 22.3 user0.0 sys 2.2 This looks like 56 MB/s on the /dev/dsk and 961 MB/s on the pool. My pool is configured into a 46 disk RAID-0 stripe. I'm going to omit the zpool status output for the sake of brevity. What I am seeing is that ZFS performance for sequential access is about 45% of raw disk access, while UFS (as well as ext3 on Linux) is around 70%. For workload consisting mostly of reading large files sequentially, it would seem then that ZFS is the wrong tool performance-wise. But, it could be just my setup, so I would appreciate more data points. This isn't what we've observed in much of our performance testing. It may be a problem with your config, although I'm not an expert on storage configurations. Would you mind providing more details about your controller, disks, and machine setup? -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?
Marko, I tried this experiment again using 1 disk and got nearly identical times: # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 21.4 user0.0 sys 2.4 $ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k count=1 1+0 records in 1+0 records out real 21.0 user0.0 sys 0.7 [I]t is not possible for dd to meaningfully access multiple-disk configurations without going through the file system. I find it curious that there is such a large slowdown by going through file system (with single drive configuration), especially compared to UFS or ext3. Comparing a filesystem to raw dd access isn't a completely fair comparison either. Few filesystems actually layout all of their data and metadata so that every read is a completely sequential read. I simply have a small SOHO server and I am trying to evaluate which OS to use to keep a redundant disk array. With unreliable consumer-level hardware, ZFS and the checksum feature are very interesting and the primary selling point compared to a Linux setup, for as long as ZFS can generate enough bandwidth from the drive array to saturate single gigabit ethernet. I would take Bart's reccomendation and go with Solaris on something like a dual-core box with 4 disks. My hardware at the moment is the wrong choice for Solaris/ZFS - PCI 3114 SATA controller on a 32-bit AthlonXP, according to many posts I found. Bill Moore lists some controller reccomendations here: http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html However, since dd over raw disk is capable of extracting 75+MB/s from this setup, I keep feeling that surely I must be able to get at least that much from reading a pair of striped or mirrored ZFS drives. But I can't - single drive or 2-drive stripes or mirrors, I only get around 34MB/s going through ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.) Maybe this is a problem with your controller? What happens when you have two simultaneous dd's to different disks running? This would simulate the case where you're reading from the two disks at the same time. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: gzip compression throttles system?
A couple more questions here. [mpstat] CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 0 3109 3616 316 1965 17 48 45 2450 85 0 15 10 0 3127 3797 592 2174 17 63 46 1760 84 0 15 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 0 3051 3529 277 2012 14 25 48 2160 83 0 17 10 0 3065 3739 606 1952 14 37 47 1530 82 0 17 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 0 3011 3538 316 2423 26 16 52 2020 81 0 19 10 0 3019 3698 578 2694 25 23 56 3090 83 0 17 # lockstat -kIW -D 20 sleep 30 Profiling interrupt: 6080 events in 31.341 seconds (194 events/sec) Count indv cuml rcnt nsec Hottest CPU+PILCaller --- 2068 34% 34% 0.00 1767 cpu[0] deflate_slow 1506 25% 59% 0.00 1721 cpu[1] longest_match 1017 17% 76% 0.00 1833 cpu[1] mach_cpu_idle 454 7% 83% 0.00 1539 cpu[0] fill_window 215 4% 87% 0.00 1788 cpu[1] pqdownheap snip What do you have zfs compresison set to? The gzip level is tunable, according to zfs set, anyway: PROPERTY EDIT INHERIT VALUES compression YES YES on | off | lzjb | gzip | gzip-[1-9] You still have idle time in this lockstat (and mpstat). What do you get for a lockstat -A -D 20 sleep 30? Do you see anyone with long lock hold times, long sleeps, or excessive spinning? The largest numbers from mpstat are for interrupts and cross calls. What does intrstat(1M) show? Have you run dtrace to determine the most frequent cross-callers? #!/usr/sbin/dtrace -s sysinfo:::xcalls { @a[stack(30)] = count(); } END { trunc(@a, 30); } is an easy way to do this. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
This seems a bit strange. What's the workload, and also, what's the output for: ARC_mru::print size lsize ARC_mfu::print size lsize and ARC_anon::print size For obvious reasons, the ARC can't evict buffers that are in use. Buffers that are available to be evicted should be on the mru or mfu list, so this output should be instructive. -j On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: arc::print -tad { . . . c02e29e8 uint64_t size = 0t10527883264 c02e29f0 uint64_t p = 0t16381819904 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . Perhaps c_max does not do what I think it does? Thanks, /jim Jim Mauro wrote: Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 (update 3). All file IO is mmap(file), read memory segment, unmap, close. Tweaked the arc size down via mdb to 1GB. I used that value because c_min was also 1GB, and I was not sure if c_max could be larger than c_minAnyway, I set c_max to 1GB. After a workload run: arc::print -tad { . . . c02e29e8 uint64_t size = 0t3099832832 c02e29f0 uint64_t p = 0t16540761088 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . size is at 3GB, with c_max at 1GB. What gives? I'm looking at the code now, but was under the impression c_max would limit ARC growth. Granted, it's not a factor of 10, and it's certainly much better than the out-of-the-box growth to 24GB (this is a 32GB x4500), so clearly ARC growth is being limited, but it still grew to 3X c_max. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
Gar. This isn't what I was hoping to see. Buffers that aren't available for eviction aren't listed in the lsize count. It looks like the MRU has grown to 10Gb and most of this could be successfully evicted. The calculation for determining if we evict from the MRU is in arc_adjust() and looks something like: top_sz = ARC_anon.size + ARC_mru.size Then if top_sz arc.p and ARC_mru.lsize 0 we evict the smaller of ARC_mru.lsize and top_size - arc.p In your previous message it looks like arc.p is (ARC_mru.size + ARC_anon.size). It might make sense to double-check these numbers together, so when you check the size and lsize again, also check arc.p. How/when did you configure arc_c_max? arc.p is supposed to be initialized to half of arc.c. Also, I assume that there's a reliable test case for reproducing this problem? Thanks, -j On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: ARC_mru::print -d size lsize size = 0t10224433152 lsize = 0t10218960896 ARC_mfu::print -d size lsize size = 0t303450112 lsize = 0t289998848 ARC_anon::print -d size size = 0 So it looks like the MRU is running at 10GB... What does this tell us? Thanks, /jim [EMAIL PROTECTED] wrote: This seems a bit strange. What's the workload, and also, what's the output for: ARC_mru::print size lsize ARC_mfu::print size lsize and ARC_anon::print size For obvious reasons, the ARC can't evict buffers that are in use. Buffers that are available to be evicted should be on the mru or mfu list, so this output should be instructive. -j On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: arc::print -tad { . . . c02e29e8 uint64_t size = 0t10527883264 c02e29f0 uint64_t p = 0t16381819904 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . Perhaps c_max does not do what I think it does? Thanks, /jim Jim Mauro wrote: Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 (update 3). All file IO is mmap(file), read memory segment, unmap, close. Tweaked the arc size down via mdb to 1GB. I used that value because c_min was also 1GB, and I was not sure if c_max could be larger than c_minAnyway, I set c_max to 1GB. After a workload run: arc::print -tad { . . . c02e29e8 uint64_t size = 0t3099832832 c02e29f0 uint64_t p = 0t16540761088 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . size is at 3GB, with c_max at 1GB. What gives? I'm looking at the code now, but was under the impression c_max would limit ARC growth. Granted, it's not a factor of 10, and it's certainly much better than the out-of-the-box growth to 24GB (this is a 32GB x4500), so clearly ARC growth is being limited, but it still grew to 3X c_max. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
Something else to consider, depending upon how you set arc_c_max, you may just want to set arc_c and arc_p at the same time. If you try setting arc_c_max, and then setting arc_c to arc_c_max, and then set arc_p to arc_c / 2, do you still get this problem? -j On Thu, Mar 15, 2007 at 05:18:12PM -0700, [EMAIL PROTECTED] wrote: Gar. This isn't what I was hoping to see. Buffers that aren't available for eviction aren't listed in the lsize count. It looks like the MRU has grown to 10Gb and most of this could be successfully evicted. The calculation for determining if we evict from the MRU is in arc_adjust() and looks something like: top_sz = ARC_anon.size + ARC_mru.size Then if top_sz arc.p and ARC_mru.lsize 0 we evict the smaller of ARC_mru.lsize and top_size - arc.p In your previous message it looks like arc.p is (ARC_mru.size + ARC_anon.size). It might make sense to double-check these numbers together, so when you check the size and lsize again, also check arc.p. How/when did you configure arc_c_max? arc.p is supposed to be initialized to half of arc.c. Also, I assume that there's a reliable test case for reproducing this problem? Thanks, -j On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: ARC_mru::print -d size lsize size = 0t10224433152 lsize = 0t10218960896 ARC_mfu::print -d size lsize size = 0t303450112 lsize = 0t289998848 ARC_anon::print -d size size = 0 So it looks like the MRU is running at 10GB... What does this tell us? Thanks, /jim [EMAIL PROTECTED] wrote: This seems a bit strange. What's the workload, and also, what's the output for: ARC_mru::print size lsize ARC_mfu::print size lsize and ARC_anon::print size For obvious reasons, the ARC can't evict buffers that are in use. Buffers that are available to be evicted should be on the mru or mfu list, so this output should be instructive. -j On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: arc::print -tad { . . . c02e29e8 uint64_t size = 0t10527883264 c02e29f0 uint64_t p = 0t16381819904 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . Perhaps c_max does not do what I think it does? Thanks, /jim Jim Mauro wrote: Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 (update 3). All file IO is mmap(file), read memory segment, unmap, close. Tweaked the arc size down via mdb to 1GB. I used that value because c_min was also 1GB, and I was not sure if c_max could be larger than c_minAnyway, I set c_max to 1GB. After a workload run: arc::print -tad { . . . c02e29e8 uint64_t size = 0t3099832832 c02e29f0 uint64_t p = 0t16540761088 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . size is at 3GB, with c_max at 1GB. What gives? I'm looking at the code now, but was under the impression c_max would limit ARC growth. Granted, it's not a factor of 10, and it's certainly much better than the out-of-the-box growth to 24GB (this is a 32GB x4500), so clearly ARC growth is being limited, but it still grew to 3X c_max. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
I suppose I should have been more forward about making my last point. If the arc_c_max isn't set in /etc/system, I don't believe that the ARC will initialize arc.p to the correct value. I could be wrong about this; however, next time you set c_max, set c to the same value as c_max and set p to half of c. Let me know if this addresses the problem or not. -j How/when did you configure arc_c_max? Immediately following a reboot, I set arc.c_max using mdb, then verified reading the arc structure again. arc.p is supposed to be initialized to half of arc.c. Also, I assume that there's a reliable test case for reproducing this problem? Yep. I'm using a x4500 in-house to sort out performance of a customer test case that uses mmap. We acquired the new DIMMs to bring the x4500 to 32GB, since the workload has a 64GB working set size, and we were clobbering a 16GB thumper. We wanted to see how doubling memory may help. I'm trying clamp the ARC size because for mmap-intensive workloads, it seems to hurt more than help (although, based on experiments up to this point, it's not hurting a lot). I'll do another reboot, and run it all down for you serially... /jim Thanks, -j On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote: ARC_mru::print -d size lsize size = 0t10224433152 lsize = 0t10218960896 ARC_mfu::print -d size lsize size = 0t303450112 lsize = 0t289998848 ARC_anon::print -d size size = 0 So it looks like the MRU is running at 10GB... What does this tell us? Thanks, /jim [EMAIL PROTECTED] wrote: This seems a bit strange. What's the workload, and also, what's the output for: ARC_mru::print size lsize ARC_mfu::print size lsize and ARC_anon::print size For obvious reasons, the ARC can't evict buffers that are in use. Buffers that are available to be evicted should be on the mru or mfu list, so this output should be instructive. -j On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote: FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max: arc::print -tad { . . . c02e29e8 uint64_t size = 0t10527883264 c02e29f0 uint64_t p = 0t16381819904 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . Perhaps c_max does not do what I think it does? Thanks, /jim Jim Mauro wrote: Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06 (update 3). All file IO is mmap(file), read memory segment, unmap, close. Tweaked the arc size down via mdb to 1GB. I used that value because c_min was also 1GB, and I was not sure if c_max could be larger than c_minAnyway, I set c_max to 1GB. After a workload run: arc::print -tad { . . . c02e29e8 uint64_t size = 0t3099832832 c02e29f0 uint64_t p = 0t16540761088 c02e29f8 uint64_t c = 0t1070318720 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t1070318720 . . . size is at 3GB, with c_max at 1GB. What gives? I'm looking at the code now, but was under the impression c_max would limit ARC growth. Granted, it's not a factor of 10, and it's certainly much better than the out-of-the-box growth to 24GB (this is a 32GB x4500), so clearly ARC growth is being limited, but it still grew to 3X c_max. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] understanding zfs/thunoer bottlenecks?
it seems there isn't an algorithm in ZFS that detects sequential write in traditional fs such as ufs, one would trigger directio. There is no directio for ZFS. Are you encountering a situation in which you believe directio support would improve performance? If so, please explain. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS multi-threading
Would the logic behind ZFS take full advantage of a heavily multicored system, such as on the Sun Niagara platform? Would it utilize of the 32 concurrent threads for generating its checksums? Has anyone compared ZFS on a Sun Tx000, to that of a 2-4 thread x64 machine? Pete and I are working on resolving ZFS scalability issues with Niagara and StarCat right now. I'm not sure if any official numbers about ZFS performance on Niagara have been published. As far as concurrent threads generating checksums goes, the system doesn't work quite the way you have postulated. The checksum is generated in the ZIO_STAGE_CHECKSUM_GENERATE pipeline state for writes, and verified in the ZIO_STAGE_CHECKSUM_VERIFY pipeline stage for reads. Whichever thread happens to advance the pipline to the checksum generate stage is the thread that will actually perform the work. ZFS does not break the work of the checksum into chunks and have multiple CPUs perform the computation. However, it is possible to have concurrent writes simultaneously in the checksum_generate stage. More details about this can be found in zfs/zio.c and zfs/sys/zio_impl.h -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
And this feature is independant on whether or not the data is DMA'ed straight into the user buffer. I suppose so, however, it seems like it would make more sense to configure a dataset property that specifically describes the caching policy that is desired. When directio implies different semantics for different filesystems, customers are going to get confused. The other feature, is to avoid a bcopy by DMAing full filesystem block reads straight into user buffer (and verify checksum after). The I/O is high latency, bcopy adds a small amount. The kernel memory can be freed/reuse straight after the user read completes. This is where I ask, how much CPU is lost to the bcopy in workloads that benefit from DIO ? Right, except that if we try to DMA into user buffers with ZFS there's a bunch of other things we need the VM to do on our behalf to protect the integrity of the kernel data that's living in user pages. Assume you have a high-latency I/O and you've locked some user pages for this I/O. In a pathological case, when another thread tries to access the locked pages and then also blocks, it does so for the duration of the first thread's I/O. At that point, it seems like it might be easier to accept the cost of the bcopy instead of blocking another thread. I'm not even sure how to assess the impact of VM operations required to change the permissions on the pages before we start the I/O. The quickest return on investement I see for the directio hint would be to tell ZFS to not grow the ARC when servicing such requests. Perhaps if we had an option that specifies not to cache data from a particular dataset, that would suffice. I think you've filed a CR along those lines already (6429855)? -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
Basically speaking - there needs to be some sort of strategy for bypassing the ARC or even parts of the ARC for applications that may need to advise the filesystem of either: 1) the delicate nature of imposing additional buffering for their data flow 2) already well optimized applications that need more adaptive cache in the application instead of the underlying filesystem or volume manager This advice can't be sensibly delivered to ZFS via a Direct I/O mechanism. Anton's characterization of Direct I/O as, an optimization which allows data to be transferred directly between user data buffers and disk, without a memory-to-memory copy, is concise and accurate. Trying to intuit advice from this is unlikely to be useful. It would be better to develop a separate mechanism for delivering advice about the application to the filesystem. (fadvise, perhaps?) A DIO implementation for ZFS is more complicated than UFS and adversely impacts well optimized applications. I looked into this late last year when we had a customer who was suffering from too much bcopy overhead. Billm found another workaround instead of bypassing the ARC. The challenge for implementing DIO for ZFS is in dealing with access to the pages mapped by the user application. Since ZFS has to checksum all of its data, the user's pages that are involved in the direct I/O cannot be written to by another thread during the I/O. If this policy isn't enforced, it is possible for the data written to or read from disk to be different from their checksums. In order to protect the user pages while a DIO is in progress, we want support from the VM that isn't presently implemented. To prevent a page from being accessed by another thread, we have to unmap the TLB/PTE entries and lock the page. There's a cost associated with this, as it may be necessary to cross-call other CPUs. Any thread that accesses the locked pages will block. While it's possible lock pages in the VM today, there isn't a neat set of interfaces the filesystem can use to maintain the integrity of the user's buffers. Without an experimental prototype to verify the design, it's impossible to say whether overhead of manipulating the page permissions is more than the cost of bypassing the cache. What do you see as potential use cases for ZFS Direct I/O? I'm having a hard time imagining a situation in which this would be useful to a customer. The application would probably have to be single-threaded, and if not, it would have to be pretty careful about how its threads access buffers involved in I/O. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
Note also that for most applications, the size of their IO operations would often not match the current page size of the buffer, causing additional performance and scalability issues. Thanks for mentioning this, I forgot about it. Since ZFS's default block size is configured to be larger than a page, the application would have to issue page-aligned block-sized I/Os. Anyone adjusting the block size would presumably be responsible for ensuring that the new size is a multiple of the page size. (If they would want Direct I/O to work...) I believe UFS also has a similar requirement, but I've been wrong before. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: slow reads question...
Harley: I had tried other sizes with much the same results, but hadnt gone as large as 128K. With bs=128K, it gets worse: | # time dd if=zeros-10g of=/dev/null bs=128k count=102400 | 81920+0 records in | 81920+0 records out | | real2m19.023s | user0m0.105s | sys 0m8.514s I may have done my math wrong, but if we assume that the real time is the actual amount of time we spent performing the I/O (which may be incorrect) haven't you done better here? In this case you pushed 81920 128k records in ~139 seconds -- approx 75437 k/sec. Using ZFS with 8k bs, you pushed 102400 8k records in ~68 seconds -- approx 12047 k/sec. Using the raw device you pushed 102400 8k records in ~23 seconds -- approx 35617 k/sec. I may have missed something here, but isn't this newest number the highest performance so far? What does iostat(1M) say about your disk read performance? Is there any other info I can provide which would help? Are you just trying to measure ZFS's read performance here? It might be interesting to change your outfile (of) argument and see if we're actually running into some other performance problem. If you change of=/tmp/zeros does performance improve or degrade? Likewise, if you write the file out to another disk (UFS, ZFS, whatever), does this improve performance? -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: slow reads question...
Harley: Old 36GB drives: | # time mkfile -v 1g zeros-1g | zeros-1g 1073741824 bytes | | real2m31.991s | user0m0.007s | sys 0m0.923s Newer 300GB drives: | # time mkfile -v 1g zeros-1g | zeros-1g 1073741824 bytes | | real0m8.425s | user0m0.010s | sys 0m1.809s This is a pretty dramatic difference. What type of drives were your old 36g drives? I am wondering if there is something other than capacity and seek time which has changed between the drives. Would a different scsi command set or features have this dramatic a difference? I'm hardly the authority on hardware, but there are a couple of possibilties. Your newer drives may have a write cache. It's also quite likely that the newer drives have a faster speed of rotation and seek time. If you subtract the usr + sys time from the real time in these measurements, I suspect the result is the amount of time you were actually waiting for the I/O to finish. In the first case, you spent 99% of your total time waiting for stuff to happen, whereas in the second case it was only ~86% of your overall time. -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss