Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?

2010-03-04 Thread Adam Leventhal
> It seems they kind of rushed the appliance into the market. We've a few 7410s 
> and replication (with zfs send/receive) doesn't work after shares reach ~1TB 
> (broken pipe error). 

While it's the case that the 7000 series is a relatively new product, the 
characterization of "rushed to market" is inaccurate. While the product 
certainly has had bugs, we've been pretty quick to address them (for example, 
the issue you described).

> It's frustrating and we can't do anything because every time we type "shell" 
> in the CLI, it freaks us out with a message saying the warranty will be 
> voided if we continue. I bet that we could work around that bug but we're not 
> allowed and the workarounds provided by Sun haven't worked.

I can understand why it might be frustrating to feel shut out of your customary 
Solaris interfaces, but it's not Solaris: it's an appliance. Arbitrary actions 
that might seem benign to someone familiar with Solaris can have disastrous 
consequences -- I'd be happy to give some examples of the amusing ways our 
customers have taken careful aim and shot themselves in the foot.

> Regarding dedup, Oracle is very courageous for including it in the 2010.Q1 
> release if this comes to be true. But I understand the pressure on then. 
> Every other vendor out there is releasing products with deduplication. 
> Personally, I would just wait 2-3 releases before using it in a black box 
> like the 7000s.

We're including dedup in the 2010.Q1 release, and as always we would not 
release a product we didn't stand behind. ZFS dedup still has some performance 
pathologies and surprising results at times; we're working our customers to 
ensure that their deployments are successful, and fixing problems as they come 
up.

> The hardware on the other hand is incredible in terms of resilience and 
> performance, no doubt. Which makes me think the pretty interface becomes an 
> annoyance sometimes. Let's wait for 2010.Q1 :)

As always, we welcome feedback (although zfs-discuss is not the appropriate 
forum), and are eager to improve the product.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-30 Thread Adam Leventhal
Hey Karsten,

Very interesting data. Your test is inherently single-threaded so I'm not 
surprised that the benefits aren't more impressive -- the flash modules on the 
F20 card are optimized more for concurrent IOPS than single-threaded latency.

Adam

On Mar 30, 2010, at 3:30 AM, Karsten Weiss wrote:

> Hi, I did some tests on a Sun Fire x4540 with an external J4500 array 
> (connected via two
> HBA ports). I.e. there are 96 disks in total configured as seven 12-disk 
> raidz2 vdevs
> (plus system, spares, unused disks) providing a ~ 63 TB pool with fletcher4 
> checksums.
> The system was recently equipped with a Sun Flash Accelerator F20 with 4 FMod
> modules to be used as log devices (ZIL). I was using the latest snv_134 
> software release.
> 
> Here are some first performance numbers for the extraction of an uncompressed 
> 50 MB
> tarball on a Linux (CentOS 5.4 x86_64) NFS-client which mounted the test 
> filesystem
> (no compression or dedup) via NFSv3 (rsize=wsize=32k,sync,tcp,hard).
> 
> standard ZIL:   7m40s  (ZFS default)
> 1x SSD ZIL:  4m07s  (Flash Accelerator F20)
> 2x SSD ZIL:  2m42s  (Flash Accelerator F20)
> 2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
> 3x SSD ZIL:  2m47s  (Flash Accelerator F20)
> 4x SSD ZIL:  2m57s  (Flash Accelerator F20)
> disabled ZIL:   0m15s
> (local extraction0m0.269s)
> 
> I was not so much interested in the absolute numbers but rather in the 
> relative
> performance differences between the standard ZIL, the SSD ZIL and the disabled
> ZIL cases.
> 
> Any opinions on the results? I wish the SSD ZIL performance was closer to the
> disabled ZIL case than it is right now.
> 
> ATM I tend to use two F20 FMods for the log and the two other FMods as L2ARC 
> cache
> devices (although the system has lots of system memory i.e. the L2ARC is not 
> really
> necessary). But the speedup of disabling the ZIL altogether is appealing (and 
> would
> probably be acceptable in this environment).
> -- 
> This message posted from opensolaris.org
> _______
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Adam Leventhal
Hey Robert,

How big of a file are you making? RAID-Z does not explicitly do the parity 
distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths 
to distribute IOPS.

Adam

On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:

> Hi,
> 
> 
> zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
>  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
>  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
>  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
>  [...]
>  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0
> 
> zfs set atime=off test
> zfs set recordsize=16k test
> (I know...)
> 
> now if I create a one large file with filebench and simulate a randomread 
> workload with 1 or more threads then disks on c2 and c3 controllers are 
> getting about 80% more reads. This happens both on 111b and snv_134. I would 
> rather except all of them to get about the same number of iops.
> 
> Any idea why?
> 
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Adam Leventhal
> Does it mean that for dataset used for databases and similar environments 
> where basically all blocks have fixed size and there is no other data all 
> parity information will end-up on one (z1) or two (z2) specific disks?

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Adam Leventhal
Hey Robert,

I've filed a bug to track this issue. We'll try to reproduce the problem and 
evaluate the cause. Thanks for bringing this to our attention.

Adam

On Jun 24, 2010, at 2:40 AM, Robert Milkowski wrote:

> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> Does it mean that for dataset used for databases and similar environments 
>>> where basically all blocks have fixed size and there is no other data all 
>>> parity information will end-up on one (z1) or two (z2) specific disks?
>>> 
>> No. There are always smaller writes to metadata that will distribute parity. 
>> What is the total width of your raidz1 stripe?
>> 
>>   
> 
> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> 


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS compression

2010-07-25 Thread Adam Leventhal
>> I've read a small amount about compression, enough to find that it'll effect 
>> performance (not a problem for me) and that once you enable compression it 
>> only effects new files written to the file system.  
> 
> Yes, that's true. Compression on defaults to lzjb which is fast; but gzip-9 
> can be twice as good. (I've just done some tests on the MacZFS port on my 
> blog for more info)

Here's a good blog comparing some ZFS compression modes in the context of the 
Sun Storage 7000:

  http://blogs.sun.com/dap/entry/zfs_compression

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz - what is stored in parity?

2010-08-12 Thread Adam Leventhal
> In my case, it gives an error that I need at least 11 disks (which I don't) 
> but the point is that raidz parity does not seem to be limited to 3. Is this 
> not true?

RAID-Z is limited to 3 parity disks. The error message is giving you false hope 
and that's a bug. If you had plugged in 11 disks or more in the example you 
provided you would have simply gotten a different error.

- ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksums

2009-10-23 Thread Adam Leventhal
On Fri, Oct 23, 2009 at 06:55:41PM -0500, Tim Cook wrote:
> So, from what I gather, even though the documentation appears to state
> otherwise, default checksums have been changed to SHA256.  Making that
> assumption, I have two questions.

That's false. The default checksum has changed from fletcher2 to fletcher4
that is to say, the definition of the value of 'on' has changed.

> First, is the default updated from fletcher2 to SHA256 automatically for a
> pool that was created with an older version of zfs and then upgraded to the
> latest?  Second, would all of the blocks be re-checksummed with a zfs
> send/receive on the receiving side?

As with all property changes, new writes get the new properties. Old data
is not rewritten.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Checksums

2009-10-25 Thread Adam Leventhal
Thank you for the correction.  My next question is, do you happen to  
know what the overhead difference between fletcher4 and SHA256 is?   
Is the checksumming multi-threaded in nature?  I know my fileserver  
has a lot of spare cpu cycles, but it would be good to know if I'm  
going to take a substantial hit in throughput moving from one to the  
other.


Tim,

That all really depends on your specific system and workload. As with  
any
performance related matter experimentation is vital for making your  
final

decision.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs code and fishworks "fork"

2009-10-26 Thread Adam Leventhal
With that said I'm concerned that there appears to be a fork between  
the opensource version of ZFS and ZFS that is part of the Sun/Oracle  
FishWorks 7nnn series appliances.  I understand (implicitly) that  
Sun (/Oracle) as a commercial concern, is free to choose their own  
priorities in terms of how they use their own IP (Intellectual  
Property) - in this case, the source for the ZFS filesystem.


Hey Al,

I'm unaware of specific plans for management either at Sun or at  
Oracle, but from an engineering perspective suffice it to say that it  
is simpler and therefore more cost effective to develop for a single,  
unified code base, to amortize the cost of testing those  
modifications, and to leverage the enthusiastic ZFS community to  
assist with the development and testing of ZFS.


Again, this isn't official policy, just the simple facts on the ground  
from engineering.


I'm not sure what would lead you to believe that there is fork between  
the open source / OpenSolaris ZFS and what we have in Fishworks.  
Indeed, we've made efforts to make sure there is a single ZFS for the  
reason stated above. Any differences that exist are quickly migrated  
to ON as you can see from the consistent work of Eric Schrock.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread Adam Leventhal
Hi Kjetil,

Unfortunately, dedup will only apply to data written after the setting is 
enabled. That also means that new blocks cannot dedup against old block 
regardless of how they were written. There is therefore no way to "prepare" 
your pool for dedup -- you just have to enable it when you have the new bits.

Adam

On Dec 9, 2009, at 3:40 AM, Kjetil Torgrim Homme wrote:

> I'm planning to try out deduplication in the near future, but started
> wondering if I can prepare for it on my servers.  one thing which struck
> me was that I should change the checksum algorithm to sha256 as soon as
> possible.  but I wonder -- is that sufficient?  will the dedup code know
> about old blocks when I store new data?
> 
> let's say I have an existing file img0.jpg.  I turn on dedup, and copy
> it twice, to img0a.jpg and img0b.jpg.  will all three files refer to the
> same block(s), or will only img0a and img0b share blocks?
> 
> -- 
> Kjetil T. Homme
> Redpill Linpro AS - Changing the game
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread Adam Leventhal
> What happens if you snapshot, send, destroy, recreate (with dedup on this 
> time around) and then write the contents of the cloned snapshot to the 
> various places in the pool - which properties are in the ascendancy here? the 
> "host pool" or the contents of the clone? The host pool I assume, because 
> clone contents are (in this scenario) "just some new data"?

The dedup property applies to all writes so the settings for the pool of origin 
don't matter, just those on the destination pool.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedupe reporting incorrect savings

2009-12-17 Thread Adam Leventhal
Hi Giridhar,

The size reported by ls can include things like holes in the file. What space 
usage does the zfs(1M) command report for the filesystem?

Adam

On Dec 16, 2009, at 10:33 PM, Giridhar K R wrote:

> Hi,
> 
> Reposting as I have not gotten any response.
> 
> Here is the issue. I created a zpool with 64k recordsize and enabled dedupe 
> on it.
> -->zpool create -O recordsize=64k TestPool device1
> -->zfs set dedup=on TestPool
> 
> I copied files onto this pool over nfs from a windows client.
> 
> Here is the output of zpool list
> --> zpool list
> NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
> TestPool 696G 19.1G 677G 2% 1.13x ONLINE -
> 
> I ran "ls -l /TestPool" and saw the total size reported as 51,193,782,290 
> bytes.
> The alloc size reported by zpool along with the DEDUP of 1.13x does not addup 
> to 51,193,782,290 bytes.
> 
> According to the DEDUP (Dedupe ratio) the amount of data copied is 21.58G 
> (19.1G * 1.13) 
> 
> Here is the output from zdb -DD
> 
> --> zdb -DD TestPool
> DDT-sha256-zap-duplicate: 33536 entries, size 272 on disk, 140 in core
> DDT-sha256-zap-unique: 278241 entries, size 274 on disk, 142 in core
> 
> DDT histogram (aggregated over all DDTs):
> 
> bucket allocated referenced
> __ __ __
> refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
> -- -- - - - -- - - -
> 1 272K 17.0G 17.0G 17.0G 272K 17.0G 17.0G 17.0G
> 2 32.7K 2.05G 2.05G 2.05G 65.6K 4.10G 4.10G 4.10G
> 4 15 960K 960K 960K 71 4.44M 4.44M 4.44M
> 8 4 256K 256K 256K 53 3.31M 3.31M 3.31M
> 16 1 64K 64K 64K 16 1M 1M 1M
> 512 1 64K 64K 64K 854 53.4M 53.4M 53.4M
> 1K 1 64K 64K 64K 1.08K 69.1M 69.1M 69.1M
> 4K 1 64K 64K 64K 5.33K 341M 341M 341M
> Total 304K 19.0G 19.0G 19.0G 345K 21.5G 21.5G 21.5G
> 
> dedup = 1.13, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.13
> 
> 
> Am I missing something?
> 
> Your inputs are much appritiated.
> 
> Thanks,
> Giri
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedupe reporting incorrect savings

2009-12-17 Thread Adam Leventhal
> Thanks for the response Adam.
> 
> Are you talking about ZFS list?
> 
> It displays 19.6 as allocated space.
> 
> What does ZFS treat as hole and how does it identify?

ZFS will compress blocks of zeros down to nothing and treat them like
sparse files. 19.6 is pretty close to your computed. Does your pool
happen to be 10+1 RAID-Z?

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread Adam Leventhal
Hey James,

> Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5. 
>  All my "boot from zfs" systems have 3 way mirrors root/usr/var disks (using 
> 9 disks) but all my data partitions are 2 way mirrors (usually 8 disks or 
> more and a spare.)

Double-parity (or triple-parity) RAID are certainly more resilient against some 
failure modes than 2-way mirroring. For example, bit errors can arise at a 
certain rate from disks. In the case of a disk failure in a mirror, it's 
possible to encounter a bit error such that data is lost.

I recently wrote an article for ACM Queue that examines recent trends in hard 
drives and makes the case for triple-parity RAID. It's at least peripherally 
relevant to this conversation:

  http://blogs.sun.com/ahl/entry/acm_triple_parity_raid

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz data loss stories?

2009-12-25 Thread Adam Leventhal
>> Applying classic RAID terms to zfs is just plain
>> wrong and misleading 
>> since zfs does not directly implement these classic
>> RAID approaches 
>> even though it re-uses some of the algorithms for
>> data recovery. 
>> Details do matter.
> 
> That's not entirely true, is it?
> * RAIDZ is RAID5 + checksum + COW
> * RAIDZ2 is RAID6 + checksum + COW
> * A stack of mirror vdevs is RAID10 + checksum + COW

Others have noted that RAID-Z isn't really the same as RAID-5 and RAID-Z2 isn't 
the same as RAID-6 because RAID-5 and RAID-6 define not just the number of 
parity disks (which would have made far more sense in my mind), but instead 
also include in the definition a notion of how the data and parity are laid 
out. The RAID levels were used to describe groupings of existing 
implementations and conflate things like the number of parity devices with, 
say, how parity is distributed across devices.

For example, RAID-Z1 lays out data most like RAID-3, that is a single block is 
carved up and spread across many disks, but distributes parity as required for 
RAID-5 but in a different manner. It's an unfortunate state of affairs which is 
why further RAID levels should identify only the most salient aspect (the 
number of parity devices) or we should use unambiguous terms like single-parity 
and double-parity RAID.

> If we can compare apples and oranges, would you same recommendation ("use 
> raidz2 and/or raidz3") be the same when comparing to mirror with the same 
> number of drives?  In other words, a 2 drive mirror compares to raidz{1} the 
> same as a 3 drive mirror compares to raidz2 and a 4 drive mirror compares to 
> raidz3?  If you were enterprise (in other words card about perf) why would 
> you ever use raidz instead of throwing more drives at the problem and doing 
> mirroring with identical parity?

You're right that a mirror is a degenerate form of raidz1, for example, but 
mirrors allow for specific optimizations. While the redundancy would be the 
same, the performance would not.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-04 Thread Adam Leventhal
Hi Brad,

RAID-Z will carve up the 8K blocks into chunks at the granularity of the sector 
size -- today 512 bytes but soon going to 4K. In this case a 9-disk RAID-Z vdev 
will look like this:

|  P  | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 |
|  P  | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 |

1K per device with an additional 1K for parity.

Adam

On Jan 4, 2010, at 3:17 PM, Brad wrote:

> If a 8K file system block is written on a 9 disk raidz vdev, how is the data 
> distributed (writtened) between all devices in the vdev since a zfs write is 
> one continuously IO operation?
> 
> Is it distributed evenly (1.125KB) per device?
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!

2010-01-13 Thread Adam Leventhal
Hey Chris,

> The DDRdrive X1 OpenSolaris device driver is now complete,
> please join us in our first-ever ZFS Intent Log (ZIL) beta test 
> program.  A select number of X1s are available for loan,
> preferred candidates would have a validation background 
> and/or a true passion for torturing new hardware/driver :-)
> 
> We are singularly focused on the ZIL device market, so a test
> environment bound by synchronous writes is required.  The
> beta program will provide extensive technical support and a
> unique opportunity to have direct interaction with the product
> designers.

Congratulations! This is great news for ZFS. I'll be very interested to
see the results members of the community can get with your device as part
of their pool. COMSTAR iSCSI performance should be dramatically improved
in particular.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hybrid storage ... thing

2010-02-05 Thread Adam Leventhal
> I saw this in /. and thought I'd point it out to this list. It appears
> to act as a L2 cache for a single drive, in theory providing better
> performance.
> 
> http://www.silverstonetek.com/products/p_contents.php?pno=HDDBOOST&area

It's a neat device, but the notion of a hybrid drive is nothing new. As
with any block-based caching, this device has no notion of the semantic
meaning of a given block so there's only so much intelligence it can bring
to bear on the problem.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-18 Thread Adam Leventhal
Hey Bob,

> My own conclusions (supported by Adam Leventhal's excellent paper) are that
> 
> - maximum device size should be constrained based on its time to
>   resilver.
> 
> - devices are growing too large and it is about time to transition to
>   the next smaller physical size.

I don't disagree with those conclusions necessarily, but the HDD vendors have 
significant momentum built up in their efforts to improve density -- that's not 
going to change in the next 5 years. If indeed transitioned to reducing 
physical size while improving density, that would imply that there would be 
many more end-points to deal with, bigger switches, etc. All reasonable, but 
there are some significant implications.

> It is unreasonable to spend more than 24 hours to resilver a single drive.

Why?

> It is unreasonable to spend very much time at all on resilvering (using 
> current rotating media) since the resilvering process kills performance.

Maybe, but then it depends on how much you rely on your disks for performance. 

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS dedup for VAX COFF data type

2010-02-21 Thread Adam Leventhal
> Hi Any idea why zfs does not dedup files with this format ?
> file /opt/XXX/XXX/data
> VAX COFF executable - version 7926

With dedup enabled, ZFS will identify and remove duplicated regardless of the 
data format.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal Setup: RAID-5, Areca, etc!

2008-07-25 Thread Adam Leventhal
> > But, is there a performance boost with mirroring the drives? That is what
> > I'm unsure of.
> 
> Mirroring will provide a boost on reads, since the system to read from
> both sides of the mirror. It will not provide an increase on writes,
> since the system needs to wait for both halves of the mirror to
> finish. It could be slightly slower than a single raid5.

That's not strictly correct. Mirroring will, in fact, deliver better IOPS for
both reads and writes. For reads, as Brandon stated, mirroring will deliver
better performance because it can distribute the reads between both devices.
For writes, however, RAID-Z with an N+1 wide stripe will divide the the data
into N+1 chunks, and reads will need to access the N chunks. This reduces
the total IOPS by a factor of N+1 for reads and writes whereas mirroring
reduces the IOPS by a factor of 2 for writes and not at all for reads.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which is better for root ZFS: mlc or slc SSD?

2008-09-26 Thread Adam Leventhal
For a root device it doesn't matter that much. You're not going to be  
writing to the device at a high data rate so write/erase cycles don't  
factor much (MLC can sustain about a factor of 10 more). With MLC  
you'll get 2-4x the capacity for the same price, but again that  
doesn't matter much for a root device. Performance is typically a bit  
better with SLC -- especially on the write side -- but it's not such a  
huge difference.

The reason you'd use a flash SSD for a boot device is power (with  
maybe a dash of performance), and either SLC or MLC will do just fine.

Adam

On Sep 24, 2008, at 11:41 AM, Erik Trimble wrote:

> I was under the impression that MLC is the preferred type of SSD,  
> but I
> want to prevent myself from having a think-o.
>
>
> I'm looking to get (2) SSD to use as my boot drive. It looks like I  
> can
> get 32GB SSDs composed of either SLC or MLC for roughly equal pricing.
> Which would be the better technology?  (I'll worry about rated access
> times/etc of the drives, I'm just wondering about general tech for  
> an OS
> boot drive usage...)
>
>
>
> -- 
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] An slog experiment (my NAS can beat up your NAS)

2008-10-05 Thread Adam Leventhal
> So what are the downsides to this?  If both nodes were to crash and  
> I used the same technique to recreate the ramdisk I would lose any  
> transactions in the slog at the time of the crash, but the physical  
> disk image is still in a consistent state right (just not from my  
> apps point of view)?

You would lose transactions, but the pool would still reflect a  
consistent
state.

> So is this idea completely crazy?


On the contrary; it's very clever.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Storage 7000

2008-11-10 Thread Adam Leventhal
On Nov 10, 2008, at 10:55 AM, Tim wrote:
> Just got an email about this today.  Fishworks finally unveiled?


Yup, that's us! On behalf of the Fishworks team, we'd like to extend a
big thank you to the ZFS team and the ZFS community here who have
contributed to such a huge building block in our new line of storage
appliances.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenStorage GUI

2008-11-11 Thread Adam Leventhal
On Nov 11, 2008, at 9:38 AM, Bryan Cantrill wrote:

> Just to throw some ice-cold water on this:
>
>  1.  It's highly unlikely that we will ever support the x4500 --  
> only the
>  x4540 is a real possibility.


And to warm things up a bit: there's already an upgrade path from the
x4500 to the x4540 so that would be required before any upgrade to the
equivalent of the Sun Storage 7210.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenStorage GUI

2008-11-11 Thread Adam Leventhal
On Nov 11, 2008, at 10:41 AM, Brent Jones wrote:
> Wish I could get my hands on a beta of this GUI...


Take a look at the VMware version that you can run on any machine:

   http://www.sun.com/storage/disk_systems/unified_storage/resources.jsp

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenStorage GUI

2008-11-11 Thread Adam Leventhal
> Is this software available for people who already have thumpers?

We're considering offering an upgrade path for people with existing
thumpers. Given the feedback we've been hearing, it seems very likely
that we will. No word yet on pricing or availability.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] continuous replication

2008-11-14 Thread Adam Leventhal
On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote:
> That is _not_ active-active, that is active-passive.
> 
> If you have a active-active system I can access the same data via both
> controllers at the same time. I can't if it works like you just
> described. You can't call it active-active just because different
> volumes are controlled by different controllers. Most active-passive
> RAID controllers can do that.
> 
> The data sheet talks about active-active clusters, how does that work?

What the Sun Storage 7000 Series does would more accurately be described as
dual active-passive.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Storage 7000

2008-11-17 Thread Adam Leventhal
On Mon, Nov 17, 2008 at 12:35:38PM -0600, Tim wrote:
> I'm not sure if this is the right place for the question or not, but I'll
> throw it out there anyways.  Does anyone know, if you create your pool(s)
> with a system running fishworks, can that pool later be imported by a
> standard solaris system?  IE: If for some reason the head running fishworks
> were to go away, could I attach the JBOD/disks to a system running
> snv/mainline solaris/whatever, and import the pool to get at the data?  Or
> is the zfs underneath fishworks proprietary as well?

Yes. The Sun Storage 7000 Series uses the same ZFS that's in OpenSolaris
today. A pool created on the appliance could potentially be imported on an
OpenSolaris system; that is, of course, not explicitly supported in the
service contract.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Storage 7000

2008-11-17 Thread Adam Leventhal
> Would be interesting to hear more about how Fishworks differs from 
> Opensolaris, what build it is based on, what package mechanism you are 
> using (IPS already?), and other differences...

I'm sure these details will be examined in the coming weeks on the blogs
of members of the Fishworks team. Keep an eye on blogs.sun.com/fishworks.

> A little off topic: Do you know when the SSDs used in the Storage 7000 are 
> available for the rest of us?

I don't think the will be, but it will be possible to purchase them as
replacement parts.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Storage 7000

2008-11-19 Thread Adam Leventhal
On Tue, Nov 18, 2008 at 09:09:07AM -0800, Andre Lue wrote:
> Is the web interface on the appliance available for download or will it make
> it to opensolaris sometime in the near future?

It's not, and it's unlikely to make it to OpenSolaris.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Comparison between the S-TEC Zeus and the Intel X25-E ??

2009-01-16 Thread Adam Leventhal
The Intel part does about a fourth as many synchronous write IOPS at  
best.

Adam

On Jan 16, 2009, at 5:34 PM, Erik Trimble wrote:

> I'm looking at the newly-orderable (via Sun) STEC Zeus SSDs, and  
> they're
> outrageously priced.
>
> http://www.stec-inc.com/product/zeusssd.php
>
> I just looked at the Intel X25-E series, and they look comparable in
> performance.  At about 20% of the cost.
>
> http://www.intel.com/design/flash/nand/extreme/index.htm
>
>
> Can anyone enlighten me as to any possible difference between an STEC
> Zeus and an Intel X25-E ?  I mean, other than those associated with  
> the
> fact that you can't get the Intel one orderable through Sun right now.
>
> -- 
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-18 Thread Adam Leventhal
> Right, which is an absolutely piss poor design decision and why  
> every major storage vendor right-sizes drives.  What happens if I  
> have an old maxtor drive in my pool whose "500g" is just slightly  
> larger than every other mfg on the market?  You know, the one who is  
> no longer making their own drives since being purchased by seagate.   
> I can't replace the drive anymore?  *GREAT*.


Sun does "right size" our drives. Are we talking about replacing a  
device bought from sun with another device bought from Sun? If these  
are just drives that fell off the back of some truck, you may not have  
that assurance.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disks in each RAIDZ group

2009-01-19 Thread Adam Leventhal
> "The recommended number of disks per group is between 3 and 9. If you have
> more disks, use multiple groups."
> 
> Odd that the Sun Unified Storage 7000 products do not allow you to control
> this, it appears to put all the hdd's into one group.  At least on the 7110
> we are evaluating there is no control to allow multiple groups/different
> raid types.

Our experience has shown that that initial guess of 3-9 per parity device was
surprisingly narrow. We see similar performance out to much wider stripes
which, of course, offer the user more usable capacity.

We don't allow you to manually set the RAID stripe widths on the 7000 series
boxes because frankly the stripe width is an implementation detail. If you
want the best performance, choose mirroring; capacity, double-parity RAID;
for something in the middle, we offer 3+1 single-parity RAID. Other than
that you're micro-optimizing for gains that would hardly be measurable given
the architecture of the Hybrid Storage Pool. Recall that unlike other
products in the same space, we get our IOPS from flash rather than from
a bazillion spindles spinning at 15,000 RPM.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-19 Thread Adam Leventhal
> Since it's done in software by HDS, NetApp, and EMC, that's complete
> bullshit.  Forcing people to spend 3x the money for a "Sun" drive that's
> identical to the seagate OEM version is also bullshit and a piss-poor
> answer.

I didn't know that HDS, NetApp, and EMC all allow users to replace their
drives with stuff they've bought at Fry's. Is this still covered by their
service plan or would this only be in an unsupported config?

Thanks.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-19 Thread Adam Leventhal
> > > Since it's done in software by HDS, NetApp, and EMC, that's complete
> > > bullshit.  Forcing people to spend 3x the money for a "Sun" drive that's
> > > identical to the seagate OEM version is also bullshit and a piss-poor
> > > answer.
> >
> > I didn't know that HDS, NetApp, and EMC all allow users to replace their
> > drives with stuff they've bought at Fry's. Is this still covered by their
> > service plan or would this only be in an unsupported config?
> 
> So because an enterprise vendor requires you to use their drives in their
> array, suddenly zfs can't right-size?  Vendor requirements have absolutely
> nothing to do with their right-sizing, and everything to do with them
> wanting your money.

Sorry, I must have missed your point. I thought that you were saying that
HDS, NetApp, and EMC had a different model. Were you merely saying that the
software in those vendors' products operates differently than ZFS?

> Are you telling me zfs is deficient to the point it can't handle basic
> right-sizing like a 15$ sata raid adapter?

How do there $15 sata raid adapters solve the problem? The more details you
could provide the better obviously.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disks in each RAIDZ group

2009-01-19 Thread Adam Leventhal
> BWAHAHAHAHA.  That's a good one.  "You don't need to setup your raid, that's
> micro-managing, we'll do that."
> 
> Remember that one time when I talked about limiting snapshots to protect a
> user from themselves, and you joined into the fray of people calling me a
> troll?

I don't remember this, but I don't doubt it.

> Can you feel the irony oozing out between your lips, or are you
> completely oblivious to it?

The irony would be that on one hand I object to artificial limitations to
business-critical features while on the other hand I think that users don't
need to tweak settings that add complexity and little to no value? They seem
very different to me, so I suppose the answer to your question is: no I cannot
feel the irony oozing out between my lips, and yes I'm oblivious to the same.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-19 Thread Adam Leventhal
On Mon, Jan 19, 2009 at 01:35:22PM -0600, Tim wrote:
> > > Are you telling me zfs is deficient to the point it can't handle basic
> > > right-sizing like a 15$ sata raid adapter?
> >
> > How do there $15 sata raid adapters solve the problem? The more details you
> > could provide the better obviously.
> 
> They short stroke the disk so that when you buy a new 500GB drive that isn't
> the exact same number of blocks you aren't screwed.  It's a design choice to
> be both sane, and to make the end-users life easier.  You know, sort of like
> you not letting people choose their raid layout...

Drive vendors, it would seem, have an incentive to make their "500GB" drives
as small as possible. Should ZFS then choose some amount of padding at the
end of each device and chop it off as insurance against a slightly smaller
drive? How much of the device should it chop off? Conversely, should users
have the option to use the full extent of the drives they've paid for, say,
if they're using a vendor that already provides that guarantee?

> You know, sort of like you not letting people choose their raid layout...

Yes, I'm not saying it shouldn't be done. I'm asking what the right answer
might be.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-19 Thread Adam Leventhal
> And again, I say take a look at the market today, figure out a percentage,
> and call it done.  I don't think you'll find a lot of users crying foul over
> losing 1% of their drive space when they don't already cry foul over the
> false advertising that is drive sizes today.

Perhaps it's quaint, but 5GB still seems like a lot to me to throw away.

> In any case, you might as well can ZFS entirely because it's not really fair
> that users are losing disk space to raid and metadata... see where this
> argument is going?

Well, I see where this _specious_ argument is going.

> I have two disks in one of my systems... both maxtor 500GB drives, purchased
> at the same time shortly after the buyout.  One is a rebadged Seagate, one
> is a true, made in China Maxtor.  Different block numbers... same model
> drive, purchased at the same time.
> 
> Wasn't zfs supposed to be about using software to make up for deficiencies
> in hardware?  It would seem this request is exactly that...

That's a fair point, and I do encourage you to file an RFE, but a) Sun has
already solved this problem in a different way as a company with our products
and b) users already have the ability to right-size drives.

Perhaps a better solution would be to handle the procedure of replacing a disk
with a slightly smaller one by migrating data and then treating the extant
disks as slightly smaller as well. This would have the advantage of being far
more dynamic and of only applying the space tax in situations where it actually
applies.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD drives in Sun Fire X4540 or X4500 for dedicated ZIL device

2009-01-23 Thread Adam Leventhal
This is correct, and you can read about it here:

  http://blogs.sun.com/ahl/entry/fishworks_launch

Adam

On Fri, Jan 23, 2009 at 05:03:57PM +, Ross Smith wrote:
> That's my understanding too.  One (STEC?) drive as a write cache,
> basically a write optimised SSD.  And cheaper, larger, read optimised
> SSD's for the read cache.
> 
> I thought it was an odd strategy until I read into SSD's a little more
> and realised you really do have to think about your usage cases with
> these.  SSD's are very definitely not all alike.
> 
> 
> On Fri, Jan 23, 2009 at 4:33 PM, Greg Mason  wrote:
> > If i'm not mistaken (and somebody please correct me if i'm wrong), the Sun
> > 7000 series storage appliances (the Fishworks boxes) use enterprise SSDs,
> > with dram caching. One such product is made by STEC.
> >
> > My understanding is that the Sun appliances use one SSD for the ZIL, and one
> > as a read cache. For the 7210 (which is basically a Sun Fire X4540), that
> > gives you 46 disks and 2 SSDs.
> >
> > -Greg
> >
> >
> > Bob Friesenhahn wrote:
> >>
> >> On Thu, 22 Jan 2009, Ross wrote:
> >>
> >>> However, now I've written that, Sun use SATA (SAS?) SSD's in their high
> >>> end fishworks storage, so I guess it definately works for some use cases.
> >>
> >> But the "fishworks" (Fishworks is a development team, not a product) write
> >> cache device is not based on FLASH.  It is based on DRAM.  The difference 
> >> is
> >> like night and day. Apparently there can also be a read cache which is 
> >> based
> >> on FLASH.
> >>
> >> Bob
> >> ==
> >> Bob Friesenhahn
> >> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> >> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
> >>
> >> ___
> >> zfs-discuss mailing list
> >> zfs-discuss@opensolaris.org
> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>
> >>
> >
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD - slow down with age

2009-02-16 Thread Adam Leventhal

On Feb 14, 2009, at 12:45 PM, Nicholas Lee wrote:
A useful article about long term use of the Intel SSD X25-M: http://www.pcper.com/article.php?aid=669 
 - 	Long-term performance analysis of Intel Mainstream SSDs.


Would a zfs cache (ZIL or ARC) based on a SSD device see this kind  
of issue?  Maybe a periodic scrub via a full disk erase would be a  
useful process.


Indeed SSDs can have certain properties that would cause their  
performance to degrade over time. We've seen this to varying degrees  
with different devices we've tested in our lab. We're working on  
adapting our use of SSDs with ZFS as a ZIL device, an L2ARC device,  
and eventually as primary storage. We'll first focus on the specific  
SSDs we certify for use in our general purpose servers and the Sun  
Storage 7000 series, and help influence the industry to move to  
standards that we can then use.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS 15K drives as L2ARC

2009-05-06 Thread Adam Leventhal
>> After all this discussion, I am not sure if anyone adequately answered the 
>> original poster's question as to whether at 2540 with SAS 15K drives would 
>> provide substantial synchronous write throughput improvement when used as 
>> a L2ARC device.
>
> I was under the impression that the L2ARC was to speed up reads, as it 
> allows things to be cached on something faster than disks (usually MLC 
> SSDs). Offloading the ZIL is what handles synchronous writes, isn't it?
>
> How would adding an L2ARC speed up writes?

You're absolutely right. The L2ARC is for accelerating reads only and will
not affect write performance.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 7110 questions

2009-06-18 Thread Adam Leventhal
On Thu, Jun 18, 2009 at 11:51:44AM -0400, Dan Pritts wrote:
> I'm curious about a couple things that would be "unsupported."
> 
> Specifically, whether they are "not supported" if they have specifically
> been crippled in the software.

We have not crippled the software in any way, but we have designed an
appliance with some specific uses. Doing things from the Solaris shell
by hand my damage your system and void your support contract.

> 1) SSD's 
> 
> I can imagine buying an intel SSD, slotting it into the 7110, and using
> it as a ZFS L2ARC (? i mean the equivalent of "readzilla")

That's not supported, it won't work easily, and if you get it working you'll
be out of luck if you have a problem.

> 2) expandability
> 
> I can imagine buying a SAS card and a JBOD and hooking it up to
> the 7110; it has plenty of PCI slots.

Ditto.

> finally, one question - I presume that I need to devote a pair of disks
> to the OS, so I really only get 14 disks for data.  Correct?

That's right. We market the 7110 as either 2TB = 146GB x 14 or 4.2TB =
300GB x 14 raw capacity.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 7110 questions

2009-06-18 Thread Adam Leventhal
Hey Lawrence,

Make sure you're running the latest software update. Note that this forumn
is not the appropriate place to discuss support issues. Please contact your
official Sun support channel.

Adam

On Thu, Jun 18, 2009 at 12:06:02PM -0700, lawrence ho wrote:
> We have a 7110 on try and buy program. 
> 
> We tried using the 7110 with XEN Server 5 over iSCSI and NFS. Nothing seems 
> to solve the slow write problem. Within the VM, we observed around 8MB/s on 
> writes. Read performance is fantastic. Some troubleshooting was done with 
> local SUN rep. The conclusion is that 7110 does not have write cache in forms 
> of SSD or controller DRAM write cache. The solution from SUN is to buy 
> StorageTek or 7000 series model with SSD write cache.
> 
> Adam, please advise if there any fixes for 7110. I am still shopping for SAN 
> and would rather buy a 7100 than a StorageTek or something else.
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-21 Thread Adam Leventhal

Hey Bob,

MTTDL analysis shows that given normal evironmental conditions, the  
MTTDL of RAID-Z2 is already much longer than the life of the  
computer or the attendant human.  Of course sometimes one encounters  
unusual conditions where additional redundancy is desired.


To what analysis are you referring? Today the absolute fastest you can  
resilver a 1TB drive is about 4 hours. Real-world speeds might be half  
that. In 2010 we'll have 3TB drives meaning it may take a full day to  
resilver. The odds of hitting a latent bit error are already  
reasonably high especially with a large pool that's infrequently  
scrubbed meaning. What then are the odds of a second drive failing in  
the 24 hours it takes to resiler?


I do think that it is worthwhile to be able to add another parity  
disk to an existing raidz vdev but I don't know how much work that  
entails.


It entails a bunch of work:

  http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

Matt Ahrens is working on a key component after which it should all be  
possible.


Zfs development seems to be overwelmed with marketing-driven  
requirements lately and it is time to get back to brass tacks and  
make sure that the parts already developed are truely enterprise- 
grade.



While I don't disagree that the focus for ZFS should be ensuring  
enterprise-class reliability and performance, let me assure you that  
requirements are driven by the market and not by marketing.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-21 Thread Adam Leventhal

which gap?

'RAID-Z should mind the gap on writes' ?

Message was edited by: thometal


I believe this is in reference to the raid 5 write hole, described  
here:

http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance


It's not.

So I'm not sure what the 'RAID-Z should mind the gap on writes'  
comment is getting at either.


Clarification?



I'm planning to write a blog post describing this, but the basic  
problem is that RAID-Z, by virtue of supporting variable stripe writes  
(the insight that allows us to avoid the RAID-5 write hole), must  
round the number of sectors up to a multiple of nparity+1. This means  
that we may have sectors that are effectively skipped. ZFS generally  
lays down data in large contiguous streams, but these skipped sectors  
can stymie both ZFS's write aggregation as well as the hard drive's  
ability to group I/Os and write them quickly.


Jeff Bonwick added some code to mind these gaps on reads. The key  
insight there is that if we're going to read 64K, say, with a 512 byte  
hole in the middle, we might as well do one big read rather than two  
smaller reads and just throw out the data that we don't care about.


Of course, doing this for writes is a bit trickier since we can't just  
blithely write over gaps as those might contain live data on the disk.  
To solve this we push the knowledge of those skipped sectors down to  
the I/O aggregation layer in the form of 'optional' I/Os purely for  
the purpose of coalescing writes into larger chunks.


I hope that's clear; if it's not, stay tuned for the aforementioned  
blog post.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-21 Thread Adam Leventhal

Don't hear about triple-parity RAID that often:


Author: Adam Leventhal
Repository: /hg/onnv/onnv-gate
Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651
Total changesets: 1
Log message:
6854612 triple-parity RAID-Z


http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/ 
009872.html

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612

(Via Blog O' Matty.)

Would be curious to see performance characteristics.



I just blogged about triple-parity RAID-Z (raidz3):

  http://blogs.sun.com/ahl/entry/triple_parity_raid_z

As for performance, on the system I was using (a max config Sun Storage
7410), I saw about a 25% improvement to 1GB/s for a streaming write
workload. YMMV, but I'd be interested in hearing your results.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal

Don't hear about triple-parity RAID that often:


I agree completely.  In fact, I have wondered (probably in these  
forums), why we don't bite the bullet and make a generic raidzN,  
where N is any number >=0.


I agree, but raidzN isn't simple to implement and it's potentially  
difficult
to get it to perform well. That said, it's something I intend to bring  
to

ZFS in the next year or so.

If memory serves, the second parity is calculated using Reed-Solomon  
which implies that any number of parity devices is possible.


True; it's a degenerate case.

In fact, get rid of mirroring, because it clearly is a variant of  
raidz with two devices.  Want three way mirroring?  Call that raidz2  
with three devices.  The truth is that a generic raidzN would roll  
up everything: striping, mirroring, parity raid, double parity, etc.  
into a single format with one parameter.


That's an interesting thought, but there are some advantages to  
calling out mirroring for example as its own vdev type. As has been  
pointed out, reading from either side of the mirror involves no  
computation whereas reading from a RAID-Z 1+2 for example would  
involve more computation. This would

complicate the calculus of balancing read operations over the mirror
devices.

Let's not stop there, though.  Once we have any number of parity  
devices, why can't I add a parity device to an array?  That should  
be simple enough with a scrub to set the parity.  In fact, what is  
to stop me from removing a parity device?  Once again, I think the  
code would make this rather easy.


With RAID-Z stripes can be of variable width meaning that, say, a  
single row
in a 4+2 configuration might have two stripes of 1+2. In other words,  
there
might not be enough space in the new parity device. I did write up the  
steps

that would be needed to support RAID-Z expansion; you can find it here:

  http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

Ok, back to the real world.  The one downside to triple parity is  
that I recall the code discovered the corrupt block by excluding it  
from the stripe, reconstructing the stripe and comparing that with  
the checksum.  In other words, for a given cost of X to compute a  
stripe and a number P of corrupt blocks, the cost of reading a  
stripe is approximately X^P.  More corrupt blocks would radically  
slow down the system.  With raidz2, the maximum number of corrupt  
blocks would be two, putting a cap on how costly the read can be.


Computing the additional parity of triple-parity RAID-Z is slightly  
more expensive, but not much -- it's just bitwise operations.  
Recovering from
a read failure is identical (and performs identically) to raidz1 or  
raidz2
until you actually have sustained three failures. In that case,  
performance
is slower as more computation is involved -- but aren't you just happy  
to

get your data back?

If there is silent data corruption, then and only then can you encounter
the O(n^3) algorithm that you alluded to, but only as a last resort.  
If we
don't know what drives failed, we try to reconstruct your data by  
assuming
that one drive, then two drives, then three drives are returning bad  
data.
For raidz1, this was a linear operation; raidz2, quadratic; now raidz3  
is
N-cubed. There's really no way around it. Fortunately with proper  
scrubbing

encountering data corruption in one stripe on three different drives is
highly unlikely.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-23 Thread Adam Leventhal
Robert,

On Fri, Jul 24, 2009 at 12:59:01AM +0100, Robert Milkowski wrote:
>> To what analysis are you referring? Today the absolute fastest you can 
>> resilver a 1TB drive is about 4 hours. Real-world speeds might be half 
>> that. In 2010 we'll have 3TB drives meaning it may take a full day to 
>> resilver. The odds of hitting a latent bit error are already reasonably 
>> high especially with a large pool that's infrequently scrubbed meaning. 
>> What then are the odds of a second drive failing in the 24 hours it takes 
>> to resiler?
>
> I wish it was so good with raid-zN.
> In real life, at least from mine experience, it can take several days to 
> resilver a disk for vdevs in raid-z2 made of 11x sata disk drives with real 
> data.
> While the way zfs ynchronizes data is way faster under some circumstances 
> it is also much slower under other.
> IIRC some builds ago there were some fixes integrated so maybe it is 
> different now.

Absolutely. I was talking more or less about optimal timing. I realize that
due to the priorities within ZFS and real word loads that it can take far
longer.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD (SLC) for cache...

2009-08-12 Thread Adam Leventhal
My question is about SSD, and the differences between use SLC for  
readzillas instead of MLC.


Sun uses MLCs for Readzillas for their 7000 series. I would think  
that if SLCs (which are generally more expensive) were really  
needed, they would be used.


That's not entirely accurate. In the 7410 and 7310 today (the members  
of the Sun Storage 7000 series that support Readzilla) we use SLC  
SSDs. We're exploring the use of MLC.


Perhaps someone on the Fishworks team could give more details, but  
by going what I've read and seen, MLCs should be sufficient for the  
L2ARC. Save your money.



That's our assessment, but it's highly dependent on the specific  
characteristics of the MLC NAND itself, the SSD controller, and, of  
course, the workload.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool

2009-08-27 Thread Adam Leventhal

Hey Gary,

There appears to be a bug in the RAID-Z code that can generate  
spurious checksum errors. I'm looking into it now and hope to have it  
fixed in build 123 or 124. Apologies for the inconvenience.


Adam

On Aug 25, 2009, at 5:29 AM, Gary Gendel wrote:

I have a 5-500GB disk Raid-Z pool that has been producing checksum  
errors right after upgrading SXCE to build 121.  They seem to be  
randomly occurring on all 5 disks, so it doesn't look like a disk  
failure situation.


Repeatingly running a scrub on the pools randomly repairs between 20  
and a few hundred checksum errors.


Since I hadn't physically touched the machine, it seems a very  
strong coincidence that it started right after I upgraded to 121.


This machine is a SunFire v20z with a Marvell SATA 8-port controller  
(the same one as in the original thumper).  I've seen this kind of  
problem way back around build 40-50 ish, but haven't seen it after  
that until now.


Anyone else experiencing this problem or knows how to isolate the  
problem definitively?


Thanks,
Gary
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] change raidz1 to raidz2 with BP rewrite?

2009-08-29 Thread Adam Leventhal
Will BP rewrite allow adding a drive to raidz1 to get raidz2? And  
how is status on BP rewrite? Far away? Not started yet? Planning?



BP rewrite is an important component technology, but there's a bunch  
beyond

that. It's not a high priority right now for us at Sun.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] change raidz1 to raidz2 with BP rewrite?

2009-08-30 Thread Adam Leventhal

Hi David,

BP rewrite is an important component technology, but there's a  
bunch beyond that. It's not a high priority right now for us at Sun.


What's the bug / RFE number for it? (So those of us with contracts  
can add a request for it.)


I don't have the number handy, but while it might be satisfying to add  
another
request for it, Matt is already cranking on it as fast as he can and  
more

requests for it are likely to have the opposite of the intended effect.

Adam


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool

2009-09-01 Thread Adam Leventhal

Hi James,

After investigating this problem a bit I'd suggest avoiding deploying  
RAID-Z

until this issue is resolved. I anticipate having it fixed in build 124.

Apologies for the inconvenience.

Adam

On Aug 28, 2009, at 8:20 PM, James Lever wrote:



On 28/08/2009, at 3:23 AM, Adam Leventhal wrote:

There appears to be a bug in the RAID-Z code that can generate  
spurious checksum errors. I'm looking into it now and hope to have  
it fixed in build 123 or 124. Apologies for the inconvenience.


Are the errors being generated likely to cause any significant  
problem running 121 with a RAID-Z volume or should users of RAID-Z*  
wait until this issue is resolved?


cheers,
James




--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snv_110 -> snv_121 produces checksum errors on Raid-Z pool

2009-09-02 Thread Adam Leventhal
Hey Bob,

> I have seen few people more prone to unsubstantiated conjecture than you.  
> The raidz checksum code was recently reworked to add raidz3. It seems 
> likely that a subtle bug was added at that time.

That appears to be the case. I'm investigating the problem and hope to have
and update to the last either later today or tomorrow.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 7110: Would it self upgrade the system zpool?

2009-09-02 Thread Adam Leventhal

Hi Trevor,

We intentionally install the system pool with an old ZFS version and  
don't
provide the ability to upgrade. We don't need or use (or even expose)  
any
of the features of the newer versions so using a newer version would  
only

create problems rolling back to earlier releases.

Adam

On Sep 2, 2009, at 7:01 PM, Trevor Pretty wrote:


Just Curious

The 7110 I've on loan has an old zpool. I *assume* because it's been  
upgraded and it gives me the ability to downgrade. Anybody know if I  
delete the old version of Amber Road whether the pool would then  
upgrade (I don't want to do it as I want to show the up/downgrade  
feature).


OS pool:-
  pool: system
  state: ONLINE
  status: The pool is formatted using an older on-disk format.   
The pool can

still be used, but some features are unavailable.

And yes I may have invalidated my support. If you have a 7000 box  
don't ask me how to access the system like this, you can see the  
warning. Remember I've a loan box and are just being nosey, a sort  
of looking under the bonnet and going "OOOHHH" an engine, but being  
too scared to even pull the dip stick  :-)


+ 
-+
|  You are entering the operating system shell.  By confirming this  
action in |
|  the appliance shell you have agreed that THIS ACTION MAY VOID ANY  
SUPPORT  |
|  AGREEMENT.  If you do not agree to this -- or do not otherwise  
understand  |
|  what you are doing -- you should type "exit" at the shell  
prompt.  EVERY   |
|  COMMAND THAT YOU EXECUTE HERE IS AUDITED, and support personnel  
may use|
|  this audit trail to substantiate invalidating your support  
contract.  The  |
|  operating system shell is NOT a supported mechanism for managing  
this  |
|  appliance, and COMMANDS EXECUTED HERE MAY DO IRREPARABLE  
HARM. |
| 
 |
|  NOTHING SHOULD BE ATTEMPTED HERE BY UNTRAINED SUPPORT PERSONNEL  
UNDER ANY  |
|  CIRCUMSTANCES.  This appliance is a non-traditional operating  
system   |
|  environment, and expertise in a traditional operating system  
environment   |
|  in NO WAY constitutes training for supporting this appliance.   
THOSE WITH  |
|  EXPERTISE IN OTHER SYSTEMS -- HOWEVER SUPERFICIALLY SIMILAR --  
ARE MORE|
|  LIKELY TO MISTAKENLY EXECUTE OPERATIONS HERE THAT WILL DO  
IRREPARABLE  |
|  HARM.  Unless you have been explicitly trained on supporting  
this  |
|  appliance via the operating system shell, you should immediately  
return|
|  to the appliance  
shell.|
| 
 |
|  Type "exit" now to return to the appliance  
shell.  |
+ 
-+



Trevor








www.eagle.co.nz
This email is confidential and may be legally privileged. If  
received in error please destroy and immediately notify us.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123

2009-09-03 Thread Adam Leventhal
 disks
 0   1   2
   _
   |   |   |   |P = parity
   | P | D | D |  LBAs  D = data
   |___|___|___|   |X = skipped sector
   |   |   |   |   |
   | X | P | D |   v
   |___|___|___|
   |   |   |   |
   | D | X |   |
   |___|___|___|

The logic for the optional IOs effectively (though not literally) in  
this

case would fill in the next LBA on the disk with a 0:

   _
   |   |   |   |P = parity
   | P | D | D |  LBAs  D = data
   |___|___|___|   |X = skipped sector
   |   |   |   |   |0 = zero-data from aggregation
   | 0 | P | D |   v
   |___|___|___|
   |   |   |   |
   | D | X |   |
   |___|___|___|

We can see the problem when the parity undergoes the swap described  
above:


   disks
 0   1   2
   _
   |   |   |   |P = parity
   | D | P | D |  LBAs  D = data
   |___|___|___|   |X = skipped sector
   |   |   |   |   |0 = zero-data from aggregation
   | X | 0 | P |   v
   |___|___|___|
   |   |   |   |
   | D | X |   |
   |___|___|___|

Note that the 0 incorrectly is also swapped thus inadvertently  
overwriting

a data sector in the subsequent stripe. This only occurs if there is IO
aggregation making it much more likely with small, synchronous IOs. It's
also only possible with an odd (N) number of child vdevs since to  
induce the
problem the size of the data written must consume a multiple of N-1  
sectors
_and_ the total number of sectors used for data and parity must be odd  
(to

create the need for a skipped sector).

The number of data sectors is simply size / 512 and the number of parity
sectors is ceil(size / 512 / (N-1)).

  1) size / 512 = K * (N-1)
  2) size / 512 + ceil(size / 512 / (N-1)) is odd
therefore
 K * (N-1) + K = K * N is odd

If N is even K * N cannot be odd and therefore the situation cannot  
arise.

If N is odd, it is possible to satisfy (1) and (2).

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123

2009-09-03 Thread Adam Leventhal
Hey Simon,

> Thanks for the info on this. Some people, including myself, reported seeing
> checksum errors within mirrors too. Is it considered that these checksum
> errors within mirrors could also be related to this bug, or is there another
> bug related to checksum errors within mirrors that I should take a look at?

Absolutely not. That is an unrelated issue. This problem is isolated to
RAID-Z.

> And good luck with the fix for build 124. Are talking days or weeks for the
> fix to be available, do you think? :) -- 

Days or hours.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-17 Thread Adam Leventhal
On Thu, Sep 17, 2009 at 01:32:43PM +0200, Eugen Leitl wrote:
> > reasons), you will lose 2 disks worth of storage to parity leaving 12
> > disks worth of data. With raid10 you will lose half, 7 disks to
> > parity/redundancy. With two raidz2 sets, you will get (5+2)+(5+2), that
> > is 5+5 disks worth of storage and 2+2 disks worth of redundancy. The
> > actual redudancy/parity is spread over all disks, not like raid3 which
> > has a dedicated parity disk.
> 
> So raidz3 has a dedicated parity disk? I couldn't see that from
> skimming http://blogs.sun.com/ahl/entry/triple_parity_raid_z

Note that Tomas was talking about RAID-3 not raidz3. To summarize the RAID
levels:

  RAID-0striping
  RAID-1mirror
  RAID-2ECC (basically not used)
  RAID-3bit-interleaved parity (basically not used)
  RAID-4block-interleaved parity
  RAID-5block-interleaved distributed parity
  RAID-6block-interleaved double distributed parity

raidz1 is most like RAID-5; raidz2 is most like RAID-6. There's no RAID
level that covers more than two parity disks, but raidz3 is most like RAID-6,
but with triple distributed parity.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can you turn on zfs compression when the fs is already populated?

2007-01-25 Thread Adam Leventhal
For what it's worth, there is a plan to allow data to be scrubbed so that
you can enable compression for extant data. No ETA, but it's on the roadmap.

In fact, I was recently reminded that I filed a bug on this in 2004:

  5029294 there should be a way to compress an extant file system

Adam

On Wed, Jan 24, 2007 at 06:50:22PM +0100, [EMAIL PROTECTED] wrote:
> 
> >I have an 800GB raidz2 zfs filesystem.  It already has approx 142Gb of data.
> >Can I simply turn on compression at this point, or do you need to start 
> >with compression
> >at the creation time?  If I turn on compression now, what happens to the 
> >existing data?
> 
> Yes.  Nothing.
> 
> Casper
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Adding my own compression to zfs

2007-01-29 Thread Adam Leventhal
On Mon, Jan 29, 2007 at 02:39:13PM -0800, roland wrote:
> > # zfs get compressratio
> > NAME   PROPERTY   VALUE  SOURCE
> > pool/gzip  compressratio  3.27x  -
> > pool/lzjb  compressratio  1.89x  -
> 
> this looks MUCH better than i would have ever expected for smaller files. 
> 
> any real-world data how good or bad compressratio goes with lots of very 
> small but good compressible files , for example some (evil for those solaris 
> evangelists) untarred linux-source tree ?
> 
> i'm rather excited how effective gzip will compress here.
> 
> for comparison:
> 
> sun1:/comptest #  bzcat /tmp/linux-2.6.19.2.tar.bz2 |tar xvf -
> --snipp--
> 
> sun1:/comptest # du -s -k *
> 143895  linux-2.6.19.2
> 1   pax_global_header
> 
> sun1:/comptest # du -s -k --apparent-size *
> 224282  linux-2.6.19.2
> 1   pax_global_header
> 
> sun1:/comptest # zfs get compressratio comptest
> NAME  PROPERTY   VALUE  SOURCE
> comptest tank  compressratio  1.79x  -

Don't start sending me your favorite files to compress (it really should
work about the same as gzip), but here's the result for the above (I found
a tar file that's about 235M uncompressed):

# du -ks linux-2.6.19.2/
80087   linux-2.6.19.2
# zfs get compressratio pool/gzip
NAME   PROPERTY   VALUE  SOURCE
pool/gzip  compressratio  3.40x  -

Doing a gzip with the default compression level (6 -- the same setting I'm
using in ZFS) yields a file that's about 52M. The small files are hurting
a bit here, but it's still pretty good -- and considerably better than LZJB.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Need help making lsof work with ZFS

2007-02-18 Thread Adam Leventhal
On Wed, Feb 14, 2007 at 01:56:33PM -0700, Matthew Ahrens wrote:
> These files are not shipped with Solaris 10.  You can find them in 
> opensolaris: usr/src/uts/common/fs/zfs/sys/
> 
> The interfaces in these files are not supported, and may change without 
> notice at any time.

Even if they're not supported, shouldn't the header files be shipped so
people can make sense of kernel data structure types?

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs received vol not appearing on iscsi target list

2007-02-26 Thread Adam Leventhal
On Sat, Feb 24, 2007 at 09:29:48PM +1300, Nicholas Lee wrote:
> I'm not really a Solaris expert, but I would have expected vol4 to appear on
> the iscsi target list automatically.  Is there a way to refresh the target
> list? Or is this a bug.

Hi Nicholas,

This is a bug either in ZFS or in the iSCSI target. Please file a bug.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS overhead killed my ZVOL

2007-03-20 Thread Adam Leventhal
On Tue, Mar 20, 2007 at 06:01:28PM -0400, Brian H. Nelson wrote:
> Why does this happen? Is it a bug? I know there is a recommendation of 
> 20% free space for good performance, but that thought never occurred to 
> me when this machine was set up (zvols only, no zfs proper).

It sounds like this bug:

  6430003 record size needs to affect zvol reservation size on RAID-Z

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS overhead killed my ZVOL

2007-03-20 Thread Adam Leventhal
On Wed, Mar 21, 2007 at 01:23:06AM +0100, Robert Milkowski wrote:
> Adam, while you are here, what about gzip compression in ZFS?
> I mean are you going to integrate changes soon?

I submitted the RTI today.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS overhead killed my ZVOL

2007-03-20 Thread Adam Leventhal
On Wed, Mar 21, 2007 at 01:36:10AM +0100, Robert Milkowski wrote:
> btw: I assume that compression level will be hard coded after all,
> right?

Nope. You'll be able to choose from gzip-N with N ranging from 1 to 9 just
like gzip(1).

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] gzip compression support

2007-03-23 Thread Adam Leventhal
I recently integrated this fix into ON:

  6536606 gzip compression for ZFS

With this, ZFS now supports gzip compression. To enable gzip compression
just set the 'compression' property to 'gzip' (or 'gzip-N' where N=1..9).
Existing pools will need to upgrade in order to use this feature, and, yes,
this is the second ZFS version number update this week. Recall that once
you've upgraded a pool older software will no longer be able to access it
regardless of whether you're using the gzip compression algorithm.

I did some very simple tests to look at relative size and time requirements:

  http://blogs.sun.com/ahl/entry/gzip_for_zfs_update

I've also asked Roch Bourbonnais and Richard Elling to do some more
extensive tests.

Adam


>From zfs(1M):

 compression=on | off | lzjb | gzip | gzip-N

 Controls  the  compression  algorithm  used   for   this
 dataset.  The  "lzjb" compression algorithm is optimized
 for performance while providing decent data compression.
 Setting  compression to "on" uses the "lzjb" compression
 algorithm. The "gzip"  compression  algorithm  uses  the
 same  compression  as  the  gzip(1)  command.   You  can
 specify the gzip level  by  using  the  value  "gzip-N",
 where  N  is  an  integer  from  1  (fastest) to 9 (best
 compression ratio). Currently, "gzip" is  equivalent  to
 "gzip-6" (which is also the default for gzip(1)).

 This property can also be referred to by  its  shortened
 column name "compress".

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] gzip compression support

2007-03-23 Thread Adam Leventhal
On Fri, Mar 23, 2007 at 11:41:21AM -0700, Rich Teer wrote:
> > I recently integrated this fix into ON:
> > 
> >   6536606 gzip compression for ZFS
> 
> Cool!  Can you recall into which build it went?

I put it back yesterday so it will be in build 62.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: ZFS layout for 10 disk?

2007-03-23 Thread Adam Leventhal
I'd take your 10 data disks and make a single raidz2 stripe. You can sustain
two disk failures before losing data, and presumably you'd replace the failed
disks before that was likely to happen. If you're very concerned about
failures, I'd have a single 9-wide raidz2 stripe with a hot spare.

Adam

On Fri, Mar 23, 2007 at 01:44:06PM -0700, John-Paul Drawneek wrote:
> Just to clarify
> 
> pool1 -> 5 disk raidz2
> pool2 -> 4 disk raid 10
> 
> spare for both pools
> 
> Is that correct?
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Adam Leventhal
On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote:
> >I'm in a way still hoping that it's a iSCSI related Problem as detecting
> >dead hosts in a network can be a non trivial problem and it takes quite
> >some time for TCP to timeout and inform the upper layers. Just a
> >guess/hope here that FC-AL, ... do better in this case
> 
> iscsi doesn't use TCP, does it?  Anyway, the problem is really transport
> independent.

It does use TCP. Were you thinking UDP?

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Convert raidz

2007-04-02 Thread Adam Leventhal
On Mon, Apr 02, 2007 at 12:37:24AM -0700, homerun wrote:
> Is it possible to convert live 3 disks zpool from raidz to raidz2
> And is it possible to add 1 new disk to raidz configuration without
> backups and recreating zpool from scratch.

The reason that's not possible is because RAID-Z uses a variable stripe
width. This solves some problems (notably the RAID-5 write hole [1]), but
it means that a given 'stripe' over N disks in a raidz1 configuration may
contains as many as floor(N/2) parity blocks -- clearly a single additional
disk wouldn't be sufficient to grow the stripe properly.

It would be possible to have a different type of RAID-Z where stripes were
variable-width to avoid the RAID-5 write hole, but the remainder of the
stripe was left unused. This would allow users to add an additional parity
disk (or several if we ever implement further redundancy) to an existing
configuration, BUT would potentially make much less efficient use of storage. 

Adam


[1] http://blogs.sun.com/bonwick/entry/raid_z

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up for zfsboot

2007-04-04 Thread Adam Leventhal
On Wed, Apr 04, 2007 at 03:34:13PM +0200, Constantin Gonzalez wrote:
> - RAID-Z is _very_ slow when one disk is broken.

Do you have data on this? The reconstruction should be relatively cheap
especially when compared with the initial disk access.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Gzip compression for ZFS

2007-04-04 Thread Adam Leventhal
On Wed, Apr 04, 2007 at 07:57:21PM +1000, Darren Reed wrote:
> From: "Darren J Moffat" <[EMAIL PROTECTED]>
> ...
> >The other problem is that you basically need a global unique registry 
> >anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is  
> >etc etc.  Similarly for crypto and any other transform.
> 
> I've two thoughts on that:
> 1) if there is to be a registry, it should be hosted by OpenSolaris
>   and be open to all and

I think there already is such a registry:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/zio.h#89

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up for zfsboot

2007-04-04 Thread Adam Leventhal
On Wed, Apr 04, 2007 at 11:04:06PM +0200, Robert Milkowski wrote:
> If I stop all activity to x4500 with a pool made of several raidz2 and
> then I issue spare attach I get really poor performance (1-2MB/s) on a
> pool with lot of relatively small files.

Does that mean the spare is resilvering when you collect the performance
data? I think a fair test would be to compare the performance of a fully
functional RAID-Z stripe against a one with a missing (absent) device.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Linux

2007-04-12 Thread Adam Leventhal
On Thu, Apr 12, 2007 at 06:59:45PM -0300, Toby Thain wrote:
> >Hey, then just don't *keep on* asking to relicense ZFS (and anything
> >else) to GPL.
> 
> I never would. But it would be horrifying to imagine it relicensed to  
> BSD. (Hello, Microsoft, you just got yourself a competitive filesystem.)

There's nothing today preventing Microsoft (or Apple) from sticking ZFS
into their OS. They'd just to have to release the (minimal) diffs to
ZFS-specific files.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
[EMAIL PROTECTED]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Status Update before Reinstall?

2007-04-26 Thread Adam Leventhal
On Wed, Apr 25, 2007 at 09:30:12PM -0700, Richard Elling wrote:
> IMHO, only a few people in the world care about dumps at all (and you
> know who you are :-).  If you care, setup dump to an NFS server somewhere,
> no need to have it local.

Well IMHO, every Solaris customer cares about crash dumps (although they
may not know it). There are failures that occur once -- no dump means no
solution.

And you're not going to be dumping directly over NFS if you care about
your crash dump (see previous point).

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] software RAID vs. HW RAID - part III

2007-04-26 Thread Adam Leventhal
Hey Robert,

This is very cool. Thanks for doing the analysis. What a terrific validation
of software RAID and of RAID-Z in particular.

Adam


On Tue, Apr 24, 2007 at 11:35:32PM +0200, Robert Milkowski wrote:
> Hello zfs-discuss,
> 
> http://milek.blogspot.com/2007/04/hw-raid-vs-zfs-software-raid-part-iii.html
> 
> 
> 
> -- 
> Best regards,
>  Robert Milkowskimailto:[EMAIL PROTECTED]
>  http://milek.blogspot.com
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-09 Thread Adam Leventhal
On Thu, May 03, 2007 at 11:43:49AM -0500, [EMAIL PROTECTED] wrote:
> I think this may be a premature leap -- It is still undetermined if we are
> running up against a yet unknown bug in the kernel implementation of gzip
> used for this compression type. From my understanding the gzip code has
> been reused from an older kernel implementation,  it may be possible that
> this code has some issues with kernel stuttering when used for zfs
> compression that may have not been exposed with its original usage.  If it
> turns out that it is just a case of high cpu trade-off for buying faster
> compression times, then the talk of a tunable may make sense (if it is even
> possible given the constraints of the gzip code in kernelspace).

The in-kernel version is zlib is the latest version (1.2.3). It's not
surprising that we're spending all of our time in zlib if the machine is
being driving by I/O. There are outstanding problems with compression in
the ZIO pipeline that may contribute to the bursty behavior.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-09 Thread Adam Leventhal
On Wed, May 09, 2007 at 11:52:06AM +0100, Darren J Moffat wrote:
> Can you give some more info on what these problems are.

I was thinking of this bug:

  6460622 zio_nowait() doesn't live up to its name

Which was surprised to find was fixed by Eric in build 59.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iscsitadm local_name in ZFS

2007-05-11 Thread Adam Leventhal
That would be a great RFE. Currently the iSCSI Alias is the dataset name
which should help with identification.

Adam

On Fri, May 04, 2007 at 02:02:34PM +0200, cedric briner wrote:
> cedric briner wrote:
> >hello dear community,
> >
> >Is there a way to have a ``local_name'' as define in iscsitadm.1m when 
> >you shareiscsi a zvol. This way, it will give even easier 
> >way to identify an device through IQN.
> >
> >Ced.
> >
> 
> Okay no reply from you so... maybe I didn't make myself well understandable.
> 
> Let me try to re-explain you what I mean:
> when you use zvol and enable shareiscsi, could you add a suffix to the 
> IQN (Iscsi Qualified Name). This suffix will be given by myself and will 
> help me to identify which IQN correspond to which zvol : this is just a 
> more human readable tag on an IQN.
> 
> Similarly, this tag is also given when you do an iscsitadm. And in the 
> man page of iscsitadm it is called a .
> 
> iscsitadm iscsitadm create target -b  /dev/dsk/c0d0s5  tiger
> or
> iscsitadm iscsitadm create target -b  /dev/dsk/c0d0s5  hd-1
> 
> tiger and hd-1 are 
> 
> Ced.
> 
> -- 
> 
> Cedric BRINER
> Geneva - Switzerland
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS over a layered driver interface

2007-05-14 Thread Adam Leventhal
Try 'trace((int)arg1);' -- 4294967295 is the unsigned representation of -1.

Adam

On Mon, May 14, 2007 at 09:57:23AM -0700, Shweta Krishnan wrote:
> Thanks Eric and Manoj.
> 
> Here's what ldi_get_size() returns:
> bash-3.00# dtrace -n 'fbt::ldi_get_size:return{trace(arg1);}' -c 'zpool 
> create adsl-pool /dev/layerzfsminor1' dtrace: description 
> 'fbt::ldi_get_size:return' matched 1 probe
> cannot create 'adsl-pool': invalid argument for this pool operation
> dtrace: pid 2582 has exited
> CPU IDFUNCTION:NAME
>   0  20927  ldi_get_size:return4294967295
> 
> 
> This is strange because I looked at the code for ldi_get_size() and the only 
> possible return values in the code are DDI_SUCCESS (0) and DDI_FAILURE(-1).
> 
> Looks like what I'm looking at either isn't the return value, or some bad 
> address is being reached. Any hints?
> 
> Thanks,
> Swetha.
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: ISCSI alias when shareiscsi=on

2007-05-24 Thread Adam Leventhal
Right now -- as I'm sure you have noticed -- we use the dataset name for 
the alias. To let users explicitly set the alias we could add a new property
as you suggest or allow other options for the existing shareiscsi property:

  shareiscsi='alias=potato'

This would sort of match what we do for the sharenfs property.

Adam

On Thu, May 24, 2007 at 02:39:24PM +0200, cedric briner wrote:
> Starting from this thread:
> http://www.opensolaris.org/jive/thread.jspa?messageID=118786𝀂
> 
> I would love to have the possibility to set an ISCSI alias when doing an 
> shareiscsi=on on ZFS. This will greatly facilate to identify where an 
> IQN is hosted.
> 
> the ISCSI alias is defined in rfc 3721
> e.g. http://www.apps.ietf.org/rfc/rfc3721.html#sec-2
> 
> and the CLI could be something like:
> zfs set shareiscsi=on shareisicsiname= tank
> 
> 
> Ced.
> -- 
> 
> Cedric BRINER
> Geneva - Switzerland
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mac OS X "Leopard" to use ZFS

2007-06-07 Thread Adam Leventhal
On Thu, Jun 07, 2007 at 08:38:10PM -0300, Toby Thain wrote:
> When should we expect Solaris kernel under OS X? 10.6? 10.7? :-)

I'm sure Jonathan will be announcing that soon. ;-)

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: LZO compression?

2007-06-17 Thread Adam Leventhal
Those are interesting results. Does this mean you've already written lzo
support into ZFS? If not, that would be a great next step -- licensing
issues can be sorted out later...

Adam

On Sat, Jun 16, 2007 at 04:40:48AM -0700, roland wrote:
> btw - is there some way to directly compare lzjb vs lzo compression - to see 
> which performs better and using less cpu ?
> 
> here those numbers from my little benchmark:
> 
> |lzo |6m39.603s |2.99x
> |gzip |7m46.875s |3.41x
> |lzjb |7m7.600s |1.79x
> 
> i`m just curious about these numbers - with lzo i got better speed and better 
> compression in comparison to lzjb
> 
> nothing against lzjb compression - it's pretty nice - but why not taking a 
> closer look  here? maybe here is some room for improvement
> 
> roland
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Mac OS X 10.5 read-only support for ZFS

2007-06-18 Thread Adam Leventhal
On Sun, Jun 17, 2007 at 09:38:51PM -0700, Anton B. Rang wrote:
> Sector errors on DVD are not uncommon. Writing a DVD in ZFS format
> with duplicated data blocks would help protect against that problem, at
> the cost of 50% or so disk space. That sounds like a lot, but with
> BluRay etc. coming along, maybe paying a 50% penalty isn't too bad.
> (And if ZFS eventually supports RAID on a single disk, the penalty
> would be less.)

It would be an interesting project to create some software that took a
directory (or ZFS filesystem) to be written to a CD or DVD and optimized
the layout for redundancy. That is, choose the compression method (if
any), and then, in effect, partition the CD for RAID-Z or mirroring to
stretch the data to fill the entire disc. It wouldn't necessarily be all
that efficient to access, but it would give you resiliency against media
errors.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to take advantage of PSARC 2007/171: ZFS Separate Intent Log

2007-07-03 Thread Adam Leventhal
Flash SSDs typically boast a huge number of _read_ IOPS (thousands), but
very few write IOPS (tens). The write throughput numbers quoted are almost
certainly for non-synchronous writes whose latency can easily be in the
milisecond range. STEC makes an interesting device which offers fast
_synchronous_ writes on an SSD, but at a pretty steep cost.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Compression algorithms - Project Proposal

2007-07-05 Thread Adam Leventhal
This is a great idea. I'd like to add a couple of suggestions:

It might be interesting to focus on compression algorithms which are
optimized for particular workloads and data types, an Oracle database for
example.

It might be worthwhile to have some sort of adaptive compression whereby
ZFS could choose a compression algorithm based on its detection of the
type of data being stored.

Adam

On Thu, Jul 05, 2007 at 08:29:38PM -0300, Domingos Soares wrote:
> Bellow, follows a proposal for a new opensolaris project. Of course,
> this is open to change since I just wrote down some ideas I had months
> ago, while researching the topic as a graduate student in Computer
> Science, and since I'm not an opensolaris/ZFS expert at all. I would
> really appreciate any suggestion or comments.
> 
> PROJECT PROPOSAL: ZFS Compression Algorithms.
> 
> The main purpose of this project is the development of new
> compression schemes for the ZFS file system. We plan to start with
> the development of a fast implementation of a Burrows Wheeler
> Transform based algorithm (BWT). BWT is an outstanding tool
> and the currently known lossless compression algorithms
> based on it outperform the compression ratio of algorithms derived from the 
> well
> known Ziv-Lempel algorithm, while being a little more time and space
> expensive. Therefore, there is space for improvement: recent results
> show that the running time and space needs of such algorithms can be
> significantly reduced and the same results suggests that BWT is
> likely to become the new standard in compression
> algorithms[1]. Suffixes Sorting (i.e. the problem of sorting suffixes of a
> given string) is the main bottleneck of BWT and really significant
> progress has been made in this area since the first algorithms of
> Manbers and Myers[2] and Larsson and Sadakane[3], notably the new
> linear time algorithms of Karkkainen and Sanders[4]; Kim, Sim and
> Park[5] and Ko e aluru[6] and also the promising O(nlogn) algorithm of
> Karkkainen and Burkhardt[7].
> 
> As a conjecture, we believe that some intrinsic properties of ZFS and
> file systems in general (e.g. sparseness and data entropy in blocks)
> could be exploited in order to produce brand new and really efficient
> compression algorithms, as well as the adaptation of existing ones to
> the task. The study might be extended to the analysis of data in
> specific applications (e.g. web servers, mail servers and others) in
> order to develop compression schemes for specific environments and/or
> modify the existing Ziv-Lempel based scheme to deal better with such
> environments.
> 
> [1] "The Burrows-Wheeler Transform: Theory and Practice". Manzini,
> Giovanni. Proc. 24th Int. Symposium on Mathematical Foundations of
> Computer Science
> 
> [2] "Suffix Arrays: A New Method for
> On-Line String Searches". Manber, Udi and Myers, Eugene W..  SIAM
> Journal on Computing, Vol. 22 Issue 5. 1990
> 
> [3] "Faster suffix sorting". Larsson, N Jasper and Sadakane,
> Kunihiko. TECHREPORT, Department of Computer Science, Lund University,
> 1999
> 
> [4] "Simple Linear Work Suffix Array Construction". Karkkainen, Juha
> and Sanders,Peter. Proc. 13th International Conference on Automata,
> Languages and Programming, 2003
> 
> [5]"Linear-time construction of suffix arrays" D.K. Kim, J.S. Sim,
> H. Park, K. Park, CPM, LNCS, Vol. 2676, 2003
> 
> [6]"Space ecient linear time construction of sux arrays",P. Ko and
> S. Aluru, CPM 2003.
> 
> [7]"Fast Lightweight Suffix Array Construction and
> Checking". Burkhardt, Stefan and K?rkk?inen, Juha. 14th Annual
> Symposium, CPM 2003,
> 
> 
> Domingos Soares Neto
> University of Sao Paulo
> Institute of Mathematics and Statistics
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs vol issue?.

2007-08-17 Thread Adam Leventhal
On Thu, Aug 16, 2007 at 05:20:25AM -0700, ramprakash wrote:
> #zfs mount  -a 
> 1.   mounts "c"  again.
> 2.   but not "vol1"..  [ ie /dev/zvol/dsk/mytank/b/c does not contain "vol1" 
> ] 
> 
> Is this the normal behavior or is it a bug?

That looks like a bug. Please file it.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored zpool across network

2007-08-20 Thread Adam Leventhal
On Sun, Aug 19, 2007 at 05:45:18PM -0700, Mark wrote:
> Basically, the setup is a large volume of Hi-Def video is being streamed
> from a camera, onto an editing timeline. This will be written to a
> network share. Due to the large amounts of data, ZFS is a really good
> option for us. But we need a backup. We need to do it on generic
> hardware (i was thinking AMD64 with an array of large 7200rpm hard
> drives), and therefore i think im going to have one box mirroring the
> other box. They will be connected by gigabit ethernet. So my question
> is how do I mirror one raidz Array across the network to the other?

One big decision you need to make in this scenario is whether you want
true synchronous replication or if asynchronous replication possibly with
some time-bound is acceptable. For the former, each byte must traverse the
network before it is acknowledged to the client; for the latter, data is
written locally and then transmitted shortly after that.

Synchronous replication obviously imposes a much larger performance hit,
but asychronous replication means you may lose data over some recent
period (but the data will always be consistent).

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Adam Leventhal
On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:
> And here are the results:
> 
> RAIDZ:
> 
>   Number of READ requests: 4.
>   Number of WRITE requests: 0.
>   Number of bytes to transmit: 695678976.
>   Number of processes: 8.
>   Bytes per second: 1305213
>   Requests per second: 75
> 
> RAID5:
> 
>   Number of READ requests: 4.
>   Number of WRITE requests: 0.
>   Number of bytes to transmit: 695678976.
>   Number of processes: 8.
>   Bytes per second: 2749719
>   Requests per second: 158

I'm a bit surprised by these results. Assuming relatively large blocks
written, RAID-Z and RAID-5 should be laid out on disk very similarly
resulting in similar read performance.

Did you compare the I/O characteristic of both? Was the bottleneck in
the software or the hardware?

Very interesting experiment...

Adam

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS gzip compression

2007-10-01 Thread Adam Leventhal
On Sat, Sep 29, 2007 at 05:03:29PM -0700, Scott wrote:
> Thanks for the reply.  I suppose my next question, then, is how
> difficult would it be for me to apply a patch against U4 to gain the
> gzip compression functionality in ZFS?  I come from a FreeBSD
> background, so I have no problems with compiling OpenSolaris source,
> but I would like to retain as much of the code from the production
> S10U4 as I can for stability reasons.

Unfortunately, that's going to be quite difficult because we don't release
the source code for Solaris 10 updates (a position I personally find a bit
dubious). The ZFS team may be able to give you some guidance about what
build of ON most closely corresponds to what's in Solaris 10 U4, and you
could try to work from there.

Adam

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-08 Thread Adam Leventhal
On Wed, Nov 07, 2007 at 01:47:04PM -0800, can you guess? wrote:
> I do consider the RAID-Z design to be somewhat brain-damaged [...]

How so? In my opinion, it seems like a cure for the brain damage of RAID-5.

Adam

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-16 Thread Adam Leventhal
On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote:
> > How so? In my opinion, it seems like a cure for the brain damage of RAID-5.
> 
> Nope.
> 
> A decent RAID-5 hardware implementation has no 'write hole' to worry about, 
> and one can make a software implementation similarly robust with some effort 
> (e.g., by using a transaction log to protect the data-plus-parity 
> double-update or by using COW mechanisms like ZFS's in a more intelligent 
> manner).

Can you reference a software RAID implementation which implements a solution
to the write hole and performs well. My understanding (and this is based on
what I've been told from people more knowledgeable in this domain than I) is
that software RAID has suffered from being unable to provide both
correctness and acceptable performance.

> The part of RAID-Z that's brain-damaged is its 
> concurrent-small-to-medium-sized-access performance (at least up to request 
> sizes equal to the largest block size that ZFS supports, and arguably 
> somewhat beyond that):  while conventional RAID-5 can satisfy N+1 
> small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in 
> parallel (though the latter also take an extra rev to complete), RAID-Z can 
> satisfy only one small-to-medium access request at a time (well, plus a 
> smidge for read accesses if it doesn't verity the parity) - effectively 
> providing RAID-3-style performance.

Brain damage seems a bit of an alarmist label. While you're certainly right
that for a given block we do need to access all disks in the given stripe,
it seems like a rather quaint argument: aren't most environments that matter
trying to avoid waiting for the disk at all? Intelligent prefetch and large
caches -- I'd argue -- are far more important for performance these days.

> The easiest way to fix ZFS's deficiency in this area would probably be to map 
> each group of N blocks in a file as a stripe with its own parity - which 
> would have the added benefit of removing any need to handle parity groups at 
> the disk level (this would, incidentally, not be a bad idea to use for 
> mirroring as well, if my impression is correct that there's a remnant of 
> LVM-style internal management there).  While this wouldn't allow use of 
> parity RAID for very small files, in most installations they really don't 
> occupy much space compared to that used by large files so this should not 
> constitute a significant drawback.

I don't really think this would be feasible given how ZFS is stratified
today, but go ahead and prove me wrong: here are the instructions for
bringing over a copy of the source code:

  http://www.opensolaris.org/os/community/tools/scm

- ahl

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Expanding a RAIDZ based Pool...

2007-12-10 Thread Adam Leventhal
On Mon, Dec 10, 2007 at 03:59:22PM +, Karl Pielorz wrote:
> e.g. If I build a RAIDZ pool with 5 * 400Gb drives, and later add a 6th 
> 400Gb drive to this pool, will its space instantly be available to volumes 
> using that pool? (I can't quite see this working myself)

Hi Karl,

You can't currently expand the width of a RAID-Z stripe. It has been
considered, but implementing that would require a fairly substantial change
in the way RAID-Z works. Sun's current ZFS priorities are elsewhere, but
there's nothing preventing an interested member of the community from
undertaking this project...

Adam

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz in zfs questions

2008-03-05 Thread Adam Leventhal
>> 2. in a raidz do all the disks have to be the same size?

Disks don't have to be the same size, but only as much space will be  
used
on the larger disks will be used as is available on the smallest disk.  
In
other words, there's no benefit to be gained from this approach.

> Related question:
> Does a raidz have to be either only full disks or only slices, or can
> it be mixed? E.g., can you do a 3-way raidz with 2 complete disks and
> one slice (of equal size as the disks) on a 3rd, larger, disk?

Sure. One could do this, but it's kind of a hack. I imagine you'd like
to do something like match a disk of size N with another disk of size 2N
and use RAID-Z to turn them into a single vdev. At that point it's
probably a better idea to build a striped vdev and use ditto blocks to  
do
your data redundancy by setting copies=2.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mixing RAIDZ and RAIDZ2 zvols in the same zpool

2008-03-12 Thread Adam Leventhal
On Wed, Mar 12, 2008 at 09:59:53PM +, A Darren Dunham wrote:
> It's not *bad*, but as far as I'm concerned, it's wasted space.
> 
> You have to deal with the pool as a whole as having single-disk
> redundancy for failure modes.  So the fact that one section of it has
> two-disk redundancy doesn't give you anything in failure planning.
> 
> And you can't assign filesystems or particular data to that vdev, so the
> added redundancy can't be concentrated anywhere.

Well, one can imagine a situation where two different type of disks have
different failure probabilities such that the same reliability could be
garnered with one using single-parity RAID as with the other using double-
parity RAID. That said, it would be a fairly uncommon scenario.

Adam

-- 
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Per filesystem scrub

2008-04-01 Thread Adam Leventhal
On Mar 31, 2008, at 10:41 AM, kristof wrote:
> I would be very happy having a filesystem based zfs scrub
>
> We have a 18TB big zpool, it takes more then 2 days to do the scrub.
>
> Since we cannot take snapshots during the scrub, this is unacceptable

While per-dataset scrubbing would certainly be a coarse-grained  
solution to
your problem, work is underway to address the problematic interaction  
between
scrubs and snapshots.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Algorithm for expanding RAID-Z

2008-04-08 Thread Adam Leventhal
After hearing many vehement requests for expanding RAID-Z vdevs, Matt Ahrens
and I sat down a few weeks ago to figure out an mechanism that would work.
While Sun isn't committing resources to imlementing a solution, I've written
up our ideas here:

  http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

I'd encourage anyone interested in getting involved with ZFS development to
take a look.

Adam

-- 
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Periodic ZFS maintenance?

2008-04-20 Thread Adam Leventhal
On Mon, Apr 21, 2008 at 10:41:35AM +1200, Ian Collins wrote:
> Sam wrote:
> > I have a 10x500 disc file server with ZFS+, do I need to perform any sort 
> > of periodic maintenance to the filesystem to keep it in tip top shape?
> >
> No, but if there are problems, a periodic scrub will tip you off sooner
> rather than later.

Well, tip you off _and_ correct the problems if possible. I believe a long-
standing RFE has been to scrub periodically in the background to ensure that
correctable problems don't turn into uncorrectable ones.

Adam

-- 
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >