Re: [zfs-discuss] Recommendation for home NAS external JBOD

2012-06-17 Thread Daniel Carosone
On Sun, Jun 17, 2012 at 03:19:18PM -0500, Timothy Coalson wrote:
> Replacing devices will not change the ashift, it is set permanently
> when a vdev is created, and zpool will refuse to replace a device in
> an ashift=9 vdev with a device that it would use ashift=12 on. 

Yep.

> [..] while hitachi and seagate offer 512 emulated disks

> I did some rudimentary testing on a large pool of hitachi 3TB 512 emulated 
> disks with ashift=9 vs ashift=12 with bonnie, and it didn't seem to matter a
> whole lot 

Hitachi are native 512-byte sectors.  At least, the 5k3000 and 7k3000
are, in the 2T and 3T sizes. I haven't noticed if they have a newer
model which is 4k native.  

How long that continues to remain the case, and how long these models
continue to remain available (e.g. for replacements) is entirely
another matter.  The replacement applies even to under-warranty cases;
I know someone who recently had a 4k-only drive supplied as a warranty
replacement for a 512 native drive (not, in this case, from Hitachi).

As for performance, at least in my experience with WD disks
emulating 512-byte sectors, you *will* notice the difference; heavy
metadata updates being the most obvious impact.

The conclusion is that unless your environment is well controlled, the
time has probably come where new general-purpose pools should be made
at ashift=12, to allow future flexibility.

> I'm wondering, based on the comment about routing 4 eSATA cables, what
> kind of options your NAS case has, if your LSI controller has SFF-8087
> connectors (or possibly even if it doesn't), you might be able to use
> an adapter to the SFF-8088 external 4 lane SAS connector, which may
> increase your options.  It seems that support for SATA port multiplier
> is not mandatory in a controller, so you will want to check with LSI
> before trying it (I would hope they support it on SAS controllers,
> since I think it is a vastly simplified version of SAS expanders).

SATA port-multipliers and SAS expanders are not related in any sense
of common driver support; they're similar only in general concept. 

Do not conflate them.

--
Dan.


pgpzXJn70kX7n.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)

2012-06-13 Thread Daniel Carosone
On Tue, Jun 12, 2012 at 03:46:00PM +1000, Scott Aitken wrote:
> Hi all,

Hi Scott. :-)

> I have a 5 drive RAIDZ volume with data that I'd like to recover.

Yeah, still..

> I tried using Jeff Bonwick's labelfix binary to create new labels but it
> carps because the txg is not zero.

Can you provide details of invocation and error response?

For the benefit of others, this was at my suggestion; I've been
discussing this problem with Scott for.. some time. 

> I can also make the solaris machine available via SSH if some wonderful
> person wants to poke around. 

Will take a poke, as discussed.  May well raise more discussion here
as a result.

--
Dan.


pgpS6tV6uuTeF.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-13 Thread Daniel Carosone
On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
> client: ubuntu 11.10
> /etc/fstab entry: :/mainpool/storage   /mnt/myelin nfs
> bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async   0
> 0

nfsvers=3

> NAME  PROPERTY  VALUE SOURCE
> mainpool/storage  sync  standard  default

sync=standard

This is expected behaviour for this combination. NFS 3 semantics are
for persistent writes at the server regardless - and mostly also 
for NFS 4.

The async client mount option relates to when the writes get shipped
to the server (immediately or delayed in dirty pages), rather than to
how the server should handle those writes once they arrive.

You could set sync=disabled if you're happy with the consequences, or
even just as a temporary test to confirm the impact.  It sounds like
you would be since that's what you're trying to achieve.

There is a difference: async on the client means data is lost on a
client reboot, async on the server means data may be lost on a server
reboot (and the client/application confused by inconsistencies as a
result). 

Separate datasets (and mounts) for data with different persistence
requirements can help.

--
Dan.




pgp4Kh9cv3BTt.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remedies for suboptimal mmap performance on zfs

2012-05-28 Thread Daniel Carosone
On Mon, May 28, 2012 at 01:34:18PM -0700, Richard Elling wrote:
> I'd be interested in the results of such tests. 

Me too, especially for databases like postgresql where there's a
complementary cache size tunable within the db that often needs to be
turned up, since they implicitly rely on some filesystem caching as a L2. 

That's where this gets tricky: L2ARC has the opportunity to make a big
difference, where the entire db won't all fit in memory (regardless of
which subsystem has jurisdiction over that memory).  If you exclude
data from ARC, you can't spill it to L2ARC.

For the mmap case: does the ARC keep a separate copy, or does the vm
system map the same page into the process's address space?  If a
separate copy is made, that seems like a potential source of many
kinds of problems - if it's the same page then the whole premise is
essentially moot and there's no "double caching".

--
Dan.

pgp202dgDhYDG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-28 Thread Daniel Carosone
On Mon, May 28, 2012 at 09:23:25AM -0600, Nigel W wrote:
> After a snafu
> last week at $work where a 512 byte pool would not resilver with a 4K
> drive plugged in, it appears that (keep in mind that these are
> consumer drives) Seagate no longer manufactures the 7200.12 series
> drives which has a select-able sector size.  The new 7200.14 series is
> 4k only.  

Does this mean they actually present with 4k sectors externally,
rather than use 4k internally and emulate 512b externally?  If so,
this is a good thing - and good to know.

> WD for the time being appears to still present 512 byte
> sectors in their current lineup. What kind of performance penalty this
> carries I don't know as we have not tested any as of yet.  Presumably
> though, WD is going to stop doing that eventually just like Seagate
> already has.

One hopes so.

There are two problems using ZFS on drives with 4k sectors:

 1) if the drive lies and presents 512-byte sectors, and you don't
manually force ashift=12, then the emulation can be slow (and
possibly error prone). There is essentially an internal RMW cycle
when a 4k sector is partially updated.  We use ZFS to get away
from the perils of RMW :) 

 2) with ashift=12, whther forced manually or automatically because
the disks present 4k sectors, ZFS is less space-efficient for
metadata and keeps fewer historical uberblocks.

For choosing a tradeoff today, I'll take 2 over 1, after experience
with both. 1 bites, seemingly especially with raidz types, but also
with mirrors.  Also because a code change could at least improve the
metadata packing in future.

AFAIK, Hitachi is the only vendor still offering 512-native consumer
drives in the 2&3T sizes.  They cost a little more, so that's another
tradeoff. 

--
Dan.

pgpy1Zzg4K50L.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-22 Thread Daniel Carosone
On Tue, May 22, 2012 at 12:42:02PM +0400, Jim Klimov wrote:
> 2012-05-22 7:30, Daniel Carosone wrote:
>> I've done basically this kind of thing before: dd a disk and then
>> scrub rather than replace, treating errors as expected.
>
> I got into similar situation last night on that Thumper -
> it is now migrating a flaky source disk in the array from
> an original old 250Gb disk into a same-sized partition on
> the new 3Tb drive (as I outlined as IDEA7 in another thread).
> The source disk itself had about 300 CKSUM errors during
> the process, and for reasons beyond my current understanding,
> the resilver never completed.
>
> In zpool status it said that the process was done several
> hours before the time I looked at it, but the TLVDEV still
> had a "spare" component device comprised of the old disk
> and new partition, and the (same) hotspare device in the
> pool was "INUSE".

I think this is at least in part an issue with older code.  There have
been various fixes for hangs/restarts/incomplete replaces and sparings
over the time since.  

> After a while we just detached the old disk from the pool
> and ran scrub, which first found some 178 CKSUM errors on
> the new partition right away, and degraded the TLVDEV and
> pool.
>
> We cleared the errors, and ran the script below to log
> the detected errors and clear them, so the disk is fixed
> and not kicked out of the pool due to mismatches.
>
> So in effect, this methodology works for two of us :)
>
> Since you did similar stuff already, I have a few questions:
> 1) How/what did you DD? The whole slice with the zfs vdev?
>Did the system complain (much) about the renaming of the
>device compared to paths embedded in pool/vdev headers?
>Did you do anything manually to remedy that (forcing
>import, DDing some handcrafted uberblocks, anything?)

I've done it a couple of times at least:

 * a failed disk in a raidz1, where i didn't trust that the other
   disks didn't also have errors.  Basically did a ddrescue from one
   disk to the new. I think these days, a 'replace' where the
   original disk is still online will use that content, like a
   hotspare replace, rather than assume it has gone away and must be
   recreated, but that wasn't the case at the time.

 * Where I had an iscsi mirror of a laptop hard disk, but it was out
   of date and had been detached when the laptop iscsi initiator
   refused to start.  Later, the disk developed a few bad sectors.  I
   made a new submirror, let it sync (with the error still), then
   blatted bits of the old image over the new in the areas where the
   bad sectors where being reported.  Scrub again, and they were fixed
   (as well as some blocks on the new submirror repaired coming back
   up to date again). 

> 2) How did you "treat errors as expected" during scrub?

Pretty much as you did: decline to panic and restart scrubs.

--
Dan.

pgpis9PrONjka.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-21 Thread Daniel Carosone
On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote:
> On Mon, 21 May 2012, Jim Klimov wrote:
>> This is so far a relatively raw idea and I've probably missed
>> something. Do you think it is worth pursuing and asking some
>> zfs developers to make a POC? ;)
>
> I did read all of your text. :-)
>
> This is an interesting idea and could be of some use but it would be  
> wise to test it first a few times before suggesting it as a general  
> course. 

I've done basically this kind of thing before: dd a disk and then
scrub rather than replace, treating errors as expected. 

> Zfs will try to keep the data compacted at the beginning of the  
> partition so if you have a way to know how far out it extends, then the 
> initial 'dd' could be much faster when the pool is not close to full.

zdb will show you usage per metaslab, you could use that and
effectively select offset ranges to skip any empty ones.  After a
while, and once the pool has seen usage fill past low %'ages, I'd say
most metaslabs would have some usage, so you might not save much
time.  Going to finer detail within a metaslab is not worthwhile -
much more involved and involves the seeks you're trying to avoid.

--
Dan.

pgpZTKmE3x5dy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-18 Thread Daniel Carosone
On Fri, May 18, 2012 at 04:18:12PM +1000, Daniel Carosone wrote:
> 
> When doing a scrub, you start at the root bp and walk the tree, doing
> reads for everything, verifying checksums, and letting repair happen
> for any errors. That traversal is either a breadth-first or
> depth-first traversal of the tree (I'm not sure which) done in TXG
> order.  
> 
> [..]
> 
> Note that there can be a lot of fanout in the tree;

Given the latter point, I'm going to guess depth-first.  Yes, I should
look at the code instead of posting speculation. 

--
Dan.


pgpX6jrYdlAjS.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does resilver/scrub work?

2012-05-17 Thread Daniel Carosone
On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote:
>   While waiting for that resilver to complete last week,
> I caught myself wondering how the resilvers (are supposed
> to) work in ZFS?

The devil finds work for idle hands... :-)

>   Based on what I see in practice and read in this list
> and some blogs, I've built a picture and would be grateful
> if some experts actually familiar with code and architecture
> would say how far off I guessed from the truth ;)

Well, I'm not that - certainly not on the code.  It would probably be
best (for both of us) to spend idle time looking at the code, before
spending too much on speculation. Nonetheless, let's have at it! :)

>   Ultimately I wonder if there are possible optimizations
> to make the scrub process more resembling a sequential
> drive-cloning (bandwidth/throughput-bound), than an
> IOPS-bound random seek thrashing for hours that we
> often see now, at least on (over?)saturated pools.

The tradeoff will be code complexity and resulting fragility. Choose
wisely what you wish for.

> This may possibly improve zfs send speeds as well.

Less likely, that's pretty much always going to have to go in txg
order.

>   First of all, I state (and ask to confirm): I think
> resilvers are a subset of scrubs, in that:
> 1) resilvers are limited to a particular top-level VDEV
> (and its number is a component of each block's DVA address)
> and
> 2) when scrub finds a block mismatching its known checksum,
> scrub reallocates the whole block anew using the recovered
> known-valid data - in essence it is a newly written block
> with a new path in BP tree and so on; a resilver expects
> to have a disk full of known-missing pieces of blocks,
> and reconstructed pieces are written on the resilvering
> disk "in-place" at an address dictated by the known DVA -
> this allows to not rewrite the other disks and BP tree
> as COW would otherwise require.

No. Scrub (and any other repair, such as for errors found in the
course of normal reads) rewrite the reconstructed blocks in-place: to
the original DVA as referenced by its parents in the BP tree, even if
the device underneath that DVA is actually a new disk.

There is no COW. This is not a rewrite, and there is no original data
to preserve, this is a repair: making the disk sector contain what the
rest of the filesystem tree 'expects' it to contain. More specifically,
making it contain data that checksums to the value that block pointers
elsewhere say it should, via reconstruction using redundant
information (same DVA on a mirror/RAIDZ recon, or ditto blocks at
different DVAs found in the parent BP for copies>1, including metadata)

BTW, if a new BP tree was required to repair blocks, we'd have
bp-rewrite already (or we wouldn't have repair yet).

>   Other than these points, resilvers and scrubs should
> work the same, perhaps with nuances like separate tunables
> for throttling and such - but generic algorithms should
> be nearly identical.
>
> Q1: Is this assessment true?

In a sense, yes, despite the correction above.  There is less
difference between these cases than you expected, so they are nearly
identical :-)

>   So I'll call them both a "scrub" below - it's shorter :)

Call them all repair.

The difference is not in how repair happens, but in how the need for a
given sector to be repaired is discovered.

Let's go over those, and clarify terminology, before going through the
rest of your post:

 * Normal reads: a device error or checksum failure triggers a
   repair. 

 * Scrub: Devices may be fine, but we want to verify that and fix any
   errors. In particular, we want to check all redundant copies.

 * Resilver: A device has been offline for a while, and needs to be
   'caught up', from its last known-good TXG to current.

 * Replace: A device has gone, and needs to be completely
   reconstructed.

Scrub is very similar to normal reads, apart from checking all copies
rather than serving the data from whichever copy successfully returns
first. Errors are not expected, are counted and repaired as/if found.

Resilver and Replace are very similar, and the terms are often used
interchangably. Replace is essentially resilver with a starting TXG of
0 (plus some labelling). In both cases, an error is expected or
assumed from the device in question, and repair initiated
unconditionally (and without incrementing error counters). 

You're suggesting an assymetry between Resilver and Replace to exploit
the possibile speedup of sequential access; ok, seems attractive at
first blush, let's explore the idea.

>   Now, as everybody knows, at least by word-of-mouth on
> this list, the scrub tends to be slow on pools with a rich
> life (many updates and deletions, causing fragmentation,
> with "old" and "young" blocks intermixed on disk), more
> so if the pools are quite full (over about 80% for some
> reporters). This slowness (on non-SSD disks with non-zero
> seek latency) is attributed to several reasons I've seen
> stat

Re: [zfs-discuss] Two disks giving errors in a raidz pool, advice needed

2012-04-22 Thread Daniel Carosone
On Mon, Apr 23, 2012 at 05:48:16AM +0200, Manuel Ryan wrote:
> After a reboot of the machine, I have no more write errors on disk 2 (only
> 4 checksum, not growing), I was able to access data which I previously
> couldn't and now only the checksum errors on disk 5 are growing.

Well, that's good, but what changed?   If it was just a reboot and
perhaps power-cycle of the disks, I don't think you've solved much in
the long term.. 

> Fortunately, I was able to recover all important data in those conditions
> (yeah !),

.. though that's clearly the most important thing!

If you're down to just checksum errors now, then run a scrub and see
if they can all be repaired, before replacing the disk.  If you
haven't been able to get a scrub complete, then either:
 * delete unimportant / rescued data, until none of the problem
   sectors are referenced any longer, or
 * "replace" the disk like I suggested last time, with a copy under
   zfs' nose and switch

> And since I can live with loosing the pool now, I'll gamble away and
> replace drive 5 tomorrow and if that fails i'll just destroy the pool,
> replace the 2 physical disks and build a new one (maybe raidz2 this time :))

You know what?  If you're prepared to do that in the worst of
circumstances, it would be a very good idea to do that under the best
of circumstances.  If you can, just rebuild it raidz2 and be happier
next time something flaky happens with this hardware.
 
> I'll try to leave all 6 original disks in the machine while replacing,
> maybe zfs will be smart enough to use the 6 drives to build the replacement
> disk ?

I don't think it will.. others who know the code, feel free to comment
otherwise.

If you've got the physical space for the extra disk, why not keep it
there and build the pool raidz2 with the same capacity? 

> It's a miracle that zpool still shows disk 5 as "ONLINE", here's a SMART
> dump of disk 5 (1265 Current_Pending_Sector, ouch) 

That's all indicative of read errors. Note that your reallocated
sector count on that disk is still low, so most of those will probably
clear when overwritten and given a chance to re-map.

If these all appeared suddenly, clearly the disk has developed a
problem. Normally, they appear gradually as head sensitivity
diminishes. 

How often do you normally run a scrub, before this happened?  It's
possible they were accumulating for a while but went undetected for
lack of read attempts to the disk.  Scrub more often!

--
Dan.



pgpFByqrFnHeY.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Two disks giving errors in a raidz pool, advice needed

2012-04-22 Thread Daniel Carosone
On Mon, Apr 23, 2012 at 02:16:40PM +1200, Ian Collins wrote:
> If it were my data, I'd set the pool read only, backup, rebuild and  
> restore.  You do risk further data loss (maybe even pool loss) while the  
> new drive is resilvering.

You're definitely in a pickle.  The first priority is to try and
ensure that no further damage is done. Check and make sure you have
ample power supply. 

Setting the pool readonly would be a good start.  Powering down and
checking all the connectors and cables would be another. 

Write errors are an interesting result. Check the smart data on that
disk - either it is totally out of sectors to reallocate, or it has
some kind of interface problem.

If you can, image all the disks elsewhere, with something like
ddrescue.  Doing so sequentially rather than random IO through the
filesystem can sometimes have better results for marginal
disks/sectors.  That gives you scratch copies to work on or fall back
to, as you try other recovery methods. 

zfs15 is fairly old..  Consider presenting a copy of the pool to a
newer solaris that may have more robust recovery, as one experiment.

I wouldn't "zpool replace" anything at this point - the moment you do,
you throw away any of the good data on that disk, which might help you
recover sectors that are bad on other disks.  If you have to swap
disks, I would try and get as many of the readable sectors copied 
across to the new disk as possible (ddrescue again) with the pool
offline, and then just physically swap disks, so at least the good
data remains usable.

Try and get some clarity on what's happening with the hardware on a
individual disk level - what reads successfully (at least at the
physical layer, below zfs chksum).  Try and get at the root cause of
the write errors first; they're impeding zfs's recovery of what looks
like other 

> I would only use raidz for unimportant data, or for a copy of data from  
> a more robust pool.

Well, yeah, but a systemic problem (like bad ram or power or
controller) can manifest as a multi-disk failure no matter how many
redundant disks.

--
Dan.


pgpQQcIoIuOpx.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive upgrades

2012-04-15 Thread Daniel Carosone
On Sat, Apr 14, 2012 at 09:04:45AM -0400, Edward Ned Harvey wrote:
> Then, about 2 weeks later, the support rep emailed me to say they
> implemented a new feature, which could autoresize +/- some small
> percentage difference, like 1Mb difference or something like that. 

There are two elements to this:
 - the size of actual data on the disk
 - the logical block count, and the resulting LBAs of the labels
   positioned relative to the end of the disk.

The available size of the disk has always been rounded to a whole
number of metaslabs, once the front and back label space is trimmed
off. Combined with the fact that metaslab size is determined
dynamically at vdev creation time based on device size, there can
easily be some amount of unused space at the end, after the last
metaslab and before the end labels. 

It is slop in this space that allows for the small differences you
describe above, even for disks laid out in earlier zpool versions.  
A little poking with zdb and a few calculations will show you just how
much a given disk has. 

However, to make the replacement actually work, the zpool code needed
to not insist on an absoute >= number of blocks (rather to check the
more proper condition, that there was room for all the metaslabs).
There was also testing to ensure that it handled the end labels moving
inwards in absolute position, for a replacement onto slightly smaller
rather than same/larger disks. That was the change that happened at
the time.

(If you somehow had disks that fit exactly a whole number of
metaslabs, you might still have an issue, I suppose. Perhaps that's
likely if you carefully calculated LUN sizes to carve out of some
other storage, in which case you can do the same for replacements.)

--
Dan.



pgpg1ciooKHti.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Accessing Data from a detached device.

2012-03-29 Thread Daniel Carosone
On Thu, Mar 29, 2012 at 05:54:47PM +0200, casper@oracle.com wrote:
> >Is it possible to access the data from a detached device from an 
> >mirrored pool.
> 
> If it is detached, I don't think there is a way to get access
> to the mirror.  Had you used split, you should be able to reimport it.
> 
> (You can try aiming "zpool import" at the disk but I'm not hopeful)

The uberblocks have been invalidated "as a precaution", so no.

If it's too late to use split instead of detach, see this thread:

 http://thread.gmane.org/gmane.os.solaris.opensolaris.zfs/15796/focus=15929

I renew my request for someone to adopt and nurture this tool.

--
Dan.


pgpEfvHDXFwfS.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot mount encrypted filesystems.

2012-02-21 Thread Daniel Carosone
On Tue, Feb 21, 2012 at 11:12:14AM +, Darren J Moffat wrote:
> Did you ever do a send|recv of these filesystems ?  There was a bug with  
> send|recv in 151a that has since been fixed that could cause the salt to  
> be zero'd out in some cases.

Ah, so that's what that was.

I hit this problem some time ago, as was discussed here.
Unfortunately, I also wrote more data into the recv'd filesystem
before the next reboot, and only after that did the new fs become
eunmountable.

So, now that the bug is understood and fixed, if i still have the
original dataset (source of the send), can I use that to recover the
salt and thus the keys and new data? 

--
Dan.

pgpg3sVTa7EWw.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Dedup and bad checksums

2012-01-12 Thread Daniel Carosone
On Fri, Jan 13, 2012 at 05:16:36AM +0400, Jim Klimov wrote:
> 2012-01-13 4:26, Richard Elling wrote:
>> On Jan 12, 2012, at 4:12 PM, Jim Klimov wrote:
>>> Alternatively (opportunistically), a flag might be set
>>> in the DDT entry requesting that a new write mathching
>>> this stored checksum should get committed to disk - thus
>>> "repairing" all files which reference the block (at least,
>>> stopping the IO errors).
>>
>> verify eliminates this failure mode.
>
> Thinking about it... got more questions:
>
> In this case: DDT/BP contain multiple references with
> correct checksums, but the on-disk block is bad.
> Newly written block has the same checksum, and verification
> proves that on-disk data is different byte-to-byte.
>
> 1) How does the write-stack interact with those checksums
>that do not match the data? Would any checksum be tested
>for this verification read of existing data at all?
>
> 2) It would make sense for the failed verification to
>have the new block committed to disk, and a new DDT
>entry with same checksum created. I would normally
>expect this to be the new unique block of a new file,
>and have no influence on existing data (block chains).
>However in the discussed problematic case, this safe
>behavior would also mean not contributing to reparation
>of those existing block chains which include the
>mismatching on-disk block.
>
> Either I misunderstand some of the above, or I fail to
> see how verification would eliminate this failure mode
> (namely, as per my suggestion, replace the bad block
> with a good one and have all references updated and
> block-chains -> files fixed with one shot).

It doesn't update past data.

It gets treated as if there were a hash collision, and the new data is
really different despite having the same checksum, and so gets written
out instead of incrementing the existing DDT pointer.  So it addresses
your ability to recover the primary filesystem by overwriting with
same data, that dedup was previously defeating. 

--
Dan.



pgp4l8LOTdUOb.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Daniel Carosone
On Thu, Jan 12, 2012 at 05:01:48PM -0800, Richard Elling wrote:
> > This thread is about checksums - namely, now, what are
> > our options when they mismatch the data? As has been
> > reported by many blog-posts researching ZDB, there do
> > happen cases when checksums are broken (i.e. bitrot in
> > block pointers, or rather in RAM while the checksum was
> > calculated - so each ditto copy of BP has the error),
> > but the file data is in fact intact (extracted from
> > disk with ZDB or DD, and compared to other copies).
> 
> Metadata is at least doubly redundant and checksummed.

The implication is that the original calculation of the checksum was
bad in ram (undetected due to lack of ECC), and then written out
redundantly and fed as bad input to the rest of the merkle construct.
The data blocks on disk are correct, but they fail to verify against
the bad metadata.

The complaint appears to be that ZFS makes this 'worse' because the
(independently verified) valid data blocks are inaccessible. 

Worse than what? Corrupted file data that is then accurately
checksummed and readable as valid? Accurate data that is read without
any assertion of validity, in a traditional filesystem? There's
an inherent value judgement here that will vary by judge, but in each
case it's as much a judgement on the value of ECC and reliable
hardware, and your data and time enacting various kinds of recovery,
as it is the value of ZFS.

The same circumstance could, in principle, happen due to bad CPU even
with ECC.  In either case, the value of ZFS includes that an error has
been detected you would otherwise have been unaware of, and you get a
clue that you need to fix hardware and spend time. 

--
Dan.


pgpE29pepViE2.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-12 Thread Daniel Carosone
On Fri, Jan 13, 2012 at 04:48:44AM +0400, Jim Klimov wrote:
> As Richard reminded me in another thread, both metadata
> and DDT can contain checksums, hopefully of the same data
> block. So for deduped data we may already have a means
> to test whether the data or the checksum is incorrect...

It's the same chksum, calculated once - this is why turning dedup=on
implies setting checksum=sha256 

> Incdentally, the problem also seems more critical for
> the deduped data ;)

Yes.  Add this to the list of reasons to use ECC, and add 'have ECC'
to the list of constraints to circumstances where using dedup is
appropriate. 

--
Dan.

pgpXxjufkN4uS.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-11 Thread Daniel Carosone
On Thu, Jan 12, 2012 at 03:05:32PM +1100, Daniel Carosone wrote:
> On Sun, Jan 08, 2012 at 06:25:05PM -0800, Richard Elling wrote:
> > ZIL makes zero impact on resilver.  I'll have to check to see if L2ARC is 
> > still used, but
> > due to the nature of the ARC design, read-once workloads like backup or 
> > resilver do 
> > not tend to negatively impact frequently used data.
> 
> This is true, in a strict sense (they don't help resilver itself) but
> it misses the point. They (can) help the system, when resilver is
> underway. 
> 
> ZIL helps reduce the impact busy resilvering disks have on other system

Well, since I'm being strict and picky, I should of course say ZIL-on-slog.

> operation (sync write syscalls and vfs ops by apps).  L2ARC, likewise
> for reads.  Both can hide the latency increases that resilvering iops
> cause for the disks (and which the throttle you mentioned also
> attempts to minimise). 

--
Dan.


pgpJr64AafDRB.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-11 Thread Daniel Carosone
On Sun, Jan 08, 2012 at 06:25:05PM -0800, Richard Elling wrote:
> ZIL makes zero impact on resilver.  I'll have to check to see if L2ARC is 
> still used, but
> due to the nature of the ARC design, read-once workloads like backup or 
> resilver do 
> not tend to negatively impact frequently used data.

This is true, in a strict sense (they don't help resilver itself) but
it misses the point. They (can) help the system, when resilver is
underway. 

ZIL helps reduce the impact busy resilvering disks have on other system
operation (sync write syscalls and vfs ops by apps).  L2ARC, likewise
for reads.  Both can hide the latency increases that resilvering iops
cause for the disks (and which the throttle you mentioned also
attempts to minimise). 

--
Dan.


pgpHuumFi1QZ5.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions about the DDT and other things

2011-12-01 Thread Daniel Carosone
On Fri, Dec 02, 2011 at 01:59:37AM +0100, Ragnar Sundblad wrote:
> 
> I am sorry if these are dumb questions. If there are explanations
> available somewhere for those questions that I just haven't found, please
> let me know! :-)

I'll give you a brief summary.

> 1. It has been said that when the DDT entries, some 376 bytes or so, are
> rolled out on L2ARC, there still is some 170 bytes in the ARC to reference
> them (or rather the ZAP objects I believe). In some places it sounds like
>  those 170 bytes refers to ZAP objects that contain several DDT entries.
> In other cases it sounds like for each DDT entry in the L2ARC there must
> be one 170 byte reference in the ARC. What is the story here really?

Currently, every object (not just DDT entries) stored in L2ARC is
tracked in memory. This metadata identifies the object and where on
L2ARC it is stored. The L2ARC on-disk doesn't contain metadata and is
not self-describing. This is one reason why the L2ARC starts out
empty/cold after every reboot, and why the usable size of L2ARC is
limited by memory.

DDT entries in core are used directly.  If the relevant DDT node is
not in core, it must be fetched from the pool, which may in turn be
assisted by an L2ARC.  It's my understanding that, yes, several DDT
entries are stored in each on-disk "block", though I'm not certain of
the number.  The on-disk size of the DDT entry is different, too.

> 2. Deletion with dedup enabled is a lot heavier for some reason that I don't
> understand. It is said that the DDT entries have to be updated for each
> deleted reference to that block. Since zfs already have a mechanism for 
> sharing
> blocks (for example with snapshots), I don't understand why the DDT has to
> contain any more block references at all, or why deletion should be much 
> harder
> just because there are checksums (DDT entries) tied to those blocks, and even
> if they have to, why it would be much harder than the other block reference
> mechanism. If anyone could explain this (or give me a pointer to an
> explanation), I'd be very happy!

DDT entries are reference-counted.  Unlike other things that look like
multiple references, these are truly block-level independent.

Everything else is either tree-structured or highly aggregated (metaslab
free-space tracking).

Snapshots, for example, are references to a certain internal node (the
root of a filesystem tree at a certain txg), and that counts as a
reference to the entire subtree underneath.  Note that any changes to
this subtree later (via writes into the live filesystem) diverge
completely via CoW; an update produces a new CoW block tree all the way
back to the root, above the snapshot node. 

When a snapshot is created, it starts out owning (almost) nothing. As
data is overwritten, the ownership of the data that might otherwise be
freed is transferred to the snapshot.

When the oldest snapshot is freed, any data blocks it owns can be
freed. When an intermediate snapshot is freed, data blocks it owns are
either transferred to the previous older snapshot because they were
shared with it (txg < snapshot's) or they're unique to this snapshot
and can be freed.

Either way, these decisions are tree based and can potentially free
large swathes of space with a single decision, whereas the DDT needs
refcount updates individually for each block (in random order, as per
below).

(This is not the same as the ZPL directory tree used for naming,
however, don't get those confused, it's flatter than that).

> 3. I, as many others, would of course like to be able to have very large
> datasets deduped without having to have enormous amounts of RAM.
> Since the DDT is a AVL tree, couldn't just that entire tree be cached on
> for example a SSD and be searched there without necessarily having to store
> anything of it in RAM? That would probably require some changes to the DDT
> lookup code, and some mechanism to gather the tree to be able to lift it
> over to the SSD cache, and some other stuff, but still that sounds - with
> my very basic (non-)understanding of zfs - like a not to overwhelming change.

Think of this the other way round. One could do this, and could
require a dedicated device (SSD) in order to use dedup at all.  Now,
every DDT lookup requires IO to bring the DDT entry into memory.  This
would be slow, so we could add an in-memory cache for the DDT... and
we're back to square one.

The major issue with the DDT is that, being context-hash indexed, it
is random-access, even for sequential-access data.  There's no getting
around that, it's in its job description.

> 4. Now and then people mention that the problem with bp_rewrite has been
> explained, on this very mailing list I believe, but I haven't found that
> explanation. Could someone please give me a pointer to that description
> (or perhaps explain it again :-) )?

This relates to the answer for 2; all the pointers in the tree
discussed there are block pointers to device virtual addresses.  If
you're go

Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive

2011-11-10 Thread Daniel Carosone
On Tue, Oct 11, 2011 at 08:17:55PM -0400, John D Groenveld wrote:
> Under both Solaris 10 and Solaris 11x, I receive the evil message:
> | I/O request is not aligned with 4096 disk sector size.
> | It is handled through Read Modify Write but the performance is very low.

I got similar with 4k sector 'disks' (as a comstar target with
blk=4096) when trying to use them to force a pool to ashift=12. The
labels are found at the wrong offset when the block numbers change,
and maybe the GPT label has issues too. 

--
Dan.


pgp36Oq3osVOg.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Daniel Carosone
On Wed, Nov 09, 2011 at 11:09:45AM +1100, Daniel Carosone wrote:
> On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote:
> >   Recently I stumbled upon a Nexenta+Supermicro report [1] about
> > cluster-in-a-box with shared storage boasting an "active-active
> > cluster" with "transparent failover". Now, I am not certain how
> > these two phrases fit in the same sentence, and maybe it is some
> > marketing-people mixup,
> 
> One way they can not be in conflict, is if each host normally owns 8
> disks and is active with it, and standby for the other 8 disks. 

Which, now that I reread it more carefully, is your case 1. 

Sorry for the noise.

--
Dan.

pgphTDpO9Oucq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Daniel Carosone
On Wed, Nov 09, 2011 at 03:52:49AM +0400, Jim Klimov wrote:
>   Recently I stumbled upon a Nexenta+Supermicro report [1] about
> cluster-in-a-box with shared storage boasting an "active-active
> cluster" with "transparent failover". Now, I am not certain how
> these two phrases fit in the same sentence, and maybe it is some
> marketing-people mixup,

One way they can not be in conflict, is if each host normally owns 8
disks and is active with it, and standby for the other 8 disks. 

Not sure if this is what the solution in question is doing, just
saying. 

--
Dan.


pgpzJ4iippP0L.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Log disk with all ssd pool?

2011-11-01 Thread Daniel Carosone
On Tue, Nov 01, 2011 at 06:17:57PM -0400, Edward Ned Harvey wrote:
> You can do both poorly for free, or you can do both very well for big bucks. 
> That's what opensolaris was doing.

That mess was costing someone money and considered very well done?
Good riddance.  

--
Dan.



pgp9EbJq1tUD1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool replace not concluding + duplicate drive label

2011-10-27 Thread Daniel Carosone
On Thu, Oct 27, 2011 at 10:49:22AM +1100, afree...@mac.com wrote:
> Hi all,
> 
> I'm seeing some puzzling behaviour with my RAID-Z.
> 

Indeed.  Start with zdb -l on each of the disks to look at the labels in more 
detail.

--
Dan.

pgpRTwLfC9flo.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Importing old vdev

2011-10-10 Thread Daniel Carosone
On Mon, Oct 10, 2011 at 04:43:30PM -0400, James Lee wrote:
> I found an old post by Jeff Bonwick with some code that does EXACTLY
> what I was looking for [1].  I had to update the 'label_write' function
> to support the newer ZFS interfaces:

That's great!

Would someone in the community please kindly adopt this little snippet
so that it is maintained as further zfs format updates occur?  Perhaps
even fold it into zdb?

--
Dan.

pgpAOZw8OdJe5.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fwd: Re: zvol space consumption vs ashift, metadata packing

2011-10-10 Thread Daniel Carosone
On Wed, Oct 05, 2011 at 08:19:20AM +0400, Jim Klimov wrote:
>
> Hello, Daniel,
>
> Apparently your data is represented by rather small files (thus
> many small data blocks)

It's a zvol, default 8k block size, so yes.

> , so proportion of metadata is relatively
> high, and your<4k blocks are now using at least 4k disk space.
> For data with small blocks (a 4k volume on an ashift=12 pool)
> I saw metadata use up most of my drive - becoming equal to
> data size.

That's pretty much my assumption, yes.

> Just for the sake of completeness, I brought up a similar problem
> and a non-intrusive (compatibility-wise) solution in this bug:
> https://www.illumos.org/issues/1044
>
> Main idea was to let ZFS users specify a minumum data block
> size and alignment, while formal ashift=9 remains in place and
> metadata blocks remain 512b. This fix would be a code-only
> change without on-disk-format changes. 

Well, other than needing to re-make a pool already created with
ashift=12..   By the time I do that, I might as well remake it onto
5k3000's instead.. :)  

> There could be some
> further cleverness when working with (updating) metadata,
> so that hardware 4k blocks would need to be rewritten as
> rarely and as fully as possible - to reduce wear and increase
> efficiency - but the main idea is hopefully simple.

If only the implementation were too..

--
Dan.

pgpqZ6QIDVRhl.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on a RAID Adapter?

2011-10-10 Thread Daniel Carosone
On Mon, Oct 03, 2011 at 07:34:07PM -0400, Edward Ned Harvey wrote:
> It is also very similar to running iscsi targets on ZFS,
> while letting some other servers use iscsi to connect to the ZFS server.

The SAS, IB and FCoE targets, too..

SAS might be the most directly comparable to replace a traditional RAID
controller in a host.. Most other HBA's already look enough like a
RAID controller to potentially confuse the issue, and also run 
more directly head-to-head with the full-scale SAN model.  Some
machines have iSCSI boot in firmware for the motherboard NICs, so for
those that would be a viable comparison.

I'm talking primarily about sticker price and customer confusion here;
of course the architectural block diagram is the same regardless of
the PHY layer for scsi transport.

--
Dan.


pgpLqKWXtBEuh.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol space consumption vs ashift, metadata packing

2011-10-10 Thread Daniel Carosone
On Tue, Oct 04, 2011 at 09:28:36PM -0700, Richard Elling wrote:
> On Oct 4, 2011, at 4:14 PM, Daniel Carosone wrote:
> 
> > I sent it twice, because something strange happened on the first send,
> > to the ashift=12 pool.  "zfs list -o space" showed figures at least
> > twice those on the source, maybe roughly 2.5 times.
> 
> Can you share the output?

Source machine, zpool v14 snv_111b:

NAME  AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  VOLSIZE
int/iscsi_01  99.2G   237G 37.9G199G  0  0 200G

Destination machine, zpool v31 snv_151b:

NAME   AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  VOLSIZE
geek/iscsi_01  3.64T   550G 88.4G461G  0  0 200G
uext/iscsi_01  1.73T   245G 39.2G206G  0  0 200G

geek is the ashift=12 pool, obviously.  I'm assuming the smaller
difference for uext is due to other layout differences in the pool
versions.

> > What is going on? Is there really that much metadata overhead?  How
> > many metadata blocks are needed for each 8k vol block, and are they
> > each really only holding 512 bytes of metadata in a 4k allocation?
> > Can they not be packed appropriately for the ashift?
> 
> Doesn't matter how small metadata compresses, the minimum size you can write
> is 4KB.

This isn't about whether the metadata compresses, this is about
whether ZFS is smart enough to use all the space in a 4k block for
metadata, rather than assuming it can fit at best 512 bytes,
regardless of ashift.  By packing, I meant packing them full rather
than leaving them mostly empty and wasted (or anything to do with
compression). 

> I think we'd need to see the exact layout of the internal data. This can be 
> achieved with the zfs_blkstats macro in mdb. Perhaps we can take this offline
> and report back?

Happy to - what other details / output would you like?

--
Dan.

pgpajJV4sBgdY.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zvol space consumption vs ashift, metadata packing

2011-10-04 Thread Daniel Carosone
I sent a zvol from host a, to host b, twice.  Host b has two pools,
one ashift=9, one ashift=12.  I sent the zvol to each of the pools on
b.  The original source pool is ashift=9, and an old revision (2009_06
because it's still running xen). 

I sent it twice, because something strange happened on the first send,
to the ashift=12 pool.  "zfs list -o space" showed figures at least
twice those on the source, maybe roughly 2.5 times.

I suspected this may be related to ashift, so tried the second send to
the ahsift=9 pool; these received snapshots line up with the same
space consumption as the source.

What is going on? Is there really that much metadata overhead?  How
many metadata blocks are needed for each 8k vol block, and are they
each really only holding 512 bytes of metadata in a 4k allocation?
Can they not be packed appropriately for the ashift?

Longer term, if zfs were to pack metadata into full blocks by ashift,
is it likely that this could be introduced via a zpool upgrade, with
space recovered as metadata is rewritten - or would it need the pool
to be recreated?  Or is there some other solution in the works?

--
Dan.

pgpmWrX1wMTDh.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirror Gone

2011-09-27 Thread Daniel Carosone
On Tue, Sep 27, 2011 at 08:29:03PM -0400, Edward Ned Harvey wrote:
> There is only one way for this to make sense:  You did not have mirror-1 in
> the first place.  You accidentally added 4 & 5 without mirroring.  

Not true. 4 & 5 may have been added initially as a mirror, then 5
detached from the mirror and later added as a single drive (rather than
reattached to 4 as a mirror again).   

It can be easy to confuse "add" and "attach" commands.

> The only
> way to fix it is to (a) add redundancy to both 4 & 5

Yes, attach 2 new disks, one to each of 4 and 5, to turn those
vdevs (back) into mirrors.

>, or (b) destroy and
> recreate the pool, and this time be very careful that you mirror 4&5.

Unfortunately, yes, if you can't attach more disks, this is necessary,
since it's not currently possible to remove a top-level vdev.

--
Dan.


pgpBxFtuuWR4L.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy snapshot runs out of memory bug

2011-09-14 Thread Daniel Carosone
On Wed, Sep 14, 2011 at 05:36:53PM -0400, Paul Kraus wrote:
> T2000 with 32 GB RAM
> 
> zpool that hangs the machine by running it out of kernel memory when
> trying to import the zpool
> 
> zpool has an "incomplete" snapshot from a zfs recv that it is trying
> to destroy on import
> 
> I *can* import the zpool readonly

Can you import it booting from a newer kernel (say liveDVD), and allow
that to complete the deletion? Or does this not help until the pool is
upgraded past the on-disk format in question, for which it must first
be imported writable?  

If you can import it read-only, would it be faster to just send it
somewhere else?  Is there a new-enough snapshot near the current data?

--
Dan.



pgpSAJqkjQOd1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] booting from ashift=12 pool..

2011-09-14 Thread Daniel Carosone
On Wed, Sep 14, 2011 at 04:08:19PM +0200, Hans Rosenfeld wrote:
> On Mon, Sep 05, 2011 at 02:18:48AM -0400, Daniel Carosone wrote:
> > I see via the issue tracker that there have been several updates
> > since, and an integration back into the main Illumos tree.   How do I
> > go about getting hold of current boot blocks?
> 
> The OpenIndiana release that was announced earlier today has the fixed
> boot blocks.

Yep, saw that and have it here ready to boot and install grub.  I hope
the fact that the pool itself is v31 for zfs crypto will not be a
problem.. 

If it should be the case that the pool version is an issue running
from the OI CD, can I take the updated stage* files and use them with
the installgrub from solaris express b151?

I guess I'll find out in due course. 

--
Dan.


pgph7UOuW8sMw.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deskstars and CCTL (aka TLER)

2011-09-07 Thread Daniel Carosone
On Wed, Sep 07, 2011 at 11:20:06AM +0200, Roy Sigurd Karlsbakk wrote:
> Hi all
> 
> Reading the docs for the Hitachi drives, it seems CCTL (aka TLER) is settable 
> for Deskstar drives. See page 97 in http://goo.gl/ER0WD

Looks like another positive for these drives over the "competition".
The same appears to be the case for the 5k3000's as well (page 96 in
that document).

Note, however:

   These timers do not apply to streaming commands, or to queued
   commands when out-of-order data delivery is enabled. 

I presume the latter is the common case for NCQ reads?  That would
appear to limit the usefulness of this specific knob, even if it's
better than binding it to the model number as other vendors do.

--
Dan.



pgpNLWKoj9N2L.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send and dedupe

2011-09-07 Thread Daniel Carosone
On Wed, Sep 07, 2011 at 08:47:36AM -0600, Lori Alt wrote:
> On 09/ 6/11 11:45 PM, Daniel Carosone wrote:
>> My understanding was that 'zfs send -D' would use the pool's DDT in
>> building its own, if present.
> It does not use the pool's DDT, but it does use the SHA-256 checksums  
> that have already been calculated for on-disk dedup, thus speeding the  
> generation of the send stream.

Ah, thanks for the clarification.  Presumably the same is true if the
pool is using checksum=sha256, without dedup? 

Still a moot point for now :)

--
Dan.


pgpGicq3S3F7G.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send and dedupe

2011-09-06 Thread Daniel Carosone
On Tue, Sep 06, 2011 at 10:05:54PM -0700, Richard Elling wrote:
> On Sep 6, 2011, at 9:01 PM, Freddie Cash wrote:
> 
> > For example, does 'zfs send -D' use the same DDT as the pool?
> 
> No.

My understanding was that 'zfs send -D' would use the pool's DDT in 
building its own, if present. If blocks were known by the filesystem
to be duplicate, it would use that knowledge to skip some work seeding
its own ddt and stream back-references. This doesn't change the stream
contents vs what it would have generated without these hints, so "No"
still works as a short answer :) 

That understanding was based on discussions and blog posts at the
time, not looking at code. At least in theory, it should help avoid
reading and checksumming extra data blocks if this knowledge can be
used, so less work regardless of measurable impact on send throughput.
(It's more about diminished impact to other concurrent activities)

The point has mostly been moot in practice, though, because I've found
"zfs send -D" just plain doesn't work and often generates invalid
streams, as you note. Good to know there are fixes.

--
Dan.

pgpjZ1t9mVCs0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] booting from ashift=12 pool..

2011-09-04 Thread Daniel Carosone
On Tue, Aug 09, 2011 at 10:51:37AM +1000, Daniel Carosone wrote:
> On Mon, Aug 01, 2011 at 01:25:35PM +1000, Daniel Carosone wrote:
> > To be clear, the system I was working on the other day is now running
> > with a normal ashift=9 pool, on a mirror of WD 2TB EARX.  Not quite
> > what I was hoping for, but hopefully it will be OK; I won't have much
> > chance to mess with it again for a little while. 
> 
> That turned out to be a false hope.  The system is almost unusable.
> 
> Soon after anything creates a lot of metadata updates, it grinds into
> the ground doing ~350 write iops, 0 read, forever trying to write them
> out. Processes start blocking on reads and/or txg closes and the
> system never comes back. rsync and atimes were the first and worst
> culprit, but atime=off wasn't enough to prevent it competely. 

With a bit of tweaking of workload, I bought enough time to put this off.

> It can't just be that these are slow to write out, because I would
> expect it to eventually finish. I suspect something else is going on
> here. 

That something, apparently, was thrashing on swap space. :-/

> I'll have to find time to convert this bac to ashift=12 and try your
> boot blocks soon. 

I see via the issue tracker that there have been several updates
since, and an integration back into the main Illumos tree.   How do I
go about getting hold of current boot blocks?

--
Dan.


pgpQFwbYvN6qp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advice with SSD, ZIL and L2ARC

2011-08-30 Thread Daniel Carosone
On Tue, Aug 30, 2011 at 03:53:48PM +0100, Darren J Moffat wrote:
> On 08/30/11 15:31, Edward Ned Harvey wrote:
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Jesus Cea
>>>
>>> 1. Is the L2ARC data stored in the SSD checksummed?. If so, can I
>>> expect that ZFS goes directly to the disk if the checksum is wrong?.
>>
>> Yup.
>
> Note the following is an implementation detail subject to change:
>
> It is NOT checksumed on disk only in memory, but the L2ARC data on disk  
> is not used after reboot anyway just now.

It's not checksummed on disk as a direct function of L2ARC storage
with an L2ARC checksum because that's unnecessary - the cached
data is verified against the original zfs checksums. And, yes, if it
fails it counts as a "bad read" and zfs tries again from the data pool.

It's checksummed on the way into memory, from either pool disk or
l2arc disk.  If it's already in ARC memory, it's just a hit and the
checksum is not done each time - that would be ludicrously expensive,
and is one of the ways non-ECC systems can corrupt data.

L2ARC persistence may require adding checksumming to the L2ARC on-disk
format, but presumably for the L2ARC metadata that will need to be
stored persistently (and now only exists in ram), not so much for the
cached pool data. 

--
Dan.

pgpF0x8xfcm0P.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-30 Thread Daniel Carosone
On Mon, Aug 29, 2011 at 11:40:34PM -0400, Edward Ned Harvey wrote:
> > On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote:
> > > I'm getting a but tired of people designing for fast resilvering.
> > 
> > It is a design consideration, regardless, though your point is valid
> > that it shouldn't be the overriding consideration.
> 
> I disagree.  I think if you build a system that will literally never
> complete a resilver, or if the resilver requires weeks or months to
> complete, then you've fundamentally misconfigured your system.  Avoiding
> such situations should be a top priority.  Such a misconfiguration is
> sometimes the case with people building 21-disk raidz3 and similar
> configurations...

Ok, yes, for these extreme cases, any of the considerations gets a
veto for "pool is unservicable". 

Beyond that, though, Richard's point is that optimising for resilver
time to the exclusion of other requirements will produce bad designs.
In my extended example, I mentioned resilver and recovery times and
impacts, but only in amongst other factors.

Another way of putting it is that pool configs that will be pessimal for
resilver will likely also be pessimal for other considerations
(general iops performance being the obvious closely-linked case).

--
Dan.

pgpiAdCd7AGFq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raidz on top of hardware raid0

2011-08-28 Thread Daniel Carosone
On Sat, Aug 27, 2011 at 08:44:13AM -0700, Richard Elling wrote:
> I'm getting a but tired of people designing for fast resilvering. 

It is a design consideration, regardless, though your point is valid
that it shouldn't be the overriding consideration. 

To the original question and poster: 

This often arises out of another type of consideration, that of the
size of a "failure unit".  When plugging together systems at almost
any scale beyond a handful of disks, there are many kinds of groupings
of disks whereby the whole group may disappear if a certain component
fails: controllers, power supplies, backplanes, cables, network/fabric
switches, backplanes, etc.  The probabilities of each of these varies,
often greatly, but they can shape and constrain a design.

I'm going to choose a deliberately exaggerated example, to illustrate
the discussion and recommendations in the thread, using the OP's
numbers.

Let's say that I have 20 5-disk little NAS boxes, each with their own
single power supply and NIC.  Each is an iSCSI target, and can serve
up either 5 bare-disk LUNs, or a single LUN for the whole box, backed
by internal RAID. Internal RAID can be 0 or 5. 

Clearly, a-box-of-5-disks is an independent failure unit, at
non-trivial probability via a variety of possible causes. I better
plan my pool accordingly. 

The first option is to "simplify" the configuration, representing
the obvious failure unit as a single LUN, just a big disk.  There is
merit in simplicity, especially for the humans involved if they're not
sophisticated and experienced ZFS users (or else why would they be
asking these questions?). This may prevent confusion and possible
mistakes (at 3am under pressure, even experienced admins make those). 

This gives us 20 "disks" to make a pool, of whatever layout suits our
performance and resiliency needs.  Regardless of what disks are used,
a 20-way RAIDZ is unlikely to be a good answer.  2x 10-way raidz2, 4x
5-way raidz1, 2-way and 3-way mirrors, might all be useful depending
on circumstances. (As an aside, mirrors might be the layout of choice
if switch failures are also to be taken into consideration, for
practical network topologies.)

The second option is to give ZFS all the disks individually. We will
embed our knowledge of the failure domains into the pool structure,
choosing which disks go in which vdev accordingly. 

The simplest expression of this is to take the same layout we chose
above for 20 big disks, and make 5 of them, each as a top-level vdev
in the same pattern, for each of the 5 individual disks. Think about
making 5 separate pools with the same layout as the previous case, and
stacking them together into one. (As another aside, in previous
discussions I've also recommended considering multiple pools vs
multiple vdevs, that still applies but I won't reiterate here.)

If our pool had enough redundancy for our needs before, we will now
have 5 times as many top-level vdevs, which will experience tolerable
failures in groups of 5 if a disk box dies, for the same overall
result.  

ZFS generally does better this way.  We will have more direct
concurrency, because ZFS's device tree maps to spindles, rather than
to a more complex interaction of underlying components. Physical disk
failures can now be seen by ZFS as such, and don't get amplified to
whole LUN failures (RAID0) or performance degradations during internal
reconstruction (RAID5). ZFS will prefer not to allocate new data on a
degraded vdev until it is repaired, but needs to know about it in the
first place. Even before we talk about recovery, ZFS can likely report
errors better than the internal RAID, which may just hide an issue
long enough for it to become a real problem during another later event.

If we can (e.g.) assign the WWN's of the exported LUNs according to a
scheme that makes disk location obvious, we're less likely to get
confused because of all the extra disks.  The structure is still
apparent.  

(There are more layouts we can now create using the extra disks, but
we lose the simplicity, and they don't really enhance this example for
the general case.  Very careful analysis will be required, and
errors under pressure might result in a situation where the system
works, but later resiliency is compromised.  This is especially true
if hot-spares are involved.) 

So, the ZFS preference is definitely for individual disks.  What might
override this preference, and cause us to use LUNs over the internal
raid, other than the perception of simplicity due to inexperience?
Some possibilities are below.

Because local reconstructions within a box may be much faster than
over the network.  Remember, though, that we trust ZFS more than
RAID5 (even before any specific implementation has a chance to add its
own bugs and wrinkles). So, effectively, after such a local RAID5
reconstruction, we'd want to run a scrub anyway - at which point we
might as well just have let ZFS resilver.  If we have more than one
top-level vdev, whi

Re: [zfs-discuss] booting from ashift=12 pool..

2011-08-08 Thread Daniel Carosone
On Mon, Aug 01, 2011 at 01:25:35PM +1000, Daniel Carosone wrote:
> To be clear, the system I was working on the other day is now running
> with a normal ashift=9 pool, on a mirror of WD 2TB EARX.  Not quite
> what I was hoping for, but hopefully it will be OK; I won't have much
> chance to mess with it again for a little while. 

That turned out to be a false hope.  The system is almost unusable.

Soon after anything creates a lot of metadata updates, it grinds into
the ground doing ~350 write iops, 0 read, forever trying to write them
out. Processes start blocking on reads and/or txg closes and the
system never comes back. rsync and atimes were the first and worst
culprit, but atime=off wasn't enough to prevent it competely. 

It can't just be that these are slow to write out, because I would
expect it to eventually finish. I suspect something else is going on
here. Would zfs re-issue writes if they haven't gotten to the disk
yet, somehow?  I'm not talking about ata commands timing out, but
something at a higher level making a long queue worse.

I'll have to find time to convert this bac to ashift=12 and try your
boot blocks soon. 

--
Dan.

pgprF69UOwUKr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-03 Thread Daniel Carosone
On Wed, Aug 03, 2011 at 12:32:56PM -0700, Brandon High wrote:
> On Mon, Aug 1, 2011 at 4:27 PM, Daniel Carosone  wrote:
> > The other thing that can cause a storm of tiny IOs is dedup, and this
> > effect can last long after space has been freed and/or dedup turned
> > off, until all the blocks corresponding to DDT entries are rewritten.
> > I wonder if this was involved here.
> 
> Using dedup on a pool that houses an Oracle DB is Doing It Wrong in so
> many ways...

Indeed, but alas people still Do It Wrong.  In particular, when a pool
is approaching full, turning on dedup might seem like an attractive
proposition to someone who doesn't understand the cost. 

So i just wonder if they have, or had at some time past, enabed it.

--
Dan.

pgpNRTzK8WULH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL

2011-08-01 Thread Daniel Carosone
On Mon, Aug 01, 2011 at 03:10:28PM -0700, Richard Elling wrote:
> On Aug 1, 2011, at 2:16 PM, Neil Perrin wrote:
> 
> > In general the blogs conclusion is correct . When file systems get full 
> > there is
> > fragmentation (happens to all file systems) and for ZFS the pool uses gang
> > blocks of smaller blocks when there are insufficient large blocks.
> > However, the ZIL never allocates or uses gang blocks. It directly allocates
> > blocks (outside of the zio pipeline) using zio_alloc_zil() -> 
> > metaslab_alloc().
> > Gang blocks are only used by the main pool when the pool transaction
> > group (txg) commit occurs.  Solutions to the problem include:
> >   - add a separate intent log
> 
> Yes, I thought that it was odd that someone who is familiar with Oracle 
> databases,
> and their redo logs, didn't use separate intent logs.
> 
> >   - add more top level devices (hopefully replicated)
> >   - delete unused files/snapshots etc with in the poll?
> 
> If gang activity is the root cause of the performance, then they must be at 
> the
> edge of effective space utilization.
>  -- richard

The other thing that can cause a storm of tiny IOs is dedup, and this
effect can last long after space has been freed and/or dedup turned
off, until all the blocks corresponding to DDT entries are rewritten.
I wonder if this was involved here.  

--
Dan.

pgpVl1bQEtQ1C.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] booting from ashift=12 pool..

2011-07-31 Thread Daniel Carosone
On Mon, Aug 01, 2011 at 11:22:36AM +1000, Daniel Carosone wrote:
> On Fri, Jul 29, 2011 at 05:58:49PM +0200, Hans Rosenfeld wrote:
> 
> > I'm working on a patch for grub that fixes the ashift=12 issue. 
> 
> Oh, great - and from the looks of the patch, for other values of 12 as
> well :)
> 
> > I'm probably not going to fix the div-by-zero reboot.
> 
> Fair enough, if it's an existing unrelated error we no longer
> expose. Perhaps it's even fixed/irrelevant for grub2, can this be
> checked easily? 
> 

FWIW, this seems to be a live issue with the zfs-on-linux folks too,
perhaps some coordination would be helpful?

See, e.g.:
http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/0c80103a8d5c0bb0#

> > If you want to try it, the patch can be found at
> > http://cr.illumos.org/view/6qc99xkh/illumos-1303-webrev/illumos-1303-webrev.patch
> 
> Any chance of providing an alternate stage1/stage2 binary I can feed
> to installgrub?  When you're ready..

To be clear, the system I was working on the other day is now running
with a normal ashift=9 pool, on a mirror of WD 2TB EARX.  Not quite
what I was hoping for, but hopefully it will be OK; I won't have much
chance to mess with it again for a little while.  I will be building
something else useful for testing this, sometime in the next couple of
weeks.

--
Dan.

pgpRxUtHugLXX.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] booting from ashift=12 pool..

2011-07-31 Thread Daniel Carosone
On Fri, Jul 29, 2011 at 05:58:49PM +0200, Hans Rosenfeld wrote:

> I'm working on a patch for grub that fixes the ashift=12 issue. 

Oh, great - and from the looks of the patch, for other values of 12 as
well :)

> I'm probably not going to fix the div-by-zero reboot.

Fair enough, if it's an existing unrelated error we no longer
expose. Perhaps it's even fixed/irrelevant for grub2, can this be
checked easily? 

> If you want to try it, the patch can be found at
> http://cr.illumos.org/view/6qc99xkh/illumos-1303-webrev/illumos-1303-webrev.patch

Any chance of providing an alternate stage1/stage2 binary I can feed
to installgrub?  When you're ready..

--
Dan.


pgpq9MYR4EuUs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] booting from ashift=12 pool..

2011-07-28 Thread Daniel Carosone
.. evidently doesn't work.  GRUB reboots the machine moments after
loading stage2, and doesn't recognise the fstype when examining the
disk loaded from an alernate source.

This is with SX-151.  Here's hoping a future version (with grub2?)
resolves this, as well as lets us boot from raidz.

Just a note for the archives in case it helps someone else get back
the afternoon I just burnt.

--
Dan.

pgpYTwogX3ZYl.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snapshots in solaris11 express

2011-07-27 Thread Daniel Carosone
On Wed, Jul 27, 2011 at 08:27:42AM +0200, Carsten John wrote:
> Hello everybody,
> 
> is there any known way to configure the point-in-time *when* the time-slider 
> will snapshot/rotate?
> 
> With hundreds of zfs filesystems, the daily snapshot rotation slows down a 
> big file server significantly, so it would be better to  have the snapshots 
> rotated outside the usual workhours.
> 
> As as I found out so far, the first snapshot is taken when the service is 
> restartet and then the next occurs 24 hour later (as supposed). Do I need to 
> restart the service at 2:00 AM to get the desired result (not a big deal deal 
> with /usr/bin/at, but not as straight forward as I would exspect).
> 
> Any suggestions?

You could try manually making a "zfs-auto-snap_daily-blahblah"
snapshot at the desired time, and then restarting the service, which
should then follow accordingly for subsequent days.

--
Dan.

pgpfWHvjWO4vJ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS resilvering loop from hell

2011-07-27 Thread Daniel Carosone
On Wed, Jul 27, 2011 at 08:00:43PM -0500, Bob Friesenhahn wrote:
> On Tue, 26 Jul 2011, Charles Stephens wrote:
>
>> I'm on S11E 150.0.1.9 and I replaced one of the drives and the pool  
>> seems to be stuck in a resilvering loop.  I performed a 'zpool clear' 
>> and 'zpool scrub' and just complains that the drives I didn't replace 
>> are degraded because of too many errors.  Oddly the replaced drive is 
>> reported as being fine.  The CKSUM counts get up to about 108 or so 
>> when the resilver is completed.
>
> This sort of problem (failing disks during a recovery) is a good reason 
> not to use raidz1 in modern systems.  Use raidz2 or raidz3.
>
> Assuming that the system is good and it is really a problem with the  
> disks experiencing bad reads, it seems that the only path forward is to 
> wait for the resilver to complete or see if creating a new pool from a 
> recent backup is better.

Indeed, but that assumption may be too strong.  If you're getting
errors across all the members, you are likely to have some other
systemic problem, such as: 
 * bad ram / cpu / motherboard
 * too-weak power supply
 * faulty disk controller / driver

Had you scrubbed the pool regularly before the replacement? Were those
clean?  If not, the possibility is that the scrubs are telling you
that bad data was written originally, especially if it's repeatable on
the same files.  If it hits different counts and files each scrub, you
may be seeing corruption on reads, due to the same causes. Or you may
have both.

--
Dan.


pgpV4QDOAXvnT.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-27 Thread Daniel Carosone
> > "Processing" the request just means flagging the blocks, though, right?
> > And the actual benefits only acrue if the garbage collection / block
> > reshuffling background tasks get a chance to run?
> 
> I think that's right. TRIM just gives hints to the garbage collector that
> sectors are no longer in use. When the GC runs, it can find more flash
> blocks more easily that aren't used or combine several mostly-empty
> blocks and erase or otherwise free them for reuse later.

Absent TRIM support, there's another way to do this, too.  It's pretty
easy to dd /dev/zero to a file now and then.  Just make sure zfs
doesn't prevent these being written to the SSD (compress and dedup are
off).  I have a separate "fill" dataset for this purpose, to avoid
keeping these zeros in auto-snapshots too.

At least the sandforce controllers recognise this, via their internal
compression and dedup, and know that the blocks that have to be
presented to the host as full of zeros can be reclaied internally and
reassigned to the spare pool. 

As long as they have enough blocks to work with, it's fine.

--
Dan.



pgpUtDFSFlj9f.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] revisiting aclmode options

2011-07-19 Thread Daniel Carosone
On Mon, Jul 18, 2011 at 06:44:25PM -0700, Paul B. Henson wrote:
> It would be really  
> nice if the aclmode could be specified on a per object level rather than  
> a per file system level, but that would be considerably more difficult  
> to achieve 8-/.

If there were an acl permission for "set legacy permission bits",
as distinct from write_acl, that could be set to "deny" at whatever
granularity you needed...  

--
Dan.

pgpzVEgeRbhtA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-14 Thread Daniel Carosone
On Fri, Jul 15, 2011 at 07:56:25AM +0400, Jim Klimov wrote:
> 2011-07-15 6:21, Daniel Carosone ?:
>> um, this is what xargs -P is for ...
>
> Thanks for the hint. True, I don't often use xargs.
>
> However from the man pages, I don't see a "-P" option
> on OpenSolaris boxes of different releases, and there
> is only a "-p" (prompt) mode. I am not eager to enter
> "yes" 40 times ;)

you want the /usr/gnu/{bin,share/man} version, at least in this case.

--
Dan.


pgpItiuUybbdI.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-14 Thread Daniel Carosone
um, this is what xargs -P is for ...

--
Dan.

On Thu, Jul 14, 2011 at 07:24:52PM +0400, Jim Klimov wrote:
> 2011-07-14 15:48, Frank Van Damme ?:
>> It seems counter-intuitive - you'd say: concurrent disk access makes  
>> things only slower - , but it turns out to be true. I'm deleting a  
>> dozen times faster than before. How completely ridiculous. Thank you 
>> :-)
>
> Well, look at it this way: it is not only about singular disk accesses
> (i.e. unlike other FSes, you do not in-place modify a directory entry),
> with ZFS COW it is about rewriting a tree of block pointers, with any
> new writes going into free (unreferenced ATM) disk blocks anyway.
>
> So by hoarding writes you have a chance to reduce mechanical
> IOPS required for your tasks. Until you run out of RAM ;)
>
> Just in case it helps, to quickly fire up removals of the specific  
> directory
> after yet another reboot of the box, and not overwhelm it with hundreds
> of thousands queued "rm"processes either, I made this script as /bin/RM:
>
> ===
> #!/bin/sh
>
> SLEEP=10
> [ x"$1" != x ] && SLEEP=$1
>
> A=0
> # To rm small files: find ... -size -10
> find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do
>   du -hs "$LINE"
>   rm -f "$LINE" &
>   A=$(($A+1))
>   [ "$A" -ge 100 ] && ( date; while [ `ps -ef | grep -wc rm` -gt 50 ]; do
>  echo "Sleep $SLEEP..."; ps -ef | grep -wc rm ; sleep $SLEEP; ps -ef 
> | grep -wc rm;
>   done
>   date ) && A="`ps -ef | grep -wc rm`"
> done ; date
> ===
>
> Essentially, after firing up 100 "rm attempts" it waits for the "rm"
> process count to go below 50, then goes on. Sizing may vary
> between systems, phase of the moon and computer's attitude.
> Sometimes I had 700 processes stacked and processed quickly.
> Sometimes it hung on 50...
>
> HTH,
> //Jim
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


pgprXDuV2KRuK.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changed to AHCI, can not access disk???

2011-07-06 Thread Daniel Carosone
On Tue, Jul 05, 2011 at 09:03:50AM -0400, Edward Ned Harvey wrote:
> > I suspect the problem is because I changed to AHCI. 
> 
> This is normal, no matter what OS you have.  It's the hardware.

That is simply false.

> If you start using a disk in non-AHCI mode, you must always continue to use
> it in non-AHCI mode.  If you switch, it will make the old data inaccessible.

Utterly not true.

Even in this case, the problem is not access to the data. The problem
is with booting from the device / mounting as root, because solaris
(and windows) embed device name/path information into configuration
data critical for booting. 

Even these OS's will be able to access data disks via either
controller mode/type if the boot time issue is removed.

Other operating systems that don't depend on embedded device path
information in the boot sequence can switch easily between IDE/AHCI
modes for boot disks, or indeed between other controller types
(different scsi controllers, booting native vs as a VM, moving disks
between boxes, etc).

The fact that Solaris fails to tolerate this is a bug.  In addition
to other problems, this bug manifests when trying to use removable usb
sticks as boot media/rpool, because usb device names are constructed
based on port and topology in some cases.  I have also been bitten by
it in the past when rearranging controllers into different slots.

> Just change it back in BIOS and you'll have your data back.  Then backup
> your data, change to AHCI mode (because it's supposed to perform better) and
> restore your data.

In this case, the recorded device path has probably been mangled, and
will need to be repaired before the pool is bootable again.  The pool
should, however, be accessible as data from an independent boot.

--
Dan.

pgpiQdSFcRMcp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 512b vs 4K sectors

2011-07-04 Thread Daniel Carosone
On Mon, Jul 04, 2011 at 01:11:09PM -0700, Richard Elling wrote:
> Thomas,
> 
> On Jul 4, 2011, at 9:53 AM, Thomas Nau wrote:
> This is a roundabout way to do this, but it can be done without changing any 
> source :-)
> With the Nexenta or Solaris iSCSI target, you can set the blocksize for a LUN.
> When you create the pool for the first time, make one of the devices be an 
> iSCSI
> LUN with a 4KB block size. This will cause the top-level vdev to use 
> ashift=12.
> You can then replace the iSCSI LUN with a different device using "zpool 
> replace"

Thomas, 

I wrote a little more detailed recipe for this a month or two ago, look in the 
archives.

--
Dan.

pgprf8LqI8Q2t.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about leaving zpools exported for future use

2011-07-03 Thread Daniel Carosone
On Sun, Jul 03, 2011 at 05:44:34PM -0500, Harry Putnam wrote:
> My zfs machine has croaked to the point that it just quits after some
> 10 15 minutes of uptime.  No interesting logs or messages what so
> ever.  At least not that I've found.  It just quietly quits.
> 
> I'm not interested in dinking around with this setup... its well
> ready for upgrade. 
> 
> What I'd like to do is during one or two of those 15 minutes of
> uptime, set things up so that when I do get the replace setup, I can
> just attache those disks and import the pools.
> 
> It may be complicated by rpool which also has some keeper data stored
> on it.

By all means export a data pool if you can/like.

You won't be able to export rpool while booted off it, but that won't
matter.  Attach it to your new machines, and import -f with a new name
(so as not to conflict with the new rpool).  It may be easier to do
this attach after boot, to avoid potential confusion by the bootloader
as to which rpool to use. 

If hotplugging is inconvenient, you could boot the old machine from CD,
and import -f the rpool with a new name, then export it, before moving
the disks. This will then be the autodetected name on the new machine.

--
Dan.


pgpU4oxeYEfYR.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 700GB gone?

2011-06-30 Thread Daniel Carosone
On Thu, Jun 30, 2011 at 11:40:53PM +0100, Andrew Gabriel wrote:
>  On 06/30/11 08:50 PM, Orvar Korvar wrote:
>> I have a 1.5TB disk that has several partitions. One of them is 900GB. Now I 
>> can only see 300GB. Where is the rest? Is there a command I can do to reach 
>> the rest of the data? Will scrub help?
>
> Not much to go on - no one can answer this.
>
> How did you go about partitioning the disk?
> What does the fdisk partitioning look like (if its x86)?
> What does the VToC slice layout look like?
> What are you using each partition and slice for?
> What tells you that you can only see 300GB?

Are you using 32-bit or 64-bit solaris? 

--
Dan.

pgpyX34G2Z94V.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SandForce SSD internal dedup

2011-06-29 Thread Daniel Carosone
This article raises the concern that SSD controllers (in particular
SandForce) do internal dedup, and in particular that this could defeat
ditto-block style replication of critical metadata as done by
filesystems including ZFS.

 http://storagemojo.com/2011/06/27/de-dup-too-much-of-good-thing/

Along with discussion of risk evaluation, it also suggests that
filesystems could vary each copy in some way (internal serial / nonce)
to defeat the mechanism.

Comments and suggestions, aside the risk evaluation piece?

This doesn't appear to mean that zfs dedup is of no use on such
drives, since they still present the same number of externally
accessible blocks.  The internal dedup evidently allows them to
maintain more spare/free sectors for g/c and performance reasons.

PS. I intend to ignore the risk discussion because in most cases zfs
users concerned aboutthese and many other related risks will mitigate
them using multiple independent devices.

--
Dan.

pgpbMdDhIwe86.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool metadata corruption from S10U9 to S11 express

2011-06-22 Thread Daniel Carosone
On Wed, Jun 22, 2011 at 12:49:27PM -0700, David W. Smith wrote:
> # /home/dws# zpool import
>   pool: tank
> id: 13155614069147461689
>  state: FAULTED
> status: The pool metadata is corrupted.
> action: The pool cannot be imported due to damaged devices or data.
>see: http://www.sun.com/msg/ZFS-8000-72
> config:
> 
> tank FAULTED  corrupted data
> logs
>   mirror-6   ONLINE
> c9t57d0  ONLINE
> c9t58d0  ONLINE
>   mirror-7   ONLINE
> c9t59d0  ONLINE
> c9t60d0  ONLINE
> 
> Is there something else I can do to see what is wrong.

Can you tell us more about the setup, in particular the drivers and
hardware on the path?  There may be labelling, block size, offset or
even bad drivers or other issues getting in the way, preventing ZFS
from doing what should otherwise be expected to work.   Was there
something else in the storage stack on the old OS, like a different
volume manager or some multipathing?

Can you show us the zfs labels with zdb -l /dev/foo ?

Does import -F get any further?

> Original attempt when specifying the name resulted in:
> 
> # /home/dws# zpool import tank
> cannot import 'tank': I/O error

Some kind of underlying driver problem odour here.

--
Dan.


pgpGA1IeLdHiM.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot format 2.5TB ext disk (EFI)

2011-06-22 Thread Daniel Carosone
On Wed, Jun 22, 2011 at 02:01:12PM -0700, Larry Liu wrote:
> You can try
>
> #fdisk /dev/rdsk/c5d0t0p0

Or just dd /dev/zero over the raw device, eject and start from clean.

--
Dan.


pgpqmaR5Jw6Q0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-20 Thread Daniel Carosone
On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote:
> Yes. I've been looking at what the value of zfs_vdev_max_pending should be.
> The old value was 35 (a guess, but a really bad guess) and the new value is
> 10 (another guess, but a better guess).  I observe that data from a fast, 
> modern 
> HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 
> IOPS. 
> But as we add threads, the average response time increases from 2.3ms to 
> 137ms.

Interesting.  What happens to total throughput, since that's the
expected tradeoff against latency here.  I might guess that in your
tests with a constant io size, it's linear with IOPS - but I wonder if
that remains so for larger IO or with mixed sizes?

> Since the whole idea is to get lower response time, and we know disks are not 
> simple queues so there is no direct IOPS to response time relationship, maybe 
> it
> is simply better to limit the number of outstanding I/Os.

I also wonder if we're seeing a form of "bufferbloat" here in these
latencies.

As I wrote in another post yesterday, remember that you're not
counting actual outstanding IO's here, because the write IO's are
being acknowledged immediately and tracked internally. The disk may
therefore be getting itself into a state where either the buffer/queue
is efectively full, or the number of requests it is tracking
internally becomes inefficient (as well as the head-thrashing). 

Even before you get to that state and writes start slowing down too,
your averages are skewed by write cache. All the writes are fast,
while a longer queue exposes reads to contention with eachother, as
well as to a much wider window of writes.  Can you look at the average
response time for just the reads, even amongst a mixed r/w workflow?
Perhaps some alternate statistic than average, too.

Can you repeat the tests with write-cache disabled, so you're more
accurately exposing the controller's actual workload and backlog?

I hypothesise that this will avoid those latencies getting so
ridiculously out of control, and potentially also show better
(relative) results for higher concurrency counts.  Alternately, it
will show that your disk firmware really is horrible at managing
concurrency even for small values :)

Whether it shows better absolute results than a shorter queue + write
cache is an entirely different question.  The write cache will
certainly make things faster in the common case, which is another way
of saying that your lower-bound average latencies are artificially low
and making the degradation look worse.

> > This comment seems to indicate that the drive queues up a whole bunch of
> > requests, and since the queue is large, each individual response time has
> > become large.  It's not that physical actual performance has degraded with
> > the cache enabled, it's that the queue has become long.  For async writes,
> > you don't really care how long the queue is, but if you have a mixture of
> > async writes and occasional sync writes...  Then the queue gets long, and
> > when you sync, the sync operation will take a long time to complete.  You
> > might actually benefit by disabling the disk cache.
> > 
> > Richard, have I gotten the gist of what you're saying?
> 
> I haven't formed an opinion yet, but I'm inclined towards wanting overall
> better latency.

And, in particlar, better latency for specific (read) requests that zfs
prioritises; these are often the ones that contribute most to a system
feeling unresponsive.  If this prioritisation is lost once passed to
the disk, both because the disk doesn't have a priority mechanism and
because it's contending with the deferred cost of previous writes,
then you'll get better latency for the requests you care most about
with a shorter queue.

--
Dan.




pgp4AqJyAubZi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-19 Thread Daniel Carosone
On Fri, Jun 17, 2011 at 07:41:41AM -0400, Edward Ned Harvey wrote:
> > From: Daniel Carosone [mailto:d...@geek.com.au]
> > Sent: Thursday, June 16, 2011 11:05 PM
> > 
> > the [sata] channel is idle, blocked on command completion, while
> > the heads seek.
> 
> I'm interested in proving this point.  Because I believe it's false.
> 
> Just hand waving for the moment ... Presenting the alternative viewpoint
> that I think is correct...
> 
> All drives, regardless of whether or not their disk cache or buffer is
> enabled, support PIO and DMA.  This means no matter the state of the cache
> or buffer, the bus will deliver information to/from the memory of the disk
> as fast as possible, and the disk will optimize the visible workload to the
> best of its ability, and the disk will report back an interrupt when each
> operation is completed out-of-order.

Yes, up to the that last "out-of-order". Without NCQ, requests are
in-order and wait for completion with the channel idle. 

> It would be stupid for a disk to hog the bus in an idle state.

Yes, but remember that ATA was designed originally to be stupid
(simple).  The complexity has crept in over time.  Understanding the
history and development order is important here.

So, for older ATA disks, commands would transfer relatively quickly
over the channel, which would then remain idle until a completion
interrupt. Systems got faster.  Write cache was added to make writes
"complete" faster, read cache (with prefetch) was added in the hope
of satisfying read requests faster and freeing up the channel. Systems
got faster. NCQ was added (rather, TCQ was reinvented and crippled) to
try and get better concurrency. NCQ supports only a few outstanding
ops, in part because write-cache was by then established practice
(turning it off would adversely impact benchmarks, especially for
software that couldn't take advantage of concurrency).

So, today with NCQ, writes are again essentially in-order (to cache)
until the cache is full and request start blocking.  NCQ may offer
some benefit to concurrent reads, but again of litle value if the cache
is full.

Furthermore, the disk controllers may not be doing such a great job
when given concurrent requests anyway, as Richard mentions elsewhere.
Will reply to those points a little later.

--
Dan.

pgpEakN7OalNL.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-16 Thread Daniel Carosone
On Thu, Jun 16, 2011 at 10:40:25PM -0400, Edward Ned Harvey wrote:
> > From: Daniel Carosone [mailto:d...@geek.com.au]
> > Sent: Thursday, June 16, 2011 10:27 PM
> > 
> > Is it still the case, as it once was, that allocating anything other
> > than whole disks as vdevs forces NCQ / write cache off on the drive
> > (either or both, forget which, guess write cache)?
> 
> I will only say, that regardless of whether or not that is or ever was true,
> I believe it's entirely irrelevant.  Because your system performs read and
> write caching and buffering in ram, the tiny little ram on the disk can't
> possibly contribute anything.

I disagree.  It can vastly help improve the IOPS of the disk and keep
the channel open for more transactions while one is in progress.
Otherwise, the channel is idle, blocked on command completion, while
the heads seek. 

> When it comes to reads:  The OS does readahead more intelligently than the
> disk could ever hope.  Hardware readahead is useless.

Little argument here, although the disk is aware of physical geometry
and may well read an entire track. 

> When it comes to writes:  Categorize as either async or sync.
> 
> When it comes to async writes:  The OS will buffer and optimize, and the
> applications have long since marched onward before the disk even sees the
> data.  It's irrelevant how much time has elapsed before the disk finally
> commits to platter.

To the application in he short term, but not to the system. TXG closes
have to wait for that, and applications have to wait for those to
close so the next can open and accept new writes.

> When it comes to sync writes:  The write will not be completed, and the
> application will block, until all the buffers have been flushed.  Both ram
> and disk buffer.  So neither the ram nor disk buffer is able to help you.

Yes. With write cache on in the drive, and especially with multiple
outstanding commands, the async writes can all be streamed quickly to
the disk. Then a cache sync can be issued, before the sync/FUA writes
to close the txg are done.

Without write cache, each async write (though deferred and perhaps
coalesced) is synchronous to platters.  This adds latency and
decreases IOPS, impacting other operations (reads) as well.
Please measure it, you will find this impact significant and even
perhaps drastic for some quite realistic workloads.

All this before the disk write cache has any chance to provide
additional benefit by seek optimisations - ie, regardless of whether
it is succesful or not in doing so.  

--
Dan.

pgpCzO1l9K1Um.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-16 Thread Daniel Carosone
On Thu, Jun 16, 2011 at 09:15:44PM -0400, Edward Ned Harvey wrote:
> My personal preference, assuming 4 disks, since the OS is mostly reads and
> only a little bit of writes, is to create a 4-way mirrored 100G partition
> for the OS, and the remaining 900G of each disk (or whatever) becomes either
> a stripe of mirrors or raidz, as appropriate in your case, for the
> storagepool.

Is it still the case, as it once was, that allocating anything other
than whole disks as vdevs forces NCQ / write cache off on the drive
(either or both, forget which, guess write cache)? 

If so, can this be forced back on somehow to regain performance when
known to be safe?  

I think the original assumption was that zfs-in-a-partition likely
implied the disk was shared with ufs, rather than another async-safe
pool. 

--
Dan.



pgpV5GoIYjQNs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] # disks per vdev

2011-06-16 Thread Daniel Carosone
On Thu, Jun 16, 2011 at 07:06:48PM +0200, Roy Sigurd Karlsbakk wrote:
> > I have decided to bite the bullet and change to 2TB disks now rather
> > than go through all the effort using 1TB disks and then maybe changing
> > in 6-12 months time or whatever. The price difference between 1TB and
> > 2TB disks is marginal and I can always re-sell my 6x 1TB disks.
> > 
> > I think I have also narrowed down the raid config to these 4;
> > 
> > 2x 7 disk raid-z2 with 1 hot spare - 20TB usable
> > 3x 5 disk raid-z2 with 0 hot spare - 18TB usable
> > 2x 6 disk raid-z2 with 2 hot spares - 16TB usable
> > 
> > with option 1 probably being preferred at the moment.
> 
> I would choose option 1. I have similar configurations in
> production. A hot spare can be very good when a drive dies while
> you're not watching. 

I would probably also go for option 1, with some additional
considerations:

1 - are the 2 vdevs in the same pool, or two separate pools?

If the majority of your bulk data can be balanced manually or by
application software across 2 filesystems/pools, this offers you the
opportunity to replicate smaller more critical data between pools (and
controllers).  This offers better protection against whole-pool
problems (bugs, fat fingers).  With careful arrangement, you could
even have one pool spun down most of the time. 

You mentioned something early on that implied this kind of thinking,
but it seems to have gone by the wayside since.

If you can, I would recommend 2 pools if you go for 2
vdevs. Conversely, in one pool, you might as well go for 15xZ3 since
even this will likely cover performance needs (and see #4).

2 - disk purchase schedule

With 2 vdevs, regardless of 1 or 2 pools, you could defer purchase of
half the 2Tb drives.  With 2 pools, you can use the 6x1Tb and change
that later to 7x with the next purchase, with some juggling of
data. You might be best to buy 1 more 1Tb to get the shape right at 
the start for in-place upgrades, and in a single pool this is
essentially mandatory.

By the time you need more space to buy the second tranche of drives,
3+Tb drives may be the better option.

3 - spare temperature

for levels raidz2 and better, you might be happier with a warm spare
and manual replacement, compared to overly-aggressive automated
replacement if there is a cascade of errors.  See recent threads.

You may also consider a cold spare, leaving a drive bay free for
disks-as-backup-tapes swapping.  If you replace the 1Tb's now,
repurpose them for this rather than reselling.  

Whatever happens, if you have a mix of drive sizes, your spare should
be of the larger size. Sorry for stating the obvious! :-)

4 - the 16th port

Can you find somewhere inside the case for an SSD as L2ARC on your
last port?  Could be very worthwhile for some of your other data and
metadata (less so the movies).

--
Dan.

pgpFH3G2Esfc9.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenIndiana | ZFS | scrub | network | awful slow

2011-06-15 Thread Daniel Carosone
On Wed, Jun 15, 2011 at 07:19:05PM +0200, Roy Sigurd Karlsbakk wrote:
> 
> Dedup is known to require a LOT of memory and/or L2ARC, and 24GB isn't really 
> much with 34TBs of data.

The fact that your second system lacks the l2arc cache device is absolutely 
your prime suspect.

--
Dan.

pgp3PZu7c6LFU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC and poor read performance

2011-06-08 Thread Daniel Carosone
On Wed, Jun 08, 2011 at 11:44:16AM -0700, Marty Scholes wrote:
> And I looked in the source.  My C is a little rusty, yet it appears
> that prefetch items are not stored in L2ARC by default.  Prefetches
> will satisfy a good portion of sequential reads but won't go to
> L2ARC.  

Won't go to L2ARC while they're still speculative reads, maybe.
Once they're actually used by the app to satisfy a good portion of the
actual reads, they'll have hits stats and will.

I suspect the problem is the threshold for l2arc writes.  Sequential
reads can be much faster than this rate, meaning it can take a lot of
effort/time to fill.

You could test by doing slow sequential reads, and see if the l2arc
fills any more for the same reads spread over a longer time.

--
Dan.

pgp0CnUan5EkQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Metadata (DDT) Cache Bias

2011-06-05 Thread Daniel Carosone
On Sun, Jun 05, 2011 at 01:26:20PM -0500, Tim Cook wrote:
> I'd go with the option of allowing both a weighted and a forced option.  I
> agree though, if you do primarycache=metadata, the system should still
> attempt to cache userdata if there is additional space remaining.

I think I disagree.  Remember that this is a per-dataset
attribute/option.  One of the reasons to set it on a particular
dataset is precisely to leave room in the cache for other datasets,
because I know something about the access pattern, desired service
level, or underlying storage capability. 

For example, for a pool on SSD, I will set secondarycache=none (since
l2arc offers no benefit, only cost in overhead and ssd wear).  I may
also set primarycache= since a data miss is
still pretty fast, and I will get more value using my l1/l2 cache
resources for other datasets on slower media.

This is starting to point out that these tunables are a blunt
instrument.  Perhaps what may be useful is some kind of service-level
priority attribute (default 0, values +/- small ints).  This could be
used in a number of places, including when deciding which of two
otherwise-equal pages to evict/demote in cache.

That's effectively what happens anyway since the blocks do go into arc
while in use, they're just freed immediately after.

--
Dan.

pgp8e70DFvopH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Metadata (DDT) Cache Bias

2011-06-03 Thread Daniel Carosone
Edward Ned Harvey writes:
>  > If you consider the extreme bias...  If the system would never give up
>  > metadata in cache until all the cached data were gone...  Then it would be
>  > similar to the current primarycache=metadata, except that the system would
>  > be willing to cache data too, whenever there was available cache otherwise
>  > going to waste.

I like this, and it could be another value for the same property:
metabias, metadata-bias, perfer-metadata, whatever. 

On Fri, Jun 03, 2011 at 06:25:45AM -0700, Roch wrote:
> Interesting. Now consider this :
> 
> We have an indirect block in memory (those are 16K
> referencing 128 individual data blocks). We also have an
> unrelated data block say 16K. Neither are currently being
> reference nor have they been for a long time (otherwise they
> move up to the head of the cache lists).  They reach the
> tail of the primary cache together. I have room for one of
> them in the secondary cache. 
> 
> Absent other information, do we think that the indirect
> block is more valuable than the data block ? At first I also
> wanted to say that metadata should be favored. Now I can't come
> up with an argument to favor either one. 

The effectiveness of a cache depends on the likelihood of a hit
against a cached value, vs the cost of keeping it.

Including data that may allow us to predict this future likelihood
based on past access patterns can improve this immensely. This is what
the arc algorithm does quite well.  

Absent this information, we assume the probability of future access to
all data blocks not currently in ARC is approximately equal.  The
indirect metadata block is therefore 127x as likely to be needed as
the one data block, since if any of the data blocks is needed, so will
the indirect block to find it.

> Therefore I think we need to include more information than just data
> vs metadata in the decision process.

If we have the information to hand, it may help - but we don't. 

The only thing I can think of we may have is whether either block was
ever on the "frequent" list, or only on the "recent" list, to catch
the single-pass sequential access pattern and make it the lower
priority for cache residence.

I don't know how feasible it is to check whether any of the blocks
referenced by the indirect block are themselves in arc, nor what that
might imply about the future likelihood of further accesses to other
blocks indirectly referenced by this one.

> Instant Poll : Yes/No ?

Yes for this as an RFE, or at least as a q&d implementation to measure
potential benefit.

--
Dan.


pgp3K2k87cSZH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything? [Summary]

2011-06-02 Thread Daniel Carosone
On Thu, Jun 02, 2011 at 09:59:39PM -0400, Edward Ned Harvey wrote:
> > From: Daniel Carosone [mailto:d...@geek.com.au]
> > Sent: Thursday, June 02, 2011 9:03 PM
> > 
> > Separately, with only 4G of RAM, i think an L2ARC is likely about a
> > wash, since L2ARC entries also consume RAM.
> 
> True the L2ARC requires some ARC consumption to support it, but for typical
> user data, it's a huge multiplier... The ARC consumption is static per entry
> (say, 176 bytes, depending on your platform) but a typical payload for user
> data would be whatever your average blocksize is ... 40K, 127K, or something
> similar probably.

Yes, but that's not the whole story.  In order for the L2ARC to be an
effective performance boost, it itself needs to be large enough to
save enough hits on the disks.  Further, the penalty of these hits is
more in IOPS than size.  Both these tend to reduce or nullify the
(space) scaling factor, other than getting the very largest blocks out
of primary cache.

Addiing read iops with a third submirror, at no cost, is the way to go
(or at least the way to start) in this case.  

--
Dan.



pgpQobf345OXI.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything? [Summary]

2011-06-02 Thread Daniel Carosone
Thanks, I like this summary format and the effort it took
to produce seems well-spent. 

On Thu, Jun 02, 2011 at 08:50:58PM -0400, Edward Ned Harvey wrote:
> > but I figured spending 500G on ZIL
> > would be unwise. 
> 
> You couldn't possibly ever use 500G of ZIL, because the ZIL is required to
> be flushed to disk at least once every 5sec to 30sec (depending on which
> build you're running.)  Even if you have a 4G dedicated log device, that's
> more than plenty for most purposes.

It is also limited to at most half of physical memory, as I
recall. Remember that SZIL is nonvolatile backing store for in-memory
write structures that have to remain until txg close anyway.

Separately, with only 4G of RAM, i think an L2ARC is likely about a
wash, since L2ARC entries also consume RAM.

The extra details provided just confirm that the 3-way-mirror is the
best tweak for this existing system with no cost.

--
Dan.

pgpt5VI8Bht4K.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS acl inherit problem

2011-06-01 Thread Daniel Carosone
On Wed, Jun 01, 2011 at 07:42:24AM -0600, Mark Shellenbaum wrote:
>
> Looks like the linux client did a chmod(2) after creating the file.

I bet this is it, and this seems to have been ignored in the later thread.

> what happens when you create a file locally in that directory on the  
> solaris system?

No, what happens when you touch(1) the file from the client in
question without the rest of the application behaviour that follows,
and then what happens when you chmod(1) it? 

Can you observe the client application behaviour, via truss or
equivalent?

--
Dan.



pgpJiAGRmjdoI.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Daniel Carosone
On Wed, Jun 01, 2011 at 05:45:14AM +0400, Jim Klimov wrote:
> Also, in a mirroring scenario is there any good reason to keep a warm spare
> instead of making a three-way mirror right away (beside energy saving)? 
> Rebuild times and non-redundant windows can be decreased considerably ;)

Perhaps where the spare may be used for any of several pools,
whichever has a failure first. Not relevant to this case..

In this case, if the drive is warm, it might as well be live.

My point was that, even as a cold spare it is worth something, and
that the sata port may be worth more, since the OP is more interested
in performance than extra redundancy.

--
Dan.



pgp8cB9ApGE1h.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE

2011-05-31 Thread Daniel Carosone
On Tue, May 31, 2011 at 05:32:47PM +0100, Matt Keenan wrote:
> Jim,
>
> Thanks for the response, I've nearly got it working, coming up against a  
> hostid issue.
>
> Here's the steps I'm going through :
>
> - At end of auto-install, on the client just installed before I manually  
> reboot I do the following :
>   $ beadm mount solaris /a
>   $ zpool export data
>   $ zpool import -R /a -N -o cachefile=/a/etc/zfs/zpool.cache data
>   $ beadm umount solaris
>   $ reboot
>
> - Before rebooting I check /a/etc/zfs/zpool.cache and it does contain  
> references to "data".
>
> - On reboot, the automatic import of data is attempted however following  
> message is displayed :
>
>  WARNING: pool 'data' could not be loaded as it was last accessed by  
> another system (host: ai-client hostid: 0x87a4a4). See  
> http://www.sun.com/msg/ZFS-8000-EY.
>
> - Host id on booted client is :
>   $ hostid
>   000c32eb
>
> As I don't control the import command on boot i cannot simply add a "-f"  
> to force the import, any ideas on what else I can do here ?

Can you simply export the pool again before rebooting, but after the
cachefile in /a has been unmounted? 
 
--
Dan.

pgp7IC9jTUesC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Daniel Carosone
On Wed, Jun 01, 2011 at 10:16:28AM +1000, Daniel Carosone wrote:
> On Tue, May 31, 2011 at 06:57:53PM -0400, Edward Ned Harvey wrote:
> > If you make it a 3-way mirror, your write performance will be unaffected,
> > but your read performance will increase 50% over a 2-way mirror.  All 3
> > drives can read different data simultaneously for the net effect of 3x a
> > single disk read performance.
> 
> This would be my recommendation too, but for the sake of completeness,
> there are other options that may provide better performance
> improvement (at a cost) depending on your needs. 

In fact, I should state even more clearly: do this, since there is
very little reason not to.  Measure the benefit.  Move on to the other
things if the benefit is not enough. When doing so, consider what kind
of benefit you're looking for.

> Namely, leave the third drive on the shelf as a cold spare, and use
> the third sata connector for an ssd, as L2ARC, ZIL or even possibly
> both (which will affect selection of which device to use).
> 
> L2ARC is likely to improve read latency (on average) even more than a
> third submirror.  ZIL will be unmirrored, but may improve writes at an
> acceptable risk for development system.  If this risk is acceptable,
> you may wish to consider whether setting sync=disabled is also
> acceptable at least for certain datasets.
> 
> Finally, if you're considering spending money, can you increase the
> RAM instead?  If so, do that first.
> 
> --
> Dan.


> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



pgpHRSk23bsVr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Daniel Carosone
On Tue, May 31, 2011 at 06:57:53PM -0400, Edward Ned Harvey wrote:
> If you make it a 3-way mirror, your write performance will be unaffected,
> but your read performance will increase 50% over a 2-way mirror.  All 3
> drives can read different data simultaneously for the net effect of 3x a
> single disk read performance.

This would be my recommendation too, but for the sake of completeness,
there are other options that may provide better performance
improvement (at a cost) depending on your needs. 

Namely, leave the third drive on the shelf as a cold spare, and use
the third sata connector for an ssd, as L2ARC, ZIL or even possibly
both (which will affect selection of which device to use).

L2ARC is likely to improve read latency (on average) even more than a
third submirror.  ZIL will be unmirrored, but may improve writes at an
acceptable risk for development system.  If this risk is acceptable,
you may wish to consider whether setting sync=disabled is also
acceptable at least for certain datasets.

Finally, if you're considering spending money, can you increase the
RAM instead?  If so, do that first.

--
Dan.

pgpt1w2jn0CGs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] offline dedup

2011-05-29 Thread Daniel Carosone
On Fri, May 27, 2011 at 07:28:06AM -0400, Edward Ned Harvey wrote:
> > From: Daniel Carosone [mailto:d...@geek.com.au]
> > Sent: Thursday, May 26, 2011 8:19 PM
> > 
> > Once your data is dedup'ed, by whatever means, access to it is the
> > same.  You need enough memory+l2arc to indirect references via
> > DDT.  
> 
> I don't think this is true.

> The reason you need arc+l2arc to store your DDT
> is because when you perform a write, the system will need to check and see
> if that block is a duplicate of an already existing block.  If you dedup
> once, and later disable dedup, the system won't bother checking to see if
> there are duplicate blocks anymore.  So the DDT won't need to be in
> arc+l2arc.  I should say "shouldn't."

dedup'd blocks are found via the ddt, no matter how many references to
them exist.  The ddt 'owns' the actual data block, and the regular
referencing files' metadata (bp) indicates that this block is dedup'd
(indirect) rather than regular (direct). 

At least that's my somewhat-rusty recollection.

--
Dan.


pgpxBzmPuAMXW.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool history length

2011-05-26 Thread Daniel Carosone
On Wed, May 25, 2011 at 11:54:16PM -0700, Matthew Ahrens wrote:
> >
> > On Thu, May 12, 2011 at 08:52:04PM +1000, Daniel Carosone wrote:
> > > Other than the initial create, and the most
> > > recent scrub, the history only contains a sequence of auto-snapshot
> > > creations and removals. None of the other commands I'd expect, like
> > > the filesystem creations and recv, the device replacements (as I
> > > described in my other post), previous scrubs, or anything else:
> >
> 
> We keep a limited amount of history data (up to 32MB of raw data).  So if
> you have a ton of auto-snapshot activity, earlier operations may have fallen
> out of the history.  But we always keep the "zpool create" line.

That sounds right for behaviour, but not for numbers.

dan@ventus:~# zpool history geek | wc
  50100  247555 4359702

I presume the on-disk format is at least a little more compat than this, too.

--
Dan.



pgp4eGo6sFwuz.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-26 Thread Daniel Carosone
On Thu, May 26, 2011 at 10:25:04AM -0400, Edward Ned Harvey wrote:
> (2) Now, in a pool with 2.4M unique blocks and dedup enabled (no verify), a
> test file requires 10m38s to write and 2m54s to delete, but with dedup
> disabled it only requires 0m40s to write and 0m13s to delete exactly the
> same file.  So ... 13x performance degradation.  
> 
> zpool iostat is indicating the disks are fully utilized doing writes.  No
> reads.  During this time, it is clear the only bottleneck is write iops.
> There is still oodles of free mem.  I am not near arc_meta_limit, nor c_max.
> The cpu is 99% idle.  It is write iops limited.  Period.

Ok.

> Assuming DDT maintenance is the only disk write overhead that dedup adds, I
> can only conclude that with dedup enabled, and a couple million unique
> blocks in the pool, the DDT must require substantial maintenance.  In my
> case, something like 12 DDT writes for every 1 actual intended new unique
> file block write.

Where did number come from?  Are there actually 13x as many IOs, or is
that just extrapolated from elapsed time?  It won't be anything like a
linear extrapolation, especially if the heads are thrashing.

Note that DDT blocks have their own allocation metadata to be updated
as well.

Try to get a number for actual total IOs and scaling factor.

> For the heck of it, since this machine has no other purpose at the present
> time, I plan to do two more tests.  And I'm open to suggestions if anyone
> can think of anything else useful to measure: 
> 
> (1) I'm currently using a recordsize of 512b, because the intended purpose
> of this test has been to rapidly generate a high number of new unique
> blocks.  Now just to eliminate the possibility that I'm shooting myself in
> the foot by systematically generating a worst case scenario, I'll try to
> systematically generate a best-case scenario.  I'll push the recordsize back
> up to 128k, and then repeat this test something slightly smaller than 128k.
> Say, 120k. That way there should be plenty of room available for any write
> aggregation the system may be trying to perform.
> 
> (2) For the heck of it, why not.  Disable ZIL and confirm that nothing
> changes.  (Understanding so far is that all these writes are async, and
> therefore ZIL should not be a factor.  Nice to confirm this belief.)

Good tests. See how the IO expansion factor changes with block size.

(3) Experiment with the maximum number of allowed outstanding current
io's per disk (I forget the specific tunable OTTOMH).  If the load
really is ~100% async write, this might well be a case where raising
that figure lets the disk firmware maximise throughput without causing
the latency impact that can happen otherwise (and leads to
recommendations to shorten the limit in general cases).

(4) See if changing the txg sync interval to (much) longer
helps. Multiple DDT entries can live in the same block, and a longer
interval may allow coalescing of these writes.

--
Dan.

pgppptUIsk3CQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] offline dedup

2011-05-26 Thread Daniel Carosone
On Fri, May 27, 2011 at 04:32:03AM +0400, Jim Klimov wrote:
> One more rationale in this idea is that with deferred dedup
> in place, the DDT may be forced to hold only non-unique
> blocks (2+ references), and would require less storage in
> RAM, disk, L2ARC, etc. - in case we agree to remake the
> DDT on every offline-dedup operation.

This is an interesting point.  In this case, deferred dedup would be
the only way to get a given block hash to have 2 or more duplicates,
but once in there further copies could be added as normal.  This
probably gives you most of the (space) benefit for much less (memory)
cost. 

In reverse, pruning the DDT of single-instance blocks could be a
useful operation, for recovery from a case where you made a DDT too
large for the system.  It would still need a complex bp_rewrite.

--
Dan.

pgpTdDmsGgpfp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-26 Thread Daniel Carosone
On Thu, May 26, 2011 at 07:38:05AM -0400, Edward Ned Harvey wrote:
> > From: Daniel Carosone [mailto:d...@geek.com.au]
> > Sent: Wednesday, May 25, 2011 10:10 PM
> > 
> > These are additional
> > iops that dedup creates, not ones that it substitutes for others in
> > roughly equal number.
> 
> Hey ZFS developers - Of course there are many ways to possibly address these
> issues.  Tweaking ARC prioritization and the like...  Has anybody considered
> the possibility of making an option to always keep DDT on a specific vdev?
> Presumably a nonvolatile mirror with very fast iops.  It is likely a lot of
> people already have cache devices present...  Perhaps a property could be
> set, which would store the DDT exclusively on that device.  Naturally there
> are implications - you would need to recommend mirroring the device, which
> you can't do, so maybe we're talking about slicing the cache device...  As I
> said, a lot of ways to address the issue.

I think l2arc persistence will just about cover that nicely, perhaps
in combination with some smarter auto-tuning for arc percentages with
large DDT. 

The writes are async, and aren't so much a problem in themselves other
than that they can get in the way of other more important things.  The
best thing you can do with them is spread them as widely as possible,
rather than bottlenecking specific devices/channels/etc. 

If you have a capacity shortfall overall, either you make the other
things faster in preference (zil, nv write cache, more arc for reads)
or you make the whole pool faster for iops (different layout, more
spindles) or you limit dedup usage within your capacity.

Another thing that can happen is that you have enough other sync
writes going on that DDT writes lag behind and delay the txg close. 
In this case, the same solutions above apply, as does judicious use of
"sync=disabled" to allow more of the writes to be async.

> Both the necessity to read & write the primary storage pool...  That's very
> hurtful.  And even with infinite ram, it's going to be unavoidable for
> things like destroying snapshots, or anything at all you ever want to do
> after a reboot.

Yeah, again, persistent l2arc helps the post-reboot case.  With
infinite ram, I'm not sure I'd have much use for dedup :)

--
Dan.

pgpqUsz7d3iF6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] offline dedup

2011-05-26 Thread Daniel Carosone
On Thu, May 26, 2011 at 09:04:04AM -0700, Brandon High wrote:
> On Thu, May 26, 2011 at 8:37 AM, Edward Ned Harvey
>  wrote:
> > Question:? Is it possible, or can it easily become possible, to periodically
> > dedup a pool instead of keeping dedup running all the time?? It is easy to
> 
> I think it's been discussed before, and the conclusion is that it
> would require bp_rewrite.

Yes, and possibly would require more of bp_rewrite than any other use
case (ie, a more complex bp_rewrite).

> Offline (or deferred) dedup certainly seems more attractive given the
> current real-time performance.

I'm not so sure.

Or, rather, if it were there and available now, I'm sure some people
would use it and prefer it for their circumstances.  Nothing comes for
free, in terms of development or operational complexity.

It seems attractive for retroactively recovering space, as a rare
operation, while maintaining snapshot integrity (and not taking
everything offline for a send|recv). But you want to be sure you can
carry the cost of that space saving.

Once your data is dedup'ed, by whatever means, access to it is the
same.  You need enough memory+l2arc to indirect references via
DDT.  If this is your performance problem today, it will not be helped
much by deferral. Reads will still have the same issue, as will the
deferred dedup write workload (with more work overall).

But I don't think it solves the core overhead of freeing deduped
blocks, and once that's no longer a problem for you, neither is the
synchronous dedup.  Plus, if you're just on the edge, that can be
deferred as noted previously, though that's not a very nice place to
be. 

I tend to think that background/deferred dedup is a task more similar
to HSM / archival type activities, that will involve some level of
application responsibility as well as fs-level assistance hooks.  For
all the work it would involve, I'd like to get more value than just a
few saved disk blocks. 

--
Dan.

pgpSDYMJroHHU.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS issues and the choice of platform

2011-05-26 Thread Daniel Carosone
On Thu, May 26, 2011 at 08:20:03AM -0400, Edward Ned Harvey wrote:
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Daniel Carosone
> > 
> > On Wed, May 25, 2011 at 10:59:19PM +0200, Roy Sigurd Karlsbakk wrote:
> > > The systems where we have had issues, are two 100TB boxes, with some
> > > 160TB "raw" storage each, so licensing this with nexentastor will be
> > > rather expensive. What would you suggest? Will a solaris express
> > > install give us good support when the shit hits the fan?
> > 
> > No more so than what you have now, without a support contract.
> 
> Are you suggesting that support contracts on sol11exp are useless?  Maybe I
> should go tell my boss to cancel ours...  *sic*

No, not at all.  The OP didn't mention having or intending to buy one,
and talked only about an "install" of SX, vs paid support for nexenta.
I was just pointing out the gap, in case wrong assumptions were being
made. 

--
Dan.


pgppqLJkbUMgE.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bug? ZFS crypto vs. scrub

2011-05-25 Thread Daniel Carosone
Just a ping for any further updates, as well as a crosspost to migrate
the thread to zfs-discuss (from -crypto-). 

Is there any further information I can provide?  What's going on with
that "zpool history", and does it tell you anything about the chances
of recovering the actual key used?

On Thu, May 12, 2011 at 08:52:04PM +1000, Daniel Carosone wrote:
> On Thu, May 12, 2011 at 10:04:19AM +0100, Darren J Moffat wrote:
> > There is a possible bug in in that area too, and it is only for the  
> > keysource=passphrase case. 
> 
> Ok, sounds like it's not yet a known one.  If there's anything I can
> do to help track it down, let me know.  
> 
> > It isn't anything to do with the terminal.
> 
> Heh, ok.. just a random WAG.
> 
> >> More importantly, what are the prospects of correctly reproducing that
> >> key so as to get at data?  I still have the original staging pool, but
> >> some additions made since the transfer would be lost otherwise. It's
> >> not especially important data, but would be annoying to lose or have
> >> to reproduce.
> >
> > I'm not sure, can you send me the ouput of 'zpool history' on the pool  
> > that the recv was done to.  I'll be able to determine from that if I can  
> > fix up the problem or not.
> 
> Can do - but it's odd.   Other than the initial create, and the most
> recent scrub, the history only contains a sequence of auto-snapshot
> creations and removals. None of the other commands I'd expect, like
> the filesystem creations and recv, the device replacements (as I
> described in my other post), previous scrubs, or anything else:
> 
> dan@ventus:~# zpool history geek | grep -v auto-snap
> History for 'geek':
> 2011-04-01.08:48:15 zpool create -f geek raidz2 /rpool1/stage/file0 
> /rpool1/stage/file1 /rpool1/stage/file2 /rpool1/stage/file3 
> /rpool1/stage/file4 /rpool1/stage/file5 /rpool1/stage/file6 
> /rpool1/stage/file7 /rpool1/stage/file8 c2t600144F0DED90A004D9590440001d0
> 2011-05-10.14:03:34 zpool scrub geek
> 
> If you want the rest, I'm happy to send it, but I don't expect it will
> tell you anything.  I do wonder why that is...
> 
> --
> Dan.




pgpIJLjpYwQNX.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS issues and the choice of platform

2011-05-25 Thread Daniel Carosone
On Wed, May 25, 2011 at 10:59:19PM +0200, Roy Sigurd Karlsbakk wrote:
> The systems where we have had issues, are two 100TB boxes, with some
> 160TB "raw" storage each, so licensing this with nexentastor will be
> rather expensive. What would you suggest? Will a solaris express
> install give us good support when the shit hits the fan? 

No more so than what you have now, without a support contract.

--
Dan.



pgpDkEAYOO8xi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-25 Thread Daniel Carosone
On Wed, May 25, 2011 at 03:50:09PM -0700, Matthew Ahrens wrote:
>  That said, for each block written (unique or not), the DDT must be updated,
> which means reading and then writing the block that contains that dedup
> table entry, and the indirect blocks to get to it.  With a reasonably large
> DDT, I would expect about 1 write to the DDT for every block written to the
> pool (or "written" but actually dedup'd).

That, right there, illustrates exactly why some people are
disappointed wrt performance expectations from dedup.

To paraphrase, and in general: 

 * for write, dedup may save bandwidth but will not save write iops.
 * dedup may amplify iops with more metadata reads 
 * dedup may turn larger sequential io into smaller random io patterns 
 * many systems will be iops bound before they are bandwidth or space
   bound (and l2arc only mitigates read iops)
 * any iops benefit will only come on later reads of dedup'd data, so
   is heavily dependent on access pattern.

Assessing whether these amortised costs are worth it for you can be
complex, especially when the above is not clearly understood.

To me, the thing that makes dedup most expensive in iops is the writes
for update when a file (or snapshot) is deleted.  These are additional
iops that dedup creates, not ones that it substitutes for others in
roughly equal number.  

This load is easily forgotten in a cursory analysis, and yet is always
there in a steady state with rolling auto-snapshots.  As I've written
before, I've had some success managing this load using deferred deletes
and snapshot holds, either to spread the load or to shift it to
otherwise-quiet times, as the case demanded.  I'd rather not have to. :-)

--
Dan.

pgpS8sJWxBEVR.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] howto: make a pool with ashift=X

2011-05-12 Thread Daniel Carosone
On Thu, May 12, 2011 at 12:23:55PM +1000, Daniel Carosone wrote:
> They were also sent from an ashift=9 to an ashift=12 pool

This reminded me to post a note describing how I made pools with
different ashift.  I do this both for pools on usb flash sticks, and
on disks with an underlying 4k blocksize, such as my 2Tb WD EARS
drives.  If I had pools on SATA Flash SSDs, I'd do it for those too.

The trick comes from noting that stmfadm create-lu has a blk option
for the block size of the iscsi volume to be presented.  Creating a
pool with at least one disk (per top-level vdev) on an iscsi initiator
pointing at such a target will cause zpool to set ashift for the vdev 
accordingly.  

This works even when the initiator and target are the same host, over
the loopback interface.  Oddly, however, it does not work if the host
is solaris express b151 - it does work on OI b148.  Something has
changed in zpool creating in the interim.

Anyway, my recipe is to:

 * boot OI b148 in a vm. 
 * make a zfs dataset to house the working files (reason will be clear
   below).
 * In that dataset, I make sparse files corresponding in size and
   number to the disks that will eventually hold the pool (this makes
   a pool with the same size and number of metaslabs as would have
   been natively).
 * Also make a sparse zvol of the same size.
 * stmfadm create-lu -p blk=4096 (or whatever, as desired) on the
   zvol, and make available.
 * get the iscsi initiator to connect the lu as a new disk device
 * zpool create, using all bar 1 of the files, and the iscsi disk, in
   the shape you want your pool (raidz2, etc).
 * zpool replace the iscsi disk with the last unused file (now you can
   tear down the lu and zvol)
 * zpool export the pool-on-files.
 * zfs send the dataset housing these files to the machine that has
   the actual disks (much faster than rsync even with the sparse files
   option, since it doesn't have to scan for holes).
 * zpool import the pool from the files
 * zpool upgrade, if you want newer pool features, like crypto.
 * zpool set autoexpand=on, if you didn't actually use files of the
   same size.
 * zpool replace a file at a time onto the real disks.

Hmm.. when written out like that, it looks a lot more complex than it
really is.. :-)

Note that if you want lots of mirrors, you'll need an iscsi device per
mirror top-level vdev.

Note also that the image created inside the iscsi device is not
identical to what you want on a device with 512-byte sector emulation,
since the label is constructed for a 4k logical sector size.  zpool
replace takes care of this when labelling the replacement disk/file.

I also played around with another method, using mdb to overwrite the
disk model table to match my disks and make the pool directly on them
with the right ashift.

  http://fxr.watson.org/fxr/ident?v=OPENSOLARIS;im=10;i=sd_flash_dev_table

This also no longer works on b151 (though the table still exists), so I
need the vm anyway, and the iscsi method is easier. 

Finally, because this doesn't work on b151, it's also only good for
creating new pools; I don't know how to expand a pool with new vdevs
to have the right ashift in those vdevs. 

--
Dan.

pgp0weQ3gx757.pgp
Description: PGP signature
___
zfs-crypto-discuss mailing list
zfs-crypto-disc...@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-crypto-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hashing files rapidly on ZFS

2010-07-07 Thread Daniel Carosone
On Tue, Jul 06, 2010 at 05:29:54PM +0200, Arne Jansen wrote:
> Daniel Carosone wrote:
> > Something similar would be useful, and much more readily achievable,
> > from ZFS from such an application, and many others.  Rather than a way
> > to compare reliably between two files for identity, I'ld liek a way to
> > compare identity of a single file between two points in time.  If my
> > application can tell quickly that the file content is unaltered since
> > last time I saw the file, I can avoid rehashing the content and use a
> > stored value. If I can achieve this result for a whole directory
> > tree, even better.
> 
> This would be great for any kind of archiving software. Aren't zfs checksums
> already ready to solve this? If a file changes, it's dnodes' checksum changes,
> the checksum of the directory it is in and so forth all the way up to the
> uberblock.

Not quite.  The merkle tree of file and metadata blocks gets to the root
via a path that is largely separate from the tree of directory objects
that name them.  Changing the inode(equivalent) metadata doesn't
change the contents of any of the directories that have entries
pointing to that file.   

Put another way, we don't have named file versions - a directory name
refers to an object and all future versions of that object.  Future
versions will be found at different disk addresses, but are the same
object id.  There's no need to rewrite the directory object unless the
name changes or the link is removed.   We have named filesystem
versions (snapshots), that name the entire tree of data and metadata
objects, once they do finally converge.

> There may be ways a checksum changes without a real change in the files 
> content,
> but the other way round should hold. If the checksum didn't change, the file
> didn't change.

That's true, for individual files. (Checksum changes can happen when
checksum algorithm is changed and same data is rewritten, or if
zero-filled blocks are replaced/rewritten with holes or vice-versa,
and some other cases)

> So the only missing link is a way to determine zfs's checksum for a
> file/directory/dataset. 

That is missing, yes.  Perhaps in part because noone could see a valid
application-level use for this implementation-specific metadata.  The
original purpose of my post above was to illustrate a use case for it.

The throwaway line at the end, about a directory tree, was more in the
vein of wishful thinking.. :)

> Am I missing something here? Of course atime update
> should be turned off, otherwise the checksum will get changed by the archiving
> agent.

Separate checksums exist that cover the difference between content and
metadata changes, so at least in theory an interface that exposes
enough detail could avoid this restriction.

--
Dan.

pgp3k9u39IfHh.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Announce: zfsdump

2010-07-03 Thread Daniel Carosone
On Wed, Jun 30, 2010 at 12:54:19PM -0400, Edward Ned Harvey wrote:
> If you're talking about streaming to a bunch of separate tape drives (or
> whatever) on a bunch of separate systems because the recipient storage is
> the bottleneck instead of the network ... then "split" probably isn't the
> most useful way to distribute those streams.  Because "split" is serial.
> You would really want to "stripe" your data to all those various
> destinations, so they could all be writing simultaneously.  But this seems
> like a very specialized scenario, that I think is probably very unusual.

At this point, I will repeat my recommendation about using
zpool-in-files as a backup (staging) target.  Depending where you
host, and how you combine the files, you can achieve these scenarios
without clunkery, and with all the benefits a zpool provides.

 1 - Create a bunch of files, sized appropriately for your eventual backup
 media unit (e.g. tape).  

 2 - make a zpool out of them, in whatever vdev arrangement suits your
 space and error tolerance needs (plain stripe or raidz or both).
 Set compression, dedup etc (encryption, one day) as suits you, too.

 3 - zfs send | zfs recv into this pool-of-files.  rsync from non-zfs
 hosts, too, if you like.

 4 - scrub, if you like

 5 - write the files to tape, or into whatever file-oriented backup
 solution you prefer (perhaps at a less frequent schedule than
 sends).

 6 - goto 3 (incremental sends for later updates)

I came up with this scheme when zpool was the only forwards-compatible
format, before the send stream format was a committed interface too.
However, there are still several other reasons why this is preferable
to backing up send streams directly.

--
Dan.

pgpEHsmYFFyjp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hashing files rapidly on ZFS

2010-05-17 Thread Daniel Carosone
On Tue, May 11, 2010 at 04:15:24AM -0700, Bertrand Augereau wrote:
> Is there a O(nb_blocks_for_the_file) solution, then?
> 
> I know O(nb_blocks_for_the_file) == O(nb_bytes_in_the_file), from Mr. 
> Landau's POV, but I'm quite interested in a good constant factor.

If you were considering the hashes of each zfs block as a precomputed
value, it might be tempting to think of getting all of these and
hashing them together.  You could thereby avoiding reading file data,
and the file metadata with the hashes in, you'd have needed to read
anyway. This would seem to be appealing, eliminating seeks and cpu
work. 

However, there are some issues that make the approach basically
infeasible and unreliable for comparing the results of two otherwise
identical files.

First, you're assuming there's an easy interface to get the stored
hashes of a block, which there isn't.  Even if we ignore that for a
moment, the hashes zfs records depend on factors other than just the
file content, including the way the file has been written over time.  

The blocks of the file may not be constant size; a file that grew
slowly may have different hashes to a copy of it or one extracted
from an archive in a fast stream.  Filesystem properties, including
checksum (obvious), dedup (which implies checksum), compress (which
changes written data and can make holes), blocksize and maybe others
may be different between filesystems or even change over the time a
file has been written, and again change results and defeat
comparisons.

These things can defeat zfs's dedup too, even though it does have
access to the block level checksums.

If you're going to do an application-level dedup, you want to utilise
the advantage of being independent of these things - or even of the
underlying filesystem at all (e.g. dedup between two NAS shares).

Something similar would be useful, and much more readily achievable,
from ZFS from such an application, and many others.  Rather than a way
to compare reliably between two files for identity, I'ld liek a way to
compare identity of a single file between two points in time.  If my
application can tell quickly that the file content is unaltered since
last time I saw the file, I can avoid rehashing the content and use a
stored value. If I can achieve this result for a whole directory
tree, even better.

--
Dan.





pgp1HgRATGs5S.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is it safe to disable the swap partition?

2010-05-09 Thread Daniel Carosone
On Sun, May 09, 2010 at 09:24:38PM -0500, Mike Gerdts wrote:
> The best thing to do with processes that can be swapped out forever is
> to not run them.

Agreed, however:

#1  Shorter values of "forever" (like, say, "daily") may still be useful.
#2  This relies on knowing in advance what these processes will be.
#3  Where are the JeOS builds without all the gnome-infested likely suspects?

--
Dan.

pgpHYkrXDUgqQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirroring USB Drive with Laptop for Backup purposes

2010-05-05 Thread Daniel Carosone
On Wed, May 05, 2010 at 04:34:13PM -0400, Edward Ned Harvey wrote:
> The suggestion I would have instead, would be to make the external drive its
> own separate zpool, and then you can incrementally "zfs send | zfs receive"
> onto the external.

I'd suggest doing both, to different destinations :)  Each kind of
"backup" serves different, complementary purposes.

> #1 I think all the entire used portion of the filesystem needs to resilver
> every time.  I don't think there's any such thing as an incremental
> resilver.  

Incorrect. It will play forward all the (still-live) blocks from txg's
between the time it was last online and now. 

That said, I'd also recommend a scrub on a regular basis, once the
resilver has completed, and that will trawl through all the data and
take all that time you were worried about anyway.  For a 200G disk,
full, over usb, I'd expect around 4-5 hours.  That's fine for a "leave
running overnight" workflow.

This is the benefit of this kind of "backup" - as well as being almost
brainless to initiate, it's able to automatically repair marginal
sectors on the laptop disk if they become unreadable, saving you from
the hassle of trying to restore damaged files.

The send|recv kind of backup is much better for restoring data from
old snapshots (if the target is larger than the source and keeps them
longer), and recovering from accidentally destroying both mirrored
copies of data due to operator error. 

> #2 How would you plan to disconnect the drive?  If you zpool detach it, I
> think it's no longer a mirror, and not mountable.

That's correct - which is why you would use "zpool offline".

--
Dan.


pgpgbQjfYhj6R.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS vs SATA: Same size, same speed, why SAS?

2010-04-27 Thread Daniel Carosone
On Tue, Apr 27, 2010 at 10:36:37AM +0200, Roy Sigurd Karlsbakk wrote:
> - "Daniel Carosone"  skrev:
> > SAS:  Full SCSI TCQ
> > SATA: Lame ATA NCQ
> 
> What's so lame about NCQ?

Primarily, the meager number of outstanding requests; write cache is
needed to pretend the writes are done straight away and free up the
slots for reads.  

If you want throughput, you want to hand the disk controller as many
requests as possible, so it can optimise seek order.  If you have
especially latency-sensitive requests, you need to manage the queue
carefully with either system.

--
Dan.

pgpf0r3L8VyeA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS version information changes (heads up)

2010-04-27 Thread Daniel Carosone
On Tue, Apr 27, 2010 at 11:29:04AM -0600, Cindy Swearingen wrote:
> The revised ZFS Administration Guide describes the ZFS version
> descriptions and the Solaris OS releases that provide the version
> and feature, starting on page 293, here:
>
> http://hub.opensolaris.org/bin/view/Community+Group+zfs/docs

It's not entirely clear how much of the text above you're quoting 
as the addition, but surely referring to a page number is even more 
volatile than a url?

--
Dan.



pgpmg908CTKgT.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS vs SATA: Same size, same speed, why SAS?

2010-04-26 Thread Daniel Carosone
On Mon, Apr 26, 2010 at 10:02:42AM -0700, Chris Du wrote:
> SAS: full duplex
> SATA: half duplex
> 
> SAS: dual port
> SATA: single port (some enterprise SATA has dual port)
> 
> SAS: 2 active channel - 2 concurrent write, or 2 read, or 1 write and 1 read
> SATA: 1 active channel - 1 read or 1 write
> 
> SAS: Full error detection and recovery on both read and write
> SATA: error detection and recovery on write, only error detection on read

SAS:  Full SCSI TCQ
SATA: Lame ATA NCQ

--
Dan.



pgpfPAxGyNIbj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-22 Thread Daniel Carosone
On Thu, Apr 22, 2010 at 09:58:12PM -0700, thomas wrote:
> Assuming newer version zpools, this sounds like it could be even
> safer since there is (supposedly) less of a chance of catastrophic
> failure if your ramdisk setup fails. Use just one remote ramdisk or
> two with battery backup.. whatever meets your paranoia level.   

If the iscsi initiator worked for me at all, I would be trying this.
I liked the idea, but it's just not accessible now.

--
Dan.


pgpfrImp3sC6A.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making an rpool smaller?

2010-04-20 Thread Daniel Carosone
On Tue, Apr 20, 2010 at 12:55:10PM -0600, Cindy Swearingen wrote:
> You can use the OpenSolaris beadm command to migrate a ZFS BE over
> to another root pool, but you will also need to perform some manual
> migration steps, such as
> - copy over your other rpool datasets
> - recreate swap and dump devices
> - install bootblocks
> - update BIOS and GRUB entries to boot from new root pool

I've also found it handy to use different names for each rpool.
Sometimes it's handy to boot an image that's entirely on a removable
disk, for example, and move that between hosts. The last thing you
want is a name clash or confusion about which pool is which.

In addition to the "import name" of the pool, there's another name
that needs to be changed. This is the "boot name" of the pool; it's
the name grub looks for in the "findroot(pool_rpool,...)" line.
That name is found in the root fs of the pool, in ./etc/bootsign (so
typically mounted at /poolname/etc/bootsign).

--
Dan.

pgpTeuorcSbDh.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Making an rpool smaller?

2010-04-20 Thread Daniel Carosone
I have certainly moved a root pool from one disk to another, with the
same basic process, ie:  

 - fuss with fdisk and SMI labels (sigh)
 - zpool create
 - snapshot, send and recv
 - installgrub
 - swap disks

I looked over the "root pool recovery" section in the Best Practices guide
at the time, it has details of all these steps.

In my case, it was to move to a larger disk (in my laptop) rather than
a smaller, but as long as it all fits it won't matter.  

(I did it this way, instead of by attach and detach of mirror, in
order to go through dedup and upgrade checksums, and also to get
comfort with the process for some time when I'm really doing a
recovery.) 

--
Dan.

pgpnel8QFR6Yq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Daniel Carosone
On Mon, Apr 19, 2010 at 03:37:43PM +1000, Daniel Carosone wrote:
> the filesystem holding /etc/zpool.cache

or, indeed, /etc/zfs/zpool.cache  :-)

--
Dan.


pgpSCBv4eR19k.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Daniel Carosone
On Sun, Apr 18, 2010 at 07:37:10PM -0700, Don wrote:
> I'm not sure to what you are referring when you say my "running BE"

Running boot environment - the filesystem holding /etc/zpool.cache

--
Dan.

pgpbKUgqnePjv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Daniel Carosone
On Sun, Apr 18, 2010 at 10:33:36PM -0500, Bob Friesenhahn wrote:
> Probably the DDRDrive is able to go faster since it should have lower  
> latency than a FLASH SSD drive. However, it may have some bandwidth  
> limits on its interface.

It clearly has some.  They're just as clearly well in excess of those
applicable to a SATA-interface SSD, even a DRAM-based one like the acard.  

In return, the SATA SSD has some deployment options (in an external
JBOD, for example) not as readily accessible to a PCI device.

I'd be curious to compare mirroring these kinds of devices across
server heads, using comstar and some suitable interconnect, as a
comparison to slogs colocated with the drives. 

--
Dan.

pgpjsKj1KdWMD.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   >