from:"Erik Trimble"

Re: [zfs-discuss] ZFS Appliance as a general-purpose server question

2012-11-26 Thread Erik Trimble


On 11/26/2012 12:54 PM, Grégory Giannoni wrote:

[snip]
I switched few month ago from Sun X45x0 to HP things : My fast NAS are now DL 180 
G6. I got better perfs using LSI 9240-8I rather than HP SmartArray (tried P410 
& P812). I'm using only 600Gb SSD drives.
That LSI controllers supports SATA III, or 6Gbps SATA.   The Px1x 
controllers do 6GB SAS, but only 3GB SATA, so that's your likely perf 
difference.  The SmartArray Px2x series should do both SATA and SAS at 
6Gbps.


That said, I do think you're right that the LSI controller is probably a 
better fit for connections requiring a SATA SSD.  The only exception is 
having to give up the 1GB of NVRAM on the HP controller. :-(



In one of the servers I replaced the 25-disks bays by 3 8-disks bays, allowing 
me to connect 3 LSI 9240-8I rather than only one. This NAS achieved 
4.4GBytes/sec reading and 4.1GBytes/Sec writing with 48 io/s, running 
Solaris 11. Using raidz-2, perfs dropped to 3.1 / 3.0 GB/sec

Is the bottleneck the LSI controller, or the SAS/SATA bus, or the PCI-E 
bus itself?  That is, have you tested with LSI 9240-4i  (one per 8-drive 
cage, which I *believe* can use the HP multi-lane cable), and with a LSI 
9260-16i or LSI 9280-24i?   My instinct would be to say it's the PCI-E 
bus, and you could probably get away with the 4-channel cards.  i.e. 
4-channels @ 6Gbit/s = 3 GBytes/s > 4x PCI-E 2.0 at 2GB/s


Also, the HP H220 is simply the OEM version of the LSI 9240-8i


-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Appliance as a general-purpose server question

2012-11-24 Thread Erik Trimble


On 11/24/2012 5:17 AM, Edmund White wrote:

Heh, I wouldn't be using G5's for ZFS purposes now. G6 and better
ProLiants are a better deal for RAM capacity and CPU core countŠ

Either way, I also use HP systems as the basis for my ZFS/Nexenta storage
systems. Typically DL380's, since I have expansion room for either 16
drive bays, or for using them as a head unit to a D2700 or D2600 JBOD.

The right replacement for the old DL320s storage server is the DL180 G6.
This model was available in a number of configurations, but the best
solutions for storage were the 2U 12-bay 3.5" model and the 2U 25-bay 2.5"
model. Both models have a SAS expander on the backplane, but with a nice
controller (LSI 9211-4i), make good ZFS storage servers.



Really?  I mean, sure, the G6 is beefier, but I can still get 8 cores of 
decently-fast CPU and 64GB of RAM in a G5, which, unless I'm doing Dedup 
and need a *stupid* amount of RAM, is more than sufficient for anything 
I've ever seen as a ZFS appliance.   I'd agree that the 64GB of RAM 
limit can be annoying if you really want to run a Super App Server + ZFS 
server on them, but they're so much more powerful than the X4500/X4540 
that I'd think they make an excellent drop-in replacement when paired 
with an MSA70, particularly on cost. The G6 is over double the cost of 
the G5.


One thing that I do know about the G6 is that they have Nehalem CPUS 
(X5500-series), which support VT-D, the virtualization I/O acceleration 
technology from Intel, while the G5's X5400-series Harpertown's don't.   
If you're running zones on the system, it won't matter, but VirtualBox 
will care.


---

Thanks for the DL180 link.  Once again, I think I'd go for the G5 rather 
than the G6 - it's roughly half the cost (or less, as the 2.5"-enabled 
G6s seem to be expensive), and these boxes make nice log servers, not 
app servers. The DL180 G5 seems to be pretty much a DL380 G5 with a 
different hard drive layout (12x2.5" rather than 8x2.5")


---

One word here for everyone getting HP equipment:  you want to get the 
Px1x or Px2x  (e.g. P812) series of SmartArray controllers, if you plan 
on running SATA drives attached to them.  The older Px0x series only 
supports SATA I (1.5GB/s) and SAS 3GB/s, which is a serious handicap if 
you want to do SSDs on that channel. The newer series do SATA II (3GB/s) 
and SAS 6Gb/s.


http://h18004.www1.hp.com/products/servers/proliantstorage/arraycontrollers/index.html


-Erik

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Appliance as a general-purpose server question

2012-11-23 Thread Erik Trimble

On 11/23/2012 5:50 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Jim Klimov
  
I wonder if it would make weird sense to get the boxes, forfeit the

cool-looking Fishworks, and install Solaris/OI/Nexenta/whatever to
get the most flexibility and bang for a buck from the owned hardware...

This is what we decided to do at work, and this is the reason why.
But we didn't buy the appliance-branded boxes; we just bought normal servers 
running solaris.




I gave up and am now buying HP-branded hardware for running Solaris on 
it. Particularly if you get off-lease used hardware (for which, HP is 
still very happy to let you buy a HW support contract), it's cheap, and 
HP has a lot of Solaris drivers for their branded stuff. Their whole 
SmartArray line of adapters has much better Solaris driver coverage than 
the generic stuff or the equivalent IBM or Dell items.


For instance, I just got a couple of DL380 G5 systems with dual 
Harpertown CPUs, fully loaded with 8 2.5" SAS drives and 32GB of RAM, 
for about $800 total.  You can attach their MSA30/50/70-series (or 
DS2700-series, if you want new) as dumb JBODs via SAS, and the nice 
SmartArray controllers have 1GB of NVRAM, which is sufficient for many 
purposes, so you don't even have cough up the dough for a nice ZIL SSD.


HP even made a sweet little "appliance" thing that was designed for 
Windows, but happens to run Solaris really, really well.  The DL320s  
(the "s" is part of the model designation).   14x 3.5" SAS/SATA hot swap 
bays, a Xeon 3070 dual-core CPU, SmartArray controller, 2 x GB Nic, LOM, 
and a free 1x PCI-E expansion slot. The only drawback is that it only 
takes up to 8GB of RAM.   It makes a *fabulous* little backup system for 
logs and stuff, and it's under about $2000 even after you splurge for 
1TB drives and an SSD for the thing.


I am in the market for something newer than that, though. Anyone know 
what HP's using as a replacement for the DL320s?


-Erik


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FC HBA for openindiana

2012-10-21 Thread Erik Trimble


Do make sure you're getting one that has the proper firmware.

Those with BIOS don't work in SPARC boxes, and those with OpenBoot don't 
work in x64 stuff.


A quick "Sun FC HBA" search on ebay turns up a whole list of stuff 
that's "official" Sun HBAs, which will give you an idea of the (max) 
pricing you'll be paying.


There's currently a *huge* price difference between the 4Gb and 2Gb 
adapters.  Also, keep in mind that PCI-X adapters are far more common at 
the 1/2Gb range, while PCI-E starts to be the most common choice at 4Gb+


Here's a list of all the old Sun FC HBAs (which can help you sort out 
which are for x64 systems, and which were for SPARC systems):


http://www.oracle.com/technetwork/documentation/oracle-storage-networking-190061.html

As Tim said, these should all have built-in drivers in the Illumos codebase.


-Erik


On 10/20/2012 4:24 PM, Tim Cook wrote:

The built in drivers support Mpha so you're good to go.

On Friday, October 19, 2012, Christof Haemmerle wrote:

Yep i Need. 4 Gig with multipathing if possible.

On Oct 19, 2012, at 10:34 PM, Tim Cook > wrote:




On Friday, October 19, 2012, Christof Haemmerle wrote:

hi there,
i need to connect some old raid subsystems to a opensolaris
box via fibre channel. can you recommend any FC HBA?

thanx
__



How old?  If its 1gbit you'll need a 4gb or slower hba. Qlogic
would be my preference. You should be able to find a 2340 for
cheap on eBay.  Or a 2460 if you want 4gb.





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] what have you been buying for slog and l2arc?

2012-08-06 Thread Erik Trimble


On 8/6/2012 2:53 PM, Bob Friesenhahn wrote:

On Mon, 6 Aug 2012, Stefan Ring wrote:


Intel's brief also clears up a prior controversy of what types of
data are actually cached, per the brief it's both user and system
data!


So you're saying that SSDs don't generally flush data to stable medium
when instructed to? So data written before an fsync is not guaranteed
to be seen after a power-down?

If that -- ignoring cache flush requests -- is the whole reason why
SSDs are so fast, I'm glad I haven't got one yet.


Testing has shown that many SSDs do not flush the data prior to 
claiming that they have done so.  The flush request may hasten the 
time until the next actual cache flush.


Honestly, I don't think this last point can be emphasized enough. SSDs 
of all flavors and manufacturers have a track record of *consistently* 
lying when returning from a cache flush command. There might exist 
somebody out there who actually does it across all products, but I've 
tested and used enough of the variety (both Consumer and Enterprise) to 
NOT trust any SSD that tells you it actually flushed out its local cache.


ALWAYS insist on some form of power protection, whether it be a 
supercap, battery, or external power-supply.   That way, even if they 
lie to you, you're covered from a power loss.


-Erik

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] IOzone benchmarking

2012-05-05 Thread Erik Trimble


On 5/5/2012 8:04 AM, Bob Friesenhahn wrote:

On Fri, 4 May 2012, Erik Trimble wrote:
predictable, and the backing store is still only giving 1 disk's 
IOPS.   The RAIDZ* may, however, give you significantly more 
throughput (in MB/s) than a single disk if you do a lot of sequential 
read or write.


Has someone done real-world measurements which indicate that raidz* 
actually provides better sequential read or write than simple 
mirroring with the same number of disks?  While it seems that there 
should be an advantage, I don't recall seeing posted evidence of such. 
If there was a measurable advantage, it would be under conditions 
which are unlikely in the real world.


The only thing totally clear to me is that raidz* provides better 
storage efficiency than mirroring and that raidz1 is dangerous with 
large disks.


Provided that the media reliability is sufficiently high, there are 
still many performance and operational advantages obtained from simple 
mirroring (duplex mirroring) with zfs.


Bob



I'll see what I can do about actual measurements.  Given that we're 
really recommending a minimum of RAIDZ2 nowdays (with disks > 1TB), that 
means, for N disks, you get N-2 data disks in a RAIDZ2, and N/2 disks in 
a standard striped mirror.   My brain says that even with the overhead 
of parity calculation, for doing sequential read/write of at least the 
slab size (i.e. involving all the data drives in a RAIDZ2),  performance 
for the RAIDZ2 should be better for N >= 6.  But, that's my theoretical 
brain, and we should do some decent benchmarking, to put some hard fact 
to that.


-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] IOzone benchmarking

2012-05-04 Thread Erik Trimble


On 5/4/2012 1:24 PM, Peter Tribble wrote:

On Thu, May 3, 2012 at 3:35 PM, Edward Ned Harvey
  wrote:

I think you'll get better, both performance&  reliability, if you break each
of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:

Incorrect on reliability; see below.


Now, to put some numbers on this...
A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
sequential.  This means resilvering the entire disk sequentially, including
unused space, (which is not what ZFS does) would require 2.2 hours.  In
practice, on my 1T disks, which are in a mirrored configuration, I find
resilvering takes 12 hours.  I would expect this to be ~4 days if I were
using 5-disk raidz1, and I would expect it to be ~12 days if I were using
15-disk raidz3.

Based on your use of "I would expect", I'm guessing you haven't
done the actual measurement.

I see ~12-16 hour resilver times on pools using 1TB drives in
raidz configurations. The resilver times don't seem to vary
with whether I'm using raidz1 or raidz2.


Suddenly the prospect of multiple failures overlapping don't seem so
unlikely.

Which is *exactly* why you need multiple-parity solutions. Put
simply, if you're using single-parity redundancy with 1TB drives
or larger (raidz1 or 2-way mirroring) then you're putting your
data at risk. I'm seeing - at a very low level, but clearly non-zero -
occasional read errors during rebuild of raidz1 vdevs, leading to
data loss. Usually just one file, so it's not too bad (and zfs will tell
you which file has been lost). And the observed error rates we're
seeing in terms of uncorrectable (and undetectable) errors from
drives are actually slightly better than you would expect from the
manufacturers spec sheets.

So you definitely need raidz2 rather than raidz1; I'm looking at
going to raidz3 for solutions using current high capacity (ie 3TB)
drives.

(On performance, I know what the theory says about getting one
disk's worth of IOPS out of each vdev in a raidz configuration. In
practice we're finding that our raidz systems actually perform
pretty well when compared with dynamic stripes, mirrors, and
hardware raid LUNs.)




Really, guys:  Richard, myself, and several others have covered how ZFS 
does resilvering (and on disk reliability, a related issue), and 
included very detailed calculations on IOPS required and discussions 
about slabs, recordsize, and how disks operate with regards to 
seek/access times and OS caching.


Please search the archives, as it's not fruitful to repost the exact 
same thing repeatedly.



Short version:  assuming identical drives and the exact same usage 
pattern and /amount/ of data, the time it takes the various ZFS 
configurations to resilver is N for ANY mirrored config and  a bit less 
than N*M for a M-disk RAIDZ*, where M = the number of data disks in the 
RAIDZ* - thus a 6-drive (total) RAIDZ2 will have the same resilver time 
as a 5-drive (total) RAIDZ1.  Calculating what N is depends entirely on 
the pattern which the data was written on the drive.  You're always 
going to be IOPS-bound on the disk being resilvered.


Which RAIDZ* config to use (assuming you have a fixed tolerance for data 
loss) depends entirely on what your data usage pattern does to resilver 
times; configurations needing very long resilver times better have more 
redundancy. And, remember, larger configs will allow for more data to be 
stored, that also increases resilver time.


Oh, and a RAIDZ* will /only/ ever get you slightly more than 1 disk's 
worth of IOPS (averaged over a reasonable time period).  Caching may 
make it appear to give more IOPS in certain cases, but that's neither 
sustainable nor predictable, and the backing store is still only giving 
1 disk's IOPS.   The RAIDZ* may, however, give you significantly more 
throughput (in MB/s) than a single disk if you do a lot of sequential 
read or write.


-Erik

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Good tower server for around 1,250 USD?

2012-03-30 Thread Erik Trimble

On 3/24/2012 4:54 PM, The Honorable Senator and Mrs. John Blutarsky wrote:

laotsu said:

well check this link

https://shop.oracle.com/pls/ostore/product?p1=3DSunFireX4270M2server&p2=3D&p=
3=3D&p4=3D&sc=3Docom_x86_SunFireX4270M2server&tz=3D-4:00

you may not like the price

Hahahah! Thanks for the laugh. The dual 10Gbe PCI card breaks my budget. I'm
not going to try to configure a server and see how much it costs...

I can't even get to the site from my country btw. I had to use a proxy
through my company in America to get pricing. Oracle doesn't want to sell
certain things everywhere or they don't know how to run a website.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

I posted this awhile ago when people were asking for a good recommendation.

I suggest pretty much any of the IBM stuff - the vast majority of them
seem to have real good compatibility (excepting the ServeRAID
controllers) with Solaris/IllumOS. The baseboard controllers are
usually some flavor of well-supported SATA or LSI SAS controller.
They're otherwise quite well put together, and parts are easy to come by
(and, IBM's support site for information is fabulous, even if you DON'T
have a contract). You can get reconditioned/used stuff for real check,
and it's even possible to get support for IBM-label stuff if it's not
out of warranty (or, buy a new warranty, if you so want, from your local
IBM reseller, of which there are a lot, world-wide). You can also
likely get a Solaris contract for this, either through IBM or through
Oracle (that is, if Oracle hasn't completely stopped selling support
contracts for Solaris for non-Oracle hardware already).

I personally own an X3500, which uses E5[1,3]00-series CPUs (dual or
quad-core) and DDR2 RAM. They usually come with a systems management
card (or you can get one for cheap, under $50). Parts are easy to come
by, cheap, and covered by any IBM warranty if you buy a IBM-labeled part
(even if it was a 3rd party, non-"authorized" reseller that sold you the
part).

The X3500 and X3500 M2, plus the X3400 or X3400 M2 are likely your best
bets.

The IBM Xref for Withdrawn Hardware is an excellent place to start to
look for a compatible system (plus give you the IBM part numbers for
everything):

http://www.redbooks.ibm.com/abstracts/redpxref.html

Here's what you want for under $1000:

http://www.ebay.com/itm/IBM-x3500-Tower-2x-Quad-Core-2-66GHz-8GB-4x73GB-8K-Raid-/140730930501?pt=COMP_EN_Servers&hash=item20c4379545

Cheaper, and better CPU, but smaller:

http://www.ebay.com/itm/IBM-x3400-M3-737942U-5U-Tower-Entry-level-Server-Intel-Xeon-E5507-2-26-GHz-/390390337076?pt=COMP_EN_Servers&hash=item5ae513ce34

-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 2.5" to 3.5" bracket for SSD

2012-01-14 Thread Erik Trimble


On 1/14/2012 8:15 AM, Anil Jangity wrote:

I have a couple of Sun/Oracle x2270 boxes and am planning to get some 2.5" 
intel 320 SSD for the rpool.

Do you happen to know what kind of bracket is required to get the 2.5" SSD to fit 
into the 3.5" slots?

Thanks


Anything that looks like this:

http://www.amazon.com/2-5-3-5-Ssd-sata-Convert/dp/B002Z2QDNE/ref=dp_cp_ob_e_title_0

The x2270 takes a standard low-profile 3.5" hard drive, so any converter 
that has the power/sata connectors in the same location as the 3.5" 
drive's would be.


-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Stress test zfs

2012-01-04 Thread Erik Trimble


On 1/4/2012 2:59 PM, grant lowe wrote:

Hi all,

I've got a solaris 10 running 9/10 on a T3. It's an oracle box with 
128GB memory RIght now oracle . I've been trying to load test the box 
with bonnie++. I can seem to get 80 to 90 K writes, but can't seem to 
get more than a couple K for writes. Any suggestions? Or should I take 
this to a bonnie++ mailing list? Any help is appreciated. I'm kinda 
new to load testing. Thanks.


Also, note that bonnie++ is single threaded, and a T3's single-thread 
performance isn't stellar, by any means. It's entirely possible you're 
CPU bound during the test.


Though, a list of your ZFS config would be nice, as previously mentioned...

-Erik

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Erik Trimble


On 12/12/2011 12:23 PM, Richard Elling wrote:

On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:


Not exactly. What is dedup'ed is the stream only, which is infect not very
efficient. Real dedup aware replication is taking the necessary steps to
avoid sending a block that exists on the other storage system.

These exist outside of ZFS (eg rsync) and scale poorly.

Given that dedup is done at the pool level and ZFS send/receive is done at
the dataset level, how would you propose implementing a dedup-aware
ZFS send command?
  -- richard


I'm with Richard.

There is no practical "optimally efficient" way to dedup a stream from 
one system to another.  The only way to do so would be to have total 
information about the pool composition on BOTH the receiver and sender 
side.  That would involve sending the checksums for the complete pool 
blocks between the receiver and sender, which is a non-trivial overhead, 
and, indeed, would usually be far worse than simply doing what 'zfs send 
-D' does now (dedup the sending stream itself).  The only possible way 
that such a scheme would work would be if the receiver and sender were 
the same machine (note: not VMs or Zones on the same machine, but the 
same OS instance, since you would need the DDT to be shared).  And, 
that's not a use case that 'zfs send' is generally optimized for - that 
is, while it's entirely possible, it's not the primary use case for 'zfs 
send'


Given the overhead of network communications, there's no way that 
sending block checksums between hosts can ever be more efficient than 
just sending the self-deduped whole stream (except in pedantic cases).  
Let's look at  possible implementations (all assume that the local 
sending machine does its own dedup - that is, the stream-to-be-sent is 
already deduped within itself):


(1) when constructing the stream, every time a block is read from a 
fileset (or volume), its checksum is sent to the receiving machine. The 
receiving machine then looks up that checksum in its DDT, and sends back 
a "needed" or "not-needed" reply to the sender. While this lookup is 
being done, the sender must hold the original block in RAM, and cannot 
write it out to the to-be-sent-stream.


(2) The sending machine reads all the to-be-sent blocks, creates a 
stream, AND creates a checksum table (a mini-DDT, if you will).  The 
sender communicates to the receiver this mini-DDT.  The receiver diffs 
this against its own master pool DDT, and then sends back an edited 
mini-DDT containing only the checksums that match blocks which aren't on 
the receiver.  The original sending machine must then go back and 
re-construct the stream (either as a whole, or parse the stream as it is 
being sent) to leave out the unneeded blocks.


(3) some combo of #1 and #2 where several checksums are stuffed into a 
packet, and sent over the wire to be checked at the destination, with 
the receiver sending back only those to be included in the stream.



In the first scenario, you produce a huge amount of small network packet 
traffic, which trashes network throughput, with no real expectation that 
the reduction in the send stream will be worth it.  In the second case, 
you induce a huge amount of latency into the construction of the sending 
stream - that is, the "sender" has to wait around and then spend a 
non-trivial amount of processing power on essentially double processing 
the send stream, when, in the current implementation, it just sends out 
stuff as soon as it gets it.  The third scenario is only an optimization 
of #1 and #2, and doesn't avoid the pitfalls of either.


That is, even if ZFS did pool-level sends, you're still trapped by the 
need to share the DDT, which induces an overhead that can't be 
reasonably made up vs simply sending an internally-deduped souce stream 
in the first place.  I'm sure I can construct an instance where such DDT 
sharing would be better than the current 'zfs send' implementation; I'm 
just as sure that such an instance would be the small minority of usage, 
and that such a required implementation would radically alter the 
"typical" use case's performance to the negative.


In any case, as 'zfs send' works on filesets and volumes, and ZFS 
maintains DDT information on a pool-level, there's no way to share an 
existing whole DDT between two systems (and, given the potential size of 
a pool-level DDT, that's a bad idea anyway).


I see no ability to optimize the 'zfs send/receive' concept beyond what 
is currently done.


-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] questions about the DDT and other things

2011-12-01 Thread Erik Trimble


On 12/1/2011 6:44 PM, Ragnar Sundblad wrote:

Thanks for your answers!

On 2 dec 2011, at 02:54, Erik Trimble wrote:


On 12/1/2011 4:59 PM, Ragnar Sundblad wrote:

I am sorry if these are dumb questions. If there are explanations
available somewhere for those questions that I just haven't found, please
let me know! :-)

1. It has been said that when the DDT entries, some 376 bytes or so, are
rolled out on L2ARC, there still is some 170 bytes in the ARC to reference
them (or rather the ZAP objects I believe). In some places it sounds like
  those 170 bytes refers to ZAP objects that contain several DDT entries.
In other cases it sounds like for each DDT entry in the L2ARC there must
be one 170 byte reference in the ARC. What is the story here really?

Yup. Each entry (not just a DDT entry, but any cached reference) in the L2ARC 
requires a pointer record in the ARC, so the DDT entries held in L2ARC also 
consume ARC space.  It's a bad situation.

Yes, it is a bad situation. But how many DDT entries can there be in each ZAP
object? Some have suggested an 1:1 relationship, others have suggested that it
isn't.
I'm pretty sure it's NOT 1:1, but I'd have to go look at the code. In 
any case, it's not a very big number, so you're still looking at the 
same O(n) as the number of DDT entries (n).




2. Deletion with dedup enabled is a lot heavier for some reason that I don't
understand. It is said that the DDT entries have to be updated for each
deleted reference to that block. Since zfs already have a mechanism for sharing
blocks (for example with snapshots), I don't understand why the DDT has to
contain any more block references at all, or why deletion should be much harder
just because there are checksums (DDT entries) tied to those blocks, and even
if they have to, why it would be much harder than the other block reference
mechanism. If anyone could explain this (or give me a pointer to an
explanation), I'd be very happy!

Remember that, when using Dedup, each block can potentially be part of a very 
large number of files. So, when you delete a file, you have to go look at the 
DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates.  
It's essentially the same problem that erasing snapshots has - for each block 
you delete, you have to find and update the metadata for all the other files 
that share that block usage.  Dedup and snapshot deletion share the same 
problem, it's just usually worse for dedup, since there's a much larger number 
of blocks that have to be updated.

What is it that must be updated in the DDT entries - a ref count?
And how does that differ from the snapshot case, which seems like
a very similar mechanism?


It is similar to the snapshot case, in that the block itself has a 
reference count in it's structure (for use in both dedup and snapshots) 
that would get updated upon "delete", but you also have to consider that 
the DDT entry itself, which is a separate structure from the block 
structure, also has to be updated. This is a whole new IOPS to get that 
additional structure. So, more or less, a dedup delete has to do two 
operations for every one that a snapshot delete does.  Plus,



The problem is that you really need to have the entire DDT in some form of 
high-speed random-access memory in order for things to be efficient. If you 
have to search the entire hard drive to get the proper DDT entry every time you 
delete a block, then your IOPs limits are going to get hammered hard.

Indeed!


3. I, as many others, would of course like to be able to have very large
datasets deduped without having to have enormous amounts of RAM.
Since the DDT is a AVL tree, couldn't just that entire tree be cached on
for example a SSD and be searched there without necessarily having to store
anything of it in RAM? That would probably require some changes to the DDT
lookup code, and some mechanism to gather the tree to be able to lift it
over to the SSD cache, and some other stuff, but still that sounds - with
my very basic (non-)understanding of zfs - like a not to overwhelming change.

L2ARC typically sits on an SSD, and the DDT is usually held there, if the L2ARC 
device exists.

Well, it rather seems to be ZAP objects, referenced from the ARC, which
happens to contain DDT entries, that is in the L2ARC.

I mean that you could just move the entire AVL tree onto the SSD, completely
outside of zfs if you will, and have it being searched there, not dependent
of what is in RAM at all.
Every DDT lookup would take up to [tree depth] number of reads, but that could
be OK if you have a SSD which is fast on reading (which many are).
ZFS currently treats all metadata (of which DDT entries are) and data 
slabs the same when it comes to choosing to migrate them from ARC to 
L2ARC, so the most-frequently-accessed info is in the ARC (regardless of 
what that info is), and everything else sits in t

Re: [zfs-discuss] questions about the DDT and other things

2011-12-01 Thread Erik Trimble


On 12/1/2011 4:59 PM, Ragnar Sundblad wrote:

I am sorry if these are dumb questions. If there are explanations
available somewhere for those questions that I just haven't found, please
let me know! :-)

1. It has been said that when the DDT entries, some 376 bytes or so, are
rolled out on L2ARC, there still is some 170 bytes in the ARC to reference
them (or rather the ZAP objects I believe). In some places it sounds like
  those 170 bytes refers to ZAP objects that contain several DDT entries.
In other cases it sounds like for each DDT entry in the L2ARC there must
be one 170 byte reference in the ARC. What is the story here really?
Yup. Each entry (not just a DDT entry, but any cached reference) in the 
L2ARC requires a pointer record in the ARC, so the DDT entries held in 
L2ARC also consume ARC space.  It's a bad situation.



2. Deletion with dedup enabled is a lot heavier for some reason that I don't
understand. It is said that the DDT entries have to be updated for each
deleted reference to that block. Since zfs already have a mechanism for sharing
blocks (for example with snapshots), I don't understand why the DDT has to
contain any more block references at all, or why deletion should be much harder
just because there are checksums (DDT entries) tied to those blocks, and even
if they have to, why it would be much harder than the other block reference
mechanism. If anyone could explain this (or give me a pointer to an
explanation), I'd be very happy!
Remember that, when using Dedup, each block can potentially be part of a 
very large number of files. So, when you delete a file, you have to go 
look at the DDT entry FOR EACH BLOCK IN THAT FILE, and make the 
appropriate DDT updates.  It's essentially the same problem that erasing 
snapshots has - for each block you delete, you have to find and update 
the metadata for all the other files that share that block usage.  Dedup 
and snapshot deletion share the same problem, it's just usually worse 
for dedup, since there's a much larger number of blocks that have to be 
updated.


The problem is that you really need to have the entire DDT in some form 
of high-speed random-access memory in order for things to be efficient. 
If you have to search the entire hard drive to get the proper DDT entry 
every time you delete a block, then your IOPs limits are going to get 
hammered hard.



3. I, as many others, would of course like to be able to have very large
datasets deduped without having to have enormous amounts of RAM.
Since the DDT is a AVL tree, couldn't just that entire tree be cached on
for example a SSD and be searched there without necessarily having to store
anything of it in RAM? That would probably require some changes to the DDT
lookup code, and some mechanism to gather the tree to be able to lift it
over to the SSD cache, and some other stuff, but still that sounds - with
my very basic (non-)understanding of zfs - like a not to overwhelming change.
L2ARC typically sits on an SSD, and the DDT is usually held there, if 
the L2ARC device exists.  There does need to be serious work on changing 
how the DDT in the L2ARC is referenced, however; the ARC memory 
requirements for DDT-in-L2ARC definitely need to be removed (which 
requires a non-trivial rearchitecting of dedup).  There are some other 
changes that have to happen for Dedup to be really usable. 
Unfortunately, I can't see anyone around willing to do those changes, 
and my understanding of the code says that it is much more likely that 
we will simply remove and replace the entire dedup feature rather than 
trying to fix the existing design.



4. Now and then people mention that the problem with bp_rewrite has been
explained, on this very mailing list I believe, but I haven't found that
explanation. Could someone please give me a pointer to that description
(or perhaps explain it again :-) )?

Thanks for any enlightenment!

/ragge


bp_rewrite is a feature which stands for the (as yet unimplemented) 
system call of the same name, which does Block Pointer re-writing. That 
is, it would allow ZFS to change the physical location on media of an 
existing ZFS data slab. That is, bp_rewrite is necessary to allow ZFS to 
change the Physical layout of data on media, without changing the 
Conceptual arrangement of such data.


It's been the #1 most-wanted feature of ZFS since I can remember, 
probably for 10 years now.


-Erik

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Poor relative performance of SAS over SATA drives

2011-10-27 Thread Erik Trimble


It occurs to me that your filesystems may not be in the same state.

That is, destroy both pools.  Recreate them, and run the tests. This 
will eliminate any possibility of allocation issues.


-Erik

On 10/27/2011 10:37 AM, weiliam.hong wrote:

Hi,

Thanks for the replies. In the beginning, I only had SAS drives 
installed when I observed the behavior, the SATA drives were added 
later for comparison and troubleshooting.


The slow behavior is observed only after 10-15mins of running dd where 
the file size is about 15GB, then the throughput drops suddenly from 
70 to 50 to 20 to <10MB/s in a matter of seconds and never recovers.


This couldn't be right no matter how look at it.

Regards,
WL



On 10/27/2011 9:59 PM, Brian Wilson wrote:

On 10/27/11 07:03 AM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of weiliam.hong

3. All 4 drives are connected to a single HBA, so I assume the mpt_sas

driver

is used. Are SAS and SATA drives handled differently ?
If they're all on the same HBA, they may be all on the same bus.  It 
may be
*because* you're mixing SATA and SAS disks on the same bus.  I'll 
suggest
separating the tests, don't run them concurrently, and see if 
there's any

difference.

Also, the HBA might have different defaults for SAS vs SATA, look in 
the HBA

to see if write back / write through are the same...

I don't know if the HBA gives you some way to enable/disable the 
on-disk

cache, but take a look and see.

Also, maybe the SAS disks are only doing SATA.  If the HBA is only 
able to
do SATA, then SAS disks will work, but might not work as optimally 
as they

would if they were connected to a real SAS HBA.

And one final thing - If you're planning to run ZFS (as I suspect 
you are,
posting on this list running OI) ... It actually works *better* 
without any

HBA.  *Footnote

*Footnote:  ZFS works the worst, if you have ZIL enabled, no log 
device, and

no HBA.  It's a significant improvement, if you add a battery backed or
nonvolatile HBA with writeback.  It's a signfiicant improvement 
again, if
you get rid of the HBA, add a log device.  It's a significant 
improvement

yet again, if you get rid of the HBA and log device, and run with ZIL
disabled (if your work load is compatible with a disabled ZIL.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


First, ditto everything Edward says above.  I'd add that your "dd" 
test creates a lot of straight sequential IO, not anything that's 
likely to be random IO.  I can't speak to why your SAS might not be 
performing any better than Edward did, but your SATA's probably 
screaming on straight sequential IO, where on something more random I 
would bet they won't perform as well as they do in this test.  The 
tool I've seen used for that sort of testing is iozone - I'm sure 
there are others as well, and I can't attest what's better or worse.


cheers,
Brian



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thumper (X4500), and CF SSD for L2ARC = ?

2011-10-14 Thread Erik Trimble

On 10/14/2011 5:49 AM, Darren J Moffat wrote:

On 10/14/11 13:39, Jim Klimov wrote:

Hello, I was asked if the CF port in Thumpers can be accessed by the OS?
In particular, would it be a good idea to use a modern 600x CF card
(some reliable one intended for professional photography) as an L2ARC
device using this port?

I don't know about the Thumpers internal CF slot.

I can say I have tried using a fast (at the time, this was about 3
years ago) CF card via a CF to IDE adaptor before and it turned out to
be a really bad idea because the spinning rust disk (which was SATA)
was actually faster to access. Same went for USB to CF adaptors at
the time too.

Last I'd checked, the CF port was fully functional.

However, I'd not use it as L2ARC (and, certainly not ZIL). CF is not
good in terms of either random write or read - professional-grade CF
cards are optimized for STREAMING write - that is, the ability to write
a big-ass JPG or BMP or TIFF as quickly as possible. The CF controller
isn't good on lots of little read/write ops.

In Casper's case, the CF->IDE adapter makes this even worse, since IDE
is spectacularly bad at IOPS.

I can't remember - does the X4500 have any extra SATA ports free on the
motherboard? And, does it have any extra HD power connectors?

http://www.amazon.com/dp/B002MWDRD6/ref=asc_df_B002MWDRD61280186?smid=A2YLYLTN75J8LR&tag=shopzilla_mp_1382-20&linkCode=asn&creative=395105&creativeASIN=B002MWDRD6

Is a great way to add a 2.5" drive slot, but it's just a physical slot
adapter - you need to attach a standard SATA cable and HD power
connector to it.

If that's not an option, find yourself a cheap PCI-E adapter with eSATA
ports on it, then use an external HD enclosure with eSATA for a small SSD.

As a last resort, remove one of the 3.5" SATA drives, and put in an SSD
in a 2.5"->3.5" converter enclosure.

Remember, you can generally get by fine with a lower-end SSD as L2ARC,
so a 60GB SSD should be $100 or less.

-Erik

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] All (pure) SSD pool rehash

2011-09-27 Thread Erik Trimble


On 9/27/2011 10:39 AM, Bob Friesenhahn wrote:

On Tue, 27 Sep 2011, Matt Banks wrote:

Also, maybe I read it wrong, but why is it that (in the previous 
thread about hw raid and zpools) zpools with large numbers of 
physical drives (eg 20+) were frowned upon? I know that ZFS!=WAFL


There is no concern with a large number of physical drives in a pool. 
The primary concern is with the number of drives per vdev.  Any 
variation in the latency of the drives hinders performance and each 
I/O to a vdev consumes 1 "IOP" across all of the drives in the vdev 
(or strip) when raidzN is used.  Having more vdevs is better for 
consistent performance and more available IOPS.


Bob


To expound just a bit on Bob's reply:   the reason that large numbers of 
disks in a RAIDZ* vdev are frowned upon has to do with the fact that 
IOPS for a RAIDZ vdev are pretty much O(C), regardless of how many disks 
are in the actual vdev. So, the IOPS throughput of a 20-disk vdev is the 
same as a 5-disk vdev.  Streaming throughput is significantly higher 
(i.e. it scales as O(N)), but you're unlikely to get that for the vast 
majority of workloads.


Given that resilvering a RAIDZ* is IOPS-bound, you quickly run into the 
situation where the time to resilver X amount of data on a 5-drive RAIDZ 
is the same as a 30-drive RAIDZ.  Given that you're highly likely to 
store much more data on a larger vdev,  your resilver time to replace a 
drive goes up linearly with the number of drives in a RAIDZ vdev.


This leads to this situation:  if I have 20 x 1TB drives, here's several 
possible configurations, and the relative resilver times (relative, 
because without knowing the exact configuration of the data itself, I 
can't estimate wall-clock-time resilver times):


(a)5 x 4-disk RAIDZ:  15TB usable, takes N amount of time to replace 
a failed disk

(b)4 x 5-disk RAIDZ:  16TB usable, takes 1.25N time to replace a disk
(c)2 x 10-disk RAIDZ:  18TB Usable, takes 2.5N time to replace a disk
(d)1 x 20-disk RAIDZ:19TB usable, takes 5N time to replace a disk

Notice that by doubling the number of drives in a RAIDZ, you double the 
resilver time for the same amount of data in the ZPOOL.


The above also applies to RAIDZ[23], as the additional parity disk 
doesn't materially impact resilver times in either direction (and, yes, 
it's not really a "parity disk", I'm just being sloppy).


Also, the other main reason is that larger numbers of drives in a single 
vdev mean there is a higher probability that multiple disk failures will 
result in loss of data. Richard Elling had some data on the exact 
calculations, but it boils down to the fact that your chance of total 
data loss from multiple drive failures goes up MORE THAN LINEARLY by 
adding drives into a vdev.  Thus, a 1x10-disk RAIDZ has well over 2x the 
chance of failure that  2 x 5-disk RAIDZ zpool has.


-Erik
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] I'm back!

2011-09-02 Thread Erik Trimble


Hi folks.

I'm now no longer at Oracle, and the past couple of weeks have been a 
bit of a mess for me as I disentangle myself from it.


I apologize to those who may have tried to contact me during August, as 
my @oracle.com email is no longer being read by myself, and I didn't 
have a lot of extra time to devote to things like making sure my email 
subscription lists pointed to my personal email. I've done that now.


I now have a free(er) hand to do some work in IllumOS (hopefully, in ZFS 
in particular), so I'm looking forward to getting back into the swing of 
things. And, hopefully, not be too much of a PITA.


:-)

-Erik Trimble
tr...@netdemons.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NexentaCore 3.1 - ZFS V. 28

2011-07-31 Thread Erik Trimble


On 7/31/2011 4:29 AM, Eugen Leitl wrote:

On Sat, Jul 30, 2011 at 12:56:38PM +0200, Eugen Leitl wrote:

apt-get update
apt-clone upgrade

Any first impressions?

I finally came around installing NexentaCore 3.1 along with
napp-it and AMP on a HP N36L with 8 GBytes RAM. I'm testing
it with 4x 1 and 1.5 TByte consumer SATA drives (Seagate)
with raidz2 and raidz3 and like what I see so far.

Given http://opensolaris.org/jive/thread.jspa?threadID=139315
I've ordered an Intel 311 series for ZIL/L2ARC.

I hope to use above with 4x 3 TByte Hitachi Deskstar 5K3000 HDS5C3030ALA630
given the data from Blackblaze in regards to their reliability.
Suggestion for above layout (8 GByte RAM 4x 3 TByte as raidz2)
I should go with 4 GByte for slog and 16 GByte for L2ARC, right?

Is it possible to attach slog/L2ARC to a pool after the fact?
I'd rather not wear out the small SSD with ~5 TByte avoidable
writes.



Yes. You can attach a ZIL or L2ARC device anytime after the pool is created.

Also, I think you want an Intel 320, NOT the 311, for use as a ZIL.  The 
320 includes capacitors, so if you lose power, your ZIL doesn't lose 
data.  The 311 DOESN'T include capacitors.



--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-25 Thread Erik Trimble


On 7/25/2011 8:03 AM, Orvar Korvar wrote:

"There is at least a common perception (misperception?) that devices cannot process 
TRIM requests while they are 100% busy processing other tasks."

Just to confirm; SSD disks can do TRIM while processing other tasks?

I heard that Illumos is working on TRIM support for ZFS and will release 
something soon. Anyone knows more?
SSDs do Garbage Collection when the controller has spare cycles. I'm not 
certain if there is a time factor (i.e. is it periodic, or just when 
there's time in the controller's queue).   So, theoretically, TRIM helps 
GC when the drive is at low utilization, but not when the SSD is under 
significant load.  Under high load, the SSD doesn't have the luxury of 
searching the NAND for "unused" blocks, aggregating them, writing them 
to a new page, and then erasing the old location.  It has to allocate 
stuff NOW, so it goes right to the dreaded read-modify-erase-write 
cycle.  Even under high load, the SSD can "process" the TRIM request 
(i.e. it will mark a block as unused), but that's not useful until a GC 
is performed (unless you are so lucky as to mark an *entire* page as 
unused), so, it doesn't really matter.  The GC run is what "fixes" the 
NAND allocation, not the TRIM command itself.


I can't speak for the ZFS developers as to TRIM support.  I *believe* 
this would have to happen both at the device level and the filesystem 
level. But, I certainly could be wrong. (IllumOS currently supports TRIM 
in the SATA framework - not sure about the SAS framework)


--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-25 Thread Erik Trimble


On 7/25/2011 4:49 AM, joerg.schill...@fokus.fraunhofer.de wrote:

Erik Trimble  wrote:


On 7/25/2011 3:32 AM, Orvar Korvar wrote:

How long have you been using a SSD? Do you see any performance decrease? I 
mean, ZFS does not support TRIM, so I wonder about long term effects...

Frankly, for the kind of use that ZFS puts on a SSD, TRIM makes no
impact whatsoever.

TRIM is primarily useful for low-volume changes - that is, for a
filesystem that generally has few deletes over time (i.e. rate of change
is low).

Using a SSD as a ZIL or L2ARC device puts a very high write load on the
device (even as an L2ARC, there is a considerably higher write load than
a "typical" filesystem use).   SSDs in such a configuration can't really
make use of TRIM, and depend on the internal SSD controller block
re-allocation algorithms to improve block layout.

Now, if you're using the SSD as primary media (i.e. in place of a Hard
Drive), there is a possibility that TRIM could help.  I honestly can't
be sure that it would help, however, as ZFS's Copy-on-Write nature means
that it tends to write entire pages of blocks, rather than just small
blocks. Which is fine from the SSD's standpoint.

Writing to an SSD is: clear + write + verify

As the SSD cannot know that the rewritten blocks have been unused for a while,
the SSD cannot handle the clear operation at a time when there is no interest
in the block, the TRIM command is needed to give this knowledge to the SSD.

Jörg


Except in many cases with ZFS, that data is irrelevant by the time it 
can be used, or is much less useful than with other filesystems.  
Copy-on-Write tends to end up with whole SSD pages of blocks being 
rendered "unused", rather than individual blocks inside pages.  So, the 
SSD often can avoid the read-erase-modify-write cycle, and just do 
erase-write instead.  TRIM *might* help somewhat when you have a 
relatively quiet ZFS filesystem, but I'm not really convinced of how 
much of a benefit it would be.


As I've mentioned in other posts, ZIL and L2ARC are too "hot" for TRIM 
to have any noticeable impact - the SSD is constantly being used, and 
has no time for GC.  It's stuck in the read-erase-modify-write cycle 
even with TRIM.


--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-25 Thread Erik Trimble


On 7/25/2011 4:28 AM, Tomas Ögren wrote:

On 25 July, 2011 - Erik Trimble sent me these 2,0K bytes:


On 7/25/2011 3:32 AM, Orvar Korvar wrote:

How long have you been using a SSD? Do you see any performance decrease? I 
mean, ZFS does not support TRIM, so I wonder about long term effects...

Frankly, for the kind of use that ZFS puts on a SSD, TRIM makes no
impact whatsoever.

TRIM is primarily useful for low-volume changes - that is, for a
filesystem that generally has few deletes over time (i.e. rate of change
is low).

Using a SSD as a ZIL or L2ARC device puts a very high write load on the
device (even as an L2ARC, there is a considerably higher write load than
a "typical" filesystem use).   SSDs in such a configuration can't really
make use of TRIM, and depend on the internal SSD controller block
re-allocation algorithms to improve block layout.

Now, if you're using the SSD as primary media (i.e. in place of a Hard
Drive), there is a possibility that TRIM could help.  I honestly can't
be sure that it would help, however, as ZFS's Copy-on-Write nature means
that it tends to write entire pages of blocks, rather than just small
blocks. Which is fine from the SSD's standpoint.

You still need the flash erase cycle.


On a related note:  I've been using a OCZ Vertex 2 as my primary drive
in a laptop, which runs Windows XP (no TRIM support). I haven't noticed
any dropoff in performance in the year its be in service.  I'm doing
typical productivity laptop-ish things (no compiling, etc.), so it
appears that the internal SSD controller is more than smart enough to
compensate even without TRIM.


Honestly, I think TRIM isn't really useful for anyone.  It took too long
to get pushed out to the OSes, and the SSD vendors seem to have just
compensated by making a smarter controller able to do better
reallocation.  Which, to me, is the better ideal, in any case.

Bullshit. I just got a OCZ Vertex 3, and the first fill was 450-500MB/s.
Second and sequent fills are at half that speed. I'm quite confident
that it's due to the flash erase cycle that's needed, and if stuff can
be TRIM:ed (and thus flash erased as well), speed would be regained.
Overwriting an previously used block requires a flash erase, and if that
can be done in the background when the timing is not critical instead of
just before you can actually write the block you want, performance will
increase.

/Tomas


I should have been more clear:  I consider the "native" speed of a SSD 
to be that which is obtained AFTER you've filled the entire drive once. 
That is, once you've blown through the extra reserve NAND, and are now 
into the full read/erase/write cycle.   IMHO, that's really what the 
sustained performance of an SSD is, not the bogus numbers reported by 
venders.


TRIM is really only useful for drives which have a low enough load 
factor to do background GC on unused blocks.  For ZFS, that *might* be 
the case when the SSD is used as primary backing store, but certainly 
isn't the case when it's used as ZIL or L2ARC.


Even with TRIM, performance after a complete fill of the SSD will drop 
noticeable, as the SSD has to do GC sometime.  You might not notice it 
right away given your usage pattern, but, with OR without TRIM, a "used" 
SSD under load will perform the same.


--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-25 Thread Erik Trimble


On 7/25/2011 6:43 AM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Erik Trimble

Honestly, I think TRIM isn't really useful for anyone.

I'm going to have to disagree.

There are only two times when TRIM isn't useful:
1)  Your demand of the system is consistently so low that it never adds up
to anything meaningful... Basically you always have free unused blocks so
adding more unused blocks to the pile doesn't matter at all, or you never
bother to delete anything...  Or it's just a lightweight server processing
requests where network latency greatly outweighs any disk latency, etc.  AKA
your demand is very low.
or
2)  Your demand of the system is consistently so high that even with TRIM,
the device would never be able to find any idle time to perform an erase
cycle on blocks marked for TRIM.

In case #2, it is at least theoretically possible for devices to become
smart enough to process the TRIM block erasures in parallel even while there
are other operations taking place simultaneously.  I don't know if device
mfgrs implement things that way today.  There is at least a common
perception (misperception?) that devices cannot process TRIM requests while
they are 100% busy processing other tasks.

Or your disk is always 100% full.  I guess that makes 3 cases, but the 3rd
one is esoteric.



What I'm saying is that #2 occurs all the time with ZFS, at least as a 
ZIL or L2ARC.  TRIM is really only useful when the SSD has some 
"downtime" to work.  As a ZIL or L2ARC, the SSD *has* no pauses, and 
can't do GC in the background usefully (which is what TRIM helps).


Instead, what I've seen is that the increased "smarts" of the new 
generation SSD controllers do a better job of on-the-fly reallocation.



--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-25 Thread Erik Trimble


On 7/25/2011 3:32 AM, Orvar Korvar wrote:

How long have you been using a SSD? Do you see any performance decrease? I 
mean, ZFS does not support TRIM, so I wonder about long term effects...


Frankly, for the kind of use that ZFS puts on a SSD, TRIM makes no 
impact whatsoever.


TRIM is primarily useful for low-volume changes - that is, for a 
filesystem that generally has few deletes over time (i.e. rate of change 
is low).


Using a SSD as a ZIL or L2ARC device puts a very high write load on the 
device (even as an L2ARC, there is a considerably higher write load than 
a "typical" filesystem use).   SSDs in such a configuration can't really 
make use of TRIM, and depend on the internal SSD controller block 
re-allocation algorithms to improve block layout.


Now, if you're using the SSD as primary media (i.e. in place of a Hard 
Drive), there is a possibility that TRIM could help.  I honestly can't 
be sure that it would help, however, as ZFS's Copy-on-Write nature means 
that it tends to write entire pages of blocks, rather than just small 
blocks. Which is fine from the SSD's standpoint.



On a related note:  I've been using a OCZ Vertex 2 as my primary drive 
in a laptop, which runs Windows XP (no TRIM support). I haven't noticed 
any dropoff in performance in the year its be in service.  I'm doing 
typical productivity laptop-ish things (no compiling, etc.), so it 
appears that the internal SSD controller is more than smart enough to 
compensate even without TRIM.



Honestly, I think TRIM isn't really useful for anyone.  It took too long 
to get pushed out to the OSes, and the SSD vendors seem to have just 
compensated by making a smarter controller able to do better 
reallocation.  Which, to me, is the better ideal, in any case.


--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pure SSD Pool

2011-07-12 Thread Erik Trimble

FYI - virtually all non-super-low-end SSDs are already significantly
over-provisioned, for GC and scratch use inside the controller.

In fact, the only difference between the OCZ "extended" models and the
non-extended models (e.g. Vertex 2 50G (OCZSSD2-2VTX50G) and Vertex 2
Extended 60G (OCZSSD2-2VTXE60G)) is the amount of extra flash dedicated
to scratch. Both the aforementioned drives have 64G of flash chips -
it's just that the 50G one uses significantly more for scratch, and
thus, will perform better under heavy use.

Over-provisioning at the filesystem level is unlikely to significantly
improve things, as the SSD controller generally only uses what it
considered "scratch" as such - that is, while not using 10G at the
filesystem level might seem useful, overall, my understanding of SSD's
controller usage patterns is that this generally isn't that much of a
performance gain. 

E.g. you'd be better off buying the 50G Vertex 2 and fully using it than
the 60G model and only using 50G on it.

-Erik

On Tue, 2011-07-12 at 10:10 -0700, Henry Lau wrote:
> It is hard to say, 90% or 80%. SSD has already reserved overprovisioning 
> places 
> for garbage collection and wear leveling. The OS level only knows file LBA, 
> not 
> the physical LBA mapping to flash pages/block. Uberblock updates and COW from 
> ZFS will use a new page/block each time. A TRIM command from ZFS level should 
> be 
> a better solution but RAID is still a problem for TRIM at the OS level.
> 
> Henry
> 
> 
> 
> - Original Message 
> From: Jim Klimov 
> Cc: ZFS Discussions 
> Sent: Tue, July 12, 2011 4:18:28 AM
> Subject: Re: [zfs-discuss] Pure SSD Pool
> 
> 2011-07-12 9:06, Brandon High пишет:
> > On Mon, Jul 11, 2011 at 7:03 AM, Eric Sproul  wrote:
> >> Interesting-- what is the suspected impact of not having TRIM support?
> > There shouldn't be much, since zfs isn't changing data in place. Any
> > drive with reasonable garbage collection (which is pretty much
> > everything these days) should be fine until the volume gets very full.
> 
> I wonder if in this case it would be beneficial to slice i.e. 90%
> of an SSD for use in ZFS pool(s) and leave the rest of the
> disk unassigned to any partition or slice? This would reserve
> some sectors as never-written-to-by-OS. Would this ease the
> life for SSD devices without TRIM between them ans the OS?
> 
> Curious,
> //Jim
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Erik Trimble
Java Platform Group - Infrastructure
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Move rpool from external hd to internal hd

2011-06-29 Thread Erik Trimble


On 6/29/2011 12:51 AM, Stephan Budach wrote:

Hi,

what are the steps necessary to move the OS rpool from an external USB 
drive to an internal drive?
I thought about adding the internal hd as a mirror to the rpool and 
then detaching the USB drive, but I am unsure if I'll have to mess 
with Grub as well.


Cheers,
budy
--
Yup. Look at the 'installgrub' man page. It's rather straightforward, 
particularly if you've already booting off the USB drive.


Remember that the boot drive can't be the whole disk - you have to 
partition it, and then mirror onto the partition.


e.g. if your internal drive is c1t0d0, you'll have to use c1t0d0s0.


--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Erik Trimble


On 6/27/2011 1:13 PM, David Magda wrote:

On Mon, June 27, 2011 15:24, Erik Trimble wrote:
[...]

I'm always kind of surprised that there hasn't been a movement to create
standardized crypto commands, like the various FP-specific commands that
are part of MMX/SSE/etc.  That way, most of this could be done in
hardware seamlessly.

The (Ultra)SPARC T-series processors do, but to a certain extent it goes
against a CPU manufacturers best (financial) interest to provide this:
crypto is very CPU intensive using 'regular' instructions, so if you need
to do a lot of it, it would force you to purchase a manufacturer's
top-of-the-line CPUs, and to have as many sockets as you can to handle a
load (and presumably you need to do "useful" work besides just
en/decrypting traffic).

If you have special instructions that do the operations efficiently, it
means that you're not chewing up cycles as much, so a less powerful (and
cheaper) processor can be purchased.

I'm sure all the Web 2.0 companies would love to have these (and OpenSSL
link use the instructions), so they could simply enable HTTPS for
everything. (Of course it'd also be helpful for data-at-rest, on-disk
encryption as well.)

The last benchmarks I saw indicated that the SPARC T-series could do 45
Gb/s AES or some such, with gobs of RSA operations as well


The T-series crypto isn't what I'm thinking of.  AFAIK, you still need 
to use the Crypto framework in Solaris to access the on-chip 
functionality. Which makes the T-series no different than CPUs without a 
crypto module but a crypto add-in board instead.


What I'm thinking of is something on the lines of what AMD proposed 
awhile ago, in combination with how we used to handle hardware that had 
FP optional.


That is, you continue to make CPUs without any crypto functionality, 
EXCEPT that they support certain extensions a la MMX.   If no Crypto 
accelerator was available, the CPU would trap any Crypto calls, and 
force them to done in software.  You could then stick a crypto 
accellerator in a second CPU socket, and the CPU would recognized this 
was there, and pipe crypto calls to the dedicated co-processor.


Think about how things were done with the i386 and i387.  That's what 
I'm after.  With modern CPU buses like AMD & Intel support, plopping a 
"co-processor" into another CPU socket would really, really help.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Erik Trimble


On 6/27/2011 9:55 AM, Roberto Waltman wrote:

I recently bought an HP Proliant Microserver for a home file server.
( pics and more here:
http://arstechnica.com/civis/viewtopic.php?p=20968192 )

I installed 5 1.5TB (5900 RPM) drives, upgraded the memory to 8GB, and 
installed Solaris 11 Express without a hitch.


A few simple tests using "dd" with 1gb and 2gb files showed excellent 
transfer rates: ~200 MB/sec on a 5 drive raidz2 pool, ~310 MB/sec on a 
five drive pool with no redundancy.


That is, until I enabled encryption, which brought the transfer rates 
down to around 20 MB/sec...


Obviously the CPU is the bottleneck here, and I?m wondering what to do 
next.
I can split the storage into file systems with and without encryption 
and allocate data accordingly. No need, for example, to encrypt open 
source code, or music. But I would like to have everything encrypted 
by default.


My concern is not industrial espionage from a hacker in Belarus, but 
having a disk fail and send it for repair with my credit card 
statements easily readable on it, etc.


I am new to (open or closed)Solaris. I found there is something called 
the Encryption Framework, and that there is hardware support for 
encryption.
This server has two unused PCI-e slots, so I thought a card could be 
the solution, but the few I found seem to be geared to protect SSH and 
VPN connections, etc., not the file system.


Cost is a factor also. I could build a similar server with a much 
faster processor for a few hundred dollars more, so a $1000 dollar 
card for a < $1000 file server is not a reasonable option.


Is there anything out there I could use?

Thanks,

Roberto Waltman 


You're out of luck.  The hardware-encryption device is seen as a small 
market by the vendors, and they price accordingly. All the solutions are 
FIPS-compliant, which makes them non-trivially expensive to 
build/test/verify.


I have yet to see the "basic" crypto accelerator - which should be as 
simple as an FPGA with downloadable (and updateable) firmware.


The other major cost point is the crypto plugins - sadly, there is no 
way to simply have the CPU farm off crypto jobs to a co-processor. That 
is, there's no way for the CPU to go "oh, that looks like I'm running a 
crypto algorithm - I should hand it over to the crypto co-processor".  
Instead, you have to write custom plugin/drivers/libraries for each OS, 
and pray that each OS has some standardized crypto framework. Otherwise, 
you have to link apps with custom libraries.


I'm always kind of surprised that there hasn't been a movement to create 
standardized crypto commands, like the various FP-specific commands that 
are part of MMX/SSE/etc.  That way, most of this could be done in 
hardware seamlessly.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] # disks per vdev

2011-06-17 Thread Erik Trimble


On 6/17/2011 6:52 AM, Marty Scholes wrote:

Lights. Good.

Agreed. In a fit of desperation and stupidity I once enumerated disks by 
pulling them one by one from the array to see which zfs device faulted.

On a busy array it is hard even to use the leds as indicators.

It makes me wonder how large shops with thousands of spindles handle this.


We pay for the brand-name disk enclosures or servers where the 
fault-management stuff is supported by Solaris.


Including the blinky lights.



--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] # disks per vdev

2011-06-17 Thread Erik Trimble


On 6/17/2011 12:55 AM, Lanky Doodle wrote:

Thanks Richard.

How does ZFS enumerate the disks? In terms of listing them does it do them 
logically, i.e;

controller #1 (motherboard)
 |
 |--- disk1
 |--- disk2
controller #3
 |--- disk3
 |--- disk4
 |--- disk5
 |--- disk6
 |--- disk7
 |--- disk8
 |--- disk9
 |--- disk10
controller #4
 |--- disk11
 |--- disk12
 |--- disk13
 |--- disk14
 |--- disk15
 |--- disk16
 |--- disk17
 |--- disk18

or is it completely random leaving me with some trial and error to work out 
what disk is on what port?


This is not a ZFS issue, this is the Solaris device driver issue.

Solaris uses a location-based disk naming scheme, NOT the 
BSD/Linux-style of simply incrementing the disk numbers. I.e. drives are 
usually named something like ctd


In most cases, the on-board controllers receive a lower controller 
number than any add-in adapters, and add-in adapters are enumerated in 
PCI ID order. However, there is no good explanation of exactly *what* 
number a given controller may be assigned.


After receiving a controller number, disks are enumerated in ascending 
order by ATA ID, SCSI ID, SAS WWN, or FC WWN.


The naming rules can get a bit complex.

--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] question about COW and snapshots

2011-06-16 Thread Erik Trimble


On 6/16/2011 1:32 PM, Paul Kraus wrote:

On Thu, Jun 16, 2011 at 4:20 PM, Richard Elling
  wrote:


You can run OpenVMS :-)

Since *you* brought it up (I was not going to :-), how does VMS'
versioning FS handle those issues ?

It doesn't, per se.  VMS's filesystem has a "versioning" concept (i.e. 
every time you do a close() on a file, it creates a new file with the 
version number appended, e.g.  foo;1  and foo;2  are the same file, 
different versions).  However, it is completely missing the rest of the 
features we're talking about, like data *consistency* in that file. It's 
still up to the app using the file to figure out what data consistency 
means, and such.  Really, all VMS adds is versioning, nothing else (no 
API, no additional features, etc.).



I know that SAM-FS has rules for _when_ copies of a file are made, so
that intermediate states are not captured. The last time I touched
SAM-FS there was _not_ a nice user interface to the previous version,
you had to trudge through log files and then pull the version you
wanted directly from secondary storage (but they did teach us how to
that in the SAM-FS / QFS class).


I'd have to look, but I *think* there is a better way to get to the file 
history/version information now.


--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] question about COW and snapshots

2011-06-16 Thread Erik Trimble


On 6/16/2011 12:09 AM, Simon Walter wrote:

On 06/16/2011 09:09 AM, Erik Trimble wrote:
We had a similar discussion a couple of years ago here, under the 
title "A Versioning FS". Look through the archives for the full 
discussion.


The jist is that application-level versioning (and consistency) is 
completely orthogonal to filesystem-level snapshots and consistency.  
IMHO, they should never be mixed together - there are way too many 
corner cases and application-specific memes for a filesystem to ever 
fully handle file-level versioning and *application*-level data 
consistency.  Don't mistake one for the other, and, don't try to 
*use* one for the other.  They're completely different creatures.




I guess that is true of the current FSs available. Though it would be 
nice to essentially have a versioning FS in the kernel rather than an 
application in userspace. But I regress. I'll use SVN and webdav.


Thanks for the advice everyone.

It's not really a technical problem, it's a knowledge locality problem. 
The *knowledge* of where to checkmark, where to version, and what data 
consistency means is held at the application level, and can ONLY be 
known by each individual application. There's no way a filesystem (or 
anything like that) can make the proper decisions without the 
application telling it what those decisions should be. So, what would 
the point be in having a "smart" versioning FS, since the intelligence 
can't be built into the FS, it would still have to be built into each 
and every application.


So, if your apps have to be programmed to be 
versioning/consistency/checkmarking aware in any case, how would having 
a fancy Versioning filesystem be any better than using what we do now? 
(i.e. svn/hg/cvs/git on top of ZFS/btrfs/et al)   ZFS at least makes 
significant practical advances by rolling the logical volume manager 
into the filesystem level, but I can't see any such advantage for a 
Versioning FS.


--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] question about COW and snapshots

2011-06-15 Thread Erik Trimble

We had a similar discussion a couple of years ago here, under the title 
"A Versioning FS". Look through the archives for the full discussion.


The jist is that application-level versioning (and consistency) is 
completely orthogonal to filesystem-level snapshots and consistency.  
IMHO, they should never be mixed together - there are way too many 
corner cases and application-specific memes for a filesystem to ever 
fully handle file-level versioning and *application*-level data 
consistency.  Don't mistake one for the other, and, don't try to *use* 
one for the other.  They're completely different creatures.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for Linux?

2011-06-14 Thread Erik Trimble


On 6/14/2011 12:50 PM, Roy Sigurd Karlsbakk wrote:

Are there estimates on how performant and stable would
it be to run VirtualBox with a Solaris-derived NAS with
dedicated hardware disks, and use that from the same
desktop? I did actually suggest this as a considered
variant as well ;)

I am going to try and build such a VirtualBox for my ailing
HomeNAS as well - so it would import that iSCSI "dcpool"
and try to process its defer-free blocks. At least if the
hardware box doesn't stall so that a human has to be
around to go and push reset, this would be a more
viable solution for my repair-reboot cycles...

If you want good performance and ZFS, I'd suggest using something like 
OpenIndiana or Solaris 11EX or perhaps FreeBSD for the host and VirtualBox for 
a linux guest if that's needed. Doing so, you'll get good I/O performance, and 
you can use the operating system or distro you like for the rest of the 
services.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/

The other option is to make sure you have a newer CPU that supports 
Virtualized I/O. I'd have to look at the desktop CPUs, but all Intel 
Nehalem and later CPUs have this feature, and I'm pretty sure all AMD 
MangyCours and later CPUs do also.


Without V-IO, doing anything that pounds on a disk under *any* 
Virtualization product is sure to make you cry.


--
Erik Trimble
Java Platform Group Infrastructure
Mailstop:  usca22-317
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (UTC-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)

2011-06-13 Thread Erik Trimble


On 6/12/2011 5:08 AM, Dimitar Hadjiev wrote:

I can lay them out as 4*3-disk raidz1, 3*4-disk-raidz1
or a 1*12-disk raidz3 with nearly the same capacity (8-9
data disks plus parity). I see that with more vdevs the
IOPS will grow - does this translate to better resilver
and scrub times as well?

Yes it would translate in better resilver times as any failures will affect 
only one of the vdevs leading to a shorter parity restore time as oposed to 
rebuilding the whole raidz2. As for scrubbing it would be as fast as the scrub 
of each vdev since the whole pool does not have parity data to synchronize.


Go look through the mail archives, and there's at least a couple of 
posts from me and Richard Elling (amongst others) about the workload 
that a resilver requires on a raidz* vdev.  Essentially, "typical" usage 
of a vdev will result in resilver times linearly degrading with each 
additional DATA disk in the raidz*, as a resilver is IOPS-bound on the 
single replaced disk. So, a 3-disk raidz1 (2 data disks) should, on 
average, resilver 4.5 times faster than a 12-disk raidz3 (9 data disks).



How good or bad is the expected reliability of
3*4-disk-raidz1 vs 1*12-disk raidz3, so which
of the tradeoffs is better - more vdevs or more
parity to survive loss of ANY 3 disks vs. "right"
3 disks?

I'd say the chances of loseing a whole vdev in a 4*3 configuration equal the 
chances of loseing 4 drives in a 1*12 raidz3 configuration - it might happen, 
nothing is foolproof.
No, the reliability of a 1x12raidz3 is *significantly* better than that 
of 4x3 raidz1 (or, frankly, ANY raidz1 configuration using 12 disks).  
Richard  has some stats around here somewhere...  basically, the math 
(singular, you damn Brits! :-) says that while a 3-disk raidz1 will 
certainly take shorter to re-silver after a loss than a 12-disk raidz3, 
this is more than counterbalanced by the ability of a 12-disk raidz3 to 
handle additional disk losses, where the 4x3 config is only 
*probabilisticly* likely to be handle a 2nd or 3rd drive failure.


I'd have to re-look at the exact numbers, but, I'd generally say that 
2x6raidz2 vdevs would be better than either 1x12raidz3 or 4x3raidz1 (or 
3x4raidz1, for a home server not looking for super-critical protection 
(in which case, you should be using mirrors with spares, not raidz*).



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SATA disk perf question

2011-06-02 Thread Erik Trimble


On 6/2/2011 5:12 PM, Jens Elkner wrote:

On Wed, Jun 01, 2011 at 06:17:08PM -0700, Erik Trimble wrote:

On Wed, 2011-06-01 at 12:54 -0400, Paul Kraus wrote:



Here's how you calculate (average) how long a random IOPs takes:
seek time + ((60 / RPMs) / 2))]

A truly sequential IOPs is:
(60 / RPMs) / 2)

For that series of drives, seek time averages 8.5ms (per Seagate).
So, you get

1 Random IOPs takes [8.5ms + 4.13ms] = 12.6ms, which translates to 78
IOPS
1 Sequential IOPs takes 4.13ms, which gives 120 IOPS.

Note that due to averaging, the above numbers may be slightly higher or
lower for any actual workload.

Nahh, shouldn't it read "numbers may be _significant_ higher or lower"
...? ;-)

Regards,
jel.


Nope. In terms of actual, obtainable IOPS, a 7200RPM drive isn't going 
to be able to do more than 200 under ideal conditions, and should be 
able to manage 50 under anything other than the pedantically worst-case 
situation. That's only about a 50% deviation, not like an order of 
magnitude or so.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SATA disk perf question

2011-06-01 Thread Erik Trimble

On Wed, 2011-06-01 at 12:54 -0400, Paul Kraus wrote:
> I figure this group will know better than any other I have contact
> with, is 700-800 I/Ops reasonable for a 7200 RPM SATA drive (1 TB Sun
> badged Seagate ST31000N in a J4400) ? I have a resilver running and am
> seeing about 700-800 writes/sec. on the hot spare as it resilvers.
> There is no other I/O activity on this box, as this is a remote
> replication target for production data. I have a the replication
> disabled until the resilver completes.
> 
> Solaris 10U9
> zpool version 22
> Server is a T2000
> 

Here's how you calculate (average) how long a random IOPs takes:

seek time + ((60 / RPMs) / 2))]

A truly sequential IOPs is:

(60 / RPMs) / 2)

For that series of drives, seek time averages 8.5ms (per Seagate).

So, you get 

1 Random IOPs takes [8.5ms + 4.13ms] = 12.6ms, which translates to 78
IOPS

1 Sequential IOPs takes 4.13ms, which gives 120 IOPS.

Note that due to averaging, the above numbers may be slightly higher or
lower for any actual workload.

In your case, since ZFS does write aggregation (turning multiple write
requests into a single larger one), you might see what appears to be
more than the above number from something like 'iostat', which is
measuring not the *actual* writes to physical disk, but the *requested*
write operations.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Compatibility between Sun-Oracle Fishworks appliance zfs and other zfs implementations

2011-06-01 Thread Erik Trimble

On Thu, 2011-05-26 at 09:36 -0700, Freddie Cash wrote:
> On Wed, May 25, 2011 at 9:30 PM, Matthew Ahrens 
> wrote:
> On Wed, May 25, 2011 at 8:01 PM, Matt Weatherford
>  wrote:
> pike# zpool get version internal
> NAME  PROPERTY  VALUESOURCE
> internal  version   28   default
> pike# zpool get version external-J4400-12x1TB
> NAME   PROPERTY  VALUESOURCE
> external-J4400-12x1TB  version   28   default
> pike#
> 
> Can I expect to move my JBOD over to a different OS
> such as FreeBSD, Illuminos, or Solaris  and be able to
> get my data off still?  (by this i mean perform a
> zpool import on another platform)
> 
> 
> Yes, because zpool version 28 is supported in Illumos.  I'm
> sure Oracle Solaris does or will soon support it too.
>  According to Wikipedia, "the 9-current development branch [of
> FreeBSD] uses ZFS Pool version 28".
> 
> Correct.  FreeBSD 9-CURRENT (dev branch that will be released as 9.0
> at some point) as of March or April includes support for ZFSv28.
> 
> And FreeBSD 8-STABLE (dev branch that will be released as 8.3 at some
> point) has patches available to support ZFSv28 here:
> http://people.freebsd.org/~mm/patches/zfs/v28/
> 
> 
> ZFS-on-FUSE for Linux currently only supports ZFSv23.
> 
> So you can "safely" use Illumos, Nexenta, FreeBSD, etc with ZFSv28.
> You can also use Solaris 11 Express, so long as you don't upgrade the
> pool version (SolE includes ZFSv31).
> 
> -- 
> Freddie Cash
> fjwc...@gmail.com


[Note: none of this is proprietary Oracle knowledge, and I do not speak
as an Oracle representative in any way]


The Fishworks stuff runs on a version of Solaris 11 - it's not some
forked Solaris branch.  So, when Solaris 11 finally ships, you can
expect that the latest Solaris 11 Update + CRU (patches) should always
be able to fully utilize any Fishworks-written volume.

As pointed out above by others, however, I would never count on a
Fishworks volume (particularly one which used the latest-available
Fishworks software update) being able to be read on *anything* other
than Solaris 11.  As S11 advances, Fishworks gets updated too, so the
ZFS/zpool version number advances.


Solaris 11 is due out RSN, which means probably sometime before the end
of the calendar year. But who knows, and Oracle hasn't officially
announced a launch date for S11.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-25 Thread Erik Trimble


On 5/25/2011 4:37 AM, Frank Van Damme wrote:

Op 24-05-11 22:58, LaoTsao schreef:

With various fock of opensource project
E.g. Zfs, opensolaris, openindina etc there are all different
There are not guarantee to be compatible

I hope at least they'll try. Just in case I want to import/export zpools
between Nexenta and OpenIndiana

Given the new "versioning" governing board, I think that's highly likely.

However, do remember that you might not be able to import a pool from 
another system, simply because your system can't support the 
featureset.  Ideally, it would be nice if you could just import the pool 
and use the features your current OS supports, but that's pretty darned 
dicey, and I'd be very happy if importing worked when both systems 
supported the same featureset.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-24 Thread Erik Trimble


On 5/24/2011 8:28 AM, Orvar Korvar wrote:

The netapp lawsuit is solved. No conflicts there.

Regarding ZFS, it is open under CDDL license. The leaked source code that is 
already open is open. Nexenta is using the open sourced version of ZFS. Oracle 
might close future ZFS versions, but Nexenta's ZFS is open and can not be 
closed.


There is no threat to Nexenta from the ZFS code itself; the license that 
it was made available under explicitly has Oracle allow use for any 
patents *Oracle* might have.


However, since the terms of the NetApp/Oracle suit aren't available 
publicly, and I seriously doubt that NetApp gave up its patent claims,  
it could still be feasible for NetApp to sue Nexenta or whomever for 
alleged violations of *NetApp's* patents in the ZFS code.


That is, ZFS has no copyright infringement issues for 3rd parties. It 
has no patent issues from Oracle.  It *could* have patent issues from 
NetApp.


The possible impact of that is beyond my knowledge. IANAL. Nor do I 
speak for Oracle in any manner, official or unofficial.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-07 Thread Erik Trimble


On 5/7/2011 6:47 AM, Edward Ned Harvey wrote:

See below.  Right around 400,000 blocks, dedup is suddenly an order of
magnitude slower than without dedup.

40  10.7sec 136.7sec143 MB  195

MB

80  21.0sec 465.6sec287 MB  391

MB

The interesting thing is - In all these cases, the complete DDT and the
complete data file itself should fit entirely in ARC comfortably.  So it
makes no sense for performance to be so terrible at this level.

So I need to start figuring out exactly what's going on.  Unfortunately I
don't know how to do that very well.  I'm looking for advice from anyone -
how to poke around and see how much memory is being consumed for what
purposes.  I know how to lookup c_min and c and c_max...  But that didn't do
me much good.  The actual value for c barely changes at all over time...
Even when I rm the file, c does not change immediately.

All the other metrics from kstat ... have less than obvious names ... so I
don't know what to look for...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Some minor issues that might affect the above:

(1) I'm assuming you run your script repeatedly in the same pool, 
without deleting the pool. If that is the case, that means that a run of 
X+1 should dedup completely with the run of X.  E.g. a run with 12 
blocks will dedup the first 11 blocks with the prior run of 11.


(2) can you NOT enable "verify" ?  Verify *requires* a disk read before 
writing for any potential dedup-able block. If case #1 above applies, 
then by turning on dedup, you *rapidly* increase the amount of disk I/O 
you require on each subsequent run.  E.g. the run of 10 requires no 
disk I/O due to verify, but the run of 11 requires 10 I/O 
requests, while the run of 12 requires 11 requests, etc.  This 
will skew your results as the ARC buffering of file info changes over time.


(3) fflush is NOT the same as fsync.  If you're running the script in a 
loop, it's entirely possible that ZFS hasn't completely committed things 
to disk yet, which means that you get I/O requests to flush out the ARC 
write buffer in the middle of your runs.   Honestly, I'd do the 
following for benchmarking:


i=0
while [i -lt 80 ];
do
j = $[10 + ( 1  * 1)]
./run_your_script j
sync
    sleep 10
i = $[$i+1]
done



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Erik Trimble


On 5/6/2011 5:46 PM, Richard Elling wrote:

On May 6, 2011, at 3:24 AM, Erik Trimble  wrote:


Casper and Richard are correct - RAM starvation seriously impacts snapshot or 
dataset deletion when a pool has dedup enabled.  The reason behind this is that 
ZFS needs to scan the entire DDT to check to see if it can actually delete each 
block in the to-be-deleted snapshot/dataset, or if it just needs to update the 
dedup reference count.

AIUI, the issue is not the the DDT is scanned, it is an AVL tree for a reason. The issue 
is that each reference update means that one, small bit of data is changed. If the 
reference is not already in ARC, then a small, probably random read is needed. If you 
have a typical consumer disk, especially a "green" disk, and have not tuned 
zfs_vdev_max_pending, then that itty bitty read can easily take more than 100 
milliseconds(!) Consider that you can have thousands or millions of reference updates to 
do during a zfs destroy, and the math gets ugly. This is why fast SSDs make good dedup 
candidates.

Just out of curiosity - I'm assuming that a delete works like this:

(1) find list of blocks associated with file to be deleted
(2) using the DDT, find out if any other files are using those blocks
(3) delete/update any metadata associated with the file (dirents, 
ACLs, etc.)

(4) for each block in the file
(4a) if the DDT indicates there ARE other files using this 
block, update the DDT entry to change the refcount
(4b) if the DDT indicates there AREN'T any other files, move 
the physical block to the free list, and delete the DDT entry



In a bulk delete scenario (not just snapshot deletion), I'd presume #1 
above almost always causes a Random I/O request to disk, as all the 
relevant metadata for every (to be deleted) file is unlikely to be 
stored in ARC.  If you can't fit the DDT in ARC/L2ARC, #2 above would 
require you to pull in the remainder of the DDT info from disk, right?  
#3 and #4 can be batched up, so they don't hurt that much.


Is that a (roughly) correct deletion methodology? Or can someone give a 
more accurate view of what's actually going on?





If it can't store the entire DDT in either the ARC or L2ARC, it will be forced 
to do considerable I/O to disk, as it brings in the appropriate DDT entry.   
Worst case for insufficient ARC/L2ARC space can increase deletion times by many 
orders of magnitude. E.g. days, weeks, or even months to do a deletion.

I've never seen months, but I have seen days, especially for low-perf disks.
I've seen an estimate of 5 weeks for removing a snapshot on a 1TB dedup 
pool made up of 1 disk.


Not an optimal set up.

:-)


If dedup isn't enabled, snapshot and data deletion is very light on RAM 
requirements, and generally won't need to do much (if any) disk I/O.  Such 
deletion should take milliseconds to a minute or so.

Yes, perhaps a bit longer for recursive destruction, but everyone here knows 
recursion is evil, right? :-)
  -- richard
You, my friend, have obviously never worshipped at the Temple of the 
Lamba Calculus, nor been exposed to the Holy Writ that is "Structure and 
Interpretation of Computer Programs" 
(http://mitpress.mit.edu/sicp/full-text/book/book.html).


I sentence you to a semester of 6.001 problem sets, written by Prof 
Sussman sometime in the 1980s.


(yes, I went to MIT.)

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Erik Trimble


On 5/6/2011 1:37 AM, casper@oracle.com wrote:

Op 06-05-11 05:44, Richard Elling schreef:

As the size of the data grows, the need to have the whole DDT in RAM or L2ARC
decreases. With one notable exception, destroying a dataset or snapshot requires
the DDT entries for the destroyed blocks to be updated. This is why people can
go for months or years and not see a problem, until they try to destroy a 
dataset.

So what you are saying is "you with your ram-starved system, don't even
try to start using snapshots on that system". Right?


I think it's more like "don't use dedup when you don't have RAM".

(It is not possible to not use snapshots in Solaris; they are used for
everything)

Casper

Casper and Richard are correct - RAM starvation seriously impacts 
snapshot or dataset deletion when a pool has dedup enabled.  The reason 
behind this is that ZFS needs to scan the entire DDT to check to see if 
it can actually delete each block in the to-be-deleted snapshot/dataset, 
or if it just needs to update the dedup reference count. If it can't 
store the entire DDT in either the ARC or L2ARC, it will be forced to do 
considerable I/O to disk, as it brings in the appropriate DDT entry.   
Worst case for insufficient ARC/L2ARC space can increase deletion times 
by many orders of magnitude. E.g. days, weeks, or even months to do a 
deletion.



If dedup isn't enabled, snapshot and data deletion is very light on RAM 
requirements, and generally won't need to do much (if any) disk I/O.  
Such deletion should take milliseconds to a minute or so.




--

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-04 Thread Erik Trimble

ements.

Using the standard c_max value of 80%, remember that this is 80% of the 
TOTAL system RAM, including that RAM normally dedicated to other 
purposes.  So long as the total amount of RAM you expect to dedicate to 
ARC usage (for all ZFS uses, not just dedup) is less than 4 times that 
of all other RAM consumption, you don't need to "overprovision".




So the end result is:
On my test system I guess the OS and processes consume 1G.  (I'm making that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I'm just making that up.)
We calculated that I need 7G for DDT and 3.4G for L2ARC.  That is 10.4G.
Multiply by 5/4 and it means I need 13G
My system needs to be built with at least 8G + 13G = 21G.
Of this, 20% (4.2G) is more than enough to run the OS and processes, while
80% (16.8G) is available for ARC.  Of the 16.8G ARC, the DDT and L2ARC
references will consume 10.4G, which leaves 6.4G for "normal" ARC caching.
These numbers are all fuzzy.  Anything from 16G to 24G might be reasonable.

That's it. I'm done.

P.S.  I'll just throw this out there:  It is my personal opinion that you
probably won't have the whole DDT in ARC and L2ARC at the same time.
Because the L2ARC is populated from the soon-to-expire list of the ARC, it
seems unlikely that all the DDT entries will get into ARC, and then onto the
soon-to-expire list and then pulled back into ARC and stay there.  The above
calculation is a sort of worst case.  I think the following is likely to be
a more realistic actual case:
There is a *very* low probability that a DDT entry will exist in both 
the ARC and L2ARC at the same time. That is, such a condition will occur 
ONLY in the very short period of time when the DDT entry is being 
migrated from the ARC to the L2ARC. Each DDT entry is tracked 
separately, so each can be migrated from ARC to L2ARC as needed.  Any 
entry that is migrated back from L2ARC into ARC is considered "stale" 
data in the L2ARC, and thus, is no longer tracked in the ARC's reference 
table for L2ARC.


As such, you can safely assume the DDT-related memory requirements for 
the ARC are (maximally) just slightly bigger than size of the DDT 
itself. Even then, that is a worst-case scenario; a typical use case 
would have the actual ARC consumption somewhere closer to the case where 
the entire DDT is in the L2ARC.  Using your numbers, that would mean the 
worst-case ARC usage would be a bit over 7G, and the more likely case 
would be somewhere in the 3.3-3.5G range.




Personally, I would model the ARC memory consumption of the L2ARC entries
using the average block size of the data pool, and just neglect the DDT
entries in the L2ARC.  Well ... inflate some.  Say 10% of the DDT is in the
L2ARC and the ARC at the same time.  I'm making up this number from thin
air.

My revised end result is:
On my test system I guess the OS and processes consume 1G.  (I'm making that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I'm just making that up.)
We calculated that I need 7G for DDT and (96M + 10% of 3.3G = 430M) for
L2ARC.  Multiply by 5/4 and it means I need 7.5G * 1.25 = 9.4G
My system needs to be built with at least 8G + 9.4G = 17.4G.
Of this, 20% (3.5G) is more than enough to run the OS and processes, while
80% (13.9G) is available for ARC.  Of the 13.9G ARC, the DDT and L2ARC
references will consume 7.5G, which leaves 6.4G for "normal" ARC caching.
I personally think that's likely to be more accurate in the observable
world.
My revised end result is still basically the same:  These numbers are all
fuzzy.  Anything from 16G to 24G might be reasonable.




For total system RAM, you need the GREATER of these two values:

(1) the sum of your OS & application requirements, plus your standard 
ZFS-related ARC requirements, plus the DDT size


(2) 1.25 times the sum of size of your standard ARC needs and the DDT size


Redoing your calculations based on my adjustments:

(a)  worst case scenario is that you need 7GB for dedup-related ARC 
requirements

(b)  you presume to need 8GB for standard ARC caching not related to dedup
(c)  your system needs 1GB for basic operation

According to those numbers:

Case #1:  1 + 8 + 7 = 16GB

Case #2:  1.25 * (8 + 7) =~ 19GB

Thus, you should have 19GB of RAM in your system, with 16GB being a 
likely reasonable amount under most conditions (e.g. typical dedup ARC 
size is going to be ~3.5G, not the 7G maximum used above).




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble


On 5/4/2011 5:11 PM, Brandon High wrote:

On Wed, May 4, 2011 at 4:36 PM, Erik Trimble  wrote:

If so, I'm almost certain NetApp is doing post-write dedup.  That way, the
strictly controlled max FlexVol size helps with keeping the resource limits
down, as it will be able to round-robin the post-write dedup to each FlexVol
in turn.

They are, its in their docs. A volume is dedup'd when 20% of
non-deduped data is added to it, or something similar. 8 volumes can
be processed at once though, I believe, and it could be that weaker
systems are not able to do as many in parallel.


Sounds rational.


block usage has a significant 4k presence.  One way I reduced this initally
was to have the VMdisk image stored on local disk, then copied the *entire*
image to the ZFS server, so the server saw a single large file, which meant
it tended to write full 128k blocks.  Do note, that my 30 images only takes

Wouldn't you have been better off cloning datasets that contain an
unconfigured install and customizing from there?

-B
Given that my "OS" installs include a fair amount of 3rd-party add-ons 
(compilers, SDKs, et al), I generally find the best method for me is to 
fully configure a client (with the VMdisk on local storage), then copy 
that VMdisk to the ZFS server as a "golden image".  I can then clone 
that image for my other clients of that type, and only have to change 
the network information.


Initially, each new VM image consumes about 1MB of space. :-)

Overall, I've found that as I have to patch each image, it's worth-while 
to take a new "golden-image" snapshot every so often, and then 
reconfigure each client machine again from that new Golden image. I'm 
sure I could do some optimization here, but the method works well enough.



What you want to avoid is having the OS image written to, and waiting 
for any other configuration and customization to happen AFTER it was 
placed on the ZFS server is sub-optimal.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble


On 5/4/2011 4:44 PM, Tim Cook wrote:



On Wed, May 4, 2011 at 6:36 PM, Erik Trimble <mailto:erik.trim...@oracle.com>> wrote:


On 5/4/2011 4:14 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:

On Wed, May 4, 2011 at 12:29 PM, Erik
Trimblemailto:erik.trim...@oracle.com>>  wrote:

   I suspect that NetApp does the following to
limit their resource
usage:   they presume the presence of some sort of
cache that can be
dedicated to the DDT (and, since they also control the
hardware, they can
make sure there is always one present).  Thus, they
can make their code

AFAIK, NetApp has more restrictive requirements about how
much data
can be dedup'd on each type of hardware.

See page 29 of
http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even
the big-daddy
filers will only dedup up to 16TB per volume, even if the
volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints
around the
problem, whereas ZFS lets you enable dedup for any size
dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

That is very true, although worth mentioning you can have
quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end
filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

Stupid question - can you hit all the various SIS volumes at once,
and not get horrid performance penalties?

If so, I'm almost certain NetApp is doing post-write dedup.  That
way, the strictly controlled max FlexVol size helps with keeping
the resource limits down, as it will be able to round-robin the
post-write dedup to each FlexVol in turn.

ZFS's problem is that it needs ALL the resouces for EACH pool ALL
the time, and can't really share them well if it expects to keep
performance from tanking... (no pun intended)


On a 2050?  Probably not.  It's got a single-core mobile celeron CPU 
and 2GB/ram.  You couldn't even run ZFS on that box, much less 
ZFS+dedup.  Can you do it on a model that isn't 4 years old without 
tanking performance?  Absolutely.


Outside of those two 2000 series, the reason there are dedup limits 
isn't performance.


--Tim

Indirectly, yes, it's performance, since NetApp has plainly chosen 
post-write dedup as a method to restrict the required hardware 
capabilities.  The dedup limits on Volsize are almost certainly driven 
by the local RAM requirements for post-write dedup.


It also looks like NetApp isn't providing for a dedicated DDT cache, 
which means that when the NetApp is doing dedup, it's consuming the 
normal filesystem cache (i.e. chewing through RAM).  Frankly, I'd be 
very surprised if you didn't see a noticeable performance hit during the 
period that the NetApp appliance is performing the dedup scans.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble


On 5/4/2011 4:17 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote:

On 5/4/2011 2:54 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:

(2) Block size:  a 4k block size will yield better dedup than a 128k
block size, presuming reasonable data turnover.  This is inherent, as
any single bit change in a block will make it non-duplicated.  With 32x
the block size, there is a much greater chance that a small change in
data will require a large loss of dedup ratio.  That is, 4k blocks
should almost always yield much better dedup ratios than larger ones.
Also, remember that the ZFS block size is a SUGGESTION for zfs
filesystems (i.e. it will use UP TO that block size, but not always that
size), but is FIXED for zvols.

(3) Method of storing (and data stored in) the dedup table.
   ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
boils down to 500+ bytes of combined L2ARC&   RAM usage per block entry
in the DDT.  Also, the actual DDT entry itself is perhaps larger than
absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
"only" 16GB of RAM for example.

It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC
cache device, the DDT must be stored in RAM.  That's about 376 bytes per
dedup block.

If you have an L2ARC cache device, then the ARC must contain a reference
to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT
entry reference.

So, adding a L2ARC reduces the ARC consumption by about 55%.

Of course, the other benefit from a L2ARC is the data/metadata caching,
which is likely worth it just by itself.

Great info.  Thanks Erik.

For dedupe workloads on larger file systems (8TB+), I wonder if makes
sense to use SLC / enterprise class SSD (or better) devices for L2ARC
instead of lower-end MLC stuff?  Seems like we'd be seeing more writes
to the device than in a non-dedupe scenario.

Thanks,
Ray
I'm using Enterprise-class MLC drives (without a supercap), and they 
work fine with dedup.  I'd have to test, but I don't think that the 
increase in write is that much, so I don't expect a SLC to really make 
much of a difference. (fill rate of the L2ARC is limited, so I can't 
imaging we'd bump up against the MLC's limits)


--

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble


On 5/4/2011 4:14 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:

On Wed, May 4, 2011 at 12:29 PM, Erik Trimble  wrote:

I suspect that NetApp does the following to limit their resource
usage:   they presume the presence of some sort of cache that can be
dedicated to the DDT (and, since they also control the hardware, they can
make sure there is always one present).  Thus, they can make their code

AFAIK, NetApp has more restrictive requirements about how much data
can be dedup'd on each type of hardware.

See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
pieces of hardware can only dedup 1TB volumes, and even the big-daddy
filers will only dedup up to 16TB per volume, even if the volume size
is 32TB (the largest volume available for dedup).

NetApp solves the problem by putting rigid constraints around the
problem, whereas ZFS lets you enable dedup for any size dataset. Both
approaches have limitations, and it sucks when you hit them.

-B

That is very true, although worth mentioning you can have quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

Stupid question - can you hit all the various SIS volumes at once, and 
not get horrid performance penalties?


If so, I'm almost certain NetApp is doing post-write dedup.  That way, 
the strictly controlled max FlexVol size helps with keeping the resource 
limits down, as it will be able to round-robin the post-write dedup to 
each FlexVol in turn.


ZFS's problem is that it needs ALL the resouces for EACH pool ALL the 
time, and can't really share them well if it expects to keep performance 
from tanking... (no pun intended)



The FAS2050 of course has a fairly small memory footprint...

I do like the additional flexibility you have with ZFS, just trying to
get a handle on the memory requirements.

Are any of you out there using dedupe ZFS file systems to store VMware
VMDK (or any VM tech. really)?  Curious what recordsize you use and
what your hardware specs / experiences have been.

Ray


Right now, I use it for my Solaris 8 containers and VirtualBox images.  
the VB images are mostly Windows (XP and Win2003).


I tend to put the OS image in one VMdisk, and my scratch disks in 
another. That is, I generally don't want my apps writing much to my OS 
images. My scratch/data disks aren't dedup.


Overall, I'm running about 30 deduped images served out over NFS. My 
recordsize is set to 128k, but, given that they're OS images, my actual 
disk block usage has a significant 4k presence.  One way I reduced this 
initally was to have the VMdisk image stored on local disk, then copied 
the *entire* image to the ZFS server, so the server saw a single large 
file, which meant it tended to write full 128k blocks.  Do note, that my 
30 images only takes about 20GB of actual space, after dedup. I figure 
about 5GB of dedup space per OS type (and, I have 4 different setups).


My data VMdisks, however, chew though about 4TB of disk space, which is 
nondeduped. I'm still trying to determine if I'm better off serving 
those data disks as NFS mounts to my clients, or as VMdisk images 
available over iSCSI or NFS.  Right now, I'm doing VMdisks over NFS.


The setup I'm using is an older X4200 (non-M2), with 3rd-party SSDs as 
L2ARC, hooked to an old 3500FC array. It has 8GB of RAM in total, and 
runs just fine with that.  I definitely am going to upgrade to something 
much larger in the near future, since I expect to up my number of VM 
images by at least a factor of 5.



That all said, if you're relatively careful about separating OS installs 
from active data, you can get really impressive dedup ratios using a 
relatively small amount of actual space.In my case, I expect to 
eventually be serving about 10 different configs out to a total of maybe 
100 clients, and probably never exceed 100GB max on the deduped end. 
Which means that I'll be able to get away with 16GB of RAM for the whole 
server, comfortably.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble


On 5/4/2011 2:54 PM, Ray Van Dolson wrote:

On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:

(2) Block size:  a 4k block size will yield better dedup than a 128k
block size, presuming reasonable data turnover.  This is inherent, as
any single bit change in a block will make it non-duplicated.  With 32x
the block size, there is a much greater chance that a small change in
data will require a large loss of dedup ratio.  That is, 4k blocks
should almost always yield much better dedup ratios than larger ones.
Also, remember that the ZFS block size is a SUGGESTION for zfs
filesystems (i.e. it will use UP TO that block size, but not always that
size), but is FIXED for zvols.

(3) Method of storing (and data stored in) the dedup table.
  ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
boils down to 500+ bytes of combined L2ARC&  RAM usage per block entry
in the DDT.  Also, the actual DDT entry itself is perhaps larger than
absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
"only" 16GB of RAM for example.


It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC 
cache device, the DDT must be stored in RAM.  That's about 376 bytes per 
dedup block.


If you have an L2ARC cache device, then the ARC must contain a reference 
to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT 
entry reference.


So, adding a L2ARC reduces the ARC consumption by about 55%.

Of course, the other benefit from a L2ARC is the data/metadata caching, 
which is likely worth it just by itself.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Erik Trimble


On 5/4/2011 9:57 AM, Ray Van Dolson wrote:

There are a number of threads (this one[1] for example) that describe
memory requirements for deduplication.  They're pretty high.

I'm trying to get a better understanding... on our NetApps we use 4K
block sizes with their post-process deduplication and get pretty good
dedupe ratios for VM content.

Using ZFS we are using 128K record sizes by default, which nets us less
impressive savings... however, to drop to a 4K record size would
theoretically require that we have nearly 40GB of memory for only 1TB
of storage (based on 150 bytes per block for the DDT).

This obviously becomes prohibitively higher for 10+ TB file systems.

I will note that our NetApps are using only 2TB FlexVols, but would
like to better understand ZFS's (apparently) higher memory
requirements... or maybe I'm missing something entirely.

Thanks,
Ray

[1] http://markmail.org/message/wile6kawka6qnjdw
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


I'm not familiar with NetApp's implementation, so I can't speak to why 
it might appear to use less resources.


However, there are a couple of possible issues here:

(1)  Pre-write vs Post-write Deduplication.
ZFS does pre-write dedup, where it looks for duplicates before 
it writes anything to disk.  In order to do pre-write dedup, you really 
have to store the ENTIRE deduplication block lookup table in some sort 
of fast (random) access media, realistically Flash or RAM.  The win is 
that you get significantly lower disk utilization (i.e. better I/O 
performance), as (potentially) much less data is actually written to disk.
Post-write Dedup is done via batch processing - that is, such a 
design has the system periodically scan the saved data, looking for 
duplicates. While this method also greatly benefits from being able to 
store the dedup table in fast random storage, it's not anywhere as 
critical. The downside here is that you see much higher disk utilization 
- the system must first write all new data to disk (without looking for 
dedup), and then must also perform significant I/O later on to do the dedup.


(2) Block size:  a 4k block size will yield better dedup than a 128k 
block size, presuming reasonable data turnover.  This is inherent, as 
any single bit change in a block will make it non-duplicated.  With 32x 
the block size, there is a much greater chance that a small change in 
data will require a large loss of dedup ratio.  That is, 4k blocks 
should almost always yield much better dedup ratios than larger ones. 
Also, remember that the ZFS block size is a SUGGESTION for zfs 
filesystems (i.e. it will use UP TO that block size, but not always that 
size), but is FIXED for zvols.


(3) Method of storing (and data stored in) the dedup table.
ZFS's current design is (IMHO) rather piggy on DDT and L2ARC 
lookup requirements. Right now, ZFS requires a record in the ARC (RAM) 
for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it 
boils down to 500+ bytes of combined L2ARC & RAM usage per block entry 
in the DDT.  Also, the actual DDT entry itself is perhaps larger than 
absolutely necessary.
I suspect that NetApp does the following to limit their 
resource usage:   they presume the presence of some sort of cache that 
can be dedicated to the DDT (and, since they also control the hardware, 
they can make sure there is always one present).  Thus, they can make 
their code completely avoid the need for an equivalent to the ARC-based 
lookup.  In addition, I suspect they have a smaller DDT entry itself.  
Which boils down to probably needing 50% of the total resource 
consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement.



Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The 
big issue is the ARC requirements, which, until they can be seriously 
reduced (or, best case, simply eliminated), really is a significant 
barrier to adoption of ZFS dedup.


Right now, ZFS treats DDT entries like any other data or metadata in how 
it ages from ARC to L2ARC to gone.  IMHO, the better way to do this is 
simply require the DDT to be entirely stored on the L2ARC (if present), 
and not ever keep any DDT info in the ARC at all (that is, the ARC 
should contain a pointer to the DDT in the L2ARC, and that's it, 
regardless of the amount or frequency of access of the DDT).  Frankly, 
at this point, I'd almost change the design to REQUIRE a L2ARC device in 
order to turn on Dedup.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-03 Thread Erik Trimble


On 5/3/2011 8:55 AM, Brandon High wrote:

On Tue, May 3, 2011 at 5:47 AM, Joerg Schilling
  wrote:

But this is most likely slower than star and does rsync support sparse files?

'rsync -ASHXavP'

-A: ACLs
-S: Sparse files
-H: Hard links
-X: Xattrs
-a: archive mode; equals -rlptgoD (no -H,-A,-X)

You don't need to specify --whole-file, it's implied when copying on
the same system. --inplace can play badly with hard links and
shouldn't be used.

It probably will be slower than other options but it may be more
accurate, especially with -H

-B


rsync is indeed slower than star; so far as I can tell, this is due 
almost exclusively to the fact that rsync needs to build an in-memory 
table of all work being done *before* it starts to copy. After that, it 
copies at about the same rate as star (my observations). I'd have to 
look at the code, but rsync appears to internally buffer a signification 
amount (due to its expect network use pattern), which helps for ZFS 
copying.  The one thing I'm not sure of is whether rsync uses a socket, 
pipe, or semaphore method when doing same-host copying. I presume socket 
(which would slightly slow it down vs star).


That said, rsync is really the only solution if you have a partial or 
interrupted copy.  It's also really the best method to do verification.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-29 Thread Erik Trimble


On 4/29/2011 9:44 AM, Brandon High wrote:

On Fri, Apr 29, 2011 at 7:10 AM, Roy Sigurd Karlsbakk  
wrote:

This was fletcher4 earlier, and still is in opensolaris/openindiana. Given a 
combination with verify (which I would use anyway, since there are always tiny 
chances of collisions), why would sha256 be a better choice?

fletcher4 was only an option for snv_128, which was quickly pulled and
replaced with snv_128b which removed fletcher4 as an option.

The official post is here:
http://www.opensolaris.org/jive/thread.jspa?threadID=118519&tstart=0#437431

It looks like fletcher4 is still an option in snv_151a for non-dedup
datasets, and is in fact the default.

As an aside: Erik, any idea when the 159 bits will make it to the public?

-B


Yup, fletcher4 is still the default for any fileset not using dedup.  
It's "good enough", and I can't see any reason to change it for those 
purposes (since it's collision problems aren't much of an issue when 
just doing data integrity checks).


Sorry, no idea on release date stuff. I'm completely out of the loop on 
release info.  I'm lucky if I can get a heads up before it actually gets 
published internally.


:-(


I'm just a lowly Java Platform Group dude.   Solaris ain't my silo.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding where dedup'd files are

2011-04-28 Thread Erik Trimble

On Thu, 2011-04-28 at 15:50 -0700, Brandon High wrote:
> On Thu, Apr 28, 2011 at 3:48 PM, Ian Collins  wrote:
> > Dedup is at the block, not file level.
> 
> Files are usually composed of blocks.
> 
> -B
> 

I think the point was, it may not be easy to determine which file a
given block is part of. I don't think stuff is stored as a
doubly-linked-list. That is, the file metadata lists the blocks
associated with that file, but the block itself doesn't refer back to
the file metadata.

Which means, that while I can get a list of blocks which are deduped, it
may not be possible to generate a list of files from that list of
blocks.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Erik Trimble

On Thu, 2011-04-28 at 14:33 -0700, Brandon High wrote:
> On Wed, Apr 27, 2011 at 9:26 PM, Edward Ned Harvey
>  wrote:
> > Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
> > to (not instead of) the fletcher2 integrity checksum.  So after bootup,
> 
> My understanding is that enabling dedup forces sha256.
> 
> "The default checksum used for deduplication is sha256 (subject to
> change). When dedup is enabled, the dedup checksum algorithm overrides
> the checksum property."
> 
> -B
> 

>From the man page for zfs(1)

 dedup=on | off | verify | sha256[,verify]

 Controls  whether  deduplication  is  in  effect  for  a
 dataset.  The default value is off. The default checksum
 used for deduplication is sha256  (subject  to  change).
 When  dedup  is  enabled,  the  dedup checksum algorithm
 overrides the checksum property. Setting  the  value  to
 verify is equivalent to specifying sha256,verify.

 If the property is set to  verify,  then,  whenever  two
 blocks  have the same signature, ZFS will do a byte-for-
 byte comparison with the existing block to  ensure  that
 the contents are identical.

This is from b159.

A careful reading of the man page seems to imply that there's no way to
change the dedup checksum algorithm from sha256, as the dedup property
ignores the checksum property, and there's no provided way to explicitly
set a checksum algorithm specific to dedup (i.e. there's no way to
override the default for dedup).

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-28 Thread Erik Trimble

On Thu, 2011-04-28 at 13:59 -0600, Neil Perrin wrote:
> On 4/28/11 12:45 PM, Edward Ned Harvey wrote:
> >
> > In any event, thank you both for your input.  Can anyone answer these
> > authoritatively?  (Neil?)   I'll send you a pizza.  ;-)
> >
> 
> - I wouldn't consider myself an authority on the dedup code.
> The size of these structures will vary according to the release you're 
> running. You can always find out the size for a particular system using 
> ::sizeof within
> mdb. For example, as super user :
> 
> : xvm-4200m2-02 ; echo ::sizeof ddt_entry_t | mdb -k
> sizeof (ddt_entry_t) = 0x178
> : xvm-4200m2-02 ; echo ::sizeof arc_buf_hdr_t | mdb -k
> sizeof (arc_buf_hdr_t) = 0x100
> : xvm-4200m2-02 ;
> 

yup, that's how I got them.  Just to add to the confusion, there are
typedefs in the code which can make names slightly different:

typedef struct arc_buf_hdr arc_buf_hdr_t;

typedef struct ddt_entry ddt_entry_t;


I got my values from a x86 box running b159, and a SPARC box running
S10u9.  The values were the same from both.

E.g.:

root@invisible:~# uname -a
SunOS invisible 5.11 snv_159 i86pc i386 i86pc Solaris
root@invisible:~# isainfo
amd64 i386
root@invisible:~# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc
pcplusmp scsi_vhci zfs ip hook neti arp usba uhci fctl stmf kssl
stmf_sbd sockfs lofs random sata sd fcip cpc crypto nfs logindmux ptm
ufs sppp ipc ]
> ::sizeof struct arc_buf_hdr
sizeof (struct arc_buf_hdr) = 0xb0
> ::sizeof struct ddt_entry
sizeof (struct ddt_entry) = 0x178




> This shows yet another size. Also there are more changes planned within
> the arc. Sorry, I can't talk about those changes and nor when you'll
> see them.
> 
> However, that's not the whole story. It looks like the arc_buf_hdr_t
> use their own kmem cache so there should be little wastage, but the
> ddt_entry_t are allocated from the generic kmem caches and so will
> probably have some roundup and unused space. Caches for small buffers
> are aligned to 64 bytes. See kmem_alloc_sizes[] and comment:
> 
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#920
> 

Ugg. I hadn't even thought of memory alignment/allocation issues.


> Pizza: Mushroom and anchovy - er, just kidding.
> 
> Neil.

And, let me say: Yuck!  What is that, an ISO-standard pizza? Disgusting.
ANSI-standard pizza, all the way!  (pepperoni & mushrooms)



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-27 Thread Erik Trimble

OK, I just re-looked at a couple of things, and here's what I /think/ is
the correct numbers.

A single entry in the DDT is defined in the struct "ddt_entry" :

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108

I just checked, and the current size of this structure is 0x178, or 376
bytes.


Each ARC entry, which points to either an L2ARC item (of any kind,
cached data, metadata, or a DDT line) or actual data/metadata/etc., is
defined in the struct "arc_buf_hdr" :

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#431

It's current size is 0xb0, or 176 bytes.

These are fixed-size structures.


PLEASE - someone correct me if these two structures AREN'T what we
should be looking at.



So, our estimate calculations have to be based on these new numbers.


Back to the original scenario:

1TB (after dedup) of 4k blocks: how much space is needed for the DDT,
and how much ARC space is needed if the DDT is kept in a L2ARC cache
device?

Step 1)  1TB (2^40 bytes) stored in blocks of 4k (2^12) = 2^28 blocks
total, which is about 268 million.

Step 2)  2^28 blocks of information in the DDT requires  376 bytes/block
* 2^28 blocks = 94 * 2^30 = 94 GB of space.  

Step 3)  Storing a reference to 268 million (2^28) DDT entries in the
L2ARC will consume the following amount of ARC space: 176 bytes/entry *
2^28 entries = 44GB of RAM.


That's pretty ugly.


So, to summarize:

For 1TB of data, broken into the following block sizes:
DDT sizeARC consumption
512b752GB (73%) 352GB (34%)
4k  94GB (9%)   44GB (4.3%)
8k  47GB (4.5%) 22GB (2.1%)
32k 11.75GB (2.2%)  5.5GB (0.5%)
64k 5.9GB (1.1%)2.75GB (0.3%)
128k2.9GB% (0.6%)   1.4GB (0.1%)

ARC consumption presumes the whole DDT is stored in the L2ARC.

Percentage size is relative to the original 1TB total data size



Of course, the trickier proposition here is that we DON'T KNOW what our
dedup value is ahead of time on a given data set.  That is, given a data
set of X size, we don't know how big the deduped data size will be. The
above calculations are for DDT/ARC size for a data set that has already
been deduped down to 1TB in size.


Perhaps it would be nice to have some sort of userland utility that
builds it's own DDT as a test and does all the above calculations, to
see how dedup would work on a given dataset.  'zdb -S' sorta, kinda does
that, but...


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-26 Thread Erik Trimble

On 4/26/2011 9:29 AM, Fred Liu wrote:
> From: Erik Trimble [mailto:erik.trim...@oracle.com] 
>> It is true, quota is in charge of logical data not physical data.
>> Let's assume an interesting scenario -- say the pool is 100% full in logical 
>> data
>> (such as 'df' tells you 100% used) but not full in physical data(such as 
>> 'zpool list' tells
>> you still some space available), can we continue writing data into this pool?
>>
> Sure, you can keep writing to the volume. What matters to the OS is what
> *it* thinks, not what some userland app thinks.
>
> OK. And then what the output of 'df' will be?
>
> Thanks.
>
> Fred
110% full. Or whatever. df will just keep reporting what it sees. Even
if what it *thinks* doesn't make sense to the human reading it.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-26 Thread Erik Trimble

On 4/26/2011 3:59 AM, Fred Liu wrote:
>
>> -Original Message-
>> From: Erik Trimble [mailto:erik.trim...@oracle.com]
>> Sent: 星期二, 四月 26, 2011 12:47
>> To: Ian Collins
>> Cc: Fred Liu; ZFS discuss
>> Subject: Re: [zfs-discuss] How does ZFS dedup space accounting work
>> with quota?
>>
>> On 4/25/2011 6:23 PM, Ian Collins wrote:
>>>   On 04/26/11 01:13 PM, Fred Liu wrote:
>>>> H, it seems dedup is pool-based not filesystem-based.
>>> That's correct. Although it can be turned off and on at the
>> filesystem
>>> level (assuming it is enabled for the pool).
>> Which is effectively the same as choosing per-filesystem dedup.  Just
>> the inverse. You turn it on at the pool level, and off at the
>> filesystem
>> level, which is identical to "off at the pool level, on at the
>> filesystem level" that NetApp does.
> My original though is just enabling dedup on one file system to check if it
> is mature enough or not in the production env. And I have only one pool.
> If dedup is filesytem-based, the effect of dedup will be just throttled within
> one file system and won't propagate to the whole pool. Just disabling dedup 
> cannot get rid of all the effects(such as the possible performance degrade 
> ... etc),
> because the already dedup'd data is still there and DDT is still there. The 
> thinkable
> thorough way is totally removing all the dedup'd data. But is it the real 
> thorough way?
You can do that now. Enable Dedup at the pool level. Turn it OFF on all
the existing filesystems. Make a new "test" filesystem, and run your tests.

Remember, only data written AFTER the dedup value it turned on will be
de-duped. Existing data will NOT. And, though dedup is enabled at the
pool level, it will only consider data written into filesystems that
have the dedup value as ON.

Thus, in your case, writing to the single filesystem with dedup on will
NOT have ZFS check for duplicates from the other filesystems. It will
check only inside itself, as it's the only filesystem with dedup enabled.

If the experiment fails, you can safely destroy your test dedup
filesystem, then unset dedup at the pool level, and you're fine.


> And also the dedup space saving is kind of indirect. 
> We cannot directly get the space saving in the file system where the 
> dedup is actually enabled for it is pool-based. Even in pool perspective,
> it is still sort of indirect and obscure from my opinion, the real space 
> saving
> is the abs delta between the output of 'zpool list' and the sum of 'du' on 
> all the folders in the pool
> (or 'df' on the mount point folder, not sure if the percentage like 123% will 
> occur or not... grinning ^:^ ).
>
> But in NetApp, we can use 'df -s' to directly and easily get the space saving.
That is true. Honestly, however, it would be hard to do this on a
per-filesystem basis. ZFS allows for the creation of an arbitrary number
of filesystems in a pool, far higher than NetApp does. The result is
that the "filesystem" concept is much more flexible in ZFS. The downside
is that keeping dedup statistics for a given arbitrary set of data is
logistically difficult.

An analogy with NetApp is thus: Can you use any tool to find the dedup
ratio of an arbitrary directory tree INSIDE a NetApp filesystem?


> It is true, quota is in charge of logical data not physical data.
> Let's assume an interesting scenario -- say the pool is 100% full in logical 
> data
> (such as 'df' tells you 100% used) but not full in physical data(such as 
> 'zpool list' tells
> you still some space available), can we continue writing data into this pool?
>
Sure, you can keep writing to the volume. What matters to the OS is what
*it* thinks, not what some userland app thinks.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-25 Thread Erik Trimble


On 4/25/2011 6:23 PM, Ian Collins wrote:

  On 04/26/11 01:13 PM, Fred Liu wrote:

H, it seems dedup is pool-based not filesystem-based.

That's correct. Although it can be turned off and on at the filesystem
level (assuming it is enabled for the pool).
Which is effectively the same as choosing per-filesystem dedup.  Just 
the inverse. You turn it on at the pool level, and off at the filesystem 
level, which is identical to "off at the pool level, on at the 
filesystem level" that NetApp does.



If it can have fine-grained granularity(like based on fs), that will be great!
It is pity! NetApp is sweet in this aspect.


So what happens to user B's quota if user B stores a ton of data that is
a duplicate of user A's data and then user A deletes the original?
Actually, right now, nothing happens to B's quota. He's always charged 
the un-deduped amount for his quota usage, whether or not dedup is 
enabled, and regardless of how much of his data is actually deduped. 
Which is as it should be, as quotas are about limiting how much a user 
is consuming, not how much the backend needs to store that data consumption.


e.g.

A, B, C, & D all have 100Mb of data in the pool, with dedup on.

20MB of storage has a dedup-factor of 3:1 (common to A, B, & C)
50MB of storage has a dedup factor of 2:1 (common to A & B )

Thus, the amount of unique data would be:

A: 100 - 20 - 50 = 30MB
B: 100 - 20 - 50 = 30MB
C: 100 - 20 = 80MB
D: 100MB

Summing it all up, you would have an actual storage consumption of  70 
(50+20 deduped) + 30+30+80+100 (unique data) = 310MB to actual storage, 
for 400MB of apparent storage (i.e. dedup ratio of 1.29:1 )


A, B, C, & D would each still have a quota usage of 100MB.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Erik Trimble


On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:


There are a lot of conflicting references on the Internet, so I'd 
really like to solicit actual experts (ZFS developers or people who 
have physical evidence) to weigh in on this...


After searching around, the reference I found to be the most seemingly 
useful was Erik's post here:


http://opensolaris.org/jive/thread.jspa?threadID=131296

Unfortunately it looks like there's an arithmetic error (1TB of 4k 
blocks means 268million blocks, not 1 billion).  Also, IMHO it seems 
important make the distinction, #files != #blocks.  Due to the 
existence of larger files, there will sometimes be more than one block 
per file; and if I'm not mistaken, thanks to write aggregation, there 
will sometimes be more than one file per block.  YMMV.  Average block 
size could be anywhere between 1 byte and 128k assuming default 
recordsize.  (BTW, recordsize seems to be a zfs property, not a zpool 
property.  So how can you know or configure the blocksize for 
something like a zvol iscsi target?)


I said 2^30, which is roughly a quarter billion.  But, I should have 
been more exact.  And, the file != block difference is important to note.


zvols also take a Recordsize attribute. And, zvols tend to be sticklers 
about all blocks being /exactly/ the recordsize value, unlike 
filesystems, which use it as a *maximum* block size.


Min block size is 512 bytes.


(BTW, is there any way to get a measurement of number of blocks 
consumed per zpool?  Per vdev?  Per zfs filesystem?)  The calculations 
below are based on assumption of 4KB blocks adding up to a known total 
data consumption.  The actual thing that matters is the number of 
blocks consumed, so the conclusions drawn will vary enormously when 
people actually have average block sizes != 4KB.




you need to use zdb to see what the current block usage is for a 
filesystem. I'd have to look up the particular CLI usage for that, as I 
don't know what it is off the top of my head.


And one more comment:  Based on what's below, it seems that the DDT 
gets stored on the cache device and also in RAM.  Is that correct?  
What if you didn't have a cache device?  Shouldn't it *always* be in 
ram?  And doesn't the cache device get wiped every time you reboot?  
It seems to me like putting the DDT on the cache device would be 
harmful...  Is that really how it is?


Nope. The DDT is stored only in one place: cache device if present, /or/ 
RAM otherwise (technically, ARC, but that's in RAM).  If a cache device 
is present, the DDT is stored there, BUT RAM also must store a basic 
lookup table for the DDT (yea, I know, a lookup table for a lookup table).



My minor corrections here:

The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every 
L2ARC entry, since the DDT is stored on the cache device.


the DDT itself doesn't consume any ARC space usage if stored in a L2ARC 
cache


E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out 
that I have about a 5:1 dedup ratio. I'd also like to see how much ARC 
usage I eat up with using a 160GB L2ARC to store my DDT on.


(1) How many entries are there in the DDT?

1TB of 4k blocks means there are 268million blocks.  However, at a 
5:1 dedup ratio, I'm only actually storing 20% of that, so I have about 
54 million blocks.  Thus, I need a DDT of about 270bytes * 54 million =~ 
14GB in size


(2) How much ARC space does this DDT take up?
The 54 million entries in my DDT take up about 200bytes * 54 
million =~ 10G of ARC space, so I need to have 10G of RAM dedicated just 
to storing the references to the DDT in the L2ARC.



(3) How much space do I have left on the L2ARC device, and how many 
blocks can that hold?
Well, I have 160GB - 14GB (DDT) = 146GB of cache space left on the 
device, which, assuming I'm still using 4k blocks, means I can cache 
about 37 million 4k blocks, or about 66% of my total data. This extra 
cache of blocks in the L2ARC would eat up 200 b * 37 million =~ 7.5GB of 
ARC entries.


Thus, for the aforementioned dedup scenario, I'd better spec it with 
(whatever base RAM for basic OS and ordinary ZFS cache and application 
requirements) at least a 14G L2ARC device for dedup + 10G more of RAM 
for the DDT L2ARC requirements + 1GB of RAM for every 20GB of additional 
space in the L2ARC cache beyond that used by the DDT.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Erik Trimble


On 4/8/2011 9:19 PM, Ian Collins wrote:

 On 04/ 9/11 03:53 PM, Mark Sandrock wrote:

I'm not arguing. If it were up to me,
we'd still be selling those boxes.


Maybe you could whisper in the right ear?

:)



Three little words are all that Oracle Product Managers hear:

"Business case justification"





I want my J4000's back, too.  And, I still want something like HP's MSA 
70 (25 x 2.5" drive JBOD  in a 2U formfactor)


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Erik Trimble


On 4/8/2011 4:50 PM, Bob Friesenhahn wrote:

On Fri, 8 Apr 2011, J.P. King wrote:


I can't speak for this particular situation or solution, but I think 
in principle you are wrong.  Networks are fast.  Hard drives are 
slow.  Put a


But memory is much faster than either.  It most situations the data 
would already be buffered in the X4540's memory so that it is 
instantly available.


Bob



Certainly, as a low-end product, the X4540 (and X4500) offered unmatched 
flexibility and performance per dollar.  It *is* sad to see them go.


But, given Oracle's strategic direction, is anyone really surprised?


PS - Nexenta, I think you've got a product position opportunity here...


PPS - about the closest thing Oracle makes to the X4540 now is the X4270 
M2 in the 2.5" drive config - 24 x 2.5" drives, 2 x Westmere-EP CPUs, in 
a 2U rack cabinet, somewhere around $25k (list) for the 24x500GB SATA 
model with (2) 6-core Westmeres + 16GB RAM.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Erik Trimble


On 4/8/2011 1:58 PM, Chris Banal wrote:

Sounds like many of us are in a similar situation.

To clarify my original post. The goal here was to continue with what 
was a cost effective solution to some of our Storage requirements. I'm 
looking for hardware that wouldn't cause me to get the run around from 
the Oracle support folks, finger pointing between vendors, or have 
lots of grief from an untested combination of parts. If this isn't 
possible we'll certainly find a another solution. I already know it 
won't be the 7000 series.


Thank you,
Chris Banal




Talk to HP then. They still sell Officially Supported Solaris servers 
and disk storage systems in more varieties than Oracle does.


The StorageWorks 600 Modular Disk System may be what you're looking for 
(70 x 2.5" drives per enclosure, 5U, SAS/SATA/FC attachment to any 
server, $35k list price for 70TB). Or the StorageWorks 70 Modular Disk 
Array (25 x 2.5" drives, 1U, SAS attachment,  $11k list price for 12.5TB)


-Erik





Marion Hakanson wrote:

jp...@cam.ac.uk said:

I can't speak for this particular situation or solution, but I think in
principle you are wrong.  Networks are fast.  Hard drives are slow.  
Put a
10G connection between your storage and your front ends and you'll 
have  the
bandwidth[1].  Actually if you really were hitting 1000x8Mbits I'd 
put  2,
but that is just a question of scale.  In a different situation I 
have  boxes
which peak at around 7 Gb/s down a 10G link (in reality I don't 
need  that
much because it is all about the IOPS for me).  That is with just  
twelve 15k
disks.  Your situation appears to be pretty ideal for storage  
hardware, so
perfectly achievable from an appliance. 


Depending on usage, I disagree with your bandwidth and latency figures
above.  An X4540, or an X4170 with J4000 JBOD's, has more bandwidth
to its disks than 10Gbit ethernet.  You would need three 10GbE 
interfaces

between your CPU and the storage appliance to equal the bandwidth of a
single 8-port 3Gb/s SAS HBA (five of them for 6Gb/s SAS).

It's also the case that the Unified Storage platform doesn't have enough
bandwidth to drive more than four 10GbE ports at their full speed:
http://dtrace.org/blogs/brendan/2009/09/22/7410-hardware-update-and-analyzing-t 


he-hypertransport/

We have a customer (internal to the university here) that does high
throughput gene sequencing.  They like a server which can hold the large
amounts of data, do a first pass analysis on it, and then serve it up
over the network to a compute cluster for further computation.  Oracle
has nothing in their product line (anymore) to meet that need.  They
ended up ordering an 8U chassis w/40x 2TB drives in it, and are willing
to pay the $2k/yr retail ransom to Oracle to run Solaris (ZFS) on it,
at least for the first year.  Maybe OpenIndiana next year, we'll see.

Bye Oracle

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] X4540 no next-gen product?

2011-04-08 Thread Erik Trimble


On 4/8/2011 12:37 AM, Ian Collins wrote:

 On 04/ 8/11 06:30 PM, Erik Trimble wrote:

On 4/7/2011 10:25 AM, Chris Banal wrote:

While I understand everything at Oracle is "top secret" these days.

Does anyone have any insight into a next-gen X4500 / X4540? Does 
some other Oracle / Sun partner make a comparable system that is 
fully supported by Oracle / Sun?


http://www.oracle.com/us/products/servers-storage/servers/previous-products/index.html 



What do X4500 / X4540 owners use if they'd like more comparable zfs 
based storage and full Oracle support?


I'm aware of Nexenta and other cloned products but am specifically 
asking about Oracle supported hardware. However, does anyone know if 
these type of vendors will be at NAB this year? I'd like to talk to 
a few if they are...




The move seems to be to the Unified Storage (aka ZFS Storage) line, 
which is a successor to the 7000-series OpenStorage stuff.


http://www.oracle.com/us/products/servers-storage/storage/unified-storage/index.html 



Which is not a lot of use to those of us who use X4540s for what they 
were intended: storage appliances.


We have had to take the retrograde step of adding more, smaller 
servers (like the ones we consolidated on the X4540s!).





Sorry, I read the question differently, as in "I have X4500/X4540 now, 
and want more of them, but Oracle doesn't sell them anymore, what can I 
buy?".  The 7000-series (now: Unified Storage) *are* storage appliances.


If you have an X4540/X4500 (and some cash burning a hole in your 
pocket),  Oracle will be happy to sell you a support license (which 
should include later versions of ZFS software). But, don't quote me on 
that - talk to a Sales Rep if you want a Quote.





--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] X4540 no next-gen product?

2011-04-07 Thread Erik Trimble


On 4/7/2011 10:25 AM, Chris Banal wrote:

While I understand everything at Oracle is "top secret" these days.

Does anyone have any insight into a next-gen X4500 / X4540? Does some 
other Oracle / Sun partner make a comparable system that is fully 
supported by Oracle / Sun?


http://www.oracle.com/us/products/servers-storage/servers/previous-products/index.html 



What do X4500 / X4540 owners use if they'd like more comparable zfs 
based storage and full Oracle support?


I'm aware of Nexenta and other cloned products but am specifically 
asking about Oracle supported hardware. However, does anyone know if 
these type of vendors will be at NAB this year? I'd like to talk to a 
few if they are...




The move seems to be to the Unified Storage (aka ZFS Storage) line, 
which is a successor to the 7000-series OpenStorage stuff.


http://www.oracle.com/us/products/servers-storage/storage/unified-storage/index.html

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best practice for boot partition layout in ZFS

2011-04-06 Thread Erik Trimble


On 4/6/2011 7:50 AM, Lori Alt wrote:

On 04/ 6/11 07:59 AM, Arjun YK wrote:

Hi,

I am trying to use ZFS for boot, and kind of confused about how the 
boot paritions like /var to be layed out.


With old UFS, we create /var as sepearate filesystem to avoid various 
logs filling up the / filesystem


I believe that creating /var as a separate file system was a common 
practice, but not a universal one.  It really depended on the 
environment and local requirements.




With ZFS, during the OS install it gives the option to "Put /var on a 
separate dataset", but no option is given to set quota. May be, 
others set quota manually.


Having a separate /var dataset gives you the option of setting a quota 
on it later.  That's why we provided the option.  It was a way of 
enabling administrators to get the same effect as having a separate 
/var slice did with ufs.  Administrators can choose to use it or not, 
depending on local requirements.




So, I am trying to understand what's the best practice for /var in 
ZFS. Is that exactly same as in UFS or is there anything different ?


I'm not sure there's a defined "best practice".  Maybe someone else 
can answer that question.  My guess is that in environments where, 
before, a separate ufs /var slice was used, a separate zfs /var 
dataset with a quota might now be appropriate.


Lori




Could someone share some thoughts ?


Thanks
Arjun




Traditionally, the reason for a separate /var was one of two major items:

(a)  /var was writable, and / wasn't - this was typical of diskless or 
minimal local-disk configurations. Modern packaging systems are making 
this kind of configuration increasingly difficult.


(b) /var held a substantial amount of data, which needed to be handled 
separately from /  - mail and news servers are a classic example



For typical machines nowdays, with large root disks, there is very 
little chance of /var suddenly exploding and filling /  (the classic 
example of being screwed... ).  Outside of the above two cases, 
about the only other place I can see that having /var separate is a good 
idea is for certain test machines, where you expect frequent  memory 
dumps (in /var/crash) - if you have a large amount of RAM, you'll need a 
lot of disk space, so it might be good to limit /var in this case by 
making it a separate dataset.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cannot remove zil device

2011-03-27 Thread Erik Trimble


On 3/27/2011 7:48 AM, Jordan McQuown wrote:
The pool was originally created on 2008.11 then upgraded to 2009.06 
and again upgraded to snv-134. Originally I was using Intel x25e's as 
the zil but recently purchased a zeusRAM and when trying to remove the 
Intel I issue the zpool remove command and it appears to work. Yet 
when I do a zpool status the Intel device is still a pool member. Any 
thoughts?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Did you remember to upgrade the pool itself?  Not just the OS, but the 
pool, too, need to be upgraded.


You'll need to run a 'zpool upgrade' on the pool with the log device.

Only later versions of ZFS supported removal of the ZIL.

IIRC, the zpool version supported by 2008.11 definitely didn't, and I'm 
pretty sure the 2009.06 version didn't either, but b134 definitely does.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] best migration path from Solaris 10

2011-03-23 Thread Erik Trimble


On 3/23/2011 6:14 AM, Deano wrote:

OpenIndiana and others (i.e. Benunix) are distributions that actively
support full desktop workstations based on the Illumos base.

It is true, that the storage server application is a popular one and so has
supporters both commercially and others. ZFS is amazing and quite rightly it
stands out, it works even better when used with zones, crossbow, dtrace,
etc. and so its obvious to see what it's a focus and often seems the only
priority.

However is isn't the only interest, by a long shot.

The SFE package repositories has many packages available to install for when
the binary packaging aren't up to date. OpenIndiana is hard at work trying
to build bigger binary repositories with more apps and newer versions.
A simple "pkg install APPLICATION" is the aim for the majority of main
applications.

Is it not moving fast enough, or missing the packages you need?
Well that's the beauty of Open Source, we welcome and have systems to help
newcomers add and update the packages and applications they want, so we all
benefit.

Ultimately I'd (and I'm sure many would) like to have a level of binary
repositories similar to Debian, with stable and faster changing place repos
and support for many different applications, however that requires a lot of
work and manpower.

Bye,
Deano


Honestly (and I say this from purely personal preferences and bias, not 
any official statement), I see the long-term future of Solaris (and 
IllumOS-based distros) as the new engine for appliances, supplanting 
Linux and the *BSDs in that space.


For a lot of reasons, Solaris has a long list of very superior 
functionality that make is very appealing for appliance makers.  Right 
now, we see that in two areas:  ZFS for storage, and high scaleability 
for DBs (the various Oracle ExaData stuff).   I'm expecting to see a 
whole raft of things start to show up - JVM container systems (Run Your 
App Server in SUPERMAN MODE! ), online backup devices, firewall 
appliances, spam and mail filter systems, intrusion detection systems, 
maybe even software routers, etc...


It's here that I think Solaris' strengths can beat its competitors, and 
where its weaknesses aren't significant.


Sadly, I think Solaris' future as a general-purpose OS is likely finished.

Of course, that's just my reading of the tea leaves...

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A resilver record?

2011-03-21 Thread Erik Trimble


On 3/21/2011 3:25 PM, Richard Elling wrote:

On Mar 21, 2011, at 5:09 AM, Edward Ned Harvey wrote:


From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Richard Elling

How many times do we have to rehash this? The speed of resilver is
dependent on the amount of data, the distribution of data on the

resilvering

device, speed of the resilvering device, and the throttle. It is NOT

dependent

on the number of drives in the vdev.

What the heck?  Yes it is.  Indirectly.  When you say it depends on the
amount of data, speed of resilvering device, etc, what you really mean
(correctly) is that it depends on the total number of used blocks that must
be resilvered on the resilvering device, multiplied by the access time for
the resilvering device.  And of course, throttling and usage during resilver
can have a big impact.  And various other factors.  But the controllable big
factor is the number of blocks used in the degraded vdev.

There is no direct correlation between the number of blocks and resilver time.



Just to be clear here, remember block != slab.  Slab is the allocation 
unit often seen through the "recordsize" attribute.


The number of data *slabs* directly correlates to resilver time.


So here is how the number of devices in the vdev matter:

If you have your whole pool made of one vdev, then every block in the pool
will be on the resilvering device.  You must spend time resilvering every
single block in the whole pool.

If you have the same amount of data, on a pool broken into N smaller vdev's,
then approximately speaking, 1/N of the blocks in the pool must be
resilvered on the resilvering vdev.  And therefore the resilver goes
approximately N times faster.

Nope. The resilver time is dependent on the speed of the resilvering disk.
Well, unless my previous posts are completely wrong, I can't see how 
resilver time is primarily bounded by speed (i.e bandwidth/throughput) 
of the HD for the vast majority of use cases.   The IOPS and raw speed 
of the underlying backing store help define how fast the workload (i.e. 
total used slabs) gets processed.  The layout of the vdev, and the 
on-disk data distribution, will define the total IOPS required to 
resilver the slab workload.  Most data distribution/vdev layout 
combinations will result in an IOPS-bound resilver disk, not a 
bandwidth-saturated resilver disk.




So if you assume the size of the pool or the number of total disks is a
given, determined by outside constraints and design requirements, and then
you faced the decision of how to architect the vdev's in your pool, then
Yes.  The number of devices in a vdev do dramatically impact the resilver
time.  Only because the number of blocks written in each vdev depend on
these decisions you made earlier.

I do not think it is wise to set the vdev configuration based on a model for
resilver time. Choose the configuration to get the best data protection.
  -- richard
Depends on the needs of the end-user.  I can certainly see places where 
it would be better to build a pool out of RAIDZ2 devices rather than 
RAIDZ3 devices.  And, of course, the converse.  Resilver times should be 
a consideration in building your pool, just like performance and disk 
costs are.  How much you value it, of course, it up to the end-user.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A resilver record?

2011-03-20 Thread Erik Trimble


On 3/20/2011 2:23 PM, Richard Elling wrote:

On Mar 20, 2011, at 12:48 PM, David Magda wrote:

On Mar 20, 2011, at 14:24, Roy Sigurd Karlsbakk wrote:


It all depends on the number of drives in the VDEV(s), traffic
patterns during resilver, speed VDEV fill, of drives etc. Still,
close to 6 days is a lot. Can you detail your configuration?

How many times do we have to rehash this? The speed of resilver is
dependent on the amount of data, the distribution of data on the
resilvering device, speed of the resilvering device, and the throttle. It is NOT
dependent on the number of drives in the vdev.

Thanks for clearing this up - I've been told large VDEVs lead to long resilver 
times, but then, I guess that was wrong.

There was a thread ("Suggested RaidZ configuration...") a little while back 
where the topic of IOps and resilver time came up:

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-September/thread.html#44633

I think this message by Erik Trimble is a good summary:

hmmm... I must've missed that one, otherwise I would have said...


Scenario 1:I have 5 1TB disks in a raidz1, and I assume I have 128k slab 
sizes.  Thus, I have 32k of data for each slab written to each disk. (4x32k 
data + 32k parity for a 128k slab size).  So, each IOPS gets to reconstruct 32k 
of data on the failed drive.   It thus takes about 1TB/32k = 31e6 IOPS to 
reconstruct the full 1TB drive.

Here, the IOPS doesn't matter because the limit will be the media write
speed of the resilvering disk -- bandwidth.


Scenario 2:I have 10 1TB drives in a raidz1, with the same 128k slab sizes. 
 In this case, there's only about 14k of data on each drive for a slab. This 
means, each IOPS to the failed drive only write 14k.  So, it takes 1TB/14k = 
71e6 IOPS to complete.

Here, IOPS might matter, but I doubt it.  Where we see IOPS matter is when the 
block
sizes are small (eg. metadata). In some cases you can see widely varying 
resilver times when
the data is large versus small. These changes follow the temporal distribution 
of the original
data. For example, if a pool's life begins with someone loading their MP3 
collection (large
blocks, mostly sequential) and then working on source code (small blocks, more 
random, lots
of creates/unlinks) then the resilver will be bandwidth bound as it resilvers 
the MP3s and then
IOPS bound as it resilvers the source. Hence, the prediction of when resilver 
will finish is not
very accurate.


 From this, it can be pretty easy to see that the number of required IOPS to 
the resilvered disk goes up linearly with the number of data drives in a vdev.  
Since you're always going to be IOPS bound by the single disk resilvering, you 
have a fixed limit.

You will not always be IOPS bound by the resilvering disk. You will be speed 
bound
by the resilvering disk, where speed is either write bandwidth or random write 
IOPS.
  -- richard



Really? Can you really be bandwidth limited on a (typical) RAIDZ resilver?

I can see where you might be on a mirror, with large slabs and 
essentially sequential read/write - that is, since the drivers can queue 
up several read/write requests at a time, you have the potential to be 
reading/writing several (let's say 4) 128k slabs per single IOPS.  That 
means you read/write at 512k per IOPS for a mirror (best case 
scenario).  For a 7200RPM drive, that's 100 IOPS x .5MB/IOPS = 50MB/s, 
which is lower than the maximum throughput of a modern SATA drive.   For 
one of the 15k SAS drives able to do 300IOPS, you get 150MB/s, which 
indeed exceeds a SAS drive's write bandwidth.


For RAIDZn configs, however, you're going to be limited on the size of 
an individual read/write.  As Roy pointed out before, that max size of 
an individual portion of a slab is 128k/X, where X=number of data drives 
in RAIDZn.   So, for a typical 4-data-drive RAIDZn, even in the best 
case scenario where I can queue multiple slab requests (say 4) into a 
single IOPS, that means I'm likely to top out at about 128k of data to 
write to the resilvered drive per IOPS.  Which, leads to 12MB/s for the 
7200RPM drive, and 36MB/s for the 15k drive, both well under their 
respective bandwidth capability.


Even with large slab sizes, I really can't see any place where a RAIDZ 
resilver isn't going to be IOPS bound when using HDs as backing store.  
Mirrors are more likely, but still, even in that case, I think you're 
going to hit the IOPS barrier far more often than the bandwidth barrier.



Now, with SSDs as backing store, yes, you become bandwidth limited, 
because the IOPS values of SSDs are at least an order of magnitude 
greater than HDs, though both have the same max bandwidth characteristics.



Now, the *total* time it takes to resilver either a mirror or RAIDZ is 
indeed primarily dependent on the number of allocated slabs in the vdev, 
and the level of fragmentation of slabs.  That essentiall

Re: [zfs-discuss] sorry everyone was: Re: External SATA drive enclosures + ZFS?

2011-02-25 Thread Erik Trimble

On Fri, 2011-02-25 at 20:29 -0800, Yaverot wrote:
> Sorry all, didn't realize that half of Oracle would auto-reply to a public 
> mailing list since they're out of the office 9:30 Friday nights.  I'll try to 
> make my initial post each month during daylight hours in the future.
> ___


Nah, probably just a Beehive (our mail system) burp. Happens a lot.

Besides, it's 8:45 PST here, and I'm still at work. :-)

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send/recv initial data load

2011-02-16 Thread Erik Trimble


On 2/16/2011 8:08 AM, Richard Elling wrote:

On Feb 16, 2011, at 7:38 AM, white...@gmail.com wrote:


Hi, I have a very limited amount of bandwidth between main office and a colocated rack of 
servers in a managed datacenter. My hope is to be able to zfs send/recv small incremental 
changes on a nightly basis as a secondary offsite backup strategy. My question is about 
the initial "seed" of the data. Is it possible to use a portable drive to copy 
the initial zfs filesystem(s) to the remote location and then make the subsequent 
incrementals over the network? If so, what would I need to do to make sure it is an exact 
copy? Thank you,

Yes, and this is a good idea. Once you have replicated a snapshot, it will be an
exact replica -- it is an all-or-nothing operation.  You can then make more 
replicas
or incrementally add snapshots.
  -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



To follow up on Richard's post, what you want to do is a perfectly good 
way to deal with moving large amounts of data via Sneakernet. :-)


I'd suggest that you create a full zfs filesystem on the external drive, 
and use 'zfs send/receive' to copy a snapshot from the production box to 
there, rather than try to store just a file from the output of 'zfs 
send'.  You can then 'zfs send/receive' that backup snapshot from the 
external drive onto your remote backup machine when you carry the drive 
over there later.


As Richard mentioned, that snapshot is unique, and it doesn't matter 
that you "recovered" it onto an external drive first, then copied that 
snapshot over to the backup machine.  It's a frozen snapshot, so you're 
all good for future incrementals.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] One LUN per RAID group

2011-02-15 Thread Erik Trimble


On 2/15/2011 1:37 PM, Torrey McMahon wrote:


On 2/14/2011 10:37 PM, Erik Trimble wrote:
That said, given that SAN NVRAM caches are true write caches (and not 
a ZIL-like thing), it should be relatively simple to swamp one with 
write requests (most SANs have little more than 1GB of cache), at 
which point, the SAN will be blocking on flushing its cache to disk. 


Actually, most array controllers now have 10s if not 100s of GB of 
cache. The 6780 has 32GB, DMX-4 has - if I remember correctly - 256. 
The latest HDS box is probably close if not more.


Of course you still have to flush to disk and the cache flush 
algorithms of the boxes themselves come into play but 1GB was a long 
time ago.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



STK2540 and the STK6140 have at most 1GB.
STK6180 has 4GB.


The move to large GB caches is only recent - only large (i.e big array 
setups with a dedicated SAN head) have had multi-GB NVRAM cache for any 
length of time.


In particular, pretty much all base arrays still have 4GB or less on the 
enclosure controller - only in the SAN heads do you find big multi-GB 
caches. And, lots (I'm going to be brave and say the vast majority) of 
ZFS deployments use direct-attach arrays or internal storage, rather 
than large SAN configs. Lots of places with older SAN heads are also 
going to have much smaller caches. Given the price tag of most large 
SANs, I'm thinking that there are still huge numbers of 5+ year-old SANs 
out there, and practically all of them have only a dozen or less GB of 
cache.


So, yes, big SAN modern configurations have lots of cache. But they're 
also the ones most likely to be hammered with huge amounts of I/O from 
multiple machines. All of which makes it relatively easy to blow through 
the cache capacity and slow I/O back down to the disk speed.


Once you get back down to raw disk speed, having multiple LUNS per raid 
array is almost certainly going to perform worse than a single LUN, due 
to thrashing.  That is, it would certainly be better (i.e. faster) for 
an array to have to commit 1 128k slab than 4 32k slabs.



So, the original recommendation is interesting, but needs to have the 
caveat that you'd really only use it if you can either limit the amount 
of sustained I/O you have, or are using very-large-cache disk setups.


I would think it idea might also apply (i.e. be useful) for something 
like the F5100 or similar RAM/Flash arrays.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] One LUN per RAID group

2011-02-14 Thread Erik Trimble


On 2/14/2011 3:52 PM, Gary Mills wrote:

On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote:

On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills  wrote:

Is there any reason not to use one LUN per RAID group?

[...]

 In other words, if you build a zpool with one vdev of 10GB and
another with two vdev's each of 5GB (both coming from the same array
and raid set) you get almost exactly twice the random read performance
from the 2x5 zpool vs. the 1x10 zpool.

This finding is surprising to me.  How do you explain it?  Is it
simply that you get twice as many outstanding I/O requests with two
LUNs?  Is it limited by the default I/O queue depth in ZFS?  After
all, all of the I/O requests must be handled by the same RAID group
once they reach the storage device.


 Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot
spares), you get substantially better random read performance using 10
LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of
ZFS aith number of vdevs and not "spindles".


I'm going to go out on a limb here and say that you get the extra 
performance under one condition:  you don't overwhelm the NVRAM write 
cache on the SAN device head.


So long as the SAN's NVRAM cache can acknowledge the write immediately 
(i.e. it isn't full with pending commits to backing store), then, yes, 
having multiple write commits coming from different ZFS vdevs will 
obviously give more performance than a single ZFS vdev.


That said, given that SAN NVRAM caches are true write caches (and not a 
ZIL-like thing), it should be relatively simple to swamp one with write 
requests (most SANs have little more than 1GB of cache), at which point, 
the SAN will be blocking on flushing its cache to disk.


So, if you can arrange your workload to avoid more than the maximum 
write load of the SAN's raid array over a defined period, then, yes, go 
with the multiple LUN/array setup.  In particular, I would think this 
would be excellent for small-write/latency-sensitive applications, where 
the total amount of data written (over several seconds) isn't large, but 
where latency is critical.  For larger I/O requests (or, for consistent, 
sustained I/O of more than small amounts), all bets are off as far as 
possibly advantage of multiple LUNS/array.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] CPU Limited on Checksums?

2011-02-08 Thread Erik Trimble


On 2/8/2011 8:41 AM, Krunal Desai wrote:

Hi all,

My system is powered by an Intel Core 2 Duo (E6600) with 8GB of RAM.
Running into some very heavy CPU usage.

First, a copy from one zpool to another (cp -aRv /oldtank/documents*
/tank/documents/*), both in the same system. Load averages are around
~4.8. I think I used lockstat correctly, and found the following:

movax@megatron:/tank# lockstat -kIW -D 20 sleep 30

Profiling interrupt: 2960 events in 30.516 seconds (97 events/sec)

Count indv cuml rcnt nsec Hottest CPU+PILCaller
---
  1518  51%  51% 0.00 1800 cpu[0] SHA256TransformBlocks
   334  11%  63% 0.00 2820 cpu[0]
vdev_raidz_generate_parity_pq
   261   9%  71% 0.00 3493 cpu[0] bcopy_altentry
   119   4%  75% 0.00 3033 cpu[0] mutex_enter
73   2%  78% 0.00 2818 cpu[0] i86_mwait


So, obviously here it seems checksum calculation is, to put it mildly,
eating up CPU cycles like none other. I believe it's bad(TM) to turn
off checksums? (zfs property just has checksum=on, I guess it has
defaulted to SHA256 checksums?)

Second, a copy from my desktop PC to my new zpool. (5900rpm drive over
GigE to 2 6-drive RAID-Z2s). Load average are around ~3. Again, with
lockstat:

movax@megatron:/tank# lockstat -kIW -D 20 sleep 30

Profiling interrupt: 2919 events in 30.089 seconds (97 events/sec)

Count indv cuml rcnt nsec Hottest CPU+PILCaller
---
  1298  44%  44% 0.00 1853 cpu[0] i86_mwait
   301  10%  55% 0.00 2700 cpu[0]
vdev_raidz_generate_parity_pq
   144   5%  60% 0.00 3569 cpu[0] bcopy_altentry
   103   4%  63% 0.00 3933 cpu[0] ddi_getl
83   3%  66% 0.00 2465 cpu[0] mutex_enter

Here it seems as if 'i86_mwait' is occupying the top spot (is this
because I have power-management set to poll my CPU?). Is something odd
happening drive buffer wise? (i.e. coming in on NIC, buffered in the
HBA somehow, and then flushed to disks?)

Either case, it seems I'm hitting a ceiling of around 65MB/s. I assume
CPU is bottlenecking, since bonnie++ benches resulted in much better
performance for the vdev. In the latter case though, it may just be a
limitation of the source drive (if it can't read data faster than
65MB/s, I can't write faster than that...).

e: E6600 is a first-generation 65nm LGA775 CPU, clocked at 2.40GHz.
Dual-cores, no hyper-threading.



In the second case, you're almost certainly hitting a Network 
bottleneck.  If you don't have Jumbo Frames turned on at both ends of 
the connection, and a switch that understands JF, then the CPU is 
spending a horrible amount of time doing interrupts, trying to 
re-assemble small packets.



I also think you might be running into an erratum around old Xeons, CR 
6588054. This was fixed in kernel patch 127128-11, included in s10u5 or 
later.


Otherwise, it might be an issue with the powersave functionality of 
certain Intel CPUs.


In either case, try putting this in your /etc/system:

set idle_cpu_prefer_mwait = 0


If that fix causes an issue (and, there a reports it does occasionally), 
you'll need to boot without the /etc/system, append the '-a' flag to the 
end of the GRUB menu entry that you boot from. This will push you into 
an interactive boot, where, when it asks you for a /etc/system to use, 
specify /dev/null.



--

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sil3124 Sata controller for ZFS on Sparc OpenSolaris Nevada b130

2011-02-08 Thread Erik Trimble


On 2/8/2011 2:17 AM, Tomas Ögren wrote:

On 08 February, 2011 - Robert Soubie sent me these 1,1K bytes:


Le 08/02/2011 07:10, Jerry Kemp a écrit :

As part of a small home project, I have purchased a SIL3124 hba in
hopes of attaching an external drive/drive enclosure via eSATA.

The host in question is an old Sun Netra T1 currently running
OpenSolaris Nevada b130.

The card in question is this Sil3124 card:

http://www.newegg.com/product/product.aspx?item=N82E16816124003

although I did not purchase it from Newegg. I specifically purchased
this card as I have seen specific reports of it working under
Solaris/OpenSolaris distro's on several Solaris mailing lists.

I use a non-eSata version of this card under Solaris Express 11 for a
boot mirrored ZFS pool. And another one for a Windows 7 machine that
does backups of the server. Bios and drivers are available from the
Silicon Image site, but nothing for Solaris.

The problem itself is sparc vs x86 and firmware for the card. AFAIK,
there is no sata card with drivers for solaris sparc. Use a SAS card.

/Tomas


Thomas is correct.  This is a hardware issue, not an OS driver one. In 
order to use a card with SPARC, its firmware must be OpenBoot-aware.  
Pretty much all consumer SATA cards only have PC BIOS firmware, as there 
is no market for sales to SPARC folks.


However, several of the low-end SAS cards ($100 or so) also have 
available OpenBoot firmware, in addition to PC BIOS firmware. In 
particular, the LSI1068 series-based HBAs are a good place to look.  
Note that you *might* have to flash the new OpenBoot firmware onto the 
card - cards don't come with both PC-BIOS and OpenBoot firmware.  Be 
sure to check the OEM's web site to make sure that the card is 
explicitly supported for SPARC, not just "Solaris".




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Erik Trimble


On 2/7/2011 1:10 PM, Yi Zhang wrote:
[snip]

This is actually what I did for 2.a) in my original post. My concern
there is that ZFS' internal write buffering makes it hard to get a
grip on my application's behavior. I want to present my application's
"raw" I/O performance without too much outside factors... UFS plus
directio gives me exactly (or close to) that but ZFS doesn't...

Of course, in the final deployment, it would be great to be able to
take advantage of ZFS' advanced features such as I/O optimization.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


And, there's your answer. You seem to care about doing "bare-metal" I/O 
for tuning of your application, so you can do consistent measurements. 
Not for actual usage in production.


Therefore, do what's inferred in the above:  develop your app, using it 
on UFS w/directio to work out the application issues and tune.  When you 
deploy it, use ZFS and its caching techniques to get maximum (though not 
absolutely consistently measurable) performance for the already-tuned app.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] deduplication requirements

2011-02-07 Thread Erik Trimble


On 2/7/2011 1:06 PM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Michael

Core i7 2600 CPU
16gb DDR3 Memory
64GB SSD for ZIL (optional)

Would this produce decent results for deduplication of 16TB worth of pools
or would I need more RAM still?

What matters is the amount of unique data in your pool.  I'll just assume
it's all unique, but of course that's ridiculous because if it's all unique
then why would you want to enable dedup.  But anyway, I'm assuming 16T of
unique data.

The rule is a little less than 3G of ram for every 1T of unique data.  In
your case, 16*2.8 = 44.8G ram required in addition to your base ram
configuration.  You need at least 48G of ram.  Or less unique data.


To follow up on Ned's estimation, please let us know what kind of data 
you're planning on putting in the Dedup'd zpool. That can really give us 
a better idea as to the number of slabs that the pool will have, which 
is what drives dedup RAM and L2ARC usage.


You also want to use an SSD for L2ARC, NOT for ZIL (though, you *might* 
also want one for ZIL, depending on your write patterns).



In all honesty, these days, it doesn't pay to dedup a pool unless you 
can count on large amounts of common data.  Virtual Machine images, 
incremental backups, ISO images of data CD/DVDs, and some Video are your 
best bet. Pretty much everything else is going to cost you more in 
RAM/L2ARC than it's worth.



IMHO, you don't want Dedup unless you can *count* on a 10x savings factor.


Also, for reasons discussed here before, I would not recommend a Core i7 
for use as a fileserver CPU. It's an Intel Desktop CPU, and almost 
certainly won't support ECC Ram on your motherboard, and it seriously 
overpowered for your use.


See if you can find a nice socket AM3+ motherboard for a low-range 
Athlon X3/X4.  You can get ECC RAM for it (even in a desktop 
motherboard), it will cost less, and perform at least as well.


Dedup is not CPU intensive. Compression is, and you may very well want 
to enable that, but you're still very unlikely to hit a CPU bottleneck 
before RAM starvation or disk wait occurs.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] deduplication requirements

2011-02-07 Thread Erik Trimble


On 2/7/2011 1:06 PM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Michael

Core i7 2600 CPU
16gb DDR3 Memory
64GB SSD for ZIL (optional)

Would this produce decent results for deduplication of 16TB worth of pools
or would I need more RAM still?

What matters is the amount of unique data in your pool.  I'll just assume
it's all unique, but of course that's ridiculous because if it's all unique
then why would you want to enable dedup.  But anyway, I'm assuming 16T of
unique data.

The rule is a little less than 3G of ram for every 1T of unique data.  In
your case, 16*2.8 = 44.8G ram required in addition to your base ram
configuration.  You need at least 48G of ram.  Or less unique data.


To follow up on Ned's estimation, please let us know what kind of data 
you're planning on putting in the Dedup'd zpool. That can really give us 
a better idea as to the number of slabs that the pool will have, which 
is what drives dedup RAM and L2ARC usage.


You also want to use an SSD for L2ARC, NOT for ZIL (though, you *might* 
also want one for ZIL, depending on your write patterns).



In all honesty, these days, it doesn't pay to dedup a pool unless you 
can count on large amounts of common data.  Virtual Machine images, 
incremental backups, ISO images of data CD/DVDs, and some Video are your 
best bet. Pretty much everything else is going to cost you more in 
RAM/L2ARC than it's worth.



IMHO, you don't want Dedup unless you can *count* on a 10x savings factor.


Also, for reasons discussed here before, I would not recommend a Core i7 
for use as a fileserver CPU. It's an Intel Desktop CPU, and almost 
certainly won't support ECC Ram on your motherboard, and it seriously 
overpowered for your use.


See if you can find a nice socket AM3+ motherboard for a low-range 
Athlon X3/X4.  You can get ECC RAM for it (even in a desktop 
motherboard), it will cost less, and perform at least as well.


Dedup is not CPU intensive. Compression is, and you may very well want 
to enable that, but you're still very unlikely to hit a CPU bottleneck 
before RAM starvation or disk wait occurs.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM - No need for TRIM

2011-02-06 Thread Erik Trimble


On 2/6/2011 3:51 AM, Orvar Korvar wrote:

Ok, so can we say that the conclusion for a home user is:

1) Using SSD without TRIM is acceptable. The only drawback is that without 
TRIM, the SSD will write much more, which effects life time. Because when the 
SSD has written enough, it will break.

I dont have high demands for my OS disk, so battery backup is overkill for my 
needs.

So I can happily settle for the next gen Intel G3 SSD disk, without worrying 
the SSD will break because Solaris has no TRIM support yet?


Yes. All modern SSDs will wear out, but, even without TRIM support, it 
will be a significant time (5+ years) before they do.  Internal 
wear-leveling by the SSD controller results in an expected lifespan 
about the same as hard drives.


TRIM really only impacts performance. For the ZFS ZIL use case, TRIM has 
only a small impact on performance - SSD performance for ZIL drops off 
quickly from peak, and supporting TRIM would only slightly mitigate this.


For home use, lack of TRIM support won't noticeably change your 
performance as a ZIL cache or lower the lifespan of the SSD.



The Intel X25-M (either G3 or G2) would be sufficient for your purposes.

In general, we do strongly recommend you put a UPS on your system, to 
avoid cache corruption in case of power outages.




2) And later, when Solaris gets TRIM support, should I reformat or is there no 
need to reformat? I mean, maybe I must format and reinstall to get TRIM all 
over the disk. Or will TRIM immediately start to do it's magic?


If/when TRIM is supported by ZFS, I would expect this to transparent to 
you, the end-user. You'd have to upgrade the OS to the proper new 
patchlevel, and *possibly* run a 'zpool upgrade' to update the various 
pools to the latest version, but I suspect the latter will be completely 
unnecessary. TRIM support would come in ZFS's guts, not in the pool 
format.  Worst case is that you'd have to enable TRIM at the device 
layer, which would probably entail either editing a config file and 
rebooting, or just running some command to enable the feature.   I can't 
imagine it would require any reformating or reinstalling.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM - No need for TRIM

2011-02-06 Thread Erik Trimble


On 2/6/2011 3:51 AM, Orvar Korvar wrote:

Ok, so can we say that the conclusion for a home user is:

1) Using SSD without TRIM is acceptable. The only drawback is that without 
TRIM, the SSD will write much more, which effects life time. Because when the 
SSD has written enough, it will break.

I dont have high demands for my OS disk, so battery backup is overkill for my 
needs.

So I can happily settle for the next gen Intel G3 SSD disk, without worrying 
the SSD will break because Solaris has no TRIM support yet?


Yes. All modern SSDs will wear out, but, even without TRIM support, it 
will be a significant time (5+ years) before they do.  Internal 
wear-leveling by the SSD controller results in an expected lifespan 
about the same as hard drives.


TRIM really only impacts performance. For the ZFS ZIL use case, TRIM has 
only a small impact on performance - SSD performance for ZIL drops off 
quickly from peak, and supporting TRIM would only slightly mitigate this.


For home use, lack of TRIM support won't noticeably change your 
performance as a ZIL cache or lower the lifespan of the SSD.



The Intel X25-M (either G3 or G2) would be sufficient for your purposes.

In general, we do strongly recommend you put a UPS on your system, to 
avoid cache corruption in case of power outages.




2) And later, when Solaris gets TRIM support, should I reformat or is there no 
need to reformat? I mean, maybe I must format and reinstall to get TRIM all 
over the disk. Or will TRIM immediately start to do it's magic?


If/when TRIM is supported by ZFS, I would expect this to transparent to 
you, the end-user. You'd have to upgrade the OS to the proper new 
patchlevel, and *possibly* run a 'zpool upgrade' to update the various 
pools to the latest version, but I suspect the latter will be completely 
unnecessary. TRIM support would come in ZFS's guts, not in the pool 
format.  Worst case is that you'd have to enable TRIM at the device 
layer, which would probably entail either editing a config file and 
rebooting, or just running some command to enable the feature.   I can't 
imagine it would require any reformating or reinstalling.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM - No need for TRIM

2011-02-05 Thread Erik Trimble


On 2/5/2011 5:44 AM, Orvar Korvar wrote:

So... Sun's SSD used for ZIL and L2ARC does not use TRIM, so how big a problem 
is lack of TRIM in ZFS really? It should not hinder anyone to run without TRIM? 
I didnt really understand the answer on this question. Because Sun's SSD does 
not use TRIM - and it is not consider a hinder? A home user could use SSD 
without greater problems? If I format the disk every year and reinstall, would 
that help? I am concerned as a home user...
As a home user, you don't really care about support for TRIM.  Even a 
reasonable SSD (i.e. not "Enterprise" level) will provide a very 
significant boost when used as a ZIL, over just having bare drives (in 
particular, if you just have SATA 7200 or 5400 RPM drives, you will 
really notice the boost).  Ideally, you want an SSD with a battery or 
supercapacitor, but they're pretty expensive right now. (i.e. OCZ's 
Vertex 2 Pro series).


If you have a dependable UPS for the system, and can accept the (small) 
risk that you might lose the last write commit on certain power-loss 
scenarios, a mid-line SSD is entirely OK.


Note that a ZIL cache device is NOT A WRITE CACHE.  ZIL is there for 
synchronous writes only, so that ZFS can essentially turn synchronous 
writes into asynchronous writes.  NFS is a big sync() write user, but 
Samba is not.  You won't notice any improvement over Samba when using a 
ZIL cache.



Don't bother with a reformat of the SSD. It won't help, other than maybe 
a yearly reformat would be good to reset absolutely everything inside 
it, but, you won't notice any really difference even after you do.




Regarding the DDRAM SSDs that dont need TRIM, they seem interesting. I got a 
mail from CTO of DDrive who did not want to mail publicly to this list 
(advertisment he said) but it seems expensive, the DDdrive X1 which is targeted 
to Enterprise costs $2000 USD. Are there any home user variants? But reading 
this presentation, it seems to have a point of using DDRAM variants:
http://www.ddrdrive.com/zil_accelerator.pdf
Quite interesting info


There are a couple of things similar to the DDRdrive, but they're all 
Enterprise-class, and going to cost you.  There's no $200 solution.


Which is kind of sad, since building a DDRdrive-like thing reallly just 
requires 2 4Gbit DRAM chips, 2 4GBit NAND chips, a small battery, and a 
FPGA that does some basic PCI-E to Memory mapping (and, a little bit of 
extra DRAM->NAND copying if power is lost).  Really, it's entirely 
doable for someone with decent VLSI experience.  The DDRdrive folks have 
got the experience to do it, but, honestly, its not a complex product.  
Bottom line, it's maybe $50 in parts, plus a $100k VLSI Engineer to do 
the design. 


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM

2011-02-04 Thread Erik Trimble


On 2/4/2011 7:39 AM, Christopher George wrote:

So, the bottom line is that Solaris 11 Express can not use
TRIM and SSD?

Correct.


So, it might not be a good idea to use a SSD?

It is true that a Flash based SSD, will be adversely impacted by
ZFS not supporting TRIM, especially for the ZIL accelerator.

But a DRAM based SSD is immune to TRIM support status and
thus unaffected.  Actually, TRIM support would only add
unnecessary overhead to the DDRdrive X1's device driver.

Best regards,

Christopher George
Founder/CTO
www.ddrdrive.com


Bottom line here is this:  for a ZIL, you have a hierarchy of 
performance, each about two orders of magnitude faster than the prior:


1. hard drive
2. NAND-based SSD
3. DRAM-based SSD


You'll still get a very noticeable improvement of using a NAND (flash) 
SSD over not using a dedicated ZIL device.  It just won't be the 
improvement "promised" by the SSD packaging.


If that performance isn't sufficient for you, then a DRAM SSD is your 
best bet.



Note that even if TRIM would be supported, it wouldn't remove the whole 
penalty that a fully-written-to NAND SSD suffers. NAND requires that any 
block which was priorly written to be erased BEFORE you can write to it 
again.  TRIM only helps with using unwritten blocks inside pages, and to 
schedule whole page erasures inside the SSD controller.   I can't put 
real numbers on it, but I would suspect that rather than suffer a 10x 
loss of performance, you might only lose 5x or so if TRIM were properly 
usable.


--

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS not usable (was ZFS Dedup question)

2011-01-29 Thread Erik Trimble


On 1/28/2011 2:24 PM, Roy Sigurd Karlsbakk wrote:

I created a zfs pool with dedup with the following settings:
zpool create data c8t1d0
zfs create data/shared
zfs set dedup=on data/shared

The thing I was wondering about was it seems like ZFS only dedup at
the file level and not the block. When I make multiple copies of a
file to the store I see an increase in the deup ratio, but when I copy
similar files the ratio stays at 1.00x.

I've done some rather intensive tests on zfs dedup on this 12TB test system we 
have. I have concluded that with some 150B worth of L2ARC and 8GB ARC, ZFS 
dedup is unusable for volumes even at 2TB storage. It works, but it's dead slow 
in write terms, and the time to remove a dataset is still very long. I wouldn't 
recommend using ZFS dedup unless your name were Ahmed Nazif or Silvio 
Berlusconi, where the damage might be used for some good.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--


If you want Dedup to perform well, you *absolutely* must have a L2ARC 
device which can hold the *entire* Dedup Table.  Remember, the size of 
the DDT is not dependent on the size of your data pool, but in the 
number of zfs slabs which are contained in that pool (slab = record, for 
this purpose).  Thus, 12TB worth of DVD iso images (record size about 
128k) will consume 256 times less DDT space as will 12TB filled with 
text configuration files (average record size < 512b).


And, I doubt 8GB for ARC is sufficient, either, for a DDT consuming over 
100GB of space.






--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Dedup question

2011-01-29 Thread Erik Trimble


On 1/28/2011 1:48 PM, Nicolas Williams wrote:

On Fri, Jan 28, 2011 at 01:38:11PM -0800, Igor P wrote:

I created a zfs pool with dedup with the following settings:
zpool create data c8t1d0
zfs create data/shared
zfs set dedup=on data/shared

The thing I was wondering about was it seems like ZFS only dedup at
the file level and not the block. When I make multiple copies of a
file to the store I see an increase in the deup ratio, but when I copy
similar files the ratio stays at 1.00x.

Dedup is done at the block level, not file level.  "Similar files" does
not mean that they actually share common blocks.  You'll have to look
more closely to determine if they do.

Nico


What Nico said.

The big reason here is that blocks have to be ALIGNED on the same block 
boundaries to be dedup'd.


That is, if I have a file which contains:

AAABBCCDD

if I have 4-character wide blocks, then if I copy the file, and append 
an "X" to the above file, making it look like:


XAAABBCCDD


There will be NO DEDUP in that case.

This is what trips people up most of the time - they see "similar" 
files, but don't realize that "similar" for dedup has to mean aligned on 
block boundaries, not just "I've got the same 3k of data in both files".



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Shrinking a pool, Increasing hotspares

2011-01-24 Thread Erik Trimble

On Mon, 2011-01-24 at 13:56 -0800, Phillip V wrote:
> Hey all,
> 
> I have a 10 TB root pool setup like so:
>   pool: s78
>  state: ONLINE
>  scrub: resilver completed after 2h0m with 0 errors on Wed Jan 19 22:04:39 
> 2011
> config:
> 
> NAME STATE READ WRITE CKSUM
> s78  ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c14t0d0  ONLINE   0 0 0
> c7t0d0   ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c7t1d0   ONLINE   0 0 0
> c14t1d0  ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c7t2d0   ONLINE   0 0 0
> c14t2d0  ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c7t3d0   ONLINE   0 0 0
> c14t3d0  ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c7t4d0   ONLINE   0 0 0
> c15t6d0  ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c7t5d0   ONLINE   0 0 0
> c14t5d0  ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c7t6d0   ONLINE   0 0 0
> c14t6d0  ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c7t7d0   ONLINE   0 0 0
> c14t7d0  ONLINE   0 0 0
>   mirror ONLINE   0 0 0
> c15t7d0  ONLINE   0 0 0
> c14t4d0  ONLINE   0 0 1.11M  251G resilvered
> logs ONLINE   0 0 0
>   c15t2d0ONLINE   0 0 0
> spares
>   c15t0d0AVAIL   
> 
> errors: No known data errors
> 
> Only 2 TB of the pool are used and I would like to remove one mirror set from 
> the pool and use the two drives in that mirror set as hot spares (for a total 
> of 3 hotspares). Is this possible?


No. You cannot shrink the size of a pool, regardless of what type of
vdev it is composed of.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?

2011-01-18 Thread Erik Trimble

On Tue, 2011-01-18 at 13:34 -0800, Philip Brown wrote:
> > On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon
> > wrote:
> >
> > ZFS's ability to handle "short-term" interruptions
> > depend heavily on the
> > underlying device driver.
> > 
> > If the device driver reports the device as
> > "dead/missing/etc" at any
> > point, then ZFS is going to require a "zpool replace"
> > action before it
> > re-accepts the device.  If the underlying driver
> > simply stalls, then
> > it's more graceful (and no user interaction is
> > required).
> > 
> > As far as what the resync does:  ZFS does "smart"
> > resilvering, in that
> > it compares what the "good" side of the mirror has
> > against what the
> > "bad" side has, and only copies the differences over
> > to sync them up.
> >
> 
> Hmm. Well, we're talking fibre, so we're very concerned with the recovery  
> mode when the fibre drivers have marked it as "failed". (except it hasnt 
> "really" failed, we've just had a switch drop out)
> 
> I THINK what you are saying, is that we could, in this situation, do:
> 
> zpool replace (old drive) (new drive)
> 
> and then your "smart" recovery, should do the limited resilvering only. Even 
> for potentially long outages.
> 
> Is that what you are saying?


Yes. It will always look at the "replaced" drive to see if it was a
prior member of the mirror, and do smart resilvering if possible.

If the device path stays the same (which, hopefully, it should), you can
even do:

zpool replace (old device) (old device)




-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How well does zfs mirror handle temporary disk offlines?

2011-01-18 Thread Erik Trimble

On Tue, 2011-01-18 at 14:51 -0500, Torrey McMahon wrote:
> 
> On 1/18/2011 2:46 PM, Philip Brown wrote:
> > My specific question is, how easily does ZFS handle*temporary*  SAN 
> > disconnects, to one side of the mirror?
> > What if the outage is only 60 seconds?
> > 3 minutes?
> > 10 minutes?
> > an hour?
> 
> Depends on the multipath drivers and the failure mode. For example, if 
> the link drops completely at the host hba connection some failover 
> drivers will mark the path down immediately which will propagate up the 
> stack faster than an intermittent connection or something father down 
> stream failing.
> 
> > If we have 2x1TB drives, in a simple zfs mirror if one side goes 
> > temporarily off line, will zfs attempt to resync **1 TB** when it comes 
> > back? Or does it have enough intelligence to say, "oh hey I know this 
> > disk..and I know [these bits] are still good, so I just need to resync 
> > [that bit]" ?
> 
> My understanding is yes though I can't find the reference for this. (I'm 
> sure someone else will find it in short order.)

ZFS's ability to handle "short-term" interruptions depend heavily on the
underlying device driver.

If the device driver reports the device as "dead/missing/etc" at any
point, then ZFS is going to require a "zpool replace" action before it
re-accepts the device.  If the underlying driver simply stalls, then
it's more graceful (and no user interaction is required).

As far as what the resync does:  ZFS does "smart" resilvering, in that
it compares what the "good" side of the mirror has against what the
"bad" side has, and only copies the differences over to sync them up.
This is one of ZFS's great strengths, in that most other RAID systems
can't do this.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is my bottleneck RAM?

2011-01-18 Thread Erik Trimble

You can't really do that.

Adding an SSD for L2ARC will help a bit, but L2ARC storage also consumes
RAM to maintain a cache table of what's in the L2ARC.  Using 2GB of RAM
with an SSD-based L2ARC (even without Dedup) likely won't help you too
much vs not having the SSD. 

If you're going to turn on Dedup, you need at least 8GB of RAM to go
with the SSD.

-Erik


On Tue, 2011-01-18 at 18:35 +, Michael Armstrong wrote:
> Thanks everyone, I think overtime I'm gonna update the system to include an 
> ssd for sure. Memory may come later though. Thanks for everyone's responses
> 
> Erik Trimble  wrote:
> 
> >On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote:
> >> I've since turned off dedup, added another 3 drives and results have 
> >> improved to around 148388K/sec on average, would turning on compression 
> >> make things more CPU bound and improve performance further?
> >> 
> >> On 18 Jan 2011, at 15:07, Richard Elling wrote:
> >> 
> >> > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote:
> >> > 
> >> >> Hi guys, sorry in advance if this is somewhat a lowly question, I've 
> >> >> recently built a zfs test box based on nexentastor with 4x samsung 2tb 
> >> >> drives connected via SATA-II in a raidz1 configuration with dedup 
> >> >> enabled compression off and pool version 23. From running bonnie++ I 
> >> >> get the following results:
> >> >> 
> >> >> Version 1.03b   --Sequential Output-- --Sequential Input- 
> >> >> --Random-
> >> >>   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
> >> >> --Seeks--
> >> >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  
> >> >> /sec %CP
> >> >> nexentastor  4G 60582  54 20502   4 12385   3 53901  57 105290  10 
> >> >> 429.8   1
> >> >>   --Sequential Create-- Random 
> >> >> Create
> >> >>   -Create-- --Read--- -Delete-- -Create-- --Read--- 
> >> >> -Delete--
> >> >> files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  
> >> >> /sec %CP
> >> >>16  7181  29 + +++ + +++ 21477  97 + +++ 
> >> >> + +++
> >> >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++
> >> >> 
> >> >> 
> >> >> I'd expect more than 105290K/s on a sequential read as a peak for a 
> >> >> single drive, let alone a striped set. The system has a relatively 
> >> >> decent CPU, however only 2GB memory, do you think increasing this to 
> >> >> 4GB would noticeably affect performance of my zpool? The memory is only 
> >> >> DDR1.
> >> > 
> >> > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, 
> >> > turn off dedup
> >> > and enable compression.
> >> > -- richard
> >> > 
> >> 
> >> ___
> >> zfs-discuss mailing list
> >> zfs-discuss@opensolaris.org
> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
> >
> >Compression will help speed things up (I/O, that is), presuming that
> >you're not already CPU-bound, which it doesn't seem you are.
> >
> >If you want Dedup, you pretty much are required to buy an SSD for L2ARC,
> >*and* get more RAM.
> >
> >
> >These days, I really don't recommend running ZFS as a fileserver without
> >a bare minimum of 4GB of RAM (8GB for anything other than light use),
> >even with Dedup turned off. 
> >
> >
> >-- 
> >Erik Trimble
> >Java System Support
> >Mailstop:  usca22-317
> >Phone:  x67195
> >Santa Clara, CA
> >Timezone: US/Pacific (GMT-0800)
> >
-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is my bottleneck RAM?

2011-01-18 Thread Erik Trimble

On Tue, 2011-01-18 at 15:11 +, Michael Armstrong wrote:
> I've since turned off dedup, added another 3 drives and results have improved 
> to around 148388K/sec on average, would turning on compression make things 
> more CPU bound and improve performance further?
> 
> On 18 Jan 2011, at 15:07, Richard Elling wrote:
> 
> > On Jan 15, 2011, at 4:21 PM, Michael Armstrong wrote:
> > 
> >> Hi guys, sorry in advance if this is somewhat a lowly question, I've 
> >> recently built a zfs test box based on nexentastor with 4x samsung 2tb 
> >> drives connected via SATA-II in a raidz1 configuration with dedup enabled 
> >> compression off and pool version 23. From running bonnie++ I get the 
> >> following results:
> >> 
> >> Version 1.03b   --Sequential Output-- --Sequential Input- 
> >> --Random-
> >>   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- 
> >> --Seeks--
> >> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  
> >> /sec %CP
> >> nexentastor  4G 60582  54 20502   4 12385   3 53901  57 105290  10 
> >> 429.8   1
> >>   --Sequential Create-- Random 
> >> Create
> >>   -Create-- --Read--- -Delete-- -Create-- --Read--- 
> >> -Delete--
> >> files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
> >> %CP
> >>16  7181  29 + +++ + +++ 21477  97 + +++ + 
> >> +++
> >> nexentastor,4G,60582,54,20502,4,12385,3,53901,57,105290,10,429.8,1,16,7181,29,+,+++,+,+++,21477,97,+,+++,+,+++
> >> 
> >> 
> >> I'd expect more than 105290K/s on a sequential read as a peak for a single 
> >> drive, let alone a striped set. The system has a relatively decent CPU, 
> >> however only 2GB memory, do you think increasing this to 4GB would 
> >> noticeably affect performance of my zpool? The memory is only DDR1.
> > 
> > 2GB or 4GB of RAM + dedup is a recipe for pain. Do yourself a favor, turn 
> > off dedup
> > and enable compression.
> > -- richard
> > 
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Compression will help speed things up (I/O, that is), presuming that
you're not already CPU-bound, which it doesn't seem you are.

If you want Dedup, you pretty much are required to buy an SSD for L2ARC,
*and* get more RAM.


These days, I really don't recommend running ZFS as a fileserver without
a bare minimum of 4GB of RAM (8GB for anything other than light use),
even with Dedup turned off. 


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-03 Thread Erik Trimble


On 1/3/2011 8:28 AM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:

On 12/26/10 05:40 AM, Tim Cook wrote:
On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
mailto:richard.ell...@gmail.com>> wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.


Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express was 
a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of "outside of Oracle" in the time since it went 
closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't work 
on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)
 -- richard




I'm getting pretty close to my pain threshold on the BP_rewrite stuff, 
since not having that feature's holding up a big chunk of work I'd like 
to push.


If anyone outside of Oracle is working on some sort of change to ZFS 
that will allow arbitrary movement/placement of pre-written slabs, can 
they please contact me?  I'm pretty much at the point where I'm going to 
start diving into that chunk of the source to see if there's something 
little old me can do, and I'd far rather help on someone else's 
implementation than have to do it myself from scratch.


I'd prefer a private contact, as I realize that such work may not be 
ready for public discussion yet.


Thanks, folks!


Oh, and this is completely just me, not Oracle talking in any way.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-29 Thread Erik Trimble


On 12/29/2010 4:55 PM, Jason Warr wrote:

HyperDrive5 = ACard ANS9010

I have personally been wanting to try one of these for some time as a 
ZIL device.




Yes, but do remember these require a half-height 5.25" drive bay, and 
you really, really should buy the extra CF card for backup.


Also, stay away from the ANS-9010S with LVD SCSI interface. As (I think) 
Bob pointed out a long time ago, parallel SCSI isn't good for a 
high-IOPS interface. It (the LVD interface) will throttle long before 
the drive does...


I've been waiting for them to come out with a 3.5" version, one which I 
can plug directly into a standard 3.5" SAS/SATA hotswap bay...


And, of course, the ANS9010 is limited to the SATA2 interface speed, so 
it is cheaper and lower-performing (but still better than an SSD) than 
the DDRdrive.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-25 Thread Erik Trimble


On 12/25/2010 12:16 PM, joerg.schill...@fokus.fraunhofer.de wrote:

Erik Trimble  wrote:

I've read Joerg's paper, and I've read several of the patents in
question, and nowhere around is there any real code. A bit of

Netapp filed patents (without code) in 1993, I of course have working code for
SuinOS-4.9 from 1991. Se below for more information.

Joerg - your paper used to be available here (which is where I read it
awhile ago), but not anymore:
http://www.fokus.gmd.de/research/cc/glone/employees/joerg.schilling/private/wofs.ps.gz


I just re-looked, and I now remember that I got it from that URL via the 
Internet Archive (www.internet.org). It's still available at the URL 
above from 2001 at the Archive.




This address did go away in 2001 when the German government enforced
integration of GMD into Fraunhofer.

The old postscript version (created from troff) is here:

http://cdrecord.berlios.de/private/wofs.ps.gz

A few years ago, a friend helped me to add the images that originally have been
created outside of troff and inserted the old way (using glue). Since 2006,
there is a pdf version that includes the images:

http://cdrecord.berlios.de/private/WoFS.pdf


Is there a better location?  (and, a full English translation?  I read
it in German, but my German is maybe at 7th-grade level, so I might have
missed some subtleties...)

There is currently no English tranlation and as a result of the legal situation
in 1991, I could not publish the related implementation. Even getting the
SunOS-4.0 source code in 1988 in order to allow the implementation, was a bit
tricky. Horst Winterhoff (Chief Sun Germany and Sun Europe) asked Bill Joy for
a permission to give away the source for my Diploma Thesis. As a result of
this and the fact that there was no official howto from Sun for writing
filesystems, I was forced to keep the implementation unpublished (as for the
implementation of mmap() in wofs, I was forced to copy aprox. 100 lines from
the UFS code).
If the code you copied is currently still in the OpenSolaris codebase, 
then you're OK. But, the SunOS codebase is significantly different than 
the Solaris one, so I wouldn't automatically assume that you can publish 
that code.  Though, if your borrowing was restricted to the UFS 
implementation (and not the Virtual Memory/Filesystem caching stuff), 
your chances are good that it's still in the OpenSolaris codebase.



Since June 2005, I would asume that the situation is different and there is no
longer a problem to publish the WOFS source. If people are interested, I could
publish the unedit original state from 1991 (including the SCCS history for my
implementation) even though it looks a bit messy.
For at least historical reasons, that would be nice. Though, I don't 
want to offer legal advice as to the possibility of problems, 
particularly for someone outside the US system. :-)



I tried to verify whether the submission of the diploma thesis in 1991 is an
official publication and in theory it should be, as a copy is stored in the
univertity library. Unfortunately, the university library is uanble to find the
paper. There are however many people who could confirm that the development
really happened between 1988 and 1991.

If your thesis paper was available via Lexisnexis, then, it certainly 
should count as officially published for any legal system. If not, I 
suspect that different countries would have different standards for 
university thesis.



Maybe, it is a good idea to send a mail to someone from eff.org?

Jörg

Yup.  They'd be the right people to talk to.

--

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-25 Thread Erik Trimble


On 12/25/2010 11:19 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 1:10 PM, Erik Trimble <mailto:erik.trim...@oracle.com>> wrote:


On 12/25/2010 6:25 AM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org
<mailto:zfs-discuss-boun...@opensolaris.org>
[mailto:zfs-discuss- <mailto:zfs-discuss->
boun...@opensolaris.org <mailto:boun...@opensolaris.org>]
On Behalf Of Joerg Schilling

And people should note that Netapp filed their patents
starting from 1993.
This
is 5 years after I started to develop WOFS, which is copy
on write. This

still

In any case, this is 20 year old technology. Aren't
patents something to
protect new ideas?

Boy, those guys must be really dumb to waste their time filing
billion
dollar lawsuits, protecting 20-year old technology, when it's
so obvious
that you and other people clearly invented it before them, and
all the money
they waste on lawyers can never achieve anything.  They should
all fire
themselves.  And anybody who defends against it can safely
hire a law
student for $20/hr to represent them, and just pull out your
documents as
defense, because that's so easy.

Plus, as you said, the technology is so old, it should be
worthless by now.
Why are we all wasting our time in this list talking about
irrelevant old
technology, anyway?


While that's a bit sarcastic there Ned,  it *should* be the
literal truth.  But, as the SCO/Linux suit showed, having no
realistic basis for a lawsuit doesn't prevent one from being
dragged through the (U.S.) courts for the better part of a decade.



Why can't we have a loser-pays civil system like every other
civilized country?


-- 
Erik Trimble

Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)



If you've got enough money, we do.  You just have to make it to the 
end of the trial, and have a judge who feels similar.  They often 
award monetary settlements for the cost of legal defense to the victor.


--Tim



Which is completely useless as a system. I'm still significantly 
out-of-pocket for a suit that I shouldn't have had to fight in the first 
place, and the likelihood that I get to recover that money isn't good 
(defense cost awards aren't common).  There's no disincentive to 
trolling the legal system, forcing settlements on those unable to fight 
a protracted suit, even if they're sure to win the case.


Using the US legal system as a business strategy is evil, pure and 
simple, and one all too common nowadays.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-25 Thread Erik Trimble

On 12/25/2010 10:59 AM, Tim Cook wrote:

On Sat, Dec 25, 2010 at 8:25 AM, Edward Ned Harvey 
<mailto:opensolarisisdeadlongliveopensola...@nedharvey.com>> wrote:

> From: zfs-discuss-boun...@opensolaris.org
<mailto:zfs-discuss-boun...@opensolaris.org> [mailto:zfs-discuss-
<mailto:zfs-discuss->
> boun...@opensolaris.org <mailto:boun...@opensolaris.org>] On
Behalf Of Joerg Schilling
>
> And people should note that Netapp filed their patents starting
from 1993.
> This
> is 5 years after I started to develop WOFS, which is copy on
write. This
still
>
> In any case, this is 20 year old technology. Aren't patents
something to
> protect new ideas?

Boy, those guys must be really dumb to waste their time filing billion
dollar lawsuits, protecting 20-year old technology, when it's so
obvious
that you and other people clearly invented it before them, and all
the money
they waste on lawyers can never achieve anything.  They should all
fire
themselves.  And anybody who defends against it can safely hire a law
student for $20/hr to represent them, and just pull out your
documents as
defense, because that's so easy.

Plus, as you said, the technology is so old, it should be
worthless by now.
Why are we all wasting our time in this list talking about
irrelevant old
technology, anyway?

Indeed.  Isn't the Oracle database itself at least 20 years old?  And 
Windows?  And Solaris itself?  All the employees of those companies 
should probably just start donating their time for free instead of 
collecting a paycheck since it's quite obvious they should no longer 
be able to charge for their product.

What I find most entertaining is all the armchair lawyers on this 
mailing list that think they've got prior art when THEY'VE NEVER EVEN 
SEEN THE CODE IN QUESTION!

--Tim

Well...

I've read Joerg's paper, and I've read several of the patents in 
question, and nowhere around is there any real code. A bit of 
pseudo-code and some math, but no full, working code.  And, granted that 
I'm not a IP lawyer, but it does look like Joerg's work is prior art 
(and, given that the standard is supposed to be what someone in the 
industry would consider obvious, based on their knowledge, and I think I 
qualify). Which all points to the real problem of software patents - 
they're really patents on IDEAS, not on a specific implementation.  Who 
the moron was that really though that was OK (yes, I know who 
specifically, but in general...) should be shot.

Copyright is fine or protecting software work, but patents?

Joerg - your paper used to be available here (which is where I read it 
awhile ago), but not anymore:  
http://www.fokus.gmd.de/research/cc/glone/employees/joerg.schilling/private/wofs.ps.gz

Is there a better location?  (and, a full English translation?  I read 
it in German, but my German is maybe at 7th-grade level, so I might have 
missed some subtleties...)

[As obvious as it is, it should be pointed out, I'm making these statements as 
a very personal opinion, and I'm certain Oracle wouldn't have the same one. I 
in no way represent Oracle.]

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-25 Thread Erik Trimble


On 12/25/2010 6:25 AM, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Joerg Schilling

And people should note that Netapp filed their patents starting from 1993.
This
is 5 years after I started to develop WOFS, which is copy on write. This

still

In any case, this is 20 year old technology. Aren't patents something to
protect new ideas?

Boy, those guys must be really dumb to waste their time filing billion
dollar lawsuits, protecting 20-year old technology, when it's so obvious
that you and other people clearly invented it before them, and all the money
they waste on lawyers can never achieve anything.  They should all fire
themselves.  And anybody who defends against it can safely hire a law
student for $20/hr to represent them, and just pull out your documents as
defense, because that's so easy.

Plus, as you said, the technology is so old, it should be worthless by now.
Why are we all wasting our time in this list talking about irrelevant old
technology, anyway?



While that's a bit sarcastic there Ned,  it *should* be the literal 
truth.  But, as the SCO/Linux suit showed, having no realistic basis for 
a lawsuit doesn't prevent one from being dragged through the (U.S.) 
courts for the better part of a decade.




Why can't we have a loser-pays civil system like every other civilized 
country?



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Looking for 3.5" SSD for ZIL

2010-12-23 Thread Erik Trimble


On 12/23/2010 7:57 AM, Deano wrote:

In an ideal world, if we could obtain details on how to reset/format blocks of 
a SSD, we could do it automatically running behind the ZIL. As a log its going 
in one direction, a background task could clean up behind it, making the 
performance lowing over time a non-issue for the ZIL. A first start may be 
calling unmap/trim on those blocks (which I was surprised to find in the source 
is already coded up in the SATA driver, just not used yet) but really a reset 
would be better.

But as you say a tool to say if its need doing would be a good start. They 
certainly exist in closed source form...

Deano

-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ray Van Dolson
Sent: 23 December 2010 15:46
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Looking for 3.5" SSD for ZIL

On Thu, Dec 23, 2010 at 07:35:29AM -0800, Deano wrote:

If anybody does know of any source to the secure erase/reformatters,
I’ll happily volunteer to do the port and then maintain it.

I’m currently in talks with several SSD and driver chip hardware
peeps with regard getting datasheets for some SSD products etc. for
the purpose of better support under the OI/Solaris driver model but
these things can take a while to obtain, so if anybody knows of
existing open source versions I’ll jump on it.

Thanks,
Deano

A tool to help the end user know *when* they should run the reformatter
tool would be helpful too.

I know we can just wait until performance "degrades", but it would be
nice to see what % of blocks are in use, etc.

Ray



AFAIK, all the reformatter utilities are closed-source, direct from the 
SSD manufacturer.  They talk directly to the drive firmware, so they're 
decidedly implementation-specific (I'd be flabberghasted if one worked 
on two different manufacturers' SSDs, even if they used the same basic 
controller).  Many are DOS-based.


As Christopher noted, you'll get a drop-off in performance as soon as 
you collect enough sync writes to have written (in the aggregate) 
slightly more than the total capacity of the SSD (including the "extra" 
that most SSDs now have).


That said, I would expect full TRIM support to possibly make this 
better, as it could free up partially-used pages more frequently, and 
thus increasing the time before performance drops (which is due to the 
page remapping/reshuffling demands on the SSD controller).


But, yes, SSDs are inherently less fast than DRAM.  They're utility is 
entirely dependent on what your use case (and performance demands) are.  
The longer-term solution is to have SSDs change how they are designed, 
moving away from the current one-page-of-multiple-blocks as the atomic 
entity of writing, and straight to a one-block-per-page setup.  Don't 
hold your breath.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Looking for 3.5" SSD for ZIL

2010-12-22 Thread Erik Trimble


On 12/22/2010 7:05 AM, Christopher George wrote:

I'm not sure if TRIM will work with ZFS.

Neither ZFS nor the ZIL code in particular support TRIM.


I was concerned that with trim support the SSD life and
write throughput will get affected.

Your concerns about sustainable write performance (IOPS)
for a Flash based SSD are valid, the resulting degradation
will vary depending on the controller used.

Best regards,

Christopher George
Founder/CTO
www.ddrdrive.com


Christopher is correct, in that SSDs will suffer from (non-trivial) 
performance degredation after they've exhausted their free list, and 
haven't been told to reclaim emptied space.  True battery-backed DRAM is 
the only permanent solution currently available which never runs into 
this problem.  Even TRIM-supported SSDs eventually need reconditioning.


However, this *can* be overcome by frequently re-formatting the SSD (not 
the Solaris format, a low-level format using a vendor-supplied 
utility).  It's generally a simple thing, but requires pulling the SSD 
from the server, connecting it to either a Linux or Windows box, running 
the reformatter, then replacing the SSD.  Which, is a PITA.


But, still a bit cheaper than buying a DDRdrive. 


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Looking for 3.5" SSD for ZIL

2010-12-22 Thread Erik Trimble

On 12/22/2010 10:04 PM, Christopher George wrote:

How about comparing a non-battery backed ZIL to running a
ZFS dataset with sync=disabled. Which is more risky?

Most likely, the 3.5" SSD's on-board volatile (not power protected)
memory would be small relative to the transaction group (txg) size
and thus less "risky" than sync=disabled.

Best regards,

Christopher George
Founder/CTO
www.ddrdrive.com

To the OP: First off, what do you mean by "sync=disabled"??? There is
no such parameter for a mount option or attribute for ZFS, nor is there
for exporting anything in NFS, nor for client-side NFS mounts.

If you meant "disable the ZIL", well, DON'T.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29

Moreover, disabling the ZIL on a per-dataset basis is not possible.

As noted in the ETG, disabling ZIL can cause possible NFS-client-side
corruption. If you absolutely must turn it off, however, you will get
More Reliable transactions than a non-SuperCap'd SSD, by virtue that any
sync-write on such a fileserver will not return as complete until the
data has reach backing store. Which, in most cases, will tank (no pun
intended) your synchronous performance. About the only case it won't
cripple performance is when your backing store is using some sort of
NVRAM to buffer writes to the disks (as most large array controllers do
- but make sure that cache is battery backed). But even there, it can
be a relatively simple thing to overwhelm the very limited cache on such
a controller, in which case your performance tanks again.

--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] stupid ZFS question - floating point operations

2010-12-22 Thread Erik Trimble


On 12/22/2010 11:49 AM, Tomas Ögren wrote:

On 22 December, 2010 - Jerry Kemp sent me these 1,0K bytes:


I have a coworker, who's primary expertise is in another flavor of Unix.

This coworker lists floating point operations as one of ZFS detriments.

I's not really sure what he means specifically, or where he got this
reference from.

Then maybe ask him first? Guilty until proven innocent isn't the regular
path...


In an effort to refute what I believe is an error or misunderstanding on
his part, I have spent time on Yahoo, Google, the ZFS section of
OpenSolaris.org, etc.  I really haven't turned up much of anything that
would prove or disprove his comments.  The one thing I haven't done is
to go through the ZFS source code, but its been years since I have done
any serious programming.

If someone from Oracle, or anyone on this mailing list could point me
towards any documentation, or give me a definitive word, I would sure
appreciate it.  If there were floating point operations going on within
ZFS, at this point I am uncertain as to what they would be.

TIA for any comments,

Jerry
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


/Tomas


So far as my understanding of the codebase goes (and, while I've read a 
significant portion, I'm not really an expert here):


Assuming he means that ZFS has a weakness of heavy floating-point 
calculation requirements (i.e using ZFS requires heavy FP usage), that's 
wrong.


Like all normal filesystems, the "ordinary" operations are all integer, 
load, and store.  The ordinary work of caching, block allocation, and 
fetching/writing is of course all integer-based. I can't imagine someone 
writing a filesystem which does such operations using floating point.


A quick grep through the main ZFS sources doesn't find anything of type 
"double" or "float".


I think he might be confused with what is happening on Checksums (which 
is still all Integer, but looks/sounds "expensive").  Yes, ZFS is 
considerably *more* compute intensive than other filesystems.  However, 
it's all Integer, and one of the base assumptions of ZFS is that modern 
systems have lots of excess CPU cycles around, so stealing 5% for use 
with ZFS won't impact performance much, and the added features of ZFS 
more than make up for any CPU cycles lost.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2010-12-20 Thread Erik Trimble


On 12/20/2010 11:56 AM, Mark Sandrock wrote:

Erik,

just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don't know why
this would not be doable. (I'm biased towards mirrors for busy filesystems.)

I'm supposing that a block-level snapshot is not doable -- or is it?

Mark
Snapshots on ZFS are true snapshots - they take a picture of the current 
state of the system. They DON'T copy any data around when created. So, a 
ZFS snapshot would be just as fragmented as the ZFS filesystem was at 
the time.



The problem is this:

Let's say I write block A, B, C, and D on a clean zpool (what kind, it 
doesn't matter).  I now delete block C.  Later on, I write block E.   
There is a probability (increasing dramatically as times goes on), that 
the on-disk layout will now look like:


A, B, E, D

rather than

A, B, [space], D, E


So, in the first case, I can do a sequential read to get A & B, but then 
must do a seek to get D, and a seek to get E.


The "fragmentation" problem is mainly due to file deletion, NOT to file 
re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing 
generally looks like a delete-then-write process, rather than a modify 
process).



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 6 >

1 - 100 of 559 matches

Mail list logo