date:20090904

Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Tim Cook

On Sat, Sep 5, 2009 at 12:30 AM, Marc Bevand  wrote:

> Tim Cook  cook.ms> writes:
> >
> > Whats the point of arguing what the back-end can do anyways?  This is
> bulk
> data storage.  Their MAX input is ~100MB/sec.  The backend can more than
> satisfy that.  Who cares at that point whether it can push 500MB/s or
> 5000MB/s?  It's not a database processing transactions.  It only needs to
> be
> able to push as fast as the front-end can go.  --Tim
>
> True, what they have is sufficient to match GbE speed. But internal I/O
> throughput matters for resilvering RAID arrays, scrubbing, local data
> analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays
> per
> pod. If their layout is optimal they put 5 drives on the PCI bus (to
> minimize
> this number) & 10 drives behind PCI-E links per array, so this means the
> PCI
> bus's ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per
> (1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of
> their arrays.
>
> -mrb
>
>
But none of that matters.  The data is replicated at a higher layer,
combined with raid-6.  They'd have to see triple disk failure across
multiple arrays at the same time...  They aren't concerned with performance,
the home users they're backing up aren't ever going to get anything remotely
close to gigE speeds.  Absolute BEST case scenario *MIGHT* push 20mbit if
the end-user is lucky enough to have FIOS or docsis 3.0 in their area, and
has large files with a clean link.

Even rebuilding two failed disks that setup will push 2MB/sec all day long.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand

Tim Cook  cook.ms> writes:
> 
> Whats the point of arguing what the back-end can do anyways?  This is bulk 
data storage.  Their MAX input is ~100MB/sec.  The backend can more than 
satisfy that.  Who cares at that point whether it can push 500MB/s or 
5000MB/s?  It's not a database processing transactions.  It only needs to be 
able to push as fast as the front-end can go.  --Tim

True, what they have is sufficient to match GbE speed. But internal I/O 
throughput matters for resilvering RAID arrays, scrubbing, local data 
analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays per 
pod. If their layout is optimal they put 5 drives on the PCI bus (to minimize 
this number) & 10 drives behind PCI-E links per array, so this means the PCI 
bus's ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per 
(1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of 
their arrays.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] incremental send/recv larger than sum of snapshots?

2009-09-04 Thread Peter Baumgartner

I've been sending daily incrementals off-site for a while now, but
recently they failed so I had to send an incremental covering a number
of snapshots. I expected the incremental to be approximately the sum
of the snapshots, but it seems to be considerably larger and still
going. The source machine is nv72 and the destination is nv99. I
send/recv with this command:

/usr/sbin/zfs send -i tank/v...@2009-08-15 tank/v...@2009-08-26 | bzip2 -c
| ssh offsite-computer "bzcat | /usr/sbin/zfs recv -F tank/vm"

The sum of the 11 days of snapshots is about 100G, but I see the
remote computer registering over 130G. I'm pushing this over a single
T1, so the process has been running for about a week. Is this
expected? If so, is there anyway I can calculate how much data will
need to be transferred?

Here is a snippet of zfs list on the source:

tank/v...@2009-08-14   8.46G  -   440G  -
tank/v...@2009-08-15   7.49G  -   440G  -
tank/v...@2009-08-16   7.42G  -   440G  -
tank/v...@2009-08-17   7.45G  -   441G  -
tank/v...@2009-08-18   11.0G  -   538G  -
tank/v...@2009-08-19   11.1G  -   479G  -
tank/v...@2009-08-20   11.1G  -   479G  -
tank/v...@2009-08-21   7.61G  -   480G  -
tank/v...@2009-08-22   6.45G  -   481G  -
tank/v...@2009-08-23   7.31G  -   481G  -
tank/v...@2009-08-24   9.66G  -   481G  -
tank/v...@2009-08-25   10.1G  -   481G  -
tank/v...@2009-08-26   12.5G  -   481G  -


And the remote:

tank/v...@2009-08-148.46G  -   440G  -
tank/v...@2009-08-159.38G  -   440G  -
tank/vm/%2009-08-26136G   867G   475G
/tank/vm/%2009-08-26
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker


On Sep 4, 2009, at 10:02 PM, David Magda  wrote:


On Sep 4, 2009, at 21:44, Ross Walker wrote:

Though I have only heard good comments from my ESX admins since  
moving the VMs off iSCSI and on to ZFS over NFS, so it can't be  
that bad.


What's your pool configuration? Striped mirrors? RAID-Z with SSDs?  
Other?


Striped mirrors off NVRAM backed controller (Dell PERC 6/E).

RAID-Z isn't the best for many VMs as the whole vdev acts as single  
disk for random io.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motherboard for home zfs/solaris file server

2009-09-04 Thread Tim Cook

On Thu, Sep 3, 2009 at 4:57 AM, Karel Gardas wrote:

> Hello,
> your "(open)solaris for Ecc support (which seems to have been dropped from
> 200906)" is misunderstanding. OS 2009.06 also supports ECC as 2005 did. Just
> install it and use my updated ecccheck.pl script to get informed about
> errors. Also you might verify that Solaris' memory scrubber is really
> running if you are that curious:
> http://developmentonsolaris.wordpress.com/2009/03/06/how-to-make-sure-memory-scrubber-is-running/
> Karel
> --
>


Is there something that needs to be done on the solaris side for memscrub
scans to occur?  I'm running snv_118, with a supermicro board running ECC
memory and amd opteron CPU's.  It would appear it's doing a lot of nothing.

Aug  8 03:56:23 fserv unix: [ID 950921 kern.info] cpu0: x86 (chipid 0x0
AuthenticAMD 40F13 family 15 model 65 step 3 clock 2010 MHz)
Aug  8 03:56:23 fserv unix: [ID 950921 kern.info] cpu0: Dual-Core AMD
Opteron(tm) Processor 2212

r...@fserv:~# isainfo -v
64-bit amd64 applications
tscp ahf cx16 sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx
cmov
amd_sysc cx8 tsc fpu
32-bit i386 applications
tscp ahf cx16 sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx
cmov
amd_sysc cx8 tsc fpu




r...@fserv:~# echo "memscrub_scans_done/U" | mdb -k
memscrub_scans_done:
memscrub_scans_done:0
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread David Magda


On Sep 4, 2009, at 21:44, Ross Walker wrote:

Though I have only heard good comments from my ESX admins since  
moving the VMs off iSCSI and on to ZFS over NFS, so it can't be that  
bad.


What's your pool configuration? Striped mirrors? RAID-Z with SSDs?  
Other?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Bob Friesenhahn


On Fri, 4 Sep 2009, Ross Walker wrote:


I have yet to see a read happen during the write flush either.  That 
impacts my application since it needs to read in order to proceed, and it 
does a similar amount of writes as it does reads.


The ARC makes it hard to tell if they are satisfied from cache or blocked due 
to writes.


The existing prefetch bug makes it doubly hard. :-)

First I complained about the blocking reads, and then I complained 
about the blocking writes (presumed responsible for the blocking 
reads) and now I am waiting for working prefetch in order to feed my 
hungry application.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker

On Sep 4, 2009, at 8:59 PM, Bob Friesenhahn > wrote:



On Fri, 4 Sep 2009, Ross Walker wrote:


I guess one can find a silver lining in any grey cloud, but for  
myself I'd just rather see a more linear approach to writes. Anyway  
I have never seen any reads happen during these write flushes.


I have yet to see a read happen during the write flush either.  That  
impacts my application since it needs to read in order to proceed,  
and it does a similar amount of writes as it does reads.


The ARC makes it hard to tell if they are satisfied from cache or  
blocked due to writes.


I suppose if you have the hardware to go sync that might be the best  
bet. That and limiting the write cache.


Though I have only heard good comments from my ESX admins since moving  
the VMs off iSCSI and on to ZFS over NFS, so it can't be that bad.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Bob Friesenhahn


On Fri, 4 Sep 2009, Ross Walker wrote:


I guess one can find a silver lining in any grey cloud, but for myself I'd 
just rather see a more linear approach to writes. Anyway I have never seen 
any reads happen during these write flushes.


I have yet to see a read happen during the write flush either.  That 
impacts my application since it needs to read in order to proceed, and 
it does a similar amount of writes as it does reads.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] PMP support in Opensolaris

2009-09-04 Thread Brandon High

On Fri, Sep 4, 2009 at 1:12 PM, Nigel
Smith wrote:
> Let us know if you can get the port multipliers working..
>
> But remember, there is a problem with ZFS raidz in that release, so be 
> careful:

I saw that, so I think I'll be waiting until snv_124 to update. The
system that I'm thinking of using currently only has mirrored vdevs
however, so it shouldn't be any risk.

Something like one of the following seems reasonable to add a few
drives to an existing system, although eSATA just seems like a bad
idea for a number of reasons:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816132016
http://www.newegg.com/Product/Product.aspx?Item=N82E16816111057

A good use that I can see is combining a Intel D945GCLF2 board with a
case that has more that 2 drive bays, using an internal PMP. One of
the systems I have is an Atom board in a small Chenbro 2-bay case,
which gives surprisingly good performance and is . There is a 4-bay
version available but lack of SATA ports on the motherboard kept me
from using it.

http://www.cooldrives.com/siseata5pomu.html
http://www.newegg.com/Product/Product.aspx?Item=N82E16811123122

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker

On Sep 4, 2009, at 5:25 PM, Scott Meilicke > wrote:


I only see the blocking while load testing, not during regular  
usage, so I am not so worried. I will try the kernel settings to see  
if that helps if/when I see the issue in production.


For what it is worth, here is the pattern I see when load testing  
NFS (iometer, 60% random, 65% read, 8k chunks, 32 outstanding I/Os):


data01  59.6G  20.4T 46 24   757K  3.09M
data01  59.6G  20.4T 39 24   593K  3.09M
data01  59.6G  20.4T 45 25   687K  3.22M
data01  59.6G  20.4T 45 23   683K  2.97M
data01  59.6G  20.4T 33 23   492K  2.97M
data01  59.6G  20.4T 16 41   214K  1.71M
data01  59.6G  20.4T  3  2.36K  53.4K  30.4M
data01  59.6G  20.4T  1  2.23K  20.3K  29.2M
data01  59.6G  20.4T  0  2.24K  30.2K  28.9M
data01  59.6G  20.4T  0  1.93K  30.2K  25.1M
data01  59.6G  20.4T  0  2.22K  0  28.4M
data01  59.7G  20.4T 21295   317K  4.48M
data01  59.7G  20.4T 32 12   495K  1.61M
data01  59.7G  20.4T 35 25   515K  3.22M
data01  59.7G  20.4T 36 11   522K  1.49M
data01  59.7G  20.4T 33 24   508K  3.09M

LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM.


With that setup you'll see max 3x the IOPS of the type of disks, not  
really the kind of setup for 60% random workload. Assuming 2TB SATA  
drives the max IOPS would be around 240 IOPS.


Now if it were mirror vdevs you'd get 7x or 560 IOPS.

Is this for VMware or data warehousing?

You'll also need an SSD drive in the mix if your not using a  
controller with NVRAM write-back. Especially when sharing over NFS.


I guess since it's 15 drives it's an MD1000, I might have gone with  
the newer 2.5" drive enclosure as it holds 24 over 15 and most SSDs  
come in 2.5".


Since you got it already, invest in a PERC 6/E with 512MB of cache and  
stick it in the other PCIe 8x slot.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker

On Sep 4, 2009, at 6:33 PM, Bob Friesenhahn > wrote:



On Fri, 4 Sep 2009, Scott Meilicke wrote:

I only see the blocking while load testing, not during regular  
usage, so I am not so worried. I will try the kernel settings to  
see if that helps if/when I see the issue in production.


The flipside of the "pulsing" is that the deferred writes dimish  
contention for precious read IOPs and quite a few programs have a  
habit of updating/rewriting a file over and over again.  If the file  
is completely asynchronously rewritten once per second and zfs  
writes a transaction group every 30 seconds, then 29 of those  
updates avoided consuming write IOPs.  Another benefit is that if  
zfs has more data in hand to write, then it can do a much better job  
of avoiding fragmentation, avoid unnecessary COW by diminishing  
short tail writes, and achieve more optimum write patterns.


I guess one can find a silver lining in any grey cloud, but for myself  
I'd just rather see a more linear approach to writes. Anyway I have  
never seen any reads happen during these write flushes.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Bob Friesenhahn


On Fri, 4 Sep 2009, Scott Meilicke wrote:

I only see the blocking while load testing, not during regular 
usage, so I am not so worried. I will try the kernel settings to see 
if that helps if/when I see the issue in production.


The flipside of the "pulsing" is that the deferred writes dimish 
contention for precious read IOPs and quite a few programs have a 
habit of updating/rewriting a file over and over again.  If the file 
is completely asynchronously rewritten once per second and zfs writes 
a transaction group every 30 seconds, then 29 of those updates avoided 
consuming write IOPs.  Another benefit is that if zfs has more data in 
hand to write, then it can do a much better job of avoiding 
fragmentation, avoid unnecessary COW by diminishing short tail writes, 
and achieve more optimum write patterns.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Bob Friesenhahn


On Fri, 4 Sep 2009, Louis-Frédéric Feuillette wrote:


JPEG2000 uses arithmetic encoding to do the final compression step.
Arithmetic encoding has a higher compression rate (in general) than
gzip-9, lzbj or others.  There is an opensource implementation of
jpeg2000 called jasper[1].  Jasper is the reference implementation for
jpeg2000, meaning that all other jpeg2000 programs must verify it's
output to that of jasper (kinda).


Jasper is incredibly slow and consumes large amount of memory.  Other 
JPEG2000 programs are validated by how many times faster they are than 
Jasper. :-)


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] check a zfs rcvd file

2009-09-04 Thread Lori Alt


On 09/04/09 10:17, dick hoogendijk wrote:

Lori Alt wrote:
The -u option to zfs recv (which was just added to support flash 
archive installs, but it's useful for other reasons too) suppresses 
all mounts of the received file systems.  So you can mount them 
yourself afterward in whatever order is appropriate, or do a 'zfs 
mount -a'.
You misunderstood my problem. It is very convenient that the 
filesystems are not mounted. I only wish they could stay that way!. 
Alas, they ARE mounted (even if I don't want them to) when I  *reboot* 
the system. And THAT's when thing get ugly. I then have different zfs 
filesystems using the same mountpoints! The backed up ones have the 
same mountpoints as their origin :-/  -> The only way to stop it is to 
*export* the "backup" zpool OR to change *manualy* the zfs prop 
"canmount=noauto" in all backed up snapshots/filesystems.


As I understand I cannot give this "canmount=noauto" to the zfs 
receive command.

# zfs send -Rv rp...@0909 | zfs receive -Fdu backup/snaps
There is a RFE to allow zfs recv to assign properties, but I'm not sure 
whether it would help in your case.  I would have thought that 
"canmount=noauto" would have already been set on the sending side, 
however.  In that case, the property should be preserved when the stream 
is preserved.  But if for some reason, you're not setting that property 
on the sending side, but want it set on the receiving side, you might 
have to write a script to set the properties for all those datasets 
after they are received.


lori

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Nicolas Williams

On Fri, Sep 04, 2009 at 01:41:15PM -0700, Richard Elling wrote:
> On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote:
> >We have groups generating terabytes a day of image data  from lab  
> >instruments and saving them to an X4500.
> 
> Wouldn't it be easier to compress at the application, or between the
> application and the archiving file system?

Especially when it comes to reading the images back!

ZFS compression is transparent.  You can't write uncompressed data then
read back compressed data.  And compression is at the block level, not
for the whole file, so even if you could read it back compressed, it
wouldn't be in a useful format.

Most people want to transfer data compressed, particularly images.  So
compressing at the application level in this case seems best to me.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke

I only see the blocking while load testing, not during regular usage, so I am 
not so worried. I will try the kernel settings to see if that helps if/when I 
see the issue in production. 

For what it is worth, here is the pattern I see when load testing NFS (iometer, 
60% random, 65% read, 8k chunks, 32 outstanding I/Os):

data01  59.6G  20.4T 46 24   757K  3.09M
data01  59.6G  20.4T 39 24   593K  3.09M
data01  59.6G  20.4T 45 25   687K  3.22M
data01  59.6G  20.4T 45 23   683K  2.97M
data01  59.6G  20.4T 33 23   492K  2.97M
data01  59.6G  20.4T 16 41   214K  1.71M
data01  59.6G  20.4T  3  2.36K  53.4K  30.4M
data01  59.6G  20.4T  1  2.23K  20.3K  29.2M
data01  59.6G  20.4T  0  2.24K  30.2K  28.9M
data01  59.6G  20.4T  0  1.93K  30.2K  25.1M
data01  59.6G  20.4T  0  2.22K  0  28.4M
data01  59.7G  20.4T 21295   317K  4.48M
data01  59.7G  20.4T 32 12   495K  1.61M
data01  59.7G  20.4T 35 25   515K  3.22M
data01  59.7G  20.4T 36 11   522K  1.49M
data01  59.7G  20.4T 33 24   508K  3.09M

LSI SAS HBA, 3 x 5 disk raidz, Dell 2950, 16GB RAM.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Louis-Frédéric Feuillette

On Fri, 2009-09-04 at 13:41 -0700, Richard Elling wrote:
> On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote:
> 
> > We have groups generating terabytes a day of image data  from lab  
> > instruments and saving them to an X4500.
> 
> Wouldn't it be easier to compress at the application, or between the
> application and the archiving file system?

Preamble:  I am actively doing research into image set compression,
specifically jpeg2000, so this is my point of reference.

I think it would be easier to compress at the application level. I would
suggest getting the image from the source, then use lossless jpeg2000
compression on it, saving the result to an uncompressed ZFS pool.

JPEG2000 uses arithmetic encoding to do the final compression step.
Arithmetic encoding has a higher compression rate (in general) than
gzip-9, lzbj or others.  There is an opensource implementation of
jpeg2000 called jasper[1].  Jasper is the reference implementation for
jpeg2000, meaning that all other jpeg2000 programs must verify it's
output to that of jasper (kinda).

Saving the jpeg2000 image to an uncompressed ZFS partition will be the
fastest thing.  Since jpeg2000 is already compressed, trying to compress
it will not yeild any storage space reduction, in fact it may _increase_
the size of the data stored on disk.  Since good compression algorithms
result in random data you can see why running on a compressed pool would
be bad for performance.

[1] http://www.ece.uvic.ca/~mdadams/jasper

On a side note, if you want to know how Arithmetic encoding works,
Wikipedia[2] has a real nice explanation.  Suffice it to say, in theory
( Without considering implementation details ) arithmetic encoding can
encode _any_ data at the rate of data_entropy*num_of_symbols +
data_symbol_table. In practice this doesn't happen due to floating point
overflows and some other issues.

[2] http://en.wikipedia.org/wiki/Arithmetic_coding

-- 
Louis-Frédéric Feuillette 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker

On Sep 4, 2009, at 4:33 PM, Scott Meilicke > wrote:


Yes, I was getting confused. Thanks to you (and everyone else) for  
clarifying.


Sync or async, I see the txg flushing to disk starve read IO.


Well try the kernel setting and see how it helps.

Honestly though if you can say it's all sync writes with certainty and  
IO is still blocking, you need a better storage sub-system, or an  
additional pool.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Richard Elling



On Sep 4, 2009, at 12:23 PM, Len Zaifman wrote:

We have groups generating terabytes a day of image data  from lab  
instruments and saving them to an X4500.


Wouldn't it be easier to compress at the application, or between the
application and the archiving file system?

We have tried lzbj : compressratio = 1.13 in 11 hours , 1.3 TB ->  
1.1 TB
   gzip -9 : compress ratio = 1.68 in > 37 hours,  
1.3 TB -> .75 TB


The filesystem performance was noticably laggy (ie ls took > 10  
seconds) while gzip -9 compression was used


do you have any idea if lossless jpeg compression is being planned  
for ZFS? We can envisage of 1.3 TB, > .8 TB will be images and if we  
could get better or equivalent compression on jpeg lossless  
compression with less impact on the filesystem than gzip -9  
compression, that would be worthwhile, if it worked.


I don't know of anyone working on that specific compression scheme,
but I've put together some thoughts on the subject of adding a new
compressor to ZFS.  Perhaps others could comment?

http://richardelling.blogspot.com/2009/08/justifying-new-compression-algorithms.html
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke

Yes, I was getting confused. Thanks to you (and everyone else) for clarifying.

Sync or async, I see the txg flushing to disk starve read IO.

Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Ross Walker

On Sep 4, 2009, at 2:22 PM, Scott Meilicke > wrote:


So, I just re-read the thread, and you can forget my last post. I  
had thought the argument was that the data were not being written to  
disk twice (assuming no separate device for the ZIL), but it was  
just explaining to me that the data are not read from the ZIL to  
disk, but rather from memory to disk. I need more coffee...


I think your confusing ARC write-back with ZIL and it isn't the sync  
writes that are blocking IO it's the async writes that have been  
cached and are now being flushed.


Just tell ARC to cache less IO for your hardware with the kernel  
config Bob mentioned way back.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] PMP support in Opensolaris

2009-09-04 Thread Nigel Smith

Hi Brandon
To answer your question, all you need to do is look up those bug numbers:

http://bugs.opensolaris.org/view_bug.do?bug_id=6422924

http://bugs.opensolaris.org/view_bug.do?bug_id=6691950

..and you see the fix should be in release snv_122.

Your in luck, as the OpenSolaris dev repository was updated to snv_122 
yesterday:

http://mail.opensolaris.org/pipermail/opensolaris-announce/2009-September/001256.html

http://pkg.opensolaris.org/dev/en/index.shtml

Let us know if you can get the port multipliers working..

But remember, there is a problem with ZFS raidz in that release, so be careful:

http://mail.opensolaris.org/pipermail/zfs-discuss/2009-September/031434.html

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] PMP support in Opensolaris

2009-09-04 Thread Brandon High

On Wed, Sep 2, 2009 at 4:56 PM, David Magda wrote:
> Said support was committed only two to three weeks ago:
>
>> PSARC/2009/394 SATA Framework Port Multiplier Support
>> 6422924 sata framework has to support port multipliers
>> 6691950 ahci driver needs to support SIL3726/4726 SATA port multiplier

When is this going to show up in the repo at
http://pkg.opensolaris.org/dev/ ? Is it already there?

Sorry if it's a dumb question, but I'm not sure where to look so the
release process is a bit opaque to me.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs compression algorithm : jpeg ??

2009-09-04 Thread Len Zaifman

We have groups generating terabytes a day of image data  from lab instruments 
and saving them to an X4500.

We have tried lzbj : compressratio = 1.13 in 11 hours , 1.3 TB -> 1.1 TB
gzip -9 : compress ratio = 1.68 in > 37 hours, 1.3 TB -> 
.75 TB

The filesystem performance was noticably laggy (ie ls took > 10 seconds) while 
gzip -9 compression was used

do you have any idea if lossless jpeg compression is being planned for ZFS? We 
can envisage of 1.3 TB, > .8 TB will be images and if we could get better or 
equivalent compression on jpeg lossless compression with less impact on the 
filesystem than gzip -9 compression, that would be worthwhile, if it worked.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Kyle McDonald


Scott Meilicke wrote:

I am still not buying it :) I need to research this to satisfy myself.

I can understand that the writes come from memory to disk during a txg write 
for async, and that is the behavior I see in testing.

But for sync, data must be committed, and a SSD/ZIL makes that faster because 
you are writing to the SSD/ZIL, and not to spinning disk. Eventually that data 
on the SSD must get to spinning disk.

  
But the txg (which may contain more data than just the sync data that 
was written to the ZIL) is still written from memory. Just because the 
sync data was written to the ZIL, doesn't mean it's not still in memory.


 -Kyle


To the books I go!

-Scott
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Tim Cook

On Fri, Sep 4, 2009 at 5:36 AM, Marc Bevand  wrote:

> Marc Bevand  gmail.com> writes:
> >
> > So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
> > is that the max I/O throughput when reading from all the disks on
> > 1 of their storage pod is about 1000MB/s.
>
> Correction: the SiI3132 are on x1 (not x2) links, so my guess as to
> the aggregate throughput when reading from all the disks is:
> 3*150+100 = 550MB/s.
> (150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link)
>
> And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards
> to exploit closer to the max theoretical bandwidth of an x1 PCI-E
> link, it would be:
> 3*250+100 = 850MB/s.
>
> -mrb
>
>

Whats the point of arguing what the back-end can do anyways?  This is bulk
data storage.  Their MAX input is ~100MB/sec.  The backend can more than
satisfy that.  Who cares at that point whether it can push 500MB/s or
5000MB/s?  It's not a database processing transactions.  It only needs to be
able to push as fast as the front-end can go.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke

So, I just re-read the thread, and you can forget my last post. I had thought 
the argument was that the data were not being written to disk twice (assuming 
no separate device for the ZIL), but it was just explaining to me that the data 
are not read from the ZIL to disk, but rather from memory to disk. I need more 
coffee...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke

Doh! I knew that, but then forgot...

So, for the case of no separate device for the ZIL, the ZIL lives on the disk 
pool. In which case, the data are written to the pool twice during a sync:

1. To the ZIL (on disk) 
2. From RAM to disk during tgx

If this is correct (and my history in this thread is not so good, so...), would 
that then explain some sort of pulsing write behavior for sync write operations?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Eric Sproul

Scott Meilicke wrote:
> So what happens during the txg commit?
> 
> For example, if the ZIL is a separate device, SSD for this example, does it 
> not work like:
> 
> 1. A sync operation commits the data to the SSD
> 2. A txg commit happens, and the data from the SSD are written to the 
> spinning disk

#1 is correct.  #2 is incorrect.  The TXG commit goes from memory into the main
pool.  The SSD data is simply left there in case something bad happens before
the TXG commit succeeds.  Once it succeeds, then the SSD data can be 
overwritten.

The only time you need to read from a ZIL device is if a crash occurs and you
need those blocks to repair the pool.

Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding when (and how) ZFS will use spare disks

2009-09-04 Thread Scott Meilicke

This sounds like the same behavior as opensolaris 2009.06. I had several disks 
recently go UNAVAIL, and the spares did not take over. But as soon as I 
physically removed a disk, the spare started replacing the removed disk. It 
seems UNAVAIL is not the same as the disk not being there. I wish the spare 
*would* take over in these cases, since the pool is degraded.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke

I am still not buying it :) I need to research this to satisfy myself.

I can understand that the writes come from memory to disk during a txg write 
for async, and that is the behavior I see in testing.

But for sync, data must be committed, and a SSD/ZIL makes that faster because 
you are writing to the SSD/ZIL, and not to spinning disk. Eventually that data 
on the SSD must get to spinning disk.

To the books I go!

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Bob Friesenhahn


On Fri, 4 Sep 2009, Scott Meilicke wrote:


So what happens during the txg commit?

For example, if the ZIL is a separate device, SSD for this example, does it not 
work like:

1. A sync operation commits the data to the SSD
2. A txg commit happens, and the data from the SSD are written to the spinning 
disk

So this is two writes, correct?


From past descriptions, the slog is basically a list of pending write 
system calls.  The only time the slog is read is after a reboot. 
Otherwise, the slog is simply updated as write operations proceed.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke

So what happens during the txg commit?

For example, if the ZIL is a separate device, SSD for this example, does it not 
work like:

1. A sync operation commits the data to the SSD
2. A txg commit happens, and the data from the SSD are written to the spinning 
disk

So this is two writes, correct?

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Understanding when (and how) ZFS will use spare disks

2009-09-04 Thread Chris Siebenmann

 We have a number of shared spares configured in our ZFS pools, and
we're seeing weird issues where spares don't get used under some
circumstances.  We're running Solaris 10 U6 using pools made up of
mirrored vdevs, and what I've seen is:

* if ZFS detects enough checksum errors on an active disk, it will
  automatically pull in a spare.
* if the system reboots without some of the disks available (so that
  half of the mirrored pairs drop out, for example), spares will *not*
  get used. ZFS recognizes that the disks are not there; they are marked
  as UNAVAIL and the vdevs (and pools) as DEGRADED, but it doesn't try to
  use spares.

(This is in a SAN environment where half of all of the mirrors come
from one controller and half come from another one.)

 All of this makes me think that I don't understand how ZFS spares
really work, and under what circumstances they'll get used. Does
anyone know if there's a writeup of this somewhere?

(What I've gathered so far from reading zfs-discuss archives is that
ZFS spares are not handled automatically in the kernel code but are
instead deployed to pools by a fmd ZFS management module[*], doing more
or less 'zpool repace   ' (presumably through
an internal code path, since 'zpool history' doesn't seem to show spare
deployment). Is this correct?)

 Also, searching turns up some old zfs-discuss messages suggesting that
not bringing in spares in response to UNAVAIL disks was a bug that's now
fixed in at least OpenSolaris. If so, does anyone know if the fix has
made it into S10 U7 (or is planned or available as a patch)?

 Thanks in advance.

- cks
[*: http://blogs.sun.com/eschrock/entry/zfs_hot_spares suggests that
it is 'zfs-retire', which is separate from 'zfs-diagnosis'.]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Neil Perrin




On 09/04/09 09:54, Scott Meilicke wrote:

Roch Bourbonnais Wrote:
""100% random writes produce around 200 IOPS with a 4-6 second pause
around every 10 seconds. "

This indicates that the bandwidth you're able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers."


When I have tested using 50% reads, 60% random using iometer over NFS,
I can see the data going straight to disk due to the sync nature of NFS.
But I also see writes coming to a stand still every 10 seconds or so,
which I have attributed to the ZIL dumping to disk. Therefore I conclude
that it is the process of dumping the ZIL to disk that (mostly?) blocks
writes during the dumping.


The ZIL does does not work like that. It is not a journal.

Under a typical write load write transactions are batched and
written out in a group transaction (txg). This txg sync occurs
every 30s under light load but more frequently or continuously
under heavy load.

When writing synchronous data (eg NFS) the transactions get written immediately
to the intent log and are made stable. When the txg later commits the
intent log blocks containing those committed transactions can be
freed. So as you can see there is no periodic dumping of
the ZIL to disk. What you are probably observing is the periodic txg
commit.

Hope that helps: Neil. 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] check a zfs rcvd file

2009-09-04 Thread dick hoogendijk


Lori Alt wrote:
The -u option to zfs recv (which was just added to support flash 
archive installs, but it's useful for other reasons too) suppresses 
all mounts of the received file systems.  So you can mount them 
yourself afterward in whatever order is appropriate, or do a 'zfs 
mount -a'.
You misunderstood my problem. It is very convenient that the filesystems 
are not mounted. I only wish they could stay that way!. Alas, they ARE 
mounted (even if I don't want them to) when I  *reboot* the system. And 
THAT's when thing get ugly. I then have different zfs filesystems using 
the same mountpoints! The backed up ones have the same mountpoints as 
their origin :-/  -> The only way to stop it is to *export* the "backup" 
zpool OR to change *manualy* the zfs prop "canmount=noauto" in all 
backed up snapshots/filesystems.


As I understand I cannot give this "canmount=noauto" to the zfs receive 
command.

# zfs send -Rv rp...@0909 | zfs receive -Fdu backup/snaps

--
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS 10u7 5/09 | OpenSolaris 2010.02 B121
+ All that's really worth doing is what we do for others (Lewis Carrol)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Archiving and Restoring Snapshots

2009-09-04 Thread Richard Elling


On Sep 3, 2009, at 10:32 PM, Tim Cook wrote:


On Fri, Sep 4, 2009 at 12:17 AM, Ross  wrote:
Hi Richard,

Actually, reading your reply has made me realise I was overlooking  
something when I talked about tar, star, etc...  How do you backup a  
ZFS volume?  That's something traditional tools can't do.  Are  
snapshots the only way to create a backup or archive of those?


Below the application, dd would do it.  But if you want incrementals,  
then

either use the application's backup scheme or zfs send.

Personally I'm quite happy with snapshots - we have a ZFS system at  
work that's replicating all of it's data to an offsite ZFS store  
using snapshots.  Using ZFS as a backup store is something I'm quite  
happy with, it's just storing just a snapshot file that makes me  
nervous.


The correct answer is ndmp.  Whether Sun will ever add it to  
opensolaris is another subject entirely though.


Available since b78 with source Integrated in b102.
http://www.opensolaris.org/os/project/ndmp/

But NDMP is just part of an overall data management architecture...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] question about my hardware choice

2009-09-04 Thread Eugen Leitl


Hi zfs congnoscenti,

a few quick question about my hardware choice (a bit late, since the
box is up already):

A 3U supermicro chassis with 16x SATA/SAS hotplug
Supermicro X8DDAi (2x Xeon QC 1.26 GHz S1366, 24 GByte RAM, IPMI)
2x LSI SAS3081E-R
16x WD2002FYPS

Right now I'm running Solaris 10 5/9 (Oracle doesn't support
OpenSolaris, unfortunately).

I would like to run Oracle in a zone/container, and use the rest for
random storage and network servage.

My questions:

* does the hardware choice make sense? Particularly, the LSI host adapters.
  should I change anything hardware-side?

* what kind of zfs layout would you recommend if I want to run Oracle in a 
container?

* should I put some SSD (e.g. Intel 80 GByte 2nd gen) into the system if I can,
  or doesn't Solaris 10 5/9 zfs support it?

* is there a reason speaking against containers and Oracle?

* how many hot spares would you suggest?

Thanks.

-- 
Eugen* Leitl http://leitl.org";>leitl http://leitl.org
__
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] check a zfs rcvd file

2009-09-04 Thread Lori Alt


On 09/04/09 09:41, dick hoogendijk wrote:

Lori Alt wrote:
The -n option does some verification.  It verifies that the record 
headers distributed throughout the stream are syntactically valid.  
Since each record header contains a length field which allows the 
next header to be found, one bad header will cause the processing of 
the stream to abort.  But it doesn't verify the content of the data 
associated with each record.


So, storing the stream in a zfs received filesystem is the better 
option. Alas, it also is the most difficult one. Storing to a file 
with "zfs send -Rv" is easy. The result is just a file and if your 
reboot the system all is OK. However, if I zfs "receive -Fdu" into a 
zfs filesystem I'm in trouble when I reboot the system. I get 
confusion on mountpoints! Let me explain:


Some time ago I backup up my rpool and my /export ; /export/home to 
/backup/snaps (with  zfs receive -Fdu). All's OK because the newly 
created zfs FS's stay unmounted 'till the next reboot(!). When I 
rebooted my system (due to a kernel upgrade) the system would nog 
boot, because it had mounted the zfs FS "backup/snaps/export" on 
/export and "backup/snaps/export/home on /export/home. The system 
itself had those FS's too, of course. So, there was a mix up. It would 
be nice if the backup FS's would not be mounted (canmount=noauto), but 
I cannot give this option when I create the zfs send | receive, can I? 
And giving this option later on is very difficult, because "canmount" 
is NOT recursive! And I don't want to set it manualy on all those 
backup up FS's.


I wonder how other people overcome this mountpoint issue.

The -u option to zfs recv (which was just added to support flash archive 
installs, but it's useful for other reasons too) suppresses all mounts 
of the received file systems.  So you can mount them yourself afterward 
in whatever order is appropriate, or do a 'zfs mount -a'.


lori



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Scott Meilicke

Roch Bourbonnais Wrote:
""100% random writes produce around 200 IOPS with a 4-6 second pause
around every 10 seconds. "

This indicates that the bandwidth you're able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers."

When I have tested using 50% reads, 60% random using iometer over NFS, I can 
see the data going straight to disk due to the sync nature of NFS. But I also 
see writes coming to a stand still every 10 seconds or so, which I have 
attributed to the ZIL dumping to disk. Therefore I conclude that it is the 
process of dumping the ZIL to disk that (mostly?) blocks writes during the 
dumping. I do agree with Bob and others that suggest making the size of the 
dump smaller will mask this behavior, and that seems like a good idea, although 
I have not yet tried and tested it myself.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] check a zfs rcvd file

2009-09-04 Thread dick hoogendijk


Lori Alt wrote:
The -n option does some verification.  It verifies that the record 
headers distributed throughout the stream are syntactically valid.  
Since each record header contains a length field which allows the next 
header to be found, one bad header will cause the processing of the 
stream to abort.  But it doesn't verify the content of the data 
associated with each record.


So, storing the stream in a zfs received filesystem is the better 
option. Alas, it also is the most difficult one. Storing to a file with 
"zfs send -Rv" is easy. The result is just a file and if your reboot the 
system all is OK. However, if I zfs "receive -Fdu" into a zfs filesystem 
I'm in trouble when I reboot the system. I get confusion on mountpoints! 
Let me explain:


Some time ago I backup up my rpool and my /export ; /export/home to 
/backup/snaps (with  zfs receive -Fdu). All's OK because the newly 
created zfs FS's stay unmounted 'till the next reboot(!). When I 
rebooted my system (due to a kernel upgrade) the system would nog boot, 
because it had mounted the zfs FS "backup/snaps/export" on /export and 
"backup/snaps/export/home on /export/home. The system itself had those 
FS's too, of course. So, there was a mix up. It would be nice if the 
backup FS's would not be mounted (canmount=noauto), but I cannot give 
this option when I create the zfs send | receive, can I? And giving this 
option later on is very difficult, because "canmount" is NOT recursive! 
And I don't want to set it manualy on all those backup up FS's.


I wonder how other people overcome this mountpoint issue.

--
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS 10u7 5/09 | OpenSolaris 2010.02 b122
+ All that's really worth doing is what we do for others (Lewis Carrol)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Change the volblocksize of a ZFS volume

2009-09-04 Thread Roch


stuart anderson writes:
 > > > > Question :
 > > > >
 > > > > Is there a way to change the volume blocksize
 > > say
 > > > via 'zfs snapshot send/receive'?
 > > > >
 > > > > As I see things, this isn't possible as the
 > > target
 > > > volume (including property values) gets
 > > overwritten
 > > > by 'zfs receive'.
 > > > >   
 > > > 
 > > > By default, properties are not received.  To pass
 > > > properties, you need 
 > > > to use
 > > > the -R flag.
 > > 
 > > I have tried that, and while it works for properties
 > > like compression, I have not found a way to preserve
 > > a non-default volblocksize across zfs send | zfs
 > > receive. the zvol created on the receive side is
 > > always defaulting to 8k. Is there a way to do this?
 > > 
 > 
 > I spoke too soon. More particularly, during the zfs send/recv
 > processes the receiving side reports 8k, but once the receive is done
 > the volblocksize is reporting the expected value as sent with -R. 
 > 
 > Hopefully, this is just a reporting bug during an active receive.
 > 
 > Note, this was observed with s10u7 (x86).
 > 

Sounds like so.

I would be very surprised if one would be able to change the
volblocksize of a zvol through send/receive (with or without
-R). It's an immutable property of the zvol.

-r


 > Thanks.
 > -- 
 > This message posted from opensolaris.org
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pulsing write performance

2009-09-04 Thread Roch


"100% random writes produce around 200 IOPS with a 4-6 second pause
around every 10 seconds. "

This indicates that the bandwidth you're able to transfer
through the protocol is about 50% greater than the bandwidth
the pool can offer to ZFS. Since, this is is not sustainable, you
see here ZFS trying to balance the 2 numbers.

-r

David Bond writes:
 > Hi,
 > 
 > I was directed here after posting in CIFS discuss (as i first thought that 
 > it could be a CIFS problem).
 > 
 > I posted the following in CIFS:
 > 
 > When using iometer from windows to the file share on opensolaris
 > svn101 and svn111 I get pauses every 5 seconds of around 5 seconds
 > (maybe a little less) where no data is transfered, when data is
 > transfered it is at a fair speed and gets around 1000-2000 iops with 1
 > thread (depending on the work type). The maximum read response time is
 > 200ms and the maximum write response time is 9824ms, which is very
 > bad, an almost 10 seconds delay in being able to send data to the
 > server. 
 > This has been experienced on 2 test servers, the same servers have
 > also been tested with windows server 2008 and they havent shown this
 > problem (the share performance was slightly lower than CIFS, but it
 > was consistent, and the average access time and maximums were very
 > close. 
 > 
 > 
 > I just noticed that if the server hasnt hit its target arc size, the
 > pauses are for maybe .5 seconds, but as soon as it hits its arc
 > target, the iops drop to around 50% of what it was and then there are
 > the longer pauses around 4-5 seconds. and then after every pause the
 > performance slows even more. So it appears it is definately server
 > side. 
 > 
 > This is with 100% random io with a spread of 33% write 66% read, 2KB
 > blocks. over a 50GB file, no compression, and a 5.5GB target arc
 > size. 
 > 
 > 
 > 
 > Also I have just ran some tests with different IO patterns and 100
 > sequencial writes produce and consistent IO of 2100IOPS, except when
 > it pauses for maybe .5 seconds every 10 - 15 seconds. 
 > 
 > 100% random writes produce around 200 IOPS with a 4-6 second pause
 > around every 10 seconds. 
 > 
 > 100% sequencial reads produce around 3700IOPS with no pauses, just
 > random peaks in response time (only 16ms) after about 1 minute of
 > running, so nothing to complain about. 
 > 
 > 100% random reads produce around 200IOPS, with no pauses. 
 > 
 > So it appears that writes cause a problem, what is causing these very
 > long write delays? 
 > 
 > A network capture shows that the server doesnt respond to the write
 > from the client when these pauses occur. 
 > 
 > Also, when using iometer, the initial file creation doesnt have and
 > pauses in the creation, so it  might only happen when modifying
 > files. 
 > 
 > Any help on finding a solution to this would be really appriciated.
 > 
 > David
 > -- 
 > This message posted from opensolaris.org
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand

Marc Bevand  gmail.com> writes:
> 
> So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
> is that the max I/O throughput when reading from all the disks on
> 1 of their storage pod is about 1000MB/s.

Correction: the SiI3132 are on x1 (not x2) links, so my guess as to
the aggregate throughput when reading from all the disks is:
3*150+100 = 550MB/s.
(150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link)

And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards
to exploit closer to the max theoretical bandwidth of an x1 PCI-E
link, it would be:
3*250+100 = 850MB/s.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Petabytes on a budget - blog

2009-09-04 Thread Marc Bevand

Bill Moore  sun.com> writes:
> 
> Moving on, modern high-capacity SATA drives are in the 100-120MB/s
> range.  Let's call it 125MB/s for easier math.  A 5-port port multiplier
> (PM) has 5 links to the drives, and 1 uplink.  SATA-II speed is 3Gb/s,
> which after all the framing overhead, can get you 300MB/s on a good day.
> So 3 drives can more than saturate a PM.  45 disks (9 backplanes at 5
> disks + PM each) in the box won't get you more than about 21 drives
> worth of performance, tops.  So you leave at least half the available
> drive bandwidth on the table, in the best of circumstances.  That also
> assumes that the SiI controllers can push 100% of the bandwidth coming
> into them, which would be 300MB/s * 2 ports = 600MB/s, which is getting
> close to a 4x PCIe-gen2 slot.

Wrong. The theoretical bandwidth of an x4 PCI-E v2.0 slot is 2GB/s per
direction (5Gbit/s before 8b-10b encoding per lane, times 0.8, times 4),
amply sufficient to deal with 600MB/s.

However they don't have this kind of slot, they have x2 PCI-E v1.0
slots (500MB/s per direction). Moreover SiI3132 default to a
MAX_PAYLOAD_SIZE of 128 bytes therefore my guess is that each 2-port
SATA card is only able to provide 60% of the theoretical throughput[1],
or about 300MB/s.

Then they have 3 such cards: total throughput of 900MB/s.

Finally the 4th SATA card (with 4 ports) is in a 32-bit 33MHz PCI slot
(not PCI-E). In practice such a bus can only provide a usable throughput
of about 100MB/s (out of 133MB/s theoretical).

All the bottlenecks are obviously the PCI-E links and the PCI bus.
So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
is that the max I/O throughput when reading from all the disks on
1 of their storage pod is about 1000MB/s. This is poor compared to
a Thumper for example, but the most important factor for them was
GB/$, not GB/sec. And they did a terrific job at that!

> And I'd re-iterate what myself and others have observed about SiI and
> silent data corruption over the years.

Irrelevant, because it seems they have built fault-tolerance higher in
the stack, à la Google. Commodity hardware + reliable software = great
combo.

[1] 
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARC limits not obeyed in OSol 2009.06

2009-09-04 Thread Roch


Do you have the zfs primarycache property on this release ?
if so, you could set it to 'metadata'  or none.

 primarycache=all | none | metadata

 Controls what is cached in the primary cache  (ARC).  If
 this  property  is set to "all", then both user data and
 metadata is cached. If this property is set  to  "none",
 then  neither  user data nor metadata is cached. If this
 property is set to "metadata",  then  only  metadata  is
 cached. The default value is "all".


-r


Udo Grabowski writes:
 > Hi,
 > we've capped Arcsize via set zfs:zfs_arc_max = 0x2000 in /etc/system to 
 > 512 MB, since ARC 
 > still does not release memory when applications need it (this is another 
 > bug). But this hard limit is 
 > not obeyed, instead, when traversing all files in a large and deep 
 > directory, we see the values below 
 > (arc started with 300 MB). After a while, machine (Ultra 20 M2 with 6GB) 
 > swaps and then, hours later, freezes completely (even no reaction on quick 
 > push power button, no ping, no mouse, have to hard 
 > reset). Arc summary shows clearly that limits are not what they supposed to 
 > be. If this is working as
 > intended, then the intention must be changed. As poorly as ARC is working 
 > now, it's absolutely 
 > necessary that a hard limit is indeed a hard limit for ARC. Please fix this. 
 > Is there anything I can do to
 > really limit or switch off the ARC completely ? It's breaking our production 
 > work often since we've
 > installed OSol (we came from SXDE 1.08 which worked better), we must find a 
 > way to stop this 
 > problem as fast as possible !
 > 
 > arcstat:
 > Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz c  
 > 13:22:16   95M   23M 24   10M   14   12M   64   22M   24   963M  536M  
 > 13:22:172K   256 10796   177   15   2229   965M  536M  
 > 13:22:182K   490 22   119   10   371   38   482   22   970M  536M  
 > 13:22:194K   214  4   1506643   1403   971M  536M  
 > 13:22:202K   427 19574   370   37   419   19   971M  536M  
 > 13:22:211K   208 19   103   17   105   21   202   19   971M  536M  
 > 
 > 13:23:161K   481 27808   401   47   478   27 1G  536M  
 > 13:23:172K   255 11   125   10   130   13   218   10 1G  536M  
 > and counting...
 > arc_summary:
 > System Memory:
 >  Physical RAM:  6134 MB
 >  Free Memory :  1739 MB
 >  LotsFree:  95 MB
 > 
 > ZFS Tunables (/etc/system):
 >  set zfs:zfs_arc_max = 0x2000
 > 
 > ARC Size:
 >  Current Size: 1357 MB (arcsize)
 >  Target Size (Adaptive):   512 MB (c)
 >  Min Size (Hard Limit):191 MB (zfs_arc_min)
 >  Max Size (Hard Limit):512 MB (zfs_arc_max)
 > 
 > ARC Size Breakdown:
 >  Most Recently Used Cache Size:  93%479 MB (p)
 >  Most Frequently Used Cache Size: 6%32 MB (c-p)
 > 
 > ARC Efficency:
 >  Cache Access Total: 97131108
 >  Cache Hit Ratio:  75%   7321   [Defined State for 
 > buffer]
 >  Cache Miss Ratio: 24%   23886667   [Undefined State for 
 > Buffer]
 >  REAL Hit Ratio:   67%   65874421   [MRU/MFU Hits Only]
 > 
 >  Data Demand   Efficiency:66%
 >  Data Prefetch Efficiency: 8%
 > 
 > CACHE HITS BY CACHE LIST:
 >   Anon:   --%Counter Rolled.
 >   Most Recently Used: 15%11463028 (mru) [ 
 > Return Customer ]
 >   Most Frequently Used:   74%54411393 (mfu) [ 
 > Frequent Customer ]
 >   Most Recently Used Ghost:   10%7537123 (mru_ghost)[ 
 > Return Customer Evicted, Now Back ]
 >   Most Frequently Used Ghost: 19%14619417 (mfu_ghost)   [ 
 > Frequent Customer Evicted, Now Back ]
 > CACHE HITS BY DATA TYPE:
 >   Demand Data: 3%2716192 
 >   Prefetch Data:   0%3506 
 >   Demand Metadata:86%63089419 
 >   Prefetch Metadata:  10%7435324 
 > CACHE MISSES BY DATA TYPE:
 >   Demand Data: 5%1365132 
 >   Prefetch Data:   0%36544 
 >   Demand Metadata:40%9664064 
 >   Prefetch Metadata:  53%12820927
 > -- 
 > This message posted from opensolaris.org
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to find poor performing disks

2009-09-04 Thread Roch

Scott Lawson writes:
 > Also you may wish to look at the output of 'iostat -xnce 1' as well.
 > 
 > You can post those to the list if you have a specific problem.
 > 
 > You want to be looking for error counts increasing and specifically 'asvc_t'
 > for the service times on the disks. I higher number for asvc_t  may help to
 >  isolate poorly performing individual disks.
 > 
 > 

I blast the pool with dd, and look for drives that are
*always* active, while others in the same group have
completed their transaction group and get no more activity.
Within a group drives should be getting the same amount of
data per 5 second (zfs_txg_synctime) and the ones that are
always active are the ones slowing you down.

If whole groups are unbalanced that's a sign that they have
different amount of free space and the expectation is that
you will be gated by the speed on the group that needs to
catch up. 

-r

 > 
 > Scott Meilicke wrote:
 > > You can try:
 > >
 > > zpool iostat pool_name -v 1
 > >
 > > This will show you IO on each vdev at one second intervals. Perhaps you 
 > > will see different IO behavior on any suspect drive.
 > >
 > > -Scott
 > >   
 > 
 > 
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Read about ZFS backup - Still confused

2009-09-04 Thread Thomas Burgess

Let me explain what i have and you decide if it's what you're looking for.
I run a home NAS based on ZFS (due to hardware issues i am using FreeBSD 7.2
as my os but all the data is on ZFS)
This system has multiple uses.  I have about 10 users and 4 HTPC's connected
via gigabit.  I have ZFS filesystems for Video, Audio and Data.

I have no problem using it for my main itune library or storing downloaded
and recorded video.  Each user also has thier own share to store data and
backups.

The system itself is made up of 3 raidz vdevs rights now, each with 4 1tb
hard drives so i have about 9 TB total space right now. Having a setup like
this sort of changes how you do things.  I have several computers, but all
the stuff i care about it on the NAS.  I am very happy with ZFS for this
purpose.  I originally used a linux backend with mdadm and xfs but i am very
much in love with my new system.  I love the ability to clone and snapshot
and i use it often.  It's already saved me from human error on 2 occasions.
It's also very fast.  I'm using cheap parts and have seen speeds over 250
MB/s, although i get around 30 MB/s per client average with samba.  for
streaming music and video it has never shuddered or skipped.  I have mostly
720p video but a large amount of 1080p as well.  It's not uncommon to have 3
htpc's streaming at the same time and 2 people using the network for other
stuff.i'm very happy with it.

I'm SURE you can find a method to backup/restore your data with ZFS.  Just
think of it more as a backend solution.  You'll still probably use whatever
method you're used to for transfering data, although i use a combination of
samba/nfs and even FTP.  If you're used to tar, no need to stop using it.
You might also look at rysnc.
You could set up a ZFS filesystem on the NAS and set up rsync on your
client, then set up automatic snapshots on the ZFS machine.  This way you'd
have multiple methods of restoring (you could just dump back the latest
rsync or you could clone one of the older snapshots and dump THAT back)

On Thu, Sep 3, 2009 at 4:58 PM, Cork Smith  wrote:

> Let me try rephrasing this. I would like the ability to restore so my
> system mirrors its state at the time when I backed it up given the old hard
> drive is now a door stop.
>
> Cork
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] one time passwords - apache infrastructure incident report 8/28/2009

2009-09-04 Thread russell aspinwall

Hi,

Just be reading about apache.org incident report for 8/28/2009 
( https://blogs.apache.org/infra/entry/apache_org_downtime_report )

The use of Solaris and ZFS on the European server was interesting including the 
recovery.

However, what I found more interesting was the use of one time passwords which 
is supported by FreeBSD ( 
http://www.freebsd.org/doc/en/books/handbook/one-time-passwords.html ). 
Could or should this technology be incorporated into OpenSolaris?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

50 matches

Mail list logo