Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Ragnar Sundblad

On 4 apr 2010, at 06.01, Richard Elling wrote:

Thank you for your reply! Just wanted to make sure.

 Do not assume that power outages are the only cause of unclean shutdowns.
 -- richard

Thanks, I have seen that mistake several times with other
(file)systems, and hope I'll never ever make it myself! :-)

/ragge s

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Diagnosing Permanent Errors

2010-04-04 Thread Willard Korfhage
I would like to get some help diagnosing permanent errors on my files. The 
machine in question has 12 1TB disks connected to an Areca raid card. I 
installed OpenSolaris build 134 and according to zpool history, created a pool 
with

zpool create bigraid raidz2 c4t0d0 c4t0d1 c4t0d2 c4t0d3 c4t0d4 c4t0d5 c4t0d6 
c4t0d7 c4t1d0 c4t1d1 c4t1d2 c4t1d3

I then backed up 806G of files to the machine, and had the backup program 
verify the files. It failed. The check is continuing to run, but so far it 
found 4 files where the checksums of the backup files don't match the checksum 
of the original file. Zpool status shows problems:

 $ sudo zpool status -v
  pool: bigraid
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
bigraid DEGRADED 0 0   536
  raidz2-0  DEGRADED 0 0 3.14K
c4t0d0  ONLINE   0 0 0
c4t0d1  ONLINE   0 0 0
c4t0d2  ONLINE   0 0 0
c4t0d3  ONLINE   0 0 0
c4t0d4  ONLINE   0 0 0
c4t0d5  ONLINE   0 0 0
c4t0d6  ONLINE   0 0 0
c4t0d7  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
c4t1d1  ONLINE   0 0 0
c4t1d2  ONLINE   0 0 0
c4t1d3  DEGRADED 0 0 0  too many errors

errors: Permanent errors have been detected in the following files:

metadata:0x18
metadata:0x3a

So, it appears that one of the disks is bad, but if one disk failed, how would 
a raidz2 pool develop permanent errors? The numbers in the CKSUM column are 
continuing to grow, but is that because the backup verification is tickling the 
errors as it runs?

Previous postings on permanent errors said to look at fmdump -eV, but that has 
437543 lines, and I don't really know how to interpret what I see. I did check 
the vdev_path with  fmdump -eV | grep  vdev_path | sort | uniq -c to see if 
it was only certain disks, but every disk in the array is listed in the file, 
albeit with different frequencies:

2189vdev_path = /dev/dsk/c4t0d0s0
1077vdev_path = /dev/dsk/c4t0d1s0
1077vdev_path = /dev/dsk/c4t0d2s0
1097vdev_path = /dev/dsk/c4t0d3s0
  25vdev_path = /dev/dsk/c4t0d4s0
  25vdev_path = /dev/dsk/c4t0d5s0
  20vdev_path = /dev/dsk/c4t0d6s0
1072vdev_path = /dev/dsk/c4t0d7s0
1092vdev_path = /dev/dsk/c4t1d0s0
vdev_path = /dev/dsk/c4t1d1s0
2221vdev_path = /dev/dsk/c4t1d2s0
1149vdev_path = /dev/dsk/c4t1d3s0

What should I make of this? All the disks are bad? That seems unlikely. I found 
another thread

http://opensolaris.org/jive/thread.jspa?messageID=399988

where it finally came down to bad memory, so I'll test that. Any other 
suggestions?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] vPool unavailable but RaidZ1 is online

2010-04-04 Thread Kevin
I am trying to recover a raid set, there are only three drives that are part of 
the set.  I attached a disk and discovered it was bad.  It was never part of 
the raid set.  The disk is now gone and when I try to import the pool I get the 
error listed below.  Is there a chance to recover?  TIA!

Sun Microsystems Inc.   SunOS 5.11  snv_112 November 2008
# zpool import
  pool: vpool
id: 14231674658037629037
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:

vpool   UNAVAIL  missing device
  raidz1ONLINE
c0t0d0  ONLINE
c0t1d0  ONLINE
c0t2d0  ONLINE

Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.
# bash
bash-3.2# zpool import -fF
  pool: vpool
id: 14231674658037629037
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:

vpool   UNAVAIL  missing device
  raidz1ONLINE
c0t0d0  ONLINE
c0t1d0  ONLINE
c0t2d0  ONLINE

Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-04 Thread Frank Middleton

On 04/ 4/10 10:00 AM, Willard Korfhage wrote:


What should I make of this? All the disks are bad? That seems
unlikely. I found another thread

http://opensolaris.org/jive/thread.jspa?messageID=399988

where it finally came down to bad memory, so I'll test that. Any
other suggestions?


It could be the cpu. I had a very bizarre case where the cpu would
sometimes miscalculate the checksums of certain files and mostly
when the cpu was also  busy doing other things. Probably the cache.

Days of running memtest and SUNWvts didn't result in any errors
because this was a weirdly pattern sensitive problem. However, I
too am of the opinion that you shouldn't even think of running zfs
without ECC memory (lots of threads about that!) and that this
is far, far more likely to be your problem, but I wouldn't count on
diagnostics finding it, either. Of course it could be the controller too.

For laughs, the cpu calculating bad checksums was discussed in
http://opensolaris.org/jive/message.jspa?messageID=469108
(see last message in the thread).

If you are seriously contemplating using a system with
non-ECC RAM, check out the Google research mentioned in
http://opensolaris.org/jive/thread.jspa?messageID=423770
http://www.cs.toronto.edu/%7Ebianca/papers/sigmetrics09.pdf

Cheers -- Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Diagnosing Permanent Errors

2010-04-04 Thread Willard Korfhage
Yeah, this morning I concluded I really should be running ECC ram. I sometimes 
wonder why people people don't run ECC ram more frequently. I remember a decade 
ago, when ram was much, much less dense, people fretted about alpha particles 
randomly flipping bits, but that seems to have died down.

I know, of course, there is some added expense, but browsing on Newegg, the 
additional RAM cost is pretty minimal. I see 2GB ECC sticks going for about $12 
more than similar non-ECC sticks. It's the motherboards that can handle ECC 
which are the expensive part. Now I've got to see what is a good motherboard 
for a file server.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Which zfs options are replicated

2010-04-04 Thread Lutz Schumann
Hello list, 

I started playing aroud with Comstar in snv_134. In snv_116 version of ZFS, a 
new hidden property for the Comstar MetaData has been intoduced (stmf_sbd_lu). 
This makes it possible to migrate from legacy (iscsi target daemon) to Comstar 
without data loss, which is great. 

Before this property you always lost the first 64k of your zvol data where 
comstar wrote it's metadata - which is bad.

When testing send/receive with the latest opensolaris I found that the property 
is not replicated. Without the send/receive support, it makes it very difficult 
to use send/receive to perform disaster recovery - why ? Because disk ID's of 
the devices change on the target side, so clients must be reconfigured, which 
is difficult with many clients.

After investigating this, I tried iscsioptions - the old style property of 
the legacy target. This is also not replicated. So it seems to be designed in.

So I wonder - where can I found information which properties are replicated and 
which are not ? 

Can someone help ? 

Regards,
Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] It's alive, and thank you all for the help.

2010-04-04 Thread R.G. Keen
I finally achieved critical mass on enough parts to put my zfs server together. 
It basically ran the first time, any non-function being my own 
misunderstandings. I wanted to issue a thank you to those of you who suffered 
through my questions and pointed me in the right direction. Many pieces of my 
learning were done right here.

I moaned about the difficulty of figuring out what would run opensolaris and 
zfs before buying hardware. In the end, I used a recipe largely copied from 
someone who had already built a home server. I'd like to return the favor. 

This combination of hardware runs with no problems with the Opensolris live CD 
install:
 - ASUS M3A78-CM which implies AMD 780V and SB700
   ...the onboard ethernet on 100Mb wiring with the rge driver
   ... the onboard video runs 1024x768
   ... I did not try onboard sound, DVI, etc.; don't care, it's a server.
 - AMD Athlon II 240e 
 - Kingston 800MHz DDR2 unbuffered ECC ram, 2x 2GB 
 - Syba  SD-SA2PEX-2IR PCIe x1 dual port SATA card with 3124 driver
   ... could not get a disk attached to this to boot the system yet
 - 2x 40GB 2.5 SATA drives, mirrored as rpool for boot
 - 6x Seagate 750GB raid-rated SATA for main storage
 - Corsair 400W 80+ rated PS with single 30A +12V rail for spin-up surge
 - Norco RC470 enclosure, 4U rackmount with between 11 and 15 spaces for 3.5 
disks, internal fans, air filter, etc.; it's big and ungainly, but modestly 
priced and not difficult to do the internal wiring as a result of the size. 
 - the usual clot of cables and adapters

No messing about with finding new drivers or things not working (other than the 
Syba card not booting the system) were found. 

As measured by a Kill-a-watt, the thing peaks at 200W from the wall at spinup, 
but settles to 105W at idle, 10W of which is the five bulkhead fans in the 
case. I suspect that I could pull the plug on maybe 2-3 of them and still not 
have overheating because of the low idle power.

I get a reported 4.06TB of available storage from the six 750GB drives in 
Raidz2, and another 30GB left over unused in the boot pool. 

As yet, I have no performance numbers. I suspect that it will be entirely 
sufficient for my needs, as I don't intend to serve anything with real time 
requirements. It's intended as a simple, large bit-bucket.

Again, thank you to those of you who helped me.

R.G.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] mpxio load-balancing...it doesn't work??

2010-04-04 Thread Brad
I had always thought that with mpxio, it load-balances IO request across your 
storage ports but this article 
http://christianbilien.wordpress.com/2007/03/23/storage-array-bottlenecks/ has 
got me thinking its not true.

The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames are 10 bytes 
long -) per port. As load balancing software (Powerpath, MPXIO, DMP, etc.) are 
most of the times used both for redundancy and load balancing, I/Os coming from 
a host can take advantage of an aggregated bandwidth of two ports. However, 
reads can use only one path, but writes are duplicated, i.e. a host write ends 
up as one write on each host port. 

Is this true?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mpxio load-balancing...it doesn't work??

2010-04-04 Thread Tim Cook
On Sun, Apr 4, 2010 at 8:55 PM, Brad bene...@yahoo.com wrote:

 I had always thought that with mpxio, it load-balances IO request across
 your storage ports but this article
 http://christianbilien.wordpress.com/2007/03/23/storage-array-bottlenecks/has 
 got me thinking its not true.

 The available bandwidth is 2 or 4Gb/s (200 or 400MB/s – FC frames are 10
 bytes long -) per port. As load balancing software (Powerpath, MPXIO, DMP,
 etc.) are most of the times used both for redundancy and load balancing,
 I/Os coming from a host can take advantage of an aggregated bandwidth of two
 ports. However, reads can use only one path, but writes are duplicated, i.e.
 a host write ends up as one write on each host port. 

 Is this true?
 --



I have no idea what MPIO stack he's talking about, but I've never heard
anything operating like he's talking about.  Writes aren't duplicated on
each port.  The path a read OR write goes down depends on the host-side
mpio stack, and how you have it configured to load-balance.  It could be
simple round-robin, it could be based on queue depth, it could be most
recently used, etc. etc.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with zfs and a STK RAID INT SAS HBA

2010-04-04 Thread Edward Ned Harvey
 When running the card in copyback write cache mode, I got horrible
 performance (with zfs), much worse than with copyback disabled
 (which I believe should mean it does write-through), when tested
 with filebench.

When I benchmark my disks, I also find that the system is slower with
WriteBack enabled.  I would not call it much worse, I'd estimate about 10%
worse.  This, naturally, is counterintuitive.  I do have an explanation,
however, which is partly conjecture:  With the WriteBack enabled, when the
OS tells the HBA to write something, it seems to complete instantly.  So the
OS will issue another, and another, and another.  The HBA has no knowledge
of the underlying pool data structure, so it cannot consolidate the smaller
writes into larger sequential ones.  It will brainlessly (or
less-brainfully) do as it was told, and write the blocks to precisely the
addresses that it was instructed to write.  Even if those are many small
writes, scattered throughout the platters.  ZFS is smarter than that.  It's
able to consolidate a zillion tiny writes, as well as some larger writes,
all into a larger sequential transaction.  ZFS has flexibility, in choosing
precisely how large a transaction it will create, before sending it to disk.
One of the variables used to decide how large the transaction should be is
... Is the disk busy writing, right now?  If the disks are still busy, I
might as well wait a little longer and continue building up my next
sequential block of data to write.  If it appears to have completed the
previous transaction already, no need to wait any longer.  Don't let the
disks sit idle.  Just send another small write to the disk.

Long story short, I think, ZFS simply does a better job of write buffering
than the HBA could possibly do.  So you benefit by disabling the WriteBack,
in order to allow ZFS handle that instead.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-04 Thread Edward Ned Harvey
 Your experience is exactly why I suggested ZFS start doing some right
 sizing if you will.  Chop off a bit from the end of any disk so that
 we're guaranteed to be able to replace drives from different
 manufacturers.  The excuse being no reason to, Sun drives are always
 of identical size.  If your drives did indeed come from Sun, their
 response is clearly not true.  Regardless, I guess I still think it
 should be done.  Figure out what the greatest variation we've seen from
 drives that are supposedly of the exact same size, and chop it off the
 end of every disk.  I'm betting it's no more than 1GB, and probably
 less than that.  When we're talking about a 2TB drive, I'm willing to
 give up a gig to be guaranteed I won't have any issues when it comes
 time to swap it out.

My disks are sun branded intel disks.  Same model number.  The first
replacement disk had a newer firmware, so we jumped to conclusion that was
the cause of the problem, and caused oracle plenty of trouble in locating an
older firmware drive in some warehouse somewhere.  But the second
replacement disk is truly identical to the original.  Same firmware and
everything.  Only the serial number is different.  Still the same problem
behavior.

I have reason to believe that both the drive, and the OS are correct.  I
have suspicion that the HBA simply handled the creation of this volume
somehow differently than how it handled the original.  Don't know the answer
for sure yet.

Either way, yes, I would love zpool to automatically waste a little space at
the end of the drive, to avoid this sort of situation, whether it's caused
by drive manufacturers, or HBA, or any other factor.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-04 Thread Edward Ned Harvey
 CR 6844090, zfs should be able to mirror to a smaller disk
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090
 b117, June 2009

Awesome.  Now if someone would only port that to solaris, I'd be a happy
man.   ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-04 Thread Tim Cook
On Sun, Apr 4, 2010 at 9:46 PM, Edward Ned Harvey solar...@nedharvey.comwrote:

  CR 6844090, zfs should be able to mirror to a smaller disk
  http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090
  b117, June 2009

 Awesome.  Now if someone would only port that to solaris, I'd be a happy
 man.   ;-)



Have you tried pointing that bug out to the support engineers who have your
case at Oracle?  If the fixed code is already out there, it's just a matter
of porting the code, right?  :)

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Edward Ned Harvey
 Hmm, when you did the write-back test was the ZIL SSD included in the
 write-back?

 What I was proposing was write-back only on the disks, and ZIL SSD
 with no write-back.

The tests I did were:
All disks write-through
All disks write-back
With/without SSD for ZIL

All the permutations of the above.

So, unfortunately, no, I didn't test with WriteBack enabled only for
spindles, and WriteThrough on SSD.  

It has been suggested, and this is actually what I now believe based on my
experience, that precisely the opposite would be the better configuration.
If the spindles are configured WriteThrough, while the SSD is configured
WriteBack.  I believe would be optimal.

If I get the opportunity to test further, I'm interested and I will.  But
who knows when/if that will happen.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-04 Thread Edward Ned Harvey
 Actually, It's my experience that Sun (and other vendors) do exactly
 that for you when you buy their parts - at least for rotating drives, I
 have no experience with SSD's.
 
 The Sun disk label shipped on all the drives is setup to make the drive
 the standard size for that sun part number. They have to do this since
 they (for many reasons) have many sources (diff. vendors, even diff.
 parts from the same vendor) for the actual disks they use for a
 particular Sun part number.

Actually, if there is a fdisk partition and/or disklabel on a drive when it
arrives, I'm pretty sure that's irrelevant.  Because when I first connect a
new drive to the HBA, of course the HBA has to sign and initialize the drive
at a lower level than what the OS normally sees.  So unless I do some sort
of special operation to tell the HBA to preserve/import a foreign disk, the
HBA will make the disk blank before the OS sees it anyway.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-04 Thread Edward Ned Harvey
  There is some question about performance.  Is there any additional
 overhead caused by using a slice instead of the whole physical device?
 
 No.
 
 If the disk is only used for ZFS, then it is ok to enable volatile disk
 write caching
 if the disk also supports write cache flush requests.
 
 If the disk is shared with UFS, then it is not ok to enable volatile
 disk write caching.

Thank you.  If you don't know the answer to this off the top of your head,
I'll go attempt the internet, but thought you might just know the answer in
2 seconds ...

Assuming the disk's write cache is disabled because of the slice (as
documented in the Best Practices Guide) how do you enable it?  I would only
be using ZFS on the drive.  The existence of a slice is purely to avoid
future mirror problems and the like.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-04 Thread Edward Ned Harvey
I haven't taken that approach, but I guess I'll give it a try.

 

 

 

From: Tim Cook [mailto:t...@cook.ms] 
Sent: Sunday, April 04, 2010 11:00 PM
To: Edward Ned Harvey
Cc: Richard Elling; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] To slice, or not to slice

 

 

On Sun, Apr 4, 2010 at 9:46 PM, Edward Ned Harvey solar...@nedharvey.com
wrote:

 CR 6844090, zfs should be able to mirror to a smaller disk
 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844090
 b117, June 2009

Awesome.  Now if someone would only port that to solaris, I'd be a happy
man.   ;-)



Have you tried pointing that bug out to the support engineers who have your
case at Oracle?  If the fixed code is already out there, it's just a matter
of porting the code, right?  :)

--Tim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] writeback vs writethrough [was: Sun Flash Accelerator F20 numbers]

2010-04-04 Thread Richard Elling
On Apr 2, 2010, at 5:03 AM, Edward Ned Harvey wrote:

 Seriously, all disks configured WriteThrough (spindle and SSD disks
 alike)
 using the dedicated ZIL SSD device, very noticeably faster than
 enabling the
 WriteBack.
 
 What do you get with both SSD ZIL and WriteBack disks enabled?
 
 I mean if you have both why not use both? Then both async and sync IO
 benefits.
 
 Interesting, but unfortunately false.  Soon I'll post the results here.  I
 just need to package them in a way suitable to give the public, and stick it
 on a website.  But I'm fighting IT fires for now and haven't had the time
 yet.
 
 Roughly speaking, the following are approximately representative.  Of course
 it varies based on tweaks of the benchmark and stuff like that.
   Stripe 3 mirrors write through:  450-780 IOPS
   Stripe 3 mirrors write back:  1030-2130 IOPS
   Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
   Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

Thanks for sharing these interesting numbers.

 Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
 ZIL is 3-4 times faster than naked disk.  And for some reason, having the
 WriteBack enabled while you have SSD ZIL actually hurts performance by
 approx 10%.  You're better off to use the SSD ZIL with disks in Write
 Through mode.

YMMV. The write workload for ZFS is best characterized by looking at
the txg commit.  In a very short period of time ZFS sends a lot[1] of write
I/O to the vdevs. It is not surprising that this can blow through the 
relatively small caches on controllers. Once you blow through the cache,
then the [in]efficiency of the disks behind the cache is experienced as
well as the [in]efficiency of the cache controller. Alas, little public 
information seems to be published regarding how those caches work. 

Changing to write-through effectively changes the G/M/1 queue [2]
at the controller to a G/M/n queue at the disks.  Sorta like:
1. write-back controller
(ZFS) N*#vdev I/Os -- controller -- disks
(ZFS) M/M/n -- G/M/1 -- M/M/n

2. write-through controller
(ZFS) N*#vdev I/Os  -- disks
(ZFS) M/M/n  -- G/M/n

This can simply be a case of the middleman becoming the bottleneck.

[1] a lot means up to 35 I/Os per vdev for older releases, 4-10 I/Os per
vdev for more recent releases

[2] queuing theory enthusiasts will note that ZFS writes do not exhibit an
exponential arrival rate at the controller or disks except for sync writes.

 That result is surprising to me.  But I have a theory to explain it.  When
 you have WriteBack enabled, the OS issues a small write, and the HBA
 immediately returns to the OS:  Yes, it's on nonvolatile storage.  So the
 OS quickly gives it another, and another, until the HBA write cache is full.
 Now the HBA faces the task of writing all those tiny writes to disk, and the
 HBA must simply follow orders, writing a tiny chunk to the sector it said it
 would write, and so on.  The HBA cannot effectively consolidate the small
 writes into a larger sequential block write.  But if you have the WriteBack
 disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
 SSD, and immediately return to the process:  Yes, it's on nonvolatile
 storage.  So the application can issue another, and another, and another.
 ZFS is smart enough to aggregate all these tiny write operations into a
 single larger sequential write before sending it to the spindle disks.  

I agree, though this paragraph has 3 different thoughts embedded.
Taken separately:
1. queuing surprises people :-)
2. writeback inserts a middleman with its own queue
3. separate logs radically change the write workload seen by
   the controller and disks

 Long story short, the evidence suggests if you have SSD ZIL, you're better
 off without WriteBack on the HBA.  And I conjecture the reasoning behind it
 is because ZFS can write buffer better than the HBA can.

I think the way the separate log works is orthogonal. However, not 
having a separate log can influence the ability of the controller and
disks to respond to read requests during this workload.  

Perhaps this is a long way around to saying that a well tuned system
will have harmony among its parts.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-04 Thread Richard Elling
On Apr 4, 2010, at 8:11 PM, Edward Ned Harvey wrote:
 There is some question about performance.  Is there any additional
 overhead caused by using a slice instead of the whole physical device?
 
 No.
 
 If the disk is only used for ZFS, then it is ok to enable volatile disk
 write caching
 if the disk also supports write cache flush requests.
 
 If the disk is shared with UFS, then it is not ok to enable volatile
 disk write caching.
 
 Thank you.  If you don't know the answer to this off the top of your head,
 I'll go attempt the internet, but thought you might just know the answer in
 2 seconds ...
 
 Assuming the disk's write cache is disabled because of the slice (as
 documented in the Best Practices Guide) how do you enable it?  I would only
 be using ZFS on the drive.  The existence of a slice is purely to avoid
 future mirror problems and the like.

This is a trick question -- some drives ignore efforts to disable the write 
cache :-P

Use format -e for access to the expert mode where you can enable
the write cache. 

As for performance benefits, YMMV.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS getting slower over time

2010-04-04 Thread Marcus Wilhelmsson
I have a problem with my zfs system, it's getting slower and slower over time. 
When the OpenSolaris machine is rebooted and just started I get about 30-35MB/s 
in read and write but after 4-8 hours I'm down to maybe 10MB/s and it varies 
between 4-18MB/s. Now, if i reboot the machine it's all gone and I have perfect 
speed again.

Does it have something to do with the cache? I use a separate SSD as a cache 
disk.
Anyways, here's my setup:
OpenSolaris 1.34 dev
C2D with 4GB ram
4x 1,5TB WD SATA drives and 1x Corsair 32GB SSD as cache

Doesn't seem to matter if I copy files locally on the computer or if I use 
CIFS, still getting the same degredation in speed. Last night I left my 
workstation copying files to/from the server for about 8 hours and you could 
see the performance dropping from about 28MB/s down to under 10MB/s after a 
couple of hours.

Any suggestion on what to do?

I've tried some tuning by setting the following variables in /etc/system:
set zfs:zfs_txg_timeout = 1
set zfs:zfs_vdev_max_pending = 1

But it doesn't seem to make any difference.

Regards
/Marcus Wilhelmsson, Kalmar, Sweden

Message was edited by: tanngens
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss