Re: [zfs-discuss] ZFS performance falls off a cliff

2011-05-13 Thread Don
~# uname -a
SunOS nas01a 5.11 oi_147 i86pc i386 i86pc Solaris

~# zfs get version pool0
NAME   PROPERTY  VALUESOURCE
pool0  version   5-

~# zpool get version pool0
NAME   PROPERTY  VALUESOURCE
pool0  version   28   default
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-12 Thread Don
 This is a slow operation which can only be done about 180-250 times per second
 for very random I/Os (may be more with HDD/Controller caching, queuing and
 faster spindles).
 I'm afraid that seeking to very dispersed metadata blocks, such as traversing 
 the
 tree during a scrub on a fragmented drive, may qualify as a very random I/O.
And that's the thing- I would understand if my scrub was slow because the disks 
were just being hammered by IOPS but- all joking aside- my pool is almost 
entirely idle according to an iostat -Xn
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Modify stmf_sbd_lu properties

2011-05-11 Thread Don
I can't actually disable the STMF framework to do this but I can try renaming 
things and dumping the properties from one device to another and see if it 
works- it might actually do it. I will let you know.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-11 Thread Don
 It sent a series of blocks to write from the queue, newer disks wrote them 
 and stay
 dormant, while older disks seek around to fit that piece of data... When old 
 disks
 complete the writes, ZFS batches them a new set of tasks.
The thing is- as far as I know the OS doesn't ask the disk to find a place to 
fit the data. Instead the OS tracks what space on the disk is free and then 
tells the disk where to write the data.

Even if ZFS was waiting for the IO to complete I would expect to see that delay 
reflected in the disk service times. In our case we see no high service times, 
no busy disks, nothing. It seems like ZFS is just sitting there quietly and 
thinking to itself. If the processor were busy that might make sense but even 
there- our processor seems largely idle.

At the same time- even a scrub on this system is a joke right now and that's a 
read intensive operation. I'm seeing a scrub speed of 400K/s but almost no IO's 
to my disks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Modify stmf_sbd_lu properties

2011-05-11 Thread Don
It turns out this was actually as simple as:
stmfadm create-lu -p guid=XXX..

I kept looking at modify-lu to change this and never thought to check the 
create-lu options.

Thanks to Evaldas for the suggestion.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-10 Thread Don
I've been going through my iostat, zilstat, and other outputs all to no avail. 
None of my disks ever seem to show outrageous service times, the load on the 
box is never high, and if the darned thing is CPU bound- I'm not even sure 
where to look.

(traversing DDT blocks even if in memory, etc - and kernel times indeed are 
above 50%) as I'm zeroing deleted blocks inside the internal pool. This 
took several days already, but recovered lots of space in my main pool also...
When you say you are zeroing deleted blocks- how are you going about doing that?

Despite claims to the contrary- I can understand ZFS needing some tuning. What 
I can't understand are the baffling differences in performance I see. For 
example- after deleting a large volume- suddenly my performance will skyrocket- 
then gradually degrade- but the question is why?

I'm not running dedup. My disks seem to be largely idle. I have 8 3GHz cores 
that also seem to be idle. I seem to have enough memory. What is ZFS doing 
during this time?

Everything I've read suggests one of two possible causes- too full, or bad 
hardware. Is there anything else that might be an issue here? Another ZFS 
factor I haven't taken into account?

Space seems to be the biggest factor in my performance difference- more free 
space = more performance- but as my fullest disks are less than 70% full, and 
my emptiest disks are less than 10% full- I can't understand why space is an 
issue.

I have a few hardware errors for one of my pool disks- but we're talking about 
a very small number of errors over a long period of time. I'm considering 
replacing this disk but the pool is so slow at times I'm loathe to slow it down 
further by doing a replace unless I can be more certain that is going to fix 
the problem.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-10 Thread Don
 # dd if=/dev/zero of=/dcpool/nodedup/bigzerofile
Ahh- I misunderstood your pool layout earlier. Now I see what you were doing.

People on this forum have seen and reported that adding a 100Mb file tanked 
their
 multiterabyte pool's performance, and removing the file boosted it back up.
Sadly I think several of those posts were mine or those of coworkers.

 Disks that have been in use for a longer time may have very fragmented free
 space on one hand, and not so much of it on another, but ZFS is still trying 
 to push
 bits around evenly. And while it's waiting on some disks, others may be 
 blocked as
 well. Something like that...
This could explain why performance would go up after a large delete but I've 
not seen large wait times for any of my disks. The service time, percent busy, 
and every other metric continues to show nearly idle disks.

If this is the problem- it would be nice if there were a simple zfs or dtrace 
query that would show it to you.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send receive problem/questions

2010-12-03 Thread Don Jackson
 Try using the -d option to zfs receive.  The ability
 to do zfs send
 -R ... | zfs receive [without -d] was added
 relatively recently, and
 you may be encountering a bug that is specific to
 receiving a send of
 a whole pool.

I just tried this, didn't work, new error:

 # zfs send -R naspool/open...@xfer-11292010 | zfs recv -d npool/openbsd
 cannot receive new filesystem stream: out of space

The destination pool is much larger (by several TB)  than the source pool, so I 
don't see how it can not have enough disk space:

# zfs list -r npool/openbsd
NAME USED  AVAIL  REFER  
MOUNTPOINT
npool/openbsd   82.5G  7.18T  23.5G  
/npool/openbsd
npool/open...@xfer-11292010 0  -  23.5G  -
npool/openbsd/openbsd   59.0G  7.18T  23.5G  
/npool/openbsd/openbsd
npool/openbsd/open...@xfer-11292010 0  -  23.5G  -
npool/openbsd/openbsd/4.5   22.3G  7.18T  1.54G  
/npool/openbsd/openbsd/4.5
npool/openbsd/openbsd/4...@xfer-11292010 0  -  1.54G  -
npool/openbsd/openbsd/4.5/packages  18.7G  7.18T  18.7G  
/npool/openbsd/openbsd/4.5/packages
npool/openbsd/openbsd/4.5/packa...@xfer-112920100  -  18.7G  -
npool/openbsd/openbsd/4.5/packages-local49.7K  7.18T  49.7K  
/npool/openbsd/openbsd/4.5/packages-local
npool/openbsd/openbsd/4.5/packages-lo...@xfer-11292010  0  -  49.7K  -
npool/openbsd/openbsd/4.5/ports  288M  7.18T   259M  
/npool/openbsd/openbsd/4.5/ports
npool/openbsd/openbsd/4.5/po...@patch00047.2K  -  49.7K  -
npool/openbsd/openbsd/4.5/po...@patch00529.0M  -   261M  -
npool/openbsd/openbsd/4.5/po...@xfer-11292010   0  -   259M  -
npool/openbsd/openbsd/4.5/release462M  7.18T   462M  
/npool/openbsd/openbsd/4.5/release
npool/openbsd/openbsd/4.5/rele...@xfer-11292010 0  -   462M  -
npool/openbsd/openbsd/4.5/src728M  7.18T   703M  
/npool/openbsd/openbsd/4.5/src
npool/openbsd/openbsd/4.5/s...@patch000  47.2K  -  49.7K  -
npool/openbsd/openbsd/4.5/s...@patch005  25.1M  -   709M  -
npool/openbsd/openbsd/4.5/s...@xfer-11292010 0  -   703M  -
npool/openbsd/openbsd/4.5/xenocara   572M  7.18T   565M  
/npool/openbsd/openbsd/4.5/xenocara
npool/openbsd/openbsd/4.5/xenoc...@patch000 47.2K  -  49.7K  -
npool/openbsd/openbsd/4.5/xenoc...@patch005 6.52M  -   565M  -
npool/openbsd/openbsd/4.5/xenoc...@xfer-112920100  -   565M  -
npool/openbsd/openbsd/4.8   13.2G  7.18T   413M  
/npool/openbsd/openbsd/4.8
npool/openbsd/openbsd/4...@xfer-11292010 0  -   413M  -
npool/openbsd/openbsd/4.8/packages  11.9G  7.18T  11.9G  
/npool/openbsd/openbsd/4.8/packages
npool/openbsd/openbsd/4.8/packa...@xfer-112920100  -  11.9G  -
npool/openbsd/openbsd/4.8/packages-local49.7K  7.18T  49.7K  
/npool/openbsd/openbsd/4.8/packages-local
npool/openbsd/openbsd/4.8/packages-lo...@xfer-11292010  0  -  49.7K  -
npool/openbsd/openbsd/4.8/ports  277M  7.18T   277M  
/npool/openbsd/openbsd/4.8/ports
npool/openbsd/openbsd/4.8/po...@patch00047.2K  -  49.7K  -
npool/openbsd/openbsd/4.8/po...@xfer-11292010   0  -   277M  -
npool/openbsd/openbsd/4.8/release577M  7.18T   577M  
/npool/openbsd/openbsd/4.8/release
npool/openbsd/openbsd/4.8/rele...@xfer-11292010 0  -   577M  -
npool/openbsd/openbsd/4.8/src   96.9K  7.18T  49.7K  
/npool/openbsd/openbsd/4.8/src
npool/openbsd/openbsd/4.8/s...@patch000  47.2K  -  49.7K  -
npool/openbsd/openbsd/4.8/s...@xfer-11292010 0  -  49.7K  -
npool/openbsd/openbsd/4.8/xenocara  96.9K  7.18T  49.7K  
/npool/openbsd/openbsd/4.8/xenocara
npool/openbsd/openbsd/4.8/xenoc...@patch000 47.2K  -  49.7K  -
npool/openbsd/openbsd/4.8/xenoc...@xfer-112920100  -  49.7K  -
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs send receive problem/questions

2010-12-01 Thread Don Jackson
Hello, 

I am attempting to move a bunch of zfs filesystems from one pool to another.

Mostly this is working fine, but one collection of file systems is causing me 
problems, and repeated re-reading of man zfs and the ZFS Administrators Guide 
is not helping.  I would really appreciate some help/advice.

Here is the scenario.
I have a nested (hierarchy) of zfs file systems.
Some of the deeper fs are snapshotted.
All this exists on the source zpool
First I recursively snapshotted the whole subtree:

   zfs snapshot -r nasp...@xfer-11292010 

Here is a subset of the source zpool:

# zfs list -r naspool
NAME   USED  AVAIL  REFER  
MOUNTPOINT
naspool   1.74T  42.4G  37.4K  /naspool
nasp...@xfer-11292010 0  -  37.4K  -
naspool/openbsd113G  42.4G  23.3G  
/naspool/openbsd
naspool/open...@xfer-11292010 0  -  23.3G  -
naspool/openbsd/4.4   21.6G  42.4G  2.33G  
/naspool/openbsd/4.4
naspool/openbsd/4...@xfer-11292010 0  -  2.33G  -
naspool/openbsd/4.4/ports  592M  42.4G   200M  
/naspool/openbsd/4.4/ports
naspool/openbsd/4.4/po...@patch00052.5M  -   169M  -
naspool/openbsd/4.4/po...@patch00654.7M  -   194M  -
naspool/openbsd/4.4/po...@patch00754.9M  -   194M  -
naspool/openbsd/4.4/po...@patch01355.1M  -   194M  -
naspool/openbsd/4.4/po...@patch01635.1M  -   200M  -
naspool/openbsd/4.4/po...@xfer-11292010   0  -   200M  -

Now I want to send this whole hierarchy to a new pool.

# zfs create npool/openbsd  
  
# zfs send -R naspool/open...@xfer-11292010 | zfs receive -Fv  npool/openbsd
  
receiving full stream of naspool/open...@xfer-11292010 into 
npool/open...@xfer-11292010
received 23.5GB stream in 883 seconds (27.3MB/sec)
cannot receive new filesystem stream: destination has snapshots (eg. 
npool/open...@xfer-11292010)
must destroy them to overwrite it

What am I doing wrong?  What is the proper way to accomplish my goal here?

And I have a follow up question:

I had to snapshot the source zpool filesystems in order to zfs send them.

Once they are received on the new zpool, I really don't need nor want this 
snapshot on the receiving side.
Is it OK to zfs destroy that snapshot?

I've been pounding my head against this problem for a couple of days, and I 
would definitely appreciate any tips/pointers/advice.

Don
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send receive problem/questions

2010-12-01 Thread Don Jackson
Here is some more info on my system:

This machine is running Solaris 10 U9, with all the patches as of 11/10/2010.

The source zpool I am attempting to transfer from was originally created on a 
older OpenSolaris (specifically Nevada) release, I think it was 111.
I did a zpool export on that zpool, and physically transferred those drives to 
the new machine, where I did a zpool import, and and then upgraded the ZFS 
version on the imported zpool, now:

# zpool upgrade
This system is currently running ZFS pool version 22.
All pools are formatted using this version.

The reference to OpenBSD in the directory paths in the listings I provided 
refers only to the data that is stored therein, the actual OS I am running here 
is Solaris 10.

# zpool status naspool npool
  pool: naspool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
naspool ONLINE   0 0 0
  raidz2-0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0

errors: No known data errors

  pool: npool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
npool   ONLINE   0 0 0
  raidz3-0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
c1t6d0  ONLINE   0 0 0
c1t7d0  ONLINE   0 0 0

errors: No known data errors
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Resizing ZFS block devices and sbdadm

2010-11-30 Thread Don
sbdadm can be used with a regular ZFS file or a ZFS block device.

Is there an advatage to using a ZFS block device and exporting it to comstar 
via sbdadm as opposed to using a file and exporting it? (e.g. performance or 
manageability?)

Also- let's say you have a 5G block device called pool/test

You can resize it by doing:
zfs set volsize=10G pool/test

However if the device was already imported into comstar then stmfadm list-lu -v 
guid will still only report the original 5G block size. You can use sbdadm 
modify-lu -s 10G /path_to_block_device but I'm not sure if there is a chance 
you might run into a size difference between ZFS and sbd.

i.e.- if I specify 10G in ZFS, and I do an sbdadm modify-lu -s 10G is there any 
chance they won't align and I'll try to write past the end of the zvol?

Thanks in advance-
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool lockup and dedupratio meanings

2010-11-26 Thread Don
 I've previously posted about some lockups I've experienced with ZFS.

 There were two suspected causes at the time: one was deduplication, and one
 was the 2009.06 code we were running.

After upgrading the zpools and adding some more disks to the pool I initiated a 
zpool scrub and was rewarded with an immediate zfs lockup. I switched to my 
backup head, killed and restarted the scrub and poof- lockup.

Anyone have any ideas why a scrub would lockup my pool? The system itself, and 
the root pool have no problems. The lockup occurs whether I try to write 
directly to the pool from the system or to the pool via comstar.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS performance questions

2010-11-24 Thread Don
I have an OpenSolaris (technically OI 147) box running ZFS with Comstar (zpool 
version 28, zfs version 5)

The box is a 2950 with 32 GB of RAM, Dell SAS5/e card connected to 6 Promise 
vTrak J610sD (dual controller SAS) disk shelves spread across both channels of 
the card (2 chains of 3 shelves).

We currently have:
4 x OCZ Vertex 2 SSD's configured as a ZIL (We've been experimenting without a 
dedicated ZIL, with 2 mirrors, and with 4 individual drives- these are not 
meant to be a permanent part of the array- they were installed to evaluate 
limited SSD benefits)
2 x 300GB 15k RPM Hot Spare drives- one on each channel
2 x 600GB 15k RPM Hot Spare drives- one on each channel
52 x 300GB 15k RPM disks configured as 4 Disk RAIDz (13 zdevs)
20 x 600GB 15k RPM disks configured as 4 disk RAIDz (5 zdevs)

(Eventually there will be 16 more 600GB disks -4 more zdevs for a total of 22 
zdevs)

Most of our disk access is through COMSTAR via iSCSI. That said- even 
performance tests direct to the local disks reveal good, but not great 
performance.

Most of our sequential write performance tests show about 200 MB/sec to the 
storage which seems pretty low given the disk's and their individual 
performance.

I'd love to have configured the disks as mirrors but I needed a minimum of 20 
TB in the space provided and I could not achieve that when using mirrors.

Can anyone provide a link to good performance analysis resources so I can try 
to track down where my limited write performance is coming from?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool lockup and dedupratio meanings

2010-11-20 Thread Don
I've previously posted about some lockups I've experienced with ZFS.

There were two suspected causes at the time: one was deduplication, and one was 
the 2009.06 code we were running.

I upgraded to OpenIndiana 147 (initially without upgrading the zpool and zfs 
disk versions). The lockups reduced in frequency but still occurred. I've since 
upgraded the zpool and zfs versions and we'll see what happens.

Dedup was the more likely causes and so we turned it off and recreated all the 
iscsi LUNS that were being exported so as to eliminate the deduplicated data. 
That almost entirely eliminated the lockups.

Having said all that- I have two questions:
When I query for the dedupratio, I still a value of 2.37x
--
r...@nas:~# zpool get dedupratio pool0
NAME   PROPERTYVALUE  SOURCE
pool0  dedupratio  2.37x  -
--
Considering the fact that all of the iscsi LUNS that were created when dedup 
was on were subsequently deleted and recreated with dedup disabled- I don't 
understand why the value is still 2.37x. It should be near zero (There are 
probably a couple of small luns that were not removed but they are rarely 
used). Am I misinterpreting the meaning of this number? 

Second question:
The most recent pool lockup was caused when a zpool scrub was kicked off. 
Initially we see 0 values for the write bandwidth in a zpool iostat and average 
numbers for the read. After a few minutes we see the read numbers jump to 
several hundred megs/second and the write performance fluctuate between 0 and a 
few kilobytes/second. Anyone else see this behavior and can provide some 
insight?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread Don
 Now, if someone would make a Battery FOB, that gives broken SSD 60
 seconds of power, then we could use the consumer  SSD's in servers
 again with real value instead of CYA value.
You know- it would probably be sufficient to provide the SSD with _just_ a big 
capacitor bank. If the host lost power it would stop writing and if the SSD 
still had power it would probably use the idle time to flush it's buffers. Then 
there would be world peace!

Yeah- got a little carried away there. Still this seems like an experiment I'm 
going to have to try on my home server out of curiosity more than anything else 
:)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-21 Thread Don
 I just spoke with a co-worker about doing something about it.
 
 He says he can design a small in-line UPS that will deliver 20-30
 seconds of 3.3V, 5V, and 12V to the SATA power connector for about $50
 in parts. It would be even less if only one voltage was needed. That
 should be enough for most any SSD to finish any pending writes.
Oh I wasn't kidding when I said I was going to have to try this with my home 
server. I actually do some circuit board design and this would be an amusing 
project. All you probably need is 5v- I'll look into it.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-21 Thread Don
 The SATA power connector supplies 3.3, 5 and 12v. A complete
 solution will have all three. Most drives use just the 5v, so you can
 probably ignore 3.3v and 12v.
I'm not interested in building something that's going to work for every 
possible drive config- just my config :) Both the Intel X25-e and the OCZ only 
uses the 5V rail.

 You'll need to use a step up DC-DC converter and be able to supply ~
 100mA at 5v.
 It's actually easier/cheaper to use a LiPoly battery  charger and get a
 few minutes of power than to use an ultracap for a few seconds of
 power. Most ultracaps are ~ 2.5v and LiPoly is 3.7v, so you'll need a
 step up converter in either case.
Ultracapacitors are available in voltage ratings beyond 12volts so there is no 
reason to use a boost converter with them. That eliminates high frequency 
switching transients right next to our SSD which is always helpful.

In this case- we have lots of room. We have a 3.5 x 1 drive bay, but a 2.5 x 
1/4 hard drive. There is ample room for several of the 6.3V ELNA 1F capacitors 
(and our SATA power rail is a 5V regulated rail so they should suffice)- either 
in series or parallel (Depending on voltage or runtime requirements).
http://www.elna.co.jp/en/capacitor/double_layer/catalog/pdf/dk_e.pdf 

You could 2 caps in series for better voltage tolerance or in parallel for 
longer runtimes. Either way you probably don't need a charge controller, a 
boost or buck converter, or in fact any IC's at all. It's just a small board 
with some caps on it.

 Cost for a 5v only system should be $30 - $35 in one-off
 prototype-ready components with a 1100mAH battery (using prices from
 Sparkfun.com),
You could literally split a sata cable and add in some capacitors for just the 
cost of the caps themselves. The issue there is whether the caps would present 
too large a current drain on initial charge up- If they do then you need to add 
in charge controllers and you've got the same problems as with a LiPo battery- 
although without the shorter service life.

At the end of the day the real problem is whether we believe the drives 
themselves will actually use the quiet period on the now dead bus to write out 
their caches. This is something we should ask the manufacturers, and test for 
ourselves.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-20 Thread Don
 So, IMHO, a cheap consumer ssd used as a zil may still be worth it (for
 some use cases) to narrow the window of data loss from ~30 seconds to a
 sub-second value.
There are lots of reasons to enable the ZIL now- I can throw four very 
inexpensive SSD's in there now in a pair of mirrors, and then when a better 
drive comes along I can replace each half of the mirror without bringing 
anything down. My slots are already allocated and it would be nice to save a 
few extra seconds of writes- just in case. It's not a great solution- but 
nothing is. I don't have access to a ZEUS- and even if I did- I wouldn't pay 
that kind of money for what amounts to a Vertex 2 Pro but with SLC flash.

I'm kind of flabbergasted that no one has simply stuck a capacitor on a more 
reasonable drive. I guess the market just isn't big enough- but I find that 
hard to believe.

Right now it seems like the options are all or nothing. There's just no %^$#^ 
middle ground.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
Well- 40k IOPS is the current claim from ZEUS- and they're the benchmark. They 
use to be 17k IOPS. How real any of these numbers are from any manufacturer is 
a guess.

Given the Intel's refusal to honor a cache flush, and their performance 
problems with the cache disabled- I don't trust them any more than anyone else 
right now.

As for the Vertex drives- if they are within +-10% of the Intel they're still 
doing it for half of what the Intel drive costs- so it's an option- not a great 
option- but still an option.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
Well the larger size of the Vertex, coupled with their smaller claimed write 
amplification should result in sufficient service life for my needs. Their 
claimed MTBF also matches the Intel X25-E's.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
Since it ignores Cache Flush command and it doesn't have any persistant buffer 
storage, disabling the write cache is the best you can do.

This actually brings up another question I had: What is the risk, beyond a few 
seconds of lost writes, if I lose power, there is no capacitor and the cache is 
not disabled?

My ZFS system is shared storage for a large VMWare based QA farm. If I lose 
power then a few seconds of writes are the least of my concerns. All of the QA 
tests will need to be restarted and all of the file systems will need to be 
checked. A few seconds of writes won't make any difference unless it has the 
potential to affect the integrity of the pool itself.

Considering the performance trade-off, I'd happily give up a few seconds worth 
of writes for significantly improved IOPS.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
You can lose all writes from the last committed transaction (i.e., the
one before the currently open transaction).

And I don't think that bothers me. As long as the array itself doesn't go belly 
up- then a few seconds of lost transactions are largely irrelevant- all of the 
QA virtual machines are going to have to be rolled back to their initial states 
anyway.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-19 Thread Don
You can lose all writes from the last committed transaction (i.e., the
one before the currently open transaction).

I'll pick one- performance :)

Honestly- I wish I had a better grasp on the real world performance of these 
drives. 50k IOPS is nice- and considering the incredible likelihood of data 
duplication in my environment- the SandForce controller seems like a win. That 
said- does anyone have a good set of real world performance numbers for these 
drives that you can link to?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] New SSD options

2010-05-18 Thread Don
I'm looking for alternatives SSD options to the Intel X25-E and the ZEUS IOPS.

The ZEUS IOPS would probably cost as much as my entire current disk system (80 
15k SAS drives)- and that's just silly.

The Intel is much less expensive, and while fast- pales in comparison to the 
ZEUS.

I've allocated 4 disk slots in my array for ZIL SSD's and I'm trying to find 
the best performance for my dollar.

With that in mind- Is anyone using the new OCZ Vertex 2 SSD's as a ZIL?

http://www.ocztechnology.com/products/solid-state-drives/2-5--sata-ii/performance-enterprise-solid-state-drives/ocz-vertex-2-sata-ii-2-5--ssd.html

They're claiming 50k IOPS (4k Write- Aligned), 2 million hour MTBF, TRIM 
support, etc. That's more write IOPS than the ZEUS (40k IOPS, $) but at 
half the price of an Intel X25-E (3.3k IOPS, $400).

Needless to say I'd love to know if anyone has evaluated these drives to see if 
they make sense as a ZIL- for example- do they honor cache flush requests? Are 
those sustained IOPS numbers?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Oracle to no longer support ZFS on OpenSolaris?

2010-04-20 Thread Don Turnbull
Not to be a conspiracy nut but anyone anywhere could have registered 
that gmail account and supplied that answer.  It would be a lot more 
believable from Mr Kay's Oracle or Sun account.


On 4/20/2010 9:40 AM, Ken Gunderson wrote:

On Tue, 2010-04-20 at 13:57 +0100, Dominic Kay wrote:
   

Oracle has no plan to move from ZFS as the principle storage platform
for Solaris 10 and OpenSolaris. It remains key to both data management
and to the OS infrastructure such as root/boot, install and upgrade.
Thanks

Dominic Kay
Product Manager, Filesystems
Oracle
 

I'll take that as a definitive answer;)

Much appreciated. Thank you.

   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
Yes yes- /etc/zfs/zpool.cache - we all hate typos :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
I must note that you haven't answered my question...

If the zpool.cache file differs between the two heads for some reason- how do I 
ensure that the second head has an accurate copy without importing the ZFS pool?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
I'm not certain if I'm misunderstanding you- or if you didn't read my post 
carefully.

Why would the zpool.cache file be current on the _second_ node? The first node 
is where I've added my zpools and so on. The second node isn't going to have an 
updated cache file until I export the zpool from the first system and import it 
to the second system no?

In my case- I believe both nodes have exactly the same view of the disks- all 
the controllers and targets are identical- but there is no reason they have to 
be as far as I know. As such- simply backing up the primary systems zpool.cache 
to the secondary could cause problems.

I'm simply curious if there is a way for a node to keep it's zpool.cache up to 
date without actually importing the zpool. i.e. is there a scandisks command 
that can scan for a zpool without importing it.
Am I misunderstanding something here?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
Ok- I think perhaps I'm failing to explain myself.

I want to know if there is a way for a second node- connected to a set of 
shared disks- to keep its zpool.cache up to date _without_ actually importing 
the ZFS pool.

As I understand it- keeping the zpool up to date on the second node would 
provide additional protection should the slog fail at the same time my primary 
head failed (it should also improve import times if what I've read is true).

I understand that importing the disks to the second node will update the cache 
file- but by that time it may be too late. I'd like to update the cache file 
_before_ then. I see no reason why the second node couldn't scan the disks 
being used by the first node and then update it's zpool.cache.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
That section of the man page is actually helpful- as I wasn't sure what I was 
going to do to ensure the nodes didn't try to bring up the zpool on their own- 
outside of clustering software or my own intervention.

That said- it still doesn't explain how I would keep the secondary nodes 
zpool.cache up to date.

If I create a zpool on the first node. Import it on the second, then move it 
back to the first. Now they both have a current zpool.cache. If I add 
additional disks to the first node- how do I get the second nodes cache file 
current without first importing the disks?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
Now I'm simply confused.

Do you mean one cachefile shared between the two nodes for this zpool? How, may 
I ask, would this work?

The rpool should be in /etc/zfs/zpool.cache.

The shared pool should be in /etc/cluster/zpool.cache (or wherever you prefer 
to put it) so it won't come up on system start.

What I don't understand is how the second node is either a) supposed to share 
the first nodes cachefile or b) create it's own without importing the pool.

You say this is the job of the cluster software- does ha-cluster already handle 
this with their ZFS modules?

I've asked this question 5 different ways and I either still haven't gotten an 
answer- or still don't understand the problem.

Is there a way for a passive node to generate it's _own_ zpool.cache without 
importing the file system. If so- how. If not- why is this unimportant?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
I apologize- I didn't mean to come across as rude- I'm just not sure if I'm 
asking the right question.

I'm not ready to use the ha-cluster software yet as I haven't finished testing 
it. For now I'm manually failing over from the primary to the backup node. That 
will change- but I'm not ready to go there yet. As such I'm trying to make sure 
both my nodes have a current cache file so that the targets and GUID's are 
ready.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
I understand that important bit about having the cachefile is the GUID's 
(although the disk record is, I believe, helpful in improving import speeds) so 
we can recover in certain oddball cases. As such- I'm still confused why you 
say it's unimportant.

Is it enough to simply copy the /etc/cluster/zpool.cache file from the primary 
node to the secondary so that I at least have the GUID's even if the disks 
references (the /dev/dsk sections) might not match?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
Continuing on the best practices theme- how big should the ZIL slog disk be?

The ZFS evil tuning guide suggests enough space for 10 seconds of my 
synchronous write load- even assuming I could cram 20 gigabits/sec into the 
host (2 10 gigE NICs) That only comes out to 200 Gigabits which = 25 Gigabytes.

I'm currently planning to use 4 32GB SSD's arranged in 2 2 way mirrors which 
should give me 64GB of log space. Is there any reason to believe that this 
would be insufficient (especially considering I can't begin to imagine being 
able to cram 5 Gb/s into the host- let alone 20).

Are there any guidelines on how much ZIL performance should increase with 2 SSD 
slogs (4 disks with mirrors) over a single SSD slog (2 disks mirrored).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
 I think the size of the ZIL log is basically irrelevant
That was the understanding I got from reading the various blog posts and tuning 
guide.

 only a single SSD, just due to the fact that you've probably got dozens of 
 disks attached, and you'll probably use multiple log devices striped just for 
 the sake of performance.

I've got 72 (possibly 76) 15k RPM 300GB and 600GB SAS drives and my head has 16 
GB of RAM though that can be increased at any time to 32GB. My current plan is 
to use 4 x 32GB SLC write optimized SSD's in a striped mirrors configuration.

I'm curious if anyone knows how ZIL slog performance scales. For example- how 
much benefit would you expect from 2 SSD slogs over 1? Would there be a 
significant benefit to 3 over 2 or does it begin to taper off? I'm sure a lot 
of this is dependent on the environment- but rough ideas are good to know.

Is it safe to assume that a stripe across two mirrored write optimized SSD's is 
going to give me the best performance for 4 available drive bays (assuming I 
want the ZIL to remain safe)?

 Is it even physically possible to write 4G to any device in less than 10 
 seconds?
I wasn't actually sure the 10 second number was still accurate- that was 
definitely part of my question. If it is- then yes- I could never fill a 32 GB 
ZIL, let alone a 64GB one.

Thanks for all of the help and advice.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
I always try to plan for the worst case- I just wasn't sure how to arrive at 
the worst case. Thanks for providing the information- and I will definitely 
checkout the dtrace zilstat script.

Considering the smallest SSD I can buy from a manufacturer that I trust seems 
to be 32GB- that's probably going to be my choice.

As for the choice of striping across two mirrored pairs- I want every last IOP 
I Can get my hands on- an extra $700 isn't going to make much of a difference 
in a system involving 2 heads, 5 storage shelves, and 76 SAS drives- if I could 
think of something better to spend that money on- I would- but right now- it 
seems like the best option.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Don
 A STEC Zeus IOPS SSD (45K IOPS) will behave quite differently than an Intel 
 X-25E (~3.3K IOPS).
Where can you even get the Zeus drives? I thought they were only in the OEM 
market and last time I checked they were ludicrously expensive. I'm looking for 
between 5k and 10k IOPS using up to 4 drive bays (so a 2 x 2 striped mirror 
would be fine). Right now we peak at about 3k IOPS (though that's not to a ZFS 
system) but I would like to be able to be able to burst to double that. We do 
have a lot of small size burst writes hence our ZIL concerns.

 A SRAM or DRAM-based drive (with FLASH backup) will behave
dramatically differently than a typical SSD.
As long as it can speak SAS or SATA and I can put it in a drive shelf I'd 
happily consider using it. All the DRAM devices I know are host based and that 
won't help my cluster.

On that note- what write optimized SSD's do you recommend? I don't actually 
know where to buy the Zeus drives even if they've become more reasonably priced.

Thanks for taking the time to share- it's been very informative.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Don
So if the Intel X25E is a bad device- can anyone recommend an SLC device with 
good firmware? (Or an MLC drive that performs as well?)

I've got 80 spindles in 5 16 bay drive shelves (76 15k RPM SAS drives in 19 4 
disk raidz sets, 2 hot spares, and 2 bays set aside for a mirrored ZIL) 
connected to two servers (so if one fails I can import on the other one). Host 
based cards are not an option for my ZIL- I need something that sits in the 
array and can be imported by the other system.

I was planning on using a pair of mirrored SLC based Intel X25E's because of 
their superior write performance but if it's going to destroy my pool- then 
it's useless.

Does anyone else have something that can match their write performance without 
breaking ZFS?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Don
If you have a pair of heads talking to shared disks with ZFS- what can you do 
to ensure the second head always has a current copy of the zpool.cache file? 
I'd prefer not to lose the ZIL, fail over, and then suddenly find out I can't 
import the pool on my second head.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Don
But if the X25E doesn't honor cache flushes then it really doesn't matter if 
they are mirrored- they both may cache the data, not write it out, and leave me 
screwed.

I'm running 2009.06 and not one of the newer developer candidates that handle 
ZIL losses gracefully (or at all- at least as far as I understand things).

As for the optimal performance- a single pair probably won't give me optimal 
performance- but based on all the numbers I've seen it's still going to beat 
using the pool disks. If I find the ZIL is still a bottleneck I'll definitely 
add a second set of SSD's- but I've got a lot of testing to do before I get 
there.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-18 Thread Don
I'm not sure to what you are referring when you say my running BE

I haven't looked at the zpool.cache file too closely but if the devices don't 
match between the two systems for some reason- isn't that going to cause a 
problem? I was really asking if there is a way to build the cache file without 
importing the disks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing faulty disk in ZFS pool

2009-08-06 Thread Don Turnbull
I believe there are a couple of ways that work.  The commands I've 
always used are to attach the new disk as a spare (if not already) and 
then replace the failed disk with the spare.  I don't know if there are 
advantages or disavantages but I also have never had a problem doing it 
this way.


Andreas Höschler wrote:

Dear managers,

one of our servers (X4240) shows a faulty disk:


-bash-3.00# zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0s0  ONLINE   0 0 0
c1t1d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent
errors.
Sufficient replicas exist for the pool to continue functioning
in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the
device
repaired.
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  mirrorONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
  mirrorDEGRADED 0 0 0
c1t6d0  FAULTED  019 0  too many errors
c1t7d0  ONLINE   0 0 0

errors: No known data errors

I derived the following possible approaches to solve the problem:

1) A way to reestablish redundancy would be to use the command

   zpool attach tank c1t7d0 c1t15d0

to add c1t15d0 to the virtual device c1t6d0 + c1t7d0. We still would
have the faulty disk in the virtual device.

We could then dettach the faulty disk with the command

   zpool dettach tank c1t6d0

2) Another approach would be to add a spare disk to tank

   zpool add tank spare c1t15d0

and the replace to replace the faulty disk.

   zpool replace tank c1t6d0 c1t15d0

In theory that is easy, but since I have never done that and since this
is a productive server I would appreciate if somone with more
experience would look on my agenda before I issue these commands.

What is the difference between the two approaches? Which one do you
recommend? And is that really all that has to be done or am I missing a
bit? I mean can c1t6d0 be physically replaced after issuing zpool
dettach tank c1t6d0 or zpool replace tank c1t6d0 c1t15d0? I also
found the command

   zpool offline tank  ...

but am not sure whether this should be used in my case. Hints are
greatly appreciated!

Thanks a lot,

  Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing faulty disk in ZFS pool

2009-08-06 Thread Don Turnbull
If her adds the spare and then manually forces a replace, it will take 
no more time than any other way.  I do this quite frequently and without 
needing the scrub which does take quite a lot of time.


cindy.swearin...@sun.com wrote:

Hi Andreas,

Good job for using a mirrored configuration. :-)

Your various approaches would work.

My only comment about #2 is that it might take some time for the spare
to kick in for the faulted disk.

Both 1 and 2 would take a bit more time than just replacing the faulted
disk with a spare disk, like this:

# zpool replace tank c1t6d0 c1t15d0

Then you could physically replace c1t6d0 and add it back to the pool as
a spare, like this:

# zpool add tank spare c1t6d0

For a production system, the steps above might be the most efficient.
Get the faulted disk replaced with a known good disk so the pool is
no longer degraded, then physically replace the bad disk when you have
the time and add it back to the pool as a spare.

It is also good practice to run a zpool scrub to ensure the
replacement is operational and use zpool clear to clear the previous
errors on the pool. If the system is used heavily, then you might want 
to run the zpool scrub when system use is reduced.


If you were going to physically replace c1t6d0 while it was still
attached to the pool, then you might offline it first.

Cindy

On 08/06/09 13:17, Andreas Höschler wrote:
  

Dear managers,

one of our servers (X4240) shows a faulty disk:


-bash-3.00# zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0s0  ONLINE   0 0 0
c1t1d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to persistent
errors.
Sufficient replicas exist for the pool to continue functioning
in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the
device
repaired.
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankDEGRADED 0 0 0
  mirrorONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
  mirrorDEGRADED 0 0 0
c1t6d0  FAULTED  019 0  too many errors
c1t7d0  ONLINE   0 0 0

errors: No known data errors

I derived the following possible approaches to solve the problem:

1) A way to reestablish redundancy would be to use the command

   zpool attach tank c1t7d0 c1t15d0

to add c1t15d0 to the virtual device c1t6d0 + c1t7d0. We still would
have the faulty disk in the virtual device.

We could then dettach the faulty disk with the command

   zpool dettach tank c1t6d0

2) Another approach would be to add a spare disk to tank

   zpool add tank spare c1t15d0

and the replace to replace the faulty disk.

   zpool replace tank c1t6d0 c1t15d0

In theory that is easy, but since I have never done that and since this
is a productive server I would appreciate if somone with more
experience would look on my agenda before I issue these commands.

What is the difference between the two approaches? Which one do you
recommend? And is that really all that has to be done or am I missing a
bit? I mean can c1t6d0 be physically replaced after issuing zpool
dettach tank c1t6d0 or zpool replace tank c1t6d0 c1t15d0? I also
found the command

   zpool offline tank  ...

but am not sure whether this should be used in my case. Hints are
greatly appreciated!

Thanks a lot,

  Andreas

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discus
s


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fed up with ZFS causing data loss

2009-08-03 Thread Don Turnbull
This may have been mentioned elsewhere and, if so, I apologize for 
repeating. 

Is it possible your difficulty here is with the Marvell driver and not, 
strictly speaking, ZFS?  The Solaris Marvell driver has had many, MANY 
bug fixes and continues to this day to be supported by IDR patches and 
other quick-fix work-arounds.  It is the source of many problems.  
Graned, ZFS handles these poorly at times (it got a lot better with ZFS 
v10) but it is difficult to expect the file system to deal well with 
underlying instability in the hardware driver I think.


I'd be interested to hear if your experiences are the same using the LSI 
controllers which have a much better driver in Solaris.


Ross wrote:

Supermicro AOC-SAT2-MV8, based on the Marvell chipset.  I figured it was the 
best available at the time since it's using the same chipset as the x4500 
Thumper servers.

Our next machine will be using LSI controllers, but I'm still not entirely 
happy with the way ZFS handles timeout type errors.  It seems that it handles 
drive reported read or write errors fine, and also handles checksum errors, but 
it's completely missed drive timeout errors as used by hardware raid 
controllers.

Personally, I feel that when a pool usually responds to requests in the order 
of milliseconds, a timeout of even a tenth of a second is too long.  Several 
minutes before a pool responds is just a joke.

I'm still a big fan of ZFS, and modern hardware may have better error handling, 
but I can't help but feel this is a little short sighted.
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Losts of small files vs fewer big files

2009-07-07 Thread Don Turnbull
I work with Greenplum which is essentially a number of Postgres database 
instances clustered together.  Being postgres, the data is held in a lot 
of individual files which can be each fairly big (hundreds of MB or 
several GB) or very small (50MB or less).  We've noticed a performance 
difference when our database files are many and small versus few and large.


To test this outside the database, we built a zpool using RAID-10 (it 
works for RAID-z too) and filled it with 800, 5MB files.  Then we used 4 
concurrent dd processes to read 1/4 of the files each.  This reqiured 
123seconds.


Then we destroyed the pool, recreated it, and filled it with 20 files 
each 200MB and 780 files each 0bytes (same number of files, same total 
space consumed).  The same dd reads took 15 seconds.


Any idea why this is?  Various configurations of our product can divide 
data in the databases into an enormous number of small files.  varying 
the arc cache size limit did not have any effect.  Are  there other 
tunables available to Solaris 10 U7 (not openSolaris) that might affect 
this behavior?


Thanks!
   -dt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Losts of small files vs fewer big files

2009-07-07 Thread Don Turnbull

Thanks for the suggestion!

We've fiddled with this in the past.  Our app is 32k instead of 8k 
blocks and it is data warehousing so the I/O model is a lot more long 
sequential reads generally.  Changing the blocksize has very little 
effect on us.  I'll have to look at fsync; hadn't considered that.  
Compression is a killer; it costs us up to 50% of the performance 
sadly.  CPU is not always a problem for us but it can be depending on 
the query workload and the servers involved.


Bryan Allen wrote:

Have you set the recordsize for the filesystem to the blocksize Postgres is
using (8K)? Note this has to be done before any files are created.

Other thoughts: Disable postgres's fsync, enable filesystem compression if disk
I/O is your bottleneck as opposed to CPU. I do this with MySQL and it has
proven useful. My rule of thumb there is 60% for InnoDB cache, 40% for ZFS ARC,
but YMMV.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Large zpool design considerations

2008-07-03 Thread Don Enrique
Hi,

I am looking for some best practice advice on a project that i am working on.

We are looking at migrating ~40TB backup data to ZFS, with an annual data 
growth of
20-25%.

Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs 
( 7 + 2 )
with one hotspare per 10 drives and just continue to expand that pool as needed.

Between calculating the MTTDL and performance models i was hit by a rather 
scary thought.

A pool comprised of X vdevs is no more resilient to data loss than the weakest 
vdev since loss
of a vdev would render the entire pool unusable.

This means that i potentially could loose 40TB+ of data if three disks within 
the same RAIDZ-2
vdev should die before the resilvering of at least one disk is complete. Since 
most disks
will be filled i do expect rather long resilvering times.

We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project with 
as much hardware
redundancy as we can get ( multiple controllers, dual cabeling, I/O 
multipathing, redundant PSUs,
etc.)

I could use multiple pools but that would make data management harder which in 
it self is a lengthy
process in our shop.

The MTTDL figures seem OK so how much should i need to worry ? Anyone having 
experience from
this kind of setup ?

/Don E.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Don Enrique
 Don Enrique wrote:
  Now, my initial plan was to create one large pool
 comprised of X RAIDZ-2 vdevs ( 7 + 2 )
  with one hotspare per 10 drives and just continue
 to expand that pool as needed.
  
  Between calculating the MTTDL and performance
 models i was hit by a rather scary thought.
  
  A pool comprised of X vdevs is no more resilient to
 data loss than the weakest vdev since loss
  of a vdev would render the entire pool unusable.
  
  This means that i potentially could loose 40TB+ of
 data if three disks within the same RAIDZ-2
  vdev should die before the resilvering of at least
 one disk is complete. Since most disks
  will be filled i do expect rather long resilvering
 times.
 
 Why are you planning on using RAIDZ-2 rather than
 mirroring ?

Mirroring would increase the cost significantly and is not within the budget of 
this project. 

 -- 
 Darren J Moffat
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss