Re: [zfs-discuss] about write balancing

2011-07-01 Thread Tuomas Leikola
Sorry everyone, this one was indeed a case of root stupidity. I had
forgotten to upgrade to OI 148, which apparently fixed the write balancer.
Duh. (didn't find full changelog from google tho.)
On Jun 30, 2011 3:12 PM, "Tuomas Leikola"  wrote:
> Thanks for the input. This was not a case of degraded vdev, but only a
> missing log device (which i cannot get rid of..). I'll try offlining some
> vdevs and see what happens - altough this should be automatic atf all
times
> IMO.
> On Jun 30, 2011 1:25 PM, "Markus Kovero"  wrote:
>>
>>
>>> To me it seems that writes are not directed properly to the devices that
> have most free space - almost exactly the opposite. The writes seem to go
to
> the devices that have _least_ free space, instead of the devices that have
> most free space. The same effect that can be seen in these 60s averages
can
> also be observed in a shorter timespan, like a second or so.
>>
>>> Is there something obvious I'm missing?
>>
>>
>> Not sure how OI should behave, I've managed to even writes & space usage
> between vdevs by bringing device offline in vdev you don't want to writes
> end up to.
>> If you have degraded vdev in your pool, zfs will try not to write there,
> and this may be the case here as well as I don't see zpool status output.
>>
>> Yours
>> Markus Kovero
>>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send not working when i/o errors in pool

2011-07-01 Thread Tuomas Leikola
Rsync with some ignore-errors option, maybe? In any case you've lost some
data so make sure to take record of zpool status -v
On Jul 1, 2011 12:26 AM, "Tom Demo"  wrote:
> Hi there.
>
> I am trying to get my filesystems off a pool that suffered irreparable
damage due to 2 disks partially failing in a 5 disk raidz.
>
> One of the filesystems has an io error when trying to read one of the
files off it.
>
> This filesystem cannot be sent - zfs send stops with this error:
>
> "warning: cannot send 'pent@wdFailuresAndSol11Migrate': I/O error"
>
> I have tried using "zfs set checksum=off" but that doesn't change
anything.
>
> Any tips how I can get these filesystems over to the new machine please ?
>
> Thanks,
>
> Tom.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about write balancing

2011-06-30 Thread Tuomas Leikola
Thanks for the input. This was not a case of degraded vdev, but only a
missing log device (which i cannot get rid of..). I'll try offlining some
vdevs and see what happens - altough this should be automatic atf all times
IMO.
On Jun 30, 2011 1:25 PM, "Markus Kovero"  wrote:
>
>
>> To me it seems that writes are not directed properly to the devices that
have most free space - almost exactly the opposite. The writes seem to go to
the devices that have _least_ free space, instead of the devices that have
most free space. The same effect that can be seen in these 60s averages can
also be observed in a shorter timespan, like a second or so.
>
>> Is there something obvious I'm missing?
>
>
> Not sure how OI should behave, I've managed to even writes & space usage
between vdevs by bringing device offline in vdev you don't want to writes
end up to.
> If you have degraded vdev in your pool, zfs will try not to write there,
and this may be the case here as well as I don't see zpool status output.
>
> Yours
> Markus Kovero
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] about write balancing

2011-06-29 Thread Tuomas Leikola
Hi!

I've been monitoring my arrays lately, and to me it seems like the zfs
allocator might be misfiring a bit. This is all on OI 147 and if there
is a problem and a fix, i'd like to see it in the next image-update =D

Here's some 60s iostat cleaned up a bit:

tank3.76T   742G  1.03K  1.02K  3.96M  4.26M
  raidz12.77T   148G718746  2.73M  3.03M
  raidz11023G   593G333300  1.23M  1.22M

and another pool:

fast   130G   169G  2.72K110  11.0M   614K
  mirror  37.9G  61.6G  1.35K 29  5.48M   154K
  mirror  27.6G  71.9G  1.36K 37  5.51M   203K
  mirror  64.1G  35.9G  3 42  21.8K   240K

To me it seems that writes are not directed properly to the devices
that have most free space - almost exactly the opposite. The writes
seem to go to the devices that have _least_ free space, instead of the
devices that have most free space.  The same effect that can be seen
in these 60s averages can also be observed in a shorter timespan, like
a second or so.

Is there something obvious I'm missing?


-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA disk perf question

2011-06-01 Thread Tuomas Leikola
> I have a resilver running and am
> seeing about 700-800 writes/sec. on the hot spare as it resilvers.

IIRC resilver works in block birth order (write order) which is
commonly more-or-less sequential unless the fs is fragmented. So it
might or might not be. I think you cannot get that kind of performance
for a fully random load, more like 100 IOPS or so.

-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] test

2011-04-17 Thread Tuomas Leikola
It's been quiet, seems.

On Fri, Apr 15, 2011 at 5:09 PM, Jerry Kemp  wrote:
> I have not seen any email from this list in a couple of days.
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs over iscsi not recovering from timeouts

2011-04-17 Thread Tuomas Leikola
Hei,

I'm crossposting this to zfs as i'm not sure which bit is to blame here.

I've been having this issue that i cannot really fix myself:

I have a OI 148 server, which hosts a log of disks on SATA
controllers. Now it's full and needs some data moving work to be done,
so i've acquired another box which runs linux and has several sata
enclosures. I'm using solaris iscsi on static-config to connect the
device.

Normally, when everything is fine, no problems. I can even restart the
iet daemon and theres just a short hiccup in the IO-stream.

Things go bad when i turn the iscsi target off for a longer period
(reboot, etc). The solaris iscsi times out, and responds these as
errors to zfs. zfs increases error counts (loses writes maybe) and
eventually marks all devices as failed and the array halts
(failmode=wait).

When in this state, there is no luck returning to running state. The
failed condition doesn't purge itself after the target becomes online
again. I've tried zpool clear but it still reports data errors and
devices as faulted. zpool export hangs.

how i see this problem is that
a) iscsi initiator reports timeouts as permanent
b) zfs handles them as such
c) there is no timeout "never" to be chosen as far as i can see

What I would like is a mode equivalent to nfs hard mount - wait
forever for the device to become available (but ability to kick the
array from cmdline if it is really dead).

Any clues?



-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What drives?

2011-02-25 Thread Tuomas Leikola
I'd pick samsung and use the savings for additional redundancy. Ymmv.
On Feb 25, 2011 8:46 AM, "Markus Kovero"  wrote:
>> So, does anyone know which drives to choose for the next setup? Hitachis
look good so far, perhaps also seagates, but right now, I'm dubious about
the > blacks.
>
> Hi! I'd go for WD RE edition. Blacks and Greens are for desktop use and
therefore lack proper TLER settings and have useless power saving features
that could induce errors and mysterious slowness.
>
> Yours
> Markus Kovero
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz recovery

2010-12-18 Thread Tuomas Leikola
On Wed, Dec 15, 2010 at 3:29 PM, Gareth de Vaux  wrote:
> On Mon 2010-12-13 (16:41), Marion Hakanson wrote:
>> After you "clear" the errors, do another "scrub" before trying anything
>> else.  Once you get a complete scrub with no new errors (and no checksum
>> errors), you should be confident that the damaged drive has been fully
>> re-integrated into the pool.
>
> Ok I did a scrub after zero'ing, and the array came back clean, apparently, 
> but
> same final result - the array faults as soon as I 'offline' a different vdev.
> The zero'ing is just a pretend-the-errors-aren't-there directive, and the 
> scrub
> seems to be listening to that. What I need in this situation is a way to
> prompt ad6 to resilver from scratch.
>

I think scrub doesn't replace all superblocks or other stuff not in
the active dataset but rather some drive labels.

have you tried zpool replace? like remove ad6, fill with zeroes,
replace, command "zpool replace tank ad6". That should simulate drive
failure and replacement with a new disk.

-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Newbie question : snapshots, replication and recovering failure of Site B

2010-10-27 Thread Tuomas Leikola
On Tue, Oct 26, 2010 at 5:21 PM, Matthieu Fecteau
 wrote:
> My question : in the event that there's no more common snapshot between Site 
> A and Site B, how can we replicate again ? (example : Site B has a power 
> failure and then Site A cleanup his snapshots before Site B is brought back, 
> so that there's no more common snapshots between the sites).

In this event you cannot send incrementals but need to transfer
everything again. It would be advisable to not delete snapshots during
backup outage.
-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing space nearly full zpool

2010-10-26 Thread Tuomas Leikola
On Mon, Oct 25, 2010 at 4:57 PM, Cuyler Dingwell  wrote:
> It's not just this directory in the example - it's any directory or file.  
> The system was running fine up until it hit 96%.  Also, a full scrub of the 
> file system was done (took nearly two days).
> --

I'm just stabbing in the dark here, but are you certain
/tank/directory_to_clear is not a separate dataset visible with zfs
list -t filesystem

-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Balancing LVOL fill?

2010-10-21 Thread Tuomas Leikola
On Thu, Oct 21, 2010 at 12:06 AM, Peter Jeremy
 wrote:
> On 2010-Oct-21 01:28:46 +0800, David Dyer-Bennet  wrote:
>>On Wed, October 20, 2010 04:24, Tuomas Leikola wrote:
>>
>>> I wished for a more aggressive write balancer but that may be too much
>>> to ask for.
>>
>>I don't think it can be too much to ask for.  Storage servers have long
>>enough lives that adding disks to them is a routine operation; to the
>>extent that that's a problem, that really needs to be fixed.
>
> It will (should) arrive as part of the mythical block pointer rewrite project.
>

Actually BP rewrite would be needed for data rebalancing after the
fact, as I was referring to write balancing that tries to mitigate the
problem before is occurs.

I was thinking of having a tunable like
"writebalance=conservative|aggressive" where conservative would be the
current mode and aggressive would be something like aiming that all
devices reach 90% at exactly the same time, and avoid writing on
devices over 90% altogether. The 90% limit is of course arbitrary, but
seems like some kind of tripping point commonly.

The downside of using aggressive balancing would of course be smaller
write bandwidth, and the data written would not be striped so also
subsequent read might have a drawback. Impact would depend heavily on
usage pattern, obviously, but I expect most use cases would either not
suffer from this, and it is arguable whether somewhat reduced
bandwidth is preferable to serious write slowdown later down the road
- the difference seems to be orders of magnitude, anyway.

-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Balancing LVOL fill?

2010-10-20 Thread Tuomas Leikola
On Wed, Oct 20, 2010 at 5:00 PM, Richard Elling
 wrote:
>>> Now, is there a way, manually or automatically, to somehow balance the data 
>>> across these LVOLs? My first guess is that doing this _automatically_ will 
>>> require block pointer rewrite, but then, is there way to hack this thing by 
>>> hand?
>>
>>
>> I described a similar issue in
>> http://opensolaris.org/jive/thread.jspa?threadID=134581&tstart=30. My
>> solution was to copy some datasets over to a new directory, delete the
>> old ones and destroy any snapshots that retain them. Data is read from
>> the old device and written on all, causing large chunks of space to be
>> freed on the old device.
>>
>> I wished for a more aggressive write balancer but that may be too much
>> to ask for.
>
> This can, of course, be tuned.  Would you be interested in characterizing the
> benefits and costs of a variety of such tunings?

If you're asking whether I wish to test and document my findings with
such tunables, then yes, I'm interested, though this is a home file
server so it's not exactly laboratory environment. I also think I can
produce enough spare parts to do synthetic tests (maybe in a VM
environment).

I was not aware of such tunables, though it appeared there might be
some emergency mode when a vdev has only a few percent space left.

My server is currently running OI_147 but I haven't yet upgraded the
pool so it's still version 14. I also have 111b and 134 boot
environments standing by.

-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-20 Thread Tuomas Leikola
On Wed, Oct 20, 2010 at 4:05 PM, Edward Ned Harvey  wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
>>
>> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3,
>> and
>> they're trying to resilver at the same time.  Does the system ignore
>> subsequently failed disks and concentrate on restoring a single disk
>> quickly?  Or does the system try to resilver them all simultaneously
>> and
>> therefore double or triple the time before any one disk is fully
>> resilvered?
>
> This is a legitimate question.  If anyone knows, I'd like to know...
>

My recent experience with os_111b, os_134 and oi_147 was that
subsequent failure and disk replacement causes resilver to restart
from beginning, including the new disks on the later pass. If disk is
not replaced, the resilver would run to completion (and then a replace
could be performed with a new resilver).

This however is an issue that is being developed further so changes
may be coming.

-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vdev failure -> pool loss ?

2010-10-20 Thread Tuomas Leikola
On Wed, Oct 20, 2010 at 3:50 AM, Bob Friesenhahn
 wrote:
> On Tue, 19 Oct 2010, Cindy Swearingen wrote:
>>>
>>> unless you use copies=2 or 3, in which case your data is still safe
>>> for those datasets that have this option set.
>>
>> This advice is a little too optimistic. Increasing the copies property
>> value on datasets might help in some failure scenarios, but probably not
>> in more catastrophic failures, such as multiple device or hardware
>> failures.
>
> It is 100% too optimistic.  The copies option only duplicates the user data.
>  While zfs already duplicates the metadata (regardless of copies setting),
> it is not designed to function if a vdev fails.

Sorry about that. I thought it was already supported with some --force
option. My bad.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Balancing LVOL fill?

2010-10-20 Thread Tuomas Leikola
On Tue, Oct 19, 2010 at 7:13 PM, Roy Sigurd Karlsbakk  
wrote:
> I have this server with some 50TB disk space. It originally had 30TB on WD 
> Greens, was filled quite full, and another storage chassis was added. Now, 
> space problem gone, fine, but what about speed? Three of the VDEVs are quite 
> full, as indicated below. VDEV #3 (the one with the spare active) just spent 
> some 72 hours resilvering a 2TB drive. Now, those green drives suck quite 
> hard, but not _that_ hard. I'm guessing the reason for this slowdown is the 
> fill of those three first VDEVs.
>
> Now, is there a way, manually or automatically, to somehow balance the data 
> across these LVOLs? My first guess is that doing this _automatically_ will 
> require block pointer rewrite, but then, is there way to hack this thing by 
> hand?


I described a similar issue in
http://opensolaris.org/jive/thread.jspa?threadID=134581&tstart=30. My
solution was to copy some datasets over to a new directory, delete the
old ones and destroy any snapshots that retain them. Data is read from
the old device and written on all, causing large chunks of space to be
freed on the old device.

I wished for a more aggressive write balancer but that may be too much
to ask for.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-19 Thread Tuomas Leikola
On Mon, Oct 18, 2010 at 4:55 PM, Edward Ned Harvey  wrote:
> Thank you, but, the original question was whether a scrub would identify
> just corrupt blocks, or if it would be able to map corrupt blocks to a list
> of corrupt files.
>

Just in case this wasn't already clear.

After scrub sees read or checksum errors, zpool status -v will list
filenames that are affected. At least in my experience.
-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vdev failure -> pool loss ?

2010-10-19 Thread Tuomas Leikola
On Mon, Oct 18, 2010 at 8:18 PM, Simon Breden  wrote:
> So are we all agreed then, that a vdev failure will cause pool loss ?
> --

unless you use copies=2 or 3, in which case your data is still safe
for those datasets that have this option set.
-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Tuomas Leikola
On Tue, Oct 12, 2010 at 9:39 AM, Stephan Budach  wrote:
> You are implying that the issues resulted from the H/W raid(s) and I don't 
> think that this is appropriate.
>

Not exactly. Because the raid is managed in hardware, and not by zfs,
is the reason why zfs cannot fix these errors when it encounters them.

> I configured a striped pool using two raids - this is exactly the same as 
> using two single hard drives without mirroring them. I simply cannot see what 
> zfs would be able to do in case of a block corruption in that matter.

It cannot, exactly.

> You are not stating that a single hard drive is more reliable than a HW raid 
> box, are you? Actually my pool has no mirror capabilities at all, unless I am 
> seriously mistaken.

no, but zfs-managed raid is more reliable than hardware raid.

> What scrub has found out is that none of the blocks had any issue, but the 
> filesystem was not "clean" either, so if scrub does it's job right and 
> doesn't report any errors, the error must have occurred somewhere else up the 
> stack, way before the checksum had been calculated.

If the case is, as speculated, that one mirror has bad data and one
has good, scrub or any IO has 50% chances of seeing the corruption.
scrub does verify checksums.

Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] free space fragmentation causing slow write speeds

2010-10-11 Thread Tuomas Leikola
Hello everybody.

I am experiencing terribly slow writes on my home server. This is from
zpool iostat:

  capacity operationsbandwidth
pool   alloc   free   read  write   read  write
-  -  -  -  -  -  -
tank   3.69T   812G148255   755K  1.61M
  raidz1   2.72T   192G 86112   554K   654K
  raidz1995G   621G 61143   201K   962K

The case is that one vdev is almost full, while the other one has
plenty of space. I remember that at least one point it was known that
writes slow down when the fs became full, due to CPU time spent
looking for free space.

I am seeing that "zpool-tank" is using half a cpu (a quad opteron
setup) all the time while this write load is running. That seems weird
as I'd expect there to be almost a full cpu consumed if cpu is the
bottleneck. Disks aren't according to iostat -xcn, and there are
multiple writers writing large files so I would not expect a
bottleneck there.

History of the pool; I added a second vdev when the first vdev was
about 70% full. There are large files as well as small files and
virtual machines and a heavily loaded database, probably making all
free space very fragmented. After adding the second vdev, writes
didn't seem to be biased towards the new device enough; the older one
filled up anyway, and now speed has slowed to a crawl.

If I delete old snapshots or delete some data, the write speeds bump
up to a more
healthy 70MB/s, but after a while this problem comes back. I know I
could rewrite most
old data to move half of it to the new device, but that seems rather
an unelegant solution to the problem.

I had a look with zdb, and there are many metaslabs that have several
hundred megabytes of free
space, best ones almost a gigabyte (of 4 gigabytes) or in other words
being something like 75-90% full. Is that too heavy for the allocator?
Maybe space map could be reformatted to a more
optimal structure when a metaslab is opened for writing. Or maybe that
is exactly what causes
the high cpu usage, I don't know. and there are still perfectly empty
metaslabs on the other device..

Last time this occurred I devised some synthetic tests to recreate
this condition repeatedly, and noticed that at some point it appeared
that zfs stopped allocating space on the more full device, except for
ditto metadata blocks. This time around that doesn't seem to happen,
maybe the the trigger is less obvious than simple % free space. Such
an 'emergency bias' seems simple enough IIRC about the source code for
choosing vdev to allocate from, aside from the triggering condition
maybe being complicated to set accurately.

Is there such a trigger and can it be adjusted to occur earlier? Any
other remedies?

Is there a way to confirm that finding free space is indeed the cause
for slow writes, or whether there is possibly another reason?

I wonder if the write balancing code should bias more aggressively.
This condition should be expected if say, one has a system 80% full
and adds another rack of disks, and does not touch existing data.
Having speed slow to a crawl a month later is a bit unexpected.

Thanks,
Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver endlessly restarting at completion

2010-10-05 Thread Tuomas Leikola
This seems to have been a false alarm, sorry for that. As soon as I started
paying attention (logging zpool status, peeking around with zdb & mdb) the
resilver didn't restart unless provoked. A cleartext log would have been
nice ("restarted due to c11t7 becoming online").

A slight problem i can see is that resilver restarts always if a device is
added to the array. In my case devices were absent for a short period (some
SATA failure that corrected itself by running cfgadm -c disconnect &
connect) and it would have been beneficial to let resilver run to completion
and restart only after that to resilver missing data on the added device.
ZFS does have some intelligence in those cases that all data is not
resilvered, but only blocks that have been born after the outage.

Also, as i had a spare in the array, that kicked in, which probably was not
what I would have wanted, as that triggered a full resilver, and not a
partial one. After the fact I could not kick the spare out, and could not
make the resilvering process forget about doing a full resilver. Plus now I
have to replace it back out and make it a cold spare.

But end is well, all is well.. mostly. Devices seem to be still dropping
from the SATA bus randomly. Maybe I'll cough together a report and post to
storage-discuss.

On Wed, Sep 29, 2010 at 8:13 PM, Tuomas Leikola wrote:

> The endless resilver problem still persists on OI b147. Restarts when it
> should complete.
>
> I see no other solution than to copy the data to safety and recreate the
> array. Any hints would be appreciated as that takes days unless i can stop
> or pause the resilvering.
>
>
> On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola 
> wrote:
>
>> Hi!
>>
>> My home server had some disk outages due to flaky cabling and whatnot, and
>> started resilvering to a spare disk. During this another disk or two
>> dropped, and were reinserted into the array. So no devices were actually
>> lost, they just were intermittently away for a while each.
>>
>> The situation is currently as follows:
>>   pool: tank
>>  state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>> attempt was made to correct the error.  Applications are
>> unaffected.
>> action: Determine if the device needs to be replaced, and clear the errors
>> using 'zpool clear' or replace the device with 'zpool replace'.
>>see: http://www.sun.com/msg/ZFS-8000-9P
>>  scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go
>> config:
>>
>> NAME   STATE READ WRITE CKSUM
>> tank   ONLINE   0 0 0
>>   raidz1-0 ONLINE   0 0 0
>> c11t1d0p0  ONLINE   0 0 0
>> c11t2d0ONLINE   0 0 5
>> c11t6d0p0  ONLINE   0 0 0
>> spare-3ONLINE   0 0 0
>>   c11t3d0p0ONLINE   0 0 0  106M
>> resilvered
>>   c9d1 ONLINE   0 0 0  104G
>> resilvered
>> c11t4d0p0  ONLINE   0 0 0
>> c11t0d0p0  ONLINE   0 0 0
>> c11t5d0p0  ONLINE   0 0 0
>> c11t7d0p0  ONLINE   0 0 0  93.6G
>> resilvered
>>   raidz1-2 ONLINE   0 0 0
>> c6t2d0 ONLINE   0 0 0
>> c6t3d0 ONLINE   0 0 0
>> c6t4d0 ONLINE   0 0 0  2.50K
>> resilvered
>> c6t5d0 ONLINE   0 0 0
>> c6t6d0 ONLINE   0 0 0
>> c6t7d0 ONLINE   0 0 0
>> c6t1d0 ONLINE   0 0 1
>> logs
>>   /dev/zvol/dsk/rpool/log  ONLINE   0 0 0
>> cache
>>   c6t0d0p0 ONLINE   0 0 0
>> spares
>>   c9d1 INUSE currently in use
>>
>> errors: No known data errors
>>
>> And this has been going on for a week now, always restarting when it
>> should complete.
>>
>> The questions in my mind atm:
>>
>> 1. How can i determine the cause for each resilver? Is there a log?
>>
>> 2. Why does it resilver the same data over and over, and not just the
>> changed 

Re: [zfs-discuss] Resliver making the system unresponsive

2010-09-30 Thread Tuomas Leikola
On Thu, Sep 30, 2010 at 1:16 AM, Scott Meilicke <
scott.meili...@craneaerospace.com> wrote:

> Resliver speed has been beaten to death I know, but is there a way to avoid
> this? For example, is more enterprisy hardware less susceptible to
> reslivers? This box is used for development VMs, but there is no way I would
> consider this for production with this kind of performance hit during a
> resliver.
>
>
According to

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473

resilver
should in later builds have some option to limit rebuild speed in order to
allow for more IO during reconstruction, but I havent't found any guides on
how to actually make use of this feature. Maybe someone can shed some light
on this?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unusual Resilver Result

2010-09-30 Thread Tuomas Leikola
On Thu, Sep 30, 2010 at 9:08 AM, Jason J. W. Williams <
jasonjwwilli...@gmail.com> wrote:

>
> Should I be worried about these checksum errors?
>
>
Maybe. Your disks, cabling or disk controller is probably having some issue
which caused them. or maybe sunspots are to blame.

Run a scrub often and monitor if there are more, and if there is a pattern
to them. Have backups. Maybe switch hardware one by one to see if that
helps.


> What caused the small resilverings on c8t5d0 and c11t5d0 which were not
> replaced or otherwise touched?
>
>
It was the checksum errors. ZFS automatically read the good data on other
mirrors, and replaced the broken blocks with correct data. If you run zpool
clear and zpool scrub you will notice these checksum errors have vanished.
If they were caused by botched writes, no new errors should probably appear.
If they are botched reads, you can see some new ones appearing  :(

So, not critical yet but something to keep an eye on.

Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver endlessly restarting at completion

2010-09-29 Thread Tuomas Leikola
Thanks for taking an interest. Answers below.

On Wed, Sep 29, 2010 at 9:01 PM, George Wilson
wrote:

> On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola 
> > tuomas.leik...@gmail.com>> wrote:
>>
>
>>  (continuous resilver loop) has been going on for a week now, always
>> restarting when it
>>should complete.
>>
>>The questions in my mind atm:
>>1. How can i determine the cause for each resilver? Is there a log?
>>
>
> If you're running OI b147 then you should be able to do the following:
>
> # echo "::zfs_dbgmsg" | mdb -k > /var/tmp/dbg.out
>
> Send me the output.


Sending verbose output in a separate email. I'm not very familiar with this
but it does show some "restarting" lines.


>2. Why does it resilver the same data over and over, and not just
>>the changed bits?
>>
>
> If you're having drives fail prior to the initial resilver finishing then
> it will restart and do all the work over again. Are drives still failing
> randomly for you?
>
>
>
Drives haven't been dropping since the initial incidents. It's run to
completion a few times now without (visible) issues with the drives.

Then again I think there is some magic to reinsert a device back into the
array if there is some intermittent SATA disconnection.


>
>>3. Can i force remove c9d1 as it is no longer needed but c11t3 can
>>be resilvered instead?
>>
>
> You can detach the spare and let the resilver work on only c11t3. Can you
> send me the output of 'zdb - tank 0'?


Detach commands complain there's not enough replicas. Of course I can
physically remove the device, at which point a scrub would suffice (the
disks must be relatively well up-to-date by now..)

Sending zdb output in a separate mail as soon as it completes..
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver endlessly restarting at completion

2010-09-29 Thread Tuomas Leikola
The endless resilver problem still persists on OI b147. Restarts when it
should complete.

I see no other solution than to copy the data to safety and recreate the
array. Any hints would be appreciated as that takes days unless i can stop
or pause the resilvering.

On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola wrote:

> Hi!
>
> My home server had some disk outages due to flaky cabling and whatnot, and
> started resilvering to a spare disk. During this another disk or two
> dropped, and were reinserted into the array. So no devices were actually
> lost, they just were intermittently away for a while each.
>
> The situation is currently as follows:
>   pool: tank
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
> attempt was made to correct the error.  Applications are
> unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> using 'zpool clear' or replace the device with 'zpool replace'.
>see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go
> config:
>
> NAME   STATE READ WRITE CKSUM
> tank   ONLINE   0 0 0
>   raidz1-0 ONLINE   0 0 0
> c11t1d0p0  ONLINE   0 0 0
> c11t2d0ONLINE   0 0 5
> c11t6d0p0  ONLINE   0 0 0
> spare-3ONLINE   0 0 0
>   c11t3d0p0ONLINE   0 0 0  106M
> resilvered
>   c9d1 ONLINE   0 0 0  104G
> resilvered
> c11t4d0p0  ONLINE   0 0 0
> c11t0d0p0  ONLINE   0 0 0
> c11t5d0p0  ONLINE   0 0 0
> c11t7d0p0  ONLINE   0 0 0  93.6G
> resilvered
>   raidz1-2 ONLINE   0 0 0
> c6t2d0 ONLINE   0 0 0
> c6t3d0 ONLINE   0 0 0
> c6t4d0 ONLINE   0 0 0  2.50K
> resilvered
> c6t5d0 ONLINE   0 0 0
> c6t6d0 ONLINE   0 0 0
> c6t7d0 ONLINE   0 0 0
> c6t1d0 ONLINE   0 0 1
> logs
>   /dev/zvol/dsk/rpool/log  ONLINE   0 0 0
> cache
>   c6t0d0p0 ONLINE   0 0 0
> spares
>   c9d1 INUSE currently in use
>
> errors: No known data errors
>
> And this has been going on for a week now, always restarting when it should
> complete.
>
> The questions in my mind atm:
>
> 1. How can i determine the cause for each resilver? Is there a log?
>
> 2. Why does it resilver the same data over and over, and not just the
> changed bits?
>
> 3. Can i force remove c9d1 as it is no longer needed but c11t3 can be
> resilvered instead?
>
> I'm running opensolaris 134, but the event originally happened on 111b. I
> upgraded and tried quiescing snapshots and IO, none of which helped.
>
> I've already ordered some new hardware to recreate this entire array as
> raidz2 among other things, but there's about a week of time when I can run
> debuggers and traces if instructed to.
>
> - Tuomas
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Resilver endlessly restarting at completion

2010-09-27 Thread Tuomas Leikola
Hi!

My home server had some disk outages due to flaky cabling and whatnot, and
started resilvering to a spare disk. During this another disk or two
dropped, and were reinserted into the array. So no devices were actually
lost, they just were intermittently away for a while each.

The situation is currently as follows:
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go
config:

NAME   STATE READ WRITE CKSUM
tank   ONLINE   0 0 0
  raidz1-0 ONLINE   0 0 0
c11t1d0p0  ONLINE   0 0 0
c11t2d0ONLINE   0 0 5
c11t6d0p0  ONLINE   0 0 0
spare-3ONLINE   0 0 0
  c11t3d0p0ONLINE   0 0 0  106M
resilvered
  c9d1 ONLINE   0 0 0  104G
resilvered
c11t4d0p0  ONLINE   0 0 0
c11t0d0p0  ONLINE   0 0 0
c11t5d0p0  ONLINE   0 0 0
c11t7d0p0  ONLINE   0 0 0  93.6G
resilvered
  raidz1-2 ONLINE   0 0 0
c6t2d0 ONLINE   0 0 0
c6t3d0 ONLINE   0 0 0
c6t4d0 ONLINE   0 0 0  2.50K
resilvered
c6t5d0 ONLINE   0 0 0
c6t6d0 ONLINE   0 0 0
c6t7d0 ONLINE   0 0 0
c6t1d0 ONLINE   0 0 1
logs
  /dev/zvol/dsk/rpool/log  ONLINE   0 0 0
cache
  c6t0d0p0 ONLINE   0 0 0
spares
  c9d1 INUSE currently in use

errors: No known data errors

And this has been going on for a week now, always restarting when it should
complete.

The questions in my mind atm:

1. How can i determine the cause for each resilver? Is there a log?

2. Why does it resilver the same data over and over, and not just the
changed bits?

3. Can i force remove c9d1 as it is no longer needed but c11t3 can be
resilvered instead?

I'm running opensolaris 134, but the event originally happened on 111b. I
upgraded and tried quiescing snapshots and IO, none of which helped.

I've already ordered some new hardware to recreate this entire array as
raidz2 among other things, but there's about a week of time when I can run
debuggers and traces if instructed to.

- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs log on another zfs pool

2010-05-01 Thread Tuomas Leikola
Hi.

I have a simple question. Is it safe to place log device on another zfs
disk?

I'm planning on placing the log on my mirrored root partition. Using latest
opensolaris.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS over multiple iSCSI targets

2008-09-08 Thread Tuomas Leikola
On Mon, Sep 8, 2008 at 8:35 PM, Miles Nordin <[EMAIL PROTECTED]> wrote:
>ps> iSCSI with respect to write barriers?
>
> +1.
>
> Does anyone even know of a good way to actually test it?  So far it
> seems the only way to know if your OS is breaking write barriers is to
> trade gossip and guess.
>

Write a program that writes backwards (every other block to avoid
write merges) with and without O_DSYNC, measure speed.

I think you can also deduce driver and drive cache flush correctness
by calculating the best theoretical correct speed (which should be
really slow, one write per disc spin)

this has been on my TODO list for ages.. :(
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best option for my home file server?

2007-10-01 Thread Tuomas Leikola
let A,B,C,D be the 250GB disks and X,Y the 500GB ones.

my choise here would be raidz over (A+B),(C+D),X,Y

means something like

zpool create tank raidz (stripe A B) (stripe C D) X Y

(how do you actually write that up as zpool commands?)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-30 Thread Tuomas Leikola
On 9/20/07, Roch - PAE <[EMAIL PROTECTED]> wrote:
>
> Next application modifies D0 -> D0' and also writes other
> data D3, D4. Now you have
>
> Disk0   Disk1   Disk2   Disk3
>
> D0  D1  D2  P0,1,2
> D0' D3  D4  P0',3,4
>
> But if D1 and D2 stays immutable for long time then we can
> run out of pool blocks with D0 held down in an half-freed state.
> So as we near full pool capacity, a scrubber would have to walk
> the stripes  and look for partially freed ones. Then it
> would need to do a scrubbing "read/write" on D1, D2 so that
> they become part of a new stripe with some other data
> freeing the full initial stripe.
>

Or, given a list of partial stripes (and sufficient cache), next write
of D5 could be combined with D1,D2:

 Disk0   Disk1   Disk2   Disk3

 D0  D1  D2  P0,1,2
 D0' D3  D4  P0',3,4
 D5  freefreeP5,1,2

therefore freeing D0 and P012:

 Disk0   Disk1   Disk2   Disk3

 freeD1  D2  free
 D0' D3  D4  P0',3,4
 D5  freefreeP5,1,2

(I assumed no need for alignment). Performance-wise, i'm guessing it
might be beneficial to "quickly" write mirrored blocks on the disk and
later combine them, freeing the now unneeded mirrors.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-15 Thread Tuomas Leikola
On 9/10/07, Pawel Jakub Dawidek <[EMAIL PROTECTED]> wrote:
> The problem with RAID5 is that different blocks share the same parity,
> which is not the case for RAIDZ. When you write a block in RAIDZ, you
> write the data and the parity, and then you switch the pointer in
> uberblock. For RAID5, you write the data and you need to update parity,
> which also protects some other data. Now if you write the data, but
> don't update the parity before a crash, you have a whole. If you update
> you parity before the write and a crash, you have a inconsistent with
> different block in the same stripe.

This is why you should consider "old" data and parity as being "live".
The old data (being overwritten) is live as it is needed for the
parity to be consistent - and the old parity is live because it
protects the other blocks.

What IMO should be done is object level raid - write new parity and
new data into blocks not yet used - and as the new parity protects
also the "neighbouring" data the old parity can be freed, and after it
no longer is live the "overwritten" data block can also be freed.

Note that this is very different from traditional raid5 as it requires
intimate knowledge about the FS structure. Traditional raids also keep
parity "in line" with the data blocks it protects - but that is not
necessary if the FS can store information about where the parity is
located.

Define "live data" well enough and you're safe if you never overwrite any of it.

> My idea was to have one sector every 1GB on each disk for a "journal" to
> keep list of blocks beeing updated.

This would be called "write intent log" or "bitmap" (as in linux
software raid). Speeds up recovery, but doesn't protect against write
hole problems.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best option for my home file server?

2007-08-19 Thread Tuomas Leikola
On 8/19/07, James <[EMAIL PROTECTED]> wrote:
> Raidz can only be as big as your smallest disk. For example if I had a 320gig 
> with a 250gig and 200gig I could only have 400gig of storage.
>

Correct - variable sized disks are not (yet?) supported.

However, you can circumvent this by slicing up the disks (or using
partitions) and building multiple raidz sets.

In you example, you would first use 200G each to make 3x200G raidz,
leaving you with 120 and 50 G "excess" on two disks. Then you can make
a mirror out of them, (2x50G) leaving 70G excess on the first disk
(which can be used as temporary space or such). It gets better when
there are more disks - if your two largest disks are of equal size,
you even get away without any wasted space.

This process is quite clumsy, and gets really complicated when you
want to add disks - but as it stands, is the only option in this
scenario.

Wishlist: object level raidz..
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs on entire disk?

2007-08-11 Thread Tuomas Leikola
On 8/11/07, Russ Petruzzelli <[EMAIL PROTECTED]> wrote:
>
>  Is it possible/recommended to create a zpool and zfs setup such that the OS
> itself (in root /)  is in its own zpool?

Yes. You're looking for "zfs root" and it's easiest if your installer
does that for you. At least latest nexenta unstable installs zfs root
by default
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Force ditto block on different vdev?

2007-08-10 Thread Tuomas Leikola
On 8/10/07, Darren Dunham <[EMAIL PROTECTED]> wrote:
> For instance, it might be nice to create a "mirror" with a 100G disk and
> two 50G disks.  Right now someone has to create slices on the big disk
> manually and feed them to zpool.  Letting ZFS handle everything itself
> might be a win for some cases.

Especially performance-wise. AFAIK ZFS doesn't understand that the two
vdevs actually share a physical disk and therefore should not be used
as raid0-like stripes.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Force ditto block on different vdev?

2007-08-10 Thread Tuomas Leikola
On 8/10/07, Moore, Joe <[EMAIL PROTECTED]> wrote:

> Wishlist: It would be nice to put the whole redundancy definitions into
> the zfs filesystem layer (rather than the pool layer):  Imagine being
> able to "set copies=5+2" for a filesystem... (requires a 7-VDEV pool,
> and stripes via RAIDz2, otherwise the zfs create/set fails)

Yes please ;)

This is practically the holy grail of "dynamic raid" - the ability to
dynamically use different redundancy settings on a per-directory
level, and to use a mix of different sized devices and add/remove them
at will.

I guess one would call this feature (ditto block setting of
stripe+parity). It's doable but probably requires large(ish) changes
to on-disk structures as block pointer will look different.

James, did you look at this? With vdev removal (which I suppose will
be implemented with some kind of "rewrite block" -type code) in place,
"reshape" and rebalance functionality would propably be relatively
small improvements.

BTW here's more wishlist items now that we're at it:

- copies=max+2 (use as many stripes as possible, with border case of
3-way mirror)
- minchunk=8kb (dont spread smaller stripes than this - performance
optimization)
- checksum on every disk independently (instead of full stripe) -
fixes raidz random read performance

.. And one crazy idea just popped into my head: fs-level raid could be
implemented with separate parity blocks instead of the ditto
mechanism. Say, when data first is written,  normal ditto block is
used. Then later, asynchronously, the block is combined with some
other blocks (that may be unrelated), the parity is written to a new
allocation and the ditto block(s) are freed. When data blocks are
freed (by COW) the parity needs to be recalculated before the data
block can actually be forgotten. This can be thought of as combining a
number of ditto blocks into a parity block.

That may be easier or more complicated to implement than saving the
block as stripe+parity in the first place. Depends on the data
structures, which I don't yet know intimately.

Come to think of this, it's probably best to get all these ideas out
there _before_ I start looking into the code - knowing the details has
the tendency to kill all the crazy ideas :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Force ditto block on different vdev?

2007-08-10 Thread Tuomas Leikola
On 8/10/07, Darren J Moffat <[EMAIL PROTECTED]> wrote:
> Tuomas Leikola wrote:
> >>>> We call that a "mirror" :-)
> >>>>
> >>> Mirror and raidz suffer from the classic blockdevice abstraction
> >>> problem in that they need disks of equal size.
> >> Not that I'm aware of.  Mirror and raid-z will simply use the smallest
> >> size of your available disks.
> >>
> >
> > Exactly. The rest is not usable.
>
> For what you are asking, forcing ditto blocks on to separate vdevs, to
> work you effectively end up with the same restriction as mirroing.

In theory, correct. In practice, administration is much simpler when
there are multiple devices.

Simplicity of administration really being the point here - sorry I
didn't make it clear at first.

I'm skipping the two-disk example as trivial - which it is. Howerer:
administration becomes a real mess when you have multiple (say, 10)
disks, all differing sizes, and want to use all the space - think
about the home user with a constrained budget or just a huge pile of
random oldish disk lying around.

It is possible to merge disks before (or after) setting up the
mirrors, but it is a tedious job, especially when you start replacing
small disks one by one with larger ones, etc.

This can be - relatively easily - automated by zfs block allocation
strategies and this is why I consider it a worthwhile feature.

> However I suspect you will say that unlike mirroring only some of your
> datasets will have ditto blocks turned on.
>

That's one good point. Maybe I don't want to decide in advance how
much mirrored storage i really need - or just use all the "free"
mirrored space for nonmirrored temporary storage. I'd call this
flexibility.

> The only way I could see this working is if *all* datasets that have
> copies > 1 were "quotaed" down to the size of the smallest disk.
>

Admittedly, in the two-disk scenario the benefit is relatively low,
but in the most multi-disk scenarios the disks can be practically full
before running out of ditto locations - minus the block(s). (This
holds for copies=2 if largest disk < sum of others).

> Which basically ends up back at a real mirror or a really hard to
> understand system IMO.

I find volume manager mess hard to understand - and it is a mess in
the multidisk scenario when you start adding and removing disks.

For a real-world use case, i'll present my home fileserver. 11 disks,
sizes vary between 80 and 400 gigabytes. The disks are concatenated
together into 6 "stacks" that are raid6:d together - with only 40G or
so "wasted" space. I had to write a program to optimize the disk
arrangement. Raid6 isn't exactly mirroring, but the administrative
hurdles are the same.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Force ditto block on different vdev?

2007-08-10 Thread Tuomas Leikola
> >> We call that a "mirror" :-)
> >>
> >
> > Mirror and raidz suffer from the classic blockdevice abstraction
> > problem in that they need disks of equal size.
>
> Not that I'm aware of.  Mirror and raid-z will simply use the smallest
> size of your available disks.
>

Exactly. The rest is not usable.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Force ditto block on different vdev?

2007-08-10 Thread Tuomas Leikola
On 8/9/07, Richard Elling <[EMAIL PROTECTED]> wrote:
> > What I'm looking for is a disk full error if ditto cannot be written
> > to different disks. This would guarantee that a mirror is written on a
> > separate disk - and the entire filesystem can be salvaged from a full
> > disk failure.
>
> We call that a "mirror" :-)
>

Mirror and raidz suffer from the classic blockdevice abstraction
problem in that they need disks of equal size. Not really a problem
for most people, but inconvenient for everyone.

Isn't flexibility and ease of administration "the zfs way"? ;)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Force ditto block on different vdev?

2007-08-10 Thread Tuomas Leikola
On 8/9/07, Mario Goebbels <[EMAIL PROTECTED]> wrote:
> If you're that bent on having maximum redundancy, I think you should
> consider implementing real redundancy. I'm also biting the bullet and
> going mirrors (cheaper than RAID-Z for home, less disks needed to start
> with).

Currently I am, and as I'm stuck with different sized disks I first
have to slice them up to similarly sized chunks and .. well you get
the idea. It's a pain.

> The problem here is that the filesystem, especially with a considerable
> fill factor, can't guarantee the necessary allocation balance across the
> vdevs (that is maintaining necessary free space) to spread the ditto
> blocks as optimal as you'd like. Implementing the required code would
> increase the overhead a lot. Not to mention that ZFS may have to defrag
> on the fly more than not to make sure the ditto spread can be maintained
> balanced.

I Feel that for most purposes, this could be fixed with an allocator
strategy option, like: Prefer vdevs with most free space (which is not
that good a default as it has performance implications).

> And then snapshots on top of that, which are supposed to be physically
> and logically immovable (unless you execute commands affecting the pool,
> like a vdev remove, I suppose), just increase the existing complexity,
> where all that would have to be hammered into.

I'm not that familiar with the code, but i get the feeling that if
vdev remove is a given, rebalance would not be a huge step? The code
to migrate data blocks would already be there.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Force ditto block on different vdev?

2007-08-09 Thread Tuomas Leikola
>
> Actually, ZFS is already supposed to try to write the ditto copies of a
> block on different vdevs if multiple are available.
>

*TRY*  being the keyword here.

What I'm looking for is a disk full error if ditto cannot be written
to different disks. This would guarantee that a mirror is written on a
separate disk - and the entire filesystem can be salvaged from a full
disk failure.

Think about having the classic case of 50M, 100M and 200M disks. only
150M can be really mirrored and the remaining 50M can only be used
non-redundantly.

> ...But I think in a
> non-redundant setup, the pool refuses to start if a disk is missing (I
> think that should be changed, to allow evacuation of properly dittoed data).

IIRC this is already considered a bug.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Force ditto block on different vdev?

2007-08-09 Thread Tuomas Leikola
Hi!

I'm having hard time finding out if it's possible to force ditto
blocks on different devices.

This mode has many benefits, the least not being that is practically
creates a fully dynamic mode of mirroring (replacing raid1 and raid10
variants), especially when combined with the upcoming vdev remove and
defrag/rebalance features.

Is this already available? Is it scheduled? Whyt not?

- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss