Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread Christo Kutrovsky
Bob,

Using a separate pool would dictate other limitations, such as not been able to 
use more space than what's allocated in the pool. You could "add" space as 
needed, but you can't remove (move) devices freely.

By using a shared pool with a hint of desired vdev/space allocation policy, you 
could have the best possible from both worlds.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-16 Thread Ethan
On Wed, Feb 17, 2010 at 00:27, Ethan  wrote:

> On Tue, Feb 16, 2010 at 23:57, Daniel Carosone  wrote:
>
>> On Tue, Feb 16, 2010 at 11:39:39PM -0500, Ethan wrote:
>> > If slice 2 is the whole disk, why is zpool trying to using slice 8 for
>> all
>> > but one disk?
>>
>> Because it's finding at least part of the labels for the pool member
>> there.
>>
>> Please check the partition tables of all the disks, and use zdb -l on
>> the various partitions, to make sure that you haven't got funny
>> offsets or other problems hiding the data from import.
>>
>> In a default solaris label, s2 and s8 start at cylinder 0 but are
>> vastly different sizes.  You need to arrange for your labels to match
>> however the data you copied got laid out.
>>
>> > Can I explicitly tell zpool to use slice 2 for each device?
>>
>> Not for import, only at creation time.  On import, devices are chosen
>> by inspection of the zfs labels within.  zdb -l will print those for
>> you; when you can see all 4 labels for all devices your import has a
>> much better chance of success.
>>
>> --
>> Dan.
>>
>
> How would I go about arranging labels?
> I only see labels 0 and 1 (and do not see labels 2 and 3) on every device,
> for both slices 8 (which makes sense if 8 is just part of the drive; the zfs
> devices take up the whole drive) and slice 2 (which doesn't seem to make
> sense to me).
>
> Since only two of the four labels are showing up for each of the drives on
> both slice 2 and slice 8, I guess that causes zpool to not have a preference
> between slice 2 and slice 8? So it just picks whichever it sees first, which
> happened to be slice 2 for one of the drives, but 8 for the others? (I am
> really just guessing at this.)
>
> So, on one hand, the fact that it says slice 2 is online for one drive
> makes me think that if I could get it to use slice 2 for the rest maybe it
> would work.
> On the other hand, the fact that I can't see labels 2 and 3 on slice 2 for
> any drive (even the one that says it's online) is worrisome and I want to
> figure out what's up with that.
>
> Labels 2 and 3 _do_ show up (and look right) in zdb -l running in zfs-fuse
> on linux, on the truecrypt volumes.
>
> If it might just be a matter of arranging the labels so that the beginning
> and end of a slice are in the right place, that sounds promising, although I
> have no idea how I go about arranging labels. Could you point me in the
> direction of what utility I might use or some documentation to get me
> started in that direction?
>
> Thanks,
> -Ethan
>

And I just realized - yes, labels 2 and 3 are in the wrong place relative to
the end of the drive; I did not take into account the overhead taken up by
truecrypt when dd'ing the data. The raw drive is 1500301910016 bytes; the
truecrypt volume is 1500301647872 bytes. Off by 262144 bytes - I need a
slice that is sized like the truecrypt volume.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-16 Thread Ethan
On Tue, Feb 16, 2010 at 23:57, Daniel Carosone  wrote:

> On Tue, Feb 16, 2010 at 11:39:39PM -0500, Ethan wrote:
> > If slice 2 is the whole disk, why is zpool trying to using slice 8 for
> all
> > but one disk?
>
> Because it's finding at least part of the labels for the pool member there.
>
> Please check the partition tables of all the disks, and use zdb -l on
> the various partitions, to make sure that you haven't got funny
> offsets or other problems hiding the data from import.
>
> In a default solaris label, s2 and s8 start at cylinder 0 but are
> vastly different sizes.  You need to arrange for your labels to match
> however the data you copied got laid out.
>
> > Can I explicitly tell zpool to use slice 2 for each device?
>
> Not for import, only at creation time.  On import, devices are chosen
> by inspection of the zfs labels within.  zdb -l will print those for
> you; when you can see all 4 labels for all devices your import has a
> much better chance of success.
>
> --
> Dan.
>

How would I go about arranging labels?
I only see labels 0 and 1 (and do not see labels 2 and 3) on every device,
for both slices 8 (which makes sense if 8 is just part of the drive; the zfs
devices take up the whole drive) and slice 2 (which doesn't seem to make
sense to me).

Since only two of the four labels are showing up for each of the drives on
both slice 2 and slice 8, I guess that causes zpool to not have a preference
between slice 2 and slice 8? So it just picks whichever it sees first, which
happened to be slice 2 for one of the drives, but 8 for the others? (I am
really just guessing at this.)

So, on one hand, the fact that it says slice 2 is online for one drive makes
me think that if I could get it to use slice 2 for the rest maybe it would
work.
On the other hand, the fact that I can't see labels 2 and 3 on slice 2 for
any drive (even the one that says it's online) is worrisome and I want to
figure out what's up with that.

Labels 2 and 3 _do_ show up (and look right) in zdb -l running in zfs-fuse
on linux, on the truecrypt volumes.

If it might just be a matter of arranging the labels so that the beginning
and end of a slice are in the right place, that sounds promising, although I
have no idea how I go about arranging labels. Could you point me in the
direction of what utility I might use or some documentation to get me
started in that direction?

Thanks,
-Ethan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Duplicating a system rpool

2010-02-16 Thread Daniel Carosone
On Tue, Feb 16, 2010 at 10:33:26PM -0600, David Dyer-Bennet wrote:
> Here's what I've started:  I've created a mirrored pool called rp2 on  
> the new disks, and I'm zfs send -R a current snapshot over to the new  
> disks.  In fact it just finished.  I've got an altroot set, and  
> obviously I gave them a new name so as not to conflict with the existing  
> rpool.
>
> I probably lost properties in the zfs send/receive (111b doesn't support  
> -p on zfs send).   I suppose I could boot from a LiveCd of something  
> more recent and re-do the send/receive with a version that supports -p,  
> to save me trouble later reconstructing the properties.

That would be handy, but be careful that rp2 isn't created with a
newer pool version than what you're copying to it can use (ok, since
you already created the pool).  

Booting from a livecd also allows you to export both pools, then
import rp2 as "rpool" before exporting it again.  That way, when you
boot, it will have the name "rpool".  This is mostly important for the
name of your swap zvol, but is handy if you prefer to keep things just
the same as before.

> So does anybody have strong opinions about the best, meaning easiest,  
> way to do this, remembering that when done I need to import my existing  
> data pool and have the user still have access to their data?

Look at the "root pool recovery" processes document I referred to just
this morning.

--
Dan.

pgpjPfReemIdj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-16 Thread Eric D. Mudama

On Tue, Feb 16 at  9:44, Brian E. Imhoff wrote:

But, at the end of the day, this is quite a bomb: "A single raidz2
vdev has about as many IOs per second as a single disk, which could
really hurt iSCSI performance."

If I have to break 24 disks up in to multiple vdevs to get the
expected performance might be a deal breaker.  To keep raidz2
redundancy, I would have to lose..almost half of the available
storage to get reasonable IO speeds.


ZFS is quite flexible.  You can put multiple vdevs in a pool, and dial
your performance/redundancy just about wherever you want them.

24 disks could be:

12x mirrored vdevs (best random IO, 50% capacity, any 1 failure absorbed, up to 
12 w/ limits)
6x 4-disk raidz vdevs (75% capacity, any 1 failure absorbed, up to 6 with 
limits)
4x 6-disk raidz vdevs (~83% capacity, any 1 failure absorbed, up to 4 with 
limits)
4x 6-disk raidz2 vdevs (~66% capacity, any 2 failures absorbed, up to 8 with 
limits)
1x 24-disk raidz2 vdev (~92% capacity, any 2 failures absorbed, worst random IO 
perf)
etc.

I think the 4x 6-disk raidz2 vdev setup is quite commonly used with 24
disks available, but each application is different.  We use mirrors
vdevs at work, with a separate box as a "live" backup using raidz of
larger SATA drives.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-16 Thread Daniel Carosone
On Tue, Feb 16, 2010 at 11:39:39PM -0500, Ethan wrote:
> If slice 2 is the whole disk, why is zpool trying to using slice 8 for all
> but one disk? 

Because it's finding at least part of the labels for the pool member there.

Please check the partition tables of all the disks, and use zdb -l on
the various partitions, to make sure that you haven't got funny
offsets or other problems hiding the data from import. 

In a default solaris label, s2 and s8 start at cylinder 0 but are
vastly different sizes.  You need to arrange for your labels to match
however the data you copied got laid out.

> Can I explicitly tell zpool to use slice 2 for each device?

Not for import, only at creation time.  On import, devices are chosen
by inspection of the zfs labels within.  zdb -l will print those for
you; when you can see all 4 labels for all devices your import has a
much better chance of success.

--
Dan.


pgpCkGexgY2ew.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-16 Thread Ethan
On Tue, Feb 16, 2010 at 23:24, Richard Elling wrote:

> On Feb 16, 2010, at 7:57 PM, Ethan wrote:
> > On Tue, Feb 16, 2010 at 22:35, Daniel Carosone  wrote:
> > On Wed, Feb 17, 2010 at 02:30:28PM +1100, Daniel Carosone wrote:
> > > > c9t4d0s8  UNAVAIL  corrupted data
> > > > c9t5d0s2  ONLINE
> > > > c9t2d0s8  UNAVAIL  corrupted data
> > > > c9t1d0s8  UNAVAIL  corrupted data
> > > > c9t0d0s8  UNAVAIL  corrupted data
>
> slice 8 tends to be tiny and slice 2 is the whole disk, which is why
> you can't find label 2 or 3, which are at the end of the disk.
>
> Try exporting the pool and then import.
>  -- richard
>
>
The pool is never successfully importing, so I can't export. The import just
gives the output I pasted, and the pool is not imported.
If slice 2 is the whole disk, why is zpool trying to using slice 8 for all
but one disk? Can I explicitly tell zpool to use slice 2 for each device?

-Ethan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Duplicating a system rpool

2010-02-16 Thread David Dyer-Bennet
I've got the new controller and the new system disks running in the 
system, for anybody keeping score at home.


So I'm looking at how to migrate to the new system disks.  They're a 
different size (160GB vs 80GB) and form factor (2.5" vs 3.5") from the 
old disks (I've got a mirrored pool for my old rpool).


Here's what I've started:  I've created a mirrored pool called rp2 on 
the new disks, and I'm zfs send -R a current snapshot over to the new 
disks.  In fact it just finished.  I've got an altroot set, and 
obviously I gave them a new name so as not to conflict with the existing 
rpool.


I probably lost properties in the zfs send/receive (111b doesn't support 
-p on zfs send).   I suppose I could boot from a LiveCd of something 
more recent and re-do the send/receive with a version that supports -p, 
to save me trouble later reconstructing the properties.


So, I was thinking I could reconstruct the properties, rename the pool 
and filesystem (possibly by booting from a livecd), and put Grub on them 
the same way you do when adding a mirror disk to rpool.   Other 
possibilities seem to include using dd to copy one drive physically and 
then booting from it and proceeding from there, and installing from 
scratch on the new drives (and then having to recreate my configuration 
down to UIDs and GIDs manually).


So does anybody have strong opinions about the best, meaning easiest, 
way to do this, remembering that when done I need to import my existing 
data pool and have the user still have access to their data?


I haven't actually investigated if the bios can boot from the new 
controller, but if not I can finagle the cables around to put the new 
disks on the old controller.


Speaking of which, to make the controller card physically fit I had to 
remove the end bracket from it, since it didn't match up with the 
cutouts in the back of the case.  This leaves the controller almost 
hanging free (currently supported by an inch of gaffer's tape), with one 
now and soon two rather stiff SAS-to-4-SATA cable sets connected to it.  
This seems like a bad idea, really.  I suppose I could glue some wood 
blocks to the back of the chassis (or the next covering strip down) to 
provide something solid for the card to rest on, but being restrained in 
both directions seems better.  How do people handle this?  And how did 
this particular standard come to have competition?


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-16 Thread Richard Elling
On Feb 16, 2010, at 7:57 PM, Ethan wrote:
> On Tue, Feb 16, 2010 at 22:35, Daniel Carosone  wrote:
> On Wed, Feb 17, 2010 at 02:30:28PM +1100, Daniel Carosone wrote:
> > > c9t4d0s8  UNAVAIL  corrupted data
> > > c9t5d0s2  ONLINE
> > > c9t2d0s8  UNAVAIL  corrupted data
> > > c9t1d0s8  UNAVAIL  corrupted data
> > > c9t0d0s8  UNAVAIL  corrupted data

slice 8 tends to be tiny and slice 2 is the whole disk, which is why
you can't find label 2 or 3, which are at the end of the disk.

Try exporting the pool and then import.
 -- richard

> 
> >  - zdb -l  for each of the devs above, compare and/or post.  this
> >helps ensure that you copied correctly, with respect to all the various
> >translations, labelling, partitioning etc differences between the
> >platforms.  Since you apparenly got at least one right, hopefully this
> >is less of an issue if you did  the same for all.
> 
> Actually, looking again, is there any signifigance to the fact that s2
> on one disk is ok, and s8 on the others are not?  Perhaps start with
> the zdb -l, and make sure you're pointing at the right data before
> importing.
> 
> --
> Dan.
> 
> 
> I do not know if there is any significance to the okay disk being s2 and the 
> others s8 - in fact I do not know what the numbers mean at all, being out of 
> my element in opensolaris (but trying to learn as much as quickly as I can). 
> As for the copying, all that I did was `dd if= of= new disk>` for each of the five disks. 
> Output of zdb -l looks identical for each of the five volumes, apart from 
> guid. I have pasted one of them below. 
> Thanks for your help.
> 
> -Ethan
> 
> 
> et...@save:/dev# zdb -l dsk/c9t2d0s8
> 
> LABEL 0
> 
> version=13
> name='q'
> state=1
> txg=361805
> pool_guid=5055543090570728034
> hostid=8323328
> hostname='that'
> top_guid=441634638335554713
> guid=13840197833631786818
> vdev_tree
> type='raidz'
> id=0
> guid=441634638335554713
> nparity=1
> metaslab_array=23
> metaslab_shift=32
> ashift=9
> asize=7501483868160
> is_log=0
> children[0]
> type='disk'
> id=0
> guid=459016284133602
> path='/dev/mapper/truecrypt3'
> whole_disk=0
> children[1]
> type='disk'
> id=1
> guid=12502103998258102871
> path='/dev/mapper/truecrypt2'
> whole_disk=0
> children[2]
> type='disk'
> id=2
> guid=13840197833631786818
> path='/dev/mapper/truecrypt1'
> whole_disk=0
> children[3]
> type='disk'
> id=3
> guid=3763020893739678459
> path='/dev/mapper/truecrypt5'
> whole_disk=0
> children[4]
> type='disk'
> id=4
> guid=4929061713231157616
> path='/dev/mapper/truecrypt4'
> whole_disk=0
> 
> LABEL 1
> 
> version=13
> name='q'
> state=1
> txg=361805
> pool_guid=5055543090570728034
> hostid=8323328
> hostname='that'
> top_guid=441634638335554713
> guid=13840197833631786818
> vdev_tree
> type='raidz'
> id=0
> guid=441634638335554713
> nparity=1
> metaslab_array=23
> metaslab_shift=32
> ashift=9
> asize=7501483868160
> is_log=0
> children[0]
> type='disk'
> id=0
> guid=459016284133602
> path='/dev/mapper/truecrypt3'
> whole_disk=0
> children[1]
> type='disk'
> id=1
> guid=12502103998258102871
> path='/dev/mapper/truecrypt2'
> whole_disk=0
> children[2]
> type='disk'
> id=2
> guid=13840197833631786818
> path='/dev/mapper/truecrypt1'
> whole_disk=0
> children[3]
> type='disk'
> id=3
> guid=3763020893739678459
> path='/dev/mapper/truecrypt5'
> whole_disk=0
> children[4]
> type='disk'
> id=4
> guid=4929061713231157616
> path='/dev/mapper/truecrypt4'
> whole_disk=0
> 
> LABEL 2
> 
> failed to unpack label 2
> 
> LABEL 3
> 

Re: [zfs-discuss] Help with corrupted pool

2010-02-16 Thread Ethan
On Tue, Feb 16, 2010 at 22:35, Daniel Carosone  wrote:

> On Wed, Feb 17, 2010 at 02:30:28PM +1100, Daniel Carosone wrote:
> > > c9t4d0s8  UNAVAIL  corrupted data
> > > c9t5d0s2  ONLINE
> > > c9t2d0s8  UNAVAIL  corrupted data
> > > c9t1d0s8  UNAVAIL  corrupted data
> > > c9t0d0s8  UNAVAIL  corrupted data
>
> >  - zdb -l  for each of the devs above, compare and/or post.  this
> >helps ensure that you copied correctly, with respect to all the
> various
> >translations, labelling, partitioning etc differences between the
> >platforms.  Since you apparenly got at least one right, hopefully this
> >is less of an issue if you did  the same for all.
>
> Actually, looking again, is there any signifigance to the fact that s2
> on one disk is ok, and s8 on the others are not?  Perhaps start with
> the zdb -l, and make sure you're pointing at the right data before
> importing.
>
> --
> Dan.
>
>
I do not know if there is any significance to the okay disk being s2 and the
others s8 - in fact I do not know what the numbers mean at all, being out of
my element in opensolaris (but trying to learn as much as quickly as I
can).
As for the copying, all that I did was `dd if= of=` for each of the five disks.
Output of zdb -l looks identical for each of the five volumes, apart from
guid. I have pasted one of them below.
Thanks for your help.

-Ethan


et...@save:/dev# zdb -l dsk/c9t2d0s8

LABEL 0

version=13
name='q'
state=1
txg=361805
pool_guid=5055543090570728034
hostid=8323328
hostname='that'
top_guid=441634638335554713
guid=13840197833631786818
vdev_tree
type='raidz'
id=0
guid=441634638335554713
nparity=1
metaslab_array=23
metaslab_shift=32
ashift=9
asize=7501483868160
is_log=0
children[0]
type='disk'
id=0
guid=459016284133602
path='/dev/mapper/truecrypt3'
whole_disk=0
children[1]
type='disk'
id=1
guid=12502103998258102871
path='/dev/mapper/truecrypt2'
whole_disk=0
children[2]
type='disk'
id=2
guid=13840197833631786818
path='/dev/mapper/truecrypt1'
whole_disk=0
children[3]
type='disk'
id=3
guid=3763020893739678459
path='/dev/mapper/truecrypt5'
whole_disk=0
children[4]
type='disk'
id=4
guid=4929061713231157616
path='/dev/mapper/truecrypt4'
whole_disk=0

LABEL 1

version=13
name='q'
state=1
txg=361805
pool_guid=5055543090570728034
hostid=8323328
hostname='that'
top_guid=441634638335554713
guid=13840197833631786818
vdev_tree
type='raidz'
id=0
guid=441634638335554713
nparity=1
metaslab_array=23
metaslab_shift=32
ashift=9
asize=7501483868160
is_log=0
children[0]
type='disk'
id=0
guid=459016284133602
path='/dev/mapper/truecrypt3'
whole_disk=0
children[1]
type='disk'
id=1
guid=12502103998258102871
path='/dev/mapper/truecrypt2'
whole_disk=0
children[2]
type='disk'
id=2
guid=13840197833631786818
path='/dev/mapper/truecrypt1'
whole_disk=0
children[3]
type='disk'
id=3
guid=3763020893739678459
path='/dev/mapper/truecrypt5'
whole_disk=0
children[4]
type='disk'
id=4
guid=4929061713231157616
path='/dev/mapper/truecrypt4'
whole_disk=0

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Bonnie++ stats

2010-02-16 Thread Marc Nicholas
Anyone else got stats to share?

Note: the below is 4*Caviar Black 500GB drives, 1*Intel x-25m setup as both
ZIL and L2ARC, decent ASUS mobo, 2GB of fast RAM.

-marc

r...@opensolaris130:/tank/myfs# /usr/benchmarks/bonnie++/bonnie++ -u root -d
/tank/myfs -f -b
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03c   --Sequential Output-- --Sequential Input-
--Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
%CP
opensolaris130   4G   49503  13 30468   9   67882   6
320.1   1
--Sequential Create-- Random
Create
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP
 16  4225  30 + +++  4709  24  3407  38 + +++  4572
22
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread Daniel Carosone
On Tue, Feb 16, 2010 at 04:47:11PM -0800, Christo Kutrovsky wrote:
> One of the ideas that sparkled is have a "max devices" property for
> each data set, and limit how many mirrored devices a given data set
> can be spread on. I mean if you don't need the performance, you can
> limit (minimize) the device, should your capacity allow this. 

There have been some good responses, around better ways to do damage
control.  I thought I'd respond separately, with a different use case
for essentially the same facility.

If your suggestion were to be implemented, it would be in the form of
a different allocation policy, when selecting vdevs and metaslabs for
writes.  There is scope for several alternate policies addressing
different requirements, in future development, and some nice XXX
comments about "cool stuff could go here" accordingly.

One of these is for power-saving, with MAID-style pools, whereby the
majority of disks (vdevs) in a pool would be idle and spun down, most
of the time.  This requires expressing very similar kinds of
preferences, for what data goes where (and when).

AIX's LVM (not the nasty linux knock-off) had similar layout
preferences, for different purposes - you could mark lv's with
allocation prefernces to the centre of spindles for performance, or
other options, and then relayout the data accordingly.  I say "had",
it presumably still does, but I haven't touched it in 15 years or
more. 

--
Dan.

pgpCgpyGqngSC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-16 Thread Daniel Carosone
On Wed, Feb 17, 2010 at 02:30:28PM +1100, Daniel Carosone wrote:
> > c9t4d0s8  UNAVAIL  corrupted data
> > c9t5d0s2  ONLINE
> > c9t2d0s8  UNAVAIL  corrupted data
> > c9t1d0s8  UNAVAIL  corrupted data
> > c9t0d0s8  UNAVAIL  corrupted data

>  - zdb -l  for each of the devs above, compare and/or post.  this
>helps ensure that you copied correctly, with respect to all the various
>translations, labelling, partitioning etc differences between the
>platforms.  Since you apparenly got at least one right, hopefully this
>is less of an issue if you did  the same for all. 

Actually, looking again, is there any signifigance to the fact that s2
on one disk is ok, and s8 on the others are not?  Perhaps start with
the zdb -l, and make sure you're pointing at the right data before
importing. 

--
Dan.



pgpfx2lwIk5r3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-16 Thread Daniel Carosone
On Tue, Feb 16, 2010 at 10:06:13PM -0500, Ethan wrote:
> This is the current state of my pool:
> 
> et...@save:~# zpool import
>   pool: q
> id: 5055543090570728034
>  state: UNAVAIL
> status: One or more devices contains corrupted data.
> action: The pool cannot be imported due to damaged devices or data.
>see: http://www.sun.com/msg/ZFS-8000-5E
> config:
> 
> q UNAVAIL  insufficient replicas
>   raidz1  UNAVAIL  insufficient replicas
> c9t4d0s8  UNAVAIL  corrupted data
> c9t5d0s2  ONLINE
> c9t2d0s8  UNAVAIL  corrupted data
> c9t1d0s8  UNAVAIL  corrupted data
> c9t0d0s8  UNAVAIL  corrupted data
> 
> 
> back story:
> I was previously running and using this pool on linux using zfs-fuse.

Two things to try:

 - import -F (with -n, first time) on a recent build
 - zdb -l  for each of the devs above, compare and/or post.  this
   helps ensure that you copied correctly, with respect to all the various
   translations, labelling, partitioning etc differences between the
   platforms.  Since you apparenly got at least one right, hopefully this
   is less of an issue if you did  the same for all. 

--
Dan.



pgp5JUba4jEE6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD and ZFS

2010-02-16 Thread Fajar A. Nugraha
On Sun, Feb 14, 2010 at 12:51 PM, Tracey Bernath  wrote:
> I went from all four disks of the array at 100%, doing about 170 read
> IOPS/25MB/s
> to all four disks of the array at 0%, once hitting nealyr 500 IOPS/65MB/s
> off the cache drive (@ only 50% load).


> And, keepĀ  in mind this was on less than $1000 of hardware.

really? complete box and all, or is it just the disks? Cause the 4
disks alone should cost about $400. Did you use ECC RAM?

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread Daniel Carosone
On Tue, Feb 16, 2010 at 06:28:05PM -0800, Richard Elling wrote:
> The problem is that MTBF measurements are only one part of the picture.
> Murphy's Law says something will go wrong, so also plan on backups.

+n

> > Imagine this scenario:
> > You lost 2 disks, and unfortunately you lost the 2 sides of a mirror.
> 
> Doing some simple math, and using the simple MTTDL[1] model, you can
> figure the probability of that happening in one year for a pair of 700k hours
> disks and a 24 hour MTTR as:
>   Pfailure =  0.86%  (trust me, I've got a spreadsheet :-)

Which is close enough to zero, but doesn't consider all the other
things that can go wrong: power surge, fire, typing destructive
commands in the wrong window, animals and small children, capricious
deities, forgetting to run backups, etc.  

These small numbers just tell you to be more worried about defending
against the other stuff.

> > You have 2 choices to pick from:
> > - loose entirely Mary, Gary's and Kelly's "documents"
> > or
> > - loose a small piece of Everyone's "documents".

Back to the OP's question, it's worth making the distinction here
between "lose" as in not-able-to-recover-because-there-are-no-backups,
and some data being out of service and inaccessible for a while, until
restored.  Perhaps this is what "loose" means? :)

If the goal is partitioning "service disruption" rather than "data loss",
then splitting things into multiple pools and even multiple servers is
a valid tactic - as well as then allowing further opportunities via
failover. That's well covered ground, and one reason it's done at pool
level is that allows concrete reasoning about what will and won't be
affected in each scenario.  Setting preferences, such as copies or the
suggested similar alternatives, will never be able to provide the same
concrete assurance.

--
Dan.


pgp0vRHODvQY3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Help with corrupted pool

2010-02-16 Thread Ethan
This is the current state of my pool:

et...@save:~# zpool import
  pool: q
id: 5055543090570728034
 state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

q UNAVAIL  insufficient replicas
  raidz1  UNAVAIL  insufficient replicas
c9t4d0s8  UNAVAIL  corrupted data
c9t5d0s2  ONLINE
c9t2d0s8  UNAVAIL  corrupted data
c9t1d0s8  UNAVAIL  corrupted data
c9t0d0s8  UNAVAIL  corrupted data


back story:
I was previously running and using this pool on linux using zfs-fuse.
one day the zfs-fuse daemon behaved strangely. zpool and zfs commands gave a
message about not being able to connect to the daemon. the filesystems for
the pool q were still available and seemed to be working correctly. I
started the zfs-fuse daemon again. I'm not sure if this meant that there
were two deamons running, since the filesystem was still available but I
couldn't get any response from the zfs or zpool commands. I then decided
instead just to reboot.
after rebooting, the pool appeared to import successfully, but `zfs list`
showed no filesystems.
I rebooted again, not really having any better ideas. after that `zpool
import` just hung forever.
I decided I should get off of the fuse/linux implementation and use a more
recent version of zfs in its native environment, so I installed
opensolaris.
I had been running the pool on truecrypt encrypted volumes, so I copied them
off of the encrypted volumes onto blank volumes, and put them on
opensolaris. I got the above when I tried to import.
Now, no idea where to go from here.

It doesn't seem like my data should be just gone - there is no problem with
the physical drives. It seems unlikely that a misbehaving zfs-fuse would
completely corrupt the data of 4 out of 5 drives (or so I am hoping).

Is there any hope for my data? I have some not-very-recent backups of some
fraction of it, but if recovering this is possible that would of course be
infinitely preferable.

-Ethan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread Richard Elling
On Feb 16, 2010, at 4:47 PM, Christo Kutrovsky wrote:
> Just finished reading the following excellent post:
> 
> http://queue.acm.org/detail.cfm?id=1670144
> 
> And started thinking what would be the best long term setup for a home 
> server, given limited number of disk slots (say 10).
> 
> I considered something like simply do a 2way mirror. What are the chances for 
> a very specific drive to fail in 2 way mirror? What if I do not want to take 
> that chance?

The probability of a device to fail in a time interval (T) given its MTBF (or 
AFR, but be careful about
how the vendors publish such specs [*]) is:
1-e^(-T/MTBF)

so if you have a consumer-grade disk with 700,000 hours rated MTBF, then over a
time period of 1 year (8760 hours) you get:
Pfailure = 1 - e^(-8760/70) = 1.24%

> I could always put "copies=2" (or more) to my important datasets and take 
> some risk and tolerate such a failure.

+1

> But chances are, everything that is not copies=2 will have some data on those 
> devices, and will be lost.
> 
> So I was thinking, how can I limit the damage, how to inject some kind of 
> "damage control".

The problem is that MTBF measurements are only one part of the picture.
Murphy's Law says something will go wrong, so also plan on backups.

> One of the ideas that sparkled is have a "max devices" property for each data 
> set, and limit how many mirrored devices a given data set can be spread on. I 
> mean if you don't need the performance, you can limit (minimize) the device, 
> should your capacity allow this.
> 
> Imagine this scenario:
> You lost 2 disks, and unfortunately you lost the 2 sides of a mirror.

Doing some simple math, and using the simple MTTDL[1] model, you can
figure the probability of that happening in one year for a pair of 700k hours
disks and a 24 hour MTTR as:
Pfailure =  0.86%  (trust me, I've got a spreadsheet :-)

> You have 2 choices to pick from:
> - loose entirely Mary, Gary's and Kelly's "documents"
> or
> - loose a small piece of Everyone's "documents".
> 
> This could be implement via something similar to:
> read/write property "target device spread"
> read only property of "achieved device spread" as this will be size dependant.
> 
> Opinions? 

I use mirrors. For the important stuff, like my wife's photos and articles, I 
set 
copies=2. And I take regular backups via snapshots to multiple disks, some
of which are offsite. With an appliance, like NexentaStor, it is trivial to 
setup
a replication scheme between multiple ZFS sites.

> Remember. The goal is damage control. I know 2x raidz2 offers better 
> protection for more capacity (altought less performance, but that's no the 
> point).

Notes:
* http://blogs.sun.com/relling/entry/awesome_disk_afr_or_is
** http://blogs.sun.com/relling/entry/a_story_of_two_mttdl

 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread Bob Friesenhahn

On Tue, 16 Feb 2010, Christo Kutrovsky wrote:


The goal was to do "damage control" in a disk failure scenario 
involving data loss. Back to the original question/idea.


Which would you prefer, loose a couple of datasets, or loose a 
little bit of every file in every dataset.


This ignores the fact that zfs is based on complex heirarchical data 
structures which support the user data.  When a pool breaks, it is 
usually because one of these complex data structures has failed, and 
not because user data has failed.


It seems easiest to support your requirement by simply creating 
another pool.


The vast majority of complaints to this list are about pool-wide 
problems and not lost files due to media/disk failure.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread Bob Friesenhahn

On Tue, 16 Feb 2010, Christo Kutrovsky wrote:


Just finished reading the following excellent post:

http://queue.acm.org/detail.cfm?id=1670144


A nice article, even if I don't agree with all of its surmises and 
conclusions. :-)


In fact, I would reach a different conclusion.

I considered something like simply do a 2way mirror. What are the 
chances for a very specific drive to fail in 2 way mirror? What if I 
do not want to take that chance?


The probability of whole drive failure, or individual sector failure, 
has not increased over the years.  The probability of individual 
sector failure has diminished substantially over the years.  The 
probability of losing a whole mirror pair has gone down since the 
probability of individual drive failure has gone down.


I could always put "copies=2" (or more) to my important datasets and 
take some risk and tolerate such a failure.


I don't believe that "copies=2" buys much at all when using mirror 
disks (or raidz).  It assumes that there is a concurrency of 
simultaneous media failure, which is actually quite rare indeed.  The 
"copies=2" setting only buys something when there is no other 
redundancy available.


One of the ideas that sparkled is have a "max devices" property for 
each data set, and limit how many mirrored devices a given data set 
can be spread on. I mean if you don't need the performance, you can 
limit (minimize) the device, should your capacity allow this.


What you seem to be suggesting is a sort of targeted heirarchical vdev 
without extra RAID.


Remember. The goal is damage control. I know 2x raidz2 offers better 
protection for more capacity (altought less performance, but that's 
no the point).


It seems that Adam Leventhal's excellent paper reaches the wrong 
conclusions because it assumes that history is a predictor for the 
future.  However, history is a rather poor predictor in this case. 
Imagine if 9" floppies had increased their density to support 20GB 
each (up from 160KB), but that did not happen, and now we don't use 
floppies at all.  We already see many cases where history was no 
longer a good predictor of the future, and (as an example) increased 
integration has brought us multi-core CPUs rather than 20GHz CPUs.


My own conclusions (supported by Adam Leventhal's excellent paper) are 
that


 - maximum device size should be constrained based on its time to
   resilver.

 - devices are growing too large and it is about time to transition to
   the next smaller physical size.

It is unreasonable to spend more than 24 hours to resilver a single 
drive.  It is unreasonable to spend more than 6 days resilvering all 
of the devices in a RAID group (the 7th day is reserved for the system 
administrator).  It is unreasonable to spend very much time at all on 
resilvering (using current rotating media) since the resilvering 
process kills performance.


When looking at the possibility of data failure it is wise to consider 
physical issues such as


 - shared power supply

 - shared chassis

 - shared physical location

 - shared OS kernel or firmware instance

all of which are very bad for data reliability since a problem with 
anything shared can lead to destruction of all copies of the data.


In New York City, all of the apartment doors seem to be fitted with 
three deadlocks, all of which lock into the same flimsy splintered 
door frame.  It is important to consider each significant system 
weakness in turn in order to achieve the least chance of loss, while 
providing the best service.


Bob

P.S. NASA is tracking large asteroids and meteors with the hope that 
they will eventually be able to deflect any which will strike our 
planet in order to in an effort to save your precious data.

--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD and ZFS

2010-02-16 Thread Richard Elling
On Feb 16, 2010, at 12:39 PM, Daniel Carosone wrote:
> On Mon, Feb 15, 2010 at 09:11:02PM -0600, Tracey Bernath wrote:
>> On Mon, Feb 15, 2010 at 5:51 PM, Daniel Carosone  wrote:
>>> Just be clear: mirror ZIL by all means, but don't mirror l2arc, just
>>> add more devices and let them load-balance.   This is especially true
>>> if you're sharing ssd writes with ZIL, as slices on the same devices.
>>> 
>>> Well, the problem I am trying to solve is wouldn't it read 2x faster with
>> the mirror?  It seems once I can drive the single device to 10 queued
>> actions, and 100% busy, it would be more useful to have two channels to the
>> same data. Is ZFS not smart enough to understand that there are two
>> identical mirror devices in the cache to split requests to? Or, are you
>> saying that ZFS is smart enough to cache it in two places, although not
>> mirrored?
> 
> First, Bob is right, measurement trumps speculation.  Try it.
> 
> As for speculation, you're thinking only about reads.  I expect
> reading from l2arc devices will be the same as reading from any other
> zfs mirror, and largely the same in both cases above; load balanced
> across either device.  In the rare case of a bad read from unmirrored
> l2arc, data will be fetched from the pool, so mirroring l2arc doesn't
> add any resiliency benefit.
> 
> However, your cache needs to be populated and maintained as well, and
> this needs writes.  Twice as many of them for the mirror as for the
> "stripe". Half of what is written never needs to be read again. These
> writes go to the same ssd devices you're using for ZIL, on commodity
> ssd's which are not well write-optimised, they may be hurting zil
> latency by making the ssd do more writing, stealing from the total
> iops count on the channel, and (as a lesser concern) adding wear
> cycles to the device.  

The L2ARC writes are throttled to be 8MB/sec, except during cold
start where the throttle is 16MB/sec.  This should not be noticeable
on the channels.

> When you're already maxing out the IO, eliminating wasted cycles opens
> your bottleneck, even if only a little. 

+1 
 -- richard

> Once you reach steady state, I don't know how much turnover in l2arc
> contents you will have, and therefore how many extra writes we're
> talking about.  It may not be many, but they are unnecessary ones.  
> 
> Normally, we'd talk about measuring a potential benefit, and then
> choosing based on the results.  In this case, if I were you I'd
> eliminate the unnecessary writes, and measure the difference more as a
> matter of curiosity and research, since I was already set up to do so.
> 
> --
> Dan.
> 

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread Christo Kutrovsky
Thanks for your feedback James, but that's not the direction where I wanted 
this discussion to go.

The goal was not how to create a better solution for an enterprise.

The goal was to do "damage control" in a disk failure scenario involving data 
loss. Back to the original question/idea. 

Which would you prefer, loose a couple of datasets, or loose a little bit of 
every file in every dataset.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-16 Thread Richard Elling
On Feb 16, 2010, at 9:44 AM, Brian E. Imhoff wrote:

> Some more back story.  I initially started with Solaris 10 u8, and was 
> getting 40ish MB/s reads, and 65-70MB/s writes, which was still a far cry 
> from the performance I was getting with OpenFiler.  I decided to try 
> Opensolaris 2009.06, thinking that since it was more "state of the art & up 
> to date" then main Solaris. Perhaps there would be some performance tweaks or 
> bug fixes which might bring performance closer to what I saw with OpenFiler.  
>  But, then on an untouched clean install of OpenSolaris 2009.06, ran into 
> something...else...apparently causing this far far far worse performance.

You thought a release dated 2009.06 was further along than than a release dated
2009.10? :-)   CR 6794730 was fixed in April, 2009, after the freeze for the 
2009.06
release, but before the freeze for 2009.10.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6794730

The schedule is published here, so you can see that there is a freeze now for
the 2010.03 OpenSolaris release.
http://hub.opensolaris.org/bin/view/Community+Group+on/schedule

As they say in comedy, timing is everything :-(

> But, at the end of the day, this is quite a bomb:  "A single raidz2 vdev has 
> about as many IOs per second as a single disk, which could really hurt iSCSI 
> performance."  

The context for this statement is for small, random reads.  40 MB/sec of 8KB 
reads is 5,000 IOPS, or about 50 HDDs worth of small random reads @ 100 
IOPS/disk,
or one decent SSD.

> If I have to break 24 disks up in to multiple vdevs to get the expected 
> performance might be a deal breaker.  To keep raidz2 redundancy, I would have 
> to lose..almost half of the available storage to get reasonable IO speeds.

Are your requirements for bandwidth or IOPS?

> Now knowing about vdev IO limitations, I believe the speeds I saw with 
> Solaris 10u8 are inline with those limitations, and instead of fighting with 
> whatever issue I have with this clean install of OpenSolaris, I reverted back 
> to 10u8.  I guess I'll just have to see if the speeds that Solaris ISCSI 
> w/ZFS is capable of, is workable for what I want to do, and what the size 
> sacrifice/performace acceptability point is at.

In Solaris 10 you are stuck with the legacy iSCSI target code. In OpenSolaris, 
you
have the option of using COMSTAR which performs and scales better, as Roch
describes here:
http://blogs.sun.com/roch/entry/iscsi_unleashed

> Thanks for all the responses and help.  First time posting here, and this 
> looks like an excellent community.

We try hard, and welcome the challenges :-)
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread James Dickens
On Tue, Feb 16, 2010 at 6:47 PM, Christo Kutrovsky wrote:

> Just finished reading the following excellent post:
>
> http://queue.acm.org/detail.cfm?id=1670144
>
> And started thinking what would be the best long term setup for a home
> server, given limited number of disk slots (say 10).
>
> I considered something like simply do a 2way mirror. What are the chances
> for a very specific drive to fail in 2 way mirror? What if I do not want to
> take that chance?
>
> I could always put "copies=2" (or more) to my important datasets and take
> some risk and tolerate such a failure.
>
> But chances are, everything that is not copies=2 will have some data on
> those devices, and will be lost.
>
> So I was thinking, how can I limit the damage, how to inject some kind of
> "damage control".
>
> One of the ideas that sparkled is have a "max devices" property for each
> data set, and limit how many mirrored devices a given data set can be spread
> on. I mean if you don't need the performance, you can limit (minimize) the
> device, should your capacity allow this.
>
> Imagine this scenario:
> You lost 2 disks, and unfortunately you lost the 2 sides of a mirror.
>
> You have 2 choices to pick from:
> - loose entirely Mary, Gary's and Kelly's "documents"
> or
> - loose a small piece of Everyone's "documents".
>
> This could be implement via something similar to:
> read/write property "target device spread"
> read only property of "achieved device spread" as this will be size
> dependant.
>
> Opinions?
>
> raid is not designed to protect data, its designed to ensure uptime, if you
can't afford to loose the data, then you should back it up, daily, and store
more than one copy, with at least one copy being off site. If your site
burns to the ground your data is gone no matter how many disks you have in
the system.

after this you should allocate a number of hot spares to the system should
one fail. If you are truly paranoid, 3-way mirror can be used. then you can
loose 2 disks without a loss of data.

Spread disks across multiple controllers, and get disks from different
companies and different lots to less the likely hood of getting hit by a bad
batch taking out your pool.

replace disks early as soon as you see disk errors. And above all backup all
data you can't afford to loose.

James Dickens
http://uadmin.blogspot.com



> Remember. The goal is damage control. I know 2x raidz2 offers better
> protection for more capacity (altought less performance, but that's no the
> point).
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-16 Thread Brandon High
On Tue, Feb 16, 2010 at 3:13 PM, Tiernan OToole  wrote:
> Cool... Thanks for the advice! Buy why would it be a good idea to change 
> layout on bigger disks?

On top of the reasons Bob gave, your current layout will be very
unbalanced after adding devices. You can't currently add more devices
to a raidz vdev or remove a top level vdev from a pool, so you'll be
stuck with 8 drives in a raidz2, 3 drives in a raidz, and any future
additions in additional vdevs.

When you say you have 2 pools, do you mean two vdevs in one pool, or
actually two pools?

-B

-- 
Brandon High : bh...@freaks.com
Indecision is the key to flexibility.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Proposed idea for enhancement - damage control

2010-02-16 Thread Christo Kutrovsky
Just finished reading the following excellent post:

http://queue.acm.org/detail.cfm?id=1670144

And started thinking what would be the best long term setup for a home server, 
given limited number of disk slots (say 10).

I considered something like simply do a 2way mirror. What are the chances for a 
very specific drive to fail in 2 way mirror? What if I do not want to take that 
chance?

I could always put "copies=2" (or more) to my important datasets and take some 
risk and tolerate such a failure.

But chances are, everything that is not copies=2 will have some data on those 
devices, and will be lost.

So I was thinking, how can I limit the damage, how to inject some kind of 
"damage control".

One of the ideas that sparkled is have a "max devices" property for each data 
set, and limit how many mirrored devices a given data set can be spread on. I 
mean if you don't need the performance, you can limit (minimize) the device, 
should your capacity allow this.

Imagine this scenario:
You lost 2 disks, and unfortunately you lost the 2 sides of a mirror.

You have 2 choices to pick from:
- loose entirely Mary, Gary's and Kelly's "documents"
or
- loose a small piece of Everyone's "documents".

This could be implement via something similar to:
read/write property "target device spread"
read only property of "achieved device spread" as this will be size dependant.

Opinions? 

Remember. The goal is damage control. I know 2x raidz2 offers better protection 
for more capacity (altought less performance, but that's no the point).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-16 Thread Bob Friesenhahn

On Tue, 16 Feb 2010, Tiernan OToole wrote:

Cool... Thanks for the advice! Buy why would it be a good idea to 
change layout on bigger disks?


Larger disks take longer to resilver, have a higher probability of 
encountering an error during resilvering or normal use, and are often 
slower.  This may cause one to want fewer disks per raidz-N vdev, or 
to use a higher level of raidz protection (e.g. raidz2 rather than 
raidz1, or raidz3 rather than raidz2).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Christo Kutrovsky
Robert,

That would be pretty cool especially if it makes into the 2010.02 release. I 
hope there are no weird special cases that pop-up from this improvement.

Regarding workaround. 

That's not my experience, unless it behaves differently on ZVOLs and datasets.

On ZVOLs it appears the setting kicks in life. I've tested this by turning it 
off/on and testing with iometer on an exported iSCSI device (iscsitgtd not 
comstar).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Christo Kutrovsky
Ok, now that you explained it, it makes sense. Thanks for replying Daniel.

Feel better now :) Suddenly, that Gigabyte i-Ram is no longer a necessity but a 
"nice to have" thing.

What would be really good to have is the that per-data set ZIL control in 
2010.02. And perhaps add another mode "sync no wait" where the sync is issued, 
but the application doesn't wait for it. Similar to Oracle's "commit nowait" vs 
"commit batch nowait" (current idea for delayed).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Robert Milkowski

On 16/02/2010 22:53, Christo Kutrovsky wrote:

Jeff, thanks for link, looking forward to per data set control.

6280630 zil synchronicity 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6280630)

It's been open for 5 years now :) Looking forward to not compromising my entire 
storage with disabled ZIL when I only need it on a few devices.

   
I quickly looked at the code and it seems to be rather simple to 
implement it.

I will try to do it in a next couple of weeks if I will find enough time.


btw: zil_disable is taken into account each time a zfs filesystem is 
being mounted, so as a workaround you may unmount all filesystems you 
want to disable zil for, set zil_disable to 1, mount these filesystems 
and set zil_disable back to 0. That way it will affect only the 
filesystems which were mounted while zil_disable=1. This is of course 
not a bullet-proof solution as other filesystems might be 
created/mounted during that period but it still might be a good enough 
workaround for you if you know no other filesystems are being mounted 
during that time.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-16 Thread Tiernan OToole
Cool... Thanks for the advice! Buy why would it be a good idea to change layout 
on bigger disks?

-Original Message-
From: Brandon High 
Sent: 16 February 2010 18:26
To: Tiernan OToole 
Cc: Robert Milkowski ; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

On Tue, Feb 16, 2010 at 8:25 AM, Tiernan OToole  wrote:
> So, does that work with RAIDZ1 and 2 pools?

Yes. Replace all the disks in one vdev, and that vdev will become
larger. Your disk layout won't change though - You'll still have a
raidz vdev, a raidz2 vdev. It might be a good idea to revise the
layout a bit with larger disks.

If you do change the layout, then a send/receive is the easiest way to
move your data. It can be used to copy everything on the pool
(including snapshots, etc) to your new system.

-B

-- 
Brandon High : bh...@freaks.com
Suspicion Breeds Confidence

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Daniel Carosone
On Tue, Feb 16, 2010 at 02:53:18PM -0800, Christo Kutrovsky wrote:
> looking to answer myself the following question: 
> Do I need to rollback all my NTFS volumes on iSCSI to the last available 
> snapshot every time there's a power failure involving the ZFS storage server 
> with a disabled ZIL.

No, but not for the reasons you think.  If the issue you're concerned
about applies, it applies whether the txg is tagged with a snapshot
name or not, whether it is the most recent or not. 

I don't think the issue applies; write reordering might happen within
a txg, because it has the freedom to do so within the whole-txg commit
boundary.  Out of order writes to the disk won't be valid until the
txg commits, making them be reachable. If other boundaries also apply
(sync commitments via iscsi commands) they will be respected, at at
least that granularity. 

--
Dan.

pgp0adJGU1a96.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Christo Kutrovsky
Jeff, thanks for link, looking forward to per data set control.

6280630 zil synchronicity 
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6280630)

It's been open for 5 years now :) Looking forward to not compromising my entire 
storage with disabled ZIL when I only need it on a few devices.

I would like to get back on the NTFS corruption on ZFS iSCSI device during 
power loss.

Think home server scenario. When power goes down, everything goes down. So 
having to restart the client for cache consistency - no problems.

Question is, can written data cause corruption due to write coalescing, out of 
order writing and etc.

looking to answer myself the following question: 
Do I need to rollback all my NTFS volumes on iSCSI to the last available 
snapshot every time there's a power failure involving the ZFS storage server 
with a disabled ZIL.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Jeff Bonwick
> People used fastfs for years in specific environments (hopefully 
> understanding the risks), and disabling the ZIL is safer than fastfs. 
> Seems like it would be a useful ZFS dataset parameter.

We agree.  There's an open RFE for this:

6280630 zil synchronicity

No promise on date, but it will bubble to the top eventually.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send/receive : panic and reboot

2010-02-16 Thread Lori Alt

Hi Bruno,

I've tried to reproduce this panic you are seeing.  However, I had 
difficulty following your procedure.  See below:




On 02/08/10 15:37, Bruno Damour wrote:

On 02/ 8/10 06:38 PM, Lori Alt wrote:


Can you please send a complete list of the actions taken:  The 
commands you used to create the send stream, the commands used to 
receive the stream.  Also the output of `zfs list -t all` on both the 
sending and receiving sides.  If you were able to collect a core dump 
(it should be in /var/crash/), it would be good to upload it.


The panic you're seeing is in the code that is specific to receiving 
a dedup'ed stream.  It's possible that you could do the migration if 
you turned off dedup (i.e. didn't specify -D) when creating the send 
stream.. However, then we wouldn't be able to diagnose and fix what 
appears to be a bug.


The best way to get us the crash dump is to upload it here:

https://supportfiles.sun.com/upload

We need either both vmcore.X and unix.X OR you can just send us 
vmdump.X.


Sometimes big uploads have mixed results, so if there is a problem 
some helpful hints are
on 
http://wikis.sun.com/display/supportfiles/Sun+Support+Files+-+Help+and+Users+Guide, 


specifically in section 7.

It's best to include your name or your initials or something in the 
name of the file you upload.  As

you might imagine we get a lot of files uploaded named vmcore.1

You might also create a defect report at 
http://defect.opensolaris.org/bz/


Lori


On 02/08/10 09:41, Bruno Damour wrote:



I kept on trying to migrate my pool with children (see previous 
threads) and had the (bad) idea to try the -d option on the receive 
part.

The system reboots immediately.

Here is the log in /var/adm/messages

Feb 8 16:07:09 amber unix: [ID 836849 kern.notice]
Feb 8 16:07:09 amber ^Mpanic[cpu1]/thread=ff014ba86e40:
Feb 8 16:07:09 amber genunix: [ID 169834 kern.notice] avl_find() 
succeeded inside avl_add()

Feb 8 16:07:09 amber unix: [ID 10 kern.notice]
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4660 genunix:avl_add+59 ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c46c0 zfs:find_ds_by_guid+b9 ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c46f0 zfs:findfunc+23 ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c47d0 zfs:dmu_objset_find_spa+38c ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4810 zfs:dmu_objset_find+40 ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4a70 zfs:dmu_recv_stream+448 ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4c40 zfs:zfs_ioc_recv+41d ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4cc0 zfs:zfsdev_ioctl+175 ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4d00 genunix:cdev_ioctl+45 ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4d40 specfs:spec_ioctl+5a ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4dc0 genunix:fop_ioctl+7b ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4ec0 genunix:ioctl+18e ()
Feb 8 16:07:09 amber genunix: [ID 655072 kern.notice] 
ff00053c4f10 unix:brand_sys_syscall32+1ca ()

Feb 8 16:07:09 amber unix: [ID 10 kern.notice]
Feb 8 16:07:09 amber genunix: [ID 672855 kern.notice] syncing file 
systems...

Feb 8 16:07:09 amber genunix: [ID 904073 kern.notice] done
Feb 8 16:07:10 amber genunix: [ID 111219 kern.notice] dumping to 
/dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Feb 8 16:07:10 amber ahci: [ID 405573 kern.info] NOTICE: ahci0: 
ahci_tran_reset_dport port 3 reset port

Feb 8 16:07:35 amber genunix: [ID 10 kern.notice]
Feb 8 16:07:35 amber genunix: [ID 665016 kern.notice] ^M100% done: 
107693 pages dumped,

Feb 8 16:07:35 amber genunix: [ID 851671 kern.notice] dump succeeded
  



Hello,
I'll try to do my best.

Here are the commands :

amber ~ # zfs unmount data
amber ~ # zfs snapshot -r d...@prededup
amber ~ # zpool destroy ezdata
amber ~ # zpool create ezdata c6t1d0
amber ~ # zfs set dedup=on ezdata
amber ~ # zfs set compress=on ezdata
amber ~ # zfs send -RD d...@prededup |zfs receive ezdata/data
cannot receive new filesystem stream: destination 'ezdata/data' exists
must specify -F to overwrite it
amber ~ # zpool destroy ezdata
amber ~ # zpool create ezdata c6t1d0
amber ~ # zfs set compression=on ezdata
amber ~ # zfs set dedup=on ezdata
amber ~ # zfs send -RD d...@prededup |zfs receive -F ezdata/data
cannot receive new filesystem stream: destination has snapshots
(eg. ezdata/d...@prededup)
must destroy them to overwrite it




This send piped to recv didn't even get started because of the above error.

Are you saying that the command ran for several house and THEN produced 
that message?


I created a hierarchy of dataset and snapshots to match yours (as shown 
below), though with only a sm

Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Andrew Gabriel

Darren J Moffat wrote:
You have done a risk analysis and if you are happy that your NTFS 
filesystems could be corrupt on those ZFS ZVOLs if you lose data then 
you could consider turning off the ZIL.  Note though that it isn't

just those ZVOLs you are serving to Windows that lose access to a ZIL
but *ALL* datasets on *ALL* pools and that includes your root pool.

For what it's worth I personally run with the ZIL disabled on my home 
NAS system which is serving over NFS and CIFS to various clients, but 
I wouldn't recommend it to anyone.  The reason I say never to turn off 
the ZIL is because in most environments outside of home usage it just 
isn't worth the risk to do so (not even for a small business).


People used fastfs for years in specific environments (hopefully 
understanding the risks), and disabling the ZIL is safer than fastfs. 
Seems like it would be a useful ZFS dataset parameter.


--
Andrew

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Mount Errors

2010-02-16 Thread Daniel Carosone
On Tue, Feb 16, 2010 at 06:20:05PM +0100, Juergen Nickelsen wrote:
> Tony MacDoodle  writes:
> 
> > Mounting ZFS filesystems: (1/6)cannot mount '/data/apache': directory is not
> > empty
> > (6/6)
> > svc:/system/filesystem/local:default: WARNING: /usr/sbin/zfs mount -a
> > failed: exit status 1
> >
> > And yes, there is data in the /data/apache file system...
> 
> I think it is complaining about entries in the *mountpoint
> directory*. See this:
> 
> # mkdir /gaga
> # zfs create -o mountpoint=/gaga rpool/gaga
> # zfs umount rpool/gaga
> # touch /gaga/boo
> # zfs mount rpool/gaga
> cannot mount '/gaga': directory is not empty
> # rm /gaga/boo
> # zfs mount rpool/gaga
> # 

Another way to get here is with two datasets with the same mountpoint
property, as can happen when doing send|recv to a backup pool.  Since
you have only 6, that seems less likely here :)  

--
Dan.

pgpqUCLXhDHfd.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD and ZFS

2010-02-16 Thread Daniel Carosone
On Mon, Feb 15, 2010 at 09:11:02PM -0600, Tracey Bernath wrote:
> On Mon, Feb 15, 2010 at 5:51 PM, Daniel Carosone  wrote:
> > Just be clear: mirror ZIL by all means, but don't mirror l2arc, just
> > add more devices and let them load-balance.   This is especially true
> > if you're sharing ssd writes with ZIL, as slices on the same devices.
> >
> > Well, the problem I am trying to solve is wouldn't it read 2x faster with
> the mirror?  It seems once I can drive the single device to 10 queued
> actions, and 100% busy, it would be more useful to have two channels to the
> same data. Is ZFS not smart enough to understand that there are two
> identical mirror devices in the cache to split requests to? Or, are you
> saying that ZFS is smart enough to cache it in two places, although not
> mirrored?

First, Bob is right, measurement trumps speculation.  Try it.

As for speculation, you're thinking only about reads.  I expect
reading from l2arc devices will be the same as reading from any other
zfs mirror, and largely the same in both cases above; load balanced
across either device.  In the rare case of a bad read from unmirrored
l2arc, data will be fetched from the pool, so mirroring l2arc doesn't
add any resiliency benefit.

However, your cache needs to be populated and maintained as well, and
this needs writes.  Twice as many of them for the mirror as for the
"stripe". Half of what is written never needs to be read again. These
writes go to the same ssd devices you're using for ZIL, on commodity
ssd's which are not well write-optimised, they may be hurting zil
latency by making the ssd do more writing, stealing from the total
iops count on the channel, and (as a lesser concern) adding wear
cycles to the device.  

When you're already maxing out the IO, eliminating wasted cycles opens
your bottleneck, even if only a little. 

Once you reach steady state, I don't know how much turnover in l2arc
contents you will have, and therefore how many extra writes we're
talking about.  It may not be many, but they are unnecessary ones.  

Normally, we'd talk about measuring a potential benefit, and then
choosing based on the results.  In this case, if I were you I'd
eliminate the unnecessary writes, and measure the difference more as a
matter of curiosity and research, since I was already set up to do so.

--
Dan.



pgpqCi6va8O6V.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Moshe Vainer
Eric, is this answer by George wrong?

http://opensolaris.org/jive/message.jspa?messageID=439187#439187

Are we to expect the fix soon or is there still no schedule?

Thanks, 
Moshe
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs snapshot of zone fails with permission denied (EPERM [sys_mount])

2010-02-16 Thread Pavel Heimlich
Hi,
when I delegate the zfs  roles to a user, the user can create a snapshot of zfs 
filesystem, but cannot snapshot a zone contained in that filesystem.
The output is:
$  /usr/sbin/zfs snapshot tank/zones/dashboardbuild/ROOT/z...@1install
cannot create snapshot 'tank/zones/dashboardbuild/ROOT/z...@1install': 
permission denied

The root user can create the snapshot just fine.
This is with OSOL b132/amd64

Am I doing something wrong?

TIA

full session log follows:
# cat /tank/zones/dashboardbuild.cfg
create -b
set zonepath=/tank/zones/dashboardbuild
set autoboot=true
add net
set address=10.10.2.43
set physical=e1000g0
end
add fs
set dir=/home
set special=/export/home
set type=lofs
end


# zfs create tank/zones/dashboardbuild
# chmod 700 /tank/zones/dashboardbuild
# zonecfg -z dashboardbuild -f /tank/zones/dashboardbuild.cfg
# zoneadm -z dashboardbuild install
   Publisher: Using opensolaris.org (http://pkg.opensolaris.org/dev/ ).
   Publisher: Using contrib.opensolaris.org 
(http://pkg.opensolaris.org/contrib/).
   Image: Preparing at /tank/zones/dashboardbuild/root.
   Cache: Using /var/pkg/download.
Sanity Check: Looking for 'entire' incorporation.
  Installing: Core System (output follows)




DOWNLOAD  PKGS   FILESXFER (MB)
Completed43/43 12186/1218684.7/84.7

PHASEACTIONS
Install Phase17622/17622
No updates necessary for this image.
  Installing: Additional Packages (output follows)


DOWNLOAD  PKGS   FILESXFER (MB)
Completed37/37   3345/334521.8/21.8

PHASEACTIONS
Install Phase  4519/4519

Note: Man pages can be obtained by installing SUNWman
 Postinstall: Copying SMF seed repository ... done.
 Postinstall: Applying workarounds.
Done: Installation completed in 543.818 seconds.

  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
  to complete the configuration process.

# zfs list |grep dashboard
tank/zones/dashboardbuild 513M   397G21K  
/tank/zones/dashboardbuild
tank/zones/dashboardbuild/ROOT513M   397G19K  legacy
tank/zones/dashboardbuild/ROOT/zbe513M   397G   513M  legacy

# zfs allow hajma snapshot,rollback,mount tank/zones/dashboardbuild
# zfs allow hajma snapshot,rollback,mount tank/zones/dashboardbuild/ROOT
# zfs allow hajma snapshot,rollback,mount tank/zones/dashboardbuild/ROOT/zbe

# zfs allow  tank/zones/dashboardbuild/ROOT/zbe
 Permissions on tank/zones/dashboardbuild/ROOT/zbe ---
Local+Descendent permissions:
user hajma mount,rollback,snapshot
 Permissions on tank/zones/dashboardbuild/ROOT ---
Local+Descendent permissions:
user hajma mount,rollback,snapshot
 Permissions on tank/zones/dashboardbuild 
Local+Descendent permissions:
user hajma mount,rollback,snapshot
#
-bash-4.0$  pfexec /usr/sbin/zfs snapshot 
tank/zones/dashboardbuild/ROOT/z...@1install
cannot create snapshot 'tank/zones/dashboardbuild/ROOT/z...@1install': 
permission denied
-bash-4.0$  pfexec /usr/sbin/zfs snapshot tank/zones/dashboardbu...@test
-bash-4.0$


this is what I see when I run the command in truss:

2116:   ioctl(3, ZFS_IOC_OBJSET_STATS, 0x08044930)  = 0
2116:   brk(0x080D4000) = 0
2116:   ioctl(3, ZFS_IOC_POOL_STATS, 0x08043300)= 0
2116:   brk(0x080E4000) = 0
2116:   ioctl(3, ZFS_IOC_SNAPSHOT, 0x080462C0)  Err#1 EPERM [sys_mount]
2116:   fstat64(2, 0x08045260)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool import hang - possibly dedup related?

2010-02-16 Thread Chris Murray
I'm trying to import a pool into b132 which once had dedup enabled, after the 
machine was shut down with an "init 5".

However, the import hangs the whole machine and I eventually get kicked off my 
SSH sessions. As it's a VM, I can see that processor usage jumps up to near 
100% very quickly, and stays there. Longest I've left it is 12 hours. Before I 
shut down the VM, there was only around 5GB of data in that zpool. There 
doesn't appear to be any disk activity while it's in this stuck state.

Are there any troubleshooting tips on where I can start to look for answers?
The virtual machine is running on ESXi 4, with two virtual CPUs and 3GB RAM.

Thanks in advance,
Chris
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-16 Thread Brandon High
On Tue, Feb 16, 2010 at 8:25 AM, Tiernan OToole  wrote:
> So, does that work with RAIDZ1 and 2 pools?

Yes. Replace all the disks in one vdev, and that vdev will become
larger. Your disk layout won't change though - You'll still have a
raidz vdev, a raidz2 vdev. It might be a good idea to revise the
layout a bit with larger disks.

If you do change the layout, then a send/receive is the easiest way to
move your data. It can be used to copy everything on the pool
(including snapshots, etc) to your new system.

-B

-- 
Brandon High : bh...@freaks.com
Suspicion Breeds Confidence
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-16 Thread Brian E. Imhoff
Some more back story.  I initially started with Solaris 10 u8, and was getting 
40ish MB/s reads, and 65-70MB/s writes, which was still a far cry from the 
performance I was getting with OpenFiler.  I decided to try Opensolaris 
2009.06, thinking that since it was more "state of the art & up to date" then 
main Solaris. Perhaps there would be some performance tweaks or bug fixes which 
might bring performance closer to what I saw with OpenFiler.   But, then on an 
untouched clean install of OpenSolaris 2009.06, ran into 
something...else...apparently causing this far far far worse performance.

But, at the end of the day, this is quite a bomb:  "A single raidz2 vdev has 
about as many IOs per second as a single disk, which could really hurt iSCSI 
performance."  

If I have to break 24 disks up in to multiple vdevs to get the expected 
performance might be a deal breaker.  To keep raidz2 redundancy, I would have 
to lose..almost half of the available storage to get reasonable IO speeds.

Now knowing about vdev IO limitations, I believe the speeds I saw with Solaris 
10u8 are inline with those limitations, and instead of fighting with whatever 
issue I have with this clean install of OpenSolaris, I reverted back to 10u8.  
I guess I'll just have to see if the speeds that Solaris ISCSI w/ZFS is capable 
of, is workable for what I want to do, and what the size sacrifice/performace 
acceptability point is at.

Thanks for all the responses and help.  First time posting here, and this looks 
like an excellent community.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speed question: 8-disk RAIDZ2 vs 10-disk RAIDZ3

2010-02-16 Thread Bob Friesenhahn

On Tue, 16 Feb 2010, Dave Pooser wrote:


If I go to 10x 2TB in a RAIDZ3, will the extra spindles increase
speed, or will the extra parity writes reduce speed, or will the two factors
offset and leave things a wash?


I should mention that the usage of this system is as storage for large
(5-300GB) video files, so what's most important is sequential write speed.


With 10 disks, I would go for two raidz2 vdevs of 5 disks each.  This 
is not as space efficient as your one raidz3 vdev with 10 disks but it 
is also likely to be a bit more responsive, and withstand a 
(temporary) slowdown due to a single slow disk a bit better.  With a 
single raidz3 vdev, write performance will go into the toilet if even 
one disk becomes a bit balky.  Resilver times when a disk is replaced 
are also likely to be longer.


Disks are far more likely to fail than the controller they are 
attached to, and disk failures are often slow to occur, or not obvious 
as failures without close scrutiny with tools like 'iostat -x'.


There have been many reports here about raidz-based pools which became 
very slow, with the eventual finding that the slowness was due to just 
one balky disk.  I think that this is what you need to prepare for, 
particularly with hardware going out on a truck to the field.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Mount Errors

2010-02-16 Thread Juergen Nickelsen
Tony MacDoodle  writes:

> Mounting ZFS filesystems: (1/6)cannot mount '/data/apache': directory is not
> empty
> (6/6)
> svc:/system/filesystem/local:default: WARNING: /usr/sbin/zfs mount -a
> failed: exit status 1
>
> And yes, there is data in the /data/apache file system...

I think it is complaining about entries in the *mountpoint
directory*. See this:

# mkdir /gaga
# zfs create -o mountpoint=/gaga rpool/gaga
# zfs umount rpool/gaga
# touch /gaga/boo
# zfs mount rpool/gaga
cannot mount '/gaga': directory is not empty
# rm /gaga/boo
# zfs mount rpool/gaga
# 

Regards, Juergen.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-16 Thread Tiernan OToole
So, does that work with RAIDZ1 and 2 pools?

On Tue, Feb 16, 2010 at 1:47 PM, Robert Milkowski  wrote:

>
>
> On Mon, 15 Feb 2010, Tiernan OToole wrote:
>
>  Good morning all.
>>
>> I am in the process of building my V1 SAN for media storage in house, and
>> i
>> am already thinkg ov the V2 build...
>>
>> Currently, there are 8 250Gb hdds and 3 500Gb disks. the 8 250s are in a
>> RAIDZ2 array, and the 3 500s will be in RAIDZ1...
>>
>> At the moment, the current case is quite full. i am looking at a 20 drive
>> hotswap case, which i plan to order soon. when the time comes, and i start
>> upgrading the drives to larger drives, say 1Tb drives, would it be easy to
>> migrate the contents of the RAIDZ2 array to the new Array? I see mentions
>> of
>> ZFS Send and ZFS recieve, but i have no idea if they would do the job...
>>
>>
>
> if you can expose both disk arrays to the host then you can replace (zpool
> replace) a disk (vdev) one-by-one. Once you replaced all disks with larger
> ones zfs will automatically enlarge your pool.
>
> --
> Robert Milkowski
> http://milek.blogspot.com
>
>


-- 
Tiernan O'Toole
blog.lotas-smartman.net
www.tiernanotoolephotography.com
www.the-hairy-one.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD and ZFS

2010-02-16 Thread Bob Friesenhahn

On Mon, 15 Feb 2010, Tracey Bernath wrote:


If the device itself was full, and items were falling off the L2ARC, then I 
could see having two
separate cache devices, but since I am only at about 50% utilization of the 
available capacity, and
maxing out the IO, then mirroring seemed smarter.

Am I missing something here?


I doubt it.  The only way to know for sure is to test it but it seems 
unlikely to me that zfs implementors would fail to load share the 
reads from mirrored L2ARC.  Richard's points about L2ARC bandwidth vs 
pool disk bandwidth are still good ones.  L2ARC is all about read 
latency, but L2ARC does not necessarily help with read bandwidth.  It 
is also useful to keep in mind that L2ARC offers at least 40x less 
bandwidth than ARC in RAM.  So always populate RAM first if you can 
afford it.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-16 Thread Richard Elling
On Feb 15, 2010, at 11:34 PM, Ragnar Sundblad wrote:
> 
> On 15 feb 2010, at 23.33, Bob Beverage wrote:
> 
>>> On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
>>>  wrote:
>>> I've seen exactly the same thing. Basically, terrible
>>> transfer rates
>>> with Windows
>>> and the server sitting there completely idle.
>> 
>> I am also seeing this behaviour.  It started somewhere around snv111 but I 
>> am not sure exactly when.  I used to get 30-40MB/s transfers over cifs but 
>> at some point that dropped to roughly 7.5MB/s.
> 
> Wasn't zvol changed a while ago from asynchronous to
> synchronous? Could that be it?

Yes.

> I don't understand that change at all - of course a zvol with or
> without iscsi to access it should behave exactly as a (not broken)
> disk, strictly obeying the protocol for write cache. cache flush etc.
> Having it entirely synchronous is in many cases almost as useless
> as having it asynchronous.

There are two changes at work here, and OpenSolaris 2009.06 is
in the middle of them -- and therefore is at the least optimal spot.
You have the choice of moving to a later build, after b113, which
has the proper fix.

> Just as much as zfs itself should demands this from it's disks, as it
> does, I believe it should provide this itself when used as storage
> for others. To me it seems that the zvol+iscsi functionality seems not
> ready for production and needs more work. If anyone has any better
> explanation, please share it with me!

The fix is in Solaris 10 10/09 and the OpenStorage software.  For some
reason, this fix is not available in the OpenSolaris supported bug fixes.
Perhaps someone from Oracle can shed light on that (non)decision?
So until next month, you will need to use an OpenSolaris dev release
after b113.

> I guess a good slog could help a bit, especially if you have a bursty
> write load.

Yes.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Mount Errors

2010-02-16 Thread Tony MacDoodle
Why would I get the following error:

Reading ZFS config: done.
Mounting ZFS filesystems: (1/6)cannot mount '/data/apache': directory is not
empty
(6/6)
svc:/system/filesystem/local:default: WARNING: /usr/sbin/zfs mount -a
failed: exit status 1

And yes, there is data in the /data/apache file system...

This was created during the jumpstart process.

Thanks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Speed question: 8-disk RAIDZ2 vs 10-disk RAIDZ3

2010-02-16 Thread Dave Pooser
> If I go to 10x 2TB in a RAIDZ3, will the extra spindles increase
> speed, or will the extra parity writes reduce speed, or will the two factors
> offset and leave things a wash?

I should mention that the usage of this system is as storage for large
(5-300GB) video files, so what's most important is sequential write speed.
-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Speed question: 8-disk RAIDZ2 vs 10-disk RAIDZ3

2010-02-16 Thread Dave Pooser
I currently am getting good speeds out of my existing system (8x 2TB in a
RAIDZ2 exported over fibre channel) but there's no such thing as too much
speed, and these other two drive bays are just begging for drives in
them If I go to 10x 2TB in a RAIDZ3, will the extra spindles increase
speed, or will the extra parity writes reduce speed, or will the two factors
offset and leave things a wash?
(My goal is to be able to survive one controller failure, so if I add
more drives I'll have to add redundancy to compensate for the fact that one
controller would then be able to take out three drives.)
I've considered adding a drive for the ZIL instead, but my experiments
in disabling the ZIL (using the evil tuning guide at
) didn't show any speed increase. (I know it's a
bad idea run the system with ZIL disabled; I disabled it only to measure its
impact on my write speeds and re-enabled it after testing was complete.)

Current system:
OpenSolaris dev release b132
Intel S5500BC mainboard (latest firmware)
Intel E5506 Xeon 2.13GHz
8GB RAM
3x LSI 3018 PCIe SATA controllers (latest IT firmware)
8x 2TB Hitachi 7200RPM SATA drives (2 connected to each LSI and 2 to
motherboard SATA ports)
2x 60GB Imation M-class SSD (boot mirror)
Qlogic 2440 PCIe Fibre Channel HBA
-- 
Dave Pooser, ACSA
Manager of Information Services
Alford Media  http://www.alfordmedia.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-16 Thread Robert Milkowski



On Mon, 15 Feb 2010, Tiernan OToole wrote:


Good morning all.

I am in the process of building my V1 SAN for media storage in house, and i
am already thinkg ov the V2 build...

Currently, there are 8 250Gb hdds and 3 500Gb disks. the 8 250s are in a
RAIDZ2 array, and the 3 500s will be in RAIDZ1...

At the moment, the current case is quite full. i am looking at a 20 drive
hotswap case, which i plan to order soon. when the time comes, and i start
upgrading the drives to larger drives, say 1Tb drives, would it be easy to
migrate the contents of the RAIDZ2 array to the new Array? I see mentions of
ZFS Send and ZFS recieve, but i have no idea if they would do the job...




if you can expose both disk arrays to the host then you can replace (zpool 
replace) a disk (vdev) one-by-one. Once you replaced all disks with larger 
ones zfs will automatically enlarge your pool.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD and ZFS

2010-02-16 Thread Tracey Bernath
On Mon, Feb 15, 2010 at 5:51 PM, Daniel Carosone  wrote:

> On Sun, Feb 14, 2010 at 11:08:52PM -0600, Tracey Bernath wrote:
> > Now, to add the second SSD ZIL/L2ARC for a mirror.
>
> Just be clear: mirror ZIL by all means, but don't mirror l2arc, just
> add more devices and let them load-balance.   This is especially true
> if you're sharing ssd writes with ZIL, as slices on the same devices.
>
> Well, the problem I am trying to solve is wouldn't it read 2x faster with
the mirror?  It seems once I can drive the single device to 10 queued
actions, and 100% busy, it would be more useful to have two channels to the
same data. Is ZFS not smart enough to understand that there are two
identical mirror devices in the cache to split requests to? Or, are you
saying that ZFS is smart enough to cache it in two places, although not
mirrored?

If the device itself was full, and items were falling off the L2ARC, then I
could see having two separate cache devices, but since I am only at about
50% utilization of the available capacity, and maxing out the IO, then
mirroring seemed smarter.

Am I missing something here?

Tracey



> > I may even splurge for one more to get a three way mirror.
>
> With more devices, questions about selecting different devices
> appropriate for each purpose come into play.
>
> > Now I need a bigger server
>
> See? :)
>
> --
> Dan.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk Issues

2010-02-16 Thread Nigel Smith
I have booted up an osol-dev-131 live CD on a Dell Precision T7500,
and the AHCI driver successfully loaded, to give access
to the two sata DVD drives in the machine.

(Unfortunately, I did not have the opportunity to attach
any hard drives, but I would expect that also to work.)

'scanpci' identified the southbridge as an
Intel 82801JI (ICH10 family)
Vendor 0x8086, device 0x3a22

AFAIK, as long as the SATA interface report a PCI ID
class-code of 010601, then the AHCI device driver 
should load.

The mode of the SATA interface will need to be selected in the BIOS.
There are normally three modes: Native IDE, RAID or AHCI.

'scanpci' should report different class-codes depending
on the mode selected in the BIOS.

RAID mode should report a class-code of 010400
IDE mode should report a class-code of 0101xx

With OpenSolaris, you can see the class-code in the
output from 'prtconf -pv'.

If Native IDE is selected the ICH10 SATA interface should
appear as two controllers, the first for ports 0-3,
and the second for ports 4 & 5.

Regards
Nigel Smith
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs questions wrt unused blocks

2010-02-16 Thread heinz zerbes

Richard,

thanks for the heads-up. I found some material here that sheds a bit 
more light on it:



http://en.wikipedia.org/wiki/ZFS
http://all-unix.blogspot.com/2007/04/transaction-file-system-and-cow.html

Regards,
heinz


Richard Elling wrote:

On Feb 15, 2010, at 8:43 PM, heinz zerbes wrote:
  

Gents,

We want to understand the mechanism of zfs a bit better.

Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks?
Q: what criteria is there for zfs to start reclaiming blocks



The answer to these questions is too big for an email. Think of
ZFS as a very dynamic system with many different factors influencing
block allocation.

  

Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk 
on a NFS server and a zpool inside that vdisk.
This vdisk tends to grow in size even if the user writes and deletes a file 
again. Question is, whether this reclaiming of unused blocks can kick in 
earlier, so that the filesystem doesn't grow much more than what is actually 
allocated?



ZFS is a COW file system, which partly explains what you are seeing.
Snapshots, deduplication, and the ZIL complicate the picture.
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)



  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss