Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-17 Thread Matt
Just out of curiosity - what Supermicro chassis did you get?  I've got the 
following items shipping to me right now, with SSD drives and 2TB main drives 
coming as soon as the system boots and performs normally (using 8 extra 500GB 
Barracuda ES.2 drives as test drives).


http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4
http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043
http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4
http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4
http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187
http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-17 Thread Matt
No SSD Log device yet.  I also tried disabling the ZIL, with no effect on 
performance.

Also - what's the best way to test local performance?  I'm _somewhat_ dumb as 
far as opensolaris goes, so if you could provide me with an exact command line 
for testing my current setup (exactly as it appears above) I'd love to report 
the local I/O readings.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Solaris Test Framework.

2010-02-17 Thread Amit G
gt;
> >
> >
> >>If there were a real-world device that tended to randomly flip bits,
> >>or randomly replace swaths of LBA's with zeroes, but otherwise behave
> >>normally (not return any errors, not slow down retrying reads, not
> >>fail to attach), then copies=2 would be really valuable, but so far it
> >>seems no such device exists.  If you actually explore the errors that
> >>really happen I venture there are few to no cases copies=2 would save
> >>you.
> >
> >I had a device which had 256 bytes of the 32MB broken (some were "1", some
> >were always "0").  But I never put it online because it was so broken.
>
> Of the 32MB cache, sorry.
>
> Casper
>
>
>
> --
>
> Message: 5
> Date: Thu, 18 Feb 2010 11:55:07 +1100
> From: Daniel Carosone 
> To: Miles Nordin 
> Cc: zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Proposed idea for enhancement - damage
>control
> Message-ID: <20100218005507.gt27...@bcd.geek.com.au>
> Content-Type: text/plain; charset="us-ascii"
>
> On Wed, Feb 17, 2010 at 02:38:04PM -0500, Miles Nordin wrote:
> > copies=2 has proven to be mostly useless in practice.
>
> I disagree.  Perhaps my cases fit under the weasel-word "mostly", but
> single-disk laptops are a pretty common use-case.
>
> > If there were a real-world device that tended to randomly flip bits,
> > or randomly replace swaths of LBA's with zeroes
>
> As an aside, there can be non-device causes of this, especially when
> sharing disks with other operating systems, booting livecd's and etc.
>
> >  * drives do not immediately turn red and start brrk-brrking when they
> >go bad.  In the real world, they develop latent sector errors,
> >which you will not discover and mark the drive bad until you scrub
> >or coincidentally happen to read the file that accumulated the
> >error.
>
> Yes, exactly - at this point, with copies=1, you get a signal that
> your drive is about to go bad, and that data has been lost.  With
> copies=2, you get a signal that your drive is about to go bad, but
> less disruption and data loss to go with it.  Note that pool metadata
> is inherently using ditto blocks for precisely this reason.
>
> I dunno about BER spec, but I have seen sectors go unreadable many
> times. Sometimes, as you note, in combination with other problems or
> further deterioriation, sometimes not. Regardless of what you do in
> response, and how soon you replace the drive, copies >1 can cover that
> interval.
>
> I agree fully, copies=2 is not a substitute for backup replication of
> whatever flavour you prefer.  It is a useful supplement.
> Misunderstanding this is dangerous.
>
> --
> Dan.
>
> -- next part --
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/pgp-signature
> Size: 194 bytes
> Desc: not available
> URL: <
> http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100218/8cf99e80/attachment-0001.bin
> >
>
> --
>
> Message: 6
> Date: Wed, 17 Feb 2010 22:11:56 -0500
> From: Ethan 
> To: Ethan , zfs-discuss@opensolaris.org
> Subject: Re: [zfs-discuss] Help with corrupted pool
> Message-ID:
>
> Content-Type: text/plain; charset="utf-8"
>
> On Wed, Feb 17, 2010 at 18:24, Daniel Carosone  wrote:
>
> > On Wed, Feb 17, 2010 at 06:15:25PM -0500, Ethan wrote:
> > > Success!
> >
> > Awesome.  Let that scrub finish before celebrating completely, but
> > this looks like a good place to stop and consider what you want for an
> > end state.
> >
> > --
> > Dan.
> >
>
> True. Thinking about where to end up - I will be staying on opensolaris
> despite having no truecrypt. My paranoia likes having encryption, but it's
> not really necessary for me, and it looks like encryption will be coming to
> zfs itself soon enough. So, no need to consider getting things working on
> zfs-fuse again.
>
> I should have a partition table, for one thing, I suppose. The partition
> table is EFI GUID Partition Table, looking at the relevant documentation.
> So, I'll need to somehow shift my zfs data down by 17408 bytes (34 512-byte
> LBA's, the size of the GPT's stuff at the beginning of the disk) - perhaps
> just by copying from the truecrypt volumes as I did before, but with offset
> of 17408 bytes. Then I should be able to use format to make the correct
> partition information, and use the s0 partition for each drive as seems to
> be the standard w

Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-17 Thread Brent Jones
On Wed, Feb 17, 2010 at 10:42 PM, Matt  wrote:


>
> I've got a very similar rig to the OP showing up next week (plus an 
> infiniband card) I'd love to get this performing up to GB Ethernet speeds, 
> otherwise I may have to abandon the iSCSI project if I can't get it to 
> perform.


Do you have an SSD log device? If not, try disabling the ZIL
temporarily to see if that helps. Your workload will likely benefit
from a log device.

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-17 Thread Matt
Just wanted to add that I'm in the exact same boat - I'm connecting from a 
Windows system and getting just horrid iSCSI transfer speeds.

I've tried updating to COMSTAR (although I'm not certain that I'm actually 
using it) to no avail, and I tried updating to the latest DEV version of 
OpenSolaris.  All that resulted from updating to the latest DEV version was a 
completely broken system that the I couldn't access the command line on.  
Fortunately i was able to roll back to the previous version and keep tinkering.

Anyone have any ideas as to what could really be causing this slowdown?

I've got 5-500GB Seagate Barracuda ES.2 drives that I'm using for my zpools, 
and I've done the following.

1 - zpool create data mirror c0t0d0 c0t1d0
2 - zfs create -s -V 600g data/iscsitarget
3 - sbdadm create-lu /dev/zvol/rdsk/data/iscsitarget
4 - stfadm add-view xx

So I've got a 500GB RAID1 zpool, and I've created a 600GB sparse volume on top 
of it, shared it via iSCSI, and connected to it.  Everything works stellar up 
until I copy files to it, then I get just sluggishness.

I start to copy a file from my windows 7 system to the iSCSI target, then pull 
up IOSTAT using this command : zpool iostat -v data 10

It shows me this : 

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
data 895M   463G  0666  0  7.93M
  mirror 895M   463G  0666  0  7.93M
c0t0d0  -  -  0269  0  7.91M
c0t1d0  -  -  0272  0  7.93M
--  -  -  -  -  -  -

So I figure, since ZFS is pretty sweet, how about I add some additional drives. 
 That should bump up my performance.

I execute this : 

zpool add data mirror c1t0d0 c1t1d0

It adds it to my zpool, and I run IOSTAT again, while the copy is still running.

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
data1.17G   927G  0738  1.58K  8.87M
  mirror1.17G   463G  0390  1.58K  4.61M
c0t0d0  -  -  0172  1.58K  4.61M
c0t1d0  -  -  0175  0  4.61M
  mirror42.5K   464G  0348  0  4.27M
c1t0d0  -  -  0156  0  4.27M
c1t1d0  -  -  0159  0  4.27M
--  -  -  -  -  -  -


I get a whopping extra 1MB/sec by adding two drives.  It fluctuates a lot, 
sometimes dropping down to 4MB/sec, sometimes rocketing all the way up to 
20MB/sec, but nothing consistent.

Basically, my transfer rates are the same no matter how many drives I add to 
the zpool.

Is there anything I am missing on this?

BTW - "test" server specs

AMD dual core 6000+
2GB RAM
Onboard Sata controller
Onboard Ethernet (gigabit)

I've got a very similar rig to the OP showing up next week (plus an infiniband 
card) I'd love to get this performing up to GB Ethernet speeds, otherwise I may 
have to abandon the iSCSI project if I can't get it to perform.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Christo Kutrovsky
Dan,

Exactly what I meant. An allocation policy, that will help in distributing the 
data in a way that when one disk is lost (entire mirror) than some data remains 
fully accessible as opposed to not been able to access pieces all over the 
storage pool.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Damon Atkins
Create a new empty pool on the solaris system, let it format the disks etc
ie used the disk names cXtXd0 This should put the EFI label on the disks and 
then setup the partitions for you. Just encase here is an example.

Go back to the Linux box, and see if you can use tools to see the same 
partition layout, if you can then dd it to the currect spot which in Solaris 
c5t2d0s0. (zfs send|zfs recv would be easier)

-bash-4.0$ pfexec fdisk -R -W - /dev/rdsk/c5t2d0p0

* /dev/rdsk/c5t2d0p0 default fdisk table
* Dimensions:
*512 bytes/sector
*126 sectors/track
*255 tracks/cylinder
*   60800 cylinders
*
* systid:
*1: DOSOS12
*  238: EFI_PMBR
*  239: EFI_FS
*
* IdAct  Bhead  Bsect  BcylEhead  Esect  EcylRsect  Numsect
  238   025563 102325563 10231  1953525167
  0 00  0  0   0  0  0   0  0
  0 00  0  0   0  0  0   0  0
  0 00  0  0   0  0  0   0  0


-bash-4.0$ pfexec prtvtoc /dev/rdsk/c5t2d0
* /dev/rdsk/c5t2d0 partition map
*
* Dimensions:
* 512 bytes/sector
* 1953525168 sectors
* 1953525101 accessible sectors
*
* Flags:
*   1: unmountable
*  10: read-only
*
* Unallocated space:
*   First SectorLast
*   Sector CountSector
*  34   222   255
*
*  First Sector   Last
* Partition  Tag   FlagsSector Count   Sector  Mount Directory
   0  4 00  2561953508495 1953508750
   8 1100  1953508751  163841953525134
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status output confusing

2010-02-17 Thread Moshe Vainer
The links look fine, and i am pretty sure (though not 100%) that this is 
related to the vdev id assignment. What i am not sure is whether this is still 
an areca firmware issue or opensolaris issue. 

 ls -l /dev/dsk/c7t1d?p0
lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d0p0 -> 
../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,0:q
lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d1p0 -> 
../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,1:q
lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d2p0 -> 
../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,2:q
lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d3p0 -> 
../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,3:q
lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d4p0 -> 
../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,4:q
lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d5p0 -> 
../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,5:q
lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d6p0 -> 
../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,6:q
lrwxrwxrwx 1 root root 62 2010-02-08 17:43 /dev/dsk/c7t1d7p0 -> 
../../devices/p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,7:q
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Ethan
On Wed, Feb 17, 2010 at 23:21, Bob Friesenhahn  wrote:

> On Wed, 17 Feb 2010, Ethan wrote:
>
>>
>> I should have a partition table, for one thing, I suppose. The partition
>> table is EFI GUID Partition
>> Table, looking at the relevant documentation. So, I'll need to somehow
>> shift my zfs data down by 17408
>> bytes (34 512-byte LBA's, the size of the GPT's stuff at the beginning of
>> the disk) - perhaps just by
>>
>> Does that sound correct / sensible? Am I missing or mistaking anything?
>>
>
> It seems to me that you could also use the approach of 'zpool replace' for
> each device in turn until all of the devices are re-written to normal
> Solaris/zfs defaults.  This would also allows you to expand the partition
> size a bit for a larger pool.
>
> Bob
> --
> Bob Friesenhahn
> bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


That is true. It seems like it then have to rebuild from parity for every
drive, though, which I think would take rather a long while, wouldn't it?
I could put in a new drive to overwrite. Then the replace command would just
copy from the old drive rather than rebuilding from parity (I think? that
seems like the sensible thing for it to do, anyway.) But I don't have a
spare drive for this - I have the original drives that still contain the
truecrypt volumes, but I am disinclined to start overwriting these, this
episode having given me a healthy paranoia about having good backups.
I guess this question just comes down to weighing whether rebuilding each
from parity or re-copying from the truecrypt volumes to a different offset
is more of a hassle.

-Ethan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Killing an EFI label

2010-02-17 Thread David Dyer-Bennet
Since this seems to be a ubiquitous problem for people running ZFS, even 
though it's really a general Solaris admin issue, I'm guessing the 
expertise is actually here, so I'm asking here.


I found lots of online pages telling how to do it.

None of them were correct or complete.  I think.  I seem to have 
accomplished it in a somewhat hackish fashion, possibly not cleanly, and 
I'm now trying to really understand this (I've always found SunOS' idea 
of overlapping partitions so insanely stupid that it turns my brain off, 
and combining that with x86-style real disk partitions and calling them 
both the same thing except when we don't has probably induced permanent 
brain damage by this point).


First step:  invoke format -e and use "label" to write an SMI label to 
the disk.  This part is sort-of documented, and if you let format 
present you the list of disks and choose one, everything works out.  
What about the other syntax, where you specify the "disk" to format on 
the command line? What are valid device files? Is it any file that ends 
up pointing to some portion of the correct physical disk?


But after this, it appears to be necessary to manually set up partitions 
(slices).  At least, without doing that, I couldn't attach the disk to 
my zpool, which was my goal.  Am I missing something?


And, when manually setting up partitions, I have no idea if what I did 
is right.  Well, a bit of idea; I know that installgrub did NOT 
overwrite anything that a scrub detected, so that means I left enough 
blank space somewhere.  Not sure it's the right place, though. Did I 
have to do this?  Every way I tried to avoid this resulted in failure to 
attach, but none of the instructions listed this step.


This is how format prints the partitions I created:

partition> p
Current partition table (original):
Total disk cylinders available: 19454 + 2 (reserved cylinders)

Part  TagFlag Cylinders SizeBlocks
  0   rootwm   1 - 19453  149.02GB(19453/0/0) 312512445
  1 unassignedwm   00 (0/0/0) 0
  2 backupwu   0 - 19453  149.03GB(19454/0/0) 312528510
  3 unassignedwm   00 (0/0/0) 0
  4 unassignedwm   00 (0/0/0) 0
  5 unassignedwm   00 (0/0/0) 0
  6 unassignedwm   00 (0/0/0) 0
  7 unassignedwm   00 (0/0/0) 0
  8   bootwu   0 - 07.84MB(1/0/0) 16065
  9 unassignedwm   00 (0/0/0) 0

And here's the one on the other disk -- yikes, it looks like it ended up 
with a completely different geometry!  (these are two identical drives).


Current partition table (original):
Total disk cylinders available: 152615 + 2 (reserved cylinders)

Part  TagFlag Cylinders  SizeBlocks
  0   rootwm   1 - 152614  149.04GB(152614/0/0) 
312553472
  1   swapwu   0 0 
(0/0/0)  0
  2 backupwu   0 - 152616  149.04GB(152617/0/0) 
312559616
  3 unassignedwm   0 0 
(0/0/0)  0
  4 unassignedwm   0 0 
(0/0/0)  0
  5 unassignedwm   0 0 
(0/0/0)  0
  6usrwm   1 - 152614  149.04GB(152614/0/0) 
312553472
  7 unassignedwm   0 0 
(0/0/0)  0
  8   bootwu   0 -  01.00MB
(1/0/0)   2048
  9 alternateswm   0 0 
(0/0/0)  0


How do I fix that?  For scsi disks (these are SATA disks on an SAS 
controller, does that count?) it's supposed to figure that out itself, I 
thought?  I certainly  never entered disk geometry figures.


The pool is using s0:

  pool: rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
rpool ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c4t0d0s0  ONLINE   0 0 0
c4t1d0s0  ONLINE   0 0 0
c9t0d0s0  ONLINE   0 0 0
c9t2d0s0  ONLINE   0 0 0

errors: No known data errors

Once I decide that these

--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Bob Friesenhahn

On Wed, 17 Feb 2010, Ethan wrote:


I should have a partition table, for one thing, I suppose. The partition table 
is EFI GUID Partition
Table, looking at the relevant documentation. So, I'll need to somehow shift my 
zfs data down by 17408
bytes (34 512-byte LBA's, the size of the GPT's stuff at the beginning of the 
disk) - perhaps just by

Does that sound correct / sensible? Am I missing or mistaking anything? 


It seems to me that you could also use the approach of 'zpool replace' 
for each device in turn until all of the devices are re-written to 
normal Solaris/zfs defaults.  This would also allows you to expand the 
partition size a bit for a larger pool.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status output confusing

2010-02-17 Thread David Dyer-Bennet

On 2/17/2010 9:59 PM, Moshe Vainer wrote:

I have another very weird one, looks like a reoccurance of the same issue but 
with the new firmware.

We have the following disks:

AVAILABLE DISK SELECTIONS:
0. c7t1d0
   /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,0
1. c7t1d1
   /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,1
2. c7t1d2
   /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,2
3. c7t1d3
   /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,3
4. c7t1d4
   /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,4
5. c7t1d5
   /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,5
6. c7t1d6
   /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,6
7. c7t1d7
   /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,7

rpool uses c7d1d7

# zpool status
   pool: rpool
  state: ONLINE
  scrub: none requested
config:

 NAMESTATE READ WRITE CKSUM
 rpool   ONLINE   0 0 0
   c7t1d7s0  ONLINE   0 0 0

errors: No known data errors


I tried to create the following tank:

zpool create -f tank \
 raidz2 \
 c7t1d0 \
 c7t1d1 \
 c7t1d2 \
 c7t1d3 \
 c7t1d4 \
 c7t1d5 \
 spare \
 c7t1d6

# ./mktank.sh
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c7t1d7s0 is part of active ZFS pool rpool. Please see zpool(1M).

So clearly, it confuses one of the other drives with c7t1d7

What's even weirder - this is after a clean reinstall of solaris (it's a test 
box).
Any ideas on how to clean the state?
James, if you read this, is this the same issue?
   


Well, I'd certainly chase through the symbolic links to find if the 
device files were pointing the wrong places in the end, or if the 
problem is lower in the stack than that.  Since it's a clean install 
it's a Solaris bug at some level either way, sounds like.


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool status output confusing

2010-02-17 Thread Moshe Vainer
I have another very weird one, looks like a reoccurance of the same issue but 
with the new firmware.

We have the following disks:

AVAILABLE DISK SELECTIONS:
   0. c7t1d0 
  /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,0
   1. c7t1d1 
  /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,1
   2. c7t1d2 
  /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,2
   3. c7t1d3 
  /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,3
   4. c7t1d4 
  /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,4
   5. c7t1d5 
  /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,5
   6. c7t1d6 
  /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,6
   7. c7t1d7 
  /p...@0,0/pci8086,3...@3/pci17d3,1...@0/d...@1,7

rpool uses c7d1d7

# zpool status
  pool: rpool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c7t1d7s0  ONLINE   0 0 0

errors: No known data errors


I tried to create the following tank:

zpool create -f tank \
raidz2 \
c7t1d0 \
c7t1d1 \
c7t1d2 \
c7t1d3 \
c7t1d4 \
c7t1d5 \
spare \
c7t1d6

# ./mktank.sh
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c7t1d7s0 is part of active ZFS pool rpool. Please see zpool(1M).

So clearly, it confuses one of the other drives with c7t1d7

What's even weirder - this is after a clean reinstall of solaris (it's a test 
box). 
Any ideas on how to clean the state?
James, if you read this, is this the same issue?

Thanks in advance, 
Moshe
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Ethan
On Wed, Feb 17, 2010 at 18:24, Daniel Carosone  wrote:

> On Wed, Feb 17, 2010 at 06:15:25PM -0500, Ethan wrote:
> > Success!
>
> Awesome.  Let that scrub finish before celebrating completely, but
> this looks like a good place to stop and consider what you want for an
> end state.
>
> --
> Dan.
>

True. Thinking about where to end up - I will be staying on opensolaris
despite having no truecrypt. My paranoia likes having encryption, but it's
not really necessary for me, and it looks like encryption will be coming to
zfs itself soon enough. So, no need to consider getting things working on
zfs-fuse again.

I should have a partition table, for one thing, I suppose. The partition
table is EFI GUID Partition Table, looking at the relevant documentation.
So, I'll need to somehow shift my zfs data down by 17408 bytes (34 512-byte
LBA's, the size of the GPT's stuff at the beginning of the disk) - perhaps
just by copying from the truecrypt volumes as I did before, but with offset
of 17408 bytes. Then I should be able to use format to make the correct
partition information, and use the s0 partition for each drive as seems to
be the standard way of doing things. Or maybe I can format (write the GPT)
first, then get linux to recognize the GPT, and copy from truecrypt into the
partition.

Does that sound correct / sensible? Am I missing or mistaking anything?

Thanks,
-Ethan

PS: scrub in progress for 4h4m, 65.43% done, 2h9m to go - no errors yet.
Looking good.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Daniel Carosone
On Wed, Feb 17, 2010 at 02:38:04PM -0500, Miles Nordin wrote:
> copies=2 has proven to be mostly useless in practice.

I disagree.  Perhaps my cases fit under the weasel-word "mostly", but
single-disk laptops are a pretty common use-case.

> If there were a real-world device that tended to randomly flip bits,
> or randomly replace swaths of LBA's with zeroes

As an aside, there can be non-device causes of this, especially when
sharing disks with other operating systems, booting livecd's and etc.

>  * drives do not immediately turn red and start brrk-brrking when they
>go bad.  In the real world, they develop latent sector errors,
>which you will not discover and mark the drive bad until you scrub
>or coincidentally happen to read the file that accumulated the
>error.

Yes, exactly - at this point, with copies=1, you get a signal that
your drive is about to go bad, and that data has been lost.  With
copies=2, you get a signal that your drive is about to go bad, but
less disruption and data loss to go with it.  Note that pool metadata
is inherently using ditto blocks for precisely this reason.

I dunno about BER spec, but I have seen sectors go unreadable many
times. Sometimes, as you note, in combination with other problems or
further deterioriation, sometimes not. Regardless of what you do in
response, and how soon you replace the drive, copies >1 can cover that
interval. 

I agree fully, copies=2 is not a substitute for backup replication of
whatever flavour you prefer.  It is a useful supplement.
Misunderstanding this is dangerous.

--
Dan.



pgpzBR2d3KS9G.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Casper . Dik

>
>
>>If there were a real-world device that tended to randomly flip bits,
>>or randomly replace swaths of LBA's with zeroes, but otherwise behave
>>normally (not return any errors, not slow down retrying reads, not
>>fail to attach), then copies=2 would be really valuable, but so far it
>>seems no such device exists.  If you actually explore the errors that
>>really happen I venture there are few to no cases copies=2 would save
>>you.
>
>I had a device which had 256 bytes of the 32MB broken (some were "1", some
>were always "0").  But I never put it online because it was so broken.

Of the 32MB cache, sorry.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Casper . Dik


>If there were a real-world device that tended to randomly flip bits,
>or randomly replace swaths of LBA's with zeroes, but otherwise behave
>normally (not return any errors, not slow down retrying reads, not
>fail to attach), then copies=2 would be really valuable, but so far it
>seems no such device exists.  If you actually explore the errors that
>really happen I venture there are few to no cases copies=2 would save
>you.

I had a device which had 256 bytes of the 32MB broken (some were "1", some
were always "0").  But I never put it online because it was so broken.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread David Magda

On Feb 17, 2010, at 12:34, Richard Elling wrote:

I'm not sure how to connect those into the system (USB 3?), but when  
you build it, let us

know how it works out.


FireWire 3200 preferably. Anyone know if USB 3 sucks as much CPU as  
previous versions?


If I'm burning CPU on I/O I'd rather having going to checksums than  
polling cheap-ass USB controllers that need to be baby sat.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Daniel Carosone
On Wed, Feb 17, 2010 at 06:15:25PM -0500, Ethan wrote:
> Success!

Awesome.  Let that scrub finish before celebrating completely, but
this looks like a good place to stop and consider what you want for an
end state. 

--
Dan.


pgph6ALkJoiw6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Ethan
On Wed, Feb 17, 2010 at 17:44, Daniel Carosone  wrote:

> On Wed, Feb 17, 2010 at 04:48:23PM -0500, Ethan wrote:
> > It looks like using p0 is exactly what I want, actually. Are s2 and p0
> both
> > the entire disk?
>
> No. s2 depends on there being a solaris partition table (Sun or EFI),
> and if there's also an fdisk partition table (disk shared with other
> OS), s2 will only cover the solaris part of the disk.  It also
> typically doesn't cover the last 2 cylinders, which solaris calls
> "reserved" for hysterical raisins.
>
> > The idea of symlinking to the full-disk devices from a directory and
> using
> > -d had crossed my mind, but I wasn't sure about it. I think that is
> > something worth trying.
>
> Note, I haven't tried it either..
>
> > I'm not too concerned about it not working at boot -
> > I just want to get something working at all, at the moment.
>
> Yup.
>
> --
> Dan.



Success!

I made a directory and symlinked p0's for all the disks:

et...@save:~/qdsk# ls -al
total 13
drwxr-xr-x   2 root root   7 Feb 17 23:06 .
drwxr-xr-x  21 ethanstaff 31 Feb 17 14:16 ..
lrwxrwxrwx   1 root root  17 Feb 17 23:06 c9t0d0p0 ->
/dev/dsk/c9t0d0p0
lrwxrwxrwx   1 root root  17 Feb 17 23:06 c9t1d0p0 ->
/dev/dsk/c9t1d0p0
lrwxrwxrwx   1 root root  17 Feb 17 23:06 c9t2d0p0 ->
/dev/dsk/c9t2d0p0
lrwxrwxrwx   1 root root  17 Feb 17 23:06 c9t4d0p0 ->
/dev/dsk/c9t4d0p0
lrwxrwxrwx   1 root root  17 Feb 17 23:06 c9t5d0p0 ->
/dev/dsk/c9t5d0p0


I attempt to import using -d:


et...@save:~/qdsk# zpool import -d .
  pool: q
id: 5055543090570728034
 state: ONLINE
status: The pool is formatted using an older on-disk version.
action: The pool can be imported using its name or numeric identifier,
though
some features will not be available without an explicit 'zpool
upgrade'.
config:

q ONLINE
  raidz1  ONLINE
/export/home/ethan/qdsk/c9t4d0p0  ONLINE
/export/home/ethan/qdsk/c9t5d0p0  ONLINE
/export/home/ethan/qdsk/c9t2d0p0  ONLINE
/export/home/ethan/qdsk/c9t1d0p0  ONLINE
/export/home/ethan/qdsk/c9t0d0p0  ONLINE


The pool is not imported. This does look promising though. I attempt to
import using the name:


et...@save:~/qdsk# zpool import -d . q


it sits there for a while. I worry that it's going to hang forever like it
did on linux.
but then it returns!


et...@save:~/qdsk# zpool status
  pool: q
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: scrub in progress for 0h2m, 0.43% done, 8h57m to go
config:

NAME  STATE READ WRITE CKSUM
q ONLINE   0 0 0
  raidz1  ONLINE   0 0 0
/export/home/ethan/qdsk/c9t4d0p0  ONLINE   0 0 0
/export/home/ethan/qdsk/c9t5d0p0  ONLINE   0 0 0
/export/home/ethan/qdsk/c9t2d0p0  ONLINE   0 0 0
/export/home/ethan/qdsk/c9t1d0p0  ONLINE   0 0 0
/export/home/ethan/qdsk/c9t0d0p0  ONLINE   0 0 0

errors: No known data errors


All the filesystems are there, all the files are there. Life is good.
Thank you all so much.

-Ethan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Daniel Carosone
On Wed, Feb 17, 2010 at 04:44:19PM -0500, Ethan wrote:
> There was no partitioning on the truecrypt disks. The truecrypt volumes
> occupied the whole raw disks (1500301910016 bytes each). The devices that I
> gave to the zpool on linux were the whole raw devices that truecrypt exposed
> (1500301647872 bytes each). There were no partition tables on either the raw
> disks or the truecrypt volumes, just truecrypt headers on the raw disk and
> zfs on the truecrypt volumes.
> I copied the data simply using
> 
> dd if=/dev/mapper/truecrypt1 of=/dev/sdb

Ok, then as you noted, you want to start with the ..p0 device, as the 
equivalent. 

> The labels 2 and 3 should be on the drives, but they are 262144 bytes
> further from the end of slice 2 than zpool must be looking.

I don't think so.. They're found by counting from the start; the end
can move out further (LUN expansion), and with autoexpand the vdev can
be extended (adding metaslabs) and the labels will be rewritten at the
new end after the last metaslab.  

I think the issue is that there are no partitions on the devices that allow
import to read that far.  Fooling it into using p0 would work around this.

> I could create a partition table on each drive, specifying a partition with
> the size of the truecrypt volume, and re-copy the data onto this partition
> (would have to re-copy as creating the partition table would overwrite zfs
> data, as zfs starts at byte 0). Would this be preferable?

Eventually, probably, yes - once you've confirmed all the speculation,
gotten past the partitioning issue to whatever other damage is in the
pool, resolved that, and have some kind of access to your data.

There are other options as well, including using replace
one at a time, or send|recv.

> I was under some
> impression that zpool devices were preferred to be raw drives, not
> partitions, but I don't recall where I came to believe that much less
> whether it's at all correct.

Sort of. zfs commands can be handed bare disks; internally they put
EFI labels on them automatically (though evidently not the fuse
variants).

ZFS mostly makes partitioning go away (hooray), but it still becomes
important in cases like this - shared disks and migration between
operating systems.

> as for using import -F, I am on snv_111b, which I am not sure has -F for
> import. 

Nope.

> I tried to update to the latest dev build (using the instructions
> at http://pkg.opensolaris.org/dev/en/index.shtml ) but things are behaving
> very strangely. I get error messages on boot - "gconf-sanity-check-2 exited
> with error status 256", and when I dismiss this and go into gnome, terminal
> is messed up and doesn't echo anything I type, and I can't ssh in (error
> message about not able to allocate a TTY). anyway, zfs mailing list isn't
> really the place to be discussing that I suppose.

Not really, but read the release notes.

Alternately, if this is a new machine, you could just reinstall (or
boot from) a current livecd/usb, download from genunix.org

--
Dan.


pgp4YNaiZ3HRo.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] false DEGRADED status based on "cannot open" device at boot.

2010-02-17 Thread Daniel Carosone
On Wed, Feb 17, 2010 at 05:28:03PM -0500, Dennis Clarke wrote:
> Good theory, however, this disk is fully external with its own power.

It can still be commanded to offline state.

--
Dan.

pgpzmziAIXUx3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Daniel Carosone
On Wed, Feb 17, 2010 at 04:48:23PM -0500, Ethan wrote:
> It looks like using p0 is exactly what I want, actually. Are s2 and p0 both
> the entire disk?

No. s2 depends on there being a solaris partition table (Sun or EFI),
and if there's also an fdisk partition table (disk shared with other
OS), s2 will only cover the solaris part of the disk.  It also
typically doesn't cover the last 2 cylinders, which solaris calls
"reserved" for hysterical raisins. 

> The idea of symlinking to the full-disk devices from a directory and using
> -d had crossed my mind, but I wasn't sure about it. I think that is
> something worth trying. 

Note, I haven't tried it either..

> I'm not too concerned about it not working at boot -
> I just want to get something working at all, at the moment.

Yup.

--
Dan.

pgprAVyX9wjRF.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] false DEGRADED status based on "cannot open" device at boot.

2010-02-17 Thread Cindy Swearingen

Hi Dennis,

You might be running into this issue:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6856341

The workaround is to force load the drivers.

Thanks,

Cindy

On 02/17/10 14:33, Dennis Clarke wrote:

I find that some servers display a DEGRADED zpool status at boot. More
troubling is that this seems to be silent and no notice is given on the
console or via a snmp message or other notification process.

Let me demonstrate :

{0} ok boot -srv

Sun Blade 2500 (Silver), No Keyboard
Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.17.3, 4096 MB memory installed, Serial #64510477.
Ethernet address 0:3:ba:d8:5a:d, Host ID: 83d85a0d.



Rebooting with command: boot -srv
Boot device: /p...@1d,70/s...@4,1/d...@0,0:a  File and args: -srv
module /platform/sun4u/kernel/sparcv9/unix: text at [0x100, 0x10a3695]
data at 0x180
module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10a3698,
0x126bbf7] data at 0x1866840
module /platform/SUNW,Sun-Blade-2500/kernel/misc/sparcv9/platmod: text at
[0x126bbf8, 0x126c1e7] data at 0x18bc0c8
.
.
. many lines of verbose messages
.
.

dump on /dev/zvol/dsk/mercury_rpool/swap size 0 MB
Loading smf(5) service descriptions: 2/2
Requesting System Maintenance Mode
SINGLE USER MODE

Root password for system maintenance (control-d to bypass):
single-user privilege assigned to /dev/console.
Entering System Maintenance Mode

# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
mercury_rpool68G  27.4G  40.6G40%  DEGRADED  -
# zpool status mercury_rpool
  pool: mercury_rpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas
exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
mercury_rpool  DEGRADED 0 0 0
  mirror  DEGRADED 0 0 0
c3t0d0s0  ONLINE   0 0 0
c1t2d0s0  UNAVAIL  0 0 0  cannot open

errors: No known data errors

This is trivial to remedy :

# zpool online mercury_rpool c1t2d0s0
# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
mercury_rpool68G  27.4G  40.6G40%  ONLINE  -
# zpool status mercury_rpool
  pool: mercury_rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: resilver completed after 0h0m with 0 errors on Wed Feb 17 21:26:11
2010
config:

NAME  STATE READ WRITE CKSUM
mercury_rpool  ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t0d0s0  ONLINE   0 0 0
c1t2d0s0  ONLINE   0 0 0  14.5M resilvered

errors: No known data errors
#

I have many systems where I keep mirrors on multiple controllers, either
fibre or SCSI. It seems that the SCSI devices don't get detected at boot
on the Sparc systems. The x86/AMD64 systems do not seem to have this
problem but I may be wrong.

Is this a known bug or am I seeing something due to a missing line in
/etc/system ?

Oh, also, I should point out that it does not matter if I boot with init S
or 3 or 6.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] false DEGRADED status based on "cannot open" device at boot.

2010-02-17 Thread Dennis Clarke

> On Wed, 17 Feb 2010, Dennis Clarke wrote:
>>
>>NAME  STATE READ WRITE CKSUM
>>mercury_rpool  ONLINE   0 0 0
>>  mirror  ONLINE   0 0 0
>>c3t0d0s0  ONLINE   0 0 0
>>c1t2d0s0  ONLINE   0 0 0  14.5M resilvered
>>
>> errors: No known data errors
>> #
>>
>> I have many systems where I keep mirrors on multiple controllers, either
>> fibre or SCSI. It seems that the SCSI devices don't get detected at boot
>> on the Sparc systems. The x86/AMD64 systems do not seem to have this
>> problem but I may be wrong.
>>
>> Is this a known bug or am I seeing something due to a missing line in
>> /etc/system ?
>
> My Sun Blade 2500 (Red) does see both boot disks.  However, I do
> recall an issue at one time with the Solaris power management daemon
> in that it shut down the second disk during boot so that it was not
> seen.  It was mentioned in the Solaris release notes (maybe U5 or U6?)
> and it happened to me.  A fix to /etc/power.conf was required.
> Perhaps that is what is happening to you.

Good theory, however, this disk is fully external with its own power.

Strange.

I'll go have a look at a V490 I have here ( snv_130 ) and install a few
SCSI cards just to see what happens. Maybe this is specific to the SB2500
workstations.


-- 
Dennis Clarke
dcla...@opensolaris.ca  <- Email related to the open source Solaris
dcla...@blastwave.org   <- Email related to open source for Solaris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] false DEGRADED status based on "cannot open" device at boot.

2010-02-17 Thread Bob Friesenhahn

On Wed, 17 Feb 2010, Dennis Clarke wrote:


   NAME  STATE READ WRITE CKSUM
   mercury_rpool  ONLINE   0 0 0
 mirror  ONLINE   0 0 0
   c3t0d0s0  ONLINE   0 0 0
   c1t2d0s0  ONLINE   0 0 0  14.5M resilvered

errors: No known data errors
#

I have many systems where I keep mirrors on multiple controllers, either
fibre or SCSI. It seems that the SCSI devices don't get detected at boot
on the Sparc systems. The x86/AMD64 systems do not seem to have this
problem but I may be wrong.

Is this a known bug or am I seeing something due to a missing line in
/etc/system ?


My Sun Blade 2500 (Red) does see both boot disks.  However, I do 
recall an issue at one time with the Solaris power management daemon 
in that it shut down the second disk during boot so that it was not 
seen.  It was mentioned in the Solaris release notes (maybe U5 or U6?) 
and it happened to me.  A fix to /etc/power.conf was required. 
Perhaps that is what is happening to you.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Ethan
On Wed, Feb 17, 2010 at 16:25, Daniel Carosone  wrote:

> On Thu, Feb 18, 2010 at 08:14:03AM +1100, Daniel Carosone wrote:
> > I think
> > you probably want to make a slice 0 that spans the right disk sectors.
> [..]
> > you could try zdb -l on /dev/dsk/c...p[01234] as well.
>
> Depending on how and what you copied, you may have zfs data that start
> at sector 0, with no space for any partitioning labels at all.   If
> zdb -l /dev/rdsk/c..p0 shows a full set, this is what has happened.
> Trying to write partition tables may overwrite some of the zfs labels.
>
> zfs won't import such a pool by default (it doesn't check those
> devices).  You could cheat, by making a directory with symlinks to the
> p0 devices, and using import -d, but this will not work at boot.  It
> would be a way to verify current state, so you can then plan next
> steps.
>
> --
> Dan.
>

It looks like using p0 is exactly what I want, actually. Are s2 and p0 both
the entire disk?
The idea of symlinking to the full-disk devices from a directory and using
-d had crossed my mind, but I wasn't sure about it. I think that is
something worth trying. I'm not too concerned about it not working at boot -
I just want to get something working at all, at the moment.

-Ethan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Ethan
On Wed, Feb 17, 2010 at 16:14, Daniel Carosone  wrote:

> On Wed, Feb 17, 2010 at 03:37:59PM -0500, Ethan wrote:
> > On Wed, Feb 17, 2010 at 15:22, Daniel Carosone  wrote:
> > I have not yet successfully imported. I can see two ways of making
> progress
> > forward. One is forcing zpool to attempt to import using slice 2 for each
> > disk rather than slice 8. If this is how autoexpand works, as you say, it
> > seems like it should work fine for this. But I don't know how, or if it
> is
> > possible to, make it use slice 2.
>
> Just get rid of 8? :-)
>

That sounds like an excellent idea, but, being very new to opensolaris, I
have no idea how to do this. I'm reading through
http://multiboot.solaris-x86.org/iv/3.html at the moment. You mention the
'format' utility below, which I will read more into.


>
> Normally, when using the whole disk, convention is that slice 0 is
> used, and there's a small initial offset (for the EFI label).  I think
> you probably want to make a slice 0 that spans the right disk sectors.
>
> Were you using some other partitioning inside the truecrypt "disks"?
> What devices were given to zfs-fuse, and what was their starting
> offset? You may need to account for that, too.  How did you copy the
> data, and to what target device, on what platform?  Perhaps the
> truecrypt device's partition table is now at the start of the physical
> disk, but solaris can't read it properly? If that's an MBR partition
> table (which you look at with fdisk), you could try zdb -l on
> /dev/dsk/c...p[01234] as well.
>

There was no partitioning on the truecrypt disks. The truecrypt volumes
occupied the whole raw disks (1500301910016 bytes each). The devices that I
gave to the zpool on linux were the whole raw devices that truecrypt exposed
(1500301647872 bytes each). There were no partition tables on either the raw
disks or the truecrypt volumes, just truecrypt headers on the raw disk and
zfs on the truecrypt volumes.
I copied the data simply using

dd if=/dev/mapper/truecrypt1 of=/dev/sdb

on linux, where /dev/mapper/truecrypt1 is the truecrypt volume for one hard
disk (which was on /dev/sda) and /dev/sdb is a new blank drive of the same
size as the old drive (but slightly larger than the truecrypt volume). And
repeat likewise for each of the five drives.

The labels 2 and 3 should be on the drives, but they are 262144 bytes
further from the end of slice 2 than zpool must be looking.

I could create a partition table on each drive, specifying a partition with
the size of the truecrypt volume, and re-copy the data onto this partition
(would have to re-copy as creating the partition table would overwrite zfs
data, as zfs starts at byte 0). Would this be preferable? I was under some
impression that zpool devices were preferred to be raw drives, not
partitions, but I don't recall where I came to believe that much less
whether it's at all correct.



>
> We're just guessing here.. to provide more concrete help, you'll need
> to show us some of the specifics, both of what you did and what you've
> ended up with. fdisk and format partition tables and zdb -l output
> would be a good start.
>
> Figuring out what is different about the disk where s2 was used would
> be handy too.  That may be a synthetic label because something is
> missing from that disk that the others have.
>
> > The other way is to make a slice that is the correct size of the volumes
> as
> > I had them before (262144 bytes less than the size of the disk). It seems
> > like this should cause zpool to prefer to use this slice over slice 8, as
> it
> > can find all 4 labels, rather than just labels 0 and 1. I don't know how
> to
> > go about this either, or if it's possible. I have been starting to read
> > documentation on slices in solaris but haven't had time to get far enough
> to
> > figure out what I need.
>
> format will let you examine and edit these.  Start by making sure they
> have all the same partitioning, flags, etc.
>

I will have a look at format, but if this operates on partition tables,
well, my disks have none at the moment so I'll have to remedy that.


>
> > I also have my doubts about this solving my actual issues - the ones that
> > caused me to be unable to import in zfs-fuse. But I need to solve this
> issue
> > before I can move forward to figuring out/solving whatever that issue
> was.
>
> Yeah - my suspicion is that import -F may help here.  That is a pool
> recovery mode, where it rolls back progressive transactions until it
> finds one that validates correctly.  It was only added recently and is
> probably missing from the fuse version.
>
> --
> Dan.
>
>
as for using import -F, I am on snv_111b, which I am not sure has -F for
import. I tried to update to the latest dev build (using the instructions
at http://pkg.opensolaris.org/dev/en/index.shtml ) but things are behaving
very strangely. I get error messages on boot - "gconf-sanity-check-2 exited
with error status 256", and when I dismiss this and go into gnome, 

[zfs-discuss] false DEGRADED status based on "cannot open" device at boot.

2010-02-17 Thread Dennis Clarke

I find that some servers display a DEGRADED zpool status at boot. More
troubling is that this seems to be silent and no notice is given on the
console or via a snmp message or other notification process.

Let me demonstrate :

{0} ok boot -srv

Sun Blade 2500 (Silver), No Keyboard
Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.17.3, 4096 MB memory installed, Serial #64510477.
Ethernet address 0:3:ba:d8:5a:d, Host ID: 83d85a0d.



Rebooting with command: boot -srv
Boot device: /p...@1d,70/s...@4,1/d...@0,0:a  File and args: -srv
module /platform/sun4u/kernel/sparcv9/unix: text at [0x100, 0x10a3695]
data at 0x180
module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10a3698,
0x126bbf7] data at 0x1866840
module /platform/SUNW,Sun-Blade-2500/kernel/misc/sparcv9/platmod: text at
[0x126bbf8, 0x126c1e7] data at 0x18bc0c8
.
.
. many lines of verbose messages
.
.

dump on /dev/zvol/dsk/mercury_rpool/swap size 0 MB
Loading smf(5) service descriptions: 2/2
Requesting System Maintenance Mode
SINGLE USER MODE

Root password for system maintenance (control-d to bypass):
single-user privilege assigned to /dev/console.
Entering System Maintenance Mode

# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
mercury_rpool68G  27.4G  40.6G40%  DEGRADED  -
# zpool status mercury_rpool
  pool: mercury_rpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas
exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
mercury_rpool  DEGRADED 0 0 0
  mirror  DEGRADED 0 0 0
c3t0d0s0  ONLINE   0 0 0
c1t2d0s0  UNAVAIL  0 0 0  cannot open

errors: No known data errors

This is trivial to remedy :

# zpool online mercury_rpool c1t2d0s0
# zpool list
NAMESIZE   USED  AVAILCAP  HEALTH  ALTROOT
mercury_rpool68G  27.4G  40.6G40%  ONLINE  -
# zpool status mercury_rpool
  pool: mercury_rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: resilver completed after 0h0m with 0 errors on Wed Feb 17 21:26:11
2010
config:

NAME  STATE READ WRITE CKSUM
mercury_rpool  ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t0d0s0  ONLINE   0 0 0
c1t2d0s0  ONLINE   0 0 0  14.5M resilvered

errors: No known data errors
#

I have many systems where I keep mirrors on multiple controllers, either
fibre or SCSI. It seems that the SCSI devices don't get detected at boot
on the Sparc systems. The x86/AMD64 systems do not seem to have this
problem but I may be wrong.

Is this a known bug or am I seeing something due to a missing line in
/etc/system ?

Oh, also, I should point out that it does not matter if I boot with init S
or 3 or 6.

-- 
Dennis Clarke
dcla...@opensolaris.ca  <- Email related to the open source Solaris
dcla...@blastwave.org   <- Email related to open source for Solaris


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Daniel Carosone
On Thu, Feb 18, 2010 at 08:14:03AM +1100, Daniel Carosone wrote:
> I think
> you probably want to make a slice 0 that spans the right disk sectors.
[..]
> you could try zdb -l on /dev/dsk/c...p[01234] as well. 

Depending on how and what you copied, you may have zfs data that start
at sector 0, with no space for any partitioning labels at all.   If
zdb -l /dev/rdsk/c..p0 shows a full set, this is what has happened.
Trying to write partition tables may overwrite some of the zfs labels.

zfs won't import such a pool by default (it doesn't check those
devices).  You could cheat, by making a directory with symlinks to the
p0 devices, and using import -d, but this will not work at boot.  It
would be a way to verify current state, so you can then plan next
steps. 

--
Dan.


pgp3HWt1wxGx7.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Daniel Carosone
On Wed, Feb 17, 2010 at 03:37:59PM -0500, Ethan wrote:
> On Wed, Feb 17, 2010 at 15:22, Daniel Carosone  wrote:
> I have not yet successfully imported. I can see two ways of making progress
> forward. One is forcing zpool to attempt to import using slice 2 for each
> disk rather than slice 8. If this is how autoexpand works, as you say, it
> seems like it should work fine for this. But I don't know how, or if it is
> possible to, make it use slice 2.

Just get rid of 8? :-)

Normally, when using the whole disk, convention is that slice 0 is
used, and there's a small initial offset (for the EFI label).  I think
you probably want to make a slice 0 that spans the right disk sectors.

Were you using some other partitioning inside the truecrypt "disks"?
What devices were given to zfs-fuse, and what was their starting
offset? You may need to account for that, too.  How did you copy the
data, and to what target device, on what platform?  Perhaps the
truecrypt device's partition table is now at the start of the physical
disk, but solaris can't read it properly? If that's an MBR partition
table (which you look at with fdisk), you could try zdb -l on
/dev/dsk/c...p[01234] as well. 

We're just guessing here.. to provide more concrete help, you'll need
to show us some of the specifics, both of what you did and what you've
ended up with. fdisk and format partition tables and zdb -l output
would be a good start. 

Figuring out what is different about the disk where s2 was used would
be handy too.  That may be a synthetic label because something is
missing from that disk that the others have.

> The other way is to make a slice that is the correct size of the volumes as
> I had them before (262144 bytes less than the size of the disk). It seems
> like this should cause zpool to prefer to use this slice over slice 8, as it
> can find all 4 labels, rather than just labels 0 and 1. I don't know how to
> go about this either, or if it's possible. I have been starting to read
> documentation on slices in solaris but haven't had time to get far enough to
> figure out what I need.

format will let you examine and edit these.  Start by making sure they
have all the same partitioning, flags, etc.

> I also have my doubts about this solving my actual issues - the ones that
> caused me to be unable to import in zfs-fuse. But I need to solve this issue
> before I can move forward to figuring out/solving whatever that issue was.

Yeah - my suspicion is that import -F may help here.  That is a pool
recovery mode, where it rolls back progressive transactions until it
finds one that validates correctly.  It was only added recently and is
probably missing from the fuse version. 

--
Dan.



pgpZ5Qt78UfHj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] naming zfs disks

2010-02-17 Thread Günther
hello
look at format - volname

FORMAT MENU:
disk   - select a disk
type   - select (define) a disk type
partition  - select (define) a partition table
current- describe the current disk
format - format and analyze the disk
fdisk  - run the fdisk program
repair - repair a defective sector
show   - translate a disk address
label  - write label to the disk
analyze- surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save   - save new disk/partition definitions
volname- set 8-character volume name


gea
www.napp-it.org 
zfs server
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Frank Middleton

On 02/17/10 02:38 PM, Miles Nordin wrote:


copies=2 has proven to be mostly useless in practice.


Not true. Take an ancient PC with a mirrored root pool, no
bus error checking and non-ECC memory, that flawlessly
passes every known diagnostic (SMC included).

Reboot with copies=1 and the same files in /usr/lib will get
trashed every time and you'll have to reboot from some other
media to repair it.

Set copies=2 (copy all of /usr/lib, of course) and it will reboot
every time with no problem, albeit with a varying number of
repaired checksum errors, almost always on the same set of
files.

Without copies=2 this hardware would be useless (well, it ran
Linux just fine), but with it, it has a new lease of life. There is
an ancient CR about this, but AFAIK no one has any idea what
the problem is or how to fix it.

IMO it proves that copies=2 can help avoid data loss in the
face of flaky buses and perhaps memory. I don't think you
should be able to lose data on mirrored drives unless both
drives fail simultaneously, but with ZFS you can. Certainly, on
any machine without ECC memory, or buses without ECC (is
parity good enough?) my suggestion would be to set copies=2,
and I have it set for critical datasets even on machines with
ECC on both. Just waiting for the bus that those SAS controllers
are on to burp at the wrong moment...

Is one counter-example enough?

Cheers -- Frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Ethan
On Wed, Feb 17, 2010 at 15:22, Daniel Carosone  wrote:

> On Wed, Feb 17, 2010 at 12:31:27AM -0500, Ethan wrote:
> > And I just realized - yes, labels 2 and 3 are in the wrong place relative
> to
> > the end of the drive; I did not take into account the overhead taken up
> by
> > truecrypt when dd'ing the data. The raw drive is 1500301910016 bytes; the
> > truecrypt volume is 1500301647872 bytes. Off by 262144 bytes - I need a
> > slice that is sized like the truecrypt volume.
>
> It shouldn't matter if the slice is larger than the original; this is
> how autoexpand works.   2 should be near the start (with 1), 3 should
> be near the logical end (with 4).
>
> Did this resolve the issue? You didn't say, and I have my doubts. I'm
> not sure this is your problem, but it seems you're on the track to
> finding the real problem.
>
> In the labels you can see, are the txg's the same for all pool
> members?  If not, you may still need import -F, once all the
> partitioning gets sorted out.
>
> Also, re-reading what I wrote above, I realised I was being ambiguous
> in my use of "label".  Sometimes I meant the zfs labels that zdb -l
> prints, and sometimes I meant the vtoc that format uses for slices. In
> the BSD world we call those labels too, and I didn't realise I was
> mixing terms.  Sorry for any confusion but it seems you figured out
> what I meant :)
>
> --
> Dan.


I have not yet successfully imported. I can see two ways of making progress
forward. One is forcing zpool to attempt to import using slice 2 for each
disk rather than slice 8. If this is how autoexpand works, as you say, it
seems like it should work fine for this. But I don't know how, or if it is
possible to, make it use slice 2.

The other way is to make a slice that is the correct size of the volumes as
I had them before (262144 bytes less than the size of the disk). It seems
like this should cause zpool to prefer to use this slice over slice 8, as it
can find all 4 labels, rather than just labels 0 and 1. I don't know how to
go about this either, or if it's possible. I have been starting to read
documentation on slices in solaris but haven't had time to get far enough to
figure out what I need.

I also have my doubts about this solving my actual issues - the ones that
caused me to be unable to import in zfs-fuse. But I need to solve this issue
before I can move forward to figuring out/solving whatever that issue was.

txg is the same for every volume.

-Ethan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help with corrupted pool

2010-02-17 Thread Daniel Carosone
On Wed, Feb 17, 2010 at 12:31:27AM -0500, Ethan wrote:
> And I just realized - yes, labels 2 and 3 are in the wrong place relative to
> the end of the drive; I did not take into account the overhead taken up by
> truecrypt when dd'ing the data. The raw drive is 1500301910016 bytes; the
> truecrypt volume is 1500301647872 bytes. Off by 262144 bytes - I need a
> slice that is sized like the truecrypt volume.

It shouldn't matter if the slice is larger than the original; this is
how autoexpand works.   2 should be near the start (with 1), 3 should
be near the logical end (with 4).

Did this resolve the issue? You didn't say, and I have my doubts. I'm
not sure this is your problem, but it seems you're on the track to
finding the real problem.  

In the labels you can see, are the txg's the same for all pool
members?  If not, you may still need import -F, once all the
partitioning gets sorted out.

Also, re-reading what I wrote above, I realised I was being ambiguous
in my use of "label".  Sometimes I meant the zfs labels that zdb -l
prints, and sometimes I meant the vtoc that format uses for slices. In
the BSD world we call those labels too, and I didn't realise I was
mixing terms.  Sorry for any confusion but it seems you figured out
what I meant :) 

--
Dan.

pgpAYV8T7I5RZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] naming zfs disks

2010-02-17 Thread Brad
Is there anyway to assign a unique name or id to a disk part of a zpool?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-17 Thread Miles Nordin
> "k" == Khyron   writes:

 k> The RFE is out there.  Just like SLOGs, I happen to think it a
 k> good idea, personally, but that's my personal opinion.  If it
 k> makes dedup more usable, I don't see the harm.

slogs and l2arcs, modulo the current longstanding ``cannot import pool
with attached missing slog'' bug, are disposeable: You will lose
either a little data or no data if the device goes away (once the bug
is finally fixed).  This makes them less ponderous because these days
we are looking for raidz2 or raidz3 amount of redundancy, so in a
seperate device that wasn't disposeable we'd need a 3- or 4-way
mirror.  It also makes their seperateness more seperable since they
can go away at any time, so maybe they do deserve to be seperate.  The
two together make the complexity more bearable.

Would an sddt be disposeable, or would it be a critical top-level vdev
needed for import?  If it's critical, well, that's kind of annoying,
because now we need 3-way mirrors of sddt to match the minimum
best-practice redundancy of the rest of the pool's redundancy, and my
first reaction would be ``can you spread it throughout the normal
raidz{,2,3} vdevs at least in backup form?''

once I say a copy should be kept in the main pool even afer it becomes
an sddt, well, what's that imply?

 * In the read case it means cacheing, so it could go in the l2arc.
   How's DDT different from anything else in the l2arc?

 * In the write case it means sometimes commiting it quickly without
   waiting on the main pool so we can release some lock or answer some
   RPC and continue.  Why not write it to the slog?  Then if we lose
   the slog we can do what we always do without the slog and roll back
   to the last valid txg, losing whatever writes were associated with
   that lost ddt update.

The two cases fit fine with the types of SSD's we're using for each
role and the type of special error recovery we have if we lose the
device.  Why litter a landscape so full of special cases and tricks
(like the ``import pool with missing slog'' bug that is taking so long
to resolve) with yet another kind of vdev that will take 1 year to
discover special cases and a halfdecade to address them?

Maybe there is a reason.  Are DDT write patterns different than slog
write patterns?  Is it possible to make a DDT read cache using less
ARC for pointers than the l2arc currently uses?  Is the DDT
particularly hard on the l2arc by having small block sizes?  Will the
sddt be delivered with a separate offline ``not an fsck!!!'' tool for
slowly regenerating it from pool data if it's lost, or maybe after an
sddt goes bad the pool can be mounted space-wastingly as in like no
dedup is done and deletes do not free space, with an empty DDT, and the
sddt regenerated by a scrub?  If the performance or recovery behavior
is different than what we're working towards with optional-slog and
persistent-l2arc then maybe sddt does deserve to be antoher vdev type.

soi dunno.  On one hand I'm clearly nowhere near informed enough
to weigh in on an architectural decision like this and shouldn't even
be discussing it, and the same applies to you Khyron, to my view,
since our input seems obvious at best and misinformed at worst.  On
the other hand, other major architectural changes (slog) was delivered
incomplete in a cripplingly bad and silly, trivial way for, AFAICT,
nothing but lack of sufficient sysadmin bitching and moaning, leaving
heaps of multi-terabyte naked pools out there for half a decade with
fancy triple redundancy that will be totally lost if a single SSD +
zpool.cache goes bad, so apparently thinking things through even at
this trivial level might have some value to the ones actually doing
the work.


pgp8j6Y2dtxrq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Miles Nordin
> "ck" == Christo Kutrovsky  writes:

ck> I could always put "copies=2" (or more) to my important
ck> datasets and take some risk and tolerate such a failure.

copies=2 has proven to be mostly useless in practice.

If there were a real-world device that tended to randomly flip bits,
or randomly replace swaths of LBA's with zeroes, but otherwise behave
normally (not return any errors, not slow down retrying reads, not
fail to attach), then copies=2 would be really valuable, but so far it
seems no such device exists.  If you actually explore the errors that
really happen I venture there are few to no cases copies=2 would save
you.

one case where such a device appears to exist but doesn't really, is
what I often end up doing for family/friend laptops and external USB
drives: wait for drives to start going bad, then rescue them with 'dd
conv=noerror,sync', fsck, and hope for ``most of'' the data.  copies=2
would help get more out of the rescued drive for some but not all of
the times I've done this, but there is not much point: Time Machine or
rsync backups, or NFS/iSCSI-booting, or zfs send|zfs recv replication
to a backup pool, are smarter.  I've never been recovering a
stupidly-vulnerable drive like that in a situation where I had ZFS on
it, so I'm not sure copies=2 will get used here much either.

One particular case of doom: a lot of people want to make two
unredundant vdevs and then set 'copies=2' and rely on ZFS's promise to
spread the two copies out as much as possible.  Then they expect to
import the pool with only one of the two vdev's and read ``some but
not all'' of the data---``I understand I won't get all of it but I
just want ZFS to try it's best and we'll see.''  Maybe you want to do
this instead of a mirror so you can have scratch datasets that consume
space at 1/2 the rate they would on a mirror.  Nope, nice try but
won't happen.  ZFS is full of all sorts of webbed assertions to
ratchet you safely through sane pool states that are
regression-testable and supportable so it will refuse to import a pool
that isn't vdev-complete, and no negotiation is possible on this.  The
dream is a FAQ and the answer is a clear ``No'' followed by ``you'd
better test with file vdevs next time you have such a dream.''

ck> What are the chances for a very specific drive to fail in 2
ck> way mirror?

This may not be what you mean, but in general single device redundancy
isn't ideal for two reasons:

 * speculation (although maybe not operational experience?) that
   modern drives are so huge that even a good drive will occasionally
   develop an unreadable spot and still be within its BER spec.  so,
   without redundancy, you cannot read for sure at all, even if all
   the drives are ``good''.

 * drives do not immediately turn red and start brrk-brrking when they
   go bad.  In the real world, they develop latent sector errors,
   which you will not discover and mark the drive bad until you scrub
   or coincidentally happen to read the file that accumulated the
   error.  It's possible for a second drive to go bad in the interval
   you're waiting to discover the first.  This usually gets retold as
   ``a drive went bad while I was resilvering!  what bad luck.  If
   only I could've resilvered faster to close this window of
   vulnerability I'd not be in such a terrible situation'' but the
   retelling's wrong: what's really happening is that resilver implies
   a scrub, so it uncovers the second bad drive you didn't yet know
   was bad at the time you discovered the first.


pgpXXJ9hwSSUa.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Bob Friesenhahn

On Wed, 17 Feb 2010, Marty Scholes wrote:


Bob, the vast majority of your post I agree with.  At the same time, I might 
disagree with a couple of things.

I don't really care how long a resilver takes (hours, days, months) given a 
couple things:
* Sufficient protection exists on the degraded array during rebuild
** Put another way, the array is NEVER in danger
* Rebuild takes a back seat to production demands


Most data loss is due to human error.  To me it seems like disks which 
take a week to resilver introduce more opportunity for human error. 
The maintenance operation fades from human memory while it is still 
underway.  If an impeccable log book is not kept and understood, then 
it is up to (potentially) multiple administrators with varying levels 
of experience to correctly understand and interpret the output of 
'zpool status'.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Marty Scholes
I cant' stop myself; I have to respond.  :-)

Richard wrote:
> The ideal pool has one inexpensive, fast, and reliable device :-)

My ideal pool has become one inexpensive, fast and reliable "device" built on 
whatever I choose.

> I'm not sure how to connect those into the system (USB 3?)

Me neither, but if I had to start guessing about host connections, I would 
probably think FC.

> but when you build it, let us know how it works out.

While it would be a fun project, a toy like that would certainly exceed my 
feeble hardware experience and even more feeble budget.

At the same time, I could make a compelling argument that this sort of 
arrangement: stripes of flash, is the future of tier-one storage.  We already 
are seeing SSD devices which internally are stripes of flash.  More and more 
disks farms are taking on the older roles of tape.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Richard Elling
On Feb 17, 2010, at 9:09 AM, Marty Scholes wrote:
> Bob Friesenhahn wrote:
>> It is unreasonable to spend more than 24 hours to resilver a single
>> drive. It is unreasonable to spend more than 6 days resilvering all 
>> of the devices in a RAID group (the 7th day is reserved for the system 
>> administrator). It is unreasonable to spend very much time at all on 
>> resilvering (using current rotating media) since the resilvering 
>> process kills performance.
> 
> Bob, the vast majority of your post I agree with.  At the same time, I might 
> disagree with a couple of things.
> 
> I don't really care how long a resilver takes (hours, days, months) given a 
> couple things:
> * Sufficient protection exists on the degraded array during rebuild
> ** Put another way, the array is NEVER in danger
> * Rebuild takes a back seat to production demands
> 
> Since I am on a rant, I suspect there is also room for improvement in the 
> scrub.  Why would I rescrub a stripe that was read (and presumably validated) 
> 30 seconds ago by a production application?  

Because scrub reads all copies of the data, not just one.

> Wouldn't it make more sense for scrub to "play nice" with production, moving 
> a leisurely pace and only scrubbing stripes not read in the past X 
> hours/days/weeks/whatever?

Scrubs are done at the lowest priority and are throttled.  I suppose there could
be a knob to adjust the throttle, but then you get into the problem we had with
SVM -- large devices could take weeks to (resilver) scrub.

>From a reliability perspective, there is a weak argument for not scrubbing 
>recent
writes. And scrubs are done in txg order (oldest first).  So perhaps there is 
some
merit to a scrub limit. However, it is not clear to me that this really buys 
you anything.
Scrubs are rare events, so the impact of shaving a few minutes or hours off the
scrub time is low.

> I also agree that an ideal pool would be lowering the device capacity and 
> radically increasing the device count.

The ideal pool has one inexpensive, fast, and reliable device :-)

>  In my perfect world, I would have a RAID set of 200+ cheap, low-latency, 
> low-capacity flash drives backed by an additional N% parity, e.g. 40-ish 
> flash drives.  A setup like this would give massive throughput: 200x each 
> flash drive, amazing IOPS and incredible resiliancy.  Rebuild times would be 
> low due to lower capacity.  One could probably build such a beast in 1U using 
> MicroSDHC cards or some such thing.

I'm not sure how to connect those into the system (USB 3?), but when you build 
it, let us
know how it works out.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Marty Scholes
Bob Friesenhahn wrote:
> It is unreasonable to spend more than 24 hours to resilver a single
> drive. It is unreasonable to spend more than 6 days resilvering all 
> of the devices in a RAID group (the 7th day is reserved for the system 
> administrator). It is unreasonable to spend very much time at all on 
> resilvering (using current rotating media) since the resilvering 
> process kills performance.

Bob, the vast majority of your post I agree with.  At the same time, I might 
disagree with a couple of things.

I don't really care how long a resilver takes (hours, days, months) given a 
couple things:
* Sufficient protection exists on the degraded array during rebuild
** Put another way, the array is NEVER in danger
* Rebuild takes a back seat to production demands

Since I am on a rant, I suspect there is also room for improvement in the 
scrub.  Why would I rescrub a stripe that was read (and presumably validated) 
30 seconds ago by a production application?  Wouldn't it make more sense for 
scrub to "play nice" with production, moving a leisurely pace and only 
scrubbing stripes not read in the past X hours/days/weeks/whatever?

I also agree that an ideal pool would be lowering the device capacity and 
radically increasing the device count.  In my perfect world, I would have a 
RAID set of 200+ cheap, low-latency, low-capacity flash drives backed by an 
additional N% parity, e.g. 40-ish flash drives.  A setup like this would give 
massive throughput: 200x each flash drive, amazing IOPS and incredible 
resiliancy.  Rebuild times would be low due to lower capacity.  One could 
probably build such a beast in 1U using MicroSDHC cards or some such thing.

End rant.

Cheers,
Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] getting tangled with recieved mountpoint properties

2010-02-17 Thread David Dyer-Bennet

On Wed, February 17, 2010 03:35, Daniel Carosone wrote:

> I'd be happy enough if none of these was mounted, but annoyingly in
> this case, the canmount property is not inheritable, so I can't just
> set this somewhere near the top and be done.  My workaround so far:
>  # zfs list -t filesystem -o name | egrep ^dpool | xargs -n 1 zfs set
> canmount=noauto
> but this is annoying and manual and I sometimes forget when a new
> filesystem is added and sent from the source host.

That's exactly where I've ended up, except it's not manual, it's built
into the send / receive command (mine are local USB disks rather than a
remote system, so things are just a bit different).  I haven't found a
better solution yet, but I'm still back on build 111b, so I don't have the
new property replication capabilities (or complexities) to deal with.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Bob Friesenhahn

On Wed, 17 Feb 2010, Daniel Carosone wrote:


These small numbers just tell you to be more worried about defending
against the other stuff.


Let's not forget that the most common cause of data loss is human 
error!


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

2010-02-17 Thread Tiernan OToole
At the moment is just one pool with a plan to add the 500gb drives... What 
would be recommend?

-Original Message-
From: Brandon High 
Sent: 17 February 2010 01:00
To: Tiernan OToole 
Cc: Robert Milkowski ; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Plan for upgrading a ZFS based SAN

On Tue, Feb 16, 2010 at 3:13 PM, Tiernan OToole  wrote:
> Cool... Thanks for the advice! Buy why would it be a good idea to change 
> layout on bigger disks?

On top of the reasons Bob gave, your current layout will be very
unbalanced after adding devices. You can't currently add more devices
to a raidz vdev or remove a top level vdev from a pool, so you'll be
stuck with 8 drives in a raidz2, 3 drives in a raidz, and any future
additions in additional vdevs.

When you say you have 2 pools, do you mean two vdevs in one pool, or
actually two pools?

-B

-- 
Brandon High : bh...@freaks.com
Indecision is the key to flexibility.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-17 Thread Robert Milkowski

On 16/02/2010 23:59, Christo Kutrovsky wrote:

Robert,

That would be pretty cool especially if it makes into the 2010.02 release. I 
hope there are no weird special cases that pop-up from this improvement.

   

I'm pretty sure it won't make 2010.03


Regarding workaround.

That's not my experience, unless it behaves differently on ZVOLs and datasets.

On ZVOLs it appears the setting kicks in life. I've tested this by turning it 
off/on and testing with iometer on an exported iSCSI device (iscsitgtd not 
comstar).
   
I haven't looked at zvol's code handling zil_disable, but with datasets 
I'm sure I'm right.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reading ZFS config for an extended period

2010-02-17 Thread Khyron
Ugh!  If you received a direct response to me instead of via the list,
apologies for
that.


Rob:

I'm just reporting the news.  The RFE is out there.  Just like SLOGs, I
happen to
think it a good idea, personally, but that's my personal opinion.  If it
makes dedup
more usable, I don't see the harm.


Taemun:

The issue, as I understand it, is not "use-lots-of-cpu" or "just dies from
paging".  I
believe it is more to do with all of the small, random reads/writes in
updating the
DDT.

Remember, the DDT is stored within the pool, just as the ZIL is if you don't
have
a SLOG.  (The S in SLOG standing for "separate".)  So all the DDT updates
are in
competition for I/O with the actual data deletion.  If the DDT could be
stored as
a separate VDEV already, I'm sure a way would have been hacked together by
someone (likely someone on this list).  Hence, the need for the RFE to
create this
functionality where it does not currently exist.  The DDT is separate from
the ARC
or L2ARC.

Here's the bug:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566

If I'm incorrect, someone please let me know.


Markus:

Yes, the issue would appear to be dataset size vs. RAM size.  Sounds like an
area
ripe for testing, much like RAID Z3 performance.


Cheers all!

On Tue, Feb 16, 2010 at 00:20, taemun  wrote:

> The system in question has 8GB of ram. It never paged during the
> import (unless I was asleep at that point, but anyway).
>
> It ran for 52 hours, then started doing 47% kernel cpu usage. At this
> stage, dtrace stopped responding, and so iopattern died, as did
> iostat. It was also increasing ram usage rapidly (15mb / minute).
> After an hour of that, the cpu went up to 76%. An hour later, CPU
> usage stopped. Hard drives were churning throughout all of this
> (albeit at a rate that looks like each vdev is being controller by a
> single threaded operation).
>
> I'm guessing that if you don't have enough ram, it gets stuck on the
> use-lots-of-cpu phase, and just dies from too much paging. Of course,
> I have absolutely nothing to back that up.
>
> Personally, I think that if L2ARC devices were persistent, we already
> have the mechanism in place for storing the DDT as a "seperate vdev".
> The problem is, there is nothing you can run at boot time to populate
> the L2ARC, so the dedup writes are ridiculously slow until the cache
> is warm. If the cache stayed warm, or there was an option to forcibly
> warm up the cache, this could be somewhat alleviated.
>
> Cheers
>



-- 
"You can choose your friends, you can choose the deals." - Equity Private

"If Linux is faster, it's a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] getting tangled with recieved mountpoint properties

2010-02-17 Thread Daniel Carosone
I have a machine whose purpose is to be a backup server.  It has a
pool for holding backups from other machines, using zfs send|recv. 

Call the pool dpool. Inside there are datasets for hostname/poolname,
for each of the received pools.  All hosts have an rpool, some have
other pools as well.  So the upper level datasets look similar to:

 dpool/foo/rpool
 dpool/bar/rpool
 dpool/bar/lake
 dpool/bar/tank
 dpool/baz/rpool
 dpool/baz/pond

and so on.  Under each of those are the datasets from the respective
pools, received using zfs recv -d 

The trouble comes from the received properties, mountpoint in
particular.  Every one of those rpools has a dataset rpool/export with
mountpoint=/export, and home and so forth under there.

There was an issue until recently that meant these mountpoints got
some strangely-doubled prefix strings added, so they were mounted in
odd places. That was fixed, and now with an upgrade to b132, I have
many filesystems all with the same mountpoint property. At boot time,
one of these wins, and the rest fail to mount, filesystem/local svc
fails and not much else starts from there.

With the new distinction between received and local properties, I was
hoping for a solution. The zfs manpage says that I can use inherit to
mask a received property with an inherited one.  This flat-out seems
not to work, at least for mountpoint.

I'd be happy enough if none of these was mounted, but annoyingly in
this case, the canmount property is not inheritable, so I can't just
set this somewhere near the top and be done.  My workaround so far:
 # zfs list -t filesystem -o name | egrep ^dpool | xargs -n 1 zfs set 
canmount=noauto
but this is annoying and manual and I sometimes forget when a new
filesystem is added and sent from the source host.

I can't use the altroot pool property, it's not persistent and anyway
would just prefix the same thing to all of the datasets, for them to
collide under there.  It would at least ensure I get the backup
machine's own /export/home/dan and /var and some others, though :)

The same issue applies to other properties as well, though they don't
cause boot failures. For example, I might set copies=2 on some dataset
on a single-disk laptop, but there's no need for that on the backup
server. I still want that to be remembered for the restore case, but
not to be applied on the backup server.

How else can I solve this problem? What else have others done on their
backup servers?

Separately, if I override recvd properties, in the restore use case
when I want to send the dataset back to the original host, is there
some way I can tell zfs send to look at the received properties rather
than the local override?  

I was hoping the new property source stuff would help with all this,
but I'm not really clear on the whole process, it doesn't quite seem
to be fully fleshed out yet.  I can imagine something like a
property-sed, that you could insert in between send and recv pipeline,
as being helpful - but a properly thought-out framework would be much
better. 

--
Dan.


pgpqNN8B2TQv1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Christo Kutrovsky
Dan,

"loose" was a typo. I meant "lose". Interesting how a typo (write error) can 
cause a lot of confusion on what exactly I mean  :) Resulting in corrupted 
interpretation.

Note that my idea/proposal is targeted for a growing number of home users. To 
those, value for money usually is a much more significant factor than others.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss