Re: [zfs-discuss] bootfs ID on zfs root

2011-05-12 Thread Jim Klimov
Sorry, I guess I'm running out of reasonable ideas then.

One that you can try (or already did) is installing Solaris not by JumpStart or 
WANBoot but from original media (DVD or Network Install) to see if the problem 
pertains. Maybe your flash image lacks some controller drivers, etc? (I am not 
sure how that would be possible if you made the archive on another domain in 
the same/similar box - but perhaps some paths were marked for exclusion?)

Another idea would be to start the system with mdb debugger (but then you'd 
need to know what to type or kick) and/or with higher verbosity (boot -m 
verbose from eeprom or via reboot - -m verbose from single-user), just to 
maybe see some more insight on what fails?.

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk space size, used, available mismatch

2011-05-12 Thread Jim Klimov
Well, the Sun-supported way is separating /var from the common root.

In our systems we do a more fine-tuned hierarchy of separated /usr /opt /var in 
each BE and sub-separated /var/adm /var/log /var/cores /var/crash and /var/mail 
shared between boot environments. This requires quite many tricks to set up, 
and often is apain to maintain with i.e. LiveUpgrade (often the system doesn't 
come up on first reboot after update, because something was mixed up - luckily 
these boxes have remote consoles). However this also allows quota'ing specific 
datasets, i.e. so that core-dumps don't eat up the whole root FS.

In your case it might make sense to separate the application software's paths 
(i.e. /opt/programname/ and /var/opt/programname/) to individual datasets with 
quotas and migrate the data via cpio...

LU does not fetch system information straight from the system itself, it 
consults its own configuration (with copies in each boot environment). See 
/etc/lutab and /etc/lu/ICF.* files (and other /etc/lu/*) but beware that manual 
mangling is not supported by Sun. Not that it does not work or help in some 
cases ;)

A common approach is to have a separate root pool (slice or better a mirror of 
two slices), and depending on your base OS installation footprint, anywhere 
from 4Gb (no graphics) to 20Gb (enough for several BE revisions) would do. 
Remainder of the disk is given to a separate data pool, where our local zone 
roots live, as well as distros, backup data, etc. Thus there is very little 
third-party software installed in the global zones which act more like 
hypervisors to many local zones with actual applications and server software.

You might want or not want to keep the swap and dump volumes in the root pool - 
this extends the limit according to your RAM size. For example, on boxes with 4 
disks we can use one 2*20Gb mirror as a root pool, and another 2*20Gb mirror as 
a pool for swap and dump, leaving equal amounts of disk for the separate data 
pool.

In case of 4 slices for the data pool you can select to make it a RAID10, 
RAIDZ1 or RAIDZ2 - with different processing overheads/performance and 
different redundancy levels and different available space. For demo boxes with 
no critical performance requirements we use RAIDZ2 as it protects from failure 
of any two disks while RAID10 protects from failure of two specific disks (from 
different mirrors), or RAIDZ1 when we need more space available.

HTH,
//Jim Klimov
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk space size, used, available mismatch

2011-05-12 Thread Jim Klimov
Ah, yes, regarding the backdoor to root fs: if you select to have some 
not-quoata'ed space hogs in the same pool as your root FS, you can look into 
setting a reservation (and/or refreservation) for the root FS datasets. For 
example, if your OS installation uses 4Gb and you don't think it would exceed 
6Gb by itself (and your ETL software uses a separate dataset), you can reserve 
6Gb for only the root FS (and hopefully its descendants - but better see 
manpages):

# zfs set reservation=6G rpool/ROOT/myBeName

HTH,
//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-12 Thread Jim Klimov
 The thing is- as far as I know the OS doesn't ask the disk to find a place
 to fit the data. Instead the OS tracks what space on the disk is free and
 then tells the disk where to write the data.

Yes and no, I did not formulate my idea clearly enough, sorry for confusion ;)

Yes - The disks don't care about free blocks at all. For them they are just LBA 
sector numbers.

No - The OS does track which sectors correlate to its logical blocks it deems 
suitable for a write, and asks the disk to position its mechanical head to a 
specific track and access a specific sector. This is a slow operation which can 
only be done about 180-250 times per second for very random I/Os (may be more 
with HDD/Controller caching, queuing and faster spindles).

I'm afraid that seeking to very dispersed metadata blocks, such as traversing 
the tree during a scrub on a fragmented drive, may qualify as a very random I/O.

This reminds me of a long-hanging BP Rewrite project which would allow live 
re-arranging of ZFS data allowing, in particular, some extent of 
defragmentation. More useful usages would be changes to RAIDZ levels and number 
of disks though, maybe even removal of top-level VDEVs from a sufficiently 
empty pool... Hopefully the Illumos team or some other developers would push 
this idea into reality ;)

There was a good tip from Jim Litchfield regarding VDEV Queue Sizing, though. 
Possible current default for zfs_vdev_max_pending is 10, which is okay (or may 
be even too much) for individual drives, but is not very much for arrays of 
many disks hidden behind a smart controller with its own caching and queuing, 
be it a SAN box controller or a PCI one which would intercept and reinterpret 
your ZFS's calls.

So maybe this is indeed a bottleneck - which you would see in iostat -Xn 1 as 
actv field numbers which are near the configured queue size. 

//Jim
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk space size, used, available mismatch

2011-05-12 Thread Brad Kroeger
Thank you for your insight.  This is a system that was handed down to me when 
another sysadmin went to greener pastures.  There were no quotas set on the 
system.  I used zfs destroy to free up some space and did put a quota on it.  I 
still have 0 freespace available.  I think this is due to the quota limit.  
Before I rebooted I had about a 68GB bootpool.  After the zfs destroy I had 
about 1.7GB free.  I put a 66.5 GB quota on it which I am hitting so services 
will not start up.

I don't want to saw off the tree branch I am sitting on so I am reluctant to 
increase the quota too much.  Here are some questions I have:

1) zfs destroy did free up a snapshot but it is still showing up in lustatus.  
How to I correct this?

2) This system is installed with everything under / so the ETL team can fill up 
root with out bounds.  What are the best practices for separating filesystems 
in ZFS so I can bound the ETL team with out affecting the OS?

3) I have captured all the critical data on to SAN disk and am thinking about 
jumpstarting the host cleanly.  That way I will have a known baseline to start 
with.  Does anyone have any suggestions here?  

4) We deal with very large data sets.  These usually exist just in Oracle but 
this host is for ETL and Informatica processing.  What would be a good quota to 
set so I have a back door onto the system to take care of problems.

Thanks for your feedback.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS backup and restore

2011-05-12 Thread Naveen surisetty
Hi,

Thanks for the response, Here is my problem.

I have a zfs stream back up took on zfs version 15, currently i have upgraded 
my OS, so new zfs version is 22. Restore process went well from old stream 
backup to new zfs pool. but on reboot i got error unable to mount pool tank. 


So there is incompatibilities between zfs versions, especially send/receive.

So looking for alternate backup solutions.

Thanks
kumar
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] howto: make a pool with ashift=X

2011-05-12 Thread Daniel Carosone
On Thu, May 12, 2011 at 12:23:55PM +1000, Daniel Carosone wrote:
 They were also sent from an ashift=9 to an ashift=12 pool

This reminded me to post a note describing how I made pools with
different ashift.  I do this both for pools on usb flash sticks, and
on disks with an underlying 4k blocksize, such as my 2Tb WD EARS
drives.  If I had pools on SATA Flash SSDs, I'd do it for those too.

The trick comes from noting that stmfadm create-lu has a blk option
for the block size of the iscsi volume to be presented.  Creating a
pool with at least one disk (per top-level vdev) on an iscsi initiator
pointing at such a target will cause zpool to set ashift for the vdev 
accordingly.  

This works even when the initiator and target are the same host, over
the loopback interface.  Oddly, however, it does not work if the host
is solaris express b151 - it does work on OI b148.  Something has
changed in zpool creating in the interim.

Anyway, my recipe is to:

 * boot OI b148 in a vm. 
 * make a zfs dataset to house the working files (reason will be clear
   below).
 * In that dataset, I make sparse files corresponding in size and
   number to the disks that will eventually hold the pool (this makes
   a pool with the same size and number of metaslabs as would have
   been natively).
 * Also make a sparse zvol of the same size.
 * stmfadm create-lu -p blk=4096 (or whatever, as desired) on the
   zvol, and make available.
 * get the iscsi initiator to connect the lu as a new disk device
 * zpool create, using all bar 1 of the files, and the iscsi disk, in
   the shape you want your pool (raidz2, etc).
 * zpool replace the iscsi disk with the last unused file (now you can
   tear down the lu and zvol)
 * zpool export the pool-on-files.
 * zfs send the dataset housing these files to the machine that has
   the actual disks (much faster than rsync even with the sparse files
   option, since it doesn't have to scan for holes).
 * zpool import the pool from the files
 * zpool upgrade, if you want newer pool features, like crypto.
 * zpool set autoexpand=on, if you didn't actually use files of the
   same size.
 * zpool replace a file at a time onto the real disks.

Hmm.. when written out like that, it looks a lot more complex than it
really is.. :-)

Note that if you want lots of mirrors, you'll need an iscsi device per
mirror top-level vdev.

Note also that the image created inside the iscsi device is not
identical to what you want on a device with 512-byte sector emulation,
since the label is constructed for a 4k logical sector size.  zpool
replace takes care of this when labelling the replacement disk/file.

I also played around with another method, using mdb to overwrite the
disk model table to match my disks and make the pool directly on them
with the right ashift.

  http://fxr.watson.org/fxr/ident?v=OPENSOLARIS;im=10;i=sd_flash_dev_table

This also no longer works on b151 (though the table still exists), so I
need the vm anyway, and the iscsi method is easier. 

Finally, because this doesn't work on b151, it's also only good for
creating new pools; I don't know how to expand a pool with new vdevs
to have the right ashift in those vdevs. 

--
Dan.

pgp0weQ3gx757.pgp
Description: PGP signature
___
zfs-crypto-discuss mailing list
zfs-crypto-disc...@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-crypto-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup complete rpool structure and data to tape

2011-05-12 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Peter Jeremy
 
 Finally, the send/recv protocol is not guaranteed to be compatible
 between ZFS versions.  

Years ago, there was a comment in the man page that said this.  Here it is:
The format of the stream is evolving. No backwards  compatibility is
guaranteed. You may not be able to receive your streams on future versions
of ZFS.

But in the last several years, backward/forward compatibility has always
been preserved, so despite the warning, it was never a problem.

In more recent versions, the man page says:  The format of the stream is
committed. You will be  able to receive your streams on future versions of
ZFS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup complete rpool structure and data to tape

2011-05-12 Thread Arjun YK
Thanks everyone. Your inputs helped me a lot.

The 'rpool/ROOT' mountpoint is set to 'legacy' as I don't see any reason to
mount it. But I am not certain if that can cause any issue in the future, or
that's a right thing to do. Any suggestions ?


Thanks
Arjun
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS backup and restore

2011-05-12 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Naveen surisetty
 
 I have a zfs stream back up took on zfs version 15, currently i have
upgraded
 my OS, so new zfs version is 22. Restore process went well from old stream
 backup to new zfs pool. but on reboot i got error unable to mount pool
tank.
 
 So there is incompatibilities between zfs versions, especially
send/receive.
 
 So looking for alternate backup solutions.

Sorry, not correct.  As Richard said, please post the exact error message
you are seeing, because what you said, the way you said it, doesn't make
sense.

And since I just posted this a minute ago in a different thread, I'll just
quote myself again here:

Years ago, there was a comment in the man page that said this:  The format
of the stream is evolving. No backwards  compatibility is guaranteed. You
may not be able to receive your streams on future versions of ZFS.

But in the last several years, backward/forward compatibility has always
been preserved, so despite the warning, it was never a problem.

In more recent versions, the man page says:  The format of the stream is
committed. You will be able to receive your streams on future versions of
ZFS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup complete rpool structure and data to tape

2011-05-12 Thread Fajar A. Nugraha
On Thu, May 12, 2011 at 8:31 PM, Arjun YK arju...@gmail.com wrote:
 Thanks everyone. Your inputs helped me a lot.
 The 'rpool/ROOT' mountpoint is set to 'legacy' as I don't see any reason to
 mount it. But I am not certain if that can cause any issue in the future, or
 that's a right thing to do. Any suggestions ?

The general answer is if it ain't broken, don't fix it.

See 
http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Complete_Solaris_ZFS_Root_Pool_Recovery
for example of bare metal rpool recovery example using nfs + zfs
send/receive. For your purpose, it's probably easier to just use the
example and have Legato back up the images created from zfs send.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub on b123

2011-05-12 Thread Karl Rossing
I have an outage tonight and would like to swap out the LSI 3801 for an 
LSI 9200


Should I zpool export before the swaping the card?


On 04/16/2011 10:45 AM, Roy Sigurd Karlsbakk wrote:

I'm going to wait until the scrub is complete before diving in some
more.

I'm wondering if replacing the LSI SAS 3801E with an LSI SAS 9200-8e
might help too.

I've seen similar errors with 3801 - seems to be SAS timeouts. Reboot the box 
and it'll probably work well again for a while. I replaced the 3801 with a 9200 
and the problem was gone.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.




CONFIDENTIALITY NOTICE:  This communication (including all attachments) is
confidential and is intended for the use of the named addressee(s) only and
may contain information that is private, confidential, privileged, and
exempt from disclosure under law.  All rights to privilege are expressly
claimed and reserved and are not waived.  Any use, dissemination,
distribution, copying or disclosure of this message and any attachments, in
whole or in part, by anyone other than the intended recipient(s) is strictly
prohibited.  If you have received this communication in error, please notify
the sender immediately, delete this communication from all data storage
devices and destroy all hard copies.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance problem suggestions?

2011-05-12 Thread Don
 This is a slow operation which can only be done about 180-250 times per second
 for very random I/Os (may be more with HDD/Controller caching, queuing and
 faster spindles).
 I'm afraid that seeking to very dispersed metadata blocks, such as traversing 
 the
 tree during a scrub on a fragmented drive, may qualify as a very random I/O.
And that's the thing- I would understand if my scrub was slow because the disks 
were just being hammered by IOPS but- all joking aside- my pool is almost 
entirely idle according to an iostat -Xn
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool scrub on b123

2011-05-12 Thread Richard Elling
On May 12, 2011, at 1:53 PM, Karl Rossing wrote:

 I have an outage tonight and would like to swap out the LSI 3801 for an LSI 
 9200
 
 Should I zpool export before the swaping the card?

A clean shutdown is sufficient. You might need to devfsadm -c disk to build 
the 
device tree.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss