from:"\"Mike.\""

Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?

2013-02-20 Thread Mike Gerdts

On Wed, Feb 20, 2013 at 4:49 PM, Markus Grundmann  wrote:
> Whenever I modify zfs pools or filesystems it's possible to destroy [on a
> bad day :-)] my data. A new
> property "protected=on|off" in the pool and/or filesystem can help the
> administrator for datalost
> (e.g. "zpool destroy tank" or "zfs destroy " command will
> be rejected
> when "protected=on" property is set).
>
> It's anywhere here on this list their can discuss/forward this feature
> request? I hope you have
> understand my post ;-)

I like the idea and it is likely not very hard to implement.  This is
very similar to how snapshot holds work.

# zpool upgrade -v | grep -i hold
 18  Snapshot user holds

So long as you aren't using a really ancient zpool version, you could
use this feature to protect your file systems.

# zfs create a/b
# zfs snapshot a/b@snap
# zfs hold protectme a/b@snap
# zfs destroy a/b
cannot destroy 'a/b': filesystem has children
use '-r' to destroy the following datasets:
a/b@snap
# zfs destroy -r a/b
cannot destroy 'a/b@snap': snapshot is busy

Of course, snapshots aren't free if you write to the file system.  A
way around that is to create an empty file system within the one that
you are trying to protect.

# zfs create a/1
# zfs create a/1/hold
# zfs snapshot a/1/hold@hold
# zfs hold 'saveme!' a/1/hold@hold
# zfs holds a/1/hold@hold
NAME   TAG  TIMESTAMP
a/1/hold@hold  saveme!  Wed Feb 20 15:06:29 2013
# zfs destroy -r a/1
cannot destroy 'a/1/hold@hold': snapshot is busy

Extending the hold mechanism to filesystems and volumes would be quite nice.

Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Traffanstead, Mike

iozone doesn't vary the blocksize during the test, it's a very
artificial test but it's useful for gauging performance under
different scenarios.

So for this test all of the writes would have been 64k blocks, 128k,
etc. for that particular step.

Just as another point of reference I reran the test with a Crucial M4
SSD and the results for 16G/64k were 35mB/s (x5 improvement).

I'll rerun that part of the test with zpool iostat and see what it says.

Mike

On Thu, Jul 19, 2012 at 7:27 PM, Jim Klimov  wrote:
>> This is normal.  The problem is that with zfs 128k block sizes, zfs
>> needs to re-read the original 128k block so that it can compose and
>> write the new 128k block.  With sufficient RAM, this is normally avoided
>> because the original block is already cached in the ARC.
>>
>> If you were to reduce the zfs blocksize to 64k then the performance dive
>> at 64k would go away but there would still be write performance loss at
>> sizes other than a multiple of 64k.
>
>
> I am not sure if I misunderstood the question or Bob's answer,
> but I have a gut feeling it is not fully correct: ZFS block
> sizes for files (filesystem datasets) are, at least by default,
> dynamically-sized depending on the contiguous write size as
> queued by the time a ZFS transaction is closed and flushed to
> disk. In case of RAIDZ layouts, this logical block is further
> striped over several sectors on several disks in one of the
> top-level vdev's, starting with parity sectors for each "row".
>
> So, if the test logically overwrites full blocks of test data
> files, reads for recombination are not needed (but that can
> be checked for with "iostat 1" or "zpool iostat" - to see how
> many reads do happen during write-tests?) Note that some reads
> will show up anyway, i.e. to update ZFS metadata (the block
> pointer tree).
>
> However, if the test file was written in 128K blocks and then
> is rewritten with 64K blocks, then Bob's answer is probably
> valid - the block would have to be re-read once for the first
> rewrite of its half; it might be taken from cache for the
> second half's rewrite (if that comes soon enough), and may be
> spooled to disk as a couple of 64K blocks or one 128K block
> (if both changes come soon after each other - within one TXG).
>
> HTH,
> //Jim Klimov
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Very poor small-block random write performance

2012-07-19 Thread Traffanstead, Mike

vfs.zfs.txg.synctime_ms: 1000
vfs.zfs.txg.timeout: 5

On Thu, Jul 19, 2012 at 8:47 PM, John Martin  wrote:
> On 07/19/12 19:27, Jim Klimov wrote:
>
>> However, if the test file was written in 128K blocks and then
>> is rewritten with 64K blocks, then Bob's answer is probably
>> valid - the block would have to be re-read once for the first
>> rewrite of its half; it might be taken from cache for the
>> second half's rewrite (if that comes soon enough), and may be
>> spooled to disk as a couple of 64K blocks or one 128K block
>> (if both changes come soon after each other - within one TXG).
>
>
> What are the values for zfs_txg_synctime_ms and zfs_txg_timeout
> on this system (FreeBSD, IIRC)?
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Benefits of enabling compression in ZFS for the zones

2012-07-10 Thread Mike Gerdts

On Tue, Jul 10, 2012 at 6:29 AM, Jordi Espasa Clofent
 wrote:
> Thanks for you explanation Fajar. However, take a look on the next lines:
>
> # available ZFS in the system
>
> root@sct-caszonesrv-07:~# zfs list
>
> NAME USED  AVAIL  REFER  MOUNTPOINT
> opt  532M  34.7G   290M  /opt
> opt/zones243M  34.7G32K  /opt/zones
> opt/zones/sct-scw02-shared   243M  34.7G   243M  /opt/zones/sct-scw02-shared
> static   104K  58.6G34K  /var/www/
>
> # creating a file in /root (UFS)
>
> root@sct-caszonesrv-07:~# dd if=/dev/zero of=file.bin count=1024 bs=1024
> 1024+0 records in
> 1024+0 records out
> 1048576 bytes (1.0 MB) copied, 0.0545957 s, 19.2 MB/s
> root@sct-caszonesrv-07:~# pwd
> /root
>
> # enable compression in some ZFS zone
>
> root@sct-caszonesrv-07:~# zfs set compression=on opt/zones/sct-scw02-shared
>
> # copying the previos file to this zone
>
> root@sct-caszonesrv-07:~# cp /root/file.bin
> /opt/zones/sct-scw02-shared/root/
>
> # checking the file size in the origin dir (UFS) and the destination one
> (ZFS with compression enabled)
>
> root@sct-caszonesrv-07:~# ls -lh /root/file.bin
> -rw-r--r-- 1 root root 1.0M Jul 10 13:21 /root/file.bin
>
> root@sct-caszonesrv-07:~# ls -lh /opt/zones/sct-scw02-shared/root/file.bin
> -rw-r--r-- 1 root root 1.0M Jul 10 13:22
> /opt/zones/sct-scw02-shared/root/file.bin
>
> # the both files has exactly the same cksum!
>
> root@sct-caszonesrv-07:~# cksum /root/file.bin
> 3018728591 1048576 /root/file.bin
>
> root@sct-caszonesrv-07:~# cksum /opt/zones/sct-scw02-shared/root/file.bin
> 3018728591 1048576 /opt/zones/sct-scw02-shared/root/file.bin
>
> So... I don't see any size variation with this test.

ls(1) tells you how much data is in the file - that is, how many bytes
of data that an application will see if it reads the whole file.
du(1) tells you how many disk blocks are used.  If you look at the
stat structure in stat(2), ls reports st_size, du reports st_blocks.

Blocks full of zeros are special to zfs compression - it recognizes
them and stores no data.  Thus, a file that contains only zeros will
only require enough space to hold the file metadata.

$ zfs list -o compression ./
COMPRESS
  on

$ dd if=/dev/zero of=1gig count=1024 bs=1024k
1024+0 records in
1024+0 records out

$ ls -l 1gig
-rw-r--r--   1 mgerdts  staff1073741824 Jul 10 07:52 1gig

$ du -k 1gig
0   1gig

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Mike Gerdts

On Tue, Jun 12, 2012 at 11:17 AM, Sašo Kiselkov  wrote:
> On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote:
>> find where your nics are bound too
>>
>> mdb -k
>> ::interrupts
>>
>> create a processor set including those cpus [ so just the nic code will
>> run there ]
>>
>> andy
>
> Tried and didn't help, unfortunately. I'm still seeing drops. What's
> even funnier is that I'm seeing drops when the machine is sync'ing the
> txg to the zpool. So looking at a little UDP receiver I can see the
> following input stream bandwidth (the stream is constant bitrate, so
> this shouldn't happen):

If processing in interrupt context (use intrstat) is dominating cpu
usage, you may be able to use pcitool to cause the device generating
all of those expensive interrupts to be moved to another CPU.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Strange hang during snapshot receive

2012-05-10 Thread Mike Gerdts

On Thu, May 10, 2012 at 5:37 AM, Ian Collins  wrote:
> I have an application I have been using to manage data replication for a
> number of years.  Recently we started using a new machine as a staging
> server (not that new, an x4540) running Solaris 11 with a single pool built
> from 7x6 drive raidz.  No dedup and no reported errors.
>
> On that box and nowhere else is see empty snapshots taking 17 or 18 seconds
> to write.  Everywhere else they return in under a second.
>
> Using truss and the last published source code, it looks like the pause is
> between a printf and  the call to zfs_ioctl and there aren't any other
> functions calls between them:

For each snapshot in a stream, there is one zfs_ioctl() call.  During
that time, the kernel will read the entire substream (that is, for one
snapshot) from the input file descriptor.

>
> 100.5124     0.0004    open("/dev/zfs", O_RDWR|O_EXCL)            = 10
> 100.7582     0.0001    read(7, "\0\0\0\0\0\0\0\0ACCBBAF5".., 312)    = 312
> 100.7586     0.    read(7, 0x080464F8, 0)                = 0
> 100.7591     0.    time()                        = 1336628656
> 100.7653     0.0035    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040CF0)    = 0
> 100.7699     0.0022    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900)    = 0
> 100.7740     0.0016    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040580)    = 0
> 100.7787     0.0026    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x080405B0)    = 0
> 100.7794     0.0001    write(1, " r e c e i v i n g   i n".., 75)    = 75
> 118.3551     0.6927    ioctl(8, ZFS_IOC_RECV, 0x08042570)        = 0
> 118.3596     0.0010    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900)    = 0
> 118.3598     0.    time()                        = 1336628673
> 118.3600     0.    write(1, " r e c e i v e d   3 1 2".., 45)    = 45
>
> zpool iostat (1 second interval) for the period is:
>
> tank        12.5T  6.58T    175      0   271K      0
> tank        12.5T  6.58T    176      0   299K      0
> tank        12.5T  6.58T    189      0   259K      0
> tank        12.5T  6.58T    156      0   231K      0
> tank        12.5T  6.58T    170      0   243K      0
> tank        12.5T  6.58T    252      0   295K      0
> tank        12.5T  6.58T    179      0   200K      0
> tank        12.5T  6.58T    214      0   258K      0
> tank        12.5T  6.58T    165      0   210K      0
> tank        12.5T  6.58T    154      0   178K      0
> tank        12.5T  6.58T    186      0   221K      0
> tank        12.5T  6.58T    184      0   215K      0
> tank        12.5T  6.58T    218      0   248K      0
> tank        12.5T  6.58T    175      0   228K      0
> tank        12.5T  6.58T    146      0   194K      0
> tank        12.5T  6.58T     99    258   209K  1.50M
> tank        12.5T  6.58T    196    296   294K  1.31M
> tank        12.5T  6.58T    188    130   229K   776K
>
> Can anyone offer any insight or further debugging tips?

I have yet to see a time when zpool iostat tells me something useful.
I'd take a look at "iostat -xzn 1" or similar output.  It could point
to imbalanced I/O or a particular disk that has abnormally high
service times.

Have you installed any SRUs?  If not, you could be seeing:

7060894 zfs recv is excruciatingly slow

which is fixed in Solaris 11 SRU 5.

If you are using zones and are using any https pkg(5) origins (such as
https://pkg.oracle.com/solaris/support), I suggest reading
https://forums.oracle.com/forums/thread.jspa?threadID=2380689&tstart=15
before updating to SRU 6 (SRU 5 is fine, however).  The fix for the
problem mentioned in that forums thread should show up in an upcoming
SRU via CR 7157313.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Mike Gerdts

On Mon, Mar 26, 2012 at 6:18 PM, Bob Friesenhahn
 wrote:
> On Mon, 26 Mar 2012, Andrew Gabriel wrote:
>
>> I just played and knocked this up (note the stunning lack of comments,
>> missing optarg processing, etc)...
>> Give it a list of files to check...
>
>
> This is a cool program, but programmers were asking (and answering) this
> same question 20+ years ago before there was anything like SEEK_HOLE.
>
> If file space usage is less than file directory size then it must contain a
> hole.  Even for compressed files, I am pretty sure that Solaris reports the
> uncompressed space usage.

That's not the case.

# zfs create -o compression=on rpool/junk
# perl -e 'print "foo" x 10'> /rpool/junk/foo
# ls -ld /rpool/junk/foo
-rw-r--r--   1 root root  30 Mar 26 18:25 /rpool/junk/foo
# du -h /rpool/junk/foo
  16K   /rpool/junk/foo
# truss -t stat -v stat du  /rpool/junk/foo
...
lstat64("foo", 0x08047C40)  = 0
d=0x02B90028 i=8 m=0100644 l=1  u=0 g=0 sz=30
at = Mar 26 18:25:25 CDT 2012  [ 1332804325.742827733 ]
mt = Mar 26 18:25:25 CDT 2012  [ 1332804325.889143166 ]
ct = Mar 26 18:25:25 CDT 2012  [ 1332804325.889143166 ]
bsz=131072 blks=32fs=zfs

Notice that it says it has 32 512 byte blocks.

The mechanism you suggest does work for every other file system that
I've tried it on.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Mike Gerdts

2012/3/26 ольга крыжановская :
> How can I test if a file on ZFS has holes, i.e. is a sparse file,
> using the C api?

See SEEK_HOLE in lseek(2).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Any rhyme or reason to disk dev names?

2011-12-21 Thread Mike Gerdts

On Wed, Dec 21, 2011 at 1:58 AM, Matthew R. Wilson
 wrote:
> Hello,
>
> I am curious to know if there is an easy way to guess or identify the device
> names of disks. Previously the /dev/dsk/c0t0d0s0 system made sense to me...
> I had a SATA controller card with 8 ports, and they showed up with the
> numbers 1-8 in the "t" position of the device name.
>
> But I just built a new system with two LSI SAS HBAs in it, and my device
> names are along the lines of:
> /dev/dsk/c0t5000CCA228C0E488d0
>
> I could not find any correlation between that identifier and the a)
> controller the disk was plugged in to, or b) the port number on the
> controller. The only way I could make a mapping of device name to controller
> port was to add one drive at a time, reboot the system, and run "format" to
> see which new disk name shows up.
>
> I'm guessing there's a better way, but I can't find any obvious answer as to
> how to determine which port on my LSI controller card will correspond with
> which seemingly random device name. Can anyone offer any suggestions on a
> way to predict the device naming, or at least get the system to list the
> disks after I insert one without rebooting?

Depending on the hardware you are using, you may be able to benefit
from croinfo.

$ croinfo
D:devchassis-path  t:occupant-type  c:occupant-compdev
-  ---  -
/dev/chassis//SYS/SASBP/HDD0/disk  disk c0t5000CCA012B66E90d0
/dev/chassis//SYS/SASBP/HDD1/disk  disk c0t5000CCA012B68AC8d0

The text in the left column represents text that should be printed on
the corresponding disk slots.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] gaining access to var from a live cd

2011-11-29 Thread Mike Gerdts

On Tue, Nov 29, 2011 at 4:40 PM, Francois Dion  wrote:
> It is on openindiana 151a, no separate /var as far as But I'll have to
> test this on solaris11 too when I get a chance.
>
> The problem is that if I
>
> zfs mount -o mountpoint=/tmp/rescue (or whatever) rpool/ROOT/openindiana
>
> i get a cannot mount /mnt/rpool: directory is not empty.
>
> The reason for that is that I had to do a zpool import -R /mnt/rpool
> rpool (or wherever I mount it it doesnt matter) before I could do a
> zfs mount, else I dont have access to the rpool zpool for zfs to do
> its thing.
>
> chicken / egg situation? I miss the old fail safe boot menu...

You can mount it pretty much anywhere:

mkdir /tmp/foo
zfs mount -o mountpoint=/tmp/foo ...

I'm not sure when the temporary mountpoint option (-o mountpoint=...)
came in. If it's not valid syntax then:

mount -F zfs rpool/ROOT/solaris /tmp/foo

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] gaining access to var from a live cd

2011-11-29 Thread Mike Gerdts

On Tue, Nov 29, 2011 at 3:01 PM, Francois Dion  wrote:
> I've hit an interesting (not) problem. I need to remove a problematic
> ld.config file (due to an improper crle...) to boot my laptop. This is
> OI 151a, but fundamentally this is zfs, so i'm asking here.
>
> what I did after booting the live cd and su:
> mkdir /tmp/disk
> zpool import -R /tmp/disk -f rpool
>
> export shows up in there and rpool also, but in rpool there is only
> boot and etc.
>
> zfs list shows rpool/ROOT/openindiana as mounted on /tmp/disk and I
> see dump and swap, but no var. rpool/ROOT shows as legacy, so I
> figured, maybe mount that.
>
> mount -F zfs rpool/ROOT /mnt/rpool

That dataset (rpool/ROOT) should never have any files in it.  It is
just a "container" for boot environments.  You can see which boot
environments exist with:

zfs list -r rpool/ROOT

If you are running Solaris 11, the boot environment's root dataset
will show a mountpoint property value of /.  Assuming it is called
"solaris" you can mount it with:

zfs mount -o mountpoint=/mnt/rpool rpool/ROOT/solaris

If the system is running Solaris 11 (and was not updated from Solaris
11 Express), it will have a separate /var dataset.

zfs mount -o mountpoint=/mnt/rpool/var rpool/ROOT/solaris/var

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FS Reliability WAS: about btrfs and zfs

2011-10-21 Thread Mike Gerdts

On Fri, Oct 21, 2011 at 8:02 PM, Fred Liu  wrote:
>
>> 3. Do NOT let a system see drives with more than one OS zpool at the
>> same time (I know you _can_ do this safely, but I have seen too many
>> horror stories on this list that I just avoid it).
>>
>
> Can you elaborate #3? In what situation will it happen?

Some people have trained their fingers to use the -f option on every
command that supports it to force the operation.  For instance, how
often do you do rm -rf vs. rm -r and answer questions about every
file?

If various zpool commands (import, create, replace, etc.) are used
against the wrong disk with a force option, you can clobber a zpool
that is in active use by another system.  In a previous job, my lab
environment had a bunch of LUNs presented to multiple boxes.  This was
done for convenience in an environment where there would be little
impact if an errant command were issued.  I'd never do that in
production without some form of I/O fencing in place.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kernel panic on zpool import. 200G of data inaccessible!

2011-08-05 Thread Mike Gerdts

On Thu, Aug 4, 2011 at 2:47 PM, Stuart James Whitefish
 wrote:
> # zpool import -f tank
>
> http://imageshack.us/photo/my-images/13/zfsimportfail.jpg/

I encourage you to open a support case and ask for an escalation on CR 7056738.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs rename query

2011-07-27 Thread Mike Gerdts

On Wed, Jul 27, 2011 at 6:37 AM, Nishchaya Bahuguna
 wrote:
> Hi,
>
> I have a query regarding the zfs rename command.
>
> There are 5 zones and my requirement is to change the zone paths using zfs
> rename.
>
> + zoneadm list -cv
> ID NAME             STATUS     PATH                           BRAND    IP
>  0 global               running    /                                 native
>   shared
> 34 public               running    /txzone/public              native
> shared
> 35 internal             running    /txzone/internal           native
> shared
> 36 restricted           running    /txzone/restricted         native
> shared
> 37 needtoknow      running    /txzone/needtoknow    native   shared
> 38 sandbox              running    /txzone/sandbox           native   shared
>
> A whole root zone  was configured and installed. Rest of the 4 zones
> were cloned from .
>
> zoneadm -z  clone public
>
> zfs get origin lists the origin as  for all 4 zones.
>
> I run zfs rename on 4 of these clone'd zones and it throws a device busy
> error because of parent-child relationship.

I think you are getting the device busy error for a different reason.
I just did the following:

zfs create -o mountpoint=/zones rpool/zones
zonecfg -z z1 'create; set zonepath=/zones/z1'
zoneadm -z z1 install
zonecfg -z z1c1 'create -t z1; set zonepath=/zones/z1c1'
zonecfg -z z1c2 'create -t z1; set zonepath=/zones/z1c2'
zoneadm -z z1c1 clone z1
zoneadm -z z1c2 clone z2

At this point, I have the following:

bash-3.2# zfs list -r -o name,origin rpool/zones
NAME  ORIGIN
rpool/zones   -
rpool/zones/z1-
rpool/zones/z1@SUNWzone1  -
rpool/zones/z1@SUNWzone2  -
rpool/zones/z1c1  rpool/zones/z1@SUNWzone1
rpool/zones/z1c2  rpool/zones/z1@SUNWzone2

Next, I decide that I would like z1c1 to be rpool/new/z1c1 instead of
it's current place.  Note that this will also change the mountpoint
which breaks the zone.

bash-3.2# zfs create -o mountpoint=/new rpool/new
bash-3.2# zfs rename rpool/zones/z1c1 rpool/new/z1c1
bash-3.2# zfs list -o name,origin -r /new
NAMEORIGIN
rpool/new   -
rpool/new/z1c1  rpool/zones/z1@SUNWzone1

To get a "device busy" error, I need to cause a situation where the
zonepath cannot be unmounted.  Having the zone running is a good way
to do that:

bash-3.2# zoneadm -z z1c2 boot
WARNING: zone z1c1 is installed, but its zonepath /zones/z1c1 does not exist.
bash-3.2# zfs rename rpool/zones/z1c2 rpool/new/z1c2
cannot unmount '/zones/z1c2': Device busy

> I guess that can be handled with zfs promote because promote would swap the
> parent and child.

You would need to do this to rename a dataset that the origin (one
that is cloned) not the clones.  That is, if you wanted to rename the
dataset for your public zone or I wanted to rename the dataset for z1,
then you would need to promote the datasets for all of the clones.
This is a known issue.

6472202 'zfs rollback' and 'zfs rename' require that clones be unmounted

> So, how do I make it work when there are multiple zones cloned from a single
> parent? Is there a way that zfs rename can work for ALL the zones rather
> than working with two zones at a time?

As I said above.

>
> Also, is there a command line option available for sorting the datasets in
> correct dependency order?

"zfs list -r -o name,origin" is a good starting point.  I suspect that
it doesn't give you exactly the output you are looking for.

FWIW, the best way to achieve what you are after without breaking the
zones is going to be along the lines of:

zlogin z1c1 init 0
zoneadm -z z1c1 detach
zfs rename rpool/zones/z1c1 rpool/new/z1c1
zoneadm -z z1c1 'set zonepath=/new/z1c1'
zoneadm -z z1c1 attach
zoneadm -z z1c1 boot

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What is ".$EXTEND/$QUOTA" ?

2011-07-19 Thread Mike Gerdts

On Tue, Jul 19, 2011 at 2:39 PM, Orvar Korvar
 wrote:
> I am using S11E, and have created a zpool on a single disk as storage. In 
> several directories, I can see a directory called  ".$EXTEND/$QUOTA". What is 
> it for? Can I delete it?
> --

Perhaps this is of help.

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/smbsrv/smb_pathname.c#752

752 /*
753  * smb_pathname_preprocess_quota
754  *
755  * There is a special file required by windows so that the quota
756  * tab will be displayed by windows clients. This is created in
757  * a special directory, $EXTEND, at the root of the shared file
758  * system. To hide this directory prepend a '.' (dot).
759  */

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-15 Thread Mike Gerdts

n,
> not only on on hardware built for dedicated storage.
>
> Sparse-root vs. full-root zones, or disk images of VMs;
> are they stuffed in one rpool or spread between rpool and
> data pools - that detail is not actually the point of the thread.
>
> Actual useability of dedup for savings and gains on these
> tasks (preferably working also on low-mid-range boxes,
> where adding a good enterprise SSD would double the
> server cost - not only on those big good systems with
> tens of GB of RAM), and hopefully simplifying the system
> configuration and maintenance - that is indeed the point
> in question.
>
> //Jim
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Non-Global zone recovery

2011-07-07 Thread Mike Gerdts

On Thu, Jul 7, 2011 at 2:41 PM, Ram kumar  wrote:
>
> Hi Cindy,
>
> Thanks for the email.
>
> We are using Solaris 10 with out Live Upgrade.
>
> Tested following in the sandbox environment:
>
> 1)  We have one non-global zone (TestZone)  which is running on Test 
> zpool (SAN)
>
> 2)  Don’t see zpool or non-global zone after re-image of Global zone.
>
> 3)  Imported zpool Test
>
> Now I am trying to create Non-global zone and it is giving error
>
> bash-3.00# zonecfg -z Test
> Test: No such zone configured
> Use 'create' to begin configuring a new zone.
> zonecfg:Test> create -a /zones/Test
> invalid path to detached zone

If you use create -a, it requires that SUNWdetached.xml exist as a
means for configuring the various properties (e.g. zonepath, brand,
etc.) and resources (inherit-pkg-dir, net, fs, device, etc.) for the
zone.  Since you don't have the SUNWdetached.xml, you can't use it.

Assuming you have a backup of the system, you could restore a copy of
/etc/zones/ to /etc/zones/restored-.xml, then
run:

zonecfg -z  create -t restored-

If that's not an option or is just too inconvenient, use zonecfg to
configure the zone just like you did initially.  That is, do not use
"create -a", use "create", "create -b", or "create -t
" followed by whatever property settings and
added resources are appropriate.

After you get past zonecfg, you should be able to:

zoneadm -z  attach

If the package and patch levels don't match up (the global zone
perhaps was installed from a newer update or has newer patches):

zoneadm -z  attach -U
or
zoneadm -z  attach -u

Since you seem to be doing this in a test environment to prepare for
bad things to happen, I'd suggest that you make it a standard practice
when you are done configuring a zone to do:

zonecfg -z  export >  /zonecfg.export

Then if you need to recover the zone using only the things that are on
the SAN, you can do:

zpool import ...
zonecfg -z  -f /zonecfg.export
zoneadm -z  attach [-u|-U]

Any follow-ups should probably go to Oracle Support or zones-discuss.
Your problems are not related to zfs.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FW: Solaris panic

2011-03-17 Thread Mike Gerdts

ippy genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11
> Version snv_151a 64-bit
> Mar 17 15:28:51 zippy genunix: [ID 877030 kern.notice] Copyright (c) 1983,
> 2010, Oracle and/or its affiliates. All rights reserved.
>
> Can anyone help?
>
> Regards
> Karl
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs-nfs-sun 7000 series

2011-03-10 Thread Mike MacNeil

Hello,
I have a Sun 7000 series NAS device, I am trying to back it up via NFS mount on 
a Solaris 10 server running Networker 7.6.1.  It works but it is extremely 
slow, I have tested other mounts and they work much faster.  The only 
difference (that I can see) between the two mounts are the underlying file 
system zfs vs ufs.  Any thoughts to speed up the backup of the Sun 7000 nfs 
mount?
Thanks you.


Mike MacNeil
Global IT Infrastructure

[cid:image001.gif@01CBDF3D.6192F090]

4281 Harvester Rd.
Burlington, ON l7l 5m4
Canada

Phone: 905 632 2999 ext.2920
Fax: 905 632 2055
Email: mike.macn...@gennum.com
www.gennum.com



This communication contains confidential information intended only for the 
addressee(s). If you have received this communication in error, please notify 
us immediately and delete this communication from your mail box.
<>___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External SATA drive enclosures + ZFS?

2011-02-25 Thread Mike Tancsa

On 2/25/2011 7:34 PM, Rich Teer wrote:
> 
> One product that seems to fit the bill is the StarTech.com S352U2RER,
> an external dual SATA disk enclosure with USB and eSATA connectivity
> (I'd be using the USB port).  Here's a link to the specific product
> I'm considering:
> 
> http://ca.startech.com/product/S352U2RER-35in-eSATA-USB-Dual-SATA-Hot-Swap-External-RAID-Hard-Drive-Enclosure

I have had mixed results with their 4 bay version.  When they work, they
are great, but we have had a number of DOA/almost DOA units.  I have had
good luck with products from
http://www.addonics.com/
(They ship to Canada as well without issue)

Why use USB ? You wll get much better performance/throughput on eSata
(if you have good drivers of course). I use their sil3124 eSata
controller on FreeBSD as well as a number of PM units and they work great.

    ---Mike

-- 
---
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada   http://www.tancsa.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure (solved?)

2011-02-01 Thread Mike Tancsa

On 1/31/2011 4:19 PM, Mike Tancsa wrote:
> On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
>> Hi Mike,
>>
>> Yes, this is looking much better.
>>
>> Some combination of removing corrupted files indicated in the zpool
>> status -v output, running zpool scrub and then zpool clear should
>> resolve the corruption, but its depends on how bad the corruption is.
>>
>> First, I would try least destruction method: Try to remove the
>> files listed below by using the rm command.
>>
>> This entry probably means that the metadata is corrupted or some
>> other file (like a temp file) no longer exists:
>>
>> tank1/argus-data:<0xc6>
> 
> 
> Hi Cindy,
>   I removed the files that were listed, and now I am left with
> 
> errors: Permanent errors have been detected in the following files:
> 
> tank1/argus-data:<0xc5>
> tank1/argus-data:<0xc6>
> tank1/argus-data:<0xc7>
> 
> I have started a scrub
>  scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go


Looks like that was it!  The scrub finished in the time it estimated and
that was all I needed to do. I did not have to to do zpool clear or any
other commands.  Is there anything beyond scrub to check the integrity
of the pool ?

0(offsite)# zpool status -v
  pool: tank1
 state: ONLINE
 scrub: scrub completed after 7h32m with 0 errors on Mon Jan 31 23:00:46
2011
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada0ONLINE   0 0 0
ada1ONLINE   0 0 0
ada2ONLINE   0 0 0
ada3ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada5ONLINE   0 0 0
ada8ONLINE   0 0 0
ada7ONLINE   0 0 0
ada6ONLINE   0 0 0

errors: No known data errors
0(offsite)#


---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Mike Tancsa

On 1/31/2011 3:14 PM, Cindy Swearingen wrote:
> Hi Mike,
> 
> Yes, this is looking much better.
> 
> Some combination of removing corrupted files indicated in the zpool
> status -v output, running zpool scrub and then zpool clear should
> resolve the corruption, but its depends on how bad the corruption is.
> 
> First, I would try least destruction method: Try to remove the
> files listed below by using the rm command.
> 
> This entry probably means that the metadata is corrupted or some
> other file (like a temp file) no longer exists:
> 
> tank1/argus-data:<0xc6>


Hi Cindy,
I removed the files that were listed, and now I am left with

errors: Permanent errors have been detected in the following files:

tank1/argus-data:<0xc5>
tank1/argus-data:<0xc6>
tank1/argus-data:<0xc7>

I have started a scrub
 scrub: scrub in progress for 0h48m, 10.90% done, 6h35m to go

I will report back once the scrub is done!

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure (solved?)

2011-01-31 Thread Mike Tancsa

On 1/29/2011 6:18 PM, Richard Elling wrote:
> 
> On Jan 29, 2011, at 12:58 PM, Mike Tancsa wrote:
> 
>> On 1/29/2011 12:57 PM, Richard Elling wrote:
>>>> 0(offsite)# zpool status
>>>> pool: tank1
>>>> state: UNAVAIL
>>>> status: One or more devices could not be opened.  There are insufficient
>>>>   replicas for the pool to continue functioning.
>>>> action: Attach the missing device and online it using 'zpool online'.
>>>>  see: http://www.sun.com/msg/ZFS-8000-3C
>>>> scrub: none requested
>>>> config:
>>>>
>>>>   NAMESTATE READ WRITE CKSUM
>>>>   tank1   UNAVAIL  0 0 0  insufficient replicas
>>>> raidz1ONLINE   0 0 0
>>>>   ad0 ONLINE   0 0 0
>>>>   ad1 ONLINE   0 0 0
>>>>   ad4 ONLINE   0 0 0
>>>>   ad6 ONLINE   0 0 0
>>>> raidz1ONLINE   0 0 0
>>>>   ada4ONLINE   0 0 0
>>>>   ada5ONLINE   0 0 0
>>>>   ada6ONLINE   0 0 0
>>>>   ada7ONLINE   0 0 0
>>>> raidz1UNAVAIL  0 0 0  insufficient replicas
>>>>   ada0UNAVAIL  0 0 0  cannot open
>>>>   ada1UNAVAIL  0 0 0  cannot open
>>>>   ada2UNAVAIL  0 0 0  cannot open
>>>>   ada3UNAVAIL  0 0 0  cannot open
>>>> 0(offsite)#
>>>
>>> This is usually easily solved without data loss by making the
>>> disks available again.  Can you read anything from the disks using
>>> any program?
>>
>> Thats the strange thing, the disks are readable.  The drive cage just
>> reset a couple of times prior to the crash. But they seem OK now.  Same
>> order as well.
>>
>> # camcontrol devlist
>>   at scbus0 target 0 lun 0
>> (pass0,ada0)
>>   at scbus0 target 1 lun 0
>> (pass1,ada1)
>>   at scbus0 target 2 lun 0
>> (pass2,ada2)
>>   at scbus0 target 3 lun 0
>> (pass3,ada3)
>>
>>
>> # dd if=/dev/ada2 of=/dev/null count=20 bs=1024
>> 20+0 records in
>> 20+0 records out
>> 20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
>> 0(offsite)#
> 
> The next step is to run "zdb -l" and look for all 4 labels. Something like:
>   zdb -l /dev/ada2
> 
> If all 4 labels exist for each drive and appear intact, then look more closely
> at how the OS locates the vdevs. If you can't solve the "UNAVAIL" problem,
> you won't be able to import the pool.
>  -- richard

On 1/29/2011 10:13 PM, James R. Van Artsdalen wrote:
> On 1/28/2011 4:46 PM, Mike Tancsa wrote:
>>
>> I had just added another set of disks to my zfs array. It looks like the
>> drive cage with the new drives is faulty.  I had added a couple of files
>> to the main pool, but not much.  Is there any way to restore the pool
>> below ? I have a lot of files on ad0,1,4,6 and ada4,5,6,7 and perhaps
>> one file on the new drives in the bad cage.
>
> Get another enclosure and verify it works OK.  Then move the disks from
> the suspect enclosure to the tested enclosure and try to import the pool.
>
> The problem may be cabling or the controller instead - you didn't
> specify how the disks were attached or which version of FreeBSD you're
> using.
>

First off thanks to all who responded on and offlist!

Good news (for me) it seems. New cage and all seems to be recognized
correctly.  The history is

...
2010-04-22.14:27:38 zpool add tank1 raidz /dev/ada4 /dev/ada5 /dev/ada6
/dev/ada7
2010-06-11.13:49:33 zfs create tank1/argus-data
2010-06-11.13:49:41 zfs create tank1/argus-data/previous
2010-06-11.13:50:38 zfs set compression=off tank1/argus-data
2010-08-06.12:20:59 zpool replace tank1 ad1 ad1
2010-09-16.10:17:51 zpool upgrade -a
2011-01-28.11:45:43 zpool add tank1 raidz /dev/ada0 /dev/ada1 /dev/ada2
/dev/ada3

FreeBSD RELENG_8 from last week, 8G of RAM, amd64.

 zpool status -v
  pool: tank1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   ONLINE   0 0 0

Re: [zfs-discuss] multiple disk failure

2011-01-30 Thread Mike Tancsa

On 1/30/2011 12:39 AM, Richard Elling wrote:
>> Hmmm, doesnt look good on any of the drives.
> 
> I'm not sure of the way BSD enumerates devices.  Some clever person thought
> that hiding the partition or slice would be useful. I don't find it useful.  
> On a Solaris
> system, ZFS can show a disk something like c0t1d0, but that doesn't exist. The
> actual data is in slice 0, so you need to use c0t1d0s0 as the argument to zdb.

I think its the right syntax.  On the older drives,


0(offsite)# zdb -l /dev/ada0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
0(offsite)# zdb -l /dev/ada4

LABEL 0

version=15
name='tank1'
state=0
txg=44593174
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338

LABEL 1

version=15
name='tank1'
state=0
txg=44592523
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338

LABEL 2

version=15
name='tank1'
state=0
txg=44593174
pool_guid=7336939736750289319
hostid=3221266864
hostname='offsite.sentex.ca'
top_guid=6980939370923808328
guid=16144392433229115618
vdev_tree
type='raidz'
id=1
guid=6980939370923808328
nparity=1
metaslab_array=38
metaslab_shift=35
ashift=9
asize=4000799784960
is_log=0
children[0]
type='disk'
id=0
guid=16144392433229115618
path='/dev/ada4'
whole_disk=0
DTL=341
children[1]
type='disk'
id=1
guid=1210677308003674848
path='/dev/ada5'
whole_disk=0
DTL=340
children[2]
type='disk'
id=2
guid=2517076601231706249
path='/dev/ada6'
whole_disk=0
DTL=339
children[3]
type='disk'
id=3
guid=16621760039941477713
path='/dev/ada7'
whole_disk=0
DTL=338
-

Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa

On 1/29/2011 6:18 PM, Richard Elling wrote:
>> 0(offsite)#
> 
> The next step is to run "zdb -l" and look for all 4 labels. Something like:
>   zdb -l /dev/ada2
> 
> If all 4 labels exist for each drive and appear intact, then look more closely
> at how the OS locates the vdevs. If you can't solve the "UNAVAIL" problem,
> you won't be able to import the pool.



Hmmm, doesnt look good on any of the drives.  Before I give up, I will
try the drives in a different cage Monday. Unfortunately, its a 150km
away from me at our DR site


# zdb -l /dev/ada0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa

On 1/29/2011 11:38 AM, Edward Ned Harvey wrote:
> 
> That is precisely the reason why you always want to spread your mirror/raidz
> devices across multiple controllers or chassis.  If you lose a controller or
> a whole chassis, you lose one device from each vdev, and you're able to
> continue production in a degraded state...

Thanks.  These are backups of backups. It would be nice to restore them
as it will take a while to sync up once again.  But if I need to start
fresh, is there a resource you can point me to with the current best
practices for laying out large storage like this ?  Its just for backups
of backups in a DR site

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multiple disk failure

2011-01-29 Thread Mike Tancsa

On 1/29/2011 12:57 PM, Richard Elling wrote:
>> 0(offsite)# zpool status
>>  pool: tank1
>> state: UNAVAIL
>> status: One or more devices could not be opened.  There are insufficient
>>replicas for the pool to continue functioning.
>> action: Attach the missing device and online it using 'zpool online'.
>>   see: http://www.sun.com/msg/ZFS-8000-3C
>> scrub: none requested
>> config:
>>
>>NAMESTATE READ WRITE CKSUM
>>tank1   UNAVAIL  0 0 0  insufficient replicas
>>  raidz1ONLINE   0 0 0
>>ad0 ONLINE   0 0 0
>>ad1 ONLINE   0 0 0
>>ad4 ONLINE   0 0 0
>>ad6 ONLINE   0 0 0
>>  raidz1ONLINE   0 0 0
>>ada4ONLINE   0 0 0
>>ada5ONLINE   0 0 0
>>ada6ONLINE   0 0 0
>>ada7ONLINE   0 0 0
>>  raidz1UNAVAIL  0 0 0  insufficient replicas
>>ada0UNAVAIL  0 0 0  cannot open
>>ada1UNAVAIL  0 0 0  cannot open
>>ada2UNAVAIL  0 0 0  cannot open
>>ada3UNAVAIL  0 0 0  cannot open
>> 0(offsite)#
> 
> This is usually easily solved without data loss by making the
> disks available again.  Can you read anything from the disks using
> any program?

Thats the strange thing, the disks are readable.  The drive cage just
reset a couple of times prior to the crash. But they seem OK now.  Same
order as well.

# camcontrol devlist
  at scbus0 target 0 lun 0
(pass0,ada0)
  at scbus0 target 1 lun 0
(pass1,ada1)
  at scbus0 target 2 lun 0
(pass2,ada2)
  at scbus0 target 3 lun 0
(pass3,ada3)


# dd if=/dev/ada2 of=/dev/null count=20 bs=1024
20+0 records in
20+0 records out
20480 bytes transferred in 0.001634 secs (12534561 bytes/sec)
0(offsite)#

---Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] multiple disk failure

2011-01-28 Thread Mike Tancsa

Hi,
I am using FreeBSD 8.2 and went to add 4 new disks today to expand my
offsite storage.  All was working fine for about 20min and then the new
drive cage started to fail.  Silly me for assuming new hardware would be
fine :(

The new drive cage started to fail, it hung the server and the box
rebooted.  After it rebooted, the entire pool is gone and in the state
below.  I had only written a few files to the new larger pool and I am
not concerned about restoring that data.  However, is there a way to get
back the original pool data ?
Going to http://www.sun.com/msg/ZFS-8000-3C gives a 503 error on the web
page listed BTW.


0(offsite)# zpool status
  pool: tank1
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tank1   UNAVAIL  0 0 0  insufficient replicas
  raidz1ONLINE   0 0 0
ad0 ONLINE   0 0 0
ad1 ONLINE   0 0 0
ad4 ONLINE   0 0 0
ad6 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ada4ONLINE   0 0 0
ada5ONLINE   0 0 0
ada6ONLINE   0 0 0
ada7ONLINE   0 0 0
  raidz1UNAVAIL  0 0 0  insufficient replicas
ada0UNAVAIL  0 0 0  cannot open
ada1UNAVAIL  0 0 0  cannot open
ada2UNAVAIL  0 0 0  cannot open
ada3UNAVAIL  0 0 0  cannot open
0(offsite)#
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool import crashes system

2010-11-11 Thread Mike DeMarco

I am trying to bring in my zpool from build 121 into build 134 and every time I 
do a zpool import the system crashes.
  I have read other posts for this and have tried setting zfs_recover = 1 and 
aok = 1 in /etc/system I have used mdb to verify that they are in the kernel 
but the system still crashes as soon as import is called.
  On this system I can rebuild the entire pool from scratch but my next system 
is 4Tbytes and I don't have space on any other system to store that much data.
  Anyone have a way to import and upgrade a older pool to a newer OS?

TIA mic
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Mike Gerdts

On Wed, Oct 27, 2010 at 3:41 PM, Harry Putnam  wrote:
> I'm guessing it was probably more like 60 to 62 c under load.  The
> temperature I posted was after something like 5minutes of being
> totally shutdown and the case been open for a long while. (mnths if
> not yrs)

What happens if the case is closed (and all PCI slot, disk, etc. slots
are closed)?  Having the case open likely changes the way that air
flows across the various components.  Also, if there is tobacco smoke
near the machine, it will cause a sticky build-up that likely
contributes to heat dissipation problems.

Perhaps this belongs somewhere other than zfs-discuss - it has nothing
to do with zfs.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN

2010-10-27 Thread Mike Gerdts

On Wed, Oct 27, 2010 at 9:27 AM, bhanu prakash  wrote:
> Hi Mike,
>
>
> Thanks for the information...
>
> Actually the requirement is like this. Please let me know whether it matches
> for the below requirement or not.
>
> Question:
>
> The SAN team will assign the new LUN’s on EMC DMX4 (currently IBM Hitache is
> there). We need to move the 17 containers which are existed on the
> server Host1 to new LUN’s”.
>
>
> Please give me the steps to do this activity.

Without knowing the layout of the storage, it is impossible to give
you precise instructions.  This sounds like it is a production Solaris
10 system in an enterprise environment.  In most places that I've
worked, I would be hesitant to provide the required level of detail on
a public mailing list.  Perhaps you should open a service call to get
the assistance you need.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN

2010-10-26 Thread Mike Gerdts

On Tue, Oct 26, 2010 at 9:40 AM, bhanu prakash  wrote:
> Hi Team,
>
>
> There 17 zones on the machine T5120. I want to move all the zones which are
> ZFS filesystem to another new LUN.
>
> Can you give me the steps to proceed this.

If the only thing on the source lun is the pool that contains the
zones and the new LUN is at least as big as the old LUN:

zpool replace   

The above can be done while the zones are booted.  Depending on the
characteristics of the server and workloads, the workloads may feel a
bit sluggish during this time due to increased I/O activity.  If that
works for you, stop reading now.

In the event that the scenario above doesn't apply, read on.  Assuming
all the zones are under oldpool/zones, oldpool/zones is mounted at
/zones, and you have done "zpool create newpool "

Be sure to test this procedure - I didn't!

zfs create newlun/zones

# optionally, shut down the zones
zfs snapshot -r oldpool/zo...@phase1
zfs send -r oldpool/zo...@phase1 | zfs receive newpool/zo...@phase1

# If you did not shut down the zones above, shut them down now.
# If the zones were shut down, skip the next two commands
zfs snapshot -r oldpool/zo...@phase2
zfs send -rI oldpool/zo...@phase1 oldpool/zo...@phase2 \
| zfs receive newpool/zo...@phase2

# Adjust mount points and restart the zones
zfs set mountpoint=none oldpool/zones
zfs set mountpoint=/zones newpool/zones
for zone in $zonelist zoneadm -z $zone boot ; done

At such a time that you are comfortable that the zone data moved over ok...

zfs destroy -r oldpool/zones

Again, verify the procedure works on a test/lab/whatever box before
trying it for real.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] making sense of arcstat.pl output

2010-10-01 Thread Mike Harsch

przemol,

Thanks for the feedback.  I had incorrectly assumed that any machine running 
the script would have L2ARC implemented (which is not the case with Solaris 
10).  I've added a check for this that allows the script to work on non-L2ARC 
machines as long as you don't specify L2ARC stats on the command line.

http://github.com/mharsch/arcstat
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] making sense of arcstat.pl output

2010-10-01 Thread Mike Harsch

Hello Christian,

Thanks for bringing this to my attention.  I believe I've fixed the rounding 
error in the latest version.

http://github.com/mharsch/arcstat
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] making sense of arcstat.pl output

2010-09-30 Thread Mike Harsch

For posterity, I'd like to point out the following:

neel's original arcstat.pl uses a crude scaling routine that results in a large 
loss of precision as numbers cross from Kilobytes to Megabytes to Gigabytes.  
The 1G reported arc size case described here, could actually be anywhere 
between 1,000,000MB and 1,999,999MB.  Use 'kstat zfs::arcstats' to read the arc 
size directly from the kstats (for comparison). 

I've updated arcstat.pl with a better scaling routine that returns more 
appropriate results (similar to df -h human-readable output).  I've also added 
support for L2ARC stats.  The updated version can be found here:

http://github.com/mharsch/arcstat
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] file level clones

2010-09-27 Thread Mike Gerdts

On Mon, Sep 27, 2010 at 6:23 AM, Robert Milkowski  wrote:

> Also see http://www.symantec.com/connect/virtualstoreserver

And 
http://blog.scottlowe.org/2008/12/03/2031-enhancements-to-netapp-cloning-technology/


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))

2010-09-23 Thread Mike.



On 9/23/2010 at 12:38 PM Erik Trimble wrote:

| [snip]
|If you don't really care about ultra-low-power, then there's
absolutely 
|no excuse not to buy a USED server-class machine which is 1- or 2- 
|generations back.  They're dirt cheap, readily available, 
| [snip]
 =



Anyone have a link or two to a place where I can buy some dirt-cheap,
readily available last gen servers?



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-09-16 Thread Mike Mackovitch

On Thu, Sep 16, 2010 at 08:15:53AM -0700, Rich Teer wrote:
> On Thu, 16 Sep 2010, Erik Ableson wrote:
> 
> > OpenSolaris snv129
> 
> Hmm, SXCE snv_130 here.  Did you have to do any server-side tuning
> (e.g., allowing remote connections), or did it just work out of the
> box?  I know that Sendmail needs some gentle persuasion to accept
> remote connections out of the box; perhaps lockd is the same?

So, you've been having this problem since April.
Did you ever try getting packet traces to see where the problem is?

As I previously stated, if you want, you can forward the traces to me to
look at.  Let me know if you need the directions on how to capture them.

--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] recordsize

2010-09-16 Thread Mike DeMarco

What are the ramifications to changing the recordsize of a zfs filesystem that 
already has data on it?

I want to tune down the recordsize to speed up very small reads to a size that 
is more in line with the read size. can I do this on a filestystem that has 
data already on it and how does it effect that data? zpool consists of 8 SANs 
Luns.

Thanks
mike
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-09-15 Thread Mike Mackovitch

On Wed, Sep 15, 2010 at 12:08:20PM -0700, Nabil wrote:
> any resolution to this issue?  I'm experiencing the same annoying
> lockd thing with mac osx 10.6 clients.  I am at pool ver 14, fs ver
> 3.  Would somehow going back to the earlier 8/2 setup make things
> better?

As noted in the earlier thread, the "annoying lockd thing" is not a
ZFS issue, but rather a networking issue.

FWIW, I never saw a resolution.  But the suggestions for how to debug
situations like this still stand:

> So, it looks like you need to investigate why the client isn't
> getting responses from the server's "lockd".
>
> This is usually caused by a firewall or NAT getting in the way.

> I would also check /var/log/system.log and /var/log/kernel.log on the Mac to
> see if any other useful messages are getting logged.
>
> Then I'd grab packet traces with wireshark/tcpdump/snoop *simultaneously* on
> the client and the server, reproduce the problem, and then determine which
> packets are being sent and which packets are being received.

HTH
--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to migrate to 4KB sector drives?

2010-09-12 Thread Mike Gerdts

On Sun, Sep 12, 2010 at 5:42 PM, Richard Elling  wrote:
> On Sep 12, 2010, at 10:11 AM, Brandon High wrote:
>
>> On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar
>>  wrote:
>>> No replies. Does this mean that you should avoid large drives with 4KB 
>>> sectors, that is, new drives? ZFS does not handle new drives?
>>
>> Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of 
>> osol.
>
> OSol source yes, binaries no :-(  You will need another distro besides 
> OpenSolaris.
> The needed support in sd was added around the b137 timeframe.

OpenIndiana, to be released on Tuesday, is based on b146 or later.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-28 Thread Mike Gerdts

On Sat, Aug 28, 2010 at 8:19 AM, Ray Van Dolson  wrote:
> On Sat, Aug 28, 2010 at 05:50:38AM -0700, Eff Norwood wrote:
>> I can't think of an easy way to measure pages that have not been consumed 
>> since it's really an SSD controller function which is obfuscated from the 
>> OS, and add the variable of over provisioning on top of that. If anyone 
>> would like to really get into what's going on inside of an SSD that makes it 
>> a bad choice for a ZIL, you can start here:
>>
>> http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29
>>
>> and
>>
>> http://en.wikipedia.org/wiki/Write_amplification
>>
>> Which will be more than you might have ever wanted to know. :)
>
> So has anyone on this list actually run into this issue?  Tons of
> people use SSD-backed slog devices...
>
> The theory sounds "sound", but if it's not really happening much in
> practice then I'm not too worried.  Especially when I can replace a
> drive from my slog mirror for a $400 or so if problems do arise... (the
> alternative being much more expensive DRAM backed devices)

Presumably this problem is being worked...

http://hg.genunix.org/onnv-gate.hg/rev/d560524b6bb6

Notice that it implements:

866610  Add SATA TRIM support

With this in place, I would imagine a next step is for zfs to issue
TRIM commands as zil entries have been committed to the data disks.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)

2010-08-25 Thread Mike Kirk

Update: version 3.2.5 out now, with changes to better support snv_134:

http://forums.halcyoninc.com/showthread.php?t=368

If you've downloaded v3.2.4 and are on 09/06, there is no reason to upgrade.

Regards,

mike.k...@halcyoninc.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] shrink zpool

2010-08-25 Thread Mike DeMarco

Is it currently or near future possible to shrink a zpool "remove a disk"
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)

2010-08-20 Thread Mike Kirk

Hi zfs user,

> Is the beta free? for how long? if not how much for 5 machines?

Everything on our web site (including the beta) runs for 30 days with the 
baked-in license. After 30 days it will stop collecting fresh numbers, unless 
you add a license key, or a demo extension file from the sales team (or 
reinstall it and start over again).

> If you are going to post about your commercial products - please include 
> some price points, so people know whether to ignore the info based on 
> their budget. 

You're right, it would be nice if people could just go to our version of 
"shop.oracle.com", but we're not there yet, and I don't have the price sheets 
the sales guys do to put those numbers in the forum. 

If you're still interested, please email me and I can put you in touch with 
someone who can directly deal with your pricing questions, without going 
through our web page or the sales alias etc.

Thanks!

mike.k...@halcyoninc.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)

2010-08-20 Thread Mike Kirk

Hi wonslung,

Thanks for posting to our forum: I'll respond there and take things off-list. 
Sounds like it's the same bug that appeared with the Sol10 July EIS: (which 
snv_134 obviously got the changes for first, and that wasn't in 09/06). Fixing 
it now...

mike.k...@halcyoninc.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Halcyon ZFS and system monitoring software for OpenSolaris (beta)

2010-08-19 Thread Mike Kirk

Hi all,

Halcyon recently started to add ZFS pool stats to our Solaris Agent, and 
because many people were interested in the previous OpenSolaris beta* we've 
rolled it into our OpenSolaris build as well. 

I've already heard some great feedback about supporting ZIL and ARC stats, 
which we're hoping to add soon. If you'd like to see what we have now, and 
maybe try it on your OpenSolaris system, please see the download/screenshot 
page here:

 http://forums.halcyoninc.com/showthread.php?p=1018

I know this isn't the best time to be posting about legacy OpenSolaris: we're 
keeping our eyes on Solaris 11 Express / Illumos and aim to support the more 
advanced features of Solaris 11 the day it's pushed out the door.

Thanks for your time!

Regards,

Mike dot Kirk at HalcyonInc dot com

* previous build: http://opensolaris.org/jive/thread.jspa?threadID=130507
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New Supermicro SAS/SATA controller: AOC-USAS2-L8e in SOHO NAS and HD HTPC

2010-08-16 Thread Mike DeMarco

What I would really like to know is why do pci-e raid controller cards cost 
more than an entire motherboard with processor. Some cards can cost over $1,000 
dollars, for what.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] EMC migration and zfs

2010-08-16 Thread Mike DeMarco

Bump this up. Anyone?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS development moving behind closed doors

2010-08-13 Thread Mike M



On 8/13/2010 at 8:56 PM Eric D. Mudama wrote:

|On Fri, Aug 13 at 19:06, Frank Cusack wrote:
|>Interesting POV, and I agree.  Most of the many "distributions" of
|>OpenSolaris had very little value-add.  Nexenta was the most
interesting
|>and why should Oracle enable them to build a business at their
expense?
|
|These distributions are, in theory, the "gateway drug" where people
|can experiment inexpensively to try out new technologies (ZFS, dtrace,
|crossbow, comstar, etc.) and eventually step up to Oracle's "big iron"
|as their business grows.
 =

Think: strategic business advantage.  

Oracle are not stupid, they recognize a jewel when they see one.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving /export to another zpool

2010-08-13 Thread Mike Gerdts

On Fri, Aug 13, 2010 at 1:07 PM, Handojo  wrote:
>> Are the old /opt and /expore still listed in your
>> vfstab(4) file?
>
> I cant access /etc/vfstab because I can't even log in as my username. I can't 
> even log in as root from the Login Screen
>
> And when I boot on using LiveCD, how can I mount my first drive that has 
> opensolaris installed ?

To list the zpools it can see:

zpool import

To import one called rpool at an alternate root:

zpool import -R /mnt rpool


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] EMC migration and zfs

2010-08-12 Thread Mike DeMarco

We are going to be migrating to a new EMC frame using Open Replicator. 
ZFS is sitting on volumes that are running MPXIO. So the controller number/disk 
number is going to change when we reboot the server. I would like to konw if 
anyone has done this and will the zfs filesystems "just work" and find the new 
disk id numbers when we go to zfs import the pool.

Our process would be:
zfs export any and all pools on the server
shutdown the server
re-zone the storage to the new EMC frame.
EMC on the backend will present the old drives through the new frame/drives 
using Open Replicator.
boot the server to single user mode
zfs import the pools
reboot the server.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs allow does not work for rpool

2010-07-28 Thread Mike DeMarco

That looks like that will work. Won't be able to test until late tonight.


Thanks
mike
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs allow does not work for rpool

2010-07-28 Thread Mike DeMarco

Thanks adding mount did allow me to create it but does not allow me to create 
the mountpoint.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs allow does not work for rpool

2010-07-28 Thread Mike DeMarco

I am trying to give a general user permissions to create zfs filesystems in the 
rpool.

zpool set=delegation=on rpool
zfs allow  create rpool

both run without any issues.

zfs allow rpool reports the user does have create permissions.

zfs create rpool/test
cannot create rpool/test : permission denied.

Can you not allow to the rpool?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-26 Thread Mike Gerdts

On Mon, Jul 26, 2010 at 2:56 PM, Miles Nordin  wrote:
>>>>>> "mg" == Mike Gerdts  writes:
>    mg> it is rather common to have multiple 1 Gb links to
>    mg> servers going to disparate switches so as to provide
>    mg> resilience in the face of switch failures.  This is not unlike
>    mg> (at a block diagram level) the architecture that you see in
>    mg> pretty much every SAN.  In such a configuation, it is
>    mg> reasonable for people to expect that load balancing will
>    mg> occur.
>
> nope.  spanning tree removes all loops, which means between any two
> points there will be only one enabled path.  An L2-switched network
> will look into L4 headers for splitting traffic across an aggregated
> link (as long as it's been deliberately configured to do that---by
> default probably only looks to L2), but it won't do any multipath
> within the mesh.

I was speaking more of IPMP, which is at layer 3.

> Even with an L3 routing protocol it usually won't do multipath unless
> the costs of the paths match exactly, so you'd want to build the
> topology to achieve this and then do all switching at layer 3 by
> making sure no VLAN is larger than a switch.

By default, IPMP does outbound load spreading.  Inbound load spreading
is not practical with a single (non-test) IP address.  If you have
multiple virtual IP's you can spread them across all of the NICs in
the IPMP group and get some degree of inbound spreading as well.  This
is the default behavior of the OpenSolaris IPMP implementation, last I
looked.  I've not seen any examples (although I can't say I've looked
real hard either) of the Solaris 10 IPMP configuration set up with
multipe IP's to encourage inbound load spreading as well.

>
> There's actually a cisco feature to make no VLAN larger than a *port*,
> which I use a little bit.  It's meant for CATV networks I think, or
> DSL networks aggregated by IP instead of ATM like maybe some European
> ones?  but the idea is not to put edge ports into vlans any more but
> instead say 'ip unnumbered loopbackN', and then some black magic they
> have built into their DHCP forwarder adds /32 routes by watching the
> DHCP replies.  If you don't use DHCP you can add static /32 routes
> yourself, and it will work.  It does not help with IPv6, and also you
> can only use it on vlan-tagged edge ports (what? arbitrary!) but
> neat that it's there at all.
>
>  http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.html

Interesting... however this seems to limit you to < 4096 edge ports
per VTP domain, as the VID field in the 802.1q header is only 12 bits.
 It is also unclear how this works when you have one physical host
with many guests.  And then there is the whole thing that I don't
really see how this helps with resilience in the face of a switch
failure.  Cool technology, but I'm not certain that it addresses what
I was talking about.

>
> The best thing IMHO would be to use this feature on the edge ports,
> just as I said, but you will have to teach the servers to VLAN-tag
> their packets.  not such a bad idea, but weird.
>
> You could also use it one hop up from the edge switches, but I think
> it might have problems in general removing the routes when you unplug
> a server, and using it one hop up could make them worse.  I only use
> it with static routes so far, so no mobility for me: I have to keep
> each server plugged into its assigned port, and reconfigure switches
> if I move it.  Once you have ``no vlan larger than 1 switch,'' if you
> actually need a vlan-like thing that spans multiple switches, the new
> word for it is 'vrf'.

There was some other Cisco dark magic that our network guys were
touting a while ago that would make each edge switch look like a blade
in a 6500 series.  This would then allow them to do link aggregation
across edge switches.  At least two of "organizational changes",
"personnel changes", and "roadmap changes" happened so I've not seen
this in action.

>
> so, yeah, it means the server people will have to take over the job of
> the networking people.  The good news is that networking people don't
> like spanning tree very much because it's always going wrong, so
> AFAICT most of them who are paying attention are already moving in
> this direction.
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-26 Thread Mike Gerdts

On Mon, Jul 26, 2010 at 1:27 AM, Garrett D'Amore  wrote:
> On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote:
>> On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore  wrote:
>> > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote:
>> >>
>> >> I think there may be very good reason to use iSCSI, if you're limited
>> >> to gigabit but need to be able to handle higher throughput for a
>> >> single client. I may be wrong, but I believe iSCSI to/from a single
>> >> initiator can take advantage of multiple links in an active-active
>> >> multipath scenario whereas NFS is only going to be able to take
>> >> advantage of 1 link (at least until pNFS).
>> >
>> > There are other ways to get multiple paths.  First off, there is IP
>> > multipathing. which offers some of this at the IP layer.  There is also
>> > 802.3ad link aggregation (trunking).  So you can still get high
>> > performance beyond  single link with NFS.  (It works with iSCSI too,
>> > btw.)
>>
>> With both IPMP and link aggregation, each TCP session will go over the
>> same wire.  There is no guarantee that load will be evenly balanced
>> between links when there are multiple TCP sessions.  As such, any
>> scalability you get using these configurations will be dependent on
>> having a complex enough workload, wise cconfiguration choices, and and
>> a bit of luck.
>
> If you're really that concerned, you could use UDP instead of TCP.  But
> that may have other detrimental performance impacts, I'm not sure how
> bad they would be in a data center with generally lossless ethernet
> links.

Heh.  My horror story with reassembly was actually with connectionless
transports (LLT, then UDP).  Oracle RAC's cache fusion sends 8 KB
blocks via UDP by default, or LLT when used in the Veritas + Oracle
RAC certified configuration from 5+ years ago.  The use of Sun
trunking with round robin hashing and the lack of use of jumbo packets
made every cache fusion block turn into 6 LLT or UDP packets that had
to be reassembled on the other end.  This was on a 15K domain with the
NICs spread across IO boards.  I assume that interrupts for a NIC are
handled by a CPU on the closest system board (Solaris 8, FWIW).  If
that assumption is true then there would also be a flurry of
inter-system board chatter to put the block back together.  In any
case, performance was horrible until we got rid of round robin and
enabled jumbo frames.

> Btw, I am not certain that the multiple initiator support (mpxio) is
> necessarily any better as far as guaranteed performance/balancing.  (It
> may be; I've not looked closely enough at it.)

I haven't paid close attention to how mpxio works.  The Veritas
analog, vxdmp, does a very good job of balancing traffic down multiple
paths, even when only a single LUN is accessed.  The exact mode that
dmp will use is dependent on the capabilities of the array it is
talking to - many arrays work in an active/passive mode.  As such, I
would expect that with vxdmp or mpxio the balancing with iSCSI would
be at least partially dependent on what the array said to do.

> I should look more closely at NFS as well -- if multiple applications on
> the same client are access the same filesystem, do they use a single
> common TCP session, or can they each have separate instances open?
> Again, I'm not sure.

It's worse than that.  A quick experiment with two different
automounted home directories from the same NFS server suggests that
both home directories share one TCP session to the NFS server.

The latest version of Oracle's RDBMS supports a userland NFS client
option.  It would be very interesting to see if this does a separate
session per data file, possibly allowing for better load spreading.

>> Note that with Sun Trunking there was an option to load balance using
>> a round robin hashing algorithm.  When pushing high network loads this
>> may cause performance problems with reassembly.
>
> Yes.  Reassembly is Evil for TCP performance.
>
> Btw, the iSCSI balancing act that was described does seem a bit
> contrived -- a single initiator and a COMSTAR server, both client *and
> server* with multiple ethernet links instead of a single 10GbE link.
>
> I'm not saying it doesn't happen, but I think it happens infrequently
> enough that its reasonable that this scenario wasn't one that popped
> immediately into my head. :-)

It depends on whether the people that control the network gear are the
same ones that control servers.  My experience suggests that if there
is a disconnect, it seems rather likely that each group's
standardization efforts, procurement cycles, and capacity plans will
work against any attempt t

Re: [zfs-discuss] NFS performance?

2010-07-25 Thread Mike Gerdts

On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore  wrote:
> On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote:
>>
>> I think there may be very good reason to use iSCSI, if you're limited
>> to gigabit but need to be able to handle higher throughput for a
>> single client. I may be wrong, but I believe iSCSI to/from a single
>> initiator can take advantage of multiple links in an active-active
>> multipath scenario whereas NFS is only going to be able to take
>> advantage of 1 link (at least until pNFS).
>
> There are other ways to get multiple paths.  First off, there is IP
> multipathing. which offers some of this at the IP layer.  There is also
> 802.3ad link aggregation (trunking).  So you can still get high
> performance beyond  single link with NFS.  (It works with iSCSI too,
> btw.)

With both IPMP and link aggregation, each TCP session will go over the
same wire.  There is no guarantee that load will be evenly balanced
between links when there are multiple TCP sessions.  As such, any
scalability you get using these configurations will be dependent on
having a complex enough workload, wise cconfiguration choices, and and
a bit of luck.

Note that with Sun Trunking there was an option to load balance using
a round robin hashing algorithm.  When pushing high network loads this
may cause performance problems with reassembly.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hashing files rapidly on ZFS

2010-07-07 Thread Mike Gerdts

On Tue, Jul 6, 2010 at 10:29 AM, Arne Jansen  wrote:
> Daniel Carosone wrote:
>> Something similar would be useful, and much more readily achievable,
>> from ZFS from such an application, and many others.  Rather than a way
>> to compare reliably between two files for identity, I'ld liek a way to
>> compare identity of a single file between two points in time.  If my
>> application can tell quickly that the file content is unaltered since
>> last time I saw the file, I can avoid rehashing the content and use a
>> stored value. If I can achieve this result for a whole directory
>> tree, even better.
>
> This would be great for any kind of archiving software. Aren't zfs checksums
> already ready to solve this? If a file changes, it's dnodes' checksum changes,
> the checksum of the directory it is in and so forth all the way up to the
> uberblock.
> There may be ways a checksum changes without a real change in the files 
> content,
> but the other way round should hold. If the checksum didn't change, the file
> didn't change.
> So the only missing link is a way to determine zfs's checksum for a
> file/directory/dataset. Am I missing something here? Of course atime update
> should be turned off, otherwise the checksum will get changed by the archiving
> agent.

What is the likelihood that the same data is re-written to the file?
If that is unlikely, it looks as though znode_t's z_seq may be useful.
 While it isn't a checksum, it seems to be incremented on every file
change.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 2:08 PM, Ian D  wrote:

> Mem:  74098512k total, 73910728k used,   187784k free,    96948k buffers
> Swap:  2104488k total,      208k used,  2104280k free, 63210472k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 17652 mysql     20   0 3553m 3.1g 5472 S   38  4.4 247:51.80 mysqld
> 16301 mysql     20   0 4275m 3.3g 5980 S    4  4.7   5468:33 mysqld
> 16006 mysql     20   0 4434m 3.3g 5888 S    3  4.6   5034:06 mysqld
> 12822 root      15  -5     0    0    0 S    2  0.0  22:00.50 scsi_wq_39

Is that 38% of one CPU or 38% of all CPU's?  How many CPU's does the
Linux box have?  I don't mean the number of sockets, I mean number of
sockets * number of cores * number of threads per core.  My
recollection of top is that the CPU percentage is:

(pcpu_t2 - pcpu_t1) / (interval * ncpus)

Where pcpu_t* is the process CPU time at a particular time.  If you
have a two socket quad core box with hyperthreading enabled, that is 2
* 4 * 2 = 16 CPU's.  38% of 16 CPU's can be roughly 6 CPU's running as
fast as they can (and 10 of them idle) or 16 CPU's each running at
about 38%.  In the "I don't have a CPU bottleneck" argument, there is
a big difference.

If PID 16301 has a single thread that is doing significant work, on
the hypothetical 16 CPU box this means that it is spending about 2/3
of the time on CPU.  If the workload does:

while ( 1 ) {
issue I/O request
get response
do cpu-intensive work work
}

It is only trying to do I/O 1/3 of the time.  Further, it has put a
single high latency operation between its bursts of CPU activity.

One other area of investigation that I didn't mention before: Your
stats imply that the Linux box is getting data 32 KB at a time.  How
does 32 KB compare to the database block size?  How does 32 KB compare
to the block size on the relevant zfs filesystem or zvol?  Are blocks
aligned at the various layers?

http://blogs.sun.com/dlutz/entry/partition_alignment_guidelines_for_unified

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 10:08 AM, Ian D  wrote:
> What I don't understand is why, when I run a single query I get <100 IOPS
> and <3MB/sec.  The setup can obviously do better, so where is the
> bottleneck?  I don't see any CPU core on any side being maxed out so it
> can't be it...

In what way is CPU contention being monitored?  "prstat" without
options is nearly useless for a multithreaded app on a multi-CPU (or
multi-core/multi-thread) system.  mpstat is only useful if threads
never migrate between CPU's.  "prstat -mL" gives a nice picture of how
busy each LWP (thread) is.

When viewed with "prstat -mL", A thread that has usr+sys at 100%
cannot go any faster, unless you can get the CPU to go faster, as I
suggest below. From my understanding (perhaps not 100% correct on the
rest of this paragraph):  The time spent in TRP may be reclaimed by
running the application in a processor set with interrupts disabled on
all of its processors.  If TFL or DFL are high, optimizing the use of
cache may be beneficial.  Examples of how you can optimize the use of
cache include using the FX scheduler with a priority that gives
relatively long time slices, using processor sets to keep other
processes off of the same caches (which are often shared by multiple
cores), or perhaps disabling CPU's (threads) to ensure that only a
single core is using each cache.  With current generation Intel CPU's,
this can allow the CPU clock rate to increase, thereby allowing more
work to get done.

> The database is MySQL, it runs on a Linux box that connects to the Nexenta

Oh, since the database runs on Linux I guess you need to dig up top's
equivalent of "prstat -mL".  Unfortunately, I don't think that Linux
has microstate accounting and as such you may not have visibility into
time spent on traps, text faults, and data faults on a per-process
basis.

> server through 10GbE using iSCSI.

Have you done any TCP tuning?  Based on the numbers you cite above, it
looks like you are doing about 32 KB I/O's.  I think you can perform a
test that involves mainly the network if you use netperf with options
like:

netperf -H $host -t TCP_RR -r 32768 -l 30

That is speculation based on reading
http://www.netperf.org/netperf/training/Netperf.html.  Someone else
(perhaps on networking or performance lists) may have better tests to
run.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 11:28 AM, Bob Friesenhahn
 wrote:
>>
>> Ok... so we've rebuilt the pool as 14 pairs of mirrors, each pair having
>> one disk in each of the two JBODs.  Now we're getting about 500-1000 IOPS
>> (according to zpool iostat) and 20-30MB/sec in random read on a big
>> database.  Does that sounds right?
>
> I am not sure who wrote the above text since the attribution quoting is all
> botched up (Gmail?) in this thread.  Regardless, it is worth pointing out
> that 'zpool iostat' only reports the I/O operations which were actually
> performed.  It will not report the operations which did not need to be
> performed due to already being in cache.  A quite busy system can still
> report very little via 'zpool iostat' if it has enough RAM to cache the
> requested data.
>
> Bob

Very good point.  You can use a combination of "zpool iostat" and
fsstat to see the effect of reads that didn't turn into physical I/Os.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Use of blocksize (-b) during zfs zvol create, poor performance

2010-06-30 Thread Mike La Spina

Hi Eff,

There are a significant number of variables to work through with dedup and 
compression enabled. So the first suggestion I have is to disable those 
features for now so your not working with too many elements. 

With those features set aside an NTFS cluster operation does not = a 64k raw 
I/O block. As well the ZFS 64k blocksize does not = one I/O operation. We may 
also need to consider the overall network performance behavior and iSCSI 
protocol characteristics and the Windows network stack.

iperf is a good tool to rule that out.

What I primarily suspect the issue may be is that write I/O operations are not 
aligned and are waiting for a I/O completion over multiple vdevs. Alignment is 
important for write I/O optimization and how the I/O maps at the software raid 
mode will make a significant impact to the DMU and SPA operations on a specific 
vdev layout. You may also have an issue with write cache operations,  by 
default large I/O calls such as 64K will not use a ZIL cache vdev, if you have 
one defined, but will be written directly to your array vdevs which will also 
include a transaction group write operation. 

To ensure ZIL log usage with 64k I/O's you can apply the following: 
edit the /etc/system file with  

set zfs:zfs_immediate_write_sz = 131071

a reboot is required to activate the system file

You have also not indicated what your zpool configuration looks like, that 
would helpful in the discussion area. 

It appears that your applying the x4500 as a backup target which means you 
should (if not already) enable write caching on the COMSTAR LU properties for 
this type of application.

e.g
stmfadm modify-lu -p wcd=false 600144F02F2280004C1D62010001

To help triage the perf issue further you could post 2 'kstat zfs' + 2 'kstat 
stmf' outputs on a 5 min interval and a 'zpool iostat -v 30 5' which would help 
visualize the I/O behavior. 

Regards,

Mike

http://blog.laspina.ca/
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] COMSTAR ISCSI - configuration export/import

2010-06-28 Thread Mike Devlin

I havnt tried it yet, but supposedly this will backup/restore the
comstar config:

$ svccfg export -a stmf > ⁠comstar⁠.bak.${DATE}

If you ever need to restore the configuration, you can attach the
storage and run an import:

$ svccfg import ⁠comstar⁠.bak.${DATE}


    - Mike

On 6/28/10, bso...@epinfante.com  wrote:
> Hi all,
>
> Having osol b134 exporting a couple of iscsi targets to some hosts,how can
> the COMSTAR configuration be migrated to other host?
> I can use the ZFS send/receive to replicate the luns but how can I
> "replicate" the target,views from serverA to serverB ?
>
> Is there any best procedures to follow to accomplish this?
> Thanks for all your time,
>
> Bruno
>
> Sent from my HTC
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>

-- 
Sent from my mobile device
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VXFS to ZFS Quota

2010-06-18 Thread Mike Gerdts

On Fri, Jun 18, 2010 at 8:09 AM, David Magda  wrote:
> You could always split things up into groups of (say) 50. A few jobs ago,
> I was in an environment where we have a /home/students1/ and
> /home/students2/, along with a separate faculty/ (using Solaris and UFS).
> This had more to do with IOps than anything else.

A decade or so ago when I managed similar environments and had (I
think) 6 file systems handling about 5000 students.  Each file system
had about 1/6 of the students.  Challenges I found in this were:

- Students needed to work on projects together.  The typical way to do
this was for them to request a group, then create a group writable
directory in one of their home directories.  If all students in the
group had home directories on the same file system, there was nothing
special to consider.  If they were on different file systems then at
least one would need to have a non-zero quota (that is, not 0 blocks
soft, 1 block hard) quota on the file system where the group directory
resides.
- Despite your best efforts things will get imbalanced.  If you are
tight on space, this means that you will need to migrate users.  This
will become apparent only at the times of the semester where even
per-user outages are most inconvenient (i.e. at 6 and 13 weeks when
big projects tend to be due).

Its probably a good idea to consider these types of situations in the
transition plan, or at least determine they don't apply.  I was
working in a college of engineering where group projects were common
and CAD, EDA, and simulation tools could generate big files very
quickly.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Mike Gerdts

On Tue, Jun 15, 2010 at 7:28 PM, David Magda  wrote:
> On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote:
>
>> I think dedup may have its greatest appeal in VDI environments (think
>> about a environment with 85% if the data that the virtual machine needs is
>> into ARC or L2ARC... is like a dream...almost instantaneous response... and
>> you can boot a new machine in a few seconds)...
>
> This may also be accomplished by using snapshots and clones of data sets. At
> least for OS images: user profiles and documents could be something else
> entirely.

It all depends on the nature of the VDI environment.  If the VMs are
regenerated on each login, the snapshot + clone mechanism is
sufficient.  Deduplication is not needed.  However, if VMs have a long
life and get periodic patches and other software updates,
deduplication will be required if you want to remain at somewhat
constant storage utilization.

It probably makes a lot of sense to be sure that swap or page files
are on a non-dedup dataset.  Executables and shared libraries
shouldn't be getting paged out to it and the likelihood that multiple
VMs page the same thing to swap or a page file is very small.

> Another situation that comes to mind is perhaps as the back-end to a mail
> store: if you send out a message(s) with an attachment(s) to a lot of
> people, the attachment blocks could be deduped (and perhaps compressed as
> well, since base-64 adds 1/3 overhead).

It all depends on how this is stored.  If the attachments are stored
like they were in 1990 as part of an mbox format, you will be very
unlikely to get the proper block alignment.  Even storing the message
body (including headers) in the same file as the attachment may not
align the attachments because the mail headers may be different (e.g.
different recipients messages took different paths, some were
forwarded, etc.).  If the attachments are stored in separate files or
a database format is used that stores attachments separate from the
message (with matching database + zfs block size) things may work out
favorably.

However, a system that detaches messages and stores them separately
may just as well store them in a file that matches the SHA256 hash,
assuming that file doesn't already exist.  If does exist, it can just
increment a reference count.  In other words, an intelligent mail
system should already dedup.  Or at least that is how I would have
written it for the last decade or so...

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Mike Gerdts

On Thu, Jun 10, 2010 at 9:39 AM, Andrey Kuzmin
 wrote:
> On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski  wrote:
>>
>> On 21/10/2009 03:54, Bob Friesenhahn wrote:
>>>
>>> I would be interested to know how many IOPS an OS like Solaris is able to
>>> push through a single device interface.  The normal driver stack is likely
>>> limited as to how many IOPS it can sustain for a given LUN since the driver
>>> stack is optimized for high latency devices like disk drives.  If you are
>>> creating a driver stack, the design decisions you make when requests will be
>>> satisfied in about 12ms would be much different than if requests are
>>> satisfied in 50us.  Limitations of existing software stacks are likely
>>> reasons why Sun is designing hardware with more device interfaces and more
>>> independent devices.
>>
>>
>> Open Solaris 2009.06, 1KB READ I/O:
>>
>> # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0&
>
> /dev/null is usually a poor choice for a test lie this. Just to be on the
> safe side, I'd rerun it with /dev/random.
> Regards,
> Andrey

(aside from other replies about read vs. write and /dev/random...)

Testing performance of disk by reading from /dev/random and writing to
disk is misguided.  From random(7d):

   Applications retrieve random bytes by reading /dev/random
   or /dev/urandom. The /dev/random interface returns random
   bytes only when sufficient amount of entropy has been collected.

In other words, when the kernel doesn't think that it can give high
quality random numbers, it stops providing them until it has gathered
enough entropy.  It will pause your reads.

If instead you use /dev/urandom, the above problem doesn't exist, but
the generation of random numbers is CPU-intensive.  There is a
reasonable chance (particularly with slow CPU's and fast disk) that
you will be testing the speed of /dev/urandom rather than the speed of
the disk or other I/O components.

If your goal is to provide data that is not all 0's to prevent ZFS
compression from making the file sparse or want to be sure that
compression doesn't otherwise make the actual writes smaller, you
could try something like:

# create a file just over 100 MB
dd if=/dev/random of=/tmp/randomdata bs=513 count=204401
# repeatedly feed that file to dd
while true ; do cat /tmp/randomdataa ; done | dd of=/my/test/file
bs=... count=...

The above should make it so that it will take a while before there are
two blocks that are identical, thus confounding deduplication as well.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

2010-05-31 Thread Mike Gerdts

Sorry, turned on html mode to avoid gmail's line wrapping.

On Mon, May 31, 2010 at 4:58 PM, Sandon Van Ness wrote:

> On 05/31/2010 02:52 PM, Mike Gerdts wrote:
> > On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness 
> wrote:
> >
> >> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
> >>
> >>> There are multiple factors at work.  Your OpenSolaris should be new
> >>> enough to have the fix in which the zfs I/O tasks are run in in a
> >>> scheduling class at lower priority than normal user processes.
> >>> However, there is also a throttling mechanism for processes which
> >>> produce data faster than can be consumed by the disks.  This
> >>> throttling mechanism depends on the amount of RAM available to zfs and
> >>> the write speed of the I/O channel.  More available RAM results in
> >>> more write buffering, which results in a larger chunk of data written
> >>> at the next transaction group write interval.  The maximum size of a
> >>> transaction group may be configured in /etc/system similar to:
> >>>
> >>> * Set ZFS maximum TXG group size to 2684354560
> >>> set zfs:zfs_write_limit_override = 0xa000
> >>>
> >>> If the transaction group is smaller, then zfs will need to write more
> >>> often.  Processes will still be throttled but the duration of the
> >>> delay should be smaller due to less data to write in each burst.  I
> >>> think that (with multiple writers) the zfs pool will be "healthier"
> >>> and less fragmented if you can offer zfs more RAM and accept some
> >>> stalls during writing.  There are always tradeoffs.
> >>>
> >>> Bob
> >>>
> >> well it seems like when messing with the txg sync times and stuff like
> >> that it did make the transfer more smooth but didn't actually help with
> >> speeds as it just meant the hangs happened for a shorter time but at a
> >> smaller interval and actually lowering the time between writes just
> >> seemed to make things worse (slightly).
> >>
> >> I think I have came to the conclusion that the problem here is CPU due
> >> to the fact that its only doing this with parity raid. I would think if
> >> it was I/O based then it would be the same as if anything its heavier on
> >> I/O on non parity raid due to the fact that it is no longer CPU
> >> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
> >> parity raidz2).
> >>
> > To see if the CPU is pegged, take a look at the output of:
> >
> > mpstat 1
> > prstat -mLc 1
> >
> > If mpstat shows that the idle time reaches 0 or the process' latency
> > column is more then a few tenths of a percent, you are probably short
> > on CPU.
> >
> > It could also be that interrupts are stealing cycles from rsync.
> > Placing it in a processor set with interrupts disabled in that
> > processor set may help.
> >
> >
>
> Unfortunately none of these utilies make it possible to ge values for <1
> second which is what the hang is (its happening for about 1/2 of a second).
>

> Here is with mpstat:
>




>
> Here is what i get with prstat:
>




> Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
> PROCESS/LWPID
>   604 root 0.0  33 0.0 0.0 0.0 0.0  42  25  18  13   0   0
> zpool-data/13
>   604 root 0.0  30 0.0 0.0 0.0 0.0  41  29  12  12   0   0
> zpool-data/15
>  1326 root  12 2.9 0.0 0.0 0.0 0.0  85 0.4  1K  12 11K   0 rsync/1
>   604 root 0.0  15 0.0 0.0 0.0 0.0  41  44 111   9   0   0
> zpool-data/27
>   604 root 0.0  14 0.0 0.0 0.0 0.0  43  42  72   3   0   0
> zpool-data/33
>   604 root 0.0 5.9 0.0 0.0 0.0 0.0  41  53 109   6   0   0
> zpool-data/19
>   604 root 0.0 5.4 0.0 0.0 0.0 0.0  42  53 106   8   0   0
> zpool-data/25
>   604 root 0.0 5.3 0.0 0.0 0.0 0.0  43  51 107   7   0   0
> zpool-data/21
>   604 root 0.0 4.5 0.0 0.0 0.0 0.0  41  54 110   4   0   0
> zpool-data/31
>   604 root 0.0 3.9 0.0 0.0 0.0 0.0  41  55 109   3   0   0
> zpool-data/23
>   604 root 0.0 3.7 0.0 0.0 0.0 0.0  44  52 111   2   0   0
> zpool-data/29
>  1322 root 0.0 0.4 0.0 0.0 0.0 0.0  98 2.0  1K   0   1   0 rsync/1
>  22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  16  13 255   0 prstat/1
>  14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   5   3  69   0 sshd/1
>   196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   2 105   0 nscd/17
>

In the interval abo

Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

2010-05-31 Thread Mike Gerdts

On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness  wrote:
> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
>> There are multiple factors at work.  Your OpenSolaris should be new
>> enough to have the fix in which the zfs I/O tasks are run in in a
>> scheduling class at lower priority than normal user processes.
>> However, there is also a throttling mechanism for processes which
>> produce data faster than can be consumed by the disks.  This
>> throttling mechanism depends on the amount of RAM available to zfs and
>> the write speed of the I/O channel.  More available RAM results in
>> more write buffering, which results in a larger chunk of data written
>> at the next transaction group write interval.  The maximum size of a
>> transaction group may be configured in /etc/system similar to:
>>
>> * Set ZFS maximum TXG group size to 2684354560
>> set zfs:zfs_write_limit_override = 0xa000
>>
>> If the transaction group is smaller, then zfs will need to write more
>> often.  Processes will still be throttled but the duration of the
>> delay should be smaller due to less data to write in each burst.  I
>> think that (with multiple writers) the zfs pool will be "healthier"
>> and less fragmented if you can offer zfs more RAM and accept some
>> stalls during writing.  There are always tradeoffs.
>>
>> Bob
> well it seems like when messing with the txg sync times and stuff like
> that it did make the transfer more smooth but didn't actually help with
> speeds as it just meant the hangs happened for a shorter time but at a
> smaller interval and actually lowering the time between writes just
> seemed to make things worse (slightly).
>
> I think I have came to the conclusion that the problem here is CPU due
> to the fact that its only doing this with parity raid. I would think if
> it was I/O based then it would be the same as if anything its heavier on
> I/O on non parity raid due to the fact that it is no longer CPU
> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
> parity raidz2).

To see if the CPU is pegged, take a look at the output of:

mpstat 1
prstat -mLc 1

If mpstat shows that the idle time reaches 0 or the process' latency
column is more then a few tenths of a percent, you are probably short
on CPU.

It could also be that interrupts are stealing cycles from rsync.
Placing it in a processor set with interrupts disabled in that
processor set may help.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is it safe to disable the swap partition?

2010-05-09 Thread Mike Gerdts

On Sun, May 9, 2010 at 7:40 PM, Edward Ned Harvey
 wrote:
>
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Richard Elling
> >
> > For a storage server, swap is not needed. If you notice swap being used
> > then your storage server is undersized.
>
> Indeed, I have two solaris 10 fileservers that have uptime in the range of a
> few months.  I just checked swap usage, and they're both zero.
>
> So, Bob, rub it in if you wish.  ;-)  I was wrong.  I knew the behavior in
> Linux, which Roy seconded as "most OSes," and apparently we both assumed the
> same here, but that was wrong.  I don't know if solaris and opensolaris both
> have the same swap behavior.  I don't know if there's *ever* a situation
> where solaris/opensolaris would swap idle processes.  But there's at least
> evidence that my two servers have not, or do not.

If Solaris is under memory pressure, pages may be paged to swap.
Under severe memory pressure, entire processes may be swapped.  This
will happen after freeing up the memory used for file system buffers,
ARC, etc.  If the processes never page in the pages that have been
paged out (or the processes that have been swapped out are never
scheduled) then those pages will not consume RAM.

The best thing to do with processes that can be swapped out forever is
to not run them.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-04-22 Thread Mike Mackovitch

On Thu, Apr 22, 2010 at 02:53:37PM -0700, Rich Teer wrote:
> On Thu, 22 Apr 2010, Mike Mackovitch wrote:
> 
> > I would also check /var/log/system.log and /var/log/kernel.log on the Mac to
> > see if any other useful messages are getting logged.
> 
> Ah, we're getting closer.  The latter shows nothing interesting, but 
> system.log
> has this line appended the minute I try the copy:
> 
> sandboxd[78312]: portmap(78311) deny network-outbound 
> /private/var/tmp/launchd/sock
> 
> Then, when the attempt times out, these appear:
> 
> KernelEventAgent[36]: tid  received event(s) VQ_NOTRESP (1)
> KernelEventAgent[36]: tid  type 'nfs', mounted on 
> '/net/zen/export/home' from 'zen:/export/home', not responding
> KernelEventAgent[36]: tid  found 1 filesystem(s) with problem(s)
> 
> Does that shed any morelight on this?

Nope.

The first message is a known annoyance that gets logged whenever portmap
starts.  It can be ignored.

The second message is just the daemon responsible for inducing
the "disconnect dialog" noticing that there is an unresponsive
file system.

Oh, and the kernel.log should at least have the "lockd not responding"
messages in it.  So, I presume you meant nothing *else* interesting.

I think it's time to look at the packets...

(...and perhaps time to move this off of zfs-discuss seeing as
this is really an NFS/networking issue and not a ZFS issue.)

HTH
--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-04-22 Thread Mike Mackovitch

On Thu, Apr 22, 2010 at 01:54:26PM -0700, Rich Teer wrote:
> On Thu, 22 Apr 2010, Mike Mackovitch wrote:
> 
> Hi Mike,
> 
> > So, it looks like you need to investigate why the client isn't
> > getting responses from the server's "lockd".
> > 
> > This is usually caused by a firewall or NAT getting in the way.
> 
> Great idea--I was indeed connected to my network using the AirPort interface,
> thorugh a Wifi router.  So as an experiment, I tried using a hard-wired,
> manually set up Ethernet connection.  Same result: no dice.  :-(
> 
> I checked the firewall settings on my laptop, and the firewall is turned off.
> 
> Do you have any other ideas?  It'd be really nice to get this working!

I would also check /var/log/system.log and /var/log/kernel.log on the Mac to
see if any other useful messages are getting logged.

Then I'd grab packet traces with wireshark/tcpdump/snoop *simultaneously* on
the client and the server, reproduce the problem, and then determine which
packets are being sent and which packets are being received.

HTH
--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X clients with ZFS server

2010-04-22 Thread Mike Mackovitch

On Thu, Apr 22, 2010 at 12:40:37PM -0700, Rich Teer wrote:
> On Thu, 22 Apr 2010, Tomas Ögren wrote:
> 
> > Copying via terminal (and cp) works.
> 
> Interesting: if I copy a file *which has no extended attributes* using cp in
> a terminal, it works fine.  If I try to cp a file that has EA (to the same
> destination), it hangs.  But I get this error message after a few seconds:
> 
> cp file_without_EA /net/zen/export/home/rich
> cp file_with_EA /net/zen/export/home/rich
> nfs server zen:/export/home: lockd not responding

So, it looks like you need to investigate why the client isn't
getting responses from the server's "lockd".

This is usually caused by a firewall or NAT getting in the way.

HTH
--macko
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Why does ARC grow above hard limit?

2010-04-05 Thread Mike Z

I would appreciate if somebody can clarify  a  few points.

I am doing some random WRITES  (100% writes, 100% random) testing and observe 
that ARC grows way beyond the "hard" limit during the test. The hard limit is 
set 512 MB via /etc/system and I see the size going up to 1 GB - how come is it 
happening?

mdb's ::memstat reports 1.5 GB used - does this include ARC as well or is it 
separate?

I see on the backed only reads (205 MB/s) and almost no writes (1.1 MB/s) - any 
ides what is being read?

--- BEFORE TEST 
# ~/bin/arc_summary.pl

System Memory:
 Physical RAM:  12270 MB
 Free Memory :  7108 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 136 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)
...


> ::memstat
Page SummaryPagesMB  %Tot
     
Kernel 800895  3128   25%
ZFS File Data  394450  1540   13%
Anon   106813   4173%
Exec and libs4178160%
Page cache  14333550%
Free (cachelist)22996891%
Free (freelist)   1797511  7021   57%

Total 3141176 12270
Physical  3141175 12270


--- DURING THE TEST
# ~/bin/arc_summary.pl 
System Memory:
 Physical RAM:  12270 MB
 Free Memory :  6687 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 1336 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:  87%446 MB (p)
 Most Frequently Used Cache Size:12%65 MB (c-p)

ARC Efficency:
 Cache Access Total: 51681761
 Cache Hit Ratio:  52%   27056475   [Defined State for 
buffer]
 Cache Miss Ratio: 47%   24625286   [Undefined State for 
Buffer]
 REAL Hit Ratio:   52%   27056475   [MRU/MFU Hits Only]

 Data Demand   Efficiency:35%
 Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable)

CACHE HITS BY CACHE LIST:
  Anon:   --%Counter Rolled.
  Most Recently Used: 13%3627289 (mru)  [ 
Return Customer ]
  Most Frequently Used:   86%23429186 (mfu) [ 
Frequent Customer ]
  Most Recently Used Ghost:   17%4657584 (mru_ghost)[ 
Return Customer Evicted, Now Back ]
  Most Frequently Used Ghost: 32%8712009 (mfu_ghost)[ 
Frequent Customer Evicted, Now Back ]
CACHE HITS BY DATA TYPE:
  Demand Data:30%8308866 
  Prefetch Data:   0%0 
  Demand Metadata:69%18747609 
  Prefetch Metadata:   0%0 
CACHE MISSES BY DATA TYPE:
  Demand Data:61%15113029 
  Prefetch Data:   0%0 
  Demand Metadata:38%9511898 
  Prefetch Metadata:   0%359 
-
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS behavior under limited resources

2010-04-02 Thread Mike Z

I am trying to see how ZFS behaves under resource starvation - corner cases in 
embedded environments. I see some very strange behavior. Any help/explanation 
would really be appreciated.

My current setup is :
OpenSolaris 111b (iSCSI seems to be broken in 132 - unable to get multiple 
connections/mutlipathing)
iSCSI Storage Array that is capable of 
20 MB/s random writes @ 4k and 70 MB random reads @ 4k
150 MB/s random writes @ 128k and 180 MB/S random reads @ 128K
180+ MB/S for sequntial reads and write at both 4k and 128k.
8 Intel CPU and 12 GB of RAM (DELL poweredge 610)

The ARC size is limited to 512MB (hard limit). No L2 Cache.

In both test below the file system size is about 300 GB. This file system 
conatins a single directory  with about 15'000 files totalling to 200 GB (so 
the file system is 2/3 full). The tests are run within the same directory.

Test 1:
Random writes @ 4k to 1000 1MB files (1000 threads, 1 per file).

First I observe that ARC size grows (momentarily) above 512 MB limit (via kstat 
and arcstat.pl).
Q: It seems that zfs:zfs_arc_max is not really a hard limit?

I tried setting primarycache to none, metadata and all. The I/O reported is 
similar in the NONE and METADATA case (17 MB/S) while when set to ALL, I/O is 3 
- 4 time less (4-5 MB/S).
Q: Any explanation would be useful.

In this test I observe for backend on average I/O is 132 MB/s for READs and 51 
MB/s WRITES
Q: Why is more read than wtritten?

Test 2:
Random writes @ 4k to 10'000 1MB files (10'000 threads, 1 per file).

- ARC size now goes to 1 GB during the entire test (way above the hard limit)

- ::memstat reports that zfs grew from the original 430 MB to about 1.5 GB
Q: Does mdb memstat reporting include ARC?

Q: On the backend I see 170 MB/s reads and 0.5 MB.s writes -- What is happening 
here?



SOME sample output ...

---
> ::memstat
Page SummaryPagesMB  %Tot
     
Kernel 800933  3128   25%
ZFS File Data  394450  1540   13%
Anon   128909   5034%
Exec and libs4172160%
Page cache  14749570%
Free (cachelist)21884851%
Free (freelist)   1776079  6937   57%

Total 3141176 12270
Physical  3141175 12270

--
System Memory:
 Physical RAM:  12270 MB
 Free Memory :  6966 MB
 LotsFree:  191 MB

ZFS Tunables (/etc/system):
 set zfs:zfs_prefetch_disable = 1
 set zfs:zfs_arc_max = 0x2000
 set zfs:zfs_arc_min = 0x1000

ARC Size:
 Current Size: 669 MB (arcsize)
 Target Size (Adaptive):   512 MB (c)
 Min Size (Hard Limit):256 MB (zfs_arc_min)
 Max Size (Hard Limit):512 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:   6%32 MB (p)
 Most Frequently Used Cache Size:93%480 MB (c-p)

ARC Efficency:
 Cache Access Total: 47002757
 Cache Hit Ratio:  52%   24657634   [Defined State for 
buffer]
 Cache Miss Ratio: 47%   22345123   [Undefined State for 
Buffer]
 REAL Hit Ratio:   52%   24657634   [MRU/MFU Hits Only]

 Data Demand   Efficiency:36%
 Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable)

CACHE HITS BY CACHE LIST:
  Anon:   --%Counter Rolled.
  Most Recently Used: 13%3420349 (mru)  [ 
Return Customer ]
  Most Frequently Used:   86%21237285 (mfu) [ 
Frequent Customer ]
  Most Recently Used Ghost:   16%4057965 (mru_ghost)[ 
Return Customer Evicted, Now Back ]
  Most Frequently Used Ghost: 31%7837353 (mfu_ghost)[ 
Frequent Customer Evicted, Now Back ]
CACHE HITS BY DATA TYPE:
  Demand Data:31%7793822 
  Prefetch Data:   0%0 
  Demand Metadata:68%16863812 
  Prefetch Metadata:   0%0 
CACHE MISSES BY DATA TYPE:
  Demand Data:60%13573358 
  Prefetch Data:   0%0 
  Demand Metadata:39%8771406 
  Prefetch Metadata:   0%359 
-
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs diff

2010-03-29 Thread Mike Gerdts

On Mon, Mar 29, 2010 at 5:39 PM, Nicolas Williams
 wrote:
> One really good use for zfs diff would be: as a way to index zfs send
> backups by contents.

Or to generate the list of files for incremental backups via NetBackup
or similar.  This is especially important for file systems will
millions of files with relatively few changes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-20 Thread Mike Gerdts

On Fri, Mar 19, 2010 at 11:57 PM, Edward Ned Harvey
 wrote:
>> 1. NDMP for putting "zfs send" streams on tape over the network.  So
>
> Tell me if I missed something here.  I don't think I did.  I think this
> sounds like crazy talk.
>
> I used NDMP up till November, when we replaced our NetApp with a Solaris Sun
> box.  In NDMP, to choose the source files, we had the ability to browse the
> fileserver, select files, and specify file matching patterns.  My point is:
> NDMP is file based.  It doesn't allow you to spawn a process and backup a
> data stream.
>
> Unless I missed something.  Which I doubt.  ;-)

5+ years ago the variety of NDMP that was available with the
combination of NetApp's OnTap and Veritas NetBackup did backups at the
volume level.  When I needed to go to tape to recover a file that was
no longer in snapshots, we had to find space on a NetApp to restore
the volume.  It could not restore the volume to a Sun box, presumably
because the contents of the backup used a data stream format that was
proprietary to NetApp.

An expired Internet Draft for NDMPv4 says:

  butype_name
 Specifies the name of the backup method to be used for the
 transfer (dump, tar, cpio, etc). Backup types are
NDMP Server
 implementation dependent and MUST match one of the Data
 Server implementation specific butype_name
strings accessible
 via the NDMP_CONFIG_GET_BUTYPE_INFO request.

http://www.ndmp.org/download/sdk_v4/draft-skardal-ndmp4-04.txt

It seems pretty clear from this that an NDMP data stream can contain
most anything and is dependent on the device being backed up.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Mike Gerdts

On Wed, Mar 17, 2010 at 9:15 AM, Edward Ned Harvey
 wrote:
>> I think what you're saying is:  Why bother trying to backup with "zfs
>> send"
>> when the recommended practice, fully supportable, is to use other tools
>> for
>> backup, such as tar, star, Amanda, bacula, etc.   Right?
>>
>> The answer to this is very simple.
>> #1  ...
>> #2  ...
>
> Oh, one more thing.  "zfs send" is only discouraged if you plan to store the
> data stream and do "zfs receive" at a later date.
>
> If instead, you are doing "zfs send | zfs receive" onto removable media, or
> another server, where the data is immediately fed through "zfs receive" then
> it's an entirely viable backup technique.

Richard Elling made an interesting observation that suggests that
storing a zfs send data stream on tape is a quite reasonable thing to
do.  Richard's background makes me trust his analysis of this much
more than I trust the typical person that says that zfs send output is
poison.

http://opensolaris.org/jive/thread.jspa?messageID=465973&tstart=0#465861

I think that a similar argument could be made for storing the zfs send
data streams on a zfs file system.  However, it is not clear why you
would do this instead of just zfs send | zfs receive.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests

2010-02-08 Thread Mike Gerdts

On Mon, Feb 8, 2010 at 9:04 PM, grarpamp  wrote:
> PS: Is there any way to get a copy of the list since inception
> for local client perusal, not via some online web interface?

You can get monthly .gz archives in mbox format from
http://mail.opensolaris.org/pipermail/zfs-discuss/.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-25 Thread Mike Gerdts

On Mon, Jan 25, 2010 at 2:32 AM, Kjetil Torgrim Homme
 wrote:
> Mike Gerdts  writes:
>
>> John Hoogerdijk wrote:
>>> Is there a way to zero out unused blocks in a pool?  I'm looking for
>>> ways to shrink the size of an opensolaris virtualbox VM and using the
>>> compact subcommand will remove zero'd sectors.
>>
>> I've long suspected that you should be able to just use mkfile or "dd
>> if=/dev/zero ..." to create a file that consumes most of the free
>> space then delete that file.  Certainly it is not an ideal solution,
>> but seems quite likely to be effective.
>
> you'll need to (temporarily) enable compression for this to have an
> effect, AFAIK.
>
> (dedup will obviously work, too, if you dare try it.)

You are missing the point.  Compression and dedup will make it so that
the blocks in the devices are not overwritten with zeroes.  The goal
is to overwrite the blocks so that a back-end storage device or
back-end virtualization platform can recognize that the blocks are not
in use and as such can reclaim the space.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-23 Thread Mike Gerdts

On Sat, Jan 23, 2010 at 11:55 AM, John Hoogerdijk
 wrote:
> Mike Gerdts wrote:
>>
>> On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk
>>  wrote:
>>
>>>
>>> Is there a way to zero out unused blocks in a pool?  I'm looking for ways
>>> to
>>> shrink the size of an opensolaris virtualbox VM and
>>> using the compact subcommand will remove zero'd sectors.
>>>
>>
>> I've long suspected that you should be able to just use mkfile or "dd
>> if=/dev/zero ..." to create a file that consumes most of the free
>> space then delete that file.  Certainly it is not an ideal solution,
>> but seems quite likely to be effective.
>>
>
> I tried this with mkfile - no joy.

Let me ask a couple of the questions that come just after "are you
sure your computer is plugged in?"

Did you wait enough time for the data to be flushed to disk (or do
sync and wait for it to complete) prior to removing the file?

You did "mkfile $huge /var/tmp/junk" not "mkfile -n $huge /var/tmp/junk", right?

If not, I suspect that "zpool replace" to a thin provisioned disk is
going to be your best bet (as suggested in another message).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-22 Thread Mike Gerdts

On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk
 wrote:
> Is there a way to zero out unused blocks in a pool?  I'm looking for ways to
> shrink the size of an opensolaris virtualbox VM and
> using the compact subcommand will remove zero'd sectors.

I've long suspected that you should be able to just use mkfile or "dd
if=/dev/zero ..." to create a file that consumes most of the free
space then delete that file.  Certainly it is not an ideal solution,
but seems quite likely to be effective.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-22 Thread Mike Gerdts

On Thu, Jan 21, 2010 at 11:28 AM, Richard Elling
 wrote:
> On Jan 21, 2010, at 3:55 AM, Julian Regel wrote:
>> >> Until you try to pick one up and put it in a fire safe!
>>
>> >Then you backup to tape from x4540 whatever data you need.
>> >In case of enterprise products you save on licensing here as you need a one 
>> >client license per x4540 but in fact can >backup data from many clients 
>> >which are there.
>>
>> Which brings up full circle...
>>
>> What do you then use to backup to tape bearing in mind that the Sun-provided 
>> tools all have significant limitations?
>
> Poor choice of words.  Sun resells NetBackup and (IIRC) that which was
> formerly called NetWorker.  Thus, Sun does provide enterprise backup
> solutions.

(Symantec nee Veritas) NetBackup and (EMC nee Legato) Networker are
different products that compete in the enterprise backup space.

Under the covers NetBackup uses gnu tar to gather file data for the
backup stream.  At one point (maybe still the case), one of the
claimed features of netbackup is that if a tape is written without
multiplexing, you can use gnu tar to extract data.  This seems to be
most useful when you need to recover master and/or media servers and
to be able to extract your data after you no longer use netbackup.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup memory overhead

2010-01-21 Thread Mike Gerdts

On Thu, Jan 21, 2010 at 2:51 PM, Andrey Kuzmin
 wrote:
> Looking at dedupe code, I noticed that on-disk DDT entries are
> compressed less efficiently than possible: key is not compressed at
> all (I'd expect roughly 2:1 compression ration with sha256 data),

A cryptographic hash such as sha256 should not be compressible.  A
trivial example shows this to be the case:

for i in {1..1} ; do
echo $i | openssl dgst -sha256 -binary
done > /tmp/sha256

$ gzip -c sha256.gz
$ compress -c sha256.Z
$ bzip2 -c sha256.bz2

$ ls -go sha256*
-rw-r--r--   1  32 Jan 22 04:13 sha256
-rw-r--r--   1  428411 Jan 22 04:14 sha256.Z
-rw-r--r--   1  321846 Jan 22 04:14 sha256.bz2
-rw-r--r--   1  320068 Jan 22 04:14 sha256.gz

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-19 Thread Mike La Spina

I use zfs send/recv in the enterprise and in smaller environments all time and 
it's is excellent.

Have a look at how awesome the functionally is in this example.

http://blog.laspina.ca/ubiquitous/provisioning_disaster_recovery_with_zfs

Regards,

Mike
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-16 Thread Mike Gerdts

On Sat, Jan 16, 2010 at 5:31 PM, Toby Thain  wrote:
> On 16-Jan-10, at 7:30 AM, Edward Ned Harvey wrote:
>
>>> I am considering building a modest sized storage system with zfs. Some
>>> of the data on this is quite valuable, some small subset to be backed
>>> up "forever", and I am evaluating back-up options with that in mind.
>>
>> You don't need to store the "zfs send" data stream on your backup media.
>> This would be annoying for the reasons mentioned - some risk of being able
>> to restore in future (although that's a pretty small risk) and inability
>> to
>> restore with any granularity, i.e. you have to restore the whole FS if you
>> restore anything at all.
>>
>> A better approach would be "zfs send" and pipe directly to "zfs receive"
>> on
>> the external media.  This way, in the future, anything which can read ZFS
>> can read the backup media, and you have granularity to restore either the
>> whole FS, or individual things inside there.
>
> There have also been comments about the extreme fragility of the data stream
> compared to other archive formats. In general it is strongly discouraged for
> these purposes.
>

Yet it is used in ZFS flash archives on Solaris 10 and are slated for
use in the successor to flash archives.  This initial proposal seems
to imply using the same mechanism for a system image backup (instead
of just system provisioning).

http://mail.opensolaris.org/pipermail/caiman-discuss/2010-January/015909.html

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 12:28 PM, Torrey McMahon  wrote:
> On 1/8/2010 10:04 AM, James Carlson wrote:
>>
>> Mike Gerdts wrote:
>>
>>>
>>> This unsupported feature is supported with the use of Sun Ops Center
>>> 2.5 when a zone is put on a "NAS Storage Library".
>>>
>>
>> Ah, ok.  I didn't know that.
>>
>>
>
> Does anyone know how that works? I can't find it in the docs, no one inside
> of Sun seemed to have a clue when I asked around, etc. RTFM gladly taken.

Storage libraries are discussed very briefly at:

http://wikis.sun.com/display/OC2dot5/Storage+Libraries

Creation of zones is discussed at:

http://wikis.sun.com/display/OC2dot5/Creating+Zones

I've found no documentation that explains the implementation details.
>From looking at a test environment that I have running, it seems to go
like:

1. The storage admin carves out some NFS space and exports it with the
appropriate options to the  various hosts (global zones).

2. In the Ops Center BUI, the ops center admin creates a new storage
library.  He selects type NFS and specifies the hostname and path that
was allocated.

3. The ops center admin associates the storage library with various
hosts.  This causes it to be be mounted at
/var/mnt/virtlibs/ on those hosts.  I'll call this $libmnt.

4. When the sysadmin provisions a zone through ops center, a UUID is
allocated and associated with this zone.  I'll call it $zuuid.  A
directory $libmnt/$zuuid is created with a set of directories under
it.

5. As the sysadmin provisions ops center prompts for the virtual disk
size.  A file of that size is created at $libmnt/$zuuid/virtdisk/data.

6. Ops center creates a zpool:

zpool create -m /var/mnt/oc-zpools/$zuuid/ z$zuuid \
 $libmnt/$zuuid/virtdisk/data

7. The zonepath is created using a uuid that is unique to the zonepath
($puuid) z$zuuid/$puuid.  It has a quota and a reservation set (8G
each in the zpool history I am looking at).

8. The zone is configured with
zonepath=/var/mnt/oc-zpools/$zuuid/$puuid, then installed

Just in case anyone sees this as the right way to do things, I think
it is generally OK with a couple caveats. The key areas that I would
suggest for improvement are:

- Mount the NFS space with -o forcedirectio.  There is no need to
cache data twice.
- Never use UUID's in paths.  This makes it nearly impossible for a
sysadmin or a support person to look at the output of commands on the
system and understand what it is doing.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 9:11 AM, Mike Gerdts  wrote:
> I've seen similar errors on Solaris 10 in the primary domain and on a
> M4000.  Unfortunately Solaris 10 doesn't show the checksums in the
> ereport.  There I noticed a mixture between read errors and checksum
> errors - and lots more of them.  This could be because the S10 zone
> was a full root SUNWCXall compared to the much smaller default ipkg
> branded zone.  On the primary domain running Solaris 10...

I've written a dtrace script to get the checksums on Solaris 10.
Here's what I see with NFSv3 on Solaris 10.

# zoneadm -z zone1 halt ; zpool export pool1 ; zpool import -d
/mnt/pool1 pool1 ; zoneadm -z zone1 boot ; sleep 30 ; pkill dtrace

# ./zfs_bad_cksum.d
Tracing...
dtrace: error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x301b363a000) in
action #4 at DIF offset 20
dtrace: error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x3037f746000) in
action #4 at DIF offset 20
cccdtrace:
error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x3026e7b) in
action #4 at DIF offset 20
cc
Checksum errors:
   3 : 0x130e01011103 0x20108 0x0 0x400 (fletcher_4_native)
   3 : 0x220125cd8000 0x62425980c08 0x16630c08296c490c
0x82b320c082aef0c (fletcher_4_native)
   3 : 0x2f2a0a202a20436f 0x7079726967687420 0x2863292032303031
0x2062792053756e20 (fletcher_4_native)
   3 : 0x3c21444f43545950 0x452048544d4c2050 0x55424c494320222d
0x2f2f5733432f2f44 (fletcher_4_native)
   3 : 0x6005a8389144 0xc2080e6405c200b6 0x960093d40800
0x9eea007b9800019c (fletcher_4_native)
   3 : 0xac044a6903d00163 0xa138c8003446 0x3f2cd1e100b10009
0xa37af9b5ef166104 (fletcher_4_native)
   3 : 0xbaddcafebaddcafe 0xc 0x0 0x0 (fletcher_4_native)
   3 : 0xc4025608801500ff 0x1018500704528210 0x190103e50066
0xc34b90001238f900 (fletcher_4_native)
   3 : 0xfe00fc01fc42fc42 0xfc42fc42fc42fc42 0xfffc42fc42fc42fc
0x42fc42fc42fc42fc (fletcher_4_native)
   4 : 0x4b2a460a 0x0 0x4b2a460a 0x0 (fletcher_4_native)
   4 : 0xc00589b159a00 0x543008a05b673 0x124b60078d5be
0xe3002b2a0b605fb3 (fletcher_4_native)
   4 : 0x130e010111 0x32000b301080034 0x10166cb34125410
0xb30c19ca9e0c0860 (fletcher_4_native)
   4 : 0x130e010111 0x3a201080038 0x104381285501102
0x418016996320408 (fletcher_4_native)
   4 : 0x130e010111 0x3a201080038 0x1043812c5501102
0x81802325c080864 (fletcher_4_native)
   4 : 0x130e010111 0x3a0001c01080038 0x1383812c550111c
0x818975698080864 (fletcher_4_native)
   4 : 0x1f81442e9241000 0x2002560880154c00 0xff10185007528210
0x19010003e566 (fletcher_4_native)
   5 : 0xbab10c 0xf 0x53ae 0xdd549ae39aa1ba20 (fletcher_4_native)
   5 : 0x130e010111 0x3ab01080038 0x1163812c550110b
0x8180a7793080864 (fletcher_4_native)
   5 : 0x61626300 0x0 0x0 0x0 (fletcher_4_native)
   5 : 0x8003 0x3df0d6a1 0x0 0x0 (fletcher_4_native)
   6 : 0xbab10c 0xf 0x5384 0xdd549ae39aa1ba20 (fletcher_4_native)
   7 : 0xbab10c 0xf 0x0 0x9af5e5f61ca2e28e (fletcher_4_native)
   7 : 0x130e010111 0x3a201080038 0x104381265501102
0xc18c7210c086006 (fletcher_4_native)
   7 : 0x275c222074650a2e 0x5c222020436f7079 0x7269676874203139
0x38392041540a2e5c (fletcher_4_native)
   8 : 0x130e010111 0x3a0003101080038 0x1623812c5501131
0x8187f66a4080864 (fletcher_4_native)
   9 : 0x8a000801010c0682 0x2eed0809c1640513 0x70200ff00026424
0x18001d16101f0059 (fletcher_4_native)
  12 : 0xbab10c 0xf 0x0 0x45a9e1fc57ca2aa8 (fletcher_4_native)
  30 : 0xbaddcafebaddcafe 0xbaddcafebaddcafe 0xbaddcafebaddcafe
0xbaddcafebaddcafe (fletcher_4_native)
  47 : 0x0 0x0 0x0 0x0 (fletcher_4_native)
  92 : 0x130e01011103 0x10108 0x0 0x200 (fletcher_4_native)

Since I had to guess at what the Solaris 10 source looks like, some
extra eyeballs on the dtrace script is in order.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/


zfs_bad_cksum.d
Description: Binary data
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 5:28 AM, Frank Batschulat (Home)
 wrote:
[snip]
> Hey Mike, you're not the only victim of these strange CHKSUM errors, I hit
> the same during my slightely different testing, where I'm NFS mounting an
> entire, pre-existing remote file living in the zpool on the NFS server and use
> that to create a zpool and install zones into it.

What does your overall setup look like?

Mine is:

T5220 + Sun System Firmware 7.2.4.f 2009/11/05 18:21
   Primary LDom
  Solaris 10u8
  Logical Domains Manager 1.2,REV=2009.06.25.09.48 + 142840-03
  Guest Domain 4 vcpus + 15 GB memory
 OpenSolaris snv_130
(this is where the problem is observed)

I've seen similar errors on Solaris 10 in the primary domain and on a
M4000.  Unfortunately Solaris 10 doesn't show the checksums in the
ereport.  There I noticed a mixture between read errors and checksum
errors - and lots more of them.  This could be because the S10 zone
was a full root SUNWCXall compared to the much smaller default ipkg
branded zone.  On the primary domain running Solaris 10...

(this command was run some time ago)
primary-domain# zpool status myzone
  pool: myzone
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
myzone  DEGRADED 0 0 0
  /foo/20g  DEGRADED 4.53K 0   671  too many errors

errors: No known data errors


(this was run today, many days after previous command)
primary-domain# fmdump -eV | egrep zio_err | uniq -c | head
   1zio_err = 5
   1zio_err = 50
   1zio_err = 5
   1zio_err = 50
   1zio_err = 5
   1zio_err = 50
   2zio_err = 5
   1zio_err = 50
   3zio_err = 5
   1zio_err = 50


Note that even though I had thousands of read errors the zone worked
just fine. I would have never known (suspected?) there was a problem
if I hadn't run "zpool status" or the various FMA commands.


> I've filed today:
>
> 6915265 zpools on files (over NFS) accumulate CKSUM errors with no apparent 
> reason

Thanks.  I'll open a support call to help get some funding on it...

> here's the relevant piece worth investigating out of it (leaving out the 
> actual setup etc..)
> as in your case, creating the zpool and installing the zone into it still 
> gives
> a healthy zpool, but immediately after booting the zone, the zpool served 
> over NFS
> accumulated CHKSUM errors.
>
> of particular interest are the 'cksum_actual' values as reported by Mike for 
> his
> test case here:
>
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html
>
> if compared to the 'chksum_actual' values I got in the fmdump error output on 
> my test case/system:
>
> note, the NFS servers zpool that is serving and sharing the file we use is 
> healthy.
>
> zone halted now on my test system, and checking fmdump:
>
> osoldev.batschul./export/home/batschul.=> fmdump -eV | grep cksum_actual | 
> sort | uniq -c | sort -n | tail
>   2    cksum_actual = 0x4bea1a77300 0xf6decb1097980 0x217874c80a8d9100 
> 0x7cd81ca72df5ccc0
>   2    cksum_actual = 0x5c1c805253 0x26fa7270d8d2 0xda52e2079fd74 
> 0x3d2827dd7ee4f21
>   6    cksum_actual = 0x28e08467900 0x479d57f76fc80 0x53bca4db5209300 
> 0x983ddbb8c4590e40
> *A   6    cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 
> 0x89715e34fbf9cdc0
> *B   7    cksum_actual = 0x0 0x0 0x0 0x0
> *C  11    cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 
> 0x280934efa6d20f40
> *D  14    cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 
> 0x7e0aef335f0c7f00
> *E  17    cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 
> 0xd4f1025a8e66fe00
> *F  20    cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 
> 0x7f84b11b3fc7f80
> *G  25    cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 
> 0x82804bc6ebcfc0
>
> osoldev.root./export/home/batschul.=> zpool status -v
>  pool: nfszone
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An
>        attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>        using 'zpool clear' or replace the device with 'zpool replace'.
>   see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
>
>        NAME

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 6:51 AM, James Carlson  wrote:
> Frank Batschulat (Home) wrote:
>> This just can't be an accident, there must be some coincidence and thus 
>> there's a good chance
>> that these CHKSUM errors must have a common source, either in ZFS or in NFS ?
>
> One possible cause would be a lack of substantial exercise.  The man
> page says:
>
>         A regular file. The use of files as a backing  store  is
>         strongly  discouraged.  It  is  designed  primarily  for
>         experimental purposes, as the fault tolerance of a  file
>         is  only  as  good  as  the file system of which it is a
>         part. A file must be specified by a full path.
>
> Could it be that "discouraged" and "experimental" mean "not tested as
> thoroughly as you might like, and certainly not a good idea in any sort
> of production environment?"
>
> It sounds like a bug, sure, but the fix might be to remove the option.

This unsupported feature is supported with the use of Sun Ops Center
2.5 when a zone is put on a "NAS Storage Library".

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 6:55 AM, Darren J Moffat  wrote:
> Frank Batschulat (Home) wrote:
>>
>> This just can't be an accident, there must be some coincidence and thus
>> there's a good chance
>> that these CHKSUM errors must have a common source, either in ZFS or in
>> NFS ?
>
> What are you using for on the wire protection with NFS ?  Is it shared using
> krb5i or do you have IPsec configured ?  If not I'd recommend trying one of
> those and see if your symptoms change.

Shouldn't a scrub pick that up?  Why would there be no errors from
"zoneadm install", which under the covers does a pkg image create
followed by *multiple* pkg install invocations.  No checksum errors
pop up there.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zones on shared storage - a warning

2010-01-07 Thread Mike Gerdts

[removed zones-discuss after sending heads-up that the conversation
will continue at zfs-discuss]

On Mon, Jan 4, 2010 at 5:16 PM, Cindy Swearingen
 wrote:
> Hi Mike,
>
> It is difficult to comment on the root cause of this failure since
> the several interactions of these features are unknown. You might
> consider seeing how Ed's proposal plays out and let him do some more
> testing...

Unfortunately Ed's proposal is not funded last I heard.  Ops Center
uses many of the same mechanisms for putting zones on ZFS.  This is
where I saw the problem initially.

> If you are interested in testing this with NFSv4 and it still fails
> the same way, then also consider testing this with a local file
> instead of a NFS-mounted file and let us know the results. I'm also
> unsure of using the same path for the pool and the zone root path,
> rather than one path for pool and a pool/dataset path for zone
> root path. I will test this myself if I get some time.

I have been unable to reproduce with a local file.  I have been able
to reproduce with NFSv4 on build 130.  Rather surprisingly the actual
checksums found in the ereports are sometimes "0x0 0x0 0x0 0x0" or
"0xbaddcafe00 ...".

Here's what I did:

- Install OpenSolaris build 130 (ldom on T5220)
- Mount some NFS space at /nfszone:
   mount -F nfs -o vers=4 $file:/path /nfszone
- Create a 10gig sparse file
   cd /nfszone
   mkfile -n 10g root
- Create a zpool
   zpool create -m /zones/nfszone nfszone /nfszone/root
- Configure and install a zone
   zonecfg -z nfszone
set zonepath = /zones/nfszone
set autoboot = false
verify
commit
exit
   chmod 700 /zones/nfszone
   zoneadm -z nfszone install

- Verify that the nfszone pool is clean.  First, pkg history in the
zone shows the timestamp of the last package operation

  2010-01-07T20:27:07 install   pkg Succeeded

At 20:31 I ran:

# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nfszone  ONLINE   0 0 0
  /nfszone/root  ONLINE   0 0 0

errors: No known data errors

I booted the zone.  By 20:32 it had accumulated 132 checksum errors:

 # zpool status nfszone
  pool: nfszone
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nfszone  DEGRADED 0 0 0
  /nfszone/root  DEGRADED 0 0   132  too many errors

errors: No known data errors

fmdump has some very interesting things to say about the actual
checksums.  The 0x0 and 0xbaddcafe00 seem to shout that these checksum
errors are not due to a couple bits flipped

# fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail
   2cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62
0x290cbce13fc59dce
   3cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400
0x7e0aef335f0c7f00
   3cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800
0xd4f1025a8e66fe00
   4cksum_actual = 0x0 0x0 0x0 0x0
   4cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900
0x330107da7c4bcec0
   5cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73
0x4e0b3a8747b8a8
   6cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00
0x280934efa6d20f40
   6cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00
0x89715e34fbf9cdc0
  16cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00
0x7f84b11b3fc7f80
  48cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500
0x82804bc6ebcfc0

I halted the zone, exported the pool, imported the pool, then did a
scrub.  Everything seemed to be OK:

# zpool export nfszone
# zpool import -d /nfszone nfszone
# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nfszone  ONLINE   0 0 0
  /nfszone/root  ONLINE   0 0 0

errors: No known data errors
# zpool scrub nfszone
# zpool status nfszone
  pool: nfszone
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Thu Jan  7 21:56:47 2010
config:

NAME STATE READ WRITE CKSUM
nfszone  ONLINE   0 0 0
  /nfszone/root  ONLINE   0 0 0

errors: No known data errors

But then I booted the zone...

# zoneadm -z nfszone boot
# zpool status nfszone
  pool: nfszone
 state: ONLINE
status: One or more devices has experienced an unrecoverabl

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mike Gerdts

On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi  wrote:
> Hello,
>
> As a result of one badly designed application running loose for some time,
> we now seem to have over 60 million files in one directory. Good thing
> about ZFS is that it allows it without any issues. Unfortunatelly now that
> we need to get rid of them (because they eat 80% of disk space) it seems
> to be quite challenging.
>
> Traditional approaches like "find ./ -exec rm {} \;" seem to take forever
> - after running several days, the directory size still says the same. The
> only way how I've been able to remove something has been by giving "rm
> -rf" to problematic directory from parent level. Running this command
> shows directory size decreasing by 10,000 files/hour, but this would still
> mean close to ten months (over 250 days) to delete everything!
>
> I also tried to use "unlink" command to directory as a root, as a user who
> created the directory, by changing directory's owner to root and so forth,
> but all attempts gave "Not owner" error.
>
> Any commands like "ls -f" or "find" will run for hours (or days) without
> actually listing anything from the directory, so I'm beginning to suspect
> that maybe the directory's data structure is somewhat damaged. Is there
> some diagnostics that I can run with e.g "zdb" to investigate and
> hopefully fix for a single directory within zfs dataset?

In situations like this, ls will be exceptionally slow partially
because it will sort the output.  Find is slow because it needs to
call lstat() on every entry.  In similar situations I have found the
following to work.

perl -e 'opendir(D, "."); while ( $d = readdir(D) ) { print "$d\n" }'

Replace print with unlink if you wish...

>
> To make things even more difficult, this directory is located in rootfs,
> so dropping the zfs filesystem would basically mean reinstalling the
> entire system, which is something that we really wouldn't wish to go.
>
>
> OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
> easy path for upgrade that might solve this problem?) and the zpool
> consists two 146 GB SAS drivers in a mirror setup.
>
>
> Any help would be appreciated.
>
> Thanks,
> Mikko
>
> --
>  Mikko Lammi | l...@lmmz.net | http://www.lmmz.net
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zpool creation best practices

2010-01-03 Thread Mike

Thanks for the response Marion.  I'm glad that I"m not the only one. :)

Message was edited by: mijohnst
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Zpool creation best practices

2009-12-30 Thread Mike

I'm just wondering what some of you might do with your systems.  

We have an EMC Clariion unit that I connect several sun machines to.  I allow 
the EMC to do it's hardware raid5 for several luns and then I stripe them 
together.  I considered using raidz and just configuring the EMC as a JBOD, but 
I thought it would defeat the purpose paying so much for a system with the 
advanced redundancy system.  I also like to add luns on the fly when a system 
needs more file space and I know you can't do that with raidz.  

I've never had a lun go bad but bad things do happen.  Does anyone else use ZFS 
in this way?  Is this an unrecommended setup?  It's too late to change my 
setup, but in the future when I'm planning new systems, should I consider the 
effort to allow zfs fully control all the disks?

Message was edited by: mijohnst
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Changing ZFS drive pathing

2009-12-30 Thread Mike

Just thought I would let you all know that I followed what Alex suggested along 
with what many of you pointed out and it worked! Here are the steps I followed:

1. Break root drive mirror
2. zpool export filesystem
3. run the command to start MPIOX and reboot the machine
4. zpool import filesystem
5. Check the system
6. Recreate the mirror.

Thank you all for the help!  I feel much better and it worked without a single 
problem!  I'm very impressed with MPXIO and wish I had known about it before 
spending thousands of dollars on PowerPath.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mike Gerdts

On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling
 wrote:
> If the allocator can change, what sorts of policies should be
> implemented?  Examples include:
>        + should the allocator stick with best-fit and encourage more
>           gangs when the vdev is virtual?
>        + should the allocator be aware of an SSD's page size?  Is
>           said page size available to an OS?
>        + should the metaslab boundaries align with virtual storage
>           or SSD page boundaries?

Wandering off topic a little bit...

Should the block size be a tunable so that page size of SSD (typically
4K, right?) and upcoming hard disks that sport a sector size > 512
bytes?

http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt

> And, perhaps most important, how can this be done automatically
> so that system administrators don't have to be rocket scientists
> to make a good choice?

Didn't you read the marketing literature?  ZFS is easy because you
only need to know two commands: zpool and zfs.  If you just ignore all
the subcommands, options to those subcommands, evil tuning that is
sometimes needed, and effects of redundancy choices then there is no
need for any rocket scientists.  :)

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mike Gerdts

On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling
 wrote:
> On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:
>
>> Devzero,
>>
>> Unfortunately that was my assumption as well. I don't have source level
>> knowledge of ZFS, though based on what I know it wouldn't be an easy way to
>> do it. I'm not even sure it's only a technical question, but a design
>> question, which would make it even less feasible.
>
> It is not hard, because ZFS knows the current free list, so walking that
> list
> and telling the storage about the freed blocks isn't very hard.
>
> What is hard is figuring out if this would actually improve life.  The
> reason
> I say this is because people like to use snapshots and clones on ZFS.
> If you keep snapshots, then you aren't freeing blocks, so the free list
> doesn't grow. This is a very different use case than UFS, as an example.

It seems as though the oft mentioned block rewrite capabilities needed
for pool shrinking and changing things like compression, encryption,
and deduplication would also show benefit here.  That is, blocks would
be re-written in such a way to minimize the number of chunks of
storage that is allocated.  The current HDS chunk size is 42 MB.

The most benefit would seem to be to have ZFS make a point of reusing
old but freed blocks before doing an allocation that causes the
back-end storage to allocate another chunk of disk to the
thin-provisioned.  While it is important to be able to roll back a few
transactions in the event of some widely discussed failure modes, it
is probably reasonable to reuse a block freed by a txg that is 3,000
txg's old (about 1 day old if 1 txg per 30 seconds).  Such a threshold
could be used to determine whether to reuse a block or venture into
previously untouched regions of the disk.

This strategy would allow the SAN administrator (who is a different
person than the sysadmin) to allocate extra space to servers and the
sysadmin can control the amount of space really used by quotas.  In
the event that there is an emergency need for more space, the sysadmin
can increase the quota and allow more of the allocate SAN space to be
used.  Assuming the block rewrite feature comes to fruition, this
emergency growth could be shrunk back down to the original size once
the surge in demand (or errant process) subsides.

>
> There are a few minor bumps in the road. The ATA PASSTHROUGH
> command, which allows TRIM to pass through the SATA drivers, was
> just integrated into b130. This will be more important to small servers
> than SANs, but the point is that all parts of the software stack need to
> support the effort. As such, it is not clear to me who, if anyone, inside
> Sun is champion for the effort -- it crosses multiple organizational
> boundaries.
>
>>
>> Apart from the technical possibilities, this feature looks really
>> inevitable to me in the long run especially for enterprise customers with
>> high-end SAN as cost is always a major factor in a storage design and it's a
>> huge difference if you have to pay based on the space used vs space
>> allocated (for example).
>
> If the high cost of SAN storage is the problem, then I think there are
> better ways to solve that :-)

The "SAN" could be an OpenSolaris device serving LUNs through COMSTAR.
 If those LUNs are used to hold a zpool, the zpool could notify the
LUN that blocks are no longer used and the "SAN" could reclaim those
blocks.  This is just a variant of the same problem faced with
expensive SAN devices that have thin provisioning allocation units
measured in the tens of megabytes instead of hundreds to thousands of
kilobytes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zones on shared storage - a warning

2009-12-22 Thread Mike Gerdts

On Tue, Dec 22, 2009 at 8:02 PM, Mike Gerdts  wrote:
> I've been playing around with zones on NFS a bit and have run into
> what looks to be a pretty bad snag - ZFS keeps seeing read and/or
> checksum errors.  This exists with S10u8 and OpenSolaris dev build
> snv_129.  This is likely a blocker for anything thinking of
> implementing parts of Ed's Zones on Shared Storage:
>
> http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>
> The OpenSolaris example appears below.  The order of events is:
>
> 1) Create a file on NFS, turn it into a zpool
> 2) Configure a zone with the pool as zonepath
> 3) Install the zone, verify that the pool is healthy
> 4) Boot the zone, observe that the pool is sick
[snip]

An off list conversation and a bit of digging into other tests I have
done shows that this is likely limited to NFSv3.  I cannot say that
this problem has been seen with NFSv4.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Zones on shared storage - a warning

2009-12-22 Thread Mike Gerdts

I've been playing around with zones on NFS a bit and have run into
what looks to be a pretty bad snag - ZFS keeps seeing read and/or
checksum errors.  This exists with S10u8 and OpenSolaris dev build
snv_129.  This is likely a blocker for anything thinking of
implementing parts of Ed's Zones on Shared Storage:

http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss

The OpenSolaris example appears below.  The order of events is:

1) Create a file on NFS, turn it into a zpool
2) Configure a zone with the pool as zonepath
3) Install the zone, verify that the pool is healthy
4) Boot the zone, observe that the pool is sick

r...@soltrain19# mount filer:/path /mnt
r...@soltrain19# cd /mnt
r...@soltrain19# mkdir osolzone
r...@soltrain19# mkfile -n 8g root
r...@soltrain19# zpool create -m /zones/osol osol /mnt/osolzone/root
r...@soltrain19# zonecfg -z osol
osol: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:osol> create
zonecfg:osol> info
zonename: osol
zonepath:
brand: ipkg
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: shared
hostid:
zonecfg:osol> set zonepath=/zones/osol
zonecfg:osol> set autoboot=false
zonecfg:osol> verify
zonecfg:osol> commit
zonecfg:osol> exit

r...@soltrain19# chmod 700 /zones/osol

r...@soltrain19# zoneadm -z osol install
   Publisher: Using opensolaris.org (http://pkg.opensolaris.org/dev/
http://pkg-na-2.opensolaris.org/dev/).
   Publisher: Using contrib (http://pkg.opensolaris.org/contrib/).
   Image: Preparing at /zones/osol/root.
   Cache: Using /var/pkg/download.
Sanity Check: Looking for 'entire' incorporation.
  Installing: Core System (output follows)
DOWNLOAD  PKGS   FILESXFER (MB)
Completed46/46 12334/1233493.1/93.1

PHASEACTIONS
Install Phase18277/18277
No updates necessary for this image.
  Installing: Additional Packages (output follows)
DOWNLOAD  PKGS   FILESXFER (MB)
Completed36/36   3339/333921.3/21.3

PHASEACTIONS
Install Phase  4466/4466

Note: Man pages can be obtained by installing SUNWman
 Postinstall: Copying SMF seed repository ... done.
 Postinstall: Applying workarounds.
Done: Installation completed in 2139.186 seconds.

  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
  to complete the configuration process.
6.3 Boot the OpenSolaris zone
r...@soltrain19# zpool status osol
  pool: osol
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
osol  ONLINE   0 0 0
  /mnt/osolzone/root  ONLINE   0 0 0

errors: No known data errors

r...@soltrain19# zoneadm -z osol boot

r...@soltrain19# zpool status osol
  pool: osol
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
osol  DEGRADED 0 0 0
  /mnt/osolzone/root  DEGRADED 0 0   117  too many errors

errors: No known data errors

r...@soltrain19# zlogin osol uptime
  5:31pm  up 1 min(s),  0 users,  load average: 0.69, 0.38, 0.52


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 6 >

1 - 100 of 527 matches

Mail list logo