Re: [zfs-discuss] Drive showing as "removed"

2010-06-08 Thread Joe Auty




Richard Elling wrote:

  On Jun 7, 2010, at 4:50 PM, besson3c wrote:

  
  
Hello,

I have a drive that was a part of the pool showing up as "removed". I made no changes to the machine, and there are no errors being displayed, which is rather weird:

# zpool status nm
 pool: nm
state: DEGRADED
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   nm  DEGRADED 0 0 0
 raidz1DEGRADED 0 0 0
   c0t2d0  ONLINE   0 0 0
   c0t3d0  ONLINE   0 0 0
   c0t4d0  ONLINE   0 0 0
   c0t5d0  ONLINE   0 0 0
   c0t6d0  ONLINE   0 0 0
   c0t7d0  REMOVED  0 0 0


What would your advice be here? What do you think happened, and what is the smartest way to bring this disk back up?

  
  
Can you send the output of "zpool history nm" ?

  


It consists of a ton of renames and snapshots as executed by my ZFS
snapshot cronjob that runs every night. If you want I can setup a job
to grep for lines that are not snapshots and renames, but there is
nothing other than snapshots and renames on my last page of results.
What sort of thing were you expecting to find, anyway?



  
  
Since there are no errors I'm inclined to throw it back into the pool and see what happens rather than trying to replace it straight away. 

Thoughts?

  
  
Sounds like a reasonable plan to me.
  


How would I do so? Attach? Replace? Some other command? I'm not sure
how to add a disk to an already established Raid-Z pool safely. In
fact, I didn't think it could be done...




   -- richard

  



-- 

Joe Auty, NetMusician
NetMusician
helps musicians, bands and artists create beautiful,
professional, custom designed, career-essential websites that are easy
to maintain and to integrate with popular social networks.
www.netmusician.org
j...@netmusician.org




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Someone is porting ZFS to Linux(again)!!

2010-06-08 Thread ???
just find this project: http://github.com/behlendorf/zfs

Does it mean we will use ZFS as a linux kernel module in the near future :)

Look forward to it !
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive showing as "removed"

2010-06-08 Thread Cindy Swearingen

Hi Joe,

The REMOVED status generally means that a device was physically removed
from the system.

If necessary, physically reconnect c0t7d0 or if connected, check
cabling, power, and so on.

If the device is physically connected, see what cfgadm says about this
device. For example, a device that was unconfigured from the system
would look like  this:

# cfgadm -al | grep c4t2d0
c4::dsk/c4t2d0  disk connectedunconfigured   unknown

(Finding the right cfgadm format for your h/w is another challenge.)

I'm very cautious about other people's data so consider this issue:

If possible, you might import the pool while you are physically
inspecting the device or changing it physically. Depending on your
hardware, I've heard of device paths changing if another device is
reseated or changes.

Thanks,

Cindy

On 06/07/10 17:50, besson3c wrote:

Hello,

I have a drive that was a part of the pool showing up as "removed". I made no 
changes to the machine, and there are no errors being displayed, which is rather weird:

# zpool status nm
  pool: nm
 state: DEGRADED
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
nm  DEGRADED 0 0 0
  raidz1DEGRADED 0 0 0
c0t2d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c0t7d0  REMOVED  0 0 0


What would your advice be here? What do you think happened, and what is the smartest way to bring this disk back up? Since there are no errors I'm inclined to throw it back into the pool and see what happens rather than trying to replace it straight away. 


Thoughts?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Native ZFS for Linux

2010-06-08 Thread Hillel Lubman
A very interesting video from DebConf, which addresses CDDL and GPL 
incompatibility issues, and some original reasoning behind CDDL usage:

http://caesar.acc.umu.se/pub/debian-meetings/2006/debconf6/theora-small/2006-05-14/tower/OpenSolaris_Java_and_Debian-Simon_Phipps__Alvaro_Lopez_Ortega.ogg
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General help with understanding ZFS performance bottlenecks

2010-06-08 Thread besson3c
It would be helpful if you posted more information about your 
configuration. 
Numbers *are* useful too, but minimally, describing your setup, use case, 
the hardware and other such facts would provide people a place to start. 

There are much brighter stars on this list than myself, but if you are sharing 
your ZFS dataset(s) via NFS with a heavy traffic load (particularly writes), 
a mirrored SLOG will probably be useful.  (The ZIL is a component of every 
ZFS pool.  A SLOG is a device, usually an SSD or mirrored pair of SSDs, 
on which you can locate your ZIL for enhanced *synchronous* write 
performance.)  Since ZFS does sync writes, that might be a win for you, but 
again it depends on a lot of factors.

Sure! The pool consists of 6 SATA drives configured as RAID-Z. There are no 
special read or write cache drives. This pool is shared to several VMs via NFS, 
these VMs manage email, web, and a Quickbooks server running on FreeBSD, Linux, 
and Windows.

On heavy reads or writes (writes seem to be more problematic) my load averages 
on my VM host shoot up and overall performance is bogged down. I suspect that I 
do need a mirrored SLOG, but I'm wondering what the best way is to go about 
assessing this so that I can be more certain about this? I'm also wondering 
what other sorts of things can be tweaked software-wise on either the VM host 
(running CentOS) or Solaris side to give me a little more headroom? The thought 
has crossed my mind that a dedicated SLOG pair of SSDs might be overkill for my 
needs, this is not a huge business (yet :)

Thanks for your help!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive showing as "removed"

2010-06-08 Thread Cindy Swearingen

Joe,

Yes, the device should resilver when its back online.

You can use the fmdump -eV command to discover when this device was
removed and other hardware-related events to help determine when this
device was removed.

I would recommend exporting (not importing) the pool before physically
changing the hardware. After the device is back online and the pool is
imported, you might need to use zpool clear to clear the pool status.

Thanks,

Cindy

On 06/08/10 11:11, Joe Auty wrote:

Cindy Swearingen wrote:

Hi Joe,

The REMOVED status generally means that a device was physically removed
from the system.

If necessary, physically reconnect c0t7d0 or if connected, check
cabling, power, and so on.

If the device is physically connected, see what cfgadm says about this
device. For example, a device that was unconfigured from the system
would look like  this:

# cfgadm -al | grep c4t2d0
c4::dsk/c4t2d0  disk connectedunconfigured   unknown

(Finding the right cfgadm format for your h/w is another challenge.)

I'm very cautious about other people's data so consider this issue:

If possible, you might import the pool while you are physically
inspecting the device or changing it physically. Depending on your
hardware, I've heard of device paths changing if another device is
reseated or changes.



Thanks Cindy!

Here is what cfgadm is showing me:

# cfgadm -al | grep c0t7d0
c0::dsk/c0t7d0 disk connectedconfigured   
unknown



I'll definitely start with a reseating of the drive. I'm assuming that 
once Solaris thinks the drive is no longer removed it will start 
leveling on its own?




Thanks,

Cindy

On 06/07/10 17:50, besson3c wrote:

Hello,

I have a drive that was a part of the pool showing up as "removed". I 
made no changes to the machine, and there are no errors being 
displayed, which is rather weird:


# zpool status nm
  pool: nm
 state: DEGRADED
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
nm  DEGRADED 0 0 0
  raidz1DEGRADED 0 0 0
c0t2d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c0t7d0  REMOVED  0 0 0


What would your advice be here? What do you think happened, and what 
is the smartest way to bring this disk back up? Since there are no 
errors I'm inclined to throw it back into the pool and see what 
happens rather than trying to replace it straight away.
Thoughts? 



--
Joe Auty, NetMusician
NetMusician helps musicians, bands and artists create beautiful, 
professional, custom designed, career-essential websites that are easy 
to maintain and to integrate with popular social networks.

www.netmusician.org 
j...@netmusician.org 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] combining series of snapshots

2010-06-08 Thread BJ Quinn
I have a series of daily snapshots against a set of data that go for several 
months, but then the server crashed.  In a hurry, we set up a new server and 
just copied over the live data and didn't bother with the snapshots (since zfs 
send/recv was too slow and would have taken hours and hours to restore).  We've 
now run on the replacement server for a year or so and it's time to upgrade to 
a new, faster server.

As part of building up our newest server, I wanted to combine the older 
snapshots with the daily snapshots generated on the server that is currently 
running.  I was wondering what the proper way to do this might be.

I was considering the following process for building up the new server :

1.  Copy over all the snapshots from a backup of the server that crashed 
(11/01/2008 - 7/14/2009) using zfs send/recv
2.  Copy over the oldest snapshot from the current server (7/15/2009) using 
rsync so that the data from that snapshot is the live filesystem data on the 
new server.
3.  Take a snapshot on the new server and call it the same thing as the 
snapshot that I copied the data from (i.e. datap...@nightly20090715)
4.  Do an incremental zfs send/recv from 7/15/2009 to today from the current 
server.

I don't know if this would work, or if it would leave me in a consistent state 
even if I did make it work.  Any suggestions?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General help with understanding ZFS performance bottlenecks

2010-06-08 Thread Brandon High
On Tue, Jun 8, 2010 at 10:33 AM, besson3c  wrote:
> On heavy reads or writes (writes seem to be more problematic) my load 
> averages on my VM host shoot up and overall performance is bogged down. I 
> suspect that I do need a mirrored SLOG, but I'm wondering what the best way is

The load that you're seeing is probably iowait. If that's the case,
it's almost certainly the write speed of your pool. A raidz will be
slow for your purposes, and adding a zil may help. There's been lots
of discussion in the archives about how to determine if a log device
will help, such as using zilstat or disabling the zil and testing.

You may want to set the recordsize smaller for the datasets that
contain vmdk files as well. With the default recordsize of 128k, a 4k
write by the VM host can result in 128k being read from and written to
the dataset.

What VM software are you using? There are a few knobs you can turn in
VBox which will help with slow storage. See
http://www.virtualbox.org/manual/ch12.html#id2662300 for instructions
on reducing the flush interval.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] combining series of snapshots

2010-06-08 Thread Brandon High
On Tue, Jun 8, 2010 at 10:51 AM, BJ Quinn  wrote:
> 3.  Take a snapshot on the new server and call it the same thing as the 
> snapshot that I copied the data from (i.e. datap...@nightly20090715)

It won't work, because the two snapshots are different. It doesn't
matter if they have same name, the snapshots are of two separate
filesystems.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SATA/SAS Interposer Cards

2010-06-08 Thread Steve D. Jost
Hello all,
We have 2 Solaris 10u8 boxes in a small cluster (active/passive) 
serving up a ZFS-formatted shared SAS tray as an NFS share.  We are going to be 
adding a few SSDs into our disk pool and have determined that we need a 
SATA/SAS Interposer AAMUX card.  Currently the storage tray vendor does not 
have a supported solution for 2.5" SATA disks or SSDs.  We've found a few 
interposer cards based on the LSISS9252 but haven't found one we can directly 
purchase in smaller quantities than bags of 144.  Does anyone have any advice 
in acquiring smaller quantities (4) of interposer cards?  Is there another card 
we should be looking at?  Thanks,

Steve Jost

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog / log recovery is here!

2010-06-08 Thread R. Eulenberg
Hi,

yesterday I changed the /etc/system file and ran:
zdb -e -bcsvL tank1
without an output and without a prompt (prozess hangs up)
and the same result of running:
zdb -eC tank1

Regards 
Ron
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General help with understanding ZFS performance bottlenecks

2010-06-08 Thread Brandon High
On Tue, Jun 8, 2010 at 11:27 AM, Joe Auty  wrote:

> things. I've also read this on a VMWare forum, although I don't know if
> this correct? This is in context to me questioning why I don't seem to have
> these same load average problems running Virtualbox:
>
> The problem with the comparison VirtualBox comparison is that caching is
> known to be broken in VirtualBox (ignores cache flush, which, by continuing
> to cache, can "speed up" IO at the expense of data integrity or loss). This
> could be playing in your favor from a performance perspective, but puts your
> data at risk. Disabling disk caching altogether would be a bit hit on the
> Virtualbox side... Neither solution is ideal.
>
>
Check the link that I posted earlier, under "Responding to guest IDE/SATA
flush requests". Setting IgnoreFlush to 0 will turn off the extra caching.


> I've actually never seen much, if any iowait (%w in iostat output, right?).
> I've run the zilstat script and am happy to share that output with you if
> you wouldn't mind taking a look at it? I'm not sure I'm understanding its
> output correctly...
>

You'll see iowait on the VM, not on the zfs server.



> Will this tuning have an impact on my existing VMDK files? Can you kindly
> tell me more about this, how I can observe my current recordsize and play
> around with this setting if it will help? Will adjusting ZFS compression on
> my share hosting my VMDKs be of any help too? Compression is disabled on my
> ZFS share where my VMDKs are hosted.
>

No, your existing files will keep whatever recordsize they were created
with. You can view or change the recordsize property the same as any other
zfs property. You'll have to recreate the files to re-write them with a
different recordsize. (eg: copy file.vmdk file.vmdk.foo ;  if $?; then mv
file.vmdk.foo file.vmdk; fi)


> This ZFS host hosts regular data shares in addition to the VMDKs. All user
> data on my VM guests that is subject to change is hosted on a ZFS share,
> only the OS and basic OS applications are saved to my VMDKs.
>

The property is per dataset. If the vmdk files are in separate datasets
(which I recommend) you can adjust the properties or take snapshots of each
VM's data separately.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Native ZFS for Linux

2010-06-08 Thread Joerg Schilling
Hillel Lubman  wrote:

> A very interesting video from DebConf, which addresses CDDL and GPL 
> incompatibility issues, and some original reasoning behind CDDL usage:
>
> http://caesar.acc.umu.se/pub/debian-meetings/2006/debconf6/theora-small/2006-05-14/tower/OpenSolaris_Java_and_Debian-Simon_Phipps__Alvaro_Lopez_Ortega.ogg

This viedo is not interesting, it is wrong.

Danese Cooper claims incorrect things and her claims have already been 
verified wrong by Simon Phipps. 

http://www.opensolaris.org/jive/message.jspa?messageID=55013#55008

Hope this helps.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive showing as "removed"

2010-06-08 Thread Cindy Swearingen

According to this report, I/O to this device caused a probe failure
because the device isn't available on May 31.

I was curious if this device had any previous issues over a longer
period of time.

Failing or faulted drives can also kill your pool's performance.

Thanks,

Cindy

On 06/08/10 11:39, Joe Auty wrote:

Cindy Swearingen wrote:

Joe,

Yes, the device should resilver when its back online.

You can use the fmdump -eV command to discover when this device was
removed and other hardware-related events to help determine when this
device was removed.

I would recommend exporting (not importing) the pool before physically
changing the hardware. After the device is back online and the pool is
imported, you might need to use zpool clear to clear the pool status.



Here is the output of that command, does this reveal anything useful? 
c0t7d0 is the drive that is marked as removed... I'll look into the 
import and export functions to learn more about them. Thanks!



# fmdump -eV
TIME   CLASS
May 31 2010 05:33:36.363381880 ereport.fs.zfs.probe_failure
nvlist version: 0
class = ereport.fs.zfs.probe_failure
ena = 0x5d2206865ac00401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0x28ebd14a56dfe4df
vdev = 0xdbdc49ecb5479c40
(end detector)

pool = nm
pool_guid = 0x28ebd14a56dfe4df
pool_context = 0
pool_failmode = wait
vdev_guid = 0xdbdc49ecb5479c40
vdev_type = disk
vdev_path = /dev/dsk/c0t7d0s0
vdev_devid = id1,s...@n5000c5001e7cf7a7/a
parent_guid = 0x16cbb2c1f07c5f51
parent_type = raidz
prev_state = 0x0
__ttl = 0x1
__tod = 0x4c038270 0x15a8c478





Thanks,

Cindy

On 06/08/10 11:11, Joe Auty wrote:

Cindy Swearingen wrote:

Hi Joe,

The REMOVED status generally means that a device was physically removed
from the system.

If necessary, physically reconnect c0t7d0 or if connected, check
cabling, power, and so on.

If the device is physically connected, see what cfgadm says about this
device. For example, a device that was unconfigured from the system
would look like  this:

# cfgadm -al | grep c4t2d0
c4::dsk/c4t2d0  disk connectedunconfigured   
unknown


(Finding the right cfgadm format for your h/w is another challenge.)

I'm very cautious about other people's data so consider this issue:

If possible, you might import the pool while you are physically
inspecting the device or changing it physically. Depending on your
hardware, I've heard of device paths changing if another device is
reseated or changes. 


Thanks Cindy!

Here is what cfgadm is showing me:

# cfgadm -al | grep c0t7d0
c0::dsk/c0t7d0 disk connectedconfigured   
unknown



I'll definitely start with a reseating of the drive. I'm assuming 
that once Solaris thinks the drive is no longer removed it will start 
leveling on its own?




Thanks,

Cindy

On 06/07/10 17:50, besson3c wrote:

Hello,

I have a drive that was a part of the pool showing up as "removed". 
I made no changes to the machine, and there are no errors being 
displayed, which is rather weird:


# zpool status nm
  pool: nm
 state: DEGRADED
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
nm  DEGRADED 0 0 0
  raidz1DEGRADED 0 0 0
c0t2d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c0t7d0  REMOVED  0 0 0


What would your advice be here? What do you think happened, and 
what is the smartest way to bring this disk back up? Since there 
are no errors I'm inclined to throw it back into the pool and see 
what happens rather than trying to replace it straight away.
Thoughts? 



--
Joe Auty, NetMusician
NetMusician helps musicians, bands and artists create beautiful, 
professional, custom designed, career-essential websites that are 
easy to maintain and to integrate with popular social networks.

www.netmusician.org 
j...@netmusician.org  



--
Joe Auty, NetMusician
NetMusician helps musicians, bands and artists create beautiful, 
professional, custom designed, career-essential websites that are easy 
to maintain and to integrate with popular social networks.

www.netmusician.org 
j...@netmusician.org 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Native ZFS for Linux

2010-06-08 Thread Hillel Lubman
Joerg Schilling wrote:

> This viedo is not interesting, it is wrong.
> Danese Cooper claims incorrect things and her claims have already been
> verified wrong by Simon Phipps.
> 
> http://www.opensolaris.org/jive/message.jspa?messageID=55013#55008
>
> Hope this helps.
>
>Jörg

I see it's a pretty heated and involved discussion :) So according to Simon 
Phipps the reason behind using CDDL was simply pragmatical (to push the code 
out earlier). But whatever the original intent was, now it's Oracle who will 
decide whether to change it or not. And Oracle is not too talkative about such 
things :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-08 Thread Miles Nordin
> "re" == Richard Elling  writes:

re> Please don't confuse Ethernet with IP.

okay, but I'm not.  seriously, if you'll look into it.

Did you misread where I said FC can exert back-pressure?  I was
contrasting with Ethernet.

Ethernet output queues are either FIFO or RED, and are large compared
to FC and IB.  FC is buffer-credit, which HOL-blocks to prevent the
small buffers from overflowing, and IB is...blocking (almost no buffer
at all---about 2KB per port and bandwidth*delay product of about 1KB
for the whole mesh, compared to ARISTA which has about 48MB per port,
so except to pedantic IB is bufferless, ie it does not even buffer one
full frame).  Unlike Ethernet, both are lossless fabrics (sounds good)
and have an HOL-blocking character (sounds bad).  They're
fundamentally different at L2, so this is not about IP.  If you run IP
over IB, it is still blocking and lossless.  It does not magically
start buffering when you use IP because the fabric is simply unable to
buffer---there is no RAM in the mesh anywhere.  Both L2 and L3
switches have output queues, and both L3 and L2 output queues can be
FIFO or RED because the output buffer exists in the same piece of
silicon of an L3 switch no matter whether it's set to forward in L2 or
L3 mode, so L2 and L3 switches are like each other and unlike FC & IB.
This is not about IP.  It's about Ethernet.

a relevant congestion difference between L3 and L2 switches (confusing
ethernet with IP) might be ECN, because only an L3 switch can do ECN.
But I don't think anyone actually uses ECN.  It's disabled by default
in Solaris and, I think, all other Unixes.  AFAICT my Extreme
switches, a very old L3 flow-forwarding platform, are not able to flip
the bit.  I think 6500 can, but I'm not certain.

re> no back-off other than that required for the link. Since
re> GbE and higher speeds are all implemented as switched fabrics,
re> the ability of the switch to manage contention is paramount.
re> You can observe this on a Solaris system by looking at the NIC
re> flow control kstats.

You're really confused, though I'm sure you're going to deny it.
Ethernet flow control mostly isn't used at all, and it is never used
to manage output queue congestion except in hardware that everyone
agrees is defective.  I almost feel like I've written all this stuff
already, even the part about ECN.

Ethernet flow control is never correctly used to signal output queue
congestion.  The ethernet signal for congestion is a dropped packet.
flow control / PAUSE frames are *not* part of some magic mesh-wide
mechanism by which switches ``manage'' congestion.  PAUSE are used,
when they're used at all, for oversubscribed backplanes: for
congestion on *input*, which in Ethernet is something you want to
avoid.  You want to switch ethernet frames to the output port where it
may or may not encounter congestion so that you don't hold up input
frames headed toward other output ports.  If you did hold them up,
you'd have something like HOL blocking.  IB takes a different
approach: you simply accept the HOL blocking, but tend to design a
mesh with little or no oversubscription unlike ethernet LAN's which
are heavily oversubscribed on their trunk ports.  so...the HOL
blocking happens, but not as much as it would with a typical Ethernet
topology, and it happens in a way that in practice probably increases
the performance of storage networks.

This is interesting for storage because when you try to shove a
128kByte write into an Ethernet fabric, part of it may get dropped in
an output queue somewhere along the way.  In IB, never will part of
the write get dropped, but sometimes you can't shove it into the
network---it just won't go, at L2.  With Ethernet you rely on TCP to
emulate this can't-shove-in condition, and it does not work perfectly
in that it can introduce huge jitter and link underuse (``incast'' problem:

 http://www.pdl.cmu.edu/PDL-FTP/Storage/FASTIncast.pdf

), and secondly leave many kilobytes in transit within the mesh or TCP
buffers, like tens of megabytes and milliseconds per hop, requiring
large TCP buffers on both ends to match the bandwidth*jitter and
frustrating storage QoS by queueing commands on the link instead of in
the storage device, but in exchange you get from Ethernet no HOL
blocking and the possibility of end-to-end network QoS.  It is a fair
tradeoff but arguably the wrong one for storage based on experience
with iSCSI sucking so far.

But the point is, looking at those ``flow control'' kstats will only
warn you if your switches are shit, and shit in one particular way
that even cheap switches rarely are.  The metric that's relevant is
how many packets are being dropped, and in what pattern (a big bucket
of them at once like FIFO, or a scattering like RED), and how TCP is
adapting to these drops.  For this you might look at TCP stats in
solaris, or at output queue drop and output queue size stats on
managed switches, or simply at the overall bandwidth, t

Re: [zfs-discuss] combining series of snapshots

2010-06-08 Thread BJ Quinn
Is there any way to merge them back together?  I really need the history data 
going back as far as possible, and I'd like to be able to access it from the 
same place .  I mean, worst case scenario, I could rsync the contents of each 
snapshot to the new filesystem and take a snapshot for each one, but surely 
there's a better way than that?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General help with understanding ZFS performance bottlenecks

2010-06-08 Thread Brandon High
On Tue, Jun 8, 2010 at 12:04 PM, Joe Auty  wrote:
>
>   Cool, so maybe this guy was going off of earlier information? Was there
> a time when there was no way to enable cache flushing in Virtualbox?
>

The default is to ignore cache flushes, so he was correct for the default
setting. The IgnoreFlush command has existed since 2.0 at least.

My mistake, yes I see pretty significant iowait times on the host... Right
> now "iostat" is showing 9.30% wait times.
>

That's not too bad, but not great. Here's from a system at work:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2.990.003.98   92.540.500.00

The problem is that io gets bursty, so you'll have good speeds for the most
part, followed by some large waits. Small writes to the vmdk will have the
worst performance, since the 128k block has to be read and written out with
the change. Because your guest has /var on the vmdk, there are constant
small writes going to the pool.


> Do you have a recommendation for a good size to start with for the dataset
> hosting VMDKs? Half of 128K? A third?
>

There are inherit tradeoffs using smaller blocks, notably more overhead for
checksums.

zvols use an 8k volblocksize by default, which is probably a decent size.


> In general large files are better served with smaller recordsizes, whereas
> small files are better served with the 128k default?
>

Files that have random small writes in the middle of the data will have poor
performance. Things such as database files, vmdk files, etc. Other than
specific cases like what you've run into, you shouldn't ever need to adjust
the recordsize.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] combining series of snapshots

2010-06-08 Thread Scott Meilicke
You might bring over all of your old data and snaps, then clone that into a new 
volume. Bring your recent stuff into the clone. Since the clone only updates 
blocks that are different than the underlying snap, you may see a significant 
storage savings.

Two clones could even be made - one for your live data, another to access the 
historical data.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive showing as "removed"

2010-06-08 Thread Joe Auty




Cindy Swearingen wrote:
Joe,
  
  
Yes, the device should resilver when its back online.
  
  
You can use the fmdump -eV command to discover when this device was
  
removed and other hardware-related events to help determine when this
  
device was removed.
  
  
I would recommend exporting (not importing) the pool before physically
  
changing the hardware. After the device is back online and the pool is
  
imported, you might need to use zpool clear to clear the pool status.
  
  


Here is the output of that command, does this reveal anything useful?
c0t7d0 is the drive that is marked as removed... I'll look into the
import and export functions to learn more about them. Thanks!

# fmdump -eV
TIME   CLASS
May 31 2010 05:33:36.363381880 ereport.fs.zfs.probe_failure
nvlist version: 0
    class = ereport.fs.zfs.probe_failure
    ena = 0x5d2206865ac00401
    detector = (embedded nvlist)
    nvlist version: 0
    version = 0x0
    scheme = zfs
    pool = 0x28ebd14a56dfe4df
    vdev = 0xdbdc49ecb5479c40
    (end detector)
  
    pool = nm
    pool_guid = 0x28ebd14a56dfe4df
    pool_context = 0
    pool_failmode = wait
    vdev_guid = 0xdbdc49ecb5479c40
    vdev_type = disk
    vdev_path = /dev/dsk/c0t7d0s0
    vdev_devid = id1,s...@n5000c5001e7cf7a7/a
    parent_guid = 0x16cbb2c1f07c5f51
    parent_type = raidz
    prev_state = 0x0
    __ttl = 0x1
    __tod = 0x4c038270 0x15a8c478



Thanks,
  
  
Cindy
  
  
On 06/08/10 11:11, Joe Auty wrote:
  
  Cindy Swearingen wrote:

Hi Joe,
  
  
The REMOVED status generally means that a device was physically removed
  
from the system.
  
  
If necessary, physically reconnect c0t7d0 or if connected, check
  
cabling, power, and so on.
  
  
If the device is physically connected, see what cfgadm says about this
  
device. For example, a device that was unconfigured from the system
  
would look like  this:
  
  
# cfgadm -al | grep c4t2d0
  
c4::dsk/c4t2d0  disk connected    unconfigured  
unknown
  
  
(Finding the right cfgadm format for your h/w is another challenge.)
  
  
I'm very cautious about other people's data so consider this issue:
  
  
If possible, you might import the pool while you are physically
  
inspecting the device or changing it physically. Depending on your
  
hardware, I've heard of device paths changing if another device is
  
reseated or changes.


Thanks Cindy!


Here is what cfgadm is showing me:


# cfgadm -al | grep c0t7d0

c0::dsk/c0t7d0 disk connected    configured  
unknown



I'll definitely start with a reseating of the drive. I'm assuming that
once Solaris thinks the drive is no longer removed it will start
leveling on its own?



Thanks,
  
  
Cindy
  
  
On 06/07/10 17:50, besson3c wrote:
  
  Hello,


I have a drive that was a part of the pool showing up as "removed". I
made no changes to the machine, and there are no errors being
displayed, which is rather weird:


# zpool status nm

  pool: nm

 state: DEGRADED

 scrub: none requested

config:


    NAME    STATE READ WRITE CKSUM

    nm  DEGRADED 0 0 0

  raidz1    DEGRADED 0 0 0

    c0t2d0  ONLINE   0 0 0

    c0t3d0  ONLINE   0 0 0

    c0t4d0  ONLINE   0 0 0

    c0t5d0  ONLINE   0 0 0

    c0t6d0  ONLINE   0 0 0

    c0t7d0  REMOVED  0 0 0



What would your advice be here? What do you think happened, and what is
the smartest way to bring this disk back up? Since there are no errors
I'm inclined to throw it back into the pool and see what happens rather
than trying to replace it straight away.

Thoughts? 



-- 
Joe Auty, NetMusician

NetMusician helps musicians, bands and artists create beautiful,
professional, custom designed, career-essential websites that are easy
to maintain and to integrate with popular social networks.

www.netmusician.org 

j...@netmusician.org 
  



-- 

Joe Auty, NetMusician
NetMusician
helps musicians, bands and artists create beautiful,
professional, custom designed, career-essential websites that are easy
to maintain and to integrate with popular social networks.
www.netmusician.org
j...@netmusician.org




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive showing as "removed"

2010-06-08 Thread Joe Auty




Cindy Swearingen wrote:
Hi Joe,
  
  
The REMOVED status generally means that a device was physically removed
  
from the system.
  
  
If necessary, physically reconnect c0t7d0 or if connected, check
  
cabling, power, and so on.
  
  
If the device is physically connected, see what cfgadm says about this
  
device. For example, a device that was unconfigured from the system
  
would look like  this:
  
  
# cfgadm -al | grep c4t2d0
  
c4::dsk/c4t2d0  disk connected    unconfigured  
unknown
  
  
(Finding the right cfgadm format for your h/w is another challenge.)
  
  
I'm very cautious about other people's data so consider this issue:
  
  
If possible, you might import the pool while you are physically
  
inspecting the device or changing it physically. Depending on your
  
hardware, I've heard of device paths changing if another device is
  
reseated or changes.
  
  


Thanks Cindy!

Here is what cfgadm is showing me:

# cfgadm -al | grep c0t7d0 
c0::dsk/c0t7d0 disk connected    configured  
unknown


I'll definitely start with a reseating of the drive. I'm assuming that
once Solaris thinks the drive is no longer removed it will start
leveling on its own?


Thanks,
  
  
Cindy
  
  
On 06/07/10 17:50, besson3c wrote:
  
  Hello,


I have a drive that was a part of the pool showing up as "removed". I
made no changes to the machine, and there are no errors being
displayed, which is rather weird:


# zpool status nm

  pool: nm

 state: DEGRADED

 scrub: none requested

config:


    NAME    STATE READ WRITE CKSUM

    nm  DEGRADED 0 0 0

  raidz1    DEGRADED 0 0 0

    c0t2d0  ONLINE   0 0 0

    c0t3d0  ONLINE   0 0 0

    c0t4d0  ONLINE   0 0 0

    c0t5d0  ONLINE   0 0 0

    c0t6d0  ONLINE   0 0 0

    c0t7d0  REMOVED  0 0 0



What would your advice be here? What do you think happened, and what is
the smartest way to bring this disk back up? Since there are no errors
I'm inclined to throw it back into the pool and see what happens rather
than trying to replace it straight away. 
Thoughts?
  



-- 

Joe Auty, NetMusician
NetMusician
helps musicians, bands and artists create beautiful,
professional, custom designed, career-essential websites that are easy
to maintain and to integrate with popular social networks.
www.netmusician.org
j...@netmusician.org




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General help with understanding ZFS performance bottlenecks

2010-06-08 Thread Joe Auty




Brandon High wrote:

  On Tue, Jun 8, 2010 at 10:33 AM, besson3c  wrote:
  
  
On heavy reads or writes (writes seem to be more problematic) my load averages on my VM host shoot up and overall performance is bogged down. I suspect that I do need a mirrored SLOG, but I'm wondering what the best way is

  
  
The load that you're seeing is probably iowait. If that's the case,
it's almost certainly the write speed of your pool. A raidz will be
slow for your purposes, and adding a zil may help. There's been lots
of discussion in the archives about how to determine if a log device
will help, such as using zilstat or disabling the zil and testing.

You may want to set the recordsize smaller for the datasets that
contain vmdk files as well. With the default recordsize of 128k, a 4k
write by the VM host can result in 128k being read from and written to
the dataset.

What VM software are you using? There are a few knobs you can turn in
VBox which will help with slow storage. See
http://www.virtualbox.org/manual/ch12.html#id2662300 for instructions
on reducing the flush interval.

-B

  


I'd love to use Virtualbox, but right now it (3.2.2 commercial which
I'm evaluating, I haven't been able to compile OSE on the CentOS 5.5
host yet) is giving me kernel panics on the host while starting up VMs
which are obviously bothersome, so I'm exploring continuing to use
VMWare Server and seeing what I can do on the Solaris/ZFS side of
things. I've also read this on a VMWare forum, although I don't know if
this correct? This is in context to me questioning why I don't seem to
have these same load average problems running Virtualbox:

The problem with the comparison VirtualBox
comparison is that caching is known to be broken in VirtualBox (ignores
cache flush, which, by continuing to cache, can "speed up" IO at the
expense of data integrity or loss). This could be playing in your favor
from a performance perspective, but puts your data at risk. Disabling
disk caching altogether would be a bit hit on the Virtualbox side...
Neither solution is ideal. 


If this is incorrect and I can get Virtualbox working stably, I'm happy
to switch to it. It has definitely performed better prior to my panics,
and others on the internet seem to agree that it outperforms VMWare
products in general. I'm definitely not opposed to this idea.

I've actually never seen much, if any iowait (%w in iostat output,
right?). I've run the zilstat script and am happy to share that output
with you if you wouldn't mind taking a look at it? I'm not sure I'm
understanding its output correctly...

As far as the recordsizes, the evil tuning guide says this:

Depending on workloads, the current ZFS
implementation can, at times, cause much more I/O to be requested than
other page-based file systems. If the throughput flowing toward the
storage, as observed by iostat, nears the capacity of the channel
linking the storage and the host, tuning down the zfs recordsize should
improve performance. This tuning is dynamic, but only impacts new file
creations. Existing files keep their old recordsize.

Will this tuning have an impact on my existing VMDK files? Can you
kindly tell me more about this, how I can observe my current recordsize
and play around with this setting if it will help? Will adjusting ZFS
compression on my share hosting my VMDKs be of any help too?
Compression is disabled on my ZFS share where my VMDKs are hosted.

This ZFS host hosts regular data shares in addition to the VMDKs. All
user data on my VM guests that is subject to change is hosted on a ZFS
share, only the OS and basic OS applications are saved to my VMDKs.



-- 

Joe Auty, NetMusician
NetMusician
helps musicians, bands and artists create beautiful,
professional, custom designed, career-essential websites that are easy
to maintain and to integrate with popular social networks.
www.netmusician.org
j...@netmusician.org




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General help with understanding ZFS performance bottlenecks

2010-06-08 Thread Joe Auty




Brandon High wrote:

  On Tue, Jun 8, 2010 at 11:27 AM, Joe Auty 
wrote:
  


things. I've also read this on a VMWare forum,
although I don't know if
this correct? This is in context to me questioning why I don't seem to
have these same load average problems running Virtualbox:


The problem with the comparison VirtualBox
comparison is that caching is known to be broken in VirtualBox (ignores
cache flush, which, by continuing to cache, can "speed up" IO at the
expense of data integrity or loss). This could be playing in your favor
from a performance perspective, but puts your data at risk. Disabling
disk caching altogether would be a bit hit on the Virtualbox side...
Neither solution is ideal. 

  
  
  
  Check the link that I posted earlier, under "Responding to guest
IDE/SATA flush requests". Setting IgnoreFlush to 0 will turn off the
extra caching.
   
  

Cool, so maybe this guy was going off of earlier information? Was there
a time when there was no way to enable cache flushing in Virtualbox?



  
  
I've actually never seen
much, if any iowait (%w in iostat output,
right?). I've run the zilstat script and am happy to share that output
with you if you wouldn't mind taking a look at it? I'm not sure I'm
understanding its output correctly...

  
  
  
  You'll see iowait on the VM, not on the zfs server.
   
  

My mistake, yes I see pretty significant iowait times on the host...
Right now "iostat" is showing 9.30% wait times.



  
   
  
Will this tuning have an
impact on my existing VMDK files? Can you
kindly tell me more about this, how I can observe my current recordsize
and play around with this setting if it will help? Will adjusting ZFS
compression on my share hosting my VMDKs be of any help too?
Compression is disabled on my ZFS share where my VMDKs are hosted.

  
  
  
  No, your existing files will keep whatever recordsize they were
created with. You can view or change the recordsize property the same
as any other zfs property. You'll have to recreate the files to
re-write them with a different recordsize. (eg: copy file.vmdk
file.vmdk.foo ;  if $?; then mv file.vmdk.foo file.vmdk; fi)
   
  
This ZFS host hosts regular
data shares in addition to the VMDKs. All
user data on my VM guests that is subject to change is hosted on a ZFS
share, only the OS and basic OS applications are saved to my VMDKs.

  
  
  
  The property is per dataset. If the vmdk files are in separate
datasets (which I recommend) you can adjust the properties or take
snapshots of each VM's data separately.
   
  
  
  


Ahhh! Yes, my VMDKs are on a separate dataset, and recordsizes are set
to 128k:

# zfs get recordsize nm/myshare
NAME   PROPERTY    VALUE    SOURCE
nm/myshare  recordsize  128K default

Do you have a recommendation for a good size to start with for the
dataset hosting VMDKs? Half of 128K? A third?


In general large files are better served with smaller recordsizes,
whereas small files are better served with the 128k default?




-- 

Joe Auty, NetMusician
NetMusician
helps musicians, bands and artists create beautiful,
professional, custom designed, career-essential websites that are easy
to maintain and to integrate with popular social networks.
www.netmusician.org
j...@netmusician.org




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] combining series of snapshots

2010-06-08 Thread Brandon High
On Tue, Jun 8, 2010 at 12:52 PM, BJ Quinn  wrote:
> Is there any way to merge them back together?  I really need the history data 
> going back as far as possible, and I'd like to be able to access it from the 
> same place .  I mean, worst case scenario, I could rsync the contents of each 
> snapshot to the new filesystem and take a snapshot for each one, but surely 
> there's a better way than that?

You won't be able to keep them in the same dataset. Because you
started your second server from a separate point, there's no way to
merge them.

If you'd copied only the most recent snapshot from your old server
onto the new one, you could merge them because they would have had a
snapshot in common.

You'll have to use two datasets to hold the data from your old server
and the data from your current server if you want to keep everything
exactly as it is on the sources

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] are these errors dangerous

2010-06-08 Thread Gary Mitchell
I have seen this too

I 'm guessing you have SATA disks which are on a iSCSI target.
I'm also guessing you have used something like

iscsitadm create target --type raw -b /dev/dsk/c4t0d00 c4t0d0

ie you are not using a zfs shareiscsi property on a zfs volume but creating  
the target from  the device
cNtNdN (dsk or rdsk it doesn't seem to matter)




You see these errors (always block 0) when the iSCSI initiator accesses the 
disks

annoying ... but the iSCSI transactions seem to be OK.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] combining series of snapshots

2010-06-08 Thread BJ Quinn
Not exactly sure how to do what you're recommending -- are you suggesting I go 
ahead with using rsync to bring in each snapshot, but to bring it into to a 
clone of the old set of snapshots?  Is there another way to bring my recent 
stuff in to the clone?

If so, then as for the storage savings, I learned a long time ago that rsync 
--inplace --no-whole-file has the same effect - it makes sure to only touch 
blocks that changed, so in theory I ought to be able to rsync over my snapshots 
intelligently without wasting any more space than they took up to begin with.

Not sure if that's what you meant.

Thanks!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] combining series of snapshots

2010-06-08 Thread BJ Quinn
Ugh, yeah, I've learned by now that you always want at least that one snapshot 
in common to keep the continuity in the dataset.  Wouldn't I be able to 
recreate effectively the same thing by rsync'ing over each snapshot one by one? 
 It may take a while, and I'd have to use the --inplace and --no-whole-file 
switches to ensure that I didn't overwrite anything except changed blocks when 
bringing over each snapshot (avoiding marking all blocks as changed and wasting 
all sorts of space), but shouldn't that work at least?  I'd hate to have to 
resort to the two data sets thing.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Native ZFS for Linux

2010-06-08 Thread Anurag Agarwal
Hi Brandon,

Thanks for providing update on this.

We at KQInfotech, initially started on an independent port of ZFS to linux.
When we posted our progress about port last year, then we came to know about
the work on LLNL port. Since then we started working on to re-base our
changing on top Brian's changes.

We are working on porting ZPL on that code. Our current status is that
mount/unmount is working. Most of the directory operations and read/write is
also working. There is still lot more development work and testing that
needs to be going in this. But we are committed to make this happen so
please stay tuned.

Regards,
Anurag.

On Tue, Jun 8, 2010 at 1:55 AM, Brandon High  wrote:

> http://www.osnews.com/story/23416/Native_ZFS_Port_for_Linux
>
> Native ZFS Port for Linux
> posted by Thom Holwerda  on Mon 7th Jun 2010 10:15 UTC, submitted by kragil
>
> Employees of Lawrence Livermore National Laboratory have ported
> Sun's/Oracle's ZFS natively to Linux. Linux already had a ZFS port in
> userspace via FUSE, since license incompatibilities between the CDDL
> and GPL prevent ZFS from becoming part of the Linux kernel. This
> project solves the licensing issue by distributing ZFS as a separate
> kernel module users will have to download and build for themselves.
> I'm assuming most of us are aware of the licensing issues when it
> comes to the CDDL and the GPL. ZFS is an awesome piece of work, but
> because of this, it was never ported to the Linux kernel - at least,
> not as part of the actual kernel. ZFS has been available as a
> userspace implementation via FUSE for a while now.
>
> Main developer Brian Behlendorf has also stated that the Lawrence
> Livermore National Laboratory has repeatedly urged Oracle to do
> something about the licensing situation so that ZFS can become a part
> of the kernel. "We have been working on this for some time now and
> have been strongly urging Sun/Oracle to make a change to the
> licensing," he explains, "I'm sorry to say we have not yet had any
> luck."
>
> There's still some major work to be done, so this is not
> production-ready code. The ZFS Posix Layer has not been implemented
> yet, therefore mounting file systems is not yet possible; direct
> database access, however, is. Supposedly, KQ Infotech is working on
> this, but it has been rather quiet around those parts for a while now.
>
> "Currently in the ZFS for Linux port the only interface available from
> user space is the zvol," the project's website reads, "The zvol allows
> you to create a virtual block device dataset in a zfs storage pool.
> While this may not immediately seem like a big deal it does open up
> some interesting possibilities."
>
> As for the ZFS FUSE implementation, Behlendorf hopes that they can
> share the same codebase. "In the long term I would love to support
> both a native in-kernel posix layer and a fuse based posix layer," he
> explains, "The way the code is structured you actually build the same
> ZFS code once in the kernel as a set of modules and a second time as a
> set of shared libraries. The in-kernel version is used by Lustre, the
> ZVOL, and will eventually be used by the native posix layer."
>
> This sounds like good news, but a lot of work still needs to be done.
> By the way, I hope I got all the details right on this one - this is
> hardly my field of expertise. Feel free to correct me.
>
> --
> Brandon High : bh...@freaks.com
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Anurag Agarwal
CEO, Founder
KQ Infotech, Pune
www.kqinfotech.com
9881254401
Coordinator Akshar Bharati
www.aksharbharati.org
Spreading joy through reading
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS host to host replication with AVS?

2010-06-08 Thread Moazam Raja
Hi all, I'm trying to accomplish server to server storage replication
in synchronous mode where each server is a Solaris/OpenSolaris machine
with its own local storage.

For Linux, I've been able to achieve what I want with DRBD but I'm
hoping I can find a similar solution on Solaris so that I can leverage
ZFS. It seems that solution is Sun Availability Suite (AVS)?

One of the major concerns I have is what happens when the primary
storage server fails. Will the secondary take over automatically
(using some sort of heartbeat mechanism)? Once the secondary node
takes over, can it fail-back to the primary node once the primary node
is back?

My concern is that AVS is not able to repair the primary node after it
has failed, as per the conversation in this forum:

http://discuss.joyent.com/viewtopic.php?id=19096

"AVS is essentially one-way replication. If your primary fails, your
secondary can take over as the primary but the disks remain in the
secondary state. There is no way to reverse the replication while the
secondary is acting as the primary."


Is AVS even the right solution here, or should I be looking at some
other technology?

Thanks.

-Moazam
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] combining series of snapshots

2010-06-08 Thread Brandon High
On Tue, Jun 8, 2010 at 4:29 PM, BJ Quinn  wrote:
> Ugh, yeah, I've learned by now that you always want at least that one 
> snapshot in common to keep the continuity in the dataset.  Wouldn't I be able 
> to recreate effectively the same thing by rsync'ing over each snapshot one by 
> one?  It may take a while, and I'd have to use the --inplace and 
> --no-whole-file switches to ensure that I didn't overwrite anything except 
> changed blocks when bringing over each snapshot (avoiding marking all blocks 
> as changed and wasting all sorts of space), but shouldn't that work at least? 
>  I'd hate to have to resort to the two data sets thing.

You could do that, but the snapshot's creation property is going to be
wrong for any snaps that you create that way. Some file properties
(such as atime, mtime, and ctime) won't be preserved by the rsync
method.

If it's important to keep an exact copy of the snapshots, you'll have
to receive from each of your old hosts into different datasets, one
for each source host. If disk space is a concern, you could enable
dedup, which should do well between the last snap of the old server
and the first snap of the current server.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] combining series of snapshots

2010-06-08 Thread BJ Quinn
In my case, snapshot creation time and atime don't matter.  I think rsync can 
preserve mtime and ctime, though.  I'll have to double check that.

I'd love to enable dedup.  Trying to stay on "stable" releases of OpenSolaris 
for whatever that's worth, and I can't seem to find a link to download 2010.06. 
 :)

At any rate, thanks for the help!  I tried rsync'ing a few snapshots, and it 
doesn't look like it will take as long as I thought.  At first I feared it 
might run for weeks!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs corruptions in pool

2010-06-08 Thread Toby Thain


On 6-Jun-10, at 7:11 AM, Thomas Maier-Komor wrote:


On 06.06.2010 08:06, devsk wrote:
I had an unclean shutdown because of a hang and suddenly my pool is  
degraded (I realized something is wrong when python dumped core a  
couple of times).


This is before I ran scrub:

 pool: mypool
state: DEGRADED
status: One or more devices has experienced an error resulting in  
data

   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise  
restore the

   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scan: scrub repaired 0 in 0h7m with 0 errors on Mon May 31 09:00:27  
2010

config:

   NAMESTATE READ WRITE CKSUM
   mypool  DEGRADED 0 0 0
 c6t0d0s0  DEGRADED 0 0 0  too many errors

errors: Permanent errors have been detected in the following files:

   mypool/ROOT/May25-2010-Image-Update:<0x3041e>
   mypool/ROOT/May25-2010-Image-Update:<0x31524>
   mypool/ROOT/May25-2010-Image-Update:<0x26d24>
   mypool/ROOT/May25-2010-Image-Update:<0x37234>
   //var/pkg/download/d6/d6be0ef348e3c81f18eca38085721f6d6503af7a
   mypool/ROOT/May25-2010-Image-Update:<0x25db3>
   //var/pkg/download/cb/cbb0ff02bcdc6649da3763900363de7cff78ec72
   mypool/ROOT/May25-2010-Image-Update:<0x26cf6>


I ran scrub and this is what it has to say afterwards.

 pool: mypool
state: DEGRADED
status: One or more devices has experienced an unrecoverable  
error.  An
   attempt was made to correct the error.  Applications are  
unaffected.
action: Determine if the device needs to be replaced, and clear the  
errors
   using 'zpool clear' or replace the device with 'zpool  
replace'.

  see: http://www.sun.com/msg/ZFS-8000-9P
scan: scrub repaired 0 in 0h11m with 0 errors on Sat Jun  5  
22:43:54 2010

config:

   NAMESTATE READ WRITE CKSUM
   mypool  DEGRADED 0 0 0
 c6t0d0s0  DEGRADED 0 0 0  too many errors

errors: No known data errors

Few of questions:

1. Have the errors really gone away? Can I just clear and be  
content that errors are really gone?


2. Why did the errors occur anyway if ZFS guarantees on-disk  
consistency? I wasn't writing anything. Those files were definitely  
not being touched when the hang and unclean shutdown happened.


I mean I don't mind if I create or modify a file and it doesn't  
land on disk because on unclean shutdown happened but a bunch of  
unrelated files getting corrupted, is sort of painful to digest.


3. The action says "Determine if the device needs to be replaced".  
How the heck do I do that?



Is it possible that this system runs on a virtual box? At least I've
seen such a thing happen on a Virtual Box but never on a real machine.


As I postulated in the relevant forum thread there:
http://forums.virtualbox.org/viewtopic.php?t=13661
(can't check URL, the site seems down for me atm)



The reason why the error have gone away might be that meta data has
three copies IIRC. So if your disk only had corruptions in the meta  
data

area these errors can be repaired by scrubbing the pool.

The smartmontools might help you figuring out if the disk is broken.  
But
if you only had an unexpected shutdown and now everything is clean  
after

a scrub, I wouldn't expect the disk to be broken. You can get the
smartmontools from opencsw.org.

If your system is really running on a Virtual Box I'd recommend that  
you

turn of disk write caching of Virtual Box.


Specifically, stop it from ignoring cache flush. Caching is irrelevant  
if flushes are being correctly handled.


ZFS isn't the only software system that will suffer inconsistencies/ 
corruption in the guest if flushes are ignored, of course.


--Toby



Search the OpenSolaris forum
of Virtual Box. There is an article somewhere how to do this. IIRC the
subject is somethink like 'zfs pool curruption'. But it is also
somewhere in the docs.

HTH,
Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-08 Thread Bob Friesenhahn

On Tue, 8 Jun 2010, Miles Nordin wrote:


"re" == Richard Elling  writes:


   re> Please don't confuse Ethernet with IP.

okay, but I'm not.  seriously, if you'll look into it.

Did you misread where I said FC can exert back-pressure?  I was
contrasting with Ethernet.

You're really confused, though I'm sure you're going to deny it.


I don't think so.  I think that it is time to reset and reboot 
yourself on the technology curve.  FC semantics have been ported onto 
ethernet.  This is not your grandmother's ethernet but it is capable 
of supporting both FCoE and normal IP traffic.  The FCoE gets 
per-stream QOS similar to what you are used to from Fibre Channel. 
Quite naturally, you get to pay a lot more for the new equipment and 
you have the opportunity to discard the equipment you bought already.


Richard is not out in the weeds although there are probably plenty of 
weeds growing at the ranch.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS host to host replication with AVS?

2010-06-08 Thread David Magda

On Jun 8, 2010, at 20:17, Moazam Raja wrote:

One of the major concerns I have is what happens when the primary  
storage server fails. Will the secondary take over automatically  
(using some sort of heartbeat mechanism)? Once the secondary node  
takes over, can it fail-back to the primary node once the primary  
node is back?


My concern is that AVS is not able to repair the primary node after  
it has failed, as per the conversation in this forum:


Either the primary node OR the secondary node can have active writes  
to a volume, but NOT BOTH at the same time. Once the secondary becomes  
active, and has made changes, you have to replicate the changes back  
to the primary. Here's a good (though dated) demo of the basic  
functionality:


http://hub.opensolaris.org/bin/view/Project+avs/Demos

The reverse replication is in Part 2, but I recommend watching them in  
order for proper context. For making the secondary send data to the  
primary:



-r

Reverses the direction of the synchronization so the primary volume  
is synchronized from the secondary volume. [...]


http://docs.sun.com/app/docs/doc/819-2240/sndradm-1m

For detecting a node failure, and automatic fail over, you could use  
Solaris Cluster:


http://en.wikipedia.org/wiki/Solaris_Cluster
http://hub.opensolaris.org/bin/view/Community+Group+ha-clusters/
http://mail.opensolaris.org/pipermail/ha-clusters-discuss/

If you have a SAN (or iSCSI?), you can have two machines have read- 
write access to the same LUN using something like QFS:


http://en.wikipedia.org/wiki/QFS
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-08 Thread Erik Trimble

On 6/8/2010 6:33 PM, Bob Friesenhahn wrote:

On Tue, 8 Jun 2010, Miles Nordin wrote:


"re" == Richard Elling  writes:


   re> Please don't confuse Ethernet with IP.

okay, but I'm not.  seriously, if you'll look into it.

Did you misread where I said FC can exert back-pressure?  I was
contrasting with Ethernet.

You're really confused, though I'm sure you're going to deny it.


I don't think so.  I think that it is time to reset and reboot 
yourself on the technology curve.  FC semantics have been ported onto 
ethernet.  This is not your grandmother's ethernet but it is capable 
of supporting both FCoE and normal IP traffic.  The FCoE gets 
per-stream QOS similar to what you are used to from Fibre Channel. 
Quite naturally, you get to pay a lot more for the new equipment and 
you have the opportunity to discard the equipment you bought already.


Richard is not out in the weeds although there are probably plenty of 
weeds growing at the ranch.


Bob


Well, you saying we might want to put certain folks out to pasture?



That said, I had a good look at FCoE about a year ago, and, unlike ATAoE 
which effectively ran over standard managed or smart switched, FCoE 
required specialized switch hardware that was non-trivially expensive.  
That said, it did seem to be a mature protocol implementation, so it was 
a viable option once the hardware price came down (and we had wider, 
better software implementations).


Also, FCoE really doesn't seem to play well with regular IP on the same 
link, so you really should dedicate a link (not necessarily a switch) to 
FCoE, and pipe your IP traffic via another link. It is NOT iSCSI.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss