Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-02-24 Thread Lutz Schumann
I fully agree. This needs fixing. I can think of so many situations, where 
device names change in OpenSolaris (especially with movable pools). This 
problem can lead to serious data corruption. 

Besides persistent L2ARC (which is much more difficult I would say) - Making 
L2ARC also rely on labels instead of device paths is essential.

Can someone open a CR for this ??
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-02-08 Thread Daniel Carosone
On Mon, Feb 01, 2010 at 12:22:55PM -0800, Lutz Schumann wrote:
> > > Created a pool on head1 containing just the cache
> > device (c0t0d0). 
> > 
> > This is not possible, unless there is a bug. You
> > cannot create a pool
> > with only a cache device.  I have verified this on
> > b131:
> > # zpool create norealpool cache /dev/ramdisk/rc1
> > 1
> > invalid vdev specification: at least one toplevel
> > l vdev must be specified
> > 
> > This is also consistent with the notion that cache
> > devices are auxiliary
> > devices and do not have pool configuration
> > information in the label.
> 
> Sorry for the confustion ...  a little misunderstanding. I created a Pool 
> who's only data disk is the disk formally used as cache device in the pool 
> that switched. Then I exported this pool mad eform just a single disk (data 
> disk). And switched back. The exported pool was picked up as cache device ... 
> this seems really problematic. 

This is exactly the scenario I was concerned about earlier in the
thread.  Thanks for confirming that it occurs.  Please verify that the
pool had autoreplace=off (just to avoid that distraction), and file a
bug.  

Cache devices should not automatically destroy disk contents based
solely on device path, especially where that device path came along
with a pool import.  Cache devices need labels to confirm their
identity. This is irrespective of whether the cache contents after the
label are persistent or volatile, ie should be fixed without waiting
for the CR about persistent l2arc.

--
Dan.

pgpjdt4tg1JNp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-02-01 Thread Lutz Schumann
> > Created a pool on head1 containing just the cache
> device (c0t0d0). 
> 
> This is not possible, unless there is a bug. You
> cannot create a pool
> with only a cache device.  I have verified this on
> b131:
> # zpool create norealpool cache /dev/ramdisk/rc1
> 1
> invalid vdev specification: at least one toplevel
> l vdev must be specified
> 
> This is also consistent with the notion that cache
> devices are auxiliary
> devices and do not have pool configuration
> information in the label.

Sorry for the confustion ...  a little misunderstanding. I created a Pool who's 
only data disk is the disk formally used as cache device in the pool that 
switched. Then I exported this pool mad eform just a single disk (data disk). 
And switched back. The exported pool was picked up as cache device ... this 
seems really problematic. 

Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-02-01 Thread Richard Elling
On Feb 1, 2010, at 5:53 AM, Lutz Schumann wrote:

> I tested some more and found that Pool disks are picked UP. 
> 
> Head1: Cachedevice1 (c0t0d0)
> Head2: Cachedevice2 (c0t0d0)
> Pool: Shared, c1td
> 
> I created a pool on shared storage. 
> Added the cache device on Head1. 
> Switched the pool to Head2 (export + import). 
> Created a pool on head1 containing just the cache device (c0t0d0). 

This is not possible, unless there is a bug. You cannot create a pool
with only a cache device.  I have verified this on b131:
# zpool create norealpool cache /dev/ramdisk/rc1
invalid vdev specification: at least one toplevel vdev must be specified

This is also consistent with the notion that cache devices are auxiliary
devices and do not have pool configuration information in the label.
 -- richard

> Exported the pool on Head1. 
> Switched back the pool from head2 to head1 (export + import)
> The disk c0t0d0 is picked up as cache device ... 
> 
> This practially means my exported pool was destroyed. 
> 
> In production this would been hell.
> 
> Am I missing something here ?
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-02-01 Thread Lutz Schumann
I tested some more and found that Pool disks are picked UP. 

Head1: Cachedevice1 (c0t0d0)
Head2: Cachedevice2 (c0t0d0)
Pool: Shared, c1td

I created a pool on shared storage. 
Added the cache device on Head1. 
Switched the pool to Head2 (export + import). 
Created a pool on head1 containing just the cache device (c0t0d0). 
Exported the pool on Head1. 
Switched back the pool from head2 to head1 (export + import)
The disk c0t0d0 is picked up as cache device ... 

This practially means my exported pool was destroyed. 

In production this would been hell.

Am I missing something here ?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-28 Thread Lutz Schumann
Yes, here it is  (performance is vmware on laptop, so sorry for that)

How did I test ? 

1) My Disks: 

LUN ID  DeviceType Size   Volume Mounted Remov Attach
c0t0d0  sd4   cdromNo Media  no  yes   ata
c1t0d0  sd0   disk 8GBsyspoolno  nompt
c1t1d0  sd1   disk 20GB   data   no  nompt
c1t2d0  sd2   disk 20GB   data   no  nompt
c1t3d0  sd3   disk 20GB   data   no  nompt
c1t4d0  sd8   disk 4GB   no  nompt
syspo~/swap   zvol 768.0MBsyspoolno  no

2) My Pools:
  
volume: data
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
dataONLINE   0 0 0
  raidz1ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0

errors: No known data errors

volume: syspool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
syspool ONLINE   0 0 0
  c1t0d0s0  ONLINE   0 0 0

errors: No known data errors

3) Add the cache device to syspool:
zpool add -f syspool cache c1t4d0s2


r...@nexenta:/volumes# zpool status
  pool: data
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
dataONLINE   0 0 0
  raidz1ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0

errors: No known data errors

  pool: syspool
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
syspool ONLINE   0 0 0
  c1t0d0s0  ONLINE   0 0 0
cache
  c1t4d0s2  ONLINE   0 0 0

errors: No known data errors

4) Do I/O on the data volume and watch if the l2arc is filled with "zpool 
iostat": 

cmd: 
cd /volumes/data
iozone -s 1G -i 0 -i 1 (for I/O) 

Typically looks like this: 

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
data1.47G  58.0G  0131  0  9.47M
  raidz11.47G  58.0G  0131  0  9.47M
c1t1d0  -  -  0100  0  8.45M
c1t2d0  -  -  0 77  0  4.74M
c1t3d0  -  -  0 77  0  5.48M
--  -  -  -  -  -  -
syspool 1.87G  6.06G  2  0  23.8K  0
  c1t0d0s0  1.87G  6.06G  2  0  23.7K  0
cache   -  -  -  -  -  -
  c1t4d0s2  95.9M  3.89G  0  0  0   127K
--  -  -  -  -  -  -

5) Do the same I/O on the syspool: 

cd /volumes
iozone -s 1G -i 0 -i 1 (for I/O)

   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
data 407K  59.5G  0  0  0  0
  raidz1 407K  59.5G  0  0  0  0
c1t1d0  -  -  0  0  0  0
c1t2d0  -  -  0  0  0  0
c1t3d0  -  -  0  0  0  0
--  -  -  -  -  -  -
syspool 2.35G  5.59G  0167  6.25K  14.2M
  c1t0d0s0  2.35G  5.59G  0167  6.25K  14.2M
cache   -  -  -  -  -  -
  c1t4d0s2   406M  3.59G  0 80  0  9.59M
--  -  -  -  -  -  -


6) You see only if I/O to syspool is done, the l2arc in syspool used. 

Release is build 104 with some patches.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-28 Thread Richard Elling
On Jan 28, 2010, at 10:54 AM, Lutz Schumann wrote:

> Actuall I tested this. 
> 
> If I add a l2arc device to the syspool it is not used when issueing I/O to 
> the data pool (note: on root pool it must no be a whole disk, but only a 
> slice of it otherwise ZFS complains that root disks may not contain some EFI 
> label). 

In my tests it does work. Can you share your test plan?
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-28 Thread Lutz Schumann
Actuall I tested this. 

If I add a l2arc device to the syspool it is not used when issueing I/O to the 
data pool (note: on root pool it must no be a whole disk, but only a slice of 
it otherwise ZFS complains that root disks may not contain some EFI label). 

So this does not work - unfortunately :(

Just for Info. 
Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-24 Thread Lutz Schumann
Thanks for the feedback Richard. 

Does that mean that the L2ARC can be part of ANY pool and that there is only 
ONE L2ARC for all pools active on the machine ? 

Thesis:

  - There is one L2ARC on the machine for all pools
  - all Pools active share the same L2ARC
  - the L2ARC can be part of any pool, also the root (syspool) pool 

If this true, the solution would be like this: 

a) Add L2ARC to the syspool 

or

b) Add another two (standby) L2ARC devices in the head that are used in case of 
a failover. (Thus a configuration that accepts degrated performance after a 
failover has to life with this "corrup data" effect).  

True ?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-23 Thread Richard Elling
AIUI, this works as designed. 

I think the best practice will be to add the L2ARC to syspool (nee rpool).
However, for current NexentaStor releases, you cannot add cache devices
to syspool.

Earlier I mentioned that this made me nervous.  I no longer hold any 
reservation against it. It should work just fine as-is.
 -- richard


On Jan 23, 2010, at 9:53 AM, Lutz Schumann wrote:

> Hi, 
> 
> i found some time and was able to test again.
> 
> - verify with unique uid of the device 
> - verify with autoreplace = off
> 
> Indeed autoreplace was set to yes for the pools. So I disabled the 
> autoreplace. 
> 
> VOL PROPERTY   VALUE   SOURCE
> nxvol2  autoreplaceoff default
> 
> Erased the labels on the cache disk and added it again to the pool. Now both 
> cache disks have different guid's: 
> 
> # cache device in node1
> r...@nex1:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0
> 
> LABEL 0
> 
>version=14
>state=4
>guid=15970804704220025940
> 
> # cache device in node2
> r...@nex2:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0
> 
> LABEL 0
> 
>version=14
>state=4
>guid=2866316542752696853
> 
> GUID's are different. 
> 
> However after switching the pool nxvol2 to node1 (where nxvol1 was active), 
> the disks picked up as cache dev's: 
> 
> # nxvol2 switched to this node ... 
> volume: nxvol2
> state: ONLINE
> scrub: none requested
> config:
> 
>NAME STATE READ WRITE CKSUM
>nxvol2   ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t10d0  ONLINE   0 0 0
>c4t13d0  ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t9d0   ONLINE   0 0 0
>c4t12d0  ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t8d0   ONLINE   0 0 0
>c4t11d0  ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t18d0  ONLINE   0 0 0
>c4t22d0  ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t17d0  ONLINE   0 0 0
>c4t21d0  ONLINE   0 0 0
>cache
>  c0t2d0 FAULTED  0 0 0  corrupted data
> 
> # nxvol1 was active here before ...
> n...@nex1:/$ show volume nxvol1 status
> volume: nxvol1
> state: ONLINE
> scrub: none requested
> config:
> 
>NAME STATE READ WRITE CKSUM
>nxvol1   ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t15d0  ONLINE   0 0 0
>c4t18d0  ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t14d0  ONLINE   0 0 0
>c4t17d0  ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t13d0  ONLINE   0 0 0
>c4t16d0  ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t12d0  ONLINE   0 0 0
>c4t15d0  ONLINE   0 0 0
>  mirror ONLINE   0 0 0
>c3t11d0  ONLINE   0 0 0
>c4t14d0  ONLINE   0 0 0
>cache
>  c0t2d0 ONLINE  0 0 0  
> 
> So this is true with and without autoreplace, and with differnt guid's of the 
> devices.
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-23 Thread Lutz Schumann
Hi, 

i found some time and was able to test again.

 - verify with unique uid of the device 
 - verify with autoreplace = off

Indeed autoreplace was set to yes for the pools. So I disabled the autoreplace. 

VOL PROPERTY   VALUE   SOURCE
nxvol2  autoreplaceoff default

Erased the labels on the cache disk and added it again to the pool. Now both 
cache disks have different guid's: 

# cache device in node1
r...@nex1:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0

LABEL 0

version=14
state=4
guid=15970804704220025940

# cache device in node2
r...@nex2:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0

LABEL 0

version=14
state=4
guid=2866316542752696853

GUID's are different. 

However after switching the pool nxvol2 to node1 (where nxvol1 was active), the 
disks picked up as cache dev's: 

# nxvol2 switched to this node ... 
volume: nxvol2
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nxvol2   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t10d0  ONLINE   0 0 0
c4t13d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t9d0   ONLINE   0 0 0
c4t12d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t8d0   ONLINE   0 0 0
c4t11d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t18d0  ONLINE   0 0 0
c4t22d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t17d0  ONLINE   0 0 0
c4t21d0  ONLINE   0 0 0
cache
  c0t2d0 FAULTED  0 0 0  corrupted data

# nxvol1 was active here before ...
n...@nex1:/$ show volume nxvol1 status
volume: nxvol1
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nxvol1   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t15d0  ONLINE   0 0 0
c4t18d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t14d0  ONLINE   0 0 0
c4t17d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t13d0  ONLINE   0 0 0
c4t16d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t12d0  ONLINE   0 0 0
c4t15d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c3t11d0  ONLINE   0 0 0
c4t14d0  ONLINE   0 0 0
cache
  c0t2d0 ONLINE  0 0 0  

So this is true with and without autoreplace, and with differnt guid's of the 
devices.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-21 Thread Daniel Carosone
On Thu, Jan 21, 2010 at 05:52:57PM -0800, Richard Elling wrote:
> I agree with this, except for the fact that the most common installers
> (LiveCD, Nexenta, etc.) use the whole disk for rpool[1]. 

Er, no. You certainly get the option of "whole disk" or "make
partitions", at least with the opensolaris livecd.

--
Dan.




pgpBWoV2Vz5kt.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-21 Thread Richard Elling
On Jan 21, 2010, at 4:32 PM, Daniel Carosone wrote:
>> I propose a best practice of adding the cache device to rpool and be 
>> happy.
> 
> It is *still* not that simple.  Forget my slow disks caching an even
> slower pool (which is still fast enough for my needs, thanks to the
> cache and zil).
> 
> Consider a server config thus:
> - two MLC SSDs (x25-M, OCZ Vertex, whatever)
> - SSDs partitioned in two, mirrored rpool & 2x l2arc
> - a bunch of disks for a data pool
> 
> This is a likely/common configuration, commodity systems being limited
> mostly by number of sata ports.  I'd even go so far as to propose it
> as another best practice, for those circumstances.

> Now, why would I waste l2arc space, bandwidth, and wear cycles to
> cache rpool to the same ssd's that would be read on a miss anyway?  
> 
> So, there's at least one more step required for happiness:
> # zfs set secondarycache=none rpool
> 
> (plus relying on property inheritance through the rest of rpool)

I agree with this, except for the fact that the most common installers
(LiveCD, Nexenta, etc.) use the whole disk for rpool[1].  So the likely
and common configuration today is moving towards one whole
root disk.  That could change in the future.

[1] Solaris 10?  well... since installation hard anyway, might as well do this.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-21 Thread Daniel Carosone
On Thu, Jan 21, 2010 at 03:33:28PM -0800, Richard Elling wrote:
> [Richard makes a hobby of confusing Dan :-)]

Heh.

> > Lutz, is the pool autoreplace property on?  If so, "god help us all"
> > is no longer quite so necessary.
> 
> I think this is a different issue.

I agree. For me, it was the main issue, and I still want clarity on
it.  However, at this point I'll go back to the start of the thread
and look at what was actually reported again in more detail.  

> But since the label in a cache device does
> not associate it with a pool, it is possible that any pool which expects a
> cache will find it.  This seems to be as designed.

Hm. My recollection was that node b's disk in that controller slot was
totally unlabelled, but perhaps I'm misremembering.. as above.

> > For example, if I have a pool of very slow disks (usb or remote
> > iscsi), and a pool of faster disks, and l2arc for the slow pool on the
> > same faster disks, it's pointless having the faster pool using l2arc
> > on the same disks or even the same type of disks.  I'd need to set the
> > secondarycache properties of one pool according to the configuration
> > of another. 
> 
> Don't use slow devices for L2ARC.

Slow is entirely relative, as we discussed here just recently.  They
just need to be faster than the pool devices I want to cache.  The
wrinkle here is that it's now clear they should be faster than the
devices in all other pools as well (or I need to take special
measures).

Faster is better regardless, and suitable l2arc ssd's are "cheap
enough" now.  It's mostly academic that, previously, faster/local hard
disks were "fast enough", since now you can have both.

> Secondarycache is a dataset property, not a pool property.  You can
> definitely manage the primary and secondary cache policies for each
> dataset.

Yeah, properties of the root fs and of the pool are easily conflated.

> >> such devices. But perhaps we can live with the oddity for a while?
> > 
> > This part, I expect, will be resolved or clarified as part of the
> > l2arc persistence work, since then their attachment to specific pools
> > will need to be clear and explicit.
> 
> Since the ARC is shared amongst all pools, it makes sense to share
> L2ARC amongst all pools.

Of course it does - apart from the wrinkles we now know we need to
watch out for.

> > Perhaps the answer is that the cache devices become their own pool
> > (since they're going to need filesystem-like structured storage
> > anyway). The actual cache could be a zvol (or new object type) within
> > that pool, and then (if necessary) an association is made between
> > normal pools and the cache (especially if I have multiple of them).
> > No new top-level commands needed. 
> 
> I propose a best practice of adding the cache device to rpool and be 
> happy.

It is *still* not that simple.  Forget my slow disks caching an even
slower pool (which is still fast enough for my needs, thanks to the
cache and zil).

Consider a server config thus:
 - two MLC SSDs (x25-M, OCZ Vertex, whatever)
 - SSDs partitioned in two, mirrored rpool & 2x l2arc
 - a bunch of disks for a data pool

This is a likely/common configuration, commodity systems being limited
mostly by number of sata ports.  I'd even go so far as to propose it
as another best practice, for those circumstances.

Now, why would I waste l2arc space, bandwidth, and wear cycles to
cache rpool to the same ssd's that would be read on a miss anyway?  

So, there's at least one more step required for happiness:
 # zfs set secondarycache=none rpool

(plus relying on property inheritance through the rest of rpool)

--
Dan.



pgph2OAJgbY6C.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-21 Thread Richard Elling
[Richard makes a hobby of confusing Dan :-)]
more below..

On Jan 21, 2010, at 1:13 PM, Daniel Carosone wrote:

> On Thu, Jan 21, 2010 at 09:36:06AM -0800, Richard Elling wrote:
>> On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote:
>> 
>>> On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote:
 Though the ARC case, PSARC/2007/618 is "unpublished," I gather from
 googling and the source that L2ARC devices are considered auxiliary,
 in the same category as spares. If so, then it is perfectly reasonable to
 expect that it gets picked up regardless of the GUID. This also implies
 that it is shareable between pools until assigned. Brief testing confirms
 this behaviour.  I learn something new every day :-)
 
 So, I suspect Lutz sees a race when both pools are imported onto one
 node.  This still makes me nervous though...
>>> 
>>> Yes. What if device reconfiguration renumbers my controllers, will
>>> l2arc suddenly start trashing a data disk?  The same problem used to
>>> be a risk for swap,  but less so now that we swap to named zvol. 
>> 
>> This will not happen unless the labels are rewritten on your data disk, 
>> and if that occurs, all bets are off.
> 
> It occurred to me later yesterday, while offline, that the pool in
> question might have autoreplace=on set.  If that were true, it would
> explain why a disk in the same controller slot was overwritten and
> used.
> 
> Lutz, is the pool autoreplace property on?  If so, "god help us all"
> is no longer quite so necessary.

I think this is a different issue. But since the label in a cache device does
not associate it with a pool, it is possible that any pool which expects a
cache will find it.  This seems to be as designed.

>>> There's work afoot to make l2arc persistent across reboot, which
>>> implies some organised storage structure on the device.  Fixing this
>>> shouldn't wait for that.
>> 
>> Upon further review, the ruling on the field is confirmed ;-)  The L2ARC
>> is shared amongst pools just like the ARC. What is important is that at
>> least one pool has a cache vdev. 
> 
> Wait, huh?  That's a totally separate issue from what I understood
> from the discussion.  What I was worried about was that disk Y, that
> happened to have the same cLtMdN address as disk X on another node,
> was overwritten and trashed on import to become l2arc.  
> 
> Maybe I missed some other detail in the thread and reached the wrong
> conclusion? 
> 
>> As such, for Lutz's configuration, I am now less nervous. If I understand
>> correctly, you could add the cache vdev to rpool and forget about how
>> it works with the shared pools.
> 
> The fact that l2arc devices could be caching data from any pool in the
> system is .. a whole different set of (mostly performance) wrinkles.
> 
> For example, if I have a pool of very slow disks (usb or remote
> iscsi), and a pool of faster disks, and l2arc for the slow pool on the
> same faster disks, it's pointless having the faster pool using l2arc
> on the same disks or even the same type of disks.  I'd need to set the
> secondarycache properties of one pool according to the configuration
> of another. 

Don't use slow devices for L2ARC.

Secondarycache is a dataset property, not a pool property.  You can
definitely manage the primary and secondary cache policies for each
dataset.

>> I suppose one could make the case
>> that a new command is needed in addition to zpool and zfs (!) to manage
>> such devices. But perhaps we can live with the oddity for a while?
> 
> This part, I expect, will be resolved or clarified as part of the
> l2arc persistence work, since then their attachment to specific pools
> will need to be clear and explicit.

Since the ARC is shared amongst all pools, it makes sense to share
L2ARC amongst all pools.

> Perhaps the answer is that the cache devices become their own pool
> (since they're going to need filesystem-like structured storage
> anyway). The actual cache could be a zvol (or new object type) within
> that pool, and then (if necessary) an association is made between
> normal pools and the cache (especially if I have multiple of them).
> No new top-level commands needed. 

I propose a best practice of adding the cache device to rpool and be 
happy.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-21 Thread Daniel Carosone
On Thu, Jan 21, 2010 at 09:36:06AM -0800, Richard Elling wrote:
> On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote:
> 
> > On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote:
> >> Though the ARC case, PSARC/2007/618 is "unpublished," I gather from
> >> googling and the source that L2ARC devices are considered auxiliary,
> >> in the same category as spares. If so, then it is perfectly reasonable to
> >> expect that it gets picked up regardless of the GUID. This also implies
> >> that it is shareable between pools until assigned. Brief testing confirms
> >> this behaviour.  I learn something new every day :-)
> >> 
> >> So, I suspect Lutz sees a race when both pools are imported onto one
> >> node.  This still makes me nervous though...
> > 
> > Yes. What if device reconfiguration renumbers my controllers, will
> > l2arc suddenly start trashing a data disk?  The same problem used to
> > be a risk for swap,  but less so now that we swap to named zvol. 
> 
> This will not happen unless the labels are rewritten on your data disk, 
> and if that occurs, all bets are off.

It occurred to me later yesterday, while offline, that the pool in
question might have autoreplace=on set.  If that were true, it would
explain why a disk in the same controller slot was overwritten and
used.

Lutz, is the pool autoreplace property on?  If so, "god help us all"
is no longer quite so necessary.

> > There's work afoot to make l2arc persistent across reboot, which
> > implies some organised storage structure on the device.  Fixing this
> > shouldn't wait for that.
> 
> Upon further review, the ruling on the field is confirmed ;-)  The L2ARC
> is shared amongst pools just like the ARC. What is important is that at
> least one pool has a cache vdev. 

Wait, huh?  That's a totally separate issue from what I understood
from the discussion.  What I was worried about was that disk Y, that
happened to have the same cLtMdN address as disk X on another node,
was overwritten and trashed on import to become l2arc.  

Maybe I missed some other detail in the thread and reached the wrong
conclusion? 

> As such, for Lutz's configuration, I am now less nervous. If I understand
> correctly, you could add the cache vdev to rpool and forget about how
> it works with the shared pools.

The fact that l2arc devices could be caching data from any pool in the
system is .. a whole different set of (mostly performance) wrinkles.

For example, if I have a pool of very slow disks (usb or remote
iscsi), and a pool of faster disks, and l2arc for the slow pool on the
same faster disks, it's pointless having the faster pool using l2arc
on the same disks or even the same type of disks.  I'd need to set the
secondarycache properties of one pool according to the configuration
of another. 

> I suppose one could make the case
> that a new command is needed in addition to zpool and zfs (!) to manage
> such devices. But perhaps we can live with the oddity for a while?

This part, I expect, will be resolved or clarified as part of the
l2arc persistence work, since then their attachment to specific pools
will need to be clear and explicit.

Perhaps the answer is that the cache devices become their own pool
(since they're going to need filesystem-like structured storage
anyway). The actual cache could be a zvol (or new object type) within
that pool, and then (if necessary) an association is made between
normal pools and the cache (especially if I have multiple of them).
No new top-level commands needed. 

--
Dan.


pgp0MK26F4Jvy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-21 Thread Richard Elling
On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote:

> On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote:
>> Though the ARC case, PSARC/2007/618 is "unpublished," I gather from
>> googling and the source that L2ARC devices are considered auxiliary,
>> in the same category as spares. If so, then it is perfectly reasonable to
>> expect that it gets picked up regardless of the GUID. This also implies
>> that it is shareable between pools until assigned. Brief testing confirms
>> this behaviour.  I learn something new every day :-)
>> 
>> So, I suspect Lutz sees a race when both pools are imported onto one
>> node.  This still makes me nervous though...
> 
> Yes. What if device reconfiguration renumbers my controllers, will
> l2arc suddenly start trashing a data disk?  The same problem used to
> be a risk for swap,  but less so now that we swap to named zvol. 

This will not happen unless the labels are rewritten on your data disk, 
and if that occurs, all bets are off.

> There's work afoot to make l2arc persistent across reboot, which
> implies some organised storage structure on the device.  Fixing this
> shouldn't wait for that.

Upon further review, the ruling on the field is confirmed ;-)  The L2ARC
is shared amongst pools just like the ARC. What is important is that at
least one pool has a cache vdev. I suppose one could make the case
that a new command is needed in addition to zpool and zfs (!) to manage
such devices. But perhaps we can live with the oddity for a while?

As such, for Lutz's configuration, I am now less nervous. If I understand
correctly, you could add the cache vdev to rpool and forget about how
it works with the shared pools.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-20 Thread Daniel Carosone
On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote:
> Though the ARC case, PSARC/2007/618 is "unpublished," I gather from
> googling and the source that L2ARC devices are considered auxiliary,
> in the same category as spares. If so, then it is perfectly reasonable to
> expect that it gets picked up regardless of the GUID. This also implies
> that it is shareable between pools until assigned. Brief testing confirms
> this behaviour.  I learn something new every day :-)
> 
> So, I suspect Lutz sees a race when both pools are imported onto one
> node.  This still makes me nervous though...

Yes. What if device reconfiguration renumbers my controllers, will
l2arc suddenly start trashing a data disk?  The same problem used to
be a risk for swap,  but less so now that we swap to named zvol. 

There's work afoot to make l2arc persistent across reboot, which
implies some organised storage structure on the device.  Fixing this
shouldn't wait for that.

--
Dan.

pgp1Mb4Zg7Mxp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-20 Thread Richard Elling
Though the ARC case, PSARC/2007/618 is "unpublished," I gather from
googling and the source that L2ARC devices are considered auxiliary,
in the same category as spares. If so, then it is perfectly reasonable to
expect that it gets picked up regardless of the GUID. This also implies
that it is shareable between pools until assigned. Brief testing confirms
this behaviour.  I learn something new every day :-)

So, I suspect Lutz sees a race when both pools are imported onto one
node.  This still makes me nervous though...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-20 Thread Tomas Ögren
On 20 January, 2010 - Richard Elling sent me these 2,7K bytes:

> Hi Lutz,
> 
> On Jan 20, 2010, at 3:17 AM, Lutz Schumann wrote:
> 
> > Hello, 
> > 
> > we tested clustering with ZFS and the setup looks like this: 
> > 
> > - 2 head nodes (nodea, nodeb)
> > - head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc)
> 
> This makes me nervous. I suspect this is not in the typical QA 
> test plan.
> 
> > - two external jbods
> > - two mirror zpools (pool1,pool2)
> >   - each mirror is a mirror of one disk from each jbod
> > - no ZIL (anyone knows a well priced SAS SSD ?)
> > 
> > We want active/active and added the l2arc to the pools. 
> > 
> > - pool1 has nodea_l2arc as cache
> > - pool2 has nodeb_l2arc as cache
> > 
> > Everything is great so far. 
> > 
> > One thing to node is that the nodea_l2arc and nodea_l2arc are named equally 
> > ! (c0t2d0 on both nodes).
> > 
> > What we found is that during tests, the pool just picked up the device 
> > nodeb_l2arc automatically, altought is was never explicitly added to the 
> > pool pool1.
> 
> This is strange. Each vdev is supposed to be uniquely identified by its GUID.
> This is how ZFS can identify the proper configuration when two pools have 
> the same name. Can you check the GUIDs (using zdb) to see if there is a
> collision?

Reproducable:

itchy:/tmp/blah# mkfile 64m 64m disk1
itchy:/tmp/blah# zfs create -V 64m rpool/blahcache
itchy:/tmp/blah# zpool create blah /tmp/blah/disk1 
itchy:/tmp/blah# zpool add blah cache /dev/zvol/dsk/rpool/blahcache 
itchy:/tmp/blah# zpool status blah
  pool: blah
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
blah ONLINE   0 0 0
  /tmp/blah/disk1ONLINE   0 0 0
cache
  /dev/zvol/dsk/rpool/blahcache  ONLINE   0 0 0

errors: No known data errors
itchy:/tmp/blah# zpool export blah
itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache 

LABEL 0

version=15
state=4
guid=6931317478877305718

itchy:/tmp/blah# zfs destroy rpool/blahcache
itchy:/tmp/blah# zfs create -V 64m rpool/blahcache
itchy:/tmp/blah# dd if=/dev/zero of=/dev/zvol/dsk/rpool/blahcache bs=1024k 
count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB) copied, 0.559299 seconds, 120 MB/s
itchy:/tmp/blah# zpool import -d /tmp/blah
  pool: blah
id: 16691059548146709374
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

blah ONLINE
  /tmp/blah/disk1ONLINE
cache
  /dev/zvol/dsk/rpool/blahcache
itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache

LABEL 0


LABEL 1


LABEL 2


LABEL 3

itchy:/tmp/blah# zpool import -d /tmp/blah blah
itchy:/tmp/blah# zpool status
  pool: blah
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
blah ONLINE   0 0 0
  /tmp/blah/disk1ONLINE   0 0 0
cache
  /dev/zvol/dsk/rpool/blahcache  ONLINE   0 0 0

errors: No known data errors
itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache

LABEL 0

version=15
state=4
guid=6931317478877305718
...


It did indeed overwrite my formerly clean blahcache.

Smells like a serious bug.

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se

>  -- richard
> 
> > We had a setup stage when pool1 was configured on nodea with nodea_l2arc 
> > and pool2 was configured on nodeb without a l2arc. Then we did a failover. 
> > Then pool1 pickup up the (until then) unconfigured nodeb_l2arc. 
> > 
> > Is this intended ? Why is a L2ARC device automatically picked up if the 
> > device name is the same ? 
> > 
> > In a later stage we had both pools configured with the corresponding l2arc 
> > device. (po...@nodea with nodea_l2arc and po...@nodeb with nodeb_l2arc). 
> > Then we also did a failover. The l2arc device of the pool failing over was 
> > marked as "too many corruptions" instead of "missing". 
> > 
> > So from this tests it looks like ZFS just picks up the device with the same 
> > name and replaces the l2arc without looking at the device signatures to 
> > only consider devices beeing pa

Re: [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

2010-01-20 Thread Richard Elling
Hi Lutz,

On Jan 20, 2010, at 3:17 AM, Lutz Schumann wrote:

> Hello, 
> 
> we tested clustering with ZFS and the setup looks like this: 
> 
> - 2 head nodes (nodea, nodeb)
> - head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc)

This makes me nervous. I suspect this is not in the typical QA 
test plan.

> - two external jbods
> - two mirror zpools (pool1,pool2)
>   - each mirror is a mirror of one disk from each jbod
> - no ZIL (anyone knows a well priced SAS SSD ?)
> 
> We want active/active and added the l2arc to the pools. 
> 
> - pool1 has nodea_l2arc as cache
> - pool2 has nodeb_l2arc as cache
> 
> Everything is great so far. 
> 
> One thing to node is that the nodea_l2arc and nodea_l2arc are named equally ! 
> (c0t2d0 on both nodes).
> 
> What we found is that during tests, the pool just picked up the device 
> nodeb_l2arc automatically, altought is was never explicitly added to the pool 
> pool1.

This is strange. Each vdev is supposed to be uniquely identified by its GUID.
This is how ZFS can identify the proper configuration when two pools have 
the same name. Can you check the GUIDs (using zdb) to see if there is a
collision?
 -- richard

> We had a setup stage when pool1 was configured on nodea with nodea_l2arc and 
> pool2 was configured on nodeb without a l2arc. Then we did a failover. Then 
> pool1 pickup up the (until then) unconfigured nodeb_l2arc. 
> 
> Is this intended ? Why is a L2ARC device automatically picked up if the 
> device name is the same ? 
> 
> In a later stage we had both pools configured with the corresponding l2arc 
> device. (po...@nodea with nodea_l2arc and po...@nodeb with nodeb_l2arc). Then 
> we also did a failover. The l2arc device of the pool failing over was marked 
> as "too many corruptions" instead of "missing". 
> 
> So from this tests it looks like ZFS just picks up the device with the same 
> name and replaces the l2arc without looking at the device signatures to only 
> consider devices beeing part of a pool.
> 
> We have not tested with a data disk as "c0t2d0" but if the same behaviour 
> occurs - god save us all.
> 
> Can someone clarify the logic behind this ? 
> 
> Can also someone give a hint how to rename SAS disk devices in opensolaris ? 
> (to workaround I would like to rename c0t2d0 on nodea (nodea_l2arc) to 
> c0t24d0 and c0t2d0 on nodeb (nodea_l2arc) to c0t48d0). 
> 
> P.s. Release is build 104 (NexentaCore 2). 
> 
> Thanks!
> -- 
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss