Re: [zfs-discuss] Sun Storage 7320 system

2011-05-26 Thread Giovanni Tirloni
On Thu, May 26, 2011 at 8:13 AM, Gustav  wrote:

> Hi All,
>
> Can someone please give some advise on the following?
>
> We are installing a 7320 with 2 18 GB Write Accelerators, 20 x 1 TB disks
> and 96 GB of ram.
>
> Postgres will be running on a Oracle x6270 device with 96GB of ram
> installed and two quad core cpus, with a local WALL on 4 hard drives, and
> 7320 LUN via 8GB FC.
>
> I am going configure the 7320 as Mirrored with the following options
> available to me (read and write cache enabled):
> Double parity RAID
> Mirrored
> Single partiy RAID, narrow stripes
> Striped
> Triple mirrored
>
> Was does the above mean in real terms officially,
> and what is optimum for a (Postgresql 9, or any performance tips for
> Postgres on a 7320) database high writes,
> and are there any comments that can help us improve performance?
>


Hello,

 I would advise against creating many small pools of disks unless you've
very different capacity/performance/reliability requeriments. Even then, try
to limit the number of pools as much as you can. I know newer firmwares
allowed you to create 2 pools but I guess they removed that limitation (but
still advise against it). Check the documentation in the "Help" link within
the appliance, it's usually very detailed.

 Alghouth the 7320 is an appliance and comes with its own documentation, I
think you'd benefit from reading the ZFS Best Practices Guide (
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide).

 You're probably looking for maximum performance with availability so that
narrows it down to a mirrored pool, unless your Postgresql workload is very
specific that raidz would fit, but beware of the performance hit.

Regards,

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reboots when importing old rpool

2011-05-17 Thread Giovanni Tirloni
On Tue, May 17, 2011 at 6:38 PM, Brandon High  wrote:
> On Tue, May 17, 2011 at 11:10 AM, Hung-ShengTsao (Lao Tsao) Ph.D.
>  wrote:
>>
>> may be do
>> zpool import  -R /a rpool
>
> 'zpool import -N' may work as well.

It looks like a crash dump is in order. The system shouldn't panic
just because it can't import a pool.

Try booting with the kernel debugger on (add "-kv" to the grub kernel
line). Take a look at dumpadm.

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS File / Folder Management

2011-05-17 Thread Giovanni Tirloni
On Mon, May 16, 2011 at 8:54 PM, MasterCATZ  wrote:
>
> Is it possible to ZFS to keep data that belongs together on the same Pool
>
> ( or would these Questions be more related to Raid-Z ? )
>
> that way if their is a failure only the Data on the Pool that Failed Needs to 
> be replaced
> ( or if one Pool failed Does that Mean All the other Pools still fail as well 
> ? with out a way to recover data ? )
>
> I am Wanting to be able to Expand My Array over Time by adding either 4 or 8 
> HDD pools
>
> Most of the Data will probably never be Deleted
>
> but say I have 1 gig remaining on the First Pool and adding an 8 gig file
> does this mean the data will be then put onto pool 1 and pool 2 ?
> ( 1 gig pool 1 7 gig pool 2 )
>
> or would ZFS be able to put it onto the 2nd pool instead of Splitting it ?
>
> the other scenario would be folder structure would ZFS be able to understand 
> data contained in a Folder Tree Belongs together and be able to store it on a 
> dedicated pool ?
>
>
> if so it would be great or else you would be spending for ever replacing data 
> from backup if something does go wrong
>
>
> sorry if this goes in the wrong spot i could no find
> » OpenSolaris Forums » zfs » discuss
> in the drop down menu

You can create a single pool and grow it as needed. From that pool,
you create filesystems.

If you want to create multiple pools (due to redundancy/performance
requirements being different), ZFS will keep them separated. And
again, you will create filesystems/datasets from each one
independently.

http://download.oracle.com/docs/cd/E19963-01/html/821-1448/index.html
http://download.oracle.com/docs/cd/E18752_01/html/819-5461/index.html

--
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 350TB+ storage solution

2011-05-16 Thread Giovanni Tirloni
On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness wrote:

>
> Actually I have seen resilvers take a very long time (weeks) on
> solaris/raidz2 when I almost never see a hardware raid controller take more
> than a day or two. In one case i thrashed the disks absolutely as hard as I
> could (hardware controller) and finally was able to get the rebuild to take
> almost 1 week.. Here is an example of one right now:
>
>   pool: raid3060
>   state: ONLINE
>   status: One or more devices is currently being resilvered. The pool will
>   continue to function, possibly in a degraded state.
>   action: Wait for the resilver to complete.
>   scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go
>   config:
>
>
Resilver has been a problem with RAIDZ volumes for a while. I've routinely
seen it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All
disks are maxed out on IOPS while still reading 1-2MB/s and there rarely is
any writes. I've written about it before here (and provided data).

My only guess is that fragmentation is a real problem in a scrub/resilver
situation but whenever the conversation changes to point weaknesses in ZFS
we start seeing "that is not a problem" comments. With the 7000s appliance
I've heard that the 900hr estimated resilver time was "normal" and
"everything is working as expected". Can't help but think there is some
walled garden syndrome floating around.

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-05 Thread Giovanni Tirloni
On Wed, May 4, 2011 at 9:04 PM, Brandon High  wrote:

> On Wed, May 4, 2011 at 2:25 PM, Giovanni Tirloni 
> wrote:
> >   The problem we've started seeing is that a zfs send -i is taking hours
> to
> > send a very small amount of data (eg. 20GB in 6 hours) while a zfs send
> full
> > transfer everything faster than the incremental (40-70MB/s). Sometimes we
> > just give up on sending the incremental and send a full altogether.
>
> Does the send complete faster if you just pipe to /dev/null? I've
> observed that if recv stalls, it'll pause the send, and they two go
> back and forth stepping on each other's toes. Unfortunately, send and
> recv tend to pause with each individual snapshot they are working on.
>
> Putting something like mbuffer
> (http://www.maier-komor.de/mbuffer.html) in the middle can help smooth
> it out and speed things up tremendously. It prevents the send from
> pausing when the recv stalls, and allows the recv to continue working
> when the send is stalled. You will have to fiddle with the buffer size
> and other options to tune it for your use.
>


We've done various tests piping it to /dev/null and then transferring the
files to the destination. What seems to stall is the recv because it doesn't
complete (through mbuffer, ssh, locally, etc). The zfs send always complete
at the same rate.

Mbuffer is being used but doesn't seem to help. When things start to stall,
the in / out buffers will quickly fill up and nothing will be sent. Probably
because the mbuffer on the other side can't receive any more data until the
zfs recv gives it some air to breath.

What I find it curious is that it only happens with incrementals. Full
send's go as fast as possible (monitored with mbuffer). I was just wondering
if other people have seen it, if there is a bug (b111 is quite old), etc.

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread Giovanni Tirloni
On Tue, May 3, 2011 at 11:42 PM, Peter Jeremy <
peter.jer...@alcatel-lucent.com> wrote:

> - Is the source pool heavily fragmented with lots of small files?
>

Peter,

  We've some servers holding Xen VMs and the setup was create to have a
default VM from where others would be cloned so the space saving are quite
good.

  The problem we've started seeing is that a zfs send -i is taking hours to
send a very small amount of data (eg. 20GB in 6 hours) while a zfs send full
transfer everything faster than the incremental (40-70MB/s). Sometimes we
just give up on sending the incremental and send a full altogether.

  I'm wondering if it has to do with fragmentation too. Has anyone
experience this? OpenSolaris b111. As a data point, we also have servers
holding Vmware VMs (not cloned) and there is no problem. Anyone know what's
special about Xen's cloned VMs? Sparse files maybe?

Thanks,

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A resilver record?

2011-03-23 Thread Giovanni Tirloni
 0.3    0.1    2.3   1  16 c4t10d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  113.0   23.0 3226.9   28.0  0.0  1.2    0.1    8.9   2  40 c4t1d0
  159.0    0.0 3286.9    0.0  0.0  0.6    0.1    3.9   2  24 c4t8d0
  176.0    0.0 3545.9    0.0  0.0  0.5    0.1    3.0   2  26 c4t10d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  147.4   34.4 3888.9   52.1  0.0  1.5    0.2    8.3   3  43 c4t1d0
  181.7    0.0 3515.1    0.0  0.0  0.6    0.1    3.1   2  24 c4t8d0
  193.5    0.0 3489.9    0.0  0.0  0.6    0.2    3.3   4  22 c4t10d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  151.2   33.9 3792.7   42.7  0.0  1.5    0.1    7.9   1  36 c4t1d0
  197.5    0.0 3856.9    0.0  0.0  0.4    0.1    2.3   2  19 c4t8d0
  164.6    0.0 3928.1    0.0  0.0  0.7    0.1    4.2   1  24 c4t10d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  171.0   90.0 4426.3  121.5  0.0  1.3    0.1    4.9   3  51 c4t1d0
  184.0    0.0 4426.8    0.0  0.0  0.7    0.1    4.0   2  30 c4t8d0
  195.0    0.0 4430.3    0.0  0.0  0.7    0.1    3.7   2  32 c4t10d0
^C

Anyone else with over 600 hours of resilver time? :-)

Thank you,


Giovanni Tirloni (gtirl...@sysdroid.com)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver misleading output

2010-12-14 Thread Giovanni Tirloni
On Tue, Dec 14, 2010 at 6:34 AM, Bruno Sousa  wrote:

> Hello everyone,
>
> I have a pool consisting of 28 1TB sata disks configured in 15*2 vdevs
> raid1 (2 disks per mirror)2 SSD in miror for the ZIL and 3 SSD's for L2ARC,
> and recently i added two more disks.
> For some reason the resilver process kicked in, and the system is
> noticeable slower, but i'm clueless to what should i do , because the zpool
> status says that the resilver process has finished.
>
> This system is running opensolaris snv_134, has 32GB of memory, and here's
> the zpool output
>
> zpool status -xv vol0
>  pool: vol0
>  state: ONLINE
> status: One or more devices is currently being resilvered.  The pool will
>continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>  scrub: resilver in progress for 13h24m, 100.00% done, 0h0m to go
> config:
>
> zpool iostat snip
>
> mirror-12  ONLINE   0 0 0
>c8t5000C5001A11A4AEd0ONLINE   0 0 0
>c8t5000C5001A10CFB7d0ONLINE   0 0 0
> 1.71G resilvered
>  mirror-13  ONLINE   0 0 0
>c8t5000C5001A0F621Dd0ONLINE   0 0 0
>c8t5000C50019EB3E2Ed0ONLINE   0 0 0
>  mirror-14  ONLINE   0 0 0
>c8t5000C5001A0F543Dd0ONLINE   0 0 0
>c8t5000C5001A105D8Cd0ONLINE   0 0 0
>  mirror-15  ONLINE   0 0 0
>   c8t5000C5001A0FEB16d0  ONLINE   0 0 0
>   c8t5000C50019C1D460d0ONLINE   0 0 0
> 4.06G resilvered
>
>
> Any idea for this type of situation?
>


http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6899970

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS2-L8i

2010-10-16 Thread Giovanni Tirloni
On Fri, Oct 15, 2010 at 5:18 PM, Maurice Volaski <
maurice.vola...@einstein.yu.edu> wrote:

> The mpt_sas driver supports it. We've had LSI 2004 and 2008 controllers
>> hang
>> for quite some time when used with SuperMicro chassis and Intel X25-E SSDs
>> (OSOL b134 and b147). It seems to be a firmware issue that isn't fixed
>> with
>> the last update.
>>
>
> Do you mean to include all the PCie cards not just the AOC-USAS2-L8i and
> when it's directly connected and not through the backplane? Prior reports
> here seem to be implicating the card only when it was connected to the
> backplane.
>
>
I only tested the LSI 2004/2008 HBAs connected to the backplane (both 3Gb/s
and 6Gb/s).

The MegaRAID 8888ELP, when connected to the same backplane, doesn't exhibit
that behavior.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-USAS2-L8i

2010-10-13 Thread Giovanni Tirloni
On Tue, Oct 12, 2010 at 9:30 AM, Alexander Lesle  wrote:

> Hello guys,
>
> I want to built a new NAS and I am searching for a controller.
> At supermicro I found this new one with the LSI 2008 controller.
>
> http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I
>
> Who can confirm that this card runs under OSOL build134 or solaris10?
> Why this card? Because its supports 6.0 Gb/s SATA.
>

The mpt_sas driver supports it. We've had LSI 2004 and 2008 controllers hang
for quite some time when used with SuperMicro chassis and Intel X25-E SSDs
(OSOL b134 and b147). It seems to be a firmware issue that isn't fixed with
the last update. While running any heavy workload you'll see as much as
$zfs_vdev_max_pending operations stuck on each SSD randomly. Others have
reported success with them though, YMMV. LSI says the boards are not
supported under Solaris and refuses to investigate it.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] non-ECC Systems and ZFS for home users

2010-09-25 Thread Giovanni Tirloni
On Thu, Sep 23, 2010 at 1:08 PM, Dick Hoogendijk  wrote:

>  And about what SUN systems are you thinking for 'home use' ?
> The likeliness of memory failures might be much higher than becoming a
> millionair, but in the years past I have never had one. And my home sytems
> are rather cheap. Mind you, not the cheapest, but rather cheap. I do buy
> good memory though. So, to me, with a good backup I feel rather safe using
> ZFS. I also had it running for quite some time on a 32bits machine and that
> also worked out fine.
>

We have correctable memory errors on ECC systems on a monthly basis. It's
not if they'll happen but how often.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing a disk never completes

2010-09-20 Thread Giovanni Tirloni
On Thu, Sep 16, 2010 at 9:36 AM, Ben Miller wrote:

> I have an X4540 running b134 where I'm replacing 500GB disks with 2TB disks
> (Seagate Constellation) and the pool seems sick now.  The pool has four
> raidz2 vdevs (8+2) where the first set of 10 disks were replaced a few
> months ago.  I replaced two disks in the second set (c2t0d0, c3t0d0) a
> couple of weeks ago, but have been unable to get the third disk to finish
> replacing (c4t0d0).
>
> I have tried the resilver for c4t0d0 four times now and the pool also comes
> up with checksum errors and a permanent error (:<0x0>).  The first
> resilver was from 'zpool replace', which came up with checksum errors.  I
> cleared the errors which triggered the second resilver (same result).  I
> then did a 'zpool scrub' which started the third resilver and also
> identified three permanent errors (the two additional were in files in
> snapshots which I then destroyed).  I then did a 'zpool clear' and then
> another scrub which started the fourth resilver attempt.  This last attempt
> identified another file with errors in a snapshot that I have now destroyed.
>
> Any ideas how to get this disk finished being replaced without rebuilding
> the pool and restoring from backup?  The pool is working, but is reporting
> as degraded and with checksum errors.
>
>
[...]

Try to run a `zpool clear pool2` and see if clears the errors. If not, you
may have to detach `c4t0d0s0/o`.

I believe it's a bug that was fixed in recent builds.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] what is zfs doing during a log resilver?

2010-09-03 Thread Giovanni Tirloni
On Thu, Sep 2, 2010 at 10:18 AM, Jeff Bacon wrote:

> So, when you add a log device to a pool, it initiates a resilver.
>
> What is it actually doing, though? Isn't the slog a copy of the
> in-memory intent log? Wouldn't it just simply replicate the data that's
> in the other log, checked against what's in RAM? And presumably there
> isn't that much data in the slog so there isn't that much to check?
>
> Or is it just doing a generic resilver for the sake of argument because
> you changed something?
>

Good question. Here it takes little over 1 hour to resilver a 32GB SSD in a
mirror. I've always wondered what exactly it was doing since it was supposed
to be 30 seconds worth of data. It also generates lots of checksum errors.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replaced pool device shows up in zpool status

2010-08-16 Thread Giovanni Tirloni
On Mon, Aug 16, 2010 at 11:47 AM, Mark J Musante
 wrote:
> On Mon, 16 Aug 2010, Matthias Appel wrote:
>
>> Can anybody tell me how to get rid of c1t3d0 and heal my zpool?
>
> Can you do a "zpool detach performance c1t3d0/o"?  If that works, then
> "zpool replace performance c1t3d0 c1t0d0" should replace the bad disk with
> the new hot spare.  Once the resilver completes, do a "zpool detach
> performance c1t3d0" to remove the bad disk and promote the hot spare to a
> full member of the pool.
>
> Or, if that doesn't work, try the same thing with c1t3d0 and c1t3d0/o
> swapped around.

Recently fixed in b147:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=67825



-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] autoreplace not kicking in

2010-08-11 Thread Giovanni Tirloni
On Wed, Aug 11, 2010 at 4:06 PM, Cindy Swearingen
 wrote:
> Hi Giovanni,
>
> The spare behavior and the autoreplace property behavior are separate
> but they should work pretty well in recent builds.
>
> You should not need to perform a zpool replace operation if the
> autoreplace property is set. If autoreplace is set and a replacement
> disk is inserted into the same physical location of the removed
> failed disk, then a new disk label is applied to the replacement
> disk and ZFS should recognize it.

That's what I'm having to do in b111. I will try to simulate the same
situation in b134.

>
> Let the replacement disk resilver from the spare. When the resilver
> completes, the spare should detach automatically. We saw this happen on
> a disk replacement last week on a system running a recent Nevada build.
>
> If the spare doesn't detach after the resilver is complete, then just
> detach it manually.

Yes, that's working as expected (spare detaches after resilver).


-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] autoreplace not kicking in

2010-08-11 Thread Giovanni Tirloni
Hello,

 In OpenSolaris b111 with autoreplace=on and a pool without spares,
ZFS is not kicking the resilver after a faulty disk is replaced and
shows up with the same device name, even after waiting several
minutes. The solution is to do a manual `zpool replace` which returns
the following:

# zpool replace tank c3t17d0
invalid vdev specification
use '-f' to override the following errors:
/dev/dsk/c3t17d0s0 is part of active ZFS pool tank. Please see zpool(1M).

 ... and resilvering starts immediately. Looks like the `zpool
replace` kicked in the autoreplace function.

 Since b111 is running a little old there is a chance this has already
been reported and fixed. Does anyone know anything about it ?

 Also, if autoreplace is on and the pool has spares, when a disk fails
the spare is automatically used (works fine) but when the faulty disk
is replaced.. nothing really happens. Was the autoreplace code
supposed to replace the faulty disk and release the spare when
resilver is done ?

Thank you,

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Upgrading 2009.06 to something current

2010-08-01 Thread Giovanni Tirloni
On Sun, Aug 1, 2010 at 2:57 PM, David Dyer-Bennet  wrote:
> What's a good choice for a decently stable upgrade?  I'm unable to run
> backups because ZFS send/receive won't do full-pool replication reliably, it
> hangs better than 2/3 of the time, and people here have told me later
> versions (later than 111b) fix this.  I was originally waiting for the
> "spring" release, but okay, I've kind of given up on that.  This is a home
> "production" server; it's got all my photos on it.  And the backup isn't as
> current as I'd like, and I'm having trouble getting a better backup.  (I'll
> do *something* before I risk the upgrade; maybe brute force, rsync to an
> external drive, to at least give me a clean copy of the current state; I can
> live without ACLs.)
>
> I find various blogs with instructions for how to do such an upgrade, and
> they don't agree, and each one has posts from people for whom it didn't
> work, too.  Is there any kind of consensus on what the best way to do this
> is?

You've got to point pkg to pkg.opensolaris.org/dev and then choose one
of the development builds.

If you run a `pkg image-update` right away, the latest bits you'll get
are from build 134 which people have reported works OK.

If you want to try something in between b111 and b134, see the
following instructions:

http://blogs.sun.com/observatory/entry/updating_to_a_specific_build

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Increase resilver priority

2010-07-23 Thread Giovanni Tirloni
On Fri, Jul 23, 2010 at 12:50 PM, Bill Sommerfeld
 wrote:
> On 07/23/10 02:31, Giovanni Tirloni wrote:
>>
>>  We've seen some resilvers on idle servers that are taking ages. Is it
>> possible to speed up resilver operations somehow?
>>
>>  Eg. iostat shows<5MB/s writes on the replaced disks.
>
> What build of opensolaris are you running?  There were some recent
> improvements (notably the addition of prefetch to the pool traverse used by
> scrub and resilver) which sped this up significantly for my systems.

b111. Thanks for the heads up regarding these improvements, I'll try
that in b134.

> Also: if there are large numbers of snapshots, pools seem to take longer to
> resilver, particularly when there's a lot of metadata divergence between
> snapshots.  Turning off atime updates (if you and your applications can cope
> with this) may also help going forward.

There are 7 snapshots and atime is disabled.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Increase resilver priority

2010-07-23 Thread Giovanni Tirloni
On Fri, Jul 23, 2010 at 11:59 AM, Richard Elling  wrote:
> On Jul 23, 2010, at 2:31 AM, Giovanni Tirloni wrote:
>
>> Hello,
>>
>> We've seen some resilvers on idle servers that are taking ages. Is it
>> possible to speed up resilver operations somehow?
>>
>> Eg. iostat shows <5MB/s writes on the replaced disks.
>
> This is lower than I expect, but It may be IOPS bound. What does
> iostat say about the IOPS and asvc_t ?
>  -- richard

It seems to have improved a bit.

 scrub: resilver in progress for 7h19m, 75.37% done, 2h23m to go
config:

NAME  STATE READ WRITE CKSUM
storage   DEGRADED 0 1 0
  mirror  DEGRADED 0 0 0
c3t2d0ONLINE   0 0 0
replacing DEGRADED 1.29M 0 0
  c3t3d0s0/o  FAULTED  0 0 0  corrupted data
  c3t3d0  DEGRADED 0 0 1.29M  too many errors
  mirror  ONLINE   0 0 0
c3t4d0ONLINE   0 0 0
c3t5d0ONLINE   0 0 0
  mirror  DEGRADED 0 0 0
c3t6d0ONLINE   0 0 0
c3t7d0REMOVED  0 0 0
  mirror  ONLINE   0 0 0
c3t8d0ONLINE   0 0 0
c3t9d0ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t10d0   ONLINE   0 0 0
c3t11d0   ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t12d0   ONLINE   0 0 0
c3t13d0   ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t14d0   ONLINE   0 0 0
c3t15d0   ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t16d0   ONLINE   0 0 0
c3t17d0   ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t18d0   ONLINE   0 0 0
c3t19d0   ONLINE   0 0 0
  mirror  ONLINE   0 0 0
c3t20d0   ONLINE   0 0 0
c3t21d0   ONLINE   0 0 0
logs  DEGRADED 0 1 0
  mirror  ONLINE   0 0 0
c3t1d0ONLINE   0 0 0
c3t22d0   ONLINE   0 0 0


extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  582.2  864.3 68925.2 37511.2  0.0 52.70.0   36.4   0 610 c3
0.0  201.70.0  806.7  0.0  0.00.00.1   0   2 c3t0d0
0.0  268.20.0 10531.2  0.0  0.10.00.4   0  10 c3t1d0
  144.10.0 18375.90.0  0.0  9.50.0   65.7   0 100 c3t2d0
   79.5  125.2 10109.9 15634.3  0.0 35.00.0  171.0   0 100 c3t3d0
   10.90.0 1181.30.0  0.0  0.10.0   13.3   0  10 c3t4d0
   19.90.0 2120.60.0  0.0  0.30.0   15.6   0  19 c3t5d0
   35.80.0 3819.50.0  0.0  0.60.0   18.1   0  28 c3t6d0
0.00.00.00.0  0.0  0.00.00.0   0   0 c3t7d0
   22.90.0 2506.60.0  0.0  0.50.0   22.0   0  22 c3t8d0
   15.90.0 1639.80.0  0.0  0.30.0   20.5   0  15 c3t9d0
   23.80.0 2889.60.0  0.0  0.50.0   19.8   0  27 c3t10d0
   21.90.0 2558.30.0  0.0  0.60.0   28.9   0  19 c3t11d0
   32.80.0 3151.90.0  0.0  1.20.0   37.4   0  25 c3t12d0
   25.80.0 2707.80.0  0.0  0.50.0   18.8   0  26 c3t13d0
   19.90.0 2281.10.0  0.0  0.30.0   17.5   0  24 c3t14d0
   23.80.0 2782.30.0  0.0  0.30.0   14.6   0  20 c3t15d0
   18.90.0 2249.80.0  0.0  0.40.0   19.7   0  23 c3t16d0
   21.90.0 2519.50.0  0.0  0.50.0   22.6   0  27 c3t17d0
   12.90.0 1653.20.0  0.0  0.20.0   16.8   0  18 c3t18d0
   26.80.0 3262.70.0  0.0  0.80.0   28.4   0  29 c3t19d0
9.90.0 1271.70.0  0.0  0.10.0   14.3   0  13 c3t20d0
   14.90.0 1843.90.0  0.0  0.30.0   20.5   0  19 c3t21d0
0.0  269.20.0 10539.0  0.0  0.40.01.3   0  33 c3t22d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  405.1  745.2 51057.2 29893.8  0.0 53.30.0   46.4   0 457 c3
0.0  252.10.0 1008.3  0.0  0.00.00.1   0   2 c3t0d0
0.0  177.10.0 5485.8  0.0  0.00.00.3   0   4 c3t1d0
  145.00.0 18438.00.0  0.0 15.10.0  104.0   0 100 c3t2d0
   80.0  140.0 10147.8 17925.9  0.0 35.00.0  159.0   0 100 c3t3d0
8.00.0 1024.30.0  0.0  0.20.0   19.9   0  12 c3t4d0
7.00.0  768.30.0  0.0  0.10.0   15.

[zfs-discuss] Increase resilver priority

2010-07-23 Thread Giovanni Tirloni
Hello,

 We've seen some resilvers on idle servers that are taking ages. Is it
possible to speed up resilver operations somehow?

 Eg. iostat shows <5MB/s writes on the replaced disks.

 I'm thinking a small performance degradation would be sometimes
better than the increased risk window (where a vdev is degraded).

Thank you,

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?

2010-07-20 Thread Giovanni Tirloni
On Tue, Jul 20, 2010 at 12:59 AM, Edward Ned Harvey  wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Richard Jahnel
>>
>> I'vw also tried mbuffer, but I get broken pipe errors part way through
>> the transfer.
>
> The standard answer is mbuffer.  I think you should ask yourself what's
> going wrong with mbuffer.  You're not, by any chance, sending across a LACP
> aggregated link, are you?  Most people don't notice it, but I sure do, that
> usually LACP introduces packet errors.  Just watch your error counter, and
> start cramming data through there.  So far I've never seen a single
> implementation that passed this test...  Although I'm sure I've just had bad
> luck.
>
> If you're having some packet loss, that might explain the poor performance
> of ssh too.  Although ssh is known to slow things down in the best of cases
> ... I don't know if the speed you're seeing is reasonable considering.

We have hundreds of servers using LACP and so far have not noticed any
increase in the error rate.

Could you share what implementations (OS, switch) have you tested and
how it was done ? I would like to try to simulate these issues.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corrupt pool?

2010-07-19 Thread Giovanni Tirloni
On Mon, Jul 19, 2010 at 1:42 PM, Wolfraider  wrote:
> Our server locked up hard yesterday and we had to hard power it off and back 
> on. The server locked up again on reading ZFS config (I left it trying to 
> read the zfs config for 24 hours). I went through and removed the drives for 
> the data pool we created and powered on the server and it booted 
> successfully. I removed the pool from the system and reattached the drives 
> and tried to re-import the pool. It has now been trying to import for about 6 
> hours. Does anyone know how to recover this pool? Running version 134.

Have you enabled compression or deduplication ?

Check the disks with `iostat -XCn 1` (look for high asvc_t times) and
`iostat -En` (hard and soft errors).

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] carrying on

2010-07-19 Thread Giovanni Tirloni
On Mon, Jul 19, 2010 at 7:12 AM, Joerg Schilling
 wrote:
> Giovanni Tirloni  wrote:
>
>> On Sun, Jul 18, 2010 at 10:19 PM, Miles Nordin  wrote:
>> > IMHO it's important we don't get stuck running Nexenta in the same
>> > spot we're now stuck with OpenSolaris: with a bunch of CDDL-protected
>> > source that few people know how to use in practice because the build
>> > procedure is magical and secret.  This is why GPL demands you release
>> > ``all build scripts''!
>>
>> I don't know if the GPL demands that but I think we've all learned a
>> lesson from Oracle/Sun regarding that.
>
> The missing requirement to provide build scripts is a drawback of the CDDL.
>
> ...But believe me that the GPL would not help you here, as the GPL cannot
> force the original author (in this case Sun/Oracle or whoever) to supply the
> scripts in question.

 I don't have any doubts that the GPL (or any other license) would not
prevent the current situation.

 It's more of a strategic/business decision.

>> I hope that if we want to be able to move OpenSolaris to the next
>> level, we can this time avoid falling into the same mouse trap.
>
> This is a community issue.
>
> Do we have people that are willing to help?

Yep! Just need a little guidance in the beginning :)

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] carrying on

2010-07-18 Thread Giovanni Tirloni
On Sun, Jul 18, 2010 at 10:19 PM, Miles Nordin  wrote:
> IMHO it's important we don't get stuck running Nexenta in the same
> spot we're now stuck with OpenSolaris: with a bunch of CDDL-protected
> source that few people know how to use in practice because the build
> procedure is magical and secret.  This is why GPL demands you release
> ``all build scripts''!

I don't know if the GPL demands that but I think we've all learned a
lesson from Oracle/Sun regarding that.

Releasing source code and expecting people to figure out the rest
could be called "open source" but it won't create the kind of
collaboration people usually expect.

For any "fork" (or whatever people want to call it, there are many
shades of gray) to succeed, the release and documentation of the
build/testing infrastructure used to create the end product is as
important as the main source code itself.

I'm not saying Oracle/Sun should have released all and everything they
used to create the OpenSolaris binary distribution (their product).
I'm saying they should have first stopped treating it as a proprietary
product and then released those bits to further forster external
collaboration. But now that's all history and discussing about how
things could have been done won't change anything.

I hope that if we want to be able to move OpenSolaris to the next
level, we can this time avoid falling into the same mouse trap.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lost zpool after reboot

2010-07-17 Thread Giovanni Tirloni
On Sat, Jul 17, 2010 at 3:07 PM, Amit Kulkarni  wrote:
> I don't know if the devices are renumbered. How do you know if the devices 
> are changed?
>
> Here is output of format, the middle one is the boot drive and selection 0 & 
> 2 are the ZFS mirrors
>
> AVAILABLE DISK SELECTIONS:
>       0. c8t0d0 
>          /p...@0,0/pci108e,5...@7/d...@0,0
>       1. c8t1d0 
>          /p...@0,0/pci108e,5...@7/d...@1,0
>       2. c9t0d0 
>          /p...@0,0/pci108e,5...@8/d...@0,0

It seems that the devices that ZFS is trying to open exist. I wonder
why it's failing.

Please send the output of:

zpool status
zpool import
zdb -C (dump config)
zdb -l /dev/dsk/c8t0d0s0 (dump label contents)
zdb -l /dev/dsk/c9t0d0s0 (dump label contents)
check /var/adm/messages

Perhaps with the additional information someone here can help you
better. I don't have any experience with Windows 7 to guarantee that
it hasn't messes with the disk contents.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lost zpool after reboot

2010-07-17 Thread Giovanni Tirloni
On Sat, Jul 17, 2010 at 10:55 AM, Amit Kulkarni  wrote:
> I did a zpool status and it gave me zfs 8000-3C error, saying my pool is 
> unavailable. Since I am able to boot & access browser, I tried a zpool import 
> without arguments, with trying to export my pool, more fiddling. Now I can't 
> get zpool status to show my pool.

>    vdev_path = /dev/dsk/c9t0d0s0
>    vdev_devid = id1,s...@ahitachi_hds7225scsun250g_0719bn9e3k=vfa100r1dn9e3k/a
>    parent_guid = 0xb89f3c5a72a22939

Does format(1M) show the devices where they once where ?

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fmadm warnings about media erros

2010-07-17 Thread Giovanni Tirloni
On Sat, Jul 17, 2010 at 10:49 AM, Bob Friesenhahn
 wrote:
> On Sat, 17 Jul 2010, Bruno Sousa wrote:
>>
>> Jul 15 12:30:48 storage01 SOURCE: eft, REV: 1.16
>> Jul 15 12:30:48 storage01 EVENT-ID: 859b9d9c-1214-4302-8089-b9447619a2a1
>> Jul 15 12:30:48 storage01 DESC: The command was terminated with a
>> non-recovered error condition that may have been caused by a flaw in the
>> media or an error in the recorded data.
>
> This sounds like a hard error to me.  I suggest using 'iostat -xe' to check
> the hard error counts and check the system log files.  If your storage array
> was undergoing maintenance and had a cable temporarily disconnected or
> controller rebooted, then it is possible that hard errors could be counted.
>  FMA usually waits until several errors have been reported over a period of
> time before reporting a fault.

Speaking of that, is there a place where one can see/change these thresholds ?

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-14 Thread Giovanni Tirloni
On Wed, Jul 14, 2010 at 12:57 PM, Edward Ned Harvey
 wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
>>
>> When you pay for the higher prices for OEM hardware, you're paying for
>> the
>> knowledge of parts availability and compatibility.  And a single point
>> vendor who supports the system as a whole, not just one component.
>
> For the record:
>
> I'm not saying this is always worth while.  Sometimes I buy the enterprise
> product and triple-platinum support.  Sometimes I buy generic blackboxes
> with mfgr warranty on individual components.  It depends on your specific
> needs at the time.
>
> I will say, that I am a highly paid senior admin.  I only buy the generic
> black boxes if I have interns or junior (no college level) people available
> to support them.

Generic != black boxes. Quite the opposite.

Some companies are successfully doing the opposite of you: They are
using standard parts and a competent staff that knows how to create
solutions out of them without having to pay for GUI-powered systems
and a 4-hour on-site part swapping service.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-help] ZFS list snapshots incurs large delay

2010-07-13 Thread Giovanni Tirloni
On Tue, Jul 13, 2010 at 2:44 PM, Brent Jones  wrote:
> I have been running a pair of X4540's for almost 2 years now, the
> usual spec (Quad core, 64GB RAM, 48x 1TB).
> I have a pair of mirrored drives for rpool, and a Raidz set with 5-6
> disks in each vdev for the rest of the disks.
> I am running snv_132 on both systems.
>
> I noticed an oddity on one particular system, that when running a
> scrub, or a zfs list -t snapshot, the results take forever.
> Mind you, these are identical systems in hardware, and software. The
> primary system replicates all data sets to the secondary nightly, so
> there isn't much of a discrepancy of space used.
>
> Primary system:
> # time zfs list -t snapshot | wc -l
> 979
>
> real    1m23.995s
> user    0m0.360s
> sys     0m4.911s
>
> Secondary system:
> # time zfs list -t snapshot | wc -l
> 979
>
> real    0m1.534s
> user    0m0.223s
> sys     0m0.663s
>
>
> At the time of running both of those, no other activity was happening,
> load average of .05 or so. Subsequent runs also take just as long on
> the primary, no matter how many times I run it, it will take about 1
> minute and 25 seconds each time, very little drift (+- 1 second if
> that)
>
> Both systems are at about 77% used space on the storage pool, no other
> distinguishing factors that I can discern.
> Upon a reboot, performance is respectable for a little while, but
> within days, it will sink back to those levels. I suspect a memory
> leak, but both systems run the same software versions and packages, so
> I can't envision that.
>
> Would anyone have any ideas what may cause this?

It could be a disk failing and dragging I/O down with it.

Try to check for high asvc_t with `iostat -XCn 1` and errors in `iostat -En`

Any timeouts or retries in /var/adm/messages ?

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send/recv hanging in 2009.06

2010-07-09 Thread Giovanni Tirloni
On Fri, Jul 9, 2010 at 6:49 PM, BJ Quinn  wrote:
> I have a couple of systems running 2009.06 that hang on relatively large zfs 
> send/recv jobs.  With the -v option, I see the snapshots coming across, and 
> at some point the process just pauses, IO and CPU usage go to zero, and it 
> takes a hard reboot to get back to normal.  The same script running against 
> the same data doesn't hang on 2008.05.

There are issues running concurrent zfs receive in 2009.6. Try to run
just one at a time.

Switching to a development build (b134) is probably the answer until
we've a new release.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NexentaStor 3.0.3 vs OpenSolaris - Patches more up to date?

2010-07-06 Thread Giovanni Tirloni
On Tue, Jul 6, 2010 at 4:06 PM, Spandana Goli  wrote:
> Release Notes information:
> If there are new features, each release is added to
> http://www.nexenta.com/corp/documentation/release-notes-support.
>
> If just bug fixes, then the Changelog listing is updated:
> http://www.nexenta.com/corp/documentation/nexentastor-changelog

Is there a bug tracker were one can objectively list all the bugs
(with details) that went into a release ?

"Many bug fixes" is a bit too general.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WD caviar/mpt issues

2010-06-23 Thread Giovanni Tirloni
On Wed, Jun 23, 2010 at 2:43 PM, Jeff Bacon  wrote:
>> >> Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves
>> >> performance by 30-40% instantly and there are no hangs anymore so
> I'm
>> >> guessing it's something related to the mpt_sas driver.
>
> Wait. The mpt_sas driver by default uses scsi_vhci, and scsi_vhci by
> default does load-balance round-robin. Have you tried setting
> load-balance="none" in scsi_vhci.conf?

That didn't help.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WD caviar/mpt issues

2010-06-23 Thread Giovanni Tirloni
On Wed, Jun 23, 2010 at 10:14 AM, Jeff Bacon  wrote:
>> > Have I missed any changes/updates in the situation?
>>
>> I'm been getting very bad performance out of a LSI 9211-4i card
>> (mpt_sas) with Seagate Constellation 2TB SAS disks, SM SC846E1 and
>> Intel X-25E/M SSDs. Long story short, I/O will hang for over 1 minute
>> at random under heavy load.
>
> Hm. That I haven't seen. Is this hang as in some drive hangs up with
> iostat busy% at 100 and nothing else happening (can't talk to a disk) or
> a hang as perceived by applications under load?
>
> What's your read/write mix, and what are you using for CPU/mem? How many
> drives?

I'm using iozone to get some performance numbers and I/O hangs when
it's doing the writing phase.

This pool has:

18 x 2TB SAS disks as 9 data mirrors
2 x 32GB X-25E as log mirror
1 x 160GB X-160M as cache

iostat shows "2" I/O operations active and SSDs at 100% busy when it's stuck.

There are timeout messages when this happens:

Jun 23 00:05:51 osol-x8-hba scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:05:51 osol-x8-hba Disconnected command timeout for Target 11
Jun 23 00:05:51 osol-x8-hba scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:05:51 osol-x8-hba Log info 0x3114 received for target 11.
Jun 23 00:05:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Jun 23 00:05:51 osol-x8-hba scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:05:51 osol-x8-hba Log info 0x3114 received for target 11.
Jun 23 00:05:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Jun 23 00:11:51 osol-x8-hba scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:11:51 osol-x8-hba Disconnected command timeout for Target 11
Jun 23 00:11:51 osol-x8-hba scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:11:51 osol-x8-hba Log info 0x3114 received for target 11.
Jun 23 00:11:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Jun 23 00:11:51 osol-x8-hba scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:11:51 osol-x8-hba Log info 0x3114 received for target 11.
Jun 23 00:11:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc



> I wonder if maybe your SSDs are flooding the channel. I have a (many)
> 847E2 chassis, and I'm considering putting in a second pair of
> controllers and splitting the drives front/back so it's 24/12 vs all 36
> on one pair.

My plan is to use the newest SC846E26 chassis with 2 cables but right
now what I've available for testing is the SC846E1.

I like the fact that SM uses the LSI chipsets in their backplanes.
It's been a good experience so far.


>> Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves
>> performance by 30-40% instantly and there are no hangs anymore so I'm
>> guessing it's something related to the mpt_sas driver.
>
> Well, I sorta hate to swap out all of my controllers (bother, not to
> mention the cost) but it'd be nice to have raidutil/lsiutil back.

As much as I would like to blame faulty hardware for this issue, I
only pointed out that using the MegaRAID doesn't show the problem
because that's what I've been using without any issues in this
particular setup.

This system will be available to me for quite some time, so if anyone
wants all kinds of tests to understand what's happening, I would be
happy to provide those.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WD caviar/mpt issues

2010-06-22 Thread Giovanni Tirloni
On Fri, Jun 18, 2010 at 9:53 AM, Jeff Bacon  wrote:
> I know that this has been well-discussed already, but it's been a few months 
> - WD caviars with mpt/mpt_sas generating lots of retryable read errors, 
> spitting out lots of beloved " Log info 3108 received for target" 
> messages, and just generally not working right.
>
> (SM 836EL1 and 836TQ chassis - though I have several variations on theme 
> depending on date of purchase: 836EL2s, 846s and 847s - sol10u8, 
> 1.26/1.29/1.30 LSI firmware on LSI retail 3801 and 3081E controllers. Not 
> that it works any better on the brace of 9211-8is I also tried these drives 
> on.)
>
> Before signing up for the list, I "accidentally" bought a wad of caviar black 
> 2TBs. No, they are new enough to not respond to WDTLER.EXE, and yes, they are 
> generally unhappy with my boxen. I have them "working" now, running 
> direct-attach off 3 3081E-Rs with breakout cables in the SC836TQ (passthru 
> backplane) chassis, set up as one pool of 2 6+2 raidz2 vdevs (16 drives 
> total), but they still toss the occasional error and performance is, well, 
> abysmal - zpool scrub runs at about a third the speed of the 1TB cudas that 
> they share the machine with, in terms of iostat reported ops/sec or 
> bytes/sec. They don't want to work in an expander chassis at all - spin up 
> the drives and connect them and they'll run great for a while, then after 
> about 12 hours they start throwing errors. (Cycling power on the enclosure 
> does seem to reset them to run for another 12 hours, but...)
>
> I've caved in and bought a brace of replacement cuda XTs, and I am currently 
> going to resign these drives to other lesser purposes (attached to si3132s 
> and ICH10 in a box to be used to store backups, running Windoze). It's kind 
> of a shame, because their single-drive performance is quite good - I've been 
> doing single-drive tests in another chassis against cudas and constellations, 
> and they seem quite a bit faster except on random-seek.
>
> Have I missed any changes/updates in the situation?

I'm been getting very bad performance out of a LSI 9211-4i card
(mpt_sas) with Seagate Constellation 2TB SAS disks, SM SC846E1 and
Intel X-25E/M SSDs. Long story short, I/O will hang for over 1 minute
at random under heavy load.

Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves
performance by 30-40% instantly and there are no hangs anymore so I'm
guessing it's something related to the mpt_sas driver.

I submitted bug #6963321 a few minutes ago (not available yet).

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool export / import discrepancy

2010-06-15 Thread Giovanni Tirloni
On Tue, Jun 15, 2010 at 1:56 PM, Scott Squires  wrote:
> Is ZFS dependent on the order of the drives?  Will this cause any issue down 
> the road?  Thank you all;

No. In your case the logical names changed but ZFS managed to order
the disks correctly as they were before.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-27 Thread Giovanni Tirloni
On Thu, May 27, 2010 at 2:39 AM, Marc Bevand  wrote:
> Hi,
>
> Brandon High  freaks.com> writes:
>>
>> I only looked at the Megaraid  that he mentioned, which has a PCIe
>> 1.0 4x interface, or 1000MB/s.
>
> You mean x8 interface (theoretically plugged into that x4 slot below...)
>
>> The board also has a PCIe 1.0 4x electrical slot, which is 8x
>> physical. If the card was in the PCIe slot furthest from the CPUs,
>> then it was only running 4x.

The tests were done connecting both cards to the PCIe 2.0 x8 slot#6
that connects directly to the Intel 5520 chipset.

I totally ignored the differences between PCIe 1.0 and 2.0. My fault.


>
> If Giovanni had put the Megaraid  in this slot, he would have seen
> an even lower throughput, around 600MB/s:
>
> This slot is provided by the ICH10R which as you can see on:
> http://www.supermicro.com/manuals/motherboard/5500/MNL-1062.pdf
> is connected to the northbridge through a DMI link, an Intel-
> proprietary PCIe 1.0 x4 link. The ICH10R supports a Max_Payload_Size
> of only 128 bytes on the DMI link:
> http://www.intel.com/Assets/PDF/datasheet/320838.pdf
> And as per my experience:
> http://opensolaris.org/jive/thread.jspa?threadID=54481&tstart=45
> a 128-byte MPS allows using just about 60% of the theoretical PCIe
> throughput, that is, for the DMI link: 250MB/s * 4 links * 60% = 600MB/s.
> Note that the PCIe x4 slot supports a larger, 256-byte MPS but this is
> irrevelant as the DMI link will be the bottleneck anyway due to the
> smaller MPS.
>
>> > A single 3Gbps link provides in theory 300MB/s usable after 8b-10b
> encoding,
>> > but practical throughput numbers are closer to 90% of this figure, or
> 270MB/s.
>> > 6 disks per link means that each disk gets allocated 270/6 = 45MB/s.
>>
>> ... except that a SFF-8087 connector contains four 3Gbps connections.
>
> Yes, four 3Gbps links, but 24 disks per SFF-8087 connector. That's
> still 6 disks per 3Gbps (according to Giovanni, his LSI HBA was
> connected to the backplane with a single SFF-8087 cable).


Correct. The backplane on the SC646E1 only has one SFF-8087 cable to the HBA.


>> It may depend on how the drives were connected to the expander. You're
>> assuming that all 18 are on 3 channels, in which case moving drives
>> around could help performance a bit.
>
> True, I assumed this and, frankly, this is probably what he did by
> using adjacent drive bays... A more optimal solution would be spread
> the 18 drives in a 5+5+4+4 config so that the 2 most congested 3Gbps
> links are shared by only 5 drives, instead of 6, which would boost the
> througput by 6/5 = 1.2x. Which would change my first overall 810MB/s
> estimate to 810*1.2 = 972MB/s.

The chassis has 4 columns of 6 disks. The 18 disks I was testing were
all on columns #1 #2 #3.

Column #0 still has a pair of SSDs and more disks which I havent' used
in this test. I'll try to move things around to make use of the 4 port
multipliers and test again.

SuperMicro is going to release 6Gb/s backplane that uses the LSI
SAS2X36 chipset in the near future, I've been told.

Good thing this is still a lab experience. Thanks very much for the
invaluable help!

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-26 Thread Giovanni Tirloni
On Wed, May 26, 2010 at 9:22 PM, Brandon High  wrote:
> On Wed, May 26, 2010 at 4:27 PM, Giovanni Tirloni  
> wrote:
>> SuperMicro X8DTi motherboard
>> SuperMicro SC846E1 chassis (3Gb/s backplane)
>> LSI 9211-4i (PCIex x4) connected to backplane with a SFF-8087 cable (4-lane).
>> 18 x Seagate 1TB SATA 7200rpm
>>
>> I was able to saturate the system at 800MB/s with the 18 disks in
>> RAID-0. Same performance was achieved swapping the 9211-4i for a
>> MegaRAID ELP.
>>
>> I'm guessing the backplane and cable are the bottleneck here.
>
> I'd wager it's the PCIe x4. That's about 1000MB/s raw bandwidth, about
> 800MB/s after overhead.

Makes perfect sense. I was calculating the bottlenecks using the
full-duplex bandwidth and it wasn't apparent the one-way bottleneck.
In any case the solution is limited externally by the 4 x Gigabit
Ethernet NICs, unless we add more, which isn't necessary for our
requirements.

Thanks!

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ideal SATA/SAS Controllers for ZFS

2010-05-26 Thread Giovanni Tirloni
On Thu, May 20, 2010 at 2:19 AM, Marc Bevand  wrote:
> Deon Cui  gmail.com> writes:
>>
>> So I had a bunch of them lying around. We've bought a 16x SAS hotswap
>> case and I've put in an AMD X4 955 BE with an ASUS M4A89GTD Pro as
>> the mobo.
>>
>> In the two 16x PCI-E slots I've put in the 1068E controllers I had
>> lying around. Everything is still being put together and I still
>> haven't even installed opensolaris yet but I'll see if I can get
>> you some numbers on the controllers when I am done.
>
> This is a well-architected config with no bottlenecks on the PCIe
> links to the 890GX northbridge or on the HT link to the CPU. If you
> run 16 concurrent dd if=/dev/rdsk/c?d?t?p0 of=/dev/zero bs=1024k and
> assuming your drives can do ~100MB/s sustained reads at the
> beginning of the platter, you should literally see an aggregate
> throughput of ~1.6GB/s...

SuperMicro X8DTi motherboard
SuperMicro SC846E1 chassis (3Gb/s backplane)
LSI 9211-4i (PCIex x4) connected to backplane with a SFF-8087 cable (4-lane).
18 x Seagate 1TB SATA 7200rpm

I was able to saturate the system at 800MB/s with the 18 disks in
RAID-0. Same performance was achieved swapping the 9211-4i for a
MegaRAID ELP.

I'm guessing the backplane and cable are the bottleneck here.

Any comments ?

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard disk buffer at 100%

2010-05-07 Thread Giovanni Tirloni
On Fri, May 7, 2010 at 8:07 AM, Emily Grettel  wrote:

>  Hi,
>
> I've had my RAIDz volume working well on SNV_131 but it has come to my
> attention that there has been some read issues with the drives. Previously I
> thought this was a CIFS problem but I'm noticing that when transfering files
> or uncompressing some fairly large 7z (1-2Gb) files (or even smaller rar -
> 200-300Mb) files occasionally running iostat will give the b% as 100 for a
> drive or two.
>


That's the percent of time the disk is busy (transactions in progress) -
iostat(1M).



>
> I have the Western Digital EADS 1TB drives (Green ones) and not the more
> expensive black or enterprise drives (our sysadmins fault).
>
> The pool in question spans 4x 1TB drives.
>
> What exactly does this mean? Is it a controller problem disk problem or
> cable problem? I've got this on commodity hardware as its only used for a
> small business with 4-5 staff accessing our media server. Its using the
> Intel ICHR SATA controller. I've already changed the cables, swapped out the
> odd drive that exhibted this issue and the only thing I can think of is to
> buy a Intel or LSI SATA card.
>
> The scrub sessions take almost a day and a half now (previously at most
> 12hours!) but theres also 70% of space being used (files wise they're chunky
> MPG files) or compressed artwork but there are no errors reported.
>
> Does anyone have any ideas?
>

You might be maxing out your drives' I/O capacity. That could happen when
ZFS is commting the transactions to disk every 30 seconds but if %b is
constantly high you disks might not be keeping up with the performance
requirements.

We've had some servers showing high asvc_t times but it turned out to be a
firmware issue in the disk controller. It was very erratic (1-2 drives out
of 24 would show that).

If you look in the archives, people have sent a few averaged I/O performance
numbers that you could compare to your workload.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Giovanni Tirloni
On Thu, May 6, 2010 at 1:18 AM, Edward Ned Harvey wrote:

> > From the information I've been reading about the loss of a ZIL device,
> What the heck?  Didn't I just answer that question?
> I know I said this is answered in ZFS Best Practices Guide.
>
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa
> rate_Log_Devices
>
> Prior to pool version 19, if you have an unmirrored log device that fails,
> your whole pool is permanently lost.
> Prior to pool version 19, mirroring the log device is highly recommended.
> In pool version 19 or greater, if an unmirrored log device fails during
> operation, the system reverts to the default behavior, using blocks from
> the
> main storage pool for the ZIL, just as if the log device had been
> gracefully
> removed via the "zpool remove" command.
>


This week I've had a bad experience replacing a SSD device that was in a
hardware RAID-1 volume. While rebuilding, the source SSD failed and the
volume was brought off-line by the controller.

The server kept working just fine but seemed to have switched from the
30-second interval to all writes going directly to the disks. I could
confirm this with iostat.

We've had some compatibility issues between LSI MegaRAID cards and a few
MTRON SSDs and I didn't believe the SSD had really died. So I brought it
off-line and back on-line and everything started to work.

ZFS showed the log device c3t1d0 as removed. After the RAID-1 volume was
back I replaced that device with itself and a resilver process started. I
don't know what it was resilvering against but it took 2h10min. I should
have probably tried a zpool offline/online too.

So I think if a log device fails AND you've to import your pool later
(server rebooted, etc)... then you lost your pool (prior to version 19).
Right ?

This happened on OpenSolaris 2009.6.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What about this status report

2010-03-27 Thread Giovanni Tirloni
On Sat, Mar 27, 2010 at 6:02 PM, Harry Putnam  wrote:

> Bob Friesenhahn  writes:
>
> > On Sat, 27 Mar 2010, Harry Putnam wrote:
> >
> >> What to do with a status report like the one included below?
> >>
> >> What does it mean to have an unrecoverable error but no data errors?
> >
> > I think that this summary means that the zfs scrub did not encounter
> > any reported read/write errors from the disks, but on one of the
> > disks, 7 of the returned blocks had a computed checksum error.  This
> > could be a problem with the data that the disk previously
> > wrote. Perhaps there was an undetected data transfer error, the drive
> > firmware glitched, the drive experienced a cache memory glitch, or the
> > drive wrote/read data from the wrong track.
> >
> > If you clear the error information, make sure you keep a record of it
> > in case it happens again.
>
> Thanks.
>
> So its not a serious matter?  Or maybe more of a potentially serious
> matter?
>

Not really. That exactly the kind of problem ZFS is designed to catch.


>
> Is there specific documentation somewhere that tells how to read these
> status reports?
>

Your pool is not degraded so I don't think anything will show up in fmdump.

But check 'fmdump -eV' and see the actual errors that got created. You could
find something there.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moving drives around...

2010-03-23 Thread Giovanni Tirloni
On Tue, Mar 23, 2010 at 2:00 PM, Ray Van Dolson  wrote:

> Kind of a newbie question here -- or I haven't been able to find great
> search terms for this...
>
> Does ZFS recognize zpool members based on drive serial number or some
> other unique, drive-associated ID?  Or is it based off the drive's
> location (c0t0d0, etc).
>

ZFS makes uses of labels and will detect your drives even if you move them
around.

You can check that with 'zdb -l /dev/rdsk/cXtXdXs0'



>
> I'm wondering because I have a zpool set up across a bunch of drives
> and I am planning to move those drives to another port on the
> controller potentially changing their location -- as well as the
> location of my "boot" zpool (two disks).
>
> Will ZFS detect this and be smart about it or do I need to do something
> like a zfs export ahead of time?  What about for the root pool?
>

No need. Same goes for the rpool, you only need to make sure your system
will boot from the correct disk.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-20 Thread Giovanni Tirloni
On Sat, Mar 20, 2010 at 4:07 PM, Svein Skogen  wrote:

> We all know that data corruption may happen, even on the most reliable of
> hardware. That's why zfs har pool scrubbing.
>
> Could we introduce a zpool option (as in zpool set  ) for
> "scrub period", in "number of hours" (with 0 being no automatic scrubbing).
>
> I see several modern raidcontrollers (such as the LSI Megaraid MFI line)
> has such features (called "patrol reads") already built into them. Why
> should zfs have the same? Having the zpool automagically handling this
> (probably a good thing to default it on 168 hours or one week) would also
> mean that the scrubbing feature is independent from cron, and since scrub
> already has lower priority than ... actual work, it really shouldn't annoy
> anybody (except those having their server under their bed).
>
> Of course I'm more than willing to stand corrected if someone can tell me
> where this is already implemented, or why it's not needed. Proper flames
> over this should start with a "warning, flame" header, so I can don my
> asbestos longjohns. ;)
>

That would add unnecessary code to the ZFS layer for something that cron can
handle in one line.

Someone could hack zfs.c to automatically handle editing the crontab but I
don't know if it's worth the effort.

Are you worried that cron will fail or is it just an aesthetic requirement ?

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool I/O error

2010-03-19 Thread Giovanni Tirloni
On Fri, Mar 19, 2010 at 1:26 PM, Grant Lowe  wrote:

> Hi all,
>
> I'm trying to delete a zpool and when I do, I get this error:
>
> # zpool destroy oradata_fs1
> cannot open 'oradata_fs1': I/O error
> #
>
> The pools I have on this box look like this:
>
> #zpool list
> NAME  SIZE   USED  AVAILCAP  HEALTH  ALTROOT
> oradata_fs1   532G   119K   532G 0%  DEGRADED  -
> rpool 136G  28.6G   107G21%  ONLINE  -
> #
>
> Why can't I delete this pool? This is on Solaris 10 5/09 s10s_u7.
>

Please send the result of zpool status.

Your devices are probably all offline but that shouldn't stop you from
removing it, at least not on OpenSolaris.


-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] lazy zfs destroy

2010-03-18 Thread Giovanni Tirloni
On Thu, Mar 18, 2010 at 1:19 AM, Chris Paul wrote:

> OK I have a very large zfs snapshot I want to destroy. When I do this, the
> system nearly freezes during the zfs destroy. This is a Sun Fire X4600 with
> 128GB of memory. Now this may be more of a function of the IO device, but
> let's say I don't care that this zfs destroy finishes quickly. I actually
> don't care, as long as it finishes before I run out of disk space.
>
> So a suggestion for room for growth for the zfs suite is the ability to
> lazily destroy snapshots, such that the destroy goes to sleep if the cpu
> idle time falls under a certain percentage.
>

What build of OpenSolaris are you using ?

Is it nearly freezing during all the process or just at the end ?

There was another thread where a similar issue was discusses a week ago.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub not completing?

2010-03-17 Thread Giovanni Tirloni
On Wed, Mar 17, 2010 at 7:09 PM, Bill Sommerfeld  wrote:

> On 03/17/10 14:03, Ian Collins wrote:
>
>> I ran a scrub on a Solaris 10 update 8 system yesterday and it is 100%
>> done, but not complete:
>>
>>   scrub: scrub in progress for 23h57m, 100.00% done, 0h0m to go
>>
>
> Don't panic.  If "zpool iostat" still shows active reads from all disks in
> the pool, just step back and let it do its thing until it says the scrub is
> complete.
>
> There's a bug open on this:
>
> 6899970 scrub/resilver percent complete reporting in zpool status can be
> overly optimistic
>
> scrub/resilver progress reporting compares the number of blocks read so far
> to the number of blocks currently allocated in the pool.
>
> If blocks that have already been visited are freed and new blocks are
> allocated, the seen:allocated ratio is no longer an accurate estimate of how
> much more work is needed to complete the scrub.
>
> Before the scrub prefetch code went in, I would routinely see scrubs last
> 75 hours which had claimed to be "100.00% done" for over a day.


I've routinely seen that happen with resilvers on builds 126/127 on
raidz/raidz2. It reaches completion and stay in progress for as much as 50
hours at times. We just wait and let it do its work.

The bugs database doesn't show if developers have added comments about that.
Would have access to check if resilvers were mentioned ?

BTW, since this bug only exists in the bug database, does it mean it was
filled by a Sun engineer or a customer ? What's the relationship between
that and the defect database ? I'm still trying to understand the flow of
information here, since both databases seem to be used exclusively for
OpenSolaris but one is less open.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems

2010-03-17 Thread Giovanni Tirloni
On Wed, Mar 17, 2010 at 11:23 AM,  wrote:

>
>
> >IMHO, what matters is that pretty much everything from the disk controller
> >to the CPU and network interface is advertised in power-of-2 terms and
> disks
> >sit alone using power-of-10. And students are taught that computers work
> >with bits and so everything is a power of 2.
>
> That is simply not true:
>
>Memory: power of 2(bytes)
>Network: power of 10  (bits/s))
>Disk: power of 10 (bytes)
>CPU Frequency: power of 10 (cycles/s)
>SD/Flash/..: power of 10 (bytes)
>Bus speed: power of 10
>
> Main memory is the odd one out.
>

My bad on generalizing that information.

Perhaps the software stack dealing with disks should be changed to use
power-of-10. Unlikely too.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to reserve space for a file on a zfs filesystem

2010-03-17 Thread Giovanni Tirloni
On Wed, Mar 17, 2010 at 6:43 AM, wensheng liu wrote:

> Hi all,
>
> How to reserve a space on a zfs filesystem? For mkfiel or dd will write
> data to the
> block, it is time consuming. whiel "mkfile -n" will not really hold the
> space.
> And zfs's set reservation only work on filesytem, not on file?
>
> Could anyone provide a solution for this?
>

Do you mean you want files created with "mkfile -n" to count against the
total filesystem usage ?

Since they've not allocated any blocks yet, ZFS would need to know about
each spare file and read it's metadata before enforcing the filesystem
reservation. I'm not sure it's doable.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems

2010-03-17 Thread Giovanni Tirloni
On Wed, Mar 17, 2010 at 9:34 AM, David Dyer-Bennet  wrote:

> On 3/16/2010 23:21, Erik Trimble wrote:
>
>> On 3/16/2010 8:29 PM, David Dyer-Bennet wrote:
>>
>>> On 3/16/2010 17:45, Erik Trimble wrote:
>>>
 David Dyer-Bennet wrote:

> On Tue, March 16, 2010 14:59, Erik Trimble wrote:
>
>  Has there been a consideration by anyone to do a class-action lawsuit
>> for false advertising on this?  I know they now have to include the
>> "1GB
>> = 1,000,000,000 bytes" thing in their specs and somewhere on the box,
>> but just because I say "1 L = 0.9 metric liters" somewhere on the box,
>> it shouldn't mean that I should be able to avertise in huge letters "2
>> L
>> bottle of Coke" on the outside of the package...
>>
>
> I think "giga" is formally defined as a prefix meaning 10^9; that is,
> the
> definition the disk manufacturers are using is the standard metric one
> and
> very probably the one most people expect.  There are international
> standards for these things.
>
> I'm well aware of the history of power-of-two block and disk sizes in
> computers (the first computers I worked with pre-dated that period);
> but I
> think we need to recognize that this is our own weird local usage of
> terminology, and that we can't expect the rest of the world to change
> to
> our way of doing things.
>

 That's RetConn-ing.  The only reason the stupid GiB / GB thing came
 around in the past couple of years is that the disk drive manufacturers
 pushed SI to do it.
 Up until 5 years ago (or so), GigaByte meant a power of 2 to EVERYONE,
 not just us techies.   I would hardly call 40+ years of using the various
 giga/mega/kilo  prefixes as a power of 2 in computer science as
 non-authoritative.  In fact, I would argue that the HD manufacturers don't
 have a leg to stand on - it's not like they were "outside" the field and
 used to the "standard" SI notation of powers of 10.  Nope. They're inside
 the industry, used the powers-of-2 for decades, then suddenly decided to
 "modify" that meaning, as it served their marketing purposes.

>>>
>>> The SI meaning was first proposed in the 1920s, so far as I can tell.
>>>  Our entire history of special usage took place while the SI definition was
>>> in place.  We simply mis-used it.  There was at the time no prefix for what
>>> we actually wanted (not giga then, but mega), so we borrowed and repurposed
>>> mega.
>>>
>>>  Doesn't matter whether the "original" meaning of K/M/G was a
>> power-of-10.  What matters is internal usage in the industry.  And that has
>> been consistent with powers-of-2 for 40+ years.  There has been NO outside
>> understanding that GB = 1 billion bytes until the Storage Industry decided
>> it wanted it that way.  That's pretty much the definition of distorted
>> advertising.
>>
>
> That's simply not true.  The first computer I programmed, an IBM 1620, was
> routinely referred to as having "20K" of core.  That meant 20,000 decimal
> digits; not 20,480.  The other two memory configurations were similarly
> "40K" for 40,000 and "60K" for 60,000.  The first computer I was *paid* for
> programming, the 1401, had "8K" of core, and that was 8,000 locations, not
> 8,192.  This was right on 40 years ago (fall of 1969 when I started working
> on the 1401).  Yes, neither was brand new, but IBM was still leasing them to
> customers (it came in configurations of 4k, 8k, 12k, and I think 16k; been a
> while!).


At this point in history it doesn't matter much who's right or wrong
anymore.

IMHO, what matters is that pretty much everything from the disk controller
to the CPU and network interface is advertised in power-of-2 terms and disks
sit alone using power-of-10. And students are taught that computers work
with bits and so everything is a power of 2.

Just last week I had to remind people that a 24-disk JBOD with 1TB disks
wouldn't provide 24TB of storage since disks show up as 931GB.

It *is* an anomaly and I don't expect it to be fixed.

Perhaps some disk vendor could add more bits to its drives and advertise a
"real 1TB disk" using power-of-2 and show how people are being misled by
other vendors that use power-of-10. Highly unlikely but would sure get some
respect from the storage community.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] persistent L2ARC

2010-03-15 Thread Giovanni Tirloni
On Mon, Mar 15, 2010 at 5:39 PM, Abdullah Al-Dahlawi wrote:

> Greeting ALL
>
>
> I understand that L2ARC is still under enhancement. Does any one know if
> ZFS can be upgrades to include "Persistent L2ARC", ie. L2ARC will not loose
> its contents after system reboot ?
>

There is a bug opened for that but it doesn't seem to be implemented yet.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6662467

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Hardware Failure Best Practices

2010-03-08 Thread Giovanni Tirloni
On Mon, Mar 8, 2010 at 2:00 PM, Chris Dunbar  wrote:

> Hello,
>
> I just found this list and am very excited that you all are here! I have a
> homemade ZFS server that serves as our poor man's Thumper (we named it
> thumpthis) and provides primarily NFS shares for our VMware environment. As
> is often the case, the server has developed a hardware problem mere days
> before I am ready to go live with a new replacement server (thumpthat). At
> first the problem appeared to be a bad drive, but now I am not so sure. I
> would like to sanity check my thought process with this list and see if
> anybody has some different ideas. Here is a quick timeline of the trouble:
>
> 1. I noticed the following when running a routine zpool status:
>
> 
>  mirrorDEGRADED 0 0 0
>c3t2d0  ONLINE   0 0 0
>c3t3d0  REMOVED  0  368K 0
> 
>
> 2. I determined which drive appeared to be offline by watching drive lights
> and then rebooted the server.
>
> 3. Initially the drive appeared to be fine and ZFS picked it backup and
> resilvered the mirror. About 30 minutes later I noticed that the same drive
> was again marked REMOVED.
>
> 4. I shut the server down and replaced the drives with a new, larger disk.
>
> 5. I ran zpool replace tank c3t3d0 and it happily went to work on the
> replacement drive. A few hours later the resilver was complete and all
> seemed well.
>
> 6. The next day, about 12 hours after installing the new drive I found the
> same error message (here's the whole pool):
>
> config:
>
>NAMESTATE READ WRITE CKSUM
>tankDEGRADED 0 0 0
>  mirrorONLINE   0 0 0
>c3t0d0  ONLINE   0 0 0
>c3t1d0  ONLINE   0 0 0
>  mirrorDEGRADED 0 0 0
>c3t2d0  ONLINE   0 0 0
>c3t3d0  REMOVED  0  370K 0
>  mirrorONLINE   0 0 0
>c4t0d0  ONLINE   0 0 0
>c4t1d0  ONLINE   0 0 0
>  mirrorONLINE   0 0 0
>c4t2d0  ONLINE   0 0 0
>c4t3d0  ONLINE   0 0 0
>
> errors: No known data errors
>
> This is where I am now. Either my new hard drive is bad (not impossible) or
> I am looking at some other hardware failure, possibly the AOC-SAT2-MV8
> controller card. I have a spare controller card (same make and model
> purchased at the same time we built the server) and plan to replace that
> tonight. Does that seem like the correct course of action? Are there any
> steps I can take beforehand to zero in on the problem? Any words of
> encouragement or wisdom?
>

What does `iostat -En` say ?

My suggestion is to replace the cable that's connecting the c3t3d0 disk.

IMHO, the cable is much more likely to be faulty than a single port on the
disk controller.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can you manually trigger spares?

2010-03-08 Thread Giovanni Tirloni
On Mon, Mar 8, 2010 at 3:33 PM, Tim Cook  wrote:

> Is there a way to manually trigger a hot spare to kick in?  Mine doesn't
> appear to be doing so.  What happened is I exported a pool to reinstall
> solaris on this system.  When I went to re-import it, one of the drives
> refused to come back online.  So, the pool imported degraded, but it doesn't
> seem to want to use the hot spare... I've tried triggering a scrub to see if
> that would give it a kick, but no-go.


uts/common/fs/zfs/vdev.c says:

/*
 * If we fail to open a vdev during an import, we mark it as
 * "not available", which signifies that it was never there to
 * begin with.  Failure to open such a device is not considered
 * an error.
*/

If there is no error then the fault management code probably doesn't kick in
and autoreplace isn't triggered.



>
> r...@fserv:~$ zpool status
>   pool: fserv
>  state: DEGRADED
> status: One or more devices could not be opened.  Sufficient replicas exist
> for
> the pool to continue functioning in a degraded state.
> action: Attach the missing device and online it using 'zpool online'.
>see: http://www.sun.com/msg/ZFS-8000-2Q
>  scrub: scrub completed after 3h19m with 0 errors on Mon Mar  8 02:28:08
> 2010
> config:
>
> NAME  STATE READ WRITE CKSUM
> fserv DEGRADED 0 0 0
>   raidz2-0DEGRADED 0 0 0
> c2t0d0ONLINE   0 0 0
> c2t1d0ONLINE   0 0 0
> c2t2d0ONLINE   0 0 0
> c2t3d0ONLINE   0 0 0
> c2t4d0ONLINE   0 0 0
> c2t5d0ONLINE   0 0 0
> c3t0d0ONLINE   0 0 0
> c3t1d0ONLINE   0 0 0
> c3t2d0ONLINE   0 0 0
> c3t3d0ONLINE   0 0 0
> c3t4d0ONLINE   0 0 0
> 12589257915302950264  UNAVAIL  0 0 0  was
> /dev/dsk/c7t5d0s0
> spares
>   c3t6d0  AVAIL
>

That crazy device name is guid (you can see that with eg. zdb -l
/dev/rdsk/c3t1d0s0)

I was able to replicate your situation here.

# uname -a
SunOS osol-dev 5.11 snv_133 i86pc i386 i86pc Solaris

# zpool status tank
  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c6t0d0  ONLINE   0 0 0
c6t1d0  ONLINE   0 0 0
cache
  c6t2d0ONLINE   0 0 0
spares
  c6t3d0AVAIL

errors: No known data errors

# zpool export tank



# zpool import tank

# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist
for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
  mirror-0   DEGRADED 0 0 0
6462738093222634405  UNAVAIL  0 0 0  was
/dev/dsk/c6t0d0s0
c6t1d0   ONLINE   0 0 0
cache
  c6t2d0 ONLINE   0 0 0
spares
  c6t3d0 AVAIL

errors: No known data errors

# zpool get autoreplace tank
NAME  PROPERTY VALUESOURCE
tank  autoreplace  on   local

# fmdump -e -t 08Mar2010
TIME CLASS

As you can see, no error report was posted. You can try to import the pool
again and see if `fmdump -e` lists any errors afterwards.

You use the spare with `zpool replace`.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-05 Thread Giovanni Tirloni
On Fri, Mar 5, 2010 at 7:41 AM, Abdullah Al-Dahlawi wrote:

> Hi Geovanni
>
> I was monitering the ssd cache using zpool iostat -v like you said. the
> cache device within the pool was showing a persistent write IOPS during the
> ten (1GB) file creation phase by the benchmark.
>
> The benchmark even gave an insufficient space and terminated which proves
> that it was writing on the ssd cache (my HDD is 50GB free space) 
>

The L2ARC cache is not accessible to end user applications. It's only used
for reads that miss the ARC and it's managed internally by ZFS.

I can't comment on the specifics of how ZFS evicts objects from ARC to L2ARC
but that should never give you insufficient space errors.

Your data is not getting stored in the cache device. The writes you see on
the SSD device are ZFS moving objects from ARC to L2ARC. It has to write
data there otherwise there is nothing to read back from later when a read()
misses the ARC cache and checks L2ARC.

I don't know what your OLTP benchmark does but my advice is to check if it's
really writing files in the 'hdd' zpool mount point.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why L2ARC device is used to store files ?

2010-03-05 Thread Giovanni Tirloni
On Fri, Mar 5, 2010 at 6:46 AM, Abdullah Al-Dahlawi wrote:

> Greeting All
>
> I have create a pool that consists oh a hard disk and a ssd as a cache
>
> zpool create hdd c11t0d0p3
> zpool add hdd cache c8t0d0p0 - cache device
>
> I ran an OLTP bench mark to emulate a DMBS
>
> One I ran the benchmark, the pool started create the database file on the
> ssd cache device ???
>
>
> can any one explain why this happening ?
>
> is not L2ARC is used to absorb the evicted data from ARC ?
>
> why it is used this way ???
>
>
Hello Abdullah,

 I don't think I understand. How are you seeing files being created on the
SSD disk ?

 You can check device usage with `zpool iostat -v hdd`. Please also send the
output of `zpool status hdd`.

Thank you,

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshot recycle freezes system activity

2010-03-04 Thread Giovanni Tirloni
On Thu, Mar 4, 2010 at 7:28 PM, Ian Collins  wrote:

> Gary Mills wrote:
>
>> We have an IMAP e-mail server running on a Solaris 10 10/09 system.
>> It uses six ZFS filesystems built on a single zpool with 14 daily
>> snapshots.  Every day at 11:56, a cron command destroys the oldest
>> snapshots and creates new ones, both recursively.  For about four
>> minutes thereafter, the load average drops and I/O to the disk devices
>> drops to almost zero.  Then, the load average shoots up to about ten
>> times normal and then declines to normal over about four minutes, as
>> disk activity resumes.  The statistics return to their normal state
>> about ten minutes after the cron command runs.
>>
>> Is it destroying old snapshots or creating new ones that causes this
>> dead time?  What does each of these procedures do that could affect
>> the system?  What can I do to make this less visible to users?
>>
>>
>>
> I have a couple of Solaris 10 boxes that do something similar (hourly
> snaps) and I've never seen any lag in creating and destroying snapshots.
>  One system with 16 filesystems takes 5 seconds to destroy the 16 oldest
> snaps and create 5 recursive new ones.  I logged load average on these boxes
> and there is a small spike on the hour, but this is down to sending the
> snaps, not creating them.
>

We've seen the behaviour that Gary describes while destroying datasets
recursively (>600GB and with 7 snapshots). It seems that close to the end
the server stalls for 10-15 minutes and NFS activity stops. For small
datasets/snapshots that doesn't happen or is harder to notice.

Does ZFS have to do something special when it's done releasing the data
blocks at the end of the destroy operation ?

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?

2010-03-04 Thread Giovanni Tirloni
On Thu, Mar 4, 2010 at 4:40 PM, zfs ml  wrote:

> On 3/4/10 9:17 AM, Brent Jones wrote:
>
>> My rep says "Use dedupe at your own risk at this time".
>>
>> Guess they've been seeing a lot of issues, and regardless if its
>> 'supported' or not, he said not to use it.
>>
>
> So its not a feature, its a bug. They should release some official
> statement if they are going to have the sales reps saying that. Either it
> works or it doesn't and if it doesn't, then all parts of Oracle should be
> saying the same thing, not just after they have your money (oh btw, that
> dedup thing...).
>
> As discussed in a couple other threads, if Oracle wants to treat the
> fishworks boxes like closed appliances, then it should "just work" and if it
> doesn't then it should be treated like a toaster that doesn't work and they
> should take it back. They seem to want to sell them with the benefits of
> being closed for them - you shouldn't use the command line, etc but then act
> like your unique workload/environment is somehow causing them to break when
> they break. If they seal the box and put 5 knobs on the outside, don't blame
> the customer when they turn all the knobs to 10 and the box doesn't work.
> Take the box back, remove the knobs or fix the guts so all the knobs work as
> advertised.


It seems they kind of rushed the appliance into the market. We've a few
7410s and replication (with zfs send/receive) doesn't work after shares
reach ~1TB (broken pipe error). It's frustrating and we can't do anything
because every time we type "shell" in the CLI, it freaks us out with a
message saying the warranty will be voided if we continue. I bet that we
could work around that bug but we're not allowed and the workarounds
provided by Sun haven't worked.

Regarding dedup, Oracle is very courageous for including it in the 2010.Q1
release if this comes to be true. But I understand the pressure on then.
Every other vendor out there is releasing products with deduplication.
Personally, I would just wait 2-3 releases before using it in a black box
like the 7000s.

The hardware on the other hand is incredible in terms of resilience and
performance, no doubt. Which makes me think the pretty interface becomes an
annoyance sometimes. Let's wait for 2010.Q1 :)

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Huge difference in reporting disk usage via du and zfs list. Fragmentation?

2010-03-04 Thread Giovanni Tirloni
On Thu, Mar 4, 2010 at 10:52 AM, Holger Isenberg wrote:

> Do we have enormous fragmentation here on our X4500 with Solaris 10, ZFS
> Version 10?
>
> What except zfs send/receive can be done to free the fragmented space?
>
> One ZFS was used for some month to store some large disk images (each
> 50GByte large) which are copied there with rsync. This ZFS then reports
> 6.39TByte usage with zfs list and only 2TByte usage with du.
>
> The other ZFS was used for similar sized disk images, this time copied via
> NFS as whole files. On this ZFS du and zfs report exactly the same usage of
> 3.7TByte.
>


Please check the ZFS FAQ:
http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq

There is a question regarding the difference between du, df and zfs list.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirror Stripe

2010-03-01 Thread Giovanni Tirloni
On Mon, Mar 1, 2010 at 1:16 PM, Tony MacDoodle  wrote:

> What is the following syntax?
>
> *zpool create tank mirror c1t2d0 c1t3d0 mirror c1t4d0 c1t5d0 spare c1t6d0
> *
> *
> *
> Is this RAID 0+1 or 1+0?*
> *
>

That's RAID1+0. You are mirroring devices and them striping the mirrors
together.

AFAIK, RAID0+1 is not supported since a vdev can only be of type disk,
mirror or raidz. And all vdevs are stripped together. Someone more
experienced in ZFS can probably confirm/deny this.


-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] device mixed-up while tying to import.

2010-02-28 Thread Giovanni Tirloni
On Sat, Feb 27, 2010 at 6:21 PM, Yariv Graf  wrote:

>
> Hi,
> It seems I can't import a single external HDD.
>
> pool: HD
> id: 8012429942861870778
>  state: UNAVAIL
> status: One or more devices are missing from the system.
> action: The pool cannot be imported. Attach the missing
> devices and try again.
>see: http://www.sun.com/msg/ZFS-8000-6X
> config:
>
> HD  UNAVAIL  missing device
>   c16t0d0   ONLINE
>

You're probably missing the device that was used as a slog in this pool. Try
to restablish that device and import the pool again.

Right now ZFS cannot import a pool in that state but it's being worked on,
according to Eric Schrock on Feb 6th.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] copies=2 and dedup

2010-02-27 Thread Giovanni Tirloni
On Sat, Feb 27, 2010 at 10:40 AM, Dick Hoogendijk  wrote:

> I want zfs on a single drive so I use copies=2 for -some- extra safety. But
> I wonder if dedup=on could mean something in this case too? That way the
> same blocks would never be written more than twice. Or would that harm the
> reliability of the drive and should I just use copies=2?
>

ZFS will honor copies=2 and keep two physical copies, even with
deduplication enabled.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Slowing down "zfs destroy"

2010-02-26 Thread Giovanni Tirloni
Hello,

 While destroying a dataset, sometimes ZFS kind of hangs the machine. I
imagine it's starving all I/O while deleting the blocks, right ?

 Here logbias=latency, the commit interval is the default (30 seconds) and
we have SSDs for logs and cache.

 Is there a way to "slow down" the destroy a little bit in order to reserve
I/O for NFS clients ? Degraded performance isn't as bad as total loss of
availability our case.

 I was thinking we could set logbias=throughput and decrease the commit
interval to 10 seconds to keep it running more smoothly.

 Here's the pool configuration. Note the 2 slogs devices.. they were
supposed to be a mirror but got added by mistake.

NAME STATE READ WRITE CKSUM
trunk  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t4d0   ONLINE   0 0 0
c7t5d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t6d0   ONLINE   0 0 0
c7t7d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t8d0   ONLINE   0 0 0
c7t9d0   ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t10d0  ONLINE   0 0 0
c7t11d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t12d0  ONLINE   0 0 0
c7t13d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t14d0  ONLINE   0 0 0
c7t15d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t16d0  ONLINE   0 0 0
c7t17d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t18d0  ONLINE   0 0 0
c7t19d0  ONLINE   0 0 0
  mirror ONLINE   0 0 0
c7t20d0  ONLINE   0 0 0
c7t21d0  ONLINE   0 0 0
logs ONLINE   0 0 0
  c7t1d0 ONLINE   0 0 0
  c7t2d0 ONLINE   0 0 0
cache
  c7t22d0ONLINE   0 0 0
spares
  c7t3d0 AVAIL


 Any ideas?

Thank you,

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS replace - many to one

2010-02-25 Thread Giovanni Tirloni
On Thu, Feb 25, 2010 at 12:44 PM, Chad  wrote:

> I'm looking to migrate a pool from using multiple smaller LUNs to one
> larger LUN. I don't see a way to do a zpool replace for multiple to one.
> Anybody know how to do this? It needs to be non disruptive.
>

As others have noted, it doesn't seem possible.

You could create a new zpool with this larger LUN and use zfs send/receive
to migrate your data.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] future of OpenSolaris

2010-02-25 Thread Giovanni Tirloni
On Thu, Feb 25, 2010 at 9:47 AM, Jacob Ritorto wrote:

> It's a kind gesture to say it'll continue to exist and all, but
> without commercial support from the manufacturer, it's relegated to
> hobbyist curiosity status for us.  If I even mentioned using an
> unsupported operating system to the higherups here, it'd be considered
> absurd.  I like free stuff to fool around with in my copious spare
> time as much as the next guy, don't get me wrong, but that's not the
> issue.  For my company, no support contract equals 'Death of
> OpenSolaris.'
>

OpenSolaris is not dying just because there is no support contract available
for it, yet.

Last time I looked Red Hat didn't offer support contracts for Fedora and
that project is doing quite well.

So please be a little more realistic and say "For my company, no support
contracts for OpenSolaris means that we will not use it in our
mission-critical servers". That's much more reasonable than saying the whole
project is jeopardized.

It's useless to try to decide your strategy right now when things are
changing. Wait for some official word from Oracle and then decide what your
company is going to do. You can always install Solaris if that makes sense
for you.

-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with sudden zfs capacity loss on snv_79a

2010-02-18 Thread Giovanni Tirloni
On Thu, Feb 18, 2010 at 1:19 AM, Julius Roberts wrote:

> Yes snv_79a is old, yes we're working separately on migrating to
> snv_111b or later.  But i need to solve this problem ASAP to buy me
> some more time for that implementation.
>
> We pull data from a variety of sources onto our zpool called Backups,
> then we snapshot them.  We keep around 20 or so and then delete them
> automatically.  We've been doing this for around two years on this
> system and it's been absolutely fantastic.  Free-space hovers around
> 300G.  But suddenly something has changed:
>
> r...@darling(/)$:zfs list | head -1 && zfs list | tail -7
> NAME
>USED  AVAIL  REFER  MOUNTPOINT
> Backups/natoffice/ons...@20091231_2347_triggeredby_20091231_2330
>   30.1G  -   287G  -
> Backups/natoffice/ons...@20100131_2349_triggeredby_20100131_2330
>   17.7G  -   287G  -
> Backups/natoffice/ons...@20100205_0001_triggeredby_20100204_2330
>   15.9G  -   287G  -
> Backups/natoffice/ons...@20100212_0424_triggeredby_20100211_2330
>   152G  -   285G  -
> Backups/natoffice/ons...@20100216_0430_triggeredby_20100215_2330
>   154G  -   287G  -
> Backups/natoffice/ons...@20100217_0431_triggeredby_20100216_2330
>   154G  -   287G  -
> Backups/natoffice/ons...@20100218_0423_triggeredby_20100217_2330
>   0  -   287G  -
>
> Normally a snapshot shows USED around 15G to 30G.  But suddenly,
> snapshots of the same filesystem are showing USED ~150G.  There are no
> corresponding increases in any of the machines we copy data from, nor
> has any of that data changed significantly.  You can see that the
> REFER hasn't changed much at all, this is normal.  So we're backing up
> the same amount of data, but it now occupies so much more on disk.
> That of course means we can't keep nearly as many snapshots, and that
> makes us all very nervous.
>
> Any ideas?
>


Is it possible that your users are now deleting everything before starting
to write the backup data ?


-- 
Giovanni Tirloni
sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-10 Thread Giovanni Tirloni
On Tue, Feb 9, 2010 at 2:04 AM, Thomas Burgess  wrote:

>
> On Mon, Feb 08, 2010 at 09:33:12PM -0500, Thomas Burgess wrote:
>> > This is a far cry from an apples to apples comparison though.
>>
>> As much as I'm no fan of Apple, it's a pity they dropped ZFS because
>> that would have brought considerable attention to the opportunity of
>> marketing and offering zfs-suitable hardware to the consumer arena.
>> Port-multiplier boxes already seem to be targetted most at the Apple
>> crowd, even it's only in hope of scoring a better margin.
>>
>> Otherwise, bad analogies, whether about cars or fruit, don't help.
>>
>>
> It might help people to understand how ridiculous they sound going on and
> on about buying a premium storage appliance without any storage.  I think
> the car analogy was dead on.  You don't have to agree with a vendors
> practices to understand them.  If you have a more fitting analogy, then by
> all means lets hear it.
>


Dell joins the party:
http://lists.us.dell.com/pipermail/linux-poweredge/2010-February/041335.html

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-07 Thread Giovanni Tirloni
On Tue, Feb 2, 2010 at 9:07 PM, Marc Nicholas  wrote:

> I believe magical unicorn controllers and drives are both bug-free and
> 100% spec compliant. The leprichorns sell them if you're trying to
> find them ;)
>

Well, "perfect" and "bug free" sure don't exist in our industry.

The problem is that we see disk firmwares that are stupidly flawed and the
revisions that get released aren't making it better. Otherwise people
looking for quality would not have to spend extra on third-party reviewed
drives from storage vendors.

It's all too convenient how the industry is organized. That is, for disk and
storage vendors. Not customers.

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives

2010-02-07 Thread Giovanni Tirloni
On Tue, Feb 2, 2010 at 1:58 PM, Tim Cook  wrote:

>
> It's called spreading the costs around.  Would you really rather pay 10x
> the price on everything else besides the drives?  This is essentially Sun's
> way of tiered pricing.  Rather than charge you a software fee based on how
> much storage you have, they increase the price of the drives.  Seems fairly
> reasonable to me... it gives a low point of entry for people that don't need
> that much storage without using ridiculous capacity based licensing on
> software.
>


Smells like the Razor and Blades business model [1].

I think the industry is in a sad state when you buy enterprise-level drives
and they don't work as expected (see that thread about TLER settings on WD
enterprise drives) that you have to spend extra on drives that got reviewed
by a third-party (Sun/EMC/etc). Just shows how bad the disk vendors are.

I would be curious to know how the internal process of testing these drives
work at Sun/EMC/etc when they find bugs and performance problems. Do they
have access to the firmware's source code to fix it ? Or do they report the
bugs back to Seagate/WD and they provide a new firmware for tests ? Do those
bugs get fixed in other drives that Seagate/WD sells ?

For me it's just hard to objectively point out the differences between
Seagate's enterprise drives and the ones provided by Sun except that they
were tested more.

1 - http://en.wikipedia.org/wiki/Freebie_marketing

-- 
Giovanni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2010-01-04 Thread Giovanni Tirloni
On Mon, Jan 4, 2010 at 3:51 PM, Joerg Schilling
 wrote:
> Giovanni Tirloni  wrote:
>
>> We use Seagate Barracuda ES.2 1TB disks and every time the OS starts
>> to bang on a region of the disk with bad blocks (which essentially
>> degrades the performance of the whole pool) we get a call from our
>> clients complaining about NFS timeouts. They usually last for 5
>> minutes but I've seen it last for a whole hour while the drive is
>> slowly dying. Off-lining the faulty disk fixes it.
>>
>> I'm trying to find out how the disks' firmware is programmed
>> (timeouts, retries, etc) but so far nothing in the official docs. In
>> this case the disk's retry timeout seem way too high for our needs and
>> I believe a timeout limit imposed by the OS would help.
>
> Did you upgrade the firmware last spring?
>
> There is a known bug in the firmware that may let them go into alzheimer mode.

No, as their "serial number check utility" was not returning any
upgrades for the disks I checked... but now I see in the forums that
they released some new versions.

Thanks for the heads up. I'll give it a try and hopefully we can see
some improvement here.

-- 
Giovanni P. Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hard drive choice, TLER/ERC/CCTL

2010-01-04 Thread Giovanni Tirloni
On Sat, Jan 2, 2010 at 4:07 PM, R.G. Keen  wrote:
> OK. From the above suppositions, if we had a desktop (infinitely
> long retry on fail) disk and a soft-fail error in a sector, then the
> disk would effectively hang each time the sector was accessed.
> This would lead to
> (1) ZFS->SD-> disk read of failing sector
> (2) disk does not reply within 60 seconds (default)
> (3) disk is reset by SD
> (4) operation is retried by SD(?)
> (5) disk does not reply within 60 seconds (default)
> (6) disk is reset by SD ?
>
> then what? If I'm reading you correctly, the following string of
> events happens:
>
>> The drivers will retry and fail the I/O. By default, for SATA
>> disks using the sd driver, there are 5 retries of 60 seconds.
>> After 5 minutes, the I/O will be declared failed and that info
>> is passed back up the stack to ZFS, which will start its
>> recovery.  This is why the T part of N in T doesn't work so
>> well for the TLER case.
>
> Hmmm... actually, it may be just fine for my personal wants.
> If I had a desktop drive which went unresponsive for 60 seconds
> on an I/O soft error, then the timeout would be five minutes.
> at that time, zfs would... check me here... mark the block as
> failed, and try to relocate the block on the disk. If that worked
> fine, the previous sectors would be marked as unusable, and
> work goes on, but with the actions noted in the logs.

We use Seagate Barracuda ES.2 1TB disks and every time the OS starts
to bang on a region of the disk with bad blocks (which essentially
degrades the performance of the whole pool) we get a call from our
clients complaining about NFS timeouts. They usually last for 5
minutes but I've seen it last for a whole hour while the drive is
slowly dying. Off-lining the faulty disk fixes it.

I'm trying to find out how the disks' firmware is programmed
(timeouts, retries, etc) but so far nothing in the official docs. In
this case the disk's retry timeout seem way too high for our needs and
I believe a timeout limit imposed by the OS would help.

-- 
Giovanni P. Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss