Re: [zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-11 Thread Sanjeev
Lutz,

On Mon, Jan 11, 2010 at 09:38:16PM -0800, Lutz Schumann wrote:
> Cause you mention the fixed / bugs I have a more general question. 
> 
> Is there a way to see all commits to OSOL that are related to a Bug Report ? 

You can go to : src.opensolaris.org and give the bug-id in the history field and
select ON gate and search. That should list all the files that were modified by
the fix for that bug.

Now for each file you can got to the histry and get a diff of version where 
the fix was integrated and the previous version.

Hope that helps.

Regards,
Sanjeev
-- 

Sanjeev Bagewadi
Solaris RPE 
Bangalore, India
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] internal backup power supplies?

2010-01-11 Thread Lutz Schumann
Actually for the ZIL you may use the a-card (memory sata disk + bbu + compact 
flash write out). 

For the data disks there is no solution yet - would be nice.  However I prefer 
the "supercapacitor on disk" method. 

Why ? because the recharge logic is chellenging. There needs to be 
communication between the disk and the power supply. The interesting case is 
"fluctucating power" (see below) and battery maintenance.

If you are charged you can run fine, but the corner cases are tricky. 

Image the following scenario: 

1) Operations: Normal
2) Power outage: 1 hour 
3) UPS failing after 30 minutes 
4) Power comes back 
5) ALL Servers power on at the same time (e.g. misconfiguration)
6) Peak -> Power goes down again

At 3) your batteries are empty.
At 6) your batteries are not fully charged, however because the device does not 
know the "status" of the local UPS, write cache is still enabled. 

Thus a simple design does not solve the problem well (eneugh). 

Another thing is maintenance of a battery. You have to check if your battery 
still works (charge cycle). You have to alarm if not (monitoring). You have to 
replace them online then. So in general - batteries are bad if your server 
lifes longer then  3 years :)

For google it works fine, maybe because the server will life < 3 years anyhow 
and because they can "jus treplace" the server due to their internal redundancy 
options (google backend technology is designed to handle failure well).  For  a 
storage system I don't see that. 

The BBU / Capcitor needs to implement the same logic a Raid BBU implements. 

If (not_working_fully(BBU)) {
  disable_write_cache();
} else {
  enable_write_cache();
}

Or better (explict state whitelisting guaranteeing data integrity also for 
unexpected states): 

If (working_fully(BBU)) {
  enable_write_cache();
} else {
  disable_write_cache();
}

p.s. While writing this I'm thinking if a-card handles this case well ? ... 
maybe not.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-11 Thread Lutz Schumann
> > .. however ... a lot of snaps still have a impact
> on system performance. After the import of the 1
> snaps volume, I saw "devfsadm" eating up all CPU: 
> 
> If you are snapshotting ZFS volumes, then each will
> create an entry in the
> device tree. In other words, if these were file
> systems instead of volumes, you
> would not see devfsadm so busy.

Ok, nice to know that. For our use case we are focused on zvols (comstar iSCSi 
to virtualized hosts). And still it works fine for a reasonable number of zvols 
with a nice backup/snapshot cycle (~ 12 (5min) + 24 (1h) +7 (daily) + 4 
(weekly) + 12 (monthly) + 1 for each year -> ~ 60 Snaps for each zvol).

> > .. another strange issue: 
> > r...@osol_dev130:/dev/zvol/dsk/ssd# ls -al
> > 
> > load averages:  1.16,  2.17,  1.50;
>   up 0+00:22:0718:54:00
> : 97 sleeping, 2 on cpu
> > CPU states: 49.1% idle,  0.1% user, 50.8% kernel,
> I don't see the issue, could you elaborate?

How ( I know how to "truss", but not not so familiar with debugging at the 
kernel level :) ? 

> > .. so having a 1 snaps of a single zvol is not
> nice :)
> 
> AIUI, devfsadm creates a database. Could you try the
> last experiment again,

Will try, but have no access to test equipment right now. 

Thanks for the feedback.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-11 Thread Lutz Schumann
Cause you mention the fixed / bugs I have a more general question. 

Is there a way to see all commits to OSOL that are related to a Bug Report ? 

Background: I'm interested in how e.g. the zfs import bug was fixed.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] internal backup power supplies?

2010-01-11 Thread Daniel Carosone
> [google server with batteries]

These are cool, and a clever rethink of the typical data centre power
supply paradigm.  They keep the server running, until either a
generator is started or a graceful shutdown can be done.

Just to be clear, I'm talking about something much smaller, that
provides power only for drives, for a few moments after the host
powers down (for whatever reason) to let the drives sync their caches
safely.   

Basically, just wrapping the drive with the supercap (or equivalent)
the manufacturer didn't include, plus whatever minimal power supply
circuitry is needed (to avoid big inrush recharge currents on startup,
to avoid sending power back out into the rest of the case, etc). 

Because there's no integration for an emergency "sync now!" signal,
we have to rely on timeouts and wait "long enough" for the cache to be
sync'ed. It might be larger and need to hold longer than an on-board
supercap, but not very long in absolute terms.

There seems to be lots of room for a comfortable niche in the gap
between common commodity hardware (that would be plenty good enough
otherwise) and the $5k F20's and LogZilla's and similar.

--
Dan.




pgpkuQRgS2Ae0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] opensolaris-vmware

2010-01-11 Thread Tim Cook
On Mon, Jan 11, 2010 at 6:17 PM, Greg  wrote:

> Hello All,
> I hope this makes sense, I have two opensolaris machines with a bunch of
> hard disks, one acts as a iSCSI SAN, and the other is identical other than
> the hard disk configuration. The only thing being served are VMWare esxi raw
> disks, which hold either virtual machines or data that the particular
> virtual machine uses, I.E. we have exchange 2007 virtualized and through its
> iSCSI initiator we are mounting two LUNs one for the database and another
> for the Logs, all on different arrays of course. Any how we are then
> snapshotting this data across the SAN network to the other box using
> snapshot send/recv. In the case the other box fails this box can immediatly
> serve all of the iSCSI LUNs. The problem, I don't really know if its a
> problem...Is when I snapshot a running vm will it come up alive in esxi or
> do I have to accomplish this in a different way. These snapshots will then
> be written to tape with bacula. I hope I am posting this in the correct
> place.
>
> Thanks,
> Greg
> --
>
>
What you've got are crash consistent snapshots.  The disks are in the same
state they would be in if you pulled the power plug.  They may come up just
fine, or they may be in a corrupt state.  If you take snapshots frequently
enough, you should have at least one good snapshot.  Your other option is
scripting.  You can build custom scripts to leverage the VSS providers in
Windows... but it won't be easy.

Any reason in particular you're using iSCSI?  I've found NFS to be much more
simple to manage, and performance to be equivalent if not better (in large
clusters).

-- 
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly

2010-01-11 Thread Eric Schrock

On Jan 11, 2010, at 6:35 PM, Paul B. Henson wrote:

> On Mon, 11 Jan 2010, Eric Schrock wrote:
> 
>> No, there is no way to tell if a pool has DTL (dirty time log) entries.
> 
> Hmm, I hadn't heard that term before, but based on a quick search I take it
> that's the list of data in the pool that is not fully redundant? So if a
> 2-way mirror vdev lost a half, everything written after the loss would be
> on the DTL, and if the same device came back, recovery would entail just
> running through the DTL and writing out what it missed? Although presumably
> if the failed device was replaced with another device entirely all of the
> data would need to be written out.
> 
> I'm not quite sure that answered my question. My original question was, for
> example, given a 2-way mirror, one half fails. There is a hot spare
> available, which is pulled in, and while the pool isn't optimal, it does
> have the same number of devices that it's supposed to. On the other hand,
> the same mirror loses a device, there's no hot spare, and the pool is short
> one device. My understanding is that in both scenarios the pool status
> would be "DEGRADED", but it seems there's an important difference. In the
> first case, another device could fail, and the pool would still be ok. In
> the second, another device failing would result in complete loss of data.
> 
> While you can tell the difference between these two different states by
> looking at the detailed output and seeing if a hot spare is in use, I was
> just saying that it would be nice for the short status to have some
> distinction between "device failed, hot spare in use" and "device failed,
> keep fingers crossed" ;).
> 
> Back to your answer, if the existance of DTL entries means the pool doesn't
> have full redundancy for some data, and you can't tell if a pool has DTL
> entries, are you saying there's no way to tell if the current state of your
> pool could survive a device failure? If a resilver successfully completes,
> barring another device failure, doesn't that mean the pool is restored to
> full redundancy? I feel like I must be misunderstanding something :(.

DTLs are a more specific answer to your question.  It implies that a toplevel 
vdev has a known time when there is invalid data for it or one of its children. 
 This may because a device failed and is accumulating DTL time, a new replacing 
or spare vdev was attached, or it may be because a device was unplugged and 
then plugged back in.  Your example (hot spares) is but one of the ways in 
which this can happen, but in any of the cases it implies that data is not 
fully replicated.

There is obviously a way to detect this in the kernel, it's simply not exported 
to userland in any useful way.  The reason I focused on DTLs is that if any 
mechanism were provided to distinguish a pool lacking full redundancy, it would 
be based on DTLs - nothing else makes sense.

- Eric

> 
> Thanks...
> 
> 
> -- 
> Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
> Operating Systems and Network Analyst  |  hen...@csupomona.edu
> California State Polytechnic University  |  Pomona CA 91768

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly

2010-01-11 Thread Paul B. Henson
On Mon, 11 Jan 2010, Eric Schrock wrote:

> No, there is no way to tell if a pool has DTL (dirty time log) entries.

Hmm, I hadn't heard that term before, but based on a quick search I take it
that's the list of data in the pool that is not fully redundant? So if a
2-way mirror vdev lost a half, everything written after the loss would be
on the DTL, and if the same device came back, recovery would entail just
running through the DTL and writing out what it missed? Although presumably
if the failed device was replaced with another device entirely all of the
data would need to be written out.

I'm not quite sure that answered my question. My original question was, for
example, given a 2-way mirror, one half fails. There is a hot spare
available, which is pulled in, and while the pool isn't optimal, it does
have the same number of devices that it's supposed to. On the other hand,
the same mirror loses a device, there's no hot spare, and the pool is short
one device. My understanding is that in both scenarios the pool status
would be "DEGRADED", but it seems there's an important difference. In the
first case, another device could fail, and the pool would still be ok. In
the second, another device failing would result in complete loss of data.

While you can tell the difference between these two different states by
looking at the detailed output and seeing if a hot spare is in use, I was
just saying that it would be nice for the short status to have some
distinction between "device failed, hot spare in use" and "device failed,
keep fingers crossed" ;).

Back to your answer, if the existance of DTL entries means the pool doesn't
have full redundancy for some data, and you can't tell if a pool has DTL
entries, are you saying there's no way to tell if the current state of your
pool could survive a device failure? If a resilver successfully completes,
barring another device failure, doesn't that mean the pool is restored to
full redundancy? I feel like I must be misunderstanding something :(.

Thanks...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach

2010-01-11 Thread Daniel Carosone
On Mon, Jan 11, 2010 at 06:03:40PM -0800, Richard Elling wrote:
> IMHO, a split mirror is not as good as a decent backup :-)

I know.. that was more by way of introduction and background.  It's
not the only method of backup, but since this disk does get plugged
into the netbook frequently enough it seemed like a useful measure, at
the cost of a couple of setup commands (I hoped).

The technical questions here, however, are "why does zpool offline not
close the device", and/or "what else do I have to do to get the device
closed". 

--
Dan.

pgp3C4T8Hv261.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-11 Thread Damon Atkins
One thing which may help is the zfs import was single threaded, ie it open 
every disk one disk (maybe slice) at a time and processed it, as of 128b it is 
multi-threaded, ie it opens N disks/slices at once and process N disks/slices 
at once. When N is the number of threads it decides to use.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191

This most like/maybe cause other parts of the process to now  become 
multi-threaded as well.

It would be nice to no longer have /etc/zfs/zpool.cache, now zfs import is fast 
enough. (which is a second reason I longed the bug)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach

2010-01-11 Thread Richard Elling
On Jan 11, 2010, at 4:42 PM, Daniel Carosone wrote:

> I have a netbook with a small internal ssd as rpool. I have an
> external usb HDD with much larger storage, as a separate pool, which
> is sometimes attached to the netbook. 
> 
> I created a zvol on the external pool, the same size as the internal
> ssd, and attached it as a mirror to rpool for backup.  I don't care so
> much about it not being bootable, as long as I can read whatever data
> I might need in case of failure or loss. 

IMHO, a split mirror is not as good as a decent backup :-)

At one time, Tim Foster's scripts had a backup feature where you 
could specify a backup flag on a file system which would automatically
backup to removable media.  This used send/recv which is a clean
way of managing such things.  There has been some talk recently
about the future of that feature, see the zfs-auto-snapshot forum to 
catch up on the conversation.

NB. one reason send/recv in ZFS works like a well-designed split
mirror using some other RAID software is because the same method
is used to send incremental snapshots and resilver mirrors.

> The mirror works fine, and resilvers properly and selectively when I
> use "zpool offline" and "zpool online" on the zvol submirror.  

Not really. If you want to split mirrors for "backup" purposes, then 
you need "zpool split" which recently integrated into b131. It takes 
care of the dangling participles.
 -- richard

> I don't want to have the usb disk attached all the time, nor even to
> run with the mirror always active (usb is - just - slower than the
> internal ssd). I'd like to be able to move this external disk between
> hosts, and potentially repeat the rpool mirror for each, having them
> resilver whenever the disk is attached.
> 
> However, with the rpool mirror in place, I can't find a way to "zpool
> export black".  It complains that the poool is busy, because of the
> zvol in use.  This happens regardless of whether I have set the zvol
> submirror offline.  I expected that, with the subdevice in the offline
> state, the zvol would be closed.
> 
> Any suggestions?  Is this worth filing as a bug (is the device really
> offline)?  Would it work differently if I used a file on the external
> pool, instead of a zvol? (I haven't tried that yet, but don't really
> expect a difference unless umount -f can help). 
> 
> --
> Dan.
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach

2010-01-11 Thread Daniel Carosone
On Tue, Jan 12, 2010 at 02:38:56PM +1300, Ian Collins wrote:
> How did you set the subdevice in the off line state? 

# zpool offline rpool /dev/zvol/dsk/

sorry if that wasn't clear.

> Did you detach the device from the mirror?

No, because then:
 - it will have to resilver fully on next attach, no quick update
 - it will be marked detached, and harder to import for recovery.  I'm
   not even sure it can be easily recovered, maybe with import -D?

If I do detach the subdevice, of course then the external pool can be
exported - but that's not helpful for the original goal.  If I
shutdown ungracefully, and boot without the external disk attached,
the expected components are faulted, but again that's not especially
desirable for normal operations.

Perhaps the new "zpool split" will be better for the second issue, 
but won't address the first issue.  One of the several reasons for
wanting to use a zvol (or file on zfs) is to be able to clone the
backing store for an "import -f" while keeping the original for
further incremental resilvers. 

--
Dan.


pgph33v8yXkyr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly

2010-01-11 Thread Eric Schrock

On 01/11/10 17:42, Paul B. Henson wrote:

On Sat, 9 Jan 2010, Eric Schrock wrote:


No, it's fine.  DEGRADED just means the pool is not operating at the
ideal state.  By definition a hot spare is always DEGRADED.  As long as
the spare itself is ONLINE it's fine.


One more question on this; so there's no way to tell just from the status
the difference between a pool degraded due to disk failure but still with
full redundancy from a hot spare vs a pool degraded due to disk failure
that has lost redundancy due to that failure? I guess you can review the
pool details for the specifics but for large pools it seems it would be
valuable to be able to quickly distinguish these states from the short
status.


No, there is no way to tell if a pool has DTL (dirty time log) entries.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly

2010-01-11 Thread Paul B. Henson
On Sat, 9 Jan 2010, Eric Schrock wrote:

> No, it's fine.  DEGRADED just means the pool is not operating at the
> ideal state.  By definition a hot spare is always DEGRADED.  As long as
> the spare itself is ONLINE it's fine.

One more question on this; so there's no way to tell just from the status
the difference between a pool degraded due to disk failure but still with
full redundancy from a hot spare vs a pool degraded due to disk failure
that has lost redundancy due to that failure? I guess you can review the
pool details for the specifics but for large pools it seems it would be
valuable to be able to quickly distinguish these states from the short
status.

Thanks...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach

2010-01-11 Thread Ian Collins

Daniel Carosone wrote:

However, with the rpool mirror in place, I can't find a way to "zpool
export black".  It complains that the poool is busy, because of the
zvol in use.  This happens regardless of whether I have set the zvol
submirror offline.  I expected that, with the subdevice in the offline
state, the zvol would be closed.

  
How did you set the subdevice in the off line state?  Did you detach the 
device from the mirror?


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] internal backup power supplies?

2010-01-11 Thread David Magda


On Jan 11, 2010, at 19:00, Toby Thain wrote:


On 11-Jan-10, at 5:59 PM, Daniel Carosone wrote:


Does anyone know of such a device being made and sold? Feel like
designing and marketing one, or publising the design?


FWIW I think Google server farm uses something like this.


It looks slightly "ghetto", but it seems like it works for them:

http://blogs.sun.com/geekism/entry/holy_battery_backup_batman

http://tinyurl.com/cpt4yq
http://arstechnica.com/hardware/news/2009/04/the-beast-unveiled-inside-a-google-server.ars

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rpool mirror on zvol, can't offline and detach

2010-01-11 Thread Daniel Carosone
I should have mentioned:
 - opensolaris b130
 - of course I could use partitions on the usb disk, but that's so much less 
flexible.

--
Dan.

pgp5A8rwHnenC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] rpool mirror on zvol, can't offline and detach

2010-01-11 Thread Daniel Carosone
I have a netbook with a small internal ssd as rpool. I have an
external usb HDD with much larger storage, as a separate pool, which
is sometimes attached to the netbook. 

I created a zvol on the external pool, the same size as the internal
ssd, and attached it as a mirror to rpool for backup.  I don't care so
much about it not being bootable, as long as I can read whatever data
I might need in case of failure or loss. 

The mirror works fine, and resilvers properly and selectively when I
use "zpool offline" and "zpool online" on the zvol submirror.  

I don't want to have the usb disk attached all the time, nor even to
run with the mirror always active (usb is - just - slower than the
internal ssd). I'd like to be able to move this external disk between
hosts, and potentially repeat the rpool mirror for each, having them
resilver whenever the disk is attached.

However, with the rpool mirror in place, I can't find a way to "zpool
export black".  It complains that the poool is busy, because of the
zvol in use.  This happens regardless of whether I have set the zvol
submirror offline.  I expected that, with the subdevice in the offline
state, the zvol would be closed.

Any suggestions?  Is this worth filing as a bug (is the device really
offline)?  Would it work differently if I used a file on the external
pool, instead of a zvol? (I haven't tried that yet, but don't really
expect a difference unless umount -f can help). 

--
Dan.


pgpfzvPA0LwYP.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] opensolaris-vmware

2010-01-11 Thread Greg
Hello All,
I hope this makes sense, I have two opensolaris machines with a bunch of hard 
disks, one acts as a iSCSI SAN, and the other is identical other than the hard 
disk configuration. The only thing being served are VMWare esxi raw disks, 
which hold either virtual machines or data that the particular virtual machine 
uses, I.E. we have exchange 2007 virtualized and through its iSCSI initiator we 
are mounting two LUNs one for the database and another for the Logs, all on 
different arrays of course. Any how we are then snapshotting this data across 
the SAN network to the other box using snapshot send/recv. In the case the 
other box fails this box can immediatly serve all of the iSCSI LUNs. The 
problem, I don't really know if its a problem...Is when I snapshot a running vm 
will it come up alive in esxi or do I have to accomplish this in a different 
way. These snapshots will then be written to tape with bacula. I hope I am 
posting this in the correct place. 

Thanks, 
Greg
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] internal backup power supplies?

2010-01-11 Thread Toby Thain


On 11-Jan-10, at 5:59 PM, Daniel Carosone wrote:


With all the recent discussion of SSD's that lack suitable
power-failure cache protection, surely there's an opportunity for a
separate modular solution?

I know there used to be (years and years ago) small internal UPS's
that fit in a few 5.25" drive bays. They were designed to power the
motherboard and peripherals, with the advantage of simplicity and
efficiency that comes from being behind the PC PSU and working
entirely on DC.
...
Does anyone know of such a device being made and sold? Feel like
designing and marketing one, or publising the design?


FWIW I think Google server farm uses something like this.

--Toby



--
Dan.___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread Ross Walker
On Jan 11, 2010, at 2:23 PM, Bob Friesenhahn > wrote:



On Mon, 11 Jan 2010, bank kus wrote:


Are we still trying to solve the starvation problem?


I would argue the disk I/O model is fundamentally broken on Solaris  
if there is no fair I/O scheduling between multiple read sources  
until that is fixed individual I_am_systemstalled_while_doing_xyz  
problems will crop up. Started a new thread focussing on just this  
problem.


While I will readily agree that zfs has a I/O read starvation  
problem (which has been discussed here many times before), I doubt  
that it is due to the reasons you are thinking.


A true fair I/O scheduling model would severely hinder overall  
throughput in the same way that true real-time task scheduling  
cripples throughput.  ZFS is very much based on its ARC model.  ZFS  
is designed for maximum throughput with minimum disk accesses in  
server systems.  Most reads and writes are to and from its ARC.   
Systems with sufficient memory hardly ever do a read from disk and  
so you will only see writes occuring in 'zpool iostat'.


The most common complaint is read stalls while zfs writes its  
transaction group, but zfs may write this data up to 30 seconds  
after the application requested the write, and the application might  
not even be running any more.


Maybe an IO scheduler like Linux's 'deadline' IO scheduler whose only  
purpose is to reduce the effect of writers starving readers while  
providing some form of guaranteed latency.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] internal backup power supplies?

2010-01-11 Thread Daniel Carosone
With all the recent discussion of SSD's that lack suitable
power-failure cache protection, surely there's an opportunity for a
separate modular solution? 

I know there used to be (years and years ago) small internal UPS's
that fit in a few 5.25" drive bays. They were designed to power the
motherboard and peripherals, with the advantage of simplicity and
efficiency that comes from being behind the PC PSU and working
entirely on DC.

Something similar in a smaller form factor, similar to the drive bay
sleds that mount one or two 2.5" disks in a 3.5" (or even 5.25") bay,
with a small and simple power storage and circuit, would be great.
Alternately, something that took up a drive bay and provided power for
multiple disks in other bays, though that might be messier for
cabling. 

It wouldn't need to hold power long.  We could then use any SSD
selected on other design and performance and price criteria.

Does anyone know of such a device being made and sold? Feel like
designing and marketing one, or publising the design?

--
Dan.

pgphFnWRkVEmO.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?

2010-01-11 Thread James C. McPherson

On 11/01/10 11:57 PM, Arnaud Brand wrote:


According to various posts the LSI SAS3081E-R seems to work well with 
OpenSolaris.
But I've got pretty chilled-out from my recent problems with Areca-1680's.

Could anyone please confirm that the LSI SAS3081E-R works well ?
Is hotplug supported ?

Anything else I should know before buying one of these cards ?



These cards work very well with OpenSolaris, and attach using
the mpt(7d) driver - supports hotplugging and MPxIO too.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] abusing zfs boot disk for fun and DR

2010-01-11 Thread Ben Taylor
> Ben,
> I have found that booting from cdrom and importing
> the pool on the new host, then boot the hard disk
> will prevent these issues.
> That will reconfigure the zfs to use the new disk
> device.
> When running, zpool detach the missing mirror device
> and attach a new one.

Thanks.  I'm well versed in dealing with zfs issues. The reason
I reported this boot/rpool issue, was that it was similar in
nature to issues that occured trying to remediate an
x4500 which had suffered may sata disks go offline (due to
the buggy Marvell driver) as well as corruption that occured
while trying to fix said issue.  Backline spent a fair amount 
of time just trying to remediate the issue with hot spares
that looked exactly like the faulted config in my rpool.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-11 Thread Richard Elling
comment below...

On Jan 11, 2010, at 10:00 AM, Lutz Schumann wrote:

> Ok, tested this myself ... 
> 
> (same hardware used for both tests) 
> 
> OpenSolaris svn_104 (actually Nexenta Core 2):
> 
> 100 Snaps
> 
> r...@nexenta:/volumes# time for i in $(seq 1 100); do zfs snapshot 
> ssd/v...@test1_$i; done
> 
> 
> real0m24.991s
> user0m0.297s
> sys 0m0.679s
> 
> Import: 
> r...@nexenta:/volumes# time zpool import ssd
> 
> real0m25.053s
> user0m0.031s
> sys 0m0.216s
> 
> 2) 500 snaps (400 Created, 500 imported)
> -
> 
> r...@nexenta:/volumes# time for i in $(seq 101 500); do zfs snapshot 
> ssd/v...@test1_$i; done
> 
> real3m6.257s
> user0m1.190s
> sys 0m2.896s
> 
> r...@nexenta:/volumes# time zpool import ssd
> 
> real3m59.206s
> user0m0.091s
> sys 0m0.956s
> 
> 3) 1500 Snaps (1000 created, 1500 imported)
> -
> 
> r...@nexenta:/volumes# time for i in $(seq 501 1500); do zfs snapshot 
> ssd/v...@test1_$i; done
> 
> real22m23.206s
> user0m3.041s
> sys 0m8.785s
> 
> r...@nexenta:/volumes# time zpool import ssd
> 
> real36m26.765s
> user0m0.233s
> sys 0m4.545s
> 
>  you see where this goes - its expotential !!
> 
> Now with svn_130 (same pool, still 1500 snaps on it) 
> 
> .. not we are booting OpenSolaris svn_130.
> Sun Microsystems Inc.  SunOS 5.11  snv_130 November 2008
> 
> r...@osol_dev130:~# zpool import
>  pool: ssd
>id: 16128137881522033167
> state: ONLINE
> status: The pool is formatted using an older on-disk version.
> action: The pool can be imported using its name or numeric identifier, though
>some features will not be available without an explicit 'zpool 
> upgrade'.
> config:
> 
>ssd ONLINE
>  c9d1  ONLINE
> 
> r...@osol_dev130:~# time zpool import ssd
> 
> real0m0.756s
> user0m0.014s
> sys 0m0.056s
> 
> r...@osol_dev130:~# zfs list -t snapshot | wc -l
>1502
>   
> r...@osol_dev130:~# time zpool export ssd
> 
> real0m0.425s
> user0m0.003s
> sys 0m0.029s
> 
> I like this one :)
> 
> ... just for fun ... (5K Snaps)
> 
> r...@osol_dev130:~# time for i in $(seq 1501 5000); do zfs snapshot 
> ssd/v...@test1_$i; done
> 
> real1m18.977s
> user0m9.889s
> sys 0m19.969s
> 
> r...@osol_dev130:~# zpool export ssd
> r...@osol_dev130:~# time zpool import  ssd
> 
> real0m0.421s
> user0m0.014s
> sys 0m0.055s
> 
> ... just for fun ... (10K Snaps)
> 
> r...@osol_dev130:~# time for i in $(seq 5001 1); do zfs snapshot 
> ssd/v...@test1_$i; done
> 
> real2m6.242s
> user0m14.107s
> sys 0m28.573s
> 
> r...@osol_dev130:~# time zpool import ssd
> 
> real0m0.405s
> user0m0.014s
> sys 0m0.057s
> 
> Very nice, so volume import is solved. 

cool

> 
> .. however ... a lot of snaps still have a impact on system performance. 
> After the import of the 1 snaps volume, I saw "devfsadm" eating up all 
> CPU: 

If you are snapshotting ZFS volumes, then each will create an entry in the
device tree. In other words, if these were file systems instead of volumes, you
would not see devfsadm so busy.

> 
> load averages:  5.00,  3.32,  1.58;   up 0+00:18:12
> 18:50:05
> 99 processes: 95 sleeping, 2 running, 2 on cpu
> CPU states:  0.0% idle,  4.6% user, 95.4% kernel,  0.0% iowait,  0.0% swap
> Kernel: 409 ctxsw, 14 trap, 47665 intr, 1223 syscall
> Memory: 8190M phys mem, 5285M free mem, 4087M total swap, 4087M free swap
> 
>   PID USERNAME NLWP PRI NICE  SIZE   RES STATETIMECPU COMMAND
>   167 root6  220   25M   13M run  3:14 49.41% devfsadm
> 
> ... a truss showed that it is the device node allocation eating up the CPU: 
> 
> /5:  0.0010 xstat(2, "/devices/pseudo/z...@0:8941,raw", 0xFE32FCE0) = 0
> /5:  0.0005 fcntl(7, F_SETLK, 0xFE32FED0)   = 0
> /5:  0. close(7)= 0
> /5:  0.0001 lwp_unpark(3)   = 0
> /3:  0.0200 lwp_park(0xFE61EF58, 0) = 0
> /3:  0. time()  = 1263232337
> /5:  0.0001 open("/etc/dev/.devfsadm_dev.lock", O_RDWR|O_CREAT, 0644) = 7
> /5:  0.0001 fcntl(7, F_SETLK, 0xFE32FEF0)   = 0
> /5:  0. read(7, "A7\0\0\0", 4)  = 4
> /5:  0.0001 getpid()= 167 [1]
> /5:  0. getpid()= 167 [1]
> /5:  0.0001 open("/devices/pseudo/devi...@0:devinfo", O_RDONLY) = 10
> /5:  0. ioctl(10, DINFOIDENT, 0x)   = 57311
> /5:  0.0138 ioctl(10, 0xDF06, 0xFE32FA60)   = 2258109
> /5:  0.0027 ioctl(10, DINFOUSRLD, 0x086CD000)   = 2260992
> /5:  0.0001 close(10)   

[zfs-discuss] Does ZFS use large memory pages?

2010-01-11 Thread Gary Mills
Last April we put this in /etc/system on a T2000 server with large ZFS
filesystems:

set pg_contig_disable=1

This was while we were attempting to solve a couple of ZFS problems
that were eventually fixed with an IDR.  Since then, we've removed
the IDR and brought the system up to Solaris 10 10/09 with current
patches.  It's stable now, but seems slower.

This line was a workaround for bug 6642475 that had to do with
searching for for large contiguous pages. The result was high system
time and slow response.  I can't find any public information on this
bug, although I assume it's been fixed by now.  It may have only
affected Oracle database.

I'd like to remove this line from /etc/system now, but I don't know
if it will have any adverse effect on ZFS or the Cyrus IMAP server
that runs on this machine.  Does anyone know if ZFS uses large memory
pages?

-- 
-Gary Mills--Unix Group--Computer and Network Services-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HW raid vs ZFS

2010-01-11 Thread Bob Friesenhahn

On Mon, 11 Jan 2010, Anil wrote:

ZFS will definitely benefit from battery backed RAM on the 
controller as long as the controller immediately acknowledges cache 
flushes (rather than waiting for battery-protected data to flush to 
the


I am little confused with this. Do we not want the controller to 
ignore these cache flushes (since the cache is battery protected)? 
What did you mean by acknowledge?  If it acknowledges and flushes 
the cache, then what is the benefit of the cache at all (if ZFS 
keeps telling to flush ever few seconds)?


We want the controller to flush unwritten data to disk as quickly as 
it can regardless of whether it receives a cache flush request.  If 
the data is "safely" stored in battery backed RAM, then we would like 
the controller to acknowledge the flush request immediately.  The 
primary benefit of the battery-protected cache is to reduce latency 
for small writes.



I use mostly DAS for my servers. This is a x4170 with 8 drive bays.
So, what's the final recommendation?  Should I just RAID 1 on the 
hardware and put ZFS on top of it?


Unless you have a severe I/O bottleneck through your controller, you 
should do the mirroring in zfs rather than in the controller.  The 
reason for this is that zfs mirrors are highly resilient, intelligent, 
and resilver time is reduced.  Zfs will be able to detect and correct 
errors that the controller might not be aware of, or be unable to 
correct.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread Bob Friesenhahn

On Mon, 11 Jan 2010, bank kus wrote:


Are we still trying to solve the starvation problem?


I would argue the disk I/O model is fundamentally broken on Solaris 
if there is no fair I/O scheduling between multiple read sources 
until that is fixed individual I_am_systemstalled_while_doing_xyz 
problems will crop up. Started a new thread focussing on just this 
problem.


While I will readily agree that zfs has a I/O read starvation problem 
(which has been discussed here many times before), I doubt that it is 
due to the reasons you are thinking.


A true fair I/O scheduling model would severely hinder overall 
throughput in the same way that true real-time task scheduling 
cripples throughput.  ZFS is very much based on its ARC model.  ZFS is 
designed for maximum throughput with minimum disk accesses in server 
systems.  Most reads and writes are to and from its ARC.  Systems with 
sufficient memory hardly ever do a read from disk and so you will only 
see writes occuring in 'zpool iostat'.


The most common complaint is read stalls while zfs writes its 
transaction group, but zfs may write this data up to 30 seconds after 
the application requested the write, and the application might not 
even be running any more.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HW raid vs ZFS

2010-01-11 Thread Toby Thain


On 11-Jan-10, at 1:12 PM, Bob Friesenhahn wrote:


On Mon, 11 Jan 2010, Anil wrote:


What is the recommended way to make use of a Hardware RAID  
controller/HBA along with ZFS?

...


Many people will recommend against using RAID5 in "hardware" since  
then zfs is not as capable of repairing errors, and because most  
RAID5 controller cards use a particular format on the drives so  
that the drives become tied to the controller brand/model and it is  
not possible to move the pool to a different system without using  
an identical controller.  If the controller fails and is no longer  
available for purchase, or the controller is found to have a design  
defect, then the pool may be toast.


+1 These drawbacks of proprietary RAID are frequently overlooked.

Marty Scholes had a neat summary in a posting here, 21 October 2009:
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg30452.html

Back when I did storage admin for a smaller company where  
availability was hyper-critical (but we couldn't afford EMC/ 
Veritas), we had a hardware RAID5 array.  After a few years of  
service, we ran into some problems:

* Need to restripe the array?  Screwed.
* Need to replace the array because current one is EOL?  Screwed.
* Array controller barfed for whatever reason?  Screwed.
* Need to flash the controller with latest firmware?  Screwed.
* Need to replace a component on the array, e.g. NIC, controller or  
power supply?  Screwed.

* Need to relocate the array?  Screwed.

If we could stomach downtime or short-lived storage solutions, none  
of this would have mattered.







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HW raid vs ZFS

2010-01-11 Thread Anil
> ZFS will definitely benefit from battery backed RAM
> on the controller 
> as long as the controller immediately acknowledges
> cache flushes 
> (rather than waiting for battery-protected data to
> flush to the 

I am little confused with this. Do we not want the controller to ignore these 
cache flushes (since the cache is battery protected)? What did you mean by 
acknowledge?  If it acknowledges and flushes the cache, then what is the 
benefit of the cache at all (if ZFS keeps telling to flush ever few seconds)?

> The notion of "performance" is highly arbitrary since
> there are 
> different types of performance.  What sort of
> performance do you need 
> to optimize for?

It won't be a big deal. Just general web/small databases. Just wondering how 
much of magnitude in performance difference there is.


> 
> Many people will recommend against using RAID5 in
> "hardware" since 
> then zfs is not as capable of repairing errors, and
> because most RAID5 
> controller cards use a particular format on the
> drives so that the 
> drives become tied to the controller brand/model and
> it is not 
> possible to move the pool to a different system
> without using an 
> identical controller.  If the controller fails and is
> no longer 
> available for purchase, or the controller is found to
> have a design 
> defect, then the pool may be toast.
> 


I use mostly DAS for my servers. This is a x4170 with 8 drive bays.
So, what's the final recommendation?  Should I just RAID 1 on the hardware and 
put ZFS on top of it?

Got three options:

Get the Storagetek Internal HBA (with batteries) and...:

7 disks
600gb usable
===
no hardware raid
2 rpool
3 raidz
1 hot spare
1 ssd

8 disks
600gb usable
===
2 rpool
2 mirror + 2 mirror  (the mirrors would be in hardware with zfs doing striping 
across the mirrors)
1 hot spare
1 ssd

8 disks
1200gb usable
===
no hardware raid
2 rpool
3 raidz + 3 raidz


The hot spare and the SSD are optional and I can add them at a later point.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Cache + ZIL on single SSD

2010-01-11 Thread A. Krijgsman
Thank you Thomas and Mertol for your feedback.

I was indeed aiming for the x25-E because of their write performance.
However since these are around 350? for 32 GB I find it disturbing to only use 
it for ZIL :-)

I will do some tests with a cheap MLC disk.
I also read about the disk cache needing to be disabled to avoid dataloss after 
a powerloss.
If a journaling FS is run over iscsi this should not become an issue right? 
( Or will ZFS be affected by missing data from cache? )

Since I am hanging here now: 
Do people experience trouble on exporting iscsi zvol's with their cpu-load 
because of the TCP-IP calculations?
Or is is safe to ignore the TOE network cards?

Regards,
Armand




  - Original Message - 
  From: Thomas Burgess 
  To: A. Krijgsman 
  Cc: zfs-discuss@opensolaris.org 
  Sent: Monday, January 11, 2010 3:02 AM
  Subject: Re: [zfs-discuss] ZFS Cache + ZIL on single SSD





Next to that I am reading all kind of performance benefits using seperate 
devices
for the ZIL (write) and the Cache (read). I was wondering if I could 
share a single SSD between both ZIL and Cache device?

Or is this not recommended?



  i asked something similar recently.  The answers i got were along these lines:

  you can use a single ssd but it's not a great idea.  If you DO need to use a 
single ssd for such a thing, make sure it's one of the more expensive SLC 
variety like the intel x25-e

  The MLC variety of SSD works well for L2ARC and is much cheaper (you can pick 
up some for less than 100 bucks)  while the ZIL really should have the SLC 
variety.  

  I'm not an expert though, i'm just passing on advice i've been given.


  For reads and dedup L2ARC does make a dramatic difference, and for NFS and 
database stuff an SSD ZIL will make a huge difference.

  I've heard of people getting 5-10x's performance increase on reads just by 
adding a cheap ssd so i'd say it's worth it if that's the type of dataset you 
have.

  Another thing i was told on more than one occasion was that you may not even 
NEED a ssd for your ZIL (basically you should run a script to see if you need a 
ZIL at all, i don't remember where this script is but i'm SURE someone will 
reply with it)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread bank kus
> Are we still trying to solve the starvation problem?

I would argue the disk I/O model is fundamentally broken on Solaris if there is 
no fair I/O scheduling between multiple read sources until that is fixed 
individual I_am_systemstalled_while_doing_xyz problems will crop up. Started a 
new thread focussing on just this problem.

http://opensolaris.org/jive/thread.jspa?threadID=121479&tstart=0
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread Henrik Johansson
Hello,

On Jan 11, 2010, at 6:53 PM, bank kus wrote:

>> For example, you could set it to half your (8GB) memory so that 4GB is
>> immediately available for other uses.
>> 
>> * Set maximum ZFS ARC size to 4GB
> 
> capping max sounds like a good idea.


Are we still trying to solve the starvation problem?

I filed a bug on the non-ZFS related urandom stall problem yesterday, primary 
since it can do nasty things from inside a resource capped zone:
CR 6915579 solaris-cryp/random Large read from /dev/urandom can stall system

Regards
Henrik
http://sparcv9.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread bank kus
> For example, you could set it to half your (8GB) memory so that 4GB is
> immediately available for other uses.
>
> * Set maximum ZFS ARC size to 4GB

capping max sounds like a good idea

thanks
banks
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HW raid vs ZFS

2010-01-11 Thread Bob Friesenhahn

On Mon, 11 Jan 2010, Anil wrote:


What is the recommended way to make use of a Hardware RAID 
controller/HBA along with ZFS?


Does it make sense to do RAID5 on the HW and then RAIDZ on the 
software? OR just stick to ZFS RAIDZ and connect the drives to the 
controller, w/o any HW RAID (to benefit from the batteries). What 
will give the most performance benefit?


ZFS will definitely benefit from battery backed RAM on the controller 
as long as the controller immediately acknowledges cache flushes 
(rather than waiting for battery-protected data to flush to the 
disks).  There will be benefit as long as the size of the write data 
backlog does not exceed controller RAM size.


The notion of "performance" is highly arbitrary since there are 
different types of performance.  What sort of performance do you need 
to optimize for?


I think that the best performance for most cases is to use mirroring. 
Use two controllers with battery-backed RAM and split the mirrors 
across the controllers so that a write to a mirror pair results in a 
write to each controller.  Unfortunately, this is not nearly as space 
efficient as RAID5 or raidz.


Many people will recommend against using RAID5 in "hardware" since 
then zfs is not as capable of repairing errors, and because most RAID5 
controller cards use a particular format on the drives so that the 
drives become tied to the controller brand/model and it is not 
possible to move the pool to a different system without using an 
identical controller.  If the controller fails and is no longer 
available for purchase, or the controller is found to have a design 
defect, then the pool may be toast.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Practical) limit on the number of snapshots?

2010-01-11 Thread Lutz Schumann
Ok, tested this myself ... 

(same hardware used for both tests) 

OpenSolaris svn_104 (actually Nexenta Core 2):

100 Snaps

r...@nexenta:/volumes# time for i in $(seq 1 100); do zfs snapshot 
ssd/v...@test1_$i; done


real0m24.991s
user0m0.297s
sys 0m0.679s

Import: 
r...@nexenta:/volumes# time zpool import ssd

real0m25.053s
user0m0.031s
sys 0m0.216s

2) 500 snaps (400 Created, 500 imported)
-

r...@nexenta:/volumes# time for i in $(seq 101 500); do zfs snapshot 
ssd/v...@test1_$i; done

real3m6.257s
user0m1.190s
sys 0m2.896s

r...@nexenta:/volumes# time zpool import ssd

real3m59.206s
user0m0.091s
sys 0m0.956s

3) 1500 Snaps (1000 created, 1500 imported)
-

r...@nexenta:/volumes# time for i in $(seq 501 1500); do zfs snapshot 
ssd/v...@test1_$i; done

real22m23.206s
user0m3.041s
sys 0m8.785s

r...@nexenta:/volumes# time zpool import ssd

real36m26.765s
user0m0.233s
sys 0m4.545s

  you see where this goes - its expotential !!

Now with svn_130 (same pool, still 1500 snaps on it) 

.. not we are booting OpenSolaris svn_130.
Sun Microsystems Inc.  SunOS 5.11  snv_130 November 2008

r...@osol_dev130:~# zpool import
  pool: ssd
id: 16128137881522033167
 state: ONLINE
status: The pool is formatted using an older on-disk version.
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit 'zpool upgrade'.
config:

ssd ONLINE
  c9d1  ONLINE

r...@osol_dev130:~# time zpool import ssd

real0m0.756s
user0m0.014s
sys 0m0.056s

r...@osol_dev130:~# zfs list -t snapshot | wc -l
1502

r...@osol_dev130:~# time zpool export ssd

real0m0.425s
user0m0.003s
sys 0m0.029s

I like this one :)

... just for fun ... (5K Snaps)

r...@osol_dev130:~# time for i in $(seq 1501 5000); do zfs snapshot 
ssd/v...@test1_$i; done

real1m18.977s
user0m9.889s
sys 0m19.969s

r...@osol_dev130:~# zpool export ssd
r...@osol_dev130:~# time zpool import  ssd

real0m0.421s
user0m0.014s
sys 0m0.055s

... just for fun ... (10K Snaps)

r...@osol_dev130:~# time for i in $(seq 5001 1); do zfs snapshot 
ssd/v...@test1_$i; done

real2m6.242s
user0m14.107s
sys 0m28.573s

r...@osol_dev130:~# time zpool import ssd

real0m0.405s
user0m0.014s
sys 0m0.057s

Very nice, so volume import is solved. 

.. however ... a lot of snaps still have a impact on system performance. After 
the import of the 1 snaps volume, I saw "devfsadm" eating up all CPU: 

load averages:  5.00,  3.32,  1.58;   up 0+00:18:1218:50:05
99 processes: 95 sleeping, 2 running, 2 on cpu
CPU states:  0.0% idle,  4.6% user, 95.4% kernel,  0.0% iowait,  0.0% swap
Kernel: 409 ctxsw, 14 trap, 47665 intr, 1223 syscall
Memory: 8190M phys mem, 5285M free mem, 4087M total swap, 4087M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATETIMECPU COMMAND
   167 root6  220   25M   13M run  3:14 49.41% devfsadm

... a truss showed that it is the device node allocation eating up the CPU: 

/5:  0.0010 xstat(2, "/devices/pseudo/z...@0:8941,raw", 0xFE32FCE0) = 0
/5:  0.0005 fcntl(7, F_SETLK, 0xFE32FED0)   = 0
/5:  0. close(7)= 0
/5:  0.0001 lwp_unpark(3)   = 0
/3:  0.0200 lwp_park(0xFE61EF58, 0) = 0
/3:  0. time()  = 1263232337
/5:  0.0001 open("/etc/dev/.devfsadm_dev.lock", O_RDWR|O_CREAT, 0644) = 7
/5:  0.0001 fcntl(7, F_SETLK, 0xFE32FEF0)   = 0
/5:  0. read(7, "A7\0\0\0", 4)  = 4
/5:  0.0001 getpid()= 167 [1]
/5:  0. getpid()= 167 [1]
/5:  0.0001 open("/devices/pseudo/devi...@0:devinfo", O_RDONLY) = 10
/5:  0. ioctl(10, DINFOIDENT, 0x)   = 57311
/5:  0.0138 ioctl(10, 0xDF06, 0xFE32FA60)   = 2258109
/5:  0.0027 ioctl(10, DINFOUSRLD, 0x086CD000)   = 2260992
/5:  0.0001 close(10)   = 0
/5:  0.0015 modctl(MODGETNAME, 0xFE32F060, 0x0401, 0xFE32F05C, 
0xFD1E0008) = 0
/5:  0.0010 xstat(2, "/devices/pseudo/z...@0:8941", 0xFE32FCE0) = 0
/5:  0.0005 fcntl(7, F_SETLK, 0xFE32FED0)   = 0
/5:  0. close(7)= 0
/5:  0.0001 lwp_unpark(3)   = 0
/3:  0.0201 lwp_park(0xFE61EF58, 0) = 0
/3:  0.0001 time()  = 1263232337
/5:  0.0001 open("/etc/dev/.devfsadm_dev.lock", O_RDWR

[zfs-discuss] HW raid vs ZFS

2010-01-11 Thread Anil
I am sure this is not the first discussion related to this... apologies for the 
duplication.

What is the recommended way to make use of a Hardware RAID controller/HBA along 
with ZFS?

Does it make sense to do RAID5 on the HW and then RAIDZ on the software? OR 
just stick to ZFS RAIDZ and connect the drives to the controller, w/o any HW 
RAID (to benefit from the batteries). What will give the most performance 
benefit?

If there isn't much difference, I just feel like I am spending too much money 
on a RAID controller but not making 100% use of it. :)   perhaps, maybe the 
route I should take is to look for drives (SAS or SSD) with built-in 
batteries/capacitors - but perhaps then it becomes much more expensive then the 
HW RAID controller?

Thanks
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread Bob Friesenhahn

On Mon, 11 Jan 2010, bank kus wrote:


However I noticed something weird, long after the file operations 
are done the free memory doesnt seem to grow back (below) 
Essentially ZFS File Data claims to use 76% of memory long after the 
file has been written. How does one reclaim it back. Is ZFS File 
Data a pool that once grown to a size doesnt shrink back even though 
its current contents might not be used by any process?


It is normal for the ZFS ARC to retain data as long as there is not 
other memory pressure.  This should not cause a problem other than a 
small delay when starting an application which does need a lot of 
memory since the ARC will give memory back to the kernel.


For better interactive use, you can place a cap on the maximum ARC 
size via an entry in /etc/system:


  http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE

For example, you could set it to half your (8GB) memory so that 4GB is 
immediately available for other uses.


* Set maximum ZFS ARC size to 4GB
* http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ARCSIZE
* set zfs:zfs_arc_max = 0x1

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread bank kus
vmstat does show something interesting.  The free memory shrinks while doing 
the first dd (generating the 8G file) from around 10G  to 1.5Gish. The copy 
operations thereafter dont consume much and it stays at 1.2G after all 
operations have completed. (btw at the point of system slugishness theres 1.5G 
free RAM so that shouldnt explain the problem)

However I noticed something weird, long after the file operations are done the 
free memory doesnt seem to grow back (below) Essentially ZFS File Data claims 
to use 76% of memory long after the file has been written. How does one reclaim 
it back. Is ZFS File Data a pool that once grown to a size doesnt shrink back 
even though its current contents might not be used by any process?

> ::memstat
Page SummaryPagesMB  %Tot
     
Kernel 234696   9167%
ZFS File Data 2384657  9315   76%
Anon   145915   5695%
Exec and libs4250160%
Page cache  28582   1111%
Free (cachelist)53147   2072%
Free (freelist)290158  11339%

Total 3141405 12271
Physical  3141404 12271
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?

2010-01-11 Thread Maurice Volaski
According to various posts the LSI SAS3081E-R seems to work well 
with OpenSolaris.

But I've got pretty chilled-out from my recent problems with Areca-1680's.

Could anyone please confirm that the LSI SAS3081E-R works well ?
Is hotplug supported ?


It works well in Solaris 10 including hotplugging.
--

Maurice Volaski, maurice.vola...@einstein.yu.edu
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Repeating scrub does random fixes

2010-01-11 Thread Cindy Swearingen

Hi Gary,

You might consider running OSOL on a later build, like build 130.

Have you reviewed the fmdump -eV output to determine on which devices
the ereports below have been generated? This might give you more clues
as to what the issues are. I would also be curious if you have any
driver-level errors reported in /var/adm/messages or the iostat -En
command.

I think of cable problems or controller issues with repeated random
problems across disks.

Thanks,

Cindy



On 01/10/10 08:40, Gary Gendel wrote:

I've been using a 5-disk raidZ for years on SXCE machine which I converted to 
OSOL.  The only time I ever had zfs problems in SXCE was with snv_120, which 
was fixed.

So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on 
random disks.  If I repeat the scrub, it will fix errors on other disks.  
Occasionally it runs cleanly.  That it doesn't happen in a consistent manner 
makes me believe it's not hardware related.

fmdump only reports, three types of errors:

ereport.fs.zfs.checksum
ereport.io.scsi.cmd.disk.tran
ereport.io.scsi.cmd.disk.recovered

The middle one seems to be the issue I'd like to track down the source.  Any 
docs on how to do this?

Thanks,
Gary

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 failed disk, not sure if hot spare took over correctly

2010-01-11 Thread Cindy Swearingen

Hi Paul,

Example 11-1 in this section describes how to replace a
disk on an x4500 system:

http://docs.sun.com/app/docs/doc/819-5461/gbcet?a=view

Cindy

On 01/09/10 16:17, Paul B. Henson wrote:

On Sat, 9 Jan 2010, Eric Schrock wrote:


If ZFS removed the drive from the pool, why does the system keep
complaining about it?

It's not failing in the sense that it's returning I/O errors, but it's
flaky, so it's attaching and detaching.  Most likely it decided to attach
again and then you got transport errors.


Ok, how do I make it stop logging messages about the drive until it is
replaced? It's still filling up the logs with the same errors about the
drive being offline.

Looks like hdadm isn't it:

r...@cartman ~ # hdadm offline disk c1t2d0
/usr/bin/hdadm[1762]: /dev/rdsk/c1t2d0d0p0: cannot open
/dev/rdsk/c1t2d0d0p0 is not available

Hmm, I was able to unconfigure it with cfgadm:

r...@cartman ~ # cfgadm -c unconfigure sata1/2::dsk/c1t2d0

It went from:

sata1/2::dsk/c1t2d0disk connectedconfigured   failed

to:

sata1/2disk connectedunconfigured failed

Hopefully that will stop the errors until it's replaced and not break
anything else :).


No, it's fine.  DEGRADED just means the pool is not operating at the
ideal state.  By definition a hot spare is always DEGRADED.  As long as
the spare itself is ONLINE it's fine.


The spare shows as "INUSE", but I'm guessing that's fine too.


Hope that helps


That was perfect, thank you very much for the review. Now I can not worry
about it until Monday :).


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool destroy -f hangs system, now zpool import hangs system.

2010-01-11 Thread Carl Rathman
On Wed, Jan 6, 2010 at 12:11 PM, Carl Rathman  wrote:
> On Tue, Jan 5, 2010 at 10:35 AM, Carl Rathman  wrote:
>> On Tue, Jan 5, 2010 at 10:12 AM, Richard Elling
>>  wrote:
>>> On Jan 5, 2010, at 7:54 AM, Carl Rathman wrote:
>>>
 I didn't mean to destroy the pool.  I used zpool destroy on a zvol,
 when I should have used zfs destroy.

 When I used zpool destroy -f mypool/myvolume the machine hard locked
 after about 20 minutes.
>>>
>>> This would be a bug.  "zpool destroy" should only destroy pools.
>>> Volumes are datasets and are destroyed by "zfs destroy."  Using
>>> "zpool destroy -f" will attempt to force unmounts of any mounted
>>> datasets, but volumes are not mounted, per se. Upon reboot, nothing
>>> will be mounted until after the pool is imported.
>>>
>>>
 I don't want to destroy the pool, I just wanted to destroy the one
 volume. -- Which is why I now want to import the pool itself. Does
 that make sense?
>>>
>>> If the pool was destroyed, then you can try to import using -D.
>>>
>>> Are you sure you didn't "zfs destroy" instead?  Once the pool is imported,
>>> "zpool history" will show all of the commands issued against the pool.
>>>  -- richard
>>>
>>>
>>
>> Hi Richard,
>>
>> If I could import the pool, I'd love to do a history on it.
>>
>> At this point, if I attempt to import the pool, the machine will have
>> heavy disk activity on the pool for approximately 10 minutes, then the
>> machine will hard lock. This will happen when I boot the machine from
>> its snv_130 rpool, or if I boot the machine from a snv_130 live cd.
>>
>> Thanks,
>> Carl
>>
>
> Any suggestions on how to begin debugging this, or if data recovery is 
> possible?
>
> Thanks,
> Carl
>

Just wanted to update everyone on this...

I installed 2009.06 (snv_111b), and gave the import of the pool one
last try. After approximately 20 minutes of grinding, the pool
imported properly! No clue why, but all seems to be working now.

Thanks for the insight from the list, I really appreciate it.

-Carl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Help needed backing ZFS to tape

2010-01-11 Thread Richard Elling
Good question.  Zmanda seems to be a popular open source solution with
commercial licenses and support available.  We try to keep the Best Practices
Guide up to date on this topic:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Using_ZFS_With_Enterprise_Backup_Solutions

Additions or corrections are greatly appreciated.
 -- richard

On Jan 11, 2010, at 7:13 AM, Julian Regel wrote:

> Hi
> 
> We have a number of customers (~150) that have a single Sun server with 
> directly attached storage and directly attached tape drive/library. These 
> servers are currently running UFS, but we are looking at deploying ZFS in 
> future builds.
> 
> At present, we backup the server to the local tape drive using ufsdump, but 
> there appears to be no equivalent for ZFS. Is anyone aware why Sun have never 
> provided a zfsdump/zfsrestore for this sort of configuration?
> 
> Can anyone advise the best way to backup a ZFS-based server to a locally 
> attached tape drive? It is possible that the filesystems to backup are bigger 
> than a single tape, so multiple volume support is a must.
> 
> I thought about doing a "zfs send tank/f...@monday > /dev/rmt/0" but this 
> won't manage multiple tapes. Also, I'm not sure if this contains all the 
> metadata required to perform a bare metal restore.
> 
> Any help much appreciated!
> 
> Thanks
> 
> JR
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-11 Thread Bob Friesenhahn

On Mon, 11 Jan 2010, Kjetil Torgrim Homme wrote:


(BTW, thank you for testing forceful removal of power.  the result is as
expected, but it's good to see that theory and practice match.)


Actually, the result is not "as expected" since the device should not 
have lost any data preceding a cache flush request.


These sort of results should be cause for concern for anyone 
currently using one as a zfs log device, or using it for any 
write-sensitive application at all.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Help needed backing ZFS to tape

2010-01-11 Thread Julian Regel
Hi

We have a number of customers (~150) that have a single Sun server with 
directly attached storage and directly attached tape drive/library. These 
servers are currently running UFS, but we are looking at deploying ZFS in 
future builds.

At present, we backup the server to the local tape drive using ufsdump, but 
there appears to be no equivalent for ZFS. Is anyone aware why Sun have never 
provided a zfsdump/zfsrestore for this sort of configuration?

Can anyone advise the best way to backup a ZFS-based server to a locally 
attached tape drive? It is possible that the filesystems to backup are bigger 
than a single tape, so multiple volume support is a must.

I thought about doing a "zfs send tank/f...@monday > /dev/rmt/0" but this won't 
manage multiple tapes. Also, I'm not sure if this contains all the metadata 
required to perform a bare metal restore.

Any help much appreciated!

Thanks

JR



  ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-11 Thread Lutz Schumann
Maybe it is lost in this much text :) .. thus this re-post 

Does anyone know the impact of disabeling the write cache for the write 
amplification factor of the intel SSD's ?

How can I permanently disable the write cache on the Intel X25-M SSD's ? 

Thanks, Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?

2010-01-11 Thread Arnaud Brand

According to various posts the LSI SAS3081E-R seems to work well with 
OpenSolaris.
But I've got pretty chilled-out from my recent problems with Areca-1680's.

Could anyone please confirm that the LSI SAS3081E-R works well ?
Is hotplug supported ?

Anything else I should know before buying one of these cards ?

Thanks,
Arnaud



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] unable to zfs destroy

2010-01-11 Thread Mark J Musante

On Fri, 8 Jan 2010, Rob Logan wrote:



this one has me alittle confused. ideas?

j...@opensolaris:~# zpool import z
cannot mount 'z/nukeme': mountpoint or dataset is busy
cannot share 'z/cle2003-1': smb add share failed
j...@opensolaris:~# zfs destroy z/nukeme
internal error: Bad exchange descriptor


EBADE is used by ZFS to indicate checksum errors, which supports the zpool 
status output:



config:

NAMESTATE READ WRITE CKSUM
z   ONLINE   0 0 2
  c3t0d0s7  ONLINE   0 0 4
  c3t1d0s7  ONLINE   0 0 0
  c2d0  ONLINE   0 0 4

errors: Permanent errors have been detected in the following files:

   z/nukeme:<0x0>


So, yeah, without a way for zfs to repair itself, it looks like the only 
way forward is to destroy the zpool and restore from a backup.



Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Repeating scrub does random fixes

2010-01-11 Thread Gary Gendel
I've just made a couple of consecutive scrubs, each time it found a couple of 
checksum errors but on different drives.  No indication of any other errors.  
That a disk scrubs cleanly on a quiescent pool in one run but fails in the next 
is puzzling.  It reminds me of the snv_120 odd number of disks raidz bug I 
reported.

Looks like I've got to bite the bullet and upgrade to the dev tree and hope for 
the best.

Gary
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-11 Thread Kjetil Torgrim Homme
Lutz Schumann  writes:

> Actually the performance decrease when disableing the write cache on
> the SSD is aprox 3x (aka 66%).

for this reason, you want a controller with battery backed write cache.
in practice this means a RAID controller, even if you don't use the RAID
functionality.  of course you can buy SSDs with capacitors, too, but I
think that will be more expensive, and it will restrict your choice of
model severely.

(BTW, thank you for testing forceful removal of power.  the result is as
expected, but it's good to see that theory and practice match.)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread Phil Harman

Hi Banks,

Some basic stats might shed some light, e.g. vmstat 5, mpstat 5,  
iostat -xnz 5, prstat -Lmc 5 ... all running from just before you  
start the tests until things are "normal" again.


Memory starvation is certainly a possibility. The ARC can be greedy  
and slow to release memory under pressure.


Phil

Sent from my iPhone

On 10 Jan 2010, at 13:29, bank kus  wrote:


Hi Phil
You make some interesting points here:

-> yes bs=1G was a lazy thing

->  the GNU cp I m using does __not__ appears to use mmap
open64 open64  read write  close close is the relevant sequence

-> replacing cp with dd 128K * 64K does not help no new apps can be  
launched until the copies complete.


Regards
banks
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS import hangs with over 66000 context switches shown in top

2010-01-11 Thread Jack Kielsmeier
I should also mention that once the "lock" starts, the disk activity light on 
my case stays busy for a bit (1-2 minutes MAX), then does nothing.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS import hangs with over 66000 context switches shown in top

2010-01-11 Thread Jack Kielsmeier
Howdy All,

I made a 1 TB zfs volume within a 4.5 TB zpool called vault for testing iscsi. 
Both DeDup and Compression were off. After my tests, I issued a zfs destroy to 
remove the volume.

This command hung. After 5 hours, I hard rebooted into single user mode and 
removed my zfs cache file (I had to do this in order to boot up, as with the 
zfs cache file, my system would hang at reading zfs config).

Now I cannot import my pool as the box always hangs after about 30 minutes. 
It's not a complete hang, I can still ping the box, but I cannot do anything. 
The keyboard is still responsive, but the server will do nothing with any input 
I make. I cannot ssh to the box as well. The only thing I can do is hard reboot 
the box. At first I thought I was running out of RAM, because the hang always 
happened right when my free RAM hit 0 (still had swap available however), but 
I've made tweaks to /etc/system  and now I get freezes with over a gig of RAM 
free (8GB total in the box).

The strange thing is, the context switches shown in top skyrocket from about 
2000-6000 to over 66,000 just before the freeze. Would anyone know what that 
would skyrocket like that?

If I do a ps -ef before the freeze, there are a normal amount of processes 
running.

I have also tried a zpool import -f vault using a snv_130 live CD, as well as 
trying a zpool import -fFX vault. The same thing happens.

System Specs:
snv_130
AMD Phenom 925
8GB DDR2 RAM
2x 500 GB rpool mirrored drives
4x 1.5TB vault raidz1 drives

I have let the import run for over 24 hours with no luck.

Thanks for the assistance.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss