Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)

2007-11-16 Thread Neil Perrin


Joe Little wrote:
> On Nov 16, 2007 9:13 PM, Neil Perrin <[EMAIL PROTECTED]> wrote:
>> Joe,
>>
>> I don't think adding a slog helped in this case. In fact I
>> believe it made performance worse. Previously the ZIL would be
>> spread out over all devices but now all synchronous traffic
>> is directed at one device (and everything is synchronous in NFS).
>> Mind you 15MB/s seems a bit on the slow side - especially is
>> cache flushing is disabled.
>>
>> It would be interesting to see what all the threads are waiting
>> on. I think the problem maybe that everything is backed
>> up waiting to start a transaction because the txg train is
>> slow due to NFS requiring the ZIL to push everything synchronously.
>>
> 
> I agree completely. The log (even though slow) was an attempt to
> isolate writes away from the pool. I guess the question is how to
> provide for async access for NFS. We may have 16, 32 or whatever
> threads, but if a single writer keeps the ZIL pegged and prohibiting
> reads, its all for nought. Is there anyway to tune/configure the
> ZFS/NFS combination to balance reads/writes to not starve one for the
> other. Its either feast or famine or so tests have shown.

No there's no way currently to give reads preference over writes.
All transactions get equal priority to enter a transaction group.
Three txgs can be outstanding as we use a 3 phase commit model:
open; quiescing; and syncing.

Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)

2007-11-16 Thread Joe Little
On Nov 16, 2007 9:17 PM, Joe Little <[EMAIL PROTECTED]> wrote:
> On Nov 16, 2007 9:13 PM, Neil Perrin <[EMAIL PROTECTED]> wrote:
> > Joe,
> >
> > I don't think adding a slog helped in this case. In fact I
> > believe it made performance worse. Previously the ZIL would be
> > spread out over all devices but now all synchronous traffic
> > is directed at one device (and everything is synchronous in NFS).
> > Mind you 15MB/s seems a bit on the slow side - especially is
> > cache flushing is disabled.
> >
> > It would be interesting to see what all the threads are waiting
> > on. I think the problem maybe that everything is backed
> > up waiting to start a transaction because the txg train is
> > slow due to NFS requiring the ZIL to push everything synchronously.
> >

Roch wrote this before (thus my interest in the log or NVRAM like solution):


"There are 2 independant things at play here.

a) NFS sync semantics conspire againts single thread performance with
any backend filesystem.
 However NVRAM normally offers some releaf of the issue.

b) ZFS sync semantics along with the Storage Software + imprecise
protocol in between, conspire againts ZFS performance
of some workloads on NVRAM backed storage. NFS being one of the
affected workloads.

The conjunction of the 2 causes worst than expected NFS perfomance
over ZFS backend running __on NVRAM back storage__.
If you are not considering NVRAM storage, then I know of no ZFS/NFS
specific problems.

Issue b) is being delt with, by both Solaris and Storage Vendors (we
need a refined protocol);

Issue a) is not related to ZFS and rather fundamental NFS issue.
Maybe future NFS protocol will help.


Net net; if one finds a way to 'disable cache flushing' on the
storage side, then one reaches the state
we'll be, out of the box, when b) is implemented by Solaris _and_
Storage vendor. At that point,  ZFS becomes a fine NFS
server not only on JBOD as it is today , both also on NVRAM backed
storage.

It's complex enough, I thougt it was worth repeating."



>
> I agree completely. The log (even though slow) was an attempt to
> isolate writes away from the pool. I guess the question is how to
> provide for async access for NFS. We may have 16, 32 or whatever
> threads, but if a single writer keeps the ZIL pegged and prohibiting
> reads, its all for nought. Is there anyway to tune/configure the
> ZFS/NFS combination to balance reads/writes to not starve one for the
> other. Its either feast or famine or so tests have shown.
>
>
> > Neil.
> >
> >
> > Joe Little wrote:
> > > I have historically noticed that in ZFS, when ever there is a heavy
> > > writer to a pool via NFS, the reads can held back (basically paused).
> > > An example is a RAID10 pool of 6 disks, whereby a directory of files
> > > including some large 100+MB in size being written can cause other
> > > clients over NFS to pause for seconds (5-30 or so). This on B70 bits.
> > > I've gotten used to this behavior over NFS, but didn't see it perform
> > > as such when on the server itself doing similar actions.
> > >
> > > To improve upon the situation, I thought perhaps I could dedicate a
> > > log device outside the pool, in the hopes that while heavy writes went
> > > to the log device, reads would merrily be allowed to coexist from the
> > > pool itself. My test case isn't ideal per se, but I added a local 9GB
> > > SCSI (80) drive for a log, and added to LUNs for the pool itself.
> > > You'll see from the below that while the log device is pegged at
> > > 15MB/sec (sd5),  my directory list request on devices sd15 and sd16
> > > never are answered. I tried this with both no-cache-flush enabled and
> > > off, with negligible difference. Is there anyway to force a better
> > > balance of reads/writes during heavy writes?
> > >
> > >  extended device statistics
> > > devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
> > > fd0   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd0   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd1   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd2   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd3   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd4   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd5   0.0  118.00.0 15099.9  0.0 35.0  296.7   0 100
> > > sd6   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd7   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd8   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd9   0.00.00.00.0  0.0  0.00.0   0   0
> > > sd10  0.00.00.00.0  0.0  0.00.0   0   0
> > > sd11  0.00.00.00.0  0.0  0.00.0   0   0
> > > sd12  0.00.00.00.0  0.0  0.00.0   0   0
> > > sd13  0.00.00.00.0  0.0  0.00.0   0   0
> > > sd14  0.00.00.00.0  0.0  0.00.0   0   0
> > > sd15  0.00.00

Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)

2007-11-16 Thread Joe Little
On Nov 16, 2007 9:13 PM, Neil Perrin <[EMAIL PROTECTED]> wrote:
> Joe,
>
> I don't think adding a slog helped in this case. In fact I
> believe it made performance worse. Previously the ZIL would be
> spread out over all devices but now all synchronous traffic
> is directed at one device (and everything is synchronous in NFS).
> Mind you 15MB/s seems a bit on the slow side - especially is
> cache flushing is disabled.
>
> It would be interesting to see what all the threads are waiting
> on. I think the problem maybe that everything is backed
> up waiting to start a transaction because the txg train is
> slow due to NFS requiring the ZIL to push everything synchronously.
>

I agree completely. The log (even though slow) was an attempt to
isolate writes away from the pool. I guess the question is how to
provide for async access for NFS. We may have 16, 32 or whatever
threads, but if a single writer keeps the ZIL pegged and prohibiting
reads, its all for nought. Is there anyway to tune/configure the
ZFS/NFS combination to balance reads/writes to not starve one for the
other. Its either feast or famine or so tests have shown.

> Neil.
>
>
> Joe Little wrote:
> > I have historically noticed that in ZFS, when ever there is a heavy
> > writer to a pool via NFS, the reads can held back (basically paused).
> > An example is a RAID10 pool of 6 disks, whereby a directory of files
> > including some large 100+MB in size being written can cause other
> > clients over NFS to pause for seconds (5-30 or so). This on B70 bits.
> > I've gotten used to this behavior over NFS, but didn't see it perform
> > as such when on the server itself doing similar actions.
> >
> > To improve upon the situation, I thought perhaps I could dedicate a
> > log device outside the pool, in the hopes that while heavy writes went
> > to the log device, reads would merrily be allowed to coexist from the
> > pool itself. My test case isn't ideal per se, but I added a local 9GB
> > SCSI (80) drive for a log, and added to LUNs for the pool itself.
> > You'll see from the below that while the log device is pegged at
> > 15MB/sec (sd5),  my directory list request on devices sd15 and sd16
> > never are answered. I tried this with both no-cache-flush enabled and
> > off, with negligible difference. Is there anyway to force a better
> > balance of reads/writes during heavy writes?
> >
> >  extended device statistics
> > devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
> > fd0   0.00.00.00.0  0.0  0.00.0   0   0
> > sd0   0.00.00.00.0  0.0  0.00.0   0   0
> > sd1   0.00.00.00.0  0.0  0.00.0   0   0
> > sd2   0.00.00.00.0  0.0  0.00.0   0   0
> > sd3   0.00.00.00.0  0.0  0.00.0   0   0
> > sd4   0.00.00.00.0  0.0  0.00.0   0   0
> > sd5   0.0  118.00.0 15099.9  0.0 35.0  296.7   0 100
> > sd6   0.00.00.00.0  0.0  0.00.0   0   0
> > sd7   0.00.00.00.0  0.0  0.00.0   0   0
> > sd8   0.00.00.00.0  0.0  0.00.0   0   0
> > sd9   0.00.00.00.0  0.0  0.00.0   0   0
> > sd10  0.00.00.00.0  0.0  0.00.0   0   0
> > sd11  0.00.00.00.0  0.0  0.00.0   0   0
> > sd12  0.00.00.00.0  0.0  0.00.0   0   0
> > sd13  0.00.00.00.0  0.0  0.00.0   0   0
> > sd14  0.00.00.00.0  0.0  0.00.0   0   0
> > sd15  0.00.00.00.0  0.0  0.00.0   0   0
> > sd16  0.00.00.00.0  0.0  0.00.0   0   0
> ...
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)

2007-11-16 Thread Neil Perrin
Joe,

I don't think adding a slog helped in this case. In fact I
believe it made performance worse. Previously the ZIL would be 
spread out over all devices but now all synchronous traffic
is directed at one device (and everything is synchronous in NFS).
Mind you 15MB/s seems a bit on the slow side - especially is
cache flushing is disabled.

It would be interesting to see what all the threads are waiting
on. I think the problem maybe that everything is backed
up waiting to start a transaction because the txg train is
slow due to NFS requiring the ZIL to push everything synchronously.

Neil.

Joe Little wrote:
> I have historically noticed that in ZFS, when ever there is a heavy
> writer to a pool via NFS, the reads can held back (basically paused).
> An example is a RAID10 pool of 6 disks, whereby a directory of files
> including some large 100+MB in size being written can cause other
> clients over NFS to pause for seconds (5-30 or so). This on B70 bits.
> I've gotten used to this behavior over NFS, but didn't see it perform
> as such when on the server itself doing similar actions.
> 
> To improve upon the situation, I thought perhaps I could dedicate a
> log device outside the pool, in the hopes that while heavy writes went
> to the log device, reads would merrily be allowed to coexist from the
> pool itself. My test case isn't ideal per se, but I added a local 9GB
> SCSI (80) drive for a log, and added to LUNs for the pool itself.
> You'll see from the below that while the log device is pegged at
> 15MB/sec (sd5),  my directory list request on devices sd15 and sd16
> never are answered. I tried this with both no-cache-flush enabled and
> off, with negligible difference. Is there anyway to force a better
> balance of reads/writes during heavy writes?
> 
>  extended device statistics
> devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
> fd0   0.00.00.00.0  0.0  0.00.0   0   0
> sd0   0.00.00.00.0  0.0  0.00.0   0   0
> sd1   0.00.00.00.0  0.0  0.00.0   0   0
> sd2   0.00.00.00.0  0.0  0.00.0   0   0
> sd3   0.00.00.00.0  0.0  0.00.0   0   0
> sd4   0.00.00.00.0  0.0  0.00.0   0   0
> sd5   0.0  118.00.0 15099.9  0.0 35.0  296.7   0 100
> sd6   0.00.00.00.0  0.0  0.00.0   0   0
> sd7   0.00.00.00.0  0.0  0.00.0   0   0
> sd8   0.00.00.00.0  0.0  0.00.0   0   0
> sd9   0.00.00.00.0  0.0  0.00.0   0   0
> sd10  0.00.00.00.0  0.0  0.00.0   0   0
> sd11  0.00.00.00.0  0.0  0.00.0   0   0
> sd12  0.00.00.00.0  0.0  0.00.0   0   0
> sd13  0.00.00.00.0  0.0  0.00.0   0   0
> sd14  0.00.00.00.0  0.0  0.00.0   0   0
> sd15  0.00.00.00.0  0.0  0.00.0   0   0
> sd16  0.00.00.00.0  0.0  0.00.0   0   0
...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] pls discontinue troll bait was: Yager on ZFS and ZFS + DB + "fragments"

2007-11-16 Thread Al Hopper

I've been observing two threads on zfs-discuss with the following 
Subject lines:

Yager on ZFS
ZFS + DB + "fragments"

and have reached the rather obvious conclusion that the author "can 
you guess?" is a professional spinmeister, who gave up a promising 
career in political speech writing, to hassle the technical list 
membership on zfs-discuss.  To illustrate my viewpoint, I offer the 
following excerpts (reformatted from an obvious WinDoze Luser Mail 
client):

Excerpt 1:  Is this premium technical BullShit (BS) or what?

- BS 301 'grad level technical BS' ---

Still, it does drive up snapshot overhead, and if you start trying to 
use snapshots to simulate 'continuous data protection' rather than 
more sparingly the problem becomes more significant (because each 
snapshot will catch any background defragmentation activity at a 
different point, such that common parent blocks may appear in more 
than one snapshot even if no child data has actually been updated). 
Once you introduce CDP into the process (and it's tempting to, since 
the file system is in a better position to handle it efficiently than 
some add-on product), rethinking how one approaches snapshots (and COW 
in general) starts to make more sense.

- end of BS 301 'grad level technical BS' ---

Comment: Amazing: so many words, so little meaningful technical 
content!

Excerpt 2: Even better than Excerpt 1 - truely exceptional BullShit:

- BS 401 'PhD level technical BS' --

No, but I described how to use a transaction log to do so and later on 
in the post how ZFS could implement a different solution more 
consistent with its current behavior.  In the case of the transaction 
log, the key is to use the log not only to protect the RAID update but 
to protect the associated higher-level file operation as well, such 
that a single log force satisfies both (otherwise, logging the RAID 
update separately would indeed slow things down - unless you had NVRAM 
to use for it, in which case you've effectively just reimplemented a 
low-end RAID controller - which is probably why no one has implemented 
that kind of solution in a stand-alone software RAID product).

...
- end of BS 401 'PhD level technical BS' --

Go ahead and lookup the full context of these exceptional BS excerpts 
and see if the full context brings any further enlightment.  I think 
you'll quickly realize that, after reading the full context, this is 
nothing more than a complete waste of time and that there is nothing 
of technical value to learned from this text.  In fact, there is very, 
very little to be learned from any posts on this list where the 
Subject line is either:

Yager on ZFS
ZFS + DB + "fragments"

and the author is: "can you guess? <[EMAIL PROTECTED]>"

I'm not, for a moment, suggesting that one can't learn *something* 
from the posts of the author "can you guess? 
<[EMAIL PROTECTED]>"... indeed there are significant 
spinmeistering skills to be learned from these posts; including how to 
combine portions of cited published technical studies (Google Study, 
CERN study) with a line of total semi-technical bullshit worthy of any 
political spinmeister working withing the DC "Beltway Bandit" area. 
In fact, if I'm trying to conn^H^H^H^H talk someone out of several 
million dollars to fund a totally BS research project, I'll pay any 
reasonable fees that "can you guess?" would demand.  Because I'm 
convinced, that with his premium spinmeistering/BS skills - nothing is 
impossible: pigs can fly, NetApp == ZFS, the world is flat  and 
ZFS is a totally deficient technical design because they did'nt 
solicit his totally invaluable technical input.

And.. one note of caution for Jeff Bonwick and Team ZFS - lookout ... 
for this guy - because his new ZFS competitor filesystem, called, 
appropriately, GOMFS (Guess-O-Matic-File-System) is about to be 
released and it'll basically, if I understand "can you guess?"'s email 
fully, solve all the current ZFS design deficiencies, and totally 
dominate all *nix based filesystems for the next 400 years.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from "sugar-coating school"?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] slog tests on read throughput exhaustion (NFS)

2007-11-16 Thread Joe Little
I have historically noticed that in ZFS, when ever there is a heavy
writer to a pool via NFS, the reads can held back (basically paused).
An example is a RAID10 pool of 6 disks, whereby a directory of files
including some large 100+MB in size being written can cause other
clients over NFS to pause for seconds (5-30 or so). This on B70 bits.
I've gotten used to this behavior over NFS, but didn't see it perform
as such when on the server itself doing similar actions.

To improve upon the situation, I thought perhaps I could dedicate a
log device outside the pool, in the hopes that while heavy writes went
to the log device, reads would merrily be allowed to coexist from the
pool itself. My test case isn't ideal per se, but I added a local 9GB
SCSI (80) drive for a log, and added to LUNs for the pool itself.
You'll see from the below that while the log device is pegged at
15MB/sec (sd5),  my directory list request on devices sd15 and sd16
never are answered. I tried this with both no-cache-flush enabled and
off, with negligible difference. Is there anyway to force a better
balance of reads/writes during heavy writes?

 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
fd0   0.00.00.00.0  0.0  0.00.0   0   0
sd0   0.00.00.00.0  0.0  0.00.0   0   0
sd1   0.00.00.00.0  0.0  0.00.0   0   0
sd2   0.00.00.00.0  0.0  0.00.0   0   0
sd3   0.00.00.00.0  0.0  0.00.0   0   0
sd4   0.00.00.00.0  0.0  0.00.0   0   0
sd5   0.0  118.00.0 15099.9  0.0 35.0  296.7   0 100
sd6   0.00.00.00.0  0.0  0.00.0   0   0
sd7   0.00.00.00.0  0.0  0.00.0   0   0
sd8   0.00.00.00.0  0.0  0.00.0   0   0
sd9   0.00.00.00.0  0.0  0.00.0   0   0
sd10  0.00.00.00.0  0.0  0.00.0   0   0
sd11  0.00.00.00.0  0.0  0.00.0   0   0
sd12  0.00.00.00.0  0.0  0.00.0   0   0
sd13  0.00.00.00.0  0.0  0.00.0   0   0
sd14  0.00.00.00.0  0.0  0.00.0   0   0
sd15  0.00.00.00.0  0.0  0.00.0   0   0
sd16  0.00.00.00.0  0.0  0.00.0   0   0
 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
fd0   0.00.00.00.0  0.0  0.00.0   0   0
sd0   0.00.00.00.0  0.0  0.00.0   0   0
sd1   0.00.00.00.0  0.0  0.00.0   0   0
sd2   0.00.00.00.0  0.0  0.00.0   0   0
sd3   0.00.00.00.0  0.0  0.00.0   0   0
sd4   0.00.00.00.0  0.0  0.00.0   0   0
sd5   0.0  117.00.0 14970.1  0.0 35.0  299.2   0 100
sd6   0.00.00.00.0  0.0  0.00.0   0   0
sd7   0.00.00.00.0  0.0  0.00.0   0   0
sd8   0.00.00.00.0  0.0  0.00.0   0   0
sd9   0.00.00.00.0  0.0  0.00.0   0   0
sd10  0.00.00.00.0  0.0  0.00.0   0   0
sd11  0.00.00.00.0  0.0  0.00.0   0   0
sd12  0.00.00.00.0  0.0  0.00.0   0   0
sd13  0.00.00.00.0  0.0  0.00.0   0   0
sd14  0.00.00.00.0  0.0  0.00.0   0   0
sd15  0.00.00.00.0  0.0  0.00.0   0   0
sd16  0.00.00.00.0  0.0  0.00.0   0   0
 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
fd0   0.00.00.00.0  0.0  0.00.0   0   0
sd0   0.00.00.00.0  0.0  0.00.0   0   0
sd1   0.00.00.00.0  0.0  0.00.0   0   0
sd2   0.00.00.00.0  0.0  0.00.0   0   0
sd3   0.00.00.00.0  0.0  0.00.0   0   0
sd4   0.00.00.00.0  0.0  0.00.0   0   0
sd5   0.0  118.10.0 15111.9  0.0 35.0  296.4   0 100
sd6   0.00.00.00.0  0.0  0.00.0   0   0
sd7   0.00.00.00.0  0.0  0.00.0   0   0
sd8   0.00.00.00.0  0.0  0.00.0   0   0
sd9   0.00.00.00.0  0.0  0.00.0   0   0
sd10  0.00.00.00.0  0.0  0.00.0   0   0
sd11  0.00.00.00.0  0.0  0.00.0   0   0
sd12  0.00.00.00.0  0.0  0.00.0   0   0
sd13  0.00.00.00.0  0.0  0.00.0   0   0
sd14  0.00.00.00.0  0.0  0.00.0   0   0
sd15  0.00.00.00.0  0.0  0.00.0   0   0
sd16  0.00.00.00.0  0.0  0.00.0   0   0
 extended device statistics
devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
fd0   0.00.00.00.0  0.0  0.00.0   0   0
sd0   0.00.00.00.0  0.0  0.00.0   0   0
sd1   0.00.00.00.0  0.0  0.00.0   0   0
sd2   0.00.00.0 

Re: [zfs-discuss] How to destory a faulted pool

2007-11-16 Thread Marco Lopes
Manoj,


# zpool destroy -f mstor0


Regards,
Marco Lopes.


Manoj Nayak wrote:

>How I can destroy the following pool  ?
>
>pool: mstor0
>id: 5853485601755236913
> state: FAULTED
>status: One or more devices contains corrupted data.
>action: The pool cannot be imported due to damaged devices or data.
>   see: http://www.sun.com/msg/ZFS-8000-5E
>config:
>
>mstor0  UNAVAIL   insufficient replicas
>  raidz1UNAVAIL   insufficient replicas
>c5t0d0  FAULTED   corrupted data
>c4t0d0  FAULTED   corrupted data
>c1t0d0  ONLINE
>c0t0d0  ONLINE
>
>
>pool: zpool1
>id: 14693037944182338678
> state: FAULTED
>status: One or more devices are missing from the system.
>action: The pool cannot be imported. Attach the missing
>devices and try again.
>   see: http://www.sun.com/msg/ZFS-8000-3C
>config:
>
>zpool1  UNAVAIL   insufficient replicas
>  raidz1UNAVAIL   insufficient replicas
>c0t1d0  UNAVAIL   cannot open
>c1t1d0  UNAVAIL   cannot open
>c4t1d0  UNAVAIL   cannot open
>c6t1d0  UNAVAIL   cannot open
>c7t1d0  UNAVAIL   cannot open
>  raidz1UNAVAIL   insufficient replicas
>c0t2d0  UNAVAIL   cannot open
>c1t2d0  UNAVAIL   cannot open
>c4t2d0  UNAVAIL   cannot open
>c6t2d0  UNAVAIL   cannot open
>c7t2d0  UNAVAIL   cannot open
>___
>zfs-discuss mailing list
>zfs-discuss@opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>  
>


-- 

Marco S. Lopes
Senior Technical Specialist
US Systems Practice
Professional Services Delivery
Sun Microsystems
925 984 6611

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool io to 6140 is really slow

2007-11-16 Thread Asif Iqbal
I have the following layout

A 490 with 8 1.8Ghz and 16G mem. 6 6140s with 2 FC controllers using
A1 anfd B1 controller port 4Gbps speed.
Each controller has 2G NVRAM

On 6140s I setup raid0 lun per SAS disks with 16K segment size.

On 490 I created a zpool with 8 4+1 raidz1s

I am getting zpool IO of only 125MB/s with zfs:zfs_nocacheflush = 1 in
/etc/system

Is there a way I can improve the performance. I like to get 1GB/sec IO.

Currently each lun is setup as primary A1 and secondary B1 or vice versa

I also have write cache eanble according to CAM

-- 
Asif Iqbal
PGP Key: 0xE62693C5 KeyServer: pgp.mit.edu
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot mount 'mypool': Input/output error

2007-11-16 Thread Eric Ham
On Nov 15, 2007 9:42 AM, Nabeel Saad <[EMAIL PROTECTED]> wrote:
> I am sure I will not use ZFS to its fullest potential at all.. right now I'm 
> trying to recover the dead disk, so if it works to mount a single disk/boot 
> disk, that's all I need, I don't need it to be very functional.  As I 
> suggested, I will only be using this to change permissions and then return 
> the disk into the appropriate Server once I am able to log back into that 
> server.

(Sorry, forgot to CC the list.)

Ok, so assuming that all you want to do is mount your old Solaris disk
and change some permissions, then there is probably an easier solution
which is to put the hard drive back in the original machine and boot
from a (Open)Solaris CD or DVD.  This eliminates the whole Linux/FUSE
issues you're getting into.  Your easiest option might be to try the
new OpenSolaris Developer Preview distribution since it's actually a
Live CD which would give you a full GUI and networking to play with.

http://www.opensolaris.org/os/downloads/

Once the Live CD boots, you should be able to mount your drive to an
alternate path like /a and then change permissions.  If you boot from
a regular Solaris CD or DVD it will start the install process, but
then you should be able to simply cancel the install and get to a
command line and work from there.

Good luck!

Regards,
-Eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS snapshot send/receive via intermediate device

2007-11-16 Thread Ross
Hey folks,

I have no knowledge at all about how streams work in Solaris, so this might 
have a simple answer, or be completely impossible.  Unfortunately I'm a windows 
admin so haven't a clue which :)

We're looking at rolling out a couple of ZFS servers on our network, and 
instead of tapes we're considering using off-site NAS boxes for backups.  We 
think there's likely to be too much data each day to send the incremental 
snapshots to the remote systems over the wire, so we're wondering if we can use 
removable disks instead to transport just the incremental changes.

The idea is that we can do the initial "zfs send" on-site with the NAS plugged 
on the network, and from then on we just need a 500GB removable disk to take 
the changes off site each night.

Let me be clear on that:  We're not thinking of storing the whole zfs pool on 
the removable disk, there's just too much data.  Instead, we want to use "zfs 
send -i" to store just the incremental changes on a removable disk, so we can 
then take that disk home and plug it into another device and use zfs receive to 
upload the changes.  Does anybody know if that's possible?

If it works it's a nice and simple off-site backup, with the added benefit that 
we have a very rapid disaster recovery response.  No need to waste time 
restoring from tape:  the off-site backup can be brought onto the network and 
data is accessible immediately.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] read/write NFS block size and ZFS

2007-11-16 Thread Richard Elling
msl wrote:
> Hello all...
>  I'm migrating a nfs server from linux to solaris, and all clients(linux) are 
> using read/write block sizes of 8192. That was the better performance that i 
> got, and it's working pretty well (nfsv3). I want to use all the zfs' 
> advantages, and i know i can have a performance loss, so i want to know if 
> there is a "recomendation" for bs on nfs/zfs, or what do you think about it.
>   

That is the network block transfer size.  The default is normally 32 kBytes.
I don't see any reason to change ZFS's block size to match.
You should follow the best practices as described at
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

If you notice a performance issue with metadata updates, be sure to 
check out
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
 
 -- richard
> I must test, or there is no need to make such configurations with zfs?
> Thanks very much for your time!
> Leal.
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-16 Thread can you guess?
> can you guess?  metrocast.net> writes:
> > 
> > You really ought to read a post before responding
> to it:  the CERN study
> > did encounter bad RAM (and my post mentioned that)
> - but ZFS usually can't
> > do a damn thing about bad RAM, because errors tend
> to arise either
> > before ZFS ever gets the data or after it has
> already returned and checked
> > it (and in both cases, ZFS will think that
> everything's just fine).
> 
> According to the memtest86 author, corruption most
> often occurs at the moment 
> memory cells are written to, by causing bitflips in
> adjacent cells. So when a 
> disk DMA data to RAM, and corruption occur when the
> DMA operation writes to 
> the memory cells, and then ZFS verifies the checksum,
> then it will detect the 
> corruption.
> 
> Therefore ZFS is perfectly capable (and even likely)
> to detect memory 
> corruption during simple read operations from a ZFS
> pool.
> 
> Of course there are other cases where neither ZFS nor
> any other checksumming 
> filesystem is capable of detecting anything (e.g. the
> sequence of events: data 
> is corrupted, checksummed, written to disk).

Indeed - the latter was the first of the two scenarios that I sketched out.  
But at least on the read end of things ZFS should have a good chance of 
catching errors due to marginal RAM.
That must mean that most of the worrisome alpha-particle problems of yore have 
finally been put to rest (since they'd be similarly likely to trash data on the 
read side after ZFS had verified it).  I think I remember reading that 
somewhere at some point, but I'd never gotten around to reading that far in the 
admirably-detailed documentation that accompanies memtest:  thanks for 
enlightening me.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool question

2007-11-16 Thread Mark J Musante
On Thu, 15 Nov 2007, Brian Lionberger wrote:

> The question is, should I create one zpool or two to hold /export/home
> and /export/backup?
> Currently I have one pool for /export/home and one pool for /export/backup.
>
> Should it be on pool for both??? Would this be better and why?

One thing to consider is that pools are the granularity of 'export' 
operations, so if you ever want to, for example, move the /export/backup 
disks to a new computer, but keep /export/home on the current computer, 
you couldn't do that easily if you create a pair of striped 2-way mirrors.


Regards,
markm
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs on a raid box

2007-11-16 Thread Tom Mooney

A little extra info:
ZFS brings in a ZFS spare device the next time the pool is accessed, not 
a raidbox hot spare. Resilvering starts automatically and increases disk 
access times by about 30%. The first hour of estimated time left ( for 
5-6 TB pools ) is wildly inaccurate, but it starts to settle down after 
that.


Tom Mooney

Dan Pritts wrote:

On Fri, Nov 16, 2007 at 11:31:00AM +0100, Paul Boven wrote:
  

Thanks for your reply. The SCSI-card in the X4200 is a Sun Single
Channel U320 card that came with the system, but the PCB artwork does
sport a nice 'LSI LOGIC' imprint.



That is probably the same card i'm using; it's actually a "Sun" card
but as you say is OEM by LSI.

  

So, just to make sure we're talking about the same thing here - your
drives are SATA, 



yes

  

you're exporting each drive through the Western
Scientific raidbox as a seperate volume, 



yes

  

and zfs actually brings in a
hot spare when you pull a drive?



yes

OS is Sol10U4, system is an X4200, original hardware rev.
  

Over here, I've still not been able to accomplish that - even after
installing Nevada b76 on the machine, removing a disk will not cause a
hot-spare to become active, nor does resilvering start. Our Transtec
raidbox seems to be based on a chipset by Promise, by the way.



I have heard some bad things about the Promise RAID boxes but I haven't
had any direct experience.  


I do own one Promise box that accepts 4 PATA drives and exports them to a
host as scsi disks.  Shockingly, it uses a master/slave IDE configuration
rather than 4 separate IDE controllers.  It wasn't super expensive but
it wasn't dirt cheap, either, and it seems it would have cost another
$5 to manufacture the "right way."

I've had fine luck with Promise $25 ATA PCI cards :)

The infortrend units, on the other hand, I have had generally quite good
luck with.  When I worked at UUNet in the late '90s we had hundreds of
their SCSI RAIDs deployed.  


I do have an Infortrend FC-attached raid with SATA disks, which basically
works fine.  It has an external JBOD also SATA disks connecting to
the main raid with FC.  Unfortunately, The RAID unit boots faster than
the JBOD.  So, if you turn them on at the same time, it thinks the JBOD
is gone and doesn't notice it's there until you reboot the controller.

That caused a little pucker for my colleagues when it happened while i
was on vacation.  The support guy at the reseller we were working with
(NOT Western Scientific) told them the raid was hosed and they should
rebuild from scratch, hope you had a backup.  


danno
--
Dan Pritts, System Administrator
Internet2
office: +1-734-352-4953 | mobile: +1-734-834-7224
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Need a 2-port PCI-X SATA-II controller for x86

2007-11-16 Thread Brian Hechinger
I'll be setting up a small server and need two SATA-II ports for an x86
box.  The cheaper the better.

Thanks!!

-brian
-- 
"Perl can be fast and elegant as much as J2EE can be fast and elegant.
In the hands of a skilled artisan, it can and does happen; it's just
that most of the shit out there is built by people who'd be better
suited to making sure that my burger is cooked thoroughly."  -- Jonathan 
Patschke
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-16 Thread Peter Schuller
> Brain damage seems a bit of an alarmist label. While you're certainly right
> that for a given block we do need to access all disks in the given stripe,
> it seems like a rather quaint argument: aren't most environments that
> matter trying to avoid waiting for the disk at all? Intelligent prefetch
> and large caches -- I'd argue -- are far more important for performance
> these days.

The concurrent small-i/o problem is fundamental though. If you have an 
application where you care only about random concurrent reads for example, 
you would not want to use raidz/raidz2 currently. No amount of smartness in 
the application gets around this. It *is* a relevant shortcoming of 
raidz/raidz2 compared to raid5/raid6, even if in many cases it is not 
significant.

If disk space is not an issue, striping across mirrors will be okay for random 
seeks. But if you also care about diskspace, it's a show stopper unless you 
can throw money at the problem.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slightly Off-Topic: Decent replication on Solaris using rNFS.

2007-11-16 Thread Richard Elling
Razvan Corneliu VILT wrote:
> Hi,
>
> In my infinite search for a reliable work-around for the lack of bandwidth in 
> the United States*, I've reached the conclusion that I need a file-system 
> replication solution for the data stored on my ZFS partition.
> I've noticed that I'm not the only one asking for this, but I still have no 
> clear answer on my options from Google.
> After looking into some reports on rNFS on citi.umich.edu, I found out that 
> I'm not the only one with the problem (go figure). I am not really up to date 
> with the NFSv4 spec and drafts, but I am curious if rNFS is part of the 
> current NFSv4 spec or of the upcoming 4.1, and if it's considered or 
> available for OpenSolaris, or if there are any alternatives (such as a 
> replicated ZFS solution that supports simultaneous r/w access on at least 2 
> geographically separate servers).
> Some might argue that QFS + Sun Cluster is the way to go, but I need a few 
> things that ZFS currently offers (NFSv4 ACLs and snapshots that Samba can be 
> made aware of), and will want to move to CIFS server as soon as it's 
> production quality.
> Generally, the write traffic on the Samba shares that need replication is 
> light (around 1GByte/day), but it does need to happen whenever there's a 
> change.
> I've tried creating a smart cron script that runs unison every minute (lame, 
> I know), but it does not replicate the NFSv4 ACLs, and it's a rather bad 
> approach to the problem to start with. A daemonized unison with support for 
> all the ZFS features that gets the file-change notifications from the kernel 
> along with a distributed lock manager might do the job, but it's something 
> that I'm not qualified to write.
> I am sure that what I'm looking for is not unheard of. I am hopeful that the 
> ZFS+Lustre integration in the future might allow me something like this, but 
> it doesn't sound like it's close.
>
> Any sugestions?!?
>   

AVS.  See http://www.opensolaris.org/os/project/avs/
Jim Dunham has a good blog and demo on using it with ZFS.
 -- richard

> Cheers,
> Razvan
>
> * Our Bucharest branch has access to 10 Mbits/sec internationally and 100 
> Mbits/sec nationally (fiber of course) with BGP and our own IP classes, for 
> around EUR 250. This is in contrast with our San Jose, CA branch, which has a 
> connectivity budget of $700 and can get only a bonded-T1 at best in that 
> money (a T1 is $500 ($399 + taxes)). I wish that the most economically 
> advanced country in the world could have a decent internet infrastructure.
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool question

2007-11-16 Thread Brian Lionberger
I have a zpool issue that I need to discuss.

My application is going to run on a 3120 with 4 disks. Two(mirrored) 
disks will represent /export/home and the other two(mirrored) will be 
/export/backup.

The question is, should I create one zpool or two to hold /export/home 
and /export/backup?
Currently I have one pool for /export/home and one pool for /export/backup.

Should it be on pool for both??? Would this be better and why?

Thanks for any help and advice.

Brian.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS for consumers WAS:Yager on ZFS

2007-11-16 Thread Paul Kraus
Splitting this thread and changing the subject to reflect that...

On 11/14/07, can you guess? <[EMAIL PROTECTED]> wrote:

> Another prominent debate in this thread revolves around the question of
> just how significant ZFS's unusual strengths are for *consumer* use.
> WAFL clearly plays no part in that debate, because it's available only
> on closed, server systems.

I am both a large systems administrator and a 'home user' (I
prefer that term to 'consumer'). I am also very slow to adopt new
technologies in either environment. We have started using ZFS at work
due to performance improvements (for our workload) over UFS (or any
other FS we tested). At home the biggest reason I went with ZFS for my
data is ease of management. I split my data up based on what it is ...
media (photos, movies, etc.), vendor stuff (software, datasheets,
etc.), home directories, and other misc. data. This gives me a good
way to control backups based on the data type. I know, this is all
more sophisticated than the typical home user. The biggest win for me
is that I don't have to partition my storage in advance. I build one
zpool and multiple datasets. I don't set quotas or reservations
(although I could).

So I suppose my argument for ZFS in home use is not data
integrity, but much simpler management, both short and long term.

-- 
Paul Kraus
Albacon 2008 Facilities
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS children stepping on parent

2007-11-16 Thread Philip
I was doing some disaster recovery testing with ZFS, where I did a mass backup 
of a family of ZFS filesystems using snapshots, destroyed them, and then did a 
mass restore from the backups.  The ZFS filesystems I was testing with had only 
one parent in the ZFS namespace; and the backup and restore went well until it 
came time to mount the restored ZFS filesystems.

Because I had destroyed everything but the zpool, there was no mountpoint set 
for the restored parent ZFS filesystem or for its children.  They were all 
restored, but unmounted.  I set the mountpoint property for the parent ZFS 
filesystem, and all its children mounted instantly as I expected;  but the 
parent failed to mount, because ZFS had created the mountpoints for the 
children before mounting the parent.

I had to unmount the children manually, delete their mountpoints, mount the 
parent manually, and then mount the children manually.  Is it supposed to work 
that way?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + "fragments"

2007-11-16 Thread can you guess?
...

> I personally believe that since most people will have
> hardware LUN's
> (with underlying RAID) and cache, it will be
> difficult to notice
> anything. Given that those hardware LUN's might be
> busy with their own
> wizardry ;) You will also have to minimize the effect
> of the database
> cache ...

By definition, once you've got the entire database in cache, none of this 
matters (though filling up the cache itself takes some added time if the table 
is fragmented).

Most real-world databases don't manage to fit all or even mostly in cache, 
because people aren't willing to dedicate that much RAM to running them.  
Instead, they either use a lot less RAM than the database size or share the 
system with other activity that shares use of the RAM.

In other words, they use a cost-effective rather than a money-is-no-object 
configuration, but then would still like to get the best performance they can 
from it.

> 
> It will be a tough assignment ... maybe someone has
> already done this?
> 
> Thinking about this (very abstract) ... does it
> really matter?
> 
> [8KB-a][8KB-b][8KB-c]
> 
> So what it 8KB-b gets updated and moved somewhere
> else? If the DB gets
> a request to read 8KB-a, it needs to do an I/O
> (eliminate all
> caching). If it gets a request to read 8KB-b, it
> needs to do an I/O.
> 
> Does it matter that b is somewhere else ...

Yes, with any competently-designed database.

 it still
> needs to go get
> it ... only in a very abstract world with read-ahead
> (both hardware or
> db) would 8KB-b be in cache after 8KB-a was read.

1.  If there's no other activity on the disk, then the disk's track cache will 
acquire the following data when the first block is read, because it has nothing 
better to do.  But if the all the disks are just sitting around waiting for 
this table scan to get to them, then if ZFS has a sufficiently intelligent 
read-ahead mechanism it could help out a lot here as well:  the differences 
become greater when the system is busier.

2.  Even a moderately smart disk will detect a sequential access pattern if one 
exists and may read ahead at least modestly after having detected that pattern 
even if it *does* have other requests pending.

3.  But in any event any competent database will explicitly issue prefetches 
when it knows (and it *does* know) that it is scanning a table sequentially - 
and will also have taken pains to try to ensure that the table data is laid out 
such that it can be scanned efficiently.  If it's using disks that support 
tagged command queuing it may just issue a bunch of single-database-block 
requests at once, and the disk will organize them such that they can all be 
satisfied by a single streaming access; with disks that don't support queuing, 
the database can elect to issue a single large I/O request covering many 
database blocks, accomplishing the same thing as long as the table is in fact 
laid out contiguously on the medium (the database knows this if it's handling 
the layout directly, but when it's using a file system as an intermediary it 
usually can only hope that the file system has minimized file fragmentation).

> 
> Hmmm... the only way is to get some data :) *hehe*

Data is good, as long as you successfully analyze what it actually means:  it 
either tends to confirm one's understanding or to refine it.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] read/write NFS block size and ZFS

2007-11-16 Thread Anton B. Rang
If you're running over NFS, the ZFS block size most likely won't have a 
measurable impact on your performance. Unless you've got multiple gigabit 
ethernet interfaces, the network will generally be the bottleneck rather than 
your disks, and NFS does enough caching at both client & server end to 
aggregate updates into large writes.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Slightly Off-Topic: Decent replication on Solaris using rNFS.

2007-11-16 Thread Razvan Corneliu VILT
Hi,

In my infinite search for a reliable work-around for the lack of bandwidth in 
the United States*, I've reached the conclusion that I need a file-system 
replication solution for the data stored on my ZFS partition.
I've noticed that I'm not the only one asking for this, but I still have no 
clear answer on my options from Google.
After looking into some reports on rNFS on citi.umich.edu, I found out that I'm 
not the only one with the problem (go figure). I am not really up to date with 
the NFSv4 spec and drafts, but I am curious if rNFS is part of the current 
NFSv4 spec or of the upcoming 4.1, and if it's considered or available for 
OpenSolaris, or if there are any alternatives (such as a replicated ZFS 
solution that supports simultaneous r/w access on at least 2 geographically 
separate servers).
Some might argue that QFS + Sun Cluster is the way to go, but I need a few 
things that ZFS currently offers (NFSv4 ACLs and snapshots that Samba can be 
made aware of), and will want to move to CIFS server as soon as it's production 
quality.
Generally, the write traffic on the Samba shares that need replication is light 
(around 1GByte/day), but it does need to happen whenever there's a change.
I've tried creating a smart cron script that runs unison every minute (lame, I 
know), but it does not replicate the NFSv4 ACLs, and it's a rather bad approach 
to the problem to start with. A daemonized unison with support for all the ZFS 
features that gets the file-change notifications from the kernel along with a 
distributed lock manager might do the job, but it's something that I'm not 
qualified to write.
I am sure that what I'm looking for is not unheard of. I am hopeful that the 
ZFS+Lustre integration in the future might allow me something like this, but it 
doesn't sound like it's close.

Any sugestions?!?
Cheers,
Razvan

* Our Bucharest branch has access to 10 Mbits/sec internationally and 100 
Mbits/sec nationally (fiber of course) with BGP and our own IP classes, for 
around EUR 250. This is in contrast with our San Jose, CA branch, which has a 
connectivity budget of $700 and can get only a bonded-T1 at best in that money 
(a T1 is $500 ($399 + taxes)). I wish that the most economically advanced 
country in the world could have a decent internet infrastructure.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs on a raid box

2007-11-16 Thread Dan Pritts
On Fri, Nov 16, 2007 at 11:31:00AM +0100, Paul Boven wrote:
> Thanks for your reply. The SCSI-card in the X4200 is a Sun Single
> Channel U320 card that came with the system, but the PCB artwork does
> sport a nice 'LSI LOGIC' imprint.

That is probably the same card i'm using; it's actually a "Sun" card
but as you say is OEM by LSI.

> So, just to make sure we're talking about the same thing here - your
> drives are SATA, 

yes

> you're exporting each drive through the Western
> Scientific raidbox as a seperate volume, 

yes

> and zfs actually brings in a
> hot spare when you pull a drive?

yes

OS is Sol10U4, system is an X4200, original hardware rev.

> Over here, I've still not been able to accomplish that - even after
> installing Nevada b76 on the machine, removing a disk will not cause a
> hot-spare to become active, nor does resilvering start. Our Transtec
> raidbox seems to be based on a chipset by Promise, by the way.

I have heard some bad things about the Promise RAID boxes but I haven't
had any direct experience.  

I do own one Promise box that accepts 4 PATA drives and exports them to a
host as scsi disks.  Shockingly, it uses a master/slave IDE configuration
rather than 4 separate IDE controllers.  It wasn't super expensive but
it wasn't dirt cheap, either, and it seems it would have cost another
$5 to manufacture the "right way."

I've had fine luck with Promise $25 ATA PCI cards :)

The infortrend units, on the other hand, I have had generally quite good
luck with.  When I worked at UUNet in the late '90s we had hundreds of
their SCSI RAIDs deployed.  

I do have an Infortrend FC-attached raid with SATA disks, which basically
works fine.  It has an external JBOD also SATA disks connecting to
the main raid with FC.  Unfortunately, The RAID unit boots faster than
the JBOD.  So, if you turn them on at the same time, it thinks the JBOD
is gone and doesn't notice it's there until you reboot the controller.

That caused a little pucker for my colleagues when it happened while i
was on vacation.  The support guy at the reseller we were working with
(NOT Western Scientific) told them the raid was hosed and they should
rebuild from scratch, hope you had a backup.  

danno
--
Dan Pritts, System Administrator
Internet2
office: +1-734-352-4953 | mobile: +1-734-834-7224
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4500 device disconnect problem persists

2007-11-16 Thread roland egle
We are having the same problem.

First with 125025-05 and then also with 125205-07
Solaris 10 update 4 - Know with all Patchesx


We opened a Case and got

T-PATCH 127871-02

we installed the Marvell Driver Binary 3 Days ago.

T127871-02/SUNWckr/reloc/kernel/misc/sata
T127871-02/SUNWmv88sx/reloc/kernel/drv/marvell88sx
T127871-02/SUNWmv88sx/reloc/kernel/drv/amd64/marvell88sx
T127871-02/SUNWsi3124/reloc/kernel/drv/si3124
T127871-02/SUNWsi3124/reloc/kernel/drv/amd64/si3124 

It seems that this resolve the device reset problem and the nfsd crash on
x4500 with one raidz2 pool and a lot of zfs Filesystems
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How to destory a faulted pool

2007-11-16 Thread Manoj Nayak
How I can destroy the following pool  ?

pool: mstor0
id: 5853485601755236913
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

mstor0  UNAVAIL   insufficient replicas
  raidz1UNAVAIL   insufficient replicas
c5t0d0  FAULTED   corrupted data
c4t0d0  FAULTED   corrupted data
c1t0d0  ONLINE
c0t0d0  ONLINE


pool: zpool1
id: 14693037944182338678
 state: FAULTED
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-3C
config:

zpool1  UNAVAIL   insufficient replicas
  raidz1UNAVAIL   insufficient replicas
c0t1d0  UNAVAIL   cannot open
c1t1d0  UNAVAIL   cannot open
c4t1d0  UNAVAIL   cannot open
c6t1d0  UNAVAIL   cannot open
c7t1d0  UNAVAIL   cannot open
  raidz1UNAVAIL   insufficient replicas
c0t2d0  UNAVAIL   cannot open
c1t2d0  UNAVAIL   cannot open
c4t2d0  UNAVAIL   cannot open
c6t2d0  UNAVAIL   cannot open
c7t2d0  UNAVAIL   cannot open
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Macs & compatibility (was Re: Yager on ZFS)

2007-11-16 Thread Toby Thain

On 16-Nov-07, at 4:36 AM, Anton B. Rang wrote:

> This is clearly off-topic :-) but perhaps worth correcting --
>
>> Long-time MAC users must be getting used to having their entire world
>> disrupted and having to re-buy all their software. This is at  
>> least the
>> second complete flag-day (no forward or backwards compatibility)  
>> change
>> they've been through.
>
> Actually, no; a fair number of Macintosh applications written in  
> 1984, for the original Macintosh, still run on machines/OSes  
> shipped in 2006. Apple provided processor compatibility by  
> emulating the 68000 series on PowerPC, and the PowerPC on Intel;

Absolutely Anton, original poster deserves firm correction. Very  
little broke in either transition; Apple had excellent success with  
fast and reliable emulation (68K, classic runtime on OS X, PPC on  
Rosetta).


> and OS compatibility by providing essentially a virtual machine  
> running Mac OS 9 inside Mac OS X (up through 10.4).
>
> Sadly, Mac OS 9 applications no longer run on Mac OS 10.5, so it's  
> true that "the world is disrupted" now for those with software  
> written prior to 2000 or so.

I will miss MPW. I wish they would release sources so we could bring  
it native to OS X.

--Toby  (Mac user since 1986 or so).

>
> To make this vaguely Solaris-relevant, it's impressive that SunOS  
> 4.x applications still generally run on Solaris 10, at least on  
> SPARC systems, though Sun doesn't do processor emulation. Still not  
> very ZFS-relevant. :-)
>
>
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] "Not owner" from a zone

2007-11-16 Thread Peter Eriksson
Yeah, this is annoying. I'm seeing this on a Thumper running Update 3 too... 
Has this issue been fixed in Update 4 and/or current releases of OpenSolaris?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs on a raid box

2007-11-16 Thread Paul Boven
Hi Dan,

Dan Pritts wrote:
> On Tue, Nov 13, 2007 at 12:25:24PM +0100, Paul Boven wrote:

>> We've building a storage system that should have about 2TB of storage
>> and good sequential write speed. The server side is a Sun X4200 running
>> Solaris 10u4 (plus yesterday's recommended patch cluster), the array we
>> bought is a Transtec Provigo 510 12-disk array. The disks are SATA, and
>> it's connected to the Sun through U320-scsi.
> 
> We are doing basically the same thing with simliar Western Scientific
> (wsm.com) raids, based on infortrend controllers.  ZFS notices when we
> pull a disk and goes on and does the right thing.
> 
> I wonder if you've got a scsi card/driver problem.  We tried using
> an Adaptec card with solaris with poor results; switched to LSI,
> it "just works".

Thanks for your reply. The SCSI-card in the X4200 is a Sun Single
Channel U320 card that came with the system, but the PCB artwork does
sport a nice 'LSI LOGIC' imprint.

So, just to make sure we're talking about the same thing here - your
drives are SATA, you're exporting each drive through the Western
Scientific raidbox as a seperate volume, and zfs actually brings in a
hot spare when you pull a drive?

Over here, I've still not been able to accomplish that - even after
installing Nevada b76 on the machine, removing a disk will not cause a
hot-spare to become active, nor does resilvering start. Our Transtec
raidbox seems to be based on a chipset by Promise, by the way.

Regards, Paul Boven.
-- 
Paul Boven <[EMAIL PROTECTED]> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe - www.jive.nl
VLBI - It's a fringe science
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS mirror and sun STK 2540 FC array

2007-11-16 Thread Ben
Hi all,

we have just bought a sun X2200M2 (4GB / 2 opteron 2214 / 2 disks 250GB 
SATA2, solaris 10 update 4)
and a sun STK 2540 FC array (8 disks SAS 146 GB, 1 raid controller).
The server is attached to the array with a single 4 Gb Fibre Channel link.

I want to make a mirror using ZFS with this array. 

I have created  2 volumes on the array
in RAID0 (stripe of 128 KB) presented to the host with lun0 and lun1.

So, on the host  :
bash-3.00# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c1d0 
  /[EMAIL PROTECTED],0/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL 
PROTECTED],0
   1. c2d0 
  /[EMAIL PROTECTED],0/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL 
PROTECTED],0
   2. c6t600A0B800038AFBC02F7472155C0d0 
  /scsi_vhci/[EMAIL PROTECTED]
   3. c6t600A0B800038AFBC02F347215518d0 
  /scsi_vhci/[EMAIL PROTECTED]
Specify disk (enter its number):

bash-3.00# zpool create tank mirror 
c6t600A0B800038AFBC02F347215518d0 c6t600A0B800038AFBC02F7472155C0d0

bash-3.00# df -h /tank
Filesystem size   used  avail capacity  Mounted on
tank   532G24K   532G 1%/tank


I have tested the performance with a simple dd
[
time dd if=/dev/zero of=/tank/testfile bs=1024k count=1
time dd if=/tank/testfile of=/dev/null bs=1024k count=1
]
command and it gives :
# local throughput
stk2540
   mirror zfs /tank
read   232 MB/s
write  175 MB/s

# just to test the max perf I did:
zpool destroy -f tank
zpool create -f pool c6t600A0B800038AFBC02F347215518d0

And the same basic dd gives me :
  single zfs /pool
read   320 MB/s
write  263 MB/s

Just to give an idea the SVM mirror using the two local sata2 disks
gives :
read  58 MB/s
write 52 MB/s

So, in production the zfs /tank mirror will be used to hold
our home directories (10 users using 10GB each),
our projects files (200 GB mostly text files and cvs database),
and some vendors tools (100 GB).
People will access the data (/tank) using nfs4 with their
workstations (sun ultra 20M2 with centos 4update5).

On the ultra20 M2, the basic test via nfs4 gives :
read  104 MB/s
write  63 MB/s

A this point, I have the following questions :
-- Does someone has some similar figures about the STK 2540 using zfs  ?

-- Instead of doing only 2 volumes in the array,
   what do you think about doing 8 volumes (one for each disk)
   and doing a 4 two way mirror :
   zpool create tank mirror  c6t6001.. c6t6002.. mirror c6t6003.. 
c6t6004.. {...} mirror c6t6007.. c6t6008..

-- I will add 4 disks in the array next summer.
   Do you think  I should create 2 new luns in the array
   and doing a :
zpool add tank mirror c6t6001..(lun3) c6t6001..(lun4)
  
   or build from scratch the 2 luns (6 disks raid0) , and the pool tank
(ie : backup /tank - zpool destroy -- add disk - reconfigure array 
-- zpool create tank ... - restore backuped data)

-- I think about doing a disk scrubbing once a month.
   Is it sufficient ?

-- Have you got any comment on the performance from the nfs4 client ?

If you add any advices / suggestions, feel free to share.

Thanks,  
 
 Benjamin
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Yager on ZFS

2007-11-16 Thread Adam Leventhal
On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote:
> > How so? In my opinion, it seems like a cure for the brain damage of RAID-5.
> 
> Nope.
> 
> A decent RAID-5 hardware implementation has no 'write hole' to worry about, 
> and one can make a software implementation similarly robust with some effort 
> (e.g., by using a transaction log to protect the data-plus-parity 
> double-update or by using COW mechanisms like ZFS's in a more intelligent 
> manner).

Can you reference a software RAID implementation which implements a solution
to the write hole and performs well. My understanding (and this is based on
what I've been told from people more knowledgeable in this domain than I) is
that software RAID has suffered from being unable to provide both
correctness and acceptable performance.

> The part of RAID-Z that's brain-damaged is its 
> concurrent-small-to-medium-sized-access performance (at least up to request 
> sizes equal to the largest block size that ZFS supports, and arguably 
> somewhat beyond that):  while conventional RAID-5 can satisfy N+1 
> small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in 
> parallel (though the latter also take an extra rev to complete), RAID-Z can 
> satisfy only one small-to-medium access request at a time (well, plus a 
> smidge for read accesses if it doesn't verity the parity) - effectively 
> providing RAID-3-style performance.

Brain damage seems a bit of an alarmist label. While you're certainly right
that for a given block we do need to access all disks in the given stripe,
it seems like a rather quaint argument: aren't most environments that matter
trying to avoid waiting for the disk at all? Intelligent prefetch and large
caches -- I'd argue -- are far more important for performance these days.

> The easiest way to fix ZFS's deficiency in this area would probably be to map 
> each group of N blocks in a file as a stripe with its own parity - which 
> would have the added benefit of removing any need to handle parity groups at 
> the disk level (this would, incidentally, not be a bad idea to use for 
> mirroring as well, if my impression is correct that there's a remnant of 
> LVM-style internal management there).  While this wouldn't allow use of 
> parity RAID for very small files, in most installations they really don't 
> occupy much space compared to that used by large files so this should not 
> constitute a significant drawback.

I don't really think this would be feasible given how ZFS is stratified
today, but go ahead and prove me wrong: here are the instructions for
bringing over a copy of the source code:

  http://www.opensolaris.org/os/community/tools/scm

- ahl

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss