date:20061212

Re: [zfs-discuss] ZFS on a damaged disk

2006-12-12 Thread Tomas Ögren

On 12 December, 2006 - Patrick P Korsnick sent me these 1,1K bytes:

> i have a machine with a disk that has some sort of defect and i've
> found that if i partition only half of the disk that the machine will
> still work.  i tried to use 'format' to scan the disk and find the bad
> blocks, but it didn't work.
> 
> so as i don't know where the bad blocks are but i'd still like to use
> some of the rest of the disk, i thought ZFS might be able to help.  i
> partitioned the disk so slices 4,5,6 and 7 are each 5GB.  i thought
> i'd make one or multiple zpools on those slices and then i'd be able
> to narrow down where the bad sections are.
> 
> so my question is can i declare a zpool that spans multiple c0d0sXX
> but isn't a mirror and if i can, then will zfs be able to detect where
> the problem c0d0sXX is and not use it?  if not, i'll have to make 4
> different zpools and experiment with storing stuff on each to find the
> approximate location of the bad blocks.

Either create 4 separate pools; zpool create slice4 c0d0s4;zpool create
slice5 c0d0s5;  and then torture each of them to see where it's
corrupted.. Or you can for instance create a raidz(2) of those 4 and
watch performance go down the hill, but still work..
zpool create broken raidz2 c0d0s4 c0d0s5 c0d0s6 c0d0s7

/Tomas
-- 
Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS on a damaged disk

2006-12-12 Thread Patrick P Korsnick

i have a machine with a disk that has some sort of defect and i've found that 
if i partition only half of the disk that the machine will still work.  i tried 
to use 'format' to scan the disk and find the bad blocks, but it didn't work.

so as i don't know where the bad blocks are but i'd still like to use some of 
the rest of the disk, i thought ZFS might be able to help.  i partitioned the 
disk so slices 4,5,6 and 7 are each 5GB.  i thought i'd make one or multiple 
zpools on those slices and then i'd be able to narrow down where the bad 
sections are.

so my question is can i declare a zpool that spans multiple c0d0sXX but isn't a 
mirror and if i can, then will zfs be able to detect where the problem c0d0sXX 
is and not use it?  if not, i'll have to make 4 different zpools and experiment 
with storing stuff on each to find the approximate location of the bad blocks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Kickstart hot spare attachment

2006-12-12 Thread Anton B. Rang

> If the SCSI commands hang forever, then there is nothing that ZFS can
> do, as a single write will never return.  The more likely case is that
> the commands are continually timining out with very long response times,
> and ZFS will continue to talk to them forever.

It looks like the sd driver defaults to a 60-second timeout, which is
quite long. It might be useful if FMA saw a potential fault for any I/O
longer than some much lower value.

(This gets tricky with power management, since if you have to wait for
the disk to spin up, it can take a long time compared to normal I/O.)

That said, it sounds to me like your enclosure is actually powering down
the drive. If so, it ought to stop responding to selection, and I/O should
fail in a "hard" way within 250 ms (or less, depending on whether you've
got a SCSI bus which supports QAS, as the newer, faster versions do).
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS and write caching (SATA)

2006-12-12 Thread Anton B. Rang

It took manufacturers of SCSI drives some years to get this right. Around 1997 
or so we were still seeing drives at my former employer that didn't properly 
flush their caches under all circumstances (and had other "interesting" 
behaviours WRT caching).

Lots of ATA disks never did bother to implement the write cache controls.

I haven't talked recently with any vendors who have been sourcing SATA disks, 
so I don't know what they're seeing. Generally the major players have their own 
disk qualification suites and often wind up with custom firmware because they 
want all of their detected bugs fixed before they'll accept a particular disk. 
If you buy a disk off-the-shelf, you get a drive that's gone through the disk 
manufacturer's testing (which is good, don't get me wrong) but hasn't been 
qualified with the particular commands or configuration that a particular 
operating system or file system might send.

If you can do your own tests, that would be best; but that involves executing a 
flush (with all the various combinations of commands outstanding, dirty vs. 
clean cache buffers, etc.) and immediately powering off the device, which 
generally can't be done without special hardware. My *hunch* is that 
"enterprise-class" SATA disks have probably gone through more of this sort of 
testing than consumer SATA, even at the drive manufacturers. (It's not at all 
the same firmware.)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Usage in Warehousing (lengthy intro)

2006-12-12 Thread Rob Logan


> http://www.norcotek.com/item_detail.php?categoryid=8&modelno=DS-1220
yea SiI3726 Multipliers, are cool..
http://cooldrives.com/cosapomubrso.html
http://cooldrives.com/mac-port-multiplier-sata-case.html

but finding PCI-X slots for Ying Tian's si3124 or marvell88sx
cards are getting tricky.. even harder at 133Mhz.

the 1x PCIe two SATA si3132 card should come up
http://elektronkind.org/category/geekery/solaris/
but has issues
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6404812
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6492430
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6492427
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=2133861

what would be nice is support for Marvell's 88SX7042
4x PCIe four SATA card
http://www.amug.org/amug-web/html/amug/reviews/articles/sonnet/e4p/

an easier bet is AMD's 4x4 Platform
http://www.tomshardware.com/2006/11/30/brute_force_quad_cores/page6.html
with its watered down Professional 3600 chipset
http://www.nvidia.com/page/pg_20060814366736.html
that would likely "just work" with 12 sata ports.

man, if someone would sell me a diskless thumper... its
an impressive grouping of PCI-X slots.

Rob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS Storage Pool advice

2006-12-12 Thread Anton B. Rang

> Were looking for pure performance.
> 
> What will be contained in the LUNS is Student User
> account files that they will access and Department
> Share files like, MS word documents, excel files,
> PDF.  There will be no applications on the ZFS
> Storage pools or pool   Does this help on what
> strategy might be best?

I think so.

I would suggest striping a single pool across all available LUNs, then. (I'm 
presuming that you would be prepared to recover from ZFS-detected errors by 
reloading from backup.) There doesn't seem any compelling reason to split your 
storage into multiple pools, and by using a single pool, you don't have to 
worry about reallocating storage if one pool fills up while another has free 
space.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Uber block corruption?

2006-12-12 Thread Anton B. Rang

> Also note that the UB is written to every vdev (4 per disk) so the 
> chances of all UBs being corrupted is rather low.

The chances that they're corrupted by the storage system, yes.

However, they are all sourced from the same in-memory buffer, so an undetected 
in-memory error (e.g. kernel bug) will be replicated to all vdevs.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS behavior under heavy load (I/O that is)

2006-12-12 Thread Anton B. Rang

I think you may be observing that fsync() is slow.

The file will be written, and visible to other processes via the in-memory 
cache, before the data has been pushed to disk. vi forces the data out via 
fsync, and that can be quite slow when the file system is under load, 
especially before a fix which allows fsync to work on a per-file basis. (In the 
S10U2 aka 6/06 Solaris release, fsync on ZFS forced all changes to disk, not 
just those of the requested file.)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Monitoring ZFS

2006-12-12 Thread Tom Duell

Thanks, Neil, for the assistance.

Tom
Neil Perrin wrote On 12/12/06 19:59,:

>Tom Duell wrote On 12/12/06 17:11,:
>  
>
>>Group,
>>
>>We are running a benchmark with 4000 users
>>simulating a hospital management system
>>running on Solaris 10 6/06 on USIV+ based
>>SunFire 6900 with 6540 storage array.
>>
>>Are there any tools for measuring internal
>>ZFS activity to help us understand what is going
>>on during slowdowns?
>>
>>
>
>dtrace can be used in numerous ways to examine
>every part of ZFS and Solaris. lockstat(1M) (which actually
>uses dtrace underneath) can also be used to see the cpu activity
>(try lockstat -kgIW -D 20 sleep 10).
>
>You can also use iostat (eg iostat -xnpcz) to look at disk activity.
>  
>
Yes, we are doing this and the disks are performing
extremely well.

>  
>
>>We have 192GB of RAM and while ZFS runs
>>well most of the time, there are times where
>>the system time jumps up to 25-40%
>>as measured by vmstat and iostat.  These
>>times coincide with slowdowns in file access
>>as measured by a side program that simply
>>reads a random block in a file... these response
>>times can exceed 1 second or longer.
>>
>>
>
>ZFS commits transaction groups every 5 seconds.
>I suspect this flurry of activity is due to that.
>Commiting can indeed take longer than a second.
>
>You might be able to show this by changing it with:
>
># echo txg_time/W 10 | mdb -kw
>
>then the activity should be longer but less frequent.
>I don't however recommend you keep it at that value.
>
>  
>
Thanks, we may try that to see what effects it
might have.

>  
>
>>Any pointers greatly appreaciated!
>>
>>Tom
>>
>>
>>
>>
>>
>>___
>>zfs-discuss mailing list
>>zfs-discuss@opensolaris.org
>>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS behavior under heavy load (I/O that is)

2006-12-12 Thread Anantha N. Srirama

I'm observing the following behavior on our E2900 (24 x 92 config), 2 FCs, and 
... I've a large filesystem (~758GB) with compress mode on. When this 
filesystem is under heavy load (>150MB/S) I've problems saving files in 'vi'. I 
posted here about it and recall that the issue is addressed in Sol10U3. This 
morning I observed another variation of this problem as follows:

- Create a file in 'vi' and save it, session will hang as if it is waiting for 
the write to complete.
- In another session you'll observe the write from 'vi' is indeed complete as 
evidenced by the contents of the file.

Am I repeating myself here or is it a different problem all together.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Monitoring ZFS

2006-12-12 Thread Neil Perrin




Tom Duell wrote On 12/12/06 17:11,:

Group,

We are running a benchmark with 4000 users
simulating a hospital management system
running on Solaris 10 6/06 on USIV+ based
SunFire 6900 with 6540 storage array.

Are there any tools for measuring internal
ZFS activity to help us understand what is going
on during slowdowns?


dtrace can be used in numerous ways to examine
every part of ZFS and Solaris. lockstat(1M) (which actually
uses dtrace underneath) can also be used to see the cpu activity
(try lockstat -kgIW -D 20 sleep 10).

You can also use iostat (eg iostat -xnpcz) to look at disk activity.



We have 192GB of RAM and while ZFS runs
well most of the time, there are times where
the system time jumps up to 25-40%
as measured by vmstat and iostat.  These
times coincide with slowdowns in file access
as measured by a side program that simply
reads a random block in a file... these response
times can exceed 1 second or longer.


ZFS commits transaction groups every 5 seconds.
I suspect this flurry of activity is due to that.
Commiting can indeed take longer than a second.

You might be able to show this by changing it with:

# echo txg_time/W 10 | mdb -kw

then the activity should be longer but less frequent.
I don't however recommend you keep it at that value.




Any pointers greatly appreaciated!

Tom





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re[2]: [zfs-discuss] Uber block corruption?

2006-12-12 Thread Darren Dunham

> Hello Toby,
> 
> Tuesday, December 12, 2006, 4:18:54 PM, you wrote:
> TT> On 12-Dec-06, at 9:46 AM, George Wilson wrote:
> 
> >> Also note that the UB is written to every vdev (4 per disk) so the  
> >> chances of all UBs being corrupted is rather low.
> 
> It depends actually - if all your vdevs are on the same array with
> write back cache set to on you actually can end-up with all UB
> corrupted - at least in theory.

Do such caches respond to explicit flushes?  My understanding is that it
should try to flush between writing the front 2 and the back 2.

Not that even that would guarantee anything if there are real bugs in
the cache code, but it would improve the odds.

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
 < This line left intentionally blank to confuse you. >
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Monitoring ZFS

2006-12-12 Thread Tom Duell

Group,

We are running a benchmark with 4000 users
simulating a hospital management system
running on Solaris 10 6/06 on USIV+ based
SunFire 6900 with 6540 storage array.

Are there any tools for measuring internal
ZFS activity to help us understand what is going
on during slowdowns?

We have 192GB of RAM and while ZFS runs
well most of the time, there are times where
the system time jumps up to 25-40%
as measured by vmstat and iostat.  These
times coincide with slowdowns in file access
as measured by a side program that simply
reads a random block in a file... these response
times can exceed 1 second or longer.

Any pointers greatly appreaciated!

Tom


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS Storage Pool advice

2006-12-12 Thread Kory Wheatley

Also there will be no NFS services on this system.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS Storage Pool advice

2006-12-12 Thread Kory Wheatley

Were looking for pure performance.

What will be contained in the LUNS is Student User account files that they will 
access and Department Share files like, MS word documents, excel files, PDF.  
There will be no applications on the ZFS Storage pools or pool   Does this help 
on what strategy might be best?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance problems during 'destroy' (and bizzare Zone problem as well)

2006-12-12 Thread Matthew Ahrens


Anantha N. Srirama wrote:

 - Why is the destroy phase taking so long?


Destroying clones will be much faster with build 53 or later (or the 
unreleased s10u4 or later) -- see bug 6484044.



 - What can explain the unduly long snapshot/clone times
 - Why didn't the Zone startup?
 - More surprisingly why did the Zone startup after an hour?


Perhaps there was so much activity on the system that we couldn't push 
out transaction groups in the usual < 5 seconds.  'zfs snapshot' and 
'zfs clone' take at least 1 transaction group to complete, so this could 
explain it.  We've seen this problem as well and are working on a fix...


--mat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and write caching (SATA)

2006-12-12 Thread Peter Schuller

> PS> While I do intend to perform actual powerloss tests, it would be
> interesting PS> to hear from anybody whether it is generally expected to be
> safe.
>
> Well is disks honors cache flush commands then it should be reliable
> wether it's SATA or SCSI disk.

Yes. Sorry, I could have stated my question clear:er. What I am specifically 
concerned about is exactly that - whether your typical SATA drive *will* 
honor cache flush commands, as I understand a lot of PATA drives did/do not.

Googling tends to give very little concrete information on this since very few 
people actually seem to care about this. Since I wanted to confirm my 
understanding of ZFS semantics w.r.t. write caching anyway I thought I might 
aswell also ask about the general tendency among drives since, if anywhere, 
people here might know.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Storage Pool advice

2006-12-12 Thread Jason J. W. Williams


Hi Kory,

It depends on the capabilities of your array in our experience...and
also the zpool type. If you're going to do RAID-Z in a write intensive
environment you're going to have a lot more I/Os with three LUNs then
a single large LUN. Your controller may go nutty.

Also, (Richard can address this better than I) you may want to disable
the ZIL or have your array ignore the write cache flushes that ZFS
issues.

Best Regards,
Jason

On 12/12/06, Kory Wheatley <[EMAIL PROTECTED]> wrote:

This question is concerning ZFS.  We have a Sun Fire V890 attached to a EMC 
disk array.  Here's are plan to incorporate ZFS:
On our EMC storage array we will create 3 LUNS.  Now how would ZFS be used for 
the best performance?

What I'm trying to ask is if you have 3 LUNS and you want to create a ZFS 
storage pool, would it be better to have a storage pool per LUN or combine the 
3 LUNS as one big disks under ZFS and create 1 huge ZFS storage pool.

Example:
LUN1 200gb  ZFS Storage Pool "pooldata1"
LUN2 200gb  ZFS Storage Pool "pooldata2"
LUN3 200gb  ZFS Storage Pool "pooldata3"

or

LUN 600gb  ZFS Storage Pool "alldata"


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SunCluster HA-NFS from Sol9/VxVM to Sol10u3/ZFS

2006-12-12 Thread Torrey McMahon


Robert Milkowski wrote:

Hello Matthew,


MCA> Also, I am considering what type of zpools to create. I have a
MCA> SAN with T3Bs and SE3511s. Since neither of these can work as a
MCA> JBOD (at lesat that is what I remember) I guess I am going  to
MCA> have to add in the LUNS in a mirrored zpool of the Raid-5 Luns?

1. those boxes can work a JBODs but not in a clustered environment.



Actually, those boxes can't act as JBODs. They only present LUNs created 
from the drives in the enclosures.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and write caching (SATA)

2006-12-12 Thread Robert Milkowski

Hello Peter,

Tuesday, December 12, 2006, 11:18:32 PM, you wrote:

PS> Hello,

PS> my understanding is that ZFS is specifically designed to work with write
PS> caching, by instructing drives to flush their caches when a write barrier is
PS> needed. And in fact, even turns write caching on explicitly on managed
PS> devices.

PS> My question is of a practical nature: will this *actually* be safe on the
PS> average consumer grade SATA drive? I have seen offhand references to PATA
PS> drives generally not being trustworthy when it comes to this (SCSI therefore
PS> being recommended), but I have not been able to find information on the
PS> status of typical SATA drives.

PS> While I do intend to perform actual powerloss tests, it would be interesting
PS> to hear from anybody whether it is generally expected to be safe.

Well is disks honors cache flush commands then it should be reliable
wether it's SATA or SCSI disk.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: Sol10u3 -- is "du" bug fixed?

2006-12-12 Thread Robert Milkowski

Hello Anton,

Tuesday, December 12, 2006, 9:36:41 PM, you wrote:

ABR> Is there an easy way to determine whether a pool has this fix applied or 
not?

Yep.

Just do 'df -h' and see what is a reported size of a pool. It should
be something like N-1 times disk size for each raid-z group. If it is
N times disk size then it was created before fix.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS and write caching (SATA)

2006-12-12 Thread Peter Schuller

Hello,

my understanding is that ZFS is specifically designed to work with write 
caching, by instructing drives to flush their caches when a write barrier is 
needed. And in fact, even turns write caching on explicitly on managed 
devices.

My question is of a practical nature: will this *actually* be safe on the 
average consumer grade SATA drive? I have seen offhand references to PATA 
drives generally not being trustworthy when it comes to this (SCSI therefore 
being recommended), but I have not been able to find information on the 
status of typical SATA drives.

While I do intend to perform actual powerloss tests, it would be interesting 
to hear from anybody whether it is generally expected to be safe.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Netapp to Solaris/ZFS issues

2006-12-12 Thread Darren Dunham

> NetApp can actually grow their RAID groups, but they recommend adding
> an entire RAID group at once instead. If you add a disk to a RAID
> group on NetApp, I believe you need to manually start a reallocate
> process to balance data across the disks.

There's no reallocation process that I'm aware of.  Obviously adding a
single column to a pretty full volume prevents you from doing the most
optimal (full-stripe) writes.  But since the existing parity disk covers
the new column, you do have full availability of the new space.  That's
a different story with raidz.

Hopefully you don't wait until the raid group is full before adding
disks, and the blocks sort themselves out over time.

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
 < This line left intentionally blank to confuse you. >
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS Storage Pool advice

2006-12-12 Thread Neil Perrin


Are you looking purely for performance, or for the added reliability that ZFS 
can give you?

If the latter, then you would want to configure across multiple LUNs in either 
a mirrored or RAID configuration. This does require sacrificing some storage in 
exchange for the peace of mind that any “silent data corruption” in the array 
or storage fabric will be not only detected but repaired by ZFS.


From a performance point of view, what will work best depends greatly on your 
application I/O pattern, how you would map the application’s data to the 
available ZFS pools if you had more than one, how many channels are used to 
attach the disk array, etc.  A single pool can be a good choice from an 
ease-of-use perspective, but multiple pools may perform better under certain 
types of load (for instance, there’s one intent log per pool, so if the intent 
log writes become a bottleneck then multiple pools can help).


Bad example, as there's actually one intent log per file system!


This also depends on how the LUNs are configured within the EMC array

If you can put together a test system, and run your application as a benchmark, 
you can get an answer. Without that, I don’t think anyone can predict which 
will work best in your particular situation.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: Sol10u3 -- is "du" bug fixed?

2006-12-12 Thread Anton B. Rang

Is there an easy way to determine whether a pool has this fix applied or not?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS Storage Pool advice

2006-12-12 Thread Anton B. Rang

Are you looking purely for performance, or for the added reliability that ZFS 
can give you?

If the latter, then you would want to configure across multiple LUNs in either 
a mirrored or RAID configuration. This does require sacrificing some storage in 
exchange for the peace of mind that any “silent data corruption” in the array 
or storage fabric will be not only detected but repaired by ZFS.

>From a performance point of view, what will work best depends greatly on your 
>application I/O pattern, how you would map the application’s data to the 
>available ZFS pools if you had more than one, how many channels are used to 
>attach the disk array, etc.  A single pool can be a good choice from an 
>ease-of-use perspective, but multiple pools may perform better under certain 
>types of load (for instance, there’s one intent log per pool, so if the intent 
>log writes become a bottleneck then multiple pools can help). This also 
>depends on how the LUNs are configured within the EMC array

If you can put together a test system, and run your application as a benchmark, 
you can get an answer. Without that, I don’t think anyone can predict which 
will work best in your particular situation.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kickstart hot spare attachment

2006-12-12 Thread James F. Hranicky

Eric Schrock wrote:

> Hmmm, it means that we correctly noticed that the device had failed, but
> for whatever reason the ZFS FMA agent didn't correctly replace the
> drive.  I am cleaning up the hot spare behavior as we speak so I will
> try to reproduce this.

Ok, great.

>> Well, as long as I know which device is affected :-> If "zpool status"
>> doesn't return it may be difficult to figure out.
>>
>> Do you know if the SATA controllers in a Thumper can better handle this
>> problem?
> 
> I will be starting a variety of experiments in this vein in the near
> future.  Others may be able to describe their experiences so far.  How
> exactly did you 'spin down' the drives in question?  Is there a
> particular failure mode you're interested in?

The Andataco cabinet has a button for each disk slot that if you
hold down will spin the drive down so you can pull it out.

I'm interested in any failure mode that might happen to my server :->
Basically, we're very interested in building a nice ZFS server box
that will house a good chunk of our data, be it homes, research or
whatever. I just have to know the server is as bulletproof as
possible, that's why I'm doing the stress tests.

>> Do you have an idea as to when this might be available?
> 
> It will be a while before the complete functionality is finished.  I
> have begun the work, but there are several distinct phases.  First, I
> am cleaning up the existing hot spare behavior.  Second, I'm adding
> proper hotplug support to ZFS so that it detects device removal without
> freaking out and correctly resilvers/replaces drives when they are
> plugged back in.  Finally, I'll be adding a ZFS diagnosis engine to both
> analyze ZFS faults as well as consume SMART data to predict disk failure
> and proactively offline devices.  I would estimate that it will be a few
> months before I get all of this into Nevada.

Ok, thanks.

Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kickstart hot spare attachment

2006-12-12 Thread Eric Schrock

On Tue, Dec 12, 2006 at 02:38:22PM -0500, James F. Hranicky wrote:
> 
> Dec 11 14:42:32.1271 1319464e-7a8c-e65b-962e-db386e90f7f2 ZFS-8000-D3
>   100%  fault.fs.zfs.device
> 
> Problem in: zfs://pool=2646e20c1cb0a9d0/vdev=724c128cdbc17745
>Affects: zfs://pool=2646e20c1cb0a9d0/vdev=724c128cdbc17745
>FRU: -
> 
> I'm not really sure what it means.

Hmmm, it means that we correctly noticed that the device had failed, but
for whatever reason the ZFS FMA agent didn't correctly replace the
drive.  I am cleaning up the hot spare behavior as we speak so I will
try to reproduce this.

> Well, as long as I know which device is affected :-> If "zpool status"
> doesn't return it may be difficult to figure out.
> 
> Do you know if the SATA controllers in a Thumper can better handle this
> problem?

I will be starting a variety of experiments in this vein in the near
future.  Others may be able to describe their experiences so far.  How
exactly did you 'spin down' the drives in question?  Is there a
particular failure mode you're interested in?

> Do you have an idea as to when this might be available?

It will be a while before the complete functionality is finished.  I
have begun the work, but there are several distinct phases.  First, I
am cleaning up the existing hot spare behavior.  Second, I'm adding
proper hotplug support to ZFS so that it detects device removal without
freaking out and correctly resilvers/replaces drives when they are
plugged back in.  Finally, I'll be adding a ZFS diagnosis engine to both
analyze ZFS faults as well as consume SMART data to predict disk failure
and proactively offline devices.  I would estimate that it will be a few
months before I get all of this into Nevada.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Storage Pool advice

2006-12-12 Thread Richard Elling


Kory Wheatley wrote:
This question is concerning ZFS.  We have a Sun Fire V890 attached to a EMC disk array.  
Here's are plan to incorporate ZFS: 
On our EMC storage array we will create 3 LUNS.  Now how would ZFS be used for the 
best performance?


What I'm trying to ask is if you have 3 LUNS and you want to create a ZFS storage pool, 
would it be better to have a storage pool per LUN or combine the 3 LUNS as one big disks 
under ZFS and create 1 huge ZFS storage pool.


One huge zpool.  Remember, the pool can contain many file systems, but the
reverse is not true.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SunCluster HA-NFS from Sol9/VxVM to Sol10u3/ZFS

2006-12-12 Thread Richard Elling


Matthew C Aycock wrote:
We are currently working on a plan to upgrade our HA-NFS cluster that 
uses HA-StoragePlus and VxVM 3.2 on Solaris 9 to Solaris 10 and ZFS. Is 
there a known procedure or best practice for this? I have enough free disk 
space to recreate all the filesystems and copy the data if necessary, but 
would like to avoid copying if possible.


You will need to copy the data from the old file system into ZFS.

Also, I am considering what type of zpools to create. I have a SAN with 
T3Bs and SE3511s. Since neither of these can work as a JBOD (at lesat that 
is what I remember) I guess I am going  to have to add in the LUNS in a 
mirrored zpool of the Raid-5 Luns?


Lacking other information, particularly performance requirements, what you
suggest is a good strategy: ZFS mirrors of RAID-5 LUNs.

We are at the extreme start of this project and I was hoping for some 
guidance as to what direction to start.


By all means, read the Sun Cluster Concepts Guide first.  It will answer
many questions that may arise as you go through the design.  Note version
3.2 which is required for ZFS has updates to the concepts guide regarding
the use of ZFS, available RSN.
 -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kickstart hot spare attachment

2006-12-12 Thread James F. Hranicky

Eric Schrock wrote:
> On Tue, Dec 12, 2006 at 02:08:57PM -0500, James F. Hranicky wrote:
>> Sure, but that's what I want to avoid. The FMA agent should do this by
>> itself, but it's not, so I guess I'm just wondering why, or if there's
>> a good way to get to do so. If this happens in the middle of the night I
>> don't want to have to run the commands by hand.
> 
> Yes, the FMA agent should do this.  Can you run 'fmdump -v' and see if
> the DE correctly identified the faulted devices?

Here you go:

# fmdump -v
TIME UUID SUNW-MSG-ID
Nov 29 16:29:12.1947 e50198f2-2eb9-c58b-d7c5-87aaae5cb935 ZFS-8000-D3
  100%  fault.fs.zfs.device

Problem in: zfs://pool=8e63f0b8e4263e71/vdev=9272c0973ecdb27c
   Affects: zfs://pool=8e63f0b8e4263e71/vdev=9272c0973ecdb27c
   FRU: -

Nov 30 10:31:48.8844 1a44a780-05c0-cb6e-d44f-f1d8999f40e5 ZFS-8000-D3
  100%  fault.fs.zfs.device

Problem in: zfs://pool=51f1caf6cad1aa2f/vdev=769276842b0efd54
   Affects: zfs://pool=51f1caf6cad1aa2f/vdev=769276842b0efd54
   FRU: -

Dec 11 14:04:57.8803 c46d21e0-200d-43a1-e5db-ae9c9ebf3482 ZFS-8000-D3
  100%  fault.fs.zfs.device

Problem in: zfs://pool=2646e20c1cb0a9d0/vdev=52070de44ec80c15
   Affects: zfs://pool=2646e20c1cb0a9d0/vdev=52070de44ec80c15
   FRU: -

Dec 11 14:42:32.1271 1319464e-7a8c-e65b-962e-db386e90f7f2 ZFS-8000-D3
  100%  fault.fs.zfs.device

Problem in: zfs://pool=2646e20c1cb0a9d0/vdev=724c128cdbc17745
   Affects: zfs://pool=2646e20c1cb0a9d0/vdev=724c128cdbc17745
   FRU: -

I'm not really sure what it means.

>> For instance, the zpool command hanging or the system hanging trying to
>> reboot normally.
> 
> If the SCSI commands hang forever, then there is nothing that ZFS can
> do, as a single write will never return.  The more likely case is that
> the commands are continually timining out with very long response times,
> and ZFS will continue to talk to them forever.  The future FMA
> integration I mentioned will solve this problem.  In the meantime, you
> should be able to 'zpool offline' the affected devices by hand.

Well, as long as I know which device is affected :-> If "zpool status"
doesn't return it may be difficult to figure out.

Do you know if the SATA controllers in a Thumper can better handle this
problem?

> There is also associated work going on to better handle asynchrounous
> reponse times across devices.  Currently, a single slow device will slow
> the entire pool to a crawl.

Do you have an idea as to when this might be available?

Thanks for all your input,
Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Netapp to Solaris/ZFS issues

2006-12-12 Thread Joe Little

On 12/12/06, James F. Hranicky <[EMAIL PROTECTED]> wrote:

Jim Davis wrote:

>> Have you tried using the automounter as suggested by the linux faq?:
>> http://nfs.sourceforge.net/#section_b
>
> Yes.  On our undergrad timesharing system (~1300 logins) we actually hit
> that limit with a standard automounting scheme.  So now we make static
> mounts of the Netapp /home space and then use amd to make symlinks to
> the home directories.  Ugly, but it works.

This is how we've always done it, but we use amd (am-utils) to manage two
maps, a filesystem map and a homes map. The homes map is of all type:=link,
so amd handles the link creation for us, plus we only have a handful of
mounts on any system.

It looks like if each user has a ZFS quota-ed home directory which acts as
its own little filesystem, we won't be able to do this anymore, as we'll have
to export and mount each user directory separately. Is this the case, or is
there a way to export and mount a volume containing zfs quota-ed directories,
i.e., have the quota-ed subdirs not necessarily act like they're separate
filesystems?

This is definitely a feature I'd love to see, whereby one can share
the filesystem at a higher point in the tree (aka /pool/a/b, sharing
/pool/a, but have "b" as its own filesystem). I know this breaks some
of the sharing, but I'd love to have clients be able to mount /pool/a
and by way of that see b as well and not have that treated as a
separate share.

Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kickstart hot spare attachment

2006-12-12 Thread Eric Schrock

On Tue, Dec 12, 2006 at 02:08:57PM -0500, James F. Hranicky wrote:
> 
> Sure, but that's what I want to avoid. The FMA agent should do this by
> itself, but it's not, so I guess I'm just wondering why, or if there's
> a good way to get to do so. If this happens in the middle of the night I
> don't want to have to run the commands by hand.

Yes, the FMA agent should do this.  Can you run 'fmdump -v' and see if
the DE correctly identified the faulted devices?

> For instance, the zpool command hanging or the system hanging trying to
> reboot normally.

If the SCSI commands hang forever, then there is nothing that ZFS can
do, as a single write will never return.  The more likely case is that
the commands are continually timining out with very long response times,
and ZFS will continue to talk to them forever.  The future FMA
integration I mentioned will solve this problem.  In the meantime, you
should be able to 'zpool offline' the affected devices by hand.

There is also associated work going on to better handle asynchrounous
reponse times across devices.  Currently, a single slow device will slow
the entire pool to a crawl.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Re: Sol10u3 -- is "du" bug fixed?

2006-12-12 Thread Jeb Campbell

> IIRC you have to re-create entire raid-z pool to get
> it fixed - just
> rewriting data or upgrading a pool won't do it.

You are correct ...

Now I have to find some place to stick +1TB of temp files ;)

Thanks for the help,

Jeb
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Netapp to Solaris/ZFS issues

2006-12-12 Thread James F. Hranicky

Jim Davis wrote:

>> Have you tried using the automounter as suggested by the linux faq?:
>> http://nfs.sourceforge.net/#section_b
> 
> Yes.  On our undergrad timesharing system (~1300 logins) we actually hit
> that limit with a standard automounting scheme.  So now we make static
> mounts of the Netapp /home space and then use amd to make symlinks to
> the home directories.  Ugly, but it works.

This is how we've always done it, but we use amd (am-utils) to manage two
maps, a filesystem map and a homes map. The homes map is of all type:=link,
so amd handles the link creation for us, plus we only have a handful of
mounts on any system.

It looks like if each user has a ZFS quota-ed home directory which acts as
its own little filesystem, we won't be able to do this anymore, as we'll have
to export and mount each user directory separately. Is this the case, or is
there a way to export and mount a volume containing zfs quota-ed directories,
i.e., have the quota-ed subdirs not necessarily act like they're separate
filesystems?

Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kickstart hot spare attachment

2006-12-12 Thread James F. Hranicky

Eric Schrock wrote:
> On Tue, Dec 12, 2006 at 07:53:32AM -0800, Jim Hranicky wrote:
>> - I know I can attach it via the zpool commands, but is there a way to
>> kickstart the attachment process if it fails to attach automatically upon
>> disk failure?
> 
> Yep.  Just do a 'zpool replace zmir  '.  This is what the
> FMA agent does in response to failed drive faults.

Sure, but that's what I want to avoid. The FMA agent should do this by
itself, but it's not, so I guess I'm just wondering why, or if there's
a good way to get to do so. If this happens in the middle of the night I
don't want to have to run the commands by hand.

>> - Is there something inherent to an old SCSI bus that causes spun-
>> down drives to hang the system in some way, even if it's just hanging
>> the zpool/zfs system calls? Would a thumper be more resilient to this?
> 
> There are a number of drive failure modes that result in arbitrarily
> misbehaving drives, as opposed to drives which fail to open entirely.
> We are working on a more complete FMA diagnosis engine which will be
> able to diagnose this type of failure and proactively fault the device.
> 
> I'm not sure exactly what behavior you're seeing by 'spun-down drives',
> so this may or may not address your issue.

For instance, the zpool command hanging or the system hanging trying to
reboot normally.

Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Sol10u3 -- is "du" bug fixed?

2006-12-12 Thread Matthew Ahrens


Jeb Campbell wrote:

After upgrade you did actually re-create your raid-z
pool, right?


No, but I did "zpool upgrade -a".

Hmm, I guess I'll try re-writing the data first.  I know you have to do that if 
you change compression options.

Ok -- rewriting the data doesn't work ...

I'll create a new temp pool and see what that does ... then I'll investigate 
options for recreating my big pool ...


Unfortunately, this bug is only fixed when you create the pool on the 
new bits.


--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SunCluster HA-NFS from Sol9/VxVM to Sol10u3/ZFS

2006-12-12 Thread Robert Milkowski

Hello Matthew,

Tuesday, December 12, 2006, 7:13:47 PM, you wrote:

MCA> We are currently working on a plan to upgrade our HA-NFS cluster
MCA> that uses HA-StoragePlus and VxVM 3.2 on Solaris 9 to Solaris 10
MCA> and ZFS. Is there a known procedure or best practice for this? I
MCA> have enough free disk space to recreate all the filesystems and
MCA> copy the data if necessary, but would like to avoid copying if possible.

You will have to copy data.
Also keep in mind that ZFS is supported in Sun Cluster 3.2 which is
not out yet (should be really soon now).

MCA> Also, I am considering what type of zpools to create. I have a
MCA> SAN with T3Bs and SE3511s. Since neither of these can work as a
MCA> JBOD (at lesat that is what I remember) I guess I am going  to
MCA> have to add in the LUNS in a mirrored zpool of the Raid-5 Luns?

1. those boxes can work a JBODs but not in a clustered environment.

2. the configurations of arrays - well, it depends. I would only
   suggest to do redundancy at zfs level at least. For some
   performance numbers on those arrays with zfs see the list archives.



-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kickstart hot spare attachment

2006-12-12 Thread Eric Schrock

On Tue, Dec 12, 2006 at 07:53:32AM -0800, Jim Hranicky wrote:
> 
> - I know I can attach it via the zpool commands, but is there a way to
> kickstart the attachment process if it fails to attach automatically upon
> disk failure?

Yep.  Just do a 'zpool replace zmir  '.  This is what the
FMA agent does in response to failed drive faults.

> - In this instance the spare is twice as big as the other
> drives -- does that make a difference? 

Nope.  The 'size' of a replacing vdev is the minimum size of its two
children, so it won't affect anything.

> - Is there something inherent to an old SCSI bus that causes spun-
> down drives to hang the system in some way, even if it's just hanging
> the zpool/zfs system calls? Would a thumper be more resilient to this?

There are a number of drive failure modes that result in arbitrarily
misbehaving drives, as opposed to drives which fail to open entirely.
We are working on a more complete FMA diagnosis engine which will be
able to diagnose this type of failure and proactively fault the device.

I'm not sure exactly what behavior you're seeing by 'spun-down drives',
so this may or may not address your issue.

- Eric

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Sol10u3 -- is "du" bug fixed?

2006-12-12 Thread Robert Milkowski

Hello Jeb,

Tuesday, December 12, 2006, 7:11:30 PM, you wrote:

>> After upgrade you did actually re-create your raid-z
>> pool, right?

JC> No, but I did "zpool upgrade -a".

JC> Hmm, I guess I'll try re-writing the data first.  I know you have
JC> to do that if you change compression options.

IIRC you have to re-create entire raid-z pool to get it fixed - just
rewriting data or upgrading a pool won't do it.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Performance problems during 'destroy' (and bizzare Zone problem as well)

2006-12-12 Thread Anantha N. Srirama

[b]Setting:[/b]
  We've operating in the following setup for well over 60 days.

 - E2900 (24 x 92)
 - 2 2Gbps FC to EMC SAN
 - Solaris 10 Update 2 (06/06)
 - ZFS with compression turned on
 - Global zone + 1 local zone (sparse)
 - Local zone is fed ZFS clones from the global Zone

[b]Daily Routine[/b]
 - Shutdown local Zone
 - Recreate ZFS clones
 - Restart local Zone
 - End to end timing for this refresh is anywhere between 5 to 30 minutes. Bulk 
of the time is spent in the ZFS 'destroy' phase.

[b]Problem[/b]
 - We had extensive read/write activity in the global and local Zones 
yesterday. I estimate that we wrote 1/4 of one large ZFS filesystem, ~ 160GB of 
write.
 - This morning we had a fair amount of activity on the system when the refresh 
started, zpool was reporting around 150MB/S of write.
 - Our 'zfs destroy' commands took what I considere 'normal', the FS that was 
fielding the bulk of the I/O took 15 minutes. During this time everything was 
crawling or more accurately come to a dead stop. A simple 'rm' would hang. I've 
reported this problem to the forum in the past. I also believe the fix for the 
problem is in Update 3 for Solaris 10, right?
 -[b]Surprisingly today the ZFS 'snapshot & clone' took an inordinate amount of 
time. I observed each snapshot & clone activity together took 10+ minutes. In 
the past the same activity has taken no more than a few seconds even during 
busy times. The total end-to-end timing for all snapshots/clones was a whopping 
1:44:00!!![/b]
 - Even more surprising was that local Zone refused to startup (zoneadm -z 
bluenile boot) with no error messages.
 - I was able to start the Zone only after an hour or so after the completion 
of the ZFS commands.

[b]Questions:[/b]
 - Why is the destroy phase taking so long?
 - What can explain the unduly long snapshot/clone times
 - Why didn't the Zone startup?
 - More surprisingly why did the Zone startup after an hour?

Thanks in advance.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS Storage Pool advice

2006-12-12 Thread Kory Wheatley

This question is concerning ZFS.  We have a Sun Fire V890 attached to a EMC 
disk array.  Here's are plan to incorporate ZFS: 
On our EMC storage array we will create 3 LUNS.  Now how would ZFS be used for 
the best performance?

What I'm trying to ask is if you have 3 LUNS and you want to create a ZFS 
storage pool, would it be better to have a storage pool per LUN or combine the 
3 LUNS as one big disks under ZFS and create 1 huge ZFS storage pool.

Example:
LUN1 200gb  ZFS Storage Pool "pooldata1"
LUN2 200gb  ZFS Storage Pool "pooldata2"
LUN3 200gb  ZFS Storage Pool "pooldata3"

or

LUN 600gb  ZFS Storage Pool "alldata"
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] SunCluster HA-NFS from Sol9/VxVM to Sol10u3/ZFS

2006-12-12 Thread Matthew C Aycock

We are currently working on a plan to upgrade our HA-NFS cluster that uses 
HA-StoragePlus and VxVM 3.2 on Solaris 9 to Solaris 10 and ZFS. Is there a 
known procedure or best practice for this? I have enough free disk space to 
recreate all the filesystems and copy the data if necessary, but would like to 
avoid copying if possible.

Also, I am considering what type of zpools to create. I have a SAN with T3Bs 
and SE3511s. Since neither of these can work as a JBOD (at lesat that is what I 
remember) I guess I am going  to have to add in the LUNS in a mirrored zpool of 
the Raid-5 Luns?

We are at the extreme start of this project and I was hoping for some guidance 
as to what direction to start.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re[2]: [zfs-discuss] Re: zpool import takes to long with large numbers of file systems

2006-12-12 Thread Robert Milkowski

Hello Jason,

Thursday, December 7, 2006, 11:18:17 PM, you wrote:

JJWW> Hi Luke,

JJWW> That's terrific!

JJWW> You know you might be able to tell ZFS which disks to look at. I'm not
JJWW> sure. It would be interesting, if anyone with a Thumper could comment
JJWW> on whether or not they see the import time issue. What are your load
JJWW> times now with MPXIO?

On x4500 importing a pool made of 44 disks takes about 13 seconds.



-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Sol10u3 -- is "du" bug fixed?

2006-12-12 Thread Jeb Campbell

> After upgrade you did actually re-create your raid-z
> pool, right?

No, but I did "zpool upgrade -a".

Hmm, I guess I'll try re-writing the data first.  I know you have to do that if 
you change compression options.

Ok -- rewriting the data doesn't work ...

I'll create a new temp pool and see what that does ... then I'll investigate 
options for recreating my big pool ...

Thanks for the info,

Jeb
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re[2]: [zfs-discuss] Uber block corruption?

2006-12-12 Thread Robert Milkowski

Hello Toby,

Tuesday, December 12, 2006, 4:18:54 PM, you wrote:

TT> On 12-Dec-06, at 9:46 AM, George Wilson wrote:

>> Also note that the UB is written to every vdev (4 per disk) so the  
>> chances of all UBs being corrupted is rather low.

It depends actually - if all your vdevs are on the same array with
write back cache set to on you actually can end-up with all UB
corrupted - at least in theory.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Need Clarification on ZFS quota property.

2006-12-12 Thread Darren Dunham

> Hi All,
>
>   Assume the device c0t0d0  size is 10 KB.
>   I created ZFS file system on this
>   $ zpool create -f mypool c0t0d0s2

This creates a pool on the entire slice.

>   and to limit the size of ZFS file system I used quota property.
>
>   $ zfs set quota = 5000K mypool

Note that this sets a quota only on the default filesystem that was
created along with the zpool.  There may be other filesystems created on
the pool with different quotas.  You are not setting a quota on the pool
itself.

>   Which 5000 K bytes are belongs (or reserved) to mypool first 5000KB
>   or last 5000KB or random ?

All blocks belong to the pool.  The /mypool filesystem may be allocated
any particular space there depending on other filesystems and layout.
Attempts to allocate space greater than 5000K will fail.

>   UFS and VxFS file systems have options to limit the size of file
>   system on the device (E.g. We can limit the size offrom 1 block to
>   some nth block . Like this is there any sub command to limit the
>   size of ZFS file system from 1 block to some n th block ?

I'm not sure what you're saying here.  UFS and VxFS normally take the
entire space of a disk slice or volume.  The pool creation does the same
thing.

Can you clarify what you mean by limiting the size of UFS or VxVS?

-- 
Darren Dunham   [EMAIL PROTECTED]
Senior Technical Consultant TAOShttp://www.taos.com/
Got some Dr Pepper?   San Francisco, CA bay area
 < This line left intentionally blank to confuse you. >
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS Usage in Warehousing (lengthy intro)

2006-12-12 Thread Anton B. Rang

> But seriously, the big issue with SCSI, is that the SCSI commands are sent
> over the SCSI bus at the original (legacy) rate of 5 Mbits/Sec in 8-bit
> mode.

Actually, this isn't true on the newest (Ultra320) SCSI systems, though I don't 
know if the 3320 supports packetized SCSI.  It's definitely an issue for older 
SCSI buses if the reads and writes are small, less than a megabyte, say.  (For 
data warehousing applications you should see larger reads, as long as your data 
is laid out contiguously on disk.)

There's rather a nice chart at
  http://www.hitachigst.com/hdd/library/whitepap/tech/hdwpacket.htm
showing how the overhead grows with the speed of the bus.

> And since it takes an average of 5 SCSI commands to do something useful

Urm?  What's wrong with just READ(10) or WRITE(10)?

> Also, it takes a lot of time to send those commands - so you have latency.

Not much compared to the rotational latency if you're actually reading from 
media, though. (Measured latency for a read operation with disconnect/reconnect 
on a parallel SCSI bus is around 22 µs. [That's microseconds in case your mail 
program/browser doesn't get it right.])

> This is the main reason why SCSI is EOL

I presume you mean parallel SCSI?  I'd argue that the larger reason was the 
cost and cooling requirements of parallel cabling; SAS seems to be alive, at 
least, if not taking off quickly.

FC, SAS, and SATA all have lower overhead since they're point-to-point and 
don't need to arbitrate (or drive multiple receivers). How noticeable this is 
depends on your application. For large sequential I/O, the data transfer time 
dominates the overhead; for random I/O, the seek time and rotational latency 
dominates the overhead. Only in the cases where you're doing fairly small 
sequential I/Os, you have a very fast caching controller, or you have so many 
spindles on one connection that you have enough I/O operations in flight to 
keep the bus busy, will this matter much.

For this application, with a mix of random & sequential I/O, FC disks, or other 
disks with very low seek+rotation times, might perform quite a lot better than 
inexpensive disks with longer seek+rotation times. I'd be concerned that the 
updates would dominate performance, unless they're happening at a rate of fewer 
than about 50/second/spindle.

Anton
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sol10u3 -- is "du" bug fixed?

2006-12-12 Thread Robert Milkowski

Hello Jeb,

Tuesday, December 12, 2006, 6:04:36 PM, you wrote:

JC> I updated to Sol10u3 last night, and I'm still seeing different
JC> differences between "du -h" and "ls -h".

JC> "du" seems to take into account raidz and compression -- if this is 
correct, please let me know.

JC> It makes sense that "du" reports actual disk usage, but this
JC> makes some scripts I wrote very broken (need real sizes of files
JC> in a directory to be able to put them on dvd isos).

JC> Sol10u3 on 3 disk RaidZ:
JC> [EMAIL PROTECTED]:~/burnout/2006-11-30]$ ls -lh JMS-data-1-2006-11-30.iso
JC> -rw-r--r-- 1 splus splus 3.5G Dec  1 10:15 JMS-data-1-2006-11-30.iso
JC> [EMAIL PROTECTED]:~/burnout/2006-11-30]$ du -hs JMS-data-1-2006-11-30.iso
JC> 5.2GJMS-data-1-2006-11-30.iso

After upgrade you did actually re-create your raid-z pool, right?

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: Re: Snapshots impact on performance

2006-12-12 Thread Robert Milkowski

Hello Chris,

Wednesday, December 6, 2006, 6:23:48 PM, you wrote:

CG> One of our file servers internally to Sun that reproduces this
CG> running nv53 here is the dtrace output:

Any conclusions yet?

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Sol10u3 -- is "du" bug fixed?

2006-12-12 Thread Jeb Campbell

I updated to Sol10u3 last night, and I'm still seeing different differences 
between "du -h" and "ls -h".

"du" seems to take into account raidz and compression -- if this is correct, 
please let me know.

It makes sense that "du" reports actual disk usage, but this makes some scripts 
I wrote very broken (need real sizes of files in a directory to be able to put 
them on dvd isos).

Sol10u3 on 3 disk RaidZ:
[EMAIL PROTECTED]:~/burnout/2006-11-30]$ ls -lh JMS-data-1-2006-11-30.iso
-rw-r--r-- 1 splus splus 3.5G Dec  1 10:15 JMS-data-1-2006-11-30.iso
[EMAIL PROTECTED]:~/burnout/2006-11-30]$ du -hs JMS-data-1-2006-11-30.iso
5.2GJMS-data-1-2006-11-30.iso

Thanks,
Jeb
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Usage in Warehousing (lengthy intro)

2006-12-12 Thread Stuart Glenn



On Dec 12, 2006, at 10:02, Al Hopper wrote:



Another possiblity, which is on my todo list to checkout, is:

http://www.norcotek.com/item_detail.php?categoryid=8&modelno=DS-1220


I would not go with this device. I picked up one along with 12 500GB  
SATA drives with the hopes of making a dumping ground on the network  
for my servers to rsync to.


Now I might have it all kinds of not configured or tuned correctly in  
terms of solaris & zfs (which if I do I can't fiure out), but  
performance is terrible compared to my existing dumping ground based  
on a cheap-o raid-5 card & freebsd

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: zpool mirror

2006-12-12 Thread Gino Ruopolo

> 
> Not right now (without a bunch of shell-scripting).
>  I'm working on 
> eing able to "send" a whole tree of filesystems &
> their snapshots. 
> Would that do what you want?

Exactly!  When you think that -really useful- feature will be available?

thanks,
gino
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Uber block corruption?

2006-12-12 Thread Anton B. Rang

> [...] there is no possibility of referencing an overwritten
> block unless you have to back off more than two uberblocks.  At this
> point, blocks that have been overwritten will show up as corrupted (bad
> checksums).

Hmmm.  Is there some way we can warn the user to scrub their pool because we 
had trouble reading an überblock?  (Maybe some FMA rules about what to do if an 
überblock read fails?)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: zfs exported a live filesystem

2006-12-12 Thread Darren J Moffat


Jim Hranicky wrote:

Now having said that I personally wouldn't have
expected that zpool  export should have worked as easily as that while
there where shared  filesystems.  I would have expected that exporting
the pool should have attempted to unmount all the ZFS filesystems first -
which would have  failed without a -f flag because they were shared.

So IMO it is a bug or at least an RFE.


Ok, where should I file an RFE?


http://bugs.opensolaris.org/


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS related kernel panic

2006-12-12 Thread Anton B. Rang

> UFS will panic on EIO also.  Most other file systems, too.

In which cases will UFS panic on an I/O error?

A quick browse through the UFS code shows several cases where we can panic if 
we have bad metadata on disk, but none if a disk read (or write) fails 
altogether.

If UFS fails to read a block, it returns EIO (in most cases, occasionally a 
different error depending on the context) to its caller.  (In a few cases, it 
can continue past the error; for instance, if it can't read a cylinder group 
header and wants to allocate a block there, it will go on to a different 
cylinder group.)

If UFS fails to write a block, the buffer cache or page cache will just keep 
retrying.

QFS won't even panic on bad metadata, unless enabled with an /etc/system 
variable; it will just returns errors to its caller. (It won't panic on I/O 
errors at all.)

---

As for why expectations with ZFS are higher?  I suspect that it's primarily 
because ZFS has been sold (deservedly) as being very good at dealing with 
hardware problems. This means that it should not only detect the problems, but 
continue on past them whenever possible. Ditto blocks are a first step in this 
direction. Bringing down the machine when a read or write fails is so 1980s; 
ZFS needs a bit of fine-tuning here.

We don't need to be defensive. ZFS is a new file system. It will take some time 
to work all the quirks out and it will take some time to eliminate all the 
panic cases. But we will.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Corruption

2006-12-12 Thread eric kustarz


Bill Casale wrote:

Please reply directly to me. Seeing the message below.

Is it possible to determine exactly which file is corrupted?
I was thinking the OBJECT/RANGE info may be pointing to it
but I don't know how to equate that to a file.


This is bug:
6410433 'zpool status -v' would be more useful with filenames

and i'm actually working on it right now!

eric




# zpool status -v
  pool: u01
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
u01 ONLINE   0 0 6
  c1t102d0  ONLINE   0 0 6

errors: The following persistent errors have been detected:

  DATASET  OBJECT   RANGE
  u01  4741362  600178688-600309760



Thanks,
Bill




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Usage in Warehousing (lengthy intro)

2006-12-12 Thread Al Hopper

On Fri, 8 Dec 2006, Jochen M. Kaiser wrote:

> Dear all,
>
> we're currently looking forward to restructure our hardware environment for
> our datawarehousing product/suite/solution/whatever.
>
> We're currently running the database side on various SF V440's attached via
> dual FC to our SAN backend (EMC DMX3) with UFS. The storage system is
> (obviously in  a SAN) shared between many systems. Performance is mediocre
> in  terms of raw throughput at 70-150MB/sec. (lengthy, sequential reads due to
> full table scan  operations on the db side) and excellent is terms of I/O and
> service times (averaging at 1,7ms according to sar).
> >From our applications perspective sequential read is the most important 
> >factor.
> Read-to-Write ratio is almost 20:1.
>
> We now want to consolidate our database servers (Oracle, btw.) to a pair of
> x4600 systems running Solaris 10 (which we've already tested in a benchmark
> setup). The whole system was still I/O-bound, even though the backend (3510,
> 12x146GB, QFS, RAID10) delivered a sustained data rate of 250-300MB/sec.
>
> I'd like to target a sequential read performance of 500++MB/sec while reading
> from the db on multiple tablespaces. We're experiencing massive data volume
> growth of about 100% per year and are therefore looking both for an 
> expandable,
> yet "cheap" solution. We'd like to use a DAS solution, because we had negative
> experiences with SAN in the past in terms of tuning and throughput.
>
> Being a friend of simplicity I was thinking about using a pair (or more) of 
> 3320
> SCSI JBODs with multiple RAIDZ and/or RAID10 zfs disk pools on which we'd

Have you not heard that SCSI is dead?  :)

But seriously, the big issue with SCSI, is that the SCSI commands are sent
over the SCSI bus at the original (legacy) rate of 5 Mbits/Sec in 8-bit
mode.  And since it takes an average of 5 SCSI commands to do something
useful, you can't send enough commands over the bus to busy out a modern
SCSI drive.  Even a single drive on a single SCSI bus.  Also, it takes a
lot of time to send those commands - so you have latency.  And everyone
understands how latency affects throughput on a LAN (or WAN) .. same issue
with SCSI.  This is the main reason why SCSI is EOL and could not be
extended without breaking the existing standards.

While I understand you don't want to build a SAN, an alternative would be
a Fibre Channel (FC) box that presents SATA drives.  This would be a DAS
solution with one or two connections to (Qlogic) FC controllers in the
host - IOW not a SAN and there is no FC switch required.  Many such boxes
are designed to provide expansion to a FC based hardware RAID box.  For
example, the DS4000 EXP100 Storage Expansion Unit from IBM.  In your
application you'd need to find something that supports FC rates of
4Gb/Sec, if possible.

Another possiblity, which is on my todo list to checkout, is:

http://www.norcotek.com/item_detail.php?categoryid=8&modelno=DS-1220

Now if I could find a Marvell based equivalent to the:
http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm with
external SATA ports, life would be great.  Another card with external SATA
ports that works with Solaris (via the si3124 driver) is:
http://www.newegg.com/product/product.asp?item=N82E16816124003 which only
has a 32-bit PCI connection. :(

> place the database. If we need more space we'll simply connect yet another
> JBOD. I'd calculate 1-2 PCIe U320 controllers (w/o raid) per jbod, starting 
> with a
> minimum of 4 controllers per server.
>
> Regarding ZFS I'd be very interested to know, whether someone else is running
> a similar setup and can provide me with some hints or point me at some 
> caveats.
>
> I'd be also very interested in the cpu usage of such a setup for the zfs raidz
> pools. After searching this forum I found the rule of thumb that 200MB/sec
> throughput roughly consume one 2GHz Opteron cpu, but am hoping that someone
> can provide me with some in depth data. (Frankly I can hardly imagine that 
> this
> holds true for reads).
>
> I'd be also be interested in you opinion on my targeted setup, so if you have
> any comments - go ahead.
>
> Any help is appreciated,
>
> Jochen
>
> P.S. Fallback scenarios would be Oracle with ASM or a (zfs/ufs) SAN setup.
>

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Kickstart hot spare attachment

2006-12-12 Thread Jim Hranicky

For my latest test I set up a stripe of two mirrors with one hot spare
like so:

zpool create -f -m /export/zmir zmir mirror c0t0d0 c3t2d0 mirror c3t3d0 c3t4d0 
spare c3t1d0

I spun down c3t2d0 and c3t4d0 simultaneously, and while the system kept 
running (my tar over NFS barely hiccuped), the zpool command hung again.

I rebooted the machine with -dnq, and although the system didn't come up
the first time, it did after a fsck and a second reboot. 

However, once again the hot spare isn't getting used:

# zpool status -v
  pool: zmir
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
  the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: resilver completed with 0 errors on Tue Dec 12 09:15:49 2006
config:

  NAMESTATE READ WRITE CKSUM
  zmirDEGRADED 0 0 0
mirrorDEGRADED 0 0 0
  c0t0d0  ONLINE   0 0 0
  c3t2d0  UNAVAIL  0 0 0  cannot open
mirrorDEGRADED 0 0 0
  c3t3d0  ONLINE   0 0 0
  c3t4d0  UNAVAIL  0 0 0  cannot open
  spares
c3t1d0AVAIL

A few questions:

- I know I can attach it via the zpool commands, but is there a way to
kickstart the attachment process if it fails to attach automatically upon
disk failure?

- In this instance the spare is twice as big as the other
drives -- does that make a difference? 

- Is there something inherent to an old SCSI bus that causes spun-
down drives to hang the system in some way, even if it's just hanging
the zpool/zfs system calls? Would a thumper be more resilient to this?

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Netapp to Solaris/ZFS issues

2006-12-12 Thread Anton B. Rang

NetApp can actually grow their RAID groups, but they recommend adding an entire 
RAID group at once instead. If you add a disk to a RAID group on NetApp, I 
believe you need to manually start a reallocate process to balance data across 
the disks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Uber block corruption?

2006-12-12 Thread Toby Thain



On 12-Dec-06, at 9:46 AM, George Wilson wrote:

Also note that the UB is written to every vdev (4 per disk) so the  
chances of all UBs being corrupted is rather low.


Furthermore the time window where UBs are mutually inconsistent would  
be very short, since they'd be updated together?


--Toby



Thanks,
George

Darren Dunham wrote:
DD> To reduce the chance of it affecting the integrety of the  
filesystem,
DD> there are multiple copies of the UB written, each with a  
checksum and a
DD> generation number.  When starting up a pool, the oldest  
generation copy
DD> that checks properly will be used.  If the import can't find  
any valid
DD> UB, then it's not going to have access to any data.  Think of  
a UFS

DD> filesystem where all copies of the superblock are corrupt.

Actually the latest UB, not the oldest.

My *other* oldest...  yeah.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Netapp to Solaris/ZFS issues

2006-12-12 Thread Robert Milkowski

Hello Jim,

Wednesday, December 6, 2006, 3:28:53 PM, you wrote:

JD> We have two aging Netapp filers and can't afford to buy new Netapp gear,
JD> so we've been looking with a lot of interest at building NFS fileservers
JD> running ZFS as a possible future approach.  Two issues have come up in the
JD> discussion

JD> - Adding new disks to a RAID-Z pool (Netapps handle adding new disks very
JD> nicely).  Mirroring is an alternative, but when you're on a tight budget
JD> losing N/2 disk capacity is painful.

Actually you can add another raid-z group to the pool.
I belive it's the same what NetApp is doing (instead of actually
growing raid group).

JD> - The default scheme of one filesystem per user runs into problems with
JD> linux NFS clients; on one linux system, with 1300 logins, we already have
JD> to do symlinks with amd because linux systems can't mount more than about
JD> 255 filesystems at once.  We can of course just have one filesystem 
JD> exported, and make /home/student a subdirectory of that, but then we run
JD> into problems with quotas -- and on an undergraduate fileserver, quotas
JD> aren't optional!

It can with 2.6 kernels.
However there're other problems we we ended-up with limit at around
700.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Uber block corruption?

2006-12-12 Thread Mark Maybee


[EMAIL PROTECTED] wrote:

Hello Casper,

Tuesday, December 12, 2006, 10:54:27 AM, you wrote:



So 'a' UB can become corrupt, but it is unlikely that 'all' UBs will
become corrupt through something that doesn't also make all the data
also corrupt or inaccessible.



CDSC> So how does this work for data which is freed and overwritten; does
CDSC> the system make sure that none of the data referenced by any of the
CDSC> old ueberblocks is ever overwritten?

Why it should? If blocks are not used due to current UB I guess you
can safely assume they are free.




What if a newer UB is corrupted and you fall back to an older one?

Casper


A block freed in transaction group N cannot be reused until transaction
group N+3; so there is no possibility of referencing an overwritten
block unless you have to back off more than two uberblocks.  At this
point, blocks that have been overwritten will show up as corrupted (bad
checksums).

-Mark
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Uber block corruption?

2006-12-12 Thread George Wilson

Also note that the UB is written to every vdev (4 per disk) so the 
chances of all UBs being corrupted is rather low.


Thanks,
George

Darren Dunham wrote:

DD> To reduce the chance of it affecting the integrety of the filesystem,
DD> there are multiple copies of the UB written, each with a checksum and a
DD> generation number.  When starting up a pool, the oldest generation copy
DD> that checks properly will be used.  If the import can't find any valid
DD> UB, then it's not going to have access to any data.  Think of a UFS
DD> filesystem where all copies of the superblock are corrupt.

Actually the latest UB, not the oldest.


My *other* oldest...  yeah.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to do DIRECT IO on ZFS ?

2006-12-12 Thread Roch - PAE


Maybe this will help:
http://blogs.sun.com/roch/entry/zfs_and_directio

-r

dudekula mastan writes:
 > Hi All,
 >
 >   We have directio() system to do DIRECT IO on UFS file system. Can
 > any one know how to do DIRECT IO on ZFS file system. 
 >
 >   Regards
 >   Masthan
 > 
 >  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS Usage in Warehousing (no more lengthy intro)

2006-12-12 Thread Robert Milkowski

Hello Jochen,

Sunday, December 10, 2006, 10:51:57 AM, you wrote:

JMK> James,

>> Just a thought.
>> 
>> have you thought about giving thumper x4500's a trial
>> for this work
>> load? Oracle would seem to be IO limited in the end
>> so  4 cores may be
>> enough to keep oracle happy when linked with upto
>> 2GB/s disk IO speed.
JMK> ===

JMK> Actually yes, however I've doubts in regard to scalability
JMK> of cpu power.  I'd imagine that a RaidZ setup will increase
JMK> cpu usage of zfs, so Mirroring will be the way to go.
JMK> I've also browsed some info on greenplum and other appliance
JMK> vendors. However none are listed as strategic products for our
JMK> company (forcing a lengthy assessment process), support/consulting
JMK> in Germany is usually non-existent and a port of our current setup
JMK> is difficult at best.
JMK> I've asked Robert Milkowski (milek.blogspot.com) if he can provide
JMK> me with some cpu figures from his throughput benchmarks.

It's not that bad with CPU usage.
For example with RAID-Z2 while doing scrub I get something like
800MB/s read from disks (550-600MB/s from zpool iostat perspective)
and all four cores are mostly consumed - I get something like 10% idle
on each cpu.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Corruption

2006-12-12 Thread George Wilson


Bill,

If you want to find the file associated with the corruption you could do 
a "find /u01 -inum 4741362" or use the output of "zdb -d u01" to 
find the object associated with that id.


Thanks,
George

Bill Casale wrote:

Please reply directly to me. Seeing the message below.

Is it possible to determine exactly which file is corrupted?
I was thinking the OBJECT/RANGE info may be pointing to it
but I don't know how to equate that to a file.


# zpool status -v
  pool: u01
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
u01 ONLINE   0 0 6
  c1t102d0  ONLINE   0 0 6

errors: The following persistent errors have been detected:

  DATASET  OBJECT   RANGE
  u01  4741362  600178688-600309760



Thanks,
Bill



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re[2]: [zfs-discuss] Uber block corruption?

2006-12-12 Thread Casper . Dik


>Hello Casper,
>
>Tuesday, December 12, 2006, 10:54:27 AM, you wrote:
>
>>>So 'a' UB can become corrupt, but it is unlikely that 'all' UBs will
>>>become corrupt through something that doesn't also make all the data
>>>also corrupt or inaccessible.
>
>
>CDSC> So how does this work for data which is freed and overwritten; does
>CDSC> the system make sure that none of the data referenced by any of the
>CDSC> old ueberblocks is ever overwritten?
>
>Why it should? If blocks are not used due to current UB I guess you
>can safely assume they are free.


What if a newer UB is corrupted and you fall back to an older one?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Corruption

2006-12-12 Thread Robert Milkowski

Hello Bill,

Tuesday, December 12, 2006, 2:34:01 PM, you wrote:

BC> Please reply directly to me. Seeing the message below.

BC> Is it possible to determine exactly which file is corrupted?
BC> I was thinking the OBJECT/RANGE info may be pointing to it
BC> but I don't know how to equate that to a file.


BC> # zpool status -v
BC>pool: u01
BC>   state: ONLINE
BC> status: One or more devices has experienced an error resulting in data
BC>  corruption.  Applications may be affected.
BC> action: Restore the file in question if possible.  Otherwise restore the
BC>  entire pool from backup.
BC> see: http://www.sun.com/msg/ZFS-8000-8A
BC>   scrub: none requested
BC> config:

BC>  NAMESTATE READ WRITE CKSUM
BC>  u01 ONLINE   0 0 6
BC>c1t102d0  ONLINE   0 0 6

BC> errors: The following persistent errors have been detected:

BC>DATASET  OBJECT   RANGE
BC>u01  4741362  600178688-600309760
^^^

This is inode number so just use find to find a file.

There's an RFE for this so zpool status will give you actual file
names.



-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to do DIRECT IO on ZFS ?

2006-12-12 Thread Robert Milkowski





Hello dudekula,

Tuesday, December 12, 2006, 9:36:24 AM, you wrote:




>


Hi All,
 
We have directio() system to do DIRECT IO on UFS file system. Can any one know how to do DIRECT IO on ZFS file system.






Right now you can't.



-- 
Best regards,
 Robert                            mailto:[EMAIL PROTECTED]
                                       http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS Corruption

2006-12-12 Thread Bill Casale


Please reply directly to me. Seeing the message below.

Is it possible to determine exactly which file is corrupted?
I was thinking the OBJECT/RANGE info may be pointing to it
but I don't know how to equate that to a file.


# zpool status -v
  pool: u01
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
u01 ONLINE   0 0 6
  c1t102d0  ONLINE   0 0 6

errors: The following persistent errors have been detected:

  DATASET  OBJECT   RANGE
  u01  4741362  600178688-600309760



Thanks,
Bill


--

   _/_/_/  _/_/  _/ _/Bill Casale - TSE
  _/  _/_/  _/_/   _/  OS Team
 _/_/_/  _/_/  _/  _/ _/  1 Network Drive
_/  _/_/  _/   _/_/ Burlington, MA. 01802
   _/_/_/   _/_/_/   _/ _/  

M  I  C  R  O  S  Y  S  T  E  M  S


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: zfs exported a live filesystem

2006-12-12 Thread Jim Hranicky

For the record, this happened with a new filesystem. I didn't
muck about with an old filesystem while it was still mounted, 
I created a new one, mounted it and then accidentally exported
it.

> > Except that it doesn't:
> > 
> > # mount /dev/dsk/c1t1d0s0 /mnt
> > # share /mnt
> > # umount /mnt
> > umount: /mnt busy
> > # unshare /mnt
> > # umount /mnt
> 
> If you umount -f it will though!

Well, sure, but I was still surprised that it happened anyway.

> The system is working as designed, the NFS client did
> what it was  supposed to do.  If you brought the pool back in
> again with zpool import  things should have picked up where they left off.

Yep -- an import/shareall made the FS available again.

> Whats more you we probably running as root when you
> did that so you got  what you asked for - there is only so much protection
> we can give  without being annoying!  

Sure, but there are still safeguards in place even when running things
as root, such as requiring "umount -f" as above, or warning you
when running format on a disk with mounted partitions.

Since this appeared to be an operation that may warrant such a
safeguard I thought I'd check and see if this was to be expected or
if a safeguard should be put in.

Annoying isn't always bad :->

> Now having said that I personally wouldn't have
> expected that zpool  export should have worked as easily as that while
> there where shared  filesystems.  I would have expected that exporting
> the pool should have attempted to unmount all the ZFS filesystems first -
> which would have  failed without a -f flag because they were shared.
> 
> So IMO it is a bug or at least an RFE.

Ok, where should I file an RFE?

Jim
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs exported a live filesystem

2006-12-12 Thread Darren J Moffat


Boyd Adamson wrote:


On 12/12/2006, at 8:48 AM, Richard Elling wrote:


Jim Hranicky wrote:

By mistake, I just exported my test filesystem while it was up
and being served via NFS, causing my tar over NFS to start
throwing stale file handle errors. Should I file this as a bug, or 
should I just "not do that" :->


Don't do that.  The same should happen if you umount a shared UFS
file system (or any other file system types).
 -- richard


Except that it doesn't:

# mount /dev/dsk/c1t1d0s0 /mnt
# share /mnt
# umount /mnt
umount: /mnt busy
# unshare /mnt
# umount /mnt


If you umount -f it will though!

I don't quite agree that unmounting a UFS filesystem that is exported 
over NFS is the same as running zpool export on the pool.  The 
equivalent to running umount on the UFS file system is running zfs 
umount on the ZFS file system in the pool.


Running zpool export on the pool is closer to removing (cleanly) the 
disks or metadevices that the ufs file system is stored on.


The system is working as designed, the NFS client did what it was 
supposed to do.  If you brought the pool back in again with zpool import 
things should have picked up where they left off.


Whats more you we probably running as root when you did that so you got 
what you asked for - there is only so much protection we can give 
without being annoying!  If you look at the RBAC profiles we currently 
ship for ZFS you will see that there are two distinct profiles, one for 
ZFS File System Management and one for ZFS Storage Management.  The 
reason they are separate is because they work at quite different layers 
in the system with different protections.


Now having said that I personally wouldn't have expected that zpool 
export should have worked as easily as that while there where shared 
filesystems.  I would have expected that exporting the pool should have 
attempted to unmount all the ZFS filesystems first - which would have 
failed without a -f flag because they were shared.


So IMO it is a bug or at least an RFE.

--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re[2]: [zfs-discuss] Uber block corruption?

2006-12-12 Thread Robert Milkowski

Hello Casper,

Tuesday, December 12, 2006, 10:54:27 AM, you wrote:

>>So 'a' UB can become corrupt, but it is unlikely that 'all' UBs will
>>become corrupt through something that doesn't also make all the data
>>also corrupt or inaccessible.

CDSC> So how does this work for data which is freed and overwritten; does
CDSC> the system make sure that none of the data referenced by any of the
CDSC> old ueberblocks is ever overwritten?

Why it should? If blocks are not used due to current UB I guess you
can safely assume they are free.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Need Clarification on ZFS quota property.

2006-12-12 Thread Tomas Ögren

On 12 December, 2006 - dudekula mastan sent me these 2,7K bytes:

> 
> Hi All,
>
>   Assume the device c0t0d0  size is 10 KB.
>
>   I created ZFS file system on this
>
>   $ zpool create -f mypool c0t0d0s2
>
>   and to limit the size of ZFS file system I used quota property.
>
>   $ zfs set quota = 5000K mypool
>
>   Which 5000 K bytes are belongs (or reserved) to mypool first 5000KB or last 
> 5000KB or random ?

"random".. When you've stored 5000K, you can't store anymore there.

>   UFS and VxFS file systems have options to limit the size of file
>   system on the device (E.g. We can limit the size offrom 1 block to
>   some nth block . Like this is there any sub command to limit the
>   size of ZFS file system from 1 block to  some n th block ?

Just amount, not specific positions on/portions of the FS/devices.

/Tomas
-- 
Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Need Clarification on ZFS quota property.

2006-12-12 Thread dudekula mastan


Hi All,
   
  Assume the device c0t0d0  size is 10 KB.
   
  I created ZFS file system on this
   
  $ zpool create -f mypool c0t0d0s2
   
  and to limit the size of ZFS file system I used quota property.
   
  $ zfs set quota = 5000K mypool
   
  Which 5000 K bytes are belongs (or reserved) to mypool first 5000KB or last 
5000KB or random ?
   
  UFS and VxFS file systems have options to limit the size of file system on 
the device (E.g. We can limit the size offrom 1 block to some nth block . Like 
this is there any sub command to limit the size of ZFS file system from 1 block 
to  some n th block ?
   
  Your help is appreciated.
   
  Thanks & Regards
  Masthan
   
   

 __
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com ___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Doubt on solaris 10 installation ..

2006-12-12 Thread Zoram Thanga

[EMAIL PROTECTED] looks like the more appropriate list to 
post questions like yours.


dudekula mastan wrote:

Hi Everybody,
   
  I have some problems in solaris 10 installation. 
   
  After installing the first CD ,  I removed the CD from CDrom , after that the machine is getting rebooting again and again. It is not asking second CD to install.
   
  If you have any idea. Please tell me.
   
  Thanks & Regards

  Masthan


--
Zoram Thanga::Sun Cluster Development::http://blogs.sun.com/zoram
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Uber block corruption?

2006-12-12 Thread Casper . Dik


>So 'a' UB can become corrupt, but it is unlikely that 'all' UBs will
>become corrupt through something that doesn't also make all the data
>also corrupt or inaccessible.


So how does this work for data which is freed and overwritten; does
the system make sure that none of the data referenced by any of the
old ueberblocks is ever overwritten?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] How to do DIRECT IO on ZFS ?

2006-12-12 Thread dudekula mastan

Hi All,
   
  We have directio() system to do DIRECT IO on UFS file system. Can any one 
know how to do DIRECT IO on ZFS file system.
   
  Regards
  Masthan

 
-
Everyone is raving about the all-new Yahoo! Mail beta.___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

79 matches

Mail list logo