Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Scott Lawson

On 13/06/11 11:36 AM, Jim Klimov wrote:

Some time ago I wrote a script to find any "duplicate" files and replace
them with hardlinks to one inode. Apparently this is only good for same
files which don't change separately in future, such as distro archives.

I can send it to you offlist, but it would be slow in your case 
because it

is not quite the tool for the job (it will start by calculating checksums
of all of your files ;) )

What you might want to do and script up yourself is a recursive listing
"find /var/opt/SUNWmsqsr/store/partition... -ls". This would print you
the inode numbers and file sizes and link counts. Pipe it through
something like this:

find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq

And you'd get 3 columns - inode, count, size

My AWK math is a bit rusty today, so I present a monster-script like
this to multiply and sum up the values:

( find ... -ls | awk '{print $1" "$4" "$7}' | sort | uniq | awk '{ 
print $2"*"$3"+\\" }'; echo 0 ) | bc
This looks something like what I thought would have to be done, I was 
just looking
to see if there was something tried and tested before I had to invent 
something. I was really hoping
in zdb there might have been some magic information I could have tapped 
into.. ;)


Can be done cleaner, i.e. in a PERL one-liner, and if you have
many values - that would probably complete faster too. But as
a prototype this would do.

HTH,
//Jim

PS: Why are you replacing the cool Sun Mail? Is it about Oracle
licensing and the now-required purchase and support cost?
Yes it is about cost mostly. We had Sun Mail for our Staff and students. 
We had
20,000 + students on it up until Christmas time as well. We have now 
migrated them
to M$ Live@EDU. This leaves us with 1500 Staff left who all like to use 
LookOut. The Sun
connector for LookOut is a bit flaky at best. But the Oracle licensing 
cost for Messaging
and Calendar starts at 10,000 users plus and so is now rather expensive 
for what mailboxes
we have left. M$ also heavily discounts Exchange CALS to Edu and Oracle 
is not very friendly
the way Sun was with their JES licensing. So it is bye bye Sun Messaging 
Server for us.



2011-06-13 1:14, Scott Lawson пишет:

Hi All,

I have an interesting question that may or may not be answerable from 
some internal

ZFS semantics.

I have a Sun Messaging Server which has 5 ZFS based email stores. The 
Sun Messaging server
uses hard links to link identical messages together. Messages are 
stored in standard SMTP
MIME format so the binary attachments are included in the message 
ASCII. Each individual

message is stored in a separate file.

So as an example if a user sends a email with a 2MB attachment to the 
staff mailing list and there
is 3 staff stores with 500 users on each, it will generate a space 
usage like :


/store1 = 1 x 2MB + 499 x 1KB
/store2 = 1 x 2MB + 499 x 1KB
/store3 = 1 x 2MB + 499 x 1KB

So total storage used is around ~7.5MB due to the hard linking taking 
place on each store.


If hard linking capability had been turned off, this same message 
would have used 1500 x 2MB =3GB

worth of storage.

My question is there any simple ways of determining the space savings 
on each of the stores from
the usage of hard links? The reason I ask is that our educational 
institute wishes to migrate these stores
to M$ Exchange 2010 which doesn't do message single instancing. I 
need to try and project what the storage

requirement will be on the new target environment.

If anyone has any ideas be it ZFS based or any useful scripts that 
could help here, I am all ears.


I may post this to Sun Managers as well to see if anyone there might 
have any ideas on this as well.


Regards,

Scott.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Scott Lawson

On 13/06/11 10:28 AM, Nico Williams wrote:

On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
  wrote:
   

I have an interesting question that may or may not be answerable from some
internal
ZFS semantics.
 

This is really standard Unix filesystem semantics.
   
I Understand this, just wanting to see if here is any easy way before I 
trawl

through 10 million little files.. ;)
   

[...]

So total storage used is around ~7.5MB due to the hard linking taking place
on each store.

If hard linking capability had been turned off, this same message would have
used 1500 x 2MB =3GB
worth of storage.

My question is there any simple ways of determining the space savings on
each of the stores from the usage of hard links?  [...]
 

But... you just did!  :)  It's: number of hard links * (file size +
sum(size of link names and/or directory slot size)).  For sufficiently
large files (say, larger than one disk block) you could approximate
that as: number of hard links * file size.  The key is the number of
hard links, which will typically vary, but for e-mails that go to all
users, well, you know the number of links then is the number of users.
   

Yes this number varies based on number of recipients, so could be as many a

You could write a script to do this -- just look at the size and
hard-link count of every file in the store, apply the above formula,
add up the inflated sizes, and you're done.
   
Looks like I will have to, just looking for a tried and tested method 
before I have to create my own
one if possible. Just was looking for an easy option before I have to 
sit down and
develop and test a script. I have resigned from my current job of 9 
years and finish in 15 days and have
a heck of a lot of documentation and knowledge transfer I need to do 
around other UNIX systems

and am running very short on time...

Nico

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...
   
As a side not Exchange 2002 + Exchange 2007 do do this. But apparently 
M$ decided in Exchange
2010 that they no longer wished to do this and dropped the capability. 
Bizarre to say the least,
but it may come down to what they have done in the underlying store 
technology changes..


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Scott Lawson

Hi All,

I have an interesting question that may or may not be answerable from 
some internal

ZFS semantics.

I have a Sun Messaging Server which has 5 ZFS based email stores. The 
Sun Messaging server
uses hard links to link identical messages together. Messages are stored 
in standard SMTP
MIME format so the binary attachments are included in the message ASCII. 
Each individual

message is stored in a separate file.

So as an example if a user sends a email with a 2MB attachment to the 
staff mailing list and there
 is 3 staff stores with 500 users on each, it will generate a space 
usage like :


/store1 = 1 x 2MB + 499 x 1KB
/store2 = 1 x 2MB + 499 x 1KB
/store3 = 1 x 2MB + 499 x 1KB

So total storage used is around ~7.5MB due to the hard linking taking 
place on each store.


If hard linking capability had been turned off, this same message would 
have used 1500 x 2MB =3GB

worth of storage.

My question is there any simple ways of determining the space savings on 
each of the stores from
the usage of hard links? The reason I ask is that our educational 
institute wishes to migrate these stores
 to M$ Exchange 2010 which doesn't do message single instancing. I need 
to try and project what the storage

requirement will be on the new target environment.

If anyone has any ideas be it ZFS based or any useful scripts that could 
help here, I am all ears.


I may post this to Sun Managers as well to see if anyone there might 
have any ideas on this as well.


Regards,

Scott.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OT: anyone aware how to obtain 1.8.0 for X2100M2?

2010-12-19 Thread Scott Lawson

Hi,

Took me a couple of minutes to find the download for this in my Oracle 
support. Search

for the patch like this .

Patches and Updates Panel -> Patch Search -> Patch Name or Number is : 
10275731


Pretty easy really.

Scott.

PS. I found that patch by using product or family equals x2100  and it 
found it for me easily.




On 20/12/2010 1:04 p.m., Jerry Kemp wrote:

Eugen,

I would 2nd your observation.

I *do* have several support contracts, and as I review my Oracle 
profile, it does show that I am authorized to download patches, among 
other items.


I really haven't downloaded a lot since SunSolve was killed off.

Do others on the list have access to download stuff like this?

Or is there some other place with in Oracle's site that makes Eugen's 
link obsolete?


Jerry


On 12/19/10 12:28, Eugen Leitl wrote:


I realize this is off-topic, but Oracle has completely
screwed up the support site from Sun. I figured someone
here would know how to obtain

Sun Fire X2100 M2 Server Software 1.8.0 Image contents:

 * BIOS is version 3A21
 * SP is updated to version 3.24 (ELOM)
 * Chipset driver is updated to 9.27

from

http://www.sun.com/servers/entry/x2100/downloads.jsp

I've been trying for an hour, and I'm at the end of
my rope.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS & HW RAID

2009-09-18 Thread Scott Lawson



Bob Friesenhahn wrote:

On Fri, 18 Sep 2009, David Magda wrote:


If you care to keep your pool up and alive as much as possible, then 
mirroring across SAN devices is recommended.


One suggestion I heard was to get a LUN that's twice the size, and 
set "copies=2". This way you have some redundancy for incorrect 
checksums.


This only helps for block-level corruption.  It does not help much at 
all if a whole LUN goes away.  It seems best for single disk rpools.
I second this. In my experience you are more likely to have a single LUN 
go missing for some reason or another and it seems most
prudent to support any production data volume with at the very minimum a 
mirror. This also give you 2 copies in a far more resilient
way generally. (and per my other post, there can be other niceties that 
come with it as well when couple with SAN based LUNS.)


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS & HW RAID

2009-09-18 Thread Scott Lawson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
_

Scott Lawson
Systems Architect
Information Communication Technology Services

Manukau Institute of Technology
Private Bag 94006
South Auckland Mail Centre
Manukau 2240
Auckland
New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz

__

perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

__



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to find poor performing disks

2009-08-26 Thread Scott Lawson

Also you may wish to look at the output of 'iostat -xnce 1' as well.

You can post those to the list if you have a specific problem.

You want to be looking for error counts increasing and specifically 'asvc_t'
for the service times on the disks. I higher number for asvc_t  may help to
isolate poorly performing individual disks.



Scott Meilicke wrote:

You can try:

zpool iostat pool_name -v 1

This will show you IO on each vdev at one second intervals. Perhaps you will 
see different IO behavior on any suspect drive.

-Scott
  



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] problem with zfs

2009-08-26 Thread Scott Lawson



serge goyette wrote:

actually i did apply the latest recommended patches
  

Recommended patches and upgrade clusters are different by the way.

10_Recommended != Upgrade Cluster that. Upgrade cluster will upgrade
the system to a effectively the Solaris Release  that the upgrade cluster
is minus any new features that arrived in the newer OS release.

SunOS VL-MO-ZMR01 5.10 Generic_139555-08 sun4v sparc SUNW,SPARC-Enterprise-T5120

but still 


perhaps you are not doing much import - export
because when i do not do, i do not experience much problem
but when doing it, outch ...
  
Sure I import and export pools. But generally this is for moving the 
pool to another

system.

But I think we would need more information about the pool and
it's file systems to be able to help you. Specifically maybe the output of
'zpool history' and 'zfs list' for starters. This will at least allow 
some specific
data to try and help resolve your issues. The question as it stands is 
pretty

generic.

Have you upgraded your pools after the patches as well?

'zpool upgrade' and 'zfs upgrade' ?

a reboot will solve until next time

-sego-
  


-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] problem with zfs

2009-08-26 Thread Scott Lawson
The latest official Solaris 10 is actually 05/09. There are update patch 
bundles available
on Sunsolve for free download that will take you to 05/09. It may well 
be worth applying
these to see if they remedy the problem for you. They certainly allow 
you to bring ZFS up to version
10 from recollection. I have upgraded 30 plus systems with these and 
haven't experienced any

issues. (both SPARC and x86)

http://sunsolve.sun.com/pdownload.do?target=10_sparc_0509_patchbundle_part1.zip
http://sunsolve.sun.com/pdownload.do?target=10_sparc_0509_patchbundle_part2.zip
http://sunsolve.sun.com/pdownload.do?target=10_sparc_0509_patchbundle_part3.zip
http://sunsolve.sun.com/pdownload.do?target=10_sparc_0509_patchbundle_part4.zip


serge goyette wrote:

for release sorry i meant

 Solaris 10 10/08 s10s_u6wos_07b SPARC
   Copyright 2008 Sun Microsystems, Inc.  All Rights Reserved.
Use is subject to license terms.
Assembled 27 October 2008
  


--
___


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-12 Thread Scott Lawson



Ed Spencer wrote:

I don't know of any reason why we can't turn 1 backup job per filesystem
into say, up to say , 26 based on the cyrus file and directory
structure.
  
No reason whatsoever. Sometimes the more the better as per the rest of 
this thread. The key
here is to test and tweak till you get the optimal arrangement of backup 
window time and performance.


Performance tuning is a little bit of a Journey, that sooner or later 
has a final destination. ;)

The cyrus file and directory structure is designed with users located
under the directories A,B,C,D,etc to deal with the millions of little
files issue at the  filesystem layer.
  
The sun messaging server actually hashes the user names into a structure 
which looks quite similar
to a squid cache store. This has a top level of 128 directories, which 
each in turn contain 128 directories,
which then contain a folder for each user that has been mapped into that 
structure by the hash algorithm
on the user name. I use a wildcard mapping to split this into 16 
streams to cover the 0-9, a-f of the hexadecimal

directory structure names. eg. /mailstore1/users/0*

Our backups will have to be changed to use this design feature.
There will be a little work on the front end  to create the jobs but
once done the full backups should finish in a couple of hours.
  
The nice thing about this work is it really is only a one off 
configuration in the backup software
and then it is done. Certainly works a lot better than something like 
ALL_LOCAL_DRIVES

in Netbackup which effectively forks one backup thread per file system.

As an aside, we are currently upgrading our backup server to a sun4v
machine.
This architecture is well suited to run more jobs in parallel.
  
I use a T5220 with staging to a J4500 with 48 x 1 TB disks in a zpool 
with 6 file systems. This then gets streamed
to 6 LTO4 tape drives in a SL500 .Needless to say this supports a high 
degree of parallelism  and generally
finds the source server to be the bottleneck. I also take advantage of 
the 10 GigE capability
built straight into the Ultrasparc T2. Only major bottleneck in this 
system is the SAS interconnect to the J4500.
 
Thanx for all your help and advice.


Ed

On Tue, 2009-08-11 at 22:47, Mike Gerdts wrote:
  

On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencer wrote:


We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on
saturday. We backup to disk and then clone to tape. Our backup people
can only handle doing 2 filesystems per night.

Creating more filesystems to increase the parallelism of our backup is
one solution but its a major redesign of the of the mail system.
  

What is magical about a 1:1 mapping of backup job to file system?
According to the Networker manual[1], a save set in Networker can be
configured to back up certain directories.  According to some random
documentation about Cyrus[2], mail boxes fall under a pretty
predictable hierarchy.

1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf
2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html

Assuming that the way that your mailboxes get hashed fall into a
structure like $fs/b/bigbird and $fs/g/grover (and not just
$fs/bigbird and $fs/grover), you should be able to set a save set per
top level directory or per group of a few directories.  That is,
create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e -
$fs/h, etc.  If you are able to create many smaller save sets and turn
the parallelism up you should be able to drive more throughput.

I wouldn't get too worried about ensuring that they all start at the
same time[3], but it would probably make sense to prioritize the
larger ones so that they start early and the smaller ones can fill in
the parallelism gaps as the longer-running ones finish.

3. That is, there is sometimes benefit in having many more jobs to run
than you have concurrent streams.  This avoids having one save set
that finishes long after all the others because of poorly balanced
save sets.


Couldn't agree more Mike.

--
Mike Gerdts
http://mgerdts.blogspot.com/



--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs fragmentation

2009-08-11 Thread Scott Lawson
ilesystems, in the same pool.


Directory walkers, like NetBackup or rsync, will not scale well as
the number of files increases.  It doesn't matter what file system you
use, the scalability will look more-or-less similar. For millions of 
files,

ZFS send/receive works much better.  More details are in my paper.
I look forward to reading this Richard. I think it will be a interesting 
read

for members of this.



We will have to do something to address the problem. A combination of
what I just listed is our probable course of action. (Much testing will
have to be done to ensure our solution will address the problem because
we are not 100% sure what is the cause of performance degradation).  I'm
also dealing with Network Appliance to see if there is anything we can
do at the filer end to increase performance.  But I'm holding out little
hope.


DNLC hit rate?
Also, is atime on?
Turning atime off may make a big difference for you. It certainly does 
for Sun Messaging server.

Maybe worth doing and reposting result?




But please, don't miss the point I'm trying to make. ZFS would benefit
from a utility or a background process that would reorganize files and
directories in the pool to optimize performance. A utility to deal with
Filesystem Entropy. Currently a zfs pool will live as long as the
lifetime of the disks that it is on, without reorganization. This can be
a long long time. Not to mention slowly expanding the pool over time
contributes to the issue.


This does not come "for free" in either performance or risk. It will
do nothing to solve the directory walker's problem.
Agree. It will have little bearing on the outcome for the reason you 
mention.


NB, people who use UFS don't tend to see this because UFS can't
handle millions of files.
It can but only if you have less than a 1 TB'ish sized file systems. Not 
large by
ZFS standards. They do work, but with the same performance issue for 
directory
walker backups. Heaven help you in fsck'ing them after a system crash. 
Hours and hours.

 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
___


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] grow zpool by replacing disks

2009-08-03 Thread Scott Lawson



Tobias Exner wrote:

Hi list,

some months ago I spoke with an zfs expert on a Sun Storage event.

He told it's possible to grow a zpool by replacing every single disk 
with a larger one.
After replacing and resilvering all disks of this pool zfs will 
provide the new size automatically.



Now I found time to check that and I was not able to grow the pool. 
It's still the same size as before.


Maybe I missed one step? Or this functionality is not implemented yet?
Can you post the steps that you took? Output of 'zpool history' would be 
helpful.


For your info:

I tested this using VMWare and a Solaris 10u6



Any comments or ideas?
Did you export and then import the pool afterwards? This step is needed 
from recollection for the system to make the
new capacity available in the pool. I did this about 3-4 weeks ago on a 
mirror and this step was required to make
the change in capacity seen by the system. This was on Solaris 10 10/08 
which is the same as S10u6.



Thank you in advance...


Tobias


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40

2009-08-01 Thread Scott Lawson



Dave Stubbs wrote:

I don't mean to be offensive Russel, but if you do
ever return to ZFS, please promise me that you will
never, ever, EVER run it virtualized on top of NTFS
(a.k.a. worst file system ever) in a production
environment. Microsoft Windows is a horribly
unreliable operating system in situations where
things like protecting against data corruption are
important. Microsoft knows this



Oh WOW!  Whether or not our friend Russel virtualized on top of NTFS (he didn't - he used raw disk access) this point is amazing!  System5 - based on this thread I'd say you can't really make this claim at all.  Solaris suffered a crash and the ZFS filesystem lost EVERYTHING!  And there aren't even any recovery tools?  


HANG YOUR HEADS!!!

Recovery from the same situation is EASY on NTFS.  There are piles of tools out 
there that will recover the file system, and failing that, locate and extract 
data.  The key parts of the file system are stored in multiple locations on the 
d
You mean the data that you don't know you have lost yet? ZFS allows you 
to be very paranoid about data protection with things like copies=2,3,4 
etc etc..
isk just in case.  It's been this way for over 10 years.  I'd say it seems from this thread that my data is a lot safer on NTFS than it is on ZFS!  
  
I can't believe my eyes as I read all these responses blaming system engineering and hiding behind ECC memory excuses and "well, you know, ZFS is intended for more Professional systems and not consumer devices, etc etc."  My goodness!  You DO realize that Sun has this website called opensolaris.org which actually proposes to have people use ZFS on commodity hardware, don't you?  I don't see a huge warning on that site saying "ATTENTION:  YOU PROBABLY WILL LOSE ALL YOUR DATA".  


I recently flirted with putting several large Unified Storage 7000 systems on 
our corporate network.  The hype about ZFS is quite compelling and I had 
positive experience in my lab setting.  But because of not having Solaris 
capability on our staff we went in another direction instead.
  
You do realize that the 7000 series machines are appliances and have no 
prerequisite for you to have any Solaris knowledge whatsoever? They are 
a supported
device just like any other disk storage system that you can purchase 
from any vendor and have it supported as such. To use it all you need is 
a web browser. Thats it.
This is no different than your EMC array or HP Storageworks hardware, 
except that the under pinnings of the storage system are there for all 
to see in the form

of open source code contributed to the community by Sun.
Reading this thread, I'm SO glad we didn't put ZFS in production in ANY way.  Guys, this is the real world.  Stuff happens.  It doesn't matter what the reason is - hardware lying about cache commits, out-of-order commits, failure to use ECC memory, whatever.  It is ABSOLUTELY unacceptable for the filesystem to be entirely lost.  No excuse or rationalization of any type can be justified.  There MUST be at least the base suite of tools to deal with this stuff.  without it, ZFS simply isn't ready yet.  
  
Sounds like you have no real world experience of ZFS in production 
environments and it's true reliability. As many people here report there 
are thousands if not millions
of zpools out there containing business critical environments that are 
happily fixing broken hardware on a daily basis. I have personally seen 
all sorts of pieces of hardware
break and ZFS corrected and fixed things for me.  I personally manage 50 
plus ZFS zpools that are anywhere from 100GB to 30 TB. Works very, very, 
very well for me.
I have never lost anything despite having had plenty of pieces of 
hardware break in some form underneath ZFS.

I am saving a copy of this thread to show my colleagues and also those Sun 
Microsystems sales people that keep calling.
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] avail drops to 32.1T from 40.8T after create -o mountpoint

2009-07-29 Thread Scott Lawson



Glen Gunselman wrote:

Here is the output from my J4500 with 48 x 1 TB
disks. It is almost the 
exact same configuration as

yours. This is used for Netbackup. As Mario just
pointed out, "zpool 
list" includes the parity drive

in the space calculation whereas "zfs list" doesn't.

[r...@xxx /]#> zpool status




Scoot,

Thanks for the sample zpool status output.  I will be using the storage for 
NetBackup, also.  (I am booting the X4500 from a SAN - 6140 - and using a SL48 
w/2 LTO4 drives.)

Glen
  

Glen,

If you want any more info about our configuration drop me a line. It 
works ver very well and we have had

no issues at all.

This System is a T5220 (323 GB RAM)with the 48 TB J4500 connected via 
SAS. System also has 3 dual port fibre channel
HBA's feeding 6 LTO4 drives in a 540 slot SL500. The server is 10 gig 
attached straight to our network core routers and
needless to say achieves very high throughput. I have seen it pushing 
the full capacity of the SAS link to the J4500 quite

commonly. This is probably the choke point for this system.

/Scott

--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] avail drops to 32.1T from 40.8T after create -o mountpoint

2009-07-28 Thread Scott Lawson



Glen Gunselman wrote:

This is my first ZFS pool.  I'm using an X4500 with 48 TB drives.  Solaris is 
5/09.
After the create zfs list shows 40.8T but after creating 4 
filesystems/mountpoints the available drops 8.8TB to 32.1TB.  What happened to 
the 8.8TB. Is this much overhead normal?


zpool create -f zpool1 raidz c1t0d0 c2t0d0 c3t0d0 c5t0d0 c6t0d0 \
   raidz c1t1d0 c2t1d0 c3t1d0 c4t1d0 c5t1d0 \
   raidz c6t1d0 c1t2d0 c2t2d0 c3t2d0 c4t2d0 \
   raidz c5t2d0 c6t2d0 c1t3d0 c2t3d0 c3t3d0 \
   raidz c4t3d0 c5t3d0 c6t3d0 c1t4d0 c2t4d0 \
   raidz c3t4d0 c5t4d0 c6t4d0 c1t5d0 c2t5d0 \
   raidz c3t5d0 c4t5d0 c5t5d0 c6t5d0 c1t6d0 \
   raidz c2t6d0 c3t6d0 c4t6d0 c5t6d0 c6t6d0 \
   raidz c1t7d0 c2t7d0 c3t7d0 c4t7d0 c5t7d0 \
   spare c6t7d0 c4t0d0 c4t4d0
zpool list
NAME SIZE   USED  AVAILCAP  HEALTH  ALTROOT
zpool1  40.8T   176K  [b]40.8T[/b] 0%  ONLINE  - 
## create multiple file systems in the pool

zfs create -o mountpoint=/backup1fs zpool1/backup1fs
zfs create -o mountpoint=/backup2fs zpool1/backup2fs
zfs create -o mountpoint=/backup3fs zpool1/backup3fs
zfs create -o mountpoint=/backup4fs zpool1/backup4fs
zfs list
NAME   USED  AVAIL  REFER  MOUNTPOINT
zpool1 364K  [b]32.1T[/b]  28.8K  /zpool1
zpool1/backup1fs  28.8K  32.1T  28.8K  /backup1fs
zpool1/backup2fs  28.8K  32.1T  28.8K  /backup2fs
zpool1/backup3fs  28.8K  32.1T  28.8K  /backup3fs
zpool1/backup4fs  28.8K  32.1T  28.8K  /backup4fs

Thanks,
Glen
(PS. As I said this is my first time working with ZFS, if this is a dumb 
question - just say so.)
  
Here is the output from my J4500 with 48 x 1 TB disks. It is almost the 
exact same configuration as
yours. This is used for Netbackup. As Mario just pointed out, "zpool 
list" includes the parity drive

in the space calculation whereas "zfs list" doesn't.

[r...@xxx /]#> zpool status

errors: No known data errors

pool: nbupool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
nbupool ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t6d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t7d0 ONLINE 0 0 0
c2t8d0 ONLINE 0 0 0
c2t9d0 ONLINE 0 0 0
c2t10d0 ONLINE 0 0 0
c2t11d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t12d0 ONLINE 0 0 0
c2t13d0 ONLINE 0 0 0
c2t14d0 ONLINE 0 0 0
c2t15d0 ONLINE 0 0 0
c2t16d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t17d0 ONLINE 0 0 0
c2t18d0 ONLINE 0 0 0
c2t19d0 ONLINE 0 0 0
c2t20d0 ONLINE 0 0 0
c2t21d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t22d0 ONLINE 0 0 0
c2t23d0 ONLINE 0 0 0
c2t24d0 ONLINE 0 0 0
c2t25d0 ONLINE 0 0 0
c2t26d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t27d0 ONLINE 0 0 0
c2t28d0 ONLINE 0 0 0
c2t29d0 ONLINE 0 0 0
c2t30d0 ONLINE 0 0 0
c2t31d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t32d0 ONLINE 0 0 0
c2t33d0 ONLINE 0 0 0
c2t34d0 ONLINE 0 0 0
c2t35d0 ONLINE 0 0 0
c2t36d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t37d0 ONLINE 0 0 0
c2t38d0 ONLINE 0 0 0
c2t39d0 ONLINE 0 0 0
c2t40d0 ONLINE 0 0 0
c2t41d0 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c2t42d0 ONLINE 0 0 0
c2t43d0 ONLINE 0 0 0
c2t44d0 ONLINE 0 0 0
c2t45d0 ONLINE 0 0 0
c2t46d0 ONLINE 0 0 0
spares
c2t47d0 AVAIL
c2t48d0 AVAIL
c2t49d0 AVAIL

errors: No known data errors
[r...@xxx /]#> zfs list
NAME USED AVAIL REFER MOUNTPOINT
NBU 113G 20.6G 113G /NBU
nbupool 27.5T 4.58T 30.4K /nbupool
nbupool/backup1 6.90T 4.58T 6.90T /backup1
nbupool/backup2 6.79T 4.58T 6.79T /backup2
nbupool/backup3 7.28T 4.58T 7.28T /backup3
nbupool/backup4 6.43T 4.58T 6.43T /backup4
nbupool/nbushareddisk 20.1G 4.58T 20.1G /nbushareddisk
nbupool/zfscachetest 69.2G 4.58T 69.2G /nbupool/zfscachetest

[r...@xxx /]#> zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
NBU 136G 113G 22.8G 83% ONLINE -
nbupool 40.8T 34.4T 6.37T 84% ONLINE -
[r...@solnbu1 /]#>


--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] L2ARC support in Solaris 10 (Update 8?)

2009-07-22 Thread Scott Lawson

Hi All,

Can anyone shed some light on / if L2ARC support will be included in the 
next
Solaris 10 update? Or if it is included in a Kernel patch over and above 
the standard Kernel

patch rev that ships in 05/09 (AKA U7)?

The reason I ask is that I have standardised on S10 here and am not keen 
to deploy OpenSolaris in
production. (Just another platform and patching system to document and 
maintain. I don't

want to debate this here. It's the way it is.)

I am currently speccing some x4240's with SSD's for some upgraded Squid 
proxy cache's that will
be handling caching duties for around 40 - 60 megabit's / s. Large disk 
caches and L1ARC
for squid will make these systems really fly. (These are replacing tow 
v240's that are getting a little long

in the tooth and want keep up with the bandwidth jump)

The plan is to have a couple of x4240's with Dual quad core processors, 
16 GB RAM and 6  x 146 GB 10K
SAS drives plus 1 x 32 GB SSD as L2ARC. I can add this later if support 
for this not available

at build time, but is road mapped for S8?

ZFS config will be a pair of 146 GB mirrored as boot drives (and 
possibly access logging) and then
a RAIDZ1 of 4 drives for max capacity (data is disposable as it is 
purely cached object data). Compression
will be enabled on the disk cache RAIDZ1 to increase performance of 
cached data read from disk. (seeing as I have

many CPU cycles to burn in these systems ;) )

I am hoping that these systems will have a L1ARC of around 10GB, L2ARC 
of 32GB and cache volume
of ~420GB RAIDZ plus compression. We may add more drives or RAIDZ's as 
we tweak the Squid

cached object size. We are hoping to cache objects up to around 100 MB.

Any comments on either system configuration and / or L2ARC support are 
invited from the list.


Thanks,

Scott.

--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating a zfs pool to another server

2009-07-21 Thread Scott Lawson



Peter Farmer wrote:

Super!

Does the export need to be called just before I import the pool to
another server, 

Yes that is correct.

or can the export be called at the time the pool is
created?
no. It must be done on the server that is exporting the pool so that it 
can be imported as Daniel explained.

 because in a fail over I wouldn't be able to "export" the
pool before importing it.
  

In that case you would do a zpool import -f $mypoolname on the new server.

The '-f'  will forcibly import the pool into the system. Provided the 
pool isn't horribly broken. In my
experience this has worked fine for me on numerous occasions, where this 
circumstance has arisen.


Thanks,

Peter

2009/7/20 Daniel J. Priem :
  

Peter Farmer  writes:



Hi All,

I have a zfs pool setup on one server, the pool is made up of 4 iSCSI
luns, is it possible to migrate the zfs pool to another server? Each
of the iSCSI luns would be available on the other server.


Thanks,
  

yes.
zpool export $mypoolname  on the old server
zpool import $mypoolname  on the new server

regards
daniel

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
_____

Scott Lawson
Systems Architect
Information Communication Technology Services

Manukau Institute of Technology
Private Bag 94006
South Auckland Mail Centre
Manukau 2240
Auckland
New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz

__

perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

__



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-14 Thread Scott Lawson
This system has 32 GB of RAM so I will probbaly need to increase the 
data set size.


[r...@x tmp]#> ./zfs-cache-test.ksh nbupool
System Configuration:  Sun Microsystems  sun4v SPARC Enterprise T5220
System architecture: sparc
System release level: 5.10 Generic_141414-02
CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 
sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc


Pool configuration:
 pool: nbupool
state: ONLINE
scrub: none requested
config:

   NAME STATE READ WRITE CKSUM
   nbupool  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t2d0   ONLINE   0 0 0
   c2t3d0   ONLINE   0 0 0
   c2t4d0   ONLINE   0 0 0
   c2t5d0   ONLINE   0 0 0
   c2t6d0   ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t7d0   ONLINE   0 0 0
   c2t8d0   ONLINE   0 0 0
   c2t9d0   ONLINE   0 0 0
   c2t10d0  ONLINE   0 0 0
   c2t11d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t12d0  ONLINE   0 0 0
   c2t13d0  ONLINE   0 0 0
   c2t14d0  ONLINE   0 0 0
   c2t15d0  ONLINE   0 0 0
   c2t16d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t17d0  ONLINE   0 0 0
   c2t18d0  ONLINE   0 0 0
   c2t19d0  ONLINE   0 0 0
   c2t20d0  ONLINE   0 0 0
   c2t21d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t22d0  ONLINE   0 0 0
   c2t23d0  ONLINE   0 0 0
   c2t24d0  ONLINE   0 0 0
   c2t25d0  ONLINE   0 0 0
   c2t26d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t27d0  ONLINE   0 0 0
   c2t28d0  ONLINE   0 0 0
   c2t29d0  ONLINE   0 0 0
   c2t30d0  ONLINE   0 0 0
   c2t31d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t32d0  ONLINE   0 0 0
   c2t33d0  ONLINE   0 0 0
   c2t34d0  ONLINE   0 0 0
   c2t35d0  ONLINE   0 0 0
   c2t36d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t37d0  ONLINE   0 0 0
   c2t38d0  ONLINE   0 0 0
   c2t39d0  ONLINE   0 0 0
   c2t40d0  ONLINE   0 0 0
   c2t41d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t42d0  ONLINE   0 0 0
   c2t43d0  ONLINE   0 0 0
   c2t44d0  ONLINE   0 0 0
   c2t45d0  ONLINE   0 0 0
   c2t46d0  ONLINE   0 0 0
   spares
 c2t47d0AVAIL  
 c2t48d0AVAIL  
 c2t49d0AVAIL  


errors: No known data errors

zfs create nbupool/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/nbupool/zfscachetest ...

Done!
zfs unmount nbupool/zfscachetest
zfs mount nbupool/zfscachetest

Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null'
48000256 blocks

real3m37.24s
user0m9.87s
sys 1m54.08s

Doing second 'cpio -C 131072 -o > /dev/null'
48000256 blocks

real1m59.11s
user0m9.93s
sys 1m49.15s

Feel free to clean up with 'zfs destroy nbupool/zfscachetest'.

Scott Lawson wrote:

Bob,

Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool 
called test1
which is contained on a raid 1 volume on a 6140 with 7.50.13.10 
firmware on

the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks.

This machine is brand new with a clean install of S10 05/09. It is 
destined to become a Oracle 10 server with

ZFS filesystems for zones and DB volumes.

[r...@xxx /]#> uname -a
SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise
[r...@xxx /]#> cat /etc/release
  Solaris 10 5/09 s10s_u7wos_08 SPARC
  Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
   Use is subject to license terms.
Assembled 30 March 2009

[r...@xxx /]#> prtdiag -v | more
System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise 
M3000 Server

System clock frequency: 1064 MHz
Memory size: 16384 Megabytes


Here is the run output for you.

[r...@xxx tmp]#> ./zfs-cache-test.ksh test1
zfs create test1/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/test1/zfscachetest ...

Done!
zfs unmount test1/zfscachetest
zfs mount test1/z

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-14 Thread Scott Lawson



Bob Friesenhahn wrote:

On Wed, 15 Jul 2009, Scott Lawson wrote:


  NAME   STATE READ WRITE 
CKSUM
  test1  ONLINE   0 
0 0
mirror   ONLINE   0 
0 0
  c3t600A0B8000562264039B4A257E11d0  ONLINE   0 
0 0
  c3t600A0B8000336DE204394A258B93d0  ONLINE   0 
0 0
Each of these LUNS is a pair of 146GB 15K drives in a RAID1 on Crystal 
firmware on a 6140. Each LUN is 2km

apart in different data centres. 1 LUN where the server is, 1 remote.

Interestingly by creating the mirror vdev the first run got faster, and 
the second much much slower.  The second cpio
took and extra 2 minutes by virtue of it being a mirror. I ran the 
script once again prior to adding the mirror
and the results were pretty much the same as the first run posted. (plus 
or minus a couple of seconds, which
is to be expected as these LUNS are on prod arrays feeding other servers 
as well)


I will try these tests on some of my J4500's when I get a chance 
shortly. My interest is now piqued.




Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null'
48000256 blocks

real3m25.13s
user0m2.67s
sys 0m28.40s


It is quite impressive that your little two disk mirror reads as fast 
as mega Sun systems with 38+ disks and striped vdevs to boot. Incredible!


Does this have something to do with your well-managed power and 
cooling? :-)

Maybe it is Bob, maybe it is. ;) haha.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-14 Thread Scott Lawson
I added a second Lun identical in size as a mirror and reran test. 
Results are more in line with yours now.


./zfs-cache-test.ksh test1
System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise 
M3000 Server

System architecture: sparc
System release level: 5.10 Generic_139555-08
CPU ISA list: sparcv9+vis2 sparcv9+vis sparcv9 sparcv8plus+vis2 
sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc


Pool configuration:
 pool: test1
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Wed Jul 15 
11:38:54 2009

config:

   NAME   STATE READ WRITE 
CKSUM
   test1  ONLINE   0 
0 0
 mirror   ONLINE   0 
0 0
   c3t600A0B8000562264039B4A257E11d0  ONLINE   0 
0 0
   c3t600A0B8000336DE204394A258B93d0  ONLINE   0 
0 0


errors: No known data errors

zfs create test1/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/test1/zfscachetest ...

Done!
zfs unmount test1/zfscachetest
zfs mount test1/zfscachetest

Doing initial (unmount/mount) 'cpio -C 131072 -o > /dev/null'
48000256 blocks

real3m25.13s
user0m2.67s
sys 0m28.40s

Doing second 'cpio -C 131072 -o > /dev/null'

48000256 blocks

real8m53.05s
user0m2.69s
sys 0m32.83s

Feel free to clean up with 'zfs destroy test1/zfscachetest'.

Scott Lawson wrote:

Bob,

Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool 
called test1
which is contained on a raid 1 volume on a 6140 with 7.50.13.10 
firmware on

the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks.

This machine is brand new with a clean install of S10 05/09. It is 
destined to become a Oracle 10 server with

ZFS filesystems for zones and DB volumes.

[r...@xxx /]#> uname -a
SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise
[r...@xxx /]#> cat /etc/release
  Solaris 10 5/09 s10s_u7wos_08 SPARC
  Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
   Use is subject to license terms.
Assembled 30 March 2009

[r...@xxx /]#> prtdiag -v | more
System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise 
M3000 Server

System clock frequency: 1064 MHz
Memory size: 16384 Megabytes


Here is the run output for you.

[r...@xxx tmp]#> ./zfs-cache-test.ksh test1
zfs create test1/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/test1/zfscachetest ...

Done!
zfs unmount test1/zfscachetest
zfs mount test1/zfscachetest

Doing initial (unmount/mount) 'cpio -o > /dev/null'
48000247 blocks

real4m48.94s
user0m21.58s
sys 0m44.91s

Doing second 'cpio -o > /dev/null'
48000247 blocks

real6m39.87s
user0m21.62s
sys 0m46.20s

Feel free to clean up with 'zfs destroy test1/zfscachetest'.

Looks like a 25% performance loss for me. I was seeing around 80MB/s 
sustained

on the first run and around 60M/'s sustained on the 2nd.

/Scott.


Bob Friesenhahn wrote:
There has been no forward progress on the ZFS read performance issue 
for a week now.  A 4X reduction in file read performance due to 
having read the file before is terrible, and of course the situation 
is considerably worse if the file was previously mmapped as well.  
Many of us have sent a lot of money to Sun and were not aware that 
ZFS is sucking the life out of our expensive Sun hardware.


It is trivially easy to reproduce this problem on multiple machines. 
For example, I reproduced it on my Blade 2500 (SPARC) which uses a 
simple mirrored rpool.  On that system there is a 1.8X read slowdown 
from the file being accessed previously.


In order to raise visibility of this issue, I invite others to see if 
they can reproduce it in their ZFS pools.  The script at


http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh 



Implements a simple test.  It requires a fair amount of disk space to 
run, but the main requirement is that the disk space consumed be more 
than available memory so that file data gets purged from the ARC. The 
script needs to run as root since it creates a filesystem and uses 
mount/umount.  The script does not destroy any data.


There are several adjustments which may be made at the front of the 
script.  The pool 'rpool' is used by default, but the name of the 
pool to test may be supplied via an argument similar to:


# ./zfs-cache-test.ksh Sun_2540
zfs create Sun_2540/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/Sun_2540/zfscachetest ...

Done!
zfs unmount Sun_2540/zfscachetest
zfs mount Sun_2540/zfscachetest

Doing initial (unmount/mount) 'cpio -o > /dev/null'
48000247 blocks

real2m54.17s
user0m7.65s
sys 0m36.59s

Doin

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-12 Thread Scott Lawson

Bob,

Output of my run for you. System is a M3000 with 16 GB RAM and 1 zpool 
called test1

which is contained on a raid 1 volume on a 6140 with 7.50.13.10 firmware on
the RAID controllers. RAid 1 is made up of two 146GB 15K FC disks.

This machine is brand new with a clean install of S10 05/09. It is 
destined to become a Oracle 10 server with

ZFS filesystems for zones and DB volumes.

[r...@xxx /]#> uname -a
SunOS xxx 5.10 Generic_139555-08 sun4u sparc SUNW,SPARC-Enterprise
[r...@xxx /]#> cat /etc/release
  Solaris 10 5/09 s10s_u7wos_08 SPARC
  Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
   Use is subject to license terms.
Assembled 30 March 2009

[r...@xxx /]#> prtdiag -v | more
System Configuration:  Sun Microsystems  sun4u Sun SPARC Enterprise 
M3000 Server

System clock frequency: 1064 MHz
Memory size: 16384 Megabytes


Here is the run output for you.

[r...@xxx tmp]#> ./zfs-cache-test.ksh test1
zfs create test1/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/test1/zfscachetest ...

Done!
zfs unmount test1/zfscachetest
zfs mount test1/zfscachetest

Doing initial (unmount/mount) 'cpio -o > /dev/null'
48000247 blocks

real4m48.94s
user0m21.58s
sys 0m44.91s

Doing second 'cpio -o > /dev/null'
48000247 blocks

real6m39.87s
user0m21.62s
sys 0m46.20s

Feel free to clean up with 'zfs destroy test1/zfscachetest'.

Looks like a 25% performance loss for me. I was seeing around 80MB/s 
sustained

on the first run and around 60M/'s sustained on the 2nd.

/Scott.


Bob Friesenhahn wrote:
There has been no forward progress on the ZFS read performance issue 
for a week now.  A 4X reduction in file read performance due to having 
read the file before is terrible, and of course the situation is 
considerably worse if the file was previously mmapped as well.  Many 
of us have sent a lot of money to Sun and were not aware that ZFS is 
sucking the life out of our expensive Sun hardware.


It is trivially easy to reproduce this problem on multiple machines. 
For example, I reproduced it on my Blade 2500 (SPARC) which uses a 
simple mirrored rpool.  On that system there is a 1.8X read slowdown 
from the file being accessed previously.


In order to raise visibility of this issue, I invite others to see if 
they can reproduce it in their ZFS pools.  The script at


http://www.simplesystems.org/users/bfriesen/zfs-discuss/zfs-cache-test.ksh 



Implements a simple test.  It requires a fair amount of disk space to 
run, but the main requirement is that the disk space consumed be more 
than available memory so that file data gets purged from the ARC. The 
script needs to run as root since it creates a filesystem and uses 
mount/umount.  The script does not destroy any data.


There are several adjustments which may be made at the front of the 
script.  The pool 'rpool' is used by default, but the name of the pool 
to test may be supplied via an argument similar to:


# ./zfs-cache-test.ksh Sun_2540
zfs create Sun_2540/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/Sun_2540/zfscachetest ...

Done!
zfs unmount Sun_2540/zfscachetest
zfs mount Sun_2540/zfscachetest

Doing initial (unmount/mount) 'cpio -o > /dev/null'
48000247 blocks

real2m54.17s
user0m7.65s
sys 0m36.59s

Doing second 'cpio -o > /dev/null'
48000247 blocks

real11m54.65s
user0m7.70s
sys 0m35.06s

Feel free to clean up with 'zfs destroy Sun_2540/zfscachetest'.

And here is a similar run on my Blade 2500 using the default rpool:

# ./zfs-cache-test.ksh
zfs create rpool/zfscachetest
Creating data file set (3000 files of 8192000 bytes) under 
/rpool/zfscachetest ...

Done!
zfs unmount rpool/zfscachetest
zfs mount rpool/zfscachetest

Doing initial (unmount/mount) 'cpio -o > /dev/null'
48000247 blocks

real13m3.91s
user2m43.04s
sys 9m28.73s

Doing second 'cpio -o > /dev/null'
48000247 blocks

real23m50.27s
user2m41.81s
sys 9m46.76s

Feel free to clean up with 'zfs destroy rpool/zfscachetest'.

I am interested to hear about systems which do not suffer from this bug.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Scott Lawson



David Magda wrote:

On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote:

I have seen UPSs help quite a lot for short glitches lasting seconds, 
or a minute.  Otherwise the outage is usually longer than the UPSs 
can stay up since the problem required human attention.


A standby generator is needed for any long outages.


Can't remember where I read the claim, but supposedly if power isn't 
restored within about ten minutes, then it will probably be out for a 
few hours. If this 'statistic' is true, it would mean that your UPS 
should last (say) fifteen minutes, and after that you really need a 
generator.
Most UPS's from any vendor are designed to run for around ~12 minutes at 
full load. So that would appear to back

that claim up and from my experience that is pretty much on the money...


At $WORK we currently have about thirty minutes worth of juice at full 
load, but as time drags on and we start shutting down less essential 
stuff we can increase that. The PBX and security system have their own 
UPSes in their own racks, so there are two layers of battery there.
The problem comes  when the power cut comes and you aren't there in the 
middle of the night. Then you either
need an automated shutdown system instigated by traps from the UPS 
(shutting things down in the correct order)
or a generator. About here the generator becomes a very good option. The 
above no generator scenario needs to be consistently tested to maintain 
it's validity, which is a royal pain in the neck. Gen sets are worth 
their weight in gold. I can't
even think how many times in the last few years they have saved our 
bacon. (through both planned and unplanned

outages)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Scott Lawson



Monish Shah wrote:

A related question:  If you are on a UPS, is it OK to disable ZIL?
I think the answer to this is no. UPS's do fail. If you have two 
redundant units, answer *might* be maybe. But prudence says *no*.


I have seen numerous UPS' failures over the years, cascading UPS 
failures as well by poorly engineered electrical systems
supporting server environments (or even more poorly managed and 
maintained). One would have to weigh up the risk against
the gain really and that would be *very* specific to any environment. 
The only time IMO would be if the data is disposable and
recreating your pool and data is not an issue. (and all of the 
accompanying downtime that would go with it is acceptable)


Really no one should disable the ZIL, rather look into write optimzed 
SSD's for the ZIL instead. Particularly if you are that

interested in performance that you are considering disabling your ZIL.


The evil tuning guide says "The ZIL is an essential part of ZFS and 
should never be disabled."  However, if you have a UPS, what can go 
wrong that really requires ZIL?


Opinions?

Monish

- Original Message - From: "Ross" 
To: 
Sent: Tuesday, June 30, 2009 3:04 PM
Subject: Re: [zfs-discuss] ZFS, power failures, and UPSes


I've seen enough people suffer from corrupted pools that a UPS is 
definitely good advice.  However, I'm running a (very low usage) ZFS 
server at home and it's suffered through at least half a dozen power 
outages without any problems at all.


I do plan to buy a UPS as soon as I can, but it seems to be surviving 
very well so far.

--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Scott Lawson



Haudy Kazemi wrote:

Hello,

I've looked around Google and the zfs-discuss archives but have not 
been able to find a good answer to this question (and the related 
questions that follow it):


How well does ZFS handle unexpected power failures? (e.g. 
environmental power failures, power supply dying, etc.)

Does it consistently gracefully recover?
Mostly. Unless you are unlucky. Backups are your friend in *any* 
environment though.
Should having a UPS be considered a (strong) recommendation or a 
"don't even think about running without it" item?
There has been quite any interesting thread on this over the last few 
months. I won't repeat my comments, but it is there in digital posterity 
on the zfs-discuss archives.


Certainly in a large environment with a lot of data being written, then 
one should consider this a mandatory requirement if you care about your
data. Particularly if there are many links in your storage chain that 
cause data corruption due to power failure.


Are there any communications/interfacing caveats to be aware of when 
choosing the UPS?


In this particular case, we're talking about a home file server 
running OpenSolaris 2009.06.  
As far as a home server goes, particularly if it is not write intensive 
then you will 'most likely' be fine. I have a home one with a v120 
running S10 u6 with a D1000
and 7 x 300 GB SCSI disk in a RAIDZ2 that has seen numerous power 
interruptions with no faults. This machine is a Samba server for my Macs 
and printing

business.

I also have another mail / web server also on another v120 which 
experiences the same power faults and regularly bounces back without 
issues.  But your mileage may vary. It all really

depends on how much you care about the data really.

I haven't used OpenSolaris specifically however as I prefer the 
generally more well supported S10 releases. (yes I know you can get 
support for OS, but I tend to be
conservative and standardize as much as possible. I do have millions of 
files stored on ZFS volumes for our Uni and I sleep well ;))


Actual environment power failures are generally < 1 per year.  I know 
there are a few blog articles about this type of application, but I 
don't recall seeing any (or any detailed) discussion about power 
failures and UPSes as they relate to ZFS.  I did see that the ZFS Evil 
Tuning Guide says cache flushes are done every 5 seconds.
The flush time you mention is based on older versions of ZFS, newer ones 
can have a flush time as long as 30 seconds I believe now.


Here is one post that didn't get any replies about a year ago after 
someone had a power failure, then UPS battery failure while copying 
data to a ZFS pool:

http://lists.macosforge.org/pipermail/zfs-discuss/2008-July/000670.html

Both theoretical answers and real life experiences would be 
appreciated as the former tells me where ZFS is needed while the later 
tells me where it has been or is now.


Thanks,

-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Increase size of ZFS mirror

2009-06-24 Thread Scott Lawson



Thomas Maier-Komor wrote:

Ben schrieb:
  
Thomas, 


Could you post an example of what you mean (ie commands in the order to use 
them)?  I've not played with ZFS that much and I don't want to muck my system 
up (I have data backed up, but am more concerned about getting myself in a mess 
and having to reinstall, thus losing my configurations).

Many thanks for both of your replies,
Ben



I'm not an expert on this, and I haven't tried it, so beware:


1) If the pool you want to expand is not the root pool:

$ zpool export mypool
# now replace one of the disks with a new disk
$ zpool import mypool
# zpool status will show that mypool is in degraded state because of a
missing disk
$ zpool replace mypool replaceddisk
# now the pool will start resilvering

# Once it is done with resilvering:
$ zpool detach mypool 
#  now physically replace 
$ zpool replace mypool 

  
This will all work well. But I have a couple of suggestions for you as 
well.


If you are using mirrored vdevs then you can also grow the vdev by 
making it a 3 or
a 4 way mirror. This way you don't lose your resiliency in your vdev 
whilst you are migrating
to larger disks.  Now of course you have to be able to take the extra 
device in your system
either via a spare drive bay in a storage enclosure or  SAN or iSCSI 
based LUNS.


When you have a lot of data and the business requires you to minimize 
any risk as much
as possible this is a good idea. The pool was only offline for 14 
seconds to gain the extra

space and at all times there were *always* two devices in my mirror vdev.

Here is a cut and paste from  this process from just the other day with 
a live production server where
the maintenance window was only 5 minutes. This pool was increased from 
300 to 500 GB on LUNS

from two disparate datacentres.

2009-06-17.13:57:05 zpool attach blackboard 
c4t600C0FF00924686710D4CF02d0 c4t600C0FF00082CA2312B99E05d0


2009-06-17.18:12:14 zpool detach blackboard 
c4t600C0FF00080797CC7A87F02d0


2009-06-17.18:12:57 zpool attach blackboard 
c4t600C0FF00924686710D4CF02d0 c4t600C0FF00086136F22B65F05d0


2009-06-17.20:02:00 zpool detach blackboard 
c4t600C0FF00924686710D4CF02d0


2009-06-18.05:58:52 zpool export blackboard

2009-06-18.05:59:06 zpool import blackboard

For home users this is probably overkill, but I thought I would mention 
it for more enterprise type

people that are maybe familiar with disksuiite and not ZFS as much.


2) if you are working on the root pool, just skip export/import part and
boot with only one half of the mirror. Don't forget to run installgrub
after replacing a disk.

HTH,
Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Path = ???

2009-05-19 Thread Scott Lawson

Howard,

I have certainly seen this with other apps making the ZFS console 
disappear. This is what I use
to make it available again. I commonly have had this happen when I have 
installed CAM into
the webconsole in the past. (Don't think it happened the last time I did 
however ;))


#wcadmin deploy -a zfs -x zfs /usr/share/webconsole/webapps/zfs

Also if you wish to make the webconsole accessible from more than just 
the localhost, use :


# svccfg -s svc:/system/webconsole setprop options/tcp_listen = true
# smcwebserver restart

Hope this helps,

Scott.





cindy.swearin...@sun.com wrote:

Hi Howard,

Which Solaris release is this?

You shouldn't have to register the ZFS app, but other problems 
prevented the ZFS GUI tool from launching successfully in the Solaris 
10 release.


If you can provide the Solaris release info and specific error messages,
I can try to get some answers.

Thanks,

Cindy

Howard Huntley wrote:
I am running ZFS on a sun Blade 100 with 2 80gig drives. When I 
upgraded the OS to the latest version of Solaris, ZFS did not 
register in the Java WEB console. I have to run the command "smreg 
add -a /directory/containing/application-file" to manually reregister 
the zfs app. Can any one tell me the correct path to use in this 
command?


Password = howard
http://imageevent.com/hhuntley/computerlab/computerlab




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
_________

Scott Lawson
Systems Architect
Information Communication Technology Services

Manukau Institute of Technology
Private Bag 94006
South Auckland Mail Centre
Manukau 2240
Auckland
New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz

__

perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

__



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS 15K drives as L2ARC

2009-05-06 Thread Scott Lawson



Bob Friesenhahn wrote:

On Thu, 7 May 2009, Scott Lawson wrote:


A STK2540 storage array with this configuration:

* Tray 1: Twelve (12) 146 GB @ 15K SAS HDDs.
* Tray 2: Twelve (12) 1 TB @ 7200 SATA HDDs.

Just thought I would point out that these are hardware backed RAID 
arrays. You might be better off using the J4200
instead for this so ZFS can manage the disks completely as well. Will 
probably be cheaper too! The savings could

be put towards some SSD's or more system RAM for L1ARC.


Something nice about the STK2540 solution is that if the server system 
dies. The STK2540's can quickly be swung over to another system via a 
quick 'zfs import'.
Sure provided they have it attached to a fibre  channel switch or have a 
nice long fibre lead. The difference is negligible
other than cost. Roger replied off list and mentioned customer has the 
2540 already, so my suggestion is moot anyway for him.


FYI. I have relocated  zpools both ways, with SAN attached 3510, 11's, 
6140's and SAS attached J4500's. Both ways work just fine. One
is cheaper. ;) Being that he was mentioning astronomical data, which I 
know is large datasets I just thought I would point it out the 2540 
probably wouldn't offer the best bang for buck for this NFS server. 
Thats all.
If SSDs are embedded inside the server system then it is necessary to 
physically move the log devices to the new system.
It is possible to buy these J series JBODS with bundled  SSD's as well 
right now. Log device would be contained in this
chassis which would facilitate easy importing and exporting in the case 
of a system shift being required.


The issue about how to quickly recover after the server dies seems to 
rarely be discussed here.  Embedded log devices tend to make issues 
more complex.


A dumb SAS array is certainly much cheaper and will perform at least 
as well, but it does seem like these newfangled embedded log devices 
cause an issue when maximum availability is desired.  With SAS it is 
necessary to physically swing the cables to the replacement server and 
of course the replacement server needs to be very close by.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS 15K drives as L2ARC

2009-05-06 Thread Scott Lawson



Roger Solano wrote:

Hello,

Does it make any sense to  use a bunch of 15K SAS drives as L2ARC 
cache for several TBs of SATA disks?


For example:

A STK2540 storage array with this configuration:

* Tray 1: Twelve (12) 146 GB @ 15K SAS HDDs.
* Tray 2: Twelve (12) 1 TB @ 7200 SATA HDDs.

Just thought I would point out that these are hardware backed RAID 
arrays. You might be better off using the J4200
instead for this so ZFS can manage the disks completely as well. Will 
probably be cheaper too! The savings could

be put towards some SSD's or more system RAM for L1ARC.

I was thinking about using disks from Tray 1 as L2ARC for Tray 2 and 
put all of these disks in one (1) zfs storage pool.


This pool would be used mainly as astronomical images repository, 
shared via NFS over a Sun Fire X2200.


Is it worth to do?

Thanks in advance for any help.

Regards,
Roger


--

  *Roger Solano*
Arquitecto de Soluciones
Región ACC - Venezuela


*Sun Microsystems, Inc.*
Teléfono: +58-212-905-3800
Fax: +58-212-905-3811
Email: roger.sol...@sun.com
INFORMACION:  Este  mensaje  está  destinado para  uso  exclusivo  del 
destinatarioypuedecontenermaterialeinformación 
confidencial.  Esta prohibida cualquier  revisión, uso,  publicación o 
distribución no autorizada del material  o información. Si usted no es 
el  destinatario correcto,  por favor  contactar a  través  del correo 
electrónico a  la persona que  envía la comunicación y  destruya todas 
las copias del mensaje original.






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + EMC Cx310 Array (JBOD ? Or Singe MetaLUN ?)

2009-04-30 Thread Scott Lawson



Wilkinson, Alex wrote:
0n Thu, Apr 30, 2009 at 11:11:55AM -0500, Bob Friesenhahn wrote: 


>On Thu, 30 Apr 2009, Wilkinson, Alex wrote:
>>
>> I currently have a single 17TB MetaLUN that i am about to present to an
>> OpenSolaris initiator and it will obviously be ZFS. However, I am 
constantly
>> reading that presenting a JBOD and using ZFS to manage the RAID is best
>> practice ? Im not really sure why ? And isn't that a waste of a high 
performing
>> RAID array (EMC) ?
>
>The JBOD "advantage" is that then ZFS can schedule I/O for the disks 
>and there is less chance of an unrecoverable pool since ZFS is assured 
>to lay out redundant data on redundant hardware and ZFS uses more 
>robust error detection than the firmware on any array.  When using 
>mirrors there is considerable advantage since writes and reads can be 
>concurrent.

>
>That said, your EMC hardware likely offers much nicer interfaces for 
>indicating and replacing bad disk drives.  With the ZFS JBOD approach 
>you have to back-track from what ZFS tells you (a Solaris device ID) 
>and figure out which physical drive is not behaving correctly.  EMC 
>tech support may not be very helpful if ZFS says there is something 
>wrong but the raid array says there is not. Sometimes there is value 
>with taking advantage of what you paid for.


So, shall I forget ZFS and use UFS ?
  


Can you share more of your system configuration / intended use?

UFS has a limitation of 16TB max for a single filesystem and this 
filesystem is limited to ~1 million inodes per TB roughly. So you
if want to store a lot of small files you may find you have a problem. I 
have certainly run into this limitation on numerous occasions.
(Smaller than ~1TB has a very high limit for inodes and generally isn't 
an issue)


Beyond what I mentioned in my other post it is hard to recommend 
anything else. ZFS does tend to have higher hardware
requirements than UFS and doesn't perform particularly well with low 
amounts of RAM. But without more workload

information it is pretty hard to advise the best path that you should take.


 -aW

IMPORTANT: This email remains the property of the Australian Defence 
Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 
1914.  If you have received this email in error, you are requested to contact 
the sender and delete the email.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + EMC Cx310 Array (JBOD ? Or Singe MetaLUN ?)

2009-04-30 Thread Scott Lawson



Wilkinson, Alex wrote:
0n Thu, Apr 30, 2009 at 11:11:55AM -0500, Bob Friesenhahn wrote: 


>On Thu, 30 Apr 2009, Wilkinson, Alex wrote:
>>
>> I currently have a single 17TB MetaLUN that i am about to present to an
>> OpenSolaris initiator and it will obviously be ZFS. However, I am 
constantly
>> reading that presenting a JBOD and using ZFS to manage the RAID is best
>> practice ? Im not really sure why ? And isn't that a waste of a high 
performing
>> RAID array (EMC) ?
>
>The JBOD "advantage" is that then ZFS can schedule I/O for the disks 
>and there is less chance of an unrecoverable pool since ZFS is assured 
>to lay out redundant data on redundant hardware and ZFS uses more 
>robust error detection than the firmware on any array.  When using 
>mirrors there is considerable advantage since writes and reads can be 
>concurrent.

>
>That said, your EMC hardware likely offers much nicer interfaces for 
>indicating and replacing bad disk drives.  With the ZFS JBOD approach 
>you have to back-track from what ZFS tells you (a Solaris device ID) 
>and figure out which physical drive is not behaving correctly.  EMC 
>tech support may not be very helpful if ZFS says there is something 
>wrong but the raid array says there is not. Sometimes there is value 
>with taking advantage of what you paid for.


So forget ZFS and use UFS ? Or use UFS with a ZVOL ? Or Just use Vx{VM,FS} ?
It kinda sux that you get no benefit from using such a killer volume manager
+ filesystem with an EMC array :(

 -aW
  
Besides the volume management aspects of ZFS and self healing etc, you 
still get other benefits
by virtue of using ZFS. Depending on *your* requirements, they can be 
arguably more beneficial,

if you are happy with the reliability of your underlying storage.

Specifically I am talking of ZFS snapshots, rollbacks, cloning, clone 
promotion, file system quotas, multiple

block copies, compression, (encryption soon) etc etc.

I have use snapshots, rollbacks and cloning quite successfully in 
complex upgrades of systems with multiple

packages and complex dependencies.

Case in point was a Blackboard Upgrade which had two servers. Both with 
ZFS file systems. One for
Blackboard and one for Oracle. The upgrade involved going through 3 
versions of
Oracle and 4 versions of blackboard where the process had potentially 
many places to go wrong. At every
point of the way we performed a snapshot on both Oracle and Blackboard 
to allow us to
rollback any particular part that we got wrong.  This saved us an 
immense amount of time and
money and is a good real world example of where this side of ZFS has 
been extremely helpful.


In the Oracle side this was infinitely faster than having to rollback 
the database itself. BB had some very large

tables!

Of course to take maximum advantage of ZFS in full, then as everyone has 
mentioned it is a good idea

to let ZFS manage the underlying raw disks if possible.


IMPORTANT: This email remains the property of the Australian Defence 
Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 
1914.  If you have received this email in error, you are requested to contact 
the sender and delete the email.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Scott Lawson



Richard Elling wrote:

Some history below...

Scott Lawson wrote:


Michael Shadle wrote:

On Mon, Apr 27, 2009 at 4:51 PM, Scott Lawson
 wrote:

 
If possible though you would be best to let the 3ware controller 
expose
the 16 disks as a JBOD  to ZFS and create a RAIDZ2 within Solaris 
as you

will then
gain the full benefits of ZFS. Block self healing etc etc.

There isn't an issue in using a larger amount of disks in a RAIDZ2, 
just

that it
is not the optimal size. Longer rebuild times for larger vdev's in 
a zpool

(although this
is proportional to how full the pool is.). Two parity disks gives you
greater cover in
the event of a drive failing in a large vdev stripe.



Hmm, this is a bit disappointing to me. I would have dedicated only 2
disks out of 16 then to a single large raidz2 instead of two 8 disk
raidz2's (meaning 4 disks went to parity)

  
No I was referring to a single RAIDZ2 vdev of 16 disks in your pool. 
So you would
lose ~2 disks to parity effectively. The larger the stripe, 
potentially the slower the rebuild.
If you had multiple vdevs in a pool that were smaller stripes you 
would get less performance
degradation by virtue of IO isolation. Of course here you lose pool 
capacity. With
smaller vdevs, you could also potentially just use RAIDZ and not 
RAIDZ2 and then you would

have the equivalent size pool still with two parity disks. 1 per vdev.


A few years ago, Sun introduced the X4500 (aka Thumper) which had 48
disks in the chassis.  Of course, the first thing customers did was to 
make
a single-level 46 or 48 disk raidz set.  The second thing they did was 
complain
that the resulting performance sucked.  So the "solution" was to try 
and put
some sort of practical limit into the docs to help people not hurt 
themselves.

After much research (down at the pub? :-) the recommendation you see in
the man page was the concensus.  It has absolutely nothing to do with
correctness of design or implementation.  It has everything to do with
setting expectations of "goodness."
Sure, I understand this. I was a beta tester for the J4500 because I 
prefer SPARC systems mostly
for Solaris. Certainly for these large disk systems the preferred layout 
of around 5-6 drives per vdev
is what I use on my assortment of *4500 series devices. My production 
J4500's with 48 x 1 TB drives
yield around ~31 TB usable. A T5520 10 Gig attached  will pretty much 
saturate the 3Gb/s SAS HBA connecting

it to the J4500. ;)

Being that this a home NAS for Michael serving large contiguous files 
with fairly low random access
requirements, most likely I would imagine that these rules of thumb can 
be relaxed a little. As you
state they are a rule of thumb for generic loads. This list does appear 
to be attracting people
wanting to use ZFS for home and capacity tends to be the biggest 
requirement over performance.


As I always advise people. Test with *your* workload as *your* 
requirements may be different
to the next mans. If you favor capacity over performance then a larger 
vdev of a dozen or so  disks
will work 'OK' in my experience. (I do routinely get referred to Sun 
customers in NZ as a site that

actually use ZFS in production and doesn't just play with it.)

I have tested the aforementioned thumpers with just this sort of config 
myself with varying results

on varying workloads. Video servers, Sun Email etc etc... Long time ago now.

I also have hardware backed RAID 6's consisting of 16 drives in 6000 
series storage on Crystal firmware
which work just fine in the hardware RAID world. (where I want capacity 
over speed). This is real world
production class stuff. Works just fine. I have ZFS overlaid on top of 
this as well.


But it is good that we are emphasizing the trade offs that any config 
has. Everyone can learn from these

sorts of discussions. ;)




One thing you haven't mentioned is the drive type and size that you 
are planning to use as this
greatly influences what people here would recommend. RAIDZ2 is built 
for big, slow SATA
disks as reconstruction times in large RAIDZ's and RAIDZ2's increase 
the risk of vdev failure
significantly due to the time taken to resilver to a replacement 
drive. Hot spares are your friend!


The concern with large drives is unrecoverable reads during resilvering.
One contributor to this is superparamagnetic decay, where the bits are
lost over time as the medium tries to revert to a more steady state.
To some extent, periodic scrubs will help repair these while the disks
are otherwise still good. At least one study found that this can occur
even when scrubs are done, so there is an open research opportunity
to determine the risk and recommend scrubbing intervals.  To a lesser
extent, hot spares can help reduce the hours it may take to physically
repair the failed drive.

+1



I was still operating under the impression that vdevs larger than 7-8
disks typically make baby Jesus 

Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Scott Lawson



Michael Shadle wrote:

On Mon, Apr 27, 2009 at 5:32 PM, Scott Lawson
 wrote:

  

One thing you haven't mentioned is the drive type and size that you are
planning to use as this
greatly influences what people here would recommend. RAIDZ2 is built for
big, slow SATA
disks as reconstruction times in large RAIDZ's and RAIDZ2's increase the
risk of vdev failure
significantly due to the time taken to resilver to a replacement drive. Hot
spares are your friend!



Well these are Seagate 1.5TB SATA disks. So.. big slow disks ;)

  

Then RAIDZ2 is your friend! The resilver time on a large RAIDZ2 stripe on
these would take a significant amount of time. The probability of 
another drive failing
during this rebuild time is quite high. I have in my time seen numerous 
double disk failures

in  hardware backed RAID5's  resulting in complete volume failure.

You did also state that this is a system to be used for backups? So
availability is five 9's?

Are you planning on using Open Solaris or mainstream Solaris 10? Mainstream
Solaris
10 is more conservative and is capable of being placed under a support
agreement if need
be.



Nope. Home storage (DVDs, music, etc) - I'd be fine with mainstream
Solaris, the only reason I went with SXCE was for the in-kernel CIFS,
which I wound up not using anyway due to some weird bug.
  
I have a v240 at home with a 12 bay D1000 chassis with 11 x 300GB SCSI's 
in a RAIDZ2 at
home with 1 hot spare. Makes a great NAS for me. Mostly for photo's and 
music so
the capacity is fine. Speed is very very quick as these are 10 K drives. 
I have a a printing
business on the side where we store customer images on this and have 
gigabit to all
the macs that we use for photoshop. The assurance that RAIDZ2 gives me 
allows me to

sleep comfortably. (coupled with daily snapshots ;))

I use S10 10/08 with Samba for my network clients. Runs like a charm.

--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Scott Lawson


Michael Shadle wrote:

On Mon, Apr 27, 2009 at 4:51 PM, Scott Lawson
 wrote:

  

If possible though you would be best to let the 3ware controller expose
the 16 disks as a JBOD  to ZFS and create a RAIDZ2 within Solaris as you
will then
gain the full benefits of ZFS. Block self healing etc etc.

There isn't an issue in using a larger amount of disks in a RAIDZ2, just
that it
is not the optimal size. Longer rebuild times for larger vdev's in a zpool
(although this
is proportional to how full the pool is.). Two parity disks gives you
greater cover in
the event of a drive failing in a large vdev stripe.



Hmm, this is a bit disappointing to me. I would have dedicated only 2
disks out of 16 then to a single large raidz2 instead of two 8 disk
raidz2's (meaning 4 disks went to parity)

  
No I was referring to a single RAIDZ2 vdev of 16 disks in your pool. So 
you would
lose ~2 disks to parity effectively. The larger the stripe, potentially 
the slower the rebuild.
If you had multiple vdevs in a pool that were smaller stripes you would 
get less performance
degradation by virtue of IO isolation. Of course here you lose pool 
capacity. With
smaller vdevs, you could also potentially just use RAIDZ and not RAIDZ2 
and then you would

have the equivalent size pool still with two parity disks. 1 per vdev.

One thing you haven't mentioned is the drive type and size that you are 
planning to use as this
greatly influences what people here would recommend. RAIDZ2 is built for 
big, slow SATA
disks as reconstruction times in large RAIDZ's and RAIDZ2's increase the 
risk of vdev failure
significantly due to the time taken to resilver to a replacement drive. 
Hot spares are your friend!

I was still operating under the impression that vdevs larger than 7-8
disks typically make baby Jesus nervous.
  
You did also state that this is a system to be used for backups? So 
availability is five 9's?


Are you planning on using Open Solaris or mainstream Solaris 10? 
Mainstream Solaris
10 is more conservative and is capable of being placed under a support 
agreement if need

be.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Scott Lawson

Leon,

RAIDZ2 is ~equivalent to RAID6. ~2 disks of parity data. Allowing a 
double drive

failure and still having the pool available.

If possible though you would be best to let the 3ware controller expose
the 16 disks as a JBOD  to ZFS and create a RAIDZ2 within Solaris as you 
will then

gain the full benefits of ZFS. Block self healing etc etc.

There isn't an issue in using a larger amount of disks in a RAIDZ2, just 
that it
is not the optimal size. Longer rebuild times for larger vdev's in a 
zpool (although this
is proportional to how full the pool is.). Two parity disks gives you 
greater cover in

the event of a drive failing in a large vdev stripe.

/Scott

Leon Meßner wrote:

Hi,
i'm new to the list so please bare with me. This isn't an OpenSolaris
related problem but i hope it's still the right list to post to.

I'm on the way to move a backup server to using zfs based storage, but i
don't want to spend too much drives to parity (the 16 drives are attached
to a 3ware raid controller so i could also just use raid6 there).

I want to be able to sustain two parallel drive failures so i need
raidz2. The man page of zpool says the recommended vdev size is
somewhere between 3-9 drives (for raidz). Is this just for getting the
best performance or are there stability issues ?

There won't be anything like heavy multi-user IO on this machine so
couldn't i just put all 16 drive in one raidz2 and have all the benefits
of zfs without sacrificing 2 extra drives to parity (compared to raid6)?

Thanks in Advance,
Leon
  



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs as a cache server

2009-04-09 Thread Scott Lawson

Hi Francois,

I use ZFS with Squid proxies here at MIT.  (MIT New Zealand that is ;))

My basic set up is like so.

- 2 x Sun SPARC v240's  dual CPU's with 2 x 36 GB boot disks and 2 x 73 
GB cache disks. Each machine has 4GB RAM.

- Each has a copy of squid,  Squidguard  and an apache server.
- Apache server, serves .pac files for client machines and each .pac 
file binds you to that proxy.
- Clients request a .pac from round robin DNS "proxy.manukau.ac.nz" 
which then gives you the real

system name of one of these two proxies.

Boot disks are mirrored using disksuite and cache and log file systems 
are ZFS. My cache pool is just a mirrored
pool which is then split into three file systems. Cache volume is 
restricted to 30 GB in squid config. Max cache object size

is 2MB. Internet bandwidth available to these machines is ~15Mbit/s.

[r...@x /]#> zpool status
 pool: proxpool
state: ONLINE
scrub: none requested
config:

   NAMESTATE READ WRITE CKSUM
   proxpoolONLINE   0 0 0
 mirrorONLINE   0 0 0
   c1t2d0  ONLINE   0 0 0
   c1t3d0  ONLINE   0 0 0

errors: No known data errors

[r...@x /]#> zfs list
NAMEUSED  AVAIL  REFER  MOUNTPOINT
proxpool   39.5G  27.4G27K  /proxpool
proxpool/apache-logs   2.40G  27.4G  2.40G  /proxpool/apache-logs
proxpool/proxy-cache2  29.5G  27.4G  29.5G  /proxpool/proxy-cache2
proxpool/proxy-logs7.54G  27.4G  7.54G  /proxpool/proxy-logs


This config works very well for our site and has done for several years 
using ZFS and quite a few more
with UFS before that. These two machines support ~4500 desktops give or 
take a few. ;)


A mirror or stripe of mirrors will give you best read performance. Also 
chuck in as much RAM as you can

for ARC caching.

Hope this real world case is of use to you. Feel free to ask any more 
questions..


Cheers,

Scott.

Francois wrote:

Hello list,

What would be the best zpool configuration for a cache/proxy server 
(probably based on squid) ?


In other words with which zpool configuration I could expect best 
reading performance ? (there'll be some writes too but much less).



Thanks.

--
Francois

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
_____

Scott Lawson
Systems Architect
Information Communication Technology Services

Manukau Institute of Technology
Private Bag 94006
South Auckland Mail Centre
Manukau 2240
Auckland
New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz

__

perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

__



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can this be done?

2009-04-07 Thread Scott Lawson



Michael Shadle wrote:

On Tue, Apr 7, 2009 at 5:22 PM, Bob Friesenhahn
 wrote:

  

No.  The two vdevs will be load shared rather than creating a mirror. This
should double your multi-user performance.



Cool - now a followup -

When I attach this new raidz2, will ZFS auto "rebalance" data between
the two, or will it keep the other one empty and do some sort of load
balancing between the two for future writes only?
  
Future writes only as far as I am aware. You will however get increased 
IO potentially.

(Total increase will depend on controller layouts etc etc.)

Is there a way (perhaps a scrub? or something?) to get the data spread
around to both?
  
No. You could backup and restore though. (or if you a small number of 
big files you

could I guess copy them around inside the pool to get them "rebalanced". )


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can this be done?

2009-04-07 Thread Scott Lawson



Michael Shadle wrote:

On Wed, Apr 1, 2009 at 3:19 AM, Michael Shadle  wrote:
  

I'm going to try to move one of my disks off my rpool tomorrow (since
it's a mirror) to a different controller.

According to what I've heard before, ZFS should automagically
recognize this new location and have no problem, right?



I successfully have realized how nice ZFS is with locating the proper
location of the disk across different controllers/ports. Besides for
rpool - ZFS boot. Moving those creates a huge PITA.


Now quick question - if I have a raidz2 named 'tank' already I can
expand the pool by doing:

zpool attach tank raidz2 device1 device2 device3 ... device7

It will make 'tank' larger and each group of disks (vdev? or zdev?)
  
You cannot expand a RAIDZ or RAIDZ2 at all.  You must back up the data 
and destroy

if you wish to alter the number of disks in a single RAIDz or RAIDZ2 stripe.

You may however attach and additional RAIDZ or RAIDZ2 to an existing 
storage pool.

Your pool would look something like below if you add additional RAIDZ's.
This is an output from a J4500 with 48 x 1TB drives with multiple 
RAIDZ's in a single pool

yielding ~30TB or so.


pool: nbupool
state: ONLINE
scrub: none requested
config:

   NAME STATE READ WRITE CKSUM
   nbupool  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t2d0   ONLINE   0 0 0
   c2t3d0   ONLINE   0 0 0
   c2t4d0   ONLINE   0 0 0
   c2t5d0   ONLINE   0 0 0
   c2t6d0   ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t7d0   ONLINE   0 0 0
   c2t8d0   ONLINE   0 0 0
   c2t9d0   ONLINE   0 0 0
   c2t10d0  ONLINE   0 0 0
   c2t11d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t12d0  ONLINE   0 0 0
   c2t13d0  ONLINE   0 0 0
   c2t14d0  ONLINE   0 0 0
   c2t15d0  ONLINE   0 0 0
   c2t16d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t17d0  ONLINE   0 0 0
   c2t18d0  ONLINE   0 0 0
   c2t19d0  ONLINE   0 0 0
   c2t20d0  ONLINE   0 0 0
   c2t21d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t22d0  ONLINE   0 0 0
   c2t23d0  ONLINE   0 0 0
   c2t24d0  ONLINE   0 0 0
   c2t25d0  ONLINE   0 0 0
   c2t26d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t27d0  ONLINE   0 0 0
   c2t28d0  ONLINE   0 0 0
   c2t29d0  ONLINE   0 0 0
   c2t30d0  ONLINE   0 0 0
   c2t31d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t32d0  ONLINE   0 0 0
   c2t33d0  ONLINE   0 0 0
   c2t34d0  ONLINE   0 0 0
   c2t35d0  ONLINE   0 0 0
   c2t36d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t37d0  ONLINE   0 0 0
   c2t38d0  ONLINE   0 0 0
   c2t39d0  ONLINE   0 0 0
   c2t40d0  ONLINE   0 0 0
   c2t41d0  ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c2t42d0  ONLINE   0 0 0
   c2t43d0  ONLINE   0 0 0
   c2t44d0  ONLINE   0 0 0
   c2t45d0  ONLINE   0 0 0
   c2t46d0  ONLINE   0 0 0
   spares
 c2t47d0AVAIL
 c2t48d0AVAIL
 c2t49d0AVAIL


will be dual parity. It won't create a mirror, will it?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can this be done?

2009-03-31 Thread Scott Lawson



Michael Shadle wrote:

On Mon, Mar 30, 2009 at 4:13 PM, Michael Shadle  wrote:


  
Sounds like a reasonable idea, no?


Follow up question: can I add a single disk to the existing raidz2
later on (if somehow I found more space in my chassis) so instead of a
7 disk raidz2  (5+2) it becomes a 6+2 ?
  
No. There is no way to expand a RAIDZ or RAIDZ2 at this point. It is a 
feature that is often discussed
and people would like, but has been seen by Sun as more of a feature 
home users would like rather
than enterprise users. Enterprise users are expected to buy a 4 or more 
disks and create another RAIDZ2
vdev and add it to the pool to increase space. You would of course have 
this option..


However by the time that you fill it there might be a solution. Adam 
Leventhal proposed a way that
this could be implemented on his blog, so I suspect at some point in the 
next few years somebody will
implement it and you will possible have the option to do so then. (after 
and OS and ZFS version upgrade)


http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

Thanks...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can this be done?

2009-03-28 Thread Scott Lawson



Bob Friesenhahn wrote:

On Sat, 28 Mar 2009, Michael Shadle wrote:

Well this is for a home storage array for my dvds and such. If I have 
to turn it off to swap a failed disk it's fine. It does not need to 
be highly available and I do not need extreme performance like a 
database for example. 45mb/sec would even be acceptable.


I can see that 14 disks costs a lot for a home storage array but to 
you the data on your home storage array may be just as important as 
data on some businesses enterprise storage array.  In fact, it may be 
even more critical since it seems unlikely that you will have an 
effective backup system in place like large businesses do.


The main problem with raidz1 is that if a disk fails and you replace 
it, that if a second disk substantially fails during resilvering 
(which needs to successfully read all data on remaining disks) then 
your ZFS pool (or at least part of the files) may be toast.  The more 
data which must be read during resilvering, the higher the probability 
that there will be a failure.  If 12TB of data needs to be read to 
resilver a 1TB disk, then that is a lot of successful reading which 
needs to go on.
This is a very good point for anyone following this and wondering why 
RAIDZ2 is a good idea. I have seen over the years several large RAID 5 
hardware arrays go belly
up as a 2nd drive fails during a rebuild with the end result of the 
entire RAID set being rendered useless. If you can afford it then you 
should use it. RAID6 or RAIDZ2 was
made for big SATA drives. If you do use it though, one should make sure 
that you have reasonable CPU as it does require a bit more grunt to run 
over RAIDZ.


The bigger the disks and the bigger the stripe the more likely you are 
to encounter a issue during a rebuild of a failed drive. plain and simple.


In order to lessen risk, you can schedule a periodic zfs scrub via a 
cron job so that there is less probabily of encountering data which 
can not be read.  This will not save you from entirely failed disk 
drives though.


As far as Tim's post that NOBODY recommends using better than RAID5, I 
hardly consider companies like IBM and NetApp to be "NOBODY".  Only 
Sun RAID hardware seems to lack RAID6, but Sun offers ZFS's raidz2 so 
it does not matter.
Plenty of Sun hardware comes with RAID6 support out of the box these 
days Bob. Certainly all of the 4140, 4150, 4240 and 4250 2 socket x86 
/x64 systems have
hardware controllers for this. Also all of the 6140's, 6540 and 6780's 
disk arrays do also have RAID 6 if they have Crystal firmware and of 
course the Open Storage 7000 series

machines do as well being that they are Opensolaris and ZFS based.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
_____

Scott Lawson
Systems Architect
Information Communication Technology Services

Manukau Institute of Technology
Private Bag 94006
South Auckland Mail Centre
Manukau 2240
Auckland
New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz

__

perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

__



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on a SAN

2009-03-12 Thread Scott Lawson

Grant,

Yes this is correct. If host A goes belly up, you can deassign the LUN 
from host A and assign to
host B. Being that host A has not gracefully exported it's zpool you 
will need to 'zpool import -f '
to force the pool to be imported because it hasn't been exported prior 
to import due to the unexpected inaccessibility

of host A.

It is possible to have the LUN visible to both machines at the same 
time, just not in use by both machines. This
is in general how clusters work. Be aware that if you do do this and 
access the disk on both systems then you run

a very real risk of corruption of the volume.

I use the first approach here quite regularly in what I call 'poor  mans 
clustering'. ;) I tend to install all my software
and data environments on SAN based LUNS that allow ease of moving just 
by exporting the zpool , reassigning
the LUN then importing to the new system. Works well as long as both 
systems are of the same OS revision

or greater on the target system.

/Scott.


Grant Lowe wrote:

Hi Erik,

A couple of questions about what you said in your email.  In synopsis 2, if 
hostA has gone belly up and is no longer accessible, then a step that is 
implied (or maybe I'm just inferring it) is to go to the SAN and reassign the 
LUN from hostA to hostB.  Correct?



- Original Message 
From: Erik Trimble 
To: Grant Lowe 
Cc: zfs-discuss@opensolaris.org
Sent: Wednesday, March 11, 2009 1:42:06 PM
Subject: Re: [zfs-discuss] ZFS on a SAN

I'm not 100% sure what your question here is, but let me give you a
(hopefully) complete answer:

(1) ZFS is NOT a clustered file system, in the sense that it is NOT
possible for two hosts to have the same LUN mounted at the same time,
even if both are hooked to a SAN and can normally see that LUN.

(2) ZFS can do failover, however.  If you have a LUN from a SAN on
hostA, create a ZFS pool in it, and use as normal.  Should you with to
failover the LUN to hostB, you need to do a 'zpool export ' on
hostA, then 'zpool import ' on hostB.  If hostA has been lost
completely (hung/died/etc) and you are unable to do an 'export' on it,
you can force the import on hostB via 'zpool import -f '


ZFS requires that you import/export entire POOLS, not just filesystems.
So, given what you seem to want, I'd recommend this:

On the SAN, create (2) LUNs - one for your primary data, and one for
your snapshots/backups.

On hostA, create a zpool on the primary data LUN (call it zpool A), and
another zpool on the backup LUN (zpool B).  Take snapshots on A, then
use 'zfs send' and 'zfs receive' to copy the clone/snapshot over to
zpool B. then 'zpool export B'

On hostB, import the snapshot pool:  'zfs import B'



It might just be as easy to have two independent zpools on each host,
and just do a 'zfs send' on hostA, and 'zfs receive' on hostB to copy
the snapshot/clone over the wire.

-Erik



On Wed, 2009-03-11 at 13:18 -0700, Grant Lowe wrote:
  

Hi All,

I'm new on ZFS, so I hope this isn't too basic a question.  I have a host where 
I setup ZFS.  The Oracle DBAs did their thing and I know have a number of ZFS 
datasets with their respective clones and snapshots on serverA.  I want to 
export some of the clones to serverB.  Do I need to zone serverB to see the 
same LUNs as serverA?  Or does it have to have preexisting, empty LUNs to 
import the clones?  Please help.  Thanks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
___


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Comstar production-ready?

2009-03-04 Thread Scott Lawson
Right on the money there Bob. Without knowing more detail about the 
clients workload, it would be hard to
advise either way. I would imagine based purely on the small amount of 
info around the clients apps and
workload that NFS would most likely be the appropriate solution on top 
of ZFS. You will make more efficient
use of the your ZFS storage this way and provide all the niceties like 
snapshots and rollbacks from the Solaris

based filer whilst still maintaining Linux front ends.

Do take heed to the various list posts around ZFS/NFS with certain types 
of workloads however.


Bob Friesenhahn wrote:

On Wed, 4 Mar 2009, Stephen Nelson-Smith wrote:


The interesting alternative is to set up Comstar on SXCE, create
zpools and volumes, and make these available either over a fibre
infrastructure, or iSCSI.  I'm quite excited by this as a solution,
but I'm not sure if it's really production ready.


While this is indeed exciting, the solutions you have proposed vary 
considerably in the type of functionality they offer.  Comstar and 
iSCSI provide access to a storage "volume" similar to SAN storage. 
This volume is then formatted with some alien filesystem which is 
unlikely to support the robust features of ZFS.  Even though the 
storage volume is implemented in robust ZFS, the client still has the 
ability to scramble its own filesystem.  ZFS snapshots can help defend 
against that by allowing to rewind the entire content of the storage 
"volume" to a former point in time.


With the NFS/CIFS server model, only ZFS is used.  There is no 
dependence on a client filesystem.


With the Comstar/iSCSI approach, you are balkanizing 
(http://en.wikipedia.org/wiki/Balkanization) your storage so that each 
client owns its own filesystems without ability to share the data 
unless the client does it.  With the native ZFS server approach, all 
clients share the pool storage and can share files on the server if 
the server allows it.  A drawback of the native ZFS server approach is 
that the server needs to know about the users on the clients in order 
to support access control.


Regardless, there are cases where Comstar/iSCSI makes the most sense, 
or the ZFS fileserver makes the most sense.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Comstar production-ready?

2009-03-04 Thread Scott Lawson



Jacob Ritorto wrote:

Caution:  I built a system like this and spent several weeks trying to
get iscsi share working under Solaris 10 u6 and older.  It would work
fine for the first few hours but then performance would start to
degrade, eventually becoming so poor as to actually cause panics on
the iscsi initiator boxes.  Couldn't find resolution through the
various Solaris knowledge bases.  Closest I got was to find out that
there's a problem only in the *Solaris 10* iscsi target code that
incorrectly frobs some counter when it shouldn't, violating the iscsi
target specifications.  The problem is fixed in Nevada/OpenSolaris.

  
Can't say I have had a problem myself. Initiator is the default 
Microsoft Vista
initiator. Mine has been running for at least 6-9 months fine. It 
doesn't get a absolute hammering
though. Using it to provide extra storage to IT staff desktops who need 
it and at
the same time allowing staff to play with iSCSI. I run a largish fibre 
channel shop and prefer that mostly

anyway.


Long story short, I tried OpenSolaris 2008.11 and the iscsi crashes
ceased and things ran smoothly.  Not the solution I was hoping for,
since this was to eventually be a prod box, but then Sun announced
that I could purchase OpenSolaris support, so I was covered.  On OS,
my two big filers have been running really nicely for months and
months now.
  
If you get Solaris 10 support sun will provide fixes for that too I 
imagine. Again, can't
say I have had a problem myself. But as I mentioned in my previous 
email, I can't
stress how important it is to *test* a solution in your environment with 
*your* workload
with the hardware/OS *you* choose. The Sun try and buy on hardware is  a 
great way to do this
relatively risk free. If it doesn't work send it back. Also they have 
startup essentials which will
potentially allow you to ge the "try and buy" hardware in the cheap if 
you are a new customer.

Don't try to use Solaris 10 as a filer OS unless you can identify and
resolve the iscsi target issue.
  
If iSCSI is truly broken then one could log a support call on this is 
you take basic maintenance. This

is cheaper the RHEL for the entry level stuff by the way...


Being that this is a Linux shop you are selling into, OpenSolaris might 
be the best way to go as the GNU
userland might be more familiar to them and they might not understand 
having to change their shell paths

to get the userland that they want ;)




On Wed, Mar 4, 2009 at 2:47 AM, Scott Lawson  wrote:
  

Stephen Nelson-Smith wrote:


Hi,

I recommended a ZFS-based archive solution to a client needing to have
a network-based archive of 15TB of data in a remote datacentre.  I
based this on an X2200 + J4400, Solaris 10 + rsync.

This was enthusiastically received, to the extent that the client is
now requesting that their live system (15TB data on cheap SAN and
Linux LVM) be replaced with a ZFS-based system.

The catch is that they're not ready to move their production systems
off Linux - so web, db and app layer will all still be on RHEL 5.

  

At some point I am sure you will convince them to see the light! ;)


As I see it, if they want to benefit from ZFS at the storage layer,
the obvious solution would be a NAS system, such as a 7210, or
something buillt from a JBOD and a head node that does something
similar.  The 7210 is out of budget - and I'm not quite sure how it
presents its storage - is it NFS/CIFS?
  

The 7000 series devices can present NFS, CIFS and iSCSI. Looks very nice if
you need
a nice Gui / Don't know command line / need nice analytics. I had a play
with one the other
day and am hoping to get my mit's on one shortly for testing. I would like
to give it a real
gd crack with VMWare for VDI VM's.


 If so, presumably it would be
relatively easy to build something equivalent, but without the
(awesome) interface.

  

For sure the above gear would be fine for that. If you use standard Solaris
10 10/08 you have
NFS and iSCSI ability directly in the OS and also available to be supported
via a support contract
if needed. Best bet would probably be NFS for the Linux machines, but you
would need
to test in *their* environment with *their* workload.


The interesting alternative is to set up Comstar on SXCE, create
zpools and volumes, and make these available either over a fibre
infrastructure, or iSCSI.  I'm quite excited by this as a solution,
but I'm not sure if it's really production ready.

  

If you want fibre channel target then you will need to use OpenSolaris or
SXDE I believe. It's
not available in mainstream Solaris yet. I am personally waiting till then
when it has been
*well* tested in the bleeding edge community. I have too much data to take
big risks with it.


What other options are there, and what advice/experience can you share?

  

I do very similar stuff here with J450

Re: [zfs-discuss] Comstar production-ready?

2009-03-03 Thread Scott Lawson



Stephen Nelson-Smith wrote:

Hi,

I recommended a ZFS-based archive solution to a client needing to have
a network-based archive of 15TB of data in a remote datacentre.  I
based this on an X2200 + J4400, Solaris 10 + rsync.

This was enthusiastically received, to the extent that the client is
now requesting that their live system (15TB data on cheap SAN and
Linux LVM) be replaced with a ZFS-based system.

The catch is that they're not ready to move their production systems
off Linux - so web, db and app layer will all still be on RHEL 5.
  

At some point I am sure you will convince them to see the light! ;)

As I see it, if they want to benefit from ZFS at the storage layer,
the obvious solution would be a NAS system, such as a 7210, or
something buillt from a JBOD and a head node that does something
similar.  The 7210 is out of budget - and I'm not quite sure how it
presents its storage - is it NFS/CIFS?
The 7000 series devices can present NFS, CIFS and iSCSI. Looks very nice 
if you need
a nice Gui / Don't know command line / need nice analytics. I had a 
play with one the other
day and am hoping to get my mit's on one shortly for testing. I would 
like to give it a real

gd crack with VMWare for VDI VM's.

  If so, presumably it would be
relatively easy to build something equivalent, but without the
(awesome) interface.
  
For sure the above gear would be fine for that. If you use standard 
Solaris 10 10/08 you have
NFS and iSCSI ability directly in the OS and also available to be 
supported via a support contract
if needed. Best bet would probably be NFS for the Linux machines, but 
you would need

to test in *their* environment with *their* workload.

The interesting alternative is to set up Comstar on SXCE, create
zpools and volumes, and make these available either over a fibre
infrastructure, or iSCSI.  I'm quite excited by this as a solution,
but I'm not sure if it's really production ready.
  
If you want fibre channel target then you will need to use OpenSolaris 
or SXDE I believe. It's
not available in mainstream Solaris yet. I am personally waiting till 
then when it has been
*well* tested in the bleeding edge community. I have too much data to 
take big risks with it.

What other options are there, and what advice/experience can you share?
  
I do very similar stuff here with J4500's and T2K's for compliance 
archives, NFS and iSCSI targets
for Windows machines. Works fine for me. Biggest system is 48TB on J4500 
for Veritas Netbackup
DDT staging volumes. Very good throughput indeed. Perfect in fact, based 
on the large files that
are created in this environment. One of these J4500's can keep 4 LTO4 
drives in a SL500  saturated with

data on a T5220. (4 streams at ~160 MB/sec)

I think you have pretty much the right idea though. Certainly if you use 
Sun kit you will be able to deliver

a commercially supported solution for them.

Thanks,

S.
  


--
_________

Scott Lawson
Systems Architect
Information Communication Technology Services

Manukau Institute of Technology
Private Bag 94006
South Auckland Mail Centre
Manukau 2240
Auckland
New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz

__

perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

__



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN? Availability edition.

2009-02-18 Thread Scott Lawson



Miles Nordin wrote:

"sl" == Scott Lawson  writes:



sl> Electricity *is* the lifeblood of available storage.

I never meant to suggest computing machinery could run without
electricity.  My suggestion is, if your focus is _reliability_ rather
than availability, meaning you don't want to lose the contents of a
pool, you should think about what happens when power goes out, not
just how to make sure power Never goes out Ever Absolutely because we
Paid and our power is PERFECT.
  

My focus is on both. And I understand that nothing is ever perfect, only
that one should strive for it if possible. But when one lives in a place 
like NZ where
our power grid system is creaky, it starts becoming a real liability 
that needs mitigation

thats all. I am sure there are plenty of  ZFS users in the same boat.

 * pools should not go corrupt when power goes out.
  

Absolutely agree.

 * UPS does not replace need for NVRAM's to have batteries in it
   because there are things between the UPS and the NVRAM like cords
   and power supplies, and the UPS themselves are not reliable enough
   if you have only one, and the controller containing the NVRAM may
   need to be hard-booted because of bugs.
  
Fully understand this too. If you use as I do hardware RAID arrays 
behind zpool
vdevs then it is very important that this stuff is maintained and that 
the batteries backing the
RAID array write caches are good and that you can have power available 
to allow them
to flush cache to disk before the batteries go flat. This is certainly 
true of any file system that

is built upon LUNS from hardware backed RAID arrays.

 * supplying superexpensive futuristic infalliable fancypower to all
   disk shelves does not mean the SYNC CACHE command can be thrown
   out.  maybe the power is still not infalliable, or maybe there will
   be SAN outages or blown controllers or shelves with junky software
   in them that hang the whole array when one drive goes bad.
  
In general why I use mirrored vdevs with LUNS provided from two 
different arrays
geographically isolated, less likely to be a problem hopefully. But yes 
anything that

ignores SYNC CACHE could pose a serious problem if it is hidden by an array
controller from ZFS.

If you really care about availability:

 * reliability crosses into availability if you are planning to have
   fragile pools backed by a single SAN LUN, which may become corrupt
   if they lose power.  Maybe you're planning to destroy the pool and
   restore from backup in that case, and you have some
   carefully-planned offsite backup heirarchy that's always recent
   enough to capture all the data you care about.  But, a restore
   could take days, which turns two minutes of unavailable power into
   one day of unavailable data.  If there were no reliability problem
   causing pool loss during power loss, two minutes unavailable power
   maybe means 10min of unavailable data.
  
Agreed and is why I would recommend against a single hardware RAID SAN 
LUN for a zpool. At bare
minimum for this you would want to use copies=2 if you really care about 
your data. IF you
don't care about the data then no problems, go ahead. I do use zpools 
for transient data
that I don't care about and favor capacity over resiliency. (main think 
I want is L2ARC for these,

think squid proxy server caches)

 * there are reported problems with systems that take hours to boot
   up, ex. with thousands of filesystems, snapshots, or nfs exports,
   which isn't exactly a reliability problem, but is a problem.  That
   open issue falls into the above outage-magnification category, too.
  
Have seen this myself. Not nice after a  system reboot. Can't recall if 
I have seen

it recently though? Seem to recall it was more around S10 U2 or U3.

I just don't like the idea people are building fancy space-age data
centers and then thinking they can safely run crappy storage software
that won't handle power outages because they're above having to worry
about all that little-guy nonsense.  A big selling point of the last
step-forward in filesystems (metadata logging) was that they'd handle
power failures with better consistency guarantees and faster
reboots---at the time, did metadata logging appeal only to people with
unreliable power?  I hope not.
  
I am just trying to put forward the perspective of a big user here. I 
have already
generated numerous off list posts with people wanting more info on the 
methodology

that we like to use. If I can be of help to people I will.

never mind those of us who find these filsystem features important
because we'd like cheaper or smaller systems, with cords that we
sometimes trip over, that are still useful.  I think having such
protections in the storage software and having them actually fully
working not just imaginary or fragile, is always useful,
Absolutely.  It is all part of the big picture. Albeit prob

Re: [zfs-discuss] ZFS on SAN? Availability edition.

2009-02-18 Thread Scott Lawson

Robin,

From recollection the business case for investment in power protection 
technology was relatively simple.


We calculated what the downtime per hour was worth and how frequently it 
happened. We used to
have several if not more incidents per year and that would cause major 
system outages. When you have
over 1000 staff and multiple remote sites depending on your data center 
(now data centers, plural). Calculate
cost per hour for staff wages alone and it becomes quite easy to 
justify. (I am not even going to fact in loss
of reputation and the media in this, or our most important customer. Our 
students)


I cannot *stress* just how important power and environment protection is 
to data. It is the main consideration
I take into account when deploying new sites. (This discussion went off 
list yesterday and I was mentioning
these same things there). My analogy here is what would be the first 
thing NASA designs into a new space craft?
Life Support. Without it you don't even leave the ground. Electricity 
*is* the lifeblood of available storage.


Case in point. Last year we had an arsonist set fire to a critical point 
in out campus infrastructure last year which burnt
down a building that just happened to have one of the main communication 
and power trenches running
through it. Knocked out around 5 buildings on that campus for two weeks. 
Immense upheaval and disruption
followed. Our brand new DR data center was on that site. Kept running 
because of redundant fibre paths to
the SAN switches and core routers so that we could still provide service 
to the rest of the campus and maintain
active DR to our primary site. Emergency power via generator was also 
available until main power could be rerouted

to the data center as well.

I will take a look at the twinstrata website. (as should others).

Sorry to all if we are diverging too much from zfs-discuss.

/Scott

This stuff does happen. When you have been around for a while you see it.

Robin Harris wrote:
Calculating the availability and economic trade-offs of configurations 
is hard. Rule of thumb seems to rule.


I recently profiled an availability/reliability tool 
on StorageMojo.com that uses Bayesian analysis to estimate datacenter 
availability. You can quickly (minutes, not days) model systems and 
compare availability and recovery times as well as OpEx and CapEx 
implications. 

One hole: AFAIK, ZFS isn't in their product catalog. There's a free 
version of the tool at http://www.twinstrata.com/


Feedback on the tool from this group is invited.

Robin
StorageMojo.com



Date: Tue, 17 Feb 2009 21:36:38 -0800
From: Richard Elling <mailto:richard.ell...@gmail.com>>
To: Toby Thain <mailto:t...@telegraphics.com.au>>

Cc: zfs-discuss@opensolaris.org <mailto:zfs-discuss@opensolaris.org>
Subject: Re: [zfs-discuss] ZFS on SAN?
Message-ID: <499b9e66.2010...@gmail.com 
<mailto:499b9e66.2010...@gmail.com>>

Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Toby Thain wrote:
Not at all. You've convinced me. Your servers will never, ever lose 
power unexpectedly.


Methinks living in Auckland has something to do with that :-) 
http://en.wikipedia.org/wiki/1998_Auckland_power_crisis


When services are reliable, then complacency brings risk.
My favorite example recently is the levees in New Orleans.
Katrina didn't top the levees, they were undermined.
-- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] qmail on zfs

2009-02-18 Thread Scott Lawson



Robert Milkowski wrote:

Hello Asif,

Wednesday, February 18, 2009, 1:28:09 AM, you wrote:

AI> On Tue, Feb 17, 2009 at 5:52 PM, Robert Milkowski  wrote:
  

Hello Asif,

Tuesday, February 17, 2009, 7:43:41 PM, you wrote:

AI> Hi All

AI> Does anyone have any experience on running qmail on solaris 10 with ZFS 
only?

AI> I would appreciate if you share your findings, suggestion and gotchas

It just works.
  


AI> Is there any performance penalty over ufs ?

I did some testing years ago and I honestly do not remember -
nevertheless we migrated to ZFS so either it wasn't slower or it was
faster.


  
I run exim (which is  a pretty similar sort of MTA) on 2-3 year old 
x4200's with ZFS mirrored on local SAS drives. These perform
better than they did with UFS due to the L2ARC. I have two of these. 
Each one manages to process around 300-500K
inbound messages per day quite easily with Spamassassin running on them 
as well. Spec is dual dual core Opteron with

4 GB RAM.  4 x 73 GB 10K 2.5 SAS disks.

As I think Tony mentioned, the best thing to do is to test with your 
specific workload. I don't think you will see a drop
in performance at all. (and you will gain so many other lovely things 
that ZFS brings ;))


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN?

2009-02-18 Thread Scott Lawson

Hi Andras,

No problems writing direct.  Answers inline below. (If there are any 
typo's it cause it's late and I have had a very long day ;))


andras spitzer wrote:

Scott,

Sorry for writing you directly, but most likely you have missed my
questions regarding your SW design, whenever you have time, would you
reply to that? I really value your comments and appreciate it as it
seems you have great experience with ZFS in a professional
environment, and this is something not so frequent today.

That was my e-mail, response to your e-mail (it's in the thread) :

"Scott,

That is an awesome reference you wrote, I totally understand and agree
with your idea of having everything redundant (dual path, redundant
switches, dual controllers) at the SAN infrastructure, I would have
some question about the sw design you use if you don't mind.

- are you using MPxIO as DMP?
  
Yes. configuring via 'stmsboot'. I have used Sun MPXIO for quite a few 
years now and have found

it works well (was SAN Foundatin Kit for many years).

- as I understood from your e-mail all of your ZFS pools are ZFS
mirrored? (you don't have non-redundant ZFS configuration)
  
Certainly the ones that are from SAN based disk. No there are no non 
redundant ZFS configurations.
All storage is doubled up. Expensive, but we tend to stick to modular 
storage for this and spread
the cost over many yeasr. Storage budget is at least 50% of systems 
group infrastructure budget.


There are many other ZFS file systems which aren't SAN attached and are 
in mirrors, RAIDZ's etc.
I mentioned the Loki's aka J4500 which are in RAIDZ's. Very nice and 
have worked very reliably
so far. I would strongly advocate these units for ZFS if you want a lot 
of disk reasonably cheaply

that performs well...

- why you decided to use ZFS mirror instead of ZFS raidz or raidz2?
  
As we already have hardware based RAID5 from our arrays. (Sun 3510, 
3511, 6140's). The ZFS
file systems are used mostly for mirroring purposes, but also to take 
advantage of the other nice things

ZFS brings lack snapshots, cloning, clone promotions etc.
 
- you have RAID 5 protected LUNs from SAN, and you put ZFS mirror on

top of them?
  

Yes. Covered above I think.

Could you please share some details about your configuration regarding
SAN redundancy VS ZFS redundancy (I guess you use both here), also
some background why you decided to go with that?
  
Been doing it for many years. Not just with ZFS, but UFS and VXFS as 
well. Also quite a large
number of NTFS machines. We have two geographically separate data 
centers which are a few kilometers
apart with redundant dark fibre links over different routes.  All core 
switches are in a full mesh with
two cores per site, each with a redundant connection to the two cores at 
the other site. One via each route.


We believe strongly that storage is the key to our business. Servers are 
but processing to work the data and
are far easier to replace. We tend to standardize on particular models 
and then buy a bunch of em and not

necessarily maintenance for them.

There are a lot of key things to building a reliable data center. I have 
been having a lively discussion on this
twith Toby and Richard which has been raising some interesting points. I 
do firmly believe in getting
things right from the ground up. I start with power and environment. 
Storage comes next in my book.

Regards,
sendai "

One point I'm really interested is that it seems you deploy ZFS with
ZFS mirror, even when you have RAID redundancy at the HW/SAN level,
which means extra costs to you obviously. I'm looking for a fairly
decisive opinion whether is it safe to use ZFS configuration without
redundancy when you have RAID redundancy in your high-end SAN, or you
still decide to go with ZFS redundancy (ZFS mirror in your case, not
even raidz or raidz2) because of the extra self-healing feature and
the lowered risk of total pool failure?
  
I think this has also been covered in recent list posts. the important 
thing is really to have two copies
of blocks if you wish to be able to self heal.  The cost I guess is what 
value you place on availability

and reliability of your data.

ZFS mirrors are faster for resilvering as well. Much much faster in my 
experience. We recently
used this during a data center move and rebuild. Our SAN fabric was 
extended to 3 sites and we moved blocks
of storage one piece at a time and resynced them at the new location 
once they were in place with 0%

disruption to the business.

I do think the fishworks stuff are going to prove to be game breakers in 
the near future for many
people as they will offer many of the features we want in our storage. 
Once COMSTAR has
been integrated into this line I might buy some. (I have a large 
investment in fibre channel and I don't
trust networking people as far as I can kick them when it comes to 
understanding the potential
problems that can arise from disconnecting block targets that are coming 
in over Ether

Re: [zfs-discuss] ZFS on SAN?

2009-02-17 Thread Scott Lawson



David Magda wrote:


On Feb 17, 2009, at 21:35, Scott Lawson wrote:

Everything we have has dual power supplies, feed from dual power 
rails, feed from separate switchboards, through separate very large 
UPS's, backed by generators, feed by two substations and then cloned 
to another data center 3 km away. HA


http://www.geonet.org.nz/earthquake/quakes/recent_quakes.html

Ha. Yeah thats why we were once known to the British as "The Shaky 
Isles" . We do have lot's
of earthquakes around the pacific rim. We are in Auckland however which 
is north of all those little stars on the pic
which is where the edge of the pacific plate intersects with the 
Australian plate. So not too many earthquakes
to worry about in Auckland compared to the rest of NZ. Although one of 
the data centers I built recently
was on the second floor of a building and had to be earthquake 
restrained due to the fact that we were going
to be potentially creating up to one ton point loads on the floor. The 
rest of NZ gets little and biggish quakes
fairly often, so much so that the Aussies next door on the west island 
see fit to warn their citizens about the
potential for earthquakes in NZ if visiting... Our capital city 
Wellington the other hand is built on fault lines.. think

San Francisco...

Now a Volcano in Auckland might be a different story... We have over 50 
dormant cones and whopping big one in the
main harbor which is called "Rangitoto". Translated form the Maori it 
means "Blood Sky". My UPS's wont

protect  from that one...

;)

I am far, far more worried with someone with root access typing 
'zpool destroy' than I am worried about the lights going out in the 
data centers I designed that house hundreds and hundreds of servers. ;)


Yeah, this is probably more likely.



--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN?

2009-02-17 Thread Scott Lawson



Toby Thain wrote:


On 17-Feb-09, at 3:01 PM, Scott Lawson wrote:


Hi All,
...
I have seen other people discussing power availability on other threads
recently. If you
want it, you can have it. You just need the business case for it. I
don't buy the comments
on UPS unreliability.


Hi,

I remarked on it. FWIW, my experience is that commercial data centres 
do not avoid 'unscheduled outages', no matter how many steely-eyed 
assurances they give. It seems rather imprudent to assume that power 
is never going to fail.


No matter how many diesel generators, rooftop tanks, or pebblebed 
reactors you have, somebody is inevitably going to kick out a plug... 
at least in most of the real world.


--Toby
Thats why you have two plugs if not more. I still don't buy your 
argument. It comes down to procedural
issues on the site when it comes to people kicking plugs out. Everything 
we have has dual power supplies,
feed from dual power rails, feed from separate switchboards, through 
separate very large UPS's, backed by
generators, feed by two substations and then cloned to another data 
center 3 km away. HA is all about

design. (I won't even comment about further up the stack than electricity)

We have secure data centers  with strict practices of work and qualified 
staff following best practice for maintenance

and risk management around maintenance.

I am far, far more worried with someone with root access typing 'zpool 
destroy' than I am worried about the lights
going out in the data centers I designed that house hundreds and 
hundreds of servers. ;) and no we don't have unplanned
outages. Not in a long time. Not all people that design data centers 
know how to design power systems
for them. Sometimes the IT people don't convey their requirements 
exactly enough to the electrical engineers. (I am
an electrical engineer who got sidetracked by SunOS around '91 and never 
went back.)


Anyway we diverge I think. Maybe we can agree to disagree? Back to 
discussions about disk caddies and

overpriced hardware.. slightly more closer to the topic at hand... ;)



...

--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN?

2009-02-17 Thread Scott Lawson

Hi All,

I have been watching this thread for a while and thought it was time a 
chipped my 2 cents
worth in. I have been an aggressive adopter of ZFS here across all of 
our Solaris
systems and have found the benefits have far outweighed any small issues 
that have

arisen.

Currently I have many systems that have LUNs provided from SAN based 
storage to
systems for zpools. All our systems are configured with mirrored vdevs 
and the

reliability factor has been as good as, if not greater than UFS and LVM.

My rules of thumb around systems tend to stem around getting the storage 
infrastructure
right as that generally leads to the best availability. To this end for 
every single SAN
attached system we have dual paths to separate switches, every array has 
dual controllers
dual pathed to different switches. ZFS may be more or less susceptible 
to any physical
infrastructure problem, but in my experience it is on a par with UFS 
(and I gave up

shelling out for vxfs long ago)

The reasons for the above configuration is that our storage is evenly 
split between two sites
dark fibre between them across redundant routes. This forms a ring 
configuration which is
around 5 km around. We have so much storage that we need to have this in 
case of a data
center catastrophe. The business recognizes the time to recovery risk 
would be so great
that if we didn't we would be out  of business in the event of one of 
our data centres burning

or other natural disaster.

I have seen other people discussing power availability on other threads 
recently. If you
want it, you can have it. You just need the business case for it. I 
don't buy the comments

on UPS unreliability.

Quite frequently I have rebooted arrays and removed them from mirrored 
vdevs and have
not had any issues with the LUNS they provided reattaching and re 
silvering. Scrubs
on the pools have always been successful.  Largest single mirrored pool 
is around 11TB

which is  form two 6140  RAID 5's.

We also use Loki boxes as well for very large storage pools which are 
routinely filled.
(I was a beta tester for Loki). I have two J4500's, one with 48 x 250 GB 
and 1 x with 48
x 1 TB drives. No issues there either. The 48 x 1 TB is used in a a Disk 
_> Disk - Tape config
with a SL500 to back up our entire site. It is routinely fulled to the 
brim and it performs

admirably attached to a T5220 which is 10 gig attached.

All of the systems I have mentioned vary from Samba servers to 
compliance archives
to Oracle DB servers, Blackboard content stores, squid web caches, LDAP 
directory
servers, Mail stores, Mail spools., Calendar servers DB's. The list  
covers 60  plus systems.

I have 0% Solaris older than Solaris 10. Why would you?

In short I hope people don't hold back from adoption of ZFS because they 
are unsure
about it. Judge for yourself as I have done and dip your toes in at 
whatever rate you

are happy to do so. Thats what I did.

/Scott.

I also use it at home too with and old D1000 attached to a v120 with 8 x 
320 GB scsi's
in a RAIDZ2 for all our home data and home business (which is a printing 
outfit

which creates a lot of very big files on our macs).

--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN?

2009-02-17 Thread Scott Lawson

Hi All,

I have been watching this thread for a while and thought it was time a
chipped my 2 cents
worth in. I have been an aggressive adopter of ZFS here across all of
our Solaris
systems and have found the benefits have far outweighed any small issues
that have
arisen.

Currently I have many systems that have LUNs provided from SAN based
storage to
systems for zpools. All our systems are configured with mirrored vdevs
and the
reliability factor has been as good as, if not greater than UFS and LVM.

My rules of thumb around systems tend to stem around getting the storage
infrastructure
right as that generally leads to the best availability. To this end for
every single SAN
attached system we have dual paths to separate switches, every array has
dual controllers
dual pathed to different switches. ZFS may be more or less susceptible
to any physical
infrastructure problem, but in my experience it is on a par with UFS
(and I gave up
shelling out for vxfs long ago)

The reasons for the above configuration is that our storage is evenly
split between two sites
dark fibre between them across redundant routes. This forms a ring
configuration which is
around 5 km around. We have so much storage that we need to have this in
case of a data
center catastrophe. The business recognizes the time to recovery risk
would be so great
that if we didn't we would be out  of business in the event of one of
our data centres burning
or other natural disaster.

I have seen other people discussing power availability on other threads
recently. If you
want it, you can have it. You just need the business case for it. I
don't buy the comments
on UPS unreliability.

Quite frequently I have rebooted arrays and removed them from mirrored
vdevs and have
not had any issues with the LUNS they provided reattaching and re
silvering. Scrubs
on the pools have always been successful.  Largest single mirrored pool
is around 11TB
which is  form two 6140  RAID 5's.

We also use Loki boxes as well for very large storage pools which are
routinely filled.
(I was a beta tester for Loki). I have two J4500's, one with 48 x 250 GB
and 1 x with 48
x 1 TB drives. No issues there either. The 48 x 1 TB is used in a a Disk
_> Disk - Tape config
with a SL500 to back up our entire site. It is routinely fulled to the
brim and it performs
admirably attached to a T5220 which is 10 gig attached.

All of the systems I have mentioned vary from Samba servers to
compliance archives
to Oracle DB servers, Blackboard content stores, squid web caches, LDAP
directory
servers, Mail stores, Mail spools., Calendar servers DB's. The list
covers 60  plus systems.
I have 0% Solaris older than Solaris 10. Why would you?

In short I hope people don't hold back from adoption of ZFS because they
are unsure
about it. Judge for yourself as I have done and dip your toes in at
whatever rate you
are happy to do so. Thats what I did.

/Scott.

I also use it at home too with and old D1000 attached to a v120 with 8 x
320 GB scsi's
in a RAIDZ2 for all our home data and home business (which is a printing
outfit
which creates a lot of very big files on our macs).

--
_______


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss