Re: [zfs-discuss] Raidz2 slow read speed (under 5MB/s)

2011-07-22 Thread Jonathan Chang
Nevermind this, I destroyed the raid volume, then checked each hard drive one 
by one, and when I put it back together, the problem fixed itself. I'm now 
getting 30-60MB/s read and write, which is still slow as heck, but works well 
for my application.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Raidz2 slow read speed (under 5MB/s)

2011-07-21 Thread Jonathan Chang
Hello all,
I'm building a file server (or just a storage that I intend to access by 
Workgroup from primarily Windows machines) using zfs raidz2 and openindiana 
148. I will be using this to stream blu-ray movies and other media, so I will 
be happy if I get just 20MB/s reads, which seems like a pretty low standard 
considering some people are getting 100+. This is my first time with OI, and 
raid, for that matter, so I hope you guys have a little patience for a noob. :)

I figured out how to setup the vdevs and smbshare after some trial and error, 
and got my Windows box to see the share. Transferring a 40GB file to the share 
yields 55-80MB/s, not earth-shattering, but satisfactory IMO. The problem is 
when I transfer the same file back to the Windows box, it went to less than 
5MB/s. I then copied a 1GB file, and then moved that 1GB file from the raidz2 
drives to the root drive (SSD), in attempt to isolate the problem. That was 
less than 5MB/s. The same file, once again, copied from the root drive to the 
raidz2 was fast, maybe 70-100MB/s.

The problem here as far as I can tell is either some setting within zfs or the 
HBA controller. Or maybe... even the timing issue with WD Green drives 
shouldn't create that much disparity.

I've attached the iostat of when activity is idle, when copying from raidz2 to 
root (read), and for comparison, copying to raidz2 from root (write). 
Please note the intermittent idling in all disks (except 1?) when the file is 
copied from the raidz2 volume to anywhere else. I have no idea what that's 
about, but the drives will drop to 0 every couple of seconds, and repeat.

My system is as follows:
10 WD20EARS (bad idea? I only found out after I bought them.) in raidz2 config
32GB SSD for root drive for OS install
Supermicro USAS-L8i HBA card (1068E chipset I believe?)
6GB RAM
500 watt power supply
AMD Athlon II X2 260 CPU

Here's my zpool:
pool: rpool
 state: ONLINE
 scan: none requested
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  c2t0d0s0  ONLINE   0 0 0

errors: No known data errors

  pool: solaris
 state: ONLINE
 scan: none requested
config:

NAMESTATE READ WRITE CKSUM
solaris ONLINE   0 0 0
  raidz2-0  ONLINE   0 0 0
c2t4d0  ONLINE   0 0 0
c2t5d0  ONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
c1t6d0  ONLINE   0 0 0
c1t7d0  ONLINE   0 0 0

errors: No known data errors

I'd greatly appreciate it if someone could give me some leads on where the 
problem might be. I've spent the past 2 days on this, and it's very frustrating 
since I would actually be very happy getting even 10MB/s read. 

Regards.
-- 
This message posted from opensolaris.org

Idle-iostat
Description: Binary data


Read-iostat
Description: Binary data


Write-iostat
Description: Binary data
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz2 slow read speed (under 5MB/s)

2011-07-21 Thread Jonathan Chang
Do you mean that OI148 might have a bug that Solaris 11 Express might solve? I 
will download the Solaris 11 Express LiveUSB and give it a shot.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS receive checksum mismatch

2011-06-09 Thread Jonathan Walker
Hey all,

New to ZFS, I made a critical error when migrating data and
configuring zpools according to needs - I stored a snapshot stream to
a file using zfs send -R [filesystem]@[snapshot] [stream_file].
When I attempted to receive the stream onto to the newly configured
pool, I ended up with a checksum mismatch and thought I had lost my
data.

After googling the issue and finding nil, I downloaded FreeBSD
9-CURRENT (development), installed, and recompiled the kernel making
one modification to
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c:

Comment out the following lines (1439 - 1440 at the time of writing):

if (!ZIO_CHECKSUM_EQUAL(drre.drr_checksum, pcksum))
ra.err = ECKSUM;

Once recompiled and booted up on the new kernel, I executed zfs
receive -v [filesystem] [stream_file]. Once received, I scrubbed the
zpool, which corrected a couple of checksum errors, and proceeded to
finish setting up my NAS. Hopefully, this might help someone else if
they're stupid enough to make the same mistake I did...

Note: changing this section of the ZFS kernel code should not be used
for anything other than special cases when you need to bypass the data
integrity checks for recovery purposes.

-Johnny Walker
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS receive checksum mismatch

2011-06-09 Thread Jonathan Walker
 New to ZFS, I made a critical error when migrating data and
 configuring zpools according to needs - I stored a snapshot stream to
 a file using zfs send -R [filesystem]@[snapshot] [stream_file].

Why is this a critical error, I thought you were supposed to be
able to save the output from zfs send to a file (just as with tar or
ufsdump you can save the output to a file or a stream) ?

Well yes, you can save the stream to a file, but it is intended for
immediate use with zfs receive. Since the stream is not an image but
instead a serialization of objects, normal data recovery methods do not
apply in the event of corruption.

 When I attempted to receive the stream onto to the newly configured
 pool, I ended up with a checksum mismatch and thought I had lost my
 data.

Was the cause of the checksum mismatch just that the stream data
was stored as a file ? That does not seem right to me.

I really can't say for sure what caused the corruption, but I think it
may have been related to a dying power supply. For more information,
check out:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storing_ZFS_Snapshot_Streams_.28zfs_send.2Freceive.29
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using multiple logs on single SSD devices

2010-08-03 Thread Jonathan Loran

On Aug 2, 2010, at 8:18 PM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jonathan Loran
 
 Because you're at pool v15, it does not matter if the log device fails while
 you're running, or you're offline and trying to come online, or whatever.
 Simply if the log device fails, unmirrored, and the version is less than 19,
 the pool is simply lost.  There are supposedly techniques to recover, so
 it's not necessarily a data unrecoverable by any means situation, but you
 certainly couldn't recover without a server crash, or at least shutdown.
 And it would certainly be a nightmare, at best.  The system will not fall
 back to ZIL in the main pool.  That was a feature created in v19.

Yes, after sending my query yesterday, I found the zfs best practices guide, 
which I haven't read for a long time, many update w/r to SSD devices (many by 
you Ed, no?).  I also found the long thread on this list, which somehow I 
missed in my first pass about SSD best practices.  After reading this, I became 
much more nervious.  My previous assumption when I added the log was based upon 
the IOP rate I saw to the ZIL, and the number of IOP an Intel X25e could take, 
and it looked like the drive should last a few years, at least.  But of course, 
that ssumes no other failure modes.  Given the high price of failure, now that 
I know the system will suddenly go south, I realized that action needed to be 
taken ASAP to mirror the log.

 I'm afraid it's too late for that, unless you're willing to destroy 
 recreate your pool.  You cannot remove the existing log device.  You cannot
 shrink it.  You cannot replace it with a smaller one.  The only things you
 can do right now are:
 
 (a) Start mirroring that log device with another device of the same size or
 larger.
 or
 (b) Buy another SSD which is larger than the first.  Create a slice on the
 2nd which is equal to the size of the first.  Mirror the first onto the
 slice of the 2nd.  After resilver, detach the first drive, and replace it
 with another one of the larger drives.  Slice the 3rd drive just like the
 2nd, and mirror the 2nd drive slice onto it.  Now you've got a mirrored 
 sliced device, without any downtime, but you had to buy 2x 2x larger drives
 in order to do it.
 or
 (c) Destroy  recreate your whole pool, but learn from your mistake.  This
 time, slice each SSD, and mirror the slices to form the log device.
 
 BTW, ask me how I know this in such detail?  It's cuz I made the same
 mistake last year.  There was one interesting possibility we considered, but
 didn't actually implement:
 
 We are running a stripe of mirrors.  We considered the possibility of
 breaking the mirrors, creating a new pool out of the other half using the
 SSD properly sliced.  Using zfs send to replicate all the snapshots over
 to the new pool, up to a very recent time. 
 
 Then, we'd be able to make a very short service window.  Shutdown briefly,
 send that one final snapshot to the new pool, destroy the old pool, rename
 the new pool to take the old name, and bring the system back up again.
 Instead of scheduling a long service window.  As soon as the system is up
 again, start mirroring and resilvering (er ... initial silvering), and of
 course, slice the SSD before attaching the mirror.
 
 Naturally there is some risk, running un-mirrored long enough to send the
 snaps... and so forth.
 
 Anyway, just an option to consider.
 

Destroying this pool is very much off the table.  It holds home directories for 
our whole lab, about 375 of them.  If I take the system offline, then no one 
works until it's back up.  You could say this machine is mission critical.  The 
host has been very reliable.  Everyone is now spoiled by how it never goes 
down, and I'm very proud of that fact.  The only way I could recreate the pool 
would be through some clever means like you give, or I thought perhaps using 
AVS to replicate one side of the mirror, then everything could be done through 
a quick reboot.

One other idea I had was using a sparse zvol for the log, but I think 
eventually, the sparse volume would fill up beyond its physical capacity.  On 
top of that, this would mean we would have a log that is a zvol from another 
zpool, which I think could a cause boot race condition.  

I think the real solution to my immediate problem is this:  Bite the bullet, 
and add storage to the existing pool.  It won't be as clean as I like, and it 
would disturb my nicely balanced mirror stripe with new large empty vdevs, 
which I fear could impact performance down the road when the original stripe 
fills up, and all writes go to the new vdevs.  Perhaps by the time that 
happens, the feature to rebalance the pool will be available, if that's even 
being worked on.  Maybe that's wishful thinking.  At any rate, if I don't have 
to add another pool, I can mirror the logs I have: problem solved. 

Finally, I'm told by my SE that ZFS

[zfs-discuss] modified mdb and zdb

2010-07-28 Thread Jonathan Cifuentes

Hi,
I would really apreciate if any of you can help me get the modified mdb and zdb 
(in any version of OpenSolaris) for digital forensic reserch purpose. 
Thank you.

Jonathan Cifuentes
_
Invite your mail contacts to join your friends list with Windows Live Spaces. 
It's easy!
http://spaces.live.com/spacesapi.aspx?wx_action=createwx_url=/friends.aspxmkt=en-us___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Migrating ZFS/data pool to new pool on the same system

2010-05-04 Thread Jonathan
Can anyone confirm my action plan is the proper way to do this?  The reason I'm 
doing this is I want to create 2xraidz2 pools instead of expanding my current 
2xraidz1 pool.  So I'll create a 1xraidz2 vdev, migrate my current 2xraidz1 
pool over, destroy that pool and then add it as a 1xraidz2 vdev to the new pool.

I'm running b130, sharing both with CIFS and ISCSI (not comstar), multiple 
decedent file systems.  Other than a couple VirtualBox machines that use the 
pool for storage (I'll shut them down), nothing on the server should be messing 
with the pool.  As I understand it the old way of doing iSCSI is going away so 
I should plan on Comstar.  I'm also thinking I should just unshare the CIFS to 
prevent any of my computers from writing to it.

So migrating from pool1 to pool2 
0. Turn off AutoSnapshots
1. Create snapshot - zfs snapshot -r po...@snap1 
2. Send/Receive - zfs send -R po...@snap1 | zfs receive -F -d test2
3. Unshare CIFS and remove iSCSI targets.  For the iSCSI targets, seems like I 
can't re-use them for Comstar and the reservations aren't carried over for 
block devices? I may just destroy them before hand.  Nothing important on them.
4. Create new snapshots - zfs snapshot -r po...@snap2
5. Send incremental stream - zfs send -Ri snap1 po...@snap2 | zfs receive -F 
-d test2
Repeat steps 4 and 5 as necessary.
6. Offline pool1... if I don't plan on destroying it right away.

Other than zfs list, is there anything I should check to make sure I received 
all the data to the new pool?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Replaced drive in zpool, was fine, now degraded - ohno

2010-04-14 Thread Jonathan
I just started replacing drives in this zpool (to increase storage). I pulled 
the first drive, and replaced it with a new drive and all was well. It 
resilvered with 0 errors. This was 5 days ago. Just today I was looking around 
and noticed that my pool was degraded (I see now that this occurred last 
night). Sure enough there are 12 read errors on the new drive.

I'm on snv 111b. I attempted to get smartmontools workings, but it doesn't seem 
to want to work as these are all sata drives. fmdump indicates that the read 
errors occurred within about 10 minutes of one another.

Is it safe to say this drive is bad, or is there anything else I can do about 
this?

Thanks,
Jon


$ zpool status MyStorage
  pool: MyStorage
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
 scrub: scrub completed after 8h7m with 0 errors on Sun Apr 11 13:07:40 2010
config:

NAMESTATE READ WRITE CKSUM
MyStorage   DEGRADED 0 0 0
  raidz1DEGRADED 0 0 0
c5t0d0  ONLINE   0 0 0
c5t1d0  ONLINE   0 0 0
c6t1d0  ONLINE   0 0 0
c7t1d0  FAULTED 12 0 0  too many errors

errors: No known data errors

$ fmdump
TIME UUID SUNW-MSG-ID
Apr 09 16:08:04.4660 1f07d23f-a4ba-cbbb-8713-d003d9771079 ZFS-8000-D3
Apr 13 22:29:02.8063 e26c7e32-e5dd-cd9c-cd26-d5715049aad8 ZFS-8000-FD

That first log is the original drive being replaced. The second is the read 
errors on the new drive.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replaced drive in zpool, was fine, now degraded - ohno

2010-04-14 Thread Jonathan
I just ran 'iostat -En'. This is what was reported for the drive in question 
(all other drives showed 0 errors across the board.

All drives indicated the illegal request... predictive failure analysis
--
c7t1d0   Soft Errors: 0 Hard Errors: 36 Transport Errors: 0 
Vendor: ATA  Product: SAMSUNG HD203WI  Revision: 0002 Serial No:  
Size: 2000.40GB 2000398934016 bytes
Media Error: 36 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 126 Predictive Failure Analysis: 0 
--
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replaced drive in zpool, was fine, now degraded - ohno

2010-04-14 Thread Jonathan
Yeah, 
--
$smartctl -d sat,12 -i /dev/rdsk/c5t0d0
smartctl 5.39.1 2010-01-28 r3054 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)
--

I'm thinking between 111 and 132 (mentioned in post) something changed.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replaced drive in zpool, was fine, now degraded - ohno

2010-04-14 Thread Jonathan

 Do worry about media errors. Though this is the most
 common HDD
 error, it is also the cause of data loss.
 Fortunately, ZFS detected this
 and repaired it for you.

Right. I assume you do recommend swapping the faulted drive out though?


  Other file systems may not
 be so gracious.
  -- richard


As we are all too aware I'm sure :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Couple Questions about replacing a drive in a zpool

2010-03-08 Thread Jonathan
First a little background, I'm running b130, I have a zpool with two 
Raidz1(each 4 drives, all WD RE4-GPs) arrays (vdev?).  They're in a 
Norco-4220 case (home server), which just consists of SAS backplanes 
(aoc-usas-l8i -8087-backplane-SATA drives).  A couple of the drives are 
showing a number of hard/transport/media errors (weekly scrubs are fine) and 
I'm guessing it can explain some of the slower throughput I've been seeing 
lately (errors are increasing more rapidly).

I do have a Hotspare for the zpool and I'm thinking of doing an advanced 
replacement (RMA) one at a time to minimize risk/downtime.  

Here's the first question, since the current drive is working, when I choose to 
replace with the Hotspare, does the current drive cease to have data written to 
it?  Eg, if the hotspare fails, does a resliver need to occur on the current 
drive?

Second question, is there a way to power-down the current drive while the 
system is running?  I mean, by command-line, which I would think would be more 
graceful than my current plan of pulling the drive.  The reason I ask is I'm 
only semi-confident I know the physical layout of the (logical) drive ordering.

Last question, is there another way to see HD serial#s?  Iostat doesn't show 
the serial#s.  I'm willing to power down the system to pull the slots I think 
the drives are in to retrieve the serials, it would just be nice not to since 
my home network depends on a lot of the services (AD/DHCP/DNS) that run on the 
server. Though, if I do have to power it down, its not the end of the world.

Example Iostat -En output:
c8t6d0   Soft Errors: 0 Hard Errors: 160 Transport Errors: 213
Vendor: ATA  Product: WDC WD2002FYPS-0 Revision: 5G04 Serial No:
Size: 2000.40GB 2000398934016 bytes
Media Error: 50 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Couple Questions about replacing a drive in a zpool

2010-03-08 Thread Jonathan
 First a little background, I'm running b130, I have a
 zpool with two Raidz1(each 4 drives, all WD RE4-GPs)
 arrays (vdev?).  They're in a Norco-4220 case
 (home server), which just consists of SAS
 backplanes (aoc-usas-l8i -8087-backplane-SATA
 drives).  A couple of the drives are showing a number
 of hard/transport/media errors (weekly scrubs are
 fine) and I'm guessing it can explain some of the
 slower throughput I've been seeing lately (errors are
 increasing more rapidly).
 
 I do have a Hotspare for the zpool and I'm thinking
 of doing an advanced replacement (RMA) one at a time
 to minimize risk/downtime.  
 
 Here's the first question, since the current drive is
 working, when I choose to replace with the Hotspare,
 does the current drive cease to have data written to
 it?  Eg, if the hotspare fails, does a resliver need
 to occur on the current drive?
 
 Second question, is there a way to power-down the
 current drive while the system is running?  I mean,
 by command-line, which I would think would be more
 graceful than my current plan of pulling the drive.
 The reason I ask is I'm only semi-confident I know
 the physical layout of the (logical) drive
  ordering.
 
 Last question, is there another way to see HD
 serial#s?  Iostat doesn't show the serial#s.  I'm
 willing to power down the system to pull the slots I
 think the drives are in to retrieve the serials, it
 would just be nice not to since my home network
 depends on a lot of the services (AD/DHCP/DNS) that
 run on the server. Though, if I do have to power it
 down, its not the end of the world.
 
 Example Iostat -En output:
 c8t6d0   Soft Errors: 0 Hard Errors: 160
 Transport Errors: 213
 Vendor: ATA  Product: WDC WD2002FYPS-0 Revision:
 5G04 Serial No:
 Size: 2000.40GB 2000398934016 bytes
 Media Error: 50 Device Not Ready: 0 No Device: 0
 Recoverable: 0
 Illegal Request: 0 Predictive Failure Analysis: 0

Someone pointed me to a thread and to try cfgadm -v to get the serial#, but it 
doesn't work for me.  Sounds like it would be for SATA only?

However, through that thread i found smartmontools  (smartctl) and I was able 
to use that to determine the serial numbers.  Fantastic!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] This is the scrub that never ends...

2009-09-10 Thread Jonathan Edwards


On Sep 9, 2009, at 9:29 PM, Bill Sommerfeld wrote:



On Wed, 2009-09-09 at 21:30 +, Will Murnane wrote:

Some hours later, here I am again:
scrub: scrub in progress for 18h24m, 100.00% done, 0h0m to go
Any suggestions?


Let it run for another day.

A pool on a build server I manage takes about 75-100 hours to scrub,  
but
typically starts reporting 100.00% done, 0h0m to go at about the  
50-60

hour point.

I suspect the combination of frequent time-based snapshots and a  
pretty

active set of users causes the progress estimate to be off..



out of curiousity - do you have a lot of small files in the filesystem?

zdb -s pool might be interesting to observe too

---
.je

(oh, and thanks for the subject line .. now i've had this song stuck  
in my head for a couple days :P)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Books on File Systems and File System Programming

2009-08-15 Thread Jonathan Edwards


On Aug 14, 2009, at 11:14 AM, Peter Schow wrote:

On Thu, Aug 13, 2009 at 05:02:46PM -0600, Louis-Fr?d?ric Feuillette  
wrote:

I saw this question on another mailing list, and I too would like to
know. And I have a couple questions of my own.

== Paraphrased from other list ==
Does anyone have any recommendations for books on File Systems and/or
File Systems Programming?
== end ==


Going back ten years, but still a good tutorial:

  Practical File System Design with the Be File System
  by Dominic Giampaolo

  http://www.nobius.org/~dbg/practical-file-system-design.pdf


I think he's still at apple now working on spotlight .. his fs-kit is  
good study too:

http://www.nobius.org/~dbg/fs-kit-0.4.tgz

for understanding the vnode/vfs interface - you might want to take a  
look at:

- Solaris Internals (2nd edition) - chapter 14
- Zadok's FiST paper:
http://www.fsl.cs.sunysb.edu/docs/zadok-thesis-proposal/

UFS:
- Solaris Internals (2nd edition) - chapter 15
HFS+:
- Amit Singh's Mac OS X Internals chapter 11 (see http://osxbook.com/)

then opensolaris src of course for:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/
http://opensolaris.org/os/community/zfs/source/
http://opensolaris.org/os/project/samqfs/sourcecode/
http://opensolaris.org/os/project/ext3/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Understanding SAS/SATA Backplanes and Connectivity

2009-07-16 Thread Jonathan Borden
 
  We have a SC846E1 at work; it's the 24-disk, 4u
 version of the 826e1.
  It's working quite nicely as a SATA JBOD enclosure.
  We'll probably be
 buying another in the coming year to have more
  capacity.
 Good to hear. What HBA(s) are you using against it?
 

I've got one too and it works great. I use the LSI SAS 3442e which also gives 
you an external SAS port. You don't need a fancy HBA with onboard RAID. 
Configure to IT mode.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-04 Thread Jonathan Edwards


On Jul 4, 2009, at 11:57 AM, Bob Friesenhahn wrote:

This brings me to the absurd conclusion that the system must be  
rebooted immediately prior to each use.


see Phil's later email .. an export/import of the pool or a remount of  
the filesystem should clear the page cache - with mmap'd files you're  
essentially both them both in the page cache and also in the ARC ..  
then invalidations in the page cache are going to have effects on  
dirty data in the cache



/etc/system tunables are currently:

set zfs:zfs_arc_max = 0x28000
set zfs:zfs_write_limit_override = 0xea60
set zfs:zfs_vdev_max_pending = 5



if you're on x86 - i'd also increase maxphys to 128K .. we still have  
a 56KB default value in there which is still a bad thing (IMO)


---
.je

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cannot mount '/tank/home': directory is not empty

2009-06-10 Thread Jonathan Edwards
i've seen a problem where periodically a 'zfs mount -a' and sometimes  
a 'zpool import pool' can create what appears to be a race condition  
on nested mounts .. that is .. let's say that i have:


FS  mountpoint
pool/export
pool/fs1/export/home
pool/fs2/export/home/bob
pool/fs3/export/home/bob/stuff

if pool is imported (or a mount -a is done) and somehow pool/fs3  
mounts first - then it will create /export/home and /export/home/bob  
and pool/fs1 and pool/fs2 will fail to mount .. this seems to be  
happening on more recent builds, but not predictably - so i'm still  
trying to track down what's going on


On Jun 10, 2009, at 1:01 PM, Richard Elling wrote:


Something is bothering me about this thread.  It seems to me that
if the system provides an error message such as cannot mount
'/tank/home': directory is not empty then the first plan of action
should be to look and see what is there, no?
The issue of overlaying mounts has existed for about 30 years and
invariably one discovers that events which lead to different data in
overlapping directories is the result of some sort of procedural  
issue.


Perhaps once again, ZFS is a singing canary?
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread Jonathan Loran


Hi list,

First off:

# cat /etc/release
   Solaris 10 6/06 s10x_u2wos_09a X86
  Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
   Use is subject to license terms.
Assembled 09 June 2006

Here's an (almost) disaster scenario that came to life over the past  
week.  We have a very large zpool containing over 30TB, composed  
(foolishly) of three concatenated iSCSI SAN devices.  There's no  
redundancy in this pool at the zfs level.  We are actually in the  
process of migrating this to a x4540 + j4500 setup, but since the  
x4540 is part of the existing pool, we need to mirror it, then detach  
it so we can build out the replacement storage.


What happened was some time after I had attached the mirror to the  
x4540, the scsi_vhci/network connection went south, and the server  
panicked.  Since this system has been up, over the past 2.5 years,  
this has never happened before.  When we got the thing glued back  
together, it immediately started resilvering from the beginning, and  
reported about 1.9 million data errors.  The list from zpool status -v  
gave over 883k bad files.  This is a small percentage of the total  
number of files in this volume: over 80 million (1%).


My question is this:  When we clear the pool with zpool clear, what  
happens to all of the bad files?  Are they deleted from the pool, or  
do the error counters just get reset, leaving the bad files in tact?   
I'm going to perform a full backup of this guy (not so easy on my  
budget), and I would rather only get the good files.


Thanks,

Jon


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 jlo...@ssl.berkeley.edu
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread Jonathan Loran


Kinda scary then.  Better make sure we delete all the bad files before  
I back it up.


What's odd is we've checked a few hundred files, and most of them  
don't seem to have any corruption.  I'm thinking what's wrong is the  
metadata for these files is corrupted somehow, yet we can read them  
just fine.  I wish I could tell which ones are really bad, so we  
wouldn't have to recreate them unnecessarily.  They are mirrored in  
various places, or can be recreated via reprocessing, but recreating/ 
restoring that many files is no easy task.


Thanks,

Jon

On Jun 1, 2009, at 2:41 PM, Paul Choi wrote:

zpool clear just clears the list of errors (and # of checksum  
errors) from its stats. It does not modify the filesystem in any  
manner. You run zpool clear to make the zpool forget that it ever  
had any issues.


-Paul

Jonathan Loran wrote:


Hi list,

First off:
# cat /etc/release
  Solaris 10 6/06 s10x_u2wos_09a X86
 Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
  Use is subject to license terms.
   Assembled 09 June 2006

Here's an (almost) disaster scenario that came to life over the  
past week.  We have a very large zpool containing over 30TB,  
composed (foolishly) of three concatenated iSCSI SAN devices.   
There's no redundancy in this pool at the zfs level.  We are  
actually in the process of migrating this to a x4540 + j4500 setup,  
but since the x4540 is part of the existing pool, we need to mirror  
it, then detach it so we can build out the replacement storage.
What happened was some time after I had attached the mirror to the  
x4540, the scsi_vhci/network connection went south, and the server  
panicked.  Since this system has been up, over the past 2.5 years,  
this has never happened before.  When we got the thing glued back  
together, it immediately started resilvering from the beginning,  
and reported about 1.9 million data errors.  The list from zpool  
status -v gave over 883k bad files.  This is a small percentage of  
the total number of files in this volume: over 80 million (1%).
My question is this:  When we clear the pool with zpool clear, what  
happens to all of the bad files?  Are they deleted from the pool,  
or do the error counters just get reset, leaving the bad files in  
tact?  I'm going to perform a full backup of this guy (not so easy  
on my budget), and I would rather only get the good files.


Thanks,

Jon


- _/ _/  /   - Jonathan Loran  
-   -
-/  /   /IT  
Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC  
Berkeley
-/  / /  (510) 643-5146 jlo...@ssl.berkeley.edu 
 mailto:jlo...@ssl.berkeley.edu

- __/__/__/   AST:7731^29u18e3





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss







- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 jlo...@ssl.berkeley.edu
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data size grew.. with compression on

2009-04-09 Thread Jonathan
OpenSolaris Forums wrote:
 if you rsync data to zfs over existing files, you need to take
 something more into account:
 
 if you have a snapshot of your files and rsync the same files again,
 you need to use --inplace rsync option , otherwise completely new
 blocks will be allocated for the new files. that`s because rsync will
 write entirely new file and rename it over the old one.

ZFS will allocate new blocks either way, check here
http://all-unix.blogspot.com/2007/03/zfs-cow-and-relate-features.html
for more information about how Copy-On-Write works.

Jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data size grew.. with compression on

2009-04-09 Thread Jonathan
Daniel Rock wrote:
 Jonathan schrieb:
 OpenSolaris Forums wrote:
 if you have a snapshot of your files and rsync the same files again,
 you need to use --inplace rsync option , otherwise completely new
 blocks will be allocated for the new files. that`s because rsync will
 write entirely new file and rename it over the old one.

 ZFS will allocate new blocks either way
 
 No it won't. --inplace doesn't rewrite blocks identical on source and
 target but only blocks which have been changed.
 
 I use rsync to synchronize a directory with a few large files (each up
 to 32 GB). Data normally gets appended to one file until it reaches the
 size limit of 32 GB. Before I used --inplace a snapshot needed on
 average ~16 GB. Now with --inplace it is just a few kBytes.

It appears I may have misread the initial post.  I don't really know how
I misread it, but I think I missed the snapshot portion of the message
and got confused.  I understand the interaction between snapshots,
rsync, and --inplace being discussed now.

My apologies,
Jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can this be done?

2009-03-28 Thread Jonathan
Michael Shadle wrote:
 On Sat, Mar 28, 2009 at 1:37 AM, Peter Tribble
peter.trib...@gmail.com wrote:

 zpool add tank raidz1 disk_1 disk_2 disk_3 ...

 (The syntax is just like creating a pool, only with add instead of
create.)

 so I can add individual disks to the existing tank zpool anytime i want?

Using the command above that Peter gave you would get you a result
similar to this

NAMESTATE READ WRITE CKSUM
storage2ONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad16ONLINE   0 0 0
ad14ONLINE   0 0 0
ad10ONLINE   0 0 0
ad12ONLINE   0 0 0
  raidz1ONLINE   0 0 0
da2 ONLINE   0 0 0
da0 ONLINE   0 0 0
da1 ONLINE   0 0 0
da3 ONLINE   0 0 0

The actual setup is a RAIDZ1 of 1.5TB drives and a RAIDZ1 of 500GB
drives with the data striped across the two RAIDZs.  In your case it
would be 7 drives in each RAIDZ based on what you said before but I
don't have *that* much money for my home file server.

 so essentially you're tleling me to keep it at raidz1 (not raidz2 as
 many people usually stress when getting up to a certain # of disks,
 like 8 or so most people start bringing it up a lot)

This really depends on how valuable your data is.  Richard Elling has a
lot of great information about MTTDL here
http://blogs.sun.com/relling/tags/mttdl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and SNDR..., now I'm confused.

2009-03-06 Thread Jonathan Edwards


On Mar 6, 2009, at 8:58 AM, Andrew Gabriel wrote:


Jim Dunham wrote:
ZFS the filesystem is always on disk consistent, and ZFS does  
maintain filesystem consistency through coordination between the  
ZPL (ZFS POSIX Layer) and the ZIL (ZFS Intent Log). Unfortunately  
for SNDR, ZFS caches a lot of an applications filesystem data in  
the ZIL, therefore the data is in memory, not written to disk, so  
SNDR does not know this data exists. ZIL flushes to disk can be  
seconds behind the actual application writes completing, and if  
SNDR is running asynchronously, these replicated writes to the SNDR  
secondary can be additional seconds behind the actual application  
writes.


Unlike UFS filesystems and lockfs -f, or lockfs -w, there is no  
'supported' way to get ZFS to empty the ZIL to disk on demand.


I'm wondering if you really meant ZIL here, or ARC?

In either case, creating a snapshot should get both flushed to disk,  
I think?
(If you don't actually need a snapshot, simply destroy it  
immediately afterwards.)


not sure if there's another way to trigger a full flush or lockfs, but  
to make sure you do have all transactions that may not have been  
flushed from the ARC you could just unmount the filesystem or export  
the zpool .. with the latter, then you wouldn't have to worry about  
the -f on the import


---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-22 Thread Jonathan Edwards
not quite .. it's 16KB at the front and 8MB back of the disk (16384  
sectors) for the Solaris EFI - so you need to zero out both of these

of course since these drives are 1TB you i find it's easier to format  
to SMI (vtoc) .. with format -e (choose SMI, label, save, validate -  
then choose EFI)

but to Casper's point - you might want to make sure that fdisk is  
using the whole disk .. you should probably reinitialize the fdisk  
sectors either with the fdisk command or run fdisk from format (delete  
the partition, create a new partition using 100% of the disk, blah,  
blah) ..

finally - glancing at the format output - there appears to be a mix of  
labels on these disks as you've got a mix c#d# entries and c#t#d#  
entries so i might suspect fdisk might not be consistent across the  
various disks here .. also noticed that you dumped the vtoc for c3d0  
and c4d0, but you're replacing c2d1 (of unknown size/layout) with c1d1  
(never dumped in your emails) .. so while this has been an animated  
(slightly trollish) discussion on right-sizing (odd - I've typically  
only seen that term as an ONTAPism) with some short-stroking digs ..  
it's a little unclear what the c1d1s0 slice looks like here or what  
the cylinder count is - i agree it should be the same - but it would  
be nice to see from my armchair here

On Jan 22, 2009, at 3:32 AM, Dale Sears wrote:

 Would this work?  (to get rid of an EFI label).

   dd if=/dev/zero of=/dev/dsk/thedisk bs=1024k count=1

 Then use

   format

 format might complain that the disk is not labeled.  You
 can then label the disk.

 Dale



 Antonius wrote:
 can you recommend a walk-through for this process, or a bit more of  
 a description? I'm not quite sure how I'd use that utility to  
 repair the EFI label
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I create ZPOOL with missing disks?

2009-01-15 Thread Jonathan
Tomas Ögren wrote:
 On 15 January, 2009 - Jim Klimov sent me these 1,3K bytes:
 
 Is it possible to create a (degraded) zpool with placeholders specified 
 instead
 of actual disks (parity or mirrors)? This is possible in linux mdadm 
 (missing 
 keyword), so I kinda hoped this can be done in Solaris, but didn't manage to.

 Usecase scenario: 

 I have a single server (or home workstation) with 4 HDD bays, sold with 2 
 drives.
 Initially the system was set up with a ZFS mirror for data slices. Now we 
 got 2 
 more drives and want to replace the mirror with a larger RAIDZ2 set (say I 
 don't 
 want a RAID10 which is trivial to make). 

 Technically I think that it should be possible to force creation of a 
 degraded
 raidz2 array with two actual drives and two missing drives. Then I'd copy 
 data
 from the old mirror pool to the new degraded raidz2 pool (zfs send | zfs 
 recv),
 destroy the mirror pool and attach its two drives to repair the raidz2 
 pool.

 While obviously not an enterprise approach, this is useful while expanding
 home systems when I don't have a spare tape backup to dump my files on it 
 and restore afterwards.

 I think it's an (intended?) limitation in zpool command itself, since the 
 kernel
 can very well live with degraded pools.
 
 You can fake it..

[snip command set]

Summary, yes that actually works and I've done it, but its very slow!

I essentially did this myself when I migrated a 4x2-way mirror pool to a
2x4 disk raidzs (4x 500GB and 4x 1.5TB).  I can say from experience that
it works but since I used 2 sparsefiles to simulate 2 disks on a single
physical disk performance sucked and it took a long time to do the
migration.  IIRC it took over 2 days to transfer 2TB of data.  I used
rsync, at the time I either didn't know about or forgot about zfs
send/receive which would probably work better.  It took a couple more
days to verify that everything transferred correctly with no bit rot
(rsync -c).

I think Sun avoids making things like this too easy because from a
business standpoint it's easier just to spend the money on enough
hardware to do it properly without the chance of data loss and the
extended down time.  Doesn't invest the time in may be a be a better
phrase than avoids though.  I doubt Sun actually goes out of their way
to make things harder for people.

Hope that helps,
Jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive Checksum error

2008-12-16 Thread Jonathan
Glaser, David wrote:
 Hi all,

[snipped]

 So, is there a way to see if it is a bad disk, or just zfs being a pain?
 Should I reset the checksum error counter and re-run the scrub?

You could try using smartctl to query the disk directly, although I
don't recall if it works on the x4500.  Normally 1 error is not a big
deal.  Clearing the errors and re-running the scrub would not hurt
anything and if you get errors again then it may be worth checking the
disk further.  Perhaps swapping it with a known good drive to make sure
the disk is the problem and not the cable.

If you start seeing hundreds of errors be sure to check things like the
cable.  I had a SATA cable come loose on a home ZFS fileserver and scrub
was throwing 100's of errors even though the drive itself was fine, I
don't want to think about what could have happened with UFS...

Hope that helps,
Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Inexpensive ZFS home server

2008-11-12 Thread Jonathan Loran


David Evans wrote:
 For anyone looking for a cheap home ZFS server...

 Dell is having a sale on their PowerEdge SC440 for $199 (regular $598) 
 through 11/12/2008.

 http://www.dell.com/content/products/productdetails.aspx/pedge_sc440?c=uscs=04l=ens=bsd

 Its got Dual Core Intel® Pentium®E2180, 2.0GHz, 1MB Cache, 800MHz FSB
 and you can upgrade the memory (ECC too) to 2gb for 19$ bucks.

 @$199, I just ordered 2.

 dce
   

I don't think the Pentium E2180 has the lanes to use ECC RAM.  I'm also 
not confident the system board for this machine would make use of ECC 
memory either, which is not good from a ZFS perspective.  How many SATA 
plugs are there on the MB in this guy?

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on Fit-PC Slim?

2008-11-06 Thread Jonathan Hogg
On 6 Nov 2008, at 04:09, Vincent Fox wrote:

 According to the slides I have seen, a ZFS filesystem even on a  
 single disk can handle massive amounts of sector failure before it  
 becomes unusable.   I seem to recall it said 1/8th of the disk?  So  
 even on a single disk the redundancy in the metadata is valuable.   
 And if I don't have really very much data I can set copies=2 so I  
 have better protection for the data as well.

 My goal is a compact low-powered and low-maintenance widget.   
 Eliminating the chance of fsck is always a good thing now that I  
 have tasted ZFS.

In my personal experience, disks are more likely to fail completely  
than suffer from small sector failures. But don't get me wrong,  
provided you have a good backup strategy and can afford the downtime  
of replacing the disk and restoring, then ZFS is still a great  
filesystem to use for a single disk.

Dont be put off. Many of the people on this list are running multi- 
terabyte enterprise solutions and are unable to think in terms of non- 
redundant, small numbers of gigabytes :-)

 I'm going to try and see if Nevada will even install when it  
 arrives, and report back.  Perhaps BSD is another option.  If not I  
 will fall back to Ubuntu.

I have FreeBSD and ZFS working fine(*) on a 1.8GHz VIA C7 (32bit)  
processor. Admittedly this is with 2GB of RAM, but I set aside 1GB for  
ARC and the machine is still showing 750MB free at the moment, so I'm  
sure it could run with 256MB of ARC in under 512MB. 1.8GHz is a fair  
bit faster than the Geode in the Fit-PC, but the C7 scales back to  
900MHz and my machine still runs acceptably at that speed (although I  
wouldn't want to buildworld with it).

I say, give it a go and see what happens. I'm sure I can still dimly  
recall a time when 500MHz/512MB was a kick-ass system...

Jonathan


(*) This machine can sustain 110MB/s off of the 4-disk RAIDZ1 set,  
which is substantially more than I can get over my 100Mb network.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] ZFS Success Stories

2008-10-20 Thread Jonathan Loran

We have 135 TB capacity with about 75 TB in use on zfs based storage.  
zfs use started about 2 years ago, and has grown from there.  This spans 
9 SAN appliances, with 5 head nodes, and 2 more recent servers running 
zfs on JBOD with vdevs made up of raidz2. 

So far, the experience has been very positive.  Never lost a bit of 
data.  We scrub weekly, and I've started sleeping better at night.  I 
have also read the horror stories, but we aren't seeing them here. 

We did have some performance issues, especially involving the SAN 
storage on more heavily used systems, but enabling the cache on the SAN 
devices without pushing fsync through to disk basically fixed that.  
Your zfs layout can profoundly effect performance, which is a down 
side.  It's best to test your setup under an approximate realistic work 
load  to balance capacity with performance before deploying.

BTW, most of our zfs deployment is on Solaris 10{u4,u5}, but two large 
servers are on OpenSolaris svn86.  The OpenSolaris servers seem to be 
considerably faster, and more feature rich, without any reliability 
issues, so far.

Jon

gm_sjo wrote:
 Hi all,

 I  have built out an 8TB SAN at home using OpenSolaris + ZFS. I have
 yet to put it into 'production' as a lot of the issues raised on this
 mailing list are putting me off trusting my data onto the platform
 right now.

 Throughout time, I have stored my personal data on NetWare and now NT
 and this solution has been 100% reliable for the last 12 years. Never
 a single problem (nor have I had any issues with NTFS with the tens of
 thousands of spindles i've worked with over the years).

 I appreciate 99% of the time people only comment if they have a
 problem, which is why I think it'd be nice for some people who have
 successfully implemented ZFS, including making various use of the
 features (recovery, replacing disks, etc), could just reply to this
 post with a sentence or paragraph detailing how great it is for them.
 Not necessarily interested in very small implementations of one/two
 disks that haven't changed config since the first day it was
 installed, but more aimed towards setups that are 'organic' and have
 changed/been_administered over time (to show functionality of the
 tools, resilience of the platform, etc.)..

 .. Of course though, I guess a lot of people who may have never had a
 problem wouldn't even be signed up on this list! :-)


 Thanks!
 ___
 storage-discuss mailing list
 [EMAIL PROTECTED]
 http://mail.opensolaris.org/mailman/listinfo/storage-discuss
   

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] [Fwd: Another ZFS question]

2008-09-27 Thread jonathan sai

Hi

Please see the query below.  Appreciate any help.

Rgds
jonathan



 Original Message 

Would you mind helping me ask your tech guy whether there will be 
repercussions when I try to run this command in view of the situation below:
#  /*zpool add -f zhome raidz c6t6006016056AC1A00C8FB7A6346F8DB11d0 
c6t6006016056AC1A00D034FA5246F8DB11d0*/


---

bash-3.00# zpool status
 pool: zhome
state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub in progress, 3.05% done, 13h52m to go
config:

   NAME   STATE READ WRITE 
CKSUM
   zhome  ONLINE   0 0 
2.20K
 raidz1   ONLINE   0 0 
2.20K
   c6t60060160A16D1B003E5B94CAEC46DC11d0  ONLINE   0 
0 0
   c6t60060160A16D1B005A7106AEEC46DC11d0  ONLINE   0 
0 0
   c6t60060160A16D1B007AC27FB9EC46DC11d0  ONLINE   0 
0 0
   c6t60060160A16D1B003870C8A5EC46DC11d0  ONLINE   0 
0 0


errors: 1 data errors, use '-v' for a list

 pool: zhome2
state: ONLINE
scrub: none requested
config:

   NAME STATE READ WRITE CKSUM
   zhome2   ONLINE   0 0 0
 c6t6006016056AC1A004069225BE146DC11d0  ONLINE   0 0 0
 c6t6006016056AC1A008253EF629235DC11d0  ONLINE   0 0 0

errors: No known data errors
bash-3.00# zpool add -n zhome raidz 
c6t6006016056AC1A00C8FB7A6346F8DB11d0 
c6t6006016056AC1A00D034FA5246F8DB11d0 c6t6006016056AC1A007860F66946F8DB11d0

invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses 4-way raidz and new vdev uses 
3-way raidz


Thanks and regards,
Andre

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS poor performance on Areca 1231ML

2008-09-26 Thread Jonathan Loran


Ross Becker wrote:
 Okay, after doing some testing, it appears that the issue is on the ZFS side. 
  I fiddled around a while with options on the areca card, and never got any 
 better performance results than my first test. So, my best out of the raidz2 
 is 42 mb/s write and 43 mb/s read.  I also tried turning off crc's (not how 
 I'd run production, but for testing), and got no performance gain.

 After fiddling with options, I destroyed my zfs  zpool, and tried some 
 single-drive bits.   I simply used newfs to create filesystems on single 
 drives, mounted them, and ran some single-drive bonnie++ tests.  On a single 
 drive, I got 50 mb/sec write  70 mb/sec read.   I also tested two benchmarks 
 on two drives simultaneously, and on each of the tests, the result dropped by 
 about 2mb/sec, so I got a combined 96 mb/sec write  136 mb/sec read with two 
 separate UFS filesystems on two separate disks.

 So next steps? 

 --ross
   

Raidz(2) vdevs can sustain the max iops of single drive in the vdev.  
I'm curious what zpool iostat would say while bonnie++ is running it's 
writing intelligently test.  The throughput sounds very low to me, but 
the clue here is the single drive speed is in line with the raidz2 vdev, 
so if a single drive is being limited by iops, not by raw throughput, 
then this IO result makes sense.  For fun, you should make two vdevs out 
of two raidz to see if you get twice the throughput, more or less.  I'll 
bet the answer is yes. 

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-auto-snapshot default schedules

2008-09-25 Thread Jonathan Hogg
On 25 Sep 2008, at 14:40, Ross wrote:

 For a default setup, I would have thought a years worth of data  
 would be enough, something like:

Given that this can presumably be configured to suit everyone's  
particular data retention plan, for a default setup, what was  
originally proposed seems obvious and sensible to me.

Going slightly off-topic:

All this auto-snapshot stuff is ace, but what's really missing, in my  
view, is some easy way to actually determine where the version of the  
file you want is. I typically find myself futzing about with diff  
across a dozen mounted snapshots trying to figure out when the last  
good version is.

It would be great if there was some way to know if a snapshot contains  
blocks for a particular file, i.e., that snapshot contains an earlier  
version of the file than the next snapshot / now. If you could do that  
and make ls support it with an additional flag/column, it'd be a real  
time-saver.

The current mechanism is especially hard as the auto-mount dirs can  
only be found at the top of the filesystem so you have to work with  
long path names. An fs trick to make .snapshot dirs of symbolic links  
appear automagically would rock, i.e.,

% cd /foo/bar/baz
% ls -l .snapshot
[...] nightly.0 - /foo/.zfs/snapshot/nightly.0/bar/baz
% diff {,.snapshot/nightly.0/}importantfile

Yes, I know this last command can just be written as:

% diff /foo/{,.zfs/snapshot/nightly.0}/bar/baz/importantfile

but this requires me to a) type more; and b) remember where the top of  
the filesystem is in order to split the path. This is obviously more  
of a pain if the path is 7 items deep, and the split means you can't  
just use $PWD.

[My choice of .snapshot/nightly.0 is a deliberate nod to the  
competition ;-)]

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-auto-snapshot default schedules

2008-09-25 Thread Jonathan Hogg
On 25 Sep 2008, at 17:14, Darren J Moffat wrote:

 Chris Gerhard has a zfs_versions script that might help: 
 http://blogs.sun.com/chrisg/entry/that_there_is

Ah. Cool. I will have to try this out.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

2008-08-28 Thread Jonathan Loran


Miles Nordin wrote:
 What is a ``failure rate for a time interval''?

   
Failure rate = Failures/unit time
Failure rate for a time interval = (Failures/unit time) * time

For example, if we have a failure rate: 

  Fr = 46% failures/month

Then the expectation value of a failure in one year:

  Fe = 46% failures/month  *  12 months = 5.52 failures


Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corrupt zfs stream? checksum mismatch

2008-08-15 Thread Jonathan Wheeler
Hi Richard,

Thanks for the detailed reply, and the work behind the scenes filing the CRs.
I've bookmarked both, and will keep a keen eye on them for status changes.

As Miles put it, I'll have to put these dumps into storage for possible future 
use.
I do dearly hope that I'll be able to recover most of that data in the future, 
but for the most important bits (documents/spreadsheets), I'll have to rebuild 
them by way of some rather intensive data entry based on hard copies, now.

Not fun.

I do have a working [zfs send dump!] backup from October, so it's not a total 
loss of my livelihood, but it'll be a life lesson alright.

With CR 6736794, I wonder if some extra notes could be added around the 
checksumming side of the code?
The wording that has been used doesn't quite match my scenario, but I certainly 
agree with what  requested functionality has been requested there.

I have a 50GB zfs send dump and zfs receive is failing (and rolling back) 
around the 20GB mark.
While the exact cause and nature of my issue remains unknown, I very much 
expect that the vast majority of my zfs send dump is in fact in tact, including 
data beyond that 20GB checksum error point. I.E there is a problem around the 
20GB mark, but I expect that the remaining 30GB contains good data, or in 
very least, *mostly* good data.

The CR appears to be only requesting that zfs receive stop at the 20GB mark, 
but {new feature} allows the failed restore attempt to be mountable, in a 
unknown/known bad state.

I'd much prefer that zfs receive continue on error too, thus giving it the full 
50GB to process and attempt to repair, rather than only the data up until the 
point that it encountered it's first problem.

Without knowing much about the actual on disk format,metadata and structures I 
can't be sure, but the fs is going to have a much better chance at recovering 
when there is more data available across the entire length of the fs, right? I 
know from my linux days that the ext2/3 superblocks were distributed across the 
full disk, so the more of the disk that it can attempt to read, the better the 
chance that it'll find more correct metadata to use in an attempt a repair of 
the FS.

And of course the second benefit of reading more of the data stream, past an 
error is that more user data will at least have a chance of being recovered. If 
it stops half way, it has _no_ chance of recovering that data, so I favor my 
odds of letting it go on to at least try :)

Or is that an entirely new CR itself?

Jonathan
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corrupt zfs stream? checksum mismatch

2008-08-13 Thread Jonathan Wheeler
Hi Mattias  Miles.

To test the version mismatch theory, I setup a snv_91 VM (using virtualbox) on 
my snv_95 desktop, and tried the zfs receive again. Unfortunately the symptoms 
are exactly the same: around the ~20GB mark, the justhome.zfs stream still 
bombs out with the checksum error.

I didn't realise that the zfs stream format wasn't backward compatible at the 
time that I made the backup, but having performed the above test, this doesn't 
actually appear to be my problem.
I wish it were - that I could have dealt with! :(

So far we've established that in this case:
*Version mismatches aren't causing the problem.
*Receiving across the network isn't the issue (because I have the exact same 
issue restoring the stream directly on my file server).
*All that's left was the initial send, and since zfs guarantees end to end data 
integrity, it should have been able to deal with any network possible 
randomness in the middle (zfs on both ends) - or at absolute worst, the zfs 
send command should have failed, if it encountered errors. Seems fair, no?

So, is there a major bug here, or at least an oversight in the zfs send part of 
the code?
Does zfs send not do checksumming, or, verification after sending? I'm not sure 
how else to interpret this data.

Today to add some more datapoints, I repeated a zfs send to the same nfs server 
from the same desktop, though this time I'm using zfs root with snv_95.
Same hardware, same network, same commands, but this time I didn't have any 
issues with the zfs receive.

?!?!?!?!

Miles:
zfs receive -nv works ok:
# zfs receive -vn  rpool/test  /net/supernova/Z/backup/angelous/justhome.zfs 
would receive full stream of faith/[EMAIL PROTECTED] into rpool/[EMAIL 
PROTECTED]

Where it gets interesting is with my recursive zfs dump:
bash-3.2# zfs receive -nvF -d rpool/test  
/net/supernova/Z/backup/angelous/pre-zfsroot.zfs
would receive full stream of [EMAIL PROTECTED] into rpool/[EMAIL PROTECTED]
would receive full stream of faith/[EMAIL PROTECTED] into rpool/test/[EMAIL 
PROTECTED]
would receive full stream of faith/[EMAIL PROTECTED] into rpool/test/[EMAIL 
PROTECTED]
would receive full stream of faith/[EMAIL PROTECTED] into rpool/test/[EMAIL 
PROTECTED]

[EMAIL PROTECTED] is actually empty.
faith/[EMAIL PROTECTED] bombs out around 2GB in, but I'm not really too worried 
about that fs.
faith/[EMAIL PROTECTED] is also another fs that I can live without.
faith/[EMAIL PROTECTED] is the one that we're after.

It would seem that my justhome.zfs dump (containing only faith/[EMAIL 
PROTECTED]) isn't going to work, but is there some way to recover the /home fs 
from the pre-zfsroot.zfs dump? Since there seems to be a problem with the first 
fs (faith/virtualmachines), I need to find a way to skip restoring that zfs, so 
it can focus on the faith/home fs.
How can this be achieved with zfs receive?

Jonathan
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corrupt zfs stream? checksum mismatch

2008-08-13 Thread Jonathan Wheeler
Thanks for the information, I'm learning quite a lot from all this.

It seems to me that zfs send *should* be doing some kind of verification, since 
some work has clearly been put into zfs so that zfs's can be dumped into 
files/pipes. It's a great feature to have, and I can't believe that this was 
purely for zfs send | zfs receive scenarios. 

A common example used all over the place is zfs send | ssh $host. In these 
examples is ssh guaranteeing the data delivery somehow? If not, there need to 
be some serious asterisks in these guides!
Looking at this at a level that I do understand, it's going via TCP, which 
checksums packets. then again, I was using nfs over TCP, and look where I 
am today. So much for that!

As I google these subjects more and more, I fear that I'm hitting the 
conceptual mental block that many before me have done also. zfs send is not 
zfsdump, even though it sure looks the same, and it's not clearly stated that 
you may end up in a situation like the one I'm in today if you don't somehow 
test your backups.

As you've rightly pointed out, it's done now and even if I did manage to 
reproduce this again, that won't help my data locked away in these 2 .zfs 
files, so focusing on the hopeful is there anything I can do to recover my data 
from these zfs dumps? Anything at all :)

If the problem is just that zfs receive is checksumming the data on the way 
in, can I disable this somehow within zfs? 
Can I globally disable checksumming in the kernel module? mdb something or 
rather?

I read this thread where someone did successfully manage to recovery data from 
a damaged zfs, which fulls me with some hope:
http://www.opensolaris.org/jive/thread.jspa?messageID=220125

It's way over my head, but if anyone can tell me the mdb commands I'm happy to 
try them, even if they do kill my cat. I don't really have anything to loose 
with a copy of the data, and I'll do it all in a VM anyway.

Thanks,
Jonathan
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corrupt zfs stream? checksum mismatch

2008-08-12 Thread Jonathan Wheeler
Hi folks,

Perhaps I was a little verbose in my first post, putting a view people off. 
Does anyone else have any ideas on this one.
I can't be the first person to have had a problem with a zfs backup stream. Is 
there nothing that can be done to recover at least some of the stream.

As another helpful chap pointed out, if tar encounters an error in the 
bitstream it just moves on until it finds usable data again. Can zfs not do 
something similar?

I'll take whatever i can get!
Jonathan
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 dead HDD, hung server, unable to boot.

2008-08-11 Thread Jonathan Loran

Jorgen Lundman wrote:
 # /usr/X11/bin/scanpci | /usr/sfw/bin/ggrep -A1 vendor 0x11ab device
 0x6081
 pci bus 0x0001 cardnum 0x01 function 0x00: vendor 0x11ab device 0x6081
   Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller

 But it claims resolved for our version:

 SunOS x4500-02.unix 5.10 Generic_127128-11 i86pc i386 i86pc

 Perhaps I should see if there are any recommended patches for Sol 10 5/08?

   
Jorgen,

For Sol 10, you need to get the IDR patch for the Marvell controllers.  
Given the crummy support you're getting, you may have problems getting 
it.  (Can anyone on this list help Jorgen?)  From recent posts on this 
list, I don't think there's an official patch yet, but if so, get that 
instead.  This should greatly improve matters for you.

Jon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] corrupt zfs stream? checksum mismatch

2008-08-10 Thread Jonathan Wheeler
Hi Folks,

I'm in the very unsettling position of fearing that I've lost all of my data 
via a zfs send/receive operation, despite ZFS's legendary integrity.

The error that I'm getting on restore is:
receiving full stream of faith/[EMAIL PROTECTED] into Z/faith/[EMAIL PROTECTED]
cannot receive: invalid stream (checksum mismatch)

Background:
I was running snv_91, and decided to upgrade to snv_95 converting to the much 
awaited zfs-root in the process.

On snv_91, I was using zfs for /opt, /export/home, and a couple of other file 
systems under /export
I expected that converting to zfs root would require completely formatting my 
disk, so I needed to backup all of my critical data to a remote host beforehand.

My main file server is running snv_71, using an 8 disks raid-Z, with plenty of 
space available via nfs, so I directed a zfs send across nfs to it. So it was 
zfs - nfs - zfs (raid-z)

I don't remember the exact commands used, but I started off with a zfs snapshot 
-r, and then did a zfs send [EMAIL PROTECTED]  /my/nfs/server/backup.zfs
This sent each of the filesystems across and redirected them into the one, 
single backup file.
I wasn't all that confident that this was a wise move, as I didn't know how I 
was going to get just one fs (rather than all) extracted again at a later time 
using zfs receive (I'm open to answers on that one still!).
So, I decided to *also send just the snapshot of my home directory, which 
contains all of my vital information. A bit of extra piece of mind eh, 2 
backups are better than one

I then installed snv_95 from dvd, using zfs-root, destroying my previous zpool 
on the disk in the process.

Here I am now, trying to restore my vital data that I backed up onto the nfs 
server, but it's not working!

# cat justhome.zfs | zfs receive -v Z/faith/home
receiving full stream of faith/[EMAIL PROTECTED] into Z/faith/[EMAIL PROTECTED]
cannot receive: invalid stream (checksum mismatch)

I just don't understand what's going on here.
I started off restoring across nfs to my desktop with the standard options. 
I've tried disabling checksumming on the parent zfs fs, to ensure that when it 
was restoring it wouldn't be using checksumming. I still got the checksum 
mismatch error.

Next I tried restoring the zfs backup internally within the nfs server, making 
it all local disk traffic, on the off chance that it was the network on my new 
build that was somehow broken. No dice, same error, with or without 
checksumming on the parent fs.

I've also tried my other backup file, but that's also having the same problem.
In all I've tried about 8 combinations, and I'm breaking out in a sweat with 
the possibility of having lost all of my data.

The zfs backup that included all file systems bombs out fairly early, on a 
small fs that was only a few GB. 
The zfs backup that included just my home fs, gets around 20Gb of the way 
through, before failing with the same error (and deleting the partial zfs fs).
I don't recall how big the original home fs was, perhaps 30-40GB, so it's a 
fair way through.

What's causing this error, and if this situation is as dire as I'm fearing 
(please tell me it's not so!), why can't I at least have the 20GB of data that 
it can restore before it bombs out with that checksum error.

Thanks for any help with this!
Jonathan
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] The best motherboard for a home ZFS fileserver

2008-07-31 Thread Jonathan Loran


Miles Nordin wrote:
 s == Steve  [EMAIL PROTECTED] writes:
 

  s http://www.newegg.com/Product/Product.aspx?Item=N82E16813128354

 no ECC:

  http://en.wikipedia.org/wiki/List_of_Intel_chipsets#Core_2_Chipsets
   

This MB will take these:

http://www.intel.com/products/processor/xeon3000/index.htm

Which does support ECC.  Now I'm not sure, but I suspect that this 
Gigabit MB doesn't have the ECC lanes. 

It's a lot more cash, but the following MB is on the HCL, and I have one 
in service working just swell:

http://www.newegg.com/Product/Product.aspx?Item=N82E16813182105

Has the plus (or minus I suppose) of four PCI-X slots to plug in the 
AOC-SAT2-MV8 cards.

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-30 Thread Jonathan Loran


 From a reporting perspective, yes, zpool status should not hang, and 
should report an error if a drive goes away, or is in any way behaving 
badly.  No arguments there.  From the data integrity perspective, the 
only event zfs needs to know about is when a bad drive is replaced, such 
that a resilver is triggered.  If a drive is suddenly gone, but it is 
only one component of a redundant set, your data should still be fine.  
Now, if enough drives go away to break the redundancy, that's a 
different story altogether.

Jon

Ross Smith wrote:
 I agree that device drivers should perform the bulk of the fault 
 monitoring, however I disagree that this absolves ZFS of any 
 responsibility for checking for errors.  The primary goal of ZFS is to 
 be a filesystem and maintain data integrity, and that entails both 
 reading and writing data to the devices.  It is no good having 
 checksumming when reading data if you are loosing huge amounts of data 
 when a disk fails.
  
 I'm not saying that ZFS should be monitoring disks and drivers to 
 ensure they are working, just that if ZFS attempts to write data and 
 doesn't get the response it's expecting, an error should be logged 
 against the device regardless of what the driver says.  If ZFS is 
 really about end-to-end data integrity, then you do need to consider 
 the possibility of a faulty driver.  Now I don't know what the root 
 cause of this error is, but I suspect it will be either a bad response 
 from the SATA driver, or something within ZFS that is not working 
 correctly.  Either way however I believe ZFS should have caught this.
  
 It's similar to the iSCSI problem I posted a few months back where the 
 ZFS pool hangs for 3 minutes when a device is disconnected.  There's 
 absolutely no need for the entire pool to hang when the other half of 
 the mirror is working fine.  ZFS is often compared to hardware raid 
 controllers, but so far it's ability to handle problems is falling short.
  
 Ross
  

  Date: Wed, 30 Jul 2008 09:48:34 -0500
  From: [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  CC: zfs-discuss@opensolaris.org
  Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive 
 removed
 
  On Wed, 30 Jul 2008, Ross wrote:
  
   Imagine you had a raid-z array and pulled a drive as I'm doing here.
   Because ZFS isn't aware of the removal it keeps writing to that
   drive as if it's valid. That means ZFS still believes the array is
   online when in fact it should be degrated. If any other drive now
   fails, ZFS will consider the status degrated instead of faulted, and
   will continue writing data. The problem is, ZFS is writing some of
   that data to a drive which doesn't exist, meaning all that data will
   be lost on reboot.
 
  While I do believe that device drivers. or the fault system, should
  notify ZFS when a device fails (and ZFS should appropriately react), I
  don't think that ZFS should be responsible for fault monitoring. ZFS
  is in a rather poor position for device fault monitoring, and if it
  attempts to do so then it will be slow and may misbehave in other
  ways. The software which communicates with the device (i.e. the
  device driver) is in the best position to monitor the device.
 
  The primary goal of ZFS is to be able to correctly read data which was
  successfully committed to disk. There are programming interfaces
  (e.g. fsync(), msync()) which may be used to ensure that data is
  committed to disk, and which should return an error if there is a
  problem. If you were performing your tests over an NFS mount then the
  results should be considerably different since NFS requests that its
  data be committed to disk.
 
  Bob


-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed

2008-07-29 Thread Jonathan Loran
 this 
 information, but wherever that is, zpool status should be reporting 
 the error and directing the admin to the log file.
  
 I would probably say this could be safely stored on the system drive.  
 Would it be possible to have a number of possible places to store this 
 log?  What I'm thinking is that if the system drive is unavailable, 
 ZFS could try each pool in turn and attempt to store the log there.
  
 In fact e-mail alerts or external error logging would be a great 
 addition to ZFS.  Surely it makes sense that filesystem errors would 
 be better off being stored and handled externally?
  
 Ross
  


  Date: Mon, 28 Jul 2008 12:28:34 -0700
  From: [EMAIL PROTECTED]
  Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive 
 removed
  To: [EMAIL PROTECTED]
 
  I'm trying to reproduce and will let you know what I find.
  -- richard
 


 
 Win £3000 to spend on whatever you want at Uni! Click here to WIN! 
 http://clk.atdmt.com/UKM/go/101719803/direct/01/
 

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Announcement: The Unofficial Unsupported Python ZFS API

2008-07-14 Thread Jonathan Hogg
On 14 Jul 2008, at 16:07, Will Murnane wrote:

 As long as I'm composing an email, I might as well mention that I had
 forgotten to mention Swig as a dependency (d'oh!).  I now have a
 mention of it on the page, and a spec file that can be built using
 pkgtool.  If you tried this before and gave up because of a missing
 package, please give it another shot.

Not related to the actual API itself, but just thought I'd note that  
all the cool kids are using ctypes these days to bind Python to  
foreign libraries.

http://docs.python.org/lib/module-ctypes.html

This has the advantage of requiring no other libraries and no compile  
phase at all.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Largest (in number of files) ZFS instance tested

2008-07-11 Thread Jonathan Edwards

On Jul 11, 2008, at 4:59 PM, Bob Friesenhahn wrote:


 Has anyone tested a ZFS file system with at least 100 million +  
 files?
 What were the performance characteristics?

 I think that there are more issues with file fragmentation over a long
 period of time than the sheer number of files.

actually it's a similar problem .. with a maximum blocksize of 128KB  
and the COW nature of the filesytem you get indirect block pointers  
pretty quickly on a large ZFS filesystem as the size of your tree  
grows .. in this case a large constantly modified file (eg: /u01/data/ 
*.dbf) is going to behave over time like a lot of random access to  
files spread across the filesystem .. the only real difference is that  
you won't walk it every time someone does a getdirent() or an lstat64()

so ultimately the question could be framed as what's the maximum  
manageable tree size you can get to with ZFS while keeping in mind  
that there's no real re-layout tool (by design) .. the number i'm  
working with until i hear otherwise is probably about 20M, but in the  
relativistic sense - it *really* does depend on how balanced your tree  
is and what your churn rate is .. we know on QFS we can go up to 100M,  
but i trust the tree layout a little better there, can separate the  
metadata out if i need to and have planned on it, and know that we've  
got some tools to relayout the metadata or dump/restore for a tape  
backed archive

jonathan

(oh and btw - i believe this question is a query for field data ..  
architect != crash test dummy .. but some days it does feel like it)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS deduplication

2008-07-08 Thread Jonathan Loran


Tim Spriggs wrote:
 Does anyone know a tool that can look over a dataset and give 
 duplication statistics? I'm not looking for something incredibly 
 efficient but I'd like to know how much it would actually benefit our 
 dataset: HiRISE has a large set of spacecraft data (images) that could 
 potentially have large amounts of redundancy, or not. Also, other up and 
 coming missions have a large data volume that have a lot of duplicate 
 image info and a small budget; with d11p in OpenSolaris there is a 
 good business case to invest in Sun/OpenSolaris rather than buy the 
 cheaper storage (+ linux?) that can simply hold everything as is.

 If someone feels like coding a tool up that basically makes a file of 
 checksums and counts how many times a particular checksum get's hit over 
 a dataset, I would be willing to run it and provide feedback. :)

 -Tim

   

Me too.  Our data profile is just like Tim's: Terra bytes of satellite 
data.  I'm going to guess that the d11p ratio won't be fantastic for 
us.  I sure would like to measure it though.

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS deduplication

2008-07-08 Thread Jonathan Loran


Justin Stringfellow wrote:
   
 Does anyone know a tool that can look over a dataset and give 
 duplication statistics? I'm not looking for something incredibly 
 efficient but I'd like to know how much it would actually benefit our 
 

 Check out the following blog..:

 http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool

   
Unfortunately we are on Solaris 10 :(  Can I get a zdb for zfs V4 that 
will dump those checksums?

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS deduplication

2008-07-08 Thread Jonathan Loran


Moore, Joe wrote:

 On ZFS, sequential files are rarely sequential anyway.  The SPA tries to
 keep blocks nearby, but when dealing with snapshotted sequential files
 being rewritten, there is no way to keep everything in order.
   

In some cases, a d11p system could actually speed up data reads and 
writes.  If you are repeatedly accessing duplicate data, then you will 
more likely hit your ARC, and not have to go to disk.  With your data 
d11p, the ARC can hold a significantly higher percentage of your data 
set, just like the disks.  For a d11p ARC, I would expire based upon 
block reference count.  If a block has few references, it should expire 
first, and vise versa, blocks with many references should be the last 
out.  With all the savings on disks, think how much RAM you could buy ;)

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS deduplication

2008-07-07 Thread Jonathan Loran


Neil Perrin wrote:
 Mertol,

 Yes, dedup is certainly on our list and has been actively
 discussed recently, so there's hope and some forward progress.
 It would be interesting to see where it fits into our customers
 priorities for ZFS. We have a long laundry list of projects.
 In addition there's bug fixes  performance changes that customers
 are demanding.

 Neil.

   

I want to cast my vote for getting dedup on ZFS.  One place we currently 
use ZFS is as nearline storage for backup data.  I have a 16TB server 
that provides a file store for an EMC Networker server.  I'm seeing a 
compressratio of 1.73, which is mighty impressive, since we also use 
native EMC compression during the backups.  But with dedup, we should 
see way more.  Here at UCB  SSL, we have demoed and investigated various 
dedup products, hardware and software, but they are all steep on the ROI 
curve.  I would be very excited to see block level ZFS deduplication 
roll out.  Especially since we already have the infrastructure in place 
using Solaris/ZFS.

Cheers,

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot delete errored file

2008-06-13 Thread Jonathan Loran


Ben Middleton wrote:
 Hi,

 Quick update:

 I left memtest running over night - 39 passes, no errors.

 I also attempted to force the BIOS to run the memory at 800MHz  5-5-5-15 as 
 suggested - but the machine became very unstable - long boot times; 
 PCI-Express failure of Yukon network card on booting etc. I've switched it 
 back to Auto speedtiming for now. I'll just hope that it was a one-off 
 glitch that corrupted the pool.

 I'm going to rebuild the pool this weekend.

 Thanks for all the suggestions.

   
Ben,

Haven't read this whole thread, and this has been brought up before, but 
make sure you power supply is running clean.  I can't tell you how many 
times I've seen very strange and intermittent system errors occur from a 
flaky power supply.

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SATA controller suggestion

2008-06-09 Thread Jonathan Hogg
On 9 Jun 2008, at 14:59, Thomas Maier-Komor wrote:

 time gdd if=/dev/zero bs=1048576 count=10240 of=/data/video/x

 real 0m13.503s
 user 0m0.016s
 sys  0m8.981s



 Are you sure gdd doesn't create a sparse file?

One would presumably expect it to be instantaneous if it was creating  
a sparse file. It's not a compressed filesystem though is it? /dev/ 
zero tends to be fairly compressible ;-)

I think, as someone else pointed out, running zpool iostat at the same  
time might be the best way to see what's really happening.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore

2008-05-30 Thread Jonathan Hogg
On 30 May 2008, at 15:49, J.P. King wrote:

 For _my_ purposes I'd be happy with zfs send/receive, if only it was
 guaranteed to be compatible between versions.  I agree that the  
 inability
 to extract single files is an irritation - I am not sure why this is
 anything more than an implementation detail, but I haven't gone into  
 it in
 depth.

I would presume it is because zfs send/receive works at the block  
level, below the ZFS POSIX layer - i.e., below the filesystem level. I  
would guess that a stream is simply a list of the blocks that were  
modified between the two snapshots, suitable for re-playing on  
another pool. This means that the stream may not contain your entire  
file.

An interesting point regarding this is that send/receive will be  
optimal in the case of small modifications to very large files, such  
as database files or large log files. The actual modified/appended  
blocks would be sent rather than the whole changed file. This may be  
an important point depending on your file modification patterns.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore

2008-05-29 Thread Jonathan Hogg
On 29 May 2008, at 15:51, Thomas Maier-Komor wrote:

 I very strongly disagree.  The closest ZFS equivalent to ufsdump is  
 'zfs
 send'.  'zfs send' like ufsdump has initmiate awareness of the the
 actual on disk layout and is an integrated part of the filesystem
 implementation.

 star is a userland archiver.


 The man page for zfs states the following for send:

  The format of the stream is evolving. No backwards  compati-
  bility  is  guaranteed.  You may not be able to receive your
  streams on future versions of ZFS.

 I think this should be taken into account when considering 'zfs send'
 for backup purposes...

Presumably, if one is backing up to another disk, one could zfs  
receive to a pool on that disk. That way you get simple file-based  
access, full history (although it could be collapsed by deleting older  
snapshots as necessary), and no worries about stream format changes.

Jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs equivalent of ufsdump and ufsrestore

2008-05-29 Thread Jonathan Hogg
On 29 May 2008, at 17:52, Chris Siebenmann wrote:

 The first issue alone makes 'zfs send' completely unsuitable for the
 purposes that we currently use ufsdump. I don't believe that we've  
 lost
 a complete filesystem in years, but we restore accidentally deleted
 files all the time. (And snapshots are not the answer, as it is common
 that a user doesn't notice the problem until well after the fact.)

 ('zfs send' to live disks is not the answer, because we cannot afford
 the space, heat, power, disks, enclosures, and servers to spin as many
 disks as we have tape space, especially if we want the fault isolation
 that separate tapes give us. most especially if we have to build a
 second, physically separate machine room in another building to put  
 the
 backups in.)

However, the original poster did say they were wanting to backup to  
another disk and said they wanted something lightweight/cheap/easy.  
zfs send/receive would seem to fit the bill in that case. Let's answer  
the question rather than getting into an argument about whether zfs  
send/receive is suitable for an enterprise archival solution.

Using snapshots is a useful practice as it costs fairly little in  
terms of disk space and provides immediate access to fairly recent,  
accidentally deleted files. If one is using snapshots, sending the  
streams to the backup pool is a simple procedure. One can then keep as  
many snapshots on the backup pool as necessary to provide the amount  
of history required. All of the files are kept in identical form on  
the backup pool for easy browsing when something needs to be restored.  
In event of catastrophic failure of the primary pool, one can quickly  
move the backup disk to the primary system and import it as the new  
primary pool.

It's a bit-perfect incremental backup strategy that requires no  
additional tools.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Inconcistancies with scrub and zdb

2008-05-06 Thread Jonathan Loran


Jonathan Loran wrote:
 Since no one has responded to my thread, I have a question:  Is zdb 
 suitable to run on a live pool?  Or should it only be run on an exported 
 or destroyed pool?  In fact, I see that it has been asked before on this 
 forum, but is there a users guide to zdb? 

   
Answering myself, finally looked at the zdb source code, and I see the results 
running on a live pool are not consistent, hence the -L option.  OK, so I'm 
going to trust the scrub to tell me if there are errors, and as far as I can 
tell, my pools are clean now.  But is was scary creating the mirror from a pool 
with checksum errors.  I think there could be some more verbosity about what is 
going on, or to give the user some options when checksum errors are found in 
the process of silvering up a mirror for the first time.  Just a comment.

Thanks,

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Video streaming and prefetch

2008-05-06 Thread Jonathan Hogg
Hi all,

I'm new to this list and ZFS, so forgive me if I'm re-hashing an old  
topic. I'm also using ZFS on FreeBSD not Solaris, so forgive me for  
being a heretic ;-)

I recently setup a home NAS box and decided that ZFS is the only  
sensible way to manage 4TB of disks. The primary use of the box is to  
serve my telly (actually a Mac mini). This is using afp (via netatalk)  
to serve space to the telly for storing and retrieving video. The  
video tends to be 2-4GB files that are read/written sequentially at a  
rate in the region of 800KB/s.

Unfortunately, the performance has been very choppy. The video  
software assumes it's talking to fast local storage and thus makes  
little attempt to buffer. I spent a long time trying to figure out the  
network problem before determining that the problem is actually in  
reading from the FS. This is a pretty cheap box, but it can still  
sustain 110GB/s off the array and low milliseconds access times. So  
there really is no excuse for not being able to serve up 800KB/s in an  
even fashion.

After some experimentation I have determined that the problem is  
prefetching. Given this thing is mostly serving sequentially at a low,  
even rate it ought to be perfect territory for prefetching. I spent  
the weekend reading the ZFS code (bank holiday fun eh?) and running  
some experiments and think the problem is in the interaction between  
the prefetching code and the running processes.

(Warning: some of the following is speculation on observed behaviour  
and may be rubbish.)

The behaviour I see is the file streaming stalling whenever the  
prefetch code decides to read some more blocks. The dmu_zfetch code is  
all run as part of the read() operation. When this finds itself  
getting close to running out of prefetched blocks it queues up  
requests for more blocks - 256 of them. At 128KB per block, that's  
32MB of data it requests. At this point it should be asynchronous and  
the caller should get back control and be able to process the data it  
just read. However, my NAS box is a uniprocessor and the issue thread  
is higher priority than user processes. So, in fact, it immediately  
begins issuing the physical reads to the disks.

Given that modern disks tend to prefetch into their own caches anyway,  
some of these reads are likely to be served up instantly. This causes  
interrupts back into the kernel to deal with the data. This queues up  
the interrupt threads, which are also higher priority than user  
processes. These consume a not-insubstantial amount of CPU time to  
gather, checksum and load the blocks into the ARC. During which time,  
the disks have located the other blocks and started serving them up.

So what I seem to get is a small perfect storm of interrupt  
processing. This delays the user process for a few hundred  
milliseconds. Even though the originally requested block was *in* the  
cache! To add insult to injury the, user process in this case, when it  
finally regains the CPU and returns the data to the the caller, then  
sleeps for a couple of hundred milliseconds. So prefetching, instead  
of evening-out reading and reducing jitter, has produced the worst  
case performance of compressing all of the jitter into one massive  
lump every 40 seconds (32MB / 800K).

I get reasonably even performance if I disable prefetching or if I  
reduce the zfetch_block_cap to 16-32 blocks instead of 256.

Other than just taking this opportunity to rant, I'm wondering if  
anyone else has seen similar problems and found a way around them?  
Also, to any ZFS developers: why does the prefetching logic follow the  
same path as a regular async read? Surely these ought to be way down  
the priority list? My immediate thought after a weekend of reading the  
code was to re-write it to use a low priority prefetch thread and have  
all of the dmu_zfetch() logic in that instead of in-line with the  
original dbuf_read().

Jonathan


PS: Hi Darren!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Inconcistancies with scrub and zdb

2008-05-05 Thread Jonathan Loran

Since no one has responded to my thread, I have a question:  Is zdb 
suitable to run on a live pool?  Or should it only be run on an exported 
or destroyed pool?  In fact, I see that it has been asked before on this 
forum, but is there a users guide to zdb? 

Thanks,

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Inconcistancies with scrub and zdb

2008-05-04 Thread Jonathan Loran

Hi List,

First of all:  S10u4 120011-14

So I have the weird situation.  Earlier this week, I finally mirrored up 
two iSCSI based pools.  I had been wanting to do this for some time, 
because the availability of the data in these pools is important. One 
pool mirrored just fine, but the other pool is another story.

First lesson (I think) is you should scrub your pools, at least those 
backed by a SAN, before mirroring them.  The problem pool was scrubbed 
about two weeks before I mirrored it, and it was clean. I assumed, 
wrongly that there were no checksum errors in the time that elapsed.  
Well guess again.  When I mirrored this guy, the source mirror had two 
checksum errors.  Interestingly, the target inherited these errors, and 
so now both sides of the mirror showed two checksums in the counters.  I 
don't know if this was real, or if the zpool attach operation just 
incremented the counters on the second half of the mirror.

My next mistake was to assume the counters were in error on the second 
mirror, and so I zeroed out the counters with zpool clear.  OK, so now I 
scrub the pool, and no checksum errors were found on either side of the 
mirror.  Huh?!?  What about those two checksum errors on the first 
mirror.  OK, so I run zdb on the pool, and if finds scads of errors:

Traversing all blocks to verify checksums and verify nothing leaked ...

zdb_blkptr_cb: Got error 50 reading 33, 727252, 0, 4a -- skipping--
...

and then tons of:

Error counts:
errno count
50 123
leaked space: vdev 0, offset 0x4deaed800, size 2048
...


OK, this is odd, so I scrub the pool again, and this time it found 4 
checksum errors, on the initial mirror, but none on the other mirror. 
That makes some sense, (though I don't know what changed) so I break the 
mirror, taking off the original side that has the checksum errs. I then 
scrub the pool, no errors found. That's good, but just to be sure, I run 
zdb on it, and it finds tons of the same errors as if found on the 
original side of the mirror. Argh!

In the mean time, I ran 4 passes of format- analyze - compare on the 
initial half of the mirror that had the checksums and it's totally clean 
hardware wise.

So my questions are these:

1) Does zdb leaked space mean trouble with the pool?
2) Is it possible that the errors got injected to the new half of the 
mirror when I attached it? For now, I'm going to assume that the new 
half of the mirror is OK, hardware wise. 
3) I'm running a scrub and zdb on the other pool that lives on these SAN 
boxes, cause I want to see if they come up with the same problems. If 
not, what would be going on with this crazy pool.
4) Can I recover from this without copying the whole pool to new 
storage? If not, it will be painful for us. We will have to reboot 350 
servers and workstations on stale file handles, interrupting 100's of 
production processes. My user base is loosing faith in my team.

Oh sage ones, please advise. Thanks in advance.

Jon


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] share zfs hierarchy over nfs

2008-04-30 Thread Jonathan Loran


Bob Friesenhahn wrote:
 So for Linux, I think that you will also need to figure out an
 indirect-map incantation which works for its own broken automounter. 
 Make sure that you read all available documentation for the Linux 
 automounter so you know which parts don't actually work.

   
Oh contraire Bob.  I'm not going to boost Linux, but in this department, 
they've tried to do it right.  If you use Linux autofs V4 or higher, you 
can use Sun style maps (except there's no direct maps in V4.  Need V5 
for direct maps).  For our home directories, which use an indirect map, 
we just use the Solaris map, thus:

auto_home:
*zfs-server:/home/

Sorry to be so off (ZFS) topic.

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS - Implementation Successes and Failures

2008-04-29 Thread Jonathan Loran


Dominic Kay wrote:
 Hi

 Firstly apologies for the spam if you got this email via multiple aliases.

 I'm trying to document a number of common scenarios where ZFS is used 
 as part of the solution such as email server, $homeserver, RDBMS and 
 so forth but taken from real implementations where things worked and 
 equally importantly threw up things that needed to be avoided (even if 
 that was the whole of ZFS!).

 I'm not looking to replace the Best Practices or Evil Tuning guides 
 but to take a slightly different slant.  If you have been involved in 
 a ZFS implementation small or large and would like to discuss it 
 either in confidence or as a referenceable case study that can be 
 written up, I'd be grateful if you'd make contact.

 -- 
 Dominic Kay
 http://blogs.sun.com/dom

For all the storage under my management, we are deploying ZFS going 
forward.  There have been issues, to be sure, though none of them show 
stoppers.  I agree with other posters that the way the z* commands 
lockup on a failed device are really not good, and it would be nice to 
be able to remove devices from a zpool.  There have been other 
performance issues that are more the fault of of our SAN nodes than 
ZFS.  But the ease of management, the unlimited nature (volume size to 
number of file systems) of everything ZFS, built in snapshots, and the 
confidence we get in our data make ZFS a winner. 

The way we've deployed ZFS has been to map iSCSI devices from our SAN.  
I know this isn't an ideal way to deploy ZFS, but SAN's do offer 
flexibility that direct attached drives do not.  Performance is now 
sufficient for our needs, but it wasn't at first.  We do everything here 
on the cheap, we have to.  After all, this is University research ;)  
Anyway, we buy commodity x86 servers, and use software iSCSI.  Most of 
our iSCSI nodes run Open-E iSCSI-R3.  The latest version is actually 
quite quick, which wasn't always the case.  I am experimenting using ZFS 
on the iSCSI target, but haven't finished validating that yet. 

I've also rebuilt an older 24 disk SATA chassis with the following parts:

Motherboard:Supermicro PDSME+
Processor: Intel Xeon X3210 Kentsfield 2.13GHz 2 x 4MB L2 Cache LGA 775 
Quad-Core
Disk Controllers x3: Supermicro AOC-SAT2-MV8 8-Port SATA
Hard disks x24: WD-1TB RE2, GP
RAM: Crucial, 4x2GB unbuffered ECC PC2-5300 (8GB total)
New power supplies...

The PDSME+ MB was on the Solaris HCL, and it has four PCI-X slots, so 
using three of the Super Micro MVs' is no problem.  This is obviously a 
standalone system, but it will be for nearline backup data, and doesn't 
have the same expansion requirements as our other servers.  The thing 
about this guy is how smokin fast it is.  I've set it up on snv b86, 
with 4 x 6 drive raid2z stripes, and I'm seeing up to 450MB/sec write 
and 900MB/sec read speeds.  We can't get data into it anywhere that 
quick, but the potential is awesome.  And it was really cheap, for this 
amount of storage.

Our total storage on ZFS now is at: 103TB, some user home directories, 
some software distribution, and a whole lot of scientific data.  I 
compress almost everything, since our bandwidth tends to be SAN pinched, 
not at the head nodes, so we can afford it.

I sleep at night, and the users don't see problems.  I'm a happy camper.

Cheers,

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS for write-only media?

2008-04-22 Thread Jonathan Loran


Bob Friesenhahn wrote:
 The problem here is that by putting the data away from your machine,
 you loose the chance to scrub
 it on a regular basis, i.e. there is always the risk of silent
 corruption.
 

 Running a scrub is pointless since the media is not writeable. :-)

   
But that's the point.  You can't correct silent errors on write once 
media because you can't write the repair.

I think it makes more sense to save a checksum of the entire CD/DVD/etc. 
media separately, so you can check the validity of your data that way, 
instead of using ZFS on WORM media.

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS for write-only media?

2008-04-22 Thread Jonathan Loran


Bob Friesenhahn wrote:
 On Tue, 22 Apr 2008, Jonathan Loran wrote:

 But that's the point.  You can't correct silent errors on write once
 media because you can't write the repair.

 Yes, you can correct the error (at time of read) due to having both 
 redundant media, and redundant blocks. That is a normal function of 
 ZFS.  It just not possible to correct the failed block on the media by 
 re-writing it or moving its data to a new location.

I suppose with ditto blocks, this has some merrit.  Someone needs to 
characterize how errors probigate on different types of WORM media.  perhaps 
this has already been done.  In my experience, when DVD-R go south, they really 
go bad at once.  Not a lot of small bit errors.  But a full analysis would be 
good.  Probably it would make the most sence to write mirrored WORM disks with 
different technology to hedge your bets.

Jon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 24-port SATA controller options?

2008-04-15 Thread Jonathan Loran


Luke Scharf wrote:
 Maurice Volaski wrote:
   
 Perhaps providing the computations rather than the conclusions would
 be more persuasive  on a technical list ;
 
   
 2 16-disk SATA arrays in RAID 5
 2 16-disk SATA arrays in RAID 6
 1 9-disk SATA array in RAID 5.

 4 drive failures over 5 years. Of course, YMMV, especially if you 
 drive drunk :-)
   
 

 My mileage does vary!

 On a 4 year old 84 disk array (with 12 RAID 5s), I replace one drive 
 every couple of weeks (on average).  This array lives in a proper 
 machine-room with good power and cooling.  The array stays active, though.

 -Luke
   

I basically agree with this.  We have about 150TB in mostly RAID 5 
configurations, ranging from 8 to 16 disks per volume.  We also replace 
bad drives about every week or three, but in six years, have never lost 
an array.  I think our secret is this: on our 3ware controllers we run 
a verify at a minimum of three times a week.  The verify will read the 
whole array (data and parity), find bad blocks and move them if 
necessary to good media.  Because of this, we've never had a rebuild 
trigger a secondary failure.  knock wood.  Our server room has 
conditioned power and cooling as well.

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris ZFS NAS Setup

2008-04-10 Thread Jonathan Loran

Chris Siebenmann wrote:
 | What your saying is independent of the iqn id?

  Yes. SCSI objects (including iSCSI ones) respond to specific SCSI
 INQUIRY commands with various 'VPD' pages that contain information about
 the drive/object, including serial number info.

  Some Googling turns up:
   
 http://wikis.sun.com/display/StorageDev/Solaris+OS+Disk+Driver+Device+Identifier+Generation
   http://www.bustrace.com/bustrace6/sas.htm

  Since you're using Linux IET as the target, you want to set the
 'ScsiId' and 'ScsiSN' Lun parameters to unique (and different) values.

 (You can use sdparm, http://sg.torque.net/sg/sdparm.html, on Solaris
 to see exactly what you're currently reporting in the VPD data for each
 disk.)

   - cks
   

CC-ing the list, cause this is of general interest

Chris, indeed the older version of Open-E iSCSI I was using for my tests 
has no unique VPD identifiers what so ever, so this could confuse the 
initiator:

prudhoe # sdparm -6 -i /devices/iscsi/[EMAIL PROTECTED],0:wd,raw
/devices/iscsi/[EMAIL PROTECTED],0:wd,raw: IET   VIRTUAL-DISK  0
Device identification VPD page:
  Addressed logical unit:
designator type: T10 vendor identification,  code set: Binary
  vendor id: IET
  vendor specific:


Where as the new version of Open-E iSCSI (called iSCSI R3) does.  These 
are two LUNS from the system I will be doing a ZFS mirror on, running 
the new Open-E iSCSI-R3 on the target:


apollo # sdparm -i 
/devices/scsi_vhci/[EMAIL PROTECTED]:wd,raw

/devices/scsi_vhci/[EMAIL PROTECTED]:wd,raw: iSCSI DISK  0
Device identification VPD page:
  Addressed logical unit:
designator type: T10 vendor identification,  code set: Binary
  vendor id: iSCSI
  vendor specific: XBD3Qzf9pzqYrsdz

apollo # sdparm -i /devices/scsi_vhci/[EMAIL PROTECTED]:wd,raw
/devices/scsi_vhci/[EMAIL PROTECTED]:wd,raw: iSCSI DISK  0
Device identification VPD page:
  Addressed logical unit:
designator type: T10 vendor identification,  code set: Binary
  vendor id: iSCSI
  vendor specific: ZknC2lbWA5y3M7v6


Open-E iSCSI-R3 generates a uniq vendor specific serial number, so the 
ZFS mirror will most likely fail and recover more cleanly.

Thanks for the pointers.

Jon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS volume export to USB-2 or Firewire?

2008-04-09 Thread Jonathan Edwards

On Apr 9, 2008, at 11:46 AM, Bob Friesenhahn wrote:
 On Wed, 9 Apr 2008, Ross wrote:

 Well the first problem is that USB cables are directional, and you
 don't have the port you need on any standard motherboard.  That

 Thanks for that info.  I did not know that.

 Adding iSCSI support to ZFS is relatively easy since Solaris already
 supported TCP/IP and iSCSI.  Adding USB support is much more
 difficult and isn't likely to happen since afaik the hardware to do
 it just doens't exist.

 I don't believe that Firewire is directional but presumably the
 Firewire support in Solaris only expects to support certain types of
 devices.  My workstation has Firewire but most systems won't have it.

 It seemed really cool to be able to put your laptop next to your
 Solaris workstation and just plug it in via USB or Firewire so it can
 be used as a removable storage device.  Or Solaris could be used on
 appropriate hardware to create a more reliable portable storage
 device.  Apparently this is not to be and it will be necessary to deal
 with iSCSI instead.

 I have never used iSCSI so I don't know how difficult it is to use as
 temporary removable storage under Windows or OS-X.

i'm not so sure what you're really after, but i'm guessing one of two  
things:

1) a global filesystem?  if so - ZFS will never be globally accessible  
from 2 hosts at the same time without an interposer layer such as NFS  
or Lustre .. zvols could be exported to multiple hosts via iSCSI or FC- 
target but that's only 1/2 the story ..
2) an easy way to export volumes?  agree - there should be some sort  
of semantics that would a signal filesystem is removable and trap on  
USB events when the media is unplugged .. of course you'll have  
problems with uncommitted transactions that would have to roll back on  
the next plug, or somehow be query-able

iSCSI will get you a block/character device level sharing from a zvol  
(pseudo device) or the equivalent of a blob filestore .. you'd have to  
format it with a filesystem, but that filesystem could be a global one  
(eg: QFS) and you could multi-host natively that way.

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris ZFS NAS Setup

2008-04-09 Thread Jonathan Loran

Just to report back to the list...  Sorry for the lengthy post

So I've tested the iSCSI based zfs mirror on Sol 10u4, and it does more 
or less work as expected.  If I unplug one side of the mirror - unplug 
or power down one of the iSCSI targets -  I/O to the zpool stops for a 
while, perhaps a minute, and then things free up again.  zpool commands 
seem to get unworkably slow, and error messages fly by on the console 
like fire ants running from a flood.  Worst of all, plugging the faulted 
mirror back in (before removing the mirror from the pool)  it's very 
hard to bring the faulted device back online:

prudhoe # zpool status
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Tue Apr  8 16:34:08 2008
config:

NAMESTATE READ WRITE CKSUM
testDEGRADED 0 0 0
  mirrorDEGRADED 0 0 0
c2t1d0  FAULTED  0 2.88K 0  corrupted data
c2t1d0  ONLINE   0 0 0

errors: No known data errors

 Comment: why are there now two instances of c2t1d0??  


prudhoe # zpool replace test c2t2d0
invalid vdev specification
use '-f' to override the following errors:
/dev/dsk/c2t1d0s0 is part of active ZFS pool test. Please see zpool(1M).

prudhoe # zpool replace -f test c2t2d0
invalid vdev specification
the following errors must be manually repaired:
/dev/dsk/c2t1d0s0 is part of active ZFS pool test. Please see zpool(1M).

prudhoe # zpool remove test c2t2d0
cannot remove c2t2d0: no such device in pool

prudhoe # zpool offline test c2t2d0
cannot offline c2t2d0: no such device in pool

prudhoe # zpool online test c2t2d0
cannot online c2t2d0: no such device in pool

  OK, get more drastic 

prudhoe # zpool clear test

prudhoe # zpool status
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid.  Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed with 0 errors on Tue Apr  8 16:34:08 2008
config:

NAMESTATE READ WRITE CKSUM
testDEGRADED 0 0 0
  mirrorDEGRADED 0 0 0
c2t1d0  FAULTED  0 0 0  corrupted data
c2t1d0  ONLINE   0 0 0

errors: No known data errors

  Frustration setting in.  The error counts are zero, but 
 still 
two instances of c2t1d0 listed... 

prudhoe # zpool export test

prudhoe # zpool import test

prudhoe # zpool list
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
test   12.9G   9.54G   3.34G74%  ONLINE -

prudhoe # zpool status
  pool: test
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 1.11% done, 0h20m to go
config:

NAMESTATE READ WRITE CKSUM
testONLINE   0 0 0
  mirrorONLINE   0 0 0
c2t2d0  ONLINE   0 0 0
c2t1d0  ONLINE   0 0 0

errors: No known data errors


  Finally resilvering with the right devices.  The thing I really don't 
 like here is the pool had to be exported and then imported to make this 
 work.  For an NFS server, this is not really acceptable.  Now I know this 
 is ol' Solaris 10u4, but still, I'm surprised I needed to export/import 
 the pool to get it working correctly again.  Anyone know what I did 
 wrong?  Is there a canonical way to online the previously faulted device?

Anyway, It looks like for now, I can get some sort of HA our of this iSCSI 
mirror.  The other pluses is the pool can self heal, and reads will be spread 
across both units.  

Cheers,

Jon

--- P.S.  Playing with this more before sending this message, if you can detach 
the faulted mirror before putting it back online, it all works well.  Hope that 
nothing bounces on your network when you have a failure:

 unplug one iscsi mirror, then: 

prudhoe # zpool status -v
  pool: test
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: scrub completed with 0 errors on Wed Apr  9 14:18:45 2008
config:

NAMESTATE READ WRITE CKSUM
testDEGRADED 0 0 0
  mirror   

Re: [zfs-discuss] [storage-discuss] OpenSolaris ZFS NAS Setup

2008-04-06 Thread Jonathan Loran

kristof wrote:
 If you have a mirrored iscsi zpool. It will NOT panic when 1 of the 
 submirrors is unavailable.

 zpool status will hang for some time, but after I thinkt 300 seconds it will 
 put the device on unavailable.

 The panic was the default in the past, And it only occurs if all devices are 
 unavailable.

 Since I think b77 there is a new zpool property: failemode, which you can set 
 to prevent a panic: 

  failmode=wait | continue | panic

  Controls the system behavior  in  the  event  of  catas-
  trophic  pool  failure.  This  condition  is typically a
  result of a  loss  of  connectivity  to  the  underlying
  storage device(s) or a failure of all devices within the
  pool. The behavior of such an  event  is  determined  as
  follows:

  waitBlocks all I/O access until the device  con-
  nectivity  is  recovered  and the errors are
  cleared. This is the default behavior.

  continueReturns EIO to any new  write  I/O  requests
  but  allows  reads  to  any of the remaining
  healthy devices.  Any  write  requests  that
  have  yet  to  be committed to disk would be
  blocked.

  panic   Prints out a message to the console and gen-
  erates a system crash dump.
  
  
   
This is encouraging, but one problem:  Our system is on Solaris 10 U4. 
Will this guy be immune to panics when one side of the mirror goes
down?  Seriously, I'm tempted to upgrade this box to OS b8?  However,
there are a lot of dependencies which we need to worry about in doing
that - for example, will all our off the shelf software run with Open
Solaris.  More things to test.



Thanks,



Jon


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris ZFS NAS Setup

2008-04-04 Thread Jonathan Loran

 This guy seems to have had lots of fun with iSCSI :)
 http://web.ivy.net/~carton/oneNightOfWork/20061119-carton.html

   
This is scaring the heck out of me.  I have a project to create a zpool 
mirror out of two iSCSI targets, and if the failure of one of them will 
panic my system, that will be totally unacceptable.  What's the point of 
having an HA mirror if one side can't fail without busting the host.  Is 
it really true that as the guy on the above link states (Please read the 
link, sorry) when one iSCSI mirror goes off line, the initiator system 
will panic?  Or even worse, not boot its self cleanly after such a 
panic?  How could this be?  Anyone else with experience with iSCSI based 
ZFS mirrors?

Thanks,

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Backup-ing up ZFS configurations

2008-03-25 Thread Jonathan Loran


Bob Friesenhahn wrote:
 On Tue, 25 Mar 2008, Robert Milkowski wrote:
 As I wrote before - it's not only about RAID config - what if you have
 hundreds of file systems, with some share{nfs|iscsi|cifs) enabled with
 specific parameters, then specific file system options, etc.

 Some zfs-related configuration is done using non-ZFS commands.  For 
 example, a filesystem devoted to a user is typically chowned to that 
 user  user's group.  I assume that owner, group, and any ACLs 
 associated with a filesystem would be preserved so that they are part 
 of the pool re-creation commands?

 When creating ZFS filesystems, the step of creating the pool is 
 separate from the steps of creating the filesystems.  Obviously these 
 steps need to either be separate, or separable, so that a similar 
 filesystem layout can be created with different hardware.

Correct me if I'm not interpreting this discussion properly, but aren't 
we discussing reconstruction of the container (zpool/zfs-file-systems 
and settings), not the data therein?  Modes, ACL, extended attributes, 
and ownership of the data, should all come over with a zfs receive, or 
the backup recovery of your choice. 

I believe I could write a trivial shell script to take the listings of:

# zpool list pool

and

# zfs list -r -t filesystem,volume -o all pool

to recreate the whole pool, and all the necessary properties.

Jon


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Jonathan Edwards

On Mar 20, 2008, at 11:07 AM, Bob Friesenhahn wrote:
 On Thu, 20 Mar 2008, Mario Goebbels wrote:

 Similarly, read block size does not make a
 significant difference to the sequential read speed.

 Last time I did a simple bench using dd, supplying the record size as
 blocksize to it instead of no blocksize parameter bumped the mirror  
 pool
 speed from 90MB/s to 130MB/s.

 Indeed.  However, as an interesting twist to things, in my own
 benchmark runs I see two behaviors.  When the file size is smaller
 than the amount of RAM the ARC can reasonably grow to, the write block
 size does make a clear difference.  When the file size is larger than
 RAM, the write block size no longer makes much difference and
 sometimes larger block sizes actually go slower.

in that case .. try fixing the ARC size .. the dynamic resizing on the  
ARC can be less than optimal IMHO

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-20 Thread Jonathan Edwards

On Mar 20, 2008, at 2:00 PM, Bob Friesenhahn wrote:
 On Thu, 20 Mar 2008, Jonathan Edwards wrote:

 in that case .. try fixing the ARC size .. the dynamic resizing on  
 the ARC
 can be less than optimal IMHO

 Is a 16GB ARC size not considered to be enough? ;-)

 I was only describing the behavior that I observed.  It seems to me
 that when large files are written very quickly, that when the file
 becomes bigger than the ARC, that what is contained in the ARC is
 mostly stale and does not help much any more.  If the file is smaller
 than the ARC, then there is likely to be more useful caching.

sure i got that - it's not the size of the arc in this case since  
caching is going to be a lost cause.. but explicitly setting a  
zfs_arc_max should result in fewer calls to arc_shrink() when you hit  
memory pressure between the application's page buffer competing with  
the arc

in other words, as soon as the arc is 50% full of dirty pages (8GB)  
it'll start evicting pages .. you can't avoid that .. but what you can  
avoid is the additional weight of constantly growing and shrinking the  
cache as it tries to keep up with your constantly changing blocks in a  
large file

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs backups to tape

2008-03-16 Thread Jonathan Edwards

On Mar 14, 2008, at 3:28 PM, Bill Shannon wrote:
 What's the best way to backup a zfs filesystem to tape, where the size
 of the filesystem is larger than what can fit on a single tape?
 ufsdump handles this quite nicely.  Is there a similar backup program
 for zfs?  Or a general tape management program that can take data from
 a stream and split it across tapes reliably with appropriate headers
 to ease tape management and restore?

for now you could send snapshots to files and a file hierarchy on a  
SAM-QFS archive .. then you've got all the feature functionality there  
to be able to proactively back up the snapshots and possibly segment  
them if they're big enough (non-shared-qfs - might make sense if  
you've got multiple drives you want to take advantage of) .. I believe  
the goal is to provide this sort of functionality through a DMAPI HSM  
with ADM at some point in the near future:
http://opensolaris.org/os/project/adm/

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs backups to tape

2008-03-14 Thread Jonathan Loran



Carson Gaspar wrote:

Bob Friesenhahn wrote:
  

On Fri, 14 Mar 2008, Bill Shannon wrote:



What's the best way to backup a zfs filesystem to tape, where the size
of the filesystem is larger than what can fit on a single tape?
ufsdump handles this quite nicely.  Is there a similar backup program
for zfs?  Or a general tape management program that can take data from
  
Previously it was suggested on this list to use a special version of 
tar called 'star' (ftp://ftp.berlios.de/pub/star).



Suggested by the rather biased (and extremely opinionated) author of 
'star'. Who, by the way, never out-and-out admitted that star does _not_ 
support ZFS ACLs (which it doesn't).


Sadly I don't now of any non-commercial backup solution for ZFS that 
supports ACLs. 


That is simply not true.  Legato (EMC) Networker 7.4 does a perfect job 
of capturing the ZFS ACL's  Just to make sure, I just performed a test 
recover of a directory where we use a complicated set of NFS4 style 
ACL's, and they were preserved exactly.
Even rsync doesn't support them, due to Sun's choice to 
use their own unique ACL API.
  
I commend Sun's choice of NFS v4 ACLs.  This is the only way to ensure 
CIFS compatibility, and it is the way the industry will be moving.


Jon


--


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs backups to tape

2008-03-14 Thread Jonathan Loran



Robert Milkowski wrote:


Hello Jonathan,


Friday, March 14, 2008, 9:48:47 PM, you wrote:








Carson Gaspar wrote: 

Bob Friesenhahn wrote:  

On Fri, 14 Mar 2008, Bill Shannon wrote:  

What's the best way to backup a zfs filesystem to tape, where the size 
of the filesystem is larger than what can fit on a single tape? 
ufsdump handles this quite nicely.  Is there a similar backup program 
for zfs?  Or a general tape management program that can take data from  

Previously it was suggested on this list to use a special version of 
tar called 'star' ftp://ftp.berlios.de/pub/star).   

Suggested by the rather biased (and extremely opinionated) author of 
'star'. Who, by the way, never out-and-out admitted that star does 
_not_ support ZFS ACLs (which it doesn't). Sadly I don't now of any 
non-commercial backup solution for ZFS that supports A



That is simply not true.  Legato (EMC) Networker 7.4 does a perfect 
job of capturing the ZFS ACL's  Just to make sure, I just performed a 
test recover of a directory where we use a complicated set of NFS4 
style ACL's, and they were preserved exactly.





If you look closely you'll see he wrote non-comercial backup 
solution. Unless I miss something Legate and Netbackup (another 
poster) are commercial.




Right you are.  I read his post wrong.

Networker and NetBackup are very pricey commercial packages.

Thanks

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirroring to a smaller disk

2008-03-04 Thread Jonathan Loran



Patrick Bachmann wrote:

Jonathan,

On Tue, Mar 04, 2008 at 12:37:33AM -0800, Jonathan Loran wrote:
  
I'm 'not sure I follow how this would work.  



The keyword here is thin provisioning. The sparse zvol only uses
as much space as the actual data needs. So, if you use a sparse
zvol, you may mirror to a smaller disk, iff you use as much
space as is physically available to the sparse zvol.

  
I do have tons of space on  
the old array.  It's only 15% utilized, hence my original comment.  How  
does my data get into the /test/old zvol (zpool foo)?  What would I end  
up with.



There's no zvol on foo. After detaching /test/old, you may
reconfigure your old array. At that point, foo is on a zvol on
the pool bar.
In what way to get the data over depends on how your
reconfiguration of the old array impacts the pool and vdev size.
If it gets smaller, you cannot attach it to the pool where your
data currently resides and have to go the send|receive route...

Putting the zpool on a zvol permanently might not be something
you want as this creates some overhead, I can't quantisize, and
you mentioned some performance issues you're already
experiencing.
  


Well, there's the rub.  I will be reconfiguring the old array identical 
to the new one.  It will be smaller.  It's always something, isn't it.


I have to say though, this is very slick and I can see this sparse zvol 
trick will be handy in the future.  Thanks!


Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] periodic ZFS disk accesses

2008-03-01 Thread Jonathan Edwards

On Mar 1, 2008, at 3:41 AM, Bill Shannon wrote:
 Running just plain iosnoop shows accesses to lots of files, but none
 on my zfs disk.  Using iosnoop -d c1t1d0 or iosnoop -m /export/ 
 home/shannon
 shows nothing at all.  I tried /usr/demo/dtrace/iosnoop.d too, still  
 nothing.

hi Bill

this came up sometime last year .. io:::start won't work since ZFS  
doesn't call bdev_strategy() directly .. you'll want to use something  
more like zfs_read:entry, zfs_write:entry and zfs_putpage or  
zfs_getpage for mmap'd ZFS files

here's one i hacked from our discussion back then to track some  
timings on files:

  cat zfs_iotime.d

#!/usr/sbin/dtrace -s

# pragma D option quiet

zfs_write:entry,
zfs_read:entry,
zfs_putpage:entry,
zfs_getpage:entry
{
self-ts = timestamp;
self-filepath = args[0]-v_path;
}

zfs_write:return,
zfs_read:return,
zfs_putpage:return,
zfs_getpage:return
/self-ts  self-filepath/
{
printf(%s on %s took %d nsecs\n, probefunc,
stringof(self-filepath), timestamp - self-ts);
self-ts = 0;
self-filepath = 0;
}

---
.je

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Does a mirror increase read performance

2008-02-28 Thread Jonathan Loran

Quick question:

If I create a ZFS mirrored pool, will the read performance get a boost?  
In other words, will the data/parity be read round robin between the 
disks, or do both mirrored sets of data and parity get read off of both 
disks?  The latter case would have a CPU expense, so I would think you 
would see a slow down.

Thanks,

Jon

  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a mirror increase read performance

2008-02-28 Thread Jonathan Loran


Roch Bourbonnais wrote:

 Le 28 févr. 08 à 20:14, Jonathan Loran a écrit :


 Quick question:

 If I create a ZFS mirrored pool, will the read performance get a boost?
 In other words, will the data/parity be read round robin between the
 disks, or do both mirrored sets of data and parity get read off of both
 disks?  The latter case would have a CPU expense, so I would think you
 would see a slow down.


 2 disks mirrored together can read data faster than a single disk.
 So  to service a read only one side of the mirror is read.

 Raid-Z parity is only read in the presence of checksum errors.
That's what I suspected, but I'm glad to get the final word on this.  BTW, I 
guess I should have said checksums instead of parity.  My bad.

Thanks,

Jon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does a mirror increase read performance

2008-02-28 Thread Jonathan Loran


Roch Bourbonnais wrote:

 Le 28 févr. 08 à 21:00, Jonathan Loran a écrit :



 Roch Bourbonnais wrote:

 Le 28 févr. 08 à 20:14, Jonathan Loran a écrit :


 Quick question:

 If I create a ZFS mirrored pool, will the read performance get a 
 boost?
 In other words, will the data/parity be read round robin between the
 disks, or do both mirrored sets of data and parity get read off of 
 both
 disks?  The latter case would have a CPU expense, so I would think you
 would see a slow down.


 2 disks mirrored together can read data faster than a single disk.
 So  to service a read only one side of the mirror is read.

 Raid-Z parity is only read in the presence of checksum errors.
 That's what I suspected, but I'm glad to get the final word on this.  
 BTW, I guess I should have said checksums instead of parity.  My bad.


 OK. The checksum is a different story and is stored within the 
 metadata block pointing to the data block.
 So given that to reach the data block we've already had to read the 
 metadata block, checskum validation is never the
 source of an I/O.
I really need to read those ZFS internals docs (in all my spare time ;) 
Thanks,

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-27 Thread Jonathan Edwards

On Feb 27, 2008, at 8:36 AM, Uwe Dippel wrote:
 As much as ZFS is revolutionary, it is far away from being the  
 'ultimate file system', if it doesn't know how to handle event- 
 driven snapshots (I don't like the word), backups, versioning. As  
 long as a high-level system utility needs to be invoked by a  
 scheduler for these features (CDP), and - this is relevant - *ZFS  
 does not support these functionalities essentially different from  
 FAT or UFS*, the days of ZFS are counted. Sooner or later, and I bet  
 it is sooner, someone will design a file system (hardware, software,  
 Cairo) to which the tasks of retiring files, as well as creating  
 versions of modified files, can be passed down, together with the  
 file handlles.

meh .. don't believe all the marketing hype you hear - it's good at  
what it's good at, and is a constant WIP for many of the other  
features that people would like to hear .. but the one ring to rule  
them all - not quite yet ..

as for the CDP issue - i believe the event driving would really have  
to happen below ZFS at the vnode or znode layer .. keep in mind that  
with the ZPL we're still dealing with 30+ year old structures and  
methods (which is fine btw) in the VFS/Vnode layers .. a couple of  
areas i would look at (that i haven't seen mentioned in this  
discussion) might be:

- fop_vnevent .. or the equivalent (if we have one yet) for a znode
- filesystem - door interface for event handling
- auditing

if you look at what some of the other vendors (eg: apple/timemachine)  
are doing - it's essentially a tally of file change events that get  
dumped into a database and rolled up at some point .. if you plan on  
taking more immediate action on the file changes then i believe that  
you'll run into latency (race) issues for synchronous semantics

anyhow - just a thought from another who is constantly learning (being  
corrected, learning some more, more correction, etc ..)

---
.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can ZFS be event-driven or not?

2008-02-25 Thread Jonathan Loran


David Magda wrote:
 On Feb 24, 2008, at 01:49, Jonathan Loran wrote:

 In some circles, CDP is big business. It would be a great ZFS offering.

 ZFS doesn't have it built-in, but AVS made be an option in some cases:

 http://opensolaris.org/os/project/avs/

Point in time copy (as AVS offers) is not the same thing as CDP.  When 
you snapshot data as in point in time copies, you predict the future, 
knowing the time slice at which your data will be needed.  Continuous 
data protection is based on the premise that you don't have a clue ahead 
of time which point in time you want to recover to.  Essentially, for 
CDP, you need to save every storage block that has ever been written, so 
you can put them back in place if you so desire. 

Anyone else on the list think it is worthwhile adding CDP to the ZFS 
list of capabilities?  It causes space management issues, but it's an 
interesting, useful idea.

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which DTrace provider to use

2008-02-14 Thread Jonathan Loran



Marion Hakanson wrote:

[EMAIL PROTECTED] said:
  

It's not that old.  It's a Supermicro system with a 3ware 9650SE-8LP.
Open-E iSCSI-R3 DOM module.  The system is plenty fast.  I can pretty
handily pull 120MB/sec from it, and write at over 100MB/sec.  It falls  apart
more on random I/O.  The server/initiator side is a T2000 with  Solaris 10u4.
 It never sees over 25% CPU, ever.  Oh yeah, and two 1GB  network links to
the SAN 
. . .

My opinion is, if when the array got really loaded up, everything slowed
down evenly, users wouldn't mind or notice much.  But when every 20 or  so
reads/writes gets delayed my 10s of seconds, the users start to line  up at
my door. 



Hmm, I have no experience with iSCSI yet.  But behavior of our T2000
file/NFS server connected via 2Gbit fiber channel SAN is exactly as
you describe when our HDS SATA array gets behind.  Access to other
ZFS pools remains unaffected, but any access to the busy pool just
hangs.  Some Oracle apps on NFS clients die due to excessive delays.

In our case, this old HDS array's SATA shelves have a very limited queue
depth (four per RAID controller) in the back end loop, plus every write
is hit with the added overhead of an in-array read-back verification.
Maybe your iSCSI situation injects enough latency at higher loads to
cause something like our FC queue limitations.

  
The iSCSI array has 2GB RAM as a cache.  Writes to cache complete very 
fast. I'm not sure, but would love to get some metering going on this 
guy to find out, that it's really the reads that cause the issue.  It 
seems like, but I'm not totally sure yet that heavy random read loads 
are when things break down.  I'll pass on anything I find to the list, 
'cause I'm sure there are a lot of folks with ZFS on a SAN.  The 
flexibility of having the SAN is still seductive, even though the 
benefits to ZFS performance for direct attached storage are pulling us 
the other way. 


Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which DTrace provider to use

2008-02-14 Thread Jonathan Loran

Hi Brendon,

I have been using iopending, though I'm not sure how to interpret it.  
Is it true the column on the left is how deep in the queue requests are, 
and then the histogram represents home many requests there are at each 
queue depth?  Then I would guess if there's lots of requests with high 
queue depth, that's bad.  once in awhile, I see some pretty long queues, 
but they only last a second, and then things even right out again.

I'll try your disktime.d script below and the other checks you 
recommend.  May have more questions to follow.  Thanks!

Jon

Brendan Gregg - Sun Microsystems wrote:
 G'Day Jon,

 For disk layer metrics, you could try Disk/iopending from the DTraceToolkit
 to check how saturated the disks become with requests (which answers that
 question with much higher definition than iostat).  I'd also run disktime.d,
 which should be in the next DTraceToolkit release (it's pretty obvious),
 and is included below.  disktime.d measures disk delta times - time from
 request to completion.

 #!/usr/sbin/dtrace -s

 #pragma D option quiet
 #pragma D option dynvarsize=16m

 BEGIN
 {
 trace(Tracing... Hit Ctrl-C to end.\n);
 }

 io:::start
 {
 start[args[0]-b_edev, args[0]-b_blkno] = timestamp;
 }

 io:::done
 /start[args[0]-b_edev, args[0]-b_blkno]/
 {
 this-delta = timestamp - start[args[0]-b_edev, args[0]-b_blkno];
 @[args[1]-dev_statname] = quantize(this-delta);
 }

 The iopattern script will also give you a measure of random vs sequential
 I/O - which would be interesting to see.

 ...

 For latencies in ZFS (such as ZIO pipeline latencies), we don't have a stable
 provider yet.  It is possible to write fbt based scripts to do this - but
 they'll only work on a particular version of Solaris.

 fsinfo would be a good provider to hit up for the VFS layer.

 I'd also check syscall latencies - it might be too obvious, but it can be
 worth checking (eg, if you discover those long latencies are only on the
 open syscall)...

 Brendan


   

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which DTrace provider to use

2008-02-14 Thread Jonathan Loran


[EMAIL PROTECTED] wrote:

On Tue, Feb 12, 2008 at 10:21:44PM -0800, Jonathan Loran wrote:
  

Thanks for any help anyone can offer.



I have faced similar problem (although not exactly the same) and was going to
monitor disk queue with dtrace but couldn't find any docs/urls about it.
Finally asked Chris Gerhard for help. He partially answered via his
blog: http://blogs.sun.com/chrisg/entry/latency_bubble_in_your_io

Maybe it helps you.

Regards
przemol

  

This is perfect.  Thank you.

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which DTrace provider to use

2008-02-13 Thread Jonathan Loran


Marion Hakanson wrote:

[EMAIL PROTECTED] said:
  

...

I know, I know, I should have gone with a JBOD setup, but it's too late  for
that in this iteration of this server.  We we set this up, I had the  gear
already, and it's not in my budget to get new stuff right now. 



What kind of array are you seeing this problem with?  It sounds very much
like our experience here with a 3-yr-old HDS ATA array.  


It's not that old.  It's a Supermicro system with a 3ware 9650SE-8LP.
Open-E iSCSI-R3 DOM module.  The system is plenty fast.  I can pretty 
handily pull 120MB/sec from it, and write at over 100MB/sec.  It falls 
apart more on random I/O.  The server/initiator side is a T2000 with 
Solaris 10u4.  It never sees over 25% CPU, ever.  Oh yeah, and two 1GB 
network links to the SAN



When the crunch
came here, I didn't know enough dtrace to help, but I threw the following
into crontab to run every five minutes (24x7), and it at least collected
the info I needed to see what LUN/filesystem was busying things out.

Way crude, but effective enough:

  /bin/ksh -c date  mpstat 2 20  iostat -xn 2 20 \
 fsstat $(zfs list -H -o mountpoint -t filesystem | egrep '^/') 2 20 \
 vmstat 2 20  /var/tmp/iostats.log 21 /dev/null

A quick scan using egrep could pull out trouble spots;  E.g. the following
would identify iostat lines that showed 90-100% busy:

  egrep '^Sun |^Mon |^Tue |^Wed |^Thu |^Fri |^Sat | 1[0-9][0-9] c6|  9[0-9] 
c6'\

/var/tmp/iostats.log
  


yeah, I have some running traditional *stat utilities running.  If I 
capture more than a second at a time, things look good.  I was hoping to 
get a real distribution of service times, to catch the outliers, that 
don't get absorbed into the average.  Hence why I wanted to use dtrace.


My opinion is, if when the array got really loaded up, everything slowed 
down evenly, users wouldn't mind or notice much.  But when every 20 or 
so reads/writes gets delayed my 10s of seconds, the users start to line 
up at my door.


Thanks for the tips.

Jon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Which DTrace provider to use

2008-02-12 Thread Jonathan Loran

Hi List,

I'm wondering if one of you expert DTrace guru's can help me.  I want to 
write a DTrace script to print out a a histogram of how long IO requests 
sit in the service queue.  I can output the results with the quantize 
method.  I'm not sure which provider I should be using for this.  Does 
anyone know?  I can easily adapt one of the DTrace Toolkit routines for 
this, if I can find the provider.

I'll also throw out the problem I'm trying to meter.  We are using ZFS 
on a large SAN array (4TB).  The pool on this array serves up a lot of 
users, (250 home file systems/directories) and also /usr/local and other 
OTS software.  It works fine most of the time, but then gets overloaded 
during busy periods.  I'm going to reconfigure the array to help with 
this, but I sure would love to have some metrics to know how big a 
difference my tweaks are making.  Basically, the problem users 
experience, when the load shoots up are huge latencies.  An ls on a 
non-cached directory, which usually is instantaneous, will take 20, 30, 
40 seconds or more.   Then when the storage array catches up, things get 
better.  My clients are not happy campers. 

I know, I know, I should have gone with a JBOD setup, but it's too late 
for that in this iteration of this server.  We we set this up, I had the 
gear already, and it's not in my budget to get new stuff right now.

Thanks for any help anyone can offer.

Jon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris, ZFS and Hardware RAID,

2008-02-10 Thread Jonathan Loran



Anton B. Rang wrote:

Careful here.  If your workload is unpredictable, RAID 6 (and RAID 5)
for that matter will break down under highly randomized write loads. 



Oh?  What precisely do you mean by break down?  RAID 5's write performance is 
well-understood and it's used successfully in many installations for random write loads. 
Clearly if you need the very highest performance from a given amount of hardware, RAID 1 
will perform better for random writes, but RAID 5 can be quite good. (RAID 6 is slightly 
worse, since a random write requires access to 3 disks instead of 2.)

There are certainly bad implementations out there, but in general RAID 5 is a 
reasonable choice for many random-access workloads.

(For those who haven't been paying attention, note that RAIDZ and RAIDZ2 are 
closer to RAID 3 in implementation and performance than to RAID 5; neither is a 
good choice for random-write workloads.)
 
 
  


In my testing, if you have a lot of IO queues spread widely across your 
array, you do better with RAID 1 or 10. RAIDZ and RAIDZ2 are much worse, 
yes. If you add large transfers on top of this, which happen in 
multi-purpose pools, small reads can get starved out. The throughput 
curve (IO rate vs. queues*size) with RAID 5-6 flattens out a lot faster 
than with RAID 10.


The scoop is this. On multipurpose pools, zfs often takes the place of 
many individual file systems. Those had the advantage of separation of 
IO and some tuning was also available to each file system. My 
experience, or should I say theory is that RAID 5,6 hardware accelerated 
arrays work pretty good in more predictable IO patterns.  Sometimes even 
great.  I use RAID 5,6 a lot for these.


Don't get me wrong, I love zfs, I ain't going back.  Don't start flaming 
me, I just think we have to be aware of the limitations and engineer our 
storage carefully.  I made the mistake recently of putting to much faith 
in hardware RAID 6, and as our user load grew, the performance went 
through the floor faster than I thought it would.


My 2 cents.

Jon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OpenSolaris, ZFS and Hardware RAID, a recipe for success?

2008-02-09 Thread Jonathan Loran



Richard Elling wrote:

Nick wrote:
  


Using the RAID cards capability for RAID6 sounds attractive?
  



Assuming the card works well with Solaris, this sounds like a
reasonable solution.

  
Careful here.  If your workload is unpredictable, RAID 6 (and RAID 5) 
for that matter will break down under highly randomized write loads.  
There's a lot of trickery done with hardware RAID cards that can do some 
read-ahead caching magic, improving the read-paritycalc-paritycalc-write 
cycle, but you can't beat out the laws of physics.  If you do *know* 
you'll be streaming more than writing random small number of blocks, 
RAID 6 hardware can work.  But with transaction like loads, performance 
will suck. 


Jon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL controls in Solaris 10 U4?

2008-02-02 Thread Jonathan Loran


This is true, but I think it's the testing bit that worries me.  It's 
hard to lab out, and fully test an equivalent setup that has 350 active 
clients pounding on it to test usability and stability.  One of our 
boxes has a boat load of special software running and various tweaks, 
that also would need to be validated.  in other words, upgrades have 
tended to be painful. 

We don't really have any Open Solaris experience yet, and we've more or 
less trusted Sun to ring out the issues to minimize the problems, and 
make these upgrades smoother.  Of course, the irony is that the 
requirement for this very stability is why we haven't seen the features 
in the ZFS code we need in Solaris 10.


Thanks,

Jon

Mike Gerdts wrote:

On Jan 30, 2008 2:27 PM, Jonathan Loran [EMAIL PROTECTED] wrote:
  

Before ranting any more, I'll do the test of disabling the ZIL.  We may
have to build out these systems with Open Solaris, but that will be hard
as they are in production.  I would have to install the new OS on test
systems and swap out the drives during scheduled down time.  Ouch.



Live upgrade can be very helpful here, either for upgrading or
applying a flash archive.  Once you are comfortable that Nevada
performs like you want, you could prep the new OS on alternate slices
or broken mirrors.  Activating the updated OS should take only a few
seconds longer than a standard init 6.  Failback is similarly easy.

I can't remember the last time I swapped physical drives to minimize
the outage during an upgrade.

  


--


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL controls in Solaris 10 U4?

2008-01-31 Thread Jonathan Loran

Guanghui Wang wrote:
 I dont know when will U5 or U6 coming,so i just set zfs_nocacheflush=1 to 
 /etc/system,and the performance will speed up like zil_disable=1,and that's 
 more safe for the filesystem.

 the separate zlog feature is  not in U4,the nfs performance on zfs will be 
 too slow when you do not set zfs_nochacheflush=1 in your /etc/system file.
   

Yeah, on one of my systems, I was able to set the zfs_nocacheflush=1, 
but the other machine that's suffering isn't patched up enough to use 
it.  I have to schedule down time to patch it up.

I don't have hard numbers yet, but the seat of the pants impression is 
that stopping cache flushes has helped.  On our SAN array's, I thought 
the settings I choose would have had them ignore cache flushing, but 
apparently not.  Thanks everyone for the help.  I still look forward to 
using fast SSD for the ZIL when it comes to Solaris 10 U? as a preferred 
method.

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL controls in Solaris 10 U4?

2008-01-30 Thread Jonathan Loran


Neil Perrin wrote:


 Roch - PAE wrote:
 Jonathan Loran writes:
 Is it true that Solaris 10 u4 does not have any of the nice ZIL 
 controls   that exist in the various recent Open Solaris flavors?  I 
 would like to   move my ZIL to solid state storage, but I fear I 
 can't do it until I   have another update.  Heck, I would be happy 
 to just be able to turn the   ZIL off to see how my NFS on ZFS 
 performance is effected before spending   the $'s.  Anyone know when 
 will we see this in Solaris 10?
  
 You can certainly turn it off with any release (Jim's link).

 It's true that S10u4 does not have the Separate Intent Log to allow 
 using an SSD for ZIL blocks. I believe S10U5 will
 have that feature.

Don't think we can live with this.  Thanks
 Unfortunately it will not. A lot of ZFS fixes and features
 that had existed for a while will not be in U5 (for reasons I
 can't go into here). They should be in S10U6...

 Neil.
I feel like we're being hung out to dry here.  I've got 70TB on 9 
various Solaris 10 u4 servers, with different data sets.  All of these 
are NFS servers.  Two servers have a ton of small files, with a lot of 
read and write updating, and NFS performance on these are abysmal.  ZFS 
is installed on SAN array's (my first mistake).  I will test by 
disabling the ZIL, but if it turns out the ZIL needs to be on a separate 
device, we're hosed. 

Before ranting any more, I'll do the test of disabling the ZIL.  We may 
have to build out these systems with Open Solaris, but that will be hard 
as they are in production.  I would have to install the new OS on test 
systems and swap out the drives during scheduled down time.  Ouch.

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL controls in Solaris 10 U4?

2008-01-30 Thread Jonathan Loran

Vincent Fox wrote:
 Are you already running with zfs_nocacheflush=1?   We have SAN arrays with 
 dual battery-backed controllers for the cache, so we definitely have this set 
 on all our production systems.  It makes a big difference for us.

   
No, we're not using the zfs_nocacheflush=1, but our SAN array's are set 
to cache all writebacks, so it shouldn't be needed.  I may test this, if 
I get the chance to reboot one of the servers, but I'll bet the storage 
arrays' are working correctly.

 As I said before I don't see the catastrophe in disabling ZIL though.

   
No catastrophe, just a potential mess.

 We actually run our production Cyrus mail servers using failover servers so 
 our downtime is typically just the small interval to switch active  idle 
 nodes anyhow.  We did this mainly for patching purposes.
   
Wish we could afford such replication.  Poor EDU environment here, I'm 
afraid.
 But we toyed with the idea of running OpenSolaris on them, then just 
 upgrading the idle node to new OpenSolaris image every month using Jumpstart 
 and switching to it.  Anything goes wrong switch back to the other node.

 What we ended up doing, for political reasons, was putting the squeeze on our 
 Sun reps and getting a 10u4 kernel spin patch with... what did they call it?  
 Oh yeah a big wad of ZFS fixes.  So this ends up being a hug PITA because 
 for the next 6 months to a year we are tied to getting any kernel patches 
 through this other channel rather than the usual way.   But it does work for 
 us, so there you are.
   
Mmmm, for us, Open Solaris may be easier.  I manly was after stability, 
to be honest.  Our ongoing experience with bleeding edge Linux is 
painful at times, and on our big iron, I want them to just work.  but if 
they're so slow, they're not really working right, are they?  Sigh...
 Give my choice I'd go with OpenSolaris but that's a hard sell for datacenter 
 management types.  I think it's no big deal in a production shop with good 
 JumpStart and CFengine setups, where any host should be rebuildable from 
 scratch in a matter of hours.  Good luck.
  
   
True, I'll think about that going forward.  Thanks,

Jon
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZIL controls in Solaris 10 U4?

2008-01-29 Thread Jonathan Loran

Is it true that Solaris 10 u4 does not have any of the nice ZIL controls 
that exist in the various recent Open Solaris flavors?  I would like to 
move my ZIL to solid state storage, but I fear I can't do it until I 
have another update.  Heck, I would be happy to just be able to turn the 
ZIL off to see how my NFS on ZFS performance is effected before spending 
the $'s.  Anyone know when will we see this in Solaris 10?

Thanks,

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue fixing ZFS corruption

2008-01-23 Thread Jonathan Stewart
Jeff Bonwick wrote:
 The Silicon Image 3114 controller is known to corrupt data.
 Google for silicon image 3114 corruption to get a flavor.
 I'd suggest getting your data onto different h/w, quickly.

I'll second this, the 3114 is a piece of junk if you value your data.  I 
bought a 4 port LSI SAS card (yes a bit pricy) and have had 0 problems 
since and hot swap actually works.  I never tried it with the 3114 I had 
just never seen it actually working before so I was quite pleasantly 
surprised.

Jonathan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue fixing ZFS corruption

2008-01-23 Thread Jonathan Stewart
Bertrand Sirodot wrote:
 Hi,
 
 if I want to stay with SATA and not go to SAS, do you have a 
 recommendation on which SATA controller is actually supported by 
 Solaris?

SAS controllers do support SATA drives actually (not the other way
around though). I'm running SATA drives on mine without a problem. As 
far as which ones are supported by Solaris someone else will have to 
answer as I actually use ZFS on FreeBSD.  SATA controllers are usually 
less expensive than SAS controllers of course.

Jonathan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] hardware for zfs home storage

2008-01-14 Thread Jonathan Loran

Alex,

I imagine that you've spent/will spend dozens or perhaps hundreds of 
hours ripping your MP3's.  Don't even think about skipping backups.   
Budget in the cost of backups, preferably off site backups, even 
something you can carry to work and lock in your desk.  Buy a four drive 
USB enclosure and 4 1TB drives.  That would work.  As reliable as zfs 
is, there's no technological fix for natural disasters, or human error.  
My 2 cents.

Jon

Alex wrote:
 Hi,

 I'm sure this has been asked many times and though a quick search didn't 
 reveal anything illuminating, I'll post regardless.

 I am looking to make a storage system available on my home network. I need 
 storage space in the order of terabytes as I have a growing iTunes collection 
 and tons of MP3s that I converted from vinyl. At this time I am unsure of the 
 growth rate, but I suppose it isn't unreasonable to look for 4TB usable 
 storage. Since I will not be backing this up, I think I want RAIDZ2.

 Since this is for home use, I don't want to spend an inordinate amount of 
 money. I did look at the cheaper STK arrays, but they're more than what I 
 want to pay, so I am thinking that puts me in the white-box market. Power 
 consumption would be nice to keep low also.

 I don't really care if it's external or internal disks. Even though I don't 
 want to get completely skinned over the money, I also don't want to buy 
 something that is unreliable.

 I am very interested as to your thoughts and experiences on this. E.g. what 
 to buy, what to stay away from.

 Thanks in advance!
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2008-01-03 Thread Jonathan Loran



Joerg Schilling wrote:

Carsten Bormann [EMAIL PROTECTED] wrote:

  

On Dec 29 2007, at 08:33, Jonathan Loran wrote:



We snapshot the file as it exists at the time of
the mv in the old file system until all referring file handles are
closed, then destroy the single file snap.  I know, not easy to
implement, but that is the correct behavior, I believe.
  

Exactly.

Note that apart from open descriptors, there may be other links to the  
file on the old FS; it has to be clear whether writes to the file in  
the new FS change the file in the old FS or not.  I'd rather say they  
shouldn't.
Yes, this would be different from the normal rename(2) semantics with  
respect to multiply linked files.  And yes, the semantics of link(2)  
should also be consistent with this.



This in an interesting problem. Your proposal would imply that a file
may have different identities in different filesystems:

-   different st_dev

-   different st_ino

-   different link count

This cannot be implemented with a single inode data anymore.

Well, it is not impossible as my WOFS (mentioned before) implements
hardlinks via inode relative symlinks. In order to allow this. a file
would need a storage pool global serial number that allows to match different
inode sets for the file.

Jörg

  


At first, as I mentioned in my earlier email, I was thinking we needed 
to emulate the cross-fs rename/link/etc behavior as it is currently 
implemented, where a file appears to actually be copied.  But now I'm 
not so sure. 

In Unixland, the ideal has always been to have the whole file system, 
kit and caboodle, singly rooted at /.  Heck, even devices are in the 
file system.  Of course, reality required that Programmatically, we 
needed to be aware of what file system your cwd is in.  At a minimum, 
it's returned in our various stat structs (st_dev). 

I can see I'm getting long winded, but I'm thinking: what is the value 
of having different behavior with a cross zfs file move, within the same 
pool as that  between  directories.  I'm not addressing the previous 
discussion about how to treat file handles, etc, but more about sharing 
open file blocks, linked across zfs boundaries before and after such a mv. 

I think the test is this: can we find a scenario where something would 
break if we did share the file blocks across zfs boundaries after such a 
mv?  For every example I've been able to think of, if I ask the 
question: what if I moved the file from one directory to the other, 
instead of across zfs boundaries, would it have been different? it's 
been no.  Comments please. 


Jon

--


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

2007-12-30 Thread Jonathan Loran



Joerg Schilling wrote:

Jonathan Edwards [EMAIL PROTECTED] wrote:
  
since in the current implementation a mv between filesystems would  
have to assign new st_ino values (fsids in NFS should also be  
different), all you should need to do is assign new block pointers in  
the new side of the filesystem .. that would also be handy for cp as  
well



If the rename would keep the blocks from the old file for the new name
then the new file would inherit the identity of the old file.

If you did iplement the rename in a way that would cause new values for
st_dev/st_ino to be returned from a fstat(2) cal then this could confuse 
programs.


If you instead set st_nlink for the open file to 0, then this would be OK 
from the viewpoint of the old file but not be OK from the view to the whole 
system. How would you implement writes into the open fd from the old name?


Jörg

  
More concise way of putting what I'm saying.  Traditional mv between two 
fs will create two copies of the data, if the source file is open.  At a 
minimum, this will have to be emulated or things will break.   Since zfs 
file systems are really different Unix file systems, we have to deal 
with the semantics.  It's not just a path change as in a directory mv.


Jon

--


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >