date:20090722

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal


Hey Bob,

MTTDL analysis shows that given normal evironmental conditions, the  
MTTDL of RAID-Z2 is already much longer than the life of the  
computer or the attendant human.  Of course sometimes one encounters  
unusual conditions where additional redundancy is desired.


To what analysis are you referring? Today the absolute fastest you can  
resilver a 1TB drive is about 4 hours. Real-world speeds might be half  
that. In 2010 we'll have 3TB drives meaning it may take a full day to  
resilver. The odds of hitting a latent bit error are already  
reasonably high especially with a large pool that's infrequently  
scrubbed meaning. What then are the odds of a second drive failing in  
the 24 hours it takes to resiler?


I do think that it is worthwhile to be able to add another parity  
disk to an existing raidz vdev but I don't know how much work that  
entails.


It entails a bunch of work:

  http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

Matt Ahrens is working on a key component after which it should all be  
possible.


Zfs development seems to be overwelmed with marketing-driven  
requirements lately and it is time to get back to brass tacks and  
make sure that the parts already developed are truely enterprise- 
grade.



While I don't disagree that the focus for ZFS should be ensuring  
enterprise-class reliability and performance, let me assure you that  
requirements are driven by the market and not by marketing.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal


which gap?

'RAID-Z should mind the gap on writes' ?

Message was edited by: thometal


I believe this is in reference to the raid 5 write hole, described  
here:

http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance


It's not.

So I'm not sure what the 'RAID-Z should mind the gap on writes'  
comment is getting at either.


Clarification?



I'm planning to write a blog post describing this, but the basic  
problem is that RAID-Z, by virtue of supporting variable stripe writes  
(the insight that allows us to avoid the RAID-5 write hole), must  
round the number of sectors up to a multiple of nparity+1. This means  
that we may have sectors that are effectively skipped. ZFS generally  
lays down data in large contiguous streams, but these skipped sectors  
can stymie both ZFS's write aggregation as well as the hard drive's  
ability to group I/Os and write them quickly.


Jeff Bonwick added some code to mind these gaps on reads. The key  
insight there is that if we're going to read 64K, say, with a 512 byte  
hole in the middle, we might as well do one big read rather than two  
smaller reads and just throw out the data that we don't care about.


Of course, doing this for writes is a bit trickier since we can't just  
blithely write over gaps as those might contain live data on the disk.  
To solve this we push the knowledge of those skipped sectors down to  
the I/O aggregation layer in the form of 'optional' I/Os purely for  
the purpose of coalescing writes into larger chunks.


I hope that's clear; if it's not, stay tuned for the aforementioned  
blog post.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal


Don't hear about triple-parity RAID that often:


Author: Adam Leventhal
Repository: /hg/onnv/onnv-gate
Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651
Total changesets: 1
Log message:
6854612 triple-parity RAID-Z


http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/ 
009872.html

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612

(Via Blog O' Matty.)

Would be curious to see performance characteristics.



I just blogged about triple-parity RAID-Z (raidz3):

  http://blogs.sun.com/ahl/entry/triple_parity_raid_z

As for performance, on the system I was using (a max config Sun Storage
7410), I saw about a 25% improvement to 1GB/s for a streaming write
workload. YMMV, but I'd be interested in hearing your results.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal


Don't hear about triple-parity RAID that often:


I agree completely.  In fact, I have wondered (probably in these  
forums), why we don't bite the bullet and make a generic raidzN,  
where N is any number =0.


I agree, but raidzN isn't simple to implement and it's potentially  
difficult
to get it to perform well. That said, it's something I intend to bring  
to

ZFS in the next year or so.

If memory serves, the second parity is calculated using Reed-Solomon  
which implies that any number of parity devices is possible.


True; it's a degenerate case.

In fact, get rid of mirroring, because it clearly is a variant of  
raidz with two devices.  Want three way mirroring?  Call that raidz2  
with three devices.  The truth is that a generic raidzN would roll  
up everything: striping, mirroring, parity raid, double parity, etc.  
into a single format with one parameter.


That's an interesting thought, but there are some advantages to  
calling out mirroring for example as its own vdev type. As has been  
pointed out, reading from either side of the mirror involves no  
computation whereas reading from a RAID-Z 1+2 for example would  
involve more computation. This would

complicate the calculus of balancing read operations over the mirror
devices.

Let's not stop there, though.  Once we have any number of parity  
devices, why can't I add a parity device to an array?  That should  
be simple enough with a scrub to set the parity.  In fact, what is  
to stop me from removing a parity device?  Once again, I think the  
code would make this rather easy.


With RAID-Z stripes can be of variable width meaning that, say, a  
single row
in a 4+2 configuration might have two stripes of 1+2. In other words,  
there
might not be enough space in the new parity device. I did write up the  
steps

that would be needed to support RAID-Z expansion; you can find it here:

  http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

Ok, back to the real world.  The one downside to triple parity is  
that I recall the code discovered the corrupt block by excluding it  
from the stripe, reconstructing the stripe and comparing that with  
the checksum.  In other words, for a given cost of X to compute a  
stripe and a number P of corrupt blocks, the cost of reading a  
stripe is approximately X^P.  More corrupt blocks would radically  
slow down the system.  With raidz2, the maximum number of corrupt  
blocks would be two, putting a cap on how costly the read can be.


Computing the additional parity of triple-parity RAID-Z is slightly  
more expensive, but not much -- it's just bitwise operations.  
Recovering from
a read failure is identical (and performs identically) to raidz1 or  
raidz2
until you actually have sustained three failures. In that case,  
performance
is slower as more computation is involved -- but aren't you just happy  
to

get your data back?

If there is silent data corruption, then and only then can you encounter
the O(n^3) algorithm that you alluded to, but only as a last resort.  
If we
don't know what drives failed, we try to reconstruct your data by  
assuming
that one drive, then two drives, then three drives are returning bad  
data.
For raidz1, this was a linear operation; raidz2, quadratic; now raidz3  
is
N-cubed. There's really no way around it. Fortunately with proper  
scrubbing

encountering data corruption in one stripe on three different drives is
highly unlikely.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs IO scheduler

2009-07-22 Thread Roch


tester writes:
  Hello,
  
  Trying to understand the ZFS IO scheduler, because of the async nature
  it is not very apparent, can someone give a short explanation for each
  of these stack traces and for  their frequency 
   
  this is the command
  
  dd if=/dev/zero of=/test/test1/trash count=1 bs=1024k;sync
  
  no other IO is happening to the test pool. OS is on a zfs pool (rpool)
  
  
  I don't see any zio_vdev_io_start in any of the function stacks, any idea 
  why?
  

I assume because of tail calls. If you trace
zio_vdev_io_start() you see it being called but (looking at
source) then it tail calls vdev_mirror_io_start() and so disappears from the 
stack.

-r


  dtrace -n 'io:::start { @a[stack()] = count(); }'
  
  dtrace: description 'io:::start ' matched 6 probes
  
  
   genunix`bdev_strategy+0x44
zfs`vdev_disk_io_start+0x2a8
zfs`zio_execute+0x74
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
 20
  
genunix`bdev_strategy+0x44
zfs`vdev_disk_io_start+0x2a8
zfs`zio_execute+0x74
zfs`vdev_queue_io_done+0x84
zfs`vdev_disk_io_done+0x4
zfs`zio_execute+0x74
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
 31
  
genunix`bdev_strategy+0x44
zfs`vdev_disk_io_start+0x2a8
zfs`zio_execute+0x74
zfs`vdev_mirror_io_start+0x1b4
zfs`zio_execute+0x74
zfs`vdev_mirror_io_start+0x1b4
zfs`zio_execute+0x74
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
 34
  
genunix`bdev_strategy+0x44
zfs`vdev_disk_io_start+0x2a8
zfs`zio_execute+0x74
zfs`vdev_mirror_io_start+0x1b4
zfs`zio_execute+0x74
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
 45
  
genunix`bdev_strategy+0x44
zfs`vdev_disk_io_start+0x2a8
zfs`zio_execute+0x74
zfs`vdev_queue_io_done+0x9c
zfs`vdev_disk_io_done+0x4
zfs`zio_execute+0x74
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
 53
  -- 
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Speeding up resilver on x4500

2009-07-22 Thread Roch


Stuart Anderson writes:
  
  On Jun 21, 2009, at 10:21 PM, Nicholas Lee wrote:
  
  
  
   On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson 
   ander...@ligo.caltech.edu 
wrote:
  
   However, it is a bit disconcerting to have to run with reduced data
   protection for an entire week. While I am certainly not going back to
   UFS, it seems like it should be at least theoretically possible to  
   do this
   several orders of magnitude faster, e.g., what if every block on the
   replacement disk had its RAIDZ2 data recomputed from the degraded
  
   Maybe this is also saying - that for large disk sets a single RAIDZ2  
   provides a false sense of security.
  
  This configuration is with 3 large RAIDZ2 devices but I have more  
  recently
  been building thumper/thor systems with a larger number of smaller  
  RAIDZ2's.
  
  Thanks.
  

170M small files reconstructed in 1 week over 3 raid-z
groups is 93 files / sec per raid-z group. That is not too
far from expectations for 7.2K RPM drives (where they ?).

I don't see orders of magnitude improvements on this however
this CR (integrated in snv_109) might give the workload a boost :

6801507 ZFS read aggregation should not mind the gap

This will enable more read aggregation to occur during a
resilver. We could also contemplate enabling the vdev
prefetch code for data during a resilver. 

Otherwise, limiting the # of small objects per raid-z group
as you're doing now, seems wise to me.

-r


  --
  Stuart Anderson  ander...@ligo.caltech.edu
  http://www.ligo.caltech.edu/~anderson
  
  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zio_assess

2009-07-22 Thread Roch


zio_assess went away with SPA 3.0 :
 6754011 SPA 3.0: lock breakup, i/o pipeline refactoring, device failure 
handling

You now have :
zio_vdev_io_assess(zio_t *zio)

Yes it's one of the last stages of the I/O pipeline (see zio_impl.h).

-r

tester writes:
  Hi,
  
  What does   zio_assess do? Is it a stage of pipeline? I see quite a bit  
  these stacks in 5 second time.
  I tried to search src.opensolaris, did not find any reference. Thanks for 
  any help
  
zfs`zio_assess+0x58
zfs`zio_execute+0x74
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
   1604
  
  Thanks
  -- 
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-22 Thread Russel

Thanks for the feed back George.
I hope we get the tools soon. 

At home I have now blown the ZFS away now and creating
a HW raid-5 set :-( Hopefully in the future when the tools
are there I will return to ZFS.

To All : The ECC discussion was very interesting as I had never 
considered it that way! I willl be buying ECC memory for my home
machine!! 

Again many many thanks to all how have replied it has been a very
interesting and informative discussion for me.

Best regards
Russel
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-22 Thread Alexander Skwar

Hi.

Good to Know!

But how do we deal with that on older sStems, which don't have the
patch applied, once it is out?

Thanks, Alexander

On Tuesday, July 21, 2009, George Wilson george.wil...@sun.com wrote:
 Russel wrote:

 OK.

 So do we have an zpool import --xtg 56574 mypoolname
 or help to do it (script?)

 Russel


 We are working on the pool rollback mechanism and hope to have that soon. The 
 ZFS team recognizes that not all hardware is created equal and thus the need 
 for this mechanism. We are using the following CR as the tracker for this 
 work:

 6667683 need a way to rollback to an uberblock from a previous txg

 Thanks,
 George
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
Alexander
-- 
[[ http://zensursula.net ]]
[ Soc. = http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ]
[ Mehr = http://zyb.com/alexws77 ]
[ Chat = Jabber: alexw...@jabber80.com | Google Talk: a.sk...@gmail.com ]
[ Mehr = AIM: alexws77 ]
[ $[ $RANDOM % 6 ] = 0 ]  rm -rf / || echo 'CLICK!'
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40

2009-07-22 Thread Anon Y Mous

I don't mean to be offensive Russel, but if you do ever return to ZFS, please
promise me that you will never, ever, EVER run it virtualized on top of NTFS
(a.k.a. worst file system ever) in a production environment. Microsoft Windows
is a horribly unreliable operating system in situations where things like
protecting against data corruption are important. Microsoft knows this, which
is why they secretly run much of Microsoft.com, their www advertisement
campaigns, and the Microsoft Updates web sites on Akamai Linux in the data
center across the hall from the data center where I work and the
invulnerable file system behind Microsoft's cloud that secretly runs on
Akamai's content delivery system is none other than ZFS's long lost brother...
Netapp WAFL! The first time I started to catch on to this was when the Project
Mojave advertisement campaign started and lots of people were nmap scanning the
site and noticing that it was running Apache on Linux:

http://openmanifesto.blogspot.com/2008/07/mss-blunder-with-mojave-experiment-uses.html

Eventually Microsoft realized they messed up and started to edit the header
strings like they usually do to make it look like IIS:

https://lists.mayfirst.org/pipermail/nosi-discussion/2008-August/000417.html

although you could still figure it out if you were smart enough by using telnet
like this:

http://news.netcraft.com/archives/2003/08/17/wwwmicrosoftcom_runs_linux_up_to_a_point_.html

but the cat was already out of the bag. I did some investigating over a year
ago and talked to some of my long time friends who were senior Akamai techs,
and one of them eventually gave me a guided tour after hours and gave me a
quick look at the Netapp WAFL setup and explained how Microsoft Windows updates
actually work. Very cool! These Akamai guys are like the Wizard of Oz for the
Internet running everything behind the curtains there. Whenever Microsoft
Updates are down- Tell an Akamai tech! Everything's will start working fine
within 5 minutes of you telling them (sure beats calling in to Microsoft Tech
Support in Mumbai India). Is apple.com or itunes running slow? Tell an Akamai
tech and it'll be fixed immediately. Cnn.com down? Jcpenny.com down? Yup. Tell
an Akamai tech and it comes right back up. It's very rare that they have a
serious problem like this one:

http://www.theregister.co.uk/2004/06/15/akamai_goes_postal/

in which case 25% of the internet (including google, yahoo, and lycos) usually
goes down with them. So my question to you Russel is- if Microsoft can't even
rely on NTFS to run their own important infrastructure (they obviously have a
Netapp WAFL dependancy), what hope can your 10TB pool possibly have? What
you're doing is the equivalent of building a 100 story tall skyscraper out of
titanium and then making the bottom-most ground floor and basement foundation
out of glue and pop sickle sticks, and then when the entire building starts to
collapse, you call in to the Titanium metal fabrication corporation, blame them
for the problem, and then tell them that they are obligated to help you glue
your pop sickle sticks back together because it's all their fault that the
building collapsed! Not very fair IMHO.

In the future, keep in mind that (as far as I understand it) the only way to
get the 100% full benefits of ZFS checksum protection is to run it in on bare
metal with no virtualization. If you're going to virtualize something,
virtualize Microsoft Windows and Linux inside of OpenSolaris. I'm running ZFS
in production with my OpenSolaris operating system zpool mirrored three times
over on 3 different drives, and I've never had a problem with it. I even
created a few simulated power outages to test my setup and pulling the plug
while twelve different users were uploading multiple files into 12 different
Solaris zones definitely didn't phase the zpool at all. Just boots right back
up and everything works. The thing is though, it only seems to work when you're
not running it virtualized on top of a closed-source proprietary file system
that's made out of glue and pop sickle sticks.

Just my 2 cents. I could be wrong though.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import is trying to tell me something...

2009-07-22 Thread Nathaniel Filardo

Maybe I should have posted the zdb -l output.  Having seen another thread which 
suggests that I might be looking at the most recent txg being damaged, I went 
to get my pool's txg counter:

hydra# zdb -l /dev/dsk/c3t10d0s0 | grep txg
txg=10168474
txg=10168474
txg=6324561
txg=6324561

All of the disks are like this (indeed, the only thing that differs about their 
zdb -l output is their own guids, as expected).  Staring at reams of od -x 
output, it appears that I have txgs 10168494 through 10168621 in L0 and L1.  L2 
and L3 appear to have not been updated in some time!  L0 and L1 are both 
version=14 and have a hostname field L2 and L3 are both version=3 and 
do not.  All four labels appear to describe the same array (guids and devices 
and all).  The uberblocks in L2 and L3 seem to contain txgs 6319346 through 
6319473.  That's uh... funny.

A little bit of dtrace and time travel back to vdev.c as of snv_115 later, I 
find that the immediate cause is that vdev_raidz_open is yielding an asize of 
1280262012928 but that
when vdev_open called vdev_raidz_open, it thought the asize was 1280272498688.  
(Thus vdev_open and vdev_root_open return EINVAL, and the import fails.)  That 
is, the array is actually one megabyte smaller than it thought... which works 
out to 256K per disk, which is exactly the size of a label pair and might 
explain why L2 and L3 are stale.

Help?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-22 Thread George Wilson

Once these bits are available in Opensolaris then users will be able to 
upgrade rather easily. This would allow you to take a liveCD running 
these bits and recover older pools.


Do you currently have a pool which needs recovery?

Thanks,
George

Alexander Skwar wrote:

Hi.

Good to Know!

But how do we deal with that on older sStems, which don't have the
patch applied, once it is out?

Thanks, Alexander

On Tuesday, July 21, 2009, George Wilson george.wil...@sun.com wrote:
  

Russel wrote:

OK.

So do we have an zpool import --xtg 56574 mypoolname
or help to do it (script?)

Russel


We are working on the pool rollback mechanism and hope to have that soon. The 
ZFS team recognizes that not all hardware is created equal and thus the need 
for this mechanism. We are using the following CR as the tracker for this work:

6667683 need a way to rollback to an uberblock from a previous txg

Thanks,
George
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

2009-07-22 Thread Mario Goebbels


To All : The ECC discussion was very interesting as I had never
considered it that way! I willl be buying ECC memory for my home
machine!!


You have to make sure your mainboard, chipset and/or CPU support it, 
otherwise any ECC modules will just work like regular modules.


The mainboard needs to have the necessary lanes to either the chipset 
that supports ECC (in case of Intel) or the CPU (in case of AMD).


I think all Xeon chipsets do ECC, as do various consumer ones (I only 
know of X38/X48, there's also some 9xx ones that do). For consumer 
boards, it's hard to figure out which actually do support it. I have an 
X48-DQ6 mainboard from Gigabyte, which does it.


Regards,
-mg
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs mirroring question

2009-07-22 Thread Daniel S

I am running basic mirroring in a server setup. When I pull out a hard drive 
and put it back in, it won't detect it and resilver it until I reboot the 
system. Is there a way to force it to detect it and resilver it in real time?

Thank you.
Dan
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...

2009-07-22 Thread W Sanders

We have Thumper that we got at a good price from the Sun Educational Grant 
program (thank you Sun!) but it came populated with 500GB drives. The box will 
be used as a virtual tape library and general purpose NFS/iSCSI/Samba file 
server for users' stuff. Probably, in about two years, we will want to reload 
it with whatever the big 1TB drive of the day is. This gives me a problem with 
respect to planning for the future, since currently one can't shrink a zpool.

I can think of a few approaches:

1) Initial configuration with two zpools. This lets us do the upgrade just 
before utilization hits 50%. We can migrate everyone off pool 1, destroy it, 
upgrade it, and either repeat the process for pool2 or join the pools together.

2) Replace with new, bigger disks, and slice them in half. Use one slice to 
rejoin the existing pool, and the second slice to start a new pool.

3) Unlikely: Mirror the existing zpool with some kind of external vdev. I've 
tested this - I actually mirrored a physical disk with a NFS vdev once, and to 
my amazement it worked. Unfortunately the Thumper is the biggest box we have 
right now, we don't have any other devices with 18+TB of space.

3 1/2): Tape, like failure, is always an option.

Either way with 1 or 2 we're stuck with two pools on the same host, but since I 
have 40+ disks to spread the IO over, I'm not too worried.

Option 4) If I just replace the 500GB disks one by one with 1 TB disks in an 
existing single zpool, will the zpool magically have twice as much space when I 
am done replacing the very last disk? I don't have any way to test this. In the 
past I have been able to do this with *some* RAID5 array controllers.

If you've been through this drill, let us know how you handled it. Thanks in 
advance,

-W Sanders
 St Marys College of CA
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs mirroring question

2009-07-22 Thread Darren J Moffat


Daniel S wrote:

I am running basic mirroring in a server setup. When I pull out a hard drive 
and put it back in, it won't detect it and resilver it until I reboot the 
system. Is there a way to force it to detect it and resilver it in real time?


More info on your hardware is required.  In particular what type of 
disks these are and how they are attached, eg: IDE, SATA, SAS, USB, FC, 
iSCSI...


I'm assuming since you said basic mirroring you don't have any hot 
spares configured that would have kicked in.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...

2009-07-22 Thread Cindy . Swearingen

Hi--

With 40+ drives, you might consider two pools any way. If you want to
use a ZFS root pool, some like this:

- Mirrored ZFS root pool (2 x 500 GB drives)
- Mirrored ZFS non-root pool for everything else

Mirrored pools are flexible and provide good performance. See this site
for more tips:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Option 4 below is your best option. Depending on the Solaris release,
ZFS will see the expanded space. If not, see this section:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Changing_Disk_Capacity_Sizes

Cindy

On 07/22/09 10:31, W Sanders wrote:

We have Thumper that we got at a good price from the Sun Educational Grant program
(thank you Sun!) but it came populated with 500GB drives. The box will be used as
a virtual tape library and general purpose NFS/iSCSI/Samba file server for users'
stuff. Probably, in about two years, we will want to reload it with whatever the
big 1TB drive of the day is. This gives me a problem with respect to planning
for the future, since currently one can't shrink a zpool.

I can think of a few approaches:

1) Initial configuration with two zpools. This lets us do the upgrade just
before utilization hits 50%. We can migrate everyone off pool 1, destroy it,
upgrade it, and either repeat the process for pool2 or join the pools together.

2) Replace with new, bigger disks, and slice them in half. Use one slice to
rejoin the existing pool, and the second slice to start a new pool.

3) Unlikely: Mirror the existing zpool with some kind of external vdev. I've
tested this - I actually mirrored a physical disk with a NFS vdev once, and to
my amazement it worked. Unfortunately the Thumper is the biggest box we have
right now, we don't have any other devices with 18+TB of space.

3 1/2): Tape, like failure, is always an option.

Either way with 1 or 2 we're stuck with two pools on the same host, but since I
have 40+ disks to spread the IO over, I'm not too worried.

Option 4) If I just replace the 500GB disks one by one with 1 TB disks in an
existing single zpool, will the zpool magically have twice as much space when I
am done replacing the very last disk? I don't have any way to test this. In the
past I have been able to do this with *some* RAID5 array controllers.

If you've been through this drill, let us know how you handled it. Thanks in
advance,

-W Sanders
St Marys College of CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...

2009-07-22 Thread Ross

4.  Yes :-D

While you can't shrink, you can already replace drives with bigger ones, and 
ZFS does increase the size at the end (although I think it needs an 
unmount/mount right now).

However, even though you can simply pull one drive and replace it with a bigger 
one, that does degrade your array.  So instead, depending on your needs, I'd 
suggest something like creating one pool of a bunch of raid-z2 vdevs, with 2-4 
drives allocated as hot spares.

That allows you in the future to replace the spare drives with new 2TB drives, 
then boot and run a 'zpool replace old disk new disk' for all of the 
spares.  That will switch the drives to the bigger size without degrading the 
array.  Then when that finishes, remove the replaced drives (which are the new 
spares), and repeat.

The reason I suggest up to 4 spares is that it's likely to take some time to 
resilver, and even doing 4 at once you'll need to do this 12 times to upgrade a 
Thumper.  So if you are planning to upgrade, sacrificing that space now is 
probably a worthwhile investment.

Sun have confirmed that 2TB drives will be supported, and probably 4TB ones 
too.  I've also tested this out myself (although just with a single 1TB drive) 
on a Thumper.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...

2009-07-22 Thread W Sanders

Thanks! Rats, we're running GA u7 and not Opensolaris for now: 

# zpool set autoexpand=on pool   (my pool is, in fact, named pool)
cannot set property for 'pool': invalid property 'autoexpand'

We're not in production yet, but I eventually have to install Veritas Netbackup 
on this thing (please feel free to pity me), and I don't know if they are 
supporting Opensolaris yet.

-w
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] SSD's and ZFS...

2009-07-22 Thread Kyle McDonald

I've started reading up on this, and I know I have alot more reading to 
do, but I've already got some questions... :)



I'm not sure yet that it will help for my purposes, but I was 
considering buying 2 SSD's for mirrored boot devices anyway.


My main question is: Can a pair of say 60GB  SSD's be shared for both 
the root pool and as an SSD ZIL? 

Can the installer be configured to make the slice for the root pool to 
be something less than the whole disk? leaving another slice for the 
ZIL? Or would a zVOL in the root pool be a better idea?


I doubt 60GB will leave enough space, but would doing this for the L2ARC 
be useful also?



 -Kyle




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD's and ZFS...

2009-07-22 Thread t. johnson

I can't speak to whether it's a good idea or not, but I also wanted to do this 
and it was rather difficult. The problem is the opensolaris installer doesn't 
let you setup slices on a device to install to.


The two ways I came up with were:

1) using the automated installer to do everything because it has the option to 
configure slices before installing files. this requires learning a lot about 
the AI just to configure slices before installing.

2) - install like normal on one drive
   - setup drive #2 with the partition map that you want to have
   - zpool replace drive #1 with drive #2 with altered partition map
   - setup drive #1 with new partition map
   - zpool attach drive #1
   - install grub on both drives


Even though approach #2 probably sounds more difficult, I ended up doing it 
that way and setup a root slice on each, a slog slice on each, and 2 
independent swap slices.

I would also like to hear if there's any other way to make this easier or any 
problems with my approach that I might have overlooked.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] virtualization, alignment and zfs variation stripes

2009-07-22 Thread t. johnson

One of the things that commonly comes up in the server virtualization world is 
making sure that all of the storage elements are aligned.  This is because 
there are often so many levels of abstraction each using their own block size 
that without any tuning, they'll usually overlap and can cause 2 or even 3 
times the I/O in some cases to read what would be just one block. I guess this 
was also a common thing in the SAN world many years back.

Lets say I have a simple-ish setup that uses vmware files for virtual disks on 
an NFS share from zfs. I'm wondering how zfs' variable block size comes into 
play? Does it make the alignment problem go away? Does it make it worse? Or 
should we perhaps be creating filesystems with a fixed block size for 
virtualization workloads?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motherboard for home zfs/solaris file server

2009-07-22 Thread Chris Du

i7 doesn't support ECC even motherboard supports it, you need XEON W3500 which 
costs the same as i7 to support ECC.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] virtualization, alignment and zfs variation stripes

2009-07-22 Thread Bob Friesenhahn


On Wed, 22 Jul 2009, t. johnson wrote:

Lets say I have a simple-ish setup that uses vmware files for 
virtual disks on an NFS share from zfs. I'm wondering how zfs' 
variable block size comes into play? Does it make the alignment 
problem go away? Does it make it worse? Or should we perhaps be


My understanding is that zfs uses fixed block sizes except for the 
tail block of a file, or if the filesystem has compression enabled.


Zfs's large blocks can definitely cause performance problems if the 
system has insufficient memory to cache the blocks which are accessed, 
or only part of the block is updated.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40

2009-07-22 Thread Miles Nordin

 aym == Anon Y Mous no-re...@opensolaris.org writes:
 mg == Mario Goebbels m...@tomservo.cc writes:

   aym I don't mean to be offensive Russel, but if you do ever return
   aym to ZFS, please promise me that you will never, ever, EVER run
   aym it virtualized on top of NTFS

he said he was using raw disk devices IIRC.

and once again, the host did not crash, only the guest, so even if it
were NTFS rather than raw disks, the integrity characteristics of NTFS
would have been irrelevant since the host was awlays shutdown cleanly.

   aym the only way to get the 100% full benefits of ZFS checksum
   aym protection is to run it in on bare metal with no
   aym virtualization.

bullshit.  That makes no sense at all.

First, why should virtualization have anything to do with checksums?
Obviously checksums go straight through it.  The suspected problem
lies elsewhere.

Second, virtualization is serious business.  Problems need to be found
and fixed.  At this point, you've become so aggressive with that
broom, anyone can see there's obviously an elephant under the rug.

   aym I'm running ZFS in production with my OpenSolaris
   aym operating system zpool mirrored three times over on 3
   aym different drives, and I've never had a problem with it.

The idea of collecting other people's problem reports is to figure out
what's causing problems before one hits you.  I hear this type of
thing all the time: ``The number of problems I've had is so close to
zero, it is zero, so by extrapolation nobody else can be having any
real problems because if I scale out my own experience the expected
number of problems in the entire world is zero.''---wtf?  clearly
bogus!

mg You have to make sure your mainboard, chipset and/or CPU
mg support it, otherwise any ECC modules will just work like
mg regular modules.

also scrubbing is sometimes enabled separately from plain ECC.
Without scrubbing the ECC can still correct errors, but won't do so
until some actual thread reads the flipped-bit, which is probably
okay but shrug.

I vaguely remember something about an idle scrub thread in solaris
where the CPU itself does the scrubbing?  but at least on AMD
platforms, the memory and cache controllers will do scrubbing
themselves using only memory bandwidth, without using CPU cycles, if
you ask.

On AMD you can use this script on Linux to control scrub speed and ECC
enablement if your BIOS does not support it.  The script does appear
to do something on Phenom II, but I haven't tried the 10-ohm resistor
test the author suggests.  I think it should be adaptable to SOlaris.

 http://hyvatti.iki.fi/~jaakko/sw/

now if only we could get 4GB ECC unbuffered DDR3 for similar prices to
non-ECC. :(


pgp2ROfrnMEfW.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-22 Thread Brad Diggs

Have you considered running your script with ZFS pre-fetching disabled  
altogether to see if

the results are consistent between runs?

Brad
Brad Diggs
Senior Directory Architect
Virtualization Architect
xVM Technology Lead


Sun Microsystems, Inc.
Phone x52957/+1 972-992-0002
Mail bradley.di...@sun.com
Blog http://TheZoneManager.com
Blog http://BradDiggs.com

On Jul 15, 2009, at 9:59 AM, Bob Friesenhahn wrote:


On Wed, 15 Jul 2009, Ross wrote:

Yes, that makes sense.  For the first run, the pool has only just  
been mounted, so the ARC will be empty, with plenty of space for  
prefetching.


I don't think that this hypothesis is quite correct.  If you use  
'zpool iostat' to monitor the read rate while reading a large  
collection of files with total size far larger than the ARC, you  
will see that there is no fall-off in read performance once the ARC  
becomes full.  The performance problem occurs when there is still  
metadata cached for a file but the file data has since been expunged  
from the cache.  The implication here is that zfs speculates that  
the file data will be in the cache if the metadata is cached, and  
this results in a cache miss as well as disabling the file read- 
ahead algorithm.  You would not want to do read-ahead on data that  
you already have in a cache.


Recent OpenSolaris seems to take a 2X performance hit rather than  
the 4X hit that Solaris 10 takes.  This may be due to improvement of  
existing algorithm function performance (optimizations) rather than  
a related design improvement.


I wonder if there is any tuning that can be done to counteract  
this? Is there any way to tell ZFS to bias towards prefetching  
rather than preserving data in the ARC?  That may provide better  
performance for scripts like this, or for random access workloads.


Recent zfs development focus has been on how to keep prefetch from  
damaging applications like database where prefetch causes more data  
to be read than is needed.  Since OpenSolaris now apparently  
includes an option setting which blocks file data caching and  
prefetch, this seems to open the door for use of more aggressive  
prefetch in the normal mode.


In summary, I agree with Richard Elling's hypothesis (which is the  
same as my own).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Best Approach

2009-07-22 Thread Joseph Finley

I have (2) of the following boxes, exact matching.

(2) Super Micro X7DBN Motherboard Dual
(16) GB of RAM (8GB in each box)
(4) 1.6GHz Intel XEON Quad-Core LGA771
(2) Super Micro 2U RM (12 Bay Chassis)
(2) Super Micro AOC 8-port SATA Controller

I'd like ZFS replicate this box to the other, is this practical for one and/or 
what is the best method?  It will be strictly nothing but a backup storage box 
via NFS and/or ISCSI.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] L2ARC support in Solaris 10 (Update 8?)

2009-07-22 Thread Scott Lawson


Hi All,

Can anyone shed some light on / if L2ARC support will be included in the 
next
Solaris 10 update? Or if it is included in a Kernel patch over and above 
the standard Kernel

patch rev that ships in 05/09 (AKA U7)?

The reason I ask is that I have standardised on S10 here and am not keen 
to deploy OpenSolaris in
production. (Just another platform and patching system to document and 
maintain. I don't

want to debate this here. It's the way it is.)

I am currently speccing some x4240's with SSD's for some upgraded Squid 
proxy cache's that will
be handling caching duties for around 40 - 60 megabit's / s. Large disk 
caches and L1ARC
for squid will make these systems really fly. (These are replacing tow 
v240's that are getting a little long

in the tooth and want keep up with the bandwidth jump)

The plan is to have a couple of x4240's with Dual quad core processors, 
16 GB RAM and 6  x 146 GB 10K
SAS drives plus 1 x 32 GB SSD as L2ARC. I can add this later if support 
for this not available

at build time, but is road mapped for S8?

ZFS config will be a pair of 146 GB mirrored as boot drives (and 
possibly access logging) and then
a RAIDZ1 of 4 drives for max capacity (data is disposable as it is 
purely cached object data). Compression
will be enabled on the disk cache RAIDZ1 to increase performance of 
cached data read from disk. (seeing as I have

many CPU cycles to burn in these systems ;) )

I am hoping that these systems will have a L1ARC of around 10GB, L2ARC 
of 32GB and cache volume
of ~420GB RAIDZ plus compression. We may add more drives or RAIDZ's as 
we tweak the Squid

cached object size. We are hoping to cache objects up to around 100 MB.

Any comments on either system configuration and / or L2ARC support are 
invited from the list.


Thanks,

Scott.

--
___


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

2009-07-22 Thread Bob Friesenhahn


On Wed, 22 Jul 2009, Roch wrote:


HI Bob did you consider running the 2 runs with

echo zfs_prefetch_disable/W0t1 | mdb -kw

and see if performance is constant between the 2 runs (and low).
That would help clear the cause a bit. Sorry, I'd do it for
you but since you have the setup etc...

Revert with :

echo zfs_prefetch_disable/W0t0 | mdb -kw

-r


I see that if I update my test script so that prefetch is disabled 
before the first cpio is executed, the read performance of the first 
cpio reported by 'zpool iostat' is similar to what has been normal for 
the second cpio case (i.e. 32MB/second). This seems to indicate that 
prefetch is entirely disabled if the file has ever been read before. 
However, there is a new wrinkle in that the second cpio completes 
twice as fast with prefetch disabled even though 'zpool iostat' 
indicates the same consistent throughput.  The difference goes away if 
I tripple the number of files.


With 3000 8.2MB files:
Doing initial (unmount/mount) 'cpio -C 131072 -o  /dev/null'
14443520 blocks

real3m41.61s
user0m0.44s
sys 0m8.12s

Doing second 'cpio -C 131072 -o  /dev/null'
14443520 blocks

real1m50.12s
user0m0.42s
sys 0m7.21s

Now if I increase the number of files to 9000 8.2MB files:

Doing initial (unmount/mount) 'cpio -C 131072 -o  /dev/null'
144000768 blocks

real35m51.47s
user0m4.46s
sys 1m20.11s

Doing second 'cpio -C 131072 -o  /dev/null'
144000768 blocks

real35m22.41s
user0m4.40s
sys 1m14.22s

Notice that with 3X the files, the throughput is dramatically reduced 
and the time is the same for both cases.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] L2ARC support in Solaris 10 (Update 8?)

2009-07-22 Thread Bob Friesenhahn


On Thu, 23 Jul 2009, Scott Lawson wrote:


The plan is to have a couple of x4240's with Dual quad core 
processors, 16 GB RAM and 6 x 146 GB 10K SAS drives plus 1 x 32 GB 
SSD as L2ARC. I can add this later if support for this not available 
at build time, but is road mapped for S8?


I suggest maxing out your server RAM capacity before worrying about 
adding a L2ARC.  The reason why is that RAM is full speed and contains 
the L1ARC.  The only reason to do otherwise if if you can't afford it.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Understanding SAS/SATA Backplanes and Connectivity

2009-07-22 Thread James C. McPherson

On Fri, 17 Jul 2009 14:16:32 -0400
Miles Nordin car...@ivy.net wrote:

  rl == Rob Logan r...@logan.com writes:
 
 rl Is there some magic that load balances the 4 SAS ports as this
 rl shows up as one scsi-bus?
 
 The LSI card is not SATA framework.  I've the impression drive
 enumeration and topology is handled by the proprietary firmware on the
 card, so it's likely there isn't any explicit support for SAS
 expanders inside solaris's binary mpt driver at all. 

There kinda is - mpt(7d) detects SAS expanders as SCSI Enclosure
Services devices (which is what the spec says), and passes the
enumeration off to ses(7d) or sgen(7d), depending on what you've
got as a device alias for scsiclass,0d.

We also (in NV since build 81, S10 Update 6) detect and correcly
handle Serial Management Protocol instances, which SAS expanders
hook into. The SAS HBA chip passes SMP frames to and from the 
expander.


 If you have x86
 I think you can explore topology using the bootup Blue Screens of
 Setup, but I don't have anything with SAS expander to test it.

Yes, that's correct, the bluescreenofsetup allows you to do some
minimal viewing of the config.

 I think the SAS standard itself has a concept of ``wide ports'' like
 infiniband or PCIe, so I would speculate the 4 pairs are treated as
 lanes rather than ports.

mpt(7d) bundles the phys and only shows one controller for internal
and one controller for external connections - on a physical hba basis.

cheers,
James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
Kernel Conference Australia - http://au.sun.com/sunnews/events/2009/kernel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motherboard for home zfs/solaris file server

2009-07-22 Thread chris

Good news; the manual for the M4N78-VM mentions ECC and gives the following 
BIOS options: disabled/basic/good/super/maxi/user.

Unsure what these mean but that's a start.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Motherboard for home zfs/solaris file server

2009-07-22 Thread chris

Found this:

ECC Mode [Disabled]
Disables or sets the DRAM ECC mode that allows the hardware to report and
correct memory errors. Set this item to [Basic] [Good] or [Max] to allow ECC 
mode
auto-adjustment. Set this item to [Super] to adjust the DRAM BG Scrub sub-item
manually. You may also adjust all sub-items by setting this item to [User].
Configuration options: [Disabled] [Basic] [Good] [Super] [Max] [User]

I would have thought the checksum was either good or not. Apparently it's not 
so simple. Now about that unique PCIe-16 slot?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] virtualization, alignment and zfs variation stripes

2009-07-22 Thread thomas

Hmm.. I guess that's what I've heard as well.

I do run compression and believe a lot of others would as well. So then, it 
seems
to me that if I have guests that run a filesystem formatted with 4k blocks for
example.. I'm inevitably going to have this overlap when using ZFS network
storage?

So if A were zfs blocks and B were virtualized guest blocks, I think it 
might
look like this with compression on?

|   B1   |   B2   |   B3   |   B4   |
|   A1   | A2 | A3  |  A4  |

So if the guest OS wants blocks B2 or B4, it actually has to read 2 blocks from 
the
underlying zfs storage?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] virtualization, alignment and zfs variation stripes

2009-07-22 Thread Fajar A. Nugraha

On Thu, Jul 23, 2009 at 12:29 PM, thomasno-re...@opensolaris.org wrote:
 Hmm.. I guess that's what I've heard as well.

 I do run compression and believe a lot of others would as well. So then, it 
 seems
 to me that if I have guests that run a filesystem formatted with 4k blocks for
 example.. I'm inevitably going to have this overlap when using ZFS network
 storage?

 So if A were zfs blocks and B were virtualized guest blocks, I think it 
 might
 look like this with compression on?

 |   B1   |   B2   |   B3   |   B4   |
 |   A1   | A2 |     A3      |  A4  |

 So if the guest OS wants blocks B2 or B4, it actually has to read 2 blocks 
 from the
 underlying zfs storage?

AFAIK If you use zvol, and set zfs volblocksize to be the same as the
fs block size on virtualized system (which is 4k by default for
several GB disk/partition with ext3/ntfs), every virtualized block
read should correspond to one zfs block read. If you set compression
on, the actual bytes read from the storage will not always be 4k
though, it can be less depending on how compressible the data is.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

Re: [zfs-discuss] triple-parity: RAID-Z3

Re: [zfs-discuss] triple-parity: RAID-Z3

Re: [zfs-discuss] triple-parity: RAID-Z3

Re: [zfs-discuss] zfs IO scheduler

Re: [zfs-discuss] Speeding up resilver on x4500

Re: [zfs-discuss] zio_assess

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40

Re: [zfs-discuss] zpool import is trying to tell me something...

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] zfs mirroring question

[zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...

Re: [zfs-discuss] zfs mirroring question

Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...

Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...

Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...

[zfs-discuss] SSD's and ZFS...

Re: [zfs-discuss] SSD's and ZFS...

[zfs-discuss] virtualization, alignment and zfs variation stripes

Re: [zfs-discuss] Motherboard for home zfs/solaris file server

Re: [zfs-discuss] virtualization, alignment and zfs variation stripes

Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

[zfs-discuss] Best Approach

[zfs-discuss] L2ARC support in Solaris 10 (Update 8?)

Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?

Re: [zfs-discuss] L2ARC support in Solaris 10 (Update 8?)

Re: [zfs-discuss] Understanding SAS/SATA Backplanes and Connectivity

Re: [zfs-discuss] Motherboard for home zfs/solaris file server

Re: [zfs-discuss] Motherboard for home zfs/solaris file server

Re: [zfs-discuss] virtualization, alignment and zfs variation stripes

Re: [zfs-discuss] virtualization, alignment and zfs variation stripes

35 matches

Site Navigation

Mail list logo

Footer information