[zfs-discuss] encfs on top of zfs

2012-07-30 Thread Tristan Klocke
Dear ZFS-Users,

I want to switch to ZFS, but still want to encrypt my data. Native
Encryption for ZFS was added in ZFS Pool Version Number
30http://en.wikipedia.org/wiki/ZFS#Release_history,
but I'm using ZFS on FreeBSD with Version 28. My question is how would
encfs (fuse encryption) affect zfs specific features like data Integrity
and deduplication?

Regards

Tristan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is LSI SAS3081E-R suitable for a ZFS NAS ?

2010-01-25 Thread Tristan Ball
It may depend on the firmware you're running. We've got a SAS1068E based 
card in Dell R710 at the moment, connected to an external SAS JBOD, and 
we did have problems with the as shipped firmware.


However we've upgraded that, and _so far_ haven't had further issues. I 
didn't do the upgrade myself, so I'm not sure what it shipped with, but 
we're currently running:



Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 1

Current active firmware version is 011a (1.26.00)
Firmware image's version is MPTFW-01.26.00.00-IT
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.24.00.00 (2008.07.01)
FCode image's version is MPT SAS FCode Version 1.00.49 (2007.09.21)


We've got multiple ISCSI and NFS clients all running disk benchmarks in 
loops to produce load, as well as repeated scrubs on the server, and 
we've been running several days now without fault.We're going to give it 
several more days before we sign it off as stable tho!


Regards,
Tristan.

On 26/01/2010 10:47 AM, Mark Nipper wrote:

As in they work without any possibility of mpt timeout issues?  I'm at my wits 
end with a machine right now that has an integrated 1068E and is dying almost 
hourly at this point.

If I could spend three hundred dollars or so and have my problems magically go 
away, I'd love to pull the trigger on one of these SAS3081E-R cards right now.

Let me know if I can provide any more feedback on my existing 1068E problems.  
I followed all the threads, but nothing apparently really fixes the problem.  I 
was running 2009.06 (111b I think) and I just updated to 131 earlier today with 
more of the same problems.  My ZFS pool is in such a sad state, it takes an 
hour or so at most with all the resilvering happening before the machine goes 
to lunch.  All of this on a SuperMicro motherboard and chassis with port 
expanders.
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Recordsize...

2010-01-17 Thread Tristan Ball

Hi Everyone,

Is it possible to use send/recv to change the recordsize, or does each 
file need to be individually recreated/copied within a given dataset?


Is there a way to check the recordsize of a given file, assuming that 
the filesystems recordsize was changed at some point?


Also - Am I right in thinking that if a 4K write is made to a filesystem 
block with a recordsize of 8K, then the original block is read (assuming 
it's not in the ARC), before the new block is written elsewhere (the 
copy, from copy on write)? This would be one of the reasons that 
aligning application IO size and filesystem record sizes is a good 
thing, because where such IO is aligned, you remove the need for that 
original read?


Thanks,
Tristan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!

2010-01-13 Thread Tristan Ball
That's very interesting tech you've got there... :-) I have a couple of 
questions, with apologies in advance if I missed them on the website..


I see the PCI card has an external power connector - can you explain 
how/why that's required, as opposed to using an on card battery or 
similar. What happens if the power to the card fails?


The 155mb rate for sustained writes is low for DDR ram? Is this because 
the backup to NAND is a constant thing, rather than only at power fail?


Regards
   Tristan

Christopher George wrote:

The DDRdrive X1 OpenSolaris device driver is now complete,
please join us in our first-ever ZFS Intent Log (ZIL) beta test 
program.  A select number of X1s are available for loan,
preferred candidates would have a validation background 
and/or a true passion for torturing new hardware/driver :-)


We are singularly focused on the ZIL device market, so a test
environment bound by synchronous writes is required.  The
beta program will provide extensive technical support and a
unique opportunity to have direct interaction with the product
designers.

Would you like to take part in the advancement of Open
Storage and explore the far-reaching potential of ZFS
based Hybrid Storage Pools?

If so, please send an inquiry to zfs at ddrdrive dot com.

The drive for speed,

Christopher George
Founder/CTO
www.ddrdrive.com

*** Special thanks goes out to SUN employees Garrett D'Amore and
James McPherson for their exemplary help and support.  Well done!
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!

2010-01-13 Thread Tristan Ball

Thanks for the detailed response - further questions inline...

Christopher George wrote:

Excellent questions!

  
I see the PCI card has an external power connector - can you explain 
how/why that's required, as opposed to using an on card battery or 
similar. 



DDRdrive X1 ZIL functionality is best served with an external attached UPS,
this allows the X1 to perform as a non-volatile storage device without specific
user configuration or unique operation.  An often overlooked aspect of batteries
(irrespective of technology or internal/external) is their limited lifetime and
varying degrees of maintenance and oversight required.  For example, a
lithium (Li-Ion) battery supply, as used by older NVRAM products and not the
X1, does have the minimum required energy density for an internal solution.
But has a fatal flaw for enterprise applications - an ignition mode failure
possibility.  Google lithium battery fire.  Such an instance, even if rare,
would be catastrophic not only to the on-card data but the host server and so
on...  Supercapacitors are another alternative which thankfully do not share
the ignition mode failure mechanism of Li-Ion, but are hampered mainly by
cost with some longevity concerns which can be addressed.  In the end, we
selected data integrity, cost, and serviceability as our top three priorities.
This led us to the industry standard external lead-acid battery as sold by APC.

Key benefits of the DDRdrive X1 power solution:

1)  Data Integrity - Supports multiple back-to-back power failures, a single
DDRdrive X1 uses less than 5W when the host is powered down, even a
small UPS is over-provisioned and unlike an internal solution will not normally
require a lengthy recharge time prior to the next power incident.  Optionally a
backup to NAND can be performed to remove the UPS duration as a factor.

2)  Cost Effective / Flexible - The Smart-UPS SC 450VA (280 Watts) is an
excellent choice for most installations and retails for approximately $150.00.
Flexibility is in regard to UPS selection, as it can be right-sized (duration) 
for
each individual application if needed.

3)  Reliability / Maintenance - UPS front panel LED status for battery 
replacement and audible alarms when battery is low or non-operational. 
Industry standard battery form factor backed by APC the industry leading 
manufacture of enterprise-class backup solutions.
  
OK, I take your point about battery fires, however we've been using 
battery backed cards (of various types) in servers for a while now, and 
I think you might have over over-emphasized those risks, when compared 
to the operational complexity of maintaining a separate power circuit 
for my PCI cards! But then, I haven't actually done the research on 
battery reliability either. :-)


I'm not sure about others on the list, but I have a dislike of AC power 
bricks in my racks. Sometimes they're unavoidable, but they're also 
physically awkward - where do we put them? Using up space on a dedicated 
shelf? Cable tied to the rack itself? Hidden under the floor?


Is the state of the power input exposed to software in some way? In 
other terms, can I have a nagios check running on my server that 
triggers an alert if the power cable accidentally gets pulled out?


  

What happens if the *host* power to the card fails?



Nothing, the DDRdrive X1's data integrity is guaranteed by the attached UPS.
  
OK, which means that the UPS must be separate to the UPS powering the 
server then.
  

The 155mb rate for sustained writes is low for DDR ram?



The DRAM's value add is it's extremely low latency (even compared to NAND)
and other intrinsic properties such as longevity and reliability.  The 
read/write
sequential bandwidth is completely bound by the PCI Express interface.
  
Any plans on a pci-e multi-lane version then? All my servers are still 
Gig-E, and I'm not likely to see more than 100MB/sec of NFS traffic, 
however I'm sure there are plenty of NFS servers on 10G out there that 
will see quite a bit more than 155MB/sec for moderate amounts of time. I 
know we can put more than one of these cards in a server, but those 
slots are often taken up with other things!


I look forward to these being available in Australia  :-)

Thanks,
   Tristan


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-06 Thread Tristan Ball
Having gotten back to the Rep and asked further questions, I'm forced to 
agree - the rep doesn't know what they're talking about.


It does look like the Intel based 40G Kinsgston may not yet be available 
in australia.


What a drag. :-)

T

On 6/01/2010 10:46 PM, Al Hopper wrote:

On Tue, Jan 5, 2010 at 11:57 PM, Eric D. Mudama
edmud...@bounceswoosh.org  wrote:
   

On Wed, Jan  6 at 14:56, Tristan Ball wrote:
 

For those searching list archives, the SNV125-S2/40GB given below is not
based on the Intel controller.

I queried Kingston directly about this because there appears to be so
much confusion (and I'm considering using these drives!), and I got back
that:

The V series uses a JMicron Controller
The V+ series uses a Samsung Controller
The M and E series are the Intel Drives
   

I'm 99.999% sure that the 40GB V drive is based on the Intel
architecture and that the kingston rep was wrong.
 

+1

The Kingston rep is wrong.  Obviously Kingstons part numbering scheme
is totally foobarred.  The 40gb drive is half an x25m.   I stand by
the part numbers I posted originally:

Desktop Bundle - SNV125-S2BD/40GB
Bare drive - SNV125-S2/40GB

When purchasing this drive, verify the manufacturers part number and
ignore the description and product picture(s).

   

http://www.anandtech.com/storage/showdoc.aspx?i=3667p=4

The label in the picture clearly shows a bare letter 'V' and 40GB
markings, and the board/layout is identical to the X25-M with only 5
NAND TSOPs instead of 10 or 20.

Either way though, the Intel-branded 40GB MLC drive (X25-V) is now
available on newegg as well:

http://www.newegg.com/Product/Product.aspx?Item=N82E16820167025

Currently $129.99 in retail packaging (with aluminum sled so it can
bolt into a 3.5 drive bay, etc.), no idea if they'll sell 'em for the
$85 that newegg briefly had the kingston branded version.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 



   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Tristan Ball

On 6/01/2010 3:00 AM, Roch wrote:

Richard Elling writes:
On Jan 3, 2010, at 11:27 PM, matthew patton wrote:
  
  I find it baffling that RaidZ(2,3) was designed to split a record-
  size block into N (N=# of member devices) pieces and send the
  uselessly tiny requests to spinning rust when we know the massive
  delays entailed in head seeks and rotational delay. The ZFS-mirror
  and load-balanced configuration do the obviously correct thing and
  don't split records and gain more by utilizing parallel access. I
  can't imagine the code-path for RAIDZ would be so hard to fix.
  
Knock yourself out :-)

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
  
  I've read posts back to 06 and all I see are lamenting about the
  horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to
  claw back performance by combining multiple such vDEVs. I understand
  RAIDZ will never equal Mirroring, but it could get damn close if it
  didn't break requests down and better yet utilized copies=N and
  properly placed the copies on disparate spindles. This is somewhat
  analogous to what the likes of 3PAR do and it's not rocket science.
  

   

[snipped for space ]


That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently.

-r
   

Sold! Let's do that then! :-)

Seriously - are there design or architectural reasons why this isn't 
done by default, or at least an option? Or is it just a no one's had 
time to implement yet thing?
I understand that 4K sectors might be less space efficient for lots of 
small files, but I suspect lots of us would happilly make that trade off!


Thanks,
Tristan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rethinking RaidZ and Record size

2010-01-05 Thread Tristan Ball



On 6/01/2010 7:19 AM, Richard Elling wrote:


If you are doing small, random reads on dozens of TB of data, then you've
got a much bigger problem on your hands... kinda like counting grains of
sand on the beach during low tide :-).  Hopefully, you do not have to 
randomly
update that data because your file system isn't COW :-). Fortunately, 
most

workloads are not of that size and scope.

Since there are already 1 TB SSDs on the market, the only thing 
keeping the
HDD market alive is the low $/TB.  Moore's Law predicts that cost 
advantage

will pass.  SSDs are already the low $/IOPS winners.
 -- richard

These workloads (small random reads over huge datasets) might be getting 
more common in some environments - because it seems to be what you get 
when you consolidate virtual machine storage.


We've got a moderately large number of Virtual Machines (a mix of 
Debian, Win2K Win2K3) running a very large set of applications, and our 
reads are all over the place! :-( I have to say I remain impressed at 
how well the ARC behaves, but even then our hit rate is often not 
wonderful.


I _dream_ about being able to afford to build out my entire storage from 
cheap/large SSD's. My guess would be that in about 2 years I'll be able 
to. One of the reasons we've essentially put a hold on buying 
enterprise storage or fast FC/SCSI disks. A large part of the 
justification for FC/SCSI disks is their performance, and they're going 
to be completely eclipsed within the lifetime of any serious mid-range 
to high end storage array. Until that day we're make do large sata 
drives, mirrored, with relatively high spindle counts to avoid long 
per-disk queues.


:-)

T

PS:  OK, I know other tier-1 storage vendors have started integrating 
SSD's as well, but they hadn't when we started out current round of 
storage upgrades, and I stilll think opensolaris+sata hdds+ssd's gives 
us a cleaner,cheaper and easier upgrade path than most tier-1 vendors 
can provide.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] need a few suggestions for a poor man's ZIL/SLOG device

2010-01-05 Thread Tristan Ball
For those searching list archives, the SNV125-S2/40GB given below is not
based on the Intel controller.

I queried Kingston directly about this because there appears to be so
much confusion (and I'm considering using these drives!), and I got back
that:

The V series uses a JMicron Controller
The V+ series uses a Samsung Controller
The M and E series are the Intel Drives

There will be G2 versions of the V and V+ series out shortly, at least
one of which will be based on a Toshiba controller.

Part number prefixes are:

V series: SNV125-S2
V+ Series: SNV225-S2
M Series: SNM225-S2
E Series: SNE125-S2

http://www.kingston.com/anz/ssd/default.asp


It does seem that there are _lots_ of 3rd party websites that claim the
variations of the SNV* parts are Intel based drives, however that's not
what Kingston's rep told me, and it's not what's on their website.

Regards,
Tristan

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Eric D. Mudama
Sent: Tuesday, 5 January 2010 7:35 PM
To: Thomas Burgess
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] need a few suggestions for a poor man's
ZIL/SLOG device

On Mon, Jan  4 at 22:01, Thomas Burgess wrote:
   I guess i got some bad advice then
   I was told the kingston snv125-s2 used almost the exact same
hardware as
   an x25-m and should be considered the poor mans x25-m
...

   Right, i couldn't find any of the 40 gb's in stock so i ordered the
64
   gb.same exact model, only biggerdoes your previous statement
about
   the larger model ssd's not apply to the kingstons?

The SNV125-S2/40GB is the half an X25-M drive which can be often
found as a bare OEM drive for about $85 w/ rebate.

Kingston does sell rebranded Intel SLC drives as well, but under a
different model number: SNE-125S2/32 or SNE-125S2/64.  I don't believe
the 64GB Kingston MLC (SNV-125S2/64) is based on Intel's controller.

The Kingston rebranding of the gen2 intel MLC design is
SNM-125S2B/80 or SNM-125S2B/160.  Those are essentially 34nm Intel
X25-M units I believe.

--eric


-- 
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Tristan Ball

To some extent it already does.

If what you're talking about is filesystems/datasets, then all 
filesystems within a pool share the same free space, which is 
functionally very similar to each filesystem within the pool being 
thin-provisioned. To get a thick filesystem, you'd need to set at 
least the filesystem's reservation, and probably quota as well. 
Basically filesystems within a pool are thin by default, with the added 
bonus that space freed within a single filesystem is available for use 
in any other filesystem within the pool.


If you're talking about volumes provisioned from a pool, then volumes 
can be provisioned as sparse, which is pretty much the same thing.


And if you happen to be providing ISCSI luns from files rather than 
volumes, then those files can be created sparse as well.


Reclaiming space from sparse volumes and files is not so easy unfortunately!

If you're talking about the pool itself being thin... that's harder to 
do, although if you really needed it I guess if you provision your pool 
from an array that itself provides thin provisioning.


Regards,
Tristan



On 30/12/2009 9:34 PM, Andras Spitzer wrote:

Hi,

Does anyone heard about having any plans to support thin devices by ZFS? I'm 
talking about the thin device feature by SAN frames (EMC, HDS) which provides 
more efficient space utilization. The concept is similar to ZFS with the pool 
and datasets, though the pool in this case is in the SAN frame itself, so the 
pool can be shared among different systems attached to the same SAN frame.

This topic is really complex but I'm sure it's inevitable to support for 
enterprise customers with SAN storage, basically it brings the differentiation 
of space used vs space allocated, which can be a huge difference in a large 
environment, and this difference is major even on the financial level as well.

Veritas already added support to thin devices, first of all support to VxFS to be 
thin-aware (for example how to handle over-subscribed thin devices), then 
Veritas added a feature called SmartMove, a nice feature to migrate from fat to thin 
devices, and the most brilliant feature of all (my personal opinion, of course) is they 
released the Veritas Thin Device Reclamation API, which provides an interface to the SAN 
frame to report unused space at the block level.

This API is a major hit, and even though SAN vendors today doesn't support it, 
HP and HDS already working on it, and I assume EMC has to follow as well. With 
this API Veritas can keep track of files deleted for example, and with a simple 
command once a day (depending on your policy) it can report the unused space 
back to the frame, so thin devices [b]remain[/b] thin.

I really believe that ZFS should have support to thin devices, especially 
referring to the feature what this API brings into this field, as it can result 
a huge cost difference to enterprise customers.

Regards,
sendai
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Tristan Ball

Ack..

I've just re-read your original post. :-) It's clear you are talking 
about support for thin devices behind the pool, not features inside the 
pool itself.


Mea culpa.

So I guess we wait for trim to be fully supported..  :-)

T.



On 31/12/2009 8:09 AM, Tristan Ball wrote:

To some extent it already does.

If what you're talking about is filesystems/datasets, then all 
filesystems within a pool share the same free space, which is 
functionally very similar to each filesystem within the pool being 
thin-provisioned. To get a thick filesystem, you'd need to set at 
least the filesystem's reservation, and probably quota as well. 
Basically filesystems within a pool are thin by default, with the 
added bonus that space freed within a single filesystem is available 
for use in any other filesystem within the pool.


If you're talking about volumes provisioned from a pool, then volumes 
can be provisioned as sparse, which is pretty much the same thing.


And if you happen to be providing ISCSI luns from files rather than 
volumes, then those files can be created sparse as well.


Reclaiming space from sparse volumes and files is not so easy 
unfortunately!


If you're talking about the pool itself being thin... that's harder to 
do, although if you really needed it I guess if you provision your 
pool from an array that itself provides thin provisioning.


Regards,
Tristan



On 30/12/2009 9:34 PM, Andras Spitzer wrote:

Hi,

Does anyone heard about having any plans to support thin devices by 
ZFS? I'm talking about the thin device feature by SAN frames (EMC, 
HDS) which provides more efficient space utilization. The concept is 
similar to ZFS with the pool and datasets, though the pool in this 
case is in the SAN frame itself, so the pool can be shared among 
different systems attached to the same SAN frame.


This topic is really complex but I'm sure it's inevitable to support 
for enterprise customers with SAN storage, basically it brings the 
differentiation of space used vs space allocated, which can be a huge 
difference in a large environment, and this difference is major even 
on the financial level as well.


Veritas already added support to thin devices, first of all support 
to VxFS to be thin-aware (for example how to handle over-subscribed 
thin devices), then Veritas added a feature called SmartMove, a nice 
feature to migrate from fat to thin devices, and the most brilliant 
feature of all (my personal opinion, of course) is they released the 
Veritas Thin Device Reclamation API, which provides an interface to 
the SAN frame to report unused space at the block level.


This API is a major hit, and even though SAN vendors today doesn't 
support it, HP and HDS already working on it, and I assume EMC has to 
follow as well. With this API Veritas can keep track of files deleted 
for example, and with a simple command once a day (depending on your 
policy) it can report the unused space back to the frame, so thin 
devices [b]remain[/b] thin.


I really believe that ZFS should have support to thin devices, 
especially referring to the feature what this API brings into this 
field, as it can result a huge cost difference to enterprise customers.


Regards,
sendai

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ARC not using all available RAM?

2009-12-20 Thread Tristan Ball
I've got an opensolaris snv_118 machine that does nothing except serve 
up NFS and ISCSI.


The machine has 8G of ram, and I've got an 80G SSD as L2ARC.
The ARC on this machine is currently sitting at around 2G, the kernel is 
using around 5G, and I've got about 1G free. I've pulled this from a 
combination of arc_summary.pl, and 'echo ::memstat | mdb -k'
It's my understanding  that the kernel will use a certain amount of ram 
for managing the L2Arc, and that how much is needed  is dependent on the 
size of the L2Arc and the recordsize of the zfs filesystems


I have some questions that I'm hoping the group can answer...

Given that I don't believe there is any other memory pressure on the 
system, why isn't the ARC using that last 1G of ram?
Is there some way to see how much ram is used for L2Arc management? Is 
that what the l2_hdr_size kstat measures? Is it possible to see it via 
'echo ::kmastat | mdb -k '?


Thanks everyone.

Tristan







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ARC not using all available RAM?

2009-12-20 Thread Tristan Ball



Bob Friesenhahn wrote:

On Sun, 20 Dec 2009, Richard Elling wrote:


Given that I don't believe there is any other memory pressure on the 
system, why isn't the ARC using that last 1G of ram?


Simon says, don't do that ? ;-)


Yes, primarily since if there is no more memory immediately available, 
performance when starting new processes would suck.  You need to 
reserve some working space for processes and short term requirements.


Why is that a given? There are several systems that steal from cache 
under memory pressure. Earlier versions of solaris  that I've dealt with 
a little managed with quite a bit less that 1G free. On this system, 
lotsfree is sitting at 127mb, which seems reasonable, and isn't it 
lotsfree and the related variables and page-reclaim logic that 
maintain that pool of free memory for new allocations?


Regards,
   Tristan.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] FW: ARC not using all available RAM?

2009-12-20 Thread Tristan Ball
Oops, should have sent to the list...

Richard Elling wrote:

 On Dec 20, 2009, at 12:25 PM, Tristan Ball wrote:

 I've got an opensolaris snv_118 machine that does nothing except 
 serve up NFS and ISCSI.

 The machine has 8G of ram, and I've got an 80G SSD as L2ARC.
 The ARC on this machine is currently sitting at around 2G, the kernel

 is using around 5G, and I've got about 1G free.

 Yes, the ARC max is set by default to 3/4 of memory or memory - 1GB, 
 whichever
 is greater.

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common
/fs/zfs/arc.c#3426 

So I've read, but the ARC is using considerably less than 3/4 of memory,

and 1G free is less that 1/4! On this box, c_max is about 7 gig (Which 
is more than 3/4 anyway)?


 I've pulled this from a combination of arc_summary.pl, and 'echo 
 ::memstat | mdb -k'

 IMHO, it is easier to look at the c_max in kstat.
 kstat -n arcstats -s c_max
You're probably right. I've been looking at those too - actually, I've 
just started graphing them in munin through some slightly modified munin

plugins that someone wrote for BSD. :-)


 It's my understanding  that the kernel will use a certain amount of 
 ram for managing the L2Arc, and that how much is needed  is dependent

 on the size of the L2Arc and the recordsize of the zfs filesystems

 Yes.

 I have some questions that I'm hoping the group can answer...

 Given that I don't believe there is any other memory pressure on the 
 system, why isn't the ARC using that last 1G of ram?
 Simon says, don't do that ? ;-)
Simon Says lots of things. :-) It strikes me that 1G sitting free is 
quite a lot.

I guess what I'm really asking, is that given that 1G free doesn't 
appear to be the 1/4 of ram that the ARC will never touch, and that c 
is so much less than c_max, why is c so small? :-)


 Is there some way to see how much ram is used for L2Arc management? 
 Is that what the l2_hdr_size kstat measures?

 Is it possible to see it via 'echo ::kmastat | mdb -k '?

 I don't think so.

 OK, so why are you interested in tracking this?  Capacity planning?
 From what I can tell so far, DDT is a much more difficult beast to 
 measure
 and has a more direct impact on performance :-(
  -- richard
Firstly, what's DDT? :-)

Secondly, it's because I'm replacing the system. The existing one was 
proof of concept, essentially built with decommissioned parts. I've got 
a new box,with 32G of ram, with a little bit of money left in the
budget.

For that money, I could get an extra 80-200G of SSD for L2ARC, or an 
extra 12G of ram, or perhaps both would be a waste money. Given the box 
will be awkward to touch once it's in, I'm going to err on the side of 
adding hardware now.

What I'm trying to find out is is my ARC relatively small because...

1) ZFS has decided that that's all it needs (the workload is fairly 
random), and that adding more wont gain me anything..
2) The system is using so much ram for tracking the L2ARC, that the ARC 
is being shrunk (we've got an 8K record size)
3) There's some other memory pressure on the system that I'm not aware 
of that is periodically chewing up then freeing the ram.
4) There's some other memory management feature that's insisting on that

1G free.


Actually, because it'll be easier to add SSD's later than to add RAM 
later, I might just add the RAM now and be done with it. :-) It's not 
very scientific, but I don't think I've ever had a system where 2 or 3 
years into it's life I've not wished that I'd put more ram in to start
with!


But I really am still interested in figuring out how much RAM is used 
for the L2ARC management in our system, because while our workload is 
fairly random, there's some moderately well defined hotter spots - so it

might be that in 12 months a feasible upgrade to the system is to add 4 
x 256G SSD's as L2ARC. It would take a while for the L2ARC to warm up, 
but once it was, most of those hotter areas would come from cache. 
However it may be too much L2 for the system to track efficiently.

Regards,
Tristan/


 __
 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit 

http://www.messagelabs.com/email
__ 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshot question

2009-11-13 Thread Tristan Ball
I think the exception may be when doing a recursive snapshot - ZFS appears to 
halt IO so that it can take all the snapshots at the same instant.

At least, that's what it looked like to me. I've got an Opensolaris ZFS box 
providing NFS to VMWare, and I was getting SCSI timeout's within the Virtual 
Machines that appeared to happen exactly as the snapshots were taken.

When I turned off the recursive snapshots, and rather had each FS snapshot 
individually, the problem went away.

Regards,
Tristan.

-Original Message-
From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Richard Elling
Sent: Saturday, 14 November 2009 5:02 AM
To: Rodrigo E. De León Plicet
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Snapshot question


On Nov 13, 2009, at 6:43 AM, Rodrigo E. De León Plicet wrote:

 While reading about NILFS here:

 http://www.linux-mag.com/cache/7345/1.html


 I saw this:

 One of the most noticeable features of NILFS is that it can  
 continuously and automatically save instantaneous states of the  
 file system without interrupting service. NILFS refers to these as  
 checkpoints. In contrast, other file systems such as ZFS, can  
 provide snapshots but they have to suspend operation to perform the  
 snapshot operation. NILFS doesn't have to do this. The snapshots  
 (checkpoints) are part of the file system design itself.

 I don't think that's correct. Can someone clarify?

It sounds to me like they confused Solaris UFS with ZFS.  What they
say applies to UFS, but not ZFS.
  -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Tristan Ball

This is truly awesome news!

What's the best way to dedup existing datasets? Will send/recv work, or 
do we just cp things around?


Regards,
   Tristan

Jeff Bonwick wrote:

Terrific! Can't wait to read the man pages / blogs about how to use it...



Just posted one:

http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup

Enjoy, and let me know if you have any questions or suggestions for
follow-on posts.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] More Dedupe Questions...

2009-11-02 Thread Tristan Ball


I'm curious as to how send/recv intersects with dedupe... if I send/recv 
a deduped filesystem, is the data sent it it's de-duped form, ie just 
sent once, followed by the pointers for subsequent dupe data, or is the 
the data sent in expanded form, with the recv side system then having to 
redo the dedupe process?


Obviously sending it deduped is more efficient in terms of bandwidth and 
CPU time on the recv side, but it may also be more complicated to achieve?



Also - do we know yet what affect block size has on dedupe? My guess is 
that a smaller block size will perhaps give a better duplication match 
rate, but at the cost of higher CPU usage and perhaps reduced 
performance, as the system will need to store larger de-dupe hash tables?


Regards,
   Tristan

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Tristan Ball
What makes you say that the X25-E's cache can't be disabled or flushed? 
The net seems to be full of references to people who are disabling the 
cache, or flushing it frequently, and then complaining about the 
performance!


T

Frédéric VANNIERE wrote:
The ZIL is a write-only log that is only read after a power failure. Several GB is large enough for most workloads. 

You can't use the Intel X25-E because it has a 32 or 64 MB volatile cache that can't be disabled neither flushed by ZFS. 


Imagine your server has a power failure while writing data to the pool. In 
normal situation, with ZIL on a reliable device, ZFS will read the ZIL and come 
back to a stable state at reboot. You may have lost some data (30 seconds) but 
the zpool works.   With the Intel X25-E as ZIL some log data has been lost with 
the power failure (32/64MB max) which lead to a corrupted log and so ... you 
loose your zpool and all your data !!

For the ZIL you need 2 reliable mirrored SSD devices with a supercapacitor that can flush the write cache to NAND when a power failure occurs. 


A hard-disk has a write cache but it can be disabled or flush by the operating 
system.

For more informations : 
http://www.c0t0d0s0.org/archives/5993-Somewhat-stable-Solid-State.html
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Recv slow with high CPU

2009-09-22 Thread Tristan Ball

OK, Thanks for that.

From reading the RFE, it sound's like having a faster machine on the 
receive side will be enough to alleviate the problem in the short term?


The hardware I'm using at the moment is quite old, and not particularly 
fast - although this is the first out  out performance limitation I've 
had with using it as a opensolaris storage system.


Regards,
   Tristan.

Matthew Ahrens wrote:

Tristan Ball wrote:

Hi Everyone,

I have a couple of systems running opensolaris b118, one of which 
sends hourly snapshots to the other. This has been working well, 
however as of today, the receiving zfs process has started running 
extremely slowly, and is running at 100% CPU on one core, completely 
in kernel mode. A little bit of exploration with lockstat and dtrace 
seems to imply that the issue is around the dbuf_free_range 
function - or at least, that's what it looks like to my inexperienced 
eye!


This is probably RFE 6812603 zfs send can aggregate free records, 
which is currently being worked on.


--matt

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Recv slow with high CPU

2009-09-20 Thread Tristan Ball




Hi Everyone,

I have a couple of systems running opensolaris b118, one of which sends
hourly snapshots to the other. This has been working well, however as
of today, the receiving zfs process has started running extremely
slowly, and is running at 100% CPU on one core, completely in kernel
mode. A little bit of exploration with lockstat and dtrace seems to
imply that the issue is around the "dbuf_free_range" function - or at
least, that's what it looks like to my inexperienced eye!

The system is very unresponsive while this problem is occurring, with
frequent multi-second delays between my typing into an ssh session and
getting a response. 

Has anyone seen this before, or know if there's anything I can do about
it?

Lockstat says:

Adaptive mutex hold: 52299 events in 1.694 seconds (30873 events/sec)

---
Count indv cuml rcnt nsec Lock Caller
 50 87% 87% 0.00 23410134 0xff0273d6dcb8
dbuf_free_range+0x268

 nsec -- Time Distribution -- count
 33554432 |@@ 50
---
Count indv cuml rcnt nsec Lock Caller
 53 0% 87% 0.00 117124 0xff025f404140
txg_rele_to_quiesce+0x18

 nsec -- Time Distribution -- count
 131072 |@@ 53
---
Count indv cuml rcnt nsec Lock Caller
2536 0% 88% 0.00 2194 0xff024f09bd40
kmem_cache_alloc+0x84

 nsec -- Time Distribution -- count
 2048 |@ 1121
 4096 | 1383
 8192 | 0
 16384 | 32
---
Count indv cuml rcnt nsec Lock Caller
 415 0% 88% 0.00 7968 0xff025120ce00 anon_resvmem+0xb4

[snip]

I modified the "procsystime" script from the dtrace toolkit to trace
fbt:zfs::entry/return rather than syscalls, and I get an output like
this (approximately a 5sec sample):
Functionnsecs
l2arc_write_eligible 46155456
spa_last_synced_txg 49211711
zio_inherit_child_errors 50039132
 zil_commit 50137305
 buf_hash 50839858
 dbuf_rele 56388970
zio_wait_for_children 67533094
dbuf_update_data 70302317
dsl_dir_space_towrite 72179496
 parent_delta 77685327
 dbuf_hash 79075532
free_range_compar 82479196
dbuf_free_range 2715449416
 TOTAL: 4156078240

Thanks,
 Tristan





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] alternative hardware configurations for zfs

2009-09-11 Thread Tristan Ball
How long have you had them in production?

Were you able to adjust the TLER settings from within solaris?

Thanks,
Tristan.

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Markus Kovero
Sent: Friday, 11 September 2009 5:00 PM
To: Eugen Leitl; Eric Sproul; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] alternative hardware configurations for zfs

We've been using caviar black 1TB with disk configurations consisting 64
disks or more. They are working just fine.

Yours
Markus Kovero

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Eugen Leitl
Sent: 11. syyskuuta 2009 9:51
To: Eric Sproul; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] alternative hardware configurations for zfs

On Thu, Sep 10, 2009 at 01:11:49PM -0400, Eric Sproul wrote:

 I would not use the Caviar Black drives, regardless of TLER settings.
The RE3
 or RE4 drives would be a better choice, since they also have better
vibration
 tolerance.  This will be a significant factor in a chassis with 20
spinning drives.

Yes, I'm aware of the issue, and am using 16x RE4 drives in my current 
box right now (which I unfortunately had to convert to CentOS 5.3 for
Oracle/
custom software compatibility reasons). I've made very bad experiences
with Seagate 7200.11 in RAID in the past.

Thanks for your advice against Caviar Black. 
 
  Do you think above is a sensible choice? 
 
 All your other choices seem good.  I've used a lot of Supermicro gear
with good
 results.  The very leading-edge hardware is sometimes not supported,
but

I've been using
http://www.supermicro.com/products/motherboard/QPI/5500/X8DAi.cfm
in above box.

 anything that's been out for a while should work fine.  I presume
you're going
 for an Intel Xeon solution-- the peripherals on those boards a a bit
better
 supported than the AMD stuff, but even the AMD boards work well.

Yes, dual-socket quadcore Xeon.

-- 
Eugen* Leitl a href=http://leitl.org;leitl/a http://leitl.org
__
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-08-27 Thread Tristan
I saw similar behavior when I was running under the kernel debugger (-k 
switch the the kernel). It largely went away when I went back to normal.


T

David Bond wrote:

Hi,

I was directed here after posting in CIFS discuss (as i first thought that it 
could be a CIFS problem).

I posted the following in CIFS:

When using iometer from windows to the file share on opensolaris svn101 and 
svn111 I get pauses every 5 seconds of around 5 seconds (maybe a little less) 
where no data is transfered, when data is transfered it is at a fair speed and 
gets around 1000-2000 iops with 1 thread (depending on the work type). The 
maximum read response time is 200ms and the maximum write response time is 
9824ms, which is very bad, an almost 10 seconds delay in being able to send 
data to the server.
This has been experienced on 2 test servers, the same servers have also been 
tested with windows server 2008 and they havent shown this problem (the share 
performance was slightly lower than CIFS, but it was consistent, and the 
average access time and maximums were very close.


I just noticed that if the server hasnt hit its target arc size, the pauses are 
for maybe .5 seconds, but as soon as it hits its arc target, the iops drop to 
around 50% of what it was and then there are the longer pauses around 4-5 
seconds. and then after every pause the performance slows even more. So it 
appears it is definately server side.

This is with 100% random io with a spread of 33% write 66% read, 2KB blocks. 
over a 50GB file, no compression, and a 5.5GB target arc size.



Also I have just ran some tests with different IO patterns and 100 sequencial 
writes produce and consistent IO of 2100IOPS, except when it pauses for maybe 
.5 seconds every 10 - 15 seconds.

100% random writes produce around 200 IOPS with a 4-6 second pause around every 
10 seconds.

100% sequencial reads produce around 3700IOPS with no pauses, just random peaks 
in response time (only 16ms) after about 1 minute of running, so nothing to 
complain about.

100% random reads produce around 200IOPS, with no pauses.

So it appears that writes cause a problem, what is causing these very long 
write delays?

A network capture shows that the server doesnt respond to the write from the 
client when these pauses occur.

Also, when using iometer, the initial file creation doesnt have and pauses in 
the creation, so it  might only happen when modifying files.

Any help on finding a solution to this would be really appriciated.

David
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using consumer drives in a zraid2

2009-08-26 Thread Tristan Ball
The intended use is NFS storage to back some VMWare servers running a
range of different VM's, including Exchange, Lotus Domino, SQL Server
and Oracle. :-) It's a very random workload, and all the research I've
done points to mirroring as the better option for providing a better
total IOP/s. The server in question is limited in the amount of RAM it
can take, so the effectiveness of both the arc and l2arc will be limited
somewhat, so I believe mirroring to be lower risk from a performance
point of view. But I've got mirrored 25-E's in there for the zil, and an
X25-M for arc as well.

 

Obviously some of this situation is not ideal, nor of my choosing. I'd
like to have a newer-faster server in there, and a second JBOD for more
drives, and probably another couple of X25-M's. Actually, I'd like to
just grab one of the Sun 7000 series and drop that in. :-) The only way
I'll get approval for the extra expenditure is to show that the current
system is viable in an initial proof of concept limited deployment
project, and one of the ways I'm doing that is to ensure I get what I
believe to be the best possible performance from my existing hardware -
and I think mirroring will do that.

 

Performance was actually more important than capacity, and I wasn't
willing to bet in advance on the arc's effectiveness. Actually, I
believe the current system will give me the requisite IOP/s without the
l2arc, I added the arc because for the cost I considered it silly not to
considering the relative cost. For those periods that it is effective,
it really makes a difference too!

 

T.

 



From: Tim Cook [mailto:t...@cook.ms] 
Sent: Wednesday, 26 August 2009 3:48 PM
To: Tristan Ball
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Using consumer drives in a zraid2

 

 

On Wed, Aug 26, 2009 at 12:27 AM, Tristan Ball
tristan.b...@leica-microsystems.com wrote:

The remaining drive would only have been flagged as dodgy if the bad
sectors had been found, hence my comments (and general best practice)
about data scrub's being necessary. While I agree it's possibly likely
that the enterprise drive would flag errors earlier, I wouldn't
necessarily bet on it. Just because a given sector has successfully been
read a number of times before doesn't guarantee that it will be read
successfully again, and again the enterprise drive doesn't try as hard.
In the absence of scrubs, resilvering can be the hardest thing the drive
does, and by my experience is likely to show up errors that haven't
occurred before. But you make a good point about retrying the resilver
until it works, presuming I don't hit a too many errors, device
faulted condition. :-)

 

I would have liked to go RaidZ2, but performance has dictated mirroring.
Physical, Financial and Capacity constraints have conspired together to
restrict me to 2 way mirroring rather than 3 way, which would have been
my next choice. :-)

 

 

Regards

Tristan

 

(Who is now going to spend the afternoon figuring out how to win lottery
by osmosis: http://en.wikipedia.org/wiki/Osmosis :-) )


My suggestion/question/whatever would be: why wouldn't raidz+an SSD arc
not meet both financial and performance requirements?  It would
literally be a first for me.

--Tim 


__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using consumer drives in a zraid2

2009-08-25 Thread Tristan Ball
I guess it depends on whether or not you class the various Raid
Edition drives as consumer? :-)

My one concern with these RE drives is that because they will return
errors early rather than retry is that they may fault when a normal
consumer drive would have returned the data eventually. If the pool is
already degraded due to a bad device, that could mean faulting the
entire pool.

Regards,
Tristan

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of thomas
Sent: Wednesday, 26 August 2009 12:55 PM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Using consumer drives in a zraid2

Are there *any* consumer drives that don't respond for a long time
trying to recover from an error? In my experience they all behave this
way which has been a nightmare on hardware raid controllers.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using consumer drives in a zraid2

2009-08-25 Thread Tristan Ball
Not upset as such :-)

 

What I'm worried about that time period where the pool is resilvering to
the hot spare. For example: one half of a mirror has failed completely,
and the mirror is being rebuilt onto the spare - if I get a read error
from the remaining half of the mirror, then I've lost data. If the RE
drives return's an error for a request that a consumer drive would have
(eventually) returned, then in this specific case I would have been
better off with the consumer drive.

 

That said, my initial ZFS systems are built with consumer drives, not
Raid Edition's, as much as anything as we got burned by some early RE
drives in some of our existing raid boxes here, so I had a general low
opinion of them. However, having done a little more reading about the
error recovery time stuff, I will also be putting in RE drives for the
production systems, and moving the consumer drives to the DR systems.

 

My logic is pretty straight forward:

 

Complete disk failures are comparatively rare, while media or transient
errors are far more common. As a media I/O or transient error on the
drive can affect the performance of the entire pool, I'm best of with
the RE drives to mitigate that. The risk of a double disk failure as
described above is partially mitigated by regular scrubs. The impact of
a double disk failure is mitigated by send/recv'ing to another box, and
catastrophic and human failures are partially mitigated by backing the
whole lot up to tape. :-)

 

Regards

Tristan.

 

 

 



From: Tim Cook [mailto:t...@cook.ms] 
Sent: Wednesday, 26 August 2009 2:08 PM
To: Tristan Ball
Cc: thomas; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Using consumer drives in a zraid2

 

 

On Tue, Aug 25, 2009 at 10:56 PM, Tristan Ball
tristan.b...@leica-microsystems.com wrote:

I guess it depends on whether or not you class the various Raid
Edition drives as consumer? :-)

My one concern with these RE drives is that because they will return
errors early rather than retry is that they may fault when a normal
consumer drive would have returned the data eventually. If the pool is
already degraded due to a bad device, that could mean faulting the
entire pool.

Regards,
   Tristan



Having it return errors when they really exist isn't a bad thing.  What
it boils down to is: you need a hot spare.  If you're running raid-z and
a drive fails, the hot spare takes over.  Once it's done resilvering,
you send the *bad drive* back in for an RMA.  (added bonus is the RE's
have a 5 year instead of 3 year warranty).  

You seem to be upset that the drive is more conservative about fail
modes.  To me, that's a good thing.  I'll admit, I was cheap at first
and my fileserver right now is consumer drives.  You can bet all my
future purchases will be of the enterprise grade.  And guess what...
none of the drives in my array are less than 5 years old, so even if
they did die, and I had bought the enterprise versions, they'd be
covered.

--Tim


__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using consumer drives in a zraid2

2009-08-25 Thread Tristan Ball
The remaining drive would only have been flagged as dodgy if the bad
sectors had been found, hence my comments (and general best practice)
about data scrub's being necessary. While I agree it's possibly likely
that the enterprise drive would flag errors earlier, I wouldn't
necessarily bet on it. Just because a given sector has successfully been
read a number of times before doesn't guarantee that it will be read
successfully again, and again the enterprise drive doesn't try as hard.
In the absence of scrubs, resilvering can be the hardest thing the drive
does, and by my experience is likely to show up errors that haven't
occurred before. But you make a good point about retrying the resilver
until it works, presuming I don't hit a too many errors, device
faulted condition. :-)

 

I would have liked to go RaidZ2, but performance has dictated mirroring.
Physical, Financial and Capacity constraints have conspired together to
restrict me to 2 way mirroring rather than 3 way, which would have been
my next choice. :-)

 

 

Regards

Tristan

 

(Who is now going to spend the afternoon figuring out how to win lottery
by osmosis: http://en.wikipedia.org/wiki/Osmosis :-) )

 



From: Tim Cook [mailto:t...@cook.ms] 
Sent: Wednesday, 26 August 2009 3:01 PM
To: Tristan Ball
Cc: thomas; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Using consumer drives in a zraid2

 

 

On Tue, Aug 25, 2009 at 11:38 PM, Tristan Ball
tristan.b...@leica-microsystems.com wrote:

Not upset as such :-)

 

What I'm worried about that time period where the pool is resilvering to
the hot spare. For example: one half of a mirror has failed completely,
and the mirror is being rebuilt onto the spare - if I get a read error
from the remaining half of the mirror, then I've lost data. If the RE
drives return's an error for a request that a consumer drive would have
(eventually) returned, then in this specific case I would have been
better off with the consumer drive.

 

That said, my initial ZFS systems are built with consumer drives, not
Raid Edition's, as much as anything as we got burned by some early RE
drives in some of our existing raid boxes here, so I had a general low
opinion of them. However, having done a little more reading about the
error recovery time stuff, I will also be putting in RE drives for the
production systems, and moving the consumer drives to the DR systems.

 

My logic is pretty straight forward:

 

Complete disk failures are comparatively rare, while media or transient
errors are far more common. As a media I/O or transient error on the
drive can affect the performance of the entire pool, I'm best of with
the RE drives to mitigate that. The risk of a double disk failure as
described above is partially mitigated by regular scrubs. The impact of
a double disk failure is mitigated by send/recv'ing to another box, and
catastrophic and human failures are partially mitigated by backing the
whole lot up to tape. :-)

 

Regards

Tristan.


The part you're missing is the good drive should have flagged bad
long, long before a consumer drive would have.  That being the case, the
odds are far, far less likely you'd get a bad read from an enterprise
drive, than you would get a good read from a consumer drive constantly
retrying.  That's ignoring the fact you could re-issue the resilver
repeatedly until you got the response you wanted from the good drive.


In any case, unless performance is absolutely forcing you to do
otherwise, if you're that paranoid just do a raid-z2/3, and you won't
have to worry about it.  The odds of 4 drives not returning valid data
are so rare (even among RE drives), you might as well stop working and
live in a hole (as your odds are better being hit by a meteor or winning
the lottery by osmosis).

I KIIID.

--Tim 


__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Tutorial at LISA09

2009-08-21 Thread Tristan Ball



Marcus wrote:

- how to best handle broken disks/controllers without ZFS hanging or
being unable to replace the disk
  
A definite +1 here. I realise it's something that Sun probably consider 
fixed by the disk/controller drivers, however many of us are using 
opensolaris on non-sun hardware, and can't necessarily test in advance 
how the system is going to behave when things fail! So any techniques 
available to mitigate against any entire pool hanging because 1 device 
is doing something dumb would be very useful.


I guess that might move away from pure zfs in terms of tute content, 
but as a sysadmin, it's the whole system we care about!

That would be my personal wishlist. I hope the presentation will be
made public after the event. ;-)

  

+1 :-)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!

2009-08-13 Thread Tristan Ball
In my testing, vmware doesn't see the vm1 and vm2 filesystems. Vmware
doesn't have an automounter, and doesn't traverse NFS4 sub-mounts
(whatever the formal name for them is). Actually, it doesn't support
NFS4 at all! 

Regards,
Tristan.

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of James Hess
Sent: Thursday, 13 August 2009 3:38 PM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!

 The real benefit of the of using a
 separate zvol for each vm is the instantaneous
 cloning of a machine, and the clone will take almost
 no additional space initially. In our case we build a

You don't have to use ZVOL devices to do that.
As mentioned by others...

 zfs create my_pool/group1
 zfs create my_pool/group1/vm1
 zfs create my_pool/group1/vm2

In this case,  'vm1'  and 'vm2'  are on separate filesystems, that will
show up in 'zfs list',  since 'zfs create' was used to make them.  But
they are still both under the common mount point  '/my_pool/group1'

Now, you could
zfs snapshot  my_pool/group1/v...@snap-1-2009-06-12
zfs clone  my_pool/group1/v...@snap-1-2009-06-12   my_pool/group1/vm3
zfs promote  my_pool/group1/vm3

And you would then have your clone, also under the common mount point..
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zdb CKSUM stats vary?

2009-08-05 Thread Tristan Ball
Can anyone tell me why successive runs of zdb would show very
different values for the cksum column? I had thought these counters were
since last clear but that doesn't appear to be the case?

If I run zdb poolname, right at the end of the output, it lists pool
statistics:

capacity   operations   bandwidth  
errors 
descriptionused avail  read write  read write  read
write cksum
data  1.46T 7.63T   117 0 7.89M 0 0
0 9
  /dev/dsk/c0t21D0230F0298d0s0  150G  781G11 0  803K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d1s0  150G  781G11 0  791K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d2s0  150G  781G11 0  803K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d3s0  150G  781G11 0  807K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d6s0  150G  781G11 0  811K 0
0 0 2
  /dev/dsk/c0t21D0230F0298d7s0  150G  781G12 0  817K 0
0 0 4
  /dev/dsk/c0t21D0230F0298d8s0  150G  781G11 0  815K 0
0 0 4
  /dev/dsk/c0t21D0230F0298d9s0  150G  781G11 0  797K 0
0 014
  /dev/dsk/c0t21D0230F0298d10s0  150G  781G11 0  822K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d11s0  150G  781G11 0  814K 0
0 0 4

If I run it again:

capacity   operations   bandwidth  
errors 
descriptionused avail  read write  read write  read
write cksum
data  1.46T 7.63T   108 0 5.72M 0 0
0 3
  /dev/dsk/c0t21D0230F0298d0s0  150G  781G10 0  583K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d1s0  150G  781G10 0  570K 0
0 019
  /dev/dsk/c0t21D0230F0298d2s0  150G  781G11 0  596K 0
0 017
  /dev/dsk/c0t21D0230F0298d3s0  150G  781G11 0  597K 0
0 0 3
  /dev/dsk/c0t21D0230F0298d6s0  150G  781G10 0  591K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d7s0  150G  781G11 0  586K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d8s0  150G  781G10 0  591K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d9s0  150G  781G10 0  569K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d10s0  150G  781G10 0  586K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d11s0  150G  781G10 0  589K 0
0 0 2

If I run zdb -vs data I get:

   capacity   operations   bandwidth  
errors 
descriptionused avail  read write  read write  read
write cksum
data  1.46T 7.63T70 0 4.27M 0 0
0 0
  /dev/dsk/c0t21D0230F0298d0s0  150G  781G 8 0  526K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d1s0  150G  781G 6 0  385K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d2s0  150G  781G 6 0  385K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d3s0  150G  781G 6 0  413K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d6s0  150G  781G 8 0  522K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d7s0  150G  781G 8 0  550K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d8s0  150G  781G 6 0  385K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d9s0  150G  781G 6 0  377K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d10s0  150G  781G 6 0  404K 0
0 0 0
  /dev/dsk/c0t21D0230F0298d11s0  150G  781G 6 0  422K 0
0 0 0

A zpool status shows:

  pool: data
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool
can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
pool will no longer be accessible on older software versions.
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
data  ONLINE   0 0 0
  c0t21D0230F0298d0   ONLINE   0 0 0
  c0t21D0230F0298d1   ONLINE   0 0 0
  c0t21D0230F0298d2   ONLINE   0 0 0
  c0t21D0230F0298d3   ONLINE   0 0 0
  c0t21D0230F0298d6   ONLINE   0 0 0
  c0t21D0230F0298d7   ONLINE   0 0 0
  c0t21D0230F0298d8   ONLINE   0 0 0
  c0t21D0230F0298d9   ONLINE   0 0 0
  c0t21D0230F0298d10  ONLINE   0 0 0
  c0t21D0230F0298d11  ONLINE   0 0 0



Thanks,
Tristan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need tips on zfs pool setup..

2009-08-03 Thread Tristan Ball
Were those tests you mentioned on Raid-5/6/Raid-Z/z2 or on Mirrored
volumes of some kind?
 
We've found here that VM loads on raid 10 sata volumes, with relatively
high numbers of disks actually works pretty well - and depending size of
the drives, you quite often get more usuable space too. ;-)
 
I suspect 80 VM's on 20 sata disks might be pushing things though, but
it'll depend on the workload.
 
T



From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Tim Cook
Sent: Tuesday, August 04, 2009 1:18 PM
To: Joachim Sandvik
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Need tips on zfs pool setup..




On Mon, Aug 3, 2009 at 3:34 PM, Joachim Sandvik
no-re...@opensolaris.org wrote:


I am looking at a nas software from nexenta, and after some
initial testing i like what i see. So i think we will find in funding
the budget for a dual setup.

We are looking at a dual cpu Supermicro server with about 32gb
ram and 2 x250gb OS disks, 21 x 1TB SATA disks, and 1 x 64gb SSD disk.

The system will use nexenta's auto-cdp which i think are based
on AVS to remote mirror to a system a few miles away. The system will
mostly be serving as a NFS server for our Vmware servers. We have about
80 vm's who access the vmfs datastores.

I have read that its smart to use a few small raid groups in a
larger pools, but i am uncertain about placing 21 disks in 1pool.

The setup i have though of so far are:

1 pool with 3 x raidz2 groups with 6x1tb disks. 2x 64gb ssd for
cache and 2 spare disks. This should give us about 12TB

An another setup i have been thinking about is:

1 pool with 9 x mirror with 2 x 1TB, also with 2 spares and 2
64gb SSD.

Do anyone have a recommendation on what might be a good setup?


 

FWIW, I think you're nuts putting that many VM's on SATA disk, SSD as
cache or not.  If there's ANY kind of I/O load those disks are going to
fall flat on their face.

VM I/O looks like completely random I/O from the storage perspective,
and it tends to be pretty darn latency sensitive.  Good luck, I'd be
happy to be proven wrong.  Every test I've ever done has shown you need
SAS/FC for vmware workloads though.

--Tim


__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] feature proposal

2009-07-31 Thread Tristan Ball
Because it means you can create zfs snapshots from a non solaris/non 
local client...


Like a linux nfs client, or a windows cifs client.

T

dick hoogendijk wrote:

On Wed, 29 Jul 2009 17:34:53 -0700
Roman V Shaposhnik r...@sun.com wrote:

  

On the read-write front: wouldn't it be cool to be able to snapshot
things by:
$ mkdir .zfs/snapshot/snap-name



I've followed this thread but I fail to see the advantages of this. I
guess I miss something here. Can you explain to me why the above would
be better (nice to have) then zfs create whate...@now?

  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD's and ZFS...

2009-07-25 Thread Tristan Ball
The 128G Supertalent Ultradrive ME. The larger version of the drive 
mentioned in the original post. Sorry, should have made that a little 
clearer. :-)


T

Kyle McDonald wrote:

Tristan Ball wrote:

It just so happens I have one of the 128G and two of the 32G versions in
my drawer, waiting to go into our DR disk array when it arrives.
  

Hi Tristan,

Just so I can be clear, What model/brand are the drives you were testing?

 -Kyle


I dropped the 128G into a spare Dell 745 (2GB ram) and used a Ubuntu
liveCD to run some simple iozone tests on it. I had some stability
issues with Iozone crashing however I did get some results...

Attached are what I've got. I intended to do two sets of tests, one for
each of sequential reads, writes, and a random IO mix. I also wanted
to do a second set of tests, running a streaming read or streaming write
in parallel with the random IO mix, as I understand many SSD's have
trouble with those kind of workloads.

As it turns out, so did my test PC. :-)
I've used 8K IO sizes for all the stage one tests - I know I might get
it to go faster with a larger size, but I like to know how well systems
will do when I treat them badly!

The Stage_1_Ops_thru_run is interesting. 2000+ ops/sec on random writes,
5000 on reads.


The Streaming write load and random over writes were started at the
same time - although I didn't see which one finished first, so it's
possible that the stream finished first and allowed the random run to
finish strong. Basically take these numbers with several large grains of
salt!

Interestingly, the random IO mix doesn't slow down much, but the
streaming writes are hurt a lot.

Regards,
Tristan.



-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of thomas
Sent: Friday, 24 July 2009 5:23 AM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] SSD's and ZFS...

 

I think it is a great idea, assuming the SSD has good write


performance.
 

This one claims up to 230MB/s read and 180MB/s write and it's only


$196.
 

http://www.newegg.com/Product/Product.aspx?Item=N82E16820609393

Compared to this one (250MB/s read and 170MB/s write) which is $699.

Are those claims really trustworthy? They sound too good to be true!




MB/s numbers are not a good indication of performance. What you should
pay attention to are usually random IOPS write and read. They tend to
correlate a bit, but those numbers on newegg are probably just best case
from the manufacturer.

In the world of consumer grade SSDs, Intel has crushed everyone on IOPS
performance.. but the other manufacturers are starting to catch up a
bit.
  



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  



__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD's and ZFS...

2009-07-25 Thread Tristan Ball



Bob Friesenhahn wrote:

On Fri, 24 Jul 2009, Tristan Ball wrote:


I've used 8K IO sizes for all the stage one tests - I know I might get
it to go faster with a larger size, but I like to know how well systems
will do when I treat them badly!

The Stage_1_Ops_thru_run is interesting. 2000+ ops/sec on random writes,
5000 on reads.


This seems like rather low random write performance.  My 12-drive 
array of rotating rust obtains 3708.89 ops/sec.  In order to be 
effective, it seems that a synchronous write log should perform 
considerably better than the backing store.


That really depends on what you're trying to achieve. Even if this 
single drive is only showing equivilient performance to a twelve drive 
array (and I suspect your 3700 ops/sec would slow down over a bigger 
data set, as seeks make more of an impact), that still means that if the 
SSD is used as a ZIL, those sync writes don't have to be written to the 
spinning disks immediately giving the scheduler a better change to order 
the IO's providing better over all latency response for the requests 
that are going to disk.


And while I didn't make it clear, I actually intend to use the 128G 
drive as a L2ARC. While it's effectiveness will obviously depend on the 
access patterns, the cost of adding the drive to the array is basically 
trivial, and it significantly increases the total ops/sec the array is 
capable of during those times that the access patterns provide for it. 
For my use, it was a case of might as well. :-)


Tristan.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD's and ZFS...

2009-07-24 Thread Tristan Ball
It just so happens I have one of the 128G and two of the 32G versions in
my drawer, waiting to go into our DR disk array when it arrives. 

I dropped the 128G into a spare Dell 745 (2GB ram) and used a Ubuntu
liveCD to run some simple iozone tests on it. I had some stability
issues with Iozone crashing however I did get some results...

Attached are what I've got. I intended to do two sets of tests, one for
each of sequential reads, writes, and a random IO mix. I also wanted
to do a second set of tests, running a streaming read or streaming write
in parallel with the random IO mix, as I understand many SSD's have
trouble with those kind of workloads.

As it turns out, so did my test PC. :-) 

I've used 8K IO sizes for all the stage one tests - I know I might get
it to go faster with a larger size, but I like to know how well systems
will do when I treat them badly!

The Stage_1_Ops_thru_run is interesting. 2000+ ops/sec on random writes,
5000 on reads.


The Streaming write load and random over writes were started at the
same time - although I didn't see which one finished first, so it's
possible that the stream finished first and allowed the random run to
finish strong. Basically take these numbers with several large grains of
salt!

Interestingly, the random IO mix doesn't slow down much, but the
streaming writes are hurt a lot.

Regards,
Tristan.



-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of thomas
Sent: Friday, 24 July 2009 5:23 AM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] SSD's and ZFS...

 I think it is a great idea, assuming the SSD has good write
performance.
 This one claims up to 230MB/s read and 180MB/s write and it's only
$196.
 
 http://www.newegg.com/Product/Product.aspx?Item=N82E16820609393
 
 Compared to this one (250MB/s read and 170MB/s write) which is $699.
 
 Are those claims really trustworthy? They sound too good to be true!


MB/s numbers are not a good indication of performance. What you should
pay attention to are usually random IOPS write and read. They tend to
correlate a bit, but those numbers on newegg are probably just best case
from the manufacturer.

In the world of consumer grade SSDs, Intel has crushed everyone on IOPS
performance.. but the other manufacturers are starting to catch up a
bit.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
Iozone: Performance Test of File I/O
Version $Revision: 3.287 $
Compiled for 64 bit mode.
Build: linux 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 Randy Dunlap, Mark Montague, Dan Million, 
 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

Run began: Fri Jul 24 20:59:50 2009

Excel chart generation enabled
POSIX Async I/O (no bcopy). Depth 12 
Microseconds/op Mode. Output is in microseconds per operation.
Record Size 8 KB
File size set to 6291456 KB
Command line used: iozone -R -k 12 -i 0 -i 2 -N -r 8K -s 6G
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random  random
bkwd  record  stride   
  KB  reclen   write rewritereadrereadread   write
read rewriteread   
 6291456   8  46  51   200 401  


iozone test complete.
Excel output is below:

Writer report
8
6291456   46 

Re-writer report
8
6291456   51 

Random read report
8
6291456   200 

Random write report
8
6291456   401 
Iozone: Performance Test of File I/O
Version $Revision: 3.287 $
Compiled for 64 bit mode.
Build: linux 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 Randy Dunlap

Re: [zfs-discuss] ZFS write I/O stalls

2009-07-03 Thread Tristan Ball

Is the system otherwise responsive during the zfs sync cycles?

I ask because I think I'm seeing a similar thing - except that it's not 
only other writers that block , it seems like other interrupts are 
blocked. Pinging my zfs server in 1s intervals results in large delays 
while the system syncs, followed by normal response times while the 
system buffers more input...


Thanks,
   Tristan.

Bob Friesenhahn wrote:
It has been quite some time (about a year) since I did testing of 
batch processing with my software (GraphicsMagick).  In between time, 
ZFS added write-throttling.  I am using Solaris 10 with kernel 141415-03.


Quite a while back I complained that ZFS was periodically stalling the 
writing process (which UFS did not do).  The ZFS write-throttling 
feature was supposed to avoid that.  In my testing today I am still 
seeing ZFS stall the writing process periodically.  When the process 
is stalled, there is a burst of disk activity, a burst of context 
switching, and total CPU use drops to almost zero. Zpool iostat says 
that read bandwidth is 15.8M and write bandwidth is 15.8M over a 60 
second averaging interval.  Since my drive array is good for writing 
over 250MB/second, this is a very small write load and the array is 
loafing.


My program uses the simple read-process-write approach.  Each file 
written (about 8MB/file) is written contiguously and written just 
once.  Data is read and written in 128K blocks.  For this application 
there is no value obtained by caching the file just written.  From 
what I am seeing, reading occurs as needed, but writes are being 
batched up until the next ZFS synchronization cycle.  During the ZFS 
synchronization cycle it seems that processes are blocked from 
writing. Since my system has a lot of memory and the ARC is capped at 
10GB, quite a lot of data can be queued up to be written.  The ARC is 
currently running at its limit of 10GB.


If I tell my software to invoke fsync() before closing each written 
file, then the stall goes away, but the program then needs to block so 
there is less beneficial use of the CPU.


If this application stall annoys me, I am sure that it would really 
annoy a user with mission-critical work which needs to get done on a 
uniform basis.


If I run this little script then the application runs more smoothly 
but I see evidence of many shorter stalls:


while true
do
  sleep 3
  sync
done

Is there a solution in the works for this problem?

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!

2009-07-02 Thread Tristan Ball
According to the link bellow, VMWare will only use a single TCP session
for NFS data, which means you're unlikely to get it to travel down more
than one interface on the VMware side, even if you can find a way to do
it on the solaris side.

http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-t
o-help-our-mutual-nfs-customers-using-vmware.html

T

-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brent Jones
Sent: Thursday, 2 July 2009 12:58 PM
To: HUGE | David Stahl
Cc: Steve Madden; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!

On Wed, Jul 1, 2009 at 7:29 PM, HUGE | David Stahldst...@hugeinc.com
wrote:
 The real benefit of the of using a separate zvol for each vm is the
 instantaneous cloning of a machine, and the clone will take almost no
 additional space initially. In our case we build a template VM and
then
 provision our development machines from this.
 However the limit of 32 nfs mounts per esx machine is kind of a
bummer.


 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org on behalf of Steve Madden
 Sent: Wed 7/1/2009 8:46 PM
 To: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!

 Why the use of zvols, why not just;

 zfs create my_pool/group1
 zfs create my_pool/group1/vm1
 zfs create my_pool/group1/vm2

 and export my_pool/group1

 If you don't want the people in group1 to see vm2 anymore just zfs
rename it
 to a different group.

 I'll admit I am coming into this green - but if you're not doing
iscsi, why
 zvols?

 SM.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Is there a supported way to multipath NFS? Thats one benefit to iSCSI
is your VMware can multipath to a target to get more speed/HA...

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss