Re: [zfs-discuss] Running on Dell hardware?

2010-10-22 Thread Henrik Johansen

'Tim Cook' wrote:

[... snip ... ]


Dell requires Dell branded drives as of roughly 8 months ago.  I don't
think there was ever an H700 firmware released that didn't require
this.  I'd bet you're going to waste a lot of money to get a drive the
system refuses to recognize.


This should no longer be an issue as Dell has abandoned that practice
because of customer pressure.


--Tim





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Henrik Johansen

'Edward Ned Harvey' wrote:

From: Henrik Johansen [mailto:hen...@scannet.dk]

The 10g models are stable - especially the R905's are real workhorses.


You would generally consider all your machines stable now?
Can you easily pdsh to all those machines?


Yes - the only problem child has been 1 R610 (the other 2 that we have
in production have not shown any signs of trouble)


kstat | grep current_cstate ; kstat | grep supported_max_cstates

I'd really love to see if "some current_cstate is higher than
supported_max_cstates" is an accurate indicator of system instability.


Here's a little sample from different machines : 


R610 #1

current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  0
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2

R610 #2

current_cstate  3
current_cstate  0
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  3
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2

PE2900

current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1
supported_max_cstates   1

PER905 
current_cstate  1

current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0
supported_max_cstates   0
  

Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Henrik Johansen

'Edward Ned Harvey' wrote:

I have a Dell R710 which has been flaky for some time.  It crashes
about once per week.  I have literally replaced every piece of hardware
in it, and reinstalled Sol 10u9 fresh and clean.

I am wondering if other people out there are using Dell hardware, with
what degree of success, and in what configuration?


We are running (Open)Solaris on lots of 10g servers (PE2900, PE1950, PE2950,
R905) and some 11g (R610 and soon some R815) with both PERC and non-PERC
controllers and lots of MD1000's.

The 10g models are stable - especially the R905's are real workhorses.

We have had only one 11g server (R610) which caused trouble. The box
froze at least once a week - after replacing almost the entire box I
switched from the old iscsitgt to COMSTAR and the box has been stable
since. Go figure ...

I might add that none of these machine use the onboard Broadcom nic's.


The failure seems to be related to the perc 6i.  For some period around
the time of crash, the system still responds to ping, and anything
currently in memory or running from remote storage continues to
function fine.  But new processes that require the local storage
... Such as inbound ssh etc, or even physical login at the console
... those are all hosed.  And eventually the system stops responding to
ping.  As soon as the problem starts, the only recourse is power cycle.

I can't seem to reproduce the problem reliably, but it does happen
regularly.  Yesterday it happened several times in one day, but
sometimes it will go 2 weeks without a problem.

Again, just wondering what other people are using, and experiencing.
To see if any more clues can be found to identify the cause.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O statistics for each file system

2010-05-17 Thread Henrik Johansen

On 05/17/10 03:05 PM, eXeC001er wrote:

perfect!

I found info about kstat for Perl.

Where can I find the meaning of each field?


Most of them can be found here under the section "I/O kstat" :

http://docs.sun.com/app/docs/doc/819-2246/kstat-3kstat?a=view



r...@atom:~# kstat stmf:0:stmf_lu_io_ff00d1c2a8f8
1274100947
module: stmfinstance: 0
name:   stmf_lu_io_ff00d1c2a8f8 class:io
 crtime  2333040.65018394
 nread   9954962
 nwritten5780992
 rcnt0
 reads   599
 rlastupdate 2334856.48028583
 rlentime2.792307252
 rtime   2.453258966
 snaptime2335022.3396771
 wcnt0
 wlastupdate 2334856.43951113
 wlentime0.103487047
 writes  510
 wtime   0.069508209

2010/5/17 Henrik Johansen mailto:hen...@scannet.dk>>

Hi,


On 05/17/10 01:57 PM, eXeC001er wrote:

good.
but this utility is used to view statistics for mounted FS.
How can i view statistics for iSCSI shared FS?


fsstat(1M) relies on certain kstat counters for it's operation -
last I checked I/O against zvols does not update those counters.

It your are using newer builds and COMSTAR you can use the stmf
kstat counters to get I/O details per target and per LUN.

Thanks.

2010/5/17 Darren J Moffat mailto:darr...@opensolaris.org>
<mailto:darr...@opensolaris.org <mailto:darr...@opensolaris.org>>>


On 17/05/2010 12:41, eXeC001er wrote:

I known that i can view statistics for the pool (zpool
iostat).
I want to view statistics for each file system on pool.
Is it
possible?


See fsstat(1M)

--
Darren J Moffat




--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk <mailto:hen...@scannet.dk>
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org <mailto:zfs-discuss@opensolaris.org>
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O statistics for each file system

2010-05-17 Thread Henrik Johansen

Hi,

On 05/17/10 01:57 PM, eXeC001er wrote:

good.
but this utility is used to view statistics for mounted FS.
How can i view statistics for iSCSI shared FS?


fsstat(1M) relies on certain kstat counters for it's operation -
last I checked I/O against zvols does not update those counters.

It your are using newer builds and COMSTAR you can use the stmf kstat 
counters to get I/O details per target and per LUN.



Thanks.

2010/5/17 Darren J Moffat mailto:darr...@opensolaris.org>>

On 17/05/2010 12:41, eXeC001er wrote:

I known that i can view statistics for the pool (zpool iostat).
I want to view statistics for each file system on pool. Is it
possible?


See fsstat(1M)

--
Darren J Moffat





--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] future of OpenSolaris

2010-02-22 Thread Henrik Johansen

On 02/22/10 09:52 PM, Tim Cook wrote:



On Mon, Feb 22, 2010 at 2:21 PM, Jacob Ritorto mailto:jacob.rito...@gmail.com>> wrote:


Since it seems you have absolutely no grasp of what's happening here,


Coming from the guy proclaiming the sky is falling without actually
having ANY official statement whatsoever to back up that train of thought.

perhaps it would be best for you to continue to sit idly by and let
this happen.  Thanks helping out with the crude characterisations
though.


Idly let what happen?  The unconfirmed death of opensolaris that you've
certified for us all without any actual proof?


Well - the lack of support subscriptions *is* a death sentence for 
OpenSolaris in many companies and I believe that this is what the OP 
complained about.




Do you understand that the OpenSolaris page has a sunset in
it and the Solaris page doesn't?


I understand previous versions of every piece of software Oracle sells
have Sunset pages, yes.  If you read the page I sent you, it clearly
states that every release of Opensolaris gets 5 years of support from
GA.  That doesn't mean they aren't releasing another version.  That
doesn't mean they're ending the opensolaris project.  That doesn't mean
they are no longer selling support for it.  Had you actually read the
link I posted, you'd have figured that out.

Sun provides contractual support on the OpenSolaris OS for up to five
years from the product's first General Availability (GA) date as
described <http://www.sun.com/service/eosl/eosl_opensolaris.html>.
OpenSolaris Package Updates are released approximately every 6 months.
OpenSolaris Subscriptions entitle customers during the term of the
Customer's Subscription contract to receive support on their current
version of OpenSolaris, as well as receive individual Package Updates
and OpenSolaris Support Repository Package Updates when made
commercially available by Sun. Sun may require a Customer to download
and install Package Updates or OpenSolaris OS Updates that have been
released since Customer's previous installation of OpenSolaris,
particularly when fixes have already been

  Have you spent enough (any) time
trying to renew your contracts only to see that all mentions of
OpenSolaris have been deleted from the support pages over the last few
days?


Can you tell me which Oracle rep you've spoken to who confirmed the
cancellation of Opensolaris?  It's funny, nobody I've talked to seems to
have any idea what you're talking about.  So please, a name would be
wonderful so I can direct my inquiry to this as-of-yet unnamed source.


I have spoken to our local Oracle sales office last week because I 
wanted to purchase a new OpenSolaris support contract - I was informed 
that this was no longer possible and that Oracle is unable to provide 
paid support for OpenSolaris at this time.




  This, specifically, is what has been yanked out from under me
and my company.  This represents years of my and my team's effort and
investment.


Again, without some sort of official word, nothing has changed...


I take the official Oracle website to be rather ... official ?

Lets recap, shall we ?

a) Almost every trace of OpenSolaris Support subscriptions vanished from 
the official website within the last 14 days.


b) An Oracle sales rep informed me personally last week that I could no 
longer purchase support subscriptions for OpenSolaris.


Please, do me a favor and call your local Oracle rep and ask for an 
Opensolaris Support subscription quote and let us know how it goes ...




It says right here those contracts are for both solaris AND opensolaris.

http://www.sun.com/service/subscriptions/index.jsp

Click Sun System Service Plans
<http://www.sun.com/service/serviceplans/sunspectrum/index.jsp>:
http://www.sun.com/service/serviceplans/sunspectrum/index.jsp


  Sun System Service Plans for Solaris

Sun System Service Plans for the Solaris Operating System provide
integrated hardware and* Solaris OS (or OpenSolaris OS)* support service
coverage to help keep your systems running smoothly. This single price,
complete system approach is ideal for companies running Solaris on Sun
hardware.



Sun System Service Plans != (Open)Solaris Support subscriptions


But thank you for the scare chicken little.





--Tim



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future of OpenSolaris

2010-02-22 Thread Henrik Johansen

On 02/22/10 03:35 PM, Jacob Ritorto wrote:

On 02/22/10 09:19, Henrik Johansen wrote:

On 02/22/10 02:33 PM, Jacob Ritorto wrote:

On 02/22/10 06:12, Henrik Johansen wrote:

Well - once thing that makes me feel a bit uncomfortable is the fact
that you no longer can buy OpenSolaris Support subscriptions.

Almost every trace of it has vanished from the Sun/Oracle website and a
quick call to our local Sun office confirmed that they apparently no
longer sell them.


I was actually very startled to see that since we're using it in
production here. After digging through the web for hours, I found that
OpenSolaris support is now included in Solaris support. This is a win
for us because we never know if a particular box, especially a dev box,
is going to remain Solaris or OpenSolaris for the duration of a support
purchase and now we're free to mix and mingle. If you refer to the
Solaris support web page (png attached if the mailing list allows),
you'll see that OpenSolaris is now officially part of the deal and is no
longer being treated as a second class support offering.


That would be *very* nice indeed. I have checked the URL in your
screenshot but I am getting a different result (png attached).

Ohwell - I'll just have to wait and see.


Confirmed your finding Henrik.  This is a showstopper for us as the
higherups are already quite leery of Sun/Oracle and the future of
Solaris.  I'm calling Oracle to see if I can get some answers.  The SUSE
folks recently took a big chunk of our UNIX business here and
OpenSolaris was my main tool in battling that.  For us, the loss of
OpenSolaris and its support likely indicates the end of Solaris altogether.


Well - I too am reluctant to put more OpenSolaris boxes into production 
until this matter has been resolved.


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future of OpenSolaris

2010-02-22 Thread Henrik Johansen

On 02/22/10 12:00 PM, Michael Ramchand wrote:

I think Oracle have been quite clear about their plans for OpenSolaris.
They have publicly said they plan to continue to support it and the
community.

They're just a little distracted right now because they are in the
process of on-boarding many thousand Sun employees, and trying to get
them feeling happy, comfortable and at home in their new surroundings so
that they can start making money again.

The silence means that you're in a queue and they forgot to turn the
"hold" music on. Have patience. :-)


Well - once thing that makes me feel a bit uncomfortable is the fact 
that you no longer can buy OpenSolaris Support subscriptions.


Almost every trace of it has vanished from the Sun/Oracle website and a 
quick call to our local Sun office confirmed that they apparently no 
longer sell them.



On 02/22/10 09:22, Eugen Leitl wrote:

Oracle's silence is starting to become a bit ominous. What are
the future options for zfs, should OpenSolaris be left dead
in the water by Suracle? I have no insight into who core
zfs developers are (have any been fired by Sun even prior to
the merger?), and who's paying them. Assuming a worst case
scenario, what would be the best candidate for a fork? Nexenta?
Debian already included FreeBSD as a kernel flavor into its
fold, it seems Nexenta could be also a good candidate.

Maybe anyone in the know could provide a short blurb on what
the state is, and what the options are.








--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale ZFS deployments out there (>200 disks)

2010-01-29 Thread Henrik Johansen

On 01/29/10 07:36 PM, Richard Elling wrote:

On Jan 29, 2010, at 12:45 AM, Henrik Johansen wrote:

On 01/28/10 11:13 PM, Lutz Schumann wrote:

While thinking about ZFS as the next generation filesystem
without limits I am wondering if the real world is ready for this
kind of incredible technology ...

I'm actually speaking of hardware :)

ZFS can handle a lot of devices. Once in the import bug
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6761786)



is fixed it should be able to handle a lot of disks.


That was fixed in build 125.


I want to ask the ZFS community and users what large scale
deploments are out there.  How man disks ? How much capacity ?
Single pool or many pools on a server ? How does resilver work in
those environtments ? How to you backup ? What is the experience
so far ? Major headakes ?

It would be great if large scale users would share their setups
and experiences with ZFS.


The largest ZFS deployment that we have is currently comprised of
22 Dell MD1000 enclosures (330 750 GB Nearline SAS disks). We have
3 head nodes and use one zpool per node, comprised of rather narrow
(5+2) RAIDZ2 vdevs. This setup is exclusively used for storing
backup data.


This is an interesting design.  It looks like a good use of hardware
and redundancy for backup storage. Would you be able to share more of
the details? :-)


Each head node (Dell PE 2900's) has 3 PERC 6/E controllers (LSI 1078 
based) with 512 MB cache each.


The PERC 6/E supports both load-balancing and path failover so each 
controller has 2 SAS connections to a daisy chained group of 3 MD1000 
enclosures.


The RAIDZ2 vdev layout was chosen because it gives a reasonable 
performance vs space ratio and it maps nicely onto the 15 disk MD1000's 
( 2 x (5+2) +1 ).


There is room for improvement in the design (fewer disks per controller, 
faster PCI Express slots, etc) but performance is good enough for our 
current needs.




Resilver times could be better - I am sure that this will improve
once we upgrade from S10u9 to 2010.03.


Nit: Solaris 10 u9 is 10/03 or 10/04 or 10/05, depending on what you
read. Solaris 10 u8 is 11/09.


One of the things that I am missing in ZFS is the ability to
prioritize background operations like scrub and resilver. All our
disks are idle during daytime and I would love to be able to take
advantage of this, especially during resilver operations.


Scrub I/O is given the lowest priority and is throttled. However, I
am not sure that the throttle is in Solaris 10, because that source
is not publicly available. In general, you will not notice a resource
cap until the system utilization is high enough that the cap is
effective.  In other words, if the system is mostly idle, the scrub
consumes the bulk of the resources.


That's not what I am seeing - resilver operations crawl even when the 
pool is idle.



This setup has been running for about a year with no major issues
so far. The only hickups we've had were all HW related (no fun in
firmware upgrading 200+ disks).


ugh. -- richard




--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale ZFS deployments out there (>200 disks)

2010-01-29 Thread Henrik Johansen

On 01/28/10 11:13 PM, Lutz Schumann wrote:

While thinking about ZFS as the next generation filesystem without
limits I am wondering if the real world is ready for this kind of
incredible technology ...

I'm actually speaking of hardware :)

ZFS can handle a lot of devices. Once in the import bug
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6761786)
is fixed it should be able to handle a lot of disks.


That was fixed in build 125.


I want to ask the ZFS community and users what large scale deploments
are out there.  How man disks ? How much capacity ? Single pool or
many pools on a server ? How does resilver work in those
environtments ? How to you backup ? What is the experience so far ?
Major headakes ?

It would be great if large scale users would share their setups and
experiences with ZFS.


The largest ZFS deployment that we have is currently comprised of 22 
Dell MD1000 enclosures (330 750 GB Nearline SAS disks). We have 3 head 
nodes and use one zpool per node, comprised of rather narrow (5+2) 
RAIDZ2 vdevs. This setup is exclusively used for storing backup data.


Resilver times could be better - I am sure that this will improve once 
we upgrade from S10u9 to 2010.03.


One of the things that I am missing in ZFS is the ability to prioritize 
background operations like scrub and resilver. All our disks are idle 
during daytime and I would love to be able to take advantage of this, 
especially during resilver operations.


This setup has been running for about a year with no major issues so 
far. The only hickups we've had were all HW related (no fun in firmware 
upgrading 200+ disks).



Will you ? :) Thanks, Robert



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pulsing write performance

2009-08-27 Thread Henrik Johansen


Ross Walker wrote:

On Aug 27, 2009, at 4:30 AM, David Bond  wrote:


Hi,

I was directed here after posting in CIFS discuss (as i first  
thought that it could be a CIFS problem).


I posted the following in CIFS:

When using iometer from windows to the file share on opensolaris  
svn101 and svn111 I get pauses every 5 seconds of around 5 seconds  
(maybe a little less) where no data is transfered, when data is  
transfered it is at a fair speed and gets around 1000-2000 iops with  
1 thread (depending on the work type). The maximum read response  
time is 200ms and the maximum write response time is 9824ms, which  
is very bad, an almost 10 seconds delay in being able to send data  
to the server.
This has been experienced on 2 test servers, the same servers have  
also been tested with windows server 2008 and they havent shown this  
problem (the share performance was slightly lower than CIFS, but it  
was consistent, and the average access time and maximums were very  
close.



I just noticed that if the server hasnt hit its target arc size, the  
pauses are for maybe .5 seconds, but as soon as it hits its arc  
target, the iops drop to around 50% of what it was and then there  
are the longer pauses around 4-5 seconds. and then after every pause  
the performance slows even more. So it appears it is definately  
server side.


This is with 100% random io with a spread of 33% write 66% read, 2KB  
blocks. over a 50GB file, no compression, and a 5.5GB target arc size.




Also I have just ran some tests with different IO patterns and 100  
sequencial writes produce and consistent IO of 2100IOPS, except when  
it pauses for maybe .5 seconds every 10 - 15 seconds.


100% random writes produce around 200 IOPS with a 4-6 second pause  
around every 10 seconds.


100% sequencial reads produce around 3700IOPS with no pauses, just  
random peaks in response time (only 16ms) after about 1 minute of  
running, so nothing to complain about.


100% random reads produce around 200IOPS, with no pauses.

So it appears that writes cause a problem, what is causing these  
very long write delays?


A network capture shows that the server doesnt respond to the write  
from the client when these pauses occur.


Also, when using iometer, the initial file creation doesnt have and  
pauses in the creation, so it  might only happen when modifying files.


Any help on finding a solution to this would be really appriciated.


What version? And system configuration?

I think it might be the issue where ZFS/ARC write caches more then the  
underlying storage can handle writing in a reasonable time.


There is a parameter to control how much is write cached, I believe it  
is zfs_write_override.


You should be able to disable the write throttle mechanism altogether
with the undocumented zfs_no_write_throttle tunable.

I never got around to testing this though ...



-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Joseph L. Casale wrote:

Quick snipped from zpool iostat :

  mirror 1.12G   695G  0  0  0  0
c8t12d0  -  -  0  0  0  0
c8t13d0  -  -  0  0  0  0
  c7t2d04K  29.0G  0  1.56K  0   200M
  c7t3d04K  29.0G  0  1.58K  0   202M

The disks on c7 are both Intel X25-E 


Henrik,
So the SATA discs are in the MD1000 behind the PERC 6/E and how
have you configured/attached the 2 SSD slogs and L2ARC drive? If
I understand you, you have sued 14 of the 15 slots in the MD so
I assume you have the 3 SSD's in the R905, what controller are
they running on?


The internal PERC 6/i controller - but I've had them on the PERC 6/E
during other test runs since I have a couple of spare MD1000's at hand. 


Both controllers work well with the SSD's.


Thanks!
jlc
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Ross Walker wrote:

On Aug 5, 2009, at 2:49 AM, Henrik Johansen  wrote:


Ross Walker wrote:

On Aug 4, 2009, at 8:36 PM, Carson Gaspar  wrote:


Ross Walker wrote:

I get pretty good NFS write speeds with NVRAM (40MB/s 4k  
sequential  write). It's a Dell PERC 6/e with 512MB onboard.

...
there, dedicated slog device with NVRAM speed. It would be even   
better to have a pair of SSDs behind the NVRAM, but it's hard to   
find compatible SSDs for these controllers, Dell currently  
doesn't  even support SSDs in their RAID products :-(


Isn't the PERC 6/e just a re-branded LSI? LSI added SSD support   
recently.


Yes, but the LSI support of SSDs is on later controllers.


Sure that's not just a firmware issue ?

My PERC 6/E seems to support SSD's :
# ./MegaCli -AdpAllInfo -a2 | grep -i ssd
Enable Copyback to SSD on SMART Error   : No
Enable SSD Patrol Read  : No
Allow SSD SAS/SATA Mix in VD : No
Allow HDD/SSD Mix in VD  : No


Controller info :Versions
   
Product Name: PERC 6/E Adapter
Serial No   : 
FW Package Build: 6.0.3-0002

   Mfg. Data
   
Mfg. Date   : 06/08/07
Rework Date : 06/08/07
Revision No : Battery FRU : N/A

   Image Versions in Flash:
   
FW Version : 1.11.82-0473
BIOS Version   : NT13-2
WebBIOS Version: 1.1-32-e_11-Rel
Ctrl-R Version : 1.01-010B
Boot Block Version : 1.00.00.01-0008


I currently have 2 x Intel X25-E (32 GB) as dedicated slogs and 1 x
Intel X25-M (80 GB) for the L2ARC behind a PERC 6/i on my Dell R905
testbox.

So far there have been no problems with them.


Really?

Now you have my interest.

Two questions, did you get the X25 from Dell? Are you using it with a  
hot-swap carrier?


Knowing that these will work would be great news.


Those disks are not from Dell as they were incapable of delivering Intel
SSD's.

Just out of curiosity - do they have to be from Dell ?

I have tested the Intel SSD's on various Dell servers - they work
out-of-the-box with both their 2.5" and 3.5" trays (the 3.5" trays do
require a SATA interposer which is included with all SATA disks ordered
from them).


-Ross



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Ross Walker wrote:

On Aug 5, 2009, at 3:09 AM, Henrik Johansen  wrote:


Ross Walker wrote:
On Aug 4, 2009, at 10:22 PM, Bob Friesenhahn  > wrote:



On Tue, 4 Aug 2009, Ross Walker wrote:
Are you sure that it is faster than an SSD?  The data is indeed   
pushed closer to the disks, but there may be considerably more   
latency associated with getting that data into the controller   
NVRAM cache than there is into a dedicated slog SSD.


I don't see how, as the SSD is behind a controller it still must   
make it to the controller.


If you take a look at 'iostat -x' output you will see that the   
system knows about a queue for each device.  If it was any other   
way, then a slow device would slow down access to all of the  
other  devices.  If there is concern about lack of bandwidth (PCI- 
E?) to  the controller, then you can use a separate controller for  
the SSDs.


It's not bandwidth. Though with a lot of mirrors that does become  
a  concern.


Well the duplexing benefit you mention does hold true. That's a   
complex real-world scenario that would be hard to benchmark in   
production.


But easy to see the effects of.


I actually meant to say, hard to bench out of production.

Tests done by others show a considerable NFS write speed  
advantage  when using a dedicated slog SSD rather than a  
controller's NVRAM  cache.


I get pretty good NFS write speeds with NVRAM (40MB/s 4k  
sequential  write). It's a Dell PERC 6/e with 512MB onboard.


I get 47.9 MB/s (60.7 MB/s peak) here too (also with 512MB  
NVRAM),  but that is not very good when the network is good for  
100 MB/s.   With an SSD, some other folks here are getting  
essentially network  speed.


In testing with ram disks I was only able to get a max of around  
60MB/ s with 4k block sizes, with 4 outstanding.


I can do 64k blocks now and get around 115MB/s.


I just ran some filebench microbenchmarks against my 10 Gbit testbox
which is a Dell R905, 4 x 2.5 Ghz AMD Quad Core CPU's and 64 GB RAM.

My current pool is comprised of 7 mirror vdevs (SATA disks), 2 Intel
X25-E as slogs and 1 Intel X25-M for the L2ARC.

The pool is a MD1000 array attached to a PERC 6/E using 2 SAS cables.

The nic's are ixgbe based.

Here are the numbers :
Randomwrite benchmark - via 10Gbit NFS : IO Summary: 4483228 ops,  
73981.2 ops/s, (0/73981 r/w) 578.0mb/s, 44us cpu/op, 0.0ms latency


Randomread benchmark - via 10Gbit NFS :
IO Summary: 7663903 ops, 126467.4 ops/s, (126467/0 r/w) 988.0mb/s,  
5us cpu/op, 0.0ms latency


The real question is if these numbers can be trusted - I am currently
preparing new test runs with other software to be able to do a
comparison.


Yes, need to make sure it is sync io as NFS clients can still choose  
to use async and work out of their own cache.


Quick snipped from zpool iostat : 


  mirror 1.12G   695G  0  0  0  0
c8t12d0  -  -  0  0  0  0
c8t13d0  -  -  0  0  0  0
  c7t2d04K  29.0G  0  1.56K  0   200M
  c7t3d04K  29.0G  0  1.58K  0   202M

The disks on c7 are both Intel X25-E 


-Ross



--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Ross Walker wrote:

On Aug 4, 2009, at 10:17 PM, James Lever  wrote:



On 05/08/2009, at 11:41 AM, Ross Walker wrote:


What is your recipe for these?


There wasn't one! ;)

The drive I'm using is a Dell badged Samsung MCCOE50G5MPQ-0VAD3.


So the key is the drive needs to have the Dell badging to work?

I called my rep about getting a Dell badged SSD and he told me they  
didn't support those in MD series enclosures so therefore were  
unavailable.


If the Dell branded SSD's are Samsung's then you might want to search
the archives - if I remember correctly there were mentionings of
less-than-desired performance using them but I cannot recall the
details.



Maybe it's time for a new account rep.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-05 Thread Henrik Johansen

Ross Walker wrote:
On Aug 4, 2009, at 10:22 PM, Bob Friesenhahn > wrote:



On Tue, 4 Aug 2009, Ross Walker wrote:
Are you sure that it is faster than an SSD?  The data is indeed  
pushed closer to the disks, but there may be considerably more  
latency associated with getting that data into the controller  
NVRAM cache than there is into a dedicated slog SSD.


I don't see how, as the SSD is behind a controller it still must  
make it to the controller.


If you take a look at 'iostat -x' output you will see that the  
system knows about a queue for each device.  If it was any other  
way, then a slow device would slow down access to all of the other  
devices.  If there is concern about lack of bandwidth (PCI-E?) to  
the controller, then you can use a separate controller for the SSDs.


It's not bandwidth. Though with a lot of mirrors that does become a  
concern.


Well the duplexing benefit you mention does hold true. That's a  
complex real-world scenario that would be hard to benchmark in  
production.


But easy to see the effects of.


I actually meant to say, hard to bench out of production.

Tests done by others show a considerable NFS write speed advantage  
when using a dedicated slog SSD rather than a controller's NVRAM  
cache.


I get pretty good NFS write speeds with NVRAM (40MB/s 4k sequential  
write). It's a Dell PERC 6/e with 512MB onboard.


I get 47.9 MB/s (60.7 MB/s peak) here too (also with 512MB NVRAM),  
but that is not very good when the network is good for 100 MB/s.   
With an SSD, some other folks here are getting essentially network  
speed.


In testing with ram disks I was only able to get a max of around 60MB/ 
s with 4k block sizes, with 4 outstanding.


I can do 64k blocks now and get around 115MB/s.


I just ran some filebench microbenchmarks against my 10 Gbit testbox
which is a Dell R905, 4 x 2.5 Ghz AMD Quad Core CPU's and 64 GB RAM.

My current pool is comprised of 7 mirror vdevs (SATA disks), 2 Intel
X25-E as slogs and 1 Intel X25-M for the L2ARC.

The pool is a MD1000 array attached to a PERC 6/E using 2 SAS cables.

The nic's are ixgbe based.

Here are the numbers : 

Randomwrite benchmark - via 10Gbit NFS : 
IO Summary: 4483228 ops, 73981.2 ops/s, (0/73981 r/w) 578.0mb/s, 44us cpu/op, 0.0ms latency


Randomread benchmark - via 10Gbit NFS :
IO Summary: 7663903 ops, 126467.4 ops/s, (126467/0 r/w) 988.0mb/s, 5us cpu/op, 
0.0ms latency

The real question is if these numbers can be trusted - I am currently
preparing new test runs with other software to be able to do a
comparison. 

There is still bus and controller plus SSD latency. I suppose one  
could use a pair of disks as an slog mirror, enable NVRAM just for  
those and let the others do write-through with their disk caches


But this encounters the problem that when the NVRAM becomes full  
then you hit the wall of synchronous disk write performance.  With  
the SSD slog, the write log can be quite large and disk writes are  
then done in a much more efficient ordered fashion similar to non- 
sync writes.


Yes, you have a point there.

So, what SSD disks do you use?

-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool iscsi /zfs performance in opensolaris 0906

2009-08-04 Thread Henrik Johansen

Ross Walker wrote:

On Aug 4, 2009, at 8:36 PM, Carson Gaspar  wrote:


Ross Walker wrote:

I get pretty good NFS write speeds with NVRAM (40MB/s 4k sequential  
write). It's a Dell PERC 6/e with 512MB onboard.

...
there, dedicated slog device with NVRAM speed. It would be even  
better to have a pair of SSDs behind the NVRAM, but it's hard to  
find compatible SSDs for these controllers, Dell currently doesn't  
even support SSDs in their RAID products :-(


Isn't the PERC 6/e just a re-branded LSI? LSI added SSD support  
recently.


Yes, but the LSI support of SSDs is on later controllers.


Sure that's not just a firmware issue ?

My PERC 6/E seems to support SSD's : 


# ./MegaCli -AdpAllInfo -a2 | grep -i ssd
Enable Copyback to SSD on SMART Error   : No
Enable SSD Patrol Read  : No
Allow SSD SAS/SATA Mix in VD : No
Allow HDD/SSD Mix in VD  : No


Controller info : 
   Versions


Product Name: PERC 6/E Adapter
Serial No   : 
FW Package Build: 6.0.3-0002

Mfg. Data

Mfg. Date   : 06/08/07
Rework Date : 06/08/07
Revision No : 
Battery FRU : N/A


Image Versions in Flash:

FW Version : 1.11.82-0473
BIOS Version   : NT13-2
WebBIOS Version: 1.1-32-e_11-Rel
Ctrl-R Version : 1.01-010B
Boot Block Version : 1.00.00.01-0008


I currently have 2 x Intel X25-E (32 GB) as dedicated slogs and 1 x
Intel X25-M (80 GB) for the L2ARC behind a PERC 6/i on my Dell R905
testbox.

So far there have been no problems with them.



-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] surprisingly poor performance

2009-07-03 Thread Henrik Johansen
# time tar xf zeroes.tar

real8m7.176s
user0m0.438s
sys 0m5.754s

While this was running, I was looking at the output of zpool iostat  
fastdata 10 to see how it was going and was surprised to see the  
seemingly low IOPS.


Have you tried running this locally on your OpenSolaris box - just to
get an idea of what it could deliver in terms of speed ? Which NFS
version are you using ?  




jam...@scalzi:~$ zpool iostat fastdata 10
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
fastdata10.0G  2.02T  0312268  3.89M
fastdata10.0G  2.02T  0818  0  3.20M
fastdata10.0G  2.02T  0811  0  3.17M
fastdata10.0G  2.02T  0860  0  3.27M

Strangely, when I added a second SSD as a second slog, it made no  
difference to the write operations.


I'm not sure where to go from here, these results are appalling (about  
3x the time of the old system with 8x 10kRPM spindles) even with two  
Enterprise SSDs as separate log devices.


cheers,
James

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Med venlig hilsen / Best Regards

Henrik Johansen
hen...@scannet.dk
Tlf. 75 53 35 00

ScanNet Group
A/S ScanNet 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best controller card for 8 SATA drives ?

2009-06-23 Thread Henrik Johansen

Erik Ableson wrote:
The problem I had was with the single raid 0 volumes (miswrote RAID 1  
on the original message)


This is not a straight to disk connection and you'll have problems if  
you ever need to move disks around or move them to another controller.


Would you mind explaining exactly what issues or problems you had ? I
have moved disks around several controllers without problems. You must
remember however to create the RAID 0 lun throught LSI's megaraid CLI
tool and / or to clear any foreign config before the controller will
expose the disk(s) to the OS.

The only real problem that I can think of is that you cannot use the
autoreplace functionality of recent ZFS versions with these controllers.

I agree that the MD1000 with ZFS is a rocking, inexpensive setup (we  
have several!) but I'd recommend using a SAS card with a true JBOD  
mode for maximum flexibility and portability. If I remember correctly,  
I think we're using the Adaptec 3085. I've pulled 465MB/s write and  
1GB/s read off the MD1000 filled with SATA drives.


Cordialement,

Erik Ableson

+33.6.80.83.58.28
Envoyé depuis mon iPhone

On 23 juin 2009, at 21:18, Henrik Johansen  wrote:


Kyle McDonald wrote:

Erik Ableson wrote:


Just a side note on the PERC labelled cards: they don't have a  
JBOD mode so you _have_ to use hardware RAID. This may or may not  
be an issue in your configuration but it does mean that moving  
disks between controllers is no longer possible. The only way to  
do a pseudo JBOD is to create broken RAID 1 volumes which is not  
ideal.



It won't even let you make single drive RAID 0 LUNs? That's a shame.


We currently have 90+ disks that are created as single drive RAID 0  
LUNs

on several PERC 6/E (LSI 1078E chipset) controllers and used by ZFS.

I can assure you that they work without any problems and perform very
well indeed.

In fact, the combination of PERC 6/E and MD1000 disk arrays has worked
so well for us that we are going to double the number of disks during
this fall.

The lack of portability is disappointing. The trade-off though is  
battery backed cache if the card supports it.


-Kyle



Cordialement,

Erik Ableson

+33.6.80.83.58.28
Envoyé depuis mon iPhone

On 23 juin 2009, at 04:33, "Eric D. Mudama" > wrote:


> On Mon, Jun 22 at 15:46, Miles Nordin wrote:
>>>>>>> "edm" == Eric D Mudama  writes:
>>
>>  edm> We bought a Dell T610 as a fileserver, and it comes with an
>>  edm> LSI 1068E based board (PERC6/i SAS).
>>
>> which driver attaches to it?
>>
>> pciids.sourceforge.net says this is a 1078 board, not a 1068  
board.

>>
>> please, be careful.  There's too much confusion about these  
cards.

>
> Sorry, that may have been confusing.  We have the cheapest storage
> option on the T610, with no onboard cache.  I guess it's called  
the

> "Dell SAS6i/R" while they reserve the PERC name for the ones with
> cache.  I had understood that they were basically identical  
except for

> the cache, but maybe not.
>
> Anyway, this adapter has worked great for us so far.
>
>
> snippet of prtconf -D:
>
>
> i86pc (driver name: rootnex)
>pci, instance #0 (driver name: npe)
>pci8086,3411, instance #6 (driver name: pcie_pci)
>pci1028,1f10, instance #0 (driver name: mpt)
>sd, instance #1 (driver name: sd)
>sd, instance #6 (driver name: sd)
>sd, instance #7 (driver name: sd)
>sd, instance #2 (driver name: sd)
>sd, instance #4 (driver name: sd)
>sd, instance #5 (driver name: sd)
>
>
> For this board the mpt driver is being used, and here's the  
prtconf

> -pv info:
>
>
>  Node 0x1f
>assigned-addresses:  >  
81020010..fc00..0100.83020014..

> df2ec000..4000.8302001c.
> .df2f..0001
>reg:  >  
0002.....01020010....0100.03020014....4000.0302001c.

> ...0001
>compatible: 'pciex1000,58.1028.1f10.8' +  
'pciex1000,58.1028.1f10' > + 'pciex1000,58.8' + 'pciex1000,58' +  
'pciexclass,01' + > 'pciexclass,0100' +  
'pci1000,58.1028.1f10.8' + > 'pci1000,58.1028.1f10' +  
'pci1028,1f10' + 'pci1000,58.8' + > 'pci1000,58' + 'pciclass, 
01' + 'pciclass,0100'

>model:  'SCSI bus controller'
>power-consumption:  0001.0001
>devsel-speed:  
>interrupts:  0001
>subsystem-vendor-id:  1028
>subsystem-

Re: [zfs-discuss] Large zpool design considerations

2008-07-04 Thread Henrik Johansen
Chris Cosby wrote:
>I'm going down a bit of a different path with my reply here. I know that all
>shops and their need for data are different, but hear me out.
>
>1) You're backing up 40TB+ of data, increasing at 20-25% per year. That's
>insane. Perhaps it's time to look at your backup strategy no from a hardware
>perspective, but from a data retention perspective. Do you really need that
>much data backed up? There has to be some way to get the volume down. If
>not, you're at 100TB in just slightly over 4 years (assuming the 25% growth
>factor). If your data is critical, my recommendation is to go find another
>job and let someone else have that headache.

Well, we are talking about backup for ~900 servers that are in
production. Our retention period is 14 days for stuff like web servers,
and 3 weeks for SQL and such. 

We could deploy deduplication but it makes me a wee bit uncomfortable to
blindly trust our backup software.

>2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares
>and such) - $12,500 for raw drive hardware. Enclosures add some money, as do
>cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives.
>In my world, I know yours is different, but the difference in a $100,000
>solution and a $75,000 solution is pretty negligible. The short description
>here: you can afford to do mirrors. Really, you can. Any of the parity
>solutions out there, I don't care what your strategy, is going to cause you
>more trouble than you're ready to deal with.

Good point. I'll take that into consideration.

>I know these aren't solutions for you, it's just the stuff that was in my
>head. The best possible solution, if you really need this kind of volume, is
>to create something that never has to resilver. Use some nifty combination
>of hardware and ZFS, like a couple of somethings that has 20TB per container
>exported as a single volume, mirror those with ZFS for its end-to-end
>checksumming and ease of management.
>
>That's my considerably more than $0.02
>
>On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn <
>[EMAIL PROTECTED]> wrote:
>
>> On Thu, 3 Jul 2008, Don Enrique wrote:
>> >
>> > This means that i potentially could loose 40TB+ of data if three
>> > disks within the same RAIDZ-2 vdev should die before the resilvering
>> > of at least one disk is complete. Since most disks will be filled i
>> > do expect rather long resilvering times.
>>
>> Yes, this risk always exists.  The probability of three disks
>> independently dying during the resilver is exceedingly low. The chance
>> that your facility will be hit by an airplane during resilver is
>> likely higher.  However, it is true that RAIDZ-2 does not offer the
>> same ease of control over physical redundancy that mirroring does.
>> If you were to use 10 independent chassis and split the RAIDZ-2
>> uniformly across the chassis then the probability of a similar
>> calamity impacting the same drives is driven by rack or facility-wide
>> factors (e.g. building burning down) rather than shelf factors.
>> However, if you had 10 RAID arrays mounted in the same rack and the
>> rack falls over on its side during resilver then hope is still lost.
>>
>> I am not seeing any options for you here.  ZFS RAIDZ-2 is about as
>> good as it gets and if you want everything in one huge pool, there
>> will be more risk.  Perhaps there is a virtual filesystem layer which
>> can be used on top of ZFS which emulates a larger filesystem but
>> refuses to split files across pools.
>>
>> In the future it would be useful for ZFS to provide the option to not
>> load-share across huge VDEVs and use VDEV-level space allocators.
>>
>> Bob
>> ==
>> Bob Friesenhahn
>> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
>> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>>
>> ___
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
>
>
>-- 
>chris -at- microcozm -dot- net
>=== Si Hoc Legere Scis Nimium Eruditionis Habes

-- 
Med venlig hilsen / Best Regards

Henrik Johansen
[EMAIL PROTECTED]


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Henrik Johansen
[Richard Elling] wrote:
> Don Enrique wrote:
>> Hi,
>>
>> I am looking for some best practice advice on a project that i am working on.
>>
>> We are looking at migrating ~40TB backup data to ZFS, with an annual data 
>> growth of
>> 20-25%.
>>
>> Now, my initial plan was to create one large pool comprised of X RAIDZ-2 
>> vdevs ( 7 + 2 )
>> with one hotspare per 10 drives and just continue to expand that pool as 
>> needed.
>>
>> Between calculating the MTTDL and performance models i was hit by a rather 
>> scary thought.
>>
>> A pool comprised of X vdevs is no more resilient to data loss than the 
>> weakest vdev since loss
>> of a vdev would render the entire pool unusable.
>>   
>
> Yes, but a raidz2 vdev using enterprise class disks is very reliable.

That's nice to hear.

>> This means that i potentially could loose 40TB+ of data if three disks 
>> within the same RAIDZ-2
>> vdev should die before the resilvering of at least one disk is complete. 
>> Since most disks
>> will be filled i do expect rather long resilvering times.
>>
>> We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project 
>> with as much hardware
>> redundancy as we can get ( multiple controllers, dual cabeling, I/O 
>> multipathing, redundant PSUs,
>> etc.)
>>   
>
> nit: SATA disks are single port, so you would need a SAS implementation
> to get multipathing to the disks.  This will not significantly impact the
> overall availability of the data, however.  I did an availability  
> analysis of
> thumper to show this.
> http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs

Yeah, I read your blog. Very informative indeed. 

I am using SAS HBA cards and SAS enclosures with SATA disks so I should
be fine.

>> I could use multiple pools but that would make data management harder which 
>> in it self is a lengthy
>> process in our shop.
>>
>> The MTTDL figures seem OK so how much should i need to worry ? Anyone having 
>> experience from
>> this kind of setup ?
>>   
>
> I think your design is reasonable.  We'd need to know the exact
> hardware details to be able to make more specific recommendations.
> -- richard

Well, my choice of hardware is kind of limited by 2 things :

1. We are a 100% Dell shop.
2. We already have lots of enclosures that i would like to reuse for my project.

The HBA cards are SAS 5/E (LSI SAS1068 chipset) cards, the enclosures are
Dell MD1000 diskarrays.

>

-- 
Med venlig hilsen / Best Regards

Henrik Johansen
[EMAIL PROTECTED]


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs data corruption

2008-04-24 Thread johansen
> I'm just interested in understanding how zfs determined there was data
> corruption when I have checksums disabled and there were no
> non-retryable read errors reported in the messages file.

If the metadata is corrupt, how is ZFS going to find the data blocks on
disk?

> >  I don't believe it was a real disk read error because of the
> >  absence of evidence in /var/adm/messages.

It's not safe to jump to this conclusion.  Disk drivers that support FMA
won't log error messages to /var/adm/messages.  As more support for I/O
FMA shows up, you won't see random spew in the messages file any more.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-11 Thread johansen
> Is deleting the old files/directories in the ZFS file system
> sufficient or do I need to destroy/recreate the pool and/or file
> system itself?  I've been doing the former.

The former should be sufficient, it's not necessary to destroy the pool.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-07 Thread johansen
> -Still playing with 'recsize' values but it doesn't seem to be doing
> much...I don't think I have a good understand of what exactly is being
> written...I think the whole file might be overwritten each time
> because it's in binary format.

The other thing to keep in mind is that the tunables like compression
and recsize only affect newly written blocks.  If you have a bunch of
data that was already laid down on disk and then you change the tunable,
this will only cause new blocks to have the new size.  If you experiment
with this, make sure all of your data has the same blocksize by copying
it over to the new pool once you've changed the properties.

> -Setting zfs_nocacheflush, though got me drastically increased
> throughput--client requests took, on average, less than 2 seconds
> each!
> 
> So, in order to use this, I should have a storage array, w/battery
> backup, instead of using the internal drives, correct?

zfs_nocacheflush should only be used on arrays with a battery backed
cache.  If you use this option on a disk, and you lose power, there's no
guarantee that your write successfully made it out of the cache.

A performance problem when flushing the cache of an individual disk
implies that there's something wrong with the disk or its firmware.  You
can disable the write cache of an individual disk using format(1M).  When you
do this, ZFS won't lose any data, whereas enabling zfs_nocacheflush can
lead to problems.

I'm attaching a DTrace script that will show the cache-flush times
per-vdev.  Remove the zfs_nocacheflush tuneable and re-run your test
while using this DTrace script.  If one particular disk takes longer
than the rest to flush, this should show us.  In that case, we can
disable the write cache on that particular disk.  Otherwise, we'll need
to disable the write cache on all of the disks.

The script is attached as zfs_flushtime.d

Use format(1M) with the -e option to adjust the write_cache settings for
SCSI disks.

-j
#!/usr/sbin/dtrace -Cs
/*
 * CDDL HEADER START
 *
 * The contents of this file are subject to the terms of the
 * Common Development and Distribution License (the "License").
 * You may not use this file except in compliance with the License.
 *
 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 * or http://www.opensolaris.org/os/licensing.
 * See the License for the specific language governing permissions
 * and limitations under the License.
 *
 * When distributing Covered Code, include this CDDL HEADER in each
 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 * If applicable, add the following below this CDDL HEADER, with the
 * fields enclosed by brackets "[]" replaced with your own identifying
 * information: Portions Copyright [] [name of copyright owner]
 *
 * CDDL HEADER END
 */

/*
 * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

#define DKIOC   (0x04 << 8)
#define DKIOCFLUSHWRITECACHE(DKIOC|34)

fbt:zfs:vdev_disk_io_start:entry
/(args[0]->io_cmd == DKIOCFLUSHWRITECACHE) && (self->traced == 0)/
{
self->traced = args[0];
self->start = timestamp;
}

fbt:zfs:vdev_disk_ioctl_done:entry
/args[0] == self->traced/
{
@a[stringof(self->traced->io_vd->vdev_path)] =
quantize(timestamp - self->start);
self->start = 0;
self->traced = 0;
}

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mdb ::memstat including zfs buffer details?

2007-11-12 Thread johansen
>  ZFS data buffers are attached to zvp; however, we still keep
>  metadata in the crashdump.  At least right now, this means that
>  cached ZFS metadata has kvp as its vnode.
>  
>Still, it's better than what you get currently.

I absolutely agree.

At one point, we discussed adding another vp for the metadata.  IIRC,
this was in the context of moving all of ZFS's allocations outside of
the cage.  There's no reason why you couldn't do the same to make
counting of buffers more understandable, though.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mdb ::memstat including zfs buffer details?

2007-11-12 Thread johansen
>I don't think it should be too bad (for ::memstat), given that (at
>least in Nevada), all of the ZFS caching data belongs to the "zvp"
>vnode, instead of "kvp".

ZFS data buffers are attached to zvp; however, we still keep metadata in
the crashdump.  At least right now, this means that cached ZFS metadata
has kvp as its vnode.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fileserver performance tests

2007-10-08 Thread johansen
> statfile1 988ops/s   0.0mb/s  0.0ms/op   22us/op-cpu
> deletefile1   991ops/s   0.0mb/s  0.0ms/op   48us/op-cpu
> closefile2997ops/s   0.0mb/s  0.0ms/op4us/op-cpu
> readfile1 997ops/s 139.8mb/s  0.2ms/op  175us/op-cpu
> openfile2 997ops/s   0.0mb/s  0.0ms/op   28us/op-cpu
> closefile1   1081ops/s   0.0mb/s  0.0ms/op6us/op-cpu
> appendfilerand1   982ops/s  14.9mb/s  0.1ms/op   91us/op-cpu
> openfile1 982ops/s   0.0mb/s  0.0ms/op   27us/op-cpu
> 
> IO Summary:   8088 ops 8017.4 ops/s, (997/982 r/w) 155.6mb/s,508us 
> cpu/op,   0.2ms

> I expected to see some higher numbers really...
> a simple "time mkfile 16g lala" gave me something like 280Mb/s.

mkfile isn't an especially realistic test for performance.  You'll note
that the fileserver workload is performing stats, deletes, closes,
reads, opens, and appends.  Mkfile is a write benchmark.  You might
consider trying the singlestreamwrite benchmark, if you're looking for
a single-threaded write performance test.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-05 Thread johansen
> But note that, for ZFS, the win with direct I/O will be somewhat
> less.  That's because you still need to read the page to compute
> its checksum.  So for direct I/O with ZFS (with checksums enabled),
> the cost is W:LPS, R:2*LPS.  Is saving one page of writes enough to
> make a difference?  Possibly not.

It's more complicated than that.  The kernel would be verifying
checksums on buffers in a user's address space.  For this to work, we
have to map these buffers into the kernel and simultaneously arrange for
these pages to be protected from other threads in the user's address
space.  We discussed some of the VM gymnastics required to properly
implement this back in January:

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-January/thread.html#36890

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/WAFL lawsuit

2007-09-06 Thread johansen-osdev
It's Columbia Pictures vs. Bunnell:

http://www.eff.org/legal/cases/torrentspy/columbia_v_bunnell_magistrate_order.pdf

The Register syndicated a Security Focus article that summarizes the
potential impact of the court decision:

http://www.theregister.co.uk/2007/08/08/litigation_data_retention/


-j

On Thu, Sep 06, 2007 at 08:14:56PM +0200, [EMAIL PROTECTED] wrote:
> 
> 
> >It really is a shot in the dark at this point, you really never know what
> >will happen in court (take the example of the recent court decision that
> >all data in RAM be held for discovery ?!WHAT, HEAD HURTS!?).  But at the
> >end of the day,  if you waited for a sure bet on any technology or
> >potential patent disputes you would not implement anything, ever.
> 
> 
> Do you have a reference for "all data in RAM most be held".  I guess we
> need to build COW RAM as well.
> 
> Casper
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Extremely long creat64 latencies on higly utilized zpools

2007-08-15 Thread johansen-osdev
You might also consider taking a look at this thread:

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-July/041760.html

Although I'm not certain, this sounds a lot like the other pool
fragmentation issues.

-j

On Wed, Aug 15, 2007 at 01:11:40AM -0700, Yaniv Aknin wrote:
> Hello friends,
> 
> I've recently seen a strange phenomenon with ZFS on Solaris 10u3, and was 
> wondering if someone may have more information.
> 
> The system uses several zpools, each a bit under 10T, each containing one zfs 
> with lots and lots of small files (way too many, about 100m files and 75m 
> directories).
> 
> I have absolutely no control over the directory structure and believe me I 
> tried to change it.
> 
> Filesystem usage patterns are create and read, never delete and never rewrite.
> 
> When volumes approach 90% usage, and under medium/light load (zpool iostat 
> reports 50mb/s and 750iops reads), some creat64 system calls take over 50 
> seconds to complete (observed with 'truss -D touch'). When doing manual 
> tests, I've seen similar times on unlink() calls (truss -D rm). 
> 
> I'd like to stress this happens on /some/ of the calls, maybe every 100th 
> manual call (I scripted the test), which (along with normal system 
> operations) would probably be every 10,000th or 100,000th call.
> 
> Other system parameters (memory usage, loadavg, process number, etc) appear 
> nominal. The machine is an NFS server, though the crazy latencies were 
> observed both local and remote.
> 
> What would you suggest to further diagnose this? Has anyone seen trouble with 
> high utilization and medium load? (with or without insanely high filecount?)
> 
> Many thanks in advance,
>  - Yaniv
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] is send/receive incremental

2007-08-08 Thread johansen-osdev
You can do it either way.  Eric Kustarz has a good explanation of how to
set up incremental send/receive on your laptop.  The description is on
his blog:

http://blogs.sun.com/erickustarz/date/20070612

The technique he uses is applicable to any ZFS filesystem.

-j

On Wed, Aug 08, 2007 at 04:44:16PM -0600, Peter Baumgartner wrote:
> 
>I'd like to send a backup of my filesystem offsite nightly using zfs
>send/receive. Are those done incrementally so only changes move or
>would a full copy get shuttled across everytime?
>--
>Pete

> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] si3124 controller problem and fix (fwd)

2007-07-17 Thread johansen-osdev
In an attempt to speed up progress on some of the si3124 bugs that Roger
reported, I've created a workspace with the fixes for:

   6565894 sata drives are not identified by si3124 driver
   6566207 si3124 driver loses interrupts.

I'm attaching a driver which contains these fixes as well as a diff of
the changes I used to produce them.

I don't have access to a si3124 chipset, unfortunately.

Would somebody be able to review these changes and try the new driver on
a si3124 card?

Thanks,

-j

On Tue, Jul 17, 2007 at 02:39:00AM -0700, Nigel Smith wrote:
> You can see the  status of bug here:
> 
> http://bugs.opensolaris.org/view_bug.do?bug_id=6566207
> 
> Unfortunately, it's showing no progress since 20th June.
> 
> This fix really could do to be in place for S10u4 and snv_70.
> Thanks
> Nigel Smith
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


si3124.tar.gz
Description: application/tar-gz

--- usr/src/uts/common/io/sata/adapters/si3124/si3124.c ---

Index: usr/src/uts/common/io/sata/adapters/si3124/si3124.c
--- /ws/onnv-clone/usr/src/uts/common/io/sata/adapters/si3124/si3124.c  Mon Nov 
13 23:20:01 2006
+++ 
/export/johansen/si-fixes/usr/src/uts/common/io/sata/adapters/si3124/si3124.c   
Tue Jul 17 14:37:17 2007
@@ -22,11 +22,11 @@
 /*
  * Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  */
 
-#pragma ident  "@(#)si3124.c   1.4 06/11/14 SMI"
+#pragma ident  "@(#)si3124.c   1.5 07/07/17 SMI"
 
 
 
 /*
  * SiliconImage 3124/3132 sata controller driver
@@ -381,11 +381,11 @@
 
 extern struct mod_ops mod_driverops;
 
 static  struct modldrv modldrv = {
&mod_driverops, /* driverops */
-   "si3124 driver v1.4",
+   "si3124 driver v1.5",
&sictl_dev_ops, /* driver ops */
 };
 
 static  struct modlinkage modlinkage = {
MODREV_1,
@@ -2808,10 +2808,13 @@
si_portp = si_ctlp->sictl_ports[port];
mutex_enter(&si_portp->siport_mutex);
 
/* Clear Port Reset. */
ddi_put32(si_ctlp->sictl_port_acc_handle,
+   (uint32_t *)PORT_CONTROL_SET(si_ctlp, port),
+   PORT_CONTROL_SET_BITS_PORT_RESET);
+   ddi_put32(si_ctlp->sictl_port_acc_handle,
(uint32_t *)PORT_CONTROL_CLEAR(si_ctlp, port),
PORT_CONTROL_CLEAR_BITS_PORT_RESET);
 
/*
 * Arm the interrupts for: Cmd completion, Cmd error,
@@ -3509,16 +3512,16 @@
port);
 
if (port_intr_status & INTR_COMMAND_COMPLETE) {
(void) si_intr_command_complete(si_ctlp, si_portp,
port);
-   }
-
+   } else {
/* Clear the interrupts */
ddi_put32(si_ctlp->sictl_port_acc_handle,
(uint32_t *)(PORT_INTERRUPT_STATUS(si_ctlp, port)),
port_intr_status & INTR_MASK);
+   }
 
/*
 * Note that we did not clear the interrupt for command
 * completion interrupt. Reading of slot_status takes care
 * of clearing the interrupt for command completion case.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance and memory consumption

2007-07-06 Thread johansen-osdev
> But now I have another question.
> How 8k blocks will impact on performance ?

When tuning recordsize for things like databases, we try to recommend
that the customer's recordsize match the I/O size of the database
record.

I don't think that's the case in your situation.  ZFS is clever enough
that changes to recordsize only affect new blocks written to the
filesystem.  If you're seeing metaslab fragmentation problems now,
changing your recordsize to 8k is likely to increase your performance.
This is because you're out of 128k metaslabs, so using a smaller size
lets you make better use of the remaining space.  This also means you
won't have to iterate through all of the used 128k metaslabs looking for
a free one.

If you're asking, "How does setting the recordsize to 8k affect
performance when I'm not encountering fragmentation," I would guess
that there would be some reduction.  However, you can adjust the
recordsize once you encounter this problem with the default size.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] si3124 controller problem and fix (fwd)

2007-06-07 Thread johansen-osdev
> it's been assigned CR 6566207 by Linda Bernal.  Basically, if you look 
> at si_intr and read the comments in the code, the bug is pretty 
> obvious.
>
> si3124 driver's interrupt routine is incorrectly coded.  The ddi_put32 
> that clears the interrupts should be enclosed in an "else" block, 
> thereby making it consistent with the comment just below.  Otherwise, 
> you would be double clearing the interrupts, thus losing pending 
> interrupts.
> 
> Since this is a simple fix, there's really no point dealing it as a 
> contributor.

The bug report for 6566207 states that the submitter is an OpenSolaris
contributor who wishes to work on the fix.  If this is not the case, we
should clarify this CR so it doesn't languish.  It's still sitting in
the dispatched state (hasn't been accepted by anyone).

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: [storage-discuss] NCQ performance

2007-05-29 Thread johansen-osdev
> When sequential I/O is done to the disk directly there is no performance
> degradation at all.  

All filesystems impose some overhead compared to the rate of raw disk
I/O.  It's going to be hard to store data on a disk unless some kind of
filesystem is used.  All the tests that Eric and I have performed show
regressions for multiple sequential I/O streams.  If you have data that
shows otherwise, please feel free to share.

> [I]t does not take any additional time in ldi_strategy(),
> bdev_strategy(), mv_rw_dma_start().  In some instance it actually
> takes less time.   The only thing that sometimes takes additional time
> is waiting for the disk I/O.

Let's be precise about what was actually observed.  Eric and I saw
increased service times for the I/O on devices with NCQ enabled when
running multiple sequential I/O streams.  Everything that we observed
indicated that it actually took the disk longer to service requests when
many sequential I/Os were queued.

-j


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
Marko,
Matt and I discussed this offline some more and he had a couple of ideas
about double-checking your hardware.

It looks like your controller (or disks, maybe?) is having trouble with
multiple simultaneous I/Os to the same disk.  It looks like prefetch
aggravates this problem.

When I asked Matt what we could do to verify that it's the number of
concurrent I/Os that is causing performance to be poor, he had the
following suggestions:

set zfs_vdev_{min,max}_pending=1 and run with prefetch on, then
iostat should show 1 outstanding io and perf should be good.

or turn prefetch off, and have multiple threads reading
concurrently, then iostat should show multiple outstanding ios
and perf should be bad.

Let me know if you have any additional questions.

-j

On Wed, May 16, 2007 at 11:38:24AM -0700, [EMAIL PROTECTED] wrote:
> At Matt's request, I did some further experiments and have found that
> this appears to be particular to your hardware.  This is not a general
> 32-bit problem.  I re-ran this experiment on a 1-disk pool using a 32
> and 64-bit kernel.  I got identical results:
> 
> 64-bit
> ==
> 
> $ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
> count=1
> 1+0 records in
> 1+0 records out
> 
> real   20.1
> user0.0
> sys 1.2
> 
> 62 Mb/s
> 
> # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
> 1+0 records in
> 1+0 records out
> 
> real   19.0
> user0.0
> sys 2.6
> 
> 65 Mb/s
> 
> 32-bit
> ==
> 
> /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
> count=1
> 1+0 records in
> 1+0 records out
> 
> real   20.1
> user0.0
> sys 1.7
> 
> 62 Mb/s
> 
> # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
> 1+0 records in
> 1+0 records out
> 
> real   19.1
> user0.0
> sys 4.3
> 
> 65 Mb/s
> 
> -j
> 
> On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote:
> > Marko Milisavljevic wrote:
> > >now lets try:
> > >set zfs:zfs_prefetch_disable=1
> > >
> > >bingo!
> > >
> > >   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> > > 609.00.0 77910.00.0  0.0  0.80.01.4   0  83 c0d0
> > >
> > >only 1-2 % slower then dd from /dev/dsk. Do you think this is general
> > >32-bit problem, or specific to this combination of hardware?
> > 
> > I suspect that it's fairly generic, but more analysis will be necessary.
> > 
> > >Finally, should I file a bug somewhere regarding prefetch, or is this
> > >a known issue?
> > 
> > It may be related to 6469558, but yes please do file another bug report. 
> >  I'll have someone on the ZFS team take a look at it.
> > 
> > --matt
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
At Matt's request, I did some further experiments and have found that
this appears to be particular to your hardware.  This is not a general
32-bit problem.  I re-ran this experiment on a 1-disk pool using a 32
and 64-bit kernel.  I got identical results:

64-bit
==

$ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
count=1
1+0 records in
1+0 records out

real   20.1
user0.0
sys 1.2

62 Mb/s

# /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   19.0
user0.0
sys 2.6

65 Mb/s

32-bit
==

/usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
count=1
1+0 records in
1+0 records out

real   20.1
user0.0
sys 1.7

62 Mb/s

# /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   19.1
user0.0
sys 4.3

65 Mb/s

-j

On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote:
> Marko Milisavljevic wrote:
> >now lets try:
> >set zfs:zfs_prefetch_disable=1
> >
> >bingo!
> >
> >   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> > 609.00.0 77910.00.0  0.0  0.80.01.4   0  83 c0d0
> >
> >only 1-2 % slower then dd from /dev/dsk. Do you think this is general
> >32-bit problem, or specific to this combination of hardware?
> 
> I suspect that it's fairly generic, but more analysis will be necessary.
> 
> >Finally, should I file a bug somewhere regarding prefetch, or is this
> >a known issue?
> 
> It may be related to 6469558, but yes please do file another bug report. 
>  I'll have someone on the ZFS team take a look at it.
> 
> --matt
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-16 Thread johansen-osdev
> >*sata_hba_list::list sata_hba_inst_t satahba_next | ::print 
> >sata_hba_inst_t satahba_dev_port | ::array void* 32 | ::print void* | 
> >::grep ".!=0" | ::print sata_cport_info_t cport_devp.cport_sata_drive | 
> >::print -a sata_drive_info_t satadrv_features_support satadrv_settings 
> >satadrv_features_enabled

> This gives me "mdb: failed to dereference symbol: unknown symbol
> name". 

You may not have the SATA module installed.  If you type:

::modinfo !  grep sata

and don't get any output, your sata driver is attached some other way.

My apologies for the confusion.

-K
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-15 Thread johansen-osdev
> Each drive is freshly formatted with one 2G file copied to it. 

How are you creating each of these files?

Also, would you please include a the output from the isalist(1) command?

> These are snapshots of iostat -xnczpm 3 captured somewhere in the
> middle of the operation.

Have you double-checked that this isn't a measurement problem by
measuring zfs with zpool iostat (see zpool(1M)) and verifying that
outputs from both iostats match?

> single drive, zfs file
>r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  258.30.0 33066.60.0 33.0  2.0  127.77.7 100 100 c0d1
> 
> Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s /
> r/s gives 256K, as I would imagine it should.

Not sure.  If we can figure out why ZFS is slower than raw disk access
in your case, it may explain why you're seeing these results.

> What if we read a UFS file from the PATA disk and ZFS from SATA:
>r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  792.80.0 44092.90.0  0.0  1.80.02.2   1  98 c1d0
>  224.00.0 28675.20.0 33.0  2.0  147.38.9 100 100 c0d0
> 
> Now that is confusing! Why did SATA/ZFS slow down too? I've retried this a
> number of times, not a fluke.

This could be cache interference.  ZFS and UFS use different caches.

How much memory is in this box?

> I have no idea what to make of all this, except that it ZFS has a problem
> with this hardware/drivers that UFS and other traditional file systems,
> don't. Is it a bug in the driver that ZFS is inadvertently exposing? A
> specific feature that ZFS assumes the hardware to have, but it doesn't? Who
> knows!

This may be a more complicated interaction than just ZFS and your
hardware.  There are a number of layers of drivers underneath ZFS that
may also be interacting with your hardware in an unfavorable way.

If you'd like to do a little poking with MDB, we can see the features
that your SATA disks claim they support.

As root, type mdb -k, and then at the ">" prompt that appears, enter the
following command (this is one very long line):

*sata_hba_list::list sata_hba_inst_t satahba_next | ::print sata_hba_inst_t 
satahba_dev_port | ::array void* 32 | ::print void* | ::grep ".!=0" | ::print 
sata_cport_info_t cport_devp.cport_sata_drive | ::print -a sata_drive_info_t 
satadrv_features_support satadrv_settings satadrv_features_enabled

This should show satadrv_features_support, satadrv_settings, and
satadrv_features_enabled for each SATA disk on the system.

The values for these variables are defined in:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/sata/impl/sata.h

this is the relevant snippet for interpreting these values:

/*
 * Device feature_support (satadrv_features_support)
 */
#define SATA_DEV_F_DMA  0x01
#define SATA_DEV_F_LBA280x02
#define SATA_DEV_F_LBA480x04
#define SATA_DEV_F_NCQ  0x08
#define SATA_DEV_F_SATA10x10
#define SATA_DEV_F_SATA20x20
#define SATA_DEV_F_TCQ  0x40/* Non NCQ tagged queuing */

/*
 * Device features enabled (satadrv_features_enabled)
 */
#define SATA_DEV_F_E_TAGGED_QING0x01/* Tagged queuing enabled */
#define SATA_DEV_F_E_UNTAGGED_QING  0x02/* Untagged queuing enabled */

/*
 * Drive settings flags (satdrv_settings)
 */
#define SATA_DEV_READ_AHEAD 0x0001  /* Read Ahead enabled */
#define SATA_DEV_WRITE_CACHE0x0002  /* Write cache ON */
#define SATA_DEV_SERIAL_FEATURES0x8000  /* Serial ATA feat.  enabled */
#define SATA_DEV_ASYNCH_NOTIFY  0x2000  /* Asynch-event enabled */

This may give us more information if this is indeed a problem with
hardware/drivers supporting the right features.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-14 Thread johansen-osdev
Marko,

I tried this experiment again using 1 disk and got nearly identical
times:

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   21.4
user0.0
sys 2.4

$ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   21.0
user0.0
sys 0.7


> [I]t is not possible for dd to meaningfully access multiple-disk
> configurations without going through the file system. I find it
> curious that there is such a large slowdown by going through file
> system (with single drive configuration), especially compared to UFS
> or ext3.

Comparing a filesystem to raw dd access isn't a completely fair
comparison either.  Few filesystems actually layout all of their data
and metadata so that every read is a completely sequential read.

> I simply have a small SOHO server and I am trying to evaluate which OS to
> use to keep a redundant disk array. With unreliable consumer-level hardware,
> ZFS and the checksum feature are very interesting and the primary selling
> point compared to a Linux setup, for as long as ZFS can generate enough
> bandwidth from the drive array to saturate single gigabit ethernet.

I would take Bart's reccomendation and go with Solaris on something like a
dual-core box with 4 disks.

> My hardware at the moment is the "wrong" choice for Solaris/ZFS - PCI 3114
> SATA controller on a 32-bit AthlonXP, according to many posts I found.

Bill Moore lists some controller reccomendations here:

http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html

> However, since dd over raw disk is capable of extracting 75+MB/s from this
> setup, I keep feeling that surely I must be able to get at least that much
> from reading a pair of striped or mirrored ZFS drives. But I can't - single
> drive or 2-drive stripes or mirrors, I only get around 34MB/s going through
> ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)

Maybe this is a problem with your controller?  What happens when you
have two simultaneous dd's to different disks running?  This would
simulate the case where you're reading from the two disks at the same
time.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-14 Thread johansen-osdev
This certainly isn't the case on my machine.

$ /usr/bin/time dd if=/test/filebench/largefile2 of=/dev/null bs=128k 
count=1
1+0 records in
1+0 records out

real1.3
user0.0
sys 1.2

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=1
1+0 records in
1+0 records out

real   22.3
user0.0
sys 2.2

This looks like 56 MB/s on the /dev/dsk and 961 MB/s on the pool.

My pool is configured into a 46 disk RAID-0 stripe.  I'm going to omit
the zpool status output for the sake of brevity.

> What I am seeing is that ZFS performance for sequential access is
> about 45% of raw disk access, while UFS (as well as ext3 on Linux) is
> around 70%. For workload consisting mostly of reading large files
> sequentially, it would seem then that ZFS is the wrong tool
> performance-wise. But, it could be just my setup, so I would
> appreciate more data points.

This isn't what we've observed in much of our performance testing.
It may be a problem with your config, although I'm not an expert on
storage configurations.  Would you mind providing more details about
your controller, disks, and machine setup?

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-03 Thread johansen-osdev
A couple more questions here.

[mpstat]

> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   00   0 3109  3616  316  1965   17   48   45   2450  85   0  15
>   10   0 3127  3797  592  2174   17   63   46   1760  84   0  15
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   00   0 3051  3529  277  2012   14   25   48   2160  83   0  17
>   10   0 3065  3739  606  1952   14   37   47   1530  82   0  17
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   00   0 3011  3538  316  2423   26   16   52   2020  81   0  19
>   10   0 3019  3698  578  2694   25   23   56   3090  83   0  17
> 
> # lockstat -kIW -D 20 sleep 30
> 
> Profiling interrupt: 6080 events in 31.341 seconds (194 events/sec)
> 
> Count indv cuml rcnt nsec Hottest CPU+PILCaller  
> ---
>  2068  34%  34% 0.00 1767 cpu[0] deflate_slow
>  1506  25%  59% 0.00 1721 cpu[1] longest_match   
>  1017  17%  76% 0.00 1833 cpu[1] mach_cpu_idle   
>   454   7%  83% 0.00 1539 cpu[0] fill_window 
>   215   4%  87% 0.00 1788 cpu[1] pqdownheap  


What do you have zfs compresison set to?  The gzip level is tunable,
according to zfs set, anyway:

PROPERTY   EDIT  INHERIT   VALUES
compression YES  YES   on | off | lzjb | gzip | gzip-[1-9]

You still have idle time in this lockstat (and mpstat).

What do you get for a lockstat -A -D 20 sleep 30?

Do you see anyone with long lock hold times, long sleeps, or excessive
spinning?

The largest numbers from mpstat are for interrupts and cross calls.
What does intrstat(1M) show?

Have you run dtrace to determine the most frequent cross-callers?

#!/usr/sbin/dtrace -s

sysinfo:::xcalls
{
@a[stack(30)] = count();
}

END
{
trunc(@a, 30);
}

is an easy way to do this.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Help me understand ZFS caching

2007-04-20 Thread johansen-osdev
Tony:

> Now to another question related to Anton's post. You mention that
> directIO does not exist in ZFS at this point. Are their plan's to
> support DirectIO; any functionality that will simulate directIO or
> some other non-caching ability suitable for critical systems such as
> databases if the client still wanted to deploy on filesystems.

I would describe DirectIO as the ability to map the application's
buffers directly for disk DMAs.  You need to disable the filesystem's
cache to do this correctly.  Having the cache disabled is an
implementation requirement for this feature.

Based upon this definition, are you seeking the ability to disable the
filesystem's cache or the ability to directly map application buffers
for DMA?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bottlenecks in building a system

2007-04-20 Thread johansen-osdev
Adam:

> Hi, hope you don't mind if I make some portions of your email public in 
> a reply--I hadn't seen it come through on the list at all, so it's no 
> duplicate to me.

I don't mind at all.  I had hoped to avoid sending the list a duplicate
e-mail, although it looks like my first post never made it here.

> > I suspect that if you have a bottleneck in your system, it would be due
> > to the available bandwidth on the PCI bus.
> 
> Mm. yeah, it's what I was worried about, too (mostly through ignorance 
> of the issues), which is why I was hoping HyperTransport and PCIe were 
> going to give that data enough room on the bus.
> But after others expressed the opinion that the Areca PCIe cards were 
> overkill, I'm now looking to putting some PCI-X cards on a different 
> (probably slower) motherboard.

I dug up a copy of the S2895 block diagram and asked Bill Moore about
it.  He said that you should be able to get about 700mb/s off of each of
the PCI-X channels and that you only need 100mb/s to saturate a GigE
link.  He also observed that the RAID card you were using was
unnecessary and would probably hamper performance.  He reccomended
non-RAID SATA cards based upon the Marvell chipset.

Here's the e-mail trail on this list where he discusses Marvell SATA
cards in a bit more detail:

http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html

It sounds like if getting disk -> network is the concern, you'll have
plenty of bandwidth, assuming you have a reasonable controller card.

> > Caching isn't going to be a huge help for writes, unless there's another
> > thread reading simultaneoulsy from the same file.
> >
> > Prefetch will definitely use the additional RAM to try to boost the
> > performance of sequential reads.  However, in the interest of full
> > disclosure, there is a pathology that we've seen where the number of
> > sequential readers exceeds the available space in the cache.  In this
> > situation, sometimes the competeing prefetches for the different streams
> > will cause more temporally favorable data to be evicted from the cache
> > and performance will drop.  The workaround right now is just to disable
> > prefetch.  We're looking into more comprehensive solutions.
> 
> Interesting. So noted. I will expect to have to test thoroughly.

If you run across this problem and are willing to let me debug on your
system, shoot me an e-mail.  We've only seen this in a couple of
situations and it was combined with another problem where we were seeing
excessive overhead for kcopyout.  It's unlikely, but possible that you'll
hit this.

-K
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bottlenecks in building a system

2007-04-18 Thread johansen-osdev
Adam:

> Does anyone have a clue as to where the bottlenecks are going to be with 
> this:
> 
> 16x hot swap SATAII hard drives (plus an internal boot drive)
> Tyan S2895 (K8WE) motherboard
> Dual GigE (integral nVidia ports)
> 2x Areca 8-port PCIe (8-lane) RAID drivers
> 2x AMD Opteron 275 CPUs (2.2GHz, dual core)
> 8 GiB RAM

> The supplier is used to shipping Linux servers in this 3U chassis, but 
> hasn't dealt with Solaris. He originally suggested 2GiB RAM, but I hear 
> things about ZFS getting RAM hungry after a while.

ZFS is opportunistic when it comes to using free memory for caching.
I'm not sure what exactly you've heard.

> I guess my questions are:
> - Does anyone out there have a clue where the potential bottlenecks 
> might be?

What's your workload?  Bart is subscribed to this list, but he has a
famous saying, "One experiment is worth a thousand expert opinions."

Without knowing what you're trying to do with this box, it's going to be
hard to offer any useful advice.  However, you'll learn the most by
getting one of these boxes and running your workload.  If you have
problems, Solaris has a lot of tools that we can use to diagnose the
problem.  Then we can improve the performance and everybody wins.

> - If I focused on simple streaming IO, would giving the server less RAM 
> have an impact on performance?

The more RAM you can give your box, the more of it ZFS will use for
caching.  If your workload doesn't benefit from caching, then the impact
on performance won't be large.  Could you be more specific about what
the filesystem's consumers are doing when they're performing "simple
streaming IO?"

> - I had assumed four cores would be better than the two faster (3.0GHz) 
> single-core processors the vendor originally suggested. Agree?

I suspect that this is correct.  ZFS does many steps in its I/O path
asynchronously and they execute in the context of different threads.
Four cores are probably better than two.  Of course experimentation
could prove me wrong here, too. :)

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: C'mon ARC, stay small...

2007-03-16 Thread johansen-osdev
> I've been seeing this failure to cap on a number of (Solaris 10 update
> 2 and 3) machines since the script came out (arc hogging is a huge
> problem for me, esp on Oracle). This is probably a red herring, but my
> v490 testbed seemed to actually cap on 3 separate tests, but my t2000
> testbed doesn't even pretend to cap - kernel memory (as identified in
> Orca) sails right to the top, leaves me maybe 2GB free on a 32GB
> machine and shoves Oracle data into swap. 

What method are you using to cap this memory?  Jim and I just disucssed
the required steps for doing this by hand using MDB.

> This isn't as amusing as one Stage and one Production Oracle machine
> which have 128GB and 96GB respectively. Sending in 92GB core dumps to
> support is an impressive gesture taking 2-3 days to complete.

This is solved by CR 4894692, which is in snv_56 and s10u4.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
I suppose I should have been more forward about making my last point.
If the arc_c_max isn't set in /etc/system, I don't believe that the ARC
will initialize arc.p to the correct value.   I could be wrong about
this; however, next time you set c_max, set c to the same value as c_max
and set p to half of c.  Let me know if this addresses the problem or
not.

-j

> >How/when did you configure arc_c_max?  
> Immediately following a reboot, I set arc.c_max using mdb,
> then verified reading the arc structure again.
> >arc.p is supposed to be
> >initialized to half of arc.c.  Also, I assume that there's a reliable
> >test case for reproducing this problem?
> >  
> Yep. I'm using a x4500 in-house to sort out performance of a customer test
> case that uses mmap. We acquired the new DIMMs to bring the
> x4500 to 32GB, since the workload has a 64GB working set size,
> and we were clobbering a 16GB thumper. We wanted to see how doubling
> memory may help.
> 
> I'm trying clamp the ARC size because for mmap-intensive workloads,
> it seems to hurt more than help (although, based on experiments up to this
> point, it's not hurting a lot).
> 
> I'll do another reboot, and run it all down for you serially...
> 
> /jim
> 
> >Thanks,
> >
> >-j
> >
> >On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
> >  
> >>
> >>>ARC_mru::print -d size lsize
> >>>  
> >>size = 0t10224433152
> >>lsize = 0t10218960896
> >>
> >>>ARC_mfu::print -d size lsize
> >>>  
> >>size = 0t303450112
> >>lsize = 0t289998848
> >>
> >>>ARC_anon::print -d size
> >>>  
> >>size = 0
> >>
> >>So it looks like the MRU is running at 10GB...
> >>
> >>What does this tell us?
> >>
> >>Thanks,
> >>/jim
> >>
> >>
> >>
> >>[EMAIL PROTECTED] wrote:
> >>
> >>>This seems a bit strange.  What's the workload, and also, what's the
> >>>output for:
> >>>
> >>> 
> >>>  
> ARC_mru::print size lsize
> ARC_mfu::print size lsize
>    
> 
> >>>and
> >>> 
> >>>  
> ARC_anon::print size
>    
> 
> >>>For obvious reasons, the ARC can't evict buffers that are in use.
> >>>Buffers that are available to be evicted should be on the mru or mfu
> >>>list, so this output should be instructive.
> >>>
> >>>-j
> >>>
> >>>On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
> >>> 
> >>>  
> FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
> 
> 
>    
> 
> >arc::print -tad
> > 
> >  
> {
> . . .
>   c02e29e8 uint64_t size = 0t10527883264
>   c02e29f0 uint64_t p = 0t16381819904
>   c02e29f8 uint64_t c = 0t1070318720
>   c02e2a00 uint64_t c_min = 0t1070318720
>   c02e2a08 uint64_t c_max = 0t1070318720
> . . .
> 
> Perhaps c_max does not do what I think it does?
> 
> Thanks,
> /jim
> 
> 
> Jim Mauro wrote:
>    
> 
> >Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
> >(update 3). All file IO is mmap(file), read memory segment, unmap, 
> >close.
> >
> >Tweaked the arc size down via mdb to 1GB. I used that value because
> >c_min was also 1GB, and I was not sure if c_max could be larger than
> >c_minAnyway, I set c_max to 1GB.
> >
> >After a workload run:
> > 
> >  
> >>arc::print -tad
> >>   
> >>
> >{
> >. . .
> >c02e29e8 uint64_t size = 0t3099832832
> >c02e29f0 uint64_t p = 0t16540761088
> >c02e29f8 uint64_t c = 0t1070318720
> >c02e2a00 uint64_t c_min = 0t1070318720
> >c02e2a08 uint64_t c_max = 0t1070318720
> >. . .
> >
> >"size" is at 3GB, with c_max at 1GB.
> >
> >What gives? I'm looking at the code now, but was under the impression
> >c_max would limit ARC growth. Granted, it's not a factor of 10, and
> >it's certainly much better than the out-of-the-box growth to 24GB
> >(this is a 32GB x4500), so clearly ARC growth is being limited, but it
> >still grew to 3X c_max.
> >
> >Thanks,
> >/jim
> >___
> >zfs-discuss mailing list
> >zfs-discuss@opensolaris.org
> >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > 
> >  
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>    
> 
> >>___
> >>zfs-discuss mailing list
> >>zfs-discuss@opensolaris.org
> >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___

Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
Something else to consider, depending upon how you set arc_c_max, you
may just want to set arc_c and arc_p at the same time.  If you try
setting arc_c_max, and then setting arc_c to arc_c_max, and then set
arc_p to arc_c / 2, do you still get this problem?

-j

On Thu, Mar 15, 2007 at 05:18:12PM -0700, [EMAIL PROTECTED] wrote:
> Gar.  This isn't what I was hoping to see.  Buffers that aren't
> available for eviction aren't listed in the lsize count.  It looks like
> the MRU has grown to 10Gb and most of this could be successfully
> evicted.
> 
> The calculation for determining if we evict from the MRU is in
> arc_adjust() and looks something like:
> 
> top_sz = ARC_anon.size + ARC_mru.size
> 
> Then if top_sz > arc.p and ARC_mru.lsize > 0 we evict the smaller of
> ARC_mru.lsize and top_size - arc.p
> 
> In your previous message it looks like arc.p is > (ARC_mru.size +
> ARC_anon.size).  It might make sense to double-check these numbers
> together, so when you check the size and lsize again, also check arc.p.
> 
> How/when did you configure arc_c_max?  arc.p is supposed to be
> initialized to half of arc.c.  Also, I assume that there's a reliable
> test case for reproducing this problem?
> 
> Thanks,
> 
> -j
> 
> On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
> > 
> > 
> > > ARC_mru::print -d size lsize
> > size = 0t10224433152
> > lsize = 0t10218960896
> > > ARC_mfu::print -d size lsize
> > size = 0t303450112
> > lsize = 0t289998848
> > > ARC_anon::print -d size
> > size = 0
> > >
> > 
> > So it looks like the MRU is running at 10GB...
> > 
> > What does this tell us?
> > 
> > Thanks,
> > /jim
> > 
> > 
> > 
> > [EMAIL PROTECTED] wrote:
> > >This seems a bit strange.  What's the workload, and also, what's the
> > >output for:
> > >
> > >  
> > >>ARC_mru::print size lsize
> > >>ARC_mfu::print size lsize
> > >>
> > >and
> > >  
> > >>ARC_anon::print size
> > >>
> > >
> > >For obvious reasons, the ARC can't evict buffers that are in use.
> > >Buffers that are available to be evicted should be on the mru or mfu
> > >list, so this output should be instructive.
> > >
> > >-j
> > >
> > >On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
> > >  
> > >>FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
> > >>
> > >>
> > >>
> > >>>arc::print -tad
> > >>>  
> > >>{
> > >>. . .
> > >>   c02e29e8 uint64_t size = 0t10527883264
> > >>   c02e29f0 uint64_t p = 0t16381819904
> > >>   c02e29f8 uint64_t c = 0t1070318720
> > >>   c02e2a00 uint64_t c_min = 0t1070318720
> > >>   c02e2a08 uint64_t c_max = 0t1070318720
> > >>. . .
> > >>
> > >>Perhaps c_max does not do what I think it does?
> > >>
> > >>Thanks,
> > >>/jim
> > >>
> > >>
> > >>Jim Mauro wrote:
> > >>
> > >>>Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
> > >>>(update 3). All file IO is mmap(file), read memory segment, unmap, close.
> > >>>
> > >>>Tweaked the arc size down via mdb to 1GB. I used that value because
> > >>>c_min was also 1GB, and I was not sure if c_max could be larger than
> > >>>c_minAnyway, I set c_max to 1GB.
> > >>>
> > >>>After a workload run:
> > >>>  
> > arc::print -tad
> > 
> > >>>{
> > >>>. . .
> > >>> c02e29e8 uint64_t size = 0t3099832832
> > >>> c02e29f0 uint64_t p = 0t16540761088
> > >>> c02e29f8 uint64_t c = 0t1070318720
> > >>> c02e2a00 uint64_t c_min = 0t1070318720
> > >>> c02e2a08 uint64_t c_max = 0t1070318720
> > >>>. . .
> > >>>
> > >>>"size" is at 3GB, with c_max at 1GB.
> > >>>
> > >>>What gives? I'm looking at the code now, but was under the impression
> > >>>c_max would limit ARC growth. Granted, it's not a factor of 10, and
> > >>>it's certainly much better than the out-of-the-box growth to 24GB
> > >>>(this is a 32GB x4500), so clearly ARC growth is being limited, but it
> > >>>still grew to 3X c_max.
> > >>>
> > >>>Thanks,
> > >>>/jim
> > >>>___
> > >>>zfs-discuss mailing list
> > >>>zfs-discuss@opensolaris.org
> > >>>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > >>>  
> > >>___
> > >>zfs-discuss mailing list
> > >>zfs-discuss@opensolaris.org
> > >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > >>
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
Gar.  This isn't what I was hoping to see.  Buffers that aren't
available for eviction aren't listed in the lsize count.  It looks like
the MRU has grown to 10Gb and most of this could be successfully
evicted.

The calculation for determining if we evict from the MRU is in
arc_adjust() and looks something like:

top_sz = ARC_anon.size + ARC_mru.size

Then if top_sz > arc.p and ARC_mru.lsize > 0 we evict the smaller of
ARC_mru.lsize and top_size - arc.p

In your previous message it looks like arc.p is > (ARC_mru.size +
ARC_anon.size).  It might make sense to double-check these numbers
together, so when you check the size and lsize again, also check arc.p.

How/when did you configure arc_c_max?  arc.p is supposed to be
initialized to half of arc.c.  Also, I assume that there's a reliable
test case for reproducing this problem?

Thanks,

-j

On Thu, Mar 15, 2007 at 06:57:12PM -0400, Jim Mauro wrote:
> 
> 
> > ARC_mru::print -d size lsize
> size = 0t10224433152
> lsize = 0t10218960896
> > ARC_mfu::print -d size lsize
> size = 0t303450112
> lsize = 0t289998848
> > ARC_anon::print -d size
> size = 0
> >
> 
> So it looks like the MRU is running at 10GB...
> 
> What does this tell us?
> 
> Thanks,
> /jim
> 
> 
> 
> [EMAIL PROTECTED] wrote:
> >This seems a bit strange.  What's the workload, and also, what's the
> >output for:
> >
> >  
> >>ARC_mru::print size lsize
> >>ARC_mfu::print size lsize
> >>
> >and
> >  
> >>ARC_anon::print size
> >>
> >
> >For obvious reasons, the ARC can't evict buffers that are in use.
> >Buffers that are available to be evicted should be on the mru or mfu
> >list, so this output should be instructive.
> >
> >-j
> >
> >On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
> >  
> >>FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
> >>
> >>
> >>
> >>>arc::print -tad
> >>>  
> >>{
> >>. . .
> >>   c02e29e8 uint64_t size = 0t10527883264
> >>   c02e29f0 uint64_t p = 0t16381819904
> >>   c02e29f8 uint64_t c = 0t1070318720
> >>   c02e2a00 uint64_t c_min = 0t1070318720
> >>   c02e2a08 uint64_t c_max = 0t1070318720
> >>. . .
> >>
> >>Perhaps c_max does not do what I think it does?
> >>
> >>Thanks,
> >>/jim
> >>
> >>
> >>Jim Mauro wrote:
> >>
> >>>Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
> >>>(update 3). All file IO is mmap(file), read memory segment, unmap, close.
> >>>
> >>>Tweaked the arc size down via mdb to 1GB. I used that value because
> >>>c_min was also 1GB, and I was not sure if c_max could be larger than
> >>>c_minAnyway, I set c_max to 1GB.
> >>>
> >>>After a workload run:
> >>>  
> arc::print -tad
> 
> >>>{
> >>>. . .
> >>> c02e29e8 uint64_t size = 0t3099832832
> >>> c02e29f0 uint64_t p = 0t16540761088
> >>> c02e29f8 uint64_t c = 0t1070318720
> >>> c02e2a00 uint64_t c_min = 0t1070318720
> >>> c02e2a08 uint64_t c_max = 0t1070318720
> >>>. . .
> >>>
> >>>"size" is at 3GB, with c_max at 1GB.
> >>>
> >>>What gives? I'm looking at the code now, but was under the impression
> >>>c_max would limit ARC growth. Granted, it's not a factor of 10, and
> >>>it's certainly much better than the out-of-the-box growth to 24GB
> >>>(this is a 32GB x4500), so clearly ARC growth is being limited, but it
> >>>still grew to 3X c_max.
> >>>
> >>>Thanks,
> >>>/jim
> >>>___
> >>>zfs-discuss mailing list
> >>>zfs-discuss@opensolaris.org
> >>>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>>  
> >>___
> >>zfs-discuss mailing list
> >>zfs-discuss@opensolaris.org
> >>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-03-15 Thread johansen-osdev
This seems a bit strange.  What's the workload, and also, what's the
output for:

> ARC_mru::print size lsize
> ARC_mfu::print size lsize
and
> ARC_anon::print size

For obvious reasons, the ARC can't evict buffers that are in use.
Buffers that are available to be evicted should be on the mru or mfu
list, so this output should be instructive.

-j

On Thu, Mar 15, 2007 at 02:08:37PM -0400, Jim Mauro wrote:
> 
> FYI - After a few more runs, ARC size hit 10GB, which is now 10X c_max:
> 
> 
> > arc::print -tad
> {
> . . .
>c02e29e8 uint64_t size = 0t10527883264
>c02e29f0 uint64_t p = 0t16381819904
>c02e29f8 uint64_t c = 0t1070318720
>c02e2a00 uint64_t c_min = 0t1070318720
>c02e2a08 uint64_t c_max = 0t1070318720
> . . .
> 
> Perhaps c_max does not do what I think it does?
> 
> Thanks,
> /jim
> 
> 
> Jim Mauro wrote:
> >Running an mmap-intensive workload on ZFS on a X4500, Solaris 10 11/06
> >(update 3). All file IO is mmap(file), read memory segment, unmap, close.
> >
> >Tweaked the arc size down via mdb to 1GB. I used that value because
> >c_min was also 1GB, and I was not sure if c_max could be larger than
> >c_minAnyway, I set c_max to 1GB.
> >
> >After a workload run:
> >> arc::print -tad
> >{
> >. . .
> >  c02e29e8 uint64_t size = 0t3099832832
> >  c02e29f0 uint64_t p = 0t16540761088
> >  c02e29f8 uint64_t c = 0t1070318720
> >  c02e2a00 uint64_t c_min = 0t1070318720
> >  c02e2a08 uint64_t c_max = 0t1070318720
> >. . .
> >
> >"size" is at 3GB, with c_max at 1GB.
> >
> >What gives? I'm looking at the code now, but was under the impression
> >c_max would limit ARC growth. Granted, it's not a factor of 10, and
> >it's certainly much better than the out-of-the-box growth to 24GB
> >(this is a 32GB x4500), so clearly ARC growth is being limited, but it
> >still grew to 3X c_max.
> >
> >Thanks,
> >/jim
> >___
> >zfs-discuss mailing list
> >zfs-discuss@opensolaris.org
> >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] understanding zfs/thunoer "bottlenecks"?

2007-02-27 Thread johansen-osdev
> it seems there isn't an algorithm in ZFS that detects sequential write
> in traditional fs such as ufs, one would trigger directio.

There is no directio for ZFS.  Are you encountering a situation in which
you believe directio support would improve performance?  If so, please
explain.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS multi-threading

2007-02-08 Thread johansen-osdev
> Would the logic behind ZFS take full advantage of a heavily multicored
> system, such as on the Sun Niagara platform? Would it utilize of the
> 32 concurrent threads for generating its checksums? Has anyone
> compared ZFS on a Sun Tx000, to that of a 2-4 thread x64 machine?

Pete and I are working on resolving ZFS scalability issues with Niagara and
StarCat right now.  I'm not sure if any official numbers about ZFS
performance on Niagara have been published.

As far as concurrent threads generating checksums goes, the system
doesn't work quite the way you have postulated.  The checksum is
generated in the ZIO_STAGE_CHECKSUM_GENERATE pipeline state for writes,
and verified in the ZIO_STAGE_CHECKSUM_VERIFY pipeline stage for reads.
Whichever thread happens to advance the pipline to the checksum generate
stage is the thread that will actually perform the work.  ZFS does not
break the work of the checksum into chunks and have multiple CPUs
perform the computation.  However, it is possible to have concurrent
writes simultaneously in the checksum_generate stage.

More details about this can be found in zfs/zio.c and zfs/sys/zio_impl.h

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-24 Thread johansen-osdev
> And this feature is independant on whether   or not the data  is
> DMA'ed straight into the user buffer.

I suppose so, however, it seems like it would make more sense to
configure a dataset property that specifically describes the caching
policy that is desired.  When directio implies different semantics for
different filesystems, customers are going to get confused.

> The  other  feature,  is to  avoid a   bcopy by  DMAing full
> filesystem block reads straight into user buffer (and verify
> checksum after). The I/O is high latency, bcopy adds a small
> amount. The kernel memory can  be freed/reuse straight after
> the user read  completes. This is  where I ask, how much CPU
> is lost to the bcopy in workloads that benefit from DIO ?

Right, except that if we try to DMA into user buffers with ZFS there's a
bunch of other things we need the VM to do on our behalf to protect the
integrity of the kernel data that's living in user pages.  Assume you
have a high-latency I/O and you've locked some user pages for this I/O.
In a pathological case, when another thread tries to access the locked
pages and then also blocks,  it does so for the duration of the first
thread's I/O.  At that point, it seems like it might be easier to accept
the cost of the bcopy instead of blocking another thread.

I'm not even sure how to assess the impact of VM operations required to
change the permissions on the pages before we start the I/O.

> The quickest return on  investement  I see for  the  directio
> hint would be to tell ZFS to not grow the ARC when servicing
> such requests.

Perhaps if we had an option that specifies not to cache data from a
particular dataset, that would suffice.  I think you've filed a CR along
those lines already (6429855)?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread johansen-osdev
> Note also that for most applications, the size of their IO operations
> would often not match the current page size of the buffer, causing
> additional performance and scalability issues.

Thanks for mentioning this, I forgot about it.

Since ZFS's default block size is configured to be larger than a page,
the application would have to issue page-aligned block-sized I/Os.
Anyone adjusting the block size would presumably be responsible for
ensuring that the new size is a multiple of the page size.  (If they
would want Direct I/O to work...)

I believe UFS also has a similar requirement, but I've been wrong
before.

-j

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread johansen-osdev
> Basically speaking - there needs to be some sort of strategy for
> bypassing the ARC or even parts of the ARC for applications that
> may need to advise the filesystem of either:
> 1) the delicate nature of imposing additional buffering for their
> data flow
> 2) already well optimized applications that need more adaptive
> cache in the application instead of the underlying filesystem or
> volume manager

This advice can't be sensibly delivered to ZFS via a Direct I/O
mechanism.  Anton's characterization of Direct I/O as, "an optimization
which allows data to be transferred directly between user data buffers
and disk, without a memory-to-memory copy," is concise and accurate.
Trying to intuit advice from this is unlikely to be useful.  It would be
better to develop a separate mechanism for delivering advice about the
application to the filesystem.  (fadvise, perhaps?)

A DIO implementation for ZFS is more complicated than UFS and adversely
impacts well optimized applications.

I looked into this late last year when we had a customer who was
suffering from too much bcopy overhead.  Billm found another workaround
instead of bypassing the ARC.

The challenge for implementing DIO for ZFS is in dealing with access to
the pages mapped by the user application.  Since ZFS has to checksum all
of its data, the user's pages that are involved in the direct I/O cannot
be written to by another thread during the I/O.  If this policy isn't
enforced, it is possible for the data written to or read from disk to be
different from their checksums.

In order to protect the user pages while a DIO is in progress, we want
support from the VM that isn't presently implemented.  To prevent a page
from being accessed by another thread, we have to unmap the TLB/PTE
entries and lock the page.  There's a cost associated with this, as it
may be necessary to cross-call other CPUs.  Any thread that accesses the
locked pages will block.  While it's possible lock pages in the VM
today, there isn't a neat set of interfaces the filesystem can use to
maintain the integrity of the user's buffers.  Without an experimental
prototype to verify the design, it's impossible to say whether overhead
of manipulating the page permissions is more than the cost of bypassing
the cache.

What do you see as potential use cases for ZFS Direct I/O?  I'm having a
hard time imagining a situation in which this would be useful to a
customer.  The application would probably have to be single-threaded,
and if not, it would have to be pretty careful about how its threads
access buffers involved in I/O.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread johansen-osdev
Robert:

> Better yet would be if memory consumed by ZFS for caching (dnodes,
> vnodes, data, ...) would behave similar to page cache like with UFS so
> applications will be able to get back almost all memory used for ZFS
> caches if needed.

I believe that a better response to memory pressure is a long-term goal
for ZFS.  There's also an effort in progress to improve the caching
algorithms used in the ARC.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread johansen-osdev
> > Note that you'd actually have to verify that the blocks were the same;
> > you cannot count on the hash function.  If you didn't do this, anyone
> > discovering a collision could destroy the colliding blocks/files.
> 
> Given that nobody knows how to find sha256 collisions, you'd of course
> need to test this code with a weaker hash algorithm.
> 
> (It would almost be worth it to have the code panic in the event that a
> real sha256 collision was found)

The novel discovery of a sha256 collision will be lost on any
administrator whose system panics.  Imagine how much this will annoy the
first customer who accidentally discovers a reproducible test-case.
Perhaps generating an FMA error report would be more appropriate?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and savecore

2006-11-10 Thread johansen-osdev
This is CR: 4894692 caching data in heap inflates crash dump

I have a fix which I am testing now.  It still needs review from
Matt/Mark before it's eligible for putback, though.

-j

On Fri, Nov 10, 2006 at 02:40:40PM -0800, Thomas Maier-Komor wrote:
> Hi, 
> 
> I'm not sure if this is the right forum, but I guess this topic will
> be bounced into the right direction from here.
> 
> With ZFS using as much physical memory as it can get, dumps and
> livedumps via 'savecore -L' are huge in size. I just tested it on my
> workstation and got a 1.8G vmcore file, when dumping only kernel
> pages. 
> 
> Might it be possible to add an extension that would make it possible,
> to support dumping without the whole ZFS cache? I guess this would
> make kernel live dumps smaller again, as they used to be...
> 
> Any comments?
> 
> Cheers,
> Tom
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: slow reads question...

2006-09-22 Thread johansen-osdev
Harley:

> Old 36GB drives:
> 
> | # time mkfile -v 1g zeros-1g
> | zeros-1g 1073741824 bytes
> | 
> | real2m31.991s
> | user0m0.007s
> | sys 0m0.923s
> 
> Newer 300GB drives:
> 
> | # time mkfile -v 1g zeros-1g
> | zeros-1g 1073741824 bytes
> | 
> | real0m8.425s
> | user0m0.010s
> | sys 0m1.809s

This is a pretty dramatic difference.  What type of drives were your old
36g drives?

>I am wondering if there is something other than capacity
> and seek time which has changed between the drives.  Would a
> different scsi command set or features have this dramatic a
> difference?

I'm hardly the authority on hardware, but there are a couple of
possibilties.  Your newer drives may have a write cache.  It's also
quite likely that the newer drives have a faster speed of rotation and
seek time.

If you subtract the usr + sys time from the real time in these
measurements, I suspect the result is the amount of time you were
actually waiting for the I/O to finish.  In the first case, you spent
99% of your total time waiting for stuff to happen, whereas in the
second case it was only ~86% of your overall time.

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: slow reads question...

2006-09-22 Thread johansen-osdev
Harley:

>I had tried other sizes with much the same results, but
> hadnt gone as large as 128K.  With bs=128K, it gets worse:
> 
> | # time dd if=zeros-10g of=/dev/null bs=128k count=102400
> | 81920+0 records in
> | 81920+0 records out
> | 
> | real2m19.023s
> | user0m0.105s
> | sys 0m8.514s

I may have done my math wrong, but if we assume that the real
time is the actual amount of time we spent performing the I/O (which may
be incorrect) haven't you done better here?

In this case you pushed 81920 128k records in ~139 seconds -- approx
75437 k/sec.

Using ZFS with 8k bs, you pushed 102400 8k records in ~68 seconds --
approx 12047 k/sec.

Using the raw device you pushed 102400 8k records in ~23 seconds --
approx 35617 k/sec.

I may have missed something here, but isn't this newest number the
highest performance so far?

What does iostat(1M) say about your disk read performance?

>Is there any other info I can provide which would help?

Are you just trying to measure ZFS's read performance here?

It might be interesting to change your outfile (of) argument and see if
we're actually running into some other performance problem.  If you
change of=/tmp/zeros does performance improve or degrade?  Likewise, if
you write the file out to another disk (UFS, ZFS, whatever), does this
improve performance?

-j
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: slow reads question...

2006-09-22 Thread johansen
ZFS uses a 128k block size.  If you change dd to use a bs=128k, do you observe 
any performance improvement?

> | # time dd if=zeros-10g of=/dev/null bs=8k
> count=102400
> | 102400+0 records in
> | 102400+0 records out
>
> | real1m8.763s
> | user0m0.104s
> | sys 0m1.759s

It's also worth noting that this dd used less system and user time than the 
read from the raw device, yet took a longer time in "real" time.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Memory Usage

2006-09-12 Thread johansen
> 1) You should be able to limit your cache max size by
> setting arc.c_max.  Its currently initialized to be
> phys-mem-size - 1GB.

Mark's assertion that this is not a best practice is something of an 
understatement.  ZFS was designed so that users/administrators wouldn't have to 
configure tunables to achieve optimal system performance.  ZFS performance is 
still a work in progress.

The problem with adjusting arc.c_max is that its definition may change from one 
release to another.  It's an internal kernel variable, its existence isn't 
guaranteed.  There are also no guarantees about the semantics of what a future 
arc.c_max might mean.  It's possible that future implementations may change the 
definition such that reducing c_max has other unintended consequences.

Unfortunately, at the present time this is probably the only way to limit the 
cache size.  Mark and I are working on strategies to make sure that ZFS is a 
better citizen when it comes to memory usage and performance.  Mark has 
recently made a number of changes which should help ZFS reduce its memory 
footprint.  However, until these changes and others make it into a production 
build we're going to have to live with this inadvisable approach for adjusting 
the cache size.

-j
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss