Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-22 Thread Robert Milkowski
Hello Richard,

Wednesday, October 15, 2008, 6:39:49 PM, you wrote:

RE Archie Cowan wrote:
 I just stumbled upon this thread somehow and thought I'd share my zfs over 
 iscsi experience. 

 We recently abandoned a similar configuration with several pairs of x4500s 
 exporting zvols as iscsi targets and mirroring them for high availability 
 with T5220s. 
   

RE In general, such tasks would be better served by T5220 (or the new T5440 :-)
RE and J4500s.  This would change the data paths from:
RE client --net-- T5220 --net-- X4500 --SATA-- disks
RE to
RE client --net-- T5440 --SAS-- disks

RE With the J4500 you get the same storage density as the X4500, but
RE with SAS access (some would call this direct access).  You will have
RE much better bandwidth and lower latency between the T5440 (server)
RE and disks while still having the ability to multi-head the disks.  The
RE J4500 is a relatively new system, so this option may not have been
RE available at the time Archie was building his system.

Has MPxIO for J4500 (SAS) been backported to S10 yet?

-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-22 Thread Richard Elling
Robert Milkowski wrote:
 Hello Richard,

 Wednesday, October 15, 2008, 6:39:49 PM, you wrote:

 RE Archie Cowan wrote:
   
 I just stumbled upon this thread somehow and thought I'd share my zfs over 
 iscsi experience. 

 We recently abandoned a similar configuration with several pairs of x4500s 
 exporting zvols as iscsi targets and mirroring them for high availability 
 with T5220s. 
   
   

 RE In general, such tasks would be better served by T5220 (or the new T5440 
 :-)
 RE and J4500s.  This would change the data paths from:
 RE client --net-- T5220 --net-- X4500 --SATA-- disks
 RE to
 RE client --net-- T5440 --SAS-- disks

 RE With the J4500 you get the same storage density as the X4500, but
 RE with SAS access (some would call this direct access).  You will have
 RE much better bandwidth and lower latency between the T5440 (server)
 RE and disks while still having the ability to multi-head the disks.  The
 RE J4500 is a relatively new system, so this option may not have been
 RE available at the time Archie was building his system.

 Has MPxIO for J4500 (SAS) been backported to S10 yet?
   

It is not a J4500 feature, it will depend on the HBA and driver.  mpt(7d)
has it in Solaris 10 5/08 (update 5) and patches are available for update 4.
When in doubt, check the man page for your driver.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-20 Thread Gary Mills
On Thu, Oct 16, 2008 at 03:50:19PM +0800, Gray Carper wrote:
 
Sidenote: Today we made eight network/iSCSI related tweaks that, in
aggregate, have resulted in dramatic performance improvements (some I
just hadn't gotten around to yet, others suggested by Sun's Mertol
Ozyoney)...
- disabling the Nagle algorithm on the head node
- setting each iSCSI target block size to match the ZFS record size of
128K
- disabling thin provisioning on the iSCSI targets
- enabling jumbo frames everywhere (each switch and NIC)
- raising ddi_msix_alloc_limit to 8
- raising ip_soft_rings_cnt to 16
- raising tcp_deferred_acks_max to 16
- raising tcp_local_dacks_max to 16

Can you tell us which of those changes made the most dramatic
improvement?  I have a similar situation here, with a 2-TB ZFS pool on
a T2000 using Iscsi to a Netapp file server.  Is there any way to tell
in advance if any of those changes will make a difference?  Many of
them seem to be server resources.  How can I determine their current
usage?

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-20 Thread Jim Dunham
Gary,

   Sidenote: Today we made eight network/iSCSI related tweaks that, in
   aggregate, have resulted in dramatic performance improvements  
 (some I
   just hadn't gotten around to yet, others suggested by Sun's Mertol
   Ozyoney)...
   - disabling the Nagle algorithm on the head node
   - setting each iSCSI target block size to match the ZFS record  
 size of
   128K
   - disabling thin provisioning on the iSCSI targets
   - enabling jumbo frames everywhere (each switch and NIC)
   - raising ddi_msix_alloc_limit to 8
   - raising ip_soft_rings_cnt to 16
   - raising tcp_deferred_acks_max to 16
   - raising tcp_local_dacks_max to 16

 Can you tell us which of those changes made the most dramatic
 improvement?

   - disabling the Nagle algorithm on the head node

This will have a dramatic effective on most I/Os, except for large  
sequential writes.

 - setting each iSCSI target block size to match the ZFS record size  
 of 128K
  - enabling jumbo frames everywhere (each switch and NIC)


These will have a positive effect for large writes, both sequential  
and random

   - disabling thin provisioning on the iSCSI targets

This only has a benefit for file-based or dsk based backing stores. If  
one use rdsk backing stores of any type, this is not an issue.

Jim

 I have a similar situation here, with a 2-TB ZFS pool on
 a T2000 using Iscsi to a Netapp file server.  Is there any way to tell
 in advance if any of those changes will make a difference?  Many of
 them seem to be server resources.  How can I determine their current
 usage?

 -- 
 -Gary Mills--Unix Support--U of M Academic Computing and  
 Networking-
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jim Dunham
Storage Platform Software Group
Sun Microsystems, Inc.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-20 Thread Gray Carper
Hey, Jim! Thanks so much for the excellent assist on this - much better than
I could have ever answered it!

I thought I'd add a little bit on the other four...

 - raising ddi_msix_alloc_limit to 8

For PCI cards that use up to 8 interrupts, which our 10GBe adapters do. The
previous value of 2 could cause some CPU interrupt bottlenecks. So far, this
has been more of a preventative measure - we haven't seen a case where this
really made any performance impact.

 - raising ip_soft_rings_cnt to 16

This increases the number of kernel threads associated with packet
processing and is specifically meant to reduce the latency in handling
10GBe. This showed a small performance improvement.

 - raising tcp_deferred_acks_max to 16

This reduces the number of ACK packets sent, thus reducing the overall TCP
overhead. This showed a small performance improvement.

 - raising tcp_local_dacks_max to 16

This also slows down ACK packets and showed a tiny performance improvement.

Overall, we have found these four settings to not make a whole lot of
difference, but every little bit helps. ; The four that Jim went through
were much more impactful particularly the enabling of jumbo frames and the
disabling of the Nagle algorithm.

-Gray

On Tue, Oct 21, 2008 at 4:21 AM, Jim Dunham [EMAIL PROTECTED] wrote:

 Gary,

   Sidenote: Today we made eight network/iSCSI related tweaks that, in
  aggregate, have resulted in dramatic performance improvements (some I
  just hadn't gotten around to yet, others suggested by Sun's Mertol
  Ozyoney)...
  - disabling the Nagle algorithm on the head node
  - setting each iSCSI target block size to match the ZFS record size of
  128K
  - disabling thin provisioning on the iSCSI targets
  - enabling jumbo frames everywhere (each switch and NIC)
  - raising ddi_msix_alloc_limit to 8
  - raising ip_soft_rings_cnt to 16
  - raising tcp_deferred_acks_max to 16
  - raising tcp_local_dacks_max to 16


 Can you tell us which of those changes made the most dramatic
 improvement?


   - disabling the Nagle algorithm on the head node


 This will have a dramatic effective on most I/Os, except for large
 sequential writes.

  - setting each iSCSI target block size to match the ZFS record size of
 128K
  - enabling jumbo frames everywhere (each switch and NIC)



 These will have a positive effect for large writes, both sequential and
 random

   - disabling thin provisioning on the iSCSI targets


 This only has a benefit for file-based or dsk based backing stores. If one
 use rdsk backing stores of any type, this is not an issue.

 Jim

  I have a similar situation here, with a 2-TB ZFS pool on
 a T2000 using Iscsi to a Netapp file server.  Is there any way to tell
 in advance if any of those changes will make a difference?  Many of
 them seem to be server resources.  How can I determine their current
 usage?

 --
 -Gary Mills--Unix Support--U of M Academic Computing and
 Networking-
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


 Jim Dunham
 Storage Platform Software Group
 Sun Microsystems, Inc.




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-17 Thread Ross
Some of that is very worrying Miles, do you have bug ID's for any of those 
problems?

I'm guessing the problem of the device being reported ok after the reboot could 
be this one:
http://bugs.opensolaris.org/view_bug.do?bug_id=6582549

And could the errors after the reboot be one of these?
http://bugs.opensolaris.org/view_bug.do?bug_id=6558852
http://bugs.opensolaris.org/view_bug.do?bug_id=6675685

I don't have the same concerns myself that you guys have over massive pools 
since we're working at a much smaller scale, but even so it's no good ZFS 
having one of it's main selling features as only resilvers the missing data 
if it can't be relied upon to do that every time in real world situations.

Incidentally, even with those resilver bugs, a few back of the envelope 
calculations makes me think that this might not be too bad with 10Gb ethernet:

Server size:  28TB
Interconnect speed:  10Gb/s   (call it 8Gb/s of actual bandwidth)
Usage:  70%   (worst case scenario - pool dies while under heavy load)

That gives us an available resilver bandwidth of 3Gb's, which I'll divide by 
two since that has to be used for both reads and writes.

28TB @ 1.5Gb/s gives a resilver time of around 42 hours, and changing some of 
the assumptions by dropping pool usage to 20% brings that down to 16 hours.  
It's still a long time, but for a rare disaster recovery scenario for a large 
pool, I think I could live with it.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Ross
Well obviously recovery scenarios need testing, but I still don't see it being 
that bad.  My thinking on this is:

1.  Loss of a server is very much the worst case scenario.  Disk errors are 
much more likely, and with raid-z2 pools on the individual servers this should 
not pose a problem.  I also would not expect to see disk failures downing an 
entire x4500.  Sun have sold an awful lot of these now, enough for me to feel 
any such problems should be a thing of the past.

2.  Even when a server does fail, the nature of ZFS is such that you would not 
expect to loose your data, nor should you be expecting to resilver the entire 
28TB.  A motherboard / backplane / PSU failure will offline that server, but 
once the faulted components are replaced your pool will come back online.  Once 
the pool is online, ZFS has the ability to resilver just the changed data, 
meaning that your rebuild time will be simply proportional to the time the 
server was down.

Of course these failure modes would need testing, as would rebuild times.  I 
don't see 'zfs send' performance being an issue though, not unless Grey has 
another 150TB of storage lying around that he's not telling us about.  :-)

There are always going to be some tradeoffs between risk, capacity and price, 
but I expect that the benefits of this setup far outweigh the negatives.

Ross
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Gray Carper
Howdy!

Very valuable advice here (and from Bob, who made similar comments - thanks,
Bob!). I think, then, we'll generally stick to 128K recordsizes. In the case
of databases, we'll stray as appropriate, and we may also stray with the HPC
compute cluster if we can get demonstrate that it is worth it.

To answer your questions below...

Currently, we have a single pool, in a load share configuration (no
raidz), that collects all the storage (which answers Ross' question too).
From that we carve filesystems on demand. There are many more tests planned
for that construction, though, so we are not married to it.

Redundancy abounds. ; Since the pool doesn't employ raidz, it isn't
internally redundant, but we plan to replicate the pool's data to an
identical system (which is not yet built) at another site. Our initial
userbase don't need the replication, however, because they uses the system
for little more than scratch space. Huge genomic datasets are dumped on the
storage, analyzed, and the results (which are much smaller) get sent
elsewhere. Everything is wiped out soon after that and the process starts
again. Future projected uses of the storage, however, would be far less
tolerant of loss, so I expect we'll want to reconfigure the pool in raidz.

I see that Archie and Miles have shared some harrowing concerns which we
take very seriously. I don't think I'll be able to reply to them today, but
I certainly will in the near future (particularly once we've completed some
more of our induced failure scenarios).

Sidenote: Today we made eight network/iSCSI related tweaks that, in
aggregate, have resulted in dramatic performance improvements (some I just
hadn't gotten around to yet, others suggested by Sun's Mertol Ozyoney)...

- disabling the Nagle algorithm on the head node
- setting each iSCSI target block size to match the ZFS record size of 128K
- disabling thin provisioning on the iSCSI targets
- enabling jumbo frames everywhere (each switch and NIC)
- raising ddi_msix_alloc_limit to 8
- raising ip_soft_rings_cnt to 16
- raising tcp_deferred_acks_max to 16
- raising tcp_local_dacks_max to 16

Rerunning the same tests, we now see...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
Write: 143373
Rewrite: 183170
Read: 433205
Reread: 435503
Random Read: 90118
Random Write: 19488

[8GB file size, 512KB record size]
Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f
/volumes/data-iscsi/perftest/8gbtest
Write:  463260
Rewrite:  449280
Read:  1092291
Reread:  881044
Random Read:  442565
Random Write:  565565

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
Write: 357199
Rewrite: 342788
Read: 609553
Reread: 645618
Random Read: 218874
Random Write: 339624

Thanks so much to everyone for all their great contributions!
-Gray

On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai 
[EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Hi Gray,

 You've got a nice setup going there, few comments:

 1. Do not tune ZFS without a proven test-case to show otherwise, except...
 2. For databases. Tune recordsize for that particular FS to match DB
 recordsize.

 Few questions...

 * How are you divvying up the space ?
 * How are you taking care of redundancy ?
 * Are you aware that each layer of ZFS needs its own redundancy ?

 Since you have got a mixed use case here, I would be surprized if a general
 config would cover all, though it might do with some luck.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Ross
Miles makes a good point here, you really need to look at how this copes with 
various failure modes.

Based on my experience, iSCSI is something that may cause you problems.  When I 
tested this kind of setup last year I found that the entire pool hung for 3 
minutes any time an iSCSI volume went offline.  It looked like a relatively 
simple thing to fix if you can recompile the iSCSI driver, and there is talk 
about making the timeout adjustable, but for me that was enough to put our 
project on hold for now.

Ross
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Gray Carper
Oops - one thing I meant to mention: We only plan to cross-site replicate
data for those folks who require it. The HPC data crunching would have no
use for it, so that filesystem wouldn't be replicated. In reality, we only
expect a select few users, with relatively small filesystems, to actually
need replication. (Which begs the question: Why build an identical 150TB
system to support that? Good question. I think we'll reevaluate. ;)

-Gray

On Thu, Oct 16, 2008 at 3:50 PM, Gray Carper [EMAIL PROTECTED] wrote:

 Howdy!

 Very valuable advice here (and from Bob, who made similar comments -
 thanks, Bob!). I think, then, we'll generally stick to 128K recordsizes. In
 the case of databases, we'll stray as appropriate, and we may also stray
 with the HPC compute cluster if we can get demonstrate that it is worth it.

 To answer your questions below...

 Currently, we have a single pool, in a load share configuration (no
 raidz), that collects all the storage (which answers Ross' question too).
 From that we carve filesystems on demand. There are many more tests planned
 for that construction, though, so we are not married to it.

 Redundancy abounds. ; Since the pool doesn't employ raidz, it isn't
 internally redundant, but we plan to replicate the pool's data to an
 identical system (which is not yet built) at another site. Our initial
 userbase don't need the replication, however, because they uses the system
 for little more than scratch space. Huge genomic datasets are dumped on the
 storage, analyzed, and the results (which are much smaller) get sent
 elsewhere. Everything is wiped out soon after that and the process starts
 again. Future projected uses of the storage, however, would be far less
 tolerant of loss, so I expect we'll want to reconfigure the pool in raidz.

 I see that Archie and Miles have shared some harrowing concerns which we
 take very seriously. I don't think I'll be able to reply to them today, but
 I certainly will in the near future (particularly once we've completed some
 more of our induced failure scenarios).

 Sidenote: Today we made eight network/iSCSI related tweaks that, in
 aggregate, have resulted in dramatic performance improvements (some I just
 hadn't gotten around to yet, others suggested by Sun's Mertol Ozyoney)...

 - disabling the Nagle algorithm on the head node
 - setting each iSCSI target block size to match the ZFS record size of 128K
 - disabling thin provisioning on the iSCSI targets
 - enabling jumbo frames everywhere (each switch and NIC)
 - raising ddi_msix_alloc_limit to 8
 - raising ip_soft_rings_cnt to 16
 - raising tcp_deferred_acks_max to 16
 - raising tcp_local_dacks_max to 16

 Rerunning the same tests, we now see...

 [1GB file size, 1KB record size]
 Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
 Write: 143373
 Rewrite: 183170
 Read: 433205
 Reread: 435503
 Random Read: 90118
 Random Write: 19488

 [8GB file size, 512KB record size]
 Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f
 /volumes/data-iscsi/perftest/8gbtest
 Write:  463260
 Rewrite:  449280
 Read:  1092291
 Reread:  881044
 Random Read:  442565
 Random Write:  565565

 [64GB file size, 1MB record size]
 Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
 Write: 357199
 Rewrite: 342788
 Read: 609553
 Reread: 645618
 Random Read: 218874
 Random Write: 339624

 Thanks so much to everyone for all their great contributions!
 -Gray


 On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai 
 [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Hi Gray,

 You've got a nice setup going there, few comments:

 1. Do not tune ZFS without a proven test-case to show otherwise, except...
 2. For databases. Tune recordsize for that particular FS to match DB
 recordsize.

 Few questions...

 * How are you divvying up the space ?
 * How are you taking care of redundancy ?
 * Are you aware that each layer of ZFS needs its own redundancy ?

 Since you have got a mixed use case here, I would be surprized if a
 general config would cover all, though it might do with some luck.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




 --
 Gray Carper
 MSIS Technical Services
 University of Michigan Medical School
 [EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
 http://www.umms.med.umich.edu/msis/




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Miles Nordin
 r == Ross  [EMAIL PROTECTED] writes:

 r 1.  Loss of a server is very much the worst case scenario.
 r Disk errors are much more likely, and with raid-z2 pools on
 r the individual servers

yeah, it kind of sucks that the slow resilvering speed enforces this
two-tier scheme.

Also if you're going to have 1000 spinning platters you'll have a
drive failure every four days or so---you need to be able to do more
than one resilver at a time, and you need to do resilvers without
interrupting scrubs which could take so long to run that you run them
continuously.  The ZFS-on-zvol hack lets you do both to a point, but I
think it's an ugly workaround for lack of scalability in flat ZFS, not
the ideal way to do things.

 r A motherboard / backplane / PSU failure will offline that
 r server, but once the faulted components are replaced your pool
 r will come back online.  Once the pool is online, ZFS has the
 r ability to resilver just the changed data,

except that is not what actually happens for my iSCSI setup.  If I
'zpool offline' the target before taking it down, it usually does work
as you describe---a relatively fast resilver kicks off, and no CKSUM
errors appear later.  I've used it gently.  I haven't offlined a
raidz2 device for three weeks while writing gigabytes to the pool in
the mean time, but for my gentle use it does seem to work.

But if the iSCSI target goes down unexpectedly---ex., because I pull
the network cord---it does come back online and does resilver, but
latent CKSUM errors show up weeks later.

Also, if the head node reboots during a resilver, ZFS totally forgets
what it was doing, and upon reboot just blindly mounts the unclean
component as if it were clean, later calling all the differences CKSUM
errors.  same thing happens if you offline a device, then reboot.  The
``persistent'' offlining doesn't seem to work, and in any case the
device comes online without a proper resilver.

SVM had dirty-region logging stored in the metadb so that resilvers
could continue where they left off across reboots.  I believe SVM
usually did a full resilver when a component disappeared, but am not
sure this was always the case.  Anyway ZFS doesn't seem to have a
similar capability, at least not one that works.

so, in practice, whenever any iSCSI component goes away
unexpectedly---target server failure, power failure, kernel panic, L2
spanning tree reconfiguration, whatever---you have to scrub the whole
pool from the head node.


It's interesting how the speed and optimisation of these maintenance
activities limit pool size.  It's not just full scrubs.  If the
filesystem is subject to corruption, you need a backup.  If the
filesystem takes two months to back up / restore, then you need really
solid incremental backup/restore features, and the backup needs to be
a cold spare, not just a backup---restoring means switching the roles
of the primary and backup system, not actually moving data.  

finally, for really big pools, even O(n) might be too slow.  The ZFS
best practice guide for converting UFS to ZFS says ``start multiple
rsync's in parallel,'' but I think we're finding zpool scrubs and zfs
sends are not well-parallelized.

These reliability limitations and performance characteristics of
maintenance tasks seem to make a sort of max-pool-size Wall beyond
which you end up painted into corners.  If they were made better, I
think you'd later hit another wall at the maximum amount of data you
could push through one head node and would have to switch to some
QFS/GFS/OCFS-type separate-data-and-metadata filesystem, and to match
ZFS this filesystem would have to do scrubs, resilvers, and backups in
a distributed way not just distribute normal data access.  A month ago
I might have ranted, ``head node speed puts a cap on how _busy_ the
filesystem can be, not how big it can be, so ZFS (modulo a lot of bug
fixes) could be fantastic for data sets of virtually unlimited size
even with its single-initiator, single-head-node limitation, so long
as the pool gets very light access.''  Now, I don't think so, because
scrubbing/resilvering/backup-restore has to flow through the head
node, too.

This observation also means my preference for a ``recovery tool'' that
treats corrupt pools as read-only over fsck (online or offline) isn't
very scalable.  The original zfs kool-aid ``online maintenance'' model
of doing a cheap fsck at import time and a long O(n) fsck through
online scrubs is the only one with a future in a world where
maintenance activities can take months.


pgpzqaJe5ZecE.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 It's interesting how the speed and optimisation of these maintenance
 activities limit pool size.  It's not just full scrubs.  If the filesystem is
 subject to corruption, you need a backup.  If the filesystem takes two months
 to back up / restore, then you need really solid incremental backup/restore
 features, and the backup needs to be a cold spare, not just a
 backup---restoring means switching the roles of the primary and backup
 system, not actually moving data.   

I'll chime in here with feeling uncomfortable with such a huge ZFS pool,
and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
just seem to be too many moving parts depending on each other, any one of
which can make the entire pool unavailable.

For the stated usage of the original poster, I think I would aim toward
turning each of the Thumpers into an NFS server, configure the head-node
as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS
to the cluster of file servers.  You'll end up with a huge logical pool,
but a Thumper outage should result only in loss of access to the data on
that particular system.  The work of scrub/resilver/replication can be
divided among the servers rather than all living on a single head node.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Erast Benson
pNFS is NFS-centric of course and it is not yet stable, isn't it? btw,
what is the ETA for pNFS putback?

On Thu, 2008-10-16 at 12:20 -0700, Marion Hakanson wrote:
 [EMAIL PROTECTED] said:
  It's interesting how the speed and optimisation of these maintenance
  activities limit pool size.  It's not just full scrubs.  If the filesystem 
  is
  subject to corruption, you need a backup.  If the filesystem takes two 
  months
  to back up / restore, then you need really solid incremental backup/restore
  features, and the backup needs to be a cold spare, not just a
  backup---restoring means switching the roles of the primary and backup
  system, not actually moving data.   
 
 I'll chime in here with feeling uncomfortable with such a huge ZFS pool,
 and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
 just seem to be too many moving parts depending on each other, any one of
 which can make the entire pool unavailable.
 
 For the stated usage of the original poster, I think I would aim toward
 turning each of the Thumpers into an NFS server, configure the head-node
 as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS
 to the cluster of file servers.  You'll end up with a huge logical pool,
 but a Thumper outage should result only in loss of access to the data on
 that particular system.  The work of scrub/resilver/replication can be
 divided among the servers rather than all living on a single head node.
 
 Regards,
 
 Marion
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Nicolas Williams
On Thu, Oct 16, 2008 at 12:20:36PM -0700, Marion Hakanson wrote:
 I'll chime in here with feeling uncomfortable with such a huge ZFS pool,
 and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
 just seem to be too many moving parts depending on each other, any one of
 which can make the entire pool unavailable.

But does it work well enough?  It may be faster than NFS if there's only
one client for each volume (unless you have fast slog devices for the
ZIL).  And it'd have better semantics too (e.g., no need for the client
and server to agree on identities/domains).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 In general, such tasks would be better served by T5220 (or the new T5440 :-)
 and J4500s.  This would change the data paths from:
 client --net-- T5220 --net-- X4500 --SATA-- disks to
 client --net-- T5440 --SAS-- disks
 
 With the J4500 you get the same storage density as the X4500, but with SAS
 access (some would call this direct access).  You will have much better
 bandwidth and lower latency between the T5440 (server) and disks while still
 having the ability to multi-head the disks.  The 

There's an odd economic factor here, if you're in the .edu sector:  The
Sun Education Essentials promotional price list has the X4540 priced
lower than a bare J4500 (not on the promotional list, but with a standard
EDU discount).

We have a project under development right now which might be served well
by one of these EDU X4540's with a J4400 attached to it.  The spec sheets
for J4400 and J4500 say you can chain together enough of them to make a
pool of 192 drives.  I'm unsure about the bandwidth of these daisy-chained
SAS interconnects, though.  Any thoughts as to how high one might scale
an X4540-plus-J4x00 solution?  How does the X4540's internal disk bandwidth
compare to that of the (non-RAID) SAS HBA?

Regards,

Marion



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Miles Nordin
 nw == Nicolas Williams [EMAIL PROTECTED] writes:

nw But does it work well enough?  It may be faster than NFS if

You're talking about different things.  Gray is using NFS period
between the storage cluster and the compute cluster, no iSCSI.

Gray's (``does it work well enough''):  iSCSI within storage cluster
NFS to storage consumers

Marion's (less ``uncomfortable''):  nothing(?) within storage cluster
pNFS to storage consumers

but Marion's is not really possible at all, and won't be for a while
with other groups' choice of storage-consumer platform, so it'd have
to be GlusterFS or some other goofy fringe FUSEy thing or
not-very-general crude in-house hack.

I guess since Gray is copying data in and out all the time he doesn't
have to worry about the glacial-restore problem and corruption
problem.  If it were my worry, I'd definitely include NFS clients in
the performance test because iSCSI is high-latency, and the NFS
clients could be more latency-sensitive than the local benchmark.  I
might test coalescing in the big data separately from running the
crunching, because maybe the big data can be copied in with
pax-over-netcat, or something other than NFS, and maybe the crunching
could treat the big data as read-only and write its small result to a
fast standalone ZFS server which would make NFS faster.  and i'd get
the small important data that needs backup off this mess (but please
let us know how the failure simulating testing goes!).


pgpM2yKwKqo4d.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Nicolas Williams
On Thu, Oct 16, 2008 at 04:30:28PM -0400, Miles Nordin wrote:
  nw == Nicolas Williams [EMAIL PROTECTED] writes:
 
 nw But does it work well enough?  It may be faster than NFS if
 
 You're talking about different things.  Gray is using NFS period
 between the storage cluster and the compute cluster, no iSCSI.

I was replying to Marion's comment about ZFS-over-ISCSI-on-ZFS, not to
Gray.

I can see why one might worry about ZFS-over-iSCSI-on-ZFS.  Two layers
of copy-on-write might interact in odd ways that kill performance.  But
if you want ZFS-over-iSCSI in the first place then ZFS-over-iSCSI-on-ZFS
sounds like the correct approach IF it can perform well enough.

ZFS-over-iSCSI could certainly perform better than NFS, but again, it
may depend on what kind of ZIL devices you have.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Miles Nordin
 nw == Nicolas Williams [EMAIL PROTECTED] writes:
 mh == Marion Hakanson [EMAIL PROTECTED] writes:

nw I was replying to Marion's [...]
nw ZFS-over-iSCSI could certainly perform better than NFS,

better than what, ZFS-over-'mkfile'-files-on-NFS?  No one was
suggesting that.  Do you mean better than pNFS?  It sounded at first
like you meant iSCSI-over-ZFS should perform better than NFS, but no
one's suggesting that either.

 Gray:NFS over ZFS over iSCSI over ZFS over disk

 Marion: pNFS over ZFS over disk

they are both using the same amount of {,p}NFS.


pgp2sQIXdWVEA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread David Magda
On Oct 16, 2008, at 15:20, Marion Hakanson wrote:

 For the stated usage of the original poster, I think I would aim  
 toward
 turning each of the Thumpers into an NFS server, configure the head- 
 node
 as a pNFS/NFSv4.1

It's a shame that Lustre isn't available on Solaris yet either.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-16 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 but Marion's is not really possible at all, and won't be for a while with
 other groups' choice of storage-consumer platform, so it'd have to be
 GlusterFS or some other goofy fringe FUSEy thing or not-very-general crude
 in-house hack. 

Well, of course the magnitude of fringe factor is in the eye of the beholder.
I didn't intend to make pNFS seem like a done deal.  I don't quite yet
think of OpenSolaris as a done deal either, still using Solaris-10 here
in production, but since this is an OpenSolaris mailing list I should be
more careful.

Anyway, from looking over the wiki/blog info, apparently the sticking
point with pNFS may be client-side availability -- there's only Linux and
(Open)Solaris NFSv4.1 clients just yet.  Still, pNFS claims to be backwards
compatible with NFS v3 clients:  If you point a traditional NFS client at
the pNFS metadata server, the MDS is supposed to relay the data from the
backend data servers.


[EMAIL PROTECTED] said:
 It's a shame that Lustre isn't available on Solaris yet either. 

Actually, that may not be so terribly fringey, either.  Lustre and Sun's
Scalable Storage product can make use of Thumpers:
http://www.sun.com/software/products/lustre/
http://www.sun.com/servers/cr/scalablestorage/

Apparently it's possible to have a Solaris/ZFS data-server for Lustre
backend storage:
http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU

I see they do not yet have anything other than Linux clients, so that's
a limitation.  But you can share out a Lustre filesystem over NFS, potentially
from multiple Lustre clients.  Maybe via CIFS/samba as well.

Lastly, I've considered the idea of using Shared-QFS to glue together
multiple Thumper-hosted ISCSI LUN's.  You could add shared-QFS clients
(acting as NFS/CIFS servers) if the client load needed more than one.
Then SAM-FS would be a possibility for backup/replication.

Anyway, I do feel that none of this stuff is quite there yet.  But my
experience with ZFS on fiberchannel SAN storage, that sinking feeling
I've had when a little connectivity glitch resulted in a ZFS panic,
makes me wonder if non-redundant ZFS on an ISCSI SAN is there yet,
either.  So far none of our lost-connection incidents resulted in pool
corruption, but we have only 4TB or so.  Restoring that much from tape
is feasible, but even if Gray's 150TB of data can be recreated, it would
take weeks to reload it.

If it's decided that the clustered-filesystem solutions aren't feasible yet,
the suggestion I've seen that I liked the best was Richard's, with a bad-boy
server SAS-connected to multiple J4500's.  But since Gray's project already
has the X4500's, I guess they'd have to find another use for them (:-).

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Bob Friesenhahn
On Wed, 15 Oct 2008, Gray Carper wrote:
 be good to set different recordsize paramaters for each one. Do you have any
 suggestions on good starting sizes for each? I'd imagine filesharing might
 benefit from a relatively small record size (64K?), image-based backup
 targets might like a pretty large record size (256K?), databases just need
 recordsizes to match their block sizes, and HPC...I have no idea. Heh. I
 expect I'll need to get in contact with the HPC lab to see what kind of
 profile they have (whether they deal with tiny files or big files, etc).
 What do you think?

Pretty much the *only* reason to reduce the ZFS recordsize from its 
default of 128K is to support relatively unusual applications like 
databases which do random read/writes of small (often 8K) blocks. 
For sequential I/O, 128K is fine even if the application (or client) 
does reads/writes using much smaller blocks.

For small-block random I/O you will find that ZFS performance improves 
immensely when the ZFS recordsize matches the application recordsize. 
The reason for this is that ZFS does I/O using its full blocksize and 
so there is more latency and waste of I/O bandwidth and CPU if ZFS 
needs to process a 128K block for each 8K block update.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Archie Cowan
I just stumbled upon this thread somehow and thought I'd share my zfs over 
iscsi experience. 

We recently abandoned a similar configuration with several pairs of x4500s 
exporting zvols as iscsi targets and mirroring them for high availability 
with T5220s. 

Initially, our performance was also good using iozone tests, but, in testing 
the resilvering processes with 10tb of data, it was abysmal. It took over a 
month for a 10tb x4500 mirror that was mostly mirrored to resilver back into 
health with its pair. So, not exactly a highly available configuration... if 
the other x4500 went unhealthy while the other was still resilvering we'd have 
been in a real bad place.  

Also, zfs send operations on filesystems hosted by the iscsi zpool couldn't 
push out more than a few kilobytes per second. Yes, we had all the 
multipathing, vlans, memory buffering and all kinds of nonsense to keep the 
network from being the bottleneck but to not much benefit. This was our plan 
for keeping our remote sites' filesystems in sync so it was vital. 

Maybe we did something completely wrong with our setup, but I'd suggest you 
verify how long it takes to resilver new x4500s into your iscsi pools and also 
see how well it does when your zpools are almost full. Our initial good 
performance test results were too good to be true and it turned out that they 
weren't the whole story. 

Good luck.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Gray Carper
Howdy, Brent!

Thanks for your interest! We're pretty enthused about this project over here
and I'd be happy to share some details with you (and anyone else who cares
to peek). In this post I'll try to hit the major configuration
bullet-points, but I can also throw you command-line level specifics if you
want them.

1. The six Thumper iSCSI target nodes, and the iSCSI initiator head node,
all have a high-availability network configuration by marrying link
aggregation and IP multipathing techniques. Each machine has four 1GBe
interfaces and one 10GBe interface (we could have had two 10GBe interfaces,
but we decided to save some cash ;). We link aggregate the four 1GBe
interfaces together to create a fatter 4GBe pipe, then we use IP
multipathing to group together the 10GBe interface and the 4GBe aggregation.
Through this we create a virtual service IP which can float back and
forth, automatically, between the two interfaces in the event of a network
path failure. The preferred home is the 10GBe interface, but if that dies
(or any part of its network path dies, like a switch somewhere down the
line), then the service IP migrates to the 4GBe aggregate (which is on a
completely separate network path) within four seconds. Whenever the 10GBe
interface is happy again, the service IP automatically migrates back to its
home.

2 The head node also has an Infiniband interface which plugs it into our HPC
compute cluster network, giving the cluster direct access to whatever
storage it needs.

3. All six iSCSI nodes have a redundant disk configuration using four ZFS
raidz2 groups, each containing 10 drives which are spread across five
controllers. Six additional disks, from a sixth controller, also live in the
pool as spares. This results in a 28.4TB data pool for each node that can
survive disk and controller failures.

4. Each of the six iSCSI nodes are presenting the entirety of their 28TB
pools through CHAP-authenticated iSCSI targets. (See
http://docs.sun.com/app/docs/doc/819-5461/gechv?a=view  for more info on
that process.)

5. The NAS nead node has wrangled up all six of the iSCSI targets (using
iscsiadm add discovery-address ...) and joined them to create ~150TB
ofusable storage
(using zpool create against the devices created with iscsiadm). With that,
we've been able to carve up the storage into multiple ZFS filesystems, each
with its own recordsize, quota, permissions, NFS/CIFS shares, etc.

I think that about covers the high-level stuff. If there's any area you want
to dive deeper into, fire away!

-Gray

On Wed, Oct 15, 2008 at 1:29 AM, Brent Jones [EMAIL PROTECTED] wrote:

 On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper [EMAIL PROTECTED] wrote:
  Hey, all!
 
  We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI
 targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an
 x4200 head node. In trying to discover optimal ZFS pool construction
 settings, we've run a number of iozone tests, so I thought I'd share them
 with you and see if you have any comments, suggestions, etc.
 
  First, on a single Thumper, we ran baseline tests on the direct-attached
 storage (which is collected into a single ZFS pool comprised of four raidz2
 groups)...
 
  [1GB file size, 1KB record size]
  Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
  Write: 123919
  Rewrite: 146277
  Read: 383226
  Reread: 383567
  Random Read: 84369
  Random Write: 121617
 
  [8GB file size, 512KB record size]
  Command:
  Write:  373345
  Rewrite:  665847
  Read:  2261103
  Reread:  2175696
  Random Read:  2239877
  Random Write:  666769
 
  [64GB file size, 1MB record size]
  Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f
 /data-das/perftest/64gbtest
  Write: 517092
  Rewrite: 541768
  Read: 682713
  Reread: 697875
  Random Read: 89362
  Random Write: 488944
 
  These results look very nice, though you'll notice that the random read
 numbers tend to be pretty low on the 1GB and 64GB tests (relative to their
 sequential counterparts), but the 8GB random (and sequential) read is
 unbelievably good.
 
  Now we move to the head node's iSCSI aggregate ZFS pool...
 
  [1GB file size, 1KB record size]
  Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f
 /volumes/data-iscsi/perftest/1gbtest
  Write:  127108
  Rewrite:  120704
  Read:  394073
  Reread:  396607
  Random Read:  63820
  Random Write:  5907
 
  [8GB file size, 512KB record size]
  Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f
 /volumes/data-iscsi/perftest/8gbtest
  Write:  235348
  Rewrite:  179740
  Read:  577315
  Reread:  662253
  Random Read:  249853
  Random Write:  274589
 
  [64GB file size, 1MB record size]
  Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f
 /volumes/data-iscsi/perftest/64gbtest
  Write:  190535
  Rewrite:  194738
  Read:  297605
  Reread:  314829
  Random Read:  93102
  Random Write:  175688
 
  Generally speaking, the results look good, but you'll notice that random
 writes are atrocious on the 1GB tests and random 

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Richard Elling
Archie Cowan wrote:
 I just stumbled upon this thread somehow and thought I'd share my zfs over 
 iscsi experience. 

 We recently abandoned a similar configuration with several pairs of x4500s 
 exporting zvols as iscsi targets and mirroring them for high availability 
 with T5220s. 
   

In general, such tasks would be better served by T5220 (or the new T5440 :-)
and J4500s.  This would change the data paths from:
client --net-- T5220 --net-- X4500 --SATA-- disks
to
client --net-- T5440 --SAS-- disks

With the J4500 you get the same storage density as the X4500, but
with SAS access (some would call this direct access).  You will have
much better bandwidth and lower latency between the T5440 (server)
and disks while still having the ability to multi-head the disks.  The
J4500 is a relatively new system, so this option may not have been
available at the time Archie was building his system.
 -- richard

 Initially, our performance was also good using iozone tests, but, in testing 
 the resilvering processes with 10tb of data, it was abysmal. It took over a 
 month for a 10tb x4500 mirror that was mostly mirrored to resilver back into 
 health with its pair. So, not exactly a highly available configuration... if 
 the other x4500 went unhealthy while the other was still resilvering we'd 
 have been in a real bad place.  

 Also, zfs send operations on filesystems hosted by the iscsi zpool couldn't 
 push out more than a few kilobytes per second. Yes, we had all the 
 multipathing, vlans, memory buffering and all kinds of nonsense to keep the 
 network from being the bottleneck but to not much benefit. This was our plan 
 for keeping our remote sites' filesystems in sync so it was vital. 

 Maybe we did something completely wrong with our setup, but I'd suggest you 
 verify how long it takes to resilver new x4500s into your iscsi pools and 
 also see how well it does when your zpools are almost full. Our initial good 
 performance test results were too good to be true and it turned out that they 
 weren't the whole story. 

 Good luck.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Akhilesh Mritunjai
Hi Gray,

You've got a nice setup going there, few comments:

1. Do not tune ZFS without a proven test-case to show otherwise, except...
2. For databases. Tune recordsize for that particular FS to match DB recordsize.

Few questions...

* How are you divvying up the space ?
* How are you taking care of redundancy ?
* Are you aware that each layer of ZFS needs its own redundancy ?

Since you have got a mixed use case here, I would be surprized if a general 
config would cover all, though it might do with some luck.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Ross
Am I right in thinking your top level zpool is a raid-z pool consisting of six 
28TB iSCSI volumes?  If so that's a very nice setup, it's what we'd be doing if 
we had that kind of cash :-)
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Miles Nordin
 gc == Gray Carper [EMAIL PROTECTED] writes:

gc 5. The NAS nead node has wrangled up all six of the iSCSI
gc targets 

are you using raidz on the head node?  It sounds like simple striping,
which is probably dangerous with the current code.  This kind of sucks
because with simple striping you will get the performance of the 6
mega-spindles, while in a raidz you don't just get less storage, you
get ~1/6th the seek bandwidth.  but that's better than losing a whole
pool.  It's not even fully effective redundancy if
resilvering/scrubbing takes 3 days per 1TB, but if it just stops the
pool from becoming corrupt and unimportable then it's done its job.

how are you backing up that much storage?  or is it all emphemeral?
It's common to lose a whole pool, so I'd have thought you'd want to,
for example, keep home directories on a main pool and a backup pool,
but keep only one copy of the backup dumps since in theory they have
corresponding originals somewhere else.  

If you did split your x4500 * 6 into two pools, I wonder how you'd lay
out a ``main pool'' and ``backup pool'' such that they'd be unlikely
to get corrupt together.  make them on disjoint sets of iscsi target
nodes?  keep the backup pool exported?  

you could keep backup pool imported so other groups can write their
backups there, and spread it across all 6 targets, but declare a
recurring noon - 3pm maintenance window for the backup pool, in which
you: export, test-import, export, take snapshots of the zvols on the
target nodes, import.  Normally you would need II to use
device-snapshots for corruption protection, but since you have two
layers of ZFS you can use this remedial trick without learning how to
use AVS.  but only while the pool is exported because otherwise
there's no way to take all six snapshots at the same instant.  

or you could just hope.

I'm most interested in failure testing.  What happens when you reboot
nodes or break network connections while there's write activity to the
pool?  That is nice that the ``service address'' fails over and fails
back, and that you've somehow extended the heartbeat all the way from
target to head node, but does this actually work well with iSCSI?
Does the iSCSI TCP circuit get broken and remade when the address
moves, and does this cause errors or even cause corruption if it
happens while writing to the pool?  How about something more drastic
like rebooting the x4500's---does the head node patiently wait and
then continue where it left off like clients are supposed to when
their NFS server reboots, or does it panic, or does it freeze for a
couple minutes, mark the target down and continue, and then throw a
bunch of CKSUM errors when the target comes back?  The last one is
what happens for me, but I have a mirrorz vdev on the head node so my
setup's different.

If you can get this setup to work sanely in error scenarios, I think
it can potentially have an availability advantage because some of the
driver problems causing freezes and kernel panics and hung ports we've
seen won't hurt you---you can just reboot the whole target node, so
shitty drivers become merely irritating to the sysadmin instead of an
availability problem.  but my expectation is, you can't.

It sounds really scary to me, to be honest, like: 200 eggs, one basket.
and the basket is made of duct tape.

i'm less interested in performance.  I can think of a bunch of silly
performance-test questions but I found most interesting Archie's
experience about how performance can influence effective reliability.
Here are the silly questions:

  have you tried any other layouts?  like exporting individual disks
  with iSCSI?  My intuition is that this would not work well because
  of TCP congestion, and I also worry the iscsi target would freeze
  the whole box when one drive failed, a behavior which could be
  statistically significant to the overall system's reliability when
  there are so many drives involved.  but I wonder.  also a simple SVM
  stripe, or maybe two or three stripes per box, might be faster by
  avoiding zvol COW.

(also, know that Linux has an iSCSI target, too.  actually it has
 three right now: IET, scst, and stgt)

  any end-to-end testing yet?  how is the performance of NFS or CIFS
  or...what are you hoping to use over the infiniband again,
  comstar/iSER or is it just IP+NFS?  i don't know much about IB.

  are there fast disks in the head node that you could use to
  expermient with slog or l2arc?  since slogs can't be removed without
  destroying the pool, you might want testing of NFS+slog/NFS-slog
  before the pool has real data on it.

  can you try with and without RED on the switches?  i've always
  wondered if this makes a difference but not bothered to check it
  because my targets are too slow.


pgpuuTkgmbffn.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Miles Nordin
 r == Ross  [EMAIL PROTECTED] writes:

 r Am I right in thinking your top level zpool is a raid-z pool
 r consisting of six 28TB iSCSI volumes?  If so that's a very
 r nice setup,

not if it scrubs at 400GB/day, and 'zfs send' is uselessly slow.  Also
I am thinking the J4500 Richard mentioned may be more robust to single
disk failures not taking down the whole box compared to a device with
a Solaris kernel in it.

s/very nice/stupidly large capacity/


pgpqB4AwiHlRy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Bob Friesenhahn
On Wed, 15 Oct 2008, Marcelo Leal wrote:

 Are you talking about what he had in the logic of the configuration at top 
 level, or you are saying his top level pool is a raidz?
 I would think his top level zpool is a raid0...

ZFS does not support RAID0 (simple striping).

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Tomas Ögren
On 15 October, 2008 - Bob Friesenhahn sent me these 0,6K bytes:

 On Wed, 15 Oct 2008, Marcelo Leal wrote:
 
  Are you talking about what he had in the logic of the configuration at top 
  level, or you are saying his top level pool is a raidz?
  I would think his top level zpool is a raid0...
 
 ZFS does not support RAID0 (simple striping).

zpool create mypool disk1 disk2 disk3

Sure it does.

/Tomas
-- 
Tomas Ögren, [EMAIL PROTECTED], http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Marcelo Leal
So, there is no raid10 in a solaris/zfs setup?
I´m talking about no redundancy...
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Bob Friesenhahn

On Wed, 15 Oct 2008, Tomas Ögren wrote:

ZFS does not support RAID0 (simple striping).


zpool create mypool disk1 disk2 disk3

Sure it does.


This is load-share, not RAID0.  Also, to answer the other fellow, 
since ZFS does not support RAID0, it also does not support RAID 1+0 
(10). :-)


With RAID0 and 8 drives in a stripe, if you send a 128K block of data, 
it gets split up into eight chunks, with a chunk written to each 
drive.  With ZFS's load share, that 128K block of data only gets 
written to one of the eight drives and no striping takes place.  The 
next write is highly likely to go to a different drive.  Load share 
seems somewhat similar to RAID0 but it is easy to see that it is not 
by looking at the drive LEDs on an drive array while writes are taking 
place.


Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-15 Thread Richard Elling
Bob Friesenhahn wrote:
 On Wed, 15 Oct 2008, Tomas Ögren wrote:
 ZFS does not support RAID0 (simple striping).

 zpool create mypool disk1 disk2 disk3

 Sure it does.

 This is load-share, not RAID0.  Also, to answer the other fellow, 
 since ZFS does not support RAID0, it also does not support RAID 1+0 
 (10). :-)

Technically correct.  But beware of operational definitions.

 From the SNIA Dictionary, http://www.snia.org/education/dictionary

RAID Level 0
[Storage System] Synonym for data striping.

RAID Level 1
[Storage System] Synonym for mirroring.

RAID Level 10
not defined at SNIA, but generally agreed to be data stripes of
mirrors.

Data Striping
[Storage System] A disk array data mapping technique in which
fixed-length sequences of virtual disk data addresses are mapped to
sequences of member disk addresses in a regular rotating pattern.

Disk striping is commonly called RAID Level 0 or RAID 0 because
of its similarity to common RAID data mapping techniques. It includes
no redundancy, however, so strictly speaking, the appellation RAID
is a misnomer.

mirroring
[Storage System] A configuration of storage in which two or more
identical copies of data are maintained on separate media; also known
as RAID Level 1, disk shadowing, real-time copy, and t1 copy.

ZFS dynamic stripes are not restricted by fixed-length sequences, so
they are not, technically data stripes by SNIA definition.

ZFS mirrors do fit the SNIA definition of mirroring, though ZFS does
so by logical address, not a physical block offset.

You will often see people describe ZFS mirroring with multiple top-level
vdevs as RAID-1+0 (or 10), because that is a well-known thing.  But if
you see this in any of the official documentation, then please file a bug.

 With RAID0 and 8 drives in a stripe, if you send a 128K block of data, 
 it gets split up into eight chunks, with a chunk written to each 
 drive.  With ZFS's load share, that 128K block of data only gets 
 written to one of the eight drives and no striping takes place.  The 
 next write is highly likely to go to a different drive.  Load share 
 seems somewhat similar to RAID0 but it is easy to see that it is not 
 by looking at the drive LEDs on an drive array while writes are taking 
 place.

ZFS allocates data in slabs.  By default, the slabs are 1 MByte each.
So a vdev is divided into a collection of slabs and when ZFS fills a
slab, it moves onto another.  With a dynamic stripe, the next slab may
be on a different vdev, depending on how much free space is available.
So you may see many I/Os hitting one disk, just because they happen
to be allocated on the same vdev, perhaps in the same slab.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Hey, all!

We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets over 
ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 head node. 
In trying to discover optimal ZFS pool construction settings, we've run a 
number of iozone tests, so I thought I'd share them with you and see if you 
have any comments, suggestions, etc.

First, on a single Thumper, we ran baseline tests on the direct-attached 
storage (which is collected into a single ZFS pool comprised of four raidz2 
groups)...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest 
Write: 123919 
Rewrite: 146277 
Read: 383226 
Reread: 383567 
Random Read: 84369 
Random Write: 121617 

[8GB file size, 512KB record size]
Command:  
Write:  373345
Rewrite:  665847
Read:  2261103
Reread:  2175696
Random Read:  2239877
Random Write:  666769

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest 
Write: 517092
Rewrite: 541768 
Read: 682713
Reread: 697875
Random Read: 89362
Random Write: 488944

These results look very nice, though you'll notice that the random read numbers 
tend to be pretty low on the 1GB and 64GB tests (relative to their sequential 
counterparts), but the 8GB random (and sequential) read is unbelievably good.

Now we move to the head node's iSCSI aggregate ZFS pool...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f 
/volumes/data-iscsi/perftest/1gbtest
Write:  127108
Rewrite:  120704
Read:  394073
Reread:  396607
Random Read:  63820
Random Write:  5907

[8GB file size, 512KB record size]
Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f 
/volumes/data-iscsi/perftest/8gbtest
Write:  235348
Rewrite:  179740
Read:  577315
Reread:  662253
Random Read:  249853
Random Write:  274589

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f 
/volumes/data-iscsi/perftest/64gbtest 
Write:  190535
Rewrite:  194738
Read:  297605
Reread:  314829
Random Read:  93102
Random Write:  175688

Generally speaking, the results look good, but you'll notice that random writes 
are atrocious on the 1GB tests and random reads are not so great on the 1GB and 
64GB tests, but the 8GB test looks great across the board. Voodoo! ; 
Incidentally, I ran all these tests against the ZFS pool in disk, raidz1, and 
raidz2 modes - there were no significant changes in the results.

So, how concerned should we be about the low scores here and there? Any 
suggestions on how to improve our configuration? And how excited should we be 
about the 8GB tests? ; 

Thanks so much for any input you have!
-Gray
---
University of Michigan
Medical School Information Services
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Howdy!

Sounds good. We'll upgrade to 1.1 (b101) as soon as it is released, re-run
our battery of tests, and see where we stand.

Thanks!
-Gray

On Tue, Oct 14, 2008 at 8:47 PM, James C. McPherson [EMAIL PROTECTED]
 wrote:

 Gray Carper wrote:

 Hello again! (And hellos to Erast, who has been a huge help to me many,
 many times! :)

 As I understand it, Nexenta 1.1 should be released in a matter of weeks
 and it'll be based on build 101. We are waiting for that with baited breath,
 since it includes some very important Active Directory integration fixes,
 but this sounds like another reason to be excited about it. Maybe this is a
 discussion that should be tabled until we are able to upgrade?


 Yup, I think that's probably the best thing. And thanks
 for passing on the info about the 1.1 release, I'll keep
 that in my back pocket :)


 cheers,
 James

 --
 Senior Kernel Software Engineer, Solaris
 Sun Microsystems
 http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Hey there, James!

We're actually running NexentaStor v1.0.8, which is based on b85. We haven't
done any tuning ourselves, but I suppose it is possible that Nexenta did. If
there's something specific you have in mind, I'd be happy to look for it.

Thanks!
-Gray

On Tue, Oct 14, 2008 at 8:10 PM, James C. McPherson [EMAIL PROTECTED]
 wrote:

 Gray Carper wrote:

 Hey, all!

 We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI
 targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on
 an x4200 head node. In trying to discover optimal ZFS pool construction
 settings, we've run a number of iozone tests, so I thought I'd share them
 with you and see if you have any comments, suggestions, etc.


 [snip]


 Which build are you running? Have you done any system
 or ZFS tuning?


 James C. McPherson
 --
 Senior Kernel Software Engineer, Solaris
 Sun Microsystems
 http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread James C. McPherson
Gray Carper wrote:
 Hey there, James!
 
 We're actually running NexentaStor v1.0.8, which is based on b85. We 
 haven't done any tuning ourselves, but I suppose it is possible that 
 Nexenta did. If there's something specific you'd like me to look for, 
 I'd be happy to.

Hi Gray,
So build 85 that's getting a bit long in the tooth now.

I know there have been *lots* of ZFS, Marvell SATA and iSCSI
fixes and enhancements since then which went into OpenSolaris.
I know they're in Solaris Express and the updated binary distro
form of os2008.05 - I just don't know whether Erast and the
Nexenta clan have included them in what they are releasing as 1.0.8.

Erast - could you chime in here please? Unfortunately I've got no
idea about Nexenta.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Akhilesh Mritunjai
Just a random spectator here, but I think artifacts you're seeing here are not 
due to file size, but rather due to record size.

What is the ZFS record size ?

On a personal note, I wouldn't do non-concurrent (?) benchmarks. They are at 
best useless and at worst misleading for ZFS

- Akhilesh.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Bob Friesenhahn
On Tue, 14 Oct 2008, Gray Carper wrote:

 So, how concerned should we be about the low scores here and there? 
 Any suggestions on how to improve our configuration? And how excited 
 should we be about the 8GB tests? ;

The level of concern should depend on how you expect your storage pool 
to actually be used.  It seems that it should work great for bulk 
storage, but not to support a database, or ultra high-performance 
super-computing applications.  The good 8GB performance is due to 
successful ZFS ARC caching in RAM, and because the record size is 
reasonable given the ZFS block size and the buffering ability of the 
intermediate links.  You might see somewhat better performance using a 
256K record size.

It may take quite a while to fill 150TB up.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Erast Benson
James, all serious ZFS bug fixes back-ported to b85 as well as marvell
and other sata drivers. Not everything is possible to back-port of
course, but I would say all critical things are there. This includes ZFS
ARC optimization patches, for example.

On Tue, 2008-10-14 at 22:33 +1000, James C. McPherson wrote:
 Gray Carper wrote:
  Hey there, James!
  
  We're actually running NexentaStor v1.0.8, which is based on b85. We 
  haven't done any tuning ourselves, but I suppose it is possible that 
  Nexenta did. If there's something specific you'd like me to look for, 
  I'd be happy to.
 
 Hi Gray,
 So build 85 that's getting a bit long in the tooth now.
 
 I know there have been *lots* of ZFS, Marvell SATA and iSCSI
 fixes and enhancements since then which went into OpenSolaris.
 I know they're in Solaris Express and the updated binary distro
 form of os2008.05 - I just don't know whether Erast and the
 Nexenta clan have included them in what they are releasing as 1.0.8.
 
 Erast - could you chime in here please? Unfortunately I've got no
 idea about Nexenta.
 
 
 James C. McPherson
 --
 Senior Kernel Software Engineer, Solaris
 Sun Microsystems
 http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Brent Jones
On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper [EMAIL PROTECTED] wrote:
 Hey, all!

 We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets 
 over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 
 head node. In trying to discover optimal ZFS pool construction settings, 
 we've run a number of iozone tests, so I thought I'd share them with you and 
 see if you have any comments, suggestions, etc.

 First, on a single Thumper, we ran baseline tests on the direct-attached 
 storage (which is collected into a single ZFS pool comprised of four raidz2 
 groups)...

 [1GB file size, 1KB record size]
 Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
 Write: 123919
 Rewrite: 146277
 Read: 383226
 Reread: 383567
 Random Read: 84369
 Random Write: 121617

 [8GB file size, 512KB record size]
 Command:
 Write:  373345
 Rewrite:  665847
 Read:  2261103
 Reread:  2175696
 Random Read:  2239877
 Random Write:  666769

 [64GB file size, 1MB record size]
 Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
 Write: 517092
 Rewrite: 541768
 Read: 682713
 Reread: 697875
 Random Read: 89362
 Random Write: 488944

 These results look very nice, though you'll notice that the random read 
 numbers tend to be pretty low on the 1GB and 64GB tests (relative to their 
 sequential counterparts), but the 8GB random (and sequential) read is 
 unbelievably good.

 Now we move to the head node's iSCSI aggregate ZFS pool...

 [1GB file size, 1KB record size]
 Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f 
 /volumes/data-iscsi/perftest/1gbtest
 Write:  127108
 Rewrite:  120704
 Read:  394073
 Reread:  396607
 Random Read:  63820
 Random Write:  5907

 [8GB file size, 512KB record size]
 Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f 
 /volumes/data-iscsi/perftest/8gbtest
 Write:  235348
 Rewrite:  179740
 Read:  577315
 Reread:  662253
 Random Read:  249853
 Random Write:  274589

 [64GB file size, 1MB record size]
 Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f 
 /volumes/data-iscsi/perftest/64gbtest
 Write:  190535
 Rewrite:  194738
 Read:  297605
 Reread:  314829
 Random Read:  93102
 Random Write:  175688

 Generally speaking, the results look good, but you'll notice that random 
 writes are atrocious on the 1GB tests and random reads are not so great on 
 the 1GB and 64GB tests, but the 8GB test looks great across the board. 
 Voodoo! ; Incidentally, I ran all these tests against the ZFS pool in disk, 
 raidz1, and raidz2 modes - there were no significant changes in the results.

 So, how concerned should we be about the low scores here and there? Any 
 suggestions on how to improve our configuration? And how excited should we be 
 about the 8GB tests? ;

 Thanks so much for any input you have!
 -Gray
 ---
 University of Michigan
 Medical School Information Services
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Your setup sounds very interesting how you export iSCSI to another
head unit, can you give me some more details on your file system
layout, and how you mount it on the head unit?
Sounds like a pretty clever way to export awesomely large volumes!

Regards,

-- 
Brent Jones
[EMAIL PROTECTED]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread James C. McPherson
Gray Carper wrote:
 Hey, all!
 
 We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI
 targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on
 an x4200 head node. In trying to discover optimal ZFS pool construction
 settings, we've run a number of iozone tests, so I thought I'd share them
 with you and see if you have any comments, suggestions, etc.

[snip]


Which build are you running? Have you done any system
or ZFS tuning?


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread James C. McPherson
Gray Carper wrote:
 Hello again! (And hellos to Erast, who has been a huge help to me many, 
 many times! :)
 
 As I understand it, Nexenta 1.1 should be released in a matter of weeks 
 and it'll be based on build 101. We are waiting for that with baited 
 breath, since it includes some very important Active Directory 
 integration fixes, but this sounds like another reason to be excited 
 about it. Maybe this is a discussion that should be tabled until we are 
 able to upgrade?

Yup, I think that's probably the best thing. And thanks
for passing on the info about the 1.1 release, I'll keep
that in my back pocket :)


cheers,
James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Hello again! (And hellos to Erast, who has been a huge help to me many, many
times! :)

As I understand it, Nexenta 1.1 should be released in a matter of weeks and
it'll be based on build 101. We are waiting for that with baited breath,
since it includes some very important Active Directory integration fixes,
but this sounds like another reason to be excited about it. Maybe this is a
discussion that should be tabled until we are able to upgrade?

-Gray

On Tue, Oct 14, 2008 at 8:33 PM, James C. McPherson [EMAIL PROTECTED]
 wrote:

 Gray Carper wrote:

 Hey there, James!

 We're actually running NexentaStor v1.0.8, which is based on b85. We
 haven't done any tuning ourselves, but I suppose it is possible that Nexenta
 did. If there's something specific you'd like me to look for, I'd be happy
 to.


 Hi Gray,
 So build 85 that's getting a bit long in the tooth now.

 I know there have been *lots* of ZFS, Marvell SATA and iSCSI
 fixes and enhancements since then which went into OpenSolaris.
 I know they're in Solaris Express and the updated binary distro
 form of os2008.05 - I just don't know whether Erast and the
 Nexenta clan have included them in what they are releasing as 1.0.8.

 Erast - could you chime in here please? Unfortunately I've got no
 idea about Nexenta.



 James C. McPherson
 --
 Senior Kernel Software Engineer, Solaris
 Sun Microsystems
 http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread James C. McPherson
Erast Benson wrote:
 James, all serious ZFS bug fixes back-ported to b85 as well as marvell
 and other sata drivers. Not everything is possible to back-port of
 course, but I would say all critical things are there. This includes ZFS
 ARC optimization patches, for example.

Excellent!


James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

2008-10-14 Thread Gray Carper
Hey there, Bob!

Looks like you and Akhilesh (thanks, Akhilesh!) are driving at a similar,
very valid point. I'm currently using the default recordsize (128K) on all
of the ZFS pool (those of the iSCSI target nodes and the aggregate pool on
the head node).

I should've mentioned something about how the storage will be used in my
original post, so I'm glad you brought it up. It will all be presented over
NFS and CIFS as a 10GBe+Infiniband NAS which will serve a number of
organizations. Some organizations will simply use their area for end-user
file sharing, others will use it as a disk backup target, others for
databases, and still others for HPC data crunching (gene sequences). Each of
these uses will be on different filesystems, of course, so I expect it would
be good to set different recordsize paramaters for each one. Do you have any
suggestions on good starting sizes for each? I'd imagine filesharing might
benefit from a relatively small record size (64K?), image-based backup
targets might like a pretty large record size (256K?), databases just need
recordsizes to match their block sizes, and HPC...I have no idea. Heh. I
expect I'll need to get in contact with the HPC lab to see what kind of
profile they have (whether they deal with tiny files or big files, etc).
What do you think?

Today I'm going to try a few non-ZFS-related tweaks (disabling the Nagle
algorithm on the iSCSI initiator and increasing MTU everywhere to 9000).
I'll give those a shot and see if they yield performance enhancements.

-Gray

On Tue, Oct 14, 2008 at 10:36 PM, Bob Friesenhahn 
[EMAIL PROTECTED] wrote:

 On Tue, 14 Oct 2008, Gray Carper wrote:


 So, how concerned should we be about the low scores here and there? Any
 suggestions on how to improve our configuration? And how excited should we
 be about the 8GB tests? ;


 The level of concern should depend on how you expect your storage pool to
 actually be used.  It seems that it should work great for bulk storage, but
 not to support a database, or ultra high-performance super-computing
 applications.  The good 8GB performance is due to successful ZFS ARC caching
 in RAM, and because the record size is reasonable given the ZFS block size
 and the buffering ability of the intermediate links.  You might see somewhat
 better performance using a 256K record size.

 It may take quite a while to fill 150TB up.

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/




-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
[EMAIL PROTECTED]  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss