Re: [zfs-discuss] What causes slow performance under load?

Mike Gerdts Sun, 19 Apr 2009 11:24:35 -0700

On Sun, Apr 19, 2009 at 10:58 AM, Gary Mills <mi...@cc.umanitoba.ca> wrote:
> On Sat, Apr 18, 2009 at 11:45:54PM -0500, Mike Gerdts wrote:
>> Also, you may want to consider doing backups from the NetApp rather
>> than from the Solaris box.
>
> I've certainly recommended finding a different way to perform backups.
>
>> Assuming all of your LUNs are in the same
>> volume on the filer, a snapshot should be a crash-consistent image of
>> the zpool.  You could verify this by making the snapshot rw and trying
>> to import the snapshotted LUNs on another host.
>
> That part sounds scary!  The filer exports four LUNs that are combined
> into one ZFS pool on the IMAP server.  These LUNs are volumes on the
> filer.  How can we safely import them on another host?


This is just like operating on ZFS clones - operations on the clones
do not change the contents of the original.  Again, you are presenting
the snapshots to another host, not the original LUNs.  It is a bit
scary only because you will have to do "zpool import -f". If you have
presented the real LUN and not the rw snapshot to your test host, you
will almost certainly corrupt the active copy.  If you do it
correctly, there should be no danger.  Proving out the process on
something other than your important data is highly recommended.

In any case, this is probably something to think about outside of the
scope of this performance issue.

>> Since iSCSI is in the mix, you should also be sure that your network
>> is appropriately tuned.  Assuming that you are using the onboard
>> e1000g NICs, be sure that none of the "bad" counters are incrementing:
>>
>> $ kstat -p e1000g | nawk '$0 ~ /err|drop|fail|no/ && $NF != 0'
>>
>> If this gives any output, there is likely something amiss with your network.
>
> Only this:
>    e1000g:0:e1000g0:unknowns       1764449

I first saw this statistic a few weeks back.  I'm not sure of the
importance of it.  A cluestick would be most appreciated.

>
> I don't know what those are, but it's e1000g1 and e1000g2 that are
> used for the Iscsi network.
>
>> The output from "iostat -xCn 10" could be interesting as well.  If
>> asvc_t is high (>30?), it means the filer is being slow to respond.
>> If wsvc_t is frequently non-zero, there is some sort of a bottleneck
>> that prevents the server from sending requests to the filer.  Perhaps
>> you have tuned ssd_max_throttle or Solaris has backed off because the
>> filer said to slow down.  (Assuming that ssd is used with iSCSI LUNs).
>
> Here's an example, taken from one of the busy periods:
>
>                    extended device statistics
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    0.0    5.0    0.0    7.7  0.0  0.1    4.1   24.8   1   1 c1t2d0
>   27.0   13.8 1523.4  172.9  0.0  0.5    0.0   11.8   0  38 
> c4t60A98000433469764E4A2D456A644A74d0
>   42.0   21.4 2027.3  350.0  0.0  0.9    0.0   13.9   0  60 
> c4t60A98000433469764E4A2D456A696579d0
>   40.8   25.0 1993.5  339.1  0.0  0.8    0.0   11.8   0  52 
> c4t60A98000433469764E4A476D2F664E4Fd0
>   42.0   26.6 1968.4  319.1  0.0  0.8    0.0   11.8   0  56 
> c4t60A98000433469764E4A476D2F6B385Ad0

Surely this has been investigated already, but just in case...

I'm not sure of how long that interval was.  It it wasn't extremely
short, it looks like you could be bumping up against the throughput
constraints of a 100 Mbit connection.  Have you verified that
everything is running at 1000 Mbit/s, full duplex?

In a hardware and OS configuration similar to yours I can drive 10x
the throughput you are seeing - and I am certain that all of my links
are 1000 full.

>
> The service times seem okay to me.  There's no `throttle' setting in
> any of the relevant driver conf files.
>
>> What else is happening on the filer when mail gets slow?  That is, are
>> you experiencing slowness due to a mail peak or due to some research
>> project that happens to be on the same spindles?  What does the
>> network look like from the NetApp side?
>
> Our Netapp guy tells me that the filer is operating normally when the
> problem occurs.  The Iscsi network is less than 10% utilized.

If something is running at 100 Mbit, this would be "the iSCSI network
is less than 100% utilized."  But... as hard as you have looked at
this, I am not optimistic that something like this would have been
overlooked.

Is the ipfilter service running?  If so, does it need to be?  If so,
is your first rule one that starts with "pass in quick" to ensure that
iSCSI packets are subjected to the fewest number of rules possible?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What causes slow performance under load?

Reply via email to