RE: perf counters from a performance discrepancy

Deneau, Tom Wed, 23 Sep 2015 13:21:05 -0700


> -----Original Message-----
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: Wednesday, September 23, 2015 1:43 PM
> To: Gregory Farnum; Sage Weil
> Cc: Deneau, Tom; ceph-devel@vger.kernel.org
> Subject: Re: perf counters from a performance discrepancy
> 
> 
> 
> On 09/23/2015 01:25 PM, Gregory Farnum wrote:
> > On Wed, Sep 23, 2015 at 11:19 AM, Sage Weil <s...@newdream.net> wrote:
> >> On Wed, 23 Sep 2015, Deneau, Tom wrote:
> >>> Hi all --
> >>>
> >>> Looking for guidance with perf counters...
> >>> I am trying to see whether the perf counters can tell me anything
> >>> about the following discrepancy
> >>>
> >>> I populate a number of 40k size objects in each of two pools, poolA
> and poolB.
> >>> Both pools cover osds on a single node, 5 osds total.
> >>>
> >>>     * Config 1 (1p):
> >>>        * use single rados bench client with 32 threads to do seq read
> of 20000 objects from poolA.
> >>>
> >>>     * Config 2 (2p):
> >>>        * use two concurrent rados bench clients (running on same
> client node) with 16 threads each,
> >>>             one reading 10000 objects from poolA,
> >>>             one reading 10000 objects from poolB,
> >>>
> >>> So in both configs, we have 32 threads total and the number of objects
> read is the same.
> >>> Note: in all cases, we drop the caches before doing the seq reads
> >>>
> >>> The combined bandwidth (MB/sec) for the 2 clients in config 2 is
> >>> about 1/3 of the bandwidth for the single client in config 1.
> >>
> >> How were the object written?  I assume the cluster is backed by
> >> spinning disks?
> >>
> >> I wonder if this is a disk layout issue.  If the 20,000 objects are
> >> written in order, they willb e roughly sequential on disk, and the 32
> >> thread case will read them in order.  In the 2x 10,000 case, the two
> >> clients are reading two sequences of objects written at different
> >> times, and the disk arms will be swinging around more.
> >>
> >> My guess is that if the reads were reading the objects in a random
> >> order the performance would be the same... I'm not sure that rados
> >> bench does that though?
> >>
> >> sage
> >>
> >>>
> >>>
> >>> I gathered perf counters before and after each run and looked at the
> >>> difference of the before and after counters for both the 1p and 2p
> >>> cases.  Here are some things I noticed that are different between
> >>> the two runs.  Can someone take a look and let me know whether any
> >>> of these differences are significant.  In particular, for the
> throttle-msgr_dispatch_throttler ones, since I don't know the detailed
> definitions of these fields.
> >>> Note: these are the numbers for one of the 5 osds, the other osds are
> similar...
> >>>
> >>> * The field osd/loadavg is always about 3 times higher on the 2p c
> >>>
> >>> some latency-related counters
> >>> ------------------------------
> >>> osd/op_latency/sum 1p=6.24801117205061, 2p=579.722513078945
> >>> osd/op_process_latency/sum 1p=3.48506945394911, 2p=42.6278494549915
> >>> osd/op_r_latency/sum 1p=6.2480111719924, 2p=579.722513079003
> >>> osd/op_r_process_latency/sum 1p=3.48506945399276,
> >>> 2p=42.6278494550061
> >
> > So, yep, the individual read ops are taking much longer in the
> > two-client case. Naively that's pretty odd.
> >
> >>>
> >>>
> >>> and some throttle-msgr_dispatch_throttler related counters
> >>> ----------------------------------------------------------
> >>> throttle-msgr_dispatch_throttler-client/get 1p=1337, 2p=1339, diff=2
> >>> throttle-msgr_dispatch_throttler-client/get_sum 1p=222877,
> >>> 2p=223088, diff=211 throttle-msgr_dispatch_throttler-client/put
> >>> 1p=1337, 2p=1339, diff=2
> >>> throttle-msgr_dispatch_throttler-client/put_sum 1p=222877,
> >>> 2p=223088, diff=211
> >>> throttle-msgr_dispatch_throttler-hb_back_server/get 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/get_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_back_server/put 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_back_server/put_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_front_server/get 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/get_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hb_front_server/put 1p=58, 2p=134,
> >>> diff=76 throttle-msgr_dispatch_throttler-hb_front_server/put_sum
> >>> 1p=2726, 2p=6298, diff=3572
> >>> throttle-msgr_dispatch_throttler-hbclient/get 1p=168, 2p=252,
> >>> diff=84 throttle-msgr_dispatch_throttler-hbclient/get_sum 1p=7896,
> >>> 2p=11844, diff=3948 throttle-msgr_dispatch_throttler-hbclient/put
> >>> 1p=168, 2p=252, diff=84
> >>> throttle-msgr_dispatch_throttler-hbclient/put_sum 1p=7896, 2p=11844,
> >>> diff=3948
> >
> > IIRC these are just saying how many times the dispatch throttler was
> > accessed on each messenger — nothing here is surprising, you're doing
> > basically the same number of messages on the client messengers, and
> > the heartbeat messengers are passing more because the test takes
> > longer.
> >
> > I'd go with Sage's idea for what is actually causing this, or try and
> > look at how the latency changes over time — if you're going to two
> > pools instead of one, presumably you're doubling the amount of
> > metadata that needs to be read into memory during the run? Perhaps
> > that's just a significant enough effect with your settings that you're
> > seeing a bunch of extra directory lookups impact your throughput more
> > than expected... :/
> 
> FWIW, typically if I've seen an effect, it's been the opposite where
> multiple rados bench processes are slightly faster (maybe simply related
> to the client side implementation).  Running collectl or iostat would show
> various interval statistics that would help diagnose if this is related to
> slower accesses on the disks.  blktrace of course would give a more
> nuanced view.  Might be worth doing if there are extra metadata accesses.
> 
> A couple of other random thoughts:
> 
> - Are any of the reads coming from buffer cache?
> - Readahead not working well?
> - Were the pools recreated between tests?
> - If not, what were the total number of objects and PGs for each pool (Is
> it possible that the per-pg directory hirearchy was deeper for the 2nd set
> of tests?)
> - If the pools still exist, what does the following script say about them?
> 
> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
> 
>


Thanks Sage, Greg and Mark for the suggestions --

The objects in the pools were created by running two simultaneous
client rados bench write processes, one to each pool.  I started with
empty osd disks.  The pools both cover the same osds and since
there was a replication factor of 2, so I would have expected that not
all the primary objects in each pool would be sequential on the disk.

But when I changed the read test to do rados bench rand rather than
seq, both the 1p and 2p configs went down in bandwidth, so clearly the
disk layout had something to do with it.  With random reads, 1p is now
"only" 40% faster than 2p.

Mark had mentioned that he's often seen the opposite effect, ie, more
client processes increasing the bandwidth, and we have seen the same,
which is why I started using multiple clients.  In fact if I do the
same configuration tests as above but with 4M objects rather than 40K,
I see the 2p beats 1p by 30% or so. (and I've seen 4p help even more).

So there's something about smaller object sizes that gets hurt by
multiple clients vs. 1 client.

I will try to look into this with some of the low level stat tools
that Mark mentioned.

-- Tom

> > -Greg
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majord...@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >

RE: perf counters from a performance discrepancy

Reply via email to