Re: hitting the throughput limit on a cluster?

Todd Palino Wed, 22 Feb 2017 16:41:09 -0800

Well, my IOwait sits around 0. Much less than 1% all the time. Any IOwait
at all indicates that the application is waiting on the disk, so that’s
where your bottleneck will be.


As far as open files, it depends on how many log segments and network
connections are open. Because I have large clusters with thousands of
partitions per broker, and tons of network connections from clients, I have
the FD limit set to 400k. Basically, you don’t want to run out, so you want
enough buffer to catch a problem (like a bug with socket file descriptors
not getting released properly).

-Todd


On Tue, Feb 21, 2017 at 12:52 PM, Jon Yeargers <jon.yearg...@cedexis.com>
wrote:

> Thanks for looking at this issue. I checked the max IOPs for this disk and
> we're only at about 10%. I can add more disks to spread out the work.
>
> What IOWait values should I be aiming for?
>
> Also - what do you set openfiles to? I have it at 65535 but I just read a
> doc that suggested > 100K is better
>
>
> On Tue, Feb 21, 2017 at 10:45 AM, Todd Palino <tpal...@gmail.com> wrote:
>
> > So I think the important thing to look at here is the IO wait on your
> > system. You’re hitting disk throughput issues, and that’s what you most
> > likely need to resolve. So just from what you’ve described, I think the
> > only thing that is going to get you more performance is more spindles (or
> > faster spindles). This is either more disks or more brokers, but at the
> end
> > of it you need to eliminate the disk IO bottleneck.
> >
> > -Todd
> >
> >
> > On Tue, Feb 21, 2017 at 7:29 AM, Jon Yeargers <jon.yearg...@cedexis.com>
> > wrote:
> >
> > > Running 3x 8core on google compute.
> > >
> > > Topic has 16 partitions (replication factor 2) and is consumed by 16
> > docker
> > > containers on individual hosts.
> > >
> > > System seems to max out at around 40000 messages / minute. Each message
> > is
> > > ~12K - compressed (snappy) JSON.
> > >
> > > Recently moved from 12 to the above 16 partitions with no change in
> > > throughput.
> > >
> > > Also tried increased the consumption capacity on each container by 50%.
> > No
> > > effect.
> > >
> > > Network is running at ~6Gb/sec (measured using iperf3). Broker load is
> > > ~1.5. IOWait % is 5-10 (via sar).
> > >
> > > What are my options for adding throughput?
> > >
> > > - more brokers?
> > > - avro/protobuf messaging?
> > > - more disks / broker? (1 / host presently)
> > > - jumbo frames?
> > >
> > > (transparent huge pages is disabled)
> > >
> > >
> > > Looking at this article (
> > > https://engineering.linkedin.com/kafka/benchmarking-apache-
> > > kafka-2-million-writes-second-three-cheap-machines)
> > > it would appear that for our message size we are at the max. This would
> > > argue that we need to shrink the message size - so perhaps switching to
> > > avro is the next step?
> > >
> >
> >
> >
> > --
> > *Todd Palino*
> > Staff Site Reliability Engineer
> > Data Infrastructure Streaming
> >
> >
> >
> > linkedin.com/in/toddpalino
> >
>



-- 
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino

Re: hitting the throughput limit on a cluster?

Reply via email to