Oh, and meant to say, zstd is a good compromise between CPU and compression
ratio, IIRC it was far less costly on CPU than gzip.

So yeah, I generally recommend setting your topic's compression to
"producer", and then going from there.

On Wed, 16 Mar 2022 at 11:49, Liam Clarke-Hutchinson <lclar...@redhat.com>
wrote:

> Sounds like a goer then :) Those strings in the protobuf always get ya,
> can't use clever encodings for them like you can with numbers.
>
> On Wed, 16 Mar 2022 at 11:29, Dan Hill <quietgol...@gmail.com> wrote:
>
>> We're using protos but there are still a bunch of custom fields where
>> clients specify redundant strings.
>>
>> My local test is showing 75% reduction in size if I use zstd or gzip.  I
>> care the most about Kafka storage costs right now.
>>
>> On Tue, Mar 15, 2022 at 2:25 PM Liam Clarke-Hutchinson <
>> lclar...@redhat.com>
>> wrote:
>>
>> > Hi Dan,
>> >
>> > Okay, so if you're looking for low latency, I'm guessing that you're
>> using
>> > a very low linger.ms in the producers? Also, what format are the
>> records?
>> > If they're already in a binary format like Protobuf or Avro, unless
>> they're
>> > composed largely of strings, compression may offer little benefit.
>> >
>> > With your small records, I'd suggest running some tests with your
>> current
>> > config with different compression settings - none, snappy, lz4, (don't
>> > bother with gzip unless that's all you have) and checking producer
>> metrics
>> > (available via JMX if you're using the Java clients) for avg-batch-size
>> and
>> > compression-ratio.
>> >
>> > You may just wish to start with no compression, and then consider
>> moving to
>> > it if/when network bandwidth becomes a bottleneck.
>> >
>> > Regards,
>> >
>> > Liam
>> >
>> > On Tue, 15 Mar 2022 at 17:05, Dan Hill <quietgol...@gmail.com> wrote:
>> >
>> > > Thanks, Liam!
>> > >
>> > > I have a mixture of Kafka record size.  10% are large (>100kbs) and
>> 90%
>> > of
>> > > the records are smaller than 1kb.  I'm working on a streaming
>> analytics
>> > > solution that streams impressions, user actions and serving info and
>> > > combines them together.  End-to-end latency is more important than
>> > storage
>> > > size.
>> > >
>> > >
>> > > On Mon, Mar 14, 2022 at 3:27 PM Liam Clarke-Hutchinson <
>> > > lclar...@redhat.com>
>> > > wrote:
>> > >
>> > > > Hi Dan,
>> > > >
>> > > > Decompression generally only happens in the broker if the topic has
>> a
>> > > > particular compression algorithm set, and the producer is using a
>> > > different
>> > > > one - then the broker will decompress records from the producer,
>> then
>> > > > recompress it using the topic's configured algorithm. (The
>> LogCleaner
>> > > will
>> > > > also decompress then recompress records when compacting compressed
>> > > topics).
>> > > >
>> > > > The consumer decompresses compressed record batches it receives.
>> > > >
>> > > > In my opinion, using topic compression instead of producer
>> compression
>> > > > would only make sense if the overhead of a few more CPU cycles
>> > > compression
>> > > > uses was not tolerable for the producing app. In all of my use
>> cases,
>> > > > network throughput becomes a bottleneck long before producer
>> > compression
>> > > > CPU cost does.
>> > > >
>> > > > For your "if X, do Y" formulation I'd say - if your producer is
>> sending
>> > > > tiny batches, do some analysis of compressed vs. uncompressed size
>> for
>> > > your
>> > > > given compression algorithm - you may find that compression overhead
>> > > > increases batch size for tiny batches.
>> > > >
>> > > > If you're sending a large amount of data, do tune your batching and
>> use
>> > > > compression to reduce data being sent over the wire.
>> > > >
>> > > > If you can tell us more about what your problem domain, there might
>> be
>> > > more
>> > > > advice that's applicable :)
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Liam Clarke-Hutchinson
>> > > >
>> > > > On Tue, 15 Mar 2022 at 10:05, Dan Hill <quietgol...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi.  I looked around for advice about Kafka compression.  I've
>> seen
>> > > mixed
>> > > > > and conflicting advice.
>> > > > >
>> > > > > Is there any sorta "if X, do Y" type of documentation around Kafka
>> > > > > compression?
>> > > > >
>> > > > > Any advice?  Any good posts to read that talk about this trade
>> off?
>> > > > >
>> > > > > *Detailed comments*
>> > > > > I tried looking for producer vs topic compression.  I didn't find
>> > much.
>> > > > > Some of the information I see is back from 2011 (which I'm
>> guessing
>> > is
>> > > > > pretty stale).
>> > > > >
>> > > > > I can guess some potential benefits but I don't know if they are
>> > > actually
>> > > > > real.  I've also seen some sites claim certain trade offs but it's
>> > > > unclear
>> > > > > if they're true.
>> > > > >
>> > > > > It looks like I can modify an existing topic's compression.  I
>> don't
>> > > know
>> > > > > if that actually works.  I'd assume it'd just impact data going
>> > > forward.
>> > > > >
>> > > > > I've seen multiple sites say that decompression happens in the
>> broker
>> > > and
>> > > > > multiple that say it happens in the consumer.
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to