Prabhjot,
When no compression is being used, it should have only a tiny impact on 
performance.  But when it is enabled it will make it as though the message 
payload is small and nearly constant, regardless as to how large the configured 
message size is.

I think that the answer is that this is room for improvement in the perf test, 
especially where compression is concerned.  If you do implement an improvement, 
a patch might be helpful to the community.  But something to consider is that 
threwput alone isn’t the only important performance measure.   Round trip 
latency is also important.
Thanks,
-Erik


From: Prabhjot Bharaj <prabhbha...@gmail.com<mailto:prabhbha...@gmail.com>>
Date: Tuesday, August 25, 2015 at 8:41 AM
To: Erik Helleren 
<erik.helle...@cmegroup.com<mailto:erik.helle...@cmegroup.com>>
Cc: "us...@kafka.apache.org<mailto:us...@kafka.apache.org>" 
<us...@kafka.apache.org<mailto:us...@kafka.apache.org>>, 
"dev@kafka.apache.org<mailto:dev@kafka.apache.org>" 
<dev@kafka.apache.org<mailto:dev@kafka.apache.org>>
Subject: Re: kafka producer-perf-test.sh compression-codec not working

Hi Erik,

I have put my efforts on the produce side till now, Thanks for making me aware 
that consumer will decompress automatically.

I'll also consider your point on creating real-life messages

But, I have still have one confusion -

Why would the current ProducerPerformance.scala compress an Array of Bytes with 
all zeros ?
That will anyways give better throughput. correct ?

Regards,
Prabhjot

On Tue, Aug 25, 2015 at 7:05 PM, Helleren, Erik 
<erik.helle...@cmegroup.com<mailto:erik.helle...@cmegroup.com>> wrote:
Hi Prabhjot,
There are two important things to know about kafka compression:  First
uncompression happens automatically in the consumer
(https://cwiki.apache.org/confluence/display/KAFKA/Compression) so you
should see ascii returned on the consumer side. The best way to see if
compression has happened that I know of is to actually look at a packet
capture.

Second, the producer does not compress individual messages, but actually
batches several sequential messages to the same topic and partition
together and compresses that compound message.
(https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Pro
tocol#AGuideToTheKafkaProtocol-Compression) Thus, a fixed string will
still see far better compression ratios than a Œtypical' real life
message.

Making a real-life-like message isn¹t easy, and depends heavily on your
domain. But a general approach would be to generate messages by randomly
selected words from a dictionary.  And having a dictionary around thousand
large words means there is a reasonable chance of the same words appearing
multiple times in the same message.  Also words can be non-sence like
³asdfasdfasdfasdf², or large words in the language of your choice.  The
goal is for each message to be unique, but still have similar chunks that
a compression algorithm can detect and compress.

-Erik


On 8/25/15, 6:47 AM, "Prabhjot Bharaj" 
<prabhbha...@gmail.com<mailto:prabhbha...@gmail.com>> wrote:

>Hi,
>
>I have bene trying to use kafka-producer-perf-test.sh to arrive at certain
>benchmarks.
>When I try to run it with --compression-codec values of 1, 2 and 3, I
>notice increased throughput compared to NoCompressionCodec
>
>But, When I checked the Producerperformance.scala, I saw that the the
>`producer.send` is getting data from the method: `generateProducerData`.
>But, this data is just an empty array of Bytes.
>
>Now, as per my basic understanding of compression algorithms, I think a
>byte sequence of zeros will eventually result in a very small message,
>because of which I thought I might be observing better throughput.
>
>So, in line: 247 of ProducerPerformance.scala, I did this minor code
>change:-
>
>
>
>*val message =
>"qopwr11591UPD113582260001AS1IL1-1N/A1Entertainment1-1an-example.com1-1-1-
>1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,MED,HIGH,HD,.mp4.csm
>il/bitrate=11subcategory
>71Title
>10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1-
>1-1-1-1-1-1-111-1-1-r1VR-11591UPD113582260001AS1IL1-1N/A1Entertainment1-1a
>n-example.com1-1-1-1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,M
>ED,HIGH,HD,.mp4.csmil/bitrate=11subcategory
>71Title
>10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1-
>1-1-1-1-1-1-111-1-1-r1VR-11591UPD113582260001AS1IL1-1N/A1Entertainment1-1a
>n-example.com1-1-1-1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,M
>ED,HIGH,HD,.mp4.csmil/bitrate=11subcategory
>71Title
>10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1-
>1-1-1-1-1-1-111-1-1-"message.getBytes().slice(0,msgSize)*
>
>
>This makes sure that I have a big message, and I can slice that
>message to the message size passed in the command line options
>
>
>But, the problem is that when I try running the same with
>--compression-codec vlues of 1, 2 or 3, I still am seeing ASCII data
>(i.e. uncompressed one only)
>
>
>I want to ask whether this is a bug. And, using
>kafka-producer-perf-test.sh, how can I send my own compressed data ?
>
>
>Thanks,
>
>Prabhjot




--
---------------------------------------------------------
"There are only 10 types of people in the world: Those who understand binary, 
and those who don't"

Reply via email to