Hi Prabhjot,
There are two important things to know about kafka compression:  First
uncompression happens automatically in the consumer
(https://cwiki.apache.org/confluence/display/KAFKA/Compression) so you
should see ascii returned on the consumer side. The best way to see if
compression has happened that I know of is to actually look at a packet
capture.   

Second, the producer does not compress individual messages, but actually
batches several sequential messages to the same topic and partition
together and compresses that compound message.
(https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Pro
tocol#AGuideToTheKafkaProtocol-Compression) Thus, a fixed string will
still see far better compression ratios than a Œtypical' real life
message. 

Making a real-life-like message isn¹t easy, and depends heavily on your
domain. But a general approach would be to generate messages by randomly
selected words from a dictionary.  And having a dictionary around thousand
large words means there is a reasonable chance of the same words appearing
multiple times in the same message.  Also words can be non-sence like
³asdfasdfasdfasdf², or large words in the language of your choice.  The
goal is for each message to be unique, but still have similar chunks that
a compression algorithm can detect and compress.

-Erik
  

On 8/25/15, 6:47 AM, "Prabhjot Bharaj" <prabhbha...@gmail.com> wrote:

>Hi,
>
>I have bene trying to use kafka-producer-perf-test.sh to arrive at certain
>benchmarks.
>When I try to run it with --compression-codec values of 1, 2 and 3, I
>notice increased throughput compared to NoCompressionCodec
>
>But, When I checked the Producerperformance.scala, I saw that the the
>`producer.send` is getting data from the method: `generateProducerData`.
>But, this data is just an empty array of Bytes.
>
>Now, as per my basic understanding of compression algorithms, I think a
>byte sequence of zeros will eventually result in a very small message,
>because of which I thought I might be observing better throughput.
>
>So, in line: 247 of ProducerPerformance.scala, I did this minor code
>change:-
>
>
>
>*val message = 
>"qopwr11591UPD113582260001AS1IL1-1N/A1Entertainment1-1an-example.com1-1-1-
>1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,MED,HIGH,HD,.mp4.csm
>il/bitrate=11subcategory
>71Title 
>10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1-
>1-1-1-1-1-1-111-1-1-r1VR-11591UPD113582260001AS1IL1-1N/A1Entertainment1-1a
>n-example.com1-1-1-1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,M
>ED,HIGH,HD,.mp4.csmil/bitrate=11subcategory
>71Title 
>10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1-
>1-1-1-1-1-1-111-1-1-r1VR-11591UPD113582260001AS1IL1-1N/A1Entertainment1-1a
>n-example.com1-1-1-1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,M
>ED,HIGH,HD,.mp4.csmil/bitrate=11subcategory
>71Title 
>10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1-
>1-1-1-1-1-1-111-1-1-"message.getBytes().slice(0,msgSize)*
>
>
>This makes sure that I have a big message, and I can slice that
>message to the message size passed in the command line options
>
>
>But, the problem is that when I try running the same with
>--compression-codec vlues of 1, 2 or 3, I still am seeing ASCII data
>(i.e. uncompressed one only)
>
>
>I want to ask whether this is a bug. And, using
>kafka-producer-perf-test.sh, how can I send my own compressed data ?
>
>
>Thanks,
>
>Prabhjot

Reply via email to