Hi Prabhjot, There are two important things to know about kafka compression: First uncompression happens automatically in the consumer (https://cwiki.apache.org/confluence/display/KAFKA/Compression) so you should see ascii returned on the consumer side. The best way to see if compression has happened that I know of is to actually look at a packet capture.
Second, the producer does not compress individual messages, but actually batches several sequential messages to the same topic and partition together and compresses that compound message. (https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Pro tocol#AGuideToTheKafkaProtocol-Compression) Thus, a fixed string will still see far better compression ratios than a Œtypical' real life message. Making a real-life-like message isn¹t easy, and depends heavily on your domain. But a general approach would be to generate messages by randomly selected words from a dictionary. And having a dictionary around thousand large words means there is a reasonable chance of the same words appearing multiple times in the same message. Also words can be non-sence like ³asdfasdfasdfasdf², or large words in the language of your choice. The goal is for each message to be unique, but still have similar chunks that a compression algorithm can detect and compress. -Erik On 8/25/15, 6:47 AM, "Prabhjot Bharaj" <prabhbha...@gmail.com> wrote: >Hi, > >I have bene trying to use kafka-producer-perf-test.sh to arrive at certain >benchmarks. >When I try to run it with --compression-codec values of 1, 2 and 3, I >notice increased throughput compared to NoCompressionCodec > >But, When I checked the Producerperformance.scala, I saw that the the >`producer.send` is getting data from the method: `generateProducerData`. >But, this data is just an empty array of Bytes. > >Now, as per my basic understanding of compression algorithms, I think a >byte sequence of zeros will eventually result in a very small message, >because of which I thought I might be observing better throughput. > >So, in line: 247 of ProducerPerformance.scala, I did this minor code >change:- > > > >*val message = >"qopwr11591UPD113582260001AS1IL1-1N/A1Entertainment1-1an-example.com1-1-1- >1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,MED,HIGH,HD,.mp4.csm >il/bitrate=11subcategory >71Title >10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1- >1-1-1-1-1-1-111-1-1-r1VR-11591UPD113582260001AS1IL1-1N/A1Entertainment1-1a >n-example.com1-1-1-1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,M >ED,HIGH,HD,.mp4.csmil/bitrate=11subcategory >71Title >10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1- >1-1-1-1-1-1-111-1-1-r1VR-11591UPD113582260001AS1IL1-1N/A1Entertainment1-1a >n-example.com1-1-1-1-1-1-1-1011413/011413_factor_points_FNC_,LOW,MED_LOW,M >ED,HIGH,HD,.mp4.csmil/bitrate=11subcategory >71Title >10^D1-1-111-1-1-1-1-1-111-1-1-1-1-115101-1-1-1-1126112491-1-1-1-1-1-1-1-1- >1-1-1-1-1-1-111-1-1-"message.getBytes().slice(0,msgSize)* > > >This makes sure that I have a big message, and I can slice that >message to the message size passed in the command line options > > >But, the problem is that when I try running the same with >--compression-codec vlues of 1, 2 or 3, I still am seeing ASCII data >(i.e. uncompressed one only) > > >I want to ask whether this is a bug. And, using >kafka-producer-perf-test.sh, how can I send my own compressed data ? > > >Thanks, > >Prabhjot