We are in the process of engineering a system that will be using kafka.
The legacy system is using the local file system and  a database as the
queue.  In terms of scale we process about 35 billion events per day
contained in 15 million files.



I am looking for feedback on a design decision we are discussing



In our current system we depending heavily on compression as a performance
optimization.  In kafka the use of compression has some overhead as the
broker needs to decompress the data to assign offsets and re-compress.
(explained in detail here
http://geekmantra.wordpress.com/2013/03/28/compression-in-kafka-gzip-or-snappy/
)



We are thinking about NOT using Kafka compression but rather compressing
multiple rows in our code. For example let’s say we wanted to send data in
batches of 5,00 rows.  Using Kafka compression we would use a batch size of
 5,000 rows and use compression. The other option is using a batch size of
1 in Kafka BUT in our code take 5,000 messages, compress them and then send
to kafka using the kafka compression setting of none.



Is this  a pattern others have used?



Regardless of compression I am curious if others are using a single message
in kafka to contain multiple messages from an application standpoint.


Bert

Reply via email to