Hello, I have a problem with changelog in one of my samza jobs grows indefinitely.
The job is quite simple, it reads messages from the input kafka topic, and either creates or updates a key in task-local samza store. Once in a minute the window method kicks-in, it iterates over all keys in the store and deletes some of them, selecting on the contents of their value.
Message rate in input topic is about 3000 messages per second. The input topic is partitioned in 48 partitions. Average number of keys, kept in the store is more or less stable and do not exceed 10000 keys per task. Average size of values is 50 bytes. So I expected that sum of all segments' size in kafka data directory for the job's changelog topic should not exceed 10000*50*48 ~= 24Mbytes. In fact it is more than 2.5GB (after 6 days running from scratch) and it is growing.
I tried to change default segment size for changelog topic in kafka, and it worked a bit - instead of 500Mbyte segments I have now 50Mbyte segments, but it did not heal the indefinite data growth problem.
Moreover, if I stop the job and start it again it cannot restart, it breaks right after reading all records from changelog topic.
Did somebody have similar problem? How it could be resolved? Best regards, Vladimir -- Vladimir Lebedev [email protected]
