[jira] [Comment Edited] (KAFKA-13687) Limit number of batches when using kafka-dump-log.sh

Sergio Troiano (Jira) Wed, 02 Mar 2022 09:18:04 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496312#comment-17496312
 ]


Sergio Troiano edited comment on KAFKA-13687 at 3/2/22, 5:17 PM:
-----------------------------------------------------------------

I did the small change in the source code and it looks good (see the 
max-batches-size parameter) to limit the batches by bytes:

 
{code:java}
$ bin/kafka-dump-log.sh --print-data-log --max-batches-size 3000 --files  
/var/lib/kafka/data/test-topic-4/00000000000183891269.log

Dumping /var/lib/kafka/data/test-topic-4/00000000000183891269.log Starting 
offset: 183891269
baseOffset: 183891269 lastOffset: 183891269 count: 1 baseSequence: -1 
lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 126 
isTransactional: false isControl: false deleteHorizonMs: OptionalLong.empty 
position: 0 CreateTime: 1645793452954 size: 803 magic: 2 compresscodec: none 
crc: 3610716550 isvalid: true
baseOffset: 183891270 lastOffset: 183891270 count: 1 baseSequence: -1 
lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 126 
isTransactional: false isControl: false deleteHorizonMs: OptionalLong.empty 
position: 803 CreateTime: 1645793453044 size: 816 magic: 2 compresscodec: none 
crc: 1908752378 isvalid: true
baseOffset: 183891271 lastOffset: 183891271 count: 1 baseSequence: -1 
lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 126 
isTransactional: false isControl: false deleteHorizonMs: OptionalLong.empty 
position: 1619 CreateTime: 1645793453261 size: 810 magic: 2 compresscodec: none 
crc: 1078688538 isvalid: true{code}


was (Author: JIRAUSER285632):
I did the small change in the source code and it looks good (see the 
max-batches parameter):

 
{code:java}
$ bin/kafka-dump-log.sh --print-data-log --max-batches-size 3000 --files  
/var/lib/kafka/data/test-topic-4/00000000000183891269.log

Dumping /var/lib/kafka/data/test-topic-4/00000000000183891269.log Starting 
offset: 183891269
baseOffset: 183891269 lastOffset: 183891269 count: 1 baseSequence: -1 
lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 126 
isTransactional: false isControl: false deleteHorizonMs: OptionalLong.empty 
position: 0 CreateTime: 1645793452954 size: 803 magic: 2 compresscodec: none 
crc: 3610716550 isvalid: true
baseOffset: 183891270 lastOffset: 183891270 count: 1 baseSequence: -1 
lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 126 
isTransactional: false isControl: false deleteHorizonMs: OptionalLong.empty 
position: 803 CreateTime: 1645793453044 size: 816 magic: 2 compresscodec: none 
crc: 1908752378 isvalid: true
baseOffset: 183891271 lastOffset: 183891271 count: 1 baseSequence: -1 
lastSequence: -1 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 126 
isTransactional: false isControl: false deleteHorizonMs: OptionalLong.empty 
position: 1619 CreateTime: 1645793453261 size: 810 magic: 2 compresscodec: none 
crc: 1078688538 isvalid: true{code}

> Limit number of batches when using kafka-dump-log.sh
> ----------------------------------------------------
>
>                 Key: KAFKA-13687
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13687
>             Project: Kafka
>          Issue Type: Improvement
>          Components: tools
>    Affects Versions: 2.8.1
>            Reporter: Sergio Troiano
>            Priority: Minor
>              Labels: easyfix, features
>         Attachments: DumpLogSegments.scala, FileRecords.java
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Currently the kafka-dump-log.sh reads the whole files(s) and dumps the 
> results of the segment file(s).
> As we know the savings when combining and using compression and batching 
> while producing (if the payloads are good candidates) are huge. 
>  
> We would like to have a way to "monitor" the way the producers produce the 
> batches as we not always  have access to the producer metrics.
> We have multitenant producers so it is hard to "detect" when the usage is not 
> the best.
>  
> The problem with the current way the DumpLogs works is it reads the whole 
> file, in an scenario of having thousands of topics with different segment 
> sizes (default is 1 GB) we could end up affecting the cluster balance as we 
> are removing useful pages from the page cache and adding what we read from 
> files. 
>  
> As we only need to take a few samples from the segments to see the pattern of 
> the usage while producing we would like to add a new parameter called 
> maxBatches.
>  
> Based on the current script the change is quite small as it only needs a 
> parameter and a counter.
>  
> After adding this change for example we could periodically take smaller 
> samples and analyze the batches headers (searching for compresscodec and the 
> batch count)
>  
> Doing this we could automate a tool to read all the topics and even going 
> further we could take the payloads of those samples when we see the client is 
> neither using compression nor batching and simulate a compression of the 
> payloads (using batching and compression) then with those numbers we can 
> reach the client for the proposal of saving money. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (KAFKA-13687) Limit number of batches when using kafka-dump-log.sh

Reply via email to