[ 
https://issues.apache.org/jira/browse/KAFKA-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergio Troiano updated KAFKA-13687:
-----------------------------------
    Description: 
Currently the kafka-dump-log.sh reads the whole files(s) and dumps the results 
of the segment file(s).

As we know the savings when combining and using compression and batching while 
producing (if the payloads are good candidates) are huge. 

 

We would like to have a way to "monitor" the way the producers produce the 
batches as we not always  have access to the producer metrics.

We have multitenant producers so it is hard to "detect" when the usage is not 
the best.

 

The problem with the current way the DumpLogs works is it reads the whole file, 
in an scenario of having thousands of topics with different segment sizes 
(default is 1 GB) we could end up affecting the cluster balance as we are 
removing useful pages from the page cache and adding what we read from files. 

 

As we only need to take a few samples from the segments to see the pattern of 
the usage while producing we would like to add a new parameter called 
maxBatches.

 

Based on the current script the change is quite small as it only needs a 
parameter and a counter.

 

After adding this change for example we could periodically take smaller samples 
and analyze the batches headers (searching for compresscodec and the batch 
count)

 

Doing this we could automate a tool to read all the topics and even going 
further we could take the payloads of those samples when we see the client is 
neither using compression nor batching and simulate a compression of the 
payloads (using batching and compression) then with those numbers we can reach 
the client for the proposal of saving money. 

  was:
Currently the kafka-dump-log.sh reads the whole files(s) and dumps the results 
of the segment file(s).

As we know the savings when combining and using compression and batching while 
producing (if the payloads are good candidates) are huge. 

 

We would like to have a way to "monitor" the way the producers produce the 
batches as we not always  have access to the producer metrics.

We have multitenant producers so it is hard to "detect" when the usage is not 
the best.

 

The problem with the current way the DumpLogs works is it reads the whole file, 
in an scenario of having thousands of topics with different segment sizes 
(default is 1 GB) we could end up affecting the cluster balance as we are 
removing useful page from the page cache and adding what we read from files.

 

As we only need to take a few samples from the segments to see the pattern of 
the usage while producing we would like to add a new parameter called 
maxBatches.

 

Based on the current script the change is quite small as it only needs a 
parameter and a counter.

 

After adding this change for example we could periodically take smaller samples 
and analyze the batches headers (searching for compresscodec and the batch 
count)

 

Doing this we could automate a tool to read all the topics and even going 
further we could take the payloads of those samples when we see the client is 
neither using compression nor batching and simulate a compression of the 
payloads (using batching and compression) then with those numbers we can reach 
the client for the proposal of saving money. 


> Limit number of batches when using kafka-dump-log.sh
> ----------------------------------------------------
>
>                 Key: KAFKA-13687
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13687
>             Project: Kafka
>          Issue Type: Improvement
>          Components: tools
>            Reporter: Sergio Troiano
>            Priority: Minor
>              Labels: easyfix, features
>         Attachments: DumpLogSegments.scala
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Currently the kafka-dump-log.sh reads the whole files(s) and dumps the 
> results of the segment file(s).
> As we know the savings when combining and using compression and batching 
> while producing (if the payloads are good candidates) are huge. 
>  
> We would like to have a way to "monitor" the way the producers produce the 
> batches as we not always  have access to the producer metrics.
> We have multitenant producers so it is hard to "detect" when the usage is not 
> the best.
>  
> The problem with the current way the DumpLogs works is it reads the whole 
> file, in an scenario of having thousands of topics with different segment 
> sizes (default is 1 GB) we could end up affecting the cluster balance as we 
> are removing useful pages from the page cache and adding what we read from 
> files. 
>  
> As we only need to take a few samples from the segments to see the pattern of 
> the usage while producing we would like to add a new parameter called 
> maxBatches.
>  
> Based on the current script the change is quite small as it only needs a 
> parameter and a counter.
>  
> After adding this change for example we could periodically take smaller 
> samples and analyze the batches headers (searching for compresscodec and the 
> batch count)
>  
> Doing this we could automate a tool to read all the topics and even going 
> further we could take the payloads of those samples when we see the client is 
> neither using compression nor batching and simulate a compression of the 
> payloads (using batching and compression) then with those numbers we can 
> reach the client for the proposal of saving money. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to