Hi Divij,

First of all thanks for your time and dedication.

About point one:
You are right, the idea is to have "in real time" visibility of the way the
clients are using the service as that is translated into a lot of money
saving.
I agree with the further vision although I think we are still far away from
it :)

About the resource usage my idea is to be zero invasive, so taking a few MB
samples once every few hours will be more than enough to understand the
produced pattern, so in this case the CPU usage is only a cost for the
producer and consumer.
Worth to mention that the additional 3% extra usage while producing is
negligible compared to the gain of batching and compression but maybe that
discussion is not related to this KIP, that is a decision between the
cluster admin and the clients.

About the "auto tuning" that is a great idea, again I think it is very
ambitious for the scope of this KIP but if the core of this is properly
done then this can be used in the future.


About point two:
Below is detailed the benefits of bathing and compression :
- Reduction of network bandwidth while data is produced.
- Reduction of disk usage to store the data, less IO for read and write the
segments (supposing the message format has not to be converted)
- Reduction of network traffic while data is replicated.
- Reduction of network traffic while the data is consumer.

The script I propose will output the percentage of network traffic
reduction + the disk space saved.
- Batching will be recommended based on the parameters
$batching-window-time (ms) and $min-records-for-batching the idea is to
check the CreationTime of each batch, lets suppose we use:

batching-window-time = 300
min-records-for-batching = 30

* This means we want to check if at least we can batch together 30 records
in 300 ms, this could be in 2 batches or in 30 (one record per batch)
* If the batching is achievable then we jump the next check to simulate the
compression even if the compression is already applied as batching more
data will improve the compression ratio.
* Finally the payload ( a few MB are brought to memory in order to get its
current size, then it is
compressed and the difference is calculated.


As a side note I think if the classes are properly created this can be
reused in the future for a more "automagic" way of usage. Again I really
like the idea of allowing the cluster to configure the producers (maybe the
producer could have a parameter to allow this)

I did not enter into details about the code as I would like to know if the
idea worth it, I use this "solution" in the company I work and it saved us
a lot of money, for now we have just get the output from the dump-logs.sh
script in order to see the CreateTime and the compression type, this is
a first step but we can't yet simulate the compression.
So for now we reach our clients saying "there is a potential benefit of
cost reduction if you apply these changes in the producer"


I hope this help, please feel free to add more feedback

Best regards.
Sergio Troiano










On Mon, 16 May 2022 at 10:35, Divij Vaidya <divijvaidy...@gmail.com> wrote:

> Thank you for the KIP Sergio.
>
> High level thoughts:
> 1\ I understand that the idea here is to provide better visibility to the
> admins about potential improvements using compression and modifying batch
> size. I would take it a step further and say that we should be providing
> this visibility in a programmatic push based manner and make this system
> generic enough so that adding new "optimization rules" in the future is
> seamless. Perhaps, have a "diagnostic" mode in the cluster, which can be
> dynamically enabled. In such a mode, the cluster would run a set of
> "optimization" rules (at the cost of additional CPU cycles). One of such
> rules would be the compression rule you mentioned in your KIP. At the end
> of the diagnostic run, the generated report would contain a set of
> recommendations. To begin with, we can introduce this "diagnostic" as a
> one-time run by admin and later, enhance it further to be triggered
> periodically in the cluster automatically (with results being published via
> existing metric libraries). Even further down the line, this could lead to
> "auto-tuning" producer libraries based on recommendations from the server.
>
> KIP implementation specific comments/questions:
> 2\ Can you please add the algorithm that would be used to determine whether
> compression is recommended or not? I am assuming that the algorithm would
> take into account the factors impacting compression optimization such as
> CPU utilization, network bandwidth, decompression cost by the consumers
> etc.
> 3\ Can you please add the algorithm that would be used to determine whether
> batching is recommended?
>
>
> Divij Vaidya
>
>
>
> On Mon, May 16, 2022 at 8:42 AM Sergio Daniel Troiano
> <sergio.troi...@adevinta.com.invalid> wrote:
>
> > Hey guys!
> >
> > I would like to start an early discussion on this:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-838+Simulate+batching+and+compression
> >
> >
> > Thanks!
> >
>

Reply via email to