Hi Divij, First of all thanks for your time and dedication.
About point one: You are right, the idea is to have "in real time" visibility of the way the clients are using the service as that is translated into a lot of money saving. I agree with the further vision although I think we are still far away from it :) About the resource usage my idea is to be zero invasive, so taking a few MB samples once every few hours will be more than enough to understand the produced pattern, so in this case the CPU usage is only a cost for the producer and consumer. Worth to mention that the additional 3% extra usage while producing is negligible compared to the gain of batching and compression but maybe that discussion is not related to this KIP, that is a decision between the cluster admin and the clients. About the "auto tuning" that is a great idea, again I think it is very ambitious for the scope of this KIP but if the core of this is properly done then this can be used in the future. About point two: Below is detailed the benefits of bathing and compression : - Reduction of network bandwidth while data is produced. - Reduction of disk usage to store the data, less IO for read and write the segments (supposing the message format has not to be converted) - Reduction of network traffic while data is replicated. - Reduction of network traffic while the data is consumer. The script I propose will output the percentage of network traffic reduction + the disk space saved. - Batching will be recommended based on the parameters $batching-window-time (ms) and $min-records-for-batching the idea is to check the CreationTime of each batch, lets suppose we use: batching-window-time = 300 min-records-for-batching = 30 * This means we want to check if at least we can batch together 30 records in 300 ms, this could be in 2 batches or in 30 (one record per batch) * If the batching is achievable then we jump the next check to simulate the compression even if the compression is already applied as batching more data will improve the compression ratio. * Finally the payload ( a few MB are brought to memory in order to get its current size, then it is compressed and the difference is calculated. As a side note I think if the classes are properly created this can be reused in the future for a more "automagic" way of usage. Again I really like the idea of allowing the cluster to configure the producers (maybe the producer could have a parameter to allow this) I did not enter into details about the code as I would like to know if the idea worth it, I use this "solution" in the company I work and it saved us a lot of money, for now we have just get the output from the dump-logs.sh script in order to see the CreateTime and the compression type, this is a first step but we can't yet simulate the compression. So for now we reach our clients saying "there is a potential benefit of cost reduction if you apply these changes in the producer" I hope this help, please feel free to add more feedback Best regards. Sergio Troiano On Mon, 16 May 2022 at 10:35, Divij Vaidya <divijvaidy...@gmail.com> wrote: > Thank you for the KIP Sergio. > > High level thoughts: > 1\ I understand that the idea here is to provide better visibility to the > admins about potential improvements using compression and modifying batch > size. I would take it a step further and say that we should be providing > this visibility in a programmatic push based manner and make this system > generic enough so that adding new "optimization rules" in the future is > seamless. Perhaps, have a "diagnostic" mode in the cluster, which can be > dynamically enabled. In such a mode, the cluster would run a set of > "optimization" rules (at the cost of additional CPU cycles). One of such > rules would be the compression rule you mentioned in your KIP. At the end > of the diagnostic run, the generated report would contain a set of > recommendations. To begin with, we can introduce this "diagnostic" as a > one-time run by admin and later, enhance it further to be triggered > periodically in the cluster automatically (with results being published via > existing metric libraries). Even further down the line, this could lead to > "auto-tuning" producer libraries based on recommendations from the server. > > KIP implementation specific comments/questions: > 2\ Can you please add the algorithm that would be used to determine whether > compression is recommended or not? I am assuming that the algorithm would > take into account the factors impacting compression optimization such as > CPU utilization, network bandwidth, decompression cost by the consumers > etc. > 3\ Can you please add the algorithm that would be used to determine whether > batching is recommended? > > > Divij Vaidya > > > > On Mon, May 16, 2022 at 8:42 AM Sergio Daniel Troiano > <sergio.troi...@adevinta.com.invalid> wrote: > > > Hey guys! > > > > I would like to start an early discussion on this: > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-838+Simulate+batching+and+compression > > > > > > Thanks! > > >