Re: [PR] [KafkaIO] Only update size metrics once per batch [beam]

via GitHub Sun, 07 Sep 2025 03:23:25 -0700


sjvanrossum commented on PR #36077:
URL: https://github.com/apache/beam/pull/36077#issuecomment-3263660429


   Sampled on a pipeline with 4 initial machines (AMD Milan/Rome, 4 vCPUs per 
machine) processing a topic receiving 1GiB/s across 500 partitions (1KiB record 
size) starting with 2TiB backlog plumbed straight to /dev/null. I think it's 
worth revisiting the metrics implementation, fetching the delegate metric for 
every update is an unnecessary waste. For this stress test I observed a 25% 
throughput improvement in `ReadFromKafkaDoFn` after these changes were applied.
   
   Profiler data for distribution of record sizes before:
   <img width="4470" height="1560" alt="image" 
src="https://github.com/user-attachments/assets/d3c99c91-0ed9-4e74-9563-3faa34367da5";
 />
   And after:
   <img width="4470" height="1320" alt="image" 
src="https://github.com/user-attachments/assets/81840221-8fe9-4db9-87d9-655245adabcf";
 />
   
   Profiler data for average record size before:
   <img width="4470" height="1132" alt="image" 
src="https://github.com/user-attachments/assets/14804341-335a-4ea9-994c-c46cf6f47e64";
 />
   And after:
   <img width="4470" height="1080" alt="image" 
src="https://github.com/user-attachments/assets/184b7a3d-4be4-4d00-b90e-116fc3841b12";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [KafkaIO] Only update size metrics once per batch [beam]

Reply via email to