Re: [I] [Bug]: IcebergIO - Write performance issues [beam]

via GitHub Fri, 11 Oct 2024 10:12:06 -0700


DanielMorales9 commented on issue #32746:
URL: https://github.com/apache/beam/issues/32746#issuecomment-2407820005


   Yep, It works! 🥳 
   I can see only two threads are writing now and a stable commit-interval 
distribution.
   
   <img width="752" alt="Screenshot 2024-10-11 at 18 25 20" 
src="https://github.com/user-attachments/assets/399d185a-7a8f-44f7-9ab5-648345117f3a";>
   
   However, I'm still uncertain about how `Redistribute` behaves when 
autoscaling is enabled in Dataflow.  🤔  
   I might need to run a load test 📈
   
   My concern is that with a fixed number of buckets defined using 
`withNumBuckets`, autoscaling may cause inefficiencies. When autoscaling kicks 
in, if the number of workers exceeds the number of buckets `numOfBuckets < 
numWorkers`, many workers could remain idle, leading to underutilization. This 
creates a scenario where the pipeline isn't truly elastic, as it can't 
dynamically scale with fluctuations in data volume.
   
   At the same time, it's not feasible to skip redistribution entirely, as seen 
from earlier attempts—the job becomes non-performant and, in some cases, 
indefinitely stuck without it.
   
   In contrast, I would expect `Redistribute` to function more like the dynamic 
repartitioning behavior of Iceberg with Spark, where micro-batches are 
automatically adjusted based on target file sizes or other heuristics (e.g., 
the default 512MB). Such dynamic repartitioning ensures that as data volume 
grows or shrinks, the system can adjust on the fly to maintain efficiency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Bug]: IcebergIO - Write performance issues [beam]

Reply via email to