DanielMorales9 commented on issue #32746: URL: https://github.com/apache/beam/issues/32746#issuecomment-2407820005
Yep, It works! 🥳 I can see only two threads are writing now and a stable commit-interval distribution. <img width="752" alt="Screenshot 2024-10-11 at 18 25 20" src="https://github.com/user-attachments/assets/399d185a-7a8f-44f7-9ab5-648345117f3a"> However, I'm still uncertain about how `Redistribute` behaves when autoscaling is enabled in Dataflow. 🤔 I might need to run a load test 📈 My concern is that with a fixed number of buckets defined using `withNumBuckets`, autoscaling may cause inefficiencies. When autoscaling kicks in, if the number of workers exceeds the number of buckets `numOfBuckets < numWorkers`, many workers could remain idle, leading to underutilization. This creates a scenario where the pipeline isn't truly elastic, as it can't dynamically scale with fluctuations in data volume. At the same time, it's not feasible to skip redistribution entirely, as seen from earlier attempts—the job becomes non-performant and, in some cases, indefinitely stuck without it. In contrast, I would expect `Redistribute` to function more like the dynamic repartitioning behavior of Iceberg with Spark, where micro-batches are automatically adjusted based on target file sizes or other heuristics (e.g., the default 512MB). Such dynamic repartitioning ensures that as data volume grows or shrinks, the system can adjust on the fly to maintain efficiency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
