Hi everyone I'm building a pipeline where I group the elements and then execute a CPU-intensive function on each group. This function performs a statistical analysis over the elements, only to return a single value on the end.
But because each group has a different amount of elements, some groups are processed really quickly, others may take up to 30min to run. The problem is that the pipeline processes 99% of the groups in a couple of minutes, but then spends another 2 hours processing the big groups. The image below illustrates what I mean: [image: image.png] Even worse than that, if I use for example, 20 dataflow instances with 32 cores, and the big groups end up each on different machines, I'm gonna pay for all those instances while the job isn't done. I know that one optimization would be to split the groups into equally-sized groups, but I'm not sure that is possible in this case given the calculation I'm performing. So I was thinking, is there any way I can "tell" the runner how long I think the DoFn is going to run, so that it can do a better job scheduling those elements? Thanks! -- *André Badawi Missaglia* Data Engineer (16) 3509-5515 *|* www.arquivei.com.br <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> [image: Arquivei.com.br – Inteligência em Notas Fiscais] <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> [image: Google seleciona Arquivei para imersão e mentoria no Vale do Silício] <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> <https://www.facebook.com/arquivei> <https://www.linkedin.com/company/arquivei> <https://www.youtube.com/channel/UC6mrvVc7b5mA2SyYPEVg8cw>
