Processing data of different sizes

André Missaglia Fri, 26 Jun 2020 10:27:39 -0700

Hi everyone

I'm building a pipeline where I group the elements and then execute a
CPU-intensive function on each group. This function performs a statistical
analysis over the elements, only to return a single value on the end.


But because each group has a different amount of elements, some groups are
processed really quickly, others may take up to 30min to run. The problem
is that the pipeline processes 99% of the groups in a couple of minutes,
but then spends another 2 hours processing the big groups. The image below
illustrates what I mean:
[image: image.png]



Even worse than that, if I use for example, 20 dataflow instances with 32
cores, and the big groups end up each on different machines, I'm gonna pay
for all those instances while the job isn't done.

I know that one optimization would be to split the groups into
equally-sized groups, but I'm not sure that is possible in this case given
the calculation I'm performing.

So I was thinking, is there any way I can "tell" the runner how long I
think the DoFn is going to run, so that it can do a better job scheduling
those elements?

Thanks!

-- 
*André Badawi Missaglia*
Data Engineer
(16) 3509-5515 *|* www.arquivei.com.br
<https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
[image: Arquivei.com.br – Inteligência em Notas Fiscais]
<https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura>
[image: Google seleciona Arquivei para imersão e mentoria no Vale do
Silício]
<https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad>
<https://www.facebook.com/arquivei>
<https://www.linkedin.com/company/arquivei>
<https://www.youtube.com/channel/UC6mrvVc7b5mA2SyYPEVg8cw>

Processing data of different sizes

Reply via email to