Hi all

I have a beam pipeline running with cloud dataflow that produces avro files
on GCS. Window duration is 1 minute and currently the job is running with
64 cores (16 * n1-standard-4). Per minute the data produced is around 2GB.

Is there any recommendation on the number of avro files to specify?
Currently I'm using 64 (to match with the number of cores). Will a very
high number help in increasing the write throughput?
I saw that BigqueryIO with FILE_LOADS is using a default value of 1000
files.

I tried some random values, but couldn't infer a pattern when is it more
performant.

Any suggestion is hugely appreciated.

Best
Ziyad

Reply via email to