Hi everyone, Load jobs into BigQuery are subject to various quotas and limitations. In the Python SDK, the BigQuery sink that uses file loads does not work well with these quotas and limitations. Improvements are needed in the following areas:
1. Handle per load job limitations on the total size. 2. Decide when to use temp_tables to load data atomically, at pipeline execution time. I have documented proposed changes in a design doc[1] as well as a draft pull request[2]. *TL;DR:* Partition the written files based on: 1. Total size of files 2. Number of files If multiple load jobs are needed to write to a single destination, data will be loaded to temp tables first. Once all data is loaded to these temp tables, data in the temp table will be copied to the destination table to ensure data is loaded into BigQuery atomically. Would love to get feedback on the proposed changes. Regards, - Tanay [1] https://s.apache.org/beam-bqfl-hardening [2] https://github.com/apache/beam/pull/9242