Python apache beam bigquery temporary tables

Pawel Jarkowski via dev Thu, 16 May 2024 07:09:07 -0700

Hi,
I'm working on a pipeline that gets data from pub/sub and splits the data
into a few big query tables. I have a problem with temporary tables because
the pipeline creates the temp tables (I have to compute table id and schema
based on data) but it does not delete all of them. Is there any possibility
of setting a different dataset for temporary tables or batching the data
into a list of dicts and pushing more at once? I tried to create
windowing + groupbykey where the key is a table name and this code created
a list of dictionaries but I got an error message:
"""
File "fastavro/_write.pyx", line 732, in fastavro._write.Writer.write
  File "fastavro/_write.pyx", line 469, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 459, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 357, in fastavro._write.write_record
TypeError: Error writing row to Avro: unhashable type: 'dict'
"""
Without the grouping and windowing all works fine but the pipeline creates
a lot of temporary tables that make a mess in our datasets.
There are my bigquery options:
method="FILE_LOADS"
triggering_frequency=60
max_files_per_bundle=1
temp_file_format="AVRO"
table="" << this is computed
dataset="" << this is computed
project="my_project"
custom_gcs_temp_location="gs://my_gcp_tmp_location/temp/"


There are my pipeline options:
runner="DataflowRunner"
region="my_region"
project="my_project"
service_account_email="my_df_SA"
temp_location="gs://my_gcp_location/temp/"
max_num_workers=10
no_use_public_ips=true
use_public_ips=false
network="my_network_url"
subnetwork="my_subnetwork_url"
machine_type="my_machine_type"
label="pubsub-qa-label"
job_name="job-name-pubsub-qa"
save_main_session=false
streaming=true
setup_file="./setup.py"

Thank you in advance for your reply and best regards,
Paweł Jarkowski

Python apache beam bigquery temporary tables

Reply via email to