Greetings, I am using Beam 2.15 and Python 2.7. I am doing a batch job to load data from CSV and upload to BigQuery. I like functionality that instead of streaming to BigQuery I can use "file load", to load table all at once.
For my case, there are few "bad" records in the input (it's geo data and during manual upload, BigQuery doesn't accept those as valid geography records. this is easily solved by setting the number of max bad records. If I understand correctly, WriteToBigQuery supports "additional_bq_parameters", but for some reason when running a pipeline on Dataflow runner it looks like those settings are ignored. I played with an example from the documentation https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py with gist file https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df where the table should be created on the partitioned field and clustered, but when running on Dataflow it doesn't happen. When I run on DirectRunner it works as expected. interestingly, when I add maxBadRecords parameter to additional_bq_parameters, DirectRunner complains that it doesn't recognize that option. This is the first time using this setup/combination so I'm just wondering if I overlooked something. I would appreciate any help. Best regards, Zdenko _______________________ http://www.the-swamp.info