Greetings,

I am using Beam 2.15 and Python 2.7.
I am doing a batch job to load data from CSV and upload to BigQuery. I like
functionality that instead of streaming to BigQuery I can use "file load",
to load table all at once.

For my case, there are few "bad" records in the input (it's geo data and
during manual upload, BigQuery doesn't accept those as valid geography
records. this is easily solved by setting the number of max bad records.
If I understand correctly, WriteToBigQuery supports
"additional_bq_parameters", but for some reason when running a pipeline on
Dataflow runner it looks like those settings are ignored.

I played with an example from the documentation
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
with
gist file https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
where the table should be created on the partitioned field and clustered,
but when running on Dataflow it doesn't happen.
When I run on DirectRunner it works as expected. interestingly, when I add
maxBadRecords parameter to additional_bq_parameters, DirectRunner complains
that it doesn't recognize that option.

This is the first time using this setup/combination so I'm just wondering
if I overlooked something. I would appreciate any help.

Best regards,
Zdenko


_______________________
 http://www.the-swamp.info

Reply via email to