Hi Zdenko, sorry about the confusion. The reason behind this is that we have not jumped tu fully change the batch behavior of WriteToBigQuery, so to use BigQueryBatchFileLoads as the implementation of WriteToBigQuery, you need to pass 'use_beam_bq_sink' as an experiment to activate it. As you rightly figured out, you can use BigQueryBatchFileLoads directly. Best -P.
On Thu, Sep 5, 2019 at 6:06 AM Zdenko Hrcek <zden...@gmail.com> wrote: > Thanks for the code sample, > > when I switched to use bigquery_file_loads.BigQueryBatchFileLoads instead > of bigquery.WriteToBigQuery it works ok now. Not sure why with > WriteToBigQuery doesn't work, since it's using BigQueryBatchFileLoads under > the hood... > > Thanks for the help. > Zdenko > _______________________ > http://www.the-swamp.info > > > > On Wed, Sep 4, 2019 at 6:55 PM Chamikara Jayalath <chamik...@google.com> > wrote: > >> +Pablo Estrada <pabl...@google.com> who added this. >> >> I don't think we have tested this specific option but I believe >> additional BQ parameters option was added in a generic way to accept all >> additional parameters. >> >> Looking at the code, seems like additional parameters do get passed >> through to load jobs: >> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427 >> >> One thing you can try out is trying to run a BQ load job directly with >> the same set of data and options to see if the data gets loaded. >> >> Thanks, >> Cham >> >> On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek <zden...@gmail.com> wrote: >> >>> Greetings, >>> >>> I am using Beam 2.15 and Python 2.7. >>> I am doing a batch job to load data from CSV and upload to BigQuery. I >>> like functionality that instead of streaming to BigQuery I can use "file >>> load", to load table all at once. >>> >>> For my case, there are few "bad" records in the input (it's geo data and >>> during manual upload, BigQuery doesn't accept those as valid geography >>> records. this is easily solved by setting the number of max bad records. >>> If I understand correctly, WriteToBigQuery supports >>> "additional_bq_parameters", but for some reason when running a pipeline on >>> Dataflow runner it looks like those settings are ignored. >>> >>> I played with an example from the documentation >>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py >>> with >>> gist file >>> https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df >>> where the table should be created on the partitioned field and >>> clustered, but when running on Dataflow it doesn't happen. >>> When I run on DirectRunner it works as expected. interestingly, when I >>> add maxBadRecords parameter to additional_bq_parameters, DirectRunner >>> complains that it doesn't recognize that option. >>> >>> This is the first time using this setup/combination so I'm just >>> wondering if I overlooked something. I would appreciate any help. >>> >>> Best regards, >>> Zdenko >>> >>> >>> _______________________ >>> http://www.the-swamp.info >>> >>>