Hello Pablo, thanks for explanation. Best regards, Zdenko _______________________ http://www.the-swamp.info
On Thu, Sep 5, 2019 at 8:10 PM Pablo Estrada <pabl...@google.com> wrote: > Hi Zdenko, > sorry about the confusion. The reason behind this is that we have not > jumped tu fully change the batch behavior of WriteToBigQuery, so to use > BigQueryBatchFileLoads as the implementation of WriteToBigQuery, you need > to pass 'use_beam_bq_sink' as an experiment to activate it. > As you rightly figured out, you can use BigQueryBatchFileLoads directly. > Best > -P. > > On Thu, Sep 5, 2019 at 6:06 AM Zdenko Hrcek <zden...@gmail.com> wrote: > >> Thanks for the code sample, >> >> when I switched to use bigquery_file_loads.BigQueryBatchFileLoads instead >> of bigquery.WriteToBigQuery it works ok now. Not sure why with >> WriteToBigQuery doesn't work, since it's using BigQueryBatchFileLoads under >> the hood... >> >> Thanks for the help. >> Zdenko >> _______________________ >> http://www.the-swamp.info >> >> >> >> On Wed, Sep 4, 2019 at 6:55 PM Chamikara Jayalath <chamik...@google.com> >> wrote: >> >>> +Pablo Estrada <pabl...@google.com> who added this. >>> >>> I don't think we have tested this specific option but I believe >>> additional BQ parameters option was added in a generic way to accept all >>> additional parameters. >>> >>> Looking at the code, seems like additional parameters do get passed >>> through to load jobs: >>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427 >>> >>> One thing you can try out is trying to run a BQ load job directly with >>> the same set of data and options to see if the data gets loaded. >>> >>> Thanks, >>> Cham >>> >>> On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek <zden...@gmail.com> wrote: >>> >>>> Greetings, >>>> >>>> I am using Beam 2.15 and Python 2.7. >>>> I am doing a batch job to load data from CSV and upload to BigQuery. I >>>> like functionality that instead of streaming to BigQuery I can use "file >>>> load", to load table all at once. >>>> >>>> For my case, there are few "bad" records in the input (it's geo data >>>> and during manual upload, BigQuery doesn't accept those as valid geography >>>> records. this is easily solved by setting the number of max bad records. >>>> If I understand correctly, WriteToBigQuery supports >>>> "additional_bq_parameters", but for some reason when running a pipeline on >>>> Dataflow runner it looks like those settings are ignored. >>>> >>>> I played with an example from the documentation >>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py >>>> with >>>> gist file >>>> https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df >>>> where the table should be created on the partitioned field and >>>> clustered, but when running on Dataflow it doesn't happen. >>>> When I run on DirectRunner it works as expected. interestingly, when I >>>> add maxBadRecords parameter to additional_bq_parameters, DirectRunner >>>> complains that it doesn't recognize that option. >>>> >>>> This is the first time using this setup/combination so I'm just >>>> wondering if I overlooked something. I would appreciate any help. >>>> >>>> Best regards, >>>> Zdenko >>>> >>>> >>>> _______________________ >>>> http://www.the-swamp.info >>>> >>>>