Hello Pablo,
thanks for explanation.

Best regards,
Zdenko
_______________________
 http://www.the-swamp.info



On Thu, Sep 5, 2019 at 8:10 PM Pablo Estrada <pabl...@google.com> wrote:

> Hi Zdenko,
> sorry about the confusion. The reason behind this is that we have not
> jumped tu fully change the batch behavior of WriteToBigQuery, so to use
> BigQueryBatchFileLoads as the implementation of WriteToBigQuery, you need
> to pass 'use_beam_bq_sink' as an experiment to activate it.
> As you rightly figured out, you can use BigQueryBatchFileLoads directly.
> Best
> -P.
>
> On Thu, Sep 5, 2019 at 6:06 AM Zdenko Hrcek <zden...@gmail.com> wrote:
>
>> Thanks for the code sample,
>>
>> when I switched to use bigquery_file_loads.BigQueryBatchFileLoads instead
>> of bigquery.WriteToBigQuery it works ok now. Not sure why with
>> WriteToBigQuery doesn't work, since it's using BigQueryBatchFileLoads under
>> the hood...
>>
>> Thanks for the help.
>> Zdenko
>> _______________________
>>  http://www.the-swamp.info
>>
>>
>>
>> On Wed, Sep 4, 2019 at 6:55 PM Chamikara Jayalath <chamik...@google.com>
>> wrote:
>>
>>> +Pablo Estrada <pabl...@google.com> who added this.
>>>
>>> I don't think we have tested this specific option but I believe
>>> additional BQ parameters option was added in a generic way to accept all
>>> additional parameters.
>>>
>>> Looking at the code, seems like additional parameters do get passed
>>> through to load jobs:
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427
>>>
>>> One thing you can try out is trying to run a BQ load job directly with
>>> the same set of data and options to see if the data gets loaded.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek <zden...@gmail.com> wrote:
>>>
>>>> Greetings,
>>>>
>>>> I am using Beam 2.15 and Python 2.7.
>>>> I am doing a batch job to load data from CSV and upload to BigQuery. I
>>>> like functionality that instead of streaming to BigQuery I can use "file
>>>> load", to load table all at once.
>>>>
>>>> For my case, there are few "bad" records in the input (it's geo data
>>>> and during manual upload, BigQuery doesn't accept those as valid geography
>>>> records. this is easily solved by setting the number of max bad records.
>>>> If I understand correctly, WriteToBigQuery supports
>>>> "additional_bq_parameters", but for some reason when running a pipeline on
>>>> Dataflow runner it looks like those settings are ignored.
>>>>
>>>> I played with an example from the documentation
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
>>>>  with
>>>> gist file
>>>> https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
>>>> where the table should be created on the partitioned field and
>>>> clustered, but when running on Dataflow it doesn't happen.
>>>> When I run on DirectRunner it works as expected. interestingly, when I
>>>> add maxBadRecords parameter to additional_bq_parameters, DirectRunner
>>>> complains that it doesn't recognize that option.
>>>>
>>>> This is the first time using this setup/combination so I'm just
>>>> wondering if I overlooked something. I would appreciate any help.
>>>>
>>>> Best regards,
>>>> Zdenko
>>>>
>>>>
>>>> _______________________
>>>>  http://www.the-swamp.info
>>>>
>>>>

Reply via email to