Re: Writing Avro Files to Big Query using Python SDK and Dataflow Runner

Pavel Solomin Mon, 09 Aug 2021 14:54:29 -0700

> wrote a generic BigQuery reader or writer

I think, I have seen an example here -
https://github.com/the-dagger/dataflow-dynamic-schema/blob/28b7d075c18d6364a67129e56652f452da67a2f6/src/main/java/com/google/cloud/pso/bigquery/BigQuerySchemaMutator.java#L38


This is in Java, but you can try to adapt it for Python SDK. Don't know if
it is possible, I use Java SDK myself for all stream processing apps,
including Beam apps.

Best Regards,
Pavel Solomin

Tel: +351 962 950 692 | Skype: pavel_solomin | Linkedin
<https://www.linkedin.com/in/pavelsolomin>





On Mon, 9 Aug 2021 at 21:55, Luke Cwik <lc...@google.com> wrote:

> The issue is that the encoding that is passed between transforms needs to
> store the metadata of what was in each column when the data is read as it
> is passed around in the pipeline. Imagine that column X was a string, was
> then deleted, and then re-added as a datetime. These kinds of schema
> evolutions typically have business specific rules as to what to do.
>
> I believe there was a user that wrote a custom coder that encoded this
> extra information with each row and wrote a generic BigQuery reader or
> writer(don't remember which) that could do something as you wanted with
> limitations around schema evolution and at the performance cost of passing
> around the metadata but I don't believe this was contributed back to the
> community.
>
> Try searching through the dev[1]/user[2] e-mail archives.
>
> 1: https://lists.apache.org/list.html?d...@beam.apache.org:lte=99M
> 2: https://lists.apache.org/list.html?user@beam.apache.org:lte=99M
>
> On Sun, Aug 1, 2021 at 12:06 PM Rajnil Guha <rajnil94.g...@gmail.com>
> wrote:
>
>> Hi Beam Users,
>>
>> Our pipeline is reading avro files from GCS into Dataflow and writing
>> them into Big Query tables . I am using the WriteToBigQuery transform to
>> write my Pcoll contents into Big Query.
>> My avro file contains about 150-200 fields. We have tested our pipeline
>> by providing the field information for all the fields in the TableSchema
>> object within the pipeline code. So every time there is a change in schema
>> or the schema evolves we need to change our pipeline code.
>> I was wondering if there is any way to provide the BigQuery table schema
>> information outside the pipeline code and infer into the pipeline from
>> there as it's much easier to maintain that way.
>>
>> Note:- We are using the Python SDK to write our pipelines and running on
>> Dataflow.
>>
>> Thanks & Regards
>> Rajnil Guha
>>
>>
>>
>>
>>
>>
>>

Re: Writing Avro Files to Big Query using Python SDK and Dataflow Runner

Reply via email to