[Question] Infer schema from a Pcollection of Python dicts

Nivaldo Tokuda Fri, 15 Apr 2022 08:39:32 -0700

Hi,

I have a pipeline with a Pcollection of dicts in Python, and I'd like to apply 
a schema to it for use with SQLTransforms.


The schema is defined as follows:

class RowSchema(typing.NamedTuple):

    colA: str

    colB: typing.Optional[str]



beam.coders.registry.register_coder(RowSchema, beam.coders.RowCoder)


The code that ingests the Pcollection of dicts and attempts to apply the schema 
is:

pcol = (p

    | 'read from BQ' >>

     beam.io.ReadFromBigQuery(

      gcs_location="gs://example_location",

      query=query, #Reads only the columns defined in the schema

      use_standard_sql=True)
  | 'ToRow' >> beam.Map(
    lambda x: RowSchema(**x)).with_output_types(RowSchema)

  # | SqlTransform(...)


However, it results in the following error:

File "/home/lib/python3.9/site-packages/apache_beam/coders/coders.py", line 
423, in encode

    return value.encode('utf-8')

AttributeError: 'int' object has no attribute 'encode' [while running 'ToRow']


I've tested that if I use a Pcollection of beam.pvalue.Row, such as the 
following, the code does in fact work:

pcol = (p

    | "Create" >> beam.Create(

        [{'colA': 'a1', 'colB': 'b1'}, {'colA': 'a2', 'colB': None}])

    | 'ToRow' >> beam.Map(lambda x: RowSchema(**x)).with_output_types(RowSchema)

    # | SqlTransform(...)


What can I do to apply the schema and enable SQLTransforms on a Pcollection of 
dicts?

The structure I tried to use is based on the following references:

  *   https://beam.apache.org/documentation/programming-guide/#inferring-schemas
  *   
https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.sql.html
I've also checked the io.gcp.bigquery reference 
(https://beam.apache.org/releases/pydoc/current/apache_beam.io.gcp.bigquery.html).
 I've noticed it has a schema implementation but only for writes to BigQuery, 
so I wasn't able to avoid the input as a Pcollection of dicts.
I also found this example 
(https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/sql_taxi.py)
 using a dynamic schema, which wouldn't be a valid approach for my use case as 
far as I understand it.

Any help with this issue would be greatly appreciated. Thanks!
**************************************************************** Este e-mail e 
seus anexos s?o para uso exclusivo do destinat?rio e podem conter informa??es 
confidenciais e/ou legalmente privilegiadas. N?o podem ser parcial ou 
totalmente reproduzidos sem o consentimento do autor. Qualquer divulga??o ou 
uso n?o autorizado deste e-mail ou seus anexos ? proibida. Se voc? receber esse 
e-mail por engano, por favor, notifique o remetente e apague-o imediatamente. 
This e-mail and its attachments are for the sole use of the addressee and may 
contain information which is confidential and/or legally privileged. Should not 
be partly or wholly reproduced without consent of the owner. Any unauthorized 
use of disclosure of this e-mail or its attachments is prohibited. If you 
receive this e-mail in error, please immediately delete it and notify the 
sender by return e-mail. 
*****************************************************************

[Question] Infer schema from a Pcollection of Python dicts

Reply via email to