[ https://issues.apache.org/jira/browse/BEAM-12955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Hulette updated BEAM-12955: --------------------------------- Description: Just as we can infer a Beam Schema from a NamedTuple type ([code|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/schemas.py]), we should have support for inferring a schema from a [protobuf-generated Python type|https://developers.google.com/protocol-buffers/docs/pythontutorial]. This should integrate well with the rest of the schema infrastructure. For example it should be possible to use schema-aware transforms like [SqlTransform|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform], [Select|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.transforms.core.html#apache_beam.transforms.core.Select], or [beam.dataframe.convert.to_dataframe|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe] on a PCollection that is annotated with a protobuf type. For example (using the addressbook_pb2 example from the [tutorial|https://developers.google.com/protocol-buffers/docs/pythontutorial#reading-a-message]): {code:python} import adressbook_pb2 import apache_beam as beam from apache_beam.dataframe.convert import to_dataframe pc = (input_pc | beam.Map(create_person).with_output_type(addressbook_pb2.Person)) df = to_dataframe(pc) # deferred dataframe with fields id, name, email, ... # OR pc | beam.transforms.SqlTransform("SELECT name WHERE email = 'f...@bar.com' FROM PCOLLECTION") {code} was: Just as we can infer a Beam Schema from a NamedTuple type ([code|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/schemas.py]), we should have support for inferring a schema from a [protobuf-generated Python type|https://developers.google.com/protocol-buffers/docs/pythontutorial]. This should integrate well with the rest of the schema infrastructure. For example it should be possible to use schema-aware transforms like SqlTransform, Select, or beam.dataframe.convert.to_dataframe on a PCollection that is annotated with a protobuf type. For example (using the addressbook_pb2 example from the [tutorial|https://developers.google.com/protocol-buffers/docs/pythontutorial#reading-a-message]): {code:python} import adressbook_pb2 import apache_beam as beam from apache_beam.dataframe.convert import to_dataframe pc = (input_pc | beam.Map(create_person).with_output_type(addressbook_pb2.Person)) df = to_dataframe(pc) # deferred dataframe with fields id, name, email, ... # OR pc | beam.transforms.SqlTransform("SELECT name WHERE email = 'f...@bar.com' FROM PCOLLECTION") {code} > Add support for inferring Beam Schemas from Python protobuf types > ----------------------------------------------------------------- > > Key: BEAM-12955 > URL: https://issues.apache.org/jira/browse/BEAM-12955 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core > Reporter: Brian Hulette > Assignee: Svetak Vihaan Sundhar > Priority: P2 > > Just as we can infer a Beam Schema from a NamedTuple type > ([code|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/schemas.py]), > we should have support for inferring a schema from a [protobuf-generated > Python > type|https://developers.google.com/protocol-buffers/docs/pythontutorial]. > This should integrate well with the rest of the schema infrastructure. For > example it should be possible to use schema-aware transforms like > [SqlTransform|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform], > > [Select|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.transforms.core.html#apache_beam.transforms.core.Select], > or > [beam.dataframe.convert.to_dataframe|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe] > on a PCollection that is annotated with a protobuf type. For example (using > the addressbook_pb2 example from the > [tutorial|https://developers.google.com/protocol-buffers/docs/pythontutorial#reading-a-message]): > {code:python} > import adressbook_pb2 > import apache_beam as beam > from apache_beam.dataframe.convert import to_dataframe > pc = (input_pc | > beam.Map(create_person).with_output_type(addressbook_pb2.Person)) > df = to_dataframe(pc) # deferred dataframe with fields id, name, email, ... > # OR > pc | beam.transforms.SqlTransform("SELECT name WHERE email = 'f...@bar.com' > FROM PCOLLECTION") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)