By proto, I meant using the messages in beam/model/pipeline/src/proto/schema.proto to define a schema. You can then use the classes in SchemaTranslation to convert that to a schema.
On Tue, Jun 22, 2021 at 8:06 PM Matthew Ouyang <[email protected]> wrote: > I am currently using BigQueryUtils to convert a BigQuery TableSchema to a > Beam Schema but I am looking to either switch off that approach because I'm > looking for nullable arrays (BigQueryUtils always makes arrays not > nullable) and ability to add my own logical types (one of my fields was > unstructured JSON). > > I'm open to using proto or Avro since I would like to avoid the worst case > scenario of building my own. However it doesn't look like either has > support to add logical types, and proto appears to be missing support for > the Beam Row type. > > On Fri, Jun 18, 2021 at 1:56 PM Brian Hulette <[email protected]> wrote: > >> Are the files in some special format that you need to parse and >> understand? Or could you opt to store the schemas as proto descriptors or >> Avro avsc? >> >> On Fri, Jun 18, 2021 at 10:40 AM Matthew Ouyang <[email protected]> >> wrote: >> >>> Hello Brian. Thank you for the clarification request. I meant the >>> first case. I have files that define field names and types. >>> >>> On Fri, Jun 18, 2021 at 12:12 PM Brian Hulette <[email protected]> >>> wrote: >>> >>>> Could you clarify what you mean? I could interpret this two different >>>> ways: >>>> 1) Have a separate file that defines the literal schema (field names >>>> and types). >>>> 2) Infer a schema from data stored in some file in a structurerd format >>>> (e.g csv or parquet). >>>> >>>> For (1) Reuven's suggestion would work. You could also use an Avro avsc >>>> file here, which we also support. >>>> For (2) we don't have anything like this in the Java SDK. In the Python >>>> SDK the DataFrame API can do this though. When you use one of the pandas >>>> sources with the Beam DataFrame API [1] we peek at the file and infer the >>>> schema so you don't need to specify it. You'd just need to use >>>> to_pcollection to convert the dataframe to a schema-aware PCollection. >>>> >>>> Brian >>>> >>>> [1] >>>> https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html >>>> [2] >>>> https://beam.apache.org/releases/pydoc/2.30.0/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection >>>> >>>> On Fri, Jun 18, 2021 at 7:50 AM Reuven Lax <[email protected]> wrote: >>>> >>>>> There is a proto format for Beam schemas. You could define it as a >>>>> proto in a file and then parse it. >>>>> >>>>> On Fri, Jun 18, 2021 at 7:28 AM Matthew Ouyang < >>>>> [email protected]> wrote: >>>>> >>>>>> I was wondering if there were any tools that would allow me to build >>>>>> a Beam schema from a file? I looked for it in the SDK but I couldn't >>>>>> find >>>>>> anything that could do it. >>>>>> >>>>>
