Hey Matthew,

I got into a pickle a while back on something similar. I had a pre-existing 
Proto defn which I wanted to use as a Beam Schema. This worked like complete 
magic until the sync. When I tried to write results as Parquet or BigQuery I 
discovered that my proto schema provider was using types that couldn’t be 
represented in Avro, I think from memory it was OneOf.

I would just double check  that both your sources and syncs are able to handle 
whatever schema you define.

Cheers,
Chris.


On 23 Jun 2021, at 06:01, Reuven Lax 
<[email protected]<mailto:[email protected]>> wrote:

By proto, I meant using the messages in 
beam/model/pipeline/src/proto/schema.proto to define a schema. You can then use 
the classes in SchemaTranslation to convert that to a schema.

On Tue, Jun 22, 2021 at 8:06 PM Matthew Ouyang 
<[email protected]<mailto:[email protected]>> wrote:
I am currently using BigQueryUtils to convert a BigQuery TableSchema to a Beam 
Schema but I am looking to either switch off that approach because I'm looking 
for nullable arrays (BigQueryUtils always makes arrays not nullable) and 
ability to add my own logical types (one of my fields was unstructured JSON).

I'm open to using proto or Avro since I would like to avoid the worst case 
scenario of building my own.  However it doesn't look like either has support 
to add logical types, and proto appears to be missing support for the Beam Row 
type.

On Fri, Jun 18, 2021 at 1:56 PM Brian Hulette 
<[email protected]<mailto:[email protected]>> wrote:
Are the files in some special format that you need to parse and understand? Or 
could you opt to store the schemas as proto descriptors or Avro avsc?

On Fri, Jun 18, 2021 at 10:40 AM Matthew Ouyang 
<[email protected]<mailto:[email protected]>> wrote:
Hello Brian.  Thank you for the clarification request.  I meant the first case. 
 I have files that define field names and types.

On Fri, Jun 18, 2021 at 12:12 PM Brian Hulette 
<[email protected]<mailto:[email protected]>> wrote:
Could you clarify what you mean? I could interpret this two different ways:
1) Have a separate file that defines the literal schema (field names and types).
2) Infer a schema from data stored in some file in a structurerd format (e.g 
csv or parquet).

For (1) Reuven's suggestion would work. You could also use an Avro avsc file 
here, which we also support.
For (2) we don't have anything like this in the Java SDK. In the Python SDK the 
DataFrame API can do this though. When you use one of the pandas sources with 
the Beam DataFrame API [1] we peek at the file and infer the schema so you 
don't need to specify it. You'd just need to use to_pcollection to convert the 
dataframe to a schema-aware PCollection.

Brian

[1] https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html
[2] 
https://beam.apache.org/releases/pydoc/2.30.0/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection

On Fri, Jun 18, 2021 at 7:50 AM Reuven Lax 
<[email protected]<mailto:[email protected]>> wrote:
There is a proto format for Beam schemas. You could define it as a proto in a 
file and then parse it.

On Fri, Jun 18, 2021 at 7:28 AM Matthew Ouyang 
<[email protected]<mailto:[email protected]>> wrote:
I was wondering if there were any tools that would allow me to build a Beam 
schema from a file?  I looked for it in the SDK but I couldn't find anything 
that could do it.

Reply via email to