By proto, I meant using the messages in
beam/model/pipeline/src/proto/schema.proto to define a schema. You can then
use the classes in SchemaTranslation to convert that to a schema.

On Tue, Jun 22, 2021 at 8:06 PM Matthew Ouyang <matthew.ouy...@gmail.com>
wrote:

> I am currently using BigQueryUtils to convert a BigQuery TableSchema to a
> Beam Schema but I am looking to either switch off that approach because I'm
> looking for nullable arrays (BigQueryUtils always makes arrays not
> nullable) and ability to add my own logical types (one of my fields was
> unstructured JSON).
>
> I'm open to using proto or Avro since I would like to avoid the worst case
> scenario of building my own.  However it doesn't look like either has
> support to add logical types, and proto appears to be missing support for
> the Beam Row type.
>
> On Fri, Jun 18, 2021 at 1:56 PM Brian Hulette <bhule...@google.com> wrote:
>
>> Are the files in some special format that you need to parse and
>> understand? Or could you opt to store the schemas as proto descriptors or
>> Avro avsc?
>>
>> On Fri, Jun 18, 2021 at 10:40 AM Matthew Ouyang <matthew.ouy...@gmail.com>
>> wrote:
>>
>>> Hello Brian.  Thank you for the clarification request.  I meant the
>>> first case.  I have files that define field names and types.
>>>
>>> On Fri, Jun 18, 2021 at 12:12 PM Brian Hulette <bhule...@google.com>
>>> wrote:
>>>
>>>> Could you clarify what you mean? I could interpret this two different
>>>> ways:
>>>> 1) Have a separate file that defines the literal schema (field names
>>>> and types).
>>>> 2) Infer a schema from data stored in some file in a structurerd format
>>>> (e.g csv or parquet).
>>>>
>>>> For (1) Reuven's suggestion would work. You could also use an Avro avsc
>>>> file here, which we also support.
>>>> For (2) we don't have anything like this in the Java SDK. In the Python
>>>> SDK the DataFrame API can do this though. When you use one of the pandas
>>>> sources with the Beam DataFrame API [1] we peek at the file and infer the
>>>> schema so you don't need to specify it. You'd just need to use
>>>> to_pcollection to convert the dataframe to a schema-aware PCollection.
>>>>
>>>> Brian
>>>>
>>>> [1]
>>>> https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html
>>>> [2]
>>>> https://beam.apache.org/releases/pydoc/2.30.0/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection
>>>>
>>>> On Fri, Jun 18, 2021 at 7:50 AM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> There is a proto format for Beam schemas. You could define it as a
>>>>> proto in a file and then parse it.
>>>>>
>>>>> On Fri, Jun 18, 2021 at 7:28 AM Matthew Ouyang <
>>>>> matthew.ouy...@gmail.com> wrote:
>>>>>
>>>>>> I was wondering if there were any tools that would allow me to build
>>>>>> a Beam schema from a file?  I looked for it in the SDK but I couldn't 
>>>>>> find
>>>>>> anything that could do it.
>>>>>>
>>>>>

Reply via email to