Re: Building a Schema from a file

Christian Battista Wed, 23 Jun 2021 06:33:25 -0700

Hi Matthew, just wanted to point out that in BQ arrays can't be null (this
is probably why BigQueryUtils has the behaviour you observed).


https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types

Best,
-C

On Tue, Jun 22, 2021 at 11:06 PM Matthew Ouyang <[email protected]>
wrote:

> I am currently using BigQueryUtils to convert a BigQuery TableSchema to a
> Beam Schema but I am looking to either switch off that approach because I'm
> looking for nullable arrays (BigQueryUtils always makes arrays not
> nullable) and ability to add my own logical types (one of my fields was
> unstructured JSON).
>
> I'm open to using proto or Avro since I would like to avoid the worst case
> scenario of building my own.  However it doesn't look like either has
> support to add logical types, and proto appears to be missing support for
> the Beam Row type.
>
> On Fri, Jun 18, 2021 at 1:56 PM Brian Hulette <[email protected]> wrote:
>
>> Are the files in some special format that you need to parse and
>> understand? Or could you opt to store the schemas as proto descriptors or
>> Avro avsc?
>>
>> On Fri, Jun 18, 2021 at 10:40 AM Matthew Ouyang <[email protected]>
>> wrote:
>>
>>> Hello Brian.  Thank you for the clarification request.  I meant the
>>> first case.  I have files that define field names and types.
>>>
>>> On Fri, Jun 18, 2021 at 12:12 PM Brian Hulette <[email protected]>
>>> wrote:
>>>
>>>> Could you clarify what you mean? I could interpret this two different
>>>> ways:
>>>> 1) Have a separate file that defines the literal schema (field names
>>>> and types).
>>>> 2) Infer a schema from data stored in some file in a structurerd format
>>>> (e.g csv or parquet).
>>>>
>>>> For (1) Reuven's suggestion would work. You could also use an Avro avsc
>>>> file here, which we also support.
>>>> For (2) we don't have anything like this in the Java SDK. In the Python
>>>> SDK the DataFrame API can do this though. When you use one of the pandas
>>>> sources with the Beam DataFrame API [1] we peek at the file and infer the
>>>> schema so you don't need to specify it. You'd just need to use
>>>> to_pcollection to convert the dataframe to a schema-aware PCollection.
>>>>
>>>> Brian
>>>>
>>>> [1]
>>>> https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html
>>>> [2]
>>>> https://beam.apache.org/releases/pydoc/2.30.0/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection
>>>>
>>>> On Fri, Jun 18, 2021 at 7:50 AM Reuven Lax <[email protected]> wrote:
>>>>
>>>>> There is a proto format for Beam schemas. You could define it as a
>>>>> proto in a file and then parse it.
>>>>>
>>>>> On Fri, Jun 18, 2021 at 7:28 AM Matthew Ouyang <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I was wondering if there were any tools that would allow me to build
>>>>>> a Beam schema from a file?  I looked for it in the SDK but I couldn't 
>>>>>> find
>>>>>> anything that could do it.
>>>>>>
>>>>>

-- 
Christian Battista, Ph.D.
he/him
Senior Data Engineer
*BenchSci*
*www.benchsci.com <http://www.benchsci.com>*
*E: *[email protected]
Did you know that $48 billion is lost to Avoidable Experiment Expenditure
every year? Read our latest whitepaper <https://hubs.ly/H0rB8W50> to learn
more.

Re: Building a Schema from a file

Reply via email to