[
https://issues.apache.org/jira/browse/BEAM-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792722#comment-16792722
]
Juta Staes commented on BEAM-6769:
----------------------------------
> The expected way of writing bytes to bq is by passing base-64 encoded strings
> to the bigquery client
I searched for documentation on how to write bytes to bq using the bq client
or using a file upload but could not find any.
When uploading a file to bq containing b'\xab\xac\xad' I got the following
error: Error while reading data, error message: JSON parsing error in row
starting at position 188: Could not decode base64 string to bytes. That's why I
tested using base-64 and then the data is written as bytes to bq both when
using file load and the bq client.
> The current test tests reading bytes from bq and then writing them.
it indeed uses a pre-populated table.
> The current test in python 2 does not work when trying to write
> b'\xab\xac\xad' to bigquery
I added a test in [https://github.com/apache/beam/pull/8056] that directly
writes to bigquery with beam.io.WriteToBigQuery to test this behavior.
I also tested how reading bytes currently works in python 2:
When I read b'\xab\xac\xad' from bq with
beam.io.Read(beam.io.BigQuerySource(..)) the output is u'q6yt' (base-64
encoding of b'\xab\xac\xad') Which confirms that BQ uses base-64 when handling
bytes.
Given that we cannot distinguish between a binary string and a non-binary
non-unicode string in python 2 I think we should pass the table schema when
writing and reading data to bq. Then we can change the tablerow encoder
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L946]
to transform bytes to base-64 encoding when writing and transforming it back
when reading for each field that has "bytes" type defined in the schema. This
would work for both py2 and py3. But this would restrict the user to always
specify the schema when writing bytes.
> BigQuery IO does not support bytes in Python 3
> ----------------------------------------------
>
> Key: BEAM-6769
> URL: https://issues.apache.org/jira/browse/BEAM-6769
> Project: Beam
> Issue Type: Sub-task
> Components: sdk-py-core
> Reporter: Juta Staes
> Assignee: Juta Staes
> Priority: Major
> Time Spent: 2h
> Remaining Estimate: 0h
>
> In Python 2 you could write bytes data to BigQuery. This is tested in
>
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py#L186]
> Python 3 does not support
> {noformat}
> json.dumps({'test': b'test'}){noformat}
> which is used to encode the data in
>
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L959]
>
> How should writing bytes to BigQuery be handled in Python 3?
> * Forbid writing bytes into BigQuery on Python 3
> * Guess the encoding (utf-8?)
> * Pass the encoding to BigQuery
> cc: [~tvalentyn]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)