[ 
https://issues.apache.org/jira/browse/BEAM-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792722#comment-16792722
 ] 

Juta Staes commented on BEAM-6769:
----------------------------------

> The expected way of writing bytes to bq is by passing base-64 encoded strings 
> to the bigquery client
 I searched for documentation on how to write bytes to bq using the bq client 
or using a file upload but could not find any.
 When uploading a file to bq containing b'\xab\xac\xad' I got the following 
error: Error while reading data, error message: JSON parsing error in row 
starting at position 188: Could not decode base64 string to bytes. That's why I 
tested using base-64 and then the data is written as bytes to bq both when 
using file load and the bq client.

> The current test tests reading bytes from bq and then writing them.
 it indeed uses a pre-populated table.

> The current test in python 2 does not work when trying to write 
> b'\xab\xac\xad' to bigquery
 I added a test in [https://github.com/apache/beam/pull/8056] that directly 
writes to bigquery with beam.io.WriteToBigQuery to test this behavior.
 I also tested how reading bytes currently works in python 2:
 When I read b'\xab\xac\xad' from bq with 
beam.io.Read(beam.io.BigQuerySource(..)) the output is u'q6yt' (base-64 
encoding of b'\xab\xac\xad') Which confirms that BQ uses base-64 when handling 
bytes.

Given that we cannot distinguish between a binary string and a non-binary 
non-unicode string in python 2 I think we should pass the table schema when 
writing and reading data to bq. Then we can change the tablerow encoder 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L946]
 to transform bytes to base-64 encoding when writing and transforming it back 
when reading for each field that has "bytes" type defined in the schema. This 
would work for both py2 and py3. But this would restrict the user to always 
specify the schema when writing bytes.

> BigQuery IO does not support bytes in Python 3
> ----------------------------------------------
>
>                 Key: BEAM-6769
>                 URL: https://issues.apache.org/jira/browse/BEAM-6769
>             Project: Beam
>          Issue Type: Sub-task
>          Components: sdk-py-core
>            Reporter: Juta Staes
>            Assignee: Juta Staes
>            Priority: Major
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> In Python 2 you could write bytes data to BigQuery. This is tested in
>  
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py#L186]
> Python 3 does not support
> {noformat}
> json.dumps({'test': b'test'}){noformat}
> which is used to encode the data in
>  
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L959]
>  
> How should writing bytes to BigQuery be handled in Python 3?
>  * Forbid writing bytes into BigQuery on Python 3
>  * Guess the encoding (utf-8?)
>  * Pass the encoding to BigQuery
> cc: [~tvalentyn]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to