[ 
https://issues.apache.org/jira/browse/HUDI-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732208#comment-17732208
 ] 

Jinpeng Zhou commented on HUDI-6333:
------------------------------------

Hi [~danny0405], could you please help review this one at your most 
convenience? Thanks.

> allow using the manifest file with absolute path to directly create one 
> bigquery external table over the Hudi table
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-6333
>                 URL: https://issues.apache.org/jira/browse/HUDI-6333
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: meta-sync
>            Reporter: Jinpeng Zhou
>            Priority: Major
>              Labels: pull-request-available
>
> To query Hudi table from bigquery, the current BigQuerySyncTool creates two 
> bigquery external tables, one over the data files and the other over a 
> manifest file that contains the data file name. Based on these two tables, it 
> creates a view to reflect the latest version of data using the following 
> query: "SELECT * FROM data_table WHERE _hoodie_file_name IN ( SELECT filename 
> FROM manifest_file_table)".
> The direct reason for such a workaround is that bigquery cannot support 
> manifest file. However, bigquery is rolling out its manifest file support , 
> allowing users to specify manifest file as source uris. Right now the 
> feature[1] roll-out seems to cover non-partitioned external tables (using 
> hive parition would return an error "file_set_spec_type option is not 
> supported for hive partition"), which should be covering partitioned external 
> tables soon.
> Given this new bigquery feature, it would be better to update 
> BigQuerySyncTool correspondingly:
>  * Allow creating a bigquery compatible manifest file which expects absolute 
> path of data files. This has been done in HUDI-6254.
>  * Allow using the new manifest file to create external table directly. This 
> can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
>  * Avoid breaking existing user workflows.  In case there are some users 
> relying on the view-based workaround, it probably make sense to keep the 
> workaround alive at least for now. That would require maintaining two 
> versions of manifest files.
>  * Provide a temporary workaround for using bigquery manifest file support 
> till this feature extends to partitioned table. Since it currently does not 
> support hive partition, the "CREATE EXTERNAL TABLE" can only create a table 
> over all the parquet data files without recognizing the partition columns. To 
> keep the partition columns, a possible workaround is to set the 
> "hoodie.datasource.write.drop.partition.columns" as false and allow users to 
> not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the 
> partition columns can be written into the parquet files and the 
> BigQuerySyncTool will not try to create a hive-partitioned external table.
> [1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to