[ https://issues.apache.org/jira/browse/HUDI-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732208#comment-17732208 ]
Jinpeng Zhou commented on HUDI-6333: ------------------------------------ Hi [~danny0405], could you please help review this one at your most convenience? Thanks. > allow using the manifest file with absolute path to directly create one > bigquery external table over the Hudi table > ------------------------------------------------------------------------------------------------------------------- > > Key: HUDI-6333 > URL: https://issues.apache.org/jira/browse/HUDI-6333 > Project: Apache Hudi > Issue Type: Improvement > Components: meta-sync > Reporter: Jinpeng Zhou > Priority: Major > Labels: pull-request-available > > To query Hudi table from bigquery, the current BigQuerySyncTool creates two > bigquery external tables, one over the data files and the other over a > manifest file that contains the data file name. Based on these two tables, it > creates a view to reflect the latest version of data using the following > query: "SELECT * FROM data_table WHERE _hoodie_file_name IN ( SELECT filename > FROM manifest_file_table)". > The direct reason for such a workaround is that bigquery cannot support > manifest file. However, bigquery is rolling out its manifest file support , > allowing users to specify manifest file as source uris. Right now the > feature[1] roll-out seems to cover non-partitioned external tables (using > hive parition would return an error "file_set_spec_type option is not > supported for hive partition"), which should be covering partitioned external > tables soon. > Given this new bigquery feature, it would be better to update > BigQuerySyncTool correspondingly: > * Allow creating a bigquery compatible manifest file which expects absolute > path of data files. This has been done in HUDI-6254. > * Allow using the new manifest file to create external table directly. This > can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery. > * Avoid breaking existing user workflows. In case there are some users > relying on the view-based workaround, it probably make sense to keep the > workaround alive at least for now. That would require maintaining two > versions of manifest files. > * Provide a temporary workaround for using bigquery manifest file support > till this feature extends to partitioned table. Since it currently does not > support hive partition, the "CREATE EXTERNAL TABLE" can only create a table > over all the parquet data files without recognizing the partition columns. To > keep the partition columns, a possible workaround is to set the > "hoodie.datasource.write.drop.partition.columns" as false and allow users to > not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the > partition columns can be written into the parquet files and the > BigQuerySyncTool will not try to create a hive-partitioned external table. > [1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table -- This message was sent by Atlassian Jira (v8.20.10#820010)