[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

Vinoth Govindarajan (Jira) Mon, 27 Dec 2021 12:02:04 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465859#comment-17465859
 ]


Vinoth Govindarajan edited comment on HUDI-2438 at 12/27/21, 8:01 PM:
----------------------------------------------------------------------

Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this 
is a general idea:

 
 * Let's say you have a Hudi table data on google cloud storage (GCS).

{code:java}
create table dwh.bq_demo_partitioned_cow (
                                      id bigint,
                                      name string,
                                      price double,
                                      ts bigint,
                                      dt string
) using hudi
    partitioned by (dt)
    options (
                type = 'cow',
                primaryKey = 'id',
                preCombineField = 'ts',
                hoodie.datasource.write.drop.partition.columns = 'true'
            )
    location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code}
BQ doesn't accept the partition column in the parquet schema, hence we need to 
drop the partition columns from the schema by enabling this flag: 
{code:java}
hoodie.datasource.write.drop.partition.columns = 'true'{code}
 * Generate a manifest file for the Hudi table which has the list of the latest 
snapshot parquet file names in a CSV format with only one column the file name. 
The location of the manifest file should be on the .hoodie metadata folder 
(`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`)

{code:java}
// this command is coming soon.
GENERATE symlink_format_manifest FOR TABLE dwh.bq_demo_partitioned_cow;{code}
 * Create a BQ table named `table_name_manifest` with only one column filename 
with this location 
`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`
(
  filename STRING
)
OPTIONS(
  format="CSV",
  
uris=["gs://hudi_datasets/bq_demo_partitioned_cow/.hoodie/manifest/latest_snapshot_files.csv"]
);{code}
 * Create another BQ table named `table_name_history` with this location 
`gs://bucket_name/table_name`, don't use this table to query the data, this 
table will have duplicate records since it scans all the versions of parquet 
files in the table/partition folders.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WITH PARTITION COLUMNS
OPTIONS(
  ignore_unknown_values=true,
  format="PARQUET",
  hive_partition_uri_prefix="gs://hudi_datasets/bq_demo_partitioned_cow/",
  uris=["gs://hudi_snowflake/bq_demo_partitioned_cow/dt=*"]
);{code}
 * Create a BQ view named `table_name` with this query: 

{code:java}
CREATE VIEW `golden-union-336019.dwh.bq_demo_partitioned_cow`
AS SELECT
  *
FROM
  `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WHERE
  _hoodie_file_name IN (
  SELECT
    filename
  FROM
    `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`);{code}
 * The last view you created has the data from the Hudi table without any 
duplicates, you can use that table to query the data.

 

To make this model work, we need a way to generate the manifest file with the 
latest snapshot files, I will create a PR for that soon. One more final step, 
Hudi generates multiple non-parquet files in the table/partition location like 
(crc + .hoodie_partition_metadata):
{code:java}
..hoodie_partition_metadata.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet.crc
.hoodie_partition_metadata
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet{code}
We need to remove these non-parquet files to make the BQ integration work, I 
will talk to the Hudi core team on how to get rid of these files from the 
partition location.

 


was (Author: vino):
Hi [~qiao.xu] - Good news!! I found a way to integrate Hudi with BigQuery, this 
is a general idea:

 
 * Let's say you have a Hudi table data on google storage (GCS).

{code:java}
create table dwh.bq_demo_partitioned_cow (
                                      id bigint,
                                      name string,
                                      price double,
                                      ts bigint,
                                      dt string
) using hudi
    partitioned by (dt)
    options (
                type = 'cow',
                primaryKey = 'id',
                preCombineField = 'ts',
                hoodie.datasource.write.drop.partition.columns = 'true'
            )
    location 'gs://hudi_datasets/bq_demo_partitioned_cow/';{code}
BQ doesn't accept the partition column in the parquet schema, hence we need to 
drop the partition columns from the schema by enabling this flag: 
{code:java}
hoodie.datasource.write.drop.partition.columns = 'true'{code}
 * Generate a manifest file for the Hudi table which has the list of the latest 
snapshot parquet file names in a CSV format with only one column the file name. 
The location of the manifest file should be on the .hoodie metadata folder 
(`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`)

{code:java}
// this command is coming soon.
GENERATE symlink_format_manifest FOR TABLE dwh.bq_demo_partitioned_cow;{code}
 * Create a BQ table named `table_name_manifest` with only one column filename 
with this location 
`gs://bucket_name/table_name/.hoodie/manifest/latest_snapshot_files.csv`.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`
(
  filename STRING
)
OPTIONS(
  format="CSV",
  
uris=["gs://hudi_datasets/bq_demo_partitioned_cow/.hoodie/manifest/latest_snapshot_files.csv"]
);{code}
 * Create another BQ table named `table_name_history` with this location 
`gs://bucket_name/table_name`, don't use this table to query the data, this 
table will have duplicate records since it scans all the versions of parquet 
files in the table/partition folders.

{code:java}
CREATE EXTERNAL TABLE `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WITH PARTITION COLUMNS
OPTIONS(
  ignore_unknown_values=true,
  format="PARQUET",
  hive_partition_uri_prefix="gs://hudi_datasets/bq_demo_partitioned_cow/",
  uris=["gs://hudi_snowflake/bq_demo_partitioned_cow/dt=*"]
);{code}
 * Create a BQ view named `table_name` with this query: 

{code:java}
CREATE VIEW `golden-union-336019.dwh.bq_demo_partitioned_cow`
AS SELECT
  *
FROM
  `golden-union-336019.dwh.bq_demo_partitioned_cow_history`
WHERE
  _hoodie_file_name IN (
  SELECT
    filename
  FROM
    `golden-union-336019.dwh.bq_demo_partitioned_cow_manifest`);{code}
 * The last view you created has the data from the Hudi table without any 
duplicates, you can use that table to query the data.

 

To make this model work, we need a way to generate the manifest file with the 
latest snapshot files, I will create a PR for that soon. One more final step, 
Hudi generates multiple non-parquet files in the table/partition location like 
(crc + .hoodie_partition_metadata):
{code:java}
..hoodie_partition_metadata.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet.crc
.a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet.crc
.hoodie_partition_metadata
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-21-1605_20211227090948.parquet
a4949325-81ca-4d6b-8ce5-5b41bea62b36-0_0-53-3220_20211227091157.parquet{code}
We need to remove these non-parquet files to make the BQ integration work, I 
will talk to the Hudi core team on how to get rid of these files from the 
partition location.

 

> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> ----------------------------------------------------------------
>
>                 Key: HUDI-2438
>                 URL: https://issues.apache.org/jira/browse/HUDI-2438
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Common Core
>            Reporter: Vinoth Govindarajan
>            Assignee: Vinoth Govindarajan
>            Priority: Major
>              Labels: BigQuery, Integration
>             Fix For: 0.11.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective 
> analytics data warehouse that lets you run analytics over vast amounts of 
> data in near real-time. BigQuery currently [doesn’t 
> support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache 
> Hudi file format, but it has support for the Parquet file format. The 
> proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi 
> table as the BigQuery External Parquet table so that users can query the Hudi 
> tables using BigQuery. Uber is already syncing some of its Hudi tables to 
> BigQuery data mart this will help them to write, sync, and query.
>  
> More details are in RFC-34: 
> [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

Reply via email to