[jira] [Created] (HUDI-7459) Update hudi-gcp-bundle pom to make it consistent with hudi-gcp

2024-02-29 Thread Jinpeng Zhou (Jira)
Jinpeng Zhou created HUDI-7459:
--

 Summary: Update hudi-gcp-bundle pom to make it consistent with 
hudi-gcp
 Key: HUDI-7459
 URL: https://issues.apache.org/jira/browse/HUDI-7459
 Project: Apache Hudi
  Issue Type: Task
Reporter: Jinpeng Zhou
 Fix For: 0.14.1


In hudi-gcp/pom.xml the libraries-bom was updated to 26.15.0 to reflect recent 
changes on the bigquery sync tool. But the hudi-gcp-bundle/pom.xml remains a 
very old version, 25.1.0. i guess we'd want to make them consistent?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7459) Update hudi-gcp-bundle pom to make it consistent with hudi-gcp

2024-02-29 Thread Jinpeng Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Zhou updated HUDI-7459:
---
Component/s: meta-sync

> Update hudi-gcp-bundle pom to make it consistent with hudi-gcp
> --
>
> Key: HUDI-7459
> URL: https://issues.apache.org/jira/browse/HUDI-7459
> Project: Apache Hudi
>  Issue Type: Task
>  Components: meta-sync
>Reporter: Jinpeng Zhou
>Priority: Major
> Fix For: 0.14.1
>
>
> In hudi-gcp/pom.xml the libraries-bom was updated to 26.15.0 to reflect 
> recent changes on the bigquery sync tool. But the hudi-gcp-bundle/pom.xml 
> remains a very old version, 25.1.0. i guess we'd want to make them consistent?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7242) Avoid unnecessary bigquery table update when using sync tool

2023-12-19 Thread Jinpeng Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Zhou updated HUDI-7242:
---
Description: 
The [PR]([https://github.com/apache/hudi/pull/9482)] added a table schema 
update step for bigquery sync tool. It seems there're two issues (this pr 
targets the 1st one):

1. When it reform the schema which is then compared to the bq table schema, the 
reformed schema puts partition fields in the beginning, while the bq table 
schema by default has partition fields at the end. So it unnecessarily triggers 
a schema update due to to order difference.

2. Though the sync tool for 0.14.0 does not support big lake connection id 
(there's a recent PR last month adding this support), the user can still 
recreate their table manually by adding connection id. The table update is 
adding the new schema into. external table definition. This does not work for 
biglake tables, and will cause error: "Schema can be specified only on the 
Table.Schema field for external tables with an associated connection_id but 
schema was provided on Table.Externaldataconfig.Schema". 

  was:
The [PR]([https://github.com/apache/hudi/pull/9482)] added a table schema 
update step for bigquery sync tool. It seems there're two issues:

1. When it reform the schema which is then compared to the bq table schema, the 
reformed schema puts partition fields in the beginning, while the bq table 
schema by default has partition fields at the end. So it unnecessarily triggers 
a schema update due to to order difference.

2. Though the sync tool for 0.14.0 does not support big lake connection id 
(there's a recent PR last month adding this support), the user can still 
recreate their table manually by adding connection id. The table update is 
adding the new schema into. external table definition. This does not work for 
biglake tables, and will cause error: "Schema can be specified only on the 
Table.Schema field for external tables with an associated connection_id but 
schema was provided on Table.Externaldataconfig.Schema". 


> Avoid unnecessary bigquery table update when using sync tool
> 
>
> Key: HUDI-7242
> URL: https://issues.apache.org/jira/browse/HUDI-7242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0, 0.14.1
>
>
> The [PR]([https://github.com/apache/hudi/pull/9482)] added a table schema 
> update step for bigquery sync tool. It seems there're two issues (this pr 
> targets the 1st one):
> 1. When it reform the schema which is then compared to the bq table schema, 
> the reformed schema puts partition fields in the beginning, while the bq 
> table schema by default has partition fields at the end. So it unnecessarily 
> triggers a schema update due to to order difference.
> 2. Though the sync tool for 0.14.0 does not support big lake connection id 
> (there's a recent PR last month adding this support), the user can still 
> recreate their table manually by adding connection id. The table update is 
> adding the new schema into. external table definition. This does not work for 
> biglake tables, and will cause error: "Schema can be specified only on the 
> Table.Schema field for external tables with an associated connection_id but 
> schema was provided on Table.Externaldataconfig.Schema". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7242) Avoid unnecessary bigquery table update when using sync tool

2023-12-19 Thread Jinpeng Zhou (Jira)
Jinpeng Zhou created HUDI-7242:
--

 Summary: Avoid unnecessary bigquery table update when using sync 
tool
 Key: HUDI-7242
 URL: https://issues.apache.org/jira/browse/HUDI-7242
 Project: Apache Hudi
  Issue Type: Bug
  Components: meta-sync
Reporter: Jinpeng Zhou
 Fix For: 0.14.1, 0.14.0


The [PR]([https://github.com/apache/hudi/pull/9482)] added a table schema 
update step for bigquery sync tool. It seems there're two issues:

1. When it reform the schema which is then compared to the bq table schema, the 
reformed schema puts partition fields in the beginning, while the bq table 
schema by default has partition fields at the end. So it unnecessarily triggers 
a schema update due to to order difference.

2. Though the sync tool for 0.14.0 does not support big lake connection id 
(there's a recent PR last month adding this support), the user can still 
recreate their table manually by adding connection id. The table update is 
adding the new schema into. external table definition. This does not work for 
biglake tables, and will cause error: "Schema can be specified only on the 
Table.Schema field for external tables with an associated connection_id but 
schema was provided on Table.Externaldataconfig.Schema". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6831) Add back missing project_id to query statement in BigQuerySyncTool

2023-09-07 Thread Jinpeng Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Zhou updated HUDI-6831:
---
Description: Currently the ddl using bq manifest file is "create external 
table dataset.table" which does not include the project id. It should be  
"project.dataset.table"  (was: The project id is missing in the query statement 
when using bigquery manifest file. Currently it's "create external table 
dataset.table" and does not work for cross-project scenarios. It should be 
fixed to  "create external table project.dataset.table")

> Add back missing project_id to query statement in BigQuerySyncTool
> --
>
> Key: HUDI-6831
> URL: https://issues.apache.org/jira/browse/HUDI-6831
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Jinpeng Zhou
>Priority: Minor
> Fix For: 0.14.0
>
>
> Currently the ddl using bq manifest file is "create external table 
> dataset.table" which does not include the project id. It should be  
> "project.dataset.table"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6831) Add back missing project_id to query statement in BigQuerySyncTool

2023-09-07 Thread Jinpeng Zhou (Jira)
Jinpeng Zhou created HUDI-6831:
--

 Summary: Add back missing project_id to query statement in 
BigQuerySyncTool
 Key: HUDI-6831
 URL: https://issues.apache.org/jira/browse/HUDI-6831
 Project: Apache Hudi
  Issue Type: Bug
  Components: meta-sync
Reporter: Jinpeng Zhou
 Fix For: 0.14.0


The project id is missing in the query statement when using bigquery manifest 
file. Currently it's "create external table dataset.table" and does not work 
for cross-project scenarios. It should be fixed to  "create external table 
project.dataset.table"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6333) allow using the manifest file with absolute path to directly create one bigquery external table over the Hudi table

2023-06-13 Thread Jinpeng Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732208#comment-17732208
 ] 

Jinpeng Zhou commented on HUDI-6333:


Hi [~danny0405], could you please help review this one at your most 
convenience? Thanks.

> allow using the manifest file with absolute path to directly create one 
> bigquery external table over the Hudi table
> ---
>
> Key: HUDI-6333
> URL: https://issues.apache.org/jira/browse/HUDI-6333
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: meta-sync
>Reporter: Jinpeng Zhou
>Priority: Major
>  Labels: pull-request-available
>
> To query Hudi table from bigquery, the current BigQuerySyncTool creates two 
> bigquery external tables, one over the data files and the other over a 
> manifest file that contains the data file name. Based on these two tables, it 
> creates a view to reflect the latest version of data using the following 
> query: "SELECT * FROM data_table WHERE _hoodie_file_name IN ( SELECT filename 
> FROM manifest_file_table)".
> The direct reason for such a workaround is that bigquery cannot support 
> manifest file. However, bigquery is rolling out its manifest file support , 
> allowing users to specify manifest file as source uris. Right now the 
> feature[1] roll-out seems to cover non-partitioned external tables (using 
> hive parition would return an error "file_set_spec_type option is not 
> supported for hive partition"), which should be covering partitioned external 
> tables soon.
> Given this new bigquery feature, it would be better to update 
> BigQuerySyncTool correspondingly:
>  * Allow creating a bigquery compatible manifest file which expects absolute 
> path of data files. This has been done in HUDI-6254.
>  * Allow using the new manifest file to create external table directly. This 
> can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
>  * Avoid breaking existing user workflows.  In case there are some users 
> relying on the view-based workaround, it probably make sense to keep the 
> workaround alive at least for now. That would require maintaining two 
> versions of manifest files.
>  * Provide a temporary workaround for using bigquery manifest file support 
> till this feature extends to partitioned table. Since it currently does not 
> support hive partition, the "CREATE EXTERNAL TABLE" can only create a table 
> over all the parquet data files without recognizing the partition columns. To 
> keep the partition columns, a possible workaround is to set the 
> "hoodie.datasource.write.drop.partition.columns" as false and allow users to 
> not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the 
> partition columns can be written into the parquet files and the 
> BigQuerySyncTool will not try to create a hive-partitioned external table.
> [1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6333) allow using the manifest file with absolute path to directly create one bigquery external table over the Hudi table

2023-06-07 Thread Jinpeng Zhou (Jira)
Jinpeng Zhou created HUDI-6333:
--

 Summary: allow using the manifest file with absolute path to 
directly create one bigquery external table over the Hudi table
 Key: HUDI-6333
 URL: https://issues.apache.org/jira/browse/HUDI-6333
 Project: Apache Hudi
  Issue Type: Improvement
  Components: meta-sync
Reporter: Jinpeng Zhou


To query Hudi table from bigquery, the current BigQuerySyncTool creates two 
bigquery external tables, one over the data files and the other over a manifest 
file that contains the data file name. Based on these two tables, it creates a 
view to reflect the latest version of data using the following query: "SELECT * 
FROM data_table WHERE _hoodie_file_name IN ( SELECT filename FROM 
manifest_file_table)".

The direct reason for such a workaround is that bigquery cannot support 
manifest file. However, bigquery is rolling out its manifest file support , 
allowing users to specify manifest file as source uris. Right now the 
feature[1] roll-out seems to cover non-partitioned external tables (using hive 
parition would return an error "file_set_spec_type option is not supported for 
hive partition"), which should be covering partitioned external tables soon.

Given this new bigquery feature, it would be better to update BigQuerySyncTool 
correspondingly:
 * Allow creating a bigquery compatible manifest file which expects absolute 
path of data files. This has been done in HUDI-6254.
 * Allow using the new manifest file to create external table directly. This 
can be done by issuing one "CREATE EXTERNAL TABLE" query to bigquery.
 * Avoid breaking existing user workflows.  In case there are some users 
relying on the view-based workaround, it probably make sense to keep the 
workaround alive at least for now. That would require maintaining two versions 
of manifest files.
 * Provide a temporary workaround for using bigquery manifest file support till 
this feature extends to partitioned table. Since it currently does not support 
hive partition, the "CREATE EXTERNAL TABLE" can only create a table over all 
the parquet data files without recognizing the partition columns. To keep the 
partition columns, a possible workaround is to set the 
"hoodie.datasource.write.drop.partition.columns" as false and allow users to 
not specify the "hoodie.gcp.bigquery.sync.source_uri_prefix", such that the 
partition columns can be written into the parquet files and the 
BigQuerySyncTool will not try to create a hive-partitioned external table.

[1]https://cloud.google.com/bigquery/docs/information-schema-table-options#options_table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-6254) Allow using absolute path in ManifestFileWriter

2023-05-23 Thread Jinpeng Zhou (Jira)
Jinpeng Zhou created HUDI-6254:
--

 Summary: Allow using absolute path in ManifestFileWriter
 Key: HUDI-6254
 URL: https://issues.apache.org/jira/browse/HUDI-6254
 Project: Apache Hudi
  Issue Type: Improvement
  Components: meta-sync
Reporter: Jinpeng Zhou
 Fix For: 0.14.0


Allow writing the manifest file with absolute path in ManifestFileWriter. 
Currently the writer only uses the file name (excluding the full path). This 
would make the manifest file more flexible, e.g., when a downstream is 
expecting full path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)